Re: [PATCH v2 05/14] mm: Introduce mf_dax_kill_procs() for fsdax case

2022-08-24 Thread  
On Wed, Aug 24, 2022 at 02:52:51PM -0700, Dan Williams wrote: > Shiyang Ruan wrote: > > This new function is a variant of mf_generic_kill_procs that accepts a > > file, offset pair instead of a struct to support multiple files sharing > > a DAX mapping. It is intended to be called by the file

Re: [PATCH v13 5/7] mm: Introduce mf_dax_kill_procs() for fsdax case

2022-04-21 Thread  
On Tue, Apr 19, 2022 at 12:50:43PM +0800, Shiyang Ruan wrote: > This new function is a variant of mf_generic_kill_procs that accepts a > file, offset pair instead of a struct to support multiple files sharing > a DAX mapping. It is intended to be called by the file systems as part > of the

Re: [PATCH v13 3/7] pagemap,pmem: Introduce ->memory_failure()

2022-04-21 Thread  
On Tue, Apr 19, 2022 at 12:50:41PM +0800, Shiyang Ruan wrote: > When memory-failure occurs, we call this function which is implemented > by each kind of devices. For the fsdax case, pmem device driver > implements it. Pmem device driver will find out the filesystem in which > the corrupted page

Re: [PATCH v13 2/7] mm: factor helpers for memory_failure_dev_pagemap

2022-04-21 Thread  
On Tue, Apr 19, 2022 at 12:50:40PM +0800, Shiyang Ruan wrote: > memory_failure_dev_pagemap code is a bit complex before introduce RMAP > feature for fsdax. So it is needed to factor some helper functions to > simplify these code. > > Signed-off-by: Shiyang Ruan > Reviewed-by: Darrick J. Wong >

Re: [PATCH v13 0/7] fsdax: introduce fs query to support reflink

2022-04-20 Thread  
Hi everyone, On Thu, Apr 21, 2022 at 02:35:02PM +1000, Dave Chinner wrote: > On Wed, Apr 20, 2022 at 07:20:07PM -0700, Dan Williams wrote: > > [ add Andrew and Naoya ] > > > > On Wed, Apr 20, 2022 at 6:48 PM Shiyang Ruan > > wrote: > > > > > > Hi Dave, > > > > > > 在 2022/4/21 9:20, Dave

Re: [PATCH v1 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address

2021-04-20 Thread  
On Tue, Apr 20, 2021 at 08:42:53AM -0700, Luck, Tony wrote: > On Mon, Apr 19, 2021 at 06:49:15PM -0700, Jue Wang wrote: > > On Tue, 13 Apr 2021 07:43:20 +0900, Naoya Horiguchi wrote: > > ... > > > + * This function is intended to handle "Action Required" MCEs on already > > > + * hardware poisoned

Re: [PATCH v1 1/3] mm/memory-failure: Use a mutex to avoid memory_failure() races

2021-04-20 Thread  
On Tue, Apr 20, 2021 at 12:16:57PM +0200, Borislav Petkov wrote: > On Tue, Apr 20, 2021 at 07:46:26AM +, HORIGUCHI NAOYA(堀口 直也) wrote: > > If you have any other suggestion, please let me know. > > Looks almost ok... > > > From: Tony Luck > > Date: Tue, 20 Apr 2

Re: [PATCH v1 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address

2021-04-20 Thread  
On Mon, Apr 19, 2021 at 06:49:15PM -0700, Jue Wang wrote: > On Tue, 13 Apr 2021 07:43:20 +0900, Naoya Horiguchi wrote: > ... > > + * This function is intended to handle "Action Required" MCEs on already > > + * hardware poisoned pages. They could happen, for example, when > > + * memory_failure()

Re: [PATCH v1 1/3] mm/memory-failure: Use a mutex to avoid memory_failure() races

2021-04-20 Thread  
On Mon, Apr 19, 2021 at 07:05:38PM +0200, Borislav Petkov wrote: > On Tue, Apr 13, 2021 at 07:43:18AM +0900, Naoya Horiguchi wrote: > > From: Tony Luck > > > > There can be races when multiple CPUs consume poison from the same > > page. The first into memory_failure() atomically sets the

Re: [PATCH] mm/memory-failure: unecessary amount of unmapping

2021-04-19 Thread  
On Mon, Apr 19, 2021 at 06:28:21PM -0600, Jane Chu wrote: > It appears that unmap_mapping_range() actually takes a 'size' as its > third argument rather than a location, the current calling fashion > causes unecessary amount of unmapping to occur. > > Fixes: 6100e34b2526e ("mm, memory_failure:

Re: [PATCH v1 0/3] mm,hwpoison: fix sending SIGBUS for Action Required MCE

2021-04-18 Thread  
On Sat, Apr 17, 2021 at 01:47:51PM +0800, Aili Yao wrote: > On Tue, 13 Apr 2021 07:43:17 +0900 > Naoya Horiguchi wrote: > > > Hi, > > > > I wrote this patchset to materialize what I think is the current > > allowable solution mentioned by the previous discussion [1]. > > I simply borrowed

Re: [PATCH v7] mm/gup: check page hwpoison status for memory recovery failures.

2021-04-06 Thread  
On Tue, Apr 06, 2021 at 10:41:23AM +0800, Aili Yao wrote: > When we call get_user_pages() to pin user page in memory, there may be > hwpoison page, currently, we just handle the normal case that memory > recovery jod is correctly finished, and we will not return the hwpoison > page to callers, but

Re: [PATCH v3] mm,hwpoison: return -EHWPOISON when page already poisoned

2021-04-05 Thread  
On Fri, Apr 02, 2021 at 03:11:20PM +, Luck, Tony wrote: > >> Combined with my "mutex" patch (to get rid of races where 2nd process > >> returns > >> early, but first process is still looking for mappings to unmap and tasks > >> to signal) this patch moves forward a bit. But I think it needs

Re: [PATCH v5] mm/gup: check page hwposion status for coredump.

2021-03-31 Thread  
On Wed, Mar 31, 2021 at 07:07:39AM +0100, Matthew Wilcox wrote: > On Wed, Mar 31, 2021 at 01:52:59AM +, HORIGUCHI NAOYA(堀口 直也) wrote: > > If we successfully unmapped but failed in truncate_error_page() for example, > > the processes mapping the page would get -EFAULT as expe

Re: [PATCH v5] mm/gup: check page hwposion status for coredump.

2021-03-30 Thread  
On Wed, Mar 31, 2021 at 10:43:36AM +0800, Aili Yao wrote: > On Wed, 31 Mar 2021 01:52:59 + HORIGUCHI NAOYA(堀口 直也) > wrote: > > On Fri, Mar 26, 2021 at 03:22:49PM +0100, David Hildenbrand wrote: > > > On 26.03.21 15:09, David Hildenbrand wrote: > > > >

Re: [PATCH v5] mm/gup: check page hwposion status for coredump.

2021-03-30 Thread  
On Fri, Mar 26, 2021 at 03:22:49PM +0100, David Hildenbrand wrote: > On 26.03.21 15:09, David Hildenbrand wrote: > > On 22.03.21 12:33, Aili Yao wrote: > > > When we do coredump for user process signal, this may be one SIGBUS signal > > > with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this

Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-16 Thread  
On Fri, Mar 12, 2021 at 11:48:31PM +, Luck, Tony wrote: > >> will memory_failure() find it and unmap it? if succeed, then the current > >> will be > >> signaled with correct vaddr and shift? > > > > That's a very good question. I didn't see a SIGBUS when I first wrote this > > code, > >

Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-11 Thread  
On Wed, Mar 10, 2021 at 02:10:42PM +0800, Aili Yao wrote: > On Fri, 5 Mar 2021 15:55:25 + > "Luck, Tony" wrote: > > > > From the walk, it seems we have got the virtual address, can we just send > > > a SIGBUS with it? > > > > If the walk wins the race and the pte for the poisoned page is

Re: [PATCH v2] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-10 Thread  
On Tue, Mar 09, 2021 at 12:01:40PM -0800, Luck, Tony wrote: > On Tue, Mar 09, 2021 at 08:28:24AM +, HORIGUCHI NAOYA(堀口 直也) wrote: > > On Tue, Mar 09, 2021 at 02:35:34PM +0800, Aili Yao wrote: > > > When the page is already poisoned, another memory_failure() call in the &

Re: [PATCH v2] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-09 Thread  
On Tue, Mar 09, 2021 at 02:35:34PM +0800, Aili Yao wrote: > When the page is already poisoned, another memory_failure() call in the > same page now return 0, meaning OK. For nested memory mce handling, this > behavior may lead to mce looping, Example: > > 1.When LCME is enabled, and there are two

Re: [PATCH] mm/memory-failure: Use a mutex to avoid memory_failure() races

2021-03-08 Thread  
On Tue, Mar 09, 2021 at 10:04:21AM +0800, Aili Yao wrote: > On Mon, 8 Mar 2021 14:55:04 -0800 > "Luck, Tony" wrote: > > > There can be races when multiple CPUs consume poison from the same > > page. The first into memory_failure() atomically sets the HWPoison > > page flag and begins hunting for

Re: [PATCH] mm/memory-failure: Use a mutex to avoid memory_failure() races

2021-03-08 Thread  
On Mon, Mar 08, 2021 at 02:55:04PM -0800, Luck, Tony wrote: > There can be races when multiple CPUs consume poison from the same > page. The first into memory_failure() atomically sets the HWPoison > page flag and begins hunting for tasks that map this page. Eventually > it invalidates those

Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-08 Thread  
On Mon, Mar 08, 2021 at 06:54:02PM +, Luck, Tony wrote: > >> So it should be safe to grab and hold a mutex. See patch below. > > > > The mutex approach looks simpler and safer, so I'm fine with it. > > Thanks. Is that an "Acked-by:"? Not yet, I intended to add it after full patch is

Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-03-07 Thread  
On Fri, Mar 05, 2021 at 02:11:43PM -0800, Luck, Tony wrote: > This whole page table walking patch is trying to work around the > races caused by multiple calls to memory_failure() for the same > page. > > Maybe better to just avoid the races. The comment right above > memory_failure says: > >

Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-02-25 Thread  
On Thu, Feb 25, 2021 at 10:15:42AM -0800, Luck, Tony wrote: > On Thu, Feb 25, 2021 at 12:38:06PM +, HORIGUCHI NAOYA(堀口 直也) wrote: > > Thank you for shedding light on this, this race looks worrisome to me. > > We call try_to_unmap() inside memory_failure(), where we find af

Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-02-25 Thread  
On Thu, Feb 25, 2021 at 12:39:30PM +0100, Oscar Salvador wrote: > On Thu, Feb 25, 2021 at 11:28:18AM +, HORIGUCHI NAOYA(堀口 直也) wrote: > > Hi Aili, > > > > I agree that this set_mce_nospec() is not expected to be called for > > "already hwpoisoned" page b

Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

2021-02-25 Thread  
On Thu, Feb 25, 2021 at 11:43:29AM +0800, Aili Yao wrote: > On Wed, 24 Feb 2021 11:31:55 +0100 Oscar Salvador wrote: ... > > > > > > > 3.The kill_me_maybe will check the return: > > > > > > 1244 static void kill_me_maybe(struct callback_head *cb) > > > 1245 { > > > > > > 1254 if

Re: [PATCH v2] x86/fault: Send a SIGBUS to user process always for hwpoison page access.

2021-02-03 Thread  
Hi Aili, On Mon, Feb 01, 2021 at 04:17:49PM +0800, Aili Yao wrote: > When one page is already hwpoisoned by AO action, process may not be > killed, the process mapping this page may make a syscall include this > page and result to trigger a VM_FAULT_HWPOISON fault, if it's in kernel > mode it may

Re: [PATCH v4 4/5] mm: Fix page reference leak in soft_offline_page()

2021-01-13 Thread  
On Wed, Jan 13, 2021 at 10:18:09PM -0800, Dan Williams wrote: > On Wed, Jan 13, 2021 at 5:50 PM HORIGUCHI NAOYA(堀口 直也) > wrote: > > > > On Wed, Jan 13, 2021 at 04:43:32PM -0800, Dan Williams wrote: > > > The conversion to move pfn_to_online_page() internal to >

Re: [PATCH v4 5/5] mm: Fix memory_failure() handling of dax-namespace metadata

2021-01-13 Thread  
On Wed, Jan 13, 2021 at 04:43:37PM -0800, Dan Williams wrote: > Given 'struct dev_pagemap' spans both data pages and metadata pages be > careful to consult the altmap if present to delineate metadata. In fact > the pfn_first() helper already identifies the first valid data pfn, so > export that

Re: [PATCH v4 4/5] mm: Fix page reference leak in soft_offline_page()

2021-01-13 Thread  
On Wed, Jan 13, 2021 at 04:43:32PM -0800, Dan Williams wrote: > The conversion to move pfn_to_online_page() internal to > soft_offline_page() missed that the get_user_pages() reference taken by > the madvise() path needs to be dropped when pfn_to_online_page() fails. > Note the direct sysfs-path

Re: [PATCH] mm,hwpoison: Fix printing of page flags

2021-01-08 Thread  
On Fri, Jan 08, 2021 at 09:52:02AM +0100, Oscar Salvador wrote: > Format %pG expects a lower case 'p' in order to print the flags. > Fix it. > > Reported-by: Dan Carpenter > Signed-off-by: Oscar Salvador > Fixes: 8295d535e2aa ("mm,hwpoison: refactor get_any_page") Thank you! Acked-by: Naoya

Re: [External] Re: [PATCH 4/6] mm: hugetlb: add return -EAGAIN for dissolve_free_huge_page

2021-01-04 Thread  
On Tue, Jan 05, 2021 at 03:10:35PM +0800, Muchun Song wrote: > On Tue, Jan 5, 2021 at 2:38 PM HORIGUCHI NAOYA(堀口 直也) > wrote: > > > > On Mon, Jan 04, 2021 at 02:58:41PM +0800, Muchun Song wrote: > > > When dissolve_free_huge_page() races with __free_huge_page(), we ca

Re: [PATCH 4/6] mm: hugetlb: add return -EAGAIN for dissolve_free_huge_page

2021-01-04 Thread  
On Mon, Jan 04, 2021 at 02:58:41PM +0800, Muchun Song wrote: > When dissolve_free_huge_page() races with __free_huge_page(), we can > do a retry. Because the race window is small. > > Signed-off-by: Muchun Song > --- > mm/hugetlb.c | 16 +++- > 1 file changed, 11 insertions(+), 5

Re: [PATCH] mm,hwpoison: Return -EBUSY when migration fails

2020-12-09 Thread  
On Wed, Dec 09, 2020 at 10:28:18AM +0100, Oscar Salvador wrote: > Currently, we return -EIO when we fail to migrate the page. > > Migrations' failures are rather transient as they can happen due to > several reasons, e.g: high page refcount bump, mapping->migrate_page > failing etc. > All meaning

Re: [PATCH] mm,memory_failure: Always pin the page in madvise_inject_error

2020-12-07 Thread  
On Mon, Dec 07, 2020 at 10:48:18AM +0100, Oscar Salvador wrote: > madvise_inject_error() uses get_user_pages_fast to translate the > address we specified to a page. > After [1], we drop the extra reference count for memory_failure() path. > That commit says that memory_failure wanted to keep the

Re: [PATCH] mm,memory_failure: Always pin the page in madvise_inject_error

2020-12-07 Thread  
On Mon, Dec 07, 2020 at 06:22:00PM -0800, Andrew Morton wrote: > On Mon, 7 Dec 2020 10:48:18 +0100 Oscar Salvador wrote: > > > madvise_inject_error() uses get_user_pages_fast to translate the > > address we specified to a page. > > After [1], we drop the extra reference count for

Re: [PATCH 3/7] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED

2020-12-06 Thread  
On Sat, Dec 05, 2020 at 04:34:23PM +0100, Oscar Salvador wrote: > On Fri, Dec 04, 2020 at 06:25:31PM +0100, Vlastimil Babka wrote: > > OK, so that means we don't introduce this race for MADV_SOFT_OFFLINE, but > > it's > > already (and still) there for MADV_HWPOISON since Dan's 23e7b5c2e271 ("mm,

Re: [PATCH 7/7] mm,hwpoison: Remove drain_all_pages from shake_page

2020-11-19 Thread  
On Thu, Nov 19, 2020 at 11:57:16AM +0100, Oscar Salvador wrote: > get_hwpoison_page already drains pcplists, previously disabling > them when trying to grab a refcount. > We do not need shake_page to take care of it anymore. > > Signed-off-by: Oscar Salvador Acked-by: Naoya Horiguchi

Re: [PATCH 6/7] mm,hwpoison: Disable pcplists before grabbing a refcount

2020-11-19 Thread  
On Thu, Nov 19, 2020 at 11:57:15AM +0100, Oscar Salvador wrote: > Currently, we have a sort of retry mechanism to make sure pages in > pcp-lists are spilled to the buddy system, so we can handle those. > > We can save us this extra checks with the new disable-pcplist mechanism > that is available

Re: [PATCH 1/7] mm,hwpoison: Refactor get_any_page

2020-11-19 Thread  
On Thu, Nov 19, 2020 at 11:57:10AM +0100, Oscar Salvador wrote: > When we want to grab a refcount via get_any_page, we call > __get_any_page that calls get_hwpoison_page to get the > actual refcount. > get_any_page is only there because we have a sort of retry > mechanism in case the page we met

Re: [PATCH 2/7] mm,hwpoison: Drop pfn parameter

2020-11-19 Thread  
On Thu, Nov 19, 2020 at 11:57:11AM +0100, Oscar Salvador wrote: > pfn parameter is no longer needed, drop it. > > Signed-off-by: Oscar Salvador Acked-by: Naoya Horiguchi

Re: [PATCH] hugetlbfs: fix anon huge page migration race

2020-11-12 Thread  
On Thu, Nov 05, 2020 at 11:50:58AM -0800, Mike Kravetz wrote: > Qian Cai reported the following BUG in [1] > > [ 6147.019063][T45242] LTP: starting move_pages12 > [ 6147.475680][T64921] BUG: unable to handle page fault for address: > ffe0 > ... > [ 6147.525866][T64921] RIP:

Re: [RFC PATCH 2/3] hugetlbfs: introduce hinode_rwsem for pmd sharing synchronization

2020-10-15 Thread  
On Tue, Oct 13, 2020 at 04:10:59PM -0700, Mike Kravetz wrote: > Due to pmd sharing, the huge PTE pointer returned by huge_pte_alloc > may not be valid. This can happen if a call to huge_pmd_unshare for > the same pmd is made in another thread. > > To address this issue, add a rw_semaphore

Re: [PATCH v5 4/4] mm,hwpoison: drop unneeded pcplist draining

2020-10-13 Thread  
On Tue, Oct 13, 2020 at 04:44:47PM +0200, Oscar Salvador wrote: > memory_failure and soft_offline_path paths now drain pcplists by calling > get_hwpoison_page. > > memory_failure flags the page as HWPoison before, so that page cannot > longer go into a pcplist, and soft_offline_page only flags a

Re: [PATCH v5 3/4] mm,hwpoison: take free pages off the buddy freelists for hugetlb

2020-10-13 Thread  
On Tue, Oct 13, 2020 at 04:44:46PM +0200, Oscar Salvador wrote: > Currently, free hugetlb get dissolved, but we also need to make sure > to take the poisoned subpage off the buddy frelists, so no one stumbles > upon it (see previous patch for more information). > > Signed-off-by: Oscar Salvador

Re: [PATCH v4 1/7] mm,hwpoison: take free pages off the buddy freelists

2020-09-24 Thread  
On Thu, Sep 17, 2020 at 10:10:43AM +0200, Oscar Salvador wrote: > The crux of the matter is that historically we left poisoned pages in the > buddy system because we have some checks in place when allocating a page > that a gatekeeper for poisoned pages. Unfortunately, we do have other > users

Re: [PATCH v4 5/7] mm,hwpoison: drain pcplists before bailing out for non-buddy zero-refcount page

2020-09-24 Thread  
On Thu, Sep 17, 2020 at 10:10:47AM +0200, Oscar Salvador wrote: > A page with 0-refcount and !PageBuddy could perfectly be a pcppage. > Currently, we bail out with an error if we encounter such a page, meaning > that we do not handle pcppages neither from hard-offline nor from > soft-offline path.

Re: [PATCH v7 14/14] mm,hwpoison: Try to narrow window race for free pages

2020-09-23 Thread  
On Tue, Sep 22, 2020 at 03:56:50PM +0200, Oscar Salvador wrote: > Aristeu Rozanski reported that a customer test case started > to report -EBUSY after the hwpoison rework patchset. > > There is a race window between spotting a free page and taking it off > its buddy freelist, so it might be that

Re: [PATCH v7 11/14] mm,hwpoison: return 0 if the page is already poisoned in soft-offline

2020-09-23 Thread  
On Tue, Sep 22, 2020 at 03:56:47PM +0200, Oscar Salvador wrote: > Currently, there is an inconsistency when calling soft-offline from > different paths on a page that is already poisoned. > > 1) madvise: > > madvise_inject_error skips any poisoned page and continues > the loop. >

Re: [PATCH v7 10/14] mm,hwpoison: refactor soft_offline_huge_page and __soft_offline_page

2020-09-23 Thread  
On Tue, Sep 22, 2020 at 03:56:46PM +0200, Oscar Salvador wrote: > Merging soft_offline_huge_page and __soft_offline_page let us get rid of > quite some duplicated code, and makes the code much easier to follow. > > Now, __soft_offline_page will handle both normal and hugetlb pages. > >

Re: [PATCH v7 09/14] mm,hwpoison: rework soft offline for in-use pages

2020-09-23 Thread  
On Tue, Sep 22, 2020 at 03:56:45PM +0200, Oscar Salvador wrote: > This patch changes the way we set and handle in-use poisoned pages. Until > now, poisoned pages were released to the buddy allocator, trusting that > the checks that take place at allocation time would act as a safe net > and would

Re: [PATCH v7 08/14] mm,hwpoison: rework soft offline for free pages

2020-09-23 Thread  
On Tue, Sep 22, 2020 at 03:56:44PM +0200, Oscar Salvador wrote: > When trying to soft-offline a free page, we need to first take it off the > buddy allocator. > Once we know is out of reach, we can safely flag it as poisoned. > > take_page_off_buddy will be used to take a page meant to be

Re: [PATCH v7 07/14] mm,hwpoison: unify THP handling for hard and soft offline

2020-09-23 Thread  
On Tue, Sep 22, 2020 at 03:56:43PM +0200, Oscar Salvador wrote: > Place the THP's page handling in a helper and use it from both hard and > soft-offline machinery, so we get rid of some duplicated code. > > Signed-off-by: Oscar Salvador Acked-by: Naoya Horiguchi

Re: [PATCH v7 06/14] mm,hwpoison: kill put_hwpoison_page

2020-09-23 Thread  
On Tue, Sep 22, 2020 at 03:56:42PM +0200, Oscar Salvador wrote: > After commit 4e41a30c6d50 ("mm: hwpoison: adjust for new thp > refcounting"), put_hwpoison_page got reduced to a put_page. Let us just > use put_page instead. > > Signed-off-by: Oscar Salvador Acked-by: Naoya Horiguchi

Re: [PATCH v7 04/14] mm,hwpoison: unexport get_hwpoison_page and make it static

2020-09-23 Thread  
On Tue, Sep 22, 2020 at 03:56:40PM +0200, Oscar Salvador wrote: > Since get_hwpoison_page is only used in memory-failure code now, let us > un-export it and make it private to that code. > > Signed-off-by: Oscar Salvador Acked-by: Naoya Horiguchi

Re: [PATCH v7 05/14] mm,hwpoison: refactor madvise_inject_error

2020-09-23 Thread  
On Tue, Sep 22, 2020 at 03:56:41PM +0200, Oscar Salvador wrote: > Make a proper if-else condition for {hard,soft}-offline. > > [akpm: refactor comment] > Signed-off-by: Oscar Salvador Acked-by: Naoya Horiguchi

Re: [PATCH v4 0/7] HWpoison: further fixes and cleanups

2020-09-17 Thread  
On Thu, Sep 17, 2020 at 03:40:06PM +0200, Oscar Salvador wrote: > On Thu, Sep 17, 2020 at 03:09:52PM +0200, Oscar Salvador wrote: > > static bool page_handle_poison(struct page *page, bool > > hugepage_or_freepage, bool release) > > { > > if (release) { > > put_page(page);

Re: [PATCH v4 0/7] HWpoison: further fixes and cleanups

2020-09-17 Thread  
On Thu, Sep 17, 2020 at 10:10:42AM +0200, Oscar Salvador wrote: > This patchset includes some fixups (patch#1,patch#2 and patch#3) > and some cleanups (patch#4-7). > > Patch#1 is a fix to take off HWPoison pages off a buddy freelist since > it can lead us to having HWPoison pages back in the game

Re: [PATCH v3 0/5] HWpoison: further fixes and cleanups

2020-09-16 Thread  
On Tue, Sep 15, 2020 at 05:22:22PM -0400, Aristeu Rozanski wrote: > Hi Oscar, Naoya, > > On Mon, Sep 14, 2020 at 12:15:54PM +0200, Oscar Salvador wrote: > > The important bit of this patchset is patch#1, which is a fix to take off > > HWPoison pages off a buddy freelist since it can lead us to

Re: [PATCH v2 1/5] mm,hwpoison: Take free pages off the buddy freelists

2020-09-10 Thread  
On Tue, Sep 08, 2020 at 09:56:22AM +0200, Oscar Salvador wrote: > The crux of the matter is that historically we left poisoned pages > in the buddy system because we have some checks in place when > allocating a page that a gatekeeper for poisoned pages. > Unfortunately, we do have other users

Re: [PATCH 2/4] mm,hwpoison: Refactor madvise_inject_error

2020-09-03 Thread  
On Wed, Sep 02, 2020 at 11:45:08AM +0200, Oscar Salvador wrote: > Make a proper if-else condition for {hard,soft}-offline. > > Signed-off-by: Oscar Salvador Acked-by: Naoya Horiguchi

Re: [PATCH 1/4] mm,hwpoison: Take free pages off the buddy freelists

2020-09-03 Thread  
On Wed, Sep 02, 2020 at 11:45:07AM +0200, Oscar Salvador wrote: > The crux of the matter is that historically we left poisoned pages > in the buddy system because we have some checks in place when > allocating a page that a gatekeeper for poisoned pages. > Unfortunately, we do have other users

Re: [PATCH] mm/memory-failure: Fix return wrong value when isolate page fail

2020-08-30 Thread  
On Sun, Aug 30, 2020 at 03:44:18PM -0400, Qian Cai wrote: > On Sun, Aug 30, 2020 at 04:10:53PM +0800, Muchun Song wrote: > > When we isolate page fail, we should not return 0, because we do not > > set page HWPoison on any page. > > > > Signed-off-by: Muchun Song > > This seems solve the

Re: [RFD PATCH] x86/mce: Make sure to send SIGBUS even after losing the race to poison a page

2020-08-28 Thread  
Hi, On Thu, Aug 27, 2020 at 09:32:05AM -0700, Tony Luck wrote: > For discussion ... I'm 100% sure the patch below is the wrong way to > fix this ... for one thing it doesn't provide the virtual address of > the error to the user signal handler. For another it just looks like > a hack. I'm just

Re: [PATCH] mm/memory-failure: do pgoff calculation before for_each_process()

2020-08-18 Thread  
On Tue, Aug 18, 2020 at 04:26:47PM +0800, Xianting Tian wrote: > There is no need to calcaulate pgoff in each loop of for_each_process(), > so move it to the place before for_each_process(), which can save some > CPU cycles. > > Signed-off-by: Xianting Tian Looks good to me. Thank you.

Re: [PATCH v6 00/12] HWPOISON: soft offline rework

2020-08-10 Thread  
On Mon, Aug 10, 2020 at 11:45:36PM -0400, Qian Cai wrote: > > > > On Aug 10, 2020, at 11:11 PM, HORIGUCHI NAOYA(堀口 直也) > > wrote: > > > > I'm still not sure why the test succeeded by reverting these because > > current mainline kernel provides similar me

Re: [PATCH v6 00/12] HWPOISON: soft offline rework

2020-08-10 Thread  
On Mon, Aug 10, 2020 at 11:22:55AM -0400, Qian Cai wrote: > On Thu, Aug 06, 2020 at 06:49:11PM +, nao.horigu...@gmail.com wrote: > > Hi, > > > > This patchset is the latest version of soft offline rework patchset > > targetted for v5.9. > > > > Since v5, I dropped some patches which tweak

Re: [PATCH v5 00/16] HWPOISON: soft offline rework

2020-08-05 Thread  
On Mon, Aug 03, 2020 at 09:49:42PM -0400, Qian Cai wrote: > On Tue, Aug 04, 2020 at 01:16:45AM +, HORIGUCHI NAOYA(堀口 直也) wrote: > > On Mon, Aug 03, 2020 at 03:07:09PM -0400, Qian Cai wrote: > > > On Fri, Jul 31, 2020 at 12:20:56PM +, nao.horigu...@gmail.com wrote: >

Re: [PATCH v5 00/16] HWPOISON: soft offline rework

2020-08-05 Thread  
On Mon, Aug 03, 2020 at 11:19:09AM -0400, Qian Cai wrote: > On Mon, Aug 03, 2020 at 01:36:58PM +, HORIGUCHI NAOYA(堀口 直也) wrote: > > Hello, > > > > On Mon, Aug 03, 2020 at 08:39:55AM -0400, Qian Cai wrote: > > > On Fri, Jul 31, 2020 at 12:20:56PM +,

Re: [PATCH v5 00/16] HWPOISON: soft offline rework

2020-08-03 Thread  
On Mon, Aug 03, 2020 at 03:07:09PM -0400, Qian Cai wrote: > On Fri, Jul 31, 2020 at 12:20:56PM +, nao.horigu...@gmail.com wrote: > > This patchset is the latest version of soft offline rework patchset > > targetted for v5.9. > > > > Main focus of this series is to stabilize soft offline.

Re: [PATCH v5 00/16] HWPOISON: soft offline rework

2020-08-03 Thread  
Hello, On Mon, Aug 03, 2020 at 08:39:55AM -0400, Qian Cai wrote: > On Fri, Jul 31, 2020 at 12:20:56PM +, nao.horigu...@gmail.com wrote: > > This patchset is the latest version of soft offline rework patchset > > targetted for v5.9. > > > > Main focus of this series is to stabilize soft

Re: [PATCH v4 12/15] mm,hwpoison: Rework soft offline for in-use pages

2020-07-17 Thread  
On Thu, Jul 16, 2020 at 02:38:06PM +0200, Oscar Salvador wrote: > This patch changes the way we set and handle in-use poisoned pages. > Until now, poisoned pages were released to the buddy allocator, trusting > that the checks that take place prior to deliver the page to its end > user would act

Re: [PATCH v3 00/15] HWPOISON: soft offline rework

2020-07-01 Thread  
On Wed, Jul 01, 2020 at 10:22:07AM +0200, Oscar Salvador wrote: > On Tue, 2020-06-30 at 08:35 +0200, Oscar Salvador wrote: > > > Even after applied the compling fix, > > > > > > https://lore.kernel.org/linux-mm/20200628065409.GA546944@u2004/ > > > > > > madvise(MADV_SOFT_OFFLINE) will fail with

Re: [PATCH v3 00/15] HWPOISON: soft offline rework

2020-06-30 Thread  
On Mon, Jun 29, 2020 at 12:29:25PM +0200, Oscar Salvador wrote: > On Wed, 2020-06-24 at 15:01 +, nao.horigu...@gmail.com wrote: > > I rebased soft-offline rework patchset [1][2] onto the latest > > mmotm. The > > rebasing required some non-trivial changes to adjust, but mainly that > > was >

Re: [PATCH v3 13/15] mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page

2020-06-29 Thread  
On Mon, Jun 29, 2020 at 05:22:40PM +1000, Stephen Rothwell wrote: > Hi Naoya, > > On Sun, 28 Jun 2020 15:54:09 +0900 Naoya Horiguchi > wrote: > > > > Andrew, could you append this diff on top of this series on mmotm? > > I have added that patch to linux-next today. Thank you! - Naoya

Re: [PATCH v3 00/15] HWPOISON: soft offline rework

2020-06-24 Thread  
On Wed, Jun 24, 2020 at 03:49:47PM -0700, Andrew Morton wrote: > On Wed, 24 Jun 2020 22:36:18 + HORIGUCHI NAOYA(堀口 直也) > wrote: > > > On Wed, Jun 24, 2020 at 12:17:42PM -0700, Andrew Morton wrote: > > > On Wed, 24 Jun 2020 15:01:22 + nao.horigu...@gmail.com

Re: [PATCH v3 00/15] HWPOISON: soft offline rework

2020-06-24 Thread  
On Wed, Jun 24, 2020 at 12:17:42PM -0700, Andrew Morton wrote: > On Wed, 24 Jun 2020 15:01:22 + nao.horigu...@gmail.com wrote: > > > I rebased soft-offline rework patchset [1][2] onto the latest mmotm. The > > rebasing required some non-trivial changes to adjust, but mainly that was > >

Re: [RFC PATCH v2 00/16] Hwpoison rework {hard,soft}-offline

2020-06-15 Thread  
Hi Dmitry, On Thu, Jun 11, 2020 at 07:43:19PM +0300, Dmitry Yakunin wrote: > Hello! > > We are faced with similar problems with hwpoisoned pages > on one of our production clusters after kernel update to stable 4.19. > Application that does a lot of memory allocations sometimes caught SIGBUS >

Re: [PATCH V2] mm, memory_failure: don't send BUS_MCEERR_AO for action required error

2020-05-31 Thread  
On Sat, May 30, 2020 at 09:08:43AM +0200, Pankaj Gupta wrote: > > Some processes dont't want to be killed early, but in "Action Required" > > case, those also may be killed by BUS_MCEERR_AO when sharing memory > > with other which is accessing the fail memory. > > And sending SIGBUS with

Re: [PATCH] mm, memory_failure: only send BUS_MCEERR_AO to early-kill process

2020-05-29 Thread  
On Fri, May 29, 2020 at 01:56:25PM +0800, wetp wrote: > On 2020/5/29 上午10:12, HORIGUCHI NAOYA(堀口 直也) wrote: ... > > > > > @@ -225,8 +225,9 @@ static int kill_proc(struct to_kill *tk, unsigned > > > > > long pfn, int flags) > > > > >

Re: [PATCH] mm, memory_failure: only send BUS_MCEERR_AO to early-kill process

2020-05-28 Thread  
On Thu, May 28, 2020 at 02:50:09PM +0800, wetp wrote: > > On 2020/5/28 上午10:22, HORIGUCHI NAOYA(堀口 直也) wrote: > > Hi Zhang, > > > > Sorry for my late response. > > > > On Tue, May 26, 2020 at 03:06:41PM +0800, Wetp Zhang wrote: > > > From: Zhang Y

Re: [PATCH] mm, memory_failure: only send BUS_MCEERR_AO to early-kill process

2020-05-27 Thread  
Hi Zhang, Sorry for my late response. On Tue, May 26, 2020 at 03:06:41PM +0800, Wetp Zhang wrote: > From: Zhang Yi > > If a process don't need early-kill, it may not care the BUS_MCEERR_AO. > Let the process to be killed when it really access the corrupted memory. > > Signed-off-by: Zhang Yi

Re: [RFC V2] mm/vmstat: Add events for PMD based THP migration without split

2020-05-20 Thread  
On Mon, May 18, 2020 at 12:12:36PM +0530, Anshuman Khandual wrote: > This adds the following two new VM events which will help in validating PMD > based THP migration without split. Statistics reported through these events > will help in performance debugging. > > 1. THP_PMD_MIGRATION_SUCCESS >

Re: memory offline infinite loop after soft offline

2020-05-14 Thread  
On Thu, May 14, 2020 at 10:46:33PM -0400, Qian Cai wrote: > > > > On Oct 20, 2019, at 11:16 PM, Naoya Horiguchi > > wrote: > > > > On Fri, Oct 18, 2019 at 07:56:09AM -0400, Qian Cai wrote: > >> > >> > >>On Oct 18, 2019, at 2:35 AM, Naoya Horiguchi > >>wrote: > >> > >> > >>