[PATCH v3 2/3] mm,hwpoison: return -EHWPOISON when page already

2021-04-20 Thread Naoya Horiguchi
ich care the return value of memory_failure() should check why they want to process a memory error which have already been processed. This behavior seems reasonable. Signed-off-by: Aili Yao Signed-off-by: Naoya Horiguchi --- mm/memory-failure.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(

[PATCH v3 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address

2021-04-20 Thread Naoya Horiguchi
From: Naoya Horiguchi The previous patch solves the infinite MCE loop issue when multiple MCE events races. The remaining issue is to make sure that all threads processing Action Required MCEs send to the current processes the SIGBUS with the proper virtual address and the error size

[PATCH v3 1/3] mm/memory-failure: Use a mutex to avoid memory_failure() races

2021-04-20 Thread Naoya Horiguchi
. But while all that work is going on, other CPUs see a "success" return code from memory_failure() and so they believe the error has been handled and continue executing. Fix by wrapping most of the internal parts of memory_failure() in a mutex. Signed-off-by: Tony Luck Signed-off

[PATCH v3 0/3] mm,hwpoison: fix sending SIGBUS for Action Required MCE

2021-04-20 Thread Naoya Horiguchi
local variables and calculation logic of error virtual address, v1: https://lore.kernel.org/linux-mm/20210412224320.1747638-1-nao.horigu...@gmail.com/ v2 (only 3/3 is posted): https://lore.kernel.org/linux-mm/20210419023658.GA1962954@u2004/ Thanks, Naoya Horiguchi --- quote from cover letter of

[PATCH v2 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address

2021-04-18 Thread Naoya Horiguchi
eed "#ifdef CONFIG_TRANSPARENT_HUGEPAGE" to use it. I found that the #ifdef is not necessary because the whole "if (ptl)" is compiled out. So I don't add #ifdef. Here's the v2 of 3/3. Aili, could you test with it? Thanks, Naoya Horiguchi - From: Naoya Horiguchi Date:

[PATCH v1 2/3] mm,hwpoison: return -EHWPOISON when page already

2021-04-12 Thread Naoya Horiguchi
ich care the return value of memory_failure() should check why they want to process a memory error which have already been processed. This behavior seems reasonable. Signed-off-by: Aili Yao Signed-off-by: Naoya Horiguchi --- mm/memory-failure.c | 4 ++-- 1 file changed, 2 insertions(+),

[PATCH v1 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address

2021-04-12 Thread Naoya Horiguchi
From: Naoya Horiguchi The previous patch solves the infinite MCE loop issue when multiple MCE events races. The remaining issue is to make sure that all threads processing Action Required MCEs send to the current processes the SIGBUS with the proper virtual address and the error size

[PATCH v1 1/3] mm/memory-failure: Use a mutex to avoid memory_failure() races

2021-04-12 Thread Naoya Horiguchi
. But while all that work is going on, other CPUs see a "success" return code from memory_failure() and so they believe the error has been handled and continue executing. Fix by wrapping most of the internal parts of memory_failure() in a mutex. Signed-off-by: Tony Luck Signed-off

[PATCH v1 0/3] mm,hwpoison: fix sending SIGBUS for Action Required MCE

2021-04-12 Thread Naoya Horiguchi
that this is not a perfect solution, but should work for some typical case. My simple testing showed this patchset seems to work as intended, but if you have the related testcases, could you please test and let me have some feedback? Thanks, Naoya Horiguchi [1]: https://lore.kernel.org/linux-mm

[PATCH v2] mm, hwpoison: do not lock page again when me_huge_page() successfully recovers

2021-03-05 Thread Naoya Horiguchi
Hello Oscar, On Fri, Mar 05, 2021 at 08:26:58AM +0100, Oscar Salvador wrote: > On Thu, Mar 04, 2021 at 03:44:37PM +0900, Naoya Horiguchi wrote: > > From: Naoya Horiguchi > > Hi Naoya, > > good catch! > > > Currently me_huge_page() temporary unlocks page to perfo

[PATCH v1] mm, hwpoison: do not lock page again when me_huge_page() successfully recovers

2021-03-03 Thread Naoya Horiguchi
From: Naoya Horiguchi Currently me_huge_page() temporary unlocks page to perform some actions then locks it again later. My testcase (which calls hard-offline on some tail page in a hugetlb, then accesses the address of the hugetlb range) showed that page allocation code detects the page lock

Re: [PATCH v1] mm, hwpoison: enable error handling on shmem thp

2021-02-09 Thread Naoya Horiguchi
On Tue, Feb 09, 2021 at 11:46:40AM -0800, Andrew Morton wrote: > On Tue, 9 Feb 2021 15:21:28 +0900 Naoya Horiguchi > wrote: > > > Currently hwpoison code checks PageAnon() for thp and refuses to handle > > errors on non-anonymous thps (just for historical reason). W

[PATCH v1] mm, hwpoison: enable error handling on shmem thp

2021-02-08 Thread Naoya Horiguchi
From: Naoya Horiguchi Currently hwpoison code checks PageAnon() for thp and refuses to handle errors on non-anonymous thps (just for historical reason). We now support non-anonymou thp like shmem one, so this patch suggests to enable to handle shmem thps. Fortunately, we already have

Re: [PATCH v6 00/12] HWPOISON: soft offline rework

2020-08-11 Thread Naoya Horiguchi
On Tue, Aug 11, 2020 at 01:39:24PM -0400, Qian Cai wrote: > On Tue, Aug 11, 2020 at 03:11:40AM +, HORIGUCHI NAOYA(堀口 直也) wrote: > > I'm still not sure why the test succeeded by reverting these because > > current mainline kernel provides similar mechanism to prevent reuse of > > soft offlined

Re: [PATCH v3 13/15] mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page

2020-06-28 Thread Naoya Horiguchi
d and sent into free list to make take_page_off_buddy() work properly. > > > > Signed-off-by: Oscar Salvador > > Signed-off-by: Naoya Horiguchi > > --- > > ChangeLog v2 -> v3: > > - use page_is_file_lru() instead of page_is_file_cache(), > > - add de

Re: [RFC PATCH v2 10/16] mm,hwpoison: Rework soft offline for free pages

2019-10-22 Thread Naoya Horiguchi
hat their processes are killed by corrected (so non-urgent) errors. Thanks, Naoya Horiguchi

Re: [RFC PATCH v2 10/16] mm,hwpoison: Rework soft offline for free pages

2019-10-22 Thread Naoya Horiguchi
gt; else > memory_failure(entry.pfn, entry.flags); > } > } > > AFAICS, for hard-offline case, a recovered event would be if: > > - the page to shut down is already free > - the page was unmapped > > In some cases we need to kill the process if it holds dirty pages. One caveat is that even if the process maps dirty error pages, we don't have to kill it unless the error data is consumed. Thanks, Naoya Horiguchi

[PATCH 17/16] mm,hwpoison: introduce MF_MSG_UNSPLIT_THP

2019-10-21 Thread Naoya Horiguchi
On Mon, Oct 21, 2019 at 07:04:40AM +, Naoya Horiguchi wrote: > On Thu, Oct 17, 2019 at 04:21:16PM +0200, Oscar Salvador wrote: > > Place the THP's page handling in a helper and use it > > from both hard and soft-offline machinery, so we get rid > > of some duplicated code

Re: [RFC PATCH v2 14/16] mm,hwpoison: Return 0 if the page is already poisoned in soft-offline

2019-10-21 Thread Naoya Horiguchi
Please, note that this represents an user-api change, since now the return > error when calling soft_offline_page_store()->soft_offline_page() will be > different. > > Signed-off-by: Oscar Salvador Looks good to me. Acked-by: Naoya Horiguchi > --- > mm/madvise.c| 3

Re: [RFC PATCH v2 10/16] mm,hwpoison: Rework soft offline for free pages

2019-10-21 Thread Naoya Horiguchi
buddy_pages(zone, page_head, page, 0, > + buddy_order, area, migratetype); > + ret = true; > + break; indent with whitespace? And you can find a few more coding style warning with checkpatch.pl. BTW, if we cons

Re: [RFC PATCH v2 06/16] mm,hwpoison: Kill put_hwpoison_page

2019-10-21 Thread Naoya Horiguchi
On Thu, Oct 17, 2019 at 04:21:13PM +0200, Oscar Salvador wrote: > After ("4e41a30c6d50: mm: hwpoison: adjust for new thp refcounting"), > put_hwpoison_page got reduced to a put_page. > Let us just use put_page instead. > > Signed-off-by: Oscar Salvador Acked-by: Naoya Horiguchi

Re: [RFC PATCH v2 05/16] mm,hwpoison: Un-export get_hwpoison_page and make it static

2019-10-21 Thread Naoya Horiguchi
On Thu, Oct 17, 2019 at 04:21:12PM +0200, Oscar Salvador wrote: > Since get_hwpoison_page is only used in memory-failure code now, > let us un-export it and make it private to that code. > > Signed-off-by: Oscar Salvador Acked-by: Naoya Horiguchi

Re: [RFC PATCH v2 09/16] mm,hwpoison: Unify THP handling for hard and soft offline

2019-10-21 Thread Naoya Horiguchi
age(p, "Memory Failure") < 0) > return -EBUSY; Although this is not a cleanup thing, this failure path means that hwpoison is handled (PG_hwpoison is marked), so action_result() should be called. I'll add a patch for this later. Anyway, this cleanup patch lo

Re: [RFC PATCH v2 03/16] mm,madvise: Refactor madvise_inject_error

2019-10-21 Thread Naoya Horiguchi
On Thu, Oct 17, 2019 at 04:21:10PM +0200, Oscar Salvador wrote: > Make a proper if-else condition for {hard,soft}-offline. > > Signed-off-by: Oscar Salvador Acked-by: Naoya Horiguchi

Re: [RFC PATCH v2 02/16] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED

2019-10-21 Thread Naoya Horiguchi
On Fri, Oct 18, 2019 at 01:52:27PM +0200, Michal Hocko wrote: > On Thu 17-10-19 16:21:09, Oscar Salvador wrote: > > From: Naoya Horiguchi > > > > The call to get_user_pages_fast is only to get the pointer to a struct > > page of a given address, pinning it is mem

Re: [RFC PATCH v2 01/16] mm,hwpoison: cleanup unused PageHuge() check

2019-10-21 Thread Naoya Horiguchi
On Fri, Oct 18, 2019 at 01:48:32PM +0200, Michal Hocko wrote: > On Thu 17-10-19 16:21:08, Oscar Salvador wrote: > > From: Naoya Horiguchi > > > > Drop the PageHuge check since memory_failure forks into > > memory_failure_hugetlb() > > for hugetlb pages. >

Re: memory offline infinite loop after soft offline

2019-10-20 Thread Naoya Horiguchi
On Fri, Oct 18, 2019 at 07:56:09AM -0400, Qian Cai wrote: > > > On Oct 18, 2019, at 2:35 AM, Naoya Horiguchi > wrote: > > > You're right, then I don't see how this happens. If the error hugepage was > isolated without having PG_hwpoison set, it's un

Re: memory offline infinite loop after soft offline

2019-10-18 Thread Naoya Horiguchi
On Fri, Oct 18, 2019 at 09:33:10AM +0200, Michal Hocko wrote: > On Fri 18-10-19 06:32:22, Naoya Horiguchi wrote: > > On Fri, Oct 18, 2019 at 08:06:35AM +0200, Michal Hocko wrote: > > > On Fri 18-10-19 02:19:06, Naoya Horiguchi wrote: > > > > On Thu, Oct 17, 2019 at

Re: memory offline infinite loop after soft offline

2019-10-18 Thread Naoya Horiguchi
On Fri, Oct 18, 2019 at 08:06:35AM +0200, Michal Hocko wrote: > On Fri 18-10-19 02:19:06, Naoya Horiguchi wrote: > > On Thu, Oct 17, 2019 at 08:27:59PM +0200, Michal Hocko wrote: > > > On Thu 17-10-19 14:07:13, Qian Cai wrote: > > > > On Thu, 2019-10-17 at 1

Re: memory offline infinite loop after soft offline

2019-10-18 Thread Naoya Horiguchi
On Thu, Oct 17, 2019 at 08:27:59PM +0200, Michal Hocko wrote: > On Thu 17-10-19 14:07:13, Qian Cai wrote: > > On Thu, 2019-10-17 at 12:01 +0200, Michal Hocko wrote: > > > On Thu 17-10-19 09:34:10, Naoya Horiguchi wrote: > > > > On Mon, Oct 14, 2019 at 10:39:

Re: memory offline infinite loop after soft offline

2019-10-17 Thread Naoya Horiguchi
/* A HWPoisoned page cannot be also PageBuddy */ > pfn++; > else This fix looks good to me. The original code only addresses hwpoisoned 4kB-page, we seem to have this issue since the following commit, commit b023f46813cde6e3b8a8c24f432ff9c1fd8e9a64 Author: Wen Congyang Date: Tue Dec 11 16:00:45 2012 -0800 memory-hotplug: skip HWPoisoned page when offlining pages and extension of LTP coverage finally discovered this. Thanks, Naoya Horiguchi

Re: [PATCH] mm, soft-offline: convert parameter to pfn

2019-10-17 Thread Naoya Horiguchi
On Thu, Oct 17, 2019 at 10:03:21AM +0200, Oscar Salvador wrote: > On Thu, Oct 17, 2019 at 07:50:18AM +0000, Naoya Horiguchi wrote: > > Actually I guess that !pfn_valid() never happens when called from > > madvise_inject_error(), because madvise_inject_error() gets pfn via > >

Re: [PATCH] mm, soft-offline: convert parameter to pfn

2019-10-17 Thread Naoya Horiguchi
On Thu, Oct 17, 2019 at 09:16:42AM +0200, David Hildenbrand wrote: > On 17.10.19 01:47, Naoya Horiguchi wrote: > > On Wed, Oct 16, 2019 at 10:57:57AM +0200, David Hildenbrand wrote: > > > On 16.10.19 10:54, Naoya Horiguchi wrote: > > > > On Wed, Oct 16, 2019 at 10:34

Re: [PATCH] mm, soft-offline: convert parameter to pfn

2019-10-16 Thread Naoya Horiguchi
On Wed, Oct 16, 2019 at 10:57:57AM +0200, David Hildenbrand wrote: > On 16.10.19 10:54, Naoya Horiguchi wrote: > >On Wed, Oct 16, 2019 at 10:34:52AM +0200, David Hildenbrand wrote: > >>On 16.10.19 10:27, Naoya Horiguchi wrote: > >>>On Wed, Oct 16, 2019 at 09:56:19AM

Re: [PATCH] mm, soft-offline: convert parameter to pfn

2019-10-16 Thread Naoya Horiguchi
On Wed, Oct 16, 2019 at 10:34:52AM +0200, David Hildenbrand wrote: > On 16.10.19 10:27, Naoya Horiguchi wrote: > > On Wed, Oct 16, 2019 at 09:56:19AM +0200, David Hildenbrand wrote: > > > On 16.10.19 09:09, Naoya Horiguchi wrote: > > > > Hi, > > > > >

Re: [PATCH] mm, soft-offline: convert parameter to pfn

2019-10-16 Thread Naoya Horiguchi
On Wed, Oct 16, 2019 at 09:56:19AM +0200, David Hildenbrand wrote: > On 16.10.19 09:09, Naoya Horiguchi wrote: > > Hi, > > > > I wrote a simple cleanup for parameter of soft_offline_page(), > > based on thread https://lkml.org/lkml/2019/10/11/57. > > >

[PATCH] mm, soft-offline: convert parameter to pfn

2019-10-16 Thread Naoya Horiguchi
as a separate one now. Thanks, Naoya Horiguchi --- From: Naoya Horiguchi Date: Wed, 16 Oct 2019 15:49:00 +0900 Subject: [PATCH] mm, soft-offline: convert parameter to pfn Currently soft_offline_page() receives struct page, and its sibling memory_failure() receives pfn. This discrepancy looks weird and makes

Re: [PATCH v2 2/2] mm/memory-failure.c: Don't access uninitialized memmaps in memory_failure()

2019-10-11 Thread Naoya Horiguchi
On Thu, Oct 10, 2019 at 09:17:42AM +0200, David Hildenbrand wrote: > On 10.10.19 02:26, Naoya Horiguchi wrote: > > On Wed, Oct 09, 2019 at 04:24:35PM +0200, David Hildenbrand wrote: > >> We should check for pfn_to_online_page() to not access uninitialized > >> memmap

Re: [PATCH v1] drivers/base/memory.c: Don't access uninitialized memmaps in soft_offline_page_store()

2019-10-11 Thread Naoya Horiguchi
/base/memory.c > @@ -540,6 +540,9 @@ static ssize_t soft_offline_page_store(struct device *dev, > pfn >>= PAGE_SHIFT; > if (!pfn_valid(pfn)) > return -ENXIO; > + /* Only online pages can be soft-offlined (esp., not ZONE_DEVICE). */ > +

Re: [PATCH v2 2/2] mm/memory-failure.c: Don't access uninitialized memmaps in memory_failure()

2019-10-11 Thread Naoya Horiguchi
;>>> On Wed 09-10-19 16:24:35, David Hildenbrand wrote: > >>>>> We should check for pfn_to_online_page() to not access uninitialized > >>>>> memmaps. Reshuffle the code so we don't have to duplicate the error > >>>>> message. > >

Re: [PATCH v1] mm: Fix access of uninitialized memmaps in fs/proc/page.c

2019-10-10 Thread Naoya Horiguchi
On Fri, Oct 11, 2019 at 12:11:25AM +, Horiguchi Naoya(堀口 直也) wrote: > On Thu, Oct 10, 2019 at 09:30:01AM +0200, David Hildenbrand wrote: > > On 09.10.19 11:57, Naoya Horiguchi wrote: > > > Hi David, > > > > > > On Wed, Oct 09, 2019 at 11:12

Re: [PATCH v1] mm: Fix access of uninitialized memmaps in fs/proc/page.c

2019-10-10 Thread Naoya Horiguchi
On Thu, Oct 10, 2019 at 09:30:01AM +0200, David Hildenbrand wrote: > On 09.10.19 11:57, Naoya Horiguchi wrote: > > Hi David, > > > > On Wed, Oct 09, 2019 at 11:12:04AM +0200, David Hildenbrand wrote: > >> There are various places where we access uninitialized

Re: [PATCH v4 0/2] mm/memory-failure: Poison read receives SIGKILL instead of SIGBUS issue

2019-10-09 Thread Naoya Horiguchi
On Wed, Oct 09, 2019 at 04:55:10PM -0700, Andrew Morton wrote: > On Tue, 8 Oct 2019 23:18:31 +0000 Naoya Horiguchi > wrote: > > > I think that this patchset is good enough and ready to be merged. > > Andrew, could you consider queuing this series into your tree? > >

Re: [PATCH v2 2/2] mm/memory-failure.c: Don't access uninitialized memmaps in memory_failure()

2019-10-09 Thread Naoya Horiguchi
On Wed, Oct 09, 2019 at 04:24:35PM +0200, David Hildenbrand wrote: > We should check for pfn_to_online_page() to not access uninitialized > memmaps. Reshuffle the code so we don't have to duplicate the error > message. > > Cc: Naoya Horiguchi > Cc: Andrew Morton > Cc: Micha

Re: [PATCH v1] mm: Fix access of uninitialized memmaps in fs/proc/page.c

2019-10-09 Thread Naoya Horiguchi
> Let's keep the existing interfaces working with ZONE_DEVICE memory. We > can later come back and fix these rare races and eventually speed-up the > ZONE_DEVICE detection. Actually, Toshiki is writing code to refactor and optimize the pfn walking part, where we find the pfn ranges covered b

Re: [PATCH v4 0/2] mm/memory-failure: Poison read receives SIGKILL instead of SIGBUS issue

2019-10-08 Thread Naoya Horiguchi
Hi Jane, I think that this patchset is good enough and ready to be merged. Andrew, could you consider queuing this series into your tree? Thanks, Naoya Horiguchi On Tue, Oct 08, 2019 at 11:13:23AM -0700, Jane Chu wrote: > Hi, Naoya, > > What is the status of the patches? > Is ther

Re: [PATCH 02/10] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED

2019-09-11 Thread Naoya Horiguchi
On Wed, Sep 11, 2019 at 12:27:22PM +0200, David Hildenbrand wrote: > On 10.09.19 12:30, Oscar Salvador wrote: > > From: Naoya Horiguchi > > > > Currently madvise_inject_error() pins the target via get_user_pages_fast. > > The call to get_user_pages_fast is only

Re: [PATCH 02/10] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED

2019-09-11 Thread Naoya Horiguchi
Hi David, On Wed, Sep 11, 2019 at 12:23:24PM +0200, David Hildenbrand wrote: > On 10.09.19 12:30, Oscar Salvador wrote: > > From: Naoya Horiguchi > > > > Currently madvise_inject_error() pins the target via get_user_pages_fast. > > The call to get_user_pages_fast is

Re: [PATCH 00/10] Hwpoison soft-offline rework

2019-09-11 Thread Naoya Horiguchi
On Wed, Sep 11, 2019 at 08:35:26AM +0200, osalva...@suse.de wrote: > On 2019-09-11 08:22, Naoya Horiguchi wrote: > > I found another panic ... > > Hi Naoya, > > Thanks for giving it a try. Are these testcase public? > I will definetely take a look and try to solve thes

Re: [PATCH 00/10] Hwpoison soft-offline rework

2019-09-11 Thread Naoya Horiguchi
pages: > > > >* Normal pages: > > > >1) Take the page off the buddy freelist > >2) Set PageHWPoison flag and set refcount to 1 > > > >* Hugetlb pages > > > >1) Try to allocate a new hugetlb page to the pool > >2) Take

Re: [PATCH 00/10] Hwpoison soft-offline rework

2019-09-10 Thread Naoya Horiguchi
ight of the taken approach before touching more > code. > > Thanks > > [1] > https://lore.kernel.org/linux-mm/1541746035-13408-1-git-send-email-n-horigu...@ah.jp.nec.com/ > [2] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u > > Naoya Horiguchi (5):

Re: poisoned pages do not play well in the buddy allocator

2019-08-26 Thread Naoya Horiguchi
On Mon, Aug 26, 2019 at 12:41:50PM +0200, Oscar Salvador wrote: > Hi, > > When analyzing a problem reported by one of our customers, I stumbbled upon > an issue > that origins from the fact that poisoned pages end up in the buddy allocator. > > Let me break down the stepts that lie to the

Re: ##freemail## Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage

2019-08-20 Thread Naoya Horiguchi
On Tue, Aug 20, 2019 at 03:03:55PM +0800, Wanpeng Li wrote: > Cc Mel Gorman, Kirill, Dave Hansen, > On Tue, 11 Jun 2019 at 07:51, Naoya Horiguchi > wrote: > > > > On Wed, May 29, 2019 at 04:31:01PM -0700, Mike Kravetz wrote: > > > On 5/28/19 2:49 AM, Wanp

Re: [PATCH] hugetlbfs: fix hugetlb page migration/fault race causing SIGBUS

2019-08-07 Thread Naoya Horiguchi
the same check that must be made further > in the code even if page allocation is successful. > > Reported-by: Li Wang > Fixes: 290408d4a250 ("hugetlb: hugepage migration core") > Signed-off-by: Mike Kravetz > Tested-by: Li Wang Thanks for the work and nice des

Re: [PATCH v3 1/2] mm/memory-failure.c clean up around tk pre-allocation

2019-08-01 Thread Naoya Horiguchi
*tk argument. > > Signed-off-by: Jane Chu Acked-by: Naoya Horiguchi # somehow I sent 2 acks to 2/2, sorry about the noise.

Re: [PATCH v3 2/2] mm/memory-failure: Poison read receives SIGKILL instead of SIGBUS if mmaped more than once

2019-08-01 Thread Naoya Horiguchi
to unmap corrupted page > => to deliver SIGKILL > Memory failure: 0xedbe201: Killing read_poison:22434 due to hardware memory > corruption > => to deliver SIGBUS > > Signed-off-by: Jane Chu > Suggested-by: Naoya Horiguchi Thanks for the fix. Acked-by: Naoya Horiguchi

Re: [PATCH v3 2/2] mm/memory-failure: Poison read receives SIGKILL instead of SIGBUS if mmaped more than once

2019-08-01 Thread Naoya Horiguchi
to unmap corrupted page > => to deliver SIGKILL > Memory failure: 0xedbe201: Killing read_poison:22434 due to hardware memory > corruption > => to deliver SIGBUS > > Signed-off-by: Jane Chu > Suggested-by: Naoya Horiguchi > --- > mm/memory-failure.c | 22 +

Re: [PATCH v2 1/1] mm/memory-failure: Poison read receives SIGKILL instead of SIGBUS if mmaped more than once

2019-07-24 Thread Naoya Horiguchi
to unmap corrupted page > => to deliver SIGKILL > Memory failure: 0xedbe201: Killing read_poison:22434 due to hardware memory > corruption > => to deliver SIGBUS > > Signed-off-by: Jane Chu > Suggested-by: Naoya Horiguchi > --- > mm/memory-failure.c | 62 >

Re: [PATCH] mm/memory-failure: Poison read receives SIGKILL instead of SIGBUS if mmaped more than once

2019-07-24 Thread Naoya Horiguchi
tk->tsk->comm, tk->tsk->pid); do_send_sig_info(SIGKILL, SEND_SIG_PRIV, > > } > > + if (tk == *tkc) > > + *tkc = NULL; > > get_task_struct(tsk); > > tk->tsk = tsk; > > list_add_tail(>nd, to_kill); > > > Concept and policy looks good to me, and I never did understand what > the mremap() case was trying to protect against. > > The patch is a bit difficult to read (not your fault) because of the > odd way that add_to_kill() expects the first 'tk' to be pre-allocated. > May I ask for a lead-in cleanup that moves all the allocation internal > to add_to_kill() and drops the **tk argument? I totally agree with this cleanup. Thanks for the comment. Thanks, Naoya Horiguchi

[PATCH v3 1/2] mm: soft-offline: return -EBUSY if set_hwpoison_free_buddy_page() fails

2019-06-17 Thread Naoya Horiguchi
(MADV_SOFT_OFFLINE) may not offline the original page and will not return an error. It might lead us to misjudge the test result when set_hwpoison_free_buddy_page() actually fails. Signed-off-by: Naoya Horiguchi Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining") Cc: # v4.19+ --- Ch

[PATCH v3 0/2] fix return value issue of soft offlining hugepages

2019-06-17 Thread Naoya Horiguchi
Hi everyone, This is v3 of the fix of return value issue of hugepage soft-offlining (v2: https://lkml.org/lkml/2019/6/10/156). Straightforwardly applied feedbacks on v2. Thanks, Naoya Horiguchi

[PATCH v3 2/2] mm: hugetlb: soft-offline: dissolve_free_huge_page() return zero on !PageHuge

2019-06-17 Thread Naoya Horiguchi
(), which are cleaned up together. Reported-by: Chen, Jerry T Tested-by: Chen, Jerry T Signed-off-by: Naoya Horiguchi Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining") Cc: # v4.19+ --- ChangeLog v2->v3: - add PageHuge check in dissolve_free_huge_page() outside hugetlb_

Re: [PATCH v2 2/2] mm: hugetlb: soft-offline: dissolve_free_huge_page() return zero on !PageHuge

2019-06-12 Thread Naoya Horiguchi
On Tue, Jun 11, 2019 at 03:20:26PM +0530, Anshuman Khandual wrote: > > On 06/10/2019 01:48 PM, Naoya Horiguchi wrote: > > madvise(MADV_SOFT_OFFLINE) often returns -EBUSY when calling soft offline > > for hugepages with overcommitting enabled. That was caused by the suboptimal &

Re: [PATCH v2 2/2] mm: hugetlb: soft-offline: dissolve_free_huge_page() return zero on !PageHuge

2019-06-12 Thread Naoya Horiguchi
On Tue, Jun 11, 2019 at 10:16:03AM -0700, Mike Kravetz wrote: > On 6/10/19 1:18 AM, Naoya Horiguchi wrote: > > madvise(MADV_SOFT_OFFLINE) often returns -EBUSY when calling soft offline > > for hugepages with overcommitting enabled. That was caused by the suboptimal > > code in

Re: [PATCH v2 1/2] mm: soft-offline: return -EBUSY if set_hwpoison_free_buddy_page() fails

2019-06-12 Thread Naoya Horiguchi
On Tue, Jun 11, 2019 at 01:44:46PM +0530, Anshuman Khandual wrote: > > > On 06/11/2019 06:27 AM, Naoya Horiguchi wrote: > > On Mon, Jun 10, 2019 at 05:19:45PM -0700, Mike Kravetz wrote: > >> On 6/10/19 1:18 AM, Naoya Horiguchi wrote: > >>> The pass/f

Re: [PATCH v2 1/2] mm: soft-offline: return -EBUSY if set_hwpoison_free_buddy_page() fails

2019-06-10 Thread Naoya Horiguchi
On Mon, Jun 10, 2019 at 05:19:45PM -0700, Mike Kravetz wrote: > On 6/10/19 1:18 AM, Naoya Horiguchi wrote: > > The pass/fail of soft offline should be judged by checking whether the > > raw error page was finally contained or not (i.e. the result of > > set_hwpoison_free_buddy

Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage

2019-06-10 Thread Naoya Horiguchi
B/4KB page granularity, also split the KVM MMU 1GB SPTE > > into 2MB/4KB and mark the offensive SPTE w/ a hwpoison flag, a sigbus > > will be delivered to VM at page fault next time for the offensive > > SPTE. Is this proposal acceptable? > > I am not sure of the error handling design, but this does sound reasonable. I agree that that's better. > That block of code which potentially dissolves a huge page on memory error > is hard to understand and I'm not sure if that is even the 'normal' > functionality. Certainly, we would hate to waste/poison an entire 1G page > for an error on a small subsection. Yes, that's not practical, so we need at first establish the code base for 2GB hugetlb splitting and then extending it to 1GB next. Thanks, Naoya Horiguchi

Re: [PATCH v2 1/2] mm: soft-offline: return -EBUSY if set_hwpoison_free_buddy_page() fails

2019-06-10 Thread Naoya Horiguchi
On Mon, Jun 10, 2019 at 02:20:33PM -0700, Andrew Morton wrote: > On Mon, 10 Jun 2019 17:18:05 +0900 Naoya Horiguchi > wrote: > > > The pass/fail of soft offline should be judged by checking whether the > > raw error page was finally contained or

[PATCH v2 1/2] mm: soft-offline: return -EBUSY if set_hwpoison_free_buddy_page() fails

2019-06-10 Thread Naoya Horiguchi
The pass/fail of soft offline should be judged by checking whether the raw error page was finally contained or not (i.e. the result of set_hwpoison_free_buddy_page()), but current code do not work like that. So this patch is suggesting to fix it. Signed-off-by: Naoya Horiguchi Fixes

[PATCH v2 2/2] mm: hugetlb: soft-offline: dissolve_free_huge_page() return zero on !PageHuge

2019-06-10 Thread Naoya Horiguchi
(), which are cleaned up together. Reported-by: Chen, Jerry T Tested-by: Chen, Jerry T Signed-off-by: Naoya Horiguchi Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining") Cc: # v4.19+ --- mm/hugetlb.c| 15 +-- mm/memory-failure.c | 5 + 2 files c

[PATCH v2 00/02] fix return value issue of soft offlining hugepages

2019-06-10 Thread Naoya Horiguchi
. Thanks, Naoya Horiguchi

Re: [PATCH v1] mm: hugetlb: soft-offline: fix wrong return value of soft offline

2019-05-29 Thread Naoya Horiguchi
Hi Mike, On Wed, May 29, 2019 at 11:44:50AM -0700, Mike Kravetz wrote: > On 5/26/19 11:06 PM, Naoya Horiguchi wrote: > > Soft offline events for hugetlb pages return -EBUSY when page migration > > succeeded and dissolve_free_huge_page() failed, which can happen when > > there

[PATCH v1] mm: hugetlb: soft-offline: fix wrong return value of soft offline

2019-05-27 Thread Naoya Horiguchi
ed.") This change affects other callers of dissolve_free_huge_page(), which are also cleaned up by this patch. Reported-by: Chen, Jerry T Signed-off-by: Naoya Horiguchi Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining") Cc: # v4.19+ --- mm/hugetlb.c| 15

Re: [PATCH v2] mm, memory-failure: clarify error message

2019-05-20 Thread Naoya Horiguchi
e to hardware memory corruption" > Slightly modify the error message to improve clarity. > > Signed-off-by: Jane Chu Acked-by: Naoya Horiguchi Thanks!

Re: [PATCH] mm, memory-failure: clarify error message

2019-05-20 Thread Naoya Horiguchi
) && t->mm == current->mm). That might need additional if sentence, which I'm not sure worth doing. I think that the simplest fix for the reported problem (a confusing message) is like below: - pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory corruption\n", + pr_err("Memory failure: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n", pfn, t->comm, t->pid); Or, if we have a good reason to separate the message for MF_ACTION_REQUIRED and MF_ACTION_OPTIONAL, that might be OK. Thanks, Naoya Horiguchi

[PATCH v2] tools/power: turbostat: make output buffer extensible (Re: [PATCH v1] tools/power: turbostat: fix buffer overrun)

2019-04-18 Thread Naoya Horiguchi
I updated the patch with a trival fix. Could you review it? - Naoya From: Naoya Horiguchi Date: Fri, 19 Apr 2019 09:21:59 +0900 Subject: [PATCH v2] tools/power: turbostat: make output buffer extensible "turbostat --Dump" could be terminated by general protection fault on s

[PATCH] tools/power: turbostat: make output buffer extensible (Re: [PATCH v1] tools/power: turbostat: fix buffer overrun)

2019-04-03 Thread Naoya Horiguchi
Hi Prarit, On Wed, Apr 03, 2019 at 07:42:45AM -0400, Prarit Bhargava wrote: > > > On 4/3/19 3:02 AM, Naoya Horiguchi wrote: > > turbostat could be terminated by general protection fault on some latest > > hardwares which (for example) support 9 levels of C-states and show

[PATCH v1] tools/power: turbostat: fix buffer overrun

2019-04-03 Thread Naoya Horiguchi
so removes duplicated "pc10:" line to reduce buffer usage. Signed-off-by: Naoya Horiguchi --- tools/power/x86/turbostat/turbostat.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git v5.1-rc3-mmotm-2019-04-02-17-16/tools/power/x86/turbostat/turbostat.c v5.1-rc3-mmotm-2019

Re: [PATCH] mm/hugetlb: Get rid of NODEMASK_ALLOC

2019-04-02 Thread Naoya Horiguchi
> This reduces some code churn and complexity. > > Signed-off-by: Oscar Salvador nice cleanup. Reviewed-by: Naoya Horiguchi

Re: [PATCH v2 0/2] A couple hugetlbfs fixes

2019-04-02 Thread Naoya Horiguchi
> mm/userfaultfd.c| 3 +-- > 4 files changed, 26 insertions(+), 31 deletions(-) Both fixes look fine to me. Reviewed-by: Naoya Horiguchi

Re: [PATCH REBASED] hugetlbfs: fix potential over/underflow setting node specific nr_hugepages

2019-03-28 Thread Naoya Horiguchi
dle the case of not being able to allocate a node mask would likely > result in incorrect behavior. Luckily, it is very unlikely we will > ever take this path. If we do, simply return ENOMEM. > > Reported-by: Jing Xiangfeng > Signed-off-by: Mike Kravetz Looks good to me. Reviewed-b

Re: [Qestion] Hit a WARN_ON_ONCE in try_to_unmap_one when runing syzkaller

2019-03-14 Thread Naoya Horiguchi
... unmap_success = try_to_unmap(hpage, ttu); ... So either of the above "ttu |= TTU_IGNORE_HWPOISON" should be executed. I'm not sure which one, but both paths show printk messages, so if you could have kernel message log, that might help ... Thanks, Naoya Horiguchi

Re: [PATCH v4] mm/hugetlb: Fix unsigned overflow in __nr_hugepages_store_common()

2019-03-04 Thread Naoya Horiguchi
o that final else if if we can not allocate > a node mask (kmalloc a few words). Right? I wonder why we should > even try to continue in this case. Why not just return right there? Simply returning on allocation failure looks better to me. As you mentioned below, current behavior for this 'el

Re: [PATCH v4] mm/hugetlb: Fix unsigned overflow in __nr_hugepages_store_common()

2019-03-03 Thread Naoya Horiguchi
t; in underflow. To fix, the calculation is moved to within the routine > set_max_huge_pages() where the lock is held. > > Reported-by: Jing Xiangfeng > Signed-off-by: Mike Kravetz > Tested-by: Jing Xiangfeng > Acked-by: David Rientjes Looks good to me with improved co

Re: [PATCH] mm: hwpoison: fix thp split handing in soft_offline_in_use_page()

2019-02-28 Thread Naoya Horiguchi
etic-only change. > > > > Please elaborate. > When soft_offline_in_use_page runs on a thp tail page after pmd is split, > and we pass the head page to split_huge_page, Unfortunately, the tail page > can be free or count turn into zero. I guess that you have the similar fix on mem

Re: [PATCH] huegtlbfs: fix races and page leaks during migration

2019-02-25 Thread Naoya Horiguchi
a/mm/hugetlb.c > +++ b/mm/hugetlb.c ... > @@ -3863,6 +3864,11 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, > } > > spin_unlock(ptl); > + > + /* Make newly allocated pages active */ You already have a perfect explanation about why we need this "if", > ... We could have got the page from the pagecache, and it could > be that the page is !page_huge_active() because it has been isolated for > migration. so you could improve this comment with it. Anyway, I agree to what/how you try to fix. Reviewed-by: Naoya Horiguchi Thanks, Naoya Horiguchi

Re: [PATCH] huegtlbfs: fix page leak during migration of file pages

2019-02-11 Thread Naoya Horiguchi
On Mon, Feb 11, 2019 at 03:06:27PM -0800, Mike Kravetz wrote: > On 2/7/19 11:31 PM, Naoya Horiguchi wrote: > > On Thu, Feb 07, 2019 at 09:50:30PM -0800, Mike Kravetz wrote: > >> On 2/7/19 6:31 PM, Naoya Horiguchi wrote: > >>> On Thu, Feb 07, 2019 at 10:50:

Re: [PATCH] huegtlbfs: fix page leak during migration of file pages

2019-02-07 Thread Naoya Horiguchi
On Thu, Feb 07, 2019 at 09:50:30PM -0800, Mike Kravetz wrote: > On 2/7/19 6:31 PM, Naoya Horiguchi wrote: > > On Thu, Feb 07, 2019 at 10:50:55AM -0800, Mike Kravetz wrote: > >> On 1/30/19 1:14 PM, Mike Kravetz wrote: > >>> +++ b/fs/hugetlbfs/inode.c > &g

Re: [PATCH] huegtlbfs: fix page leak during migration of file pages

2019-02-07 Thread Naoya Horiguchi
that do not use PagePrivate/PagePrivate2. * * Pages are locked upon entry and exit. */ int migrate_page(struct address_space *mapping, ... So this common logic assumes that page_private is not used, so why do we explicitly clear page_private in migrate_page_states()? buffer_migrate_page(), which is commonly used for the case when page_private is used, does that clearing outside migrate_page_states(). So I thought that hugetlbfs_migrate_page() could do in the similar manner. IOW, migrate_page_states() should not do anything on PagePrivate. But there're a few other .migratepage callbacks, and I'm not sure all of them are safe for the change, so this approach might not fit for a small fix. # BTW, there seems a typo in $SUBJECT. Thanks, Naoya Horiguchi

Re: [PATCH] mm: hwpoison: use do_send_sig_info() instead of force_sig() (Re: PMEM error-handling forces SIGKILL causes kernel panic)

2019-01-16 Thread Naoya Horiguchi
Hi Jane, On Wed, Jan 16, 2019 at 09:56:02AM -0800, Jane Chu wrote: > Hi, Naoya, > > On 1/16/2019 1:30 AM, Naoya Horiguchi wrote: > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index 7c72f2a95785..831be5ff5f4d 100644 > --- a/mm/memory-failure.c

[PATCH] mm: hwpoison: use do_send_sig_info() instead of force_sig() (Re: PMEM error-handling forces SIGKILL causes kernel panic)

2019-01-16 Thread Naoya Horiguchi
roper tasklist and rcu read > > lock(s). > > This reasoning and proposal sound right to me. I'm trying to reproduce > this race (for non-pmem case,) but no luck for now. I'll investigate more. I wrote/tested a patch for this issue. I think that switching signal API effectively does

Re: PMEM error-handling forces SIGKILL causes kernel panic

2019-01-11 Thread Naoya Horiguchi
gt; Before 4.19 this test should result in a machine-check reboot, not > much better than a kernel crash. > > > Should we not to force the SIGKILL, or find a way to close the race window? > > The race should be closed by holding the proper tasklist and rcu read lock(s). This reasoning and proposal sound right to me. I'm trying to reproduce this race (for non-pmem case,) but no luck for now. I'll investigate more. Thanks, Naoya Horiguchi

Re: [PATCH] tools/vm/page-types.c: fix "kpagecount returned fewer pages than expected" failures

2018-12-10 Thread Naoya Horiguchi
s: 7f1d23e60718 ("tools/vm/page-types.c: include shared map counts") > Signed-off-by: Anthony Yznaga Reviewed-by: Naoya Horiguchi > --- > tools/vm/page-types.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/tools/vm/page-types.c b/tools/vm/page

Re: [PATCH v2] /proc/kpagecount: return 0 for special pages that are never mapped

2018-12-10 Thread Naoya Horiguchi
age_mapcount() for these pages. > > Signed-off-by: Anthony Yznaga > Acked-by: Matthew Wilcox Looks good to me. Reviewed-by: Naoya Horiguchi > --- > v2 - incorporated feedback from Matthew Wilcox > > fs/proc/page.c | 2 +- > include/linux/page-flag

Re: [RFC PATCH] hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined

2018-12-06 Thread Naoya Horiguchi
On Thu, Dec 06, 2018 at 09:32:06AM +0100, Michal Hocko wrote: > On Thu 06-12-18 05:21:38, Naoya Horiguchi wrote: > > On Wed, Dec 05, 2018 at 05:57:16PM +0100, Michal Hocko wrote: > > > On Wed 05-12-18 13:29:18, Michal Hocko wrote: > > > [...] > > > > After

Re: [RFC PATCH] hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined

2018-12-06 Thread Naoya Horiguchi
On Thu, Dec 06, 2018 at 09:32:06AM +0100, Michal Hocko wrote: > On Thu 06-12-18 05:21:38, Naoya Horiguchi wrote: > > On Wed, Dec 05, 2018 at 05:57:16PM +0100, Michal Hocko wrote: > > > On Wed 05-12-18 13:29:18, Michal Hocko wrote: > > > [...] > > > > After

Re: [RFC PATCH] hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined

2018-12-05 Thread Naoya Horiguchi
le tree) might explain the difference you experienced: commit 286c469a988fbaf68e3a97ddf1e6c245c1446968 Author: Naoya Horiguchi Date: Wed May 3 14:56:22 2017 -0700

Re: [RFC PATCH] hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined

2018-12-05 Thread Naoya Horiguchi
le tree) might explain the difference you experienced: commit 286c469a988fbaf68e3a97ddf1e6c245c1446968 Author: Naoya Horiguchi Date: Wed May 3 14:56:22 2017 -0700

Re: [RFC PATCH] hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined

2018-12-04 Thread Naoya Horiguchi
On Tue, Dec 04, 2018 at 10:35:49AM +0100, Michal Hocko wrote: > On Tue 04-12-18 09:11:05, Naoya Horiguchi wrote: > > On Tue, Dec 04, 2018 at 09:48:26AM +0100, Michal Hocko wrote: > > > On Tue 04-12-18 07:21:16, Naoya Horiguchi wrote: > > > > On Mon, Dec 03, 2018 at

Re: [RFC PATCH] hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined

2018-12-04 Thread Naoya Horiguchi
On Tue, Dec 04, 2018 at 10:35:49AM +0100, Michal Hocko wrote: > On Tue 04-12-18 09:11:05, Naoya Horiguchi wrote: > > On Tue, Dec 04, 2018 at 09:48:26AM +0100, Michal Hocko wrote: > > > On Tue 04-12-18 07:21:16, Naoya Horiguchi wrote: > > > > On Mon, Dec 03, 2018 at

  1   2   3   4   5   6   7   8   9   10   >