Re: [PATCH v1 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address
On Tue, Apr 20, 2021 at 8:48 AM Luck, Tony wrote: > > On Mon, Apr 19, 2021 at 07:03:01PM -0700, Jue Wang wrote: > > On Tue, 13 Apr 2021 07:43:20 +0900, Naoya Horiguchi wrote: > > > > > This patch suggests to do page table walk to find the error virtual > > > address. If we find multiple virtual addresses in walking, we now can't > > > determine which one is correct, so we fall back to sending SIGBUS in > > > kill_me_maybe() without error info as we do now. This corner case needs > > > to be solved in the future. > > > > Instead of walking the page tables, I wonder what about the following idea: > > > > When failing to get vaddr, memory_failure just ensures the mapping is > > removed > > and an hwpoisoned swap pte is put in place; or the original page is flagged > > with > > PG_HWPOISONED and kept in the radix tree (e.g., for SHMEM THP). > > To remove the mapping, you need to know the virtual address :-) I meant in this case (racing to access the same poisoned pages), the page mapping should have been removed by and the hwpoison swap pte installed by the winner thread? Other racing threads can rely on the subsequent #PFs to get the correct SIGBUS with accurate vaddr semantics? Or is the goal to "give back correct SIGBUS with accurate vaddr on _the first MCE on ANY threads_"? I wonder if that goal is absolutely necessary and can be relaxed a little to take into account subsequent #PFs. > > Well, I did try a patch that removed *all* user mappings (switched CR3 to > swapper_pgdir) and returned to user. Then have the resulting page fault > report the address. But that didn't work very well. Curious what didn't work well in this case? :-) > > > > > > NOTE: no SIGBUS is sent to user space. > > > > Then do_machine_check just returns to user space to resume execution, the > > re-execution will result in a #PF and should land to the exact page fault > > handling code that generates a SIGBUS with the precise vaddr info: > > That's how SRAO (and other races) are supposed to work. Hmm, I wonder why it doesn't apply to this race. > > -Tony
Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison
On Tue, 13 Apr 2021 12:07:22 +0200, Petkov, Borislav wrote: >> KVM apparently passes a machine check into the guest. > Ah, there it is: > static void kvm_send_hwpoison_signal(unsigned long address, struct > task_struct *tsk) > { > send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, PAGE_SHIFT, > tsk); > } This path is when EPT #PF finds accesses to a hwpoisoned page and sends SIGBUS to user space (KVM exits into user space) with the same semantic as if regular #PF found access to a hwpoisoned page. The KVM_X86_SET_MCE ioctl actually injects a machine check into the guest. We are in process to launch a product with MCE recovery capability in a KVM based virtualization product and plan to expand the scope of the application of it in the near future. > So what I'm missing with all this fun is, yeah, sure, we have this > facility out there but who's using it? Is anyone even using it at all? The in-memory database and analytical domain are definitely using it. A couple examples: SAP HANA - as we've tested and planned to launch as a strategic enterprise use case with MCE recovery capability in our product SQL server - https://support.microsoft.com/en-us/help/2967651/inf-sql-server-may-display-memory-corruption-and-recovery-errors Cheers, -Jue
Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison
On Wed, Apr 14, 2021 at 6:10 AM Borislav Petkov wrote: > > On Tue, Apr 13, 2021 at 10:47:21PM -0700, Jue Wang wrote: > > This path is when EPT #PF finds accesses to a hwpoisoned page and > > sends SIGBUS to user space (KVM exits into user space) with the same > > semantic as if regular #PF found access to a hwpoisoned page. > > > > The KVM_X86_SET_MCE ioctl actually injects a machine check into the guest. > > > > We are in process to launch a product with MCE recovery capability in > > a KVM based virtualization product and plan to expand the scope of the > > application of it in the near future. > > Any pointers to code or is this all non-public? Any text on what that > product does with the MCEs? These are non-public at this point. User-facing docs and blog post are expected to be released towards the launch (i.e., in 3-4 months from now). > > > The in-memory database and analytical domain are definitely using it. > > A couple examples: > > SAP HANA - as we've tested and planned to launch as a strategic > > enterprise use case with MCE recovery capability in our product > > SQL server - > > https://support.microsoft.com/en-us/help/2967651/inf-sql-server-may-display-memory-corruption-and-recovery-errors > > Aha, so they register callbacks for the processes to exec on a memory > error. Good to know, thanks for those. My other 2 cents: I can see this is useful in other types of domains, e.g., on multi-tenant cloud servers where many VMs are collocated on the same host, with proper recovery + live migration, a single MCE would only affect a single VM at most. Another type of generic use case may be services that can tolerate abrupt crash, i.e., they periodically save checkpoints to persistent storage or are stateless services in nature and are managed by some process manager to automatically restart and resume from where the work was left at when crashed. Thanks, -Jue > > Thx. > > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette
Re: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison
On Thu, 8 Apr 2021 10:08:52 -0700, Tony Luck wrote: > KVM apparently passes a machine check into the guest. Though it seems > to be misisng the MCG_STATUS information to tell the guest whether this > is an "Action Required" machine check, or an "Action Optional" (i.e. > whether the poison was found synchonously by execution of the current > instruction, or asynchronously). The KVM_X86_SET_MCE ioctl takes a parameter of struct kvm_x86_mce, hypervisor can set with necessary semantics. 1140 #ifdef KVM_CAP_MCE 1141 /* x86 MCE */ 1142 struct kvm_x86_mce { 1143 __u64 status; 1144 __u64 addr; 1145 __u64 misc; 1146 __u64 mcg_status; 1147 __u8 bank; 1148 __u8 pad1[7]; 1149 __u64 pad2[3]; 1150 }; 1151 #endif > > Are we documenting somewhere: "if your process gets a SIGBUS and this > > and that, which means your page got offlined, you should do this and > > that to recover"? > Essentially it boils down to: > SIGBUS handler gets additional data giving virtual address that has gone away > 1) Can the application replace the lost page? > Use mmap(addr, MAP_FIXED, ...) to map a fresh page into the gap > and fill with replacement data. This case can return from SIGBUS > handler to re-execute failed instruction > 2) Can the application continue in degraded mode w/o the lost page? > Hunt down pointers to lost page and update structures to say > "this data lost". Use siglongjmp() to go to preset recovery path > 3) Can the application shut down gracefully? > Record details of the lost page. Inform next-of-kin. Exit. > 4) Default - just exit Two possible addition to these great points: 5) If for some reason the page cannot be unmapped (e.g., either losing to much memory like hugetlbfs 1G pages, or THP split failure for SHMEM THP), kernel maintains a consistent semantic (i.e., MCE SIGBUS with vaddr) to all future accesses from user space, by leaving the hwpoisoned page mapped or in the radix tree. 6). If for some reason the vaddr is not available upon the first MCE recovery and page is unmapped, kernel provides correct semantic (MCE SIGBUS with vaddr) in subsequent page faults from user space accessing the same vaddr.
Re: [PATCH 4/4] x86/mce: Avoid infinite loop for copy from user recovery
On Thu, 25 Mar 2021 17:02:35 -0700, Tony Luck wrote: ... > But there are places in the kernel where the code assumes that this > EFAULT return was simply because of a page fault. The code takes some > action to fix that, and then retries the access. This results in a second > machine check. What about return EHWPOISON instead of EFAULT and update the callers to handle EHWPOISON explicitly: i.e., not retry but give up on the page? My main concern is that the strong assumptions that the kernel can't hit more than a fixed number of poisoned cache lines before turning to user space may simply not be true. When DIMM goes bad, it can easily affect an entire bank or entire ram device chip. Even with memory interleaving, it's possible that a kernel control path touches lots of poisoned cache lines in the buffer it is working through.
Re: [PATCH v1 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address
On Tue, 13 Apr 2021 07:43:20 +0900, Naoya Horiguchi wrote: ... > + * This function is intended to handle "Action Required" MCEs on already > + * hardware poisoned pages. They could happen, for example, when > + * memory_failure() failed to unmap the error page at the first call, or > + * when multiple Action Optional MCE events races on different CPUs with > + * Local MCE enabled. +Tony Luck Hey Tony, I thought SRAO MCEs are broadcasted to all cores in the system as they come without an execution context, is it correct? If Yes, Naoya, I think we might want to remove the comments about the "multiple Action Optional MCE racing" part. Best, -Jue
Re: [PATCH v1 3/3] mm,hwpoison: add kill_accessing_process() to find error virtual address
On Tue, 13 Apr 2021 07:43:20 +0900, Naoya Horiguchi wrote: > This patch suggests to do page table walk to find the error virtual > address. If we find multiple virtual addresses in walking, we now can't > determine which one is correct, so we fall back to sending SIGBUS in > kill_me_maybe() without error info as we do now. This corner case needs > to be solved in the future. Instead of walking the page tables, I wonder what about the following idea: When failing to get vaddr, memory_failure just ensures the mapping is removed and an hwpoisoned swap pte is put in place; or the original page is flagged with PG_HWPOISONED and kept in the radix tree (e.g., for SHMEM THP). NOTE: no SIGBUS is sent to user space. Then do_machine_check just returns to user space to resume execution, the re-execution will result in a #PF and should land to the exact page fault handling code that generates a SIGBUS with the precise vaddr info: https://github.com/torvalds/linux/blob/7af08140979a6e7e12b78c93b8625c8d25b084e2/mm/memory.c#L3290 https://github.com/torvalds/linux/blob/7af08140979a6e7e12b78c93b8625c8d25b084e2/mm/memory.c#L3647 Thanks, -Jue
Re: [PATCH v2] mm,hwpoison: return -EBUSY when page already poisoned
I believe the mutex type patch has its own value in protecting memory_failure from other inherent races, e.g., races around split_huge_page where concurrent MCE happens to different 4k pages under the same THP.[1] This realistically can happen given the physical locality clustering effect of memory errors. Thanks, -Jue [1] The split fails due to a page reference taken by other concurrent calling into memory_failure on the same THP.
Re: [PATCH v1] mm, hwpoison: enable error handling on shmem thp
Huge thanks to Hugh for his expertise in shmem thps and forwarding this message! Hi Naoya, This is the first time I reply to an email in a Linux upstream thread, apologies for any technical issues due to my email client. And my apologies for the state of the whole patch not quite shareable with upstream due to some kernel differences. Some fyi comment inline. Thanks, -Jue On Thu, Mar 11, 2021 at 7:14 AM HORIGUCHI NAOYA(堀口 直也) wrote: > > On Wed, Mar 10, 2021 at 11:22:18PM -0800, Hugh Dickins wrote: > > On Tue, 9 Feb 2021, Naoya Horiguchi wrote: > > > > > From: Naoya Horiguchi > > > > > > Currently hwpoison code checks PageAnon() for thp and refuses to handle > > > errors on non-anonymous thps (just for historical reason). We now > > > support non-anonymou thp like shmem one, so this patch suggests to enable > > > to handle shmem thps. Fortunately, we already have can_split_huge_page() > > > to check if a give thp is splittable, so this patch relies on it. > > > > Fortunately? I don't understand. Why call can_split_huge_page() > > at all, instead of simply trying split_huge_page() directly? > > The background of this change was that I've experienced triggering > VM_BUG_ON_PAGE() at the beginning of split_huge_page_to_list() in the older > kernel (I forgot specific condition of the BUG_ON). I thought that that was > caused by race between thp allocation and memory_failure. So I wanted to > have some rigid way to confirm that a given thp is splittable. Then I found > can_split_huge_page(), which sounds suitable to me because I expected the API > to be maintained by thp subsystem. > > But I rethink that split_huge_page_to_list() seems to have different set of > VM_BUG_ON_PAGE()s now, and anyway split_huge_page_to_list() calls > can_split_huge_page() internally, so I might have wrongly read the code. > > > And could it do better than -EBUSY when split_huge_page() fails? > > Yes it could. > > > > > > > > > Signed-off-by: Naoya Horiguchi > > > > Thanks for trying to add shmem+file THP support, but I think this > > does not work as intended - Andrew, if Naoya agrees, please drop from > > mmotm for now, the fixup needed will be more than a line or two. > > I agree to drop it. I need research more to address the following comments. > > > > > I'm not much into memory-failure myself, but Jue discovered that the > > SIGBUS never arrives: because split_huge_page() on a shmem or file > > THP unmaps all its pmds and ptes, and (unlike with anon) leaves them > > unmapped - in normal circumstances, to be faulted back on demand. > > So the page_mapped() check in hwpoison_user_mappings() fails, > > and the intended SIGBUS is not delivered. > > Thanks for the information. The split behaves quite differently between > for anon thp and for shmem thp. I saw some unexpected behavior in my > testing, maybe that's due to the difference. > > > > > (Or, is it acceptable that the SIGBUS is not delivered to those who > > have the huge page mapped: should it get delivered later, to anyone > > who faults back in the bad 4k?) > > Later access should report error in page fault, so the worst scenario > of consuming corrupted data does not happen, but precautionary signal > does not work so it's not acceptable. In our experiment with SHMEM THPs, later accesses resulted in a zero page allocated instead of a SIGBUS with BUS_MCEERR_AR reported by the page fault handler. That part might be an opportunity to prevent some silent data corruption just in case. > > > > > We believe the tokill list has to be set up earlier, before > > split_huge_page() is called, then passed in to hwpoison_user_mappings(). > > > > Sorry, we don't have a proper patch for that right now, but I expect > > you can see what needs to be done. But something we found on the way, > > we do have a patch for: add_to_kill() uses page_address_in_vma(), but > > that has not been used on file THP tails before - fix appended at the > > end below, so as not to waste your time on that bit. > > > > Thank you very much, I'll work on top of it. > > Thanks, > Naoya Horiguchi > > ... > > > > [PATCH] mm: fix page_address_in_vma() on file THP tails > > From: Jue Wang > > > > Anon THP tails were already supported, but memory-failure now needs to use > > page_address_in_vma() on file THP tails, which its page->mapping check did > > not permit: fix it. > > > > Signed-off-by: Jue Wang > > Signed-off-by: Hugh Dickins > > --- > > > > mm/rmap.c |8 > > 1 file cha