Re: [Intel-gfx] [Linaro-mm-sig] Re: [PATCH] dma-buf: Require VM_PFNMAP vma for mmap
On Wed, 23 Nov 2022 at 16:15, Christian König wrote: > > Am 23.11.22 um 16:08 schrieb Jason Gunthorpe: > > On Wed, Nov 23, 2022 at 03:34:54PM +0100, Daniel Vetter wrote: > >>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > >>> index 1376a47fedeedb..4161241fc3228c 100644 > >>> --- a/virt/kvm/kvm_main.c > >>> +++ b/virt/kvm/kvm_main.c > >>> @@ -2598,6 +2598,19 @@ static int hva_to_pfn_remapped(struct > >>> vm_area_struct *vma, > >>> return r; > >>> } > >>> > >>> + /* > >>> +* Special PTEs are never convertible into a struct page, even if > >>> the > >>> +* driver that owns them might have put a PFN with a struct page > >>> into > >>> +* the PFNMAP. If the arch doesn't support special then we cannot > >>> +* safely process these pages. > >>> +*/ > >>> +#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL > >>> + if (pte_special(*ptep)) > >>> + return -EINVAL; > >> On second thought this wont work, because it completely defeats the > >> point of why this code here exists. remap_pfn_range() (which is what > >> the various dma_mmap functions and the ioremap functions are built on > >> top of too) sets VM_PFNMAP too, so this check would even catch the > >> static mappings. > > The problem with the way this code is designed is how it allows > > returning the pfn without taking any reference based on things like > > !pfn_valid or page_reserved. This allows it to then conditionally put > > back the reference based on the same reasoning. It is impossible to > > thread pte special into that since it is a PTE flag, not a property of > > the PFN. > > > > I don't entirely understand why it needs the page reference at all, > > That's exactly what I've pointed out in the previous discussion about > that code as well. > > As far as I can see it this is just another case where people assumed > that grabbing a page reference somehow magically prevents the pte from > changing. > > I have not the slightest idea how people got this impression, but I have > heard it so many time from so many different sources that there must be > some common cause to this. Is the maybe some book or tutorial how to > sophisticate break the kernel or something like this? It's what get_user_pages does, so it does "work". Except this path here is the fallback for when get_user_pages does not work (because of the pte_special/VM_SPECIAL case). So essentially it's just a rather broken get_user_pages that handrolls a bunch of things with bugs&races. I have no idea why people don't realize they're just reinventing gup without using gup, but that's essentially what's going on. > Anyway as far as I can see only correct approach would be to use an MMU > notifier or more high level hmm_range_fault()+seq number. Yeah, plus if you go through ptes you really have to obey all the flags or things will break. Especially the RO pte flag. -Daniel > > Regards, > Christian. > > > even if it is available - so I can't guess why it is OK to ignore the > > page reference in other cases, or why it is OK to be racy.. > > > > Eg hmm_range_fault() does not obtain page references and implements a > > very similar algorithm to kvm. > > > >> Plus these static mappings aren't all that static either, e.g. pci > >> access also can revoke bar mappings nowadays. > > And there are already mmu notifiers to handle that, AFAIK. > > > > Jason > -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
Re: [Intel-gfx] [Linaro-mm-sig] Re: [PATCH] dma-buf: Require VM_PFNMAP vma for mmap
On Wed, 23 Nov 2022 at 16:04, Jason Gunthorpe wrote: > > On Wed, Nov 23, 2022 at 03:28:27PM +0100, Daniel Vetter wrote: > > > > This patch is known to be broken in so many ways. It also has a major > > > security hole that it ignores the PTE flags making the page > > > RO. Ignoring the special bit is somehow not surprising :( > > > > > > This probably doesn't work, but is the general idea of what KVM needs > > > to do: > > > > Oh dear, when I dug around in there I entirely missed that > > kvm_try_get_pfn exists, and it's very broken indeed. kvm really needs > > to grow a proper mmu notifier. > > > > Another thing I'm wondering right now, the follow_pte(); > > fixup_user_fault(); follow_pte(); approach does not make any > > guarantees of actually being right. If you're sufficiently unlucky you > > might race against an immediate pte invalidate between the fixup and > > the 2nd follow_pte(). But you can also not loop, because that would > > fail to catch permanent faults. > > Yes, it is pretty broken. > > kvm already has support for mmu notifiers and uses it for other > stuff. I can't remember what exactly this code path was for, IIRC > Paolo talked about having a big rework/fix for it when we last talked > about the missing write protect. I also vauagely recall he had some > explanation why this might be safe. > > > I think the iommu fault drivers have a similar pattern. > > Where? It shouldn't > > The common code for SVA just calls handle_mm_fault() and restarts the > PRI. Since the page table is physically shared there is no issue with > a stale copy. > > > What am I missing here? Or is that also just broken. gup works around > > this with the slow path that takes the mmap sem and walking the vma > > tree, follow_pte/fixup_user_fautl users dont. > > follow_pte() is just fundamentally broken, things must not use it. > > > Maybe mmu notifier based restarting would help with this too, if > > done properly. > > That is called hmm_range_fault() Ah right I mixed that up on a quick grep, thanks for pointing me in the right direction. Worries appeased. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
Re: [Intel-gfx] [Linaro-mm-sig] Re: [PATCH] dma-buf: Require VM_PFNMAP vma for mmap
Am 23.11.22 um 16:08 schrieb Jason Gunthorpe: On Wed, Nov 23, 2022 at 03:34:54PM +0100, Daniel Vetter wrote: diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 1376a47fedeedb..4161241fc3228c 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2598,6 +2598,19 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, return r; } + /* +* Special PTEs are never convertible into a struct page, even if the +* driver that owns them might have put a PFN with a struct page into +* the PFNMAP. If the arch doesn't support special then we cannot +* safely process these pages. +*/ +#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL + if (pte_special(*ptep)) + return -EINVAL; On second thought this wont work, because it completely defeats the point of why this code here exists. remap_pfn_range() (which is what the various dma_mmap functions and the ioremap functions are built on top of too) sets VM_PFNMAP too, so this check would even catch the static mappings. The problem with the way this code is designed is how it allows returning the pfn without taking any reference based on things like !pfn_valid or page_reserved. This allows it to then conditionally put back the reference based on the same reasoning. It is impossible to thread pte special into that since it is a PTE flag, not a property of the PFN. I don't entirely understand why it needs the page reference at all, That's exactly what I've pointed out in the previous discussion about that code as well. As far as I can see it this is just another case where people assumed that grabbing a page reference somehow magically prevents the pte from changing. I have not the slightest idea how people got this impression, but I have heard it so many time from so many different sources that there must be some common cause to this. Is the maybe some book or tutorial how to sophisticate break the kernel or something like this? Anyway as far as I can see only correct approach would be to use an MMU notifier or more high level hmm_range_fault()+seq number. Regards, Christian. even if it is available - so I can't guess why it is OK to ignore the page reference in other cases, or why it is OK to be racy.. Eg hmm_range_fault() does not obtain page references and implements a very similar algorithm to kvm. Plus these static mappings aren't all that static either, e.g. pci access also can revoke bar mappings nowadays. And there are already mmu notifiers to handle that, AFAIK. Jason
Re: [Intel-gfx] [Linaro-mm-sig] Re: [PATCH] dma-buf: Require VM_PFNMAP vma for mmap
On Wed, 23 Nov 2022 at 14:28, Jason Gunthorpe wrote: > > On Wed, Nov 23, 2022 at 02:12:25PM +0100, Christian König wrote: > > Am 23.11.22 um 13:53 schrieb Jason Gunthorpe: > > > On Wed, Nov 23, 2022 at 01:49:41PM +0100, Christian König wrote: > > > > Am 23.11.22 um 13:46 schrieb Jason Gunthorpe: > > > > > On Wed, Nov 23, 2022 at 11:06:55AM +0100, Daniel Vetter wrote: > > > > > > > > > > > > Maybe a GFP flag to set the page reference count to zero or > > > > > > > something > > > > > > > like this? > > > > > > Hm yeah that might work. I'm not sure what it will all break though? > > > > > > And we'd need to make sure that underflowing the page refcount dies > > > > > > in > > > > > > a backtrace. > > > > > Mucking with the refcount like this to protect against crazy out of > > > > > tree drives seems horrible.. > > > > Well not only out of tree drivers. The intree KVM got that horrible > > > > wrong as well, those where the latest guys complaining about it. > > > kvm was taking refs on special PTEs? That seems really unlikely? > > > > Well then look at this code here: > > > > commit add6a0cd1c5ba51b201e1361b05a5df817083618 > > Author: Paolo Bonzini > > Date: Tue Jun 7 17:51:18 2016 +0200 > > > > KVM: MMU: try to fix up page faults before giving up > > > > The vGPU folks would like to trap the first access to a BAR by setting > > vm_ops on the VMAs produced by mmap-ing a VFIO device. The fault > > handler > > then can use remap_pfn_range to place some non-reserved pages in the > > VMA. > > > > This kind of VM_PFNMAP mapping is not handled by KVM, but follow_pfn > > and fixup_user_fault together help supporting it. The patch also > > supports > > VM_MIXEDMAP vmas where the pfns are not reserved and thus subject to > > reference counting. > > > > Cc: Xiao Guangrong > > Cc: Andrea Arcangeli > > Cc: Radim Krčmář > > Tested-by: Neo Jia > > Reported-by: Kirti Wankhede > > Signed-off-by: Paolo Bonzini > > This patch is known to be broken in so many ways. It also has a major > security hole that it ignores the PTE flags making the page > RO. Ignoring the special bit is somehow not surprising :( > > This probably doesn't work, but is the general idea of what KVM needs > to do: > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 1376a47fedeedb..4161241fc3228c 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -2598,6 +2598,19 @@ static int hva_to_pfn_remapped(struct vm_area_struct > *vma, > return r; > } > > + /* > +* Special PTEs are never convertible into a struct page, even if the > +* driver that owns them might have put a PFN with a struct page into > +* the PFNMAP. If the arch doesn't support special then we cannot > +* safely process these pages. > +*/ > +#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL > + if (pte_special(*ptep)) > + return -EINVAL; On second thought this wont work, because it completely defeats the point of why this code here exists. remap_pfn_range() (which is what the various dma_mmap functions and the ioremap functions are built on top of too) sets VM_PFNMAP too, so this check would even catch the static mappings. Plus these static mappings aren't all that static either, e.g. pci access also can revoke bar mappings nowadays. I think nothing except full mmu_notifier will actually fix this. -Daniel > +#else > + return -EINVAL; > +#endif > + > if (write_fault && !pte_write(*ptep)) { > pfn = KVM_PFN_ERR_RO_FAULT; > goto out; > > Jason -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
Re: [Intel-gfx] [Linaro-mm-sig] Re: [PATCH] dma-buf: Require VM_PFNMAP vma for mmap
On Wed, 23 Nov 2022 at 14:28, Jason Gunthorpe wrote: > > On Wed, Nov 23, 2022 at 02:12:25PM +0100, Christian König wrote: > > Am 23.11.22 um 13:53 schrieb Jason Gunthorpe: > > > On Wed, Nov 23, 2022 at 01:49:41PM +0100, Christian König wrote: > > > > Am 23.11.22 um 13:46 schrieb Jason Gunthorpe: > > > > > On Wed, Nov 23, 2022 at 11:06:55AM +0100, Daniel Vetter wrote: > > > > > > > > > > > > Maybe a GFP flag to set the page reference count to zero or > > > > > > > something > > > > > > > like this? > > > > > > Hm yeah that might work. I'm not sure what it will all break though? > > > > > > And we'd need to make sure that underflowing the page refcount dies > > > > > > in > > > > > > a backtrace. > > > > > Mucking with the refcount like this to protect against crazy out of > > > > > tree drives seems horrible.. > > > > Well not only out of tree drivers. The intree KVM got that horrible > > > > wrong as well, those where the latest guys complaining about it. > > > kvm was taking refs on special PTEs? That seems really unlikely? > > > > Well then look at this code here: > > > > commit add6a0cd1c5ba51b201e1361b05a5df817083618 > > Author: Paolo Bonzini > > Date: Tue Jun 7 17:51:18 2016 +0200 > > > > KVM: MMU: try to fix up page faults before giving up > > > > The vGPU folks would like to trap the first access to a BAR by setting > > vm_ops on the VMAs produced by mmap-ing a VFIO device. The fault > > handler > > then can use remap_pfn_range to place some non-reserved pages in the > > VMA. > > > > This kind of VM_PFNMAP mapping is not handled by KVM, but follow_pfn > > and fixup_user_fault together help supporting it. The patch also > > supports > > VM_MIXEDMAP vmas where the pfns are not reserved and thus subject to > > reference counting. > > > > Cc: Xiao Guangrong > > Cc: Andrea Arcangeli > > Cc: Radim Krčmář > > Tested-by: Neo Jia > > Reported-by: Kirti Wankhede > > Signed-off-by: Paolo Bonzini > > This patch is known to be broken in so many ways. It also has a major > security hole that it ignores the PTE flags making the page > RO. Ignoring the special bit is somehow not surprising :( > > This probably doesn't work, but is the general idea of what KVM needs > to do: Oh dear, when I dug around in there I entirely missed that kvm_try_get_pfn exists, and it's very broken indeed. kvm really needs to grow a proper mmu notifier. Another thing I'm wondering right now, the follow_pte(); fixup_user_fault(); follow_pte(); approach does not make any guarantees of actually being right. If you're sufficiently unlucky you might race against an immediate pte invalidate between the fixup and the 2nd follow_pte(). But you can also not loop, because that would fail to catch permanent faults. I think the iommu fault drivers have a similar pattern. What am I missing here? Or is that also just broken. gup works around this with the slow path that takes the mmap sem and walking the vma tree, follow_pte/fixup_user_fautl users dont. Maybe mmu notifier based restarting would help with this too, if done properly. -Daniel > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 1376a47fedeedb..4161241fc3228c 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -2598,6 +2598,19 @@ static int hva_to_pfn_remapped(struct vm_area_struct > *vma, > return r; > } > > + /* > +* Special PTEs are never convertible into a struct page, even if the > +* driver that owns them might have put a PFN with a struct page into > +* the PFNMAP. If the arch doesn't support special then we cannot > +* safely process these pages. > +*/ > +#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL > + if (pte_special(*ptep)) > + return -EINVAL; > +#else > + return -EINVAL; > +#endif > + > if (write_fault && !pte_write(*ptep)) { > pfn = KVM_PFN_ERR_RO_FAULT; > goto out; > > Jason -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
Re: [Intel-gfx] [Linaro-mm-sig] Re: [PATCH] dma-buf: Require VM_PFNMAP vma for mmap
Am 23.11.22 um 13:53 schrieb Jason Gunthorpe: On Wed, Nov 23, 2022 at 01:49:41PM +0100, Christian König wrote: Am 23.11.22 um 13:46 schrieb Jason Gunthorpe: On Wed, Nov 23, 2022 at 11:06:55AM +0100, Daniel Vetter wrote: Maybe a GFP flag to set the page reference count to zero or something like this? Hm yeah that might work. I'm not sure what it will all break though? And we'd need to make sure that underflowing the page refcount dies in a backtrace. Mucking with the refcount like this to protect against crazy out of tree drives seems horrible.. Well not only out of tree drivers. The intree KVM got that horrible wrong as well, those where the latest guys complaining about it. kvm was taking refs on special PTEs? That seems really unlikely? Well then look at this code here: commit add6a0cd1c5ba51b201e1361b05a5df817083618 Author: Paolo Bonzini Date: Tue Jun 7 17:51:18 2016 +0200 KVM: MMU: try to fix up page faults before giving up The vGPU folks would like to trap the first access to a BAR by setting vm_ops on the VMAs produced by mmap-ing a VFIO device. The fault handler then can use remap_pfn_range to place some non-reserved pages in the VMA. This kind of VM_PFNMAP mapping is not handled by KVM, but follow_pfn and fixup_user_fault together help supporting it. The patch also supports VM_MIXEDMAP vmas where the pfns are not reserved and thus subject to reference counting. Cc: Xiao Guangrong Cc: Andrea Arcangeli Cc: Radim Krčmář Tested-by: Neo Jia Reported-by: Kirti Wankhede Signed-off-by: Paolo Bonzini And see also the discussion here: https://patchwork.freedesktop.org/patch/414123/ as well as here: https://patchwork.freedesktop.org/patch/499190/ I can't count how often I have pointed out that this is absolutely illegal and KVM can't touch pages in VMAs with VM_PFNMAP. The WARN_ON(pag_count(p) != 1) seems like a reasonable thing to do though, though you must combine this with the special PTE flag.. That's not sufficient. The pages are released much later than things actually go wrong. In most cases this WARN_ON here won't hit. How so? As long as the page is mapped into the PTE there is no issue with corruption. If dmabuf checks the refcount after it does the unmap mapping range it should catch any bogus pin that might be confused about address coherency. Yeah, that would work. The problem is this WARN_ON() comes much later. The device drivers usually keep the page around for a while even after it is unmapped. IIRC the cleanup worker only runs every 10ms or so. Christian. Jason
Re: [Intel-gfx] [Linaro-mm-sig] Re: [PATCH] dma-buf: Require VM_PFNMAP vma for mmap
Am 23.11.22 um 13:46 schrieb Jason Gunthorpe: On Wed, Nov 23, 2022 at 11:06:55AM +0100, Daniel Vetter wrote: Maybe a GFP flag to set the page reference count to zero or something like this? Hm yeah that might work. I'm not sure what it will all break though? And we'd need to make sure that underflowing the page refcount dies in a backtrace. Mucking with the refcount like this to protect against crazy out of tree drives seems horrible.. Well not only out of tree drivers. The intree KVM got that horrible wrong as well, those where the latest guys complaining about it. The WARN_ON(pag_count(p) != 1) seems like a reasonable thing to do though, though you must combine this with the special PTE flag.. That's not sufficient. The pages are released much later than things actually go wrong. In most cases this WARN_ON here won't hit. Christian. Jason
Re: [Intel-gfx] [Linaro-mm-sig] Re: [PATCH] dma-buf: Require VM_PFNMAP vma for mmap
On Wed, 23 Nov 2022 at 10:39, Christian König wrote: > > Am 23.11.22 um 10:30 schrieb Daniel Vetter: > > On Wed, 23 Nov 2022 at 10:06, Christian König > > wrote: > >> Am 22.11.22 um 20:50 schrieb Daniel Vetter: > >>> On Tue, 22 Nov 2022 at 20:34, Jason Gunthorpe wrote: > On Tue, Nov 22, 2022 at 08:29:05PM +0100, Daniel Vetter wrote: > > You nuke all the ptes. Drivers that move have slightly more than a > > bare struct file, they also have a struct address_space so that > > invalidate_mapping_range() works. > Okay, this is one of the ways that this can be made to work correctly, > as long as you never allow GUP/GUP_fast to succeed on the PTEs. (this > was the DAX mistake) > >>> Hence this patch, to enforce that no dma-buf exporter gets this wrong. > >>> Which some did, and then blamed bug reporters for the resulting splats > >>> :-) One of the things we've reverted was the ttm huge pte support, > >>> since that doesn't have the pmd_special flag (yet) and so would let > >>> gup_fast through. > >> The problem is not only gup, a lot of people seem to assume that when > >> you are able to grab a reference to a page that the ptes pointing to > >> that page can't change any more. And that's obviously incorrect. > >> > >> I witnessed tons of discussions about that already. Some customers even > >> modified our code assuming that and then wondered why the heck they ran > >> into data corruption. > >> > >> It's gotten so bad that I've even proposed intentionally mangling the > >> page reference count on TTM allocated pages: > >> https://patchwork.kernel.org/project/dri-devel/patch/20220927143529.135689-1-christian.koe...@amd.com/ > > Yeah maybe something like this could be applied after we land this > > patch here. > > I think both should land ASAP. We don't have any other way than to > clearly point out incorrect approaches if we want to prevent the > resulting data corruption. > > > Well maybe should have the same check in gem mmap code to > > make sure no driver > > Reads like the sentence is somehow cut of? Yeah, just wanted to say that we need to make sure in as many places as possible that VM_PFNMAP is set for bo mmaps. > >> I think it would be better that instead of having special flags in the > >> ptes and vmas that you can't follow them to a page structure we would > >> add something to the page indicating that you can't grab a reference to > >> it. But this might break some use cases as well. > > Afaik the problem with that is that there's no free page bits left for > > these debug checks. Plus the pte+vma flags are the flags to make this > > clear already. > > Maybe a GFP flag to set the page reference count to zero or something > like this? Hm yeah that might work. I'm not sure what it will all break though? And we'd need to make sure that underflowing the page refcount dies in a backtrace. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
Re: [Intel-gfx] [Linaro-mm-sig] Re: [PATCH] dma-buf: Require VM_PFNMAP vma for mmap
Am 23.11.22 um 10:30 schrieb Daniel Vetter: On Wed, 23 Nov 2022 at 10:06, Christian König wrote: Am 22.11.22 um 20:50 schrieb Daniel Vetter: On Tue, 22 Nov 2022 at 20:34, Jason Gunthorpe wrote: On Tue, Nov 22, 2022 at 08:29:05PM +0100, Daniel Vetter wrote: You nuke all the ptes. Drivers that move have slightly more than a bare struct file, they also have a struct address_space so that invalidate_mapping_range() works. Okay, this is one of the ways that this can be made to work correctly, as long as you never allow GUP/GUP_fast to succeed on the PTEs. (this was the DAX mistake) Hence this patch, to enforce that no dma-buf exporter gets this wrong. Which some did, and then blamed bug reporters for the resulting splats :-) One of the things we've reverted was the ttm huge pte support, since that doesn't have the pmd_special flag (yet) and so would let gup_fast through. The problem is not only gup, a lot of people seem to assume that when you are able to grab a reference to a page that the ptes pointing to that page can't change any more. And that's obviously incorrect. I witnessed tons of discussions about that already. Some customers even modified our code assuming that and then wondered why the heck they ran into data corruption. It's gotten so bad that I've even proposed intentionally mangling the page reference count on TTM allocated pages: https://patchwork.kernel.org/project/dri-devel/patch/20220927143529.135689-1-christian.koe...@amd.com/ Yeah maybe something like this could be applied after we land this patch here. I think both should land ASAP. We don't have any other way than to clearly point out incorrect approaches if we want to prevent the resulting data corruption. Well maybe should have the same check in gem mmap code to make sure no driver Reads like the sentence is somehow cut of? I think it would be better that instead of having special flags in the ptes and vmas that you can't follow them to a page structure we would add something to the page indicating that you can't grab a reference to it. But this might break some use cases as well. Afaik the problem with that is that there's no free page bits left for these debug checks. Plus the pte+vma flags are the flags to make this clear already. Maybe a GFP flag to set the page reference count to zero or something like this? Christian. -Daniel