Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
On Fri, Sep 11, 2020 at 5:20 AM Alexander Gordeev wrote: > > What if the entry is still pud_present, but got remapped after > READ_ONCE(*pudp)? IOW, it is still valid, but points elsewhere? That can't happen. The GUP walk doesn't hold any locks, but it *is* done with interrupts disabled, and anybody who is modifying the page tables needs to do the TLB flush, and/or RCU-free them. The interrupt disable means that on architectures where the TLB flush involves an IPI, it will be delayed until afterwards, but it also acts as a big RCU read lock hammer. So the page tables can get modified under us, but the old pages won't be released and re-used. Linus
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
On Thu, Sep 10, 2020 at 07:11:16PM -0300, Jason Gunthorpe wrote: > On Thu, Sep 10, 2020 at 02:22:37PM -0700, John Hubbard wrote: > > > Or am I way off here, and it really is possible (aside from the current > > s390 situation) to observe something that "is no longer a page table"? > > Yes, that is the issue. Remember there is no locking for GUP > fast. While a page table cannot be freed there is nothing preventing > the page table entry from being concurrently modified. > > Without the stack variable it looks like this: > >pud_t pud = READ_ONCE(*pudp); >if (!pud_present(pud)) > return >pmd_offset(pudp, address); > > And pmd_offset() expands to > > return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address); > > Between the READ_ONCE(*pudp) and (*pud) inside pmd_offset() the value > of *pud can change, eg to !pud_present. > > Then pud_page_vaddr(*pud) will crash. It is not use after free, it > is using data that has not been validated. One thing I ask myself and it is probably a good moment to wonder. What if the entry is still pud_present, but got remapped after READ_ONCE(*pudp)? IOW, it is still valid, but points elsewhere? > Jason
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
On Fri, Sep 11, 2020 at 09:09:39AM +0200, pet...@infradead.org wrote: > On Thu, Sep 10, 2020 at 06:59:21PM -0300, Jason Gunthorpe wrote: > > So, I suggest pXX_offset_unlocked() > > Urgh, no. Elsewhere in gup _unlocked() means it will take the lock > itself (get_user_pages_unlocked()) -- although often it seems to mean > the lock is already held (git grep _unlocked and marvel). > > What we want is _lockless(). This is clear to me! Thanks, Jason
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
On Thu, Sep 10, 2020 at 06:59:21PM -0300, Jason Gunthorpe wrote: > So, I suggest pXX_offset_unlocked() Urgh, no. Elsewhere in gup _unlocked() means it will take the lock itself (get_user_pages_unlocked()) -- although often it seems to mean the lock is already held (git grep _unlocked and marvel). What we want is _lockless().
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
On Thu, Sep 10, 2020 at 07:57:49PM +0200, Gerald Schaefer wrote: > On Thu, 10 Sep 2020 10:02:33 -0300 > Jason Gunthorpe wrote: > > > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote: > > > > > As Gerald mentioned, it is very difficult to explain in a clear way. > > > Hopefully, one could make sense ot of it. > > > > I would say the page table API requires this invariant: > > > > pud = pud_offset(p4d, addr); > > do { > > WARN_ON(pud != pud_offset(p4d, addr); > > next = pud_addr_end(addr, end); > > } while (pud++, addr = next, addr != end); > > > > ie pud++ is supposed to be a shortcut for > > pud_offset(p4d, next) > > > > Hmm, IIUC, all architectures with static folding will simply return > the passed-in p4d pointer for pud_offset(p4d, addr), for 3-level > pagetables. It is probably moot now, but since other arch's don't crash they also return pud_addr_end() == end so the loop only does one iteration. ie pud == pud_offset(p4d, addr) for all iterations as the pud++ never happens. Which is what this addr_end patch does for s390.. Jason
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
On 9/10/20 3:11 PM, Jason Gunthorpe wrote: On Thu, Sep 10, 2020 at 02:22:37PM -0700, John Hubbard wrote: Or am I way off here, and it really is possible (aside from the current s390 situation) to observe something that "is no longer a page table"? Yes, that is the issue. Remember there is no locking for GUP fast. While a page table cannot be freed there is nothing preventing the page table entry from being concurrently modified. OK, then we are saying the same thing after all, good. Without the stack variable it looks like this: pud_t pud = READ_ONCE(*pudp); if (!pud_present(pud)) return pmd_offset(pudp, address); And pmd_offset() expands to return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address); Between the READ_ONCE(*pudp) and (*pud) inside pmd_offset() the value of *pud can change, eg to !pud_present. Then pud_page_vaddr(*pud) will crash. It is not use after free, it is using data that has not been validated. Right, that matches what I had in mind, too: you can still have a problem even though you're in the same page table. I just wanted to confirm that there's not some odd way to launch out into completely non-page-table memory. thanks, -- John Hubbard NVIDIA
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
On Thu, Sep 10, 2020 at 02:22:37PM -0700, John Hubbard wrote: > Or am I way off here, and it really is possible (aside from the current > s390 situation) to observe something that "is no longer a page table"? Yes, that is the issue. Remember there is no locking for GUP fast. While a page table cannot be freed there is nothing preventing the page table entry from being concurrently modified. Without the stack variable it looks like this: pud_t pud = READ_ONCE(*pudp); if (!pud_present(pud)) return pmd_offset(pudp, address); And pmd_offset() expands to return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address); Between the READ_ONCE(*pudp) and (*pud) inside pmd_offset() the value of *pud can change, eg to !pud_present. Then pud_page_vaddr(*pud) will crash. It is not use after free, it is using data that has not been validated. Jason
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
On Thu, Sep 10, 2020 at 12:32:05PM -0700, Linus Torvalds wrote: > Yeah, I get hung up on naming sometimes. I don't tend to care much > about private local variables ("i" is a perfectly fine variable name), > but these kinds of somewhat subtle cross-architecture definitions I > feel matter. One of the first replys to this patch was to ask "when would I use _orig vs normal", so you are not alone. The name should convey it.. So, I suggest pXX_offset_unlocked() Since it is safe to call without the page table lock, while pXX_offset() requires the page table lock to be held as the internal *pXX is a data race otherwise. Patch 1 might be OK for a stable backport, but to get to a clear pXX_offset_unlocked() all the arches would want to be changed to implement that API and the generic code would provide the wrapper: #define pXX_offset(pXXp, address) pXX_offset_unlocked(pXXp, *(pXXp), address) Arches would not have a *pXX inside their code. Then we can talk about auditing call sites of pXX_offset and think about using the _unlocked version in places where the page table lock is not held. For instance mm/pagewalk.c should be changed. So should huge_pte_offset() and probably other places. These places might already be exsting data-race bugs. It is code-as-documentation indicating an unlocked page table walk. Now it is not just a S390 story but a change that makes the data concurrency much clearer, so I think I prefer this version to the addr_end one too. Regards, Jason
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
On 9/10/20 11:13 AM, Jason Gunthorpe wrote: On Thu, Sep 10, 2020 at 10:35:38AM -0700, Linus Torvalds wrote: On Thu, Sep 10, 2020 at 2:40 AM Alexander Gordeev wrote: It is only gup_fast case that exposes the issue. It hits because pointers to stack copies are passed to gup_pXd_range iterators, not pointers to real page tables itself. Can we possibly change fast-gup to not do the stack copies? I'd actually rather do something like that, than the "addr_end" thing. As you say, none of the other page table walking code does what the GUP code does, and I don't think it's required. As I understand it, the requirement is because fast-gup walks without the page table spinlock, or mmap_sem held so it must READ_ONCE the *pXX. It then checks that it is a valid page table pointer, then calls pXX_offset(). The arch implementation of pXX_offset() derefs again the passed pXX pointer. So it defeats the READ_ONCE and the 2nd load could observe something that is no longer a page table pointer and crash. Just to be clear, though, that makes it sound a little wilder and reckless than it really is, right? Because actually, the page tables cannot be freed while gup_fast is walking them, due to either IPI blocking during the walk, or the moral equivalent (MMU_GATHER_RCU_TABLE_FREE) for non-IPI architectures. So the pages tables can *change* underneath gup_fast, and for example pages can be unmapped. But they remain valid page tables, it's just that their contents are unstable. Even if pXd_none()==true. Or am I way off here, and it really is possible (aside from the current s390 situation) to observe something that "is no longer a page table"? thanks, -- John Hubbard NVIDIA
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
On Thu, Sep 10, 2020 at 12:11 PM Gerald Schaefer wrote: > > That sounds a lot like the pXd_offset_orig() from Martins first approach > in this thread: > https://lore.kernel.org/linuxppc-dev/20190418100218.0a4afd51@mschwideX1/ I have to admit to finding that name horrible, but aside from that, yes. I don't think "pXd_offset_orig()" makes any sense as a name. Yes, "orig" may make sense as the variable name (as in "this was the original value we read"), but a function name should describe what it *does*, not what the arguments are. Plus "original" doesn't make sense to me anyway, since we're not modifying it. To me, "original" means that there's a final version too, which this interface in no way implies. It's just "this is the value we already read". ("orig" does make some sense in that fault path - because by definition we *are* going to modify the page table entry, that's the whole point of the fault - we need to do something to not keep faulting. But here, we're not at all necessarily modifying the page table contents, we're just following them and readign the values once) Of course, I don't know what a better name would be to describe what is actually going on, I'm just explaining why I hate that naming. *Maybe* something like just "pXd_offset_value()" together with a comment explaining that it's given the upper pXd pointer _and_ the value behind it, and it needs to return the next level offset? I dunno. "value" doesn't really seem horribly descriptive either, but at least it doesn't feel actively misleading to me. Yeah, I get hung up on naming sometimes. I don't tend to care much about private local variables ("i" is a perfectly fine variable name), but these kinds of somewhat subtle cross-architecture definitions I feel matter. Linus
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
On Thu, 10 Sep 2020 11:33:17 -0700 Linus Torvalds wrote: > On Thu, Sep 10, 2020 at 11:13 AM Jason Gunthorpe wrote: > > > > So.. To change away from the stack option I think we'd have to pass > > the READ_ONCE value to pXX_offset() as an extra argument instead of it > > derefing the pointer internally. > > Yeah, but I think that would actually be the better model than passing > an address to a random stack location. > > It's also effectively what we do in some other places, eg the whole > logic with "orig" in the regular pte fault handling is basically doing > unlocked loads of the pte, various decisions on that, and then doing a > final "is this still the same pte" after it has gotten the page table > lock. That sounds a lot like the pXd_offset_orig() from Martins first approach in this thread: https://lore.kernel.org/linuxppc-dev/20190418100218.0a4afd51@mschwideX1/ It is also the "Patch 1" option from the start of this thread: https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schae...@linux.ibm.com/ I guess I chose wrongly there, should have had more trust in Martins approach, and not try so hard to do it like others... So, maybe we can start over again, from that patch option. It would of course also initially introduce some gup-specific helpers, like with the other approach. It seemed harder to generalize when I thought about it back then, but I guess it should not be a lot harder than the _addr_end stuff. Or, maybe this time, just not to risk Christian getting a heart attack, we could go for the gup-specific helper first, so that we would at least have a fix for the possible s390 data corruption. Jason, would you agree that we send a new RFC, this time with pXd_offset_orig() approach, and have that accepted as short-term fix? Or would you rather also wait for some proper generic change? Have lost that option from my radar, so cannot really judge how much more effort it would be. I'm on vacation next week anyway, but Alexander or Vasily (who did the option 1 patch) could look into this further.
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
On Thu, Sep 10, 2020 at 11:13 AM Jason Gunthorpe wrote: > > So.. To change away from the stack option I think we'd have to pass > the READ_ONCE value to pXX_offset() as an extra argument instead of it > derefing the pointer internally. Yeah, but I think that would actually be the better model than passing an address to a random stack location. It's also effectively what we do in some other places, eg the whole logic with "orig" in the regular pte fault handling is basically doing unlocked loads of the pte, various decisions on that, and then doing a final "is this still the same pte" after it has gotten the page table lock. (And yes, those other pte fault handling cases are different, since they _do_ hold the mmap lock, so they know the page *tables* are stable, and it's only the last level that then gets re-checked against the pte once the pte itself has also been stabilized with the page table lock). So I think it would actually be a better conceptual match to make the page table walking interface be "here, this is the value I read once carefully, and this is the address, now give me the next address". The folded case would then just return the address it was given, and the non-folded case would return the inner page table based on the value. I dunno. I don't actually feel all that strongly about this, so whatever works, I guess. Linus
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
On Thu, Sep 10, 2020 at 10:35:38AM -0700, Linus Torvalds wrote: > On Thu, Sep 10, 2020 at 2:40 AM Alexander Gordeev > wrote: > > > > It is only gup_fast case that exposes the issue. It hits because > > pointers to stack copies are passed to gup_pXd_range iterators, not > > pointers to real page tables itself. > > Can we possibly change fast-gup to not do the stack copies? > > I'd actually rather do something like that, than the "addr_end" thing. > As you say, none of the other page table walking code does what the > GUP code does, and I don't think it's required. As I understand it, the requirement is because fast-gup walks without the page table spinlock, or mmap_sem held so it must READ_ONCE the *pXX. It then checks that it is a valid page table pointer, then calls pXX_offset(). The arch implementation of pXX_offset() derefs again the passed pXX pointer. So it defeats the READ_ONCE and the 2nd load could observe something that is no longer a page table pointer and crash. Passing it the address of the stack value is a way to force pXX_offset() to use the READ_ONCE result which has already been tested to be a page table pointer. Other page walking code that holds the mmap_sem tends to use pmd_trans_unstable() which solves this problem by injecting a barrier. The load hidden in pte_offset() after a pmd_trans_unstable() can't be re-ordered and will only see a page table entry under the mmap_sem. However, I think that logic would have been much clearer following the GUP model of READ_ONCE vs extra reads and a hidden barrier. At least it took me a long time to work it out :( I also think there are real bugs here where places are reading *pXX multiple times without locking the page table. One was found recently in the wild in the huge tlb code IIRC. The mm/pagewalk.c has these missing READ_ONCE bugs too. So.. To change away from the stack option I think we'd have to pass the READ_ONCE value to pXX_offset() as an extra argument instead of it derefing the pointer internally. Jason
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
On Thu, 10 Sep 2020 10:02:33 -0300 Jason Gunthorpe wrote: > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote: > > > As Gerald mentioned, it is very difficult to explain in a clear way. > > Hopefully, one could make sense ot of it. > > I would say the page table API requires this invariant: > > pud = pud_offset(p4d, addr); > do { > WARN_ON(pud != pud_offset(p4d, addr); > next = pud_addr_end(addr, end); > } while (pud++, addr = next, addr != end); > > ie pud++ is supposed to be a shortcut for > pud_offset(p4d, next) > Hmm, IIUC, all architectures with static folding will simply return the passed-in p4d pointer for pud_offset(p4d, addr), for 3-level pagetables. There is no difference for s390. For gup_fast, that p4d pointer is not really a pointer to a value in a pagetable, but to some local copy of such a value, and not just for s390. So, pud = p4d = pointer to copy, and increasing that pud pointer cannot be the same as pud_offset(p4d, next). I do see your point however, at last I think :-) My problem is that I do not see where we would have an s390-specific issue here. Maybe my understanding of how it works for others with static folding is wrong. That would explain my difficulties in getting your point... > While S390 does not follow this. Fixing addr_end brings it into > alignment by preventing pud++ from happening. Exactly, only that nobody seems to follow it, IIUC. Fixing it up with pXd_addr_end was my impression of what we need to do, in order to have it work the same way as for others. > The only currently known side effect is that gup_fast crashes, but it > sure is an unexpected thing. Well, from my understanding it feels more unexpected that something that is supposed to be a pointer to an entry in a page table, really is just a pointer to some copy somewhere.
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
On Thu, Sep 10, 2020 at 07:07:57PM +0200, Gerald Schaefer wrote: > I might have lost track a bit. Are we still talking about possible > functional impacts of either our current pagetable walking with s390 > (apart from gup_fast), or the proposed generic change (for s390, or > others?)? I'm looking for an more understandable explanation what is wrong with the S390 implementation. If the page operations require the invariant I described then it is quite easy to explain the problem and understand the solution. Jason
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
On Thu, Sep 10, 2020 at 2:40 AM Alexander Gordeev wrote: > > It is only gup_fast case that exposes the issue. It hits because > pointers to stack copies are passed to gup_pXd_range iterators, not > pointers to real page tables itself. Can we possibly change fast-gup to not do the stack copies? I'd actually rather do something like that, than the "addr_end" thing. As you say, none of the other page table walking code does what the GUP code does, and I don't think it's required. The GUP code is kind of strange, I'm not quite sure why. Some of it unusually came from the powerpc code that handled their special odd hugepage model, and that may be why it's so different. How painful would it be to just pass the pmd (etc) _pointers_ around, rather than do the odd "take the address of local copies"? Linus
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
On Thu, 10 Sep 2020 12:10:26 -0300 Jason Gunthorpe wrote: > On Thu, Sep 10, 2020 at 03:28:03PM +0200, Gerald Schaefer wrote: > > On Thu, 10 Sep 2020 10:02:33 -0300 > > Jason Gunthorpe wrote: > > > > > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote: > > > > > > > As Gerald mentioned, it is very difficult to explain in a clear way. > > > > Hopefully, one could make sense ot of it. > > > > > > I would say the page table API requires this invariant: > > > > > > pud = pud_offset(p4d, addr); > > > do { > > > WARN_ON(pud != pud_offset(p4d, addr); > > > next = pud_addr_end(addr, end); > > > } while (pud++, addr = next, addr != end); > > > > > > ie pud++ is supposed to be a shortcut for > > > pud_offset(p4d, next) > > > > > > While S390 does not follow this. Fixing addr_end brings it into > > > alignment by preventing pud++ from happening. > > > > > > The only currently known side effect is that gup_fast crashes, but it > > > sure is an unexpected thing. > > > > It only is unexpected in a "top-level folding" world, see my other reply. > > Consider it an optimization, which was possible because of how our dynamic > > folding works, and e.g. because we can determine the correct pagetable > > level from a pXd value in pXd_offset. > > No, I disagree. The page walker API the arch presents has to have well > defined semantics. For instance, there is an effort to define tests > and invarients for the page table accesses to bring this understanding > and uniformity: > > mm/debug_vm_pgtable.c > > If we fix S390 using the pX_addr_end() change then the above should be > updated with an invariant to check it. I've added Anshuman for some > thoughts.. We are very aware of those tests, and actually a big supporter of the idea. Also part of the supported architectures already, and it has already helped us find / fix some s390 oddities. However, we did not see any issues wrt to our pagetable walking, neither with the current version, nor with the new generic approach. We do currently see other issues, Anshuman will know what I mean :-) > For better or worse, that invariant does exclude arches from using > other folding techniques. > > The other solution would be to address the other side of != and adjust > the pud++ > > eg replcae pud++ with something like: > pud = pud_next_entry(p4d, pud, next) > > Such that: > pud_next_entry(p4d, pud, next) === pud_offset(p4d, next) > > In which case the invarient changes to 'callers can never do pointer > arithmetic on the result of pXX_offset()' which is a bit harder to > enforce. I might have lost track a bit. Are we still talking about possible functional impacts of either our current pagetable walking with s390 (apart from gup_fast), or the proposed generic change (for s390, or others?)? Or is this rather some (other) generic issue / idea that you have, in order to put "some more structure / enforcement" to generic pagetable walkers?
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
On Thu, Sep 10, 2020 at 03:28:03PM +0200, Gerald Schaefer wrote: > On Thu, 10 Sep 2020 10:02:33 -0300 > Jason Gunthorpe wrote: > > > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote: > > > > > As Gerald mentioned, it is very difficult to explain in a clear way. > > > Hopefully, one could make sense ot of it. > > > > I would say the page table API requires this invariant: > > > > pud = pud_offset(p4d, addr); > > do { > > WARN_ON(pud != pud_offset(p4d, addr); > > next = pud_addr_end(addr, end); > > } while (pud++, addr = next, addr != end); > > > > ie pud++ is supposed to be a shortcut for > > pud_offset(p4d, next) > > > > While S390 does not follow this. Fixing addr_end brings it into > > alignment by preventing pud++ from happening. > > > > The only currently known side effect is that gup_fast crashes, but it > > sure is an unexpected thing. > > It only is unexpected in a "top-level folding" world, see my other reply. > Consider it an optimization, which was possible because of how our dynamic > folding works, and e.g. because we can determine the correct pagetable > level from a pXd value in pXd_offset. No, I disagree. The page walker API the arch presents has to have well defined semantics. For instance, there is an effort to define tests and invarients for the page table accesses to bring this understanding and uniformity: mm/debug_vm_pgtable.c If we fix S390 using the pX_addr_end() change then the above should be updated with an invariant to check it. I've added Anshuman for some thoughts.. For better or worse, that invariant does exclude arches from using other folding techniques. The other solution would be to address the other side of != and adjust the pud++ eg replcae pud++ with something like: pud = pud_next_entry(p4d, pud, next) Such that: pud_next_entry(p4d, pud, next) === pud_offset(p4d, next) In which case the invarient changes to 'callers can never do pointer arithmetic on the result of pXX_offset()' which is a bit harder to enforce. Jason
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
On Thu, 10 Sep 2020 10:02:33 -0300 Jason Gunthorpe wrote: > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote: > > > As Gerald mentioned, it is very difficult to explain in a clear way. > > Hopefully, one could make sense ot of it. > > I would say the page table API requires this invariant: > > pud = pud_offset(p4d, addr); > do { > WARN_ON(pud != pud_offset(p4d, addr); > next = pud_addr_end(addr, end); > } while (pud++, addr = next, addr != end); > > ie pud++ is supposed to be a shortcut for > pud_offset(p4d, next) > > While S390 does not follow this. Fixing addr_end brings it into > alignment by preventing pud++ from happening. > > The only currently known side effect is that gup_fast crashes, but it > sure is an unexpected thing. It only is unexpected in a "top-level folding" world, see my other reply. Consider it an optimization, which was possible because of how our dynamic folding works, and e.g. because we can determine the correct pagetable level from a pXd value in pXd_offset. > This suggests another fix, which is to say that pud++ is undefined and > pud_offset() must always be called, but I think that would cause worse > codegen on all other archs. There really is nothing to fix for s390 outside of gup_fast, or other potential future READ_ONCE pagetable walkers. We do take the side-effect of the generic change on all other pagetable walkers for s390, but it really is rather a slight degradation than a fix.
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
On Wed, 9 Sep 2020 15:03:24 -0300 Jason Gunthorpe wrote: > On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote: > > I actually had to draw myself a picture to get some hold of > > this, or rather a walk-through with a certain pud-crossing > > range in a folded 3-level scenario. Not sure if I would have > > understood my explanation above w/o that, but I hope you can > > make some sense out of it. Or draw yourself a picture :-) > > What I don't understand is how does anything work with S390 today? That is totally comprehensible :-) > If the fix is only to change pxx_addr_end() then than generic code > like mm/pagewalk.c will iterate over a *different list* of page table > entries. > > It's choice of entries to look at is entirely driven by pxx_addr_end(). > > Which suggest to me that mm/pagewalk.c also doesn't work properly > today on S390 and this issue is not really about stack variables? I guess you are confused by the fact that the generic change will indeed change the logic for _all_ pagetable walkers on s390, not just for the gup_fast case. But that doesn't mean that they were doing it wrong before, we simply can do it both ways. However, we probably should make that (in theory useless) change more explicit. Let's compare before and after for mm/pagewalk.c on s390, with 3-level pagetables, range crossing 2 GB pud boundary. * Before (with pXd_addr_end always using static 5-level PxD_SIZE): walk_pgd_range() -> pgd_addr_end() will use static 2^53 PGDIR_SIZE, range is not cropped, no iterations needed, passed over to next level walk_p4d_range() -> p4d_addr_end() will use static 2^42 P4D_SIZE, range still not cropped walk_pud_range() -> pud_addr_end() now we're cropping, with 2^31 PUD_SIZE, need two iterations for range crossing pud boundary, doing that right here on a pudp which is actually the previously passed-through pgdp/p4dp (pointing to correct pagetable entry) * After (with dynamic pXd_addr_end using "correct" PxD_SIZE boundaries, should be similar to other archs static "top-level folding"): walk_pgd_range() -> pgd_addr_end() will now determine "correct" boundary based on pgd value, i.e. 2^31 PUD_SIZE, do cropping now, iteration will now happen here walk_p4d/pud_range() -> operate on cropped range, will not iterate, instead return to pgd level, which will then use the same pointer for iteration as in the "Before" case, but not on the same level. IMHO, our "Before" logic is more efficient, and also feels more natural. After all, it is not really necessary to return to pgd level, and it will surely cost some extra instructions. We are willing to take that cost for the sake of doing it in a more generic way, hoping that will reduce future issues. E.g. you already mentioned that you have plans for using the READ_ONCE logic also in other places, and that would be such a "future issue". > Fundamentally if pXX_offset() and pXX_addr_end() must be consistent > together, if pXX_offset() is folded then pXX_addr_end() must cause a > single iteration of that level. well, that sounds correct in theory, but I guess it depends on "how you fold it". E.g. what does "if pXX_offset() is folded" mean? Take pgd_offset() for the 3-level case above. From our previous "middle-level folding/iteration" perspective, I would say that pgd/p4d are folded into pud, so if you say "if pgd_offset() is folded then pgd_addr_end() must cause a single iteration of that level", we were doing it all correctly, i.e only having single iteration on pgd/p4d level. You could even say that all others are doing / using it wrong :-) Now take pgd_offset() from the "top-level folding/iteration". Here you would say that p4d/pud are folded into pgd, which again does not sound like the natural / most efficient way to me, but IIUC this has to be how it works for all other archs with (static) pagetable folding. Now you'd say "if pud/p4d_offset() is folded then pud/p4d_addr_end() must cause a single iteration of that level", and that would sound correct. At least until you look more closely, because e.g. p4d_addr_end() in include/asm-generic/pgtable-nop4d.h is simply this: #define p4d_addr_end(addr, end) (end) How can that cause a single iteration? It clearly won't, it only works because the previous pgd_addr_end already cropped the range so that there will be only single iterations for p4d/pud. The more I think of it, the more it sounds like s390 "middle-level folding/iteration" was doing it "the right way", and everybody else was wrong, or at least not in an optimally efficient way :-) Might also be that only we could do this because we can determine the pagetable level from a pagetable entry value. Anyway, if you are not yet confused enough, I recommend looking at the other option we had in mind, for fixing the gup_fast issue. See "Patch 1" from here: https://lore.kernel.org
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote: > As Gerald mentioned, it is very difficult to explain in a clear way. > Hopefully, one could make sense ot of it. I would say the page table API requires this invariant: pud = pud_offset(p4d, addr); do { WARN_ON(pud != pud_offset(p4d, addr); next = pud_addr_end(addr, end); } while (pud++, addr = next, addr != end); ie pud++ is supposed to be a shortcut for pud_offset(p4d, next) While S390 does not follow this. Fixing addr_end brings it into alignment by preventing pud++ from happening. The only currently known side effect is that gup_fast crashes, but it sure is an unexpected thing. This suggests another fix, which is to say that pud++ is undefined and pud_offset() must always be called, but I think that would cause worse codegen on all other archs. Jason
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
On Wed, Sep 09, 2020 at 03:03:24PM -0300, Jason Gunthorpe wrote: > On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote: > > I actually had to draw myself a picture to get some hold of > > this, or rather a walk-through with a certain pud-crossing > > range in a folded 3-level scenario. Not sure if I would have > > understood my explanation above w/o that, but I hope you can > > make some sense out of it. Or draw yourself a picture :-) > > What I don't understand is how does anything work with S390 today? > > If the fix is only to change pxx_addr_end() then than generic code > like mm/pagewalk.c will iterate over a *different list* of page table > entries. > > It's choice of entries to look at is entirely driven by pxx_addr_end(). > > Which suggest to me that mm/pagewalk.c also doesn't work properly > today on S390 and this issue is not really about stack variables? > > Fundamentally if pXX_offset() and pXX_addr_end() must be consistent > together, if pXX_offset() is folded then pXX_addr_end() must cause a > single iteration of that level. Your observation is correct. Another way to describe the problem is existing pXd_addr_end helpers could be applied to mismatching levels on s390 (e.g p4d_addr_end applied to pud or pgd_addr_end applied to p4d). As you noticed, all *_pXd_range iterators could be called with address ranges that exceed single pXd table. However, when it happens with pointers to real page tables (passed to *_pXd_range iterators) we still operate on valid tables, which just (lucky for us) happened to be folded. Thus we still reference correct table entries. It is only gup_fast case that exposes the issue. It hits because pointers to stack copies are passed to gup_pXd_range iterators, not pointers to real page tables itself. As Gerald mentioned, it is very difficult to explain in a clear way. Hopefully, one could make sense ot of it. > Jason
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote: > I actually had to draw myself a picture to get some hold of > this, or rather a walk-through with a certain pud-crossing > range in a folded 3-level scenario. Not sure if I would have > understood my explanation above w/o that, but I hope you can > make some sense out of it. Or draw yourself a picture :-) What I don't understand is how does anything work with S390 today? If the fix is only to change pxx_addr_end() then than generic code like mm/pagewalk.c will iterate over a *different list* of page table entries. It's choice of entries to look at is entirely driven by pxx_addr_end(). Which suggest to me that mm/pagewalk.c also doesn't work properly today on S390 and this issue is not really about stack variables? Fundamentally if pXX_offset() and pXX_addr_end() must be consistent together, if pXX_offset() is folded then pXX_addr_end() must cause a single iteration of that level. Jason
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
On Wed, 9 Sep 2020 09:18:46 -0700 Dave Hansen wrote: > On 9/9/20 5:29 AM, Gerald Schaefer wrote: > > This only works well as long there are real pagetable pointers involved, > > that can also be used for iteration. For gup_fast, or any other future > > pagetable walkers using the READ_ONCE logic w/o lock, that is not true. > > There are pointers involved to local pXd values on the stack, because of > > the READ_ONCE logic, and our middle-level iteration will suddenly iterate > > over such stack pointers instead of pagetable pointers. > > By "There are pointers involved to local pXd values on the stack", did > you mean "locate" instead of "local"? That sentence confused me. > > Which code is it, exactly that allocates these troublesome on-stack pXd > values, btw? It is the gup_pXd_range() call sequence in mm/gup.c. It starts in gup_pgd_range() with "pgdp = pgd_offset(current->mm, addr)" and then the "pgd_t pgd = READ_ONCE(*pgdp)" which creates the first local stack variable "pgd". The next-level call to gup_p4d_range() gets this "pgd" value as input, but not the original pgdp pointer where it was read from. This is already the essential difference to other pagetable walkers like e.g. walk_pXd_range() in mm/pagewalk.c, where the original pointer is passed through. With READ_ONCE, that pointer must not be further de-referenced, so instead the value is passed over. In gup_p4d_range() we then have "p4dp = p4d_offset(&pgd, addr)", with &pgd being a pointer to the passed over pgd value, so that's the first pXd pointer that does not point directly to the pXd in the page table, but a local stack variable. With folded p4d, p4d_offset(&pgd, addr) will simply return the passed-in &pgd pointer, so we now also have p4dp point to that. That continues with "p4d_t p4d = READ_ONCE(*p4dp)", and that second stack variable passed to gup_huge_pud() and so on. Due to inlining, all those variables will not really be passed anywhere, but simply sit on the stack. So far, IIUC, that would also happen on x86 (or everywhere else actually) for folded levels, i.e. some pXd_offset() calls would simply return the passed in (stack) value pointer. This works as designed, and it will not lead to the "iteration over stack pointer" for anybody but s390, because the pXd_addr_end() boundaries usually take care that you always return to pgd level for iteration, and that is the only level with a real pagetable pointer. For s390, we stay at the first non-folded level and do the iteration there, which is fine for other pagetable walkers using the original pointers, but not for the READ_ONCE-style gup_fast. I actually had to draw myself a picture to get some hold of this, or rather a walk-through with a certain pud-crossing range in a folded 3-level scenario. Not sure if I would have understood my explanation above w/o that, but I hope you can make some sense out of it. Or draw yourself a picture :-)
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
On 9/9/20 5:29 AM, Gerald Schaefer wrote: > This only works well as long there are real pagetable pointers involved, > that can also be used for iteration. For gup_fast, or any other future > pagetable walkers using the READ_ONCE logic w/o lock, that is not true. > There are pointers involved to local pXd values on the stack, because of > the READ_ONCE logic, and our middle-level iteration will suddenly iterate > over such stack pointers instead of pagetable pointers. By "There are pointers involved to local pXd values on the stack", did you mean "locate" instead of "local"? That sentence confused me. Which code is it, exactly that allocates these troublesome on-stack pXd values, btw? > This will be addressed by making the pXd_addr_end() dynamic, for which > we need to see the pXd value in order to determine its level / type. Thanks for the explanation!
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
On Tue, 8 Sep 2020 07:30:50 -0700 Dave Hansen wrote: > On 9/7/20 11:00 AM, Gerald Schaefer wrote: > > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast > > code") introduced a subtle but severe bug on s390 with gup_fast, due to > > dynamic page table folding. > > Would it be fair to say that the "fake" page table entries s390 > allocates on the stack are what's causing the trouble here? That might > be a nice thing to open up with here. "Dynamic page table folding" > really means nothing to me. Sorry, I guess my previous reply does not really explain "what the heck is dynamic page table folding?". On s390, we can have different number of page table levels for different processes / mms. We always start with 3 levels, and update dynamically on process demand to 4 or 5 levels, hence the dynamic folding. Still, the PxD_SIZE/SHIFT is defined statically, so that e.g. pXd_addr_end() will not reflect this dynamic behavior. For the various pagetable walkers using pXd_addr_end() (w/o READ_ONCE logic) this is no problem. With static folding, iteration over the folded levels will always happen at pgd level (top-level folding). For s390, we stay at the respective level and iterate there (dynamic middle-level folding), only return to pgd level if there really were 5 levels. This only works well as long there are real pagetable pointers involved, that can also be used for iteration. For gup_fast, or any other future pagetable walkers using the READ_ONCE logic w/o lock, that is not true. There are pointers involved to local pXd values on the stack, because of the READ_ONCE logic, and our middle-level iteration will suddenly iterate over such stack pointers instead of pagetable pointers. This will be addressed by making the pXd_addr_end() dynamic, for which we need to see the pXd value in order to determine its level / type.
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
On Tue, 8 Sep 2020 07:30:50 -0700 Dave Hansen wrote: > On 9/7/20 11:00 AM, Gerald Schaefer wrote: > > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast > > code") introduced a subtle but severe bug on s390 with gup_fast, due to > > dynamic page table folding. > > Would it be fair to say that the "fake" page table entries s390 > allocates on the stack are what's causing the trouble here? That might > be a nice thing to open up with here. "Dynamic page table folding" > really means nothing to me. We do not really allocate anything on the stack, it is the generic logic from gup_fast that passes over pXd values (read once before), and using pointers to such (stack) variables instead of real pXd pointers. That, combined with the fact that we just return the passed in pointer in pXd_offset() for folded levels. That works similar on x86 IIUC, but with static folding, and thus also proper pXd_addr_end() results because of statically (and correspondingly) defined Pxd_INDEX/SHIFT. We always have static 5-level PxD_INDEX/SHIFT, and that cannot really be made dynamic, so we just make pXd_addr_end() dynamic instead, and that requires the pXd value to determine the correct pagetable level. Still makes my head spin when trying to explain, sorry. It is a very special s390 oddity, or lets call it "feature", because I don't think any other architecture has "dynamic pagetable folding" capability, depending on process requirements, for whatever it is worth... > > > @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long > > addr, unsigned long end, > > do { > > pmd_t pmd = READ_ONCE(*pmdp); > > > > - next = pmd_addr_end(addr, end); > > + next = pmd_addr_end_folded(pmd, addr, end); > > if (!pmd_present(pmd)) > > return 0; > > It looks like you fix this up later, but this would be a problem if left > this way. There's no documentation for whether I use > pmd_addr_end_folded() or pmd_addr_end() when writing a page table walker. Yes, that is very unfortunate. We did have some lengthy comment in include/linux/pgtable.h where the pXd_addr_end(_folded) were defined. But that was moved to arch/s390/include/asm/pgtable.h in this version, probably because we already had the generalization in mind, where we would not need such explanation in common header any more. So, it might help better understand the issue that we have with dynamic page table folding and READ_ONCE-style pagetable walkers when looking at that comment. Thanks for pointing out, that comment should definitely go into include/linux/pgtable.h again. At least if we would still go for that "s390 fix first, generalization second" approach, but it seems we have other / better options now.
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
On 9/7/20 11:00 AM, Gerald Schaefer wrote: > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast > code") introduced a subtle but severe bug on s390 with gup_fast, due to > dynamic page table folding. Would it be fair to say that the "fake" page table entries s390 allocates on the stack are what's causing the trouble here? That might be a nice thing to open up with here. "Dynamic page table folding" really means nothing to me. > @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, > unsigned long end, > do { > pmd_t pmd = READ_ONCE(*pmdp); > > - next = pmd_addr_end(addr, end); > + next = pmd_addr_end_folded(pmd, addr, end); > if (!pmd_present(pmd)) > return 0; It looks like you fix this up later, but this would be a problem if left this way. There's no documentation for whether I use pmd_addr_end_folded() or pmd_addr_end() when writing a page table walker.
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
On Tue, 8 Sep 2020 14:40:10 +0200 Christophe Leroy wrote: > > > Le 08/09/2020 à 14:09, Christian Borntraeger a écrit : > > > > > > On 08.09.20 07:06, Christophe Leroy wrote: > >> > >> > >> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit : > >>> From: Alexander Gordeev > >>> > >>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast > >>> code") introduced a subtle but severe bug on s390 with gup_fast, due to > >>> dynamic page table folding. > >>> > >>> The question "What would it require for the generic code to work for s390" > >>> has already been discussed here > >>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1 > >>> and ended with a promising approach here > >>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1 > >>> which in the end unfortunately didn't quite work completely. > >>> > >>> We tried to mimic static level folding by changing pgd_offset to always > >>> calculate top level page table offset, and do nothing in folded > >>> pXd_offset. > >>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do > >>> not reflect this dynamic behaviour, and still act like static 5-level > >>> page tables. > >>> > >> > >> [...] > >> > >>> > >>> Fix this by introducing new pXd_addr_end_folded helpers, which take an > >>> additional pXd entry value parameter, that can be used on s390 > >>> to determine the correct page table level and return corresponding > >>> end / boundary. With that, the pointer iteration will always > >>> happen in gup_pgd_range for s390. No change for other architectures > >>> introduced. > >> > >> Not sure pXd_addr_end_folded() is the best understandable name, allthough > >> I don't have any alternative suggestion at the moment. > >> Maybe could be something like pXd_addr_end_fixup() as it will disappear in > >> the next patch, or pXd_addr_end_gup() ? > >> > >> Also, if it happens to be acceptable to get patch 2 in stable, I think you > >> should switch patch 1 and patch 2 to avoid the step through > >> pXd_addr_end_folded() > > > > given that this fixes a data corruption issue, wouldnt it be the best to go > > forward > > with this patch ASAP and then handle the other patches on top with all the > > time that > > we need? > > I have no strong opinion on this, but I feel rather tricky to have to > change generic part of GUP to use a new fonction then revert that change > in the following patch, just because you want the first patch in stable > and not the second one. > > Regardless, I was wondering, why do we need a reference to the pXd at > all when calling pXd_addr_end() ? > > Couldn't S390 retrieve the pXd by using the pXd_offset() dance with the > passed addr ? Apart from performance impact when re-doing that what has already been done by the caller, I think we would also break the READ_ONCE semantics. After all, the pXd_offset() would also require some pXd pointer input, which we don't have. So we would need to start over again from mm->pgd. Also, it seems to be more in line with other primitives that take a pXd value or pointer.
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
Le 08/09/2020 à 14:09, Christian Borntraeger a écrit : On 08.09.20 07:06, Christophe Leroy wrote: Le 07/09/2020 à 20:00, Gerald Schaefer a écrit : From: Alexander Gordeev Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code") introduced a subtle but severe bug on s390 with gup_fast, due to dynamic page table folding. The question "What would it require for the generic code to work for s390" has already been discussed here https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1 and ended with a promising approach here https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1 which in the end unfortunately didn't quite work completely. We tried to mimic static level folding by changing pgd_offset to always calculate top level page table offset, and do nothing in folded pXd_offset. What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do not reflect this dynamic behaviour, and still act like static 5-level page tables. [...] Fix this by introducing new pXd_addr_end_folded helpers, which take an additional pXd entry value parameter, that can be used on s390 to determine the correct page table level and return corresponding end / boundary. With that, the pointer iteration will always happen in gup_pgd_range for s390. No change for other architectures introduced. Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment. Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ? Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded() given that this fixes a data corruption issue, wouldnt it be the best to go forward with this patch ASAP and then handle the other patches on top with all the time that we need? I have no strong opinion on this, but I feel rather tricky to have to change generic part of GUP to use a new fonction then revert that change in the following patch, just because you want the first patch in stable and not the second one. Regardless, I was wondering, why do we need a reference to the pXd at all when calling pXd_addr_end() ? Couldn't S390 retrieve the pXd by using the pXd_offset() dance with the passed addr ? Christophe
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
On 08.09.20 07:06, Christophe Leroy wrote: > > > Le 07/09/2020 à 20:00, Gerald Schaefer a écrit : >> From: Alexander Gordeev >> >> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast >> code") introduced a subtle but severe bug on s390 with gup_fast, due to >> dynamic page table folding. >> >> The question "What would it require for the generic code to work for s390" >> has already been discussed here >> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1 >> and ended with a promising approach here >> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1 >> which in the end unfortunately didn't quite work completely. >> >> We tried to mimic static level folding by changing pgd_offset to always >> calculate top level page table offset, and do nothing in folded pXd_offset. >> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do >> not reflect this dynamic behaviour, and still act like static 5-level >> page tables. >> > > [...] > >> >> Fix this by introducing new pXd_addr_end_folded helpers, which take an >> additional pXd entry value parameter, that can be used on s390 >> to determine the correct page table level and return corresponding >> end / boundary. With that, the pointer iteration will always >> happen in gup_pgd_range for s390. No change for other architectures >> introduced. > > Not sure pXd_addr_end_folded() is the best understandable name, allthough I > don't have any alternative suggestion at the moment. > Maybe could be something like pXd_addr_end_fixup() as it will disappear in > the next patch, or pXd_addr_end_gup() ? > > Also, if it happens to be acceptable to get patch 2 in stable, I think you > should switch patch 1 and patch 2 to avoid the step through > pXd_addr_end_folded() given that this fixes a data corruption issue, wouldnt it be the best to go forward with this patch ASAP and then handle the other patches on top with all the time that we need? > > >> >> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast >> code") >> Cc: # 5.2+ >> Reviewed-by: Gerald Schaefer >> Signed-off-by: Alexander Gordeev >> Signed-off-by: Gerald Schaefer >> --- >> arch/s390/include/asm/pgtable.h | 42 + >> include/linux/pgtable.h | 16 + >> mm/gup.c | 8 +++ >> 3 files changed, 62 insertions(+), 4 deletions(-) >> >> diff --git a/arch/s390/include/asm/pgtable.h >> b/arch/s390/include/asm/pgtable.h >> index 7eb01a5459cd..027206e4959d 100644 >> --- a/arch/s390/include/asm/pgtable.h >> +++ b/arch/s390/include/asm/pgtable.h >> @@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm) >> } >> #define mm_pmd_folded(mm) mm_pmd_folded(mm) >> +/* >> + * With dynamic page table levels on s390, the static pXd_addr_end() >> functions >> + * will not return corresponding dynamic boundaries. This is no problem as >> long >> + * as only pXd pointers are passed down during page table walk, because >> + * pXd_offset() will simply return the given pointer for folded levels, and >> the >> + * pointer iteration over a range simply happens at the correct page table >> + * level. >> + * It is however a problem with gup_fast, or other places walking the page >> + * tables w/o locks using READ_ONCE(), and passing down the pXd values >> instead >> + * of pointers. In this case, the pointer given to pXd_offset() is a >> pointer to >> + * a stack variable, which cannot be used for pointer iteration at the >> correct >> + * level. Instead, the iteration then has to happen by going up to pgd level >> + * again. To allow this, provide pXd_addr_end_folded() functions with an >> + * additional pXd value parameter, which can be used on s390 to determine >> the >> + * folding level and return the corresponding boundary. >> + */ >> +static inline unsigned long rste_addr_end_folded(unsigned long rste, >> unsigned long addr, unsigned long end) > > What does 'rste' stands for ? > > Isn't this line a bit long ? this is region/segment table entry according to the architecture. On our platform we do have the pagetables with a different format that next levels (segment table -> 1MB granularity, region 3rd table -> 2 GB granularity, region 2nd table -> 4TB granularity, region 1st table -> 8 PB granularity. ST,R3,R2,R1 have the same format and are thus often called crste (combined region and segment table entry).
Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
Le 07/09/2020 à 20:00, Gerald Schaefer a écrit : From: Alexander Gordeev Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code") introduced a subtle but severe bug on s390 with gup_fast, due to dynamic page table folding. The question "What would it require for the generic code to work for s390" has already been discussed here https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1 and ended with a promising approach here https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1 which in the end unfortunately didn't quite work completely. We tried to mimic static level folding by changing pgd_offset to always calculate top level page table offset, and do nothing in folded pXd_offset. What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do not reflect this dynamic behaviour, and still act like static 5-level page tables. [...] Fix this by introducing new pXd_addr_end_folded helpers, which take an additional pXd entry value parameter, that can be used on s390 to determine the correct page table level and return corresponding end / boundary. With that, the pointer iteration will always happen in gup_pgd_range for s390. No change for other architectures introduced. Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment. Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ? Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded() Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code") Cc: # 5.2+ Reviewed-by: Gerald Schaefer Signed-off-by: Alexander Gordeev Signed-off-by: Gerald Schaefer --- arch/s390/include/asm/pgtable.h | 42 + include/linux/pgtable.h | 16 + mm/gup.c| 8 +++ 3 files changed, 62 insertions(+), 4 deletions(-) diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index 7eb01a5459cd..027206e4959d 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm) } #define mm_pmd_folded(mm) mm_pmd_folded(mm) +/* + * With dynamic page table levels on s390, the static pXd_addr_end() functions + * will not return corresponding dynamic boundaries. This is no problem as long + * as only pXd pointers are passed down during page table walk, because + * pXd_offset() will simply return the given pointer for folded levels, and the + * pointer iteration over a range simply happens at the correct page table + * level. + * It is however a problem with gup_fast, or other places walking the page + * tables w/o locks using READ_ONCE(), and passing down the pXd values instead + * of pointers. In this case, the pointer given to pXd_offset() is a pointer to + * a stack variable, which cannot be used for pointer iteration at the correct + * level. Instead, the iteration then has to happen by going up to pgd level + * again. To allow this, provide pXd_addr_end_folded() functions with an + * additional pXd value parameter, which can be used on s390 to determine the + * folding level and return the corresponding boundary. + */ +static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned long addr, unsigned long end) What does 'rste' stands for ? Isn't this line a bit long ? +{ + unsigned long type = (rste & _REGION_ENTRY_TYPE_MASK) >> 2; + unsigned long size = 1UL << (_SEGMENT_SHIFT + type * 11); + unsigned long boundary = (addr + size) & ~(size - 1); + + /* +* FIXME The below check is for internal testing only, to be removed +*/ + VM_BUG_ON(type < (_REGION_ENTRY_TYPE_R3 >> 2)); + + return (boundary - 1) < (end - 1) ? boundary : end; +} + +#define pgd_addr_end_folded pgd_addr_end_folded +static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr, unsigned long end) +{ + return rste_addr_end_folded(pgd_val(pgd), addr, end); +} + +#define p4d_addr_end_folded p4d_addr_end_folded +static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr, unsigned long end) +{ + return rste_addr_end_folded(p4d_val(p4d), addr, end); +} + static inline int mm_has_pgste(struct mm_struct *mm) { #ifdef CONFIG_PGSTE diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index e8cbc2e795d5..981c4c2a31fe 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -681,6 +681,22 @@ static inline int arch_unmap_one(struct mm_struct *mm, }) #endif +#ifndef pgd_addr_end_folded +#define pgd_addr_end_folded(pgd, addr, end)pgd_addr_end(addr, end) +#endif + +#ifndef p4d_addr_end_folded +#define p4d_addr_end_folded(p4d, addr, end)p4d_addr_end(addr,
[RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
From: Alexander Gordeev Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code") introduced a subtle but severe bug on s390 with gup_fast, due to dynamic page table folding. The question "What would it require for the generic code to work for s390" has already been discussed here https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1 and ended with a promising approach here https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1 which in the end unfortunately didn't quite work completely. We tried to mimic static level folding by changing pgd_offset to always calculate top level page table offset, and do nothing in folded pXd_offset. What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do not reflect this dynamic behaviour, and still act like static 5-level page tables. Here is an example of what happens with gup_fast on s390, for a task with 3-levels paging, crossing a 2 GB pud boundary: // addr = 0x1007000, end = 0x10080001000 static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) { unsigned long next; pud_t *pudp; // pud_offset returns &p4d itself (a pointer to a value on stack) pudp = pud_offset(&p4d, addr); do { // on second iteratation reading "random" stack value pud_t pud = READ_ONCE(*pudp); // next = 0x1008000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390 next = pud_addr_end(addr, end); ... } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack return 1; } pud_addr_end = 0x1008000 is correct, but the previous pgd/p4d_addr_end should also have returned that limit, instead of the 5-level static pgd/p4d limits with PUD_SIZE/MASK != PGDIR_SIZE/MASK. Then the "end" parameter for gup_pud_range would also have been 0x1008000, and we would not iterate further in gup_pud_range, but rather go back and (correctly) do it in gup_pgd_range. So, for the second iteration in gup_pud_range, we will increase pudp, which pointed to a stack value and not the real pud table. This new pudp will then point to whatever lies behind the p4d stack value. In general, this happens to be the previously read pgd, but it probably could also be something different, depending on compiler decisions. Most unfortunately, if it happens to be the pgd value, which is the same as the p4d / pud due to folding, it is a valid and present entry. So after the increment, we would still point to the same pud entry. The addr however has been increased in the second iteration, so that we now have different pmd/pte_index values, which will result in very wrong behaviour for the remaining gup_pmd/pte_range calls. We will effectively operate on an address minus 2 GB, due to missing pudp increase. In the "good case", if nothing is mapped there, we will fall back to the slow gup path. But if something is mapped there, and valid for gup_fast, we will end up (silently) getting references on the wrong pages and also add the wrong pages to the **pages result array. This can cause data corruption. Fix this by introducing new pXd_addr_end_folded helpers, which take an additional pXd entry value parameter, that can be used on s390 to determine the correct page table level and return corresponding end / boundary. With that, the pointer iteration will always happen in gup_pgd_range for s390. No change for other architectures introduced. Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code") Cc: # 5.2+ Reviewed-by: Gerald Schaefer Signed-off-by: Alexander Gordeev Signed-off-by: Gerald Schaefer --- arch/s390/include/asm/pgtable.h | 42 + include/linux/pgtable.h | 16 + mm/gup.c| 8 +++ 3 files changed, 62 insertions(+), 4 deletions(-) diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index 7eb01a5459cd..027206e4959d 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm) } #define mm_pmd_folded(mm) mm_pmd_folded(mm) +/* + * With dynamic page table levels on s390, the static pXd_addr_end() functions + * will not return corresponding dynamic boundaries. This is no problem as long + * as only pXd pointers are passed down during page table walk, because + * pXd_offset() will simply return the given pointer for folded levels, and the + * pointer iteration over a range simply happens at the correct page table + * level. + * It is however a problem with gup_fast, or other places walking the page + * tables w/o locks using READ_ONCE(), and passing down the pXd values instead + * of pointers. In this case, the pointer given to pXd_offset() is a pointer to + * a stack variable, which cannot be use