Re: [RFC] page-table walkers vs memory order
* Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: > On Sat, Aug 04, 2012 at 11:59:10PM +0100, Dr. David Alan Gilbert wrote: > > A compiler could decide to dereference it using a non-faulting load, > > do the calculations or whatever on the returned value of the non-faulting > > load, and then check whether the load actually faulted, and whether the > > address matched the prediction before it did a store based on it's > > guess. > > Or the compiler could record a recovery address in a per-thread variable > before doing the speculative reference. The page-fault handler could > consult the per-thread variable and take appropriate action. The difference is that I'd expect a compiler writer to think that they've got a free hand in terms of instruction usage that the OS/library doesn't see - if it's in the instruction manual and it's marked as user space and non-faulting I'd say it's fair game; once they know that they're going to take a fault or mark pages specially then they already know they're going to have to cooperate with the OS, or worry about what other normal library calls are going to do. (A bit of googling seems to suggest IA64 and SPARC have played around with non-faulting load optimisations, but I can't tell how much.) > But both this approach and your approach are vulnerable to things like > having the speculation area mapped to (say) MMIO space. Not good! Not good for someone doing MMIO, but from an evil-compiler point of view, they might well assume that a pointer is to memory unless someone has made an effort to tell them otherwise (not that there is a good standard to do that). > So I am with Andrea on this one -- there would need to be some handshake > between kernel and compiler to avoid messing with possibly-unsafe > mappings. And I am still not much in favor of value speculation. ;-) Dave -- -Open up your eyes, open up your mind, open up your code --- / Dr. David Alan Gilbert| Running GNU/Linux | Happy \ \ gro.gilbert @ treblig.org | | In Hex / \ _|_ http://www.treblig.org |___/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] page-table walkers vs memory order
On Sat, Aug 04, 2012 at 11:59:10PM +0100, Dr. David Alan Gilbert wrote: > * Andrea Arcangeli (aarca...@redhat.com) wrote: > > On Sat, Aug 04, 2012 at 03:02:45PM -0700, Paul E. McKenney wrote: > > > OK, I'll bite. ;-) > > > > :)) > > > > > The most sane way for this to happen is with feedback-driven techniques > > > involving profiling, similar to what is done for basic-block reordering > > > or branch prediction. The idea is that you compile the kernel in an > > > as-yet (and thankfully) mythical pointer-profiling mode, which records > > > the values of pointer loads and also measures the pointer-load latency. > > > If a situation is found where a given pointer almost always has the > > > same value but has high load latency (for example, is almost always a > > > high-latency cache miss), this fact is recorded and fed back into a > > > subsequent kernel build. This subsequent kernel build might choose to > > > speculate the value of the pointer concurrently with the pointer load. > > > > > > And of course, when interpreting the phrase "most sane way" at the > > > beginning of the prior paragraph, it would probably be wise to keep > > > in mind who wrote it. And that "most sane way" might have little or > > > no resemblance to anything that typical kernel hackers would consider > > > anywhere near sanity. ;-) > > > > I see. The above scenario is sure fair enough assumption. We're > > clearly stretching the constraints to see what is theoretically > > possible and this is a very clear explanation of how gcc could have an > > hardcoded "guessed" address in the .text. > > > > Next step to clearify now, is how gcc can safely dereference such a > > "guessed" address without the kernel knowing about it. > > > > If gcc would really dereference a guessed address coming from a > > profiling run without kernel being aware of it, it would eventually > > crash the kernel with an oops. gcc cannot know what another CPU will > > do with the kernel pagetables. It'd be perfectly legitimate to > > temporarily move the data at the "guessed address" to another page and > > to update the pointer through stop_cpu during some weird "cpu > > offlining scenario" or anything you can imagine. I mean gcc must > > behave in all cases so it's not allowed to deference the guessed > > address at any given time. > > A compiler could decide to dereference it using a non-faulting load, > do the calculations or whatever on the returned value of the non-faulting > load, and then check whether the load actually faulted, and whether the > address matched the prediction before it did a store based on it's > guess. Or the compiler could record a recovery address in a per-thread variable before doing the speculative reference. The page-fault handler could consult the per-thread variable and take appropriate action. But both this approach and your approach are vulnerable to things like having the speculation area mapped to (say) MMIO space. Not good! So I am with Andrea on this one -- there would need to be some handshake between kernel and compiler to avoid messing with possibly-unsafe mappings. And I am still not much in favor of value speculation. ;-) Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] page-table walkers vs memory order
On Sun, Aug 05, 2012 at 12:47:05AM +0200, Andrea Arcangeli wrote: > On Sat, Aug 04, 2012 at 03:02:45PM -0700, Paul E. McKenney wrote: > > OK, I'll bite. ;-) > > :)) > > > The most sane way for this to happen is with feedback-driven techniques > > involving profiling, similar to what is done for basic-block reordering > > or branch prediction. The idea is that you compile the kernel in an > > as-yet (and thankfully) mythical pointer-profiling mode, which records > > the values of pointer loads and also measures the pointer-load latency. > > If a situation is found where a given pointer almost always has the > > same value but has high load latency (for example, is almost always a > > high-latency cache miss), this fact is recorded and fed back into a > > subsequent kernel build. This subsequent kernel build might choose to > > speculate the value of the pointer concurrently with the pointer load. > > > > And of course, when interpreting the phrase "most sane way" at the > > beginning of the prior paragraph, it would probably be wise to keep > > in mind who wrote it. And that "most sane way" might have little or > > no resemblance to anything that typical kernel hackers would consider > > anywhere near sanity. ;-) > > I see. The above scenario is sure fair enough assumption. We're > clearly stretching the constraints to see what is theoretically > possible and this is a very clear explanation of how gcc could have an > hardcoded "guessed" address in the .text. > > Next step to clearify now, is how gcc can safely dereference such a > "guessed" address without the kernel knowing about it. > > If gcc would really dereference a guessed address coming from a > profiling run without kernel being aware of it, it would eventually > crash the kernel with an oops. gcc cannot know what another CPU will > do with the kernel pagetables. It'd be perfectly legitimate to > temporarily move the data at the "guessed address" to another page and > to update the pointer through stop_cpu during some weird "cpu > offlining scenario" or anything you can imagine. I mean gcc must > behave in all cases so it's not allowed to deference the guessed > address at any given time. > > The only way gcc could do the alpha thing and dereference the guessed > address before the real pointer, is with cooperation with the kernel. > The kernel should provide gcc "safe ranges" that won't crash the > kernel, and/or gcc could provide a .fixup section similar to the > current .fixup and the kernel should look it up during the page fault > handler in case the kernel is ok with temporarily getting faults in > that range. And in turn it can't happen unless we explicitly decide to > allow gcc to do it. And these are indeed some good reasons why I am not a fan of pointer-value speculation. ;-) > > > Furthermore the ACCESS_ONCE that Peter's patch added to gup_fast > > > pud/pgd can't prevent the compiler to read a guessed pmdp address as a > > > volatile variable, before reading the pmdp pointer and compare it with > > > the guessed address! So if it's 5 you worry about, when adding > > > ACCESS_ONCE in pudp/pgdp/pmdp is useless and won't fix it. You should > > > have added a barrier() instead. > > > > Most compiler writers I have discussed this with agreed that a volatile > > cast would suppress value speculation. The "volatile" keyword is not > > all that well specified in the C and C++ standards, but as "nix" said > > at http://lwn.net/Articles/509731/: > > > > volatile's meaning as 'minimize optimizations applied to things > > manipulating anything of volatile type, do not duplicate, elide, > > move, fold, spindle or mutilate' is of long standing. > > Ok, so if the above optimization would be possible, volatile would > stop it too, thanks for the quote and the explanation. > > On a side note I believe there's a few barrier()s that may be worth > converting to ACCESS_ONCE, that would take care of case 6) too in > addition to avoid clobbering more CPU registers than strictly > necessary. Not very important but a possible microoptimization. Agreed on both points. > > That said, value speculation as a compiler optimization makes me a bit > > nervous, so my current feeling is that is should be suppressed entirely. > > > > Hey, you asked, even if only implicitly! ;-) > > You're reading my mind! :) Or succesfully carrying out value speculation on it. ;-) Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] page-table walkers vs memory order
* Andrea Arcangeli (aarca...@redhat.com) wrote: > On Sat, Aug 04, 2012 at 03:02:45PM -0700, Paul E. McKenney wrote: > > OK, I'll bite. ;-) > > :)) > > > The most sane way for this to happen is with feedback-driven techniques > > involving profiling, similar to what is done for basic-block reordering > > or branch prediction. The idea is that you compile the kernel in an > > as-yet (and thankfully) mythical pointer-profiling mode, which records > > the values of pointer loads and also measures the pointer-load latency. > > If a situation is found where a given pointer almost always has the > > same value but has high load latency (for example, is almost always a > > high-latency cache miss), this fact is recorded and fed back into a > > subsequent kernel build. This subsequent kernel build might choose to > > speculate the value of the pointer concurrently with the pointer load. > > > > And of course, when interpreting the phrase "most sane way" at the > > beginning of the prior paragraph, it would probably be wise to keep > > in mind who wrote it. And that "most sane way" might have little or > > no resemblance to anything that typical kernel hackers would consider > > anywhere near sanity. ;-) > > I see. The above scenario is sure fair enough assumption. We're > clearly stretching the constraints to see what is theoretically > possible and this is a very clear explanation of how gcc could have an > hardcoded "guessed" address in the .text. > > Next step to clearify now, is how gcc can safely dereference such a > "guessed" address without the kernel knowing about it. > > If gcc would really dereference a guessed address coming from a > profiling run without kernel being aware of it, it would eventually > crash the kernel with an oops. gcc cannot know what another CPU will > do with the kernel pagetables. It'd be perfectly legitimate to > temporarily move the data at the "guessed address" to another page and > to update the pointer through stop_cpu during some weird "cpu > offlining scenario" or anything you can imagine. I mean gcc must > behave in all cases so it's not allowed to deference the guessed > address at any given time. A compiler could decide to dereference it using a non-faulting load, do the calculations or whatever on the returned value of the non-faulting load, and then check whether the load actually faulted, and whether the address matched the prediction before it did a store based on it's guess. Dave -- -Open up your eyes, open up your mind, open up your code --- / Dr. David Alan Gilbert| Running GNU/Linux | Happy \ \ gro.gilbert @ treblig.org | | In Hex / \ _|_ http://www.treblig.org |___/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] page-table walkers vs memory order
On Sat, Aug 04, 2012 at 03:02:45PM -0700, Paul E. McKenney wrote: > OK, I'll bite. ;-) :)) > The most sane way for this to happen is with feedback-driven techniques > involving profiling, similar to what is done for basic-block reordering > or branch prediction. The idea is that you compile the kernel in an > as-yet (and thankfully) mythical pointer-profiling mode, which records > the values of pointer loads and also measures the pointer-load latency. > If a situation is found where a given pointer almost always has the > same value but has high load latency (for example, is almost always a > high-latency cache miss), this fact is recorded and fed back into a > subsequent kernel build. This subsequent kernel build might choose to > speculate the value of the pointer concurrently with the pointer load. > > And of course, when interpreting the phrase "most sane way" at the > beginning of the prior paragraph, it would probably be wise to keep > in mind who wrote it. And that "most sane way" might have little or > no resemblance to anything that typical kernel hackers would consider > anywhere near sanity. ;-) I see. The above scenario is sure fair enough assumption. We're clearly stretching the constraints to see what is theoretically possible and this is a very clear explanation of how gcc could have an hardcoded "guessed" address in the .text. Next step to clearify now, is how gcc can safely dereference such a "guessed" address without the kernel knowing about it. If gcc would really dereference a guessed address coming from a profiling run without kernel being aware of it, it would eventually crash the kernel with an oops. gcc cannot know what another CPU will do with the kernel pagetables. It'd be perfectly legitimate to temporarily move the data at the "guessed address" to another page and to update the pointer through stop_cpu during some weird "cpu offlining scenario" or anything you can imagine. I mean gcc must behave in all cases so it's not allowed to deference the guessed address at any given time. The only way gcc could do the alpha thing and dereference the guessed address before the real pointer, is with cooperation with the kernel. The kernel should provide gcc "safe ranges" that won't crash the kernel, and/or gcc could provide a .fixup section similar to the current .fixup and the kernel should look it up during the page fault handler in case the kernel is ok with temporarily getting faults in that range. And in turn it can't happen unless we explicitly decide to allow gcc to do it. > > Furthermore the ACCESS_ONCE that Peter's patch added to gup_fast > > pud/pgd can't prevent the compiler to read a guessed pmdp address as a > > volatile variable, before reading the pmdp pointer and compare it with > > the guessed address! So if it's 5 you worry about, when adding > > ACCESS_ONCE in pudp/pgdp/pmdp is useless and won't fix it. You should > > have added a barrier() instead. > > Most compiler writers I have discussed this with agreed that a volatile > cast would suppress value speculation. The "volatile" keyword is not > all that well specified in the C and C++ standards, but as "nix" said > at http://lwn.net/Articles/509731/: > > volatile's meaning as 'minimize optimizations applied to things > manipulating anything of volatile type, do not duplicate, elide, > move, fold, spindle or mutilate' is of long standing. Ok, so if the above optimization would be possible, volatile would stop it too, thanks for the quote and the explanation. On a side note I believe there's a few barrier()s that may be worth converting to ACCESS_ONCE, that would take care of case 6) too in addition to avoid clobbering more CPU registers than strictly necessary. Not very important but a possible microoptimization. > That said, value speculation as a compiler optimization makes me a bit > nervous, so my current feeling is that is should be suppressed entirely. > > Hey, you asked, even if only implicitly! ;-) You're reading my mind! :) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] page-table walkers vs memory order
On Sat, Aug 04, 2012 at 04:37:19PM +0200, Andrea Arcangeli wrote: > On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote: > > Since then, I think THP has made the rules more complicated; but I > > believe Andrea paid a great deal of attention to that kind of issue. > > There were many issues, one unexpected was > 1a5a9906d4e8d1976b701f889d8f35d54b928f25. [ . . . ] > 5) compiler behaving like alpha -> impossible (I may be wrong but I >believe so after thinking more on it) > > 6) I was told a decade ago by Honza to never touch any ram that can >change under the compiler unless it's declared volatile (could >crash over switch/case statement implemented with a table if the >switch/case value is re-read by the compiler). -> depends, we >don't always obey to this rule, clearly gup_fast currently disobeys >and even the generic pmd_read_atomic still disobeys (MADV_DONTNEED >can zero the pmd). If there's no "switch/case" I'm not aware of >other troubles. > > 7) barrier in pmd_none_or_trans_huge_or_clear_bad -> possible, same >issue as 2, full explanation in git show 1a5a9906d4e8d1976b701f > > Note: here I'm ignoring CPU reordering, this is only about the compiler. > > 5 is impossible because: > > a) the compiler can't read a guessed address or it can crash the >kernel > > b) the compiler has no memory to store a "guessed" valid address when >the function return and the stack is unwind > > For the compiler to behave like alpha, the compiler should read the > pteval before the pmdp, that it can't do, because it has no address to > guess from and it would Oops if it really tries to guess it! > > So far it was said "compiler can guess the address" but there was no > valid explanation of how it could do it, and I don't see it, so please > explain if I'm wrong about the a, b above. OK, I'll bite. ;-) The most sane way for this to happen is with feedback-driven techniques involving profiling, similar to what is done for basic-block reordering or branch prediction. The idea is that you compile the kernel in an as-yet (and thankfully) mythical pointer-profiling mode, which records the values of pointer loads and also measures the pointer-load latency. If a situation is found where a given pointer almost always has the same value but has high load latency (for example, is almost always a high-latency cache miss), this fact is recorded and fed back into a subsequent kernel build. This subsequent kernel build might choose to speculate the value of the pointer concurrently with the pointer load. And of course, when interpreting the phrase "most sane way" at the beginning of the prior paragraph, it would probably be wise to keep in mind who wrote it. And that "most sane way" might have little or no resemblance to anything that typical kernel hackers would consider anywhere near sanity. ;-) > Furthermore the ACCESS_ONCE that Peter's patch added to gup_fast > pud/pgd can't prevent the compiler to read a guessed pmdp address as a > volatile variable, before reading the pmdp pointer and compare it with > the guessed address! So if it's 5 you worry about, when adding > ACCESS_ONCE in pudp/pgdp/pmdp is useless and won't fix it. You should > have added a barrier() instead. Most compiler writers I have discussed this with agreed that a volatile cast would suppress value speculation. The "volatile" keyword is not all that well specified in the C and C++ standards, but as "nix" said at http://lwn.net/Articles/509731/: volatile's meaning as 'minimize optimizations applied to things manipulating anything of volatile type, do not duplicate, elide, move, fold, spindle or mutilate' is of long standing. That said, value speculation as a compiler optimization makes me a bit nervous, so my current feeling is that is should be suppressed entirely. Hey, you asked, even if only implicitly! ;-) Thanx, Paul > > I suspect your arch/x86/mm/gup.c ACCESS_ONCE()s are necessary: > > gup_fast() breaks as many rules as it can, and in particular may > > be racing with the freeing of page tables; but I'm not so sure > > about the pagewalk mods - we could say "cannot do any harm", > > but I don't like adding lines on that basis. > > I agree to add ACCESS_ONCE but because it's case 2, 7 above and it > could race with free_pgtables of pgd/pud/pmd or MADV_DONTNEED with > pmd. > > The other part of the patch in pagewalk.c is superflous and should be > dropped: pud/pgd can't change in walk_page_table, it's required to > hold the mmap_sem at least in read mode (it's not disabling irqs). > > The pmd side instead can change but only with THP enabled, and only > because MADV_DONTNEED (never because of free_pgtables) but it's > already fully handled through pmd_none_or_trans_huge_or_clear_bad. The > ->pmd_entry callers are required to call pmd_trans_unstable() before > proceeding as the pmd may have been
Re: [RFC] page-table walkers vs memory order
On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote: > Since then, I think THP has made the rules more complicated; but I > believe Andrea paid a great deal of attention to that kind of issue. There were many issues, one unexpected was 1a5a9906d4e8d1976b701f889d8f35d54b928f25. Keep in mind when holding only mmap_sem read mode (walk page range speculative mode) or gup_fast, the result is always undefined and racey if on the other CPU you have a munmap or mremap or any other pmd manging concurrently messing with the mapping you're walking, all we have to do is not to crash, it doesn't matter what happens. The fact you need a barrier() or ACCESS_ONCE to avoid a lockup in a while (rcu_dereference()), is no good reason to worry about all possible purely theoretical gcc issues. One important thing that wasn't mentioned so far in this thread is also that we entirely relay on gcc for all pagetable and device driver writes (to do 1 movq instead of 8 movb), see native_set_pmd and writel. We must separate all different cases to avoid huge confusion: 1) tmp=*ptr, while(tmp) -> possible, needs barrier or better ACCESS_ONCE if possible 2) orig_pmd = *pmdp before do_wp_huge_page in memory.c -> possible, needs barrier() and it should be possible to convert to ACCESS_ONCE 3) native_set_pmd and friends -> possible but not worth fixing, tried to fix a decade ago for a peace of mind and I was suggested to desist and it didn't bite us yet 4) writel -> possible but same as 3 5) compiler behaving like alpha -> impossible (I may be wrong but I believe so after thinking more on it) 6) I was told a decade ago by Honza to never touch any ram that can change under the compiler unless it's declared volatile (could crash over switch/case statement implemented with a table if the switch/case value is re-read by the compiler). -> depends, we don't always obey to this rule, clearly gup_fast currently disobeys and even the generic pmd_read_atomic still disobeys (MADV_DONTNEED can zero the pmd). If there's no "switch/case" I'm not aware of other troubles. 7) barrier in pmd_none_or_trans_huge_or_clear_bad -> possible, same issue as 2, full explanation in git show 1a5a9906d4e8d1976b701f Note: here I'm ignoring CPU reordering, this is only about the compiler. 5 is impossible because: a) the compiler can't read a guessed address or it can crash the kernel b) the compiler has no memory to store a "guessed" valid address when the function return and the stack is unwind For the compiler to behave like alpha, the compiler should read the pteval before the pmdp, that it can't do, because it has no address to guess from and it would Oops if it really tries to guess it! So far it was said "compiler can guess the address" but there was no valid explanation of how it could do it, and I don't see it, so please explain if I'm wrong about the a, b above. Furthermore the ACCESS_ONCE that Peter's patch added to gup_fast pud/pgd can't prevent the compiler to read a guessed pmdp address as a volatile variable, before reading the pmdp pointer and compare it with the guessed address! So if it's 5 you worry about, when adding ACCESS_ONCE in pudp/pgdp/pmdp is useless and won't fix it. You should have added a barrier() instead. > I suspect your arch/x86/mm/gup.c ACCESS_ONCE()s are necessary: > gup_fast() breaks as many rules as it can, and in particular may > be racing with the freeing of page tables; but I'm not so sure > about the pagewalk mods - we could say "cannot do any harm", > but I don't like adding lines on that basis. I agree to add ACCESS_ONCE but because it's case 2, 7 above and it could race with free_pgtables of pgd/pud/pmd or MADV_DONTNEED with pmd. The other part of the patch in pagewalk.c is superflous and should be dropped: pud/pgd can't change in walk_page_table, it's required to hold the mmap_sem at least in read mode (it's not disabling irqs). The pmd side instead can change but only with THP enabled, and only because MADV_DONTNEED (never because of free_pgtables) but it's already fully handled through pmd_none_or_trans_huge_or_clear_bad. The ->pmd_entry callers are required to call pmd_trans_unstable() before proceeding as the pmd may have been zeroed out by the time we get there. The solution is zero barrier()/ACCESS_ONCE impact for THP=n. If there are smp_read_barrier_depends already in alpha pte methods we're fine. Sorry for the long email but without a list that separates all possible cases above, I don't think we can understand what we're fixing in that patch and why the gup.c part is good. Peter, I suggest to resend the fix with a more detailed explanataion of the 2, 7 kind of race for gup.c only and drop the pagewalk.c. Thanks, Andrea -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at
Re: [RFC] page-table walkers vs memory order
On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote: Since then, I think THP has made the rules more complicated; but I believe Andrea paid a great deal of attention to that kind of issue. There were many issues, one unexpected was 1a5a9906d4e8d1976b701f889d8f35d54b928f25. Keep in mind when holding only mmap_sem read mode (walk page range speculative mode) or gup_fast, the result is always undefined and racey if on the other CPU you have a munmap or mremap or any other pmd manging concurrently messing with the mapping you're walking, all we have to do is not to crash, it doesn't matter what happens. The fact you need a barrier() or ACCESS_ONCE to avoid a lockup in a while (rcu_dereference()), is no good reason to worry about all possible purely theoretical gcc issues. One important thing that wasn't mentioned so far in this thread is also that we entirely relay on gcc for all pagetable and device driver writes (to do 1 movq instead of 8 movb), see native_set_pmd and writel. We must separate all different cases to avoid huge confusion: 1) tmp=*ptr, while(tmp) - possible, needs barrier or better ACCESS_ONCE if possible 2) orig_pmd = *pmdp before do_wp_huge_page in memory.c - possible, needs barrier() and it should be possible to convert to ACCESS_ONCE 3) native_set_pmd and friends - possible but not worth fixing, tried to fix a decade ago for a peace of mind and I was suggested to desist and it didn't bite us yet 4) writel - possible but same as 3 5) compiler behaving like alpha - impossible (I may be wrong but I believe so after thinking more on it) 6) I was told a decade ago by Honza to never touch any ram that can change under the compiler unless it's declared volatile (could crash over switch/case statement implemented with a table if the switch/case value is re-read by the compiler). - depends, we don't always obey to this rule, clearly gup_fast currently disobeys and even the generic pmd_read_atomic still disobeys (MADV_DONTNEED can zero the pmd). If there's no switch/case I'm not aware of other troubles. 7) barrier in pmd_none_or_trans_huge_or_clear_bad - possible, same issue as 2, full explanation in git show 1a5a9906d4e8d1976b701f Note: here I'm ignoring CPU reordering, this is only about the compiler. 5 is impossible because: a) the compiler can't read a guessed address or it can crash the kernel b) the compiler has no memory to store a guessed valid address when the function return and the stack is unwind For the compiler to behave like alpha, the compiler should read the pteval before the pmdp, that it can't do, because it has no address to guess from and it would Oops if it really tries to guess it! So far it was said compiler can guess the address but there was no valid explanation of how it could do it, and I don't see it, so please explain if I'm wrong about the a, b above. Furthermore the ACCESS_ONCE that Peter's patch added to gup_fast pud/pgd can't prevent the compiler to read a guessed pmdp address as a volatile variable, before reading the pmdp pointer and compare it with the guessed address! So if it's 5 you worry about, when adding ACCESS_ONCE in pudp/pgdp/pmdp is useless and won't fix it. You should have added a barrier() instead. I suspect your arch/x86/mm/gup.c ACCESS_ONCE()s are necessary: gup_fast() breaks as many rules as it can, and in particular may be racing with the freeing of page tables; but I'm not so sure about the pagewalk mods - we could say cannot do any harm, but I don't like adding lines on that basis. I agree to add ACCESS_ONCE but because it's case 2, 7 above and it could race with free_pgtables of pgd/pud/pmd or MADV_DONTNEED with pmd. The other part of the patch in pagewalk.c is superflous and should be dropped: pud/pgd can't change in walk_page_table, it's required to hold the mmap_sem at least in read mode (it's not disabling irqs). The pmd side instead can change but only with THP enabled, and only because MADV_DONTNEED (never because of free_pgtables) but it's already fully handled through pmd_none_or_trans_huge_or_clear_bad. The -pmd_entry callers are required to call pmd_trans_unstable() before proceeding as the pmd may have been zeroed out by the time we get there. The solution is zero barrier()/ACCESS_ONCE impact for THP=n. If there are smp_read_barrier_depends already in alpha pte methods we're fine. Sorry for the long email but without a list that separates all possible cases above, I don't think we can understand what we're fixing in that patch and why the gup.c part is good. Peter, I suggest to resend the fix with a more detailed explanataion of the 2, 7 kind of race for gup.c only and drop the pagewalk.c. Thanks, Andrea -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] page-table walkers vs memory order
On Sat, Aug 04, 2012 at 04:37:19PM +0200, Andrea Arcangeli wrote: On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote: Since then, I think THP has made the rules more complicated; but I believe Andrea paid a great deal of attention to that kind of issue. There were many issues, one unexpected was 1a5a9906d4e8d1976b701f889d8f35d54b928f25. [ . . . ] 5) compiler behaving like alpha - impossible (I may be wrong but I believe so after thinking more on it) 6) I was told a decade ago by Honza to never touch any ram that can change under the compiler unless it's declared volatile (could crash over switch/case statement implemented with a table if the switch/case value is re-read by the compiler). - depends, we don't always obey to this rule, clearly gup_fast currently disobeys and even the generic pmd_read_atomic still disobeys (MADV_DONTNEED can zero the pmd). If there's no switch/case I'm not aware of other troubles. 7) barrier in pmd_none_or_trans_huge_or_clear_bad - possible, same issue as 2, full explanation in git show 1a5a9906d4e8d1976b701f Note: here I'm ignoring CPU reordering, this is only about the compiler. 5 is impossible because: a) the compiler can't read a guessed address or it can crash the kernel b) the compiler has no memory to store a guessed valid address when the function return and the stack is unwind For the compiler to behave like alpha, the compiler should read the pteval before the pmdp, that it can't do, because it has no address to guess from and it would Oops if it really tries to guess it! So far it was said compiler can guess the address but there was no valid explanation of how it could do it, and I don't see it, so please explain if I'm wrong about the a, b above. OK, I'll bite. ;-) The most sane way for this to happen is with feedback-driven techniques involving profiling, similar to what is done for basic-block reordering or branch prediction. The idea is that you compile the kernel in an as-yet (and thankfully) mythical pointer-profiling mode, which records the values of pointer loads and also measures the pointer-load latency. If a situation is found where a given pointer almost always has the same value but has high load latency (for example, is almost always a high-latency cache miss), this fact is recorded and fed back into a subsequent kernel build. This subsequent kernel build might choose to speculate the value of the pointer concurrently with the pointer load. And of course, when interpreting the phrase most sane way at the beginning of the prior paragraph, it would probably be wise to keep in mind who wrote it. And that most sane way might have little or no resemblance to anything that typical kernel hackers would consider anywhere near sanity. ;-) Furthermore the ACCESS_ONCE that Peter's patch added to gup_fast pud/pgd can't prevent the compiler to read a guessed pmdp address as a volatile variable, before reading the pmdp pointer and compare it with the guessed address! So if it's 5 you worry about, when adding ACCESS_ONCE in pudp/pgdp/pmdp is useless and won't fix it. You should have added a barrier() instead. Most compiler writers I have discussed this with agreed that a volatile cast would suppress value speculation. The volatile keyword is not all that well specified in the C and C++ standards, but as nix said at http://lwn.net/Articles/509731/: volatile's meaning as 'minimize optimizations applied to things manipulating anything of volatile type, do not duplicate, elide, move, fold, spindle or mutilate' is of long standing. That said, value speculation as a compiler optimization makes me a bit nervous, so my current feeling is that is should be suppressed entirely. Hey, you asked, even if only implicitly! ;-) Thanx, Paul I suspect your arch/x86/mm/gup.c ACCESS_ONCE()s are necessary: gup_fast() breaks as many rules as it can, and in particular may be racing with the freeing of page tables; but I'm not so sure about the pagewalk mods - we could say cannot do any harm, but I don't like adding lines on that basis. I agree to add ACCESS_ONCE but because it's case 2, 7 above and it could race with free_pgtables of pgd/pud/pmd or MADV_DONTNEED with pmd. The other part of the patch in pagewalk.c is superflous and should be dropped: pud/pgd can't change in walk_page_table, it's required to hold the mmap_sem at least in read mode (it's not disabling irqs). The pmd side instead can change but only with THP enabled, and only because MADV_DONTNEED (never because of free_pgtables) but it's already fully handled through pmd_none_or_trans_huge_or_clear_bad. The -pmd_entry callers are required to call pmd_trans_unstable() before proceeding as the pmd may have been zeroed out by the time we get there. The solution is zero barrier()/ACCESS_ONCE impact for
Re: [RFC] page-table walkers vs memory order
On Sat, Aug 04, 2012 at 03:02:45PM -0700, Paul E. McKenney wrote: OK, I'll bite. ;-) :)) The most sane way for this to happen is with feedback-driven techniques involving profiling, similar to what is done for basic-block reordering or branch prediction. The idea is that you compile the kernel in an as-yet (and thankfully) mythical pointer-profiling mode, which records the values of pointer loads and also measures the pointer-load latency. If a situation is found where a given pointer almost always has the same value but has high load latency (for example, is almost always a high-latency cache miss), this fact is recorded and fed back into a subsequent kernel build. This subsequent kernel build might choose to speculate the value of the pointer concurrently with the pointer load. And of course, when interpreting the phrase most sane way at the beginning of the prior paragraph, it would probably be wise to keep in mind who wrote it. And that most sane way might have little or no resemblance to anything that typical kernel hackers would consider anywhere near sanity. ;-) I see. The above scenario is sure fair enough assumption. We're clearly stretching the constraints to see what is theoretically possible and this is a very clear explanation of how gcc could have an hardcoded guessed address in the .text. Next step to clearify now, is how gcc can safely dereference such a guessed address without the kernel knowing about it. If gcc would really dereference a guessed address coming from a profiling run without kernel being aware of it, it would eventually crash the kernel with an oops. gcc cannot know what another CPU will do with the kernel pagetables. It'd be perfectly legitimate to temporarily move the data at the guessed address to another page and to update the pointer through stop_cpu during some weird cpu offlining scenario or anything you can imagine. I mean gcc must behave in all cases so it's not allowed to deference the guessed address at any given time. The only way gcc could do the alpha thing and dereference the guessed address before the real pointer, is with cooperation with the kernel. The kernel should provide gcc safe ranges that won't crash the kernel, and/or gcc could provide a .fixup section similar to the current .fixup and the kernel should look it up during the page fault handler in case the kernel is ok with temporarily getting faults in that range. And in turn it can't happen unless we explicitly decide to allow gcc to do it. Furthermore the ACCESS_ONCE that Peter's patch added to gup_fast pud/pgd can't prevent the compiler to read a guessed pmdp address as a volatile variable, before reading the pmdp pointer and compare it with the guessed address! So if it's 5 you worry about, when adding ACCESS_ONCE in pudp/pgdp/pmdp is useless and won't fix it. You should have added a barrier() instead. Most compiler writers I have discussed this with agreed that a volatile cast would suppress value speculation. The volatile keyword is not all that well specified in the C and C++ standards, but as nix said at http://lwn.net/Articles/509731/: volatile's meaning as 'minimize optimizations applied to things manipulating anything of volatile type, do not duplicate, elide, move, fold, spindle or mutilate' is of long standing. Ok, so if the above optimization would be possible, volatile would stop it too, thanks for the quote and the explanation. On a side note I believe there's a few barrier()s that may be worth converting to ACCESS_ONCE, that would take care of case 6) too in addition to avoid clobbering more CPU registers than strictly necessary. Not very important but a possible microoptimization. That said, value speculation as a compiler optimization makes me a bit nervous, so my current feeling is that is should be suppressed entirely. Hey, you asked, even if only implicitly! ;-) You're reading my mind! :) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] page-table walkers vs memory order
* Andrea Arcangeli (aarca...@redhat.com) wrote: On Sat, Aug 04, 2012 at 03:02:45PM -0700, Paul E. McKenney wrote: OK, I'll bite. ;-) :)) The most sane way for this to happen is with feedback-driven techniques involving profiling, similar to what is done for basic-block reordering or branch prediction. The idea is that you compile the kernel in an as-yet (and thankfully) mythical pointer-profiling mode, which records the values of pointer loads and also measures the pointer-load latency. If a situation is found where a given pointer almost always has the same value but has high load latency (for example, is almost always a high-latency cache miss), this fact is recorded and fed back into a subsequent kernel build. This subsequent kernel build might choose to speculate the value of the pointer concurrently with the pointer load. And of course, when interpreting the phrase most sane way at the beginning of the prior paragraph, it would probably be wise to keep in mind who wrote it. And that most sane way might have little or no resemblance to anything that typical kernel hackers would consider anywhere near sanity. ;-) I see. The above scenario is sure fair enough assumption. We're clearly stretching the constraints to see what is theoretically possible and this is a very clear explanation of how gcc could have an hardcoded guessed address in the .text. Next step to clearify now, is how gcc can safely dereference such a guessed address without the kernel knowing about it. If gcc would really dereference a guessed address coming from a profiling run without kernel being aware of it, it would eventually crash the kernel with an oops. gcc cannot know what another CPU will do with the kernel pagetables. It'd be perfectly legitimate to temporarily move the data at the guessed address to another page and to update the pointer through stop_cpu during some weird cpu offlining scenario or anything you can imagine. I mean gcc must behave in all cases so it's not allowed to deference the guessed address at any given time. A compiler could decide to dereference it using a non-faulting load, do the calculations or whatever on the returned value of the non-faulting load, and then check whether the load actually faulted, and whether the address matched the prediction before it did a store based on it's guess. Dave -- -Open up your eyes, open up your mind, open up your code --- / Dr. David Alan Gilbert| Running GNU/Linux | Happy \ \ gro.gilbert @ treblig.org | | In Hex / \ _|_ http://www.treblig.org |___/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] page-table walkers vs memory order
On Sun, Aug 05, 2012 at 12:47:05AM +0200, Andrea Arcangeli wrote: On Sat, Aug 04, 2012 at 03:02:45PM -0700, Paul E. McKenney wrote: OK, I'll bite. ;-) :)) The most sane way for this to happen is with feedback-driven techniques involving profiling, similar to what is done for basic-block reordering or branch prediction. The idea is that you compile the kernel in an as-yet (and thankfully) mythical pointer-profiling mode, which records the values of pointer loads and also measures the pointer-load latency. If a situation is found where a given pointer almost always has the same value but has high load latency (for example, is almost always a high-latency cache miss), this fact is recorded and fed back into a subsequent kernel build. This subsequent kernel build might choose to speculate the value of the pointer concurrently with the pointer load. And of course, when interpreting the phrase most sane way at the beginning of the prior paragraph, it would probably be wise to keep in mind who wrote it. And that most sane way might have little or no resemblance to anything that typical kernel hackers would consider anywhere near sanity. ;-) I see. The above scenario is sure fair enough assumption. We're clearly stretching the constraints to see what is theoretically possible and this is a very clear explanation of how gcc could have an hardcoded guessed address in the .text. Next step to clearify now, is how gcc can safely dereference such a guessed address without the kernel knowing about it. If gcc would really dereference a guessed address coming from a profiling run without kernel being aware of it, it would eventually crash the kernel with an oops. gcc cannot know what another CPU will do with the kernel pagetables. It'd be perfectly legitimate to temporarily move the data at the guessed address to another page and to update the pointer through stop_cpu during some weird cpu offlining scenario or anything you can imagine. I mean gcc must behave in all cases so it's not allowed to deference the guessed address at any given time. The only way gcc could do the alpha thing and dereference the guessed address before the real pointer, is with cooperation with the kernel. The kernel should provide gcc safe ranges that won't crash the kernel, and/or gcc could provide a .fixup section similar to the current .fixup and the kernel should look it up during the page fault handler in case the kernel is ok with temporarily getting faults in that range. And in turn it can't happen unless we explicitly decide to allow gcc to do it. And these are indeed some good reasons why I am not a fan of pointer-value speculation. ;-) Furthermore the ACCESS_ONCE that Peter's patch added to gup_fast pud/pgd can't prevent the compiler to read a guessed pmdp address as a volatile variable, before reading the pmdp pointer and compare it with the guessed address! So if it's 5 you worry about, when adding ACCESS_ONCE in pudp/pgdp/pmdp is useless and won't fix it. You should have added a barrier() instead. Most compiler writers I have discussed this with agreed that a volatile cast would suppress value speculation. The volatile keyword is not all that well specified in the C and C++ standards, but as nix said at http://lwn.net/Articles/509731/: volatile's meaning as 'minimize optimizations applied to things manipulating anything of volatile type, do not duplicate, elide, move, fold, spindle or mutilate' is of long standing. Ok, so if the above optimization would be possible, volatile would stop it too, thanks for the quote and the explanation. On a side note I believe there's a few barrier()s that may be worth converting to ACCESS_ONCE, that would take care of case 6) too in addition to avoid clobbering more CPU registers than strictly necessary. Not very important but a possible microoptimization. Agreed on both points. That said, value speculation as a compiler optimization makes me a bit nervous, so my current feeling is that is should be suppressed entirely. Hey, you asked, even if only implicitly! ;-) You're reading my mind! :) Or succesfully carrying out value speculation on it. ;-) Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] page-table walkers vs memory order
On Sat, Aug 04, 2012 at 11:59:10PM +0100, Dr. David Alan Gilbert wrote: * Andrea Arcangeli (aarca...@redhat.com) wrote: On Sat, Aug 04, 2012 at 03:02:45PM -0700, Paul E. McKenney wrote: OK, I'll bite. ;-) :)) The most sane way for this to happen is with feedback-driven techniques involving profiling, similar to what is done for basic-block reordering or branch prediction. The idea is that you compile the kernel in an as-yet (and thankfully) mythical pointer-profiling mode, which records the values of pointer loads and also measures the pointer-load latency. If a situation is found where a given pointer almost always has the same value but has high load latency (for example, is almost always a high-latency cache miss), this fact is recorded and fed back into a subsequent kernel build. This subsequent kernel build might choose to speculate the value of the pointer concurrently with the pointer load. And of course, when interpreting the phrase most sane way at the beginning of the prior paragraph, it would probably be wise to keep in mind who wrote it. And that most sane way might have little or no resemblance to anything that typical kernel hackers would consider anywhere near sanity. ;-) I see. The above scenario is sure fair enough assumption. We're clearly stretching the constraints to see what is theoretically possible and this is a very clear explanation of how gcc could have an hardcoded guessed address in the .text. Next step to clearify now, is how gcc can safely dereference such a guessed address without the kernel knowing about it. If gcc would really dereference a guessed address coming from a profiling run without kernel being aware of it, it would eventually crash the kernel with an oops. gcc cannot know what another CPU will do with the kernel pagetables. It'd be perfectly legitimate to temporarily move the data at the guessed address to another page and to update the pointer through stop_cpu during some weird cpu offlining scenario or anything you can imagine. I mean gcc must behave in all cases so it's not allowed to deference the guessed address at any given time. A compiler could decide to dereference it using a non-faulting load, do the calculations or whatever on the returned value of the non-faulting load, and then check whether the load actually faulted, and whether the address matched the prediction before it did a store based on it's guess. Or the compiler could record a recovery address in a per-thread variable before doing the speculative reference. The page-fault handler could consult the per-thread variable and take appropriate action. But both this approach and your approach are vulnerable to things like having the speculation area mapped to (say) MMIO space. Not good! So I am with Andrea on this one -- there would need to be some handshake between kernel and compiler to avoid messing with possibly-unsafe mappings. And I am still not much in favor of value speculation. ;-) Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] page-table walkers vs memory order
* Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Sat, Aug 04, 2012 at 11:59:10PM +0100, Dr. David Alan Gilbert wrote: snip A compiler could decide to dereference it using a non-faulting load, do the calculations or whatever on the returned value of the non-faulting load, and then check whether the load actually faulted, and whether the address matched the prediction before it did a store based on it's guess. Or the compiler could record a recovery address in a per-thread variable before doing the speculative reference. The page-fault handler could consult the per-thread variable and take appropriate action. The difference is that I'd expect a compiler writer to think that they've got a free hand in terms of instruction usage that the OS/library doesn't see - if it's in the instruction manual and it's marked as user space and non-faulting I'd say it's fair game; once they know that they're going to take a fault or mark pages specially then they already know they're going to have to cooperate with the OS, or worry about what other normal library calls are going to do. (A bit of googling seems to suggest IA64 and SPARC have played around with non-faulting load optimisations, but I can't tell how much.) But both this approach and your approach are vulnerable to things like having the speculation area mapped to (say) MMIO space. Not good! Not good for someone doing MMIO, but from an evil-compiler point of view, they might well assume that a pointer is to memory unless someone has made an effort to tell them otherwise (not that there is a good standard to do that). So I am with Andrea on this one -- there would need to be some handshake between kernel and compiler to avoid messing with possibly-unsafe mappings. And I am still not much in favor of value speculation. ;-) Dave -- -Open up your eyes, open up your mind, open up your code --- / Dr. David Alan Gilbert| Running GNU/Linux | Happy \ \ gro.gilbert @ treblig.org | | In Hex / \ _|_ http://www.treblig.org |___/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] page-table walkers vs memory order
On Mon, Jul 30, 2012 at 08:21:40PM +0100, Jamie Lokier wrote: > Paul E. McKenney wrote: > > > Does some version of gcc, under the options which we insist upon, > > > make such optimizations on any of the architectures which we support? > > > > Pretty much any production-quality compiler will do double-fetch > > and old-value-reuse optimizations, the former especially on 32-bit > > x86. I don't know of any production-quality compilers that do value > > speculation, which would make the compiler act like DEC Alpha hardware, > > and I would hope that if this does appear, (1) we would have warning > > and (2) it could be turned off. But there has been a lot of work on > > this topic, so we would be foolish to rule it out. > > GCC documentation for IA-64: > >-msched-ar-data-spec >-mno-sched-ar-data-spec > (En/Dis)able data speculative scheduling after reload. This results > in generation of ld.a instructions and the corresponding check > instructions (ld.c / chk.a). The default is 'enable'. > > I don't know if that results in value speculation of the relevant kind. If I remember correctly, the chk.a instruction will detect failed speculation via cache state and deal with the situation correctly, but I really need to defer to someone with more recent IA-64 experience. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] page-table walkers vs memory order
Paul E. McKenney wrote: > > Does some version of gcc, under the options which we insist upon, > > make such optimizations on any of the architectures which we support? > > Pretty much any production-quality compiler will do double-fetch > and old-value-reuse optimizations, the former especially on 32-bit > x86. I don't know of any production-quality compilers that do value > speculation, which would make the compiler act like DEC Alpha hardware, > and I would hope that if this does appear, (1) we would have warning > and (2) it could be turned off. But there has been a lot of work on > this topic, so we would be foolish to rule it out. GCC documentation for IA-64: -msched-ar-data-spec -mno-sched-ar-data-spec (En/Dis)able data speculative scheduling after reload. This results in generation of ld.a instructions and the corresponding check instructions (ld.c / chk.a). The default is 'enable'. I don't know if that results in value speculation of the relevant kind. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] page-table walkers vs memory order
Paul E. McKenney wrote: Does some version of gcc, under the options which we insist upon, make such optimizations on any of the architectures which we support? Pretty much any production-quality compiler will do double-fetch and old-value-reuse optimizations, the former especially on 32-bit x86. I don't know of any production-quality compilers that do value speculation, which would make the compiler act like DEC Alpha hardware, and I would hope that if this does appear, (1) we would have warning and (2) it could be turned off. But there has been a lot of work on this topic, so we would be foolish to rule it out. GCC documentation for IA-64: -msched-ar-data-spec -mno-sched-ar-data-spec (En/Dis)able data speculative scheduling after reload. This results in generation of ld.a instructions and the corresponding check instructions (ld.c / chk.a). The default is 'enable'. I don't know if that results in value speculation of the relevant kind. -- Jamie -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] page-table walkers vs memory order
On Mon, Jul 30, 2012 at 08:21:40PM +0100, Jamie Lokier wrote: Paul E. McKenney wrote: Does some version of gcc, under the options which we insist upon, make such optimizations on any of the architectures which we support? Pretty much any production-quality compiler will do double-fetch and old-value-reuse optimizations, the former especially on 32-bit x86. I don't know of any production-quality compilers that do value speculation, which would make the compiler act like DEC Alpha hardware, and I would hope that if this does appear, (1) we would have warning and (2) it could be turned off. But there has been a lot of work on this topic, so we would be foolish to rule it out. GCC documentation for IA-64: -msched-ar-data-spec -mno-sched-ar-data-spec (En/Dis)able data speculative scheduling after reload. This results in generation of ld.a instructions and the corresponding check instructions (ld.c / chk.a). The default is 'enable'. I don't know if that results in value speculation of the relevant kind. If I remember correctly, the chk.a instruction will detect failed speculation via cache state and deal with the situation correctly, but I really need to defer to someone with more recent IA-64 experience. Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] page-table walkers vs memory order
On Fri, Jul 27, 2012 at 12:22:46PM -0700, Hugh Dickins wrote: > On Thu, 26 Jul 2012, Peter Zijlstra wrote: > > On Tue, 2012-07-24 at 14:51 -0700, Hugh Dickins wrote: > > > I do love the status quo, but an audit would be welcome. When > > > it comes to patches, personally I tend to prefer ACCESS_ONCE() and > > > smp_read_barrier_depends() and accompanying comments to be hidden away > > > in the underlying macros or inlines where reasonable, rather than > > > repeated all over; but I may have my priorities wrong on that. > > I notice from that old radix_tree thread you pointed to in the previous > mail (for which many thanks: lots of meat to digest in there) that this > is also Linus's preference. > > > > > > > > > Yeah, I was being lazy, and I totally forgot to actually look at the > > alpha code. > > > > How about we do a generic (cribbed from rcu_dereference): > > > > #define page_table_deref(p) \ > > ({ \ > > typeof(*p) *__p = (typeof(*p) __force *)ACCESS_ONCE(p);\ > > smp_read_barrier_depends(); \ > > ((typeof(*p) __force __kernel *)(__p)); \ > > }) > > > > and use that all over to dereference page-tables. That way all this > > lives in one place. Granted, I'll have to go edit all arch code, but I > > seem to be doing that on a frequent basis anyway :/ > > If you're convinced that we now have (or are in danger of growing) > a number of places which need this safety, yes, I suppose so. > > Personally, I'd have gone for just adding the relatively-understandable > ACCESS_ONCEs in all the arch/*/include/asm macros (which you're going to > visit to make the above change), and leave the smp_read_barrier_depends() > entirely in Alpha - one level of indirection less for the reader. > But that's just me, you're the one proposing to do the work, and > you may have very good reason for the above. > > I'm unfamiliar with what value the __force __kernel annotations add. > But I am interested to notice that you are only 6/9ths as insane as > Paul: any chance of helping global underscore availability by not > hoarding quite so many in there? Heh!!! The number of underscores for the original rcu_dereference()'s local variable was the outcome of an argument about how obfuscated that variable's name should be in order to avoid possible collisions with names in the enclosing scope. Nine leading underscores might seem excessive, or even as you say, insane, but on the other hand no name collisions have ever come to my attention. ;-) Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] page-table walkers vs memory order
On Thu, 26 Jul 2012, Peter Zijlstra wrote: > On Tue, 2012-07-24 at 14:51 -0700, Hugh Dickins wrote: > > I do love the status quo, but an audit would be welcome. When > > it comes to patches, personally I tend to prefer ACCESS_ONCE() and > > smp_read_barrier_depends() and accompanying comments to be hidden away > > in the underlying macros or inlines where reasonable, rather than > > repeated all over; but I may have my priorities wrong on that. I notice from that old radix_tree thread you pointed to in the previous mail (for which many thanks: lots of meat to digest in there) that this is also Linus's preference. > > > > > Yeah, I was being lazy, and I totally forgot to actually look at the > alpha code. > > How about we do a generic (cribbed from rcu_dereference): > > #define page_table_deref(p) \ > ({\ > typeof(*p) *__p = (typeof(*p) __force *)ACCESS_ONCE(p);\ > smp_read_barrier_depends(); \ > ((typeof(*p) __force __kernel *)(__p)); \ > }) > > and use that all over to dereference page-tables. That way all this > lives in one place. Granted, I'll have to go edit all arch code, but I > seem to be doing that on a frequent basis anyway :/ If you're convinced that we now have (or are in danger of growing) a number of places which need this safety, yes, I suppose so. Personally, I'd have gone for just adding the relatively-understandable ACCESS_ONCEs in all the arch/*/include/asm macros (which you're going to visit to make the above change), and leave the smp_read_barrier_depends() entirely in Alpha - one level of indirection less for the reader. But that's just me, you're the one proposing to do the work, and you may have very good reason for the above. I'm unfamiliar with what value the __force __kernel annotations add. But I am interested to notice that you are only 6/9ths as insane as Paul: any chance of helping global underscore availability by not hoarding quite so many in there? Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] page-table walkers vs memory order
On Thu, 26 Jul 2012, Peter Zijlstra wrote: On Tue, 2012-07-24 at 14:51 -0700, Hugh Dickins wrote: I do love the status quo, but an audit would be welcome. When it comes to patches, personally I tend to prefer ACCESS_ONCE() and smp_read_barrier_depends() and accompanying comments to be hidden away in the underlying macros or inlines where reasonable, rather than repeated all over; but I may have my priorities wrong on that. I notice from that old radix_tree thread you pointed to in the previous mail (for which many thanks: lots of meat to digest in there) that this is also Linus's preference. Yeah, I was being lazy, and I totally forgot to actually look at the alpha code. How about we do a generic (cribbed from rcu_dereference): #define page_table_deref(p) \ ({\ typeof(*p) *__p = (typeof(*p) __force *)ACCESS_ONCE(p);\ smp_read_barrier_depends(); \ ((typeof(*p) __force __kernel *)(__p)); \ }) and use that all over to dereference page-tables. That way all this lives in one place. Granted, I'll have to go edit all arch code, but I seem to be doing that on a frequent basis anyway :/ If you're convinced that we now have (or are in danger of growing) a number of places which need this safety, yes, I suppose so. Personally, I'd have gone for just adding the relatively-understandable ACCESS_ONCEs in all the arch/*/include/asm macros (which you're going to visit to make the above change), and leave the smp_read_barrier_depends() entirely in Alpha - one level of indirection less for the reader. But that's just me, you're the one proposing to do the work, and you may have very good reason for the above. I'm unfamiliar with what value the __force __kernel annotations add. But I am interested to notice that you are only 6/9ths as insane as Paul: any chance of helping global underscore availability by not hoarding quite so many in there? Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] page-table walkers vs memory order
On Fri, Jul 27, 2012 at 12:22:46PM -0700, Hugh Dickins wrote: On Thu, 26 Jul 2012, Peter Zijlstra wrote: On Tue, 2012-07-24 at 14:51 -0700, Hugh Dickins wrote: I do love the status quo, but an audit would be welcome. When it comes to patches, personally I tend to prefer ACCESS_ONCE() and smp_read_barrier_depends() and accompanying comments to be hidden away in the underlying macros or inlines where reasonable, rather than repeated all over; but I may have my priorities wrong on that. I notice from that old radix_tree thread you pointed to in the previous mail (for which many thanks: lots of meat to digest in there) that this is also Linus's preference. Yeah, I was being lazy, and I totally forgot to actually look at the alpha code. How about we do a generic (cribbed from rcu_dereference): #define page_table_deref(p) \ ({ \ typeof(*p) *__p = (typeof(*p) __force *)ACCESS_ONCE(p);\ smp_read_barrier_depends(); \ ((typeof(*p) __force __kernel *)(__p)); \ }) and use that all over to dereference page-tables. That way all this lives in one place. Granted, I'll have to go edit all arch code, but I seem to be doing that on a frequent basis anyway :/ If you're convinced that we now have (or are in danger of growing) a number of places which need this safety, yes, I suppose so. Personally, I'd have gone for just adding the relatively-understandable ACCESS_ONCEs in all the arch/*/include/asm macros (which you're going to visit to make the above change), and leave the smp_read_barrier_depends() entirely in Alpha - one level of indirection less for the reader. But that's just me, you're the one proposing to do the work, and you may have very good reason for the above. I'm unfamiliar with what value the __force __kernel annotations add. But I am interested to notice that you are only 6/9ths as insane as Paul: any chance of helping global underscore availability by not hoarding quite so many in there? Heh!!! The number of underscores for the original rcu_dereference()'s local variable was the outcome of an argument about how obfuscated that variable's name should be in order to avoid possible collisions with names in the enclosing scope. Nine leading underscores might seem excessive, or even as you say, insane, but on the other hand no name collisions have ever come to my attention. ;-) Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] page-table walkers vs memory order
On Tue, 2012-07-24 at 14:51 -0700, Hugh Dickins wrote: > I do love the status quo, but an audit would be welcome. When > it comes to patches, personally I tend to prefer ACCESS_ONCE() and > smp_read_barrier_depends() and accompanying comments to be hidden away > in the underlying macros or inlines where reasonable, rather than > repeated all over; but I may have my priorities wrong on that. > > Yeah, I was being lazy, and I totally forgot to actually look at the alpha code. How about we do a generic (cribbed from rcu_dereference): #define page_table_deref(p) \ ({ \ typeof(*p) *__p = (typeof(*p) __force *)ACCESS_ONCE(p);\ smp_read_barrier_depends(); \ ((typeof(*p) __force __kernel *)(__p)); \ }) and use that all over to dereference page-tables. That way all this lives in one place. Granted, I'll have to go edit all arch code, but I seem to be doing that on a frequent basis anyway :/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] page-table walkers vs memory order
On Wed, 2012-07-25 at 15:09 -0700, Hugh Dickins wrote: > We find out after it hits us, and someone studies the disassembly - > if we're lucky enough to crash near the origin of the problem. This is a rather painful way.. see https://lkml.org/lkml/2009/1/5/555 we were lucky there in that the lack of ACCESS_ONCE() caused an infinite loop so we knew exactly where we got stuck. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] page-table walkers vs memory order
On Wed, 2012-07-25 at 15:09 -0700, Hugh Dickins wrote: We find out after it hits us, and someone studies the disassembly - if we're lucky enough to crash near the origin of the problem. This is a rather painful way.. see https://lkml.org/lkml/2009/1/5/555 we were lucky there in that the lack of ACCESS_ONCE() caused an infinite loop so we knew exactly where we got stuck. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] page-table walkers vs memory order
On Tue, 2012-07-24 at 14:51 -0700, Hugh Dickins wrote: I do love the status quo, but an audit would be welcome. When it comes to patches, personally I tend to prefer ACCESS_ONCE() and smp_read_barrier_depends() and accompanying comments to be hidden away in the underlying macros or inlines where reasonable, rather than repeated all over; but I may have my priorities wrong on that. Yeah, I was being lazy, and I totally forgot to actually look at the alpha code. How about we do a generic (cribbed from rcu_dereference): #define page_table_deref(p) \ ({ \ typeof(*p) *__p = (typeof(*p) __force *)ACCESS_ONCE(p);\ smp_read_barrier_depends(); \ ((typeof(*p) __force __kernel *)(__p)); \ }) and use that all over to dereference page-tables. That way all this lives in one place. Granted, I'll have to go edit all arch code, but I seem to be doing that on a frequent basis anyway :/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] page-table walkers vs memory order
On Wed, Jul 25, 2012 at 03:09:48PM -0700, Hugh Dickins wrote: > On Wed, 25 Jul 2012, Paul E. McKenney wrote: > > On Wed, Jul 25, 2012 at 01:26:43PM -0700, Hugh Dickins wrote: > > > On Wed, 25 Jul 2012, Paul E. McKenney wrote: > > > > On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote: > > > > > > > > > > I'm totally unclear whether the kernel ever gets built with these > > > > > 'creative' compilers that you refer to. Is ACCESS_ONCE() a warning > > > > > of where some future compiler would be permitted to mess with our > > > > > assumptions? Or is it actually saving us already today? Would we > > > > > know? Could there be a boottime test that would tell us? Is it > > > > > likely that a future compiler would have an "--access_once" > > > > > option that the kernel build would want to turn on? > > > > > > > > The problem is that, unless you tell it otherwise, the compiler is > > > > permitted to assume that the code that it is generating is the only > > > > thing > > > > active in that address space at that time. So the compiler might know > > > > that it already has a perfectly good copy of that value somewhere in > > > > its registers, or it might decide to fetch the value twice rather than > > > > once due to register pressure, either of which can be fatal in SMP code. > > > > And then there are more aggressive optimizations as well. > > > > > > > > ACCESS_ONCE() is a way of telling the compiler to access the value > > > > once, regardless of what cute single-threaded optimizations that it > > > > otherwise might want to apply. > > > > > > Right, but you say "might": I have never heard it asserted, that we do > > > build the kernel with a compiler which actually makes such optimizations. > > > > The compiler we use today can and has hurt us with double-fetching > > and old-value-reuse optimizations. There have been several that have > > "optimized" things like "while (foo)" into "tmp = foo; while (tmp)" > > in the Linux kernel, which have been dealt with by recoding. > > Ah yes, those: I think we need ACCESS_EVERY_TIME() for those ones ;) ;-) ;-) ;-) > I consider the double-fetching ones more insidious, > less obviously in need of the volatile cast. Agreed! > > You might argue that the compiler cannot reasonably apply such an > > optimization in some given case, but the compiler does much more detailed > > analysis of the code than most people are willing to do (certainly more > > than I am usually willing to do!), so I believe that a little paranoia is > > quite worthwhile. > > > > > There's a lot of other surprising things which a compiler is permitted > > > to do, but we would simply not use such a compiler to build the kernel. > > > > Unless we get the gcc folks to build and boot the Linux kernel as part > > of their test suite (maybe they already do, but not that I know of), > > how would either they or we know that they had deployed a destructive > > optimization? > > We find out after it hits us, and someone studies the disassembly - > if we're lucky enough to crash near the origin of the problem. Color me unreassured. ;-) > > > Does some version of gcc, under the options which we insist upon, > > > make such optimizations on any of the architectures which we support? > > > > Pretty much any production-quality compiler will do double-fetch > > and old-value-reuse optimizations, the former especially on 32-bit > > x86. > > That makes good sense, yes: so, under register pressure, they may > refetch from global memory, instead of using a temporary on local stack. > > > I don't know of any production-quality compilers that do value > > speculation, which would make the compiler act like DEC Alpha hardware, > > and I would hope that if this does appear, (1) we would have warning > > and (2) it could be turned off. But there has been a lot of work on > > this topic, so we would be foolish to rule it out. > > I think you're justified in expecting both (1) and (2) there. Here is hoping! > > But the currently deployed optimizations can already cause enough trouble. > > > > > Or is there some other compiler in use on the kernel, which makes > > > such optimizations? It seems a long time since I heard of building > > > the kernel with icc. clang? > > > > > > I don't mind the answer "Yes, you idiot" - preferably with an example > > > or two of which compiler and which piece of code it has bitten us on. > > > I don't mind the answer "We just don't know" if that's the case. > > > > > > But I'd like a better idea of how much to worry: is ACCESS_ONCE > > > demonstrably needed today, or rather future-proofing and documentation? > > > > Both. If you are coding "while (foo)" where "foo" can be changed by an > > interrupt handler, you had better instead write "while (ACCESS_ONCE(foo))" > > or something similar, because most compilers are happy to optimize your > > loop into an infinite loop in that case. There are places in the Linux > > kernel that would have
Re: [RFC] page-table walkers vs memory order
On Wed, 25 Jul 2012, Paul E. McKenney wrote: > On Wed, Jul 25, 2012 at 01:26:43PM -0700, Hugh Dickins wrote: > > On Wed, 25 Jul 2012, Paul E. McKenney wrote: > > > On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote: > > > > > > > > I'm totally unclear whether the kernel ever gets built with these > > > > 'creative' compilers that you refer to. Is ACCESS_ONCE() a warning > > > > of where some future compiler would be permitted to mess with our > > > > assumptions? Or is it actually saving us already today? Would we > > > > know? Could there be a boottime test that would tell us? Is it > > > > likely that a future compiler would have an "--access_once" > > > > option that the kernel build would want to turn on? > > > > > > The problem is that, unless you tell it otherwise, the compiler is > > > permitted to assume that the code that it is generating is the only thing > > > active in that address space at that time. So the compiler might know > > > that it already has a perfectly good copy of that value somewhere in > > > its registers, or it might decide to fetch the value twice rather than > > > once due to register pressure, either of which can be fatal in SMP code. > > > And then there are more aggressive optimizations as well. > > > > > > ACCESS_ONCE() is a way of telling the compiler to access the value > > > once, regardless of what cute single-threaded optimizations that it > > > otherwise might want to apply. > > > > Right, but you say "might": I have never heard it asserted, that we do > > build the kernel with a compiler which actually makes such optimizations. > > The compiler we use today can and has hurt us with double-fetching > and old-value-reuse optimizations. There have been several that have > "optimized" things like "while (foo)" into "tmp = foo; while (tmp)" > in the Linux kernel, which have been dealt with by recoding. Ah yes, those: I think we need ACCESS_EVERY_TIME() for those ones ;) I consider the double-fetching ones more insidious, less obviously in need of the volatile cast. > > You might argue that the compiler cannot reasonably apply such an > optimization in some given case, but the compiler does much more detailed > analysis of the code than most people are willing to do (certainly more > than I am usually willing to do!), so I believe that a little paranoia is > quite worthwhile. > > > There's a lot of other surprising things which a compiler is permitted > > to do, but we would simply not use such a compiler to build the kernel. > > Unless we get the gcc folks to build and boot the Linux kernel as part > of their test suite (maybe they already do, but not that I know of), > how would either they or we know that they had deployed a destructive > optimization? We find out after it hits us, and someone studies the disassembly - if we're lucky enough to crash near the origin of the problem. > > > Does some version of gcc, under the options which we insist upon, > > make such optimizations on any of the architectures which we support? > > Pretty much any production-quality compiler will do double-fetch > and old-value-reuse optimizations, the former especially on 32-bit > x86. That makes good sense, yes: so, under register pressure, they may refetch from global memory, instead of using a temporary on local stack. > I don't know of any production-quality compilers that do value > speculation, which would make the compiler act like DEC Alpha hardware, > and I would hope that if this does appear, (1) we would have warning > and (2) it could be turned off. But there has been a lot of work on > this topic, so we would be foolish to rule it out. I think you're justified in expecting both (1) and (2) there. > > But the currently deployed optimizations can already cause enough trouble. > > > Or is there some other compiler in use on the kernel, which makes > > such optimizations? It seems a long time since I heard of building > > the kernel with icc. clang? > > > > I don't mind the answer "Yes, you idiot" - preferably with an example > > or two of which compiler and which piece of code it has bitten us on. > > I don't mind the answer "We just don't know" if that's the case. > > > > But I'd like a better idea of how much to worry: is ACCESS_ONCE > > demonstrably needed today, or rather future-proofing and documentation? > > Both. If you are coding "while (foo)" where "foo" can be changed by an > interrupt handler, you had better instead write "while (ACCESS_ONCE(foo))" > or something similar, because most compilers are happy to optimize your > loop into an infinite loop in that case. There are places in the Linux > kernel that would have problems if the compiler decided to refetch a > value -- if a pointer was changed in the meantime, part of your code > might be working on the old structure, and part on the new structure. > This really can happen today, and this is why rcu_dereference() contains > an ACCESS_ONCE(). > > If you are making
Re: [RFC] page-table walkers vs memory order
On Wed, Jul 25, 2012 at 01:26:43PM -0700, Hugh Dickins wrote: > On Wed, 25 Jul 2012, Paul E. McKenney wrote: > > On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote: > > > > > > I'm totally unclear whether the kernel ever gets built with these > > > 'creative' compilers that you refer to. Is ACCESS_ONCE() a warning > > > of where some future compiler would be permitted to mess with our > > > assumptions? Or is it actually saving us already today? Would we > > > know? Could there be a boottime test that would tell us? Is it > > > likely that a future compiler would have an "--access_once" > > > option that the kernel build would want to turn on? > > > > The problem is that, unless you tell it otherwise, the compiler is > > permitted to assume that the code that it is generating is the only thing > > active in that address space at that time. So the compiler might know > > that it already has a perfectly good copy of that value somewhere in > > its registers, or it might decide to fetch the value twice rather than > > once due to register pressure, either of which can be fatal in SMP code. > > And then there are more aggressive optimizations as well. > > > > ACCESS_ONCE() is a way of telling the compiler to access the value > > once, regardless of what cute single-threaded optimizations that it > > otherwise might want to apply. > > Right, but you say "might": I have never heard it asserted, that we do > build the kernel with a compiler which actually makes such optimizations. The compiler we use today can and has hurt us with double-fetching and old-value-reuse optimizations. There have been several that have "optimized" things like "while (foo)" into "tmp = foo; while (tmp)" in the Linux kernel, which have been dealt with by recoding. You might argue that the compiler cannot reasonably apply such an optimization in some given case, but the compiler does much more detailed analysis of the code than most people are willing to do (certainly more than I am usually willing to do!), so I believe that a little paranoia is quite worthwhile. > There's a lot of other surprising things which a compiler is permitted > to do, but we would simply not use such a compiler to build the kernel. Unless we get the gcc folks to build and boot the Linux kernel as part of their test suite (maybe they already do, but not that I know of), how would either they or we know that they had deployed a destructive optimization? > Does some version of gcc, under the options which we insist upon, > make such optimizations on any of the architectures which we support? Pretty much any production-quality compiler will do double-fetch and old-value-reuse optimizations, the former especially on 32-bit x86. I don't know of any production-quality compilers that do value speculation, which would make the compiler act like DEC Alpha hardware, and I would hope that if this does appear, (1) we would have warning and (2) it could be turned off. But there has been a lot of work on this topic, so we would be foolish to rule it out. But the currently deployed optimizations can already cause enough trouble. > Or is there some other compiler in use on the kernel, which makes > such optimizations? It seems a long time since I heard of building > the kernel with icc. clang? > > I don't mind the answer "Yes, you idiot" - preferably with an example > or two of which compiler and which piece of code it has bitten us on. > I don't mind the answer "We just don't know" if that's the case. > > But I'd like a better idea of how much to worry: is ACCESS_ONCE > demonstrably needed today, or rather future-proofing and documentation? Both. If you are coding "while (foo)" where "foo" can be changed by an interrupt handler, you had better instead write "while (ACCESS_ONCE(foo))" or something similar, because most compilers are happy to optimize your loop into an infinite loop in that case. There are places in the Linux kernel that would have problems if the compiler decided to refetch a value -- if a pointer was changed in the meantime, part of your code might be working on the old structure, and part on the new structure. This really can happen today, and this is why rcu_dereference() contains an ACCESS_ONCE(). If you are making lockless non-atomic access to a variable, I strongly suggest ACCESS_ONCE() or something similar even if you cannot see how the compiler can mess you up, especially in cases involving a lot of inline functions. In this case, the compiler can be looking at quite a bit of code and optimizing across the entire mess. /me wonders what he stepped into with this email thread. ;-) Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] page-table walkers vs memory order
On Wed, 25 Jul 2012, Paul E. McKenney wrote: > On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote: > > > > I'm totally unclear whether the kernel ever gets built with these > > 'creative' compilers that you refer to. Is ACCESS_ONCE() a warning > > of where some future compiler would be permitted to mess with our > > assumptions? Or is it actually saving us already today? Would we > > know? Could there be a boottime test that would tell us? Is it > > likely that a future compiler would have an "--access_once" > > option that the kernel build would want to turn on? > > The problem is that, unless you tell it otherwise, the compiler is > permitted to assume that the code that it is generating is the only thing > active in that address space at that time. So the compiler might know > that it already has a perfectly good copy of that value somewhere in > its registers, or it might decide to fetch the value twice rather than > once due to register pressure, either of which can be fatal in SMP code. > And then there are more aggressive optimizations as well. > > ACCESS_ONCE() is a way of telling the compiler to access the value > once, regardless of what cute single-threaded optimizations that it > otherwise might want to apply. Right, but you say "might": I have never heard it asserted, that we do build the kernel with a compiler which actually makes such optimizations. There's a lot of other surprising things which a compiler is permitted to do, but we would simply not use such a compiler to build the kernel. Does some version of gcc, under the options which we insist upon, make such optimizations on any of the architectures which we support? Or is there some other compiler in use on the kernel, which makes such optimizations? It seems a long time since I heard of building the kernel with icc. clang? I don't mind the answer "Yes, you idiot" - preferably with an example or two of which compiler and which piece of code it has bitten us on. I don't mind the answer "We just don't know" if that's the case. But I'd like a better idea of how much to worry: is ACCESS_ONCE demonstrably needed today, or rather future-proofing and documentation? Thanks, Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] page-table walkers vs memory order
On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote: > On Mon, 23 Jul 2012, Peter Zijlstra wrote: > > > > While staring at mm/huge_memory.c I found a very under-commented > > smp_wmb() in __split_huge_page_map(). It turns out that its copied from > > __{pte,pmd,pud}_alloc() but forgot the useful comment (or a reference > > thereto). > > > > Now, afaict we're not good, as per that comment. Paul has since > > convinced some of us that compiler writers are pure evil and out to get > > us. > > > > Therefore we should do what rcu_dereference() does and use > > ACCESS_ONCE()/barrier() followed smp_read_barrier_depends() every time > > we dereference a page-table pointer. > > > > > > In particular it looks like things like > > mm/memcontrol.c:mem_cgroup_count_precharge(), which use > > walk_page_range() under down_read(>mmap_sem) and can thus be > > concurrent with __{pte,pmd,pud}_alloc() from faults (and possibly > > itself) are quite broken on Alpha > > The Alpha pmd_offset() and pte_offset_map() already contain an > smp_read_barrier_depends() (362a61ad from Nick); with comment that > it's not needed on the pgd, and I presume the pud level is folded. > Does Alpha really need more of them, as you have put below? > > > and subtly broken for those of us with 'creative' compilers. > > I don't want to fight against ACCESS_ONCE() (or barrier(): that's > interesting, thank you, I hadn't seen it used as an ACCESS_ONCE() > substitute before); but I do want to question it a little. > > I'm totally unclear whether the kernel ever gets built with these > 'creative' compilers that you refer to. Is ACCESS_ONCE() a warning > of where some future compiler would be permitted to mess with our > assumptions? Or is it actually saving us already today? Would we > know? Could there be a boottime test that would tell us? Is it > likely that a future compiler would have an "--access_once" > option that the kernel build would want to turn on? > > Those may all be questions for Paul! The problem is that, unless you tell it otherwise, the compiler is permitted to assume that the code that it is generating is the only thing active in that address space at that time. So the compiler might know that it already has a perfectly good copy of that value somewhere in its registers, or it might decide to fetch the value twice rather than once due to register pressure, either of which can be fatal in SMP code. And then there are more aggressive optimizations as well. ACCESS_ONCE() is a way of telling the compiler to access the value once, regardless of what cute single-threaded optimizations that it otherwise might want to apply. Thanx, Paul > > Should I go do a more extensive audit of page-table walkers or are we > > happy with the status quo? > > I do love the status quo, but an audit would be welcome. When > it comes to patches, personally I tend to prefer ACCESS_ONCE() and > smp_read_barrier_depends() and accompanying comments to be hidden away > in the underlying macros or inlines where reasonable, rather than > repeated all over; but I may have my priorities wrong on that. > > The last time we rewrote the main pgd-pud-pmd-pte walkers, > we believed that no ACCESS_ONCE() was necessary, because although a > pgd-pud-pmd entry might be racily instantiated at any instant, it > could never change beneath us - the freeing of page tables happens > only when we cannot reach them by other routes. > > (Never quite true: those _clear_bad() things can zero entries at any > instant, but we're already in a bad place when those come into play, > so we never worried about racing against them.) > > Since then, I think THP has made the rules more complicated; but I > believe Andrea paid a great deal of attention to that kind of issue. > > I suspect your arch/x86/mm/gup.c ACCESS_ONCE()s are necessary: > gup_fast() breaks as many rules as it can, and in particular may > be racing with the freeing of page tables; but I'm not so sure > about the pagewalk mods - we could say "cannot do any harm", > but I don't like adding lines on that basis. > > Hugh > > > > > --- > > arch/x86/mm/gup.c |6 +++--- > > mm/pagewalk.c | 24 > > 2 files changed, 27 insertions(+), 3 deletions(-) > > > > diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c > > index dd74e46..4958fb1 100644 > > --- a/arch/x86/mm/gup.c > > +++ b/arch/x86/mm/gup.c > > @@ -150,7 +150,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, > > unsigned long end, > > > > pmdp = pmd_offset(, addr); > > do { > > - pmd_t pmd = *pmdp; > > + pmd_t pmd = ACCESS_ONCE(*pmdp); > > > > next = pmd_addr_end(addr, end); > > /* > > @@ -220,7 +220,7 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, > > unsigned long end, > > > > pudp = pud_offset(, addr); > > do { > > - pud_t pud = *pudp; > > +
Re: [RFC] page-table walkers vs memory order
On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote: On Mon, 23 Jul 2012, Peter Zijlstra wrote: While staring at mm/huge_memory.c I found a very under-commented smp_wmb() in __split_huge_page_map(). It turns out that its copied from __{pte,pmd,pud}_alloc() but forgot the useful comment (or a reference thereto). Now, afaict we're not good, as per that comment. Paul has since convinced some of us that compiler writers are pure evil and out to get us. Therefore we should do what rcu_dereference() does and use ACCESS_ONCE()/barrier() followed smp_read_barrier_depends() every time we dereference a page-table pointer. In particular it looks like things like mm/memcontrol.c:mem_cgroup_count_precharge(), which use walk_page_range() under down_read(mm-mmap_sem) and can thus be concurrent with __{pte,pmd,pud}_alloc() from faults (and possibly itself) are quite broken on Alpha The Alpha pmd_offset() and pte_offset_map() already contain an smp_read_barrier_depends() (362a61ad from Nick); with comment that it's not needed on the pgd, and I presume the pud level is folded. Does Alpha really need more of them, as you have put below? and subtly broken for those of us with 'creative' compilers. I don't want to fight against ACCESS_ONCE() (or barrier(): that's interesting, thank you, I hadn't seen it used as an ACCESS_ONCE() substitute before); but I do want to question it a little. I'm totally unclear whether the kernel ever gets built with these 'creative' compilers that you refer to. Is ACCESS_ONCE() a warning of where some future compiler would be permitted to mess with our assumptions? Or is it actually saving us already today? Would we know? Could there be a boottime test that would tell us? Is it likely that a future compiler would have an --access_once option that the kernel build would want to turn on? Those may all be questions for Paul! The problem is that, unless you tell it otherwise, the compiler is permitted to assume that the code that it is generating is the only thing active in that address space at that time. So the compiler might know that it already has a perfectly good copy of that value somewhere in its registers, or it might decide to fetch the value twice rather than once due to register pressure, either of which can be fatal in SMP code. And then there are more aggressive optimizations as well. ACCESS_ONCE() is a way of telling the compiler to access the value once, regardless of what cute single-threaded optimizations that it otherwise might want to apply. Thanx, Paul Should I go do a more extensive audit of page-table walkers or are we happy with the status quo? I do love the status quo, but an audit would be welcome. When it comes to patches, personally I tend to prefer ACCESS_ONCE() and smp_read_barrier_depends() and accompanying comments to be hidden away in the underlying macros or inlines where reasonable, rather than repeated all over; but I may have my priorities wrong on that. The last time we rewrote the main pgd-pud-pmd-pte walkers, we believed that no ACCESS_ONCE() was necessary, because although a pgd-pud-pmd entry might be racily instantiated at any instant, it could never change beneath us - the freeing of page tables happens only when we cannot reach them by other routes. (Never quite true: those _clear_bad() things can zero entries at any instant, but we're already in a bad place when those come into play, so we never worried about racing against them.) Since then, I think THP has made the rules more complicated; but I believe Andrea paid a great deal of attention to that kind of issue. I suspect your arch/x86/mm/gup.c ACCESS_ONCE()s are necessary: gup_fast() breaks as many rules as it can, and in particular may be racing with the freeing of page tables; but I'm not so sure about the pagewalk mods - we could say cannot do any harm, but I don't like adding lines on that basis. Hugh --- arch/x86/mm/gup.c |6 +++--- mm/pagewalk.c | 24 2 files changed, 27 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c index dd74e46..4958fb1 100644 --- a/arch/x86/mm/gup.c +++ b/arch/x86/mm/gup.c @@ -150,7 +150,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, pmdp = pmd_offset(pud, addr); do { - pmd_t pmd = *pmdp; + pmd_t pmd = ACCESS_ONCE(*pmdp); next = pmd_addr_end(addr, end); /* @@ -220,7 +220,7 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end, pudp = pud_offset(pgd, addr); do { - pud_t pud = *pudp; + pud_t pud = ACCESS_ONCE(*pudp); next = pud_addr_end(addr, end); if (pud_none(pud)) @@ -280,7 +280,7 @@ int
Re: [RFC] page-table walkers vs memory order
On Wed, 25 Jul 2012, Paul E. McKenney wrote: On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote: I'm totally unclear whether the kernel ever gets built with these 'creative' compilers that you refer to. Is ACCESS_ONCE() a warning of where some future compiler would be permitted to mess with our assumptions? Or is it actually saving us already today? Would we know? Could there be a boottime test that would tell us? Is it likely that a future compiler would have an --access_once option that the kernel build would want to turn on? The problem is that, unless you tell it otherwise, the compiler is permitted to assume that the code that it is generating is the only thing active in that address space at that time. So the compiler might know that it already has a perfectly good copy of that value somewhere in its registers, or it might decide to fetch the value twice rather than once due to register pressure, either of which can be fatal in SMP code. And then there are more aggressive optimizations as well. ACCESS_ONCE() is a way of telling the compiler to access the value once, regardless of what cute single-threaded optimizations that it otherwise might want to apply. Right, but you say might: I have never heard it asserted, that we do build the kernel with a compiler which actually makes such optimizations. There's a lot of other surprising things which a compiler is permitted to do, but we would simply not use such a compiler to build the kernel. Does some version of gcc, under the options which we insist upon, make such optimizations on any of the architectures which we support? Or is there some other compiler in use on the kernel, which makes such optimizations? It seems a long time since I heard of building the kernel with icc. clang? I don't mind the answer Yes, you idiot - preferably with an example or two of which compiler and which piece of code it has bitten us on. I don't mind the answer We just don't know if that's the case. But I'd like a better idea of how much to worry: is ACCESS_ONCE demonstrably needed today, or rather future-proofing and documentation? Thanks, Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] page-table walkers vs memory order
On Wed, Jul 25, 2012 at 01:26:43PM -0700, Hugh Dickins wrote: On Wed, 25 Jul 2012, Paul E. McKenney wrote: On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote: I'm totally unclear whether the kernel ever gets built with these 'creative' compilers that you refer to. Is ACCESS_ONCE() a warning of where some future compiler would be permitted to mess with our assumptions? Or is it actually saving us already today? Would we know? Could there be a boottime test that would tell us? Is it likely that a future compiler would have an --access_once option that the kernel build would want to turn on? The problem is that, unless you tell it otherwise, the compiler is permitted to assume that the code that it is generating is the only thing active in that address space at that time. So the compiler might know that it already has a perfectly good copy of that value somewhere in its registers, or it might decide to fetch the value twice rather than once due to register pressure, either of which can be fatal in SMP code. And then there are more aggressive optimizations as well. ACCESS_ONCE() is a way of telling the compiler to access the value once, regardless of what cute single-threaded optimizations that it otherwise might want to apply. Right, but you say might: I have never heard it asserted, that we do build the kernel with a compiler which actually makes such optimizations. The compiler we use today can and has hurt us with double-fetching and old-value-reuse optimizations. There have been several that have optimized things like while (foo) into tmp = foo; while (tmp) in the Linux kernel, which have been dealt with by recoding. You might argue that the compiler cannot reasonably apply such an optimization in some given case, but the compiler does much more detailed analysis of the code than most people are willing to do (certainly more than I am usually willing to do!), so I believe that a little paranoia is quite worthwhile. There's a lot of other surprising things which a compiler is permitted to do, but we would simply not use such a compiler to build the kernel. Unless we get the gcc folks to build and boot the Linux kernel as part of their test suite (maybe they already do, but not that I know of), how would either they or we know that they had deployed a destructive optimization? Does some version of gcc, under the options which we insist upon, make such optimizations on any of the architectures which we support? Pretty much any production-quality compiler will do double-fetch and old-value-reuse optimizations, the former especially on 32-bit x86. I don't know of any production-quality compilers that do value speculation, which would make the compiler act like DEC Alpha hardware, and I would hope that if this does appear, (1) we would have warning and (2) it could be turned off. But there has been a lot of work on this topic, so we would be foolish to rule it out. But the currently deployed optimizations can already cause enough trouble. Or is there some other compiler in use on the kernel, which makes such optimizations? It seems a long time since I heard of building the kernel with icc. clang? I don't mind the answer Yes, you idiot - preferably with an example or two of which compiler and which piece of code it has bitten us on. I don't mind the answer We just don't know if that's the case. But I'd like a better idea of how much to worry: is ACCESS_ONCE demonstrably needed today, or rather future-proofing and documentation? Both. If you are coding while (foo) where foo can be changed by an interrupt handler, you had better instead write while (ACCESS_ONCE(foo)) or something similar, because most compilers are happy to optimize your loop into an infinite loop in that case. There are places in the Linux kernel that would have problems if the compiler decided to refetch a value -- if a pointer was changed in the meantime, part of your code might be working on the old structure, and part on the new structure. This really can happen today, and this is why rcu_dereference() contains an ACCESS_ONCE(). If you are making lockless non-atomic access to a variable, I strongly suggest ACCESS_ONCE() or something similar even if you cannot see how the compiler can mess you up, especially in cases involving a lot of inline functions. In this case, the compiler can be looking at quite a bit of code and optimizing across the entire mess. /me wonders what he stepped into with this email thread. ;-) Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] page-table walkers vs memory order
On Wed, 25 Jul 2012, Paul E. McKenney wrote: On Wed, Jul 25, 2012 at 01:26:43PM -0700, Hugh Dickins wrote: On Wed, 25 Jul 2012, Paul E. McKenney wrote: On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote: I'm totally unclear whether the kernel ever gets built with these 'creative' compilers that you refer to. Is ACCESS_ONCE() a warning of where some future compiler would be permitted to mess with our assumptions? Or is it actually saving us already today? Would we know? Could there be a boottime test that would tell us? Is it likely that a future compiler would have an --access_once option that the kernel build would want to turn on? The problem is that, unless you tell it otherwise, the compiler is permitted to assume that the code that it is generating is the only thing active in that address space at that time. So the compiler might know that it already has a perfectly good copy of that value somewhere in its registers, or it might decide to fetch the value twice rather than once due to register pressure, either of which can be fatal in SMP code. And then there are more aggressive optimizations as well. ACCESS_ONCE() is a way of telling the compiler to access the value once, regardless of what cute single-threaded optimizations that it otherwise might want to apply. Right, but you say might: I have never heard it asserted, that we do build the kernel with a compiler which actually makes such optimizations. The compiler we use today can and has hurt us with double-fetching and old-value-reuse optimizations. There have been several that have optimized things like while (foo) into tmp = foo; while (tmp) in the Linux kernel, which have been dealt with by recoding. Ah yes, those: I think we need ACCESS_EVERY_TIME() for those ones ;) I consider the double-fetching ones more insidious, less obviously in need of the volatile cast. You might argue that the compiler cannot reasonably apply such an optimization in some given case, but the compiler does much more detailed analysis of the code than most people are willing to do (certainly more than I am usually willing to do!), so I believe that a little paranoia is quite worthwhile. There's a lot of other surprising things which a compiler is permitted to do, but we would simply not use such a compiler to build the kernel. Unless we get the gcc folks to build and boot the Linux kernel as part of their test suite (maybe they already do, but not that I know of), how would either they or we know that they had deployed a destructive optimization? We find out after it hits us, and someone studies the disassembly - if we're lucky enough to crash near the origin of the problem. Does some version of gcc, under the options which we insist upon, make such optimizations on any of the architectures which we support? Pretty much any production-quality compiler will do double-fetch and old-value-reuse optimizations, the former especially on 32-bit x86. That makes good sense, yes: so, under register pressure, they may refetch from global memory, instead of using a temporary on local stack. I don't know of any production-quality compilers that do value speculation, which would make the compiler act like DEC Alpha hardware, and I would hope that if this does appear, (1) we would have warning and (2) it could be turned off. But there has been a lot of work on this topic, so we would be foolish to rule it out. I think you're justified in expecting both (1) and (2) there. But the currently deployed optimizations can already cause enough trouble. Or is there some other compiler in use on the kernel, which makes such optimizations? It seems a long time since I heard of building the kernel with icc. clang? I don't mind the answer Yes, you idiot - preferably with an example or two of which compiler and which piece of code it has bitten us on. I don't mind the answer We just don't know if that's the case. But I'd like a better idea of how much to worry: is ACCESS_ONCE demonstrably needed today, or rather future-proofing and documentation? Both. If you are coding while (foo) where foo can be changed by an interrupt handler, you had better instead write while (ACCESS_ONCE(foo)) or something similar, because most compilers are happy to optimize your loop into an infinite loop in that case. There are places in the Linux kernel that would have problems if the compiler decided to refetch a value -- if a pointer was changed in the meantime, part of your code might be working on the old structure, and part on the new structure. This really can happen today, and this is why rcu_dereference() contains an ACCESS_ONCE(). If you are making lockless non-atomic access to a variable, I strongly suggest ACCESS_ONCE() or something similar even if you cannot see how the compiler can mess you up, especially in
Re: [RFC] page-table walkers vs memory order
On Wed, Jul 25, 2012 at 03:09:48PM -0700, Hugh Dickins wrote: On Wed, 25 Jul 2012, Paul E. McKenney wrote: On Wed, Jul 25, 2012 at 01:26:43PM -0700, Hugh Dickins wrote: On Wed, 25 Jul 2012, Paul E. McKenney wrote: On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote: I'm totally unclear whether the kernel ever gets built with these 'creative' compilers that you refer to. Is ACCESS_ONCE() a warning of where some future compiler would be permitted to mess with our assumptions? Or is it actually saving us already today? Would we know? Could there be a boottime test that would tell us? Is it likely that a future compiler would have an --access_once option that the kernel build would want to turn on? The problem is that, unless you tell it otherwise, the compiler is permitted to assume that the code that it is generating is the only thing active in that address space at that time. So the compiler might know that it already has a perfectly good copy of that value somewhere in its registers, or it might decide to fetch the value twice rather than once due to register pressure, either of which can be fatal in SMP code. And then there are more aggressive optimizations as well. ACCESS_ONCE() is a way of telling the compiler to access the value once, regardless of what cute single-threaded optimizations that it otherwise might want to apply. Right, but you say might: I have never heard it asserted, that we do build the kernel with a compiler which actually makes such optimizations. The compiler we use today can and has hurt us with double-fetching and old-value-reuse optimizations. There have been several that have optimized things like while (foo) into tmp = foo; while (tmp) in the Linux kernel, which have been dealt with by recoding. Ah yes, those: I think we need ACCESS_EVERY_TIME() for those ones ;) ;-) ;-) ;-) I consider the double-fetching ones more insidious, less obviously in need of the volatile cast. Agreed! You might argue that the compiler cannot reasonably apply such an optimization in some given case, but the compiler does much more detailed analysis of the code than most people are willing to do (certainly more than I am usually willing to do!), so I believe that a little paranoia is quite worthwhile. There's a lot of other surprising things which a compiler is permitted to do, but we would simply not use such a compiler to build the kernel. Unless we get the gcc folks to build and boot the Linux kernel as part of their test suite (maybe they already do, but not that I know of), how would either they or we know that they had deployed a destructive optimization? We find out after it hits us, and someone studies the disassembly - if we're lucky enough to crash near the origin of the problem. Color me unreassured. ;-) Does some version of gcc, under the options which we insist upon, make such optimizations on any of the architectures which we support? Pretty much any production-quality compiler will do double-fetch and old-value-reuse optimizations, the former especially on 32-bit x86. That makes good sense, yes: so, under register pressure, they may refetch from global memory, instead of using a temporary on local stack. I don't know of any production-quality compilers that do value speculation, which would make the compiler act like DEC Alpha hardware, and I would hope that if this does appear, (1) we would have warning and (2) it could be turned off. But there has been a lot of work on this topic, so we would be foolish to rule it out. I think you're justified in expecting both (1) and (2) there. Here is hoping! But the currently deployed optimizations can already cause enough trouble. Or is there some other compiler in use on the kernel, which makes such optimizations? It seems a long time since I heard of building the kernel with icc. clang? I don't mind the answer Yes, you idiot - preferably with an example or two of which compiler and which piece of code it has bitten us on. I don't mind the answer We just don't know if that's the case. But I'd like a better idea of how much to worry: is ACCESS_ONCE demonstrably needed today, or rather future-proofing and documentation? Both. If you are coding while (foo) where foo can be changed by an interrupt handler, you had better instead write while (ACCESS_ONCE(foo)) or something similar, because most compilers are happy to optimize your loop into an infinite loop in that case. There are places in the Linux kernel that would have problems if the compiler decided to refetch a value -- if a pointer was changed in the meantime, part of your code might be working on the old structure, and part on the new structure. This really can happen today, and this is why rcu_dereference()
Re: [RFC] page-table walkers vs memory order
On Mon, 23 Jul 2012, Peter Zijlstra wrote: > > While staring at mm/huge_memory.c I found a very under-commented > smp_wmb() in __split_huge_page_map(). It turns out that its copied from > __{pte,pmd,pud}_alloc() but forgot the useful comment (or a reference > thereto). > > Now, afaict we're not good, as per that comment. Paul has since > convinced some of us that compiler writers are pure evil and out to get > us. > > Therefore we should do what rcu_dereference() does and use > ACCESS_ONCE()/barrier() followed smp_read_barrier_depends() every time > we dereference a page-table pointer. > > > In particular it looks like things like > mm/memcontrol.c:mem_cgroup_count_precharge(), which use > walk_page_range() under down_read(>mmap_sem) and can thus be > concurrent with __{pte,pmd,pud}_alloc() from faults (and possibly > itself) are quite broken on Alpha The Alpha pmd_offset() and pte_offset_map() already contain an smp_read_barrier_depends() (362a61ad from Nick); with comment that it's not needed on the pgd, and I presume the pud level is folded. Does Alpha really need more of them, as you have put below? > and subtly broken for those of us with 'creative' compilers. I don't want to fight against ACCESS_ONCE() (or barrier(): that's interesting, thank you, I hadn't seen it used as an ACCESS_ONCE() substitute before); but I do want to question it a little. I'm totally unclear whether the kernel ever gets built with these 'creative' compilers that you refer to. Is ACCESS_ONCE() a warning of where some future compiler would be permitted to mess with our assumptions? Or is it actually saving us already today? Would we know? Could there be a boottime test that would tell us? Is it likely that a future compiler would have an "--access_once" option that the kernel build would want to turn on? Those may all be questions for Paul! > > Should I go do a more extensive audit of page-table walkers or are we > happy with the status quo? I do love the status quo, but an audit would be welcome. When it comes to patches, personally I tend to prefer ACCESS_ONCE() and smp_read_barrier_depends() and accompanying comments to be hidden away in the underlying macros or inlines where reasonable, rather than repeated all over; but I may have my priorities wrong on that. The last time we rewrote the main pgd-pud-pmd-pte walkers, we believed that no ACCESS_ONCE() was necessary, because although a pgd-pud-pmd entry might be racily instantiated at any instant, it could never change beneath us - the freeing of page tables happens only when we cannot reach them by other routes. (Never quite true: those _clear_bad() things can zero entries at any instant, but we're already in a bad place when those come into play, so we never worried about racing against them.) Since then, I think THP has made the rules more complicated; but I believe Andrea paid a great deal of attention to that kind of issue. I suspect your arch/x86/mm/gup.c ACCESS_ONCE()s are necessary: gup_fast() breaks as many rules as it can, and in particular may be racing with the freeing of page tables; but I'm not so sure about the pagewalk mods - we could say "cannot do any harm", but I don't like adding lines on that basis. Hugh > > --- > arch/x86/mm/gup.c |6 +++--- > mm/pagewalk.c | 24 > 2 files changed, 27 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c > index dd74e46..4958fb1 100644 > --- a/arch/x86/mm/gup.c > +++ b/arch/x86/mm/gup.c > @@ -150,7 +150,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, > unsigned long end, > > pmdp = pmd_offset(, addr); > do { > - pmd_t pmd = *pmdp; > + pmd_t pmd = ACCESS_ONCE(*pmdp); > > next = pmd_addr_end(addr, end); > /* > @@ -220,7 +220,7 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, > unsigned long end, > > pudp = pud_offset(, addr); > do { > - pud_t pud = *pudp; > + pud_t pud = ACCESS_ONCE(*pudp); > > next = pud_addr_end(addr, end); > if (pud_none(pud)) > @@ -280,7 +280,7 @@ int __get_user_pages_fast(unsigned long start, int > nr_pages, int write, > local_irq_save(flags); > pgdp = pgd_offset(mm, addr); > do { > - pgd_t pgd = *pgdp; > + pgd_t pgd = ACCESS_ONCE(*pgdp); > > next = pgd_addr_end(addr, end); > if (pgd_none(pgd)) > diff --git a/mm/pagewalk.c b/mm/pagewalk.c > index 6c118d0..2ba2a74 100644 > --- a/mm/pagewalk.c > +++ b/mm/pagewalk.c > @@ -10,6 +10,14 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, > unsigned long end, > int err = 0; > > pte = pte_offset_map(pmd, addr); > + /* > + * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and > + * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the > + * actual dereference of p[gum]d, but that's
Re: [RFC] page-table walkers vs memory order
On Mon, 23 Jul 2012, Peter Zijlstra wrote: While staring at mm/huge_memory.c I found a very under-commented smp_wmb() in __split_huge_page_map(). It turns out that its copied from __{pte,pmd,pud}_alloc() but forgot the useful comment (or a reference thereto). Now, afaict we're not good, as per that comment. Paul has since convinced some of us that compiler writers are pure evil and out to get us. Therefore we should do what rcu_dereference() does and use ACCESS_ONCE()/barrier() followed smp_read_barrier_depends() every time we dereference a page-table pointer. In particular it looks like things like mm/memcontrol.c:mem_cgroup_count_precharge(), which use walk_page_range() under down_read(mm-mmap_sem) and can thus be concurrent with __{pte,pmd,pud}_alloc() from faults (and possibly itself) are quite broken on Alpha The Alpha pmd_offset() and pte_offset_map() already contain an smp_read_barrier_depends() (362a61ad from Nick); with comment that it's not needed on the pgd, and I presume the pud level is folded. Does Alpha really need more of them, as you have put below? and subtly broken for those of us with 'creative' compilers. I don't want to fight against ACCESS_ONCE() (or barrier(): that's interesting, thank you, I hadn't seen it used as an ACCESS_ONCE() substitute before); but I do want to question it a little. I'm totally unclear whether the kernel ever gets built with these 'creative' compilers that you refer to. Is ACCESS_ONCE() a warning of where some future compiler would be permitted to mess with our assumptions? Or is it actually saving us already today? Would we know? Could there be a boottime test that would tell us? Is it likely that a future compiler would have an --access_once option that the kernel build would want to turn on? Those may all be questions for Paul! Should I go do a more extensive audit of page-table walkers or are we happy with the status quo? I do love the status quo, but an audit would be welcome. When it comes to patches, personally I tend to prefer ACCESS_ONCE() and smp_read_barrier_depends() and accompanying comments to be hidden away in the underlying macros or inlines where reasonable, rather than repeated all over; but I may have my priorities wrong on that. The last time we rewrote the main pgd-pud-pmd-pte walkers, we believed that no ACCESS_ONCE() was necessary, because although a pgd-pud-pmd entry might be racily instantiated at any instant, it could never change beneath us - the freeing of page tables happens only when we cannot reach them by other routes. (Never quite true: those _clear_bad() things can zero entries at any instant, but we're already in a bad place when those come into play, so we never worried about racing against them.) Since then, I think THP has made the rules more complicated; but I believe Andrea paid a great deal of attention to that kind of issue. I suspect your arch/x86/mm/gup.c ACCESS_ONCE()s are necessary: gup_fast() breaks as many rules as it can, and in particular may be racing with the freeing of page tables; but I'm not so sure about the pagewalk mods - we could say cannot do any harm, but I don't like adding lines on that basis. Hugh --- arch/x86/mm/gup.c |6 +++--- mm/pagewalk.c | 24 2 files changed, 27 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c index dd74e46..4958fb1 100644 --- a/arch/x86/mm/gup.c +++ b/arch/x86/mm/gup.c @@ -150,7 +150,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, pmdp = pmd_offset(pud, addr); do { - pmd_t pmd = *pmdp; + pmd_t pmd = ACCESS_ONCE(*pmdp); next = pmd_addr_end(addr, end); /* @@ -220,7 +220,7 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end, pudp = pud_offset(pgd, addr); do { - pud_t pud = *pudp; + pud_t pud = ACCESS_ONCE(*pudp); next = pud_addr_end(addr, end); if (pud_none(pud)) @@ -280,7 +280,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, local_irq_save(flags); pgdp = pgd_offset(mm, addr); do { - pgd_t pgd = *pgdp; + pgd_t pgd = ACCESS_ONCE(*pgdp); next = pgd_addr_end(addr, end); if (pgd_none(pgd)) diff --git a/mm/pagewalk.c b/mm/pagewalk.c index 6c118d0..2ba2a74 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -10,6 +10,14 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, int err = 0; pte = pte_offset_map(pmd, addr); + /* + * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and + * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the + * actual dereference of p[gum]d, but that's hidden deep within the + * bowels of {pte,pmd,pud}_offset. +
Re: [RFC] page-table walkers vs memory order
On Mon, Jul 23, 2012 at 07:34:30PM +0200, Peter Zijlstra wrote: > > While staring at mm/huge_memory.c I found a very under-commented > smp_wmb() in __split_huge_page_map(). It turns out that its copied from > __{pte,pmd,pud}_alloc() but forgot the useful comment (or a reference > thereto). > > Now, afaict we're not good, as per that comment. Paul has since > convinced some of us that compiler writers are pure evil and out to get > us. I have seen the glint in their eyes when they discuss optimization techniques that you would not want your children to know about! > Therefore we should do what rcu_dereference() does and use > ACCESS_ONCE()/barrier() followed smp_read_barrier_depends() every time > we dereference a page-table pointer. > > > In particular it looks like things like > mm/memcontrol.c:mem_cgroup_count_precharge(), which use > walk_page_range() under down_read(>mmap_sem) and can thus be > concurrent with __{pte,pmd,pud}_alloc() from faults (and possibly > itself) are quite broken on Alpha and subtly broken for those of us with > 'creative' compilers. Looks good to me, though given my ignorance of mm, not sure that counts for much. Thanx, Paul > Should I go do a more extensive audit of page-table walkers or are we > happy with the status quo? > > --- > arch/x86/mm/gup.c |6 +++--- > mm/pagewalk.c | 24 > 2 files changed, 27 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c > index dd74e46..4958fb1 100644 > --- a/arch/x86/mm/gup.c > +++ b/arch/x86/mm/gup.c > @@ -150,7 +150,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, > unsigned long end, > > pmdp = pmd_offset(, addr); > do { > - pmd_t pmd = *pmdp; > + pmd_t pmd = ACCESS_ONCE(*pmdp); Here, ACCESS_ONCE() suffices because this is x86-specific code. Core code would need to worry about Alpha, and would thus also need smp_read_barrier_depends(), as you in fact do have further down. > next = pmd_addr_end(addr, end); > /* > @@ -220,7 +220,7 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, > unsigned long end, > > pudp = pud_offset(, addr); > do { > - pud_t pud = *pudp; > + pud_t pud = ACCESS_ONCE(*pudp); > > next = pud_addr_end(addr, end); > if (pud_none(pud)) > @@ -280,7 +280,7 @@ int __get_user_pages_fast(unsigned long start, int > nr_pages, int write, > local_irq_save(flags); > pgdp = pgd_offset(mm, addr); > do { > - pgd_t pgd = *pgdp; > + pgd_t pgd = ACCESS_ONCE(*pgdp); > > next = pgd_addr_end(addr, end); > if (pgd_none(pgd)) > diff --git a/mm/pagewalk.c b/mm/pagewalk.c > index 6c118d0..2ba2a74 100644 > --- a/mm/pagewalk.c > +++ b/mm/pagewalk.c > @@ -10,6 +10,14 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, > unsigned long end, > int err = 0; > > pte = pte_offset_map(pmd, addr); > + /* > + * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and > + * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the > + * actual dereference of p[gum]d, but that's hidden deep within the > + * bowels of {pte,pmd,pud}_offset. > + */ > + barrier(); > + smp_read_barrier_depends(); > for (;;) { > err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, walk); > if (err) > @@ -32,6 +40,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, > unsigned long end, > int err = 0; > > pmd = pmd_offset(pud, addr); > + /* > + * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and > + * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the > + * actual dereference of p[gum]d, but that's hidden deep within the > + * bowels of {pte,pmd,pud}_offset. > + */ > + barrier(); > + smp_read_barrier_depends(); > do { > again: > next = pmd_addr_end(addr, end); > @@ -77,6 +93,14 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, > unsigned long end, > int err = 0; > > pud = pud_offset(pgd, addr); > + /* > + * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and > + * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the > + * actual dereference of p[gum]d, but that's hidden deep within the > + * bowels of {pte,pmd,pud}_offset. > + */ > + barrier(); > + smp_read_barrier_depends(); > do { > next = pud_addr_end(addr, end); > if (pud_none_or_clear_bad(pud)) { > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] page-table walkers vs memory order
While staring at mm/huge_memory.c I found a very under-commented smp_wmb() in __split_huge_page_map(). It turns out that its copied from __{pte,pmd,pud}_alloc() but forgot the useful comment (or a reference thereto). Now, afaict we're not good, as per that comment. Paul has since convinced some of us that compiler writers are pure evil and out to get us. Therefore we should do what rcu_dereference() does and use ACCESS_ONCE()/barrier() followed smp_read_barrier_depends() every time we dereference a page-table pointer. In particular it looks like things like mm/memcontrol.c:mem_cgroup_count_precharge(), which use walk_page_range() under down_read(>mmap_sem) and can thus be concurrent with __{pte,pmd,pud}_alloc() from faults (and possibly itself) are quite broken on Alpha and subtly broken for those of us with 'creative' compilers. Should I go do a more extensive audit of page-table walkers or are we happy with the status quo? --- arch/x86/mm/gup.c |6 +++--- mm/pagewalk.c | 24 2 files changed, 27 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c index dd74e46..4958fb1 100644 --- a/arch/x86/mm/gup.c +++ b/arch/x86/mm/gup.c @@ -150,7 +150,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, pmdp = pmd_offset(, addr); do { - pmd_t pmd = *pmdp; + pmd_t pmd = ACCESS_ONCE(*pmdp); next = pmd_addr_end(addr, end); /* @@ -220,7 +220,7 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end, pudp = pud_offset(, addr); do { - pud_t pud = *pudp; + pud_t pud = ACCESS_ONCE(*pudp); next = pud_addr_end(addr, end); if (pud_none(pud)) @@ -280,7 +280,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, local_irq_save(flags); pgdp = pgd_offset(mm, addr); do { - pgd_t pgd = *pgdp; + pgd_t pgd = ACCESS_ONCE(*pgdp); next = pgd_addr_end(addr, end); if (pgd_none(pgd)) diff --git a/mm/pagewalk.c b/mm/pagewalk.c index 6c118d0..2ba2a74 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -10,6 +10,14 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, int err = 0; pte = pte_offset_map(pmd, addr); + /* +* Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and +* __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the +* actual dereference of p[gum]d, but that's hidden deep within the +* bowels of {pte,pmd,pud}_offset. +*/ + barrier(); + smp_read_barrier_depends(); for (;;) { err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, walk); if (err) @@ -32,6 +40,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, int err = 0; pmd = pmd_offset(pud, addr); + /* +* Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and +* __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the +* actual dereference of p[gum]d, but that's hidden deep within the +* bowels of {pte,pmd,pud}_offset. +*/ + barrier(); + smp_read_barrier_depends(); do { again: next = pmd_addr_end(addr, end); @@ -77,6 +93,14 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end, int err = 0; pud = pud_offset(pgd, addr); + /* +* Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and +* __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the +* actual dereference of p[gum]d, but that's hidden deep within the +* bowels of {pte,pmd,pud}_offset. +*/ + barrier(); + smp_read_barrier_depends(); do { next = pud_addr_end(addr, end); if (pud_none_or_clear_bad(pud)) { -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] page-table walkers vs memory order
While staring at mm/huge_memory.c I found a very under-commented smp_wmb() in __split_huge_page_map(). It turns out that its copied from __{pte,pmd,pud}_alloc() but forgot the useful comment (or a reference thereto). Now, afaict we're not good, as per that comment. Paul has since convinced some of us that compiler writers are pure evil and out to get us. Therefore we should do what rcu_dereference() does and use ACCESS_ONCE()/barrier() followed smp_read_barrier_depends() every time we dereference a page-table pointer. In particular it looks like things like mm/memcontrol.c:mem_cgroup_count_precharge(), which use walk_page_range() under down_read(mm-mmap_sem) and can thus be concurrent with __{pte,pmd,pud}_alloc() from faults (and possibly itself) are quite broken on Alpha and subtly broken for those of us with 'creative' compilers. Should I go do a more extensive audit of page-table walkers or are we happy with the status quo? --- arch/x86/mm/gup.c |6 +++--- mm/pagewalk.c | 24 2 files changed, 27 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c index dd74e46..4958fb1 100644 --- a/arch/x86/mm/gup.c +++ b/arch/x86/mm/gup.c @@ -150,7 +150,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, pmdp = pmd_offset(pud, addr); do { - pmd_t pmd = *pmdp; + pmd_t pmd = ACCESS_ONCE(*pmdp); next = pmd_addr_end(addr, end); /* @@ -220,7 +220,7 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end, pudp = pud_offset(pgd, addr); do { - pud_t pud = *pudp; + pud_t pud = ACCESS_ONCE(*pudp); next = pud_addr_end(addr, end); if (pud_none(pud)) @@ -280,7 +280,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, local_irq_save(flags); pgdp = pgd_offset(mm, addr); do { - pgd_t pgd = *pgdp; + pgd_t pgd = ACCESS_ONCE(*pgdp); next = pgd_addr_end(addr, end); if (pgd_none(pgd)) diff --git a/mm/pagewalk.c b/mm/pagewalk.c index 6c118d0..2ba2a74 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -10,6 +10,14 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, int err = 0; pte = pte_offset_map(pmd, addr); + /* +* Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and +* __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the +* actual dereference of p[gum]d, but that's hidden deep within the +* bowels of {pte,pmd,pud}_offset. +*/ + barrier(); + smp_read_barrier_depends(); for (;;) { err = walk-pte_entry(pte, addr, addr + PAGE_SIZE, walk); if (err) @@ -32,6 +40,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, int err = 0; pmd = pmd_offset(pud, addr); + /* +* Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and +* __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the +* actual dereference of p[gum]d, but that's hidden deep within the +* bowels of {pte,pmd,pud}_offset. +*/ + barrier(); + smp_read_barrier_depends(); do { again: next = pmd_addr_end(addr, end); @@ -77,6 +93,14 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end, int err = 0; pud = pud_offset(pgd, addr); + /* +* Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and +* __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the +* actual dereference of p[gum]d, but that's hidden deep within the +* bowels of {pte,pmd,pud}_offset. +*/ + barrier(); + smp_read_barrier_depends(); do { next = pud_addr_end(addr, end); if (pud_none_or_clear_bad(pud)) { -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] page-table walkers vs memory order
On Mon, Jul 23, 2012 at 07:34:30PM +0200, Peter Zijlstra wrote: While staring at mm/huge_memory.c I found a very under-commented smp_wmb() in __split_huge_page_map(). It turns out that its copied from __{pte,pmd,pud}_alloc() but forgot the useful comment (or a reference thereto). Now, afaict we're not good, as per that comment. Paul has since convinced some of us that compiler writers are pure evil and out to get us. I have seen the glint in their eyes when they discuss optimization techniques that you would not want your children to know about! Therefore we should do what rcu_dereference() does and use ACCESS_ONCE()/barrier() followed smp_read_barrier_depends() every time we dereference a page-table pointer. In particular it looks like things like mm/memcontrol.c:mem_cgroup_count_precharge(), which use walk_page_range() under down_read(mm-mmap_sem) and can thus be concurrent with __{pte,pmd,pud}_alloc() from faults (and possibly itself) are quite broken on Alpha and subtly broken for those of us with 'creative' compilers. Looks good to me, though given my ignorance of mm, not sure that counts for much. Thanx, Paul Should I go do a more extensive audit of page-table walkers or are we happy with the status quo? --- arch/x86/mm/gup.c |6 +++--- mm/pagewalk.c | 24 2 files changed, 27 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c index dd74e46..4958fb1 100644 --- a/arch/x86/mm/gup.c +++ b/arch/x86/mm/gup.c @@ -150,7 +150,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, pmdp = pmd_offset(pud, addr); do { - pmd_t pmd = *pmdp; + pmd_t pmd = ACCESS_ONCE(*pmdp); Here, ACCESS_ONCE() suffices because this is x86-specific code. Core code would need to worry about Alpha, and would thus also need smp_read_barrier_depends(), as you in fact do have further down. next = pmd_addr_end(addr, end); /* @@ -220,7 +220,7 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end, pudp = pud_offset(pgd, addr); do { - pud_t pud = *pudp; + pud_t pud = ACCESS_ONCE(*pudp); next = pud_addr_end(addr, end); if (pud_none(pud)) @@ -280,7 +280,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, local_irq_save(flags); pgdp = pgd_offset(mm, addr); do { - pgd_t pgd = *pgdp; + pgd_t pgd = ACCESS_ONCE(*pgdp); next = pgd_addr_end(addr, end); if (pgd_none(pgd)) diff --git a/mm/pagewalk.c b/mm/pagewalk.c index 6c118d0..2ba2a74 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -10,6 +10,14 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, int err = 0; pte = pte_offset_map(pmd, addr); + /* + * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and + * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the + * actual dereference of p[gum]d, but that's hidden deep within the + * bowels of {pte,pmd,pud}_offset. + */ + barrier(); + smp_read_barrier_depends(); for (;;) { err = walk-pte_entry(pte, addr, addr + PAGE_SIZE, walk); if (err) @@ -32,6 +40,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, int err = 0; pmd = pmd_offset(pud, addr); + /* + * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and + * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the + * actual dereference of p[gum]d, but that's hidden deep within the + * bowels of {pte,pmd,pud}_offset. + */ + barrier(); + smp_read_barrier_depends(); do { again: next = pmd_addr_end(addr, end); @@ -77,6 +93,14 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end, int err = 0; pud = pud_offset(pgd, addr); + /* + * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and + * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the + * actual dereference of p[gum]d, but that's hidden deep within the + * bowels of {pte,pmd,pud}_offset. + */ + barrier(); + smp_read_barrier_depends(); do { next = pud_addr_end(addr, end); if (pud_none_or_clear_bad(pud)) { -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/