Re: [RFC] page-table walkers vs memory order

2012-08-04 Thread Dr. David Alan Gilbert
* Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
> On Sat, Aug 04, 2012 at 11:59:10PM +0100, Dr. David Alan Gilbert wrote:



> > A compiler could decide to dereference it using a non-faulting load,
> > do the calculations or whatever on the returned value of the non-faulting
> > load, and then check whether the load actually faulted, and whether the
> > address matched the prediction before it did a store based on it's
> > guess.
> 
> Or the compiler could record a recovery address in a per-thread variable
> before doing the speculative reference.  The page-fault handler could
> consult the per-thread variable and take appropriate action.

The difference is that I'd expect a compiler writer to think that
they've got a free hand in terms of instruction usage that the OS/library
doesn't see - if it's in the instruction manual and it's marked as user
space and non-faulting I'd say it's fair game; once they know that they're
going to take a fault or mark pages specially then they already know they're
going to have to cooperate with the OS, or worry about what other
normal library calls are going to do.
(A bit of googling seems to suggest IA64 and SPARC have played around
with non-faulting load optimisations, but I can't tell how much.)

> But both this approach and your approach are vulnerable to things like
> having the speculation area mapped to (say) MMIO space.  Not good!

Not good for someone doing MMIO, but from an evil-compiler point
of view, they might well assume that a pointer is to memory
unless someone has made an effort to tell them otherwise (not that
there is a good standard to do that).

> So I am with Andrea on this one -- there would need to be some handshake
> between kernel and compiler to avoid messing with possibly-unsafe
> mappings.  And I am still not much in favor of value speculation.  ;-)

Dave
-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\ gro.gilbert @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] page-table walkers vs memory order

2012-08-04 Thread Paul E. McKenney
On Sat, Aug 04, 2012 at 11:59:10PM +0100, Dr. David Alan Gilbert wrote:
> * Andrea Arcangeli (aarca...@redhat.com) wrote:
> > On Sat, Aug 04, 2012 at 03:02:45PM -0700, Paul E. McKenney wrote:
> > > OK, I'll bite.  ;-)
> > 
> > :))
> > 
> > > The most sane way for this to happen is with feedback-driven techniques
> > > involving profiling, similar to what is done for basic-block reordering
> > > or branch prediction.  The idea is that you compile the kernel in an
> > > as-yet (and thankfully) mythical pointer-profiling mode, which records
> > > the values of pointer loads and also measures the pointer-load latency.
> > > If a situation is found where a given pointer almost always has the
> > > same value but has high load latency (for example, is almost always a
> > > high-latency cache miss), this fact is recorded and fed back into a
> > > subsequent kernel build.  This subsequent kernel build might choose to
> > > speculate the value of the pointer concurrently with the pointer load.
> > > 
> > > And of course, when interpreting the phrase "most sane way" at the
> > > beginning of the prior paragraph, it would probably be wise to keep
> > > in mind who wrote it.  And that "most sane way" might have little or
> > > no resemblance to anything that typical kernel hackers would consider
> > > anywhere near sanity.  ;-)
> > 
> > I see. The above scenario is sure fair enough assumption. We're
> > clearly stretching the constraints to see what is theoretically
> > possible and this is a very clear explanation of how gcc could have an
> > hardcoded "guessed" address in the .text.
> > 
> > Next step to clearify now, is how gcc can safely dereference such a
> > "guessed" address without the kernel knowing about it.
> > 
> > If gcc would really dereference a guessed address coming from a
> > profiling run without kernel being aware of it, it would eventually
> > crash the kernel with an oops. gcc cannot know what another CPU will
> > do with the kernel pagetables. It'd be perfectly legitimate to
> > temporarily move the data at the "guessed address" to another page and
> > to update the pointer through stop_cpu during some weird "cpu
> > offlining scenario" or anything you can imagine. I mean gcc must
> > behave in all cases so it's not allowed to deference the guessed
> > address at any given time.
> 
> A compiler could decide to dereference it using a non-faulting load,
> do the calculations or whatever on the returned value of the non-faulting
> load, and then check whether the load actually faulted, and whether the
> address matched the prediction before it did a store based on it's
> guess.

Or the compiler could record a recovery address in a per-thread variable
before doing the speculative reference.  The page-fault handler could
consult the per-thread variable and take appropriate action.

But both this approach and your approach are vulnerable to things like
having the speculation area mapped to (say) MMIO space.  Not good!

So I am with Andrea on this one -- there would need to be some handshake
between kernel and compiler to avoid messing with possibly-unsafe
mappings.  And I am still not much in favor of value speculation.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] page-table walkers vs memory order

2012-08-04 Thread Paul E. McKenney
On Sun, Aug 05, 2012 at 12:47:05AM +0200, Andrea Arcangeli wrote:
> On Sat, Aug 04, 2012 at 03:02:45PM -0700, Paul E. McKenney wrote:
> > OK, I'll bite.  ;-)
> 
> :))
> 
> > The most sane way for this to happen is with feedback-driven techniques
> > involving profiling, similar to what is done for basic-block reordering
> > or branch prediction.  The idea is that you compile the kernel in an
> > as-yet (and thankfully) mythical pointer-profiling mode, which records
> > the values of pointer loads and also measures the pointer-load latency.
> > If a situation is found where a given pointer almost always has the
> > same value but has high load latency (for example, is almost always a
> > high-latency cache miss), this fact is recorded and fed back into a
> > subsequent kernel build.  This subsequent kernel build might choose to
> > speculate the value of the pointer concurrently with the pointer load.
> > 
> > And of course, when interpreting the phrase "most sane way" at the
> > beginning of the prior paragraph, it would probably be wise to keep
> > in mind who wrote it.  And that "most sane way" might have little or
> > no resemblance to anything that typical kernel hackers would consider
> > anywhere near sanity.  ;-)
> 
> I see. The above scenario is sure fair enough assumption. We're
> clearly stretching the constraints to see what is theoretically
> possible and this is a very clear explanation of how gcc could have an
> hardcoded "guessed" address in the .text.
> 
> Next step to clearify now, is how gcc can safely dereference such a
> "guessed" address without the kernel knowing about it.
> 
> If gcc would really dereference a guessed address coming from a
> profiling run without kernel being aware of it, it would eventually
> crash the kernel with an oops. gcc cannot know what another CPU will
> do with the kernel pagetables. It'd be perfectly legitimate to
> temporarily move the data at the "guessed address" to another page and
> to update the pointer through stop_cpu during some weird "cpu
> offlining scenario" or anything you can imagine. I mean gcc must
> behave in all cases so it's not allowed to deference the guessed
> address at any given time.
> 
> The only way gcc could do the alpha thing and dereference the guessed
> address before the real pointer, is with cooperation with the kernel.
> The kernel should provide gcc "safe ranges" that won't crash the
> kernel, and/or gcc could provide a .fixup section similar to the
> current .fixup and the kernel should look it up during the page fault
> handler in case the kernel is ok with temporarily getting faults in
> that range. And in turn it can't happen unless we explicitly decide to
> allow gcc to do it.

And these are indeed some good reasons why I am not a fan of pointer-value
speculation.  ;-)

> > > Furthermore the ACCESS_ONCE that Peter's patch added to gup_fast
> > > pud/pgd can't prevent the compiler to read a guessed pmdp address as a
> > > volatile variable, before reading the pmdp pointer and compare it with
> > > the guessed address! So if it's 5 you worry about, when adding
> > > ACCESS_ONCE in pudp/pgdp/pmdp is useless and won't fix it. You should
> > > have added a barrier() instead.
> > 
> > Most compiler writers I have discussed this with agreed that a volatile
> > cast would suppress value speculation.  The "volatile" keyword is not
> > all that well specified in the C and C++ standards, but as "nix" said
> > at http://lwn.net/Articles/509731/:
> > 
> > volatile's meaning as 'minimize optimizations applied to things
> > manipulating anything of volatile type, do not duplicate, elide,
> > move, fold, spindle or mutilate' is of long standing.
> 
> Ok, so if the above optimization would be possible, volatile would
> stop it too, thanks for the quote and the explanation.
> 
> On a side note I believe there's a few barrier()s that may be worth
> converting to ACCESS_ONCE, that would take care of case 6) too in
> addition to avoid clobbering more CPU registers than strictly
> necessary. Not very important but a possible microoptimization.

Agreed on both points.

> > That said, value speculation as a compiler optimization makes me a bit
> > nervous, so my current feeling is that is should be suppressed entirely.
> > 
> > Hey, you asked, even if only implicitly!  ;-)
> 
> You're reading my mind! :)

Or succesfully carrying out value speculation on it.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] page-table walkers vs memory order

2012-08-04 Thread Dr. David Alan Gilbert
* Andrea Arcangeli (aarca...@redhat.com) wrote:
> On Sat, Aug 04, 2012 at 03:02:45PM -0700, Paul E. McKenney wrote:
> > OK, I'll bite.  ;-)
> 
> :))
> 
> > The most sane way for this to happen is with feedback-driven techniques
> > involving profiling, similar to what is done for basic-block reordering
> > or branch prediction.  The idea is that you compile the kernel in an
> > as-yet (and thankfully) mythical pointer-profiling mode, which records
> > the values of pointer loads and also measures the pointer-load latency.
> > If a situation is found where a given pointer almost always has the
> > same value but has high load latency (for example, is almost always a
> > high-latency cache miss), this fact is recorded and fed back into a
> > subsequent kernel build.  This subsequent kernel build might choose to
> > speculate the value of the pointer concurrently with the pointer load.
> > 
> > And of course, when interpreting the phrase "most sane way" at the
> > beginning of the prior paragraph, it would probably be wise to keep
> > in mind who wrote it.  And that "most sane way" might have little or
> > no resemblance to anything that typical kernel hackers would consider
> > anywhere near sanity.  ;-)
> 
> I see. The above scenario is sure fair enough assumption. We're
> clearly stretching the constraints to see what is theoretically
> possible and this is a very clear explanation of how gcc could have an
> hardcoded "guessed" address in the .text.
> 
> Next step to clearify now, is how gcc can safely dereference such a
> "guessed" address without the kernel knowing about it.
> 
> If gcc would really dereference a guessed address coming from a
> profiling run without kernel being aware of it, it would eventually
> crash the kernel with an oops. gcc cannot know what another CPU will
> do with the kernel pagetables. It'd be perfectly legitimate to
> temporarily move the data at the "guessed address" to another page and
> to update the pointer through stop_cpu during some weird "cpu
> offlining scenario" or anything you can imagine. I mean gcc must
> behave in all cases so it's not allowed to deference the guessed
> address at any given time.

A compiler could decide to dereference it using a non-faulting load,
do the calculations or whatever on the returned value of the non-faulting
load, and then check whether the load actually faulted, and whether the
address matched the prediction before it did a store based on it's
guess.

Dave
-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\ gro.gilbert @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] page-table walkers vs memory order

2012-08-04 Thread Andrea Arcangeli
On Sat, Aug 04, 2012 at 03:02:45PM -0700, Paul E. McKenney wrote:
> OK, I'll bite.  ;-)

:))

> The most sane way for this to happen is with feedback-driven techniques
> involving profiling, similar to what is done for basic-block reordering
> or branch prediction.  The idea is that you compile the kernel in an
> as-yet (and thankfully) mythical pointer-profiling mode, which records
> the values of pointer loads and also measures the pointer-load latency.
> If a situation is found where a given pointer almost always has the
> same value but has high load latency (for example, is almost always a
> high-latency cache miss), this fact is recorded and fed back into a
> subsequent kernel build.  This subsequent kernel build might choose to
> speculate the value of the pointer concurrently with the pointer load.
> 
> And of course, when interpreting the phrase "most sane way" at the
> beginning of the prior paragraph, it would probably be wise to keep
> in mind who wrote it.  And that "most sane way" might have little or
> no resemblance to anything that typical kernel hackers would consider
> anywhere near sanity.  ;-)

I see. The above scenario is sure fair enough assumption. We're
clearly stretching the constraints to see what is theoretically
possible and this is a very clear explanation of how gcc could have an
hardcoded "guessed" address in the .text.

Next step to clearify now, is how gcc can safely dereference such a
"guessed" address without the kernel knowing about it.

If gcc would really dereference a guessed address coming from a
profiling run without kernel being aware of it, it would eventually
crash the kernel with an oops. gcc cannot know what another CPU will
do with the kernel pagetables. It'd be perfectly legitimate to
temporarily move the data at the "guessed address" to another page and
to update the pointer through stop_cpu during some weird "cpu
offlining scenario" or anything you can imagine. I mean gcc must
behave in all cases so it's not allowed to deference the guessed
address at any given time.

The only way gcc could do the alpha thing and dereference the guessed
address before the real pointer, is with cooperation with the kernel.
The kernel should provide gcc "safe ranges" that won't crash the
kernel, and/or gcc could provide a .fixup section similar to the
current .fixup and the kernel should look it up during the page fault
handler in case the kernel is ok with temporarily getting faults in
that range. And in turn it can't happen unless we explicitly decide to
allow gcc to do it.

> > Furthermore the ACCESS_ONCE that Peter's patch added to gup_fast
> > pud/pgd can't prevent the compiler to read a guessed pmdp address as a
> > volatile variable, before reading the pmdp pointer and compare it with
> > the guessed address! So if it's 5 you worry about, when adding
> > ACCESS_ONCE in pudp/pgdp/pmdp is useless and won't fix it. You should
> > have added a barrier() instead.
> 
> Most compiler writers I have discussed this with agreed that a volatile
> cast would suppress value speculation.  The "volatile" keyword is not
> all that well specified in the C and C++ standards, but as "nix" said
> at http://lwn.net/Articles/509731/:
> 
>   volatile's meaning as 'minimize optimizations applied to things
>   manipulating anything of volatile type, do not duplicate, elide,
>   move, fold, spindle or mutilate' is of long standing.

Ok, so if the above optimization would be possible, volatile would
stop it too, thanks for the quote and the explanation.

On a side note I believe there's a few barrier()s that may be worth
converting to ACCESS_ONCE, that would take care of case 6) too in
addition to avoid clobbering more CPU registers than strictly
necessary. Not very important but a possible microoptimization.

> That said, value speculation as a compiler optimization makes me a bit
> nervous, so my current feeling is that is should be suppressed entirely.
> 
> Hey, you asked, even if only implicitly!  ;-)

You're reading my mind! :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] page-table walkers vs memory order

2012-08-04 Thread Paul E. McKenney
On Sat, Aug 04, 2012 at 04:37:19PM +0200, Andrea Arcangeli wrote:
> On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote:
> > Since then, I think THP has made the rules more complicated; but I
> > believe Andrea paid a great deal of attention to that kind of issue.
> 
> There were many issues, one unexpected was
> 1a5a9906d4e8d1976b701f889d8f35d54b928f25.

[ . . . ]

> 5) compiler behaving like alpha -> impossible (I may be wrong but I
>believe so after thinking more on it)
> 
> 6) I was told a decade ago by Honza to never touch any ram that can
>change under the compiler unless it's declared volatile (could
>crash over switch/case statement implemented with a table if the
>switch/case value is re-read by the compiler).  -> depends, we
>don't always obey to this rule, clearly gup_fast currently disobeys
>and even the generic pmd_read_atomic still disobeys (MADV_DONTNEED
>can zero the pmd). If there's no "switch/case" I'm not aware of
>other troubles.
> 
> 7) barrier in pmd_none_or_trans_huge_or_clear_bad -> possible, same
>issue as 2, full explanation in git show 1a5a9906d4e8d1976b701f
> 
> Note: here I'm ignoring CPU reordering, this is only about the compiler.
> 
> 5 is impossible because:
> 
> a) the compiler can't read a guessed address or it can crash the
>kernel
> 
> b) the compiler has no memory to store a "guessed" valid address when
>the function return and the stack is unwind
> 
> For the compiler to behave like alpha, the compiler should read the
> pteval before the pmdp, that it can't do, because it has no address to
> guess from and it would Oops if it really tries to guess it!
> 
> So far it was said "compiler can guess the address" but there was no
> valid explanation of how it could do it, and I don't see it, so please
> explain if I'm wrong about the a, b above.

OK, I'll bite.  ;-)

The most sane way for this to happen is with feedback-driven techniques
involving profiling, similar to what is done for basic-block reordering
or branch prediction.  The idea is that you compile the kernel in an
as-yet (and thankfully) mythical pointer-profiling mode, which records
the values of pointer loads and also measures the pointer-load latency.
If a situation is found where a given pointer almost always has the
same value but has high load latency (for example, is almost always a
high-latency cache miss), this fact is recorded and fed back into a
subsequent kernel build.  This subsequent kernel build might choose to
speculate the value of the pointer concurrently with the pointer load.

And of course, when interpreting the phrase "most sane way" at the
beginning of the prior paragraph, it would probably be wise to keep
in mind who wrote it.  And that "most sane way" might have little or
no resemblance to anything that typical kernel hackers would consider
anywhere near sanity.  ;-)

> Furthermore the ACCESS_ONCE that Peter's patch added to gup_fast
> pud/pgd can't prevent the compiler to read a guessed pmdp address as a
> volatile variable, before reading the pmdp pointer and compare it with
> the guessed address! So if it's 5 you worry about, when adding
> ACCESS_ONCE in pudp/pgdp/pmdp is useless and won't fix it. You should
> have added a barrier() instead.

Most compiler writers I have discussed this with agreed that a volatile
cast would suppress value speculation.  The "volatile" keyword is not
all that well specified in the C and C++ standards, but as "nix" said
at http://lwn.net/Articles/509731/:

volatile's meaning as 'minimize optimizations applied to things
manipulating anything of volatile type, do not duplicate, elide,
move, fold, spindle or mutilate' is of long standing.

That said, value speculation as a compiler optimization makes me a bit
nervous, so my current feeling is that is should be suppressed entirely.

Hey, you asked, even if only implicitly!  ;-)

Thanx, Paul

> > I suspect your arch/x86/mm/gup.c ACCESS_ONCE()s are necessary:
> > gup_fast() breaks as many rules as it can, and in particular may
> > be racing with the freeing of page tables; but I'm not so sure
> > about the pagewalk mods - we could say "cannot do any harm",
> > but I don't like adding lines on that basis.
> 
> I agree to add ACCESS_ONCE but because it's case 2, 7 above and it
> could race with free_pgtables of pgd/pud/pmd or MADV_DONTNEED with
> pmd.
> 
> The other part of the patch in pagewalk.c is superflous and should be
> dropped: pud/pgd can't change in walk_page_table, it's required to
> hold the mmap_sem at least in read mode (it's not disabling irqs).
> 
> The pmd side instead can change but only with THP enabled, and only
> because MADV_DONTNEED (never because of free_pgtables) but it's
> already fully handled through pmd_none_or_trans_huge_or_clear_bad. The
> ->pmd_entry callers are required to call pmd_trans_unstable() before
> proceeding as the pmd may have been 

Re: [RFC] page-table walkers vs memory order

2012-08-04 Thread Andrea Arcangeli
On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote:
> Since then, I think THP has made the rules more complicated; but I
> believe Andrea paid a great deal of attention to that kind of issue.

There were many issues, one unexpected was
1a5a9906d4e8d1976b701f889d8f35d54b928f25.

Keep in mind when holding only mmap_sem read mode (walk page range
speculative mode) or gup_fast, the result is always undefined and
racey if on the other CPU you have a munmap or mremap or any other pmd
manging concurrently messing with the mapping you're walking, all we
have to do is not to crash, it doesn't matter what happens.

The fact you need a barrier() or ACCESS_ONCE to avoid a lockup in a
while (rcu_dereference()), is no good reason to worry about all
possible purely theoretical gcc issues.

One important thing that wasn't mentioned so far in this thread is
also that we entirely relay on gcc for all pagetable and device driver
writes (to do 1 movq instead of 8 movb), see native_set_pmd and writel.

We must separate all different cases to avoid huge confusion:

1) tmp=*ptr, while(tmp) -> possible, needs barrier or better
   ACCESS_ONCE if possible

2) orig_pmd = *pmdp before do_wp_huge_page in memory.c -> possible, needs
   barrier() and it should be possible to convert to ACCESS_ONCE

3) native_set_pmd and friends -> possible but not worth fixing, tried
   to fix a decade ago for a peace of mind and I was suggested to
   desist and it didn't bite us yet

4) writel -> possible but same as 3

5) compiler behaving like alpha -> impossible (I may be wrong but I
   believe so after thinking more on it)

6) I was told a decade ago by Honza to never touch any ram that can
   change under the compiler unless it's declared volatile (could
   crash over switch/case statement implemented with a table if the
   switch/case value is re-read by the compiler).  -> depends, we
   don't always obey to this rule, clearly gup_fast currently disobeys
   and even the generic pmd_read_atomic still disobeys (MADV_DONTNEED
   can zero the pmd). If there's no "switch/case" I'm not aware of
   other troubles.

7) barrier in pmd_none_or_trans_huge_or_clear_bad -> possible, same
   issue as 2, full explanation in git show 1a5a9906d4e8d1976b701f

Note: here I'm ignoring CPU reordering, this is only about the compiler.

5 is impossible because:

a) the compiler can't read a guessed address or it can crash the
   kernel

b) the compiler has no memory to store a "guessed" valid address when
   the function return and the stack is unwind

For the compiler to behave like alpha, the compiler should read the
pteval before the pmdp, that it can't do, because it has no address to
guess from and it would Oops if it really tries to guess it!

So far it was said "compiler can guess the address" but there was no
valid explanation of how it could do it, and I don't see it, so please
explain if I'm wrong about the a, b above.

Furthermore the ACCESS_ONCE that Peter's patch added to gup_fast
pud/pgd can't prevent the compiler to read a guessed pmdp address as a
volatile variable, before reading the pmdp pointer and compare it with
the guessed address! So if it's 5 you worry about, when adding
ACCESS_ONCE in pudp/pgdp/pmdp is useless and won't fix it. You should
have added a barrier() instead.

> I suspect your arch/x86/mm/gup.c ACCESS_ONCE()s are necessary:
> gup_fast() breaks as many rules as it can, and in particular may
> be racing with the freeing of page tables; but I'm not so sure
> about the pagewalk mods - we could say "cannot do any harm",
> but I don't like adding lines on that basis.

I agree to add ACCESS_ONCE but because it's case 2, 7 above and it
could race with free_pgtables of pgd/pud/pmd or MADV_DONTNEED with
pmd.

The other part of the patch in pagewalk.c is superflous and should be
dropped: pud/pgd can't change in walk_page_table, it's required to
hold the mmap_sem at least in read mode (it's not disabling irqs).

The pmd side instead can change but only with THP enabled, and only
because MADV_DONTNEED (never because of free_pgtables) but it's
already fully handled through pmd_none_or_trans_huge_or_clear_bad. The
->pmd_entry callers are required to call pmd_trans_unstable() before
proceeding as the pmd may have been zeroed out by the time we get
there. The solution is zero barrier()/ACCESS_ONCE impact for THP=n. If
there are smp_read_barrier_depends already in alpha pte methods we're
fine.

Sorry for the long email but without a list that separates all
possible cases above, I don't think we can understand what we're
fixing in that patch and why the gup.c part is good.

Peter, I suggest to resend the fix with a more detailed explanataion
of the 2, 7 kind of race for gup.c only and drop the pagewalk.c.

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  

Re: [RFC] page-table walkers vs memory order

2012-08-04 Thread Andrea Arcangeli
On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote:
 Since then, I think THP has made the rules more complicated; but I
 believe Andrea paid a great deal of attention to that kind of issue.

There were many issues, one unexpected was
1a5a9906d4e8d1976b701f889d8f35d54b928f25.

Keep in mind when holding only mmap_sem read mode (walk page range
speculative mode) or gup_fast, the result is always undefined and
racey if on the other CPU you have a munmap or mremap or any other pmd
manging concurrently messing with the mapping you're walking, all we
have to do is not to crash, it doesn't matter what happens.

The fact you need a barrier() or ACCESS_ONCE to avoid a lockup in a
while (rcu_dereference()), is no good reason to worry about all
possible purely theoretical gcc issues.

One important thing that wasn't mentioned so far in this thread is
also that we entirely relay on gcc for all pagetable and device driver
writes (to do 1 movq instead of 8 movb), see native_set_pmd and writel.

We must separate all different cases to avoid huge confusion:

1) tmp=*ptr, while(tmp) - possible, needs barrier or better
   ACCESS_ONCE if possible

2) orig_pmd = *pmdp before do_wp_huge_page in memory.c - possible, needs
   barrier() and it should be possible to convert to ACCESS_ONCE

3) native_set_pmd and friends - possible but not worth fixing, tried
   to fix a decade ago for a peace of mind and I was suggested to
   desist and it didn't bite us yet

4) writel - possible but same as 3

5) compiler behaving like alpha - impossible (I may be wrong but I
   believe so after thinking more on it)

6) I was told a decade ago by Honza to never touch any ram that can
   change under the compiler unless it's declared volatile (could
   crash over switch/case statement implemented with a table if the
   switch/case value is re-read by the compiler).  - depends, we
   don't always obey to this rule, clearly gup_fast currently disobeys
   and even the generic pmd_read_atomic still disobeys (MADV_DONTNEED
   can zero the pmd). If there's no switch/case I'm not aware of
   other troubles.

7) barrier in pmd_none_or_trans_huge_or_clear_bad - possible, same
   issue as 2, full explanation in git show 1a5a9906d4e8d1976b701f

Note: here I'm ignoring CPU reordering, this is only about the compiler.

5 is impossible because:

a) the compiler can't read a guessed address or it can crash the
   kernel

b) the compiler has no memory to store a guessed valid address when
   the function return and the stack is unwind

For the compiler to behave like alpha, the compiler should read the
pteval before the pmdp, that it can't do, because it has no address to
guess from and it would Oops if it really tries to guess it!

So far it was said compiler can guess the address but there was no
valid explanation of how it could do it, and I don't see it, so please
explain if I'm wrong about the a, b above.

Furthermore the ACCESS_ONCE that Peter's patch added to gup_fast
pud/pgd can't prevent the compiler to read a guessed pmdp address as a
volatile variable, before reading the pmdp pointer and compare it with
the guessed address! So if it's 5 you worry about, when adding
ACCESS_ONCE in pudp/pgdp/pmdp is useless and won't fix it. You should
have added a barrier() instead.

 I suspect your arch/x86/mm/gup.c ACCESS_ONCE()s are necessary:
 gup_fast() breaks as many rules as it can, and in particular may
 be racing with the freeing of page tables; but I'm not so sure
 about the pagewalk mods - we could say cannot do any harm,
 but I don't like adding lines on that basis.

I agree to add ACCESS_ONCE but because it's case 2, 7 above and it
could race with free_pgtables of pgd/pud/pmd or MADV_DONTNEED with
pmd.

The other part of the patch in pagewalk.c is superflous and should be
dropped: pud/pgd can't change in walk_page_table, it's required to
hold the mmap_sem at least in read mode (it's not disabling irqs).

The pmd side instead can change but only with THP enabled, and only
because MADV_DONTNEED (never because of free_pgtables) but it's
already fully handled through pmd_none_or_trans_huge_or_clear_bad. The
-pmd_entry callers are required to call pmd_trans_unstable() before
proceeding as the pmd may have been zeroed out by the time we get
there. The solution is zero barrier()/ACCESS_ONCE impact for THP=n. If
there are smp_read_barrier_depends already in alpha pte methods we're
fine.

Sorry for the long email but without a list that separates all
possible cases above, I don't think we can understand what we're
fixing in that patch and why the gup.c part is good.

Peter, I suggest to resend the fix with a more detailed explanataion
of the 2, 7 kind of race for gup.c only and drop the pagewalk.c.

Thanks,
Andrea
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] page-table walkers vs memory order

2012-08-04 Thread Paul E. McKenney
On Sat, Aug 04, 2012 at 04:37:19PM +0200, Andrea Arcangeli wrote:
 On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote:
  Since then, I think THP has made the rules more complicated; but I
  believe Andrea paid a great deal of attention to that kind of issue.
 
 There were many issues, one unexpected was
 1a5a9906d4e8d1976b701f889d8f35d54b928f25.

[ . . . ]

 5) compiler behaving like alpha - impossible (I may be wrong but I
believe so after thinking more on it)
 
 6) I was told a decade ago by Honza to never touch any ram that can
change under the compiler unless it's declared volatile (could
crash over switch/case statement implemented with a table if the
switch/case value is re-read by the compiler).  - depends, we
don't always obey to this rule, clearly gup_fast currently disobeys
and even the generic pmd_read_atomic still disobeys (MADV_DONTNEED
can zero the pmd). If there's no switch/case I'm not aware of
other troubles.
 
 7) barrier in pmd_none_or_trans_huge_or_clear_bad - possible, same
issue as 2, full explanation in git show 1a5a9906d4e8d1976b701f
 
 Note: here I'm ignoring CPU reordering, this is only about the compiler.
 
 5 is impossible because:
 
 a) the compiler can't read a guessed address or it can crash the
kernel
 
 b) the compiler has no memory to store a guessed valid address when
the function return and the stack is unwind
 
 For the compiler to behave like alpha, the compiler should read the
 pteval before the pmdp, that it can't do, because it has no address to
 guess from and it would Oops if it really tries to guess it!
 
 So far it was said compiler can guess the address but there was no
 valid explanation of how it could do it, and I don't see it, so please
 explain if I'm wrong about the a, b above.

OK, I'll bite.  ;-)

The most sane way for this to happen is with feedback-driven techniques
involving profiling, similar to what is done for basic-block reordering
or branch prediction.  The idea is that you compile the kernel in an
as-yet (and thankfully) mythical pointer-profiling mode, which records
the values of pointer loads and also measures the pointer-load latency.
If a situation is found where a given pointer almost always has the
same value but has high load latency (for example, is almost always a
high-latency cache miss), this fact is recorded and fed back into a
subsequent kernel build.  This subsequent kernel build might choose to
speculate the value of the pointer concurrently with the pointer load.

And of course, when interpreting the phrase most sane way at the
beginning of the prior paragraph, it would probably be wise to keep
in mind who wrote it.  And that most sane way might have little or
no resemblance to anything that typical kernel hackers would consider
anywhere near sanity.  ;-)

 Furthermore the ACCESS_ONCE that Peter's patch added to gup_fast
 pud/pgd can't prevent the compiler to read a guessed pmdp address as a
 volatile variable, before reading the pmdp pointer and compare it with
 the guessed address! So if it's 5 you worry about, when adding
 ACCESS_ONCE in pudp/pgdp/pmdp is useless and won't fix it. You should
 have added a barrier() instead.

Most compiler writers I have discussed this with agreed that a volatile
cast would suppress value speculation.  The volatile keyword is not
all that well specified in the C and C++ standards, but as nix said
at http://lwn.net/Articles/509731/:

volatile's meaning as 'minimize optimizations applied to things
manipulating anything of volatile type, do not duplicate, elide,
move, fold, spindle or mutilate' is of long standing.

That said, value speculation as a compiler optimization makes me a bit
nervous, so my current feeling is that is should be suppressed entirely.

Hey, you asked, even if only implicitly!  ;-)

Thanx, Paul

  I suspect your arch/x86/mm/gup.c ACCESS_ONCE()s are necessary:
  gup_fast() breaks as many rules as it can, and in particular may
  be racing with the freeing of page tables; but I'm not so sure
  about the pagewalk mods - we could say cannot do any harm,
  but I don't like adding lines on that basis.
 
 I agree to add ACCESS_ONCE but because it's case 2, 7 above and it
 could race with free_pgtables of pgd/pud/pmd or MADV_DONTNEED with
 pmd.
 
 The other part of the patch in pagewalk.c is superflous and should be
 dropped: pud/pgd can't change in walk_page_table, it's required to
 hold the mmap_sem at least in read mode (it's not disabling irqs).
 
 The pmd side instead can change but only with THP enabled, and only
 because MADV_DONTNEED (never because of free_pgtables) but it's
 already fully handled through pmd_none_or_trans_huge_or_clear_bad. The
 -pmd_entry callers are required to call pmd_trans_unstable() before
 proceeding as the pmd may have been zeroed out by the time we get
 there. The solution is zero barrier()/ACCESS_ONCE impact for 

Re: [RFC] page-table walkers vs memory order

2012-08-04 Thread Andrea Arcangeli
On Sat, Aug 04, 2012 at 03:02:45PM -0700, Paul E. McKenney wrote:
 OK, I'll bite.  ;-)

:))

 The most sane way for this to happen is with feedback-driven techniques
 involving profiling, similar to what is done for basic-block reordering
 or branch prediction.  The idea is that you compile the kernel in an
 as-yet (and thankfully) mythical pointer-profiling mode, which records
 the values of pointer loads and also measures the pointer-load latency.
 If a situation is found where a given pointer almost always has the
 same value but has high load latency (for example, is almost always a
 high-latency cache miss), this fact is recorded and fed back into a
 subsequent kernel build.  This subsequent kernel build might choose to
 speculate the value of the pointer concurrently with the pointer load.
 
 And of course, when interpreting the phrase most sane way at the
 beginning of the prior paragraph, it would probably be wise to keep
 in mind who wrote it.  And that most sane way might have little or
 no resemblance to anything that typical kernel hackers would consider
 anywhere near sanity.  ;-)

I see. The above scenario is sure fair enough assumption. We're
clearly stretching the constraints to see what is theoretically
possible and this is a very clear explanation of how gcc could have an
hardcoded guessed address in the .text.

Next step to clearify now, is how gcc can safely dereference such a
guessed address without the kernel knowing about it.

If gcc would really dereference a guessed address coming from a
profiling run without kernel being aware of it, it would eventually
crash the kernel with an oops. gcc cannot know what another CPU will
do with the kernel pagetables. It'd be perfectly legitimate to
temporarily move the data at the guessed address to another page and
to update the pointer through stop_cpu during some weird cpu
offlining scenario or anything you can imagine. I mean gcc must
behave in all cases so it's not allowed to deference the guessed
address at any given time.

The only way gcc could do the alpha thing and dereference the guessed
address before the real pointer, is with cooperation with the kernel.
The kernel should provide gcc safe ranges that won't crash the
kernel, and/or gcc could provide a .fixup section similar to the
current .fixup and the kernel should look it up during the page fault
handler in case the kernel is ok with temporarily getting faults in
that range. And in turn it can't happen unless we explicitly decide to
allow gcc to do it.

  Furthermore the ACCESS_ONCE that Peter's patch added to gup_fast
  pud/pgd can't prevent the compiler to read a guessed pmdp address as a
  volatile variable, before reading the pmdp pointer and compare it with
  the guessed address! So if it's 5 you worry about, when adding
  ACCESS_ONCE in pudp/pgdp/pmdp is useless and won't fix it. You should
  have added a barrier() instead.
 
 Most compiler writers I have discussed this with agreed that a volatile
 cast would suppress value speculation.  The volatile keyword is not
 all that well specified in the C and C++ standards, but as nix said
 at http://lwn.net/Articles/509731/:
 
   volatile's meaning as 'minimize optimizations applied to things
   manipulating anything of volatile type, do not duplicate, elide,
   move, fold, spindle or mutilate' is of long standing.

Ok, so if the above optimization would be possible, volatile would
stop it too, thanks for the quote and the explanation.

On a side note I believe there's a few barrier()s that may be worth
converting to ACCESS_ONCE, that would take care of case 6) too in
addition to avoid clobbering more CPU registers than strictly
necessary. Not very important but a possible microoptimization.

 That said, value speculation as a compiler optimization makes me a bit
 nervous, so my current feeling is that is should be suppressed entirely.
 
 Hey, you asked, even if only implicitly!  ;-)

You're reading my mind! :)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] page-table walkers vs memory order

2012-08-04 Thread Dr. David Alan Gilbert
* Andrea Arcangeli (aarca...@redhat.com) wrote:
 On Sat, Aug 04, 2012 at 03:02:45PM -0700, Paul E. McKenney wrote:
  OK, I'll bite.  ;-)
 
 :))
 
  The most sane way for this to happen is with feedback-driven techniques
  involving profiling, similar to what is done for basic-block reordering
  or branch prediction.  The idea is that you compile the kernel in an
  as-yet (and thankfully) mythical pointer-profiling mode, which records
  the values of pointer loads and also measures the pointer-load latency.
  If a situation is found where a given pointer almost always has the
  same value but has high load latency (for example, is almost always a
  high-latency cache miss), this fact is recorded and fed back into a
  subsequent kernel build.  This subsequent kernel build might choose to
  speculate the value of the pointer concurrently with the pointer load.
  
  And of course, when interpreting the phrase most sane way at the
  beginning of the prior paragraph, it would probably be wise to keep
  in mind who wrote it.  And that most sane way might have little or
  no resemblance to anything that typical kernel hackers would consider
  anywhere near sanity.  ;-)
 
 I see. The above scenario is sure fair enough assumption. We're
 clearly stretching the constraints to see what is theoretically
 possible and this is a very clear explanation of how gcc could have an
 hardcoded guessed address in the .text.
 
 Next step to clearify now, is how gcc can safely dereference such a
 guessed address without the kernel knowing about it.
 
 If gcc would really dereference a guessed address coming from a
 profiling run without kernel being aware of it, it would eventually
 crash the kernel with an oops. gcc cannot know what another CPU will
 do with the kernel pagetables. It'd be perfectly legitimate to
 temporarily move the data at the guessed address to another page and
 to update the pointer through stop_cpu during some weird cpu
 offlining scenario or anything you can imagine. I mean gcc must
 behave in all cases so it's not allowed to deference the guessed
 address at any given time.

A compiler could decide to dereference it using a non-faulting load,
do the calculations or whatever on the returned value of the non-faulting
load, and then check whether the load actually faulted, and whether the
address matched the prediction before it did a store based on it's
guess.

Dave
-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\ gro.gilbert @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] page-table walkers vs memory order

2012-08-04 Thread Paul E. McKenney
On Sun, Aug 05, 2012 at 12:47:05AM +0200, Andrea Arcangeli wrote:
 On Sat, Aug 04, 2012 at 03:02:45PM -0700, Paul E. McKenney wrote:
  OK, I'll bite.  ;-)
 
 :))
 
  The most sane way for this to happen is with feedback-driven techniques
  involving profiling, similar to what is done for basic-block reordering
  or branch prediction.  The idea is that you compile the kernel in an
  as-yet (and thankfully) mythical pointer-profiling mode, which records
  the values of pointer loads and also measures the pointer-load latency.
  If a situation is found where a given pointer almost always has the
  same value but has high load latency (for example, is almost always a
  high-latency cache miss), this fact is recorded and fed back into a
  subsequent kernel build.  This subsequent kernel build might choose to
  speculate the value of the pointer concurrently with the pointer load.
  
  And of course, when interpreting the phrase most sane way at the
  beginning of the prior paragraph, it would probably be wise to keep
  in mind who wrote it.  And that most sane way might have little or
  no resemblance to anything that typical kernel hackers would consider
  anywhere near sanity.  ;-)
 
 I see. The above scenario is sure fair enough assumption. We're
 clearly stretching the constraints to see what is theoretically
 possible and this is a very clear explanation of how gcc could have an
 hardcoded guessed address in the .text.
 
 Next step to clearify now, is how gcc can safely dereference such a
 guessed address without the kernel knowing about it.
 
 If gcc would really dereference a guessed address coming from a
 profiling run without kernel being aware of it, it would eventually
 crash the kernel with an oops. gcc cannot know what another CPU will
 do with the kernel pagetables. It'd be perfectly legitimate to
 temporarily move the data at the guessed address to another page and
 to update the pointer through stop_cpu during some weird cpu
 offlining scenario or anything you can imagine. I mean gcc must
 behave in all cases so it's not allowed to deference the guessed
 address at any given time.
 
 The only way gcc could do the alpha thing and dereference the guessed
 address before the real pointer, is with cooperation with the kernel.
 The kernel should provide gcc safe ranges that won't crash the
 kernel, and/or gcc could provide a .fixup section similar to the
 current .fixup and the kernel should look it up during the page fault
 handler in case the kernel is ok with temporarily getting faults in
 that range. And in turn it can't happen unless we explicitly decide to
 allow gcc to do it.

And these are indeed some good reasons why I am not a fan of pointer-value
speculation.  ;-)

   Furthermore the ACCESS_ONCE that Peter's patch added to gup_fast
   pud/pgd can't prevent the compiler to read a guessed pmdp address as a
   volatile variable, before reading the pmdp pointer and compare it with
   the guessed address! So if it's 5 you worry about, when adding
   ACCESS_ONCE in pudp/pgdp/pmdp is useless and won't fix it. You should
   have added a barrier() instead.
  
  Most compiler writers I have discussed this with agreed that a volatile
  cast would suppress value speculation.  The volatile keyword is not
  all that well specified in the C and C++ standards, but as nix said
  at http://lwn.net/Articles/509731/:
  
  volatile's meaning as 'minimize optimizations applied to things
  manipulating anything of volatile type, do not duplicate, elide,
  move, fold, spindle or mutilate' is of long standing.
 
 Ok, so if the above optimization would be possible, volatile would
 stop it too, thanks for the quote and the explanation.
 
 On a side note I believe there's a few barrier()s that may be worth
 converting to ACCESS_ONCE, that would take care of case 6) too in
 addition to avoid clobbering more CPU registers than strictly
 necessary. Not very important but a possible microoptimization.

Agreed on both points.

  That said, value speculation as a compiler optimization makes me a bit
  nervous, so my current feeling is that is should be suppressed entirely.
  
  Hey, you asked, even if only implicitly!  ;-)
 
 You're reading my mind! :)

Or succesfully carrying out value speculation on it.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] page-table walkers vs memory order

2012-08-04 Thread Paul E. McKenney
On Sat, Aug 04, 2012 at 11:59:10PM +0100, Dr. David Alan Gilbert wrote:
 * Andrea Arcangeli (aarca...@redhat.com) wrote:
  On Sat, Aug 04, 2012 at 03:02:45PM -0700, Paul E. McKenney wrote:
   OK, I'll bite.  ;-)
  
  :))
  
   The most sane way for this to happen is with feedback-driven techniques
   involving profiling, similar to what is done for basic-block reordering
   or branch prediction.  The idea is that you compile the kernel in an
   as-yet (and thankfully) mythical pointer-profiling mode, which records
   the values of pointer loads and also measures the pointer-load latency.
   If a situation is found where a given pointer almost always has the
   same value but has high load latency (for example, is almost always a
   high-latency cache miss), this fact is recorded and fed back into a
   subsequent kernel build.  This subsequent kernel build might choose to
   speculate the value of the pointer concurrently with the pointer load.
   
   And of course, when interpreting the phrase most sane way at the
   beginning of the prior paragraph, it would probably be wise to keep
   in mind who wrote it.  And that most sane way might have little or
   no resemblance to anything that typical kernel hackers would consider
   anywhere near sanity.  ;-)
  
  I see. The above scenario is sure fair enough assumption. We're
  clearly stretching the constraints to see what is theoretically
  possible and this is a very clear explanation of how gcc could have an
  hardcoded guessed address in the .text.
  
  Next step to clearify now, is how gcc can safely dereference such a
  guessed address without the kernel knowing about it.
  
  If gcc would really dereference a guessed address coming from a
  profiling run without kernel being aware of it, it would eventually
  crash the kernel with an oops. gcc cannot know what another CPU will
  do with the kernel pagetables. It'd be perfectly legitimate to
  temporarily move the data at the guessed address to another page and
  to update the pointer through stop_cpu during some weird cpu
  offlining scenario or anything you can imagine. I mean gcc must
  behave in all cases so it's not allowed to deference the guessed
  address at any given time.
 
 A compiler could decide to dereference it using a non-faulting load,
 do the calculations or whatever on the returned value of the non-faulting
 load, and then check whether the load actually faulted, and whether the
 address matched the prediction before it did a store based on it's
 guess.

Or the compiler could record a recovery address in a per-thread variable
before doing the speculative reference.  The page-fault handler could
consult the per-thread variable and take appropriate action.

But both this approach and your approach are vulnerable to things like
having the speculation area mapped to (say) MMIO space.  Not good!

So I am with Andrea on this one -- there would need to be some handshake
between kernel and compiler to avoid messing with possibly-unsafe
mappings.  And I am still not much in favor of value speculation.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] page-table walkers vs memory order

2012-08-04 Thread Dr. David Alan Gilbert
* Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
 On Sat, Aug 04, 2012 at 11:59:10PM +0100, Dr. David Alan Gilbert wrote:

snip

  A compiler could decide to dereference it using a non-faulting load,
  do the calculations or whatever on the returned value of the non-faulting
  load, and then check whether the load actually faulted, and whether the
  address matched the prediction before it did a store based on it's
  guess.
 
 Or the compiler could record a recovery address in a per-thread variable
 before doing the speculative reference.  The page-fault handler could
 consult the per-thread variable and take appropriate action.

The difference is that I'd expect a compiler writer to think that
they've got a free hand in terms of instruction usage that the OS/library
doesn't see - if it's in the instruction manual and it's marked as user
space and non-faulting I'd say it's fair game; once they know that they're
going to take a fault or mark pages specially then they already know they're
going to have to cooperate with the OS, or worry about what other
normal library calls are going to do.
(A bit of googling seems to suggest IA64 and SPARC have played around
with non-faulting load optimisations, but I can't tell how much.)

 But both this approach and your approach are vulnerable to things like
 having the speculation area mapped to (say) MMIO space.  Not good!

Not good for someone doing MMIO, but from an evil-compiler point
of view, they might well assume that a pointer is to memory
unless someone has made an effort to tell them otherwise (not that
there is a good standard to do that).

 So I am with Andrea on this one -- there would need to be some handshake
 between kernel and compiler to avoid messing with possibly-unsafe
 mappings.  And I am still not much in favor of value speculation.  ;-)

Dave
-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\ gro.gilbert @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] page-table walkers vs memory order

2012-07-30 Thread Paul E. McKenney
On Mon, Jul 30, 2012 at 08:21:40PM +0100, Jamie Lokier wrote:
> Paul E. McKenney wrote:
> > > Does some version of gcc, under the options which we insist upon,
> > > make such optimizations on any of the architectures which we support?
> > 
> > Pretty much any production-quality compiler will do double-fetch
> > and old-value-reuse optimizations, the former especially on 32-bit
> > x86.  I don't know of any production-quality compilers that do value
> > speculation, which would make the compiler act like DEC Alpha hardware,
> > and I would hope that if this does appear, (1) we would have warning
> > and (2) it could be turned off.  But there has been a lot of work on
> > this topic, so we would be foolish to rule it out.
> 
> GCC documentation for IA-64:
> 
>-msched-ar-data-spec
>-mno-sched-ar-data-spec
>  (En/Dis)able data speculative scheduling after reload. This results
>  in generation of ld.a instructions and the corresponding check
>  instructions (ld.c / chk.a). The default is 'enable'.
> 
> I don't know if that results in value speculation of the relevant kind.

If I remember correctly, the chk.a instruction will detect failed
speculation via cache state and deal with the situation correctly,
but I really need to defer to someone with more recent IA-64 experience.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] page-table walkers vs memory order

2012-07-30 Thread Jamie Lokier
Paul E. McKenney wrote:
> > Does some version of gcc, under the options which we insist upon,
> > make such optimizations on any of the architectures which we support?
> 
> Pretty much any production-quality compiler will do double-fetch
> and old-value-reuse optimizations, the former especially on 32-bit
> x86.  I don't know of any production-quality compilers that do value
> speculation, which would make the compiler act like DEC Alpha hardware,
> and I would hope that if this does appear, (1) we would have warning
> and (2) it could be turned off.  But there has been a lot of work on
> this topic, so we would be foolish to rule it out.

GCC documentation for IA-64:

   -msched-ar-data-spec
   -mno-sched-ar-data-spec
 (En/Dis)able data speculative scheduling after reload. This results
 in generation of ld.a instructions and the corresponding check
 instructions (ld.c / chk.a). The default is 'enable'.

I don't know if that results in value speculation of the relevant kind.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] page-table walkers vs memory order

2012-07-30 Thread Jamie Lokier
Paul E. McKenney wrote:
  Does some version of gcc, under the options which we insist upon,
  make such optimizations on any of the architectures which we support?
 
 Pretty much any production-quality compiler will do double-fetch
 and old-value-reuse optimizations, the former especially on 32-bit
 x86.  I don't know of any production-quality compilers that do value
 speculation, which would make the compiler act like DEC Alpha hardware,
 and I would hope that if this does appear, (1) we would have warning
 and (2) it could be turned off.  But there has been a lot of work on
 this topic, so we would be foolish to rule it out.

GCC documentation for IA-64:

   -msched-ar-data-spec
   -mno-sched-ar-data-spec
 (En/Dis)able data speculative scheduling after reload. This results
 in generation of ld.a instructions and the corresponding check
 instructions (ld.c / chk.a). The default is 'enable'.

I don't know if that results in value speculation of the relevant kind.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] page-table walkers vs memory order

2012-07-30 Thread Paul E. McKenney
On Mon, Jul 30, 2012 at 08:21:40PM +0100, Jamie Lokier wrote:
 Paul E. McKenney wrote:
   Does some version of gcc, under the options which we insist upon,
   make such optimizations on any of the architectures which we support?
  
  Pretty much any production-quality compiler will do double-fetch
  and old-value-reuse optimizations, the former especially on 32-bit
  x86.  I don't know of any production-quality compilers that do value
  speculation, which would make the compiler act like DEC Alpha hardware,
  and I would hope that if this does appear, (1) we would have warning
  and (2) it could be turned off.  But there has been a lot of work on
  this topic, so we would be foolish to rule it out.
 
 GCC documentation for IA-64:
 
-msched-ar-data-spec
-mno-sched-ar-data-spec
  (En/Dis)able data speculative scheduling after reload. This results
  in generation of ld.a instructions and the corresponding check
  instructions (ld.c / chk.a). The default is 'enable'.
 
 I don't know if that results in value speculation of the relevant kind.

If I remember correctly, the chk.a instruction will detect failed
speculation via cache state and deal with the situation correctly,
but I really need to defer to someone with more recent IA-64 experience.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] page-table walkers vs memory order

2012-07-27 Thread Paul E. McKenney
On Fri, Jul 27, 2012 at 12:22:46PM -0700, Hugh Dickins wrote:
> On Thu, 26 Jul 2012, Peter Zijlstra wrote:
> > On Tue, 2012-07-24 at 14:51 -0700, Hugh Dickins wrote:
> > > I do love the status quo, but an audit would be welcome.  When
> > > it comes to patches, personally I tend to prefer ACCESS_ONCE() and
> > > smp_read_barrier_depends() and accompanying comments to be hidden away
> > > in the underlying macros or inlines where reasonable, rather than
> > > repeated all over; but I may have my priorities wrong on that.
> 
> I notice from that old radix_tree thread you pointed to in the previous
> mail (for which many thanks: lots of meat to digest in there) that this
> is also Linus's preference.
> 
> > > 
> > > 
> > Yeah, I was being lazy, and I totally forgot to actually look at the
> > alpha code.
> > 
> > How about we do a generic (cribbed from rcu_dereference):
> > 
> > #define page_table_deref(p) \
> > ({  \
> > typeof(*p) *__p = (typeof(*p) __force *)ACCESS_ONCE(p);\
> > smp_read_barrier_depends(); \
> > ((typeof(*p) __force __kernel *)(__p)); \
> > })
> > 
> > and use that all over to dereference page-tables. That way all this
> > lives in one place. Granted, I'll have to go edit all arch code, but I
> > seem to be doing that on a frequent basis anyway :/
> 
> If you're convinced that we now have (or are in danger of growing)
> a number of places which need this safety, yes, I suppose so.
> 
> Personally, I'd have gone for just adding the relatively-understandable
> ACCESS_ONCEs in all the arch/*/include/asm macros (which you're going to
> visit to make the above change), and leave the smp_read_barrier_depends()
> entirely in Alpha - one level of indirection less for the reader.
> But that's just me, you're the one proposing to do the work, and
> you may have very good reason for the above.
> 
> I'm unfamiliar with what value the __force __kernel annotations add.
> But I am interested to notice that you are only 6/9ths as insane as
> Paul: any chance of helping global underscore availability by not
> hoarding quite so many in there? 

Heh!!!  The number of underscores for the original rcu_dereference()'s
local variable was the outcome of an argument about how obfuscated that
variable's name should be in order to avoid possible collisions with names
in the enclosing scope.  Nine leading underscores might seem excessive,
or even as you say, insane, but on the other hand no name collisions
have ever come to my attention.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] page-table walkers vs memory order

2012-07-27 Thread Hugh Dickins
On Thu, 26 Jul 2012, Peter Zijlstra wrote:
> On Tue, 2012-07-24 at 14:51 -0700, Hugh Dickins wrote:
> > I do love the status quo, but an audit would be welcome.  When
> > it comes to patches, personally I tend to prefer ACCESS_ONCE() and
> > smp_read_barrier_depends() and accompanying comments to be hidden away
> > in the underlying macros or inlines where reasonable, rather than
> > repeated all over; but I may have my priorities wrong on that.

I notice from that old radix_tree thread you pointed to in the previous
mail (for which many thanks: lots of meat to digest in there) that this
is also Linus's preference.

> > 
> > 
> Yeah, I was being lazy, and I totally forgot to actually look at the
> alpha code.
> 
> How about we do a generic (cribbed from rcu_dereference):
> 
> #define page_table_deref(p)   \
> ({\
>   typeof(*p) *__p = (typeof(*p) __force *)ACCESS_ONCE(p);\
>   smp_read_barrier_depends(); \
>   ((typeof(*p) __force __kernel *)(__p)); \
> })
> 
> and use that all over to dereference page-tables. That way all this
> lives in one place. Granted, I'll have to go edit all arch code, but I
> seem to be doing that on a frequent basis anyway :/

If you're convinced that we now have (or are in danger of growing)
a number of places which need this safety, yes, I suppose so.

Personally, I'd have gone for just adding the relatively-understandable
ACCESS_ONCEs in all the arch/*/include/asm macros (which you're going to
visit to make the above change), and leave the smp_read_barrier_depends()
entirely in Alpha - one level of indirection less for the reader.
But that's just me, you're the one proposing to do the work, and
you may have very good reason for the above.

I'm unfamiliar with what value the __force __kernel annotations add.
But I am interested to notice that you are only 6/9ths as insane as
Paul: any chance of helping global underscore availability by not
hoarding quite so many in there? 

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] page-table walkers vs memory order

2012-07-27 Thread Hugh Dickins
On Thu, 26 Jul 2012, Peter Zijlstra wrote:
 On Tue, 2012-07-24 at 14:51 -0700, Hugh Dickins wrote:
  I do love the status quo, but an audit would be welcome.  When
  it comes to patches, personally I tend to prefer ACCESS_ONCE() and
  smp_read_barrier_depends() and accompanying comments to be hidden away
  in the underlying macros or inlines where reasonable, rather than
  repeated all over; but I may have my priorities wrong on that.

I notice from that old radix_tree thread you pointed to in the previous
mail (for which many thanks: lots of meat to digest in there) that this
is also Linus's preference.

  
  
 Yeah, I was being lazy, and I totally forgot to actually look at the
 alpha code.
 
 How about we do a generic (cribbed from rcu_dereference):
 
 #define page_table_deref(p)   \
 ({\
   typeof(*p) *__p = (typeof(*p) __force *)ACCESS_ONCE(p);\
   smp_read_barrier_depends(); \
   ((typeof(*p) __force __kernel *)(__p)); \
 })
 
 and use that all over to dereference page-tables. That way all this
 lives in one place. Granted, I'll have to go edit all arch code, but I
 seem to be doing that on a frequent basis anyway :/

If you're convinced that we now have (or are in danger of growing)
a number of places which need this safety, yes, I suppose so.

Personally, I'd have gone for just adding the relatively-understandable
ACCESS_ONCEs in all the arch/*/include/asm macros (which you're going to
visit to make the above change), and leave the smp_read_barrier_depends()
entirely in Alpha - one level of indirection less for the reader.
But that's just me, you're the one proposing to do the work, and
you may have very good reason for the above.

I'm unfamiliar with what value the __force __kernel annotations add.
But I am interested to notice that you are only 6/9ths as insane as
Paul: any chance of helping global underscore availability by not
hoarding quite so many in there? 

Hugh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] page-table walkers vs memory order

2012-07-27 Thread Paul E. McKenney
On Fri, Jul 27, 2012 at 12:22:46PM -0700, Hugh Dickins wrote:
 On Thu, 26 Jul 2012, Peter Zijlstra wrote:
  On Tue, 2012-07-24 at 14:51 -0700, Hugh Dickins wrote:
   I do love the status quo, but an audit would be welcome.  When
   it comes to patches, personally I tend to prefer ACCESS_ONCE() and
   smp_read_barrier_depends() and accompanying comments to be hidden away
   in the underlying macros or inlines where reasonable, rather than
   repeated all over; but I may have my priorities wrong on that.
 
 I notice from that old radix_tree thread you pointed to in the previous
 mail (for which many thanks: lots of meat to digest in there) that this
 is also Linus's preference.
 
   
   
  Yeah, I was being lazy, and I totally forgot to actually look at the
  alpha code.
  
  How about we do a generic (cribbed from rcu_dereference):
  
  #define page_table_deref(p) \
  ({  \
  typeof(*p) *__p = (typeof(*p) __force *)ACCESS_ONCE(p);\
  smp_read_barrier_depends(); \
  ((typeof(*p) __force __kernel *)(__p)); \
  })
  
  and use that all over to dereference page-tables. That way all this
  lives in one place. Granted, I'll have to go edit all arch code, but I
  seem to be doing that on a frequent basis anyway :/
 
 If you're convinced that we now have (or are in danger of growing)
 a number of places which need this safety, yes, I suppose so.
 
 Personally, I'd have gone for just adding the relatively-understandable
 ACCESS_ONCEs in all the arch/*/include/asm macros (which you're going to
 visit to make the above change), and leave the smp_read_barrier_depends()
 entirely in Alpha - one level of indirection less for the reader.
 But that's just me, you're the one proposing to do the work, and
 you may have very good reason for the above.
 
 I'm unfamiliar with what value the __force __kernel annotations add.
 But I am interested to notice that you are only 6/9ths as insane as
 Paul: any chance of helping global underscore availability by not
 hoarding quite so many in there? 

Heh!!!  The number of underscores for the original rcu_dereference()'s
local variable was the outcome of an argument about how obfuscated that
variable's name should be in order to avoid possible collisions with names
in the enclosing scope.  Nine leading underscores might seem excessive,
or even as you say, insane, but on the other hand no name collisions
have ever come to my attention.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] page-table walkers vs memory order

2012-07-26 Thread Peter Zijlstra
On Tue, 2012-07-24 at 14:51 -0700, Hugh Dickins wrote:
> I do love the status quo, but an audit would be welcome.  When
> it comes to patches, personally I tend to prefer ACCESS_ONCE() and
> smp_read_barrier_depends() and accompanying comments to be hidden away
> in the underlying macros or inlines where reasonable, rather than
> repeated all over; but I may have my priorities wrong on that.
> 
> 
Yeah, I was being lazy, and I totally forgot to actually look at the
alpha code.

How about we do a generic (cribbed from rcu_dereference):

#define page_table_deref(p) \
({  \
typeof(*p) *__p = (typeof(*p) __force *)ACCESS_ONCE(p);\
smp_read_barrier_depends(); \
((typeof(*p) __force __kernel *)(__p)); \
})

and use that all over to dereference page-tables. That way all this
lives in one place. Granted, I'll have to go edit all arch code, but I
seem to be doing that on a frequent basis anyway :/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] page-table walkers vs memory order

2012-07-26 Thread Peter Zijlstra
On Wed, 2012-07-25 at 15:09 -0700, Hugh Dickins wrote:
> We find out after it hits us, and someone studies the disassembly -
> if we're lucky enough to crash near the origin of the problem. 

This is a rather painful way.. see

  https://lkml.org/lkml/2009/1/5/555

we were lucky there in that the lack of ACCESS_ONCE() caused an infinite
loop so we knew exactly where we got stuck.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] page-table walkers vs memory order

2012-07-26 Thread Peter Zijlstra
On Wed, 2012-07-25 at 15:09 -0700, Hugh Dickins wrote:
 We find out after it hits us, and someone studies the disassembly -
 if we're lucky enough to crash near the origin of the problem. 

This is a rather painful way.. see

  https://lkml.org/lkml/2009/1/5/555

we were lucky there in that the lack of ACCESS_ONCE() caused an infinite
loop so we knew exactly where we got stuck.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] page-table walkers vs memory order

2012-07-26 Thread Peter Zijlstra
On Tue, 2012-07-24 at 14:51 -0700, Hugh Dickins wrote:
 I do love the status quo, but an audit would be welcome.  When
 it comes to patches, personally I tend to prefer ACCESS_ONCE() and
 smp_read_barrier_depends() and accompanying comments to be hidden away
 in the underlying macros or inlines where reasonable, rather than
 repeated all over; but I may have my priorities wrong on that.
 
 
Yeah, I was being lazy, and I totally forgot to actually look at the
alpha code.

How about we do a generic (cribbed from rcu_dereference):

#define page_table_deref(p) \
({  \
typeof(*p) *__p = (typeof(*p) __force *)ACCESS_ONCE(p);\
smp_read_barrier_depends(); \
((typeof(*p) __force __kernel *)(__p)); \
})

and use that all over to dereference page-tables. That way all this
lives in one place. Granted, I'll have to go edit all arch code, but I
seem to be doing that on a frequent basis anyway :/


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] page-table walkers vs memory order

2012-07-25 Thread Paul E. McKenney
On Wed, Jul 25, 2012 at 03:09:48PM -0700, Hugh Dickins wrote:
> On Wed, 25 Jul 2012, Paul E. McKenney wrote:
> > On Wed, Jul 25, 2012 at 01:26:43PM -0700, Hugh Dickins wrote:
> > > On Wed, 25 Jul 2012, Paul E. McKenney wrote:
> > > > On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote:
> > > > > 
> > > > > I'm totally unclear whether the kernel ever gets built with these
> > > > > 'creative' compilers that you refer to.  Is ACCESS_ONCE() a warning
> > > > > of where some future compiler would be permitted to mess with our
> > > > > assumptions?  Or is it actually saving us already today?  Would we
> > > > > know?  Could there be a boottime test that would tell us?  Is it
> > > > > likely that a future compiler would have an "--access_once"
> > > > > option that the kernel build would want to turn on?
> > > > 
> > > > The problem is that, unless you tell it otherwise, the compiler is
> > > > permitted to assume that the code that it is generating is the only 
> > > > thing
> > > > active in that address space at that time.  So the compiler might know
> > > > that it already has a perfectly good copy of that value somewhere in
> > > > its registers, or it might decide to fetch the value twice rather than
> > > > once due to register pressure, either of which can be fatal in SMP code.
> > > > And then there are more aggressive optimizations as well.
> > > > 
> > > > ACCESS_ONCE() is a way of telling the compiler to access the value
> > > > once, regardless of what cute single-threaded optimizations that it
> > > > otherwise might want to apply.
> > > 
> > > Right, but you say "might": I have never heard it asserted, that we do
> > > build the kernel with a compiler which actually makes such optimizations.
> > 
> > The compiler we use today can and has hurt us with double-fetching
> > and old-value-reuse optimizations.  There have been several that have
> > "optimized" things like "while (foo)" into "tmp = foo; while (tmp)"
> > in the Linux kernel, which have been dealt with by recoding.
> 
> Ah yes, those: I think we need ACCESS_EVERY_TIME() for those ones ;)

;-) ;-) ;-)

> I consider the double-fetching ones more insidious,
> less obviously in need of the volatile cast.

Agreed!

> > You might argue that the compiler cannot reasonably apply such an
> > optimization in some given case, but the compiler does much more detailed
> > analysis of the code than most people are willing to do (certainly more
> > than I am usually willing to do!), so I believe that a little paranoia is
> > quite worthwhile.
> > 
> > > There's a lot of other surprising things which a compiler is permitted
> > > to do, but we would simply not use such a compiler to build the kernel.
> > 
> > Unless we get the gcc folks to build and boot the Linux kernel as part
> > of their test suite (maybe they already do, but not that I know of),
> > how would either they or we know that they had deployed a destructive
> > optimization?
> 
> We find out after it hits us, and someone studies the disassembly -
> if we're lucky enough to crash near the origin of the problem.

Color me unreassured.  ;-)

> > > Does some version of gcc, under the options which we insist upon,
> > > make such optimizations on any of the architectures which we support?
> > 
> > Pretty much any production-quality compiler will do double-fetch
> > and old-value-reuse optimizations, the former especially on 32-bit
> > x86.
> 
> That makes good sense, yes: so, under register pressure, they may
> refetch from global memory, instead of using a temporary on local stack.
> 
> > I don't know of any production-quality compilers that do value
> > speculation, which would make the compiler act like DEC Alpha hardware,
> > and I would hope that if this does appear, (1) we would have warning
> > and (2) it could be turned off.  But there has been a lot of work on
> > this topic, so we would be foolish to rule it out.
> 
> I think you're justified in expecting both (1) and (2) there.

Here is hoping!

> > But the currently deployed optimizations can already cause enough trouble.
> > 
> > > Or is there some other compiler in use on the kernel, which makes
> > > such optimizations?  It seems a long time since I heard of building
> > > the kernel with icc.  clang?
> > > 
> > > I don't mind the answer "Yes, you idiot" - preferably with an example
> > > or two of which compiler and which piece of code it has bitten us on.
> > > I don't mind the answer "We just don't know" if that's the case.
> > > 
> > > But I'd like a better idea of how much to worry: is ACCESS_ONCE
> > > demonstrably needed today, or rather future-proofing and documentation?
> > 
> > Both.  If you are coding "while (foo)" where "foo" can be changed by an
> > interrupt handler, you had better instead write "while (ACCESS_ONCE(foo))"
> > or something similar, because most compilers are happy to optimize your
> > loop into an infinite loop in that case.  There are places in the Linux
> > kernel that would have 

Re: [RFC] page-table walkers vs memory order

2012-07-25 Thread Hugh Dickins
On Wed, 25 Jul 2012, Paul E. McKenney wrote:
> On Wed, Jul 25, 2012 at 01:26:43PM -0700, Hugh Dickins wrote:
> > On Wed, 25 Jul 2012, Paul E. McKenney wrote:
> > > On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote:
> > > > 
> > > > I'm totally unclear whether the kernel ever gets built with these
> > > > 'creative' compilers that you refer to.  Is ACCESS_ONCE() a warning
> > > > of where some future compiler would be permitted to mess with our
> > > > assumptions?  Or is it actually saving us already today?  Would we
> > > > know?  Could there be a boottime test that would tell us?  Is it
> > > > likely that a future compiler would have an "--access_once"
> > > > option that the kernel build would want to turn on?
> > > 
> > > The problem is that, unless you tell it otherwise, the compiler is
> > > permitted to assume that the code that it is generating is the only thing
> > > active in that address space at that time.  So the compiler might know
> > > that it already has a perfectly good copy of that value somewhere in
> > > its registers, or it might decide to fetch the value twice rather than
> > > once due to register pressure, either of which can be fatal in SMP code.
> > > And then there are more aggressive optimizations as well.
> > > 
> > > ACCESS_ONCE() is a way of telling the compiler to access the value
> > > once, regardless of what cute single-threaded optimizations that it
> > > otherwise might want to apply.
> > 
> > Right, but you say "might": I have never heard it asserted, that we do
> > build the kernel with a compiler which actually makes such optimizations.
> 
> The compiler we use today can and has hurt us with double-fetching
> and old-value-reuse optimizations.  There have been several that have
> "optimized" things like "while (foo)" into "tmp = foo; while (tmp)"
> in the Linux kernel, which have been dealt with by recoding.

Ah yes, those: I think we need ACCESS_EVERY_TIME() for those ones ;)

I consider the double-fetching ones more insidious,
less obviously in need of the volatile cast.

> 
> You might argue that the compiler cannot reasonably apply such an
> optimization in some given case, but the compiler does much more detailed
> analysis of the code than most people are willing to do (certainly more
> than I am usually willing to do!), so I believe that a little paranoia is
> quite worthwhile.
> 
> > There's a lot of other surprising things which a compiler is permitted
> > to do, but we would simply not use such a compiler to build the kernel.
> 
> Unless we get the gcc folks to build and boot the Linux kernel as part
> of their test suite (maybe they already do, but not that I know of),
> how would either they or we know that they had deployed a destructive
> optimization?

We find out after it hits us, and someone studies the disassembly -
if we're lucky enough to crash near the origin of the problem.

> 
> > Does some version of gcc, under the options which we insist upon,
> > make such optimizations on any of the architectures which we support?
> 
> Pretty much any production-quality compiler will do double-fetch
> and old-value-reuse optimizations, the former especially on 32-bit
> x86.

That makes good sense, yes: so, under register pressure, they may
refetch from global memory, instead of using a temporary on local stack.

> I don't know of any production-quality compilers that do value
> speculation, which would make the compiler act like DEC Alpha hardware,
> and I would hope that if this does appear, (1) we would have warning
> and (2) it could be turned off.  But there has been a lot of work on
> this topic, so we would be foolish to rule it out.

I think you're justified in expecting both (1) and (2) there.

> 
> But the currently deployed optimizations can already cause enough trouble.
> 
> > Or is there some other compiler in use on the kernel, which makes
> > such optimizations?  It seems a long time since I heard of building
> > the kernel with icc.  clang?
> > 
> > I don't mind the answer "Yes, you idiot" - preferably with an example
> > or two of which compiler and which piece of code it has bitten us on.
> > I don't mind the answer "We just don't know" if that's the case.
> > 
> > But I'd like a better idea of how much to worry: is ACCESS_ONCE
> > demonstrably needed today, or rather future-proofing and documentation?
> 
> Both.  If you are coding "while (foo)" where "foo" can be changed by an
> interrupt handler, you had better instead write "while (ACCESS_ONCE(foo))"
> or something similar, because most compilers are happy to optimize your
> loop into an infinite loop in that case.  There are places in the Linux
> kernel that would have problems if the compiler decided to refetch a
> value -- if a pointer was changed in the meantime, part of your code
> might be working on the old structure, and part on the new structure.
> This really can happen today, and this is why rcu_dereference() contains
> an ACCESS_ONCE().
> 
> If you are making 

Re: [RFC] page-table walkers vs memory order

2012-07-25 Thread Paul E. McKenney
On Wed, Jul 25, 2012 at 01:26:43PM -0700, Hugh Dickins wrote:
> On Wed, 25 Jul 2012, Paul E. McKenney wrote:
> > On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote:
> > > 
> > > I'm totally unclear whether the kernel ever gets built with these
> > > 'creative' compilers that you refer to.  Is ACCESS_ONCE() a warning
> > > of where some future compiler would be permitted to mess with our
> > > assumptions?  Or is it actually saving us already today?  Would we
> > > know?  Could there be a boottime test that would tell us?  Is it
> > > likely that a future compiler would have an "--access_once"
> > > option that the kernel build would want to turn on?
> > 
> > The problem is that, unless you tell it otherwise, the compiler is
> > permitted to assume that the code that it is generating is the only thing
> > active in that address space at that time.  So the compiler might know
> > that it already has a perfectly good copy of that value somewhere in
> > its registers, or it might decide to fetch the value twice rather than
> > once due to register pressure, either of which can be fatal in SMP code.
> > And then there are more aggressive optimizations as well.
> > 
> > ACCESS_ONCE() is a way of telling the compiler to access the value
> > once, regardless of what cute single-threaded optimizations that it
> > otherwise might want to apply.
> 
> Right, but you say "might": I have never heard it asserted, that we do
> build the kernel with a compiler which actually makes such optimizations.

The compiler we use today can and has hurt us with double-fetching
and old-value-reuse optimizations.  There have been several that have
"optimized" things like "while (foo)" into "tmp = foo; while (tmp)"
in the Linux kernel, which have been dealt with by recoding.

You might argue that the compiler cannot reasonably apply such an
optimization in some given case, but the compiler does much more detailed
analysis of the code than most people are willing to do (certainly more
than I am usually willing to do!), so I believe that a little paranoia is
quite worthwhile.

> There's a lot of other surprising things which a compiler is permitted
> to do, but we would simply not use such a compiler to build the kernel.

Unless we get the gcc folks to build and boot the Linux kernel as part
of their test suite (maybe they already do, but not that I know of),
how would either they or we know that they had deployed a destructive
optimization?

> Does some version of gcc, under the options which we insist upon,
> make such optimizations on any of the architectures which we support?

Pretty much any production-quality compiler will do double-fetch
and old-value-reuse optimizations, the former especially on 32-bit
x86.  I don't know of any production-quality compilers that do value
speculation, which would make the compiler act like DEC Alpha hardware,
and I would hope that if this does appear, (1) we would have warning
and (2) it could be turned off.  But there has been a lot of work on
this topic, so we would be foolish to rule it out.

But the currently deployed optimizations can already cause enough trouble.

> Or is there some other compiler in use on the kernel, which makes
> such optimizations?  It seems a long time since I heard of building
> the kernel with icc.  clang?
> 
> I don't mind the answer "Yes, you idiot" - preferably with an example
> or two of which compiler and which piece of code it has bitten us on.
> I don't mind the answer "We just don't know" if that's the case.
> 
> But I'd like a better idea of how much to worry: is ACCESS_ONCE
> demonstrably needed today, or rather future-proofing and documentation?

Both.  If you are coding "while (foo)" where "foo" can be changed by an
interrupt handler, you had better instead write "while (ACCESS_ONCE(foo))"
or something similar, because most compilers are happy to optimize your
loop into an infinite loop in that case.  There are places in the Linux
kernel that would have problems if the compiler decided to refetch a
value -- if a pointer was changed in the meantime, part of your code
might be working on the old structure, and part on the new structure.
This really can happen today, and this is why rcu_dereference() contains
an ACCESS_ONCE().

If you are making lockless non-atomic access to a variable, I strongly
suggest ACCESS_ONCE() or something similar even if you cannot see how
the compiler can mess you up, especially in cases involving a lot of
inline functions.  In this case, the compiler can be looking at quite
a bit of code and optimizing across the entire mess.

/me wonders what he stepped into with this email thread.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] page-table walkers vs memory order

2012-07-25 Thread Hugh Dickins
On Wed, 25 Jul 2012, Paul E. McKenney wrote:
> On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote:
> > 
> > I'm totally unclear whether the kernel ever gets built with these
> > 'creative' compilers that you refer to.  Is ACCESS_ONCE() a warning
> > of where some future compiler would be permitted to mess with our
> > assumptions?  Or is it actually saving us already today?  Would we
> > know?  Could there be a boottime test that would tell us?  Is it
> > likely that a future compiler would have an "--access_once"
> > option that the kernel build would want to turn on?
> 
> The problem is that, unless you tell it otherwise, the compiler is
> permitted to assume that the code that it is generating is the only thing
> active in that address space at that time.  So the compiler might know
> that it already has a perfectly good copy of that value somewhere in
> its registers, or it might decide to fetch the value twice rather than
> once due to register pressure, either of which can be fatal in SMP code.
> And then there are more aggressive optimizations as well.
> 
> ACCESS_ONCE() is a way of telling the compiler to access the value
> once, regardless of what cute single-threaded optimizations that it
> otherwise might want to apply.

Right, but you say "might": I have never heard it asserted, that we do
build the kernel with a compiler which actually makes such optimizations.

There's a lot of other surprising things which a compiler is permitted
to do, but we would simply not use such a compiler to build the kernel.

Does some version of gcc, under the options which we insist upon,
make such optimizations on any of the architectures which we support?

Or is there some other compiler in use on the kernel, which makes
such optimizations?  It seems a long time since I heard of building
the kernel with icc.  clang?

I don't mind the answer "Yes, you idiot" - preferably with an example
or two of which compiler and which piece of code it has bitten us on.
I don't mind the answer "We just don't know" if that's the case.

But I'd like a better idea of how much to worry: is ACCESS_ONCE
demonstrably needed today, or rather future-proofing and documentation?

Thanks,
Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] page-table walkers vs memory order

2012-07-25 Thread Paul E. McKenney
On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote:
> On Mon, 23 Jul 2012, Peter Zijlstra wrote:
> > 
> > While staring at mm/huge_memory.c I found a very under-commented
> > smp_wmb() in __split_huge_page_map(). It turns out that its copied from
> > __{pte,pmd,pud}_alloc() but forgot the useful comment (or a reference
> > thereto).
> > 
> > Now, afaict we're not good, as per that comment. Paul has since
> > convinced some of us that compiler writers are pure evil and out to get
> > us.
> > 
> > Therefore we should do what rcu_dereference() does and use
> > ACCESS_ONCE()/barrier() followed smp_read_barrier_depends() every time
> > we dereference a page-table pointer.
> > 
> > 
> > In particular it looks like things like
> > mm/memcontrol.c:mem_cgroup_count_precharge(), which use
> > walk_page_range() under down_read(>mmap_sem) and can thus be
> > concurrent with __{pte,pmd,pud}_alloc() from faults (and possibly
> > itself) are quite broken on Alpha
> 
> The Alpha pmd_offset() and pte_offset_map() already contain an
> smp_read_barrier_depends() (362a61ad from Nick); with comment that
> it's not needed on the pgd, and I presume the pud level is folded.
> Does Alpha really need more of them, as you have put below?
> 
> > and subtly broken for those of us with 'creative' compilers.
> 
> I don't want to fight against ACCESS_ONCE() (or barrier(): that's
> interesting, thank you, I hadn't seen it used as an ACCESS_ONCE()
> substitute before); but I do want to question it a little.
> 
> I'm totally unclear whether the kernel ever gets built with these
> 'creative' compilers that you refer to.  Is ACCESS_ONCE() a warning
> of where some future compiler would be permitted to mess with our
> assumptions?  Or is it actually saving us already today?  Would we
> know?  Could there be a boottime test that would tell us?  Is it
> likely that a future compiler would have an "--access_once"
> option that the kernel build would want to turn on?
> 
> Those may all be questions for Paul!

The problem is that, unless you tell it otherwise, the compiler is
permitted to assume that the code that it is generating is the only thing
active in that address space at that time.  So the compiler might know
that it already has a perfectly good copy of that value somewhere in
its registers, or it might decide to fetch the value twice rather than
once due to register pressure, either of which can be fatal in SMP code.
And then there are more aggressive optimizations as well.

ACCESS_ONCE() is a way of telling the compiler to access the value
once, regardless of what cute single-threaded optimizations that it
otherwise might want to apply.

Thanx, Paul

> > Should I go do a more extensive audit of page-table walkers or are we
> > happy with the status quo?
> 
> I do love the status quo, but an audit would be welcome.  When
> it comes to patches, personally I tend to prefer ACCESS_ONCE() and
> smp_read_barrier_depends() and accompanying comments to be hidden away
> in the underlying macros or inlines where reasonable, rather than
> repeated all over; but I may have my priorities wrong on that.
> 
> The last time we rewrote the main pgd-pud-pmd-pte walkers,
> we believed that no ACCESS_ONCE() was necessary, because although a
> pgd-pud-pmd entry might be racily instantiated at any instant, it
> could never change beneath us - the freeing of page tables happens
> only when we cannot reach them by other routes.
> 
> (Never quite true: those _clear_bad() things can zero entries at any
> instant, but we're already in a bad place when those come into play,
> so we never worried about racing against them.)
> 
> Since then, I think THP has made the rules more complicated; but I
> believe Andrea paid a great deal of attention to that kind of issue.
> 
> I suspect your arch/x86/mm/gup.c ACCESS_ONCE()s are necessary:
> gup_fast() breaks as many rules as it can, and in particular may
> be racing with the freeing of page tables; but I'm not so sure
> about the pagewalk mods - we could say "cannot do any harm",
> but I don't like adding lines on that basis.
> 
> Hugh
> 
> > 
> > ---
> >  arch/x86/mm/gup.c |6 +++---
> >  mm/pagewalk.c |   24 
> >  2 files changed, 27 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> > index dd74e46..4958fb1 100644
> > --- a/arch/x86/mm/gup.c
> > +++ b/arch/x86/mm/gup.c
> > @@ -150,7 +150,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, 
> > unsigned long end,
> >  
> > pmdp = pmd_offset(, addr);
> > do {
> > -   pmd_t pmd = *pmdp;
> > +   pmd_t pmd = ACCESS_ONCE(*pmdp);
> >  
> > next = pmd_addr_end(addr, end);
> > /*
> > @@ -220,7 +220,7 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, 
> > unsigned long end,
> >  
> > pudp = pud_offset(, addr);
> > do {
> > -   pud_t pud = *pudp;
> > +   

Re: [RFC] page-table walkers vs memory order

2012-07-25 Thread Paul E. McKenney
On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote:
 On Mon, 23 Jul 2012, Peter Zijlstra wrote:
  
  While staring at mm/huge_memory.c I found a very under-commented
  smp_wmb() in __split_huge_page_map(). It turns out that its copied from
  __{pte,pmd,pud}_alloc() but forgot the useful comment (or a reference
  thereto).
  
  Now, afaict we're not good, as per that comment. Paul has since
  convinced some of us that compiler writers are pure evil and out to get
  us.
  
  Therefore we should do what rcu_dereference() does and use
  ACCESS_ONCE()/barrier() followed smp_read_barrier_depends() every time
  we dereference a page-table pointer.
  
  
  In particular it looks like things like
  mm/memcontrol.c:mem_cgroup_count_precharge(), which use
  walk_page_range() under down_read(mm-mmap_sem) and can thus be
  concurrent with __{pte,pmd,pud}_alloc() from faults (and possibly
  itself) are quite broken on Alpha
 
 The Alpha pmd_offset() and pte_offset_map() already contain an
 smp_read_barrier_depends() (362a61ad from Nick); with comment that
 it's not needed on the pgd, and I presume the pud level is folded.
 Does Alpha really need more of them, as you have put below?
 
  and subtly broken for those of us with 'creative' compilers.
 
 I don't want to fight against ACCESS_ONCE() (or barrier(): that's
 interesting, thank you, I hadn't seen it used as an ACCESS_ONCE()
 substitute before); but I do want to question it a little.
 
 I'm totally unclear whether the kernel ever gets built with these
 'creative' compilers that you refer to.  Is ACCESS_ONCE() a warning
 of where some future compiler would be permitted to mess with our
 assumptions?  Or is it actually saving us already today?  Would we
 know?  Could there be a boottime test that would tell us?  Is it
 likely that a future compiler would have an --access_once
 option that the kernel build would want to turn on?
 
 Those may all be questions for Paul!

The problem is that, unless you tell it otherwise, the compiler is
permitted to assume that the code that it is generating is the only thing
active in that address space at that time.  So the compiler might know
that it already has a perfectly good copy of that value somewhere in
its registers, or it might decide to fetch the value twice rather than
once due to register pressure, either of which can be fatal in SMP code.
And then there are more aggressive optimizations as well.

ACCESS_ONCE() is a way of telling the compiler to access the value
once, regardless of what cute single-threaded optimizations that it
otherwise might want to apply.

Thanx, Paul

  Should I go do a more extensive audit of page-table walkers or are we
  happy with the status quo?
 
 I do love the status quo, but an audit would be welcome.  When
 it comes to patches, personally I tend to prefer ACCESS_ONCE() and
 smp_read_barrier_depends() and accompanying comments to be hidden away
 in the underlying macros or inlines where reasonable, rather than
 repeated all over; but I may have my priorities wrong on that.
 
 The last time we rewrote the main pgd-pud-pmd-pte walkers,
 we believed that no ACCESS_ONCE() was necessary, because although a
 pgd-pud-pmd entry might be racily instantiated at any instant, it
 could never change beneath us - the freeing of page tables happens
 only when we cannot reach them by other routes.
 
 (Never quite true: those _clear_bad() things can zero entries at any
 instant, but we're already in a bad place when those come into play,
 so we never worried about racing against them.)
 
 Since then, I think THP has made the rules more complicated; but I
 believe Andrea paid a great deal of attention to that kind of issue.
 
 I suspect your arch/x86/mm/gup.c ACCESS_ONCE()s are necessary:
 gup_fast() breaks as many rules as it can, and in particular may
 be racing with the freeing of page tables; but I'm not so sure
 about the pagewalk mods - we could say cannot do any harm,
 but I don't like adding lines on that basis.
 
 Hugh
 
  
  ---
   arch/x86/mm/gup.c |6 +++---
   mm/pagewalk.c |   24 
   2 files changed, 27 insertions(+), 3 deletions(-)
  
  diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
  index dd74e46..4958fb1 100644
  --- a/arch/x86/mm/gup.c
  +++ b/arch/x86/mm/gup.c
  @@ -150,7 +150,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, 
  unsigned long end,
   
  pmdp = pmd_offset(pud, addr);
  do {
  -   pmd_t pmd = *pmdp;
  +   pmd_t pmd = ACCESS_ONCE(*pmdp);
   
  next = pmd_addr_end(addr, end);
  /*
  @@ -220,7 +220,7 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, 
  unsigned long end,
   
  pudp = pud_offset(pgd, addr);
  do {
  -   pud_t pud = *pudp;
  +   pud_t pud = ACCESS_ONCE(*pudp);
   
  next = pud_addr_end(addr, end);
  if (pud_none(pud))
  @@ -280,7 +280,7 @@ int 

Re: [RFC] page-table walkers vs memory order

2012-07-25 Thread Hugh Dickins
On Wed, 25 Jul 2012, Paul E. McKenney wrote:
 On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote:
  
  I'm totally unclear whether the kernel ever gets built with these
  'creative' compilers that you refer to.  Is ACCESS_ONCE() a warning
  of where some future compiler would be permitted to mess with our
  assumptions?  Or is it actually saving us already today?  Would we
  know?  Could there be a boottime test that would tell us?  Is it
  likely that a future compiler would have an --access_once
  option that the kernel build would want to turn on?
 
 The problem is that, unless you tell it otherwise, the compiler is
 permitted to assume that the code that it is generating is the only thing
 active in that address space at that time.  So the compiler might know
 that it already has a perfectly good copy of that value somewhere in
 its registers, or it might decide to fetch the value twice rather than
 once due to register pressure, either of which can be fatal in SMP code.
 And then there are more aggressive optimizations as well.
 
 ACCESS_ONCE() is a way of telling the compiler to access the value
 once, regardless of what cute single-threaded optimizations that it
 otherwise might want to apply.

Right, but you say might: I have never heard it asserted, that we do
build the kernel with a compiler which actually makes such optimizations.

There's a lot of other surprising things which a compiler is permitted
to do, but we would simply not use such a compiler to build the kernel.

Does some version of gcc, under the options which we insist upon,
make such optimizations on any of the architectures which we support?

Or is there some other compiler in use on the kernel, which makes
such optimizations?  It seems a long time since I heard of building
the kernel with icc.  clang?

I don't mind the answer Yes, you idiot - preferably with an example
or two of which compiler and which piece of code it has bitten us on.
I don't mind the answer We just don't know if that's the case.

But I'd like a better idea of how much to worry: is ACCESS_ONCE
demonstrably needed today, or rather future-proofing and documentation?

Thanks,
Hugh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] page-table walkers vs memory order

2012-07-25 Thread Paul E. McKenney
On Wed, Jul 25, 2012 at 01:26:43PM -0700, Hugh Dickins wrote:
 On Wed, 25 Jul 2012, Paul E. McKenney wrote:
  On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote:
   
   I'm totally unclear whether the kernel ever gets built with these
   'creative' compilers that you refer to.  Is ACCESS_ONCE() a warning
   of where some future compiler would be permitted to mess with our
   assumptions?  Or is it actually saving us already today?  Would we
   know?  Could there be a boottime test that would tell us?  Is it
   likely that a future compiler would have an --access_once
   option that the kernel build would want to turn on?
  
  The problem is that, unless you tell it otherwise, the compiler is
  permitted to assume that the code that it is generating is the only thing
  active in that address space at that time.  So the compiler might know
  that it already has a perfectly good copy of that value somewhere in
  its registers, or it might decide to fetch the value twice rather than
  once due to register pressure, either of which can be fatal in SMP code.
  And then there are more aggressive optimizations as well.
  
  ACCESS_ONCE() is a way of telling the compiler to access the value
  once, regardless of what cute single-threaded optimizations that it
  otherwise might want to apply.
 
 Right, but you say might: I have never heard it asserted, that we do
 build the kernel with a compiler which actually makes such optimizations.

The compiler we use today can and has hurt us with double-fetching
and old-value-reuse optimizations.  There have been several that have
optimized things like while (foo) into tmp = foo; while (tmp)
in the Linux kernel, which have been dealt with by recoding.

You might argue that the compiler cannot reasonably apply such an
optimization in some given case, but the compiler does much more detailed
analysis of the code than most people are willing to do (certainly more
than I am usually willing to do!), so I believe that a little paranoia is
quite worthwhile.

 There's a lot of other surprising things which a compiler is permitted
 to do, but we would simply not use such a compiler to build the kernel.

Unless we get the gcc folks to build and boot the Linux kernel as part
of their test suite (maybe they already do, but not that I know of),
how would either they or we know that they had deployed a destructive
optimization?

 Does some version of gcc, under the options which we insist upon,
 make such optimizations on any of the architectures which we support?

Pretty much any production-quality compiler will do double-fetch
and old-value-reuse optimizations, the former especially on 32-bit
x86.  I don't know of any production-quality compilers that do value
speculation, which would make the compiler act like DEC Alpha hardware,
and I would hope that if this does appear, (1) we would have warning
and (2) it could be turned off.  But there has been a lot of work on
this topic, so we would be foolish to rule it out.

But the currently deployed optimizations can already cause enough trouble.

 Or is there some other compiler in use on the kernel, which makes
 such optimizations?  It seems a long time since I heard of building
 the kernel with icc.  clang?
 
 I don't mind the answer Yes, you idiot - preferably with an example
 or two of which compiler and which piece of code it has bitten us on.
 I don't mind the answer We just don't know if that's the case.
 
 But I'd like a better idea of how much to worry: is ACCESS_ONCE
 demonstrably needed today, or rather future-proofing and documentation?

Both.  If you are coding while (foo) where foo can be changed by an
interrupt handler, you had better instead write while (ACCESS_ONCE(foo))
or something similar, because most compilers are happy to optimize your
loop into an infinite loop in that case.  There are places in the Linux
kernel that would have problems if the compiler decided to refetch a
value -- if a pointer was changed in the meantime, part of your code
might be working on the old structure, and part on the new structure.
This really can happen today, and this is why rcu_dereference() contains
an ACCESS_ONCE().

If you are making lockless non-atomic access to a variable, I strongly
suggest ACCESS_ONCE() or something similar even if you cannot see how
the compiler can mess you up, especially in cases involving a lot of
inline functions.  In this case, the compiler can be looking at quite
a bit of code and optimizing across the entire mess.

/me wonders what he stepped into with this email thread.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] page-table walkers vs memory order

2012-07-25 Thread Hugh Dickins
On Wed, 25 Jul 2012, Paul E. McKenney wrote:
 On Wed, Jul 25, 2012 at 01:26:43PM -0700, Hugh Dickins wrote:
  On Wed, 25 Jul 2012, Paul E. McKenney wrote:
   On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote:

I'm totally unclear whether the kernel ever gets built with these
'creative' compilers that you refer to.  Is ACCESS_ONCE() a warning
of where some future compiler would be permitted to mess with our
assumptions?  Or is it actually saving us already today?  Would we
know?  Could there be a boottime test that would tell us?  Is it
likely that a future compiler would have an --access_once
option that the kernel build would want to turn on?
   
   The problem is that, unless you tell it otherwise, the compiler is
   permitted to assume that the code that it is generating is the only thing
   active in that address space at that time.  So the compiler might know
   that it already has a perfectly good copy of that value somewhere in
   its registers, or it might decide to fetch the value twice rather than
   once due to register pressure, either of which can be fatal in SMP code.
   And then there are more aggressive optimizations as well.
   
   ACCESS_ONCE() is a way of telling the compiler to access the value
   once, regardless of what cute single-threaded optimizations that it
   otherwise might want to apply.
  
  Right, but you say might: I have never heard it asserted, that we do
  build the kernel with a compiler which actually makes such optimizations.
 
 The compiler we use today can and has hurt us with double-fetching
 and old-value-reuse optimizations.  There have been several that have
 optimized things like while (foo) into tmp = foo; while (tmp)
 in the Linux kernel, which have been dealt with by recoding.

Ah yes, those: I think we need ACCESS_EVERY_TIME() for those ones ;)

I consider the double-fetching ones more insidious,
less obviously in need of the volatile cast.

 
 You might argue that the compiler cannot reasonably apply such an
 optimization in some given case, but the compiler does much more detailed
 analysis of the code than most people are willing to do (certainly more
 than I am usually willing to do!), so I believe that a little paranoia is
 quite worthwhile.
 
  There's a lot of other surprising things which a compiler is permitted
  to do, but we would simply not use such a compiler to build the kernel.
 
 Unless we get the gcc folks to build and boot the Linux kernel as part
 of their test suite (maybe they already do, but not that I know of),
 how would either they or we know that they had deployed a destructive
 optimization?

We find out after it hits us, and someone studies the disassembly -
if we're lucky enough to crash near the origin of the problem.

 
  Does some version of gcc, under the options which we insist upon,
  make such optimizations on any of the architectures which we support?
 
 Pretty much any production-quality compiler will do double-fetch
 and old-value-reuse optimizations, the former especially on 32-bit
 x86.

That makes good sense, yes: so, under register pressure, they may
refetch from global memory, instead of using a temporary on local stack.

 I don't know of any production-quality compilers that do value
 speculation, which would make the compiler act like DEC Alpha hardware,
 and I would hope that if this does appear, (1) we would have warning
 and (2) it could be turned off.  But there has been a lot of work on
 this topic, so we would be foolish to rule it out.

I think you're justified in expecting both (1) and (2) there.

 
 But the currently deployed optimizations can already cause enough trouble.
 
  Or is there some other compiler in use on the kernel, which makes
  such optimizations?  It seems a long time since I heard of building
  the kernel with icc.  clang?
  
  I don't mind the answer Yes, you idiot - preferably with an example
  or two of which compiler and which piece of code it has bitten us on.
  I don't mind the answer We just don't know if that's the case.
  
  But I'd like a better idea of how much to worry: is ACCESS_ONCE
  demonstrably needed today, or rather future-proofing and documentation?
 
 Both.  If you are coding while (foo) where foo can be changed by an
 interrupt handler, you had better instead write while (ACCESS_ONCE(foo))
 or something similar, because most compilers are happy to optimize your
 loop into an infinite loop in that case.  There are places in the Linux
 kernel that would have problems if the compiler decided to refetch a
 value -- if a pointer was changed in the meantime, part of your code
 might be working on the old structure, and part on the new structure.
 This really can happen today, and this is why rcu_dereference() contains
 an ACCESS_ONCE().
 
 If you are making lockless non-atomic access to a variable, I strongly
 suggest ACCESS_ONCE() or something similar even if you cannot see how
 the compiler can mess you up, especially in 

Re: [RFC] page-table walkers vs memory order

2012-07-25 Thread Paul E. McKenney
On Wed, Jul 25, 2012 at 03:09:48PM -0700, Hugh Dickins wrote:
 On Wed, 25 Jul 2012, Paul E. McKenney wrote:
  On Wed, Jul 25, 2012 at 01:26:43PM -0700, Hugh Dickins wrote:
   On Wed, 25 Jul 2012, Paul E. McKenney wrote:
On Tue, Jul 24, 2012 at 02:51:05PM -0700, Hugh Dickins wrote:
 
 I'm totally unclear whether the kernel ever gets built with these
 'creative' compilers that you refer to.  Is ACCESS_ONCE() a warning
 of where some future compiler would be permitted to mess with our
 assumptions?  Or is it actually saving us already today?  Would we
 know?  Could there be a boottime test that would tell us?  Is it
 likely that a future compiler would have an --access_once
 option that the kernel build would want to turn on?

The problem is that, unless you tell it otherwise, the compiler is
permitted to assume that the code that it is generating is the only 
thing
active in that address space at that time.  So the compiler might know
that it already has a perfectly good copy of that value somewhere in
its registers, or it might decide to fetch the value twice rather than
once due to register pressure, either of which can be fatal in SMP code.
And then there are more aggressive optimizations as well.

ACCESS_ONCE() is a way of telling the compiler to access the value
once, regardless of what cute single-threaded optimizations that it
otherwise might want to apply.
   
   Right, but you say might: I have never heard it asserted, that we do
   build the kernel with a compiler which actually makes such optimizations.
  
  The compiler we use today can and has hurt us with double-fetching
  and old-value-reuse optimizations.  There have been several that have
  optimized things like while (foo) into tmp = foo; while (tmp)
  in the Linux kernel, which have been dealt with by recoding.
 
 Ah yes, those: I think we need ACCESS_EVERY_TIME() for those ones ;)

;-) ;-) ;-)

 I consider the double-fetching ones more insidious,
 less obviously in need of the volatile cast.

Agreed!

  You might argue that the compiler cannot reasonably apply such an
  optimization in some given case, but the compiler does much more detailed
  analysis of the code than most people are willing to do (certainly more
  than I am usually willing to do!), so I believe that a little paranoia is
  quite worthwhile.
  
   There's a lot of other surprising things which a compiler is permitted
   to do, but we would simply not use such a compiler to build the kernel.
  
  Unless we get the gcc folks to build and boot the Linux kernel as part
  of their test suite (maybe they already do, but not that I know of),
  how would either they or we know that they had deployed a destructive
  optimization?
 
 We find out after it hits us, and someone studies the disassembly -
 if we're lucky enough to crash near the origin of the problem.

Color me unreassured.  ;-)

   Does some version of gcc, under the options which we insist upon,
   make such optimizations on any of the architectures which we support?
  
  Pretty much any production-quality compiler will do double-fetch
  and old-value-reuse optimizations, the former especially on 32-bit
  x86.
 
 That makes good sense, yes: so, under register pressure, they may
 refetch from global memory, instead of using a temporary on local stack.
 
  I don't know of any production-quality compilers that do value
  speculation, which would make the compiler act like DEC Alpha hardware,
  and I would hope that if this does appear, (1) we would have warning
  and (2) it could be turned off.  But there has been a lot of work on
  this topic, so we would be foolish to rule it out.
 
 I think you're justified in expecting both (1) and (2) there.

Here is hoping!

  But the currently deployed optimizations can already cause enough trouble.
  
   Or is there some other compiler in use on the kernel, which makes
   such optimizations?  It seems a long time since I heard of building
   the kernel with icc.  clang?
   
   I don't mind the answer Yes, you idiot - preferably with an example
   or two of which compiler and which piece of code it has bitten us on.
   I don't mind the answer We just don't know if that's the case.
   
   But I'd like a better idea of how much to worry: is ACCESS_ONCE
   demonstrably needed today, or rather future-proofing and documentation?
  
  Both.  If you are coding while (foo) where foo can be changed by an
  interrupt handler, you had better instead write while (ACCESS_ONCE(foo))
  or something similar, because most compilers are happy to optimize your
  loop into an infinite loop in that case.  There are places in the Linux
  kernel that would have problems if the compiler decided to refetch a
  value -- if a pointer was changed in the meantime, part of your code
  might be working on the old structure, and part on the new structure.
  This really can happen today, and this is why rcu_dereference() 

Re: [RFC] page-table walkers vs memory order

2012-07-24 Thread Hugh Dickins
On Mon, 23 Jul 2012, Peter Zijlstra wrote:
> 
> While staring at mm/huge_memory.c I found a very under-commented
> smp_wmb() in __split_huge_page_map(). It turns out that its copied from
> __{pte,pmd,pud}_alloc() but forgot the useful comment (or a reference
> thereto).
> 
> Now, afaict we're not good, as per that comment. Paul has since
> convinced some of us that compiler writers are pure evil and out to get
> us.
> 
> Therefore we should do what rcu_dereference() does and use
> ACCESS_ONCE()/barrier() followed smp_read_barrier_depends() every time
> we dereference a page-table pointer.
> 
> 
> In particular it looks like things like
> mm/memcontrol.c:mem_cgroup_count_precharge(), which use
> walk_page_range() under down_read(>mmap_sem) and can thus be
> concurrent with __{pte,pmd,pud}_alloc() from faults (and possibly
> itself) are quite broken on Alpha

The Alpha pmd_offset() and pte_offset_map() already contain an
smp_read_barrier_depends() (362a61ad from Nick); with comment that
it's not needed on the pgd, and I presume the pud level is folded.
Does Alpha really need more of them, as you have put below?

> and subtly broken for those of us with 'creative' compilers.

I don't want to fight against ACCESS_ONCE() (or barrier(): that's
interesting, thank you, I hadn't seen it used as an ACCESS_ONCE()
substitute before); but I do want to question it a little.

I'm totally unclear whether the kernel ever gets built with these
'creative' compilers that you refer to.  Is ACCESS_ONCE() a warning
of where some future compiler would be permitted to mess with our
assumptions?  Or is it actually saving us already today?  Would we
know?  Could there be a boottime test that would tell us?  Is it
likely that a future compiler would have an "--access_once"
option that the kernel build would want to turn on?

Those may all be questions for Paul!

> 
> Should I go do a more extensive audit of page-table walkers or are we
> happy with the status quo?

I do love the status quo, but an audit would be welcome.  When
it comes to patches, personally I tend to prefer ACCESS_ONCE() and
smp_read_barrier_depends() and accompanying comments to be hidden away
in the underlying macros or inlines where reasonable, rather than
repeated all over; but I may have my priorities wrong on that.

The last time we rewrote the main pgd-pud-pmd-pte walkers,
we believed that no ACCESS_ONCE() was necessary, because although a
pgd-pud-pmd entry might be racily instantiated at any instant, it
could never change beneath us - the freeing of page tables happens
only when we cannot reach them by other routes.

(Never quite true: those _clear_bad() things can zero entries at any
instant, but we're already in a bad place when those come into play,
so we never worried about racing against them.)

Since then, I think THP has made the rules more complicated; but I
believe Andrea paid a great deal of attention to that kind of issue.

I suspect your arch/x86/mm/gup.c ACCESS_ONCE()s are necessary:
gup_fast() breaks as many rules as it can, and in particular may
be racing with the freeing of page tables; but I'm not so sure
about the pagewalk mods - we could say "cannot do any harm",
but I don't like adding lines on that basis.

Hugh

> 
> ---
>  arch/x86/mm/gup.c |6 +++---
>  mm/pagewalk.c |   24 
>  2 files changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> index dd74e46..4958fb1 100644
> --- a/arch/x86/mm/gup.c
> +++ b/arch/x86/mm/gup.c
> @@ -150,7 +150,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, 
> unsigned long end,
>  
>   pmdp = pmd_offset(, addr);
>   do {
> - pmd_t pmd = *pmdp;
> + pmd_t pmd = ACCESS_ONCE(*pmdp);
>  
>   next = pmd_addr_end(addr, end);
>   /*
> @@ -220,7 +220,7 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, 
> unsigned long end,
>  
>   pudp = pud_offset(, addr);
>   do {
> - pud_t pud = *pudp;
> + pud_t pud = ACCESS_ONCE(*pudp);
>  
>   next = pud_addr_end(addr, end);
>   if (pud_none(pud))
> @@ -280,7 +280,7 @@ int __get_user_pages_fast(unsigned long start, int 
> nr_pages, int write,
>   local_irq_save(flags);
>   pgdp = pgd_offset(mm, addr);
>   do {
> - pgd_t pgd = *pgdp;
> + pgd_t pgd = ACCESS_ONCE(*pgdp);
>  
>   next = pgd_addr_end(addr, end);
>   if (pgd_none(pgd))
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 6c118d0..2ba2a74 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -10,6 +10,14 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, 
> unsigned long end,
>   int err = 0;
>  
>   pte = pte_offset_map(pmd, addr);
> + /*
> +  * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
> +  * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
> +  * actual dereference of p[gum]d, but that's 

Re: [RFC] page-table walkers vs memory order

2012-07-24 Thread Hugh Dickins
On Mon, 23 Jul 2012, Peter Zijlstra wrote:
 
 While staring at mm/huge_memory.c I found a very under-commented
 smp_wmb() in __split_huge_page_map(). It turns out that its copied from
 __{pte,pmd,pud}_alloc() but forgot the useful comment (or a reference
 thereto).
 
 Now, afaict we're not good, as per that comment. Paul has since
 convinced some of us that compiler writers are pure evil and out to get
 us.
 
 Therefore we should do what rcu_dereference() does and use
 ACCESS_ONCE()/barrier() followed smp_read_barrier_depends() every time
 we dereference a page-table pointer.
 
 
 In particular it looks like things like
 mm/memcontrol.c:mem_cgroup_count_precharge(), which use
 walk_page_range() under down_read(mm-mmap_sem) and can thus be
 concurrent with __{pte,pmd,pud}_alloc() from faults (and possibly
 itself) are quite broken on Alpha

The Alpha pmd_offset() and pte_offset_map() already contain an
smp_read_barrier_depends() (362a61ad from Nick); with comment that
it's not needed on the pgd, and I presume the pud level is folded.
Does Alpha really need more of them, as you have put below?

 and subtly broken for those of us with 'creative' compilers.

I don't want to fight against ACCESS_ONCE() (or barrier(): that's
interesting, thank you, I hadn't seen it used as an ACCESS_ONCE()
substitute before); but I do want to question it a little.

I'm totally unclear whether the kernel ever gets built with these
'creative' compilers that you refer to.  Is ACCESS_ONCE() a warning
of where some future compiler would be permitted to mess with our
assumptions?  Or is it actually saving us already today?  Would we
know?  Could there be a boottime test that would tell us?  Is it
likely that a future compiler would have an --access_once
option that the kernel build would want to turn on?

Those may all be questions for Paul!

 
 Should I go do a more extensive audit of page-table walkers or are we
 happy with the status quo?

I do love the status quo, but an audit would be welcome.  When
it comes to patches, personally I tend to prefer ACCESS_ONCE() and
smp_read_barrier_depends() and accompanying comments to be hidden away
in the underlying macros or inlines where reasonable, rather than
repeated all over; but I may have my priorities wrong on that.

The last time we rewrote the main pgd-pud-pmd-pte walkers,
we believed that no ACCESS_ONCE() was necessary, because although a
pgd-pud-pmd entry might be racily instantiated at any instant, it
could never change beneath us - the freeing of page tables happens
only when we cannot reach them by other routes.

(Never quite true: those _clear_bad() things can zero entries at any
instant, but we're already in a bad place when those come into play,
so we never worried about racing against them.)

Since then, I think THP has made the rules more complicated; but I
believe Andrea paid a great deal of attention to that kind of issue.

I suspect your arch/x86/mm/gup.c ACCESS_ONCE()s are necessary:
gup_fast() breaks as many rules as it can, and in particular may
be racing with the freeing of page tables; but I'm not so sure
about the pagewalk mods - we could say cannot do any harm,
but I don't like adding lines on that basis.

Hugh

 
 ---
  arch/x86/mm/gup.c |6 +++---
  mm/pagewalk.c |   24 
  2 files changed, 27 insertions(+), 3 deletions(-)
 
 diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
 index dd74e46..4958fb1 100644
 --- a/arch/x86/mm/gup.c
 +++ b/arch/x86/mm/gup.c
 @@ -150,7 +150,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, 
 unsigned long end,
  
   pmdp = pmd_offset(pud, addr);
   do {
 - pmd_t pmd = *pmdp;
 + pmd_t pmd = ACCESS_ONCE(*pmdp);
  
   next = pmd_addr_end(addr, end);
   /*
 @@ -220,7 +220,7 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, 
 unsigned long end,
  
   pudp = pud_offset(pgd, addr);
   do {
 - pud_t pud = *pudp;
 + pud_t pud = ACCESS_ONCE(*pudp);
  
   next = pud_addr_end(addr, end);
   if (pud_none(pud))
 @@ -280,7 +280,7 @@ int __get_user_pages_fast(unsigned long start, int 
 nr_pages, int write,
   local_irq_save(flags);
   pgdp = pgd_offset(mm, addr);
   do {
 - pgd_t pgd = *pgdp;
 + pgd_t pgd = ACCESS_ONCE(*pgdp);
  
   next = pgd_addr_end(addr, end);
   if (pgd_none(pgd))
 diff --git a/mm/pagewalk.c b/mm/pagewalk.c
 index 6c118d0..2ba2a74 100644
 --- a/mm/pagewalk.c
 +++ b/mm/pagewalk.c
 @@ -10,6 +10,14 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, 
 unsigned long end,
   int err = 0;
  
   pte = pte_offset_map(pmd, addr);
 + /*
 +  * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
 +  * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
 +  * actual dereference of p[gum]d, but that's hidden deep within the
 +  * bowels of {pte,pmd,pud}_offset.
 +  

Re: [RFC] page-table walkers vs memory order

2012-07-23 Thread Paul E. McKenney
On Mon, Jul 23, 2012 at 07:34:30PM +0200, Peter Zijlstra wrote:
> 
> While staring at mm/huge_memory.c I found a very under-commented
> smp_wmb() in __split_huge_page_map(). It turns out that its copied from
> __{pte,pmd,pud}_alloc() but forgot the useful comment (or a reference
> thereto).
> 
> Now, afaict we're not good, as per that comment. Paul has since
> convinced some of us that compiler writers are pure evil and out to get
> us.

I have seen the glint in their eyes when they discuss optimization
techniques that you would not want your children to know about!

> Therefore we should do what rcu_dereference() does and use
> ACCESS_ONCE()/barrier() followed smp_read_barrier_depends() every time
> we dereference a page-table pointer.
> 
> 
> In particular it looks like things like
> mm/memcontrol.c:mem_cgroup_count_precharge(), which use
> walk_page_range() under down_read(>mmap_sem) and can thus be
> concurrent with __{pte,pmd,pud}_alloc() from faults (and possibly
> itself) are quite broken on Alpha and subtly broken for those of us with
> 'creative' compilers.

Looks good to me, though given my ignorance of mm, not sure that counts
for much.

Thanx, Paul

> Should I go do a more extensive audit of page-table walkers or are we
> happy with the status quo?
> 
> ---
>  arch/x86/mm/gup.c |6 +++---
>  mm/pagewalk.c |   24 
>  2 files changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> index dd74e46..4958fb1 100644
> --- a/arch/x86/mm/gup.c
> +++ b/arch/x86/mm/gup.c
> @@ -150,7 +150,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, 
> unsigned long end,
>  
>   pmdp = pmd_offset(, addr);
>   do {
> - pmd_t pmd = *pmdp;
> + pmd_t pmd = ACCESS_ONCE(*pmdp);

Here, ACCESS_ONCE() suffices because this is x86-specific code.
Core code would need to worry about Alpha, and would thus also need
smp_read_barrier_depends(), as you in fact do have further down.

>   next = pmd_addr_end(addr, end);
>   /*
> @@ -220,7 +220,7 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, 
> unsigned long end,
>  
>   pudp = pud_offset(, addr);
>   do {
> - pud_t pud = *pudp;
> + pud_t pud = ACCESS_ONCE(*pudp);
>  
>   next = pud_addr_end(addr, end);
>   if (pud_none(pud))
> @@ -280,7 +280,7 @@ int __get_user_pages_fast(unsigned long start, int 
> nr_pages, int write,
>   local_irq_save(flags);
>   pgdp = pgd_offset(mm, addr);
>   do {
> - pgd_t pgd = *pgdp;
> + pgd_t pgd = ACCESS_ONCE(*pgdp);
>  
>   next = pgd_addr_end(addr, end);
>   if (pgd_none(pgd))
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 6c118d0..2ba2a74 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -10,6 +10,14 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, 
> unsigned long end,
>   int err = 0;
>  
>   pte = pte_offset_map(pmd, addr);
> + /*
> +  * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
> +  * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
> +  * actual dereference of p[gum]d, but that's hidden deep within the
> +  * bowels of {pte,pmd,pud}_offset.
> +  */
> + barrier();
> + smp_read_barrier_depends();
>   for (;;) {
>   err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
>   if (err)
> @@ -32,6 +40,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, 
> unsigned long end,
>   int err = 0;
>  
>   pmd = pmd_offset(pud, addr);
> + /*
> +  * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
> +  * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
> +  * actual dereference of p[gum]d, but that's hidden deep within the
> +  * bowels of {pte,pmd,pud}_offset.
> +  */
> + barrier();
> + smp_read_barrier_depends();
>   do {
>  again:
>   next = pmd_addr_end(addr, end);
> @@ -77,6 +93,14 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, 
> unsigned long end,
>   int err = 0;
>  
>   pud = pud_offset(pgd, addr);
> + /*
> +  * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
> +  * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
> +  * actual dereference of p[gum]d, but that's hidden deep within the
> +  * bowels of {pte,pmd,pud}_offset.
> +  */
> + barrier();
> + smp_read_barrier_depends();
>   do {
>   next = pud_addr_end(addr, end);
>   if (pud_none_or_clear_bad(pud)) {
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] page-table walkers vs memory order

2012-07-23 Thread Peter Zijlstra

While staring at mm/huge_memory.c I found a very under-commented
smp_wmb() in __split_huge_page_map(). It turns out that its copied from
__{pte,pmd,pud}_alloc() but forgot the useful comment (or a reference
thereto).

Now, afaict we're not good, as per that comment. Paul has since
convinced some of us that compiler writers are pure evil and out to get
us.

Therefore we should do what rcu_dereference() does and use
ACCESS_ONCE()/barrier() followed smp_read_barrier_depends() every time
we dereference a page-table pointer.


In particular it looks like things like
mm/memcontrol.c:mem_cgroup_count_precharge(), which use
walk_page_range() under down_read(>mmap_sem) and can thus be
concurrent with __{pte,pmd,pud}_alloc() from faults (and possibly
itself) are quite broken on Alpha and subtly broken for those of us with
'creative' compilers.

Should I go do a more extensive audit of page-table walkers or are we
happy with the status quo?

---
 arch/x86/mm/gup.c |6 +++---
 mm/pagewalk.c |   24 
 2 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index dd74e46..4958fb1 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -150,7 +150,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, 
unsigned long end,
 
pmdp = pmd_offset(, addr);
do {
-   pmd_t pmd = *pmdp;
+   pmd_t pmd = ACCESS_ONCE(*pmdp);
 
next = pmd_addr_end(addr, end);
/*
@@ -220,7 +220,7 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, 
unsigned long end,
 
pudp = pud_offset(, addr);
do {
-   pud_t pud = *pudp;
+   pud_t pud = ACCESS_ONCE(*pudp);
 
next = pud_addr_end(addr, end);
if (pud_none(pud))
@@ -280,7 +280,7 @@ int __get_user_pages_fast(unsigned long start, int 
nr_pages, int write,
local_irq_save(flags);
pgdp = pgd_offset(mm, addr);
do {
-   pgd_t pgd = *pgdp;
+   pgd_t pgd = ACCESS_ONCE(*pgdp);
 
next = pgd_addr_end(addr, end);
if (pgd_none(pgd))
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 6c118d0..2ba2a74 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -10,6 +10,14 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, 
unsigned long end,
int err = 0;
 
pte = pte_offset_map(pmd, addr);
+   /*
+* Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
+* __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
+* actual dereference of p[gum]d, but that's hidden deep within the
+* bowels of {pte,pmd,pud}_offset.
+*/
+   barrier();
+   smp_read_barrier_depends();
for (;;) {
err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
if (err)
@@ -32,6 +40,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, 
unsigned long end,
int err = 0;
 
pmd = pmd_offset(pud, addr);
+   /*
+* Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
+* __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
+* actual dereference of p[gum]d, but that's hidden deep within the
+* bowels of {pte,pmd,pud}_offset.
+*/
+   barrier();
+   smp_read_barrier_depends();
do {
 again:
next = pmd_addr_end(addr, end);
@@ -77,6 +93,14 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, 
unsigned long end,
int err = 0;
 
pud = pud_offset(pgd, addr);
+   /*
+* Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
+* __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
+* actual dereference of p[gum]d, but that's hidden deep within the
+* bowels of {pte,pmd,pud}_offset.
+*/
+   barrier();
+   smp_read_barrier_depends();
do {
next = pud_addr_end(addr, end);
if (pud_none_or_clear_bad(pud)) {

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] page-table walkers vs memory order

2012-07-23 Thread Peter Zijlstra

While staring at mm/huge_memory.c I found a very under-commented
smp_wmb() in __split_huge_page_map(). It turns out that its copied from
__{pte,pmd,pud}_alloc() but forgot the useful comment (or a reference
thereto).

Now, afaict we're not good, as per that comment. Paul has since
convinced some of us that compiler writers are pure evil and out to get
us.

Therefore we should do what rcu_dereference() does and use
ACCESS_ONCE()/barrier() followed smp_read_barrier_depends() every time
we dereference a page-table pointer.


In particular it looks like things like
mm/memcontrol.c:mem_cgroup_count_precharge(), which use
walk_page_range() under down_read(mm-mmap_sem) and can thus be
concurrent with __{pte,pmd,pud}_alloc() from faults (and possibly
itself) are quite broken on Alpha and subtly broken for those of us with
'creative' compilers.

Should I go do a more extensive audit of page-table walkers or are we
happy with the status quo?

---
 arch/x86/mm/gup.c |6 +++---
 mm/pagewalk.c |   24 
 2 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index dd74e46..4958fb1 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -150,7 +150,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, 
unsigned long end,
 
pmdp = pmd_offset(pud, addr);
do {
-   pmd_t pmd = *pmdp;
+   pmd_t pmd = ACCESS_ONCE(*pmdp);
 
next = pmd_addr_end(addr, end);
/*
@@ -220,7 +220,7 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, 
unsigned long end,
 
pudp = pud_offset(pgd, addr);
do {
-   pud_t pud = *pudp;
+   pud_t pud = ACCESS_ONCE(*pudp);
 
next = pud_addr_end(addr, end);
if (pud_none(pud))
@@ -280,7 +280,7 @@ int __get_user_pages_fast(unsigned long start, int 
nr_pages, int write,
local_irq_save(flags);
pgdp = pgd_offset(mm, addr);
do {
-   pgd_t pgd = *pgdp;
+   pgd_t pgd = ACCESS_ONCE(*pgdp);
 
next = pgd_addr_end(addr, end);
if (pgd_none(pgd))
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 6c118d0..2ba2a74 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -10,6 +10,14 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, 
unsigned long end,
int err = 0;
 
pte = pte_offset_map(pmd, addr);
+   /*
+* Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
+* __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
+* actual dereference of p[gum]d, but that's hidden deep within the
+* bowels of {pte,pmd,pud}_offset.
+*/
+   barrier();
+   smp_read_barrier_depends();
for (;;) {
err = walk-pte_entry(pte, addr, addr + PAGE_SIZE, walk);
if (err)
@@ -32,6 +40,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, 
unsigned long end,
int err = 0;
 
pmd = pmd_offset(pud, addr);
+   /*
+* Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
+* __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
+* actual dereference of p[gum]d, but that's hidden deep within the
+* bowels of {pte,pmd,pud}_offset.
+*/
+   barrier();
+   smp_read_barrier_depends();
do {
 again:
next = pmd_addr_end(addr, end);
@@ -77,6 +93,14 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, 
unsigned long end,
int err = 0;
 
pud = pud_offset(pgd, addr);
+   /*
+* Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
+* __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
+* actual dereference of p[gum]d, but that's hidden deep within the
+* bowels of {pte,pmd,pud}_offset.
+*/
+   barrier();
+   smp_read_barrier_depends();
do {
next = pud_addr_end(addr, end);
if (pud_none_or_clear_bad(pud)) {

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] page-table walkers vs memory order

2012-07-23 Thread Paul E. McKenney
On Mon, Jul 23, 2012 at 07:34:30PM +0200, Peter Zijlstra wrote:
 
 While staring at mm/huge_memory.c I found a very under-commented
 smp_wmb() in __split_huge_page_map(). It turns out that its copied from
 __{pte,pmd,pud}_alloc() but forgot the useful comment (or a reference
 thereto).
 
 Now, afaict we're not good, as per that comment. Paul has since
 convinced some of us that compiler writers are pure evil and out to get
 us.

I have seen the glint in their eyes when they discuss optimization
techniques that you would not want your children to know about!

 Therefore we should do what rcu_dereference() does and use
 ACCESS_ONCE()/barrier() followed smp_read_barrier_depends() every time
 we dereference a page-table pointer.
 
 
 In particular it looks like things like
 mm/memcontrol.c:mem_cgroup_count_precharge(), which use
 walk_page_range() under down_read(mm-mmap_sem) and can thus be
 concurrent with __{pte,pmd,pud}_alloc() from faults (and possibly
 itself) are quite broken on Alpha and subtly broken for those of us with
 'creative' compilers.

Looks good to me, though given my ignorance of mm, not sure that counts
for much.

Thanx, Paul

 Should I go do a more extensive audit of page-table walkers or are we
 happy with the status quo?
 
 ---
  arch/x86/mm/gup.c |6 +++---
  mm/pagewalk.c |   24 
  2 files changed, 27 insertions(+), 3 deletions(-)
 
 diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
 index dd74e46..4958fb1 100644
 --- a/arch/x86/mm/gup.c
 +++ b/arch/x86/mm/gup.c
 @@ -150,7 +150,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, 
 unsigned long end,
  
   pmdp = pmd_offset(pud, addr);
   do {
 - pmd_t pmd = *pmdp;
 + pmd_t pmd = ACCESS_ONCE(*pmdp);

Here, ACCESS_ONCE() suffices because this is x86-specific code.
Core code would need to worry about Alpha, and would thus also need
smp_read_barrier_depends(), as you in fact do have further down.

   next = pmd_addr_end(addr, end);
   /*
 @@ -220,7 +220,7 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, 
 unsigned long end,
  
   pudp = pud_offset(pgd, addr);
   do {
 - pud_t pud = *pudp;
 + pud_t pud = ACCESS_ONCE(*pudp);
  
   next = pud_addr_end(addr, end);
   if (pud_none(pud))
 @@ -280,7 +280,7 @@ int __get_user_pages_fast(unsigned long start, int 
 nr_pages, int write,
   local_irq_save(flags);
   pgdp = pgd_offset(mm, addr);
   do {
 - pgd_t pgd = *pgdp;
 + pgd_t pgd = ACCESS_ONCE(*pgdp);
  
   next = pgd_addr_end(addr, end);
   if (pgd_none(pgd))
 diff --git a/mm/pagewalk.c b/mm/pagewalk.c
 index 6c118d0..2ba2a74 100644
 --- a/mm/pagewalk.c
 +++ b/mm/pagewalk.c
 @@ -10,6 +10,14 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, 
 unsigned long end,
   int err = 0;
  
   pte = pte_offset_map(pmd, addr);
 + /*
 +  * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
 +  * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
 +  * actual dereference of p[gum]d, but that's hidden deep within the
 +  * bowels of {pte,pmd,pud}_offset.
 +  */
 + barrier();
 + smp_read_barrier_depends();
   for (;;) {
   err = walk-pte_entry(pte, addr, addr + PAGE_SIZE, walk);
   if (err)
 @@ -32,6 +40,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, 
 unsigned long end,
   int err = 0;
  
   pmd = pmd_offset(pud, addr);
 + /*
 +  * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
 +  * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
 +  * actual dereference of p[gum]d, but that's hidden deep within the
 +  * bowels of {pte,pmd,pud}_offset.
 +  */
 + barrier();
 + smp_read_barrier_depends();
   do {
  again:
   next = pmd_addr_end(addr, end);
 @@ -77,6 +93,14 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, 
 unsigned long end,
   int err = 0;
  
   pud = pud_offset(pgd, addr);
 + /*
 +  * Pairs with the smp_wmb() in __{pte,pmd,pud}_alloc() and
 +  * __split_huge_page_map(). Ideally we'd use ACCESS_ONCE() on the
 +  * actual dereference of p[gum]d, but that's hidden deep within the
 +  * bowels of {pte,pmd,pud}_offset.
 +  */
 + barrier();
 + smp_read_barrier_depends();
   do {
   next = pud_addr_end(addr, end);
   if (pud_none_or_clear_bad(pud)) {
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/