Re: [PATCH] Convert struct pid count to refcount_t

2019-04-04 Thread Joel Fernandes
On Thu, Apr 04, 2019 at 11:19:46AM -0700, Paul E. McKenney wrote:
[snip]
> > > > > Further, from the herd simulator output (below), according to the 
> > > > > "States",
> > > > > r1==1 means P1() AFAICS would have already finished the the read and 
> > > > > set the
> > > > > r1 register to 1.  Then I am wondering why it couldn't take the 
> > > > > branch to set
> > > > > *x to 2.  According to herd, r1 == 1 AND x == 1 is a perfectly valid 
> > > > > state
> > > > > for the below program. I still couldn't see in my mind how for the 
> > > > > below
> > > > > program, this is possible - in terms of compiler optimizations or 
> > > > > other kinds
> > > > > of ordering. Because there is a smp_mb() between the 2 plain writes 
> > > > > in P0()
> > > > > and P1() did establish that r1 is 1 in the positive case. :-/.  I am 
> > > > > surely
> > > > > missing something :-)
> > > > > 
> > > > > ---8<---
> > > > > C Joel-put_pid
> > > > > 
> > > > > {}
> > > > > 
> > > > > P0(int *x, int *y)
> > > > > {
> > > > >   *x = 1;
> > > > >   smp_mb();
> > > > >   *y = 1;
> > > > > }
> > > > > 
> > > > > P1(int *x, int *y)
> > > > > {
> > > > >   int r1;
> > > > > 
> > > > >   r1 = READ_ONCE(*y);
> > > > >   if (r1)
> > > > >   WRITE_ONCE(*x, 2);
> > > > > }
> > > > > 
> > > > > exists (1:r1=1 /\ ~x=2)
> > > > > 
> > > > > ---8<---
> > > > > Output:
> > > > > 
> > > > > Test Joel-put_pid Allowed
> > > > > States 3
> > > > > 1:r1=0; x=1;
> > > > > 1:r1=1; x=1;  <-- Can't figure out why r1=1 and x != 2 here.
> > > > 
> > > > I must defer to Alan on this, but my guess is that this is due to
> > > > the fact that there is a data race.
> > > 
> > > Yes, and because the plain-access/data-race patch for the LKMM was
> > > intended only to detect data races, not to be aware of all the kinds of
> > > ordering that plain accesses can induce.  I said as much at the start
> > > of the patch's cover letter and it bears repeating.
> > > 
> > > In this case it is certainly true that the "*x = 1" write must be
> > > complete before the value of *y is changed from 0.  Hence P1's
> > > WRITE_ONCE will certainly overwrite the 1 with 2, and the final value
> > > of *x will be 2 if r1=1.
> > > 
> > > But the notion of "visibility" of a write that I put into the LKMM
> > > patch only allows for chains of marked accesses.  Since "*y = 1" is a
> > > plain access, the model does not recognize that it can enforce the
> > > ordering between the two writes to *x.
> > > 
> > > Also, you must remember, the LKMM's prediction about whether an outcome
> > > will or will not occur are meaningless if a data race is present.  
> > > Therefore the most fundamental the answer to why the "1:r1=1; x=1;"  
> > > line is there is basically what Paul said: It's there because the
> > > herd model gets completely messed up by data races.
> > 
> > Makes sense to me. Thanks for the good work on this.
> > 
> > FWIW, thought to mention (feel free ignore the suggestion if its
> > meaningless): If there is any chance that the outcome can be better
> > outputted, like r1=X; x=1; Where X stands for the result of a data race, 
> > that
> > would be lovely.  I don't know much about herd internals (yet) to say if the
> > suggestion makes sense but as a user, it would certainly help reduce
> > confusion.
> 
> The "Flag data-race" that appeared in the herd output is your friend in
> this case.  If you see that in the output, that means that herd detected
> a data race, and the states output might or might not be reliable.

Thanks Paul and Alan, the "Flag data-race" indication sounds good to me. I
will watch out for that :)
 - Joel



Re: [PATCH] Convert struct pid count to refcount_t

2019-04-04 Thread Alan Stern
On Thu, 4 Apr 2019, Joel Fernandes wrote:

> FWIW, thought to mention (feel free ignore the suggestion if its
> meaningless): If there is any chance that the outcome can be better
> outputted, like r1=X; x=1; Where X stands for the result of a data race, that
> would be lovely.  I don't know much about herd internals (yet) to say if the
> suggestion makes sense but as a user, it would certainly help reduce
> confusion.

I don't think that is feasible.  For one thing, according to the C 
Standard, in the presence of a data race a program is allowed to do 
anything at all.  There's no point trying to display all possible 
outcomes of an execution!

If we adopted the proposal that certain kinds of data races only result
in undefined values, things would be a little better.  It's possible
that the herd program could be changed to support undefined values.  
At present, however, it does not.

Alan Stern



Re: [PATCH] Convert struct pid count to refcount_t

2019-04-04 Thread Paul E. McKenney
On Thu, Apr 04, 2019 at 02:08:42PM -0400, Joel Fernandes wrote:
> Thanks a lot Paul and Allen, I replied below.
> 
> On Thu, Apr 04, 2019 at 12:01:32PM -0400, Alan Stern wrote:
> > On Thu, 4 Apr 2019, Paul E. McKenney wrote:
> > 
> > > On Mon, Apr 01, 2019 at 05:11:39PM -0400, Joel Fernandes wrote:
> > 
> > > > > > So I would have expected the following litmus to result in Never, 
> > > > > > but it
> > > > > > doesn't with Alan's patch:
> > > > > > 
> > > > > > P0(int *x, int *y)
> > > > > > {
> > > > > > *x = 1;
> > > > > > smp_mb();
> > > > > > *y = 1;
> > > > > > }
> > > > > > 
> > > > > > P1(int *x, int *y)
> > > > > > {
> > > > > > int r1;
> > > > > > 
> > > > > > r1 = READ_ONCE(*y);
> > > > > > if (r1)
> > > > > > WRITE_ONCE(*x, 2);
> > > > > > }
> > > > > > 
> > > > > > exists (1:r1=1 /\ ~x=2)
> > > > > 
> > > > > The problem is that the compiler can turn both of P0()'s writes into 
> > > > > reads:
> > > > > 
> > > > > P0(int *x, int *y)
> > > > > {
> > > > >   if (*x != 1)
> > > > >   *x = 1;
> > > > >   smp_mb();
> > > > >   if (*y != 1)
> > > > >   *y = 1;
> > > > > }
> > > > > 
> > > > > These reads will not be ordered by smp_wmb(), so you have to do 
> > > > > WRITE_ONCE()
> > > > > to get an iron-clad ordering guarantee.
> > > > 
> > > > But the snippet above has smp_mb() which does order writes, even for the
> > > > plain accesses.
> > > 
> > > True, but that code has a data race, namely P0()'s plain write to y
> > > races with P1()'s READ_ONCE().  Data races give the compiler absolutely
> > > astonishing levels of freedom to rearrange your code.  So, if you
> > > as a developer or maintainer choose to have data races, it is your
> > > responsibility to ensure that the compiler cannot mess you up.  So what
> > > you should do in that case is to list the transformed code the compiler's
> > > optimizations can produce and feed the corresponding litmus tests to LKMM,
> > > using READ_ONCE() and WRITE_ONCE() for the post-optimization accesses.
> > 
> > In this case, I strongly suspect the compiler would not mess up the
> > code badly enough to allow the unwanted end result.  But the LKMM tries
> > to avoid making strong assumptions about what compilers will or will
> > not do.
> 
> Here I was just trying to understand better how any kind of code
> transformation can cause an issue. Certainly I can see an issue if the
> compiler uses the memory location as a temporary variable as Paul pointed, to
> which I agree that a WRITE_ONCE is better. I am in favor of being careful,
> but here I was just trying to understand this better.
> 
> > > > I understand. Are we talking about load/store tearing being the issue 
> > > > here?
> > > > Even under load/store tearing, I feel the program will produce correct
> > > > results because r1 is either 0 or 1 (a single bit cannot be torn).
> > > 
> > > The compiler can (at least in theory) also transform *y = 1 as follows:
> > > 
> > >   *y = r1; /* Use *y as temporary storage. */
> > >   do_something();
> > >   r1 = *y;
> > >   *y = 1;
> > > 
> > > Here r1 is some register and do_something() is an inline function visible
> > > to the compiler (hence not implying barrier() upon call and return).
> > > 
> > > I don't know of any compilers that actually do this, but for me use
> > > of WRITE_ONCE() is a small price to pay to prevent all past, present,
> > > and future compiler from ever destroying^W optimizing my code in this way.
> > > 
> > > In your own code, if you dislike WRITE_ONCE() so much that you are willing
> > > to commit to (now and forever!!!) checking all applicable versions of
> > > compilers to make sure that they don't do this, well and good, knock
> > > yourself out.  But it is your responsibility to do that checking, and you
> > > can attest to LKMM that you have done so by giving LKMM litmus tests that
> > > substitute WRITE_ONCE() for that plain C-language assignment statement.
> > 
> > Remember that the LKMM does not produce strict bounds.  That is, the
> > LKMM will say that some outcomes are allowed even though no existing
> > compiler/CPU combination would ever actually produce them.  This litmus
> > test is an example.
> 
> Ok.
> 
> > > > Further, from the herd simulator output (below), according to the 
> > > > "States",
> > > > r1==1 means P1() AFAICS would have already finished the the read and 
> > > > set the
> > > > r1 register to 1.  Then I am wondering why it couldn't take the branch 
> > > > to set
> > > > *x to 2.  According to herd, r1 == 1 AND x == 1 is a perfectly valid 
> > > > state
> > > > for the below program. I still couldn't see in my mind how for the below
> > > > program, this is possible - in terms of compiler optimizations or other 
> > > > kinds
> > > > of ordering. Because there is a smp_mb() between the 2 plain writes in 
> > > > P0()
> > > > and P1() did establish that r1 is 1 in the positive case. :-/.  I am 
> > > > surely
> > > > missing 

Re: [PATCH] Convert struct pid count to refcount_t

2019-04-04 Thread Joel Fernandes
Thanks a lot Paul and Allen, I replied below.

On Thu, Apr 04, 2019 at 12:01:32PM -0400, Alan Stern wrote:
> On Thu, 4 Apr 2019, Paul E. McKenney wrote:
> 
> > On Mon, Apr 01, 2019 at 05:11:39PM -0400, Joel Fernandes wrote:
> 
> > > > > So I would have expected the following litmus to result in Never, but 
> > > > > it
> > > > > doesn't with Alan's patch:
> > > > > 
> > > > > P0(int *x, int *y)
> > > > > {
> > > > >   *x = 1;
> > > > >   smp_mb();
> > > > >   *y = 1;
> > > > > }
> > > > > 
> > > > > P1(int *x, int *y)
> > > > > {
> > > > >   int r1;
> > > > > 
> > > > >   r1 = READ_ONCE(*y);
> > > > >   if (r1)
> > > > >   WRITE_ONCE(*x, 2);
> > > > > }
> > > > > 
> > > > > exists (1:r1=1 /\ ~x=2)
> > > > 
> > > > The problem is that the compiler can turn both of P0()'s writes into 
> > > > reads:
> > > > 
> > > > P0(int *x, int *y)
> > > > {
> > > > if (*x != 1)
> > > > *x = 1;
> > > > smp_mb();
> > > > if (*y != 1)
> > > > *y = 1;
> > > > }
> > > > 
> > > > These reads will not be ordered by smp_wmb(), so you have to do 
> > > > WRITE_ONCE()
> > > > to get an iron-clad ordering guarantee.
> > > 
> > > But the snippet above has smp_mb() which does order writes, even for the
> > > plain accesses.
> > 
> > True, but that code has a data race, namely P0()'s plain write to y
> > races with P1()'s READ_ONCE().  Data races give the compiler absolutely
> > astonishing levels of freedom to rearrange your code.  So, if you
> > as a developer or maintainer choose to have data races, it is your
> > responsibility to ensure that the compiler cannot mess you up.  So what
> > you should do in that case is to list the transformed code the compiler's
> > optimizations can produce and feed the corresponding litmus tests to LKMM,
> > using READ_ONCE() and WRITE_ONCE() for the post-optimization accesses.
> 
> In this case, I strongly suspect the compiler would not mess up the
> code badly enough to allow the unwanted end result.  But the LKMM tries
> to avoid making strong assumptions about what compilers will or will
> not do.

Here I was just trying to understand better how any kind of code
transformation can cause an issue. Certainly I can see an issue if the
compiler uses the memory location as a temporary variable as Paul pointed, to
which I agree that a WRITE_ONCE is better. I am in favor of being careful,
but here I was just trying to understand this better.

> > > I understand. Are we talking about load/store tearing being the issue 
> > > here?
> > > Even under load/store tearing, I feel the program will produce correct
> > > results because r1 is either 0 or 1 (a single bit cannot be torn).
> > 
> > The compiler can (at least in theory) also transform *y = 1 as follows:
> > 
> > *y = r1; /* Use *y as temporary storage. */
> > do_something();
> > r1 = *y;
> > *y = 1;
> > 
> > Here r1 is some register and do_something() is an inline function visible
> > to the compiler (hence not implying barrier() upon call and return).
> > 
> > I don't know of any compilers that actually do this, but for me use
> > of WRITE_ONCE() is a small price to pay to prevent all past, present,
> > and future compiler from ever destroying^W optimizing my code in this way.
> > 
> > In your own code, if you dislike WRITE_ONCE() so much that you are willing
> > to commit to (now and forever!!!) checking all applicable versions of
> > compilers to make sure that they don't do this, well and good, knock
> > yourself out.  But it is your responsibility to do that checking, and you
> > can attest to LKMM that you have done so by giving LKMM litmus tests that
> > substitute WRITE_ONCE() for that plain C-language assignment statement.
> 
> Remember that the LKMM does not produce strict bounds.  That is, the
> LKMM will say that some outcomes are allowed even though no existing
> compiler/CPU combination would ever actually produce them.  This litmus
> test is an example.

Ok.

> > > Further, from the herd simulator output (below), according to the 
> > > "States",
> > > r1==1 means P1() AFAICS would have already finished the the read and set 
> > > the
> > > r1 register to 1.  Then I am wondering why it couldn't take the branch to 
> > > set
> > > *x to 2.  According to herd, r1 == 1 AND x == 1 is a perfectly valid state
> > > for the below program. I still couldn't see in my mind how for the below
> > > program, this is possible - in terms of compiler optimizations or other 
> > > kinds
> > > of ordering. Because there is a smp_mb() between the 2 plain writes in 
> > > P0()
> > > and P1() did establish that r1 is 1 in the positive case. :-/.  I am 
> > > surely
> > > missing something :-)
> > > 
> > > ---8<---
> > > C Joel-put_pid
> > > 
> > > {}
> > > 
> > > P0(int *x, int *y)
> > > {
> > >   *x = 1;
> > >   smp_mb();
> > >   *y = 1;
> > > }
> > > 
> > > P1(int *x, int *y)
> > > {
> > >   int r1;
> > > 
> > >   r1 = 

Re: [PATCH] Convert struct pid count to refcount_t

2019-04-04 Thread Alan Stern
On Thu, 4 Apr 2019, Paul E. McKenney wrote:

> On Mon, Apr 01, 2019 at 05:11:39PM -0400, Joel Fernandes wrote:

> > > > So I would have expected the following litmus to result in Never, but it
> > > > doesn't with Alan's patch:
> > > > 
> > > > P0(int *x, int *y)
> > > > {
> > > > *x = 1;
> > > > smp_mb();
> > > > *y = 1;
> > > > }
> > > > 
> > > > P1(int *x, int *y)
> > > > {
> > > > int r1;
> > > > 
> > > > r1 = READ_ONCE(*y);
> > > > if (r1)
> > > > WRITE_ONCE(*x, 2);
> > > > }
> > > > 
> > > > exists (1:r1=1 /\ ~x=2)
> > > 
> > > The problem is that the compiler can turn both of P0()'s writes into 
> > > reads:
> > > 
> > > P0(int *x, int *y)
> > > {
> > >   if (*x != 1)
> > >   *x = 1;
> > >   smp_mb();
> > >   if (*y != 1)
> > >   *y = 1;
> > > }
> > > 
> > > These reads will not be ordered by smp_wmb(), so you have to do 
> > > WRITE_ONCE()
> > > to get an iron-clad ordering guarantee.
> > 
> > But the snippet above has smp_mb() which does order writes, even for the
> > plain accesses.
> 
> True, but that code has a data race, namely P0()'s plain write to y
> races with P1()'s READ_ONCE().  Data races give the compiler absolutely
> astonishing levels of freedom to rearrange your code.  So, if you
> as a developer or maintainer choose to have data races, it is your
> responsibility to ensure that the compiler cannot mess you up.  So what
> you should do in that case is to list the transformed code the compiler's
> optimizations can produce and feed the corresponding litmus tests to LKMM,
> using READ_ONCE() and WRITE_ONCE() for the post-optimization accesses.

In this case, I strongly suspect the compiler would not mess up the
code badly enough to allow the unwanted end result.  But the LKMM tries
to avoid making strong assumptions about what compilers will or will
not do.


> > I understand. Are we talking about load/store tearing being the issue here?
> > Even under load/store tearing, I feel the program will produce correct
> > results because r1 is either 0 or 1 (a single bit cannot be torn).
> 
> The compiler can (at least in theory) also transform *y = 1 as follows:
> 
>   *y = r1; /* Use *y as temporary storage. */
>   do_something();
>   r1 = *y;
>   *y = 1;
> 
> Here r1 is some register and do_something() is an inline function visible
> to the compiler (hence not implying barrier() upon call and return).
> 
> I don't know of any compilers that actually do this, but for me use
> of WRITE_ONCE() is a small price to pay to prevent all past, present,
> and future compiler from ever destroying^W optimizing my code in this way.
> 
> In your own code, if you dislike WRITE_ONCE() so much that you are willing
> to commit to (now and forever!!!) checking all applicable versions of
> compilers to make sure that they don't do this, well and good, knock
> yourself out.  But it is your responsibility to do that checking, and you
> can attest to LKMM that you have done so by giving LKMM litmus tests that
> substitute WRITE_ONCE() for that plain C-language assignment statement.

Remember that the LKMM does not produce strict bounds.  That is, the
LKMM will say that some outcomes are allowed even though no existing
compiler/CPU combination would ever actually produce them.  This litmus
test is an example.

> > Further, from the herd simulator output (below), according to the "States",
> > r1==1 means P1() AFAICS would have already finished the the read and set the
> > r1 register to 1.  Then I am wondering why it couldn't take the branch to 
> > set
> > *x to 2.  According to herd, r1 == 1 AND x == 1 is a perfectly valid state
> > for the below program. I still couldn't see in my mind how for the below
> > program, this is possible - in terms of compiler optimizations or other 
> > kinds
> > of ordering. Because there is a smp_mb() between the 2 plain writes in P0()
> > and P1() did establish that r1 is 1 in the positive case. :-/.  I am surely
> > missing something :-)
> > 
> > ---8<---
> > C Joel-put_pid
> > 
> > {}
> > 
> > P0(int *x, int *y)
> > {
> > *x = 1;
> > smp_mb();
> > *y = 1;
> > }
> > 
> > P1(int *x, int *y)
> > {
> > int r1;
> > 
> > r1 = READ_ONCE(*y);
> > if (r1)
> > WRITE_ONCE(*x, 2);
> > }
> > 
> > exists (1:r1=1 /\ ~x=2)
> > 
> > ---8<---
> > Output:
> > 
> > Test Joel-put_pid Allowed
> > States 3
> > 1:r1=0; x=1;
> > 1:r1=1; x=1;<-- Can't figure out why r1=1 and x != 2 here.
> 
> I must defer to Alan on this, but my guess is that this is due to
> the fact that there is a data race.

Yes, and because the plain-access/data-race patch for the LKMM was
intended only to detect data races, not to be aware of all the kinds of
ordering that plain accesses can induce.  I said as much at the start
of the patch's cover letter and it bears repeating.

In this case it is certainly true that the "*x = 1" write must be

Re: [PATCH] Convert struct pid count to refcount_t

2019-04-04 Thread Paul E. McKenney
On Mon, Apr 01, 2019 at 05:11:39PM -0400, Joel Fernandes wrote:
> Thanks a lot Alan and Paul for the replies. I replied inline.
> 
> On Sun, Mar 31, 2019 at 02:55:31PM -0700, Paul E. McKenney wrote:
> > On Fri, Mar 29, 2019 at 10:36:39PM -0400, Joel Fernandes wrote:
> > > On Thu, Mar 28, 2019 at 10:37:07AM -0700, Paul E. McKenney wrote:
> > > > On Thu, Mar 28, 2019 at 05:26:42PM +0100, Oleg Nesterov wrote:
> > > > > On 03/28, Jann Horn wrote:
> > > > > >
> > > > > > Since we're just talking about RCU stuff now, adding Paul McKenney 
> > > > > > to
> > > > > > the thread.
> > > > > 
> > > > > Since you added Paul let me add more confusion to this thread ;)
> > > > 
> > > > Woo-hoo!!!  More confusion!  Bring it on!!!  ;-)
> > > 
> > > Nice to take part in the confusion fun too!!! ;-)
> > > 
> > > > > There were some concerns about the lack of barriers in put_pid(), but 
> > > > > I can't
> > > > > find that old discussion and I forgot the result of that discussion...
> > > > > 
> > > > > Paul, could you confirm that this code
> > > > > 
> > > > >   CPU_0   CPU_1
> > > > > 
> > > > >   X = 1;  if (READ_ONCE(Y))
> > > > >   mb();   X = 2;
> > > > >   Y = 1;  BUG_ON(X != 2);
> > > > > 
> > > > > 
> > > > > is correct? I think it is, control dependency pairs with mb(), right?
> > > > 
> > > > The BUG_ON() is supposed to happen at the end of time, correct?
> > > > As written, there is (in the strict sense) a data race between the load
> > > > of X in the BUG_ON() and CPU_0's store to X.  In a less strict sense,
> > > > you could of course argue that this data race is harmless, especially
> > > > if X is a single byte.  But the more I talk to compiler writers, the
> > > > less comfortable I become with data races in general.  :-/
> > > > 
> > > > So I would also feel better if the "Y = 1" was WRITE_ONCE().
> > > > 
> > > > On the other hand, this is a great opportunity to try out Alan Stern's
> > > > prototype plain-accesses patch to the Linux Kernel Memory Model (LKMM)!
> > > > 
> > > > https://lkml.kernel.org/r/pine.lnx.4.44l0.1903191459270.1593-200...@iolanthe.rowland.org
> > > > 
> > > > Also adding Alan on CC.
> > > > 
> > > > Here is what I believe is the litmus test that your are interested in:
> > > > 
> > > > 
> > > > C OlegNesterov-put_pid
> > > > 
> > > > {}
> > > > 
> > > > P0(int *x, int *y)
> > > > {
> > > > *x = 1;
> > > > smp_mb();
> > > > *y = 1;
> > > > }
> > > > 
> > > > P1(int *x, int *y)
> > > > {
> > > > int r1;
> > > > 
> > > > r1 = READ_ONCE(*y);
> > > > if (r1)
> > > > *x = 2;
> > > > }
> > > > 
> > > > exists (1:r1=1 /\ ~x=2)
> > > > 
> > > > 
> > > > Running this through herd with Alan's patch detects the data race
> > > > and says that the undesired outcome is allowed:
> > > > 
> > > > $ herd7  -conf linux-kernel.cfg 
> > > > /tmp/OlegNesterov-put_pid.litmus 
> > > > Test OlegNesterov-put_pid Allowed
> > > > States 3
> > > > 1:r1=0; x=1;
> > > > 1:r1=1; x=1;
> > > > 1:r1=1; x=2;
> > > > Ok
> > > > Witnesses
> > > > Positive: 1 Negative: 2
> > > > Flag data-race
> > > > Condition exists (1:r1=1 /\ not (x=2))
> > > > Observation OlegNesterov-put_pid Sometimes 1 2
> > > > Time OlegNesterov-put_pid 0.00
> > > > Hash=a3e0043ad753effa860fea37eeba0a76
> > > > 
> > > > Using WRITE_ONCE() for P0()'s store to y still allows this outcome,
> > > > although it does remove the "Flag data-race".
> > > > 
> > > > Using WRITE_ONCE() for both P0()'s store to y and P1()'s store to x
> > > > gets rid of both the "Flag data-race" and the undesired outcome:
> > > > 
> > > > $ herd7  -conf linux-kernel.cfg 
> > > > /tmp/OlegNesterov-put_pid-WO-WO.litmus 
> > > > Test OlegNesterov-put_pid-WO-WO Allowed
> > > > States 2
> > > > 1:r1=0; x=1;
> > > > 1:r1=1; x=2;
> > > > No
> > > > Witnesses
> > > > Positive: 0 Negative: 2
> > > > Condition exists (1:r1=1 /\ not (x=2))
> > > > Observation OlegNesterov-put_pid-WO-WO Never 0 2
> > > > Time OlegNesterov-put_pid-WO-WO 0.01
> > > > Hash=6e1643e3c5e4739b590bde0a8e8a918e
> > > > 
> > > > Here is the corresponding litmus test, in case I messed something up:
> > > > 
> > > > 
> > > > C OlegNesterov-put_pid-WO-WO
> > > > 
> > > > {}
> > > > 
> > > > P0(int *x, int *y)
> > > > {
> > > > *x = 1;
> > > > smp_mb();
> > > > WRITE_ONCE(*y, 1);
> > > > }
> > > > 
> > > > P1(int *x, int *y)
> > > > {
> > > > int r1;
> > > > 
> > > > r1 = READ_ONCE(*y);
> > > > if (r1)
> > > > 

Re: [PATCH] Convert struct pid count to refcount_t

2019-04-01 Thread Joel Fernandes
Thanks a lot Alan and Paul for the replies. I replied inline.

On Sun, Mar 31, 2019 at 02:55:31PM -0700, Paul E. McKenney wrote:
> On Fri, Mar 29, 2019 at 10:36:39PM -0400, Joel Fernandes wrote:
> > On Thu, Mar 28, 2019 at 10:37:07AM -0700, Paul E. McKenney wrote:
> > > On Thu, Mar 28, 2019 at 05:26:42PM +0100, Oleg Nesterov wrote:
> > > > On 03/28, Jann Horn wrote:
> > > > >
> > > > > Since we're just talking about RCU stuff now, adding Paul McKenney to
> > > > > the thread.
> > > > 
> > > > Since you added Paul let me add more confusion to this thread ;)
> > > 
> > > Woo-hoo!!!  More confusion!  Bring it on!!!  ;-)
> > 
> > Nice to take part in the confusion fun too!!! ;-)
> > 
> > > > There were some concerns about the lack of barriers in put_pid(), but I 
> > > > can't
> > > > find that old discussion and I forgot the result of that discussion...
> > > > 
> > > > Paul, could you confirm that this code
> > > > 
> > > > CPU_0   CPU_1
> > > > 
> > > > X = 1;  if (READ_ONCE(Y))
> > > > mb();   X = 2;
> > > > Y = 1;  BUG_ON(X != 2);
> > > > 
> > > > 
> > > > is correct? I think it is, control dependency pairs with mb(), right?
> > > 
> > > The BUG_ON() is supposed to happen at the end of time, correct?
> > > As written, there is (in the strict sense) a data race between the load
> > > of X in the BUG_ON() and CPU_0's store to X.  In a less strict sense,
> > > you could of course argue that this data race is harmless, especially
> > > if X is a single byte.  But the more I talk to compiler writers, the
> > > less comfortable I become with data races in general.  :-/
> > > 
> > > So I would also feel better if the "Y = 1" was WRITE_ONCE().
> > > 
> > > On the other hand, this is a great opportunity to try out Alan Stern's
> > > prototype plain-accesses patch to the Linux Kernel Memory Model (LKMM)!
> > > 
> > > https://lkml.kernel.org/r/pine.lnx.4.44l0.1903191459270.1593-200...@iolanthe.rowland.org
> > > 
> > > Also adding Alan on CC.
> > > 
> > > Here is what I believe is the litmus test that your are interested in:
> > > 
> > > 
> > > C OlegNesterov-put_pid
> > > 
> > > {}
> > > 
> > > P0(int *x, int *y)
> > > {
> > >   *x = 1;
> > >   smp_mb();
> > >   *y = 1;
> > > }
> > > 
> > > P1(int *x, int *y)
> > > {
> > >   int r1;
> > > 
> > >   r1 = READ_ONCE(*y);
> > >   if (r1)
> > >   *x = 2;
> > > }
> > > 
> > > exists (1:r1=1 /\ ~x=2)
> > > 
> > > 
> > > Running this through herd with Alan's patch detects the data race
> > > and says that the undesired outcome is allowed:
> > > 
> > >   $ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid.litmus 
> > >   Test OlegNesterov-put_pid Allowed
> > >   States 3
> > >   1:r1=0; x=1;
> > >   1:r1=1; x=1;
> > >   1:r1=1; x=2;
> > >   Ok
> > >   Witnesses
> > >   Positive: 1 Negative: 2
> > >   Flag data-race
> > >   Condition exists (1:r1=1 /\ not (x=2))
> > >   Observation OlegNesterov-put_pid Sometimes 1 2
> > >   Time OlegNesterov-put_pid 0.00
> > >   Hash=a3e0043ad753effa860fea37eeba0a76
> > > 
> > > Using WRITE_ONCE() for P0()'s store to y still allows this outcome,
> > > although it does remove the "Flag data-race".
> > > 
> > > Using WRITE_ONCE() for both P0()'s store to y and P1()'s store to x
> > > gets rid of both the "Flag data-race" and the undesired outcome:
> > > 
> > >   $ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-WO.litmus 
> > >   Test OlegNesterov-put_pid-WO-WO Allowed
> > >   States 2
> > >   1:r1=0; x=1;
> > >   1:r1=1; x=2;
> > >   No
> > >   Witnesses
> > >   Positive: 0 Negative: 2
> > >   Condition exists (1:r1=1 /\ not (x=2))
> > >   Observation OlegNesterov-put_pid-WO-WO Never 0 2
> > >   Time OlegNesterov-put_pid-WO-WO 0.01
> > >   Hash=6e1643e3c5e4739b590bde0a8e8a918e
> > > 
> > > Here is the corresponding litmus test, in case I messed something up:
> > > 
> > > 
> > > C OlegNesterov-put_pid-WO-WO
> > > 
> > > {}
> > > 
> > > P0(int *x, int *y)
> > > {
> > >   *x = 1;
> > >   smp_mb();
> > >   WRITE_ONCE(*y, 1);
> > > }
> > > 
> > > P1(int *x, int *y)
> > > {
> > >   int r1;
> > > 
> > >   r1 = READ_ONCE(*y);
> > >   if (r1)
> > >   WRITE_ONCE(*x, 2);
> > > }
> > > 
> > > exists (1:r1=1 /\ ~x=2)
> > 
> > I ran the above examples too. Its a bit confusing to me why the WRITE_ONCE 
> > in
> > P0() is required, and why would the READ_ONCE / WRITE_ONCE in P1() not be
> > sufficient to prevent the exists condition. Shouldn't the compiler know 
> > that,
> > in P0(), it should not reorder the store to y=1 before the x=1 because there
> > is an explicit barrier between the 2 stores? Looks me to me like a broken
> > compiler :-|. 
> > 
> > So I would have expected the following litmus to result in Never, but it
> > doesn't with 

RE: [PATCH] Convert struct pid count to refcount_t

2019-04-01 Thread David Laight
From: Alan Stern
> Sent: 29 March 2019 19:45
...
> There is a big difference between WRITE_ONCE() and plain assignment.
> Given "WRITE_ONCE(X, 2)", the compiler will emit a simple store
> instruction.  But given "X = 2", the compiler is allowed to emit
> instructions equivalent to:
> 
>   if (X != 2)
>   X = 2;

Worse for you, it can also emit:
X = 0;
X = 2;

Many years ago I fell foul of a compiler (not C) that implemented
a write to a 2 bit wide bitfield as:
X &= ~3
X |= value
even when 'value' was a compile time constant of 3.
Took a while to find out why the linked list got f*cked.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



Re: [PATCH] Convert struct pid count to refcount_t

2019-03-31 Thread Paul E. McKenney
On Sat, Mar 30, 2019 at 11:16:01AM -0400, Alan Stern wrote:
> On Fri, 29 Mar 2019, Joel Fernandes wrote:
> > On Thu, Mar 28, 2019 at 10:37:07AM -0700, Paul E. McKenney wrote:
> > > On Thu, Mar 28, 2019 at 05:26:42PM +0100, Oleg Nesterov wrote:
> > > > On 03/28, Jann Horn wrote:
> > > > >
> > > > > Since we're just talking about RCU stuff now, adding Paul McKenney to
> > > > > the thread.
> > > > 
> > > > Since you added Paul let me add more confusion to this thread ;)
> > > 
> > > Woo-hoo!!!  More confusion!  Bring it on!!!  ;-)
> > 
> > Nice to take part in the confusion fun too!!! ;-)
> > 
> > > > There were some concerns about the lack of barriers in put_pid(), but I 
> > > > can't
> > > > find that old discussion and I forgot the result of that discussion...
> > > > 
> > > > Paul, could you confirm that this code
> > > > 
> > > > CPU_0   CPU_1
> > > > 
> > > > X = 1;  if (READ_ONCE(Y))
> > > > mb();   X = 2;
> > > > Y = 1;  BUG_ON(X != 2);
> > > > 
> > > > 
> > > > is correct? I think it is, control dependency pairs with mb(), right?
> > > 
> > > The BUG_ON() is supposed to happen at the end of time, correct?
> > > As written, there is (in the strict sense) a data race between the load
> > > of X in the BUG_ON() and CPU_0's store to X.  In a less strict sense,
> > > you could of course argue that this data race is harmless, especially
> > > if X is a single byte.  But the more I talk to compiler writers, the
> > > less comfortable I become with data races in general.  :-/
> > > 
> > > So I would also feel better if the "Y = 1" was WRITE_ONCE().
> > > 
> > > On the other hand, this is a great opportunity to try out Alan Stern's
> > > prototype plain-accesses patch to the Linux Kernel Memory Model (LKMM)!
> > > 
> > > https://lkml.kernel.org/r/pine.lnx.4.44l0.1903191459270.1593-200...@iolanthe.rowland.org
> > > 
> > > Also adding Alan on CC.
> > > 
> > > Here is what I believe is the litmus test that your are interested in:
> > > 
> > > 
> > > C OlegNesterov-put_pid
> > > 
> > > {}
> > > 
> > > P0(int *x, int *y)
> > > {
> > >   *x = 1;
> > >   smp_mb();
> > >   *y = 1;
> > > }
> > > 
> > > P1(int *x, int *y)
> > > {
> > >   int r1;
> > > 
> > >   r1 = READ_ONCE(*y);
> > >   if (r1)
> > >   *x = 2;
> > > }
> > > 
> > > exists (1:r1=1 /\ ~x=2)
> > > 
> > > 
> > > Running this through herd with Alan's patch detects the data race
> > > and says that the undesired outcome is allowed:
> > > 
> > >   $ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid.litmus 
> > >   Test OlegNesterov-put_pid Allowed
> > >   States 3
> > >   1:r1=0; x=1;
> > >   1:r1=1; x=1;
> > >   1:r1=1; x=2;
> > >   Ok
> > >   Witnesses
> > >   Positive: 1 Negative: 2
> > >   Flag data-race
> > >   Condition exists (1:r1=1 /\ not (x=2))
> > >   Observation OlegNesterov-put_pid Sometimes 1 2
> > >   Time OlegNesterov-put_pid 0.00
> > >   Hash=a3e0043ad753effa860fea37eeba0a76
> > > 
> > > Using WRITE_ONCE() for P0()'s store to y still allows this outcome,
> > > although it does remove the "Flag data-race".
> > > 
> > > Using WRITE_ONCE() for both P0()'s store to y and P1()'s store to x
> > > gets rid of both the "Flag data-race" and the undesired outcome:
> > > 
> > >   $ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-WO.litmus 
> > >   Test OlegNesterov-put_pid-WO-WO Allowed
> > >   States 2
> > >   1:r1=0; x=1;
> > >   1:r1=1; x=2;
> > >   No
> > >   Witnesses
> > >   Positive: 0 Negative: 2
> > >   Condition exists (1:r1=1 /\ not (x=2))
> > >   Observation OlegNesterov-put_pid-WO-WO Never 0 2
> > >   Time OlegNesterov-put_pid-WO-WO 0.01
> > >   Hash=6e1643e3c5e4739b590bde0a8e8a918e
> > > 
> > > Here is the corresponding litmus test, in case I messed something up:
> > > 
> > > 
> > > C OlegNesterov-put_pid-WO-WO
> > > 
> > > {}
> > > 
> > > P0(int *x, int *y)
> > > {
> > >   *x = 1;
> > >   smp_mb();
> > >   WRITE_ONCE(*y, 1);
> > > }
> > > 
> > > P1(int *x, int *y)
> > > {
> > >   int r1;
> > > 
> > >   r1 = READ_ONCE(*y);
> > >   if (r1)
> > >   WRITE_ONCE(*x, 2);
> > > }
> > > 
> > > exists (1:r1=1 /\ ~x=2)
> > 
> > I ran the above examples too. Its a bit confusing to me why the WRITE_ONCE 
> > in
> > P0() is required,
> 
> If the "WRITE_ONCE(*y, 1)" in P0 were written instead as "*y = 1", it
> would race with P1's "READ_ONCE(*y)".
> 
> >  and why would the READ_ONCE / WRITE_ONCE in P1() not be
> > sufficient to prevent the exists condition. Shouldn't the compiler know 
> > that,
> > in P0(), it should not reorder the store to y=1 before the x=1 because there
> > is an explicit barrier between the 2 stores? Looks me to me like a broken
> > compiler :-|. 
> > 
> > So I would have expected the following litmus to result 

Re: [PATCH] Convert struct pid count to refcount_t

2019-03-31 Thread Paul E. McKenney
On Fri, Mar 29, 2019 at 10:36:39PM -0400, Joel Fernandes wrote:
> On Thu, Mar 28, 2019 at 10:37:07AM -0700, Paul E. McKenney wrote:
> > On Thu, Mar 28, 2019 at 05:26:42PM +0100, Oleg Nesterov wrote:
> > > On 03/28, Jann Horn wrote:
> > > >
> > > > Since we're just talking about RCU stuff now, adding Paul McKenney to
> > > > the thread.
> > > 
> > > Since you added Paul let me add more confusion to this thread ;)
> > 
> > Woo-hoo!!!  More confusion!  Bring it on!!!  ;-)
> 
> Nice to take part in the confusion fun too!!! ;-)
> 
> > > There were some concerns about the lack of barriers in put_pid(), but I 
> > > can't
> > > find that old discussion and I forgot the result of that discussion...
> > > 
> > > Paul, could you confirm that this code
> > > 
> > >   CPU_0   CPU_1
> > > 
> > >   X = 1;  if (READ_ONCE(Y))
> > >   mb();   X = 2;
> > >   Y = 1;  BUG_ON(X != 2);
> > > 
> > > 
> > > is correct? I think it is, control dependency pairs with mb(), right?
> > 
> > The BUG_ON() is supposed to happen at the end of time, correct?
> > As written, there is (in the strict sense) a data race between the load
> > of X in the BUG_ON() and CPU_0's store to X.  In a less strict sense,
> > you could of course argue that this data race is harmless, especially
> > if X is a single byte.  But the more I talk to compiler writers, the
> > less comfortable I become with data races in general.  :-/
> > 
> > So I would also feel better if the "Y = 1" was WRITE_ONCE().
> > 
> > On the other hand, this is a great opportunity to try out Alan Stern's
> > prototype plain-accesses patch to the Linux Kernel Memory Model (LKMM)!
> > 
> > https://lkml.kernel.org/r/pine.lnx.4.44l0.1903191459270.1593-200...@iolanthe.rowland.org
> > 
> > Also adding Alan on CC.
> > 
> > Here is what I believe is the litmus test that your are interested in:
> > 
> > 
> > C OlegNesterov-put_pid
> > 
> > {}
> > 
> > P0(int *x, int *y)
> > {
> > *x = 1;
> > smp_mb();
> > *y = 1;
> > }
> > 
> > P1(int *x, int *y)
> > {
> > int r1;
> > 
> > r1 = READ_ONCE(*y);
> > if (r1)
> > *x = 2;
> > }
> > 
> > exists (1:r1=1 /\ ~x=2)
> > 
> > 
> > Running this through herd with Alan's patch detects the data race
> > and says that the undesired outcome is allowed:
> > 
> > $ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid.litmus 
> > Test OlegNesterov-put_pid Allowed
> > States 3
> > 1:r1=0; x=1;
> > 1:r1=1; x=1;
> > 1:r1=1; x=2;
> > Ok
> > Witnesses
> > Positive: 1 Negative: 2
> > Flag data-race
> > Condition exists (1:r1=1 /\ not (x=2))
> > Observation OlegNesterov-put_pid Sometimes 1 2
> > Time OlegNesterov-put_pid 0.00
> > Hash=a3e0043ad753effa860fea37eeba0a76
> > 
> > Using WRITE_ONCE() for P0()'s store to y still allows this outcome,
> > although it does remove the "Flag data-race".
> > 
> > Using WRITE_ONCE() for both P0()'s store to y and P1()'s store to x
> > gets rid of both the "Flag data-race" and the undesired outcome:
> > 
> > $ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-WO.litmus 
> > Test OlegNesterov-put_pid-WO-WO Allowed
> > States 2
> > 1:r1=0; x=1;
> > 1:r1=1; x=2;
> > No
> > Witnesses
> > Positive: 0 Negative: 2
> > Condition exists (1:r1=1 /\ not (x=2))
> > Observation OlegNesterov-put_pid-WO-WO Never 0 2
> > Time OlegNesterov-put_pid-WO-WO 0.01
> > Hash=6e1643e3c5e4739b590bde0a8e8a918e
> > 
> > Here is the corresponding litmus test, in case I messed something up:
> > 
> > 
> > C OlegNesterov-put_pid-WO-WO
> > 
> > {}
> > 
> > P0(int *x, int *y)
> > {
> > *x = 1;
> > smp_mb();
> > WRITE_ONCE(*y, 1);
> > }
> > 
> > P1(int *x, int *y)
> > {
> > int r1;
> > 
> > r1 = READ_ONCE(*y);
> > if (r1)
> > WRITE_ONCE(*x, 2);
> > }
> > 
> > exists (1:r1=1 /\ ~x=2)
> 
> I ran the above examples too. Its a bit confusing to me why the WRITE_ONCE in
> P0() is required, and why would the READ_ONCE / WRITE_ONCE in P1() not be
> sufficient to prevent the exists condition. Shouldn't the compiler know that,
> in P0(), it should not reorder the store to y=1 before the x=1 because there
> is an explicit barrier between the 2 stores? Looks me to me like a broken
> compiler :-|. 
> 
> So I would have expected the following litmus to result in Never, but it
> doesn't with Alan's patch:
> 
> P0(int *x, int *y)
> {
>   *x = 1;
>   smp_mb();
>   *y = 1;
> }
> 
> P1(int *x, int *y)
> {
>   int r1;
> 
>   r1 = READ_ONCE(*y);
>   if (r1)
>   WRITE_ONCE(*x, 2);
> }
> 
> exists (1:r1=1 /\ ~x=2)

The problem is that the compiler can turn both of P0()'s writes into reads:

P0(int *x, int *y)
{
if 

Re: [PATCH] Convert struct pid count to refcount_t

2019-03-30 Thread Alan Stern
On Fri, 29 Mar 2019, Joel Fernandes wrote:

> On Thu, Mar 28, 2019 at 10:37:07AM -0700, Paul E. McKenney wrote:
> > On Thu, Mar 28, 2019 at 05:26:42PM +0100, Oleg Nesterov wrote:
> > > On 03/28, Jann Horn wrote:
> > > >
> > > > Since we're just talking about RCU stuff now, adding Paul McKenney to
> > > > the thread.
> > > 
> > > Since you added Paul let me add more confusion to this thread ;)
> > 
> > Woo-hoo!!!  More confusion!  Bring it on!!!  ;-)
> 
> Nice to take part in the confusion fun too!!! ;-)
> 
> > > There were some concerns about the lack of barriers in put_pid(), but I 
> > > can't
> > > find that old discussion and I forgot the result of that discussion...
> > > 
> > > Paul, could you confirm that this code
> > > 
> > >   CPU_0   CPU_1
> > > 
> > >   X = 1;  if (READ_ONCE(Y))
> > >   mb();   X = 2;
> > >   Y = 1;  BUG_ON(X != 2);
> > > 
> > > 
> > > is correct? I think it is, control dependency pairs with mb(), right?
> > 
> > The BUG_ON() is supposed to happen at the end of time, correct?
> > As written, there is (in the strict sense) a data race between the load
> > of X in the BUG_ON() and CPU_0's store to X.  In a less strict sense,
> > you could of course argue that this data race is harmless, especially
> > if X is a single byte.  But the more I talk to compiler writers, the
> > less comfortable I become with data races in general.  :-/
> > 
> > So I would also feel better if the "Y = 1" was WRITE_ONCE().
> > 
> > On the other hand, this is a great opportunity to try out Alan Stern's
> > prototype plain-accesses patch to the Linux Kernel Memory Model (LKMM)!
> > 
> > https://lkml.kernel.org/r/pine.lnx.4.44l0.1903191459270.1593-200...@iolanthe.rowland.org
> > 
> > Also adding Alan on CC.
> > 
> > Here is what I believe is the litmus test that your are interested in:
> > 
> > 
> > C OlegNesterov-put_pid
> > 
> > {}
> > 
> > P0(int *x, int *y)
> > {
> > *x = 1;
> > smp_mb();
> > *y = 1;
> > }
> > 
> > P1(int *x, int *y)
> > {
> > int r1;
> > 
> > r1 = READ_ONCE(*y);
> > if (r1)
> > *x = 2;
> > }
> > 
> > exists (1:r1=1 /\ ~x=2)
> > 
> > 
> > Running this through herd with Alan's patch detects the data race
> > and says that the undesired outcome is allowed:
> > 
> > $ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid.litmus 
> > Test OlegNesterov-put_pid Allowed
> > States 3
> > 1:r1=0; x=1;
> > 1:r1=1; x=1;
> > 1:r1=1; x=2;
> > Ok
> > Witnesses
> > Positive: 1 Negative: 2
> > Flag data-race
> > Condition exists (1:r1=1 /\ not (x=2))
> > Observation OlegNesterov-put_pid Sometimes 1 2
> > Time OlegNesterov-put_pid 0.00
> > Hash=a3e0043ad753effa860fea37eeba0a76
> > 
> > Using WRITE_ONCE() for P0()'s store to y still allows this outcome,
> > although it does remove the "Flag data-race".
> > 
> > Using WRITE_ONCE() for both P0()'s store to y and P1()'s store to x
> > gets rid of both the "Flag data-race" and the undesired outcome:
> > 
> > $ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-WO.litmus 
> > Test OlegNesterov-put_pid-WO-WO Allowed
> > States 2
> > 1:r1=0; x=1;
> > 1:r1=1; x=2;
> > No
> > Witnesses
> > Positive: 0 Negative: 2
> > Condition exists (1:r1=1 /\ not (x=2))
> > Observation OlegNesterov-put_pid-WO-WO Never 0 2
> > Time OlegNesterov-put_pid-WO-WO 0.01
> > Hash=6e1643e3c5e4739b590bde0a8e8a918e
> > 
> > Here is the corresponding litmus test, in case I messed something up:
> > 
> > 
> > C OlegNesterov-put_pid-WO-WO
> > 
> > {}
> > 
> > P0(int *x, int *y)
> > {
> > *x = 1;
> > smp_mb();
> > WRITE_ONCE(*y, 1);
> > }
> > 
> > P1(int *x, int *y)
> > {
> > int r1;
> > 
> > r1 = READ_ONCE(*y);
> > if (r1)
> > WRITE_ONCE(*x, 2);
> > }
> > 
> > exists (1:r1=1 /\ ~x=2)
> 
> I ran the above examples too. Its a bit confusing to me why the WRITE_ONCE in
> P0() is required,

If the "WRITE_ONCE(*y, 1)" in P0 were written instead as "*y = 1", it
would race with P1's "READ_ONCE(*y)".

>  and why would the READ_ONCE / WRITE_ONCE in P1() not be
> sufficient to prevent the exists condition. Shouldn't the compiler know that,
> in P0(), it should not reorder the store to y=1 before the x=1 because there
> is an explicit barrier between the 2 stores? Looks me to me like a broken
> compiler :-|. 
> 
> So I would have expected the following litmus to result in Never, but it
> doesn't with Alan's patch:
> 
> P0(int *x, int *y)
> {
>   *x = 1;
>   smp_mb();
>   *y = 1;
> }
> 
> P1(int *x, int *y)
> {
>   int r1;
> 
>   r1 = READ_ONCE(*y);
>   if (r1)
>   WRITE_ONCE(*x, 2);
> }
> 
> exists (1:r1=1 /\ ~x=2)

You have to 

Re: [PATCH] Convert struct pid count to refcount_t

2019-03-29 Thread Joel Fernandes
On Thu, Mar 28, 2019 at 10:37:07AM -0700, Paul E. McKenney wrote:
> On Thu, Mar 28, 2019 at 05:26:42PM +0100, Oleg Nesterov wrote:
> > On 03/28, Jann Horn wrote:
> > >
> > > Since we're just talking about RCU stuff now, adding Paul McKenney to
> > > the thread.
> > 
> > Since you added Paul let me add more confusion to this thread ;)
> 
> Woo-hoo!!!  More confusion!  Bring it on!!!  ;-)

Nice to take part in the confusion fun too!!! ;-)

> > There were some concerns about the lack of barriers in put_pid(), but I 
> > can't
> > find that old discussion and I forgot the result of that discussion...
> > 
> > Paul, could you confirm that this code
> > 
> > CPU_0   CPU_1
> > 
> > X = 1;  if (READ_ONCE(Y))
> > mb();   X = 2;
> > Y = 1;  BUG_ON(X != 2);
> > 
> > 
> > is correct? I think it is, control dependency pairs with mb(), right?
> 
> The BUG_ON() is supposed to happen at the end of time, correct?
> As written, there is (in the strict sense) a data race between the load
> of X in the BUG_ON() and CPU_0's store to X.  In a less strict sense,
> you could of course argue that this data race is harmless, especially
> if X is a single byte.  But the more I talk to compiler writers, the
> less comfortable I become with data races in general.  :-/
> 
> So I would also feel better if the "Y = 1" was WRITE_ONCE().
> 
> On the other hand, this is a great opportunity to try out Alan Stern's
> prototype plain-accesses patch to the Linux Kernel Memory Model (LKMM)!
> 
> https://lkml.kernel.org/r/pine.lnx.4.44l0.1903191459270.1593-200...@iolanthe.rowland.org
> 
> Also adding Alan on CC.
> 
> Here is what I believe is the litmus test that your are interested in:
> 
> 
> C OlegNesterov-put_pid
> 
> {}
> 
> P0(int *x, int *y)
> {
>   *x = 1;
>   smp_mb();
>   *y = 1;
> }
> 
> P1(int *x, int *y)
> {
>   int r1;
> 
>   r1 = READ_ONCE(*y);
>   if (r1)
>   *x = 2;
> }
> 
> exists (1:r1=1 /\ ~x=2)
> 
> 
> Running this through herd with Alan's patch detects the data race
> and says that the undesired outcome is allowed:
> 
>   $ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid.litmus 
>   Test OlegNesterov-put_pid Allowed
>   States 3
>   1:r1=0; x=1;
>   1:r1=1; x=1;
>   1:r1=1; x=2;
>   Ok
>   Witnesses
>   Positive: 1 Negative: 2
>   Flag data-race
>   Condition exists (1:r1=1 /\ not (x=2))
>   Observation OlegNesterov-put_pid Sometimes 1 2
>   Time OlegNesterov-put_pid 0.00
>   Hash=a3e0043ad753effa860fea37eeba0a76
> 
> Using WRITE_ONCE() for P0()'s store to y still allows this outcome,
> although it does remove the "Flag data-race".
> 
> Using WRITE_ONCE() for both P0()'s store to y and P1()'s store to x
> gets rid of both the "Flag data-race" and the undesired outcome:
> 
>   $ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-WO.litmus 
>   Test OlegNesterov-put_pid-WO-WO Allowed
>   States 2
>   1:r1=0; x=1;
>   1:r1=1; x=2;
>   No
>   Witnesses
>   Positive: 0 Negative: 2
>   Condition exists (1:r1=1 /\ not (x=2))
>   Observation OlegNesterov-put_pid-WO-WO Never 0 2
>   Time OlegNesterov-put_pid-WO-WO 0.01
>   Hash=6e1643e3c5e4739b590bde0a8e8a918e
> 
> Here is the corresponding litmus test, in case I messed something up:
> 
> 
> C OlegNesterov-put_pid-WO-WO
> 
> {}
> 
> P0(int *x, int *y)
> {
>   *x = 1;
>   smp_mb();
>   WRITE_ONCE(*y, 1);
> }
> 
> P1(int *x, int *y)
> {
>   int r1;
> 
>   r1 = READ_ONCE(*y);
>   if (r1)
>   WRITE_ONCE(*x, 2);
> }
> 
> exists (1:r1=1 /\ ~x=2)

I ran the above examples too. Its a bit confusing to me why the WRITE_ONCE in
P0() is required, and why would the READ_ONCE / WRITE_ONCE in P1() not be
sufficient to prevent the exists condition. Shouldn't the compiler know that,
in P0(), it should not reorder the store to y=1 before the x=1 because there
is an explicit barrier between the 2 stores? Looks me to me like a broken
compiler :-|. 

So I would have expected the following litmus to result in Never, but it
doesn't with Alan's patch:

P0(int *x, int *y)
{
*x = 1;
smp_mb();
*y = 1;
}

P1(int *x, int *y)
{
int r1;

r1 = READ_ONCE(*y);
if (r1)
WRITE_ONCE(*x, 2);
}

exists (1:r1=1 /\ ~x=2)

> 
> 
> > If not, then put_pid() needs atomic_read_acquire() as it was proposed in 
> > that
> > discussion.
> 
> Good point, let's try with smp_load_acquire() in P1():
> 
>   $ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-sla.litmus 
>   Test OlegNesterov-put_pid-WO-sla Allowed
>   States 2

Re: [PATCH] Convert struct pid count to refcount_t

2019-03-29 Thread Alan Stern
On Fri, 29 Mar 2019, Oleg Nesterov wrote:

> On 03/28, Paul E. McKenney wrote:
> >
> > On Thu, Mar 28, 2019 at 05:26:42PM +0100, Oleg Nesterov wrote:
> > >
> > > Since you added Paul let me add more confusion to this thread ;)
> >
> > Woo-hoo!!!  More confusion!  Bring it on!!!  ;-)
> 
> OK, thanks, you certainly managed to confused me much more than I expected!
> 
> > > There were some concerns about the lack of barriers in put_pid(), but I 
> > > can't
> > > find that old discussion and I forgot the result of that discussion...
> > >
> > > Paul, could you confirm that this code
> > >
> > >   CPU_0   CPU_1
> > >
> > >   X = 1;  if (READ_ONCE(Y))
> > >   mb();   X = 2;
> > >   Y = 1;  BUG_ON(X != 2);
> > >
> > >
> > > is correct? I think it is, control dependency pairs with mb(), right?
> >
> > The BUG_ON() is supposed to happen at the end of time, correct?
> 
> Yes,
> 
> > As written, there is (in the strict sense) a data race between the load
> > of X in the BUG_ON() and CPU_0's store to X.
> 
> Well, this pseudo code is simply wrong, I meant that "X = 1" on CPU 0
> must not happen after "X = 2" on CPU 1 or we have a problem.

The situation is worse than that.  Strictly speaking, any time there is
a data race you have a problem.  In particular, the C11 Standard
permits a data race to cause undefined behavior.  For example, the
Standard wouldn't be violated if a data race caused your computer to
catch on fire.

In real life the situation isn't that bad, and you can do things the
Standard doesn't cover if you are very familiar with the properties of
the compiler.  But compilers change over time, so relying on special 
knowledge about it might not be a good idea in the long run.

> > But the more I talk to compiler writers, the
> > less comfortable I become with data races in general.  :-/
> >
> > So I would also feel better if the "Y = 1" was WRITE_ONCE().
> 
> If we forget about potential compiler bugs, then it is not clear to me how
> WRITE_ONCE() can help in this case. mb() implies the compiler barrier, and
> we do not really need the "once" semantics?

There is a big difference between WRITE_ONCE() and plain assignment.  
Given "WRITE_ONCE(X, 2)", the compiler will emit a simple store
instruction.  But given "X = 2", the compiler is allowed to emit
instructions equivalent to:

if (X != 2)
X = 2;

This is not a simple store; it also contains a load.  (And in some 
situations, a transformation like this might even be beneficial, 
because it can avoid a cache line transfer.)

CPUs are allowed to execute loads out of order with respect to control
dependencies.  So for example, if your pseudo-code for CPU_1 got
compiled to object code equivalent to:

if (READ_ONCE(Y)) {
if (X != 2)
X = 2;
}

then the CPU might read the value of X before reading the value of Y.  
This might not look like a problem, but it would be an example of a
race.

> But as for put_pid(), it actually does atomic_dec_and_test(), so this is
> probably not relevant.
> 
> > On the other hand, this is a great opportunity to try out Alan Stern's
> > prototype plain-accesses patch to the Linux Kernel Memory Model (LKMM)!
> >
> > https://lkml.kernel.org/r/pine.lnx.4.44l0.1903191459270.1593-200...@iolanthe.rowland.org
> 
> Heh. Will do, but only after I buy more brains.
> 
> > Here is what I believe is the litmus test that your are interested in:
> >
> > 
> > C OlegNesterov-put_pid
> >
> > {}
> >
> > P0(int *x, int *y)
> > {
> > *x = 1;
> > smp_mb();
> > *y = 1;
> > }
> >
> > P1(int *x, int *y)
> > {
> > int r1;
> >
> > r1 = READ_ONCE(*y);
> > if (r1)
> > *x = 2;
> > }
> >
> > exists (1:r1=1 /\ ~x=2)
> 
> I am not familiar with litmus, and I do not really understand what (and why)
> it reports.
> 
> > Running this through herd with Alan's patch detects the data race
> > and says that the undesired outcome is allowed:
> 
> OK, so it says that "*x = 2" can happen before "*x = 1" even if P1() observes
> *y == 1.
> 
> Still can't understand how this can happen... Nevermind ;)

You might imagine that the compiler could generate object code for P1() 
equivalent to:

tmp = *x;
*x = 2;
if (!READ_ONCE(*y))
*x = tmp;

Given this object code, P1 might indeed set *x to 2 before P0 sets *x
to 1.  I believe the Standard allows the compiler to do this sort of 
thing, although of course it would be a stupid thing to do.

Using plain accesses gives the compiler a tremendous amount of freedom
-- a lot more than many people think.

Alan

> > Using WRITE_ONCE() for both P0()'s store to y and P1()'s store to x
> > gets rid of both the "Flag data-race" and the undesired outcome:
> 
> ...
> 
> > P1(int *x, int *y)
> > {
> > int r1;
> >
> > r1 = READ_ONCE(*y);
> > if (r1)
> > 

Re: [PATCH] Convert struct pid count to refcount_t

2019-03-29 Thread Oleg Nesterov
On 03/28, Joel Fernandes wrote:
>
> Also Oleg, why not just call refcount_dec_and_test like below?

Of course we can, this is just optimization to avoid the atomic op if possible.

Does this optimization really help? I have no idea ;)

But as you can see, it surely helps to provoke the interesting discussions!

Oleg.



Re: [PATCH] Convert struct pid count to refcount_t

2019-03-29 Thread Oleg Nesterov
On 03/28, Paul E. McKenney wrote:
>
> On Thu, Mar 28, 2019 at 05:26:42PM +0100, Oleg Nesterov wrote:
> >
> > Since you added Paul let me add more confusion to this thread ;)
>
> Woo-hoo!!!  More confusion!  Bring it on!!!  ;-)

OK, thanks, you certainly managed to confused me much more than I expected!

> > There were some concerns about the lack of barriers in put_pid(), but I 
> > can't
> > find that old discussion and I forgot the result of that discussion...
> >
> > Paul, could you confirm that this code
> >
> > CPU_0   CPU_1
> >
> > X = 1;  if (READ_ONCE(Y))
> > mb();   X = 2;
> > Y = 1;  BUG_ON(X != 2);
> >
> >
> > is correct? I think it is, control dependency pairs with mb(), right?
>
> The BUG_ON() is supposed to happen at the end of time, correct?

Yes,

> As written, there is (in the strict sense) a data race between the load
> of X in the BUG_ON() and CPU_0's store to X.

Well, this pseudo code is simply wrong, I meant that "X = 1" on CPU 0
must not happen after "X = 2" on CPU 1 or we have a problem.


> But the more I talk to compiler writers, the
> less comfortable I become with data races in general.  :-/
>
> So I would also feel better if the "Y = 1" was WRITE_ONCE().

If we forget about potential compiler bugs, then it is not clear to me how
WRITE_ONCE() can help in this case. mb() implies the compiler barrier, and
we do not really need the "once" semantics?

But as for put_pid(), it actually does atomic_dec_and_test(), so this is
probably not relevant.

> On the other hand, this is a great opportunity to try out Alan Stern's
> prototype plain-accesses patch to the Linux Kernel Memory Model (LKMM)!
>
> https://lkml.kernel.org/r/pine.lnx.4.44l0.1903191459270.1593-200...@iolanthe.rowland.org

Heh. Will do, but only after I buy more brains.

> Here is what I believe is the litmus test that your are interested in:
>
> 
> C OlegNesterov-put_pid
>
> {}
>
> P0(int *x, int *y)
> {
>   *x = 1;
>   smp_mb();
>   *y = 1;
> }
>
> P1(int *x, int *y)
> {
>   int r1;
>
>   r1 = READ_ONCE(*y);
>   if (r1)
>   *x = 2;
> }
>
> exists (1:r1=1 /\ ~x=2)

I am not familiar with litmus, and I do not really understand what (and why)
it reports.

> Running this through herd with Alan's patch detects the data race
> and says that the undesired outcome is allowed:

OK, so it says that "*x = 2" can happen before "*x = 1" even if P1() observes
*y == 1.

Still can't understand how this can happen... Nevermind ;)

> Using WRITE_ONCE() for both P0()'s store to y and P1()'s store to x
> gets rid of both the "Flag data-race" and the undesired outcome:

...

> P1(int *x, int *y)
> {
>   int r1;
>
>   r1 = READ_ONCE(*y);
>   if (r1)
>   WRITE_ONCE(*x, 2);
> }

And this is what Documentation/memory-barriers.txt says in the "CONTROL
DEPENDENCIES" section:

q = READ_ONCE(a);
if (q) {
WRITE_ONCE(b, 1);
}

Control dependencies pair normally with other types of barriers.
That said, please note that neither READ_ONCE() nor WRITE_ONCE()
are optional!

but again, I fail to really understand why WRITE_ONCE() is not optional
in this particular case.

Thanks!

Oleg.



Re: [PATCH] Convert struct pid count to refcount_t

2019-03-28 Thread Joel Fernandes
On Thu, Mar 28, 2019 at 10:39:58AM -0400, Joel Fernandes wrote:
> On Thu, Mar 28, 2019 at 03:26:19PM +0100, Oleg Nesterov wrote:
> > On 03/27, Joel Fernandes wrote:
> > >
> > > Also, based on Kees comment, I think it appears to me that get_pid and
> > > put_pid can race in this way in the original code right?
> > >
> > > get_pid   put_pid
> > >
> > >   atomic_dec_and_test returns 1
> > > atomic_inc
> > >   kfree
> > >
> > > deref pid /* boom */
> > > -
> > >
> > > I think get_pid needs to call atomic_inc_not_zero()
> > 
> > No.
> > 
> > get_pid() should only be used if you already have a reference or you do
> > something like
> > 
> > rcu_read_lock();
> > pid = find_vpid();
> > get_pid();
> > rcu_read_lock();
> > 
> > in this case we rely on call_rcu(delayed_put_pid) which drops the initial
> > reference.
> > 
> > If put_pid() sees pid->count == 1, then a) nobody else has a reference and
> > b) nobody else can find this pid on rcu-protected lists, so it is safe to
> > free it.
> 
> I agree. Check my reply to Jann, I already replied to him about this. thanks!
> 

Also Oleg, why not just call refcount_dec_and_test like below? If count is 1,
then it will decrement to 0 and return true anyway. Is this because we want
to avoid writes at the cost of more reads? Did I miss something? Thank you.

I don't remember very clearly, but I think Kees also asked about the same thing.

diff --git a/kernel/pid.c b/kernel/pid.c
index 2095c7da644d..89c4849fab5d 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -106,8 +106,7 @@ void put_pid(struct pid *pid)
return;
 
ns = pid->numbers[pid->level].ns;
-   if ((refcount_read(>count) == 1) ||
-refcount_dec_and_test(>count)) {
+   if (refcount_dec_and_test(>count)) {
kmem_cache_free(ns->pid_cachep, pid);
put_pid_ns(ns);
}


Re: [PATCH] Convert struct pid count to refcount_t

2019-03-28 Thread Joel Fernandes
On Thu, Mar 28, 2019 at 04:00:52PM -0400, Joel Fernandes wrote:
> On Thu, Mar 28, 2019 at 04:17:50PM +0100, Jann Horn wrote:
> > Since we're just talking about RCU stuff now, adding Paul McKenney to
> > the thread.
> > 
> > On Thu, Mar 28, 2019 at 3:37 PM Joel Fernandes  
> > wrote:
> > > On Thu, Mar 28, 2019 at 03:57:44AM +0100, Jann Horn wrote:
> > > > On Thu, Mar 28, 2019 at 3:34 AM Joel Fernandes  
> > > > wrote:
> > > > > On Thu, Mar 28, 2019 at 01:59:45AM +0100, Jann Horn wrote:
> > > > > > On Thu, Mar 28, 2019 at 1:06 AM Kees Cook  
> > > > > > wrote:
> > > > > > > On Wed, Mar 27, 2019 at 7:53 AM Joel Fernandes (Google)
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > struct pid's count is an atomic_t field used as a refcount. Use
> > > > > > > > refcount_t for it which is basically atomic_t but does 
> > > > > > > > additional
> > > > > > > > checking to prevent use-after-free bugs. No change in behavior 
> > > > > > > > if
> > > > > > > > CONFIG_REFCOUNT_FULL=n.
> > > > > > > >
> > > > > > > > Cc: keesc...@chromium.org
> > > > > > > > Cc: kernel-t...@android.com
> > > > > > > > Cc: kernel-harden...@lists.openwall.com
> > > > > > > > Signed-off-by: Joel Fernandes (Google) 
> > > > > > > > [...]
> > > > > > > > diff --git a/kernel/pid.c b/kernel/pid.c
> > > > > > > > index 20881598bdfa..2095c7da644d 100644
> > > > > > > > --- a/kernel/pid.c
> > > > > > > > +++ b/kernel/pid.c
> > > > > > > > @@ -37,7 +37,7 @@
> > > > > > > >  #include 
> > > > > > > >  #include 
> > > > > > > >  #include 
> > > > > > > > -#include 
> > > > > > > > +#include 
> > > > > > > >  #include 
> > > > > > > >  #include 
> > > > > > > >
> > > > > > > > @@ -106,8 +106,8 @@ void put_pid(struct pid *pid)
> > > > > > > > return;
> > > > > > > >
> > > > > > > > ns = pid->numbers[pid->level].ns;
> > > > > > > > -   if ((atomic_read(>count) == 1) ||
> > > > > > > > -atomic_dec_and_test(>count)) {
> > > > > > > > +   if ((refcount_read(>count) == 1) ||
> > > > > > > > +refcount_dec_and_test(>count)) {
> > > > > > >
> > > > > > > Why is this (and the original code) safe in the face of a race 
> > > > > > > against
> > > > > > > get_pid()? i.e. shouldn't this only use refcount_dec_and_test()? I
> > > > > > > don't see this code pattern anywhere else in the kernel.
> > > > > >
> > > > > > Semantically, it doesn't make a difference whether you do this or
> > > > > > leave out the "refcount_read(>count) == 1". If you read a 1 
> > > > > > from
> > > > > > refcount_read(), then you have the only reference to "struct pid", 
> > > > > > and
> > > > > > therefore you want to free it. If you don't get a 1, you have to
> > > > > > atomically drop a reference, which, if someone else is concurrently
> > > > > > also dropping a reference, may leave you with the last reference (in
> > > > > > the case where refcount_dec_and_test() returns true), in which case
> > > > > > you still have to take care of freeing it.
> > > > >
> > > > > Also, based on Kees comment, I think it appears to me that get_pid and
> > > > > put_pid can race in this way in the original code right?
> > > > >
> > > > > get_pid put_pid
> > > > >
> > > > > atomic_dec_and_test returns 1
> > > >
> > > > This can't happen. get_pid() can only be called on an existing
> > > > reference. If you are calling get_pid() on an existing reference, and
> > > > someone else is dropping another reference with put_pid(), then when
> > > > both functions start running, the refcount must be at least 2.
> > >
> > > Sigh, you are right. Ok. I was quite tired last night when I wrote this.
> > > Obviously, I should have waited a bit and thought it through.
> > >
> > > Kees can you describe more the race you had in mind?
> > >
> > > > > atomic_inc
> > > > > kfree
> > > > >
> > > > > deref pid /* boom */
> > > > > -
> > > > >
> > > > > I think get_pid needs to call atomic_inc_not_zero() and put_pid should
> > > > > not test for pid->count == 1 as condition for freeing, but rather 
> > > > > just do
> > > > > atomic_dec_and_test. So something like the following diff. (And I see 
> > > > > a
> > > > > similar pattern used in drivers/net/mac.c)
> > > >
> > > > get_pid() can only be called when you already have a refcounted
> > > > reference; in other words, when the reference count is at least one.
> > > > The lifetime management of struct pid differs from the lifetime
> > > > management of most other objects in the kernel; the usual patterns
> > > > don't quite apply here.
> > > >
> > > > Look at put_pid(): When the refcount has reached zero, there is no RCU
> > > > grace period (unlike most other objects with RCU-managed lifetimes).
> > > > Instead, free_pid() has an RCU grace period *before* it invokes
> > > > delayed_put_pid() to drop a reference; and free_pid() is also the
> > > > function that removes a PID from the namespace's IDR, 

Re: [PATCH] Convert struct pid count to refcount_t

2019-03-28 Thread Joel Fernandes
On Thu, Mar 28, 2019 at 04:17:50PM +0100, Jann Horn wrote:
> Since we're just talking about RCU stuff now, adding Paul McKenney to
> the thread.
> 
> On Thu, Mar 28, 2019 at 3:37 PM Joel Fernandes  wrote:
> > On Thu, Mar 28, 2019 at 03:57:44AM +0100, Jann Horn wrote:
> > > On Thu, Mar 28, 2019 at 3:34 AM Joel Fernandes  
> > > wrote:
> > > > On Thu, Mar 28, 2019 at 01:59:45AM +0100, Jann Horn wrote:
> > > > > On Thu, Mar 28, 2019 at 1:06 AM Kees Cook  
> > > > > wrote:
> > > > > > On Wed, Mar 27, 2019 at 7:53 AM Joel Fernandes (Google)
> > > > > >  wrote:
> > > > > > >
> > > > > > > struct pid's count is an atomic_t field used as a refcount. Use
> > > > > > > refcount_t for it which is basically atomic_t but does additional
> > > > > > > checking to prevent use-after-free bugs. No change in behavior if
> > > > > > > CONFIG_REFCOUNT_FULL=n.
> > > > > > >
> > > > > > > Cc: keesc...@chromium.org
> > > > > > > Cc: kernel-t...@android.com
> > > > > > > Cc: kernel-harden...@lists.openwall.com
> > > > > > > Signed-off-by: Joel Fernandes (Google) 
> > > > > > > [...]
> > > > > > > diff --git a/kernel/pid.c b/kernel/pid.c
> > > > > > > index 20881598bdfa..2095c7da644d 100644
> > > > > > > --- a/kernel/pid.c
> > > > > > > +++ b/kernel/pid.c
> > > > > > > @@ -37,7 +37,7 @@
> > > > > > >  #include 
> > > > > > >  #include 
> > > > > > >  #include 
> > > > > > > -#include 
> > > > > > > +#include 
> > > > > > >  #include 
> > > > > > >  #include 
> > > > > > >
> > > > > > > @@ -106,8 +106,8 @@ void put_pid(struct pid *pid)
> > > > > > > return;
> > > > > > >
> > > > > > > ns = pid->numbers[pid->level].ns;
> > > > > > > -   if ((atomic_read(>count) == 1) ||
> > > > > > > -atomic_dec_and_test(>count)) {
> > > > > > > +   if ((refcount_read(>count) == 1) ||
> > > > > > > +refcount_dec_and_test(>count)) {
> > > > > >
> > > > > > Why is this (and the original code) safe in the face of a race 
> > > > > > against
> > > > > > get_pid()? i.e. shouldn't this only use refcount_dec_and_test()? I
> > > > > > don't see this code pattern anywhere else in the kernel.
> > > > >
> > > > > Semantically, it doesn't make a difference whether you do this or
> > > > > leave out the "refcount_read(>count) == 1". If you read a 1 from
> > > > > refcount_read(), then you have the only reference to "struct pid", and
> > > > > therefore you want to free it. If you don't get a 1, you have to
> > > > > atomically drop a reference, which, if someone else is concurrently
> > > > > also dropping a reference, may leave you with the last reference (in
> > > > > the case where refcount_dec_and_test() returns true), in which case
> > > > > you still have to take care of freeing it.
> > > >
> > > > Also, based on Kees comment, I think it appears to me that get_pid and
> > > > put_pid can race in this way in the original code right?
> > > >
> > > > get_pid put_pid
> > > >
> > > > atomic_dec_and_test returns 1
> > >
> > > This can't happen. get_pid() can only be called on an existing
> > > reference. If you are calling get_pid() on an existing reference, and
> > > someone else is dropping another reference with put_pid(), then when
> > > both functions start running, the refcount must be at least 2.
> >
> > Sigh, you are right. Ok. I was quite tired last night when I wrote this.
> > Obviously, I should have waited a bit and thought it through.
> >
> > Kees can you describe more the race you had in mind?
> >
> > > > atomic_inc
> > > > kfree
> > > >
> > > > deref pid /* boom */
> > > > -
> > > >
> > > > I think get_pid needs to call atomic_inc_not_zero() and put_pid should
> > > > not test for pid->count == 1 as condition for freeing, but rather just 
> > > > do
> > > > atomic_dec_and_test. So something like the following diff. (And I see a
> > > > similar pattern used in drivers/net/mac.c)
> > >
> > > get_pid() can only be called when you already have a refcounted
> > > reference; in other words, when the reference count is at least one.
> > > The lifetime management of struct pid differs from the lifetime
> > > management of most other objects in the kernel; the usual patterns
> > > don't quite apply here.
> > >
> > > Look at put_pid(): When the refcount has reached zero, there is no RCU
> > > grace period (unlike most other objects with RCU-managed lifetimes).
> > > Instead, free_pid() has an RCU grace period *before* it invokes
> > > delayed_put_pid() to drop a reference; and free_pid() is also the
> > > function that removes a PID from the namespace's IDR, and it is used
> > > by __change_pid() when a task loses its reference on a PID.
> > >
> > > In other words: Most refcounted objects with RCU guarantee that the
> > > object waits for a grace period after its refcount has reached zero;
> > > and during the grace period, the refcount is zero and you're not
> > > allowed to increment it 

Re: [PATCH] Convert struct pid count to refcount_t

2019-03-28 Thread Paul E. McKenney
On Thu, Mar 28, 2019 at 05:26:42PM +0100, Oleg Nesterov wrote:
> On 03/28, Jann Horn wrote:
> >
> > Since we're just talking about RCU stuff now, adding Paul McKenney to
> > the thread.
> 
> Since you added Paul let me add more confusion to this thread ;)

Woo-hoo!!!  More confusion!  Bring it on!!!  ;-)

> There were some concerns about the lack of barriers in put_pid(), but I can't
> find that old discussion and I forgot the result of that discussion...
> 
> Paul, could you confirm that this code
> 
>   CPU_0   CPU_1
> 
>   X = 1;  if (READ_ONCE(Y))
>   mb();   X = 2;
>   Y = 1;  BUG_ON(X != 2);
> 
> 
> is correct? I think it is, control dependency pairs with mb(), right?

The BUG_ON() is supposed to happen at the end of time, correct?
As written, there is (in the strict sense) a data race between the load
of X in the BUG_ON() and CPU_0's store to X.  In a less strict sense,
you could of course argue that this data race is harmless, especially
if X is a single byte.  But the more I talk to compiler writers, the
less comfortable I become with data races in general.  :-/

So I would also feel better if the "Y = 1" was WRITE_ONCE().

On the other hand, this is a great opportunity to try out Alan Stern's
prototype plain-accesses patch to the Linux Kernel Memory Model (LKMM)!

https://lkml.kernel.org/r/pine.lnx.4.44l0.1903191459270.1593-200...@iolanthe.rowland.org

Also adding Alan on CC.

Here is what I believe is the litmus test that your are interested in:


C OlegNesterov-put_pid

{}

P0(int *x, int *y)
{
*x = 1;
smp_mb();
*y = 1;
}

P1(int *x, int *y)
{
int r1;

r1 = READ_ONCE(*y);
if (r1)
*x = 2;
}

exists (1:r1=1 /\ ~x=2)


Running this through herd with Alan's patch detects the data race
and says that the undesired outcome is allowed:

$ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid.litmus 
Test OlegNesterov-put_pid Allowed
States 3
1:r1=0; x=1;
1:r1=1; x=1;
1:r1=1; x=2;
Ok
Witnesses
Positive: 1 Negative: 2
Flag data-race
Condition exists (1:r1=1 /\ not (x=2))
Observation OlegNesterov-put_pid Sometimes 1 2
Time OlegNesterov-put_pid 0.00
Hash=a3e0043ad753effa860fea37eeba0a76

Using WRITE_ONCE() for P0()'s store to y still allows this outcome,
although it does remove the "Flag data-race".

Using WRITE_ONCE() for both P0()'s store to y and P1()'s store to x
gets rid of both the "Flag data-race" and the undesired outcome:

$ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-WO.litmus 
Test OlegNesterov-put_pid-WO-WO Allowed
States 2
1:r1=0; x=1;
1:r1=1; x=2;
No
Witnesses
Positive: 0 Negative: 2
Condition exists (1:r1=1 /\ not (x=2))
Observation OlegNesterov-put_pid-WO-WO Never 0 2
Time OlegNesterov-put_pid-WO-WO 0.01
Hash=6e1643e3c5e4739b590bde0a8e8a918e

Here is the corresponding litmus test, in case I messed something up:


C OlegNesterov-put_pid-WO-WO

{}

P0(int *x, int *y)
{
*x = 1;
smp_mb();
WRITE_ONCE(*y, 1);
}

P1(int *x, int *y)
{
int r1;

r1 = READ_ONCE(*y);
if (r1)
WRITE_ONCE(*x, 2);
}

exists (1:r1=1 /\ ~x=2)


> If not, then put_pid() needs atomic_read_acquire() as it was proposed in that
> discussion.

Good point, let's try with smp_load_acquire() in P1():

$ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-WO-sla.litmus 
Test OlegNesterov-put_pid-WO-sla Allowed
States 2
1:r1=0; x=1;
1:r1=1; x=2;
No
Witnesses
Positive: 0 Negative: 2
Condition exists (1:r1=1 /\ not (x=2))
Observation OlegNesterov-put_pid-WO-sla Never 0 2
Time OlegNesterov-put_pid-WO-sla 0.01
Hash=4fb0276eabf924793dec1970199db3a6

This also works.  Here is the litmus test:


C OlegNesterov-put_pid-WO-sla

{}

P0(int *x, int *y)
{
*x = 1;
smp_mb();
WRITE_ONCE(*y, 1);
}

P1(int *x, int *y)
{
int r1;

r1 = smp_load_acquire(y);
if (r1)
*x = 2;
}

exists (1:r1=1 /\ ~x=2)


Demoting P0()'s WRITE_ONCE() to a plain write while leaving P1()'s
smp_load_acquire() gets us a data race and allows the undesired
outcome:

$ herd7  -conf linux-kernel.cfg /tmp/OlegNesterov-put_pid-sla.litmus 
Test OlegNesterov-put_pid-sla Allowed

Re: [PATCH] Convert struct pid count to refcount_t

2019-03-28 Thread Kees Cook
On Thu, Mar 28, 2019 at 7:37 AM Joel Fernandes  wrote:
> Kees can you describe more the race you had in mind?

I just didn't see any barriers, so it seemed racey to me.

> Also note to the on looker, the original patch I sent is not wrong, that
> still applies and is correct. We are just discussing here any possible issues
> with the *existing* code.

Correct. And akpm has (correctly) taken this for -mm.

-- 
Kees Cook


Re: [PATCH] Convert struct pid count to refcount_t

2019-03-28 Thread Oleg Nesterov
On 03/28, Jann Horn wrote:
>
> Since we're just talking about RCU stuff now, adding Paul McKenney to
> the thread.

Since you added Paul let me add more confusion to this thread ;)

There were some concerns about the lack of barriers in put_pid(), but I can't
find that old discussion and I forgot the result of that discussion...

Paul, could you confirm that this code

CPU_0   CPU_1

X = 1;  if (READ_ONCE(Y))
mb();   X = 2;
Y = 1;  BUG_ON(X != 2);


is correct? I think it is, control dependency pairs with mb(), right?

If not, then put_pid() needs atomic_read_acquire() as it was proposed in that
discussion.

Oleg.



Re: [PATCH] Convert struct pid count to refcount_t

2019-03-28 Thread Jann Horn
Since we're just talking about RCU stuff now, adding Paul McKenney to
the thread.

On Thu, Mar 28, 2019 at 3:37 PM Joel Fernandes  wrote:
> On Thu, Mar 28, 2019 at 03:57:44AM +0100, Jann Horn wrote:
> > On Thu, Mar 28, 2019 at 3:34 AM Joel Fernandes  
> > wrote:
> > > On Thu, Mar 28, 2019 at 01:59:45AM +0100, Jann Horn wrote:
> > > > On Thu, Mar 28, 2019 at 1:06 AM Kees Cook  wrote:
> > > > > On Wed, Mar 27, 2019 at 7:53 AM Joel Fernandes (Google)
> > > > >  wrote:
> > > > > >
> > > > > > struct pid's count is an atomic_t field used as a refcount. Use
> > > > > > refcount_t for it which is basically atomic_t but does additional
> > > > > > checking to prevent use-after-free bugs. No change in behavior if
> > > > > > CONFIG_REFCOUNT_FULL=n.
> > > > > >
> > > > > > Cc: keesc...@chromium.org
> > > > > > Cc: kernel-t...@android.com
> > > > > > Cc: kernel-harden...@lists.openwall.com
> > > > > > Signed-off-by: Joel Fernandes (Google) 
> > > > > > [...]
> > > > > > diff --git a/kernel/pid.c b/kernel/pid.c
> > > > > > index 20881598bdfa..2095c7da644d 100644
> > > > > > --- a/kernel/pid.c
> > > > > > +++ b/kernel/pid.c
> > > > > > @@ -37,7 +37,7 @@
> > > > > >  #include 
> > > > > >  #include 
> > > > > >  #include 
> > > > > > -#include 
> > > > > > +#include 
> > > > > >  #include 
> > > > > >  #include 
> > > > > >
> > > > > > @@ -106,8 +106,8 @@ void put_pid(struct pid *pid)
> > > > > > return;
> > > > > >
> > > > > > ns = pid->numbers[pid->level].ns;
> > > > > > -   if ((atomic_read(>count) == 1) ||
> > > > > > -atomic_dec_and_test(>count)) {
> > > > > > +   if ((refcount_read(>count) == 1) ||
> > > > > > +refcount_dec_and_test(>count)) {
> > > > >
> > > > > Why is this (and the original code) safe in the face of a race against
> > > > > get_pid()? i.e. shouldn't this only use refcount_dec_and_test()? I
> > > > > don't see this code pattern anywhere else in the kernel.
> > > >
> > > > Semantically, it doesn't make a difference whether you do this or
> > > > leave out the "refcount_read(>count) == 1". If you read a 1 from
> > > > refcount_read(), then you have the only reference to "struct pid", and
> > > > therefore you want to free it. If you don't get a 1, you have to
> > > > atomically drop a reference, which, if someone else is concurrently
> > > > also dropping a reference, may leave you with the last reference (in
> > > > the case where refcount_dec_and_test() returns true), in which case
> > > > you still have to take care of freeing it.
> > >
> > > Also, based on Kees comment, I think it appears to me that get_pid and
> > > put_pid can race in this way in the original code right?
> > >
> > > get_pid put_pid
> > >
> > > atomic_dec_and_test returns 1
> >
> > This can't happen. get_pid() can only be called on an existing
> > reference. If you are calling get_pid() on an existing reference, and
> > someone else is dropping another reference with put_pid(), then when
> > both functions start running, the refcount must be at least 2.
>
> Sigh, you are right. Ok. I was quite tired last night when I wrote this.
> Obviously, I should have waited a bit and thought it through.
>
> Kees can you describe more the race you had in mind?
>
> > > atomic_inc
> > > kfree
> > >
> > > deref pid /* boom */
> > > -
> > >
> > > I think get_pid needs to call atomic_inc_not_zero() and put_pid should
> > > not test for pid->count == 1 as condition for freeing, but rather just do
> > > atomic_dec_and_test. So something like the following diff. (And I see a
> > > similar pattern used in drivers/net/mac.c)
> >
> > get_pid() can only be called when you already have a refcounted
> > reference; in other words, when the reference count is at least one.
> > The lifetime management of struct pid differs from the lifetime
> > management of most other objects in the kernel; the usual patterns
> > don't quite apply here.
> >
> > Look at put_pid(): When the refcount has reached zero, there is no RCU
> > grace period (unlike most other objects with RCU-managed lifetimes).
> > Instead, free_pid() has an RCU grace period *before* it invokes
> > delayed_put_pid() to drop a reference; and free_pid() is also the
> > function that removes a PID from the namespace's IDR, and it is used
> > by __change_pid() when a task loses its reference on a PID.
> >
> > In other words: Most refcounted objects with RCU guarantee that the
> > object waits for a grace period after its refcount has reached zero;
> > and during the grace period, the refcount is zero and you're not
> > allowed to increment it again.
>
> Can you give an example of this "most refcounted objects with RCU" usecase?
> I could not find any good examples of such. I want to document this pattern
> and possibly submit to Documentation/RCU.

E.g. struct posix_acl is a relatively straightforward example:
posix_acl_release() is a 

Re: [PATCH] Convert struct pid count to refcount_t

2019-03-28 Thread Joel Fernandes
On Thu, Mar 28, 2019 at 03:26:19PM +0100, Oleg Nesterov wrote:
> On 03/27, Joel Fernandes wrote:
> >
> > Also, based on Kees comment, I think it appears to me that get_pid and
> > put_pid can race in this way in the original code right?
> >
> > get_pid put_pid
> >
> > atomic_dec_and_test returns 1
> > atomic_inc
> > kfree
> >
> > deref pid /* boom */
> > -
> >
> > I think get_pid needs to call atomic_inc_not_zero()
> 
> No.
> 
> get_pid() should only be used if you already have a reference or you do
> something like
> 
>   rcu_read_lock();
>   pid = find_vpid();
>   get_pid();
>   rcu_read_lock();
> 
> in this case we rely on call_rcu(delayed_put_pid) which drops the initial
> reference.
> 
> If put_pid() sees pid->count == 1, then a) nobody else has a reference and
> b) nobody else can find this pid on rcu-protected lists, so it is safe to
> free it.

I agree. Check my reply to Jann, I already replied to him about this. thanks!



Re: [PATCH] Convert struct pid count to refcount_t

2019-03-28 Thread Joel Fernandes
On Thu, Mar 28, 2019 at 03:57:44AM +0100, Jann Horn wrote:
> On Thu, Mar 28, 2019 at 3:34 AM Joel Fernandes  wrote:
> > On Thu, Mar 28, 2019 at 01:59:45AM +0100, Jann Horn wrote:
> > > On Thu, Mar 28, 2019 at 1:06 AM Kees Cook  wrote:
> > > > On Wed, Mar 27, 2019 at 7:53 AM Joel Fernandes (Google)
> > > >  wrote:
> > > > >
> > > > > struct pid's count is an atomic_t field used as a refcount. Use
> > > > > refcount_t for it which is basically atomic_t but does additional
> > > > > checking to prevent use-after-free bugs. No change in behavior if
> > > > > CONFIG_REFCOUNT_FULL=n.
> > > > >
> > > > > Cc: keesc...@chromium.org
> > > > > Cc: kernel-t...@android.com
> > > > > Cc: kernel-harden...@lists.openwall.com
> > > > > Signed-off-by: Joel Fernandes (Google) 
> > > > > [...]
> > > > > diff --git a/kernel/pid.c b/kernel/pid.c
> > > > > index 20881598bdfa..2095c7da644d 100644
> > > > > --- a/kernel/pid.c
> > > > > +++ b/kernel/pid.c
> > > > > @@ -37,7 +37,7 @@
> > > > >  #include 
> > > > >  #include 
> > > > >  #include 
> > > > > -#include 
> > > > > +#include 
> > > > >  #include 
> > > > >  #include 
> > > > >
> > > > > @@ -106,8 +106,8 @@ void put_pid(struct pid *pid)
> > > > > return;
> > > > >
> > > > > ns = pid->numbers[pid->level].ns;
> > > > > -   if ((atomic_read(>count) == 1) ||
> > > > > -atomic_dec_and_test(>count)) {
> > > > > +   if ((refcount_read(>count) == 1) ||
> > > > > +refcount_dec_and_test(>count)) {
> > > >
> > > > Why is this (and the original code) safe in the face of a race against
> > > > get_pid()? i.e. shouldn't this only use refcount_dec_and_test()? I
> > > > don't see this code pattern anywhere else in the kernel.
> > >
> > > Semantically, it doesn't make a difference whether you do this or
> > > leave out the "refcount_read(>count) == 1". If you read a 1 from
> > > refcount_read(), then you have the only reference to "struct pid", and
> > > therefore you want to free it. If you don't get a 1, you have to
> > > atomically drop a reference, which, if someone else is concurrently
> > > also dropping a reference, may leave you with the last reference (in
> > > the case where refcount_dec_and_test() returns true), in which case
> > > you still have to take care of freeing it.
> >
> > Also, based on Kees comment, I think it appears to me that get_pid and
> > put_pid can race in this way in the original code right?
> >
> > get_pid put_pid
> >
> > atomic_dec_and_test returns 1
> 
> This can't happen. get_pid() can only be called on an existing
> reference. If you are calling get_pid() on an existing reference, and
> someone else is dropping another reference with put_pid(), then when
> both functions start running, the refcount must be at least 2.

Sigh, you are right. Ok. I was quite tired last night when I wrote this.
Obviously, I should have waited a bit and thought it through.

Kees can you describe more the race you had in mind?

> > atomic_inc
> > kfree
> >
> > deref pid /* boom */
> > -
> >
> > I think get_pid needs to call atomic_inc_not_zero() and put_pid should
> > not test for pid->count == 1 as condition for freeing, but rather just do
> > atomic_dec_and_test. So something like the following diff. (And I see a
> > similar pattern used in drivers/net/mac.c)
> 
> get_pid() can only be called when you already have a refcounted
> reference; in other words, when the reference count is at least one.
> The lifetime management of struct pid differs from the lifetime
> management of most other objects in the kernel; the usual patterns
> don't quite apply here.
> 
> Look at put_pid(): When the refcount has reached zero, there is no RCU
> grace period (unlike most other objects with RCU-managed lifetimes).
> Instead, free_pid() has an RCU grace period *before* it invokes
> delayed_put_pid() to drop a reference; and free_pid() is also the
> function that removes a PID from the namespace's IDR, and it is used
> by __change_pid() when a task loses its reference on a PID.
> 
> In other words: Most refcounted objects with RCU guarantee that the
> object waits for a grace period after its refcount has reached zero;
> and during the grace period, the refcount is zero and you're not
> allowed to increment it again.

Can you give an example of this "most refcounted objects with RCU" usecase?
I could not find any good examples of such. I want to document this pattern
and possibly submit to Documentation/RCU.

> But for struct pid, the guarantee is
> instead that there is an RCU grace period after it has been removed
> from the IDRs and the task, and during the grace period, refcounting
> is guaranteed to still work normally.

Ok, thanks. Here I think in scrappy but simple pseudo code form, the struct
pid flow is something like (replaced "pid" with data");

get_data:
atomic_inc(data->refcount);

some_user_of_data:
  

Re: [PATCH] Convert struct pid count to refcount_t

2019-03-28 Thread Oleg Nesterov
On 03/27, Joel Fernandes wrote:
>
> Also, based on Kees comment, I think it appears to me that get_pid and
> put_pid can race in this way in the original code right?
>
> get_pid   put_pid
>
>   atomic_dec_and_test returns 1
> atomic_inc
>   kfree
>
> deref pid /* boom */
> -
>
> I think get_pid needs to call atomic_inc_not_zero()

No.

get_pid() should only be used if you already have a reference or you do
something like

rcu_read_lock();
pid = find_vpid();
get_pid();
rcu_read_lock();

in this case we rely on call_rcu(delayed_put_pid) which drops the initial
reference.

If put_pid() sees pid->count == 1, then a) nobody else has a reference and
b) nobody else can find this pid on rcu-protected lists, so it is safe to
free it.

Oleg.



Re: [PATCH] Convert struct pid count to refcount_t

2019-03-27 Thread Jann Horn
On Thu, Mar 28, 2019 at 3:34 AM Joel Fernandes  wrote:
> On Thu, Mar 28, 2019 at 01:59:45AM +0100, Jann Horn wrote:
> > On Thu, Mar 28, 2019 at 1:06 AM Kees Cook  wrote:
> > > On Wed, Mar 27, 2019 at 7:53 AM Joel Fernandes (Google)
> > >  wrote:
> > > >
> > > > struct pid's count is an atomic_t field used as a refcount. Use
> > > > refcount_t for it which is basically atomic_t but does additional
> > > > checking to prevent use-after-free bugs. No change in behavior if
> > > > CONFIG_REFCOUNT_FULL=n.
> > > >
> > > > Cc: keesc...@chromium.org
> > > > Cc: kernel-t...@android.com
> > > > Cc: kernel-harden...@lists.openwall.com
> > > > Signed-off-by: Joel Fernandes (Google) 
> > > > [...]
> > > > diff --git a/kernel/pid.c b/kernel/pid.c
> > > > index 20881598bdfa..2095c7da644d 100644
> > > > --- a/kernel/pid.c
> > > > +++ b/kernel/pid.c
> > > > @@ -37,7 +37,7 @@
> > > >  #include 
> > > >  #include 
> > > >  #include 
> > > > -#include 
> > > > +#include 
> > > >  #include 
> > > >  #include 
> > > >
> > > > @@ -106,8 +106,8 @@ void put_pid(struct pid *pid)
> > > > return;
> > > >
> > > > ns = pid->numbers[pid->level].ns;
> > > > -   if ((atomic_read(>count) == 1) ||
> > > > -atomic_dec_and_test(>count)) {
> > > > +   if ((refcount_read(>count) == 1) ||
> > > > +refcount_dec_and_test(>count)) {
> > >
> > > Why is this (and the original code) safe in the face of a race against
> > > get_pid()? i.e. shouldn't this only use refcount_dec_and_test()? I
> > > don't see this code pattern anywhere else in the kernel.
> >
> > Semantically, it doesn't make a difference whether you do this or
> > leave out the "refcount_read(>count) == 1". If you read a 1 from
> > refcount_read(), then you have the only reference to "struct pid", and
> > therefore you want to free it. If you don't get a 1, you have to
> > atomically drop a reference, which, if someone else is concurrently
> > also dropping a reference, may leave you with the last reference (in
> > the case where refcount_dec_and_test() returns true), in which case
> > you still have to take care of freeing it.
>
> Also, based on Kees comment, I think it appears to me that get_pid and
> put_pid can race in this way in the original code right?
>
> get_pid put_pid
>
> atomic_dec_and_test returns 1

This can't happen. get_pid() can only be called on an existing
reference. If you are calling get_pid() on an existing reference, and
someone else is dropping another reference with put_pid(), then when
both functions start running, the refcount must be at least 2.

> atomic_inc
> kfree
>
> deref pid /* boom */
> -
>
> I think get_pid needs to call atomic_inc_not_zero() and put_pid should
> not test for pid->count == 1 as condition for freeing, but rather just do
> atomic_dec_and_test. So something like the following diff. (And I see a
> similar pattern used in drivers/net/mac.c)

get_pid() can only be called when you already have a refcounted
reference; in other words, when the reference count is at least one.
The lifetime management of struct pid differs from the lifetime
management of most other objects in the kernel; the usual patterns
don't quite apply here.

Look at put_pid(): When the refcount has reached zero, there is no RCU
grace period (unlike most other objects with RCU-managed lifetimes).
Instead, free_pid() has an RCU grace period *before* it invokes
delayed_put_pid() to drop a reference; and free_pid() is also the
function that removes a PID from the namespace's IDR, and it is used
by __change_pid() when a task loses its reference on a PID.

In other words: Most refcounted objects with RCU guarantee that the
object waits for a grace period after its refcount has reached zero;
and during the grace period, the refcount is zero and you're not
allowed to increment it again. But for struct pid, the guarantee is
instead that there is an RCU grace period after it has been removed
from the IDRs and the task, and during the grace period, refcounting
is guaranteed to still work normally.

> Is the above scenario valid? I didn't see any locking around get_pid or
> pud_pid to avoid such a race.
>
> ---8<---
>
> diff --git a/include/linux/pid.h b/include/linux/pid.h
> index 8cb86d377ff5..3d79834e3180 100644
> --- a/include/linux/pid.h
> +++ b/include/linux/pid.h
> @@ -69,8 +69,8 @@ extern struct pid init_struct_pid;
>
>  static inline struct pid *get_pid(struct pid *pid)
>  {
> -   if (pid)
> -   refcount_inc(>count);
> +   if (!pid || !refcount_inc_not_zero(>count))
> +   return NULL;
> return pid;
>  }

Nope, this is wrong. Once the refcount is zero, the object goes away,
refcount_inc_not_zero() makes no sense here.

>
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 2095c7da644d..89c4849fab5d 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ 

Re: [PATCH] Convert struct pid count to refcount_t

2019-03-27 Thread Joel Fernandes
On Thu, Mar 28, 2019 at 01:59:45AM +0100, Jann Horn wrote:
> On Thu, Mar 28, 2019 at 1:06 AM Kees Cook  wrote:
> > On Wed, Mar 27, 2019 at 7:53 AM Joel Fernandes (Google)
> >  wrote:
> > >
> > > struct pid's count is an atomic_t field used as a refcount. Use
> > > refcount_t for it which is basically atomic_t but does additional
> > > checking to prevent use-after-free bugs. No change in behavior if
> > > CONFIG_REFCOUNT_FULL=n.
> > >
> > > Cc: keesc...@chromium.org
> > > Cc: kernel-t...@android.com
> > > Cc: kernel-harden...@lists.openwall.com
> > > Signed-off-by: Joel Fernandes (Google) 
> > > [...]
> > > diff --git a/kernel/pid.c b/kernel/pid.c
> > > index 20881598bdfa..2095c7da644d 100644
> > > --- a/kernel/pid.c
> > > +++ b/kernel/pid.c
> > > @@ -37,7 +37,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > -#include 
> > > +#include 
> > >  #include 
> > >  #include 
> > >
> > > @@ -106,8 +106,8 @@ void put_pid(struct pid *pid)
> > > return;
> > >
> > > ns = pid->numbers[pid->level].ns;
> > > -   if ((atomic_read(>count) == 1) ||
> > > -atomic_dec_and_test(>count)) {
> > > +   if ((refcount_read(>count) == 1) ||
> > > +refcount_dec_and_test(>count)) {
> >
> > Why is this (and the original code) safe in the face of a race against
> > get_pid()? i.e. shouldn't this only use refcount_dec_and_test()? I
> > don't see this code pattern anywhere else in the kernel.
> 
> Semantically, it doesn't make a difference whether you do this or
> leave out the "refcount_read(>count) == 1". If you read a 1 from
> refcount_read(), then you have the only reference to "struct pid", and
> therefore you want to free it. If you don't get a 1, you have to
> atomically drop a reference, which, if someone else is concurrently
> also dropping a reference, may leave you with the last reference (in
> the case where refcount_dec_and_test() returns true), in which case
> you still have to take care of freeing it.

Also, based on Kees comment, I think it appears to me that get_pid and
put_pid can race in this way in the original code right?

get_pid put_pid

atomic_dec_and_test returns 1
atomic_inc
kfree

deref pid /* boom */
-

I think get_pid needs to call atomic_inc_not_zero() and put_pid should
not test for pid->count == 1 as condition for freeing, but rather just do
atomic_dec_and_test. So something like the following diff. (And I see a
similar pattern used in drivers/net/mac.c)

Is the above scenario valid? I didn't see any locking around get_pid or
pud_pid to avoid such a race.

---8<---

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 8cb86d377ff5..3d79834e3180 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -69,8 +69,8 @@ extern struct pid init_struct_pid;
 
 static inline struct pid *get_pid(struct pid *pid)
 {
-   if (pid)
-   refcount_inc(>count);
+   if (!pid || !refcount_inc_not_zero(>count))
+   return NULL;
return pid;
 }
 
diff --git a/kernel/pid.c b/kernel/pid.c
index 2095c7da644d..89c4849fab5d 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -106,8 +106,7 @@ void put_pid(struct pid *pid)
return;
 
ns = pid->numbers[pid->level].ns;
-   if ((refcount_read(>count) == 1) ||
-refcount_dec_and_test(>count)) {
+   if (refcount_dec_and_test(>count)) {
kmem_cache_free(ns->pid_cachep, pid);
put_pid_ns(ns);
}




Re: [PATCH] Convert struct pid count to refcount_t

2019-03-27 Thread Jann Horn
On Thu, Mar 28, 2019 at 1:06 AM Kees Cook  wrote:
> On Wed, Mar 27, 2019 at 7:53 AM Joel Fernandes (Google)
>  wrote:
> >
> > struct pid's count is an atomic_t field used as a refcount. Use
> > refcount_t for it which is basically atomic_t but does additional
> > checking to prevent use-after-free bugs. No change in behavior if
> > CONFIG_REFCOUNT_FULL=n.
> >
> > Cc: keesc...@chromium.org
> > Cc: kernel-t...@android.com
> > Cc: kernel-harden...@lists.openwall.com
> > Signed-off-by: Joel Fernandes (Google) 
> > [...]
> > diff --git a/kernel/pid.c b/kernel/pid.c
> > index 20881598bdfa..2095c7da644d 100644
> > --- a/kernel/pid.c
> > +++ b/kernel/pid.c
> > @@ -37,7 +37,7 @@
> >  #include 
> >  #include 
> >  #include 
> > -#include 
> > +#include 
> >  #include 
> >  #include 
> >
> > @@ -106,8 +106,8 @@ void put_pid(struct pid *pid)
> > return;
> >
> > ns = pid->numbers[pid->level].ns;
> > -   if ((atomic_read(>count) == 1) ||
> > -atomic_dec_and_test(>count)) {
> > +   if ((refcount_read(>count) == 1) ||
> > +refcount_dec_and_test(>count)) {
>
> Why is this (and the original code) safe in the face of a race against
> get_pid()? i.e. shouldn't this only use refcount_dec_and_test()? I
> don't see this code pattern anywhere else in the kernel.

Semantically, it doesn't make a difference whether you do this or
leave out the "refcount_read(>count) == 1". If you read a 1 from
refcount_read(), then you have the only reference to "struct pid", and
therefore you want to free it. If you don't get a 1, you have to
atomically drop a reference, which, if someone else is concurrently
also dropping a reference, may leave you with the last reference (in
the case where refcount_dec_and_test() returns true), in which case
you still have to take care of freeing it.

My guess is that the goal of this is to make the "drop last reference"
case a little bit faster by avoiding the cacheline dirtying and the
atomic op, at the expense of an extra memory op and branch every time
we drop a non-final reference. But that's a pretty low-level
optimization, and forking by itself isn't exactly fast... I think the
clean thing to do would be to either move this detail into the
refcount implementation (if it turns out to actually be valuable in at
least a microbenchmark), or just get rid of it. Given the overhead of
fork()/clone(), I would be surprised if you could actually measure
this effect here.

Eric, can you remember the rationale for doing it that way in commit
92476d7fc0326a409ab1d3864a04093a6be9aca7? Am I guessing correctly?


Re: [PATCH] Convert struct pid count to refcount_t

2019-03-27 Thread Kees Cook
On Wed, Mar 27, 2019 at 7:53 AM Joel Fernandes (Google)
 wrote:
>
> struct pid's count is an atomic_t field used as a refcount. Use
> refcount_t for it which is basically atomic_t but does additional
> checking to prevent use-after-free bugs. No change in behavior if
> CONFIG_REFCOUNT_FULL=n.
>
> Cc: keesc...@chromium.org
> Cc: kernel-t...@android.com
> Cc: kernel-harden...@lists.openwall.com
> Signed-off-by: Joel Fernandes (Google) 
> [...]
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 20881598bdfa..2095c7da644d 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -37,7 +37,7 @@
>  #include 
>  #include 
>  #include 
> -#include 
> +#include 
>  #include 
>  #include 
>
> @@ -106,8 +106,8 @@ void put_pid(struct pid *pid)
> return;
>
> ns = pid->numbers[pid->level].ns;
> -   if ((atomic_read(>count) == 1) ||
> -atomic_dec_and_test(>count)) {
> +   if ((refcount_read(>count) == 1) ||
> +refcount_dec_and_test(>count)) {

Why is this (and the original code) safe in the face of a race against
get_pid()? i.e. shouldn't this only use refcount_dec_and_test()? I
don't see this code pattern anywhere else in the kernel.

Beyond that, yes please. :)

-Kees


> kmem_cache_free(ns->pid_cachep, pid);
> put_pid_ns(ns);
> }
> @@ -210,7 +210,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> }
>
> get_pid_ns(ns);
> -   atomic_set(>count, 1);
> +   refcount_set(>count, 1);
> for (type = 0; type < PIDTYPE_MAX; ++type)
> INIT_HLIST_HEAD(>tasks[type]);
>
> --
> 2.21.0.392.gf8f6787159e-goog
>


--
Kees Cook