Re: Soft lockup in inet_put_port on 4.6

2016-12-19 Thread Josef Bacik

> On Dec 19, 2016, at 11:52 PM, Eric Dumazet  wrote:
> 
> On Tue, 2016-12-20 at 03:40 +, Josef Bacik wrote:
>>> On Dec 19, 2016, at 9:42 PM, Eric Dumazet  wrote:
>>> 
 On Mon, 2016-12-19 at 18:07 -0800, Tom Herbert wrote:
 
 When sockets created SO_REUSEPORT move to TW state they are placed
 back on the the tb->owners. fastreuse port is no longer set so we have
 to walk potential long list of sockets in tb->owners to open a new
 listener socket. I imagine this is happens when we try to open a new
 listener SO_REUSEPORT after the system has been running a while and so
 we hit the long tb->owners list.
>>> 
>>> Hmm...  __inet_twsk_hashdance() does not change tb->fastreuse
>>> 
>>> So where tb->fastreuse is cleared ?
>>> 
>>> If all your sockets have SO_REUSEPORT set, this should not happen.
>>> 
>> 
>> The app starts out with no SO_REUSEPORT, and then we restart it with
>> that option enabled.
> 
> But... why would the application do this dance ?
> 
> I now better understand why we never had these issues...
> 

It doesn't do it as a part of it's normal operation.  The old version didn't 
use SO_REUSEPORT and then somebody added support for it, restarted the service 
with the new option enabled and boom.  They immediately stopped doing anything 
and gave it to me to figure out.

> 
>>  What I suspect is we have all the twsks from the original service,
>> and the fastreuse stuff is cleared.  My naive patch resets it once we
>> add a reuseport sk to the tb and that makes the problem go away.  I'm
>> reworking all of this logic and adding some extra info to the tb to
>> make the reset actually safe.  I'll send those patches out tomorrow.
>> Thanks,
> 
> Okay, we will review them ;)
> 
> Note that Willy Tarreau wants some mechanism to be able to freeze a
> listener, to allow haproxy to be replaced without closing any sessions.
> 

I assume that's what these guys would want as well.  They have some weird 
handoff thing they do when the app starts but I'm not entirely convinced it 
does what they think it does.  Thanks,

Josef

Re: Soft lockup in inet_put_port on 4.6

2016-12-19 Thread Eric Dumazet
On Tue, 2016-12-20 at 03:40 +, Josef Bacik wrote:
> > On Dec 19, 2016, at 9:42 PM, Eric Dumazet  wrote:
> > 
> >> On Mon, 2016-12-19 at 18:07 -0800, Tom Herbert wrote:
> >> 
> >> When sockets created SO_REUSEPORT move to TW state they are placed
> >> back on the the tb->owners. fastreuse port is no longer set so we have
> >> to walk potential long list of sockets in tb->owners to open a new
> >> listener socket. I imagine this is happens when we try to open a new
> >> listener SO_REUSEPORT after the system has been running a while and so
> >> we hit the long tb->owners list.
> > 
> > Hmm...  __inet_twsk_hashdance() does not change tb->fastreuse
> > 
> > So where tb->fastreuse is cleared ?
> > 
> > If all your sockets have SO_REUSEPORT set, this should not happen.
> > 
> 
> The app starts out with no SO_REUSEPORT, and then we restart it with
> that option enabled.

But... why would the application do this dance ?

I now better understand why we never had these issues...


>   What I suspect is we have all the twsks from the original service,
> and the fastreuse stuff is cleared.  My naive patch resets it once we
> add a reuseport sk to the tb and that makes the problem go away.  I'm
> reworking all of this logic and adding some extra info to the tb to
> make the reset actually safe.  I'll send those patches out tomorrow.
> Thanks,

Okay, we will review them ;)

Note that Willy Tarreau wants some mechanism to be able to freeze a
listener, to allow haproxy to be replaced without closing any sessions.





Re: Soft lockup in inet_put_port on 4.6

2016-12-19 Thread Josef Bacik

> On Dec 19, 2016, at 9:42 PM, Eric Dumazet  wrote:
> 
>> On Mon, 2016-12-19 at 18:07 -0800, Tom Herbert wrote:
>> 
>> When sockets created SO_REUSEPORT move to TW state they are placed
>> back on the the tb->owners. fastreuse port is no longer set so we have
>> to walk potential long list of sockets in tb->owners to open a new
>> listener socket. I imagine this is happens when we try to open a new
>> listener SO_REUSEPORT after the system has been running a while and so
>> we hit the long tb->owners list.
> 
> Hmm...  __inet_twsk_hashdance() does not change tb->fastreuse
> 
> So where tb->fastreuse is cleared ?
> 
> If all your sockets have SO_REUSEPORT set, this should not happen.
> 

The app starts out with no SO_REUSEPORT, and then we restart it with that 
option enabled.  What I suspect is we have all the twsks from the original 
service, and the fastreuse stuff is cleared.  My naive patch resets it once we 
add a reuseport sk to the tb and that makes the problem go away.  I'm reworking 
all of this logic and adding some extra info to the tb to make the reset 
actually safe.  I'll send those patches out tomorrow. Thanks,

Josef

Re: Soft lockup in inet_put_port on 4.6

2016-12-19 Thread Eric Dumazet
On Mon, 2016-12-19 at 18:07 -0800, Tom Herbert wrote:

> When sockets created SO_REUSEPORT move to TW state they are placed
> back on the the tb->owners. fastreuse port is no longer set so we have
> to walk potential long list of sockets in tb->owners to open a new
> listener socket. I imagine this is happens when we try to open a new
> listener SO_REUSEPORT after the system has been running a while and so
> we hit the long tb->owners list.

Hmm...  __inet_twsk_hashdance() does not change tb->fastreuse

So where tb->fastreuse is cleared ?

If all your sockets have SO_REUSEPORT set, this should not happen.






Re: Soft lockup in inet_put_port on 4.6

2016-12-19 Thread Tom Herbert
On Mon, Dec 19, 2016 at 5:56 PM, David Miller  wrote:
> From: Josef Bacik 
> Date: Sat, 17 Dec 2016 13:26:00 +
>
>> So take my current duct tape fix and augment it with more
>> information in the bind bucket?  I'm not sure how to make this work
>> without at least having a list of the binded addrs as well to make
>> sure we are really ok.  I suppose we could save the fastreuseport
>> address that last succeeded to make it work properly, but I'd have
>> to make it protocol agnostic and then have a callback to have the
>> protocol to make sure we don't have to do the bind_conflict run.  Is
>> that what you were thinking of?  Thanks,
>
> So there isn't a deadlock or lockup here, something is just running
> really slow, right?
>
Correct.

> And that "something" is a scan of the sockets on a tb list, and
> there's lots of timewait sockets hung off of that tb.
>
Yes.

> As far as I can tell, this scan is happening in
> inet_csk_bind_conflict().
>
Yes.

> Furthermore, reuseport is somehow required to make this problem
> happen.  How exactly?

When sockets created SO_REUSEPORT move to TW state they are placed
back on the the tb->owners. fastreuse port is no longer set so we have
to walk potential long list of sockets in tb->owners to open a new
listener socket. I imagine this is happens when we try to open a new
listener SO_REUSEPORT after the system has been running a while and so
we hit the long tb->owners list.

Tom


Re: Soft lockup in inet_put_port on 4.6

2016-12-19 Thread David Miller
From: Josef Bacik 
Date: Sat, 17 Dec 2016 13:26:00 +

> So take my current duct tape fix and augment it with more
> information in the bind bucket?  I'm not sure how to make this work
> without at least having a list of the binded addrs as well to make
> sure we are really ok.  I suppose we could save the fastreuseport
> address that last succeeded to make it work properly, but I'd have
> to make it protocol agnostic and then have a callback to have the
> protocol to make sure we don't have to do the bind_conflict run.  Is
> that what you were thinking of?  Thanks,

So there isn't a deadlock or lockup here, something is just running
really slow, right?

And that "something" is a scan of the sockets on a tb list, and
there's lots of timewait sockets hung off of that tb.

As far as I can tell, this scan is happening in
inet_csk_bind_conflict().

Furthermore, reuseport is somehow required to make this problem
happen.  How exactly?


Re: Soft lockup in inet_put_port on 4.6

2016-12-17 Thread Josef Bacik

> On Dec 17, 2016, at 6:09 AM, Hannes Frederic Sowa 
>  wrote:
> 
>> On 16.12.2016 23:50, Josef Bacik wrote:
>>> On Fri, Dec 16, 2016 at 5:18 PM, Tom Herbert  wrote:
 On Fri, Dec 16, 2016 at 2:08 PM, Josef Bacik  wrote:
> On Fri, Dec 16, 2016 at 10:21 AM, Josef Bacik  wrote:
> 
>> On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik  wrote:
>> 
>> On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa
>>  wrote:
>>> 
>>> Hi Josef,
>>> 
 On 15.12.2016 19:53, Josef Bacik wrote:
 
  On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert 
 wrote:
> 
>  On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek
> 
>  wrote:
>> 
>>   On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert
>> 
>>  wrote:
>>> 
>>>   I think there may be some suspicious code in
>>> inet_csk_get_port. At
>>>   tb_found there is:
>>> 
>>>   if (((tb->fastreuse > 0 && reuse) ||
>>>(tb->fastreuseport > 0 &&
>>> 
>>> !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>> sk->sk_reuseport && uid_eq(tb->fastuid,
>>>  uid))) &&
>>>   smallest_size == -1)
>>>   goto success;
>>>   if
>>> (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,
>>>  tb, true)) {
>>>   if ((reuse ||
>>>(tb->fastreuseport > 0 &&
>>> sk->sk_reuseport &&
>>> 
>>>  !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>> uid_eq(tb->fastuid, uid))) &&
>>>   smallest_size != -1 &&
>>> --attempts >=
>>> 0) {
>>>   spin_unlock_bh(>lock);
>>>   goto again;
>>>   }
>>>   goto fail_unlock;
>>>   }
>>> 
>>>   AFAICT there is redundancy in these two conditionals.  The
>>> same
>>> clause
>>>   is being checked in both: (tb->fastreuseport > 0 &&
>>>   !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>> sk->sk_reuseport &&
>>>   uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this
>>> is true
>>> the
>>>   first conditional should be hit, goto done,  and the second
>>> will
>>> never
>>>   evaluate that part to true-- unless the sk is changed (do
>>> we need
>>>   READ_ONCE for sk->sk_reuseport_cb?).
>> 
>>   That's an interesting point... It looks like this function also
>>   changed in 4.6 from using a single local_bh_disable() at the
>> beginning
>>   with several spin_lock(>lock) to exclusively
>>   spin_lock_bh(>lock) at each locking point.  Perhaps
>> the full
>> bh
>>   disable variant was preventing the timers in your stack
>> trace from
>>   running interleaved with this function before?
> 
> 
>  Could be, although dropping the lock shouldn't be able to
> affect the
>  search state. TBH, I'm a little lost in reading function, the
>  SO_REUSEPORT handling is pretty complicated. For instance,
>  rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in
> that
>  function and also in every call to inet_csk_bind_conflict. I
> wonder
> if
>  we can simply this under the assumption that SO_REUSEPORT is only
>  allowed if the port number (snum) is explicitly specified.
 
 
  Ok first I have data for you Hannes, here's the time distributions
  before during and after the lockup (with all the debugging in
 place
 the
  box eventually recovers).  I've attached it as a text file
 since it is
  long.
>>> 
>>> 
>>> Thanks a lot!
>>> 
  Second is I was thinking about why we would spend so much time
 doing
 the
  ->owners list, and obviously it's because of the massive amount of
  timewait sockets on the owners list.  I wrote the following
 dumb patch
  and tested it and the problem has disappeared completely.  Now
 I don't
  know if this is right at all, but I thought it was weird we
 weren't
  copying the soreuseport option from the original socket onto
 the twsk.
  Is there are reason 

Re: Soft lockup in inet_put_port on 4.6

2016-12-17 Thread Hannes Frederic Sowa
On 16.12.2016 23:50, Josef Bacik wrote:
> On Fri, Dec 16, 2016 at 5:18 PM, Tom Herbert  wrote:
>> On Fri, Dec 16, 2016 at 2:08 PM, Josef Bacik  wrote:
>>>  On Fri, Dec 16, 2016 at 10:21 AM, Josef Bacik  wrote:

  On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik  wrote:
>
>  On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa
>   wrote:
>>
>>  Hi Josef,
>>
>>  On 15.12.2016 19:53, Josef Bacik wrote:
>>>
>>>   On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert 
>>>  wrote:

   On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek
 
   wrote:
>
>On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert
> 
>   wrote:
>>
>>I think there may be some suspicious code in
>> inet_csk_get_port. At
>>tb_found there is:
>>
>>if (((tb->fastreuse > 0 && reuse) ||
>> (tb->fastreuseport > 0 &&
>> 
>> !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>  sk->sk_reuseport && uid_eq(tb->fastuid,
>>   uid))) &&
>>smallest_size == -1)
>>goto success;
>>if
>> (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,
>>   tb, true)) {
>>if ((reuse ||
>> (tb->fastreuseport > 0 &&
>>  sk->sk_reuseport &&
>>
>>   !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>  uid_eq(tb->fastuid, uid))) &&
>>smallest_size != -1 &&
>> --attempts >=
>>  0) {
>>spin_unlock_bh(>lock);
>>goto again;
>>}
>>goto fail_unlock;
>>}
>>
>>AFAICT there is redundancy in these two conditionals.  The
>> same
>>  clause
>>is being checked in both: (tb->fastreuseport > 0 &&
>>!rcu_access_pointer(sk->sk_reuseport_cb) &&
>> sk->sk_reuseport &&
>>uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this
>> is true
>>  the
>>first conditional should be hit, goto done,  and the second
>> will
>>  never
>>evaluate that part to true-- unless the sk is changed (do
>> we need
>>READ_ONCE for sk->sk_reuseport_cb?).
>
>That's an interesting point... It looks like this function also
>changed in 4.6 from using a single local_bh_disable() at the
>  beginning
>with several spin_lock(>lock) to exclusively
>spin_lock_bh(>lock) at each locking point.  Perhaps
> the full
>  bh
>disable variant was preventing the timers in your stack
> trace from
>running interleaved with this function before?


   Could be, although dropping the lock shouldn't be able to
 affect the
   search state. TBH, I'm a little lost in reading function, the
   SO_REUSEPORT handling is pretty complicated. For instance,
   rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in
  that
   function and also in every call to inet_csk_bind_conflict. I
 wonder
  if
   we can simply this under the assumption that SO_REUSEPORT is only
   allowed if the port number (snum) is explicitly specified.
>>>
>>>
>>>   Ok first I have data for you Hannes, here's the time distributions
>>>   before during and after the lockup (with all the debugging in
>>> place
>>>  the
>>>   box eventually recovers).  I've attached it as a text file
>>> since it is
>>>   long.
>>
>>
>>  Thanks a lot!
>>
>>>   Second is I was thinking about why we would spend so much time
>>> doing
>>>  the
>>>   ->owners list, and obviously it's because of the massive amount of
>>>   timewait sockets on the owners list.  I wrote the following
>>> dumb patch
>>>   and tested it and the problem has disappeared completely.  Now
>>> I don't
>>>   know if this is right at all, but I thought it was weird we
>>> weren't
>>>   copying the soreuseport option from the original socket onto
>>> the twsk.
>>>   Is there are reason we aren't doing this currently?  Does this
>>> help
>>>   explain what is happening?  Thanks,
>>
>>
>>  The patch is 

Re: Soft lockup in inet_put_port on 4.6

2016-12-16 Thread Josef Bacik
On Fri, Dec 16, 2016 at 5:18 PM, Tom Herbert  
wrote:

On Fri, Dec 16, 2016 at 2:08 PM, Josef Bacik  wrote:

 On Fri, Dec 16, 2016 at 10:21 AM, Josef Bacik  wrote:


 On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik  wrote:


 On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa
  wrote:


 Hi Josef,

 On 15.12.2016 19:53, Josef Bacik wrote:


  On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert 


 wrote:


  On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek 


  wrote:


   On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert 


  wrote:


   I think there may be some suspicious code in 
inet_csk_get_port. At

   tb_found there is:

   if (((tb->fastreuse > 0 && reuse) ||
(tb->fastreuseport > 0 &&
 
!rcu_access_pointer(sk->sk_reuseport_cb) &&
 sk->sk_reuseport && 
uid_eq(tb->fastuid,

  uid))) &&
   smallest_size == -1)
   goto success;
   if 
(inet_csk(sk)->icsk_af_ops->bind_conflict(sk,

  tb, true)) {
   if ((reuse ||
(tb->fastreuseport > 0 &&
 sk->sk_reuseport &&

  !rcu_access_pointer(sk->sk_reuseport_cb) &&
 uid_eq(tb->fastuid, uid))) &&
   smallest_size != -1 && 
--attempts >=

 0) {
   
spin_unlock_bh(>lock);

   goto again;
   }
   goto fail_unlock;
   }

   AFAICT there is redundancy in these two conditionals.  The 
same

 clause
   is being checked in both: (tb->fastreuseport > 0 &&
   !rcu_access_pointer(sk->sk_reuseport_cb) && 
sk->sk_reuseport &&
   uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this 
is true

 the
   first conditional should be hit, goto done,  and the 
second will

 never
   evaluate that part to true-- unless the sk is changed (do 
we need

   READ_ONCE for sk->sk_reuseport_cb?).


   That's an interesting point... It looks like this function 
also

   changed in 4.6 from using a single local_bh_disable() at the
 beginning
   with several spin_lock(>lock) to exclusively
   spin_lock_bh(>lock) at each locking point.  Perhaps 
the full

 bh
   disable variant was preventing the timers in your stack 
trace from

   running interleaved with this function before?



  Could be, although dropping the lock shouldn't be able to 
affect the

  search state. TBH, I'm a little lost in reading function, the
  SO_REUSEPORT handling is pretty complicated. For instance,
  rcu_access_pointer(sk->sk_reuseport_cb) is checked three 
times in

 that
  function and also in every call to inet_csk_bind_conflict. I 
wonder

 if
  we can simply this under the assumption that SO_REUSEPORT is 
only

  allowed if the port number (snum) is explicitly specified.



  Ok first I have data for you Hannes, here's the time 
distributions
  before during and after the lockup (with all the debugging in 
place

 the
  box eventually recovers).  I've attached it as a text file 
since it is

  long.



 Thanks a lot!

  Second is I was thinking about why we would spend so much time 
doing

 the
  ->owners list, and obviously it's because of the massive 
amount of
  timewait sockets on the owners list.  I wrote the following 
dumb patch
  and tested it and the problem has disappeared completely.  Now 
I don't
  know if this is right at all, but I thought it was weird we 
weren't
  copying the soreuseport option from the original socket onto 
the twsk.
  Is there are reason we aren't doing this currently?  Does this 
help

  explain what is happening?  Thanks,



 The patch is interesting and a good clue, but I am immediately a 
bit
 concerned that we don't copy/tag the socket with the uid also to 
keep
 the security properties for SO_REUSEPORT. I have to think a bit 
more

 about this.

 We have seen hangs during connect. I am afraid this patch 
wouldn't help

 there while also guaranteeing uniqueness.




 Yeah so I looked at the code some more and actually my patch is 
really
 bad.  If sk2->sk_reuseport is set we'll look at 
sk2->sk_reuseport_cb, which

 is outside of the timewait sock, so that's definitely bad.

 But we should at least be setting it to 0 so that we don't do this
 normally.  Unfortunately simply setting it to 0 doesn't fix the 
problem.  So
 for some reason having ->sk_reuseport set to 1 on a timewait 
socket makes

 this problem non-existent, which is strange.

 So back to the drawing board I guess.  I wonder if doing what 
craig
 suggested and batching the timewait timer expires so it hurts 
less would

 accomplish the same results.  Thanks,



 Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's.  
This is the

 code


Re: Soft lockup in inet_put_port on 4.6

2016-12-16 Thread Tom Herbert
On Fri, Dec 16, 2016 at 2:08 PM, Josef Bacik  wrote:
> On Fri, Dec 16, 2016 at 10:21 AM, Josef Bacik  wrote:
>>
>> On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik  wrote:
>>>
>>> On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa
>>>  wrote:

 Hi Josef,

 On 15.12.2016 19:53, Josef Bacik wrote:
>
>  On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert 
> wrote:
>>
>>  On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek 
>>  wrote:
>>>
>>>   On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert 
>>>  wrote:

   I think there may be some suspicious code in inet_csk_get_port. At
   tb_found there is:

   if (((tb->fastreuse > 0 && reuse) ||
(tb->fastreuseport > 0 &&
 !rcu_access_pointer(sk->sk_reuseport_cb) &&
 sk->sk_reuseport && uid_eq(tb->fastuid,
  uid))) &&
   smallest_size == -1)
   goto success;
   if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,
  tb, true)) {
   if ((reuse ||
(tb->fastreuseport > 0 &&
 sk->sk_reuseport &&

  !rcu_access_pointer(sk->sk_reuseport_cb) &&
 uid_eq(tb->fastuid, uid))) &&
   smallest_size != -1 && --attempts >=
 0) {
   spin_unlock_bh(>lock);
   goto again;
   }
   goto fail_unlock;
   }

   AFAICT there is redundancy in these two conditionals.  The same
 clause
   is being checked in both: (tb->fastreuseport > 0 &&
   !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&
   uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true
 the
   first conditional should be hit, goto done,  and the second will
 never
   evaluate that part to true-- unless the sk is changed (do we need
   READ_ONCE for sk->sk_reuseport_cb?).
>>>
>>>   That's an interesting point... It looks like this function also
>>>   changed in 4.6 from using a single local_bh_disable() at the
>>> beginning
>>>   with several spin_lock(>lock) to exclusively
>>>   spin_lock_bh(>lock) at each locking point.  Perhaps the full
>>> bh
>>>   disable variant was preventing the timers in your stack trace from
>>>   running interleaved with this function before?
>>
>>
>>  Could be, although dropping the lock shouldn't be able to affect the
>>  search state. TBH, I'm a little lost in reading function, the
>>  SO_REUSEPORT handling is pretty complicated. For instance,
>>  rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in
>> that
>>  function and also in every call to inet_csk_bind_conflict. I wonder
>> if
>>  we can simply this under the assumption that SO_REUSEPORT is only
>>  allowed if the port number (snum) is explicitly specified.
>
>
>  Ok first I have data for you Hannes, here's the time distributions
>  before during and after the lockup (with all the debugging in place
> the
>  box eventually recovers).  I've attached it as a text file since it is
>  long.


 Thanks a lot!

>  Second is I was thinking about why we would spend so much time doing
> the
>  ->owners list, and obviously it's because of the massive amount of
>  timewait sockets on the owners list.  I wrote the following dumb patch
>  and tested it and the problem has disappeared completely.  Now I don't
>  know if this is right at all, but I thought it was weird we weren't
>  copying the soreuseport option from the original socket onto the twsk.
>  Is there are reason we aren't doing this currently?  Does this help
>  explain what is happening?  Thanks,


 The patch is interesting and a good clue, but I am immediately a bit
 concerned that we don't copy/tag the socket with the uid also to keep
 the security properties for SO_REUSEPORT. I have to think a bit more
 about this.

 We have seen hangs during connect. I am afraid this patch wouldn't help
 there while also guaranteeing uniqueness.
>>>
>>>
>>>
>>> Yeah so I looked at the code some more and actually my patch is really
>>> bad.  If sk2->sk_reuseport is set we'll look at sk2->sk_reuseport_cb, which
>>> is outside of the timewait sock, so that's definitely bad.
>>>
>>> But we should at least be 

Re: Soft lockup in inet_put_port on 4.6

2016-12-16 Thread Josef Bacik

On Fri, Dec 16, 2016 at 10:21 AM, Josef Bacik  wrote:

On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik  wrote:
On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa 
 wrote:

Hi Josef,

On 15.12.2016 19:53, Josef Bacik wrote:
 On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert 
 wrote:
 On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek 


 wrote:
  On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert 


 wrote:
  I think there may be some suspicious code in 
inet_csk_get_port. At

  tb_found there is:

  if (((tb->fastreuse > 0 && reuse) ||
   (tb->fastreuseport > 0 &&

!rcu_access_pointer(sk->sk_reuseport_cb) &&

sk->sk_reuseport && uid_eq(tb->fastuid,
 uid))) &&
  smallest_size == -1)
  goto success;
  if 
(inet_csk(sk)->icsk_af_ops->bind_conflict(sk,

 tb, true)) {
  if ((reuse ||
   (tb->fastreuseport > 0 &&
sk->sk_reuseport &&

 !rcu_access_pointer(sk->sk_reuseport_cb) &&
uid_eq(tb->fastuid, uid))) &&
  smallest_size != -1 && --attempts 
>= 0) {

  spin_unlock_bh(>lock);
  goto again;
  }
  goto fail_unlock;
  }

  AFAICT there is redundancy in these two conditionals.  The 
same clause

  is being checked in both: (tb->fastreuseport > 0 &&
  !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport 
&&
  uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is 
true the
  first conditional should be hit, goto done,  and the second 
will never
  evaluate that part to true-- unless the sk is changed (do we 
need

  READ_ONCE for sk->sk_reuseport_cb?).

  That's an interesting point... It looks like this function also
  changed in 4.6 from using a single local_bh_disable() at the 
beginning

  with several spin_lock(>lock) to exclusively
  spin_lock_bh(>lock) at each locking point.  Perhaps the 
full bh
  disable variant was preventing the timers in your stack trace 
from

  running interleaved with this function before?


 Could be, although dropping the lock shouldn't be able to affect 
the

 search state. TBH, I'm a little lost in reading function, the
 SO_REUSEPORT handling is pretty complicated. For instance,
 rcu_access_pointer(sk->sk_reuseport_cb) is checked three times 
in that
 function and also in every call to inet_csk_bind_conflict. I 
wonder if

 we can simply this under the assumption that SO_REUSEPORT is only
 allowed if the port number (snum) is explicitly specified.


 Ok first I have data for you Hannes, here's the time distributions
 before during and after the lockup (with all the debugging in 
place the
 box eventually recovers).  I've attached it as a text file since 
it is

 long.


Thanks a lot!

 Second is I was thinking about why we would spend so much time 
doing the

 ->owners list, and obviously it's because of the massive amount of
 timewait sockets on the owners list.  I wrote the following dumb 
patch
 and tested it and the problem has disappeared completely.  Now I 
don't
 know if this is right at all, but I thought it was weird we 
weren't
 copying the soreuseport option from the original socket onto the 
twsk.
 Is there are reason we aren't doing this currently?  Does this 
help

 explain what is happening?  Thanks,


The patch is interesting and a good clue, but I am immediately a bit
concerned that we don't copy/tag the socket with the uid also to 
keep

the security properties for SO_REUSEPORT. I have to think a bit more
about this.

We have seen hangs during connect. I am afraid this patch wouldn't 
help

there while also guaranteeing uniqueness.



Yeah so I looked at the code some more and actually my patch is 
really bad.  If sk2->sk_reuseport is set we'll look at 
sk2->sk_reuseport_cb, which is outside of the timewait sock, so 
that's definitely bad.


But we should at least be setting it to 0 so that we don't do this 
normally.  Unfortunately simply setting it to 0 doesn't fix the 
problem.  So for some reason having ->sk_reuseport set to 1 on a 
timewait socket makes this problem non-existent, which is strange.


So back to the drawing board I guess.  I wonder if doing what craig 
suggested and batching the timewait timer expires so it hurts less 
would accomplish the same results.  Thanks,


Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's.  This 
is the code


   if ((!reuse || !sk2->sk_reuse ||
   sk2->sk_state == TCP_LISTEN) &&
   (!reuseport || !sk2->sk_reuseport ||
rcu_access_pointer(sk->sk_reuseport_cb) ||
  

Re: Soft lockup in inet_put_port on 4.6

2016-12-16 Thread Josef Bacik

On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik  wrote:
On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa 
 wrote:

Hi Josef,

On 15.12.2016 19:53, Josef Bacik wrote:
 On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert  
wrote:
 On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek 


 wrote:
  On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert 


 wrote:
  I think there may be some suspicious code in 
inet_csk_get_port. At

  tb_found there is:

  if (((tb->fastreuse > 0 && reuse) ||
   (tb->fastreuseport > 0 &&
!rcu_access_pointer(sk->sk_reuseport_cb) 
&&

sk->sk_reuseport && uid_eq(tb->fastuid,
 uid))) &&
  smallest_size == -1)
  goto success;
  if 
(inet_csk(sk)->icsk_af_ops->bind_conflict(sk,

 tb, true)) {
  if ((reuse ||
   (tb->fastreuseport > 0 &&
sk->sk_reuseport &&

 !rcu_access_pointer(sk->sk_reuseport_cb) &&
uid_eq(tb->fastuid, uid))) &&
  smallest_size != -1 && --attempts 
>= 0) {

  spin_unlock_bh(>lock);
  goto again;
  }
  goto fail_unlock;
  }

  AFAICT there is redundancy in these two conditionals.  The 
same clause

  is being checked in both: (tb->fastreuseport > 0 &&
  !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&
  uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is 
true the
  first conditional should be hit, goto done,  and the second 
will never
  evaluate that part to true-- unless the sk is changed (do we 
need

  READ_ONCE for sk->sk_reuseport_cb?).

  That's an interesting point... It looks like this function also
  changed in 4.6 from using a single local_bh_disable() at the 
beginning

  with several spin_lock(>lock) to exclusively
  spin_lock_bh(>lock) at each locking point.  Perhaps the 
full bh
  disable variant was preventing the timers in your stack trace 
from

  running interleaved with this function before?


 Could be, although dropping the lock shouldn't be able to affect 
the

 search state. TBH, I'm a little lost in reading function, the
 SO_REUSEPORT handling is pretty complicated. For instance,
 rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in 
that
 function and also in every call to inet_csk_bind_conflict. I 
wonder if

 we can simply this under the assumption that SO_REUSEPORT is only
 allowed if the port number (snum) is explicitly specified.


 Ok first I have data for you Hannes, here's the time distributions
 before during and after the lockup (with all the debugging in 
place the
 box eventually recovers).  I've attached it as a text file since 
it is

 long.


Thanks a lot!

 Second is I was thinking about why we would spend so much time 
doing the

 ->owners list, and obviously it's because of the massive amount of
 timewait sockets on the owners list.  I wrote the following dumb 
patch
 and tested it and the problem has disappeared completely.  Now I 
don't

 know if this is right at all, but I thought it was weird we weren't
 copying the soreuseport option from the original socket onto the 
twsk.

 Is there are reason we aren't doing this currently?  Does this help
 explain what is happening?  Thanks,


The patch is interesting and a good clue, but I am immediately a bit
concerned that we don't copy/tag the socket with the uid also to keep
the security properties for SO_REUSEPORT. I have to think a bit more
about this.

We have seen hangs during connect. I am afraid this patch wouldn't 
help

there while also guaranteeing uniqueness.



Yeah so I looked at the code some more and actually my patch is 
really bad.  If sk2->sk_reuseport is set we'll look at 
sk2->sk_reuseport_cb, which is outside of the timewait sock, so 
that's definitely bad.


But we should at least be setting it to 0 so that we don't do this 
normally.  Unfortunately simply setting it to 0 doesn't fix the 
problem.  So for some reason having ->sk_reuseport set to 1 on a 
timewait socket makes this problem non-existent, which is strange.


So back to the drawing board I guess.  I wonder if doing what craig 
suggested and batching the timewait timer expires so it hurts less 
would accomplish the same results.  Thanks,


Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's.  This is 
the code


   if ((!reuse || !sk2->sk_reuse ||
   sk2->sk_state == TCP_LISTEN) &&
   (!reuseport || !sk2->sk_reuseport ||
rcu_access_pointer(sk->sk_reuseport_cb) ||
(sk2->sk_state != TCP_TIME_WAIT &&

Re: Soft lockup in inet_put_port on 4.6

2016-12-16 Thread Josef Bacik
On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa 
 wrote:

Hi Josef,

On 15.12.2016 19:53, Josef Bacik wrote:
 On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert  
wrote:
 On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek 


 wrote:
  On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert 


 wrote:
  I think there may be some suspicious code in inet_csk_get_port. 
At

  tb_found there is:

  if (((tb->fastreuse > 0 && reuse) ||
   (tb->fastreuseport > 0 &&
!rcu_access_pointer(sk->sk_reuseport_cb) 
&&

sk->sk_reuseport && uid_eq(tb->fastuid,
 uid))) &&
  smallest_size == -1)
  goto success;
  if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,
 tb, true)) {
  if ((reuse ||
   (tb->fastreuseport > 0 &&
sk->sk_reuseport &&

 !rcu_access_pointer(sk->sk_reuseport_cb) &&
uid_eq(tb->fastuid, uid))) &&
  smallest_size != -1 && --attempts 
>= 0) {

  spin_unlock_bh(>lock);
  goto again;
  }
  goto fail_unlock;
  }

  AFAICT there is redundancy in these two conditionals.  The same 
clause

  is being checked in both: (tb->fastreuseport > 0 &&
  !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&
  uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is 
true the
  first conditional should be hit, goto done,  and the second 
will never
  evaluate that part to true-- unless the sk is changed (do we 
need

  READ_ONCE for sk->sk_reuseport_cb?).

  That's an interesting point... It looks like this function also
  changed in 4.6 from using a single local_bh_disable() at the 
beginning

  with several spin_lock(>lock) to exclusively
  spin_lock_bh(>lock) at each locking point.  Perhaps the 
full bh
  disable variant was preventing the timers in your stack trace 
from

  running interleaved with this function before?


 Could be, although dropping the lock shouldn't be able to affect 
the

 search state. TBH, I'm a little lost in reading function, the
 SO_REUSEPORT handling is pretty complicated. For instance,
 rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in 
that
 function and also in every call to inet_csk_bind_conflict. I 
wonder if

 we can simply this under the assumption that SO_REUSEPORT is only
 allowed if the port number (snum) is explicitly specified.


 Ok first I have data for you Hannes, here's the time distributions
 before during and after the lockup (with all the debugging in place 
the
 box eventually recovers).  I've attached it as a text file since it 
is

 long.


Thanks a lot!

 Second is I was thinking about why we would spend so much time 
doing the

 ->owners list, and obviously it's because of the massive amount of
 timewait sockets on the owners list.  I wrote the following dumb 
patch
 and tested it and the problem has disappeared completely.  Now I 
don't

 know if this is right at all, but I thought it was weird we weren't
 copying the soreuseport option from the original socket onto the 
twsk.

 Is there are reason we aren't doing this currently?  Does this help
 explain what is happening?  Thanks,


The patch is interesting and a good clue, but I am immediately a bit
concerned that we don't copy/tag the socket with the uid also to keep
the security properties for SO_REUSEPORT. I have to think a bit more
about this.

We have seen hangs during connect. I am afraid this patch wouldn't 
help

there while also guaranteeing uniqueness.



Yeah so I looked at the code some more and actually my patch is really 
bad.  If sk2->sk_reuseport is set we'll look at sk2->sk_reuseport_cb, 
which is outside of the timewait sock, so that's definitely bad.


But we should at least be setting it to 0 so that we don't do this 
normally.  Unfortunately simply setting it to 0 doesn't fix the 
problem.  So for some reason having ->sk_reuseport set to 1 on a 
timewait socket makes this problem non-existent, which is strange.


So back to the drawing board I guess.  I wonder if doing what craig 
suggested and batching the timewait timer expires so it hurts less 
would accomplish the same results.  Thanks,


Josef



Re: Soft lockup in inet_put_port on 4.6

2016-12-15 Thread Hannes Frederic Sowa
Hi Josef,

On 15.12.2016 19:53, Josef Bacik wrote:
> On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert  wrote:
>> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek 
>> wrote:
>>>  On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert 
>>> wrote:
  I think there may be some suspicious code in inet_csk_get_port. At
  tb_found there is:

  if (((tb->fastreuse > 0 && reuse) ||
   (tb->fastreuseport > 0 &&
!rcu_access_pointer(sk->sk_reuseport_cb) &&
sk->sk_reuseport && uid_eq(tb->fastuid,
 uid))) &&
  smallest_size == -1)
  goto success;
  if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,
 tb, true)) {
  if ((reuse ||
   (tb->fastreuseport > 0 &&
sk->sk_reuseport &&
   
 !rcu_access_pointer(sk->sk_reuseport_cb) &&
uid_eq(tb->fastuid, uid))) &&
  smallest_size != -1 && --attempts >= 0) {
  spin_unlock_bh(>lock);
  goto again;
  }
  goto fail_unlock;
  }

  AFAICT there is redundancy in these two conditionals.  The same clause
  is being checked in both: (tb->fastreuseport > 0 &&
  !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&
  uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true the
  first conditional should be hit, goto done,  and the second will never
  evaluate that part to true-- unless the sk is changed (do we need
  READ_ONCE for sk->sk_reuseport_cb?).
>>>  That's an interesting point... It looks like this function also
>>>  changed in 4.6 from using a single local_bh_disable() at the beginning
>>>  with several spin_lock(>lock) to exclusively
>>>  spin_lock_bh(>lock) at each locking point.  Perhaps the full bh
>>>  disable variant was preventing the timers in your stack trace from
>>>  running interleaved with this function before?
>>
>> Could be, although dropping the lock shouldn't be able to affect the
>> search state. TBH, I'm a little lost in reading function, the
>> SO_REUSEPORT handling is pretty complicated. For instance,
>> rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in that
>> function and also in every call to inet_csk_bind_conflict. I wonder if
>> we can simply this under the assumption that SO_REUSEPORT is only
>> allowed if the port number (snum) is explicitly specified.
> 
> Ok first I have data for you Hannes, here's the time distributions
> before during and after the lockup (with all the debugging in place the
> box eventually recovers).  I've attached it as a text file since it is
> long.

Thanks a lot!

> Second is I was thinking about why we would spend so much time doing the
> ->owners list, and obviously it's because of the massive amount of
> timewait sockets on the owners list.  I wrote the following dumb patch
> and tested it and the problem has disappeared completely.  Now I don't
> know if this is right at all, but I thought it was weird we weren't
> copying the soreuseport option from the original socket onto the twsk. 
> Is there are reason we aren't doing this currently?  Does this help
> explain what is happening?  Thanks,

The patch is interesting and a good clue, but I am immediately a bit
concerned that we don't copy/tag the socket with the uid also to keep
the security properties for SO_REUSEPORT. I have to think a bit more
about this.

We have seen hangs during connect. I am afraid this patch wouldn't help
there while also guaranteeing uniqueness.

Thanks a lot!


Re: Soft lockup in inet_put_port on 4.6

2016-12-15 Thread Craig Gallek
On Thu, Dec 15, 2016 at 5:39 PM, Tom Herbert  wrote:
> On Thu, Dec 15, 2016 at 10:53 AM, Josef Bacik  wrote:
>> On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert  wrote:
>>>
>>> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek 
>>> wrote:

  On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert 
 wrote:
>
>  I think there may be some suspicious code in inet_csk_get_port. At
>  tb_found there is:
>
>  if (((tb->fastreuse > 0 && reuse) ||
>   (tb->fastreuseport > 0 &&
>!rcu_access_pointer(sk->sk_reuseport_cb) &&
>sk->sk_reuseport && uid_eq(tb->fastuid, uid))) &&
>  smallest_size == -1)
>  goto success;
>  if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb,
> true)) {
>  if ((reuse ||
>   (tb->fastreuseport > 0 &&
>sk->sk_reuseport &&
>!rcu_access_pointer(sk->sk_reuseport_cb)
> &&
>uid_eq(tb->fastuid, uid))) &&
>  smallest_size != -1 && --attempts >= 0) {
>  spin_unlock_bh(>lock);
>  goto again;
>  }
>  goto fail_unlock;
>  }
>
>  AFAICT there is redundancy in these two conditionals.  The same clause
>  is being checked in both: (tb->fastreuseport > 0 &&
>  !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&
>  uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true the
>  first conditional should be hit, goto done,  and the second will never
>  evaluate that part to true-- unless the sk is changed (do we need
>  READ_ONCE for sk->sk_reuseport_cb?).

  That's an interesting point... It looks like this function also
  changed in 4.6 from using a single local_bh_disable() at the beginning
  with several spin_lock(>lock) to exclusively
  spin_lock_bh(>lock) at each locking point.  Perhaps the full bh
  disable variant was preventing the timers in your stack trace from
  running interleaved with this function before?
>>>
>>>
>>> Could be, although dropping the lock shouldn't be able to affect the
>>> search state. TBH, I'm a little lost in reading function, the
>>> SO_REUSEPORT handling is pretty complicated. For instance,
>>> rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in that
>>> function and also in every call to inet_csk_bind_conflict. I wonder if
>>> we can simply this under the assumption that SO_REUSEPORT is only
>>> allowed if the port number (snum) is explicitly specified.
>>
>>
>> Ok first I have data for you Hannes, here's the time distributions before
>> during and after the lockup (with all the debugging in place the box
>> eventually recovers).  I've attached it as a text file since it is long.
>>
>> Second is I was thinking about why we would spend so much time doing the
>> ->owners list, and obviously it's because of the massive amount of timewait
>> sockets on the owners list.  I wrote the following dumb patch and tested it
>> and the problem has disappeared completely.  Now I don't know if this is
>> right at all, but I thought it was weird we weren't copying the soreuseport
>> option from the original socket onto the twsk.  Is there are reason we
>> aren't doing this currently?  Does this help explain what is happening?
>> Thanks,
>>
> I think that would explain it. We would be walking long lists of TW
> sockets in inet_bind_bucket_for_each(tb, >chain). This should
> break, although now I'm wondering if there's other ways we can get
> into this situation. reuseport ensures that we can have long lists of
> sockets in a single bucket, TW sockets can make that list really long.
What if the time-wait timer implementation was changed to do more
opportunistic removals?  In this case, you seem to have a coordinated
timer event causing many independent locking events on the bucket in
question.  If one of those firing events realized it could handle all
of them, you could greatly reduce the contention.  The fact that they
all hash to the same bucket may make this even easier...

> Tom
>
>> Josef


Re: Soft lockup in inet_put_port on 4.6

2016-12-15 Thread Tom Herbert
On Thu, Dec 15, 2016 at 10:53 AM, Josef Bacik  wrote:
> On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert  wrote:
>>
>> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek 
>> wrote:
>>>
>>>  On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert 
>>> wrote:

  I think there may be some suspicious code in inet_csk_get_port. At
  tb_found there is:

  if (((tb->fastreuse > 0 && reuse) ||
   (tb->fastreuseport > 0 &&
!rcu_access_pointer(sk->sk_reuseport_cb) &&
sk->sk_reuseport && uid_eq(tb->fastuid, uid))) &&
  smallest_size == -1)
  goto success;
  if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb,
 true)) {
  if ((reuse ||
   (tb->fastreuseport > 0 &&
sk->sk_reuseport &&
!rcu_access_pointer(sk->sk_reuseport_cb)
 &&
uid_eq(tb->fastuid, uid))) &&
  smallest_size != -1 && --attempts >= 0) {
  spin_unlock_bh(>lock);
  goto again;
  }
  goto fail_unlock;
  }

  AFAICT there is redundancy in these two conditionals.  The same clause
  is being checked in both: (tb->fastreuseport > 0 &&
  !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&
  uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true the
  first conditional should be hit, goto done,  and the second will never
  evaluate that part to true-- unless the sk is changed (do we need
  READ_ONCE for sk->sk_reuseport_cb?).
>>>
>>>  That's an interesting point... It looks like this function also
>>>  changed in 4.6 from using a single local_bh_disable() at the beginning
>>>  with several spin_lock(>lock) to exclusively
>>>  spin_lock_bh(>lock) at each locking point.  Perhaps the full bh
>>>  disable variant was preventing the timers in your stack trace from
>>>  running interleaved with this function before?
>>
>>
>> Could be, although dropping the lock shouldn't be able to affect the
>> search state. TBH, I'm a little lost in reading function, the
>> SO_REUSEPORT handling is pretty complicated. For instance,
>> rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in that
>> function and also in every call to inet_csk_bind_conflict. I wonder if
>> we can simply this under the assumption that SO_REUSEPORT is only
>> allowed if the port number (snum) is explicitly specified.
>
>
> Ok first I have data for you Hannes, here's the time distributions before
> during and after the lockup (with all the debugging in place the box
> eventually recovers).  I've attached it as a text file since it is long.
>
> Second is I was thinking about why we would spend so much time doing the
> ->owners list, and obviously it's because of the massive amount of timewait
> sockets on the owners list.  I wrote the following dumb patch and tested it
> and the problem has disappeared completely.  Now I don't know if this is
> right at all, but I thought it was weird we weren't copying the soreuseport
> option from the original socket onto the twsk.  Is there are reason we
> aren't doing this currently?  Does this help explain what is happening?
> Thanks,
>
I think that would explain it. We would be walking long lists of TW
sockets in inet_bind_bucket_for_each(tb, >chain). This should
break, although now I'm wondering if there's other ways we can get
into this situation. reuseport ensures that we can have long lists of
sockets in a single bucket, TW sockets can make that list really long.

Tom

> Josef


Re: Soft lockup in inet_put_port on 4.6

2016-12-15 Thread Josef Bacik
On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert  
wrote:
On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek  
wrote:
 On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert  
wrote:

 I think there may be some suspicious code in inet_csk_get_port. At
 tb_found there is:

 if (((tb->fastreuse > 0 && reuse) ||
  (tb->fastreuseport > 0 &&
   !rcu_access_pointer(sk->sk_reuseport_cb) &&
   sk->sk_reuseport && uid_eq(tb->fastuid, 
uid))) &&

 smallest_size == -1)
 goto success;
 if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, 
tb, true)) {

 if ((reuse ||
  (tb->fastreuseport > 0 &&
   sk->sk_reuseport &&
   
!rcu_access_pointer(sk->sk_reuseport_cb) &&

   uid_eq(tb->fastuid, uid))) &&
 smallest_size != -1 && --attempts >= 
0) {

 spin_unlock_bh(>lock);
 goto again;
 }
 goto fail_unlock;
 }

 AFAICT there is redundancy in these two conditionals.  The same 
clause

 is being checked in both: (tb->fastreuseport > 0 &&
 !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&
 uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true 
the
 first conditional should be hit, goto done,  and the second will 
never

 evaluate that part to true-- unless the sk is changed (do we need
 READ_ONCE for sk->sk_reuseport_cb?).

 That's an interesting point... It looks like this function also
 changed in 4.6 from using a single local_bh_disable() at the 
beginning

 with several spin_lock(>lock) to exclusively
 spin_lock_bh(>lock) at each locking point.  Perhaps the full 
bh

 disable variant was preventing the timers in your stack trace from
 running interleaved with this function before?


Could be, although dropping the lock shouldn't be able to affect the
search state. TBH, I'm a little lost in reading function, the
SO_REUSEPORT handling is pretty complicated. For instance,
rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in that
function and also in every call to inet_csk_bind_conflict. I wonder if
we can simply this under the assumption that SO_REUSEPORT is only
allowed if the port number (snum) is explicitly specified.


Ok first I have data for you Hannes, here's the time distributions 
before during and after the lockup (with all the debugging in place the 
box eventually recovers).  I've attached it as a text file since it is 
long.


Second is I was thinking about why we would spend so much time doing 
the ->owners list, and obviously it's because of the massive amount of 
timewait sockets on the owners list.  I wrote the following dumb patch 
and tested it and the problem has disappeared completely.  Now I don't 
know if this is right at all, but I thought it was weird we weren't 
copying the soreuseport option from the original socket onto the twsk.  
Is there are reason we aren't doing this currently?  Does this help 
explain what is happening?  Thanks,


Josef
 inet_csk_get_port   : count distribution
 0 -> 1  : 0||
 2 -> 3  : 0||
 4 -> 7  : 0||
 8 -> 15 : 0||
16 -> 31 : 0||
32 -> 63 : 0||
64 -> 127: 0||
   128 -> 255: 0||
   256 -> 511: 0||
   512 -> 1023   : 0||
  1024 -> 2047   : 4|*   |
  2048 -> 4095   : 100  ||
  4096 -> 8191   : 64   |*   |
  8192 -> 16383  : 35   |**  |
 16384 -> 32767  : 2||
 inet_csk_bind_conflict : count distribution
 0 -> 1  : 0||
 2 -> 3  : 0||
 4 -> 7  : 0||
 8 -> 15 : 0||
16 -> 31 : 0||
 

Re: Soft lockup in inet_put_port on 4.6

2016-12-13 Thread Tom Herbert
On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek  wrote:
> On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert  wrote:
>> I think there may be some suspicious code in inet_csk_get_port. At
>> tb_found there is:
>>
>> if (((tb->fastreuse > 0 && reuse) ||
>>  (tb->fastreuseport > 0 &&
>>   !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>   sk->sk_reuseport && uid_eq(tb->fastuid, uid))) &&
>> smallest_size == -1)
>> goto success;
>> if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, true)) {
>> if ((reuse ||
>>  (tb->fastreuseport > 0 &&
>>   sk->sk_reuseport &&
>>   !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>   uid_eq(tb->fastuid, uid))) &&
>> smallest_size != -1 && --attempts >= 0) {
>> spin_unlock_bh(>lock);
>> goto again;
>> }
>> goto fail_unlock;
>> }
>>
>> AFAICT there is redundancy in these two conditionals.  The same clause
>> is being checked in both: (tb->fastreuseport > 0 &&
>> !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&
>> uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true the
>> first conditional should be hit, goto done,  and the second will never
>> evaluate that part to true-- unless the sk is changed (do we need
>> READ_ONCE for sk->sk_reuseport_cb?).
> That's an interesting point... It looks like this function also
> changed in 4.6 from using a single local_bh_disable() at the beginning
> with several spin_lock(>lock) to exclusively
> spin_lock_bh(>lock) at each locking point.  Perhaps the full bh
> disable variant was preventing the timers in your stack trace from
> running interleaved with this function before?

Could be, although dropping the lock shouldn't be able to affect the
search state. TBH, I'm a little lost in reading function, the
SO_REUSEPORT handling is pretty complicated. For instance,
rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in that
function and also in every call to inet_csk_bind_conflict. I wonder if
we can simply this under the assumption that SO_REUSEPORT is only
allowed if the port number (snum) is explicitly specified.

Tom


Re: Soft lockup in inet_put_port on 4.6

2016-12-13 Thread Craig Gallek
On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert  wrote:
> I think there may be some suspicious code in inet_csk_get_port. At
> tb_found there is:
>
> if (((tb->fastreuse > 0 && reuse) ||
>  (tb->fastreuseport > 0 &&
>   !rcu_access_pointer(sk->sk_reuseport_cb) &&
>   sk->sk_reuseport && uid_eq(tb->fastuid, uid))) &&
> smallest_size == -1)
> goto success;
> if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, true)) {
> if ((reuse ||
>  (tb->fastreuseport > 0 &&
>   sk->sk_reuseport &&
>   !rcu_access_pointer(sk->sk_reuseport_cb) &&
>   uid_eq(tb->fastuid, uid))) &&
> smallest_size != -1 && --attempts >= 0) {
> spin_unlock_bh(>lock);
> goto again;
> }
> goto fail_unlock;
> }
>
> AFAICT there is redundancy in these two conditionals.  The same clause
> is being checked in both: (tb->fastreuseport > 0 &&
> !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&
> uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true the
> first conditional should be hit, goto done,  and the second will never
> evaluate that part to true-- unless the sk is changed (do we need
> READ_ONCE for sk->sk_reuseport_cb?).
That's an interesting point... It looks like this function also
changed in 4.6 from using a single local_bh_disable() at the beginning
with several spin_lock(>lock) to exclusively
spin_lock_bh(>lock) at each locking point.  Perhaps the full bh
disable variant was preventing the timers in your stack trace from
running interleaved with this function before?


Re: Soft lockup in inet_put_port on 4.6

2016-12-13 Thread Tom Herbert
I think there may be some suspicious code in inet_csk_get_port. At
tb_found there is:

if (((tb->fastreuse > 0 && reuse) ||
 (tb->fastreuseport > 0 &&
  !rcu_access_pointer(sk->sk_reuseport_cb) &&
  sk->sk_reuseport && uid_eq(tb->fastuid, uid))) &&
smallest_size == -1)
goto success;
if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, true)) {
if ((reuse ||
 (tb->fastreuseport > 0 &&
  sk->sk_reuseport &&
  !rcu_access_pointer(sk->sk_reuseport_cb) &&
  uid_eq(tb->fastuid, uid))) &&
smallest_size != -1 && --attempts >= 0) {
spin_unlock_bh(>lock);
goto again;
}
goto fail_unlock;
}

AFAICT there is redundancy in these two conditionals.  The same clause
is being checked in both: (tb->fastreuseport > 0 &&
!rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&
uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true the
first conditional should be hit, goto done,  and the second will never
evaluate that part to true-- unless the sk is changed (do we need
READ_ONCE for sk->sk_reuseport_cb?).

Another potential issue is the that the goto again goes back to doing
the port scan, but if snum had been set originally that doesn't seem
like what we want.

Thanks,
Tom




On Mon, Dec 12, 2016 at 2:24 PM, Josef Bacik  wrote:
>
> On Mon, Dec 12, 2016 at 1:44 PM, Hannes Frederic Sowa
>  wrote:
>>
>> On 12.12.2016 19:05, Josef Bacik wrote:
>>>
>>>  On Fri, Dec 9, 2016 at 11:14 PM, Eric Dumazet 
>>>  wrote:

  On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote:

>
>   Hmm... Is your ephemeral port range includes the port your load
>   balancing app is using ?


  I suspect that you might have processes doing bind( port = 0) that are
  trapped into the bind_conflict() scan ?

  With 100,000 + timewaits there, this possibly hurts.

  Can you try the following loop breaker ?
>>>
>>>
>>>  It doesn't appear that the app is doing bind(port = 0) during normal
>>>  operation.  I tested this patch and it made no difference.  I'm going to
>>>  test simply restarting the app without changing to the SO_REUSEPORT
>>>  option.  Thanks,
>>
>>
>> Would it be possible to trace the time the function uses with trace? If
>> we don't see the number growing considerably over time we probably can
>> rule out that we loop somewhere in there (I would instrument
>> inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port).
>>
>> __inet_hash_connect -> __inet_check_established also takes a lock
>> (inet_ehash_lockp) which can be locked from inet_diag code path during
>> socket diag info dumping.
>>
>> Unfortunately we couldn't reproduce it so far. :/
>
>
> So I had a bcc script running to time how long we spent in
> inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port, but of
> course I'm an idiot and didn't actually separate out the stats so I could
> tell _which_ one was taking forever.  But anyway here's a normal
> distribution on the box
>
> Some shit   : count distribution
> 0 -> 1  : 0|   |
> 2 -> 3  : 0|   |
> 4 -> 7  : 0|   |
> 8 -> 15 : 0|   |
>16 -> 31 : 0|   |
>32 -> 63 : 0|   |
>64 -> 127: 0|   |
>   128 -> 255: 0|   |
>   256 -> 511: 0|   |
>   512 -> 1023   : 0|   |
>  1024 -> 2047   : 74   |   |
>  2048 -> 4095   : 10537
> ||
>  4096 -> 8191   : 8497 |   |
>  8192 -> 16383  : 3745 |** |
> 16384 -> 32767  : 300  |*  |
> 32768 -> 65535  : 250  |   |
> 65536 -> 131071 : 180  |   |
>131072 -> 262143 : 71   |   |
>262144 -> 524287 : 18   |   |
>  

Re: Soft lockup in inet_put_port on 4.6

2016-12-12 Thread Josef Bacik


On Mon, Dec 12, 2016 at 1:44 PM, Hannes Frederic Sowa 
 wrote:

On 12.12.2016 19:05, Josef Bacik wrote:
 On Fri, Dec 9, 2016 at 11:14 PM, Eric Dumazet 


 wrote:

 On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote:



  Hmm... Is your ephemeral port range includes the port your load
  balancing app is using ?


 I suspect that you might have processes doing bind( port = 0) that 
are

 trapped into the bind_conflict() scan ?

 With 100,000 + timewaits there, this possibly hurts.

 Can you try the following loop breaker ?


 It doesn't appear that the app is doing bind(port = 0) during normal
 operation.  I tested this patch and it made no difference.  I'm 
going to

 test simply restarting the app without changing to the SO_REUSEPORT
 option.  Thanks,


Would it be possible to trace the time the function uses with trace? 
If

we don't see the number growing considerably over time we probably can
rule out that we loop somewhere in there (I would instrument
inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port).

__inet_hash_connect -> __inet_check_established also takes a lock
(inet_ehash_lockp) which can be locked from inet_diag code path during
socket diag info dumping.

Unfortunately we couldn't reproduce it so far. :/


So I had a bcc script running to time how long we spent in 
inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port, but 
of course I'm an idiot and didn't actually separate out the stats so I 
could tell _which_ one was taking forever.  But anyway here's a normal 
distribution on the box


Some shit   : count distribution
0 -> 1  : 0|   
|
2 -> 3  : 0|   
|
4 -> 7  : 0|   
|
8 -> 15 : 0|   
|
   16 -> 31 : 0|   
|
   32 -> 63 : 0|   
|
   64 -> 127: 0|   
|
  128 -> 255: 0|   
|
  256 -> 511: 0|   
|
  512 -> 1023   : 0|   
|
 1024 -> 2047   : 74   |   
|
 2048 -> 4095   : 10537
||
 4096 -> 8191   : 8497 |   
|
 8192 -> 16383  : 3745 |** 
|
16384 -> 32767  : 300  |*  
|
32768 -> 65535  : 250  |   
|
65536 -> 131071 : 180  |   
|
   131072 -> 262143 : 71   |   
|
   262144 -> 524287 : 18   |   
|
   524288 -> 1048575: 5|   
|


With the times in nanoseconds, and here's the distribution during the 
problem


Some shit   : count distribution
0 -> 1  : 0|   
|
2 -> 3  : 0|   
|
4 -> 7  : 0|   
|
8 -> 15 : 0|   
|
   16 -> 31 : 0|   
|
   32 -> 63 : 0|   
|
   64 -> 127: 0|   
|
  128 -> 255: 0|   
|
  256 -> 511: 0|   
|
  512 -> 1023   : 0|   
|
 1024 -> 2047   : 21   |   
|
 2048 -> 4095   : 21820
||
 4096 -> 8191   : 11598|*  
|
 8192 -> 16383  : 4337 |***
|
16384 -> 32767  : 290  |   
|
32768 -> 65535  : 59   |   
|
65536 -> 131071 : 23   |   
|
   131072 -> 262143 : 12   |   
|
   262144 -> 524287 : 6|   
|
   524288 -> 1048575: 19   |   
|
  1048576 -> 2097151: 1079 |*  
|
  

Re: Soft lockup in inet_put_port on 4.6

2016-12-12 Thread Josef Bacik
On Mon, Dec 12, 2016 at 1:44 PM, Hannes Frederic Sowa 
 wrote:

On 12.12.2016 19:05, Josef Bacik wrote:
 On Fri, Dec 9, 2016 at 11:14 PM, Eric Dumazet 


 wrote:

 On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote:



  Hmm... Is your ephemeral port range includes the port your load
  balancing app is using ?


 I suspect that you might have processes doing bind( port = 0) that 
are

 trapped into the bind_conflict() scan ?

 With 100,000 + timewaits there, this possibly hurts.

 Can you try the following loop breaker ?


 It doesn't appear that the app is doing bind(port = 0) during normal
 operation.  I tested this patch and it made no difference.  I'm 
going to

 test simply restarting the app without changing to the SO_REUSEPORT
 option.  Thanks,


Would it be possible to trace the time the function uses with trace? 
If

we don't see the number growing considerably over time we probably can
rule out that we loop somewhere in there (I would instrument
inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port).

__inet_hash_connect -> __inet_check_established also takes a lock
(inet_ehash_lockp) which can be locked from inet_diag code path during
socket diag info dumping.

Unfortunately we couldn't reproduce it so far. :/


Working on getting the timing info, will probably be tomorrow due to 
meetings.  I did test simply restarting the app without changing to the 
config that enabled the use of SO_REUSEPORT and the problem didn't 
occur, so it definitely has something to do with SO_REUSEPORT.  Thanks,


Josef



Re: Soft lockup in inet_put_port on 4.6

2016-12-12 Thread Hannes Frederic Sowa
On 12.12.2016 19:05, Josef Bacik wrote:
> On Fri, Dec 9, 2016 at 11:14 PM, Eric Dumazet 
> wrote:
>> On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote:
>>
>>>
>>>  Hmm... Is your ephemeral port range includes the port your load
>>>  balancing app is using ?
>>
>> I suspect that you might have processes doing bind( port = 0) that are
>> trapped into the bind_conflict() scan ?
>>
>> With 100,000 + timewaits there, this possibly hurts.
>>
>> Can you try the following loop breaker ?
> 
> It doesn't appear that the app is doing bind(port = 0) during normal
> operation.  I tested this patch and it made no difference.  I'm going to
> test simply restarting the app without changing to the SO_REUSEPORT
> option.  Thanks,

Would it be possible to trace the time the function uses with trace? If
we don't see the number growing considerably over time we probably can
rule out that we loop somewhere in there (I would instrument
inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port).

__inet_hash_connect -> __inet_check_established also takes a lock
(inet_ehash_lockp) which can be locked from inet_diag code path during
socket diag info dumping.

Unfortunately we couldn't reproduce it so far. :/

Thanks,
Hannes



Re: Soft lockup in inet_put_port on 4.6

2016-12-12 Thread Josef Bacik
On Fri, Dec 9, 2016 at 11:14 PM, Eric Dumazet  
wrote:

On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote:



 Hmm... Is your ephemeral port range includes the port your load
 balancing app is using ?


I suspect that you might have processes doing bind( port = 0) that are
trapped into the bind_conflict() scan ?

With 100,000 + timewaits there, this possibly hurts.

Can you try the following loop breaker ?


It doesn't appear that the app is doing bind(port = 0) during normal 
operation.  I tested this patch and it made no difference.  I'm going 
to test simply restarting the app without changing to the SO_REUSEPORT 
option.  Thanks,


Josef



Re: Soft lockup in inet_put_port on 4.6

2016-12-09 Thread Eric Dumazet
On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote:

> 
> Hmm... Is your ephemeral port range includes the port your load
> balancing app is using ?

I suspect that you might have processes doing bind( port = 0) that are
trapped into the bind_conflict() scan ?

With 100,000 + timewaits there, this possibly hurts.

Can you try the following loop breaker ?

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 
d5d3ead0a6c31e42e8843d30f8c643324a91b8e9..74f0f5ee6a02c624edb0263b9ddd27813f68d0a5
 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -51,7 +51,7 @@ int inet_csk_bind_conflict(const struct sock *sk,
int reuse = sk->sk_reuse;
int reuseport = sk->sk_reuseport;
kuid_t uid = sock_i_uid((struct sock *)sk);
-
+   unsigned int max_count;
/*
 * Unlike other sk lookup places we do not check
 * for sk_net here, since _all_ the socks listed
@@ -59,6 +59,7 @@ int inet_csk_bind_conflict(const struct sock *sk,
 * one this bucket belongs to.
 */
 
+   max_count = relax ? ~0U : 100;
sk_for_each_bound(sk2, >owners) {
if (sk != sk2 &&
!inet_v6_ipv6only(sk2) &&
@@ -84,6 +85,8 @@ int inet_csk_bind_conflict(const struct sock *sk,
break;
}
}
+   if (--max_count == 0)
+   return 1;
}
return sk2 != NULL;
 }
diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
index 
1c86c478f578b49373e61a4c397f23f3dc7f3fc6..4f63d06e0d601da94eb3f2b35a988abd060e156c
 100644
--- a/net/ipv6/inet6_connection_sock.c
+++ b/net/ipv6/inet6_connection_sock.c
@@ -35,12 +35,14 @@ int inet6_csk_bind_conflict(const struct sock *sk,
int reuse = sk->sk_reuse;
int reuseport = sk->sk_reuseport;
kuid_t uid = sock_i_uid((struct sock *)sk);
+   unsigned int max_count;
 
/* We must walk the whole port owner list in this case. -DaveM */
/*
 * See comment in inet_csk_bind_conflict about sock lookup
 * vs net namespaces issues.
 */
+   max_count = relax ? ~0U : 100;
sk_for_each_bound(sk2, >owners) {
if (sk != sk2 &&
(!sk->sk_bound_dev_if ||
@@ -61,6 +63,8 @@ int inet6_csk_bind_conflict(const struct sock *sk,
ipv6_rcv_saddr_equal(sk, sk2, true))
break;
}
+   if (--max_count == 0)
+   return 1;
}
 
return sk2 != NULL;





Re: Soft lockup in inet_put_port on 4.6

2016-12-09 Thread Eric Dumazet
On Fri, 2016-12-09 at 20:59 -0500, Josef Bacik wrote:
> On Thu, Dec 8, 2016 at 8:01 PM, Josef Bacik  wrote:
> > 
> >>  On Dec 8, 2016, at 7:32 PM, Eric Dumazet  
> >> wrote:
> >> 
> >>>  On Thu, 2016-12-08 at 16:36 -0500, Josef Bacik wrote:
> >>> 
> >>>  We can reproduce the problem at will, still trying to run down the
> >>>  problem.  I'll try and find one of the boxes that dumped a core 
> >>> and get
> >>>  a bt of everybody.  Thanks,
> >> 
> >>  OK, sounds good.
> >> 
> >>  I had a look and :
> >>  - could not spot a fix that came after 4.6.
> >>  - could not spot an obvious bug.
> >> 
> >>  Anything special in the program triggering the issue ?
> >>  SO_REUSEPORT and/or special socket options ?
> >> 
> > 
> > So they recently started using SO_REUSEPORT, that's what triggered 
> > it, if they don't use it then everything is fine.
> > 
> > I added some instrumentation for get_port to see if it was looping in 
> > there and none of my printk's triggered.  The softlockup messages are 
> > always on the inet_bind_bucket lock, sometimes in the process context 
> > in get_port or in the softirq context either through inet_put_port or 
> > inet_kill_twsk.  On the box that I have a coredump for there's only 
> > one processor in the inet code so I'm not sure what to make of that.  
> > That was a box from last week so I'll look at a more recent core and 
> > see if it's different.  Thanks,
> 
> Ok more investigation today, a few bullet points
> 
> - With all the debugging turned on the boxes seem to recover after 
> about a minute.  I'd get the spam of the soft lockup messages all on 
> the inet_bind_bucket, and then the box would be fine.
> - I looked at a core I had from before I started investigating things 
> and there's only one process trying to get the inet_bind_bucket of all 
> the 48 cpus.
> - I noticed that there was over 100k twsk's in that original core.
> - I put a global counter of the twsk's (since most of the softlockup 
> messages have the twsk timers in the stack) and noticed with the 
> debugging kernel it started around 16k twsk's and once it recovered it 
> was down to less than a thousand.  There's a jump where it goes from 8k 
> to 2k and then there's only one more softlockup message and the box is 
> fine.
> - This happens when we restart the service with the config option to 
> start using SO_REUSEPORT.
> 
> The application is our load balancing app, so obviously has lots of 
> connections opened at any given time.  What I'm wondering and will test 
> on Monday is if the SO_REUSEPORT change even matters, or if simply 
> restarting the service is what triggers the problem.  One thing I 
> forgot to mention is that it's also using TCP_FASTOPEN in both the 
> non-reuseport and reuseport variants.
> 
> What I suspect is happening is the service stops, all of the sockets it 
> had open go into TIMEWAIT with relatively the same timer period, and 
> then suddenly all wake up at the same time which coupled with the 
> massive amount of traffic that we see per box anyway results in so much 
> contention and ksoftirqd usage that the box livelocks for a while.  
> With the lock debugging and stuff turned on we aren't able to service 
> as much traffic so it recovers relatively quickly, whereas a normal 
> production kernel never recovers.
> 
> Please keep in mind that I"m a file system developer so my conclusions 
> may be completely insane, any guidance would be welcome.  I'll continue 
> hammering on this on Monday.  Thanks,

Hmm... Is your ephemeral port range includes the port your load
balancing app is using ?





Re: Soft lockup in inet_put_port on 4.6

2016-12-09 Thread Josef Bacik

On Thu, Dec 8, 2016 at 8:01 PM, Josef Bacik  wrote:


 On Dec 8, 2016, at 7:32 PM, Eric Dumazet  
wrote:



 On Thu, 2016-12-08 at 16:36 -0500, Josef Bacik wrote:

 We can reproduce the problem at will, still trying to run down the
 problem.  I'll try and find one of the boxes that dumped a core 
and get

 a bt of everybody.  Thanks,


 OK, sounds good.

 I had a look and :
 - could not spot a fix that came after 4.6.
 - could not spot an obvious bug.

 Anything special in the program triggering the issue ?
 SO_REUSEPORT and/or special socket options ?



So they recently started using SO_REUSEPORT, that's what triggered 
it, if they don't use it then everything is fine.


I added some instrumentation for get_port to see if it was looping in 
there and none of my printk's triggered.  The softlockup messages are 
always on the inet_bind_bucket lock, sometimes in the process context 
in get_port or in the softirq context either through inet_put_port or 
inet_kill_twsk.  On the box that I have a coredump for there's only 
one processor in the inet code so I'm not sure what to make of that.  
That was a box from last week so I'll look at a more recent core and 
see if it's different.  Thanks,


Ok more investigation today, a few bullet points

- With all the debugging turned on the boxes seem to recover after 
about a minute.  I'd get the spam of the soft lockup messages all on 
the inet_bind_bucket, and then the box would be fine.
- I looked at a core I had from before I started investigating things 
and there's only one process trying to get the inet_bind_bucket of all 
the 48 cpus.

- I noticed that there was over 100k twsk's in that original core.
- I put a global counter of the twsk's (since most of the softlockup 
messages have the twsk timers in the stack) and noticed with the 
debugging kernel it started around 16k twsk's and once it recovered it 
was down to less than a thousand.  There's a jump where it goes from 8k 
to 2k and then there's only one more softlockup message and the box is 
fine.
- This happens when we restart the service with the config option to 
start using SO_REUSEPORT.


The application is our load balancing app, so obviously has lots of 
connections opened at any given time.  What I'm wondering and will test 
on Monday is if the SO_REUSEPORT change even matters, or if simply 
restarting the service is what triggers the problem.  One thing I 
forgot to mention is that it's also using TCP_FASTOPEN in both the 
non-reuseport and reuseport variants.


What I suspect is happening is the service stops, all of the sockets it 
had open go into TIMEWAIT with relatively the same timer period, and 
then suddenly all wake up at the same time which coupled with the 
massive amount of traffic that we see per box anyway results in so much 
contention and ksoftirqd usage that the box livelocks for a while.  
With the lock debugging and stuff turned on we aren't able to service 
as much traffic so it recovers relatively quickly, whereas a normal 
production kernel never recovers.


Please keep in mind that I"m a file system developer so my conclusions 
may be completely insane, any guidance would be welcome.  I'll continue 
hammering on this on Monday.  Thanks,


Josef



Re: Soft lockup in inet_put_port on 4.6

2016-12-08 Thread Josef Bacik

> On Dec 8, 2016, at 7:32 PM, Eric Dumazet  wrote:
> 
>> On Thu, 2016-12-08 at 16:36 -0500, Josef Bacik wrote:
>> 
>> We can reproduce the problem at will, still trying to run down the 
>> problem.  I'll try and find one of the boxes that dumped a core and get 
>> a bt of everybody.  Thanks,
> 
> OK, sounds good.
> 
> I had a look and :
> - could not spot a fix that came after 4.6. 
> - could not spot an obvious bug.
> 
> Anything special in the program triggering the issue ?
> SO_REUSEPORT and/or special socket options ?
> 

So they recently started using SO_REUSEPORT, that's what triggered it, if they 
don't use it then everything is fine.

I added some instrumentation for get_port to see if it was looping in there and 
none of my printk's triggered.  The softlockup messages are always on the 
inet_bind_bucket lock, sometimes in the process context in get_port or in the 
softirq context either through inet_put_port or inet_kill_twsk.  On the box 
that I have a coredump for there's only one processor in the inet code so I'm 
not sure what to make of that.  That was a box from last week so I'll look at a 
more recent core and see if it's different.  Thanks,

Josef

Re: Soft lockup in inet_put_port on 4.6

2016-12-08 Thread Eric Dumazet
On Thu, 2016-12-08 at 16:36 -0500, Josef Bacik wrote:

> We can reproduce the problem at will, still trying to run down the 
> problem.  I'll try and find one of the boxes that dumped a core and get 
> a bt of everybody.  Thanks,

OK, sounds good.

I had a look and :
- could not spot a fix that came after 4.6. 
- could not spot an obvious bug.

Anything special in the program triggering the issue ?
SO_REUSEPORT and/or special socket options ?

Thanks.




Re: Soft lockup in inet_put_port on 4.6

2016-12-08 Thread Josef Bacik
On Thu, Dec 8, 2016 at 4:03 PM, Hannes Frederic Sowa 
 wrote:

Hello Tom,

On Wed, Dec 7, 2016, at 00:06, Tom Herbert wrote:
 We are seeing a fair number of machines getting into softlockup in 
4.6

 kernel. As near as I can tell this is happening on the spinlock in
 bind hash bucket. When inet_csk_get_port exits and does 
spinunlock_bh

 the TCP timer runs and we hit lockup in inet_put_port (presumably on
 same lock). It seems like the locked isn't properly be unlocked
 somewhere but I don't readily see it.

 Any ideas?


Likewise we received reports that pretty much look the same on our
heavily patched kernel. Did you have a chance to investigate or
reproduce the problem?

I am wondering if you would be able to take a complete thread stack 
dump
if you can reproduce this to check if one of the user space processes 
is

looping inside finding a free port?


We can reproduce the problem at will, still trying to run down the 
problem.  I'll try and find one of the boxes that dumped a core and get 
a bt of everybody.  Thanks,


Josef



Re: Soft lockup in inet_put_port on 4.6

2016-12-08 Thread Hannes Frederic Sowa
Hello Tom,

On Wed, Dec 7, 2016, at 00:06, Tom Herbert wrote:
> We are seeing a fair number of machines getting into softlockup in 4.6
> kernel. As near as I can tell this is happening on the spinlock in
> bind hash bucket. When inet_csk_get_port exits and does spinunlock_bh
> the TCP timer runs and we hit lockup in inet_put_port (presumably on
> same lock). It seems like the locked isn't properly be unlocked
> somewhere but I don't readily see it.
> 
> Any ideas?

Likewise we received reports that pretty much look the same on our
heavily patched kernel. Did you have a chance to investigate or
reproduce the problem?

I am wondering if you would be able to take a complete thread stack dump
if you can reproduce this to check if one of the user space processes is
looping inside finding a free port?

Thanks,
Hannes


Soft lockup in inet_put_port on 4.6

2016-12-06 Thread Tom Herbert
Hello,

We are seeing a fair number of machines getting into softlockup in 4.6
kernel. As near as I can tell this is happening on the spinlock in
bind hash bucket. When inet_csk_get_port exits and does spinunlock_bh
the TCP timer runs and we hit lockup in inet_put_port (presumably on
same lock). It seems like the locked isn't properly be unlocked
somewhere but I don't readily see it.

Any ideas?

Thanks,
Tom

NMI watchdog: BUG: soft lockup - CPU#22 stuck for 22s! [proxygend:4152094]
Modules linked in: fuse nf_log_ipv6 ip6t_REJECT nf_reject_ipv6
nf_log_ipv4 nf_log_common xt_LOG ipt_REJECT nf_reject_ipv4 xt_limit
xt_multiport ipip ip_tunnel tunnel4 ip6_tunnel tunnel6 coretemp mptctl
mptbase cls_bpf ipmi_watchdog tcp_diag inet_diag ip6table_filter
xt_NFLOG nfnetlink_log xt_comment xt_statistic iptable_filter xt_mark
tpm_crb i2c_piix4 dm_crypt loop ipmi_devintf acpi_cpufreq iTCO_wdt
iTCO_vendor_support ipmi_si ipmi_msghandler efivars i2c_i801 sg
lpc_ich mfd_core hpilo xhci_pci xhci_hcd button nvme nvme_core
CPU: 22 PID: 4152094 Comm: proxygend Tainted: G W L
4.6.7-13_fbk3_1119_g367d67b #13
Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 12/27/2015
task: 88168c52d100 ti: 881c12fb task.ti: 881c12fb
RIP: 0010:[] []
queued_spin_lock_slowpath+0xf8/0x170
RSP: 0018:883fff303da0 EFLAGS: 0246
RAX:  RBX: 881257163e00 RCX: 0001
RDX: 883fff375e40 RSI: 005c RDI: c90018d6bae0
RBP: 883fff303da0 R08: 883fff315e40 R09: 
R10: 0020 R11: 01c0 R12: c90018d6bae0
R13: 820f8a80 R14: 881257163f30 R15: 
FS: 7fa7bb7ff700() GS:883fff30() knlGS:
CS: 0010 DS:  ES:  CR0: 80050033
CR2: 7ff4be114d90 CR3: 00243f99c000 CR4: 003406e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Stack: 883fff303db0 817e5910 883fff303dd8 8172f6b4
881257163e00 0007 0004 883fff303e00
81733237 881257163e00  81ce7cc0
Call Trace:

[] _raw_spin_lock+0x20/0x30
[] inet_put_port+0x54/0xb0
[] tcp_set_state+0x67/0xc0
[] tcp_done+0x33/0x90
[] tcp_write_err+0x31/0x50
[] tcp_retransmit_timer+0x119/0x7d0
[] ? tcp_write_timer_handler+0x1e0/0x1e0
[] tcp_write_timer_handler+0x8e/0x1e0
[] tcp_write_timer+0x67/0x70
[] call_timer_fn+0x35/0x120
[] ? tcp_write_timer_handler+0x1e0/0x1e0
[] run_timer_softirq+0x1fc/0x2b0
[] __do_softirq+0xcc/0x26c
[] do_softirq_own_stack+0x1c/0x30 
[] do_softirq+0x31/0x40
[] __local_bh_enable_ip+0x78/0x80
[] _raw_spin_unlock_bh +0x1a/0x20
[] inet_csk_get_port+0x1c1/0x5a0
[] ? sock_poll+0x47/0xb0
[] inet_csk_listen_start+0x65/0xc0
[] inet_listen+0x9c/0xe0
[] SyS_listen+0x80/0x90
[] entry_SYSCALL_64_fastpath+0x13/0x8f
Code: c1 ea 0c 83 e8 01 83 e2 30 48 98 48 81 c2 40 5e 01 00 48 03 14
c5 c0 d4 d1 81 4c 89 02 41 8b 40 08 85 c0 75 0a f3 90 41 8b 40 08 <85>
c0 74 f6 4d 8b 08 4d 85 c9 74 08 41 0f 0d 09 eb 02 f3 90 8b