Re: Soft lockup in inet_put_port on 4.6

Josef Bacik Fri, 16 Dec 2016 15:26:16 -0800

On Fri, Dec 16, 2016 at 5:18 PM, Tom Herbert <t...@herbertland.com>wrote:

On Fri, Dec 16, 2016 at 2:08 PM, Josef Bacik <jba...@fb.com> wrote:
 On Fri, Dec 16, 2016 at 10:21 AM, Josef Bacik <jba...@fb.com> wrote:
 On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik <jba...@fb.com> wrote:
 On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa
 <han...@stressinduktion.org> wrote:
 Hi Josef,

 On 15.12.2016 19:53, Josef Bacik wrote:
On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert<t...@herbertland.com>
 wrote:
On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek<kraigatg...@gmail.com>
  wrote:
On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert<t...@herbertland.com>
  wrote:
I think there may be some suspicious code ininet_csk_get_port. At
   tb_found there is:

                   if (((tb->fastreuse > 0 && reuse) ||
                        (tb->fastreuseport > 0 &&
!rcu_access_pointer(sk->sk_reuseport_cb) &&sk->sk_reuseport &&uid_eq(tb->fastuid,
  uid))) &&
                       smallest_size == -1)
                           goto success;
if(inet_csk(sk)->icsk_af_ops->bind_conflict(sk,
  tb, true)) {
                           if ((reuse ||
                                (tb->fastreuseport > 0 &&
                                 sk->sk_reuseport &&

  !rcu_access_pointer(sk->sk_reuseport_cb) &&
                                 uid_eq(tb->fastuid, uid))) &&
smallest_size != -1 &&--attempts >=
 0) {
spin_unlock_bh(&head->lock);
                                   goto again;
                           }
                           goto fail_unlock;
                   }
AFAICT there is redundancy in these two conditionals. Thesame
 clause
   is being checked in both: (tb->fastreuseport > 0 &&
!rcu_access_pointer(sk->sk_reuseport_cb) &&sk->sk_reuseport &&uid_eq(tb->fastuid, uid))) && smallest_size == -1. If thisis true
 the
first conditional should be hit, goto done, and thesecond will
 never
evaluate that part to true-- unless the sk is changed (dowe need
   READ_ONCE for sk->sk_reuseport_cb?).
That's an interesting point... It looks like this functionalso
   changed in 4.6 from using a single local_bh_disable() at the
 beginning
   with several spin_lock(&head->lock) to exclusively
spin_lock_bh(&head->lock) at each locking point. Perhapsthe full
 bh
disable variant was preventing the timers in your stacktrace from
   running interleaved with this function before?
Could be, although dropping the lock shouldn't be able toaffect the
  search state. TBH, I'm a little lost in reading function, the
  SO_REUSEPORT handling is pretty complicated. For instance,
rcu_access_pointer(sk->sk_reuseport_cb) is checked threetimes in
 that
function and also in every call to inet_csk_bind_conflict. Iwonder
 if
we can simply this under the assumption that SO_REUSEPORT isonly
  allowed if the port number (snum) is explicitly specified.
Ok first I have data for you Hannes, here's the timedistributionsbefore during and after the lockup (with all the debugging inplace
 the
box eventually recovers). I've attached it as a text filesince it is
  long.
 Thanks a lot!
Second is I was thinking about why we would spend so much timedoing
 the
->owners list, and obviously it's because of the massiveamount oftimewait sockets on the owners list. I wrote the followingdumb patchand tested it and the problem has disappeared completely. NowI don'tknow if this is right at all, but I thought it was weird weweren'tcopying the soreuseport option from the original socket ontothe twsk.Is there are reason we aren't doing this currently? Does thishelp
  explain what is happening?  Thanks,
The patch is interesting and a good clue, but I am immediately abitconcerned that we don't copy/tag the socket with the uid also tokeepthe security properties for SO_REUSEPORT. I have to think a bitmore
 about this.
We have seen hangs during connect. I am afraid this patchwouldn't help
 there while also guaranteeing uniqueness.
Yeah so I looked at the code some more and actually my patch isreallybad. If sk2->sk_reuseport is set we'll look atsk2->sk_reuseport_cb, which
 is outside of the timewait sock, so that's definitely bad.

 But we should at least be setting it to 0 so that we don't do this
normally. Unfortunately simply setting it to 0 doesn't fix theproblem. Sofor some reason having ->sk_reuseport set to 1 on a timewaitsocket makes
 this problem non-existent, which is strange.
So back to the drawing board I guess. I wonder if doing whatcraigsuggested and batching the timewait timer expires so it hurtsless would
 accomplish the same results.  Thanks,
Wait no I lied, we access the sk->sk_reuseport_cb, not sk2's.This is the
 code

                        if ((!reuse || !sk2->sk_reuse ||
                            sk2->sk_state == TCP_LISTEN) &&
                            (!reuseport || !sk2->sk_reuseport ||
rcu_access_pointer(sk->sk_reuseport_cb) ||
                             (sk2->sk_state != TCP_TIME_WAIT &&
                             !uid_eq(uid, sock_i_uid(sk2))))) {
if (!sk2->sk_rcv_saddr ||!sk->sk_rcv_saddr
 ||
sk2->sk_rcv_saddr ==sk->sk_rcv_saddr)
                                        break;
                        }
so in my patches case we now have reuseport == 1,sk2->sk_reuseport == 1.But now we are using reuseport, so sk->sk_reuseport_cb should benon-NULLright? So really setting the timewait sock's sk_reuseport shouldhave no
 bearing on how this loop plays out right?  Thanks,
 So more messing around and I noticed that we basically don't do the
tb->fastreuseport logic at all if we've ended up with a nonSO_REUSEPORTsocket on that tb. So before I fully understood what I was doing Ifixed itso that after we go through ->bind_conflict() once with aSO_REUSEPORTsocket, we reset tb->fastreuseport to 1 and set the uid to matchthe uid ofthe socket. This made the problem go away. Tom pointed out thatif we bindto the same port on a different address and we have a nonSO_REUSEPORTsocket with the same address on this tb then we'd be screwed withmy code.
Which brings me to his proposed solution. We need another hashtable thatis indexed based on the binding address. Then each nodecorresponds to oneaddress/port binding, with non-SO_REUSEPORT entries having only oneentry,and normal SO_REUSEPORT entries having many. This cleans up theneed tosearch all the possible sockets on any given tb, we just go andlook at the
 one we care about.  Does this make sense?  Thanks,
Hi Josef,

Thinking about it some more the hash table won't work because of the
rules of binding different addresses to the same port. What I think we
can do is to change inet_bind_bucket to be structure that contains all
the information used to detect conflicts (reuse*, if, address, uid,
etc.) and a list of sockets that share that exact same information--
for instance all socket in timewait state create through some listener
socket should wind up on single bucket. When we do the bind_conflict
function we only should have to walk this buckets, not the full list
of sockets.

Thoughts on this?


This sounds good, maybe tb->owners be a list of say

struct inet_unique_shit {
        struct sock_common sk;
        struct hlist socks;
};

Then we make inet_unique_shit like twsks', just copy the relevantinformation, then hang the real sockets off of the socks hlist.Something like that? Thanks,


Josef

Re: Soft lockup in inet_put_port on 4.6

Reply via email to