OK, long mail, but if you have *any* familiarity with the network stack I really want your opinion! Please try skimming through it at least and point out anything you think might be a problem. You don't need to do much research; a simple "I think there might be a problem in the interaction with $FOO, please recheck" will do.
If you haven't been paying attention, see here too http://wiki.dragonflybsd.org/index.cgi/NetMP Short story: sockets are accessed both from the protocol threads and processes running in the kernel. Currently, only the mp lock protects the socket. This needs to change. I've been going through the struct socket fields, and so_state stands out as a potential source of trouble. Let's focus on the send/receive paths at first: flags used in tcp_input: SS_CANTRCVMORE, SS_NOFDREF, SS_RCVATMARK soreceive: SS_ISCONFIRMING, SS_CANTRCVMORE, SS_ISCONNECTED|SS_ISCONNECTING, SS_RCVATMARK tcp_output: none sosend: SS_CANTSENDMORE, SS_ISCONNECTED, SS_ISCONFIRMING, udp_input: none udp_output: none SS_ISCONFIRMING is never set and will be removed. Taking a closer look, soreceive() tests SS_ISCONNECTED|SS_ISCONNECTING in if ((so->so_state & (SS_ISCONNECTED|SS_ISCONNECTING)) == 0 && (pr->pr_flags & PR_CONNREQUIRED)) { error = ENOTCONN; goto release; } Now, for TCP, if my understanding is correct, if either SS_ISCONNECTED or SS_ISCONNECTING are 1, they will stay 1 while soreceive is running. This is because the connection cannot be disconnected until we do a close() or at least a shutdown() and we assume that only one thread/process can be messing with the socket at a time. OTOH, if SS_ISCONNECTED and SS_ISCONNECTING are both 0, they cannot go to 1 unless we issue a connect(), so again we're safe. For UDP, PR_CONNREQUIRED is not set; other protocols will be under the BGL. sosend() uses SS_ISCONNECTED like this: if ((so->so_state & SS_ISCONNECTED) == 0) { /* * `sendto' and `sendmsg' is allowed on a connection- * based socket if it supports implied connect. * Return ENOTCONN if not connected and no address is * supplied. */ [S1] if ((so->so_proto->pr_flags & PR_CONNREQUIRED) && (so->so_proto->pr_flags & PR_IMPLOPCL) == 0) { if ((so->so_state & SS_ISCONFIRMING) == 0 && !(resid == 0 && clen != 0)) gotoerr(ENOTCONN); } else if (addr == 0) [S2] gotoerr(so->so_proto->pr_flags & PR_CONNREQUIRED ? ENOTCONN : EDESTADDRREQ); } Happily, PR_IMPLOPCL is set for TCP and PR_CONNREQUIRED is clear for UDP, so the "if" in [S1] is always false for the cases I care about. In [S2], if the socket is not connected then either a) it's not in SS_ISCONNECTING: unless we issue a connect() (which as explained above can't happen while sosend() is running), so_state can't go to SS_ISCONNECTED so we're OK. b) it is in SS_ISCONNECTING, so we're talking about TCP: the socket state CAN change to SS_ISCONNECTED after our check. But, since we're the only ones accessing the socket and we are not in a connect() system call (since we're in sosend()), this means a connect() completed asynchronously. In that case, we can just pretend that userspace tried to send while we were still trying to connect; userspace has no way of knowing that is not the case. BTW, you'd think the connection state would only be modified by the protocol thread, but kern_connect() takes it upon itself to clear SS_ISCONNECTING if there was an error. I'm inclined to change so->so_state &= ~SS_ISCONNECTING to KKASSERT(!(so->so_state & SS_ISCONNECTING)) and fix any protocol connect functions that hit the assertion to clean up after themselves... Am I missing anything in the above analysis? It all seems a bit too neat; if that's how things are, we needn't bother about the connection state changing from under us... I'm worried I'm "seeing" what I'd like to be true here. Turning to SS_CANTRCVMORE, tcp_input() references it in two ways: if (so->so_state & SS_CANTRCVMORE) { m_freem(m); } else { m_adj(m, drop_hdrlen); /* delayed header drop */ ssb_appendstream(&so->so_rcv, m); } and if (so->so_state & SS_CANTRCVMORE) { soisdisconnected(so); callout_reset(tp->tt_2msl, tcp_maxidle, tcp_timer_2msl, tp); } this appears hopeless; one can maybe try to live with the race in the first case (doing extra cleanup) but in the second case that would be very ugly. The obvious "solution" is to require the protocol thread to obtain a socket lock before messing with this flag (this requires "breaking" so_state in two different fields). Same goes for SS_CANTSENDMORE. This isn't a "common" case, but it has the disadvantage that it can force the protocol thread to spin for a while... Great Ideas (tm) welcome. Oh and if that's acceptable, requiring locking for so_oobmark and SS_RCVATMARK is a no-brainer. I don't think it's worth my time to even investigate lockless oob data delivery... AFAICT SS_ASYNC needs no changes. It's only set/cleared by soo_ioctl and tested in sowakeup(). It should be okay if some thread changes it asynchronously. WRT SS_NOFDREF: - it is set on error by socreate(), but at this point the socket is not reachable yet - it is set by soclose() and I think that if the socket is reachable at this point it's a bug anyway (someone may try to access it after it's been free()ed). [edit: well, apparently that is not the case: soo_close() and some network filesystems remove the reference to the socket after calling soclose(). That should be easy to fix, right?] - it is cleared by soaccept(); soaccept() is called by the old netgraph (do we care?) and kern_accept(). The socket passed to soaccept() is definitely not reachable from any other userspace thread (no fd pointing to it yet). From what I can tell the proto thread shouldn't be messing with SS_NOFDREF for a socket that is just being accepted (XXX: not 100% sure though). - it is set by sonewconn(); the socket was just created and is not yet visible - it is cleared by sys_sctp_peeloff(): don't care, don't want to know - it is cleared in tcp_attach(). Two paths end up calling tcp_attach() (through ->pru_attach()): one starts from sonewconn() and the other from socreate(). In both cases this is a new socket. So, unless I'm missing something, SS_NOFDREF is only modified before the socket becomes relevant or after the socket is no longer reachable. Moving to so_incomp: it's modified by sofree() (only called by protocol thread), sonewconn (same) and soisconnected(). The latter is called from process context (fifo_open, portal_connect at least, maybe sctp too) and from the bluetooth netisr. ->so_comp is modified by soclose (called from process context), soisconnected, sonewconn (see above), soaccept_predicate, sctp_get_peeloff (of course!), ng_ksocket_finish_accept(). Assuming we can move all list modification in protocol thread context, the list traversal would still be tricky. Maybe a spinlock protecting the lists and queue lengths is in order for now? The same lock could be reused for protecting SS_INCOMP, SS_COMP. In the future we might try something more clever if this becomes a performance issue. Opinions? The other so_* fields seem to be easier; I'll send a separate mail for those. TIA, Aggelos
