> Date: Fri, 10 Aug 2018 19:48:40 +0200 > From: Edgar Fuß <e...@math.uni-bonn.de> > > > Yes -- isn't that the symptom you're seeing, or did I miss something? > It's the mutex_oncpu in the while condition that crashes, not the on in the > if condition above the do.
Are you sure it _only_ happens in the do/while and _never_ in the preceding if? I don't see any reason to distinguish ordering on other CPUs between the two calls to mutex_oncpu. It is possible that the spinlock backoff in the do/while opens a window for a race condition wide enough that you only see it in the do/while condition. > > It doesn't really matter since (a) only one thread ever sets the > > variable, (b) there are no invariants around it, and (c) you never > > dereference it. So, as soon as unp_gc decides it will use a > > particular socket, it should just store the pointer to that socket in > > some global unp_gc_current_socket, and when it's done (before closef), > > it should set unp_gc_current_socket to null; then in soput/sofree, > > just KASSERT(so != unp_gc_current_socket). > But couldn't the thread that KASSERTs read a stale copy that unp_gc() > nulled out but the null value didn't make it to the right CPU/cache/whatever? Conceivably, yes. Then you would have a false positive for your test. I would guess that (a) that won't happen a lot, and (b) it'll be clear on scrutiny that it's a false positive. But I also don't think it's very likely to have false positives, because in correct code, soclose won't be called with a socket associated with a file that has a positive reference count, which is all managed under fp->f_lock. In any case, it's just a diagnostic, not a protocol for a robust software system to rely on. If it doesn't work, can try another one.