On Mon, 23 Sep 2019 at 10:59, Dmitry Vyukov <dvyu...@google.com> wrote: > > On Mon, Sep 23, 2019 at 10:54 AM Boqun Feng <boqun.f...@gmail.com> wrote: > > > > On Mon, Sep 23, 2019 at 10:21:38AM +0200, Dmitry Vyukov wrote: > > > On Mon, Sep 23, 2019 at 6:31 AM Boqun Feng <boqun.f...@gmail.com> wrote: > > > > > > > > On Fri, Sep 20, 2019 at 04:54:21PM +0100, Will Deacon wrote: > > > > > Hi Marco, > > > > > > > > > > On Fri, Sep 20, 2019 at 04:18:57PM +0200, Marco Elver wrote: > > > > > > We would like to share a new data-race detector for the Linux > > > > > > kernel: > > > > > > Kernel Concurrency Sanitizer (KCSAN) -- > > > > > > https://github.com/google/ktsan/wiki/KCSAN (Details: > > > > > > https://github.com/google/ktsan/blob/kcsan/Documentation/dev-tools/kcsan.rst) > > > > > > > > > > > > To those of you who we mentioned at LPC that we're working on a > > > > > > watchpoint-based KTSAN inspired by DataCollider [1], this is it (we > > > > > > renamed it to KCSAN to avoid confusion with KTSAN). > > > > > > [1] > > > > > > http://usenix.org/legacy/events/osdi10/tech/full_papers/Erickson.pdf > > > > > > > > > > Oh, spiffy! > > > > > > > > > > > In the coming weeks we're planning to: > > > > > > * Set up a syzkaller instance. > > > > > > * Share the dashboard so that you can see the races that are found. > > > > > > * Attempt to send fixes for some races upstream (if you find that > > > > > > the > > > > > > kcsan-with-fixes branch contains an important fix, please feel free > > > > > > to > > > > > > point it out and we'll prioritize that). > > > > > > > > > > Curious: do you take into account things like alignment and/or access > > > > > size > > > > > when looking at READ_ONCE/WRITE_ONCE? Perhaps you could initially > > > > > prune > > > > > naturally aligned accesses for which __native_word() is true? > > > > > > > > > > > There are a few open questions: > > > > > > * The big one: most of the reported races are due to unmarked > > > > > > accesses; prioritization or pruning of races to focus initial > > > > > > efforts > > > > > > to fix races might be required. Comments on how best to proceed are > > > > > > welcome. We're aware that these are issues that have recently > > > > > > received > > > > > > attention in the context of the LKMM > > > > > > (https://lwn.net/Articles/793253/). > > > > > > > > > > This one is tricky. What I think we need to avoid is an onslaught of > > > > > patches adding READ_ONCE/WRITE_ONCE without a concrete analysis of the > > > > > code being modified. My worry is that Joe Developer is eager to get > > > > > their > > > > > first patch into the kernel, so runs this tool and starts spamming > > > > > maintainers with these things to the point that they start ignoring > > > > > KCSAN > > > > > reports altogether because of the time they take up. > > > > > > > > > > I suppose one thing we could do is to require each new > > > > > READ_ONCE/WRITE_ONCE > > > > > to have a comment describing the racy access, a bit like we do for > > > > > memory > > > > > barriers. Another possibility would be to use atomic_t more widely if > > > > > there is genuine concurrency involved. > > > > > > > > > > > > > Instead of commenting READ_ONCE/WRITE_ONCE()s, how about adding > > > > anotations for data fields/variables that might be accessed without > > > > holding a lock? Because if all accesses to a variable are protected by > > > > proper locks, we mostly don't need to worry about data races caused by > > > > not using READ_ONCE/WRITE_ONCE(). Bad things happen when we write to a > > > > variable using locks but read it outside a lock critical section for > > > > better performance, for example, rcu_node::qsmask. I'm thinking so maybe > > > > we can introduce a new annotation similar to __rcu, maybe call it > > > > __lockfree ;-) as follow: > > > > > > > > struct rcu_node { > > > > ... > > > > unsigned long __lockfree qsmask; > > > > ... > > > > } > > > > > > > > , and __lockfree indicates that by design the maintainer of this data > > > > structure or variable believe there will be accesses outside lock > > > > critical sections. Note that not all accesses to __lockfree field, need > > > > to be READ_ONCE/WRITE_ONCE(), if the developer manages to build a > > > > complex but working wake/wait state machine so that it could not be > > > > accessed in the same time, READ_ONCE()/WRITE_ONCE() is not needed. > > > > > > > > If we have such an annotation, I think it won't be hard for configuring > > > > KCSAN to only examine accesses to variables with this annotation. Also > > > > this annotation could help other checkers in the future. > > > > > > > > If KCSAN (at the least the upstream version) only check accesses with > > > > such an anotation, "spamming with KCSAN warnings/fixes" will be the > > > > choice of each maintainer ;-) > > > > > > > > Thoughts? > > > > > > But doesn't this defeat the main goal of any race detector -- finding > > > concurrent accesses to complex data structures, e.g. forgotten > > > spinlock around rbtree manipulation? Since rbtree is not meant to > > > concurrent accesses, it won't have __lockfree annotation, and thus we > > > will ignore races on it... > > > > Maybe, but for forgotten locks detection, we already have lockdep and > > also sparse can help a little. > > They don't do this at all, or to the necessary degree. > > > Having a __lockfree annotation could be > > benefical for KCSAN to focus on checking the accesses whose race > > conditions could only be detected by KCSAN at this time. I think this > > could help KCSAN find problem more easily (and fast).
Just to confirm, the annotation is supposed to mean "this variable should not be accessed concurrently". '__lockfree' may be confusing, as "lock-free" has a very specific meaning ("lock-free algorithm"), and I initially thought the annotation means the opposite. Maybe more intuitive would be '__nonatomic'. My view, however, is that this will not scale. 1) Our goal is to *avoid* more annotations if possible. 2) Furthermore, any such annotation assumes the developer already has understanding of all concurrently accessed variables; however, this may not be the case for the next person touching the code, resulting in an error. By "whitelisting" variables, we would likely miss almost every serious bug. To enable/disable KCSAN for entire subsystems, it's already possible to use 'KCSAN_SANITIZE :=n' in the Makefile, or 'KCSAN_SANITIZE_file.o := n' for individual files. > > Out of curiosity, does KCSAN ever find a problem with forgotten locks > > involved? I didn't see any in the -with-fixes branch (that's > > understandable, given the seriousness, the fixes of this kind of > > problems could already be submitted to upstream once KCSAN found it.) The sheer volume of 'benign' data-races makes it difficult to filter through and get to these, but it certainly detects such issues. Thanks, -- Marco > This one comes to mind: > https://www.spinics.net/lists/linux-mm/msg92677.html > > Maybe some others here, but I don't remember which ones now: > https://github.com/google/ktsan/wiki/KTSAN-Found-Bugs