On Fri, Oct 13, 2017 at 2:35 AM, Ben Maurer <bmau...@fb.com> wrote: > > I'm really excited to hear that you're open to this patch set and totally > understand the desire for some more numbers.
So the patch-set actually looks very reasonable today. I looked through it (ok, I wasn't cc'd on the ppc-only patches so I didn't look at those, but I don't think they are likely objectionable either), and everything looked fine from a patch standpoint. But it's not _just_ numbers for real loads I'm looking for, it's actually an _existence proof_ for a real load too. I'd like to know that the suggested interface _really_ works in practice too for all the expected users. In particular, it's easy to make test-cases to show basic functionality, but that does not necessarily show that the interface then works in "real life". For example, if this is supposed to work for a malloc library, it's important that people show that yes, this can really work in a *LIBRARY*. That sounds so obvious and stupid that you might go "What do you mean?", but for things to work for libraries, they have to work together with *other* users, and with *independent* users. For example, say that you're some runtime that wants to use the percpu thing for percpu counters - because you want to avoid cache ping-pong, and you want to avoid per-thread allocation overhead (or per-thread scaling for just summing up the counters) when you have potentially tens of thousands of threads. Now, how does this runtime work *together* with - CPU hotplug adding new cpu's while you are running (and after you allocated your percpu areas) - libraries and system admins that limit - or extend - you to a certain set of CPUs - another library (like the malloc library) that wants to use the same interface for its percpu allocation queues. maybe all of this "just works", but I really want to see an existence proof. Not just a "dedicated use of the interface for one benchmark". So yes, I want to see numbers, but I really want to see something much more fundamental. I want to feel like there is a good reason to believe that the interface really is sufficient and that it really does work, even when a single thread may have multiple *different* uses for this. Statistics, memory allocation queues, RCU, per-cpu locking, yadda yadda. All these things may want to use this, but they want to use it *together*, and without you having to write special code where every user needs to know about every other user statically. Can you load two different *dynamic* libraries that each independently uses this thing for their own use, without having to be built together for each other? >> A "increment percpu value" simply isn't relevant. > > While I understand it seems trivial, my experience has been that this type of > operation can actually be important in many server workloads. Oh, I'm not saying that it's not relevant to have high-performance statistics gathering using percpu data structures. Of _course_ that is important, we do that very much in the kernel itself. But a benchmark that does nothing else really isn't relevant. If the *only* thing somebody uses this for is statistics, it's simply not good enough. >> Because without real-world uses, it's not obvious that there won't be >> somebody who goes "oh, this isn't quite enough for us, the semantics >> are subtly incompatible with our real-world use case". > > Is your concern mainly this question (is this patchset a good way to > bring per-cpu algorithms to userspace)? I'm hoping that given the > variety of ways that per-cpu data structures are used in the kernel > the concerns around this patch set are mainly around what approach we > should take rather than if per-cpu algorithms are a good idea at all. > If this is your main concern perhaps our focus should be around > demonstrating that a number of useful per-cpu algorithms can be > implemented using restartable sequences. The important thing for me is that it should demonstrate that you can have users co-exists, and that the interface is sufficient for that. So I do want to see "just numbers" in the sense that I would want to see that people have actually written code that takes advantage of the percpu nature to do real things (like an allocator). But more than that, I want to see *use*. > Ultimately I'm worried there's a chicken and egg problem here. This patch-set has been around for *years* in some form. It's improved over the years, but the basic approaches are not new. Honestly, if people still don't have any actual user-level code that really _uses_ this, I'm not interested in merging it. There's no chicken-and-egg here. Anybody who wants to push this patch-set needs to write the user level code to validate that the patch-set makes sense. That's not chicken-and-egg, that's just "without the user-space code, the kernel code has never been tested, validated or used". And if nobody can be bothered to write the user-level code and test this patch-series, then obviously it's not important enough for the kernel to merge it. Linus