On Fri, Oct 13, 2017 at 2:35 AM, Ben Maurer <[email protected]> wrote:
>
> I'm really excited to hear that you're open to this patch set and totally
> understand the desire for some more numbers.
So the patch-set actually looks very reasonable today. I looked
through it (ok, I wasn't cc'd on the ppc-only patches so I didn't look
at those, but I don't think they are likely objectionable either), and
everything looked fine from a patch standpoint.
But it's not _just_ numbers for real loads I'm looking for, it's
actually an _existence proof_ for a real load too. I'd like to know
that the suggested interface _really_ works in practice too for all
the expected users.
In particular, it's easy to make test-cases to show basic
functionality, but that does not necessarily show that the interface
then works in "real life".
For example, if this is supposed to work for a malloc library, it's
important that people show that yes, this can really work in a
*LIBRARY*.
That sounds so obvious and stupid that you might go "What do you
mean?", but for things to work for libraries, they have to work
together with *other* users, and with *independent* users.
For example, say that you're some runtime that wants to use the percpu
thing for percpu counters - because you want to avoid cache ping-pong,
and you want to avoid per-thread allocation overhead (or per-thread
scaling for just summing up the counters) when you have potentially
tens of thousands of threads.
Now, how does this runtime work *together* with
- CPU hotplug adding new cpu's while you are running (and after you
allocated your percpu areas)
- libraries and system admins that limit - or extend - you to a
certain set of CPUs
- another library (like the malloc library) that wants to use the
same interface for its percpu allocation queues.
maybe all of this "just works", but I really want to see an existence
proof. Not just a "dedicated use of the interface for one benchmark".
So yes, I want to see numbers, but I really want to see something much
more fundamental. I want to feel like there is a good reason to
believe that the interface really is sufficient and that it really
does work, even when a single thread may have multiple *different*
uses for this. Statistics, memory allocation queues, RCU, per-cpu
locking, yadda yadda. All these things may want to use this, but they
want to use it *together*, and without you having to write special
code where every user needs to know about every other user statically.
Can you load two different *dynamic* libraries that each independently
uses this thing for their own use, without having to be built together
for each other?
>> A "increment percpu value" simply isn't relevant.
>
> While I understand it seems trivial, my experience has been that this type of
> operation can actually be important in many server workloads.
Oh, I'm not saying that it's not relevant to have high-performance
statistics gathering using percpu data structures. Of _course_ that is
important, we do that very much in the kernel itself.
But a benchmark that does nothing else really isn't relevant. If the
*only* thing somebody uses this for is statistics, it's simply not
good enough.
>> Because without real-world uses, it's not obvious that there won't be
>> somebody who goes "oh, this isn't quite enough for us, the semantics
>> are subtly incompatible with our real-world use case".
>
> Is your concern mainly this question (is this patchset a good way to
> bring per-cpu algorithms to userspace)? I'm hoping that given the
> variety of ways that per-cpu data structures are used in the kernel
> the concerns around this patch set are mainly around what approach we
> should take rather than if per-cpu algorithms are a good idea at all.
> If this is your main concern perhaps our focus should be around
> demonstrating that a number of useful per-cpu algorithms can be
> implemented using restartable sequences.
The important thing for me is that it should demonstrate that you can
have users co-exists, and that the interface is sufficient for that.
So I do want to see "just numbers" in the sense that I would want to
see that people have actually written code that takes advantage of the
percpu nature to do real things (like an allocator). But more than
that, I want to see *use*.
> Ultimately I'm worried there's a chicken and egg problem here.
This patch-set has been around for *years* in some form. It's improved
over the years, but the basic approaches are not new.
Honestly, if people still don't have any actual user-level code that
really _uses_ this, I'm not interested in merging it.
There's no chicken-and-egg here. Anybody who wants to push this
patch-set needs to write the user level code to validate that the
patch-set makes sense. That's not chicken-and-egg, that's just
"without the user-space code, the kernel code has never been tested,
validated or used".
And if nobody can be bothered to write the user-level code and test
this patch-series, then obviously it's not important enough for the
kernel to merge it.
Linus