On Fri, Oct 13, 2017 at 2:35 AM, Ben Maurer <bmau...@fb.com> wrote:
>
> I'm really excited to hear that you're open to this patch set and totally 
> understand the desire for some more numbers.

So the patch-set actually looks very reasonable today. I looked
through it (ok, I wasn't cc'd on the ppc-only patches so I didn't look
at those, but I don't think they are likely objectionable either), and
everything looked fine from a patch standpoint.

But it's not _just_ numbers for real loads I'm looking for, it's
actually an _existence proof_ for a real load too. I'd like to know
that the suggested interface _really_ works in practice too for all
the expected users.

In particular, it's easy to make test-cases to show basic
functionality, but that does not necessarily show that the interface
then works in "real life".

For example, if this is supposed to work for a malloc library, it's
important that people show that yes, this can really work in a
*LIBRARY*.

That sounds so obvious and stupid that you might go "What do you
mean?", but for things to work for libraries, they have to work
together with *other* users, and with *independent* users.

For example, say that you're some runtime that wants to use the percpu
thing for percpu counters - because you want to avoid cache ping-pong,
and you want to avoid per-thread allocation overhead (or per-thread
scaling for just summing up the counters) when you have potentially
tens of thousands of threads.

Now, how does this runtime work *together* with

 - CPU hotplug adding new cpu's while you are running (and after you
allocated your percpu areas)

 - libraries and system admins that limit - or extend - you to a
certain set of CPUs

 - another library (like the malloc library) that wants to use the
same interface for its percpu allocation queues.

maybe all of this "just works", but I really want to see an existence
proof.  Not just a "dedicated use of the interface for one benchmark".

So yes, I want to see numbers, but I really want to see something much
more fundamental. I want to feel like there is a good reason to
believe that the interface really is sufficient and that it really
does work, even when a single thread may have multiple *different*
uses for this. Statistics, memory allocation queues, RCU, per-cpu
locking, yadda yadda. All these things may want to use this, but they
want to use it *together*, and without you having to write special
code where every user needs to know about every other user statically.

Can you load two different *dynamic* libraries that each independently
uses this thing for their own use, without having to be built together
for each other?

>> A "increment percpu value" simply isn't relevant.
>
> While I understand it seems trivial, my experience has been that this type of 
> operation can actually be important in many server workloads.

Oh, I'm not saying that it's not relevant to have high-performance
statistics gathering using percpu data structures. Of _course_ that is
important, we do that very much in the kernel itself.

But a benchmark that does nothing else really isn't relevant.  If the
*only* thing somebody uses this for is statistics, it's simply not
good enough.


>> Because without real-world uses, it's not obvious that there won't be
>> somebody who goes "oh, this isn't quite enough for us, the semantics
>> are subtly incompatible with our real-world use case".
>
> Is your concern mainly this question (is this patchset a good way to
> bring per-cpu algorithms to userspace)? I'm hoping that given the
> variety of ways that per-cpu data structures are used in the kernel
> the concerns around this patch set are mainly around what approach we
> should take rather than if per-cpu algorithms are a good idea at all.
> If this is your main concern perhaps our focus should be around
> demonstrating that a number of useful per-cpu algorithms can be
> implemented using restartable sequences.

The important thing for me is that it should demonstrate that you can
have users co-exists, and that the interface is sufficient for that.

So I do want to see "just numbers" in the sense that I would want to
see that people have actually written code that takes advantage of the
percpu nature to do real things (like an allocator). But more than
that, I want to see *use*.

> Ultimately I'm worried there's a chicken and egg problem here.

This patch-set has been around for *years* in some form. It's improved
over the years, but the basic approaches are not new.

Honestly, if people still don't have any actual user-level code that
really _uses_ this, I'm not interested in merging it.

There's no chicken-and-egg here. Anybody who wants to push this
patch-set needs to write the user level code to validate that the
patch-set makes sense. That's not chicken-and-egg, that's just
"without the user-space code, the kernel code has never been tested,
validated or used".

And if nobody can be bothered to write the user-level code and test
this patch-series, then obviously it's not important enough for the
kernel to merge it.

                      Linus

Reply via email to