Hey,

I'm really excited to hear that you're open to this patch set and totally 
understand the desire for some more numbers. I have a few thoughts and 
questions -- hopefully ones that could help better understand where you'd like 
to see more data (potentially from myself and other Facebook folks)

> A "increment percpu value" simply isn't relevant.

While I understand it seems trivial, my experience has been that this type of 
operation can actually be important in many server workloads. In applications 
with 1000s of threads, keeping a counter like this can pose a challenge. One 
can use a per-thread variable, but the size overhead here can be very large (8 
bytes per counter per thread adds up very quickly). And atomic instructions can 
cause contention quickly. Server programs tend to have many statistical 
counters, being able to implement them efficiently without size bloat is a real 
world win.

This type of per-cpu counter can also very quickly be used to implement other 
abstractions in common use -- eg an asymmetric reader-writer lock or a 
reference counted object that is changed infrequently. While thread local 
storage can also be used in these cases this can be a substantial size overhead 
and can also require cooperation between the application and the library to 
manage thread lifecycle.

At least from what I've seen of our usage of these types of abstractions within 
Facebook, if rseq met these use cases and did absolutely nothing else it would 
still be a feature that our applications would benefit from. Hopefully we can 
find evidence that it can do even more than this, but I think that this 
"trivial" use case is actually addressing a real world problem.

> When I asked last time, people pointed me to potential uses, including
> malloc libraries that could get per-thread performance with just
> per-cpu (not per thread) data structure overhead. I see that you once
> more point to the slides from 2013 that again talks about it.
>
> But people didn't post code, people didn't post numbers, and people
> didn't point to actual real uses, just "this could happen".

At Facebook we did some work to experiment with rseq and jemalloc Qi and David 
(cc'd) may be able to provide more context on the current state.

> Because without real-world uses, it's not obvious that there won't be
> somebody who goes "oh, this isn't quite enough for us, the semantics
> are subtly incompatible with our real-world use case".

Is your concern mainly this question (is this patchset a good way to bring 
per-cpu algorithms to userspace)? I'm hoping that given the variety of ways 
that per-cpu data structures are used in the kernel the concerns around this 
patch set are mainly around what approach we should take rather than if per-cpu 
algorithms are a good idea at all. If this is your main concern perhaps our 
focus should be around demonstrating that a number of useful per-cpu algorithms 
can be implemented using restartable sequences.

Ultimately I'm worried there's a chicken and egg problem here. It's hard to get 
people to commit to investing in rseq without a clear path to the patchset 
seeing the light of day. It's also easy to understand why you'd be reluctant to 
merge such a substantial and unfamiliar API without extensive data. If we're 
still not able to get compelling data, I'm wondering if there are other 
approaches that could get us unstuck, eg

(1) Could we merge enough of this patchset (eg things like the faster getcpu() 
operation, which seems like a good improvement over the current approach). If 
we make the remaining patches small enough it may be easier for sophisticated 
users to apply the remaining patches, maintain them, and provide real-world 
operational experience with this abstraction.

(2) Could we implement restartable sequences in the kernel but only allow the 
vdso to make use of them? We could have the vdso export a number of high-level 
operations (like the ones suggested in Paul Turner's original presentation -- 
per-cpu spin lock, per-cpu atomic increment/decrement, per-cpu list push/pop). 
This would allow us to get real-world data about how these primitives are used 
without committing to a complex ABI -- only committing to support the specific 
operations. If the whole idea flops we could eventually create a slow/naive 
implementation of the vdso functions and kill restartable sequences entirely.

-b

Reply via email to