Re: [RFC PATCH for 4.18 00/14] Restartable Sequences
Hey - I think the ideas Daniel brings up here are interesting -- specifically the notion that a thread could set a "pre-sleep wish" to signal it's sleeping. As this conversation shows I think there's a fair bit of depth to that. For example, the FUTEX_LOCK is an alternative approach. Another idea might be using the "currently running cpu" area of rseq to tell if the process in question was sleeping (assuming that the kernel would be modified to set this to -1 every time a process was unscheduled) The idea discussed here seems orthogonal to the core thesis of rseq. I'm wondering if we can focus on getting rseq in, maybe with a eye for making sure this use case could be handled long term. My sense is that this is possible. We could use the flags setting in the per-thread rseq area, or maybe extend the meaning of the structure rseq_cs points to to signal that there was information about how to signal the sleeping of the current process. It seems to me this would be a natural way to add the functionality Daniel talks about if desired in the future. Daniel - do you think there's anything we should do in the patch set today that would make it easier to implement your idea in the future without expanding the scope of the patch today. i.e. is there anything else we need to do to lay the framework for your idea. I'd really love to see us get this patch in. There's a whole host of primitives this unlocks (more efficient RCU, better malloc implementations, fast reader-writer locks). I'm sure we'll have more ideas about APIs to provide once we've explored these use cases, but to me this patch is the MVP we need to ship to get that feedback. It's a solid framework that we can build upon, eg with the "opv" syscall or the idea in this thread if user feedback shows those things are necessary. -b
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
>>> int rseq(struct rseq * rseq, uint32_t rseq_len, int flags, uint32_t >>>sig); >> >> Really dumb question -- and one I'm sorry to bring up at the last minute. >> Should >> we consider making the syscall name something more generic >> "register_tls_abi"? > I proposed that approach back in 2016 ("tls abi" system call), and the > feedback > I received back then is that it was preferred to have a dedicated "rseq" > system > call than an "open ended" and generic "tls abi" system call. Ultimately I'm fine either way. I do think that in the past few months of review it has become clear that creating this tls abi requires a fair bit of work. It'd be a shame to see a future attempt to use such an ABI made difficult by forcing the author to figure out the registration process yet again. I assume the maintainers of glibc would also like to avoid the need to register multiple ABIs. -b
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
(apologies for the duplicate email, the previous one bounced as it was accidentally using HTML formatting) If I understand correctly this is run on every context switch so we probably want to make it really fast > +static int rseq_need_restart(struct task_struct *t, uint32_t cs_flags) > +{ > + bool need_restart = false; > + uint32_t flags; > + > + /* Get thread flags. */ > + if (__get_user(flags, &t->rseq->flags)) > + return -EFAULT; > + > + /* Take into account critical section flags. */ > + flags |= cs_flags; > + > + /* > +* Restart on signal can only be inhibited when restart on > +* preempt and restart on migrate are inhibited too. Otherwise, > +* a preempted signal handler could fail to restart the prior > +* execution context on sigreturn. > +*/ > + if (flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL) { > + if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)) > + return -EINVAL; > + if (!(flags & RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT)) > + return -EINVAL; > + } How does this error even get to userspace? Is it worth doing this switch on every execution? > + if (t->rseq_migrate > + && !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)) > + need_restart = true; > + else if (t->rseq_preempt > + && !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT)) > + need_restart = true; > + else if (t->rseq_signal > + && !(flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL)) > + need_restart = true; This could potentially be sped up by having the rseq_* fields in t use a single bitmask with the same bit offsets as RSEQ_CS_FLAG_NO_* then using bit operations to check the appropriate overlap.
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call
> int rseq(struct rseq * rseq, uint32_t rseq_len, int flags, uint32_t >sig); Really dumb question -- and one I'm sorry to bring up at the last minute. Should we consider making the syscall name something more generic "register_tls_abi"? I'm assuming that if we ever want to use a per-thread userspace/kernel ABI we'll want to use this field given the difficulty of getting adoption of registration, the need to involve glibc, etc. It seems like there could be future use cases of this TLS area that have nothing to do with rseq.
Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
> if (!((long)ip - (long)start_ip <= (long)post_commit_offset)) > return 1; > This introduces an issue here: if "ip" is lower than "start_ip", we > can incorrectly think we are in a critical section, when we are in > fact not. This shouldn't be an issue if we used unsigned numbers. Eg if start_ip is X and post_commit_offset is L, then (ip - X <= L) means that if ip is less than X ip - X will be signed, which will become a large unsigned value. > or to the kernel to set it back to NULL if it finds out that it is > preempting/delivering a signal over an instruction pointer outside > of the current rseq_cs start_ip/post_commit_ip range (lazy clear). I see, lazy clear makes sense. Still, if during most execution periods the user code enters some rseq section (likely if rseq is used for something like malloc) on every context switch this code will have to be run. > Moreover, this modification would add a subtraction on the common case > (ip - start_ip), and makes the ABI slightly uglier. We could benchmark it but the subtraction should be similar in cost to the extra comparison but reducing the number of branches seems like it will help as well. FWIW GCC attempts to translate this kind of sequence to a subtract and compare: https://godbolt.org/g/5DGLvo. I agree the ABI is uglier, but since we're mucking with every context switch I thought I'd point it out. > If I understand well, you are proposing to speed up .so load time by > means of removing relocations of pointers within rseq_cs, done by > making those relative to the rseq_cs address. Yeah, I think this may be overkill as optimization.
Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
> The layout of struct rseq_cs is as follows: > start_ip > Instruction pointer address of the first instruction of the > sequence of consecutive assembly instructions. > post_commit_ip > Instruction pointer address after the last instruction of > the sequence of consecutive assembly instructions. > abort_ip > Instruction pointer address where to move the execution > flow in case of abort of the sequence of consecutive assem‐ > bly instructions. Really minor performance performance thought here. 1) In the kernel at context switch time you'd need code like: if (ip >= start_ip && ip <= post_commit_ip) This branch would be hard to predict because most instruction pointers would be either before or after. If post_commit_ip were relative to start_ip you could do this: if (ip - start_ip <= post_commit_offset) which is a single branch that would be more predictable. 2) In a shared library a rseq_cs structure would have to be relocated at runtime because at compilation time the final address of the library wouldn't be known. I'm not sure if this is important enough to address, but it could be solved by making the pointers relative to the address of rseq_cs. But this would make for an uglier API.
Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
> I have a use-case for keeping the reference counting in place though. It's > use of rseq in signal handlers. Would this be solved by saying that the rseq api will return an error if you register and there's already a block registered. In this case the signal handler would register the rseq abi area just as the non-signal code is trying to do the same. The non-signal code would see this error code and realize that its job had been done for it and then go on it's way. It would be unsafe for signal handler code to *unregister* the area, but I don't think that's necessary. Basically we'd support a refcount of either 0 or 1, but nothing else. If a signal handler registers the ABI area, how will it ensure the ABI is cleaned up at thread exit? I can't imagine pthread_key is signal safe.
Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
Hey, > So far the restrictions I see for libraries using this symbol are: > - They should never be unloaded, > - They should never be loaded with dlopen RTLD_LOCAL flag. We talked a bit about this off-list but I wanted to state publicly that I think this model works well for our use case. Specifically, (1) It reduces complexity by focusing on the common case -- long term we expect glibc to manage the process of using this feature and registering/deregistering threads for rseq. Unloading isn't a challenge in these situations, so why add the complexity for it? (2) This still allows for early adopters to use rseq before there is glibc support. I believe the vast majority of real world applications meet these two criteria you've listed. If not, they can create a thin shared library that has the sole purpose of providing the weak symbol and that never gets unloaded (3) This allows for applications to provide the __rseq_abi so that they can ensure it uses the initial_exec tls model and optimize in-application assembly code for it. This is a good optimization for server applications that tend to statically link. If others agree with this, would it make sense to remove the concept of reference counting in the system call that defines and redefines the per-thread area? Seems like it would remove complexity. -b
Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
Hey, I'm really excited to hear that you're open to this patch set and totally understand the desire for some more numbers. I have a few thoughts and questions -- hopefully ones that could help better understand where you'd like to see more data (potentially from myself and other Facebook folks) > A "increment percpu value" simply isn't relevant. While I understand it seems trivial, my experience has been that this type of operation can actually be important in many server workloads. In applications with 1000s of threads, keeping a counter like this can pose a challenge. One can use a per-thread variable, but the size overhead here can be very large (8 bytes per counter per thread adds up very quickly). And atomic instructions can cause contention quickly. Server programs tend to have many statistical counters, being able to implement them efficiently without size bloat is a real world win. This type of per-cpu counter can also very quickly be used to implement other abstractions in common use -- eg an asymmetric reader-writer lock or a reference counted object that is changed infrequently. While thread local storage can also be used in these cases this can be a substantial size overhead and can also require cooperation between the application and the library to manage thread lifecycle. At least from what I've seen of our usage of these types of abstractions within Facebook, if rseq met these use cases and did absolutely nothing else it would still be a feature that our applications would benefit from. Hopefully we can find evidence that it can do even more than this, but I think that this "trivial" use case is actually addressing a real world problem. > When I asked last time, people pointed me to potential uses, including > malloc libraries that could get per-thread performance with just > per-cpu (not per thread) data structure overhead. I see that you once > more point to the slides from 2013 that again talks about it. > > But people didn't post code, people didn't post numbers, and people > didn't point to actual real uses, just "this could happen". At Facebook we did some work to experiment with rseq and jemalloc Qi and David (cc'd) may be able to provide more context on the current state. > Because without real-world uses, it's not obvious that there won't be > somebody who goes "oh, this isn't quite enough for us, the semantics > are subtly incompatible with our real-world use case". Is your concern mainly this question (is this patchset a good way to bring per-cpu algorithms to userspace)? I'm hoping that given the variety of ways that per-cpu data structures are used in the kernel the concerns around this patch set are mainly around what approach we should take rather than if per-cpu algorithms are a good idea at all. If this is your main concern perhaps our focus should be around demonstrating that a number of useful per-cpu algorithms can be implemented using restartable sequences. Ultimately I'm worried there's a chicken and egg problem here. It's hard to get people to commit to investing in rseq without a clear path to the patchset seeing the light of day. It's also easy to understand why you'd be reluctant to merge such a substantial and unfamiliar API without extensive data. If we're still not able to get compelling data, I'm wondering if there are other approaches that could get us unstuck, eg (1) Could we merge enough of this patchset (eg things like the faster getcpu() operation, which seems like a good improvement over the current approach). If we make the remaining patches small enough it may be easier for sophisticated users to apply the remaining patches, maintain them, and provide real-world operational experience with this abstraction. (2) Could we implement restartable sequences in the kernel but only allow the vdso to make use of them? We could have the vdso export a number of high-level operations (like the ones suggested in Paul Turner's original presentation -- per-cpu spin lock, per-cpu atomic increment/decrement, per-cpu list push/pop). This would allow us to get real-world data about how these primitives are used without committing to a complex ABI -- only committing to support the specific operations. If the whole idea flops we could eventually create a slow/naive implementation of the vdso functions and kill restartable sequences entirely. -b
Re: [RFC PATCH v8 1/9] Restartable sequences system call
On 8/25/16, 10:08 AM, "Mathieu Desnoyers" wrote: > The most appealing application we have so far is Dave Watson's Facebook >services using jemalloc as a memory allocator. It would be nice if he >could re-run those benchmarks with my rseq implementation. The trade-offs >here are about speed and memory usage: One piece of context I’d like to provide about our investigation into using rseq for malloc – I think that it’s really important to keep in mind that we’ve optimized the software that we write to survive well in a world where there’s a high memory cost for jemalloc for each thread and jemalloc is unable to have allocation caches as large as we would like. We’re not going to have real world benchmarks that show a magical improvement with rseq because over time we’ve baked the constraints of our environment into the design of our programs and optimized for the current set of APIs the kernel provides. I do think rseq provides a benefit even for applications optimized for today’s malloc implementations. But the real benefit is the new types of application designed that rseq enables and the ability for rseq to provide predictable performance for low-level APIs with much less investment from users. I’ll illustrate the costs that rseq would let us avoid with two examples of design choices we’ve made: 1) Because jemalloc uses a per-thread cache, threads that are sleeping have a high memory cost. For example, if you have a thread-pool with 100 threads but only 10 are used most of the time the other 90 threads will still have a dormant cache consuming memory. In order to combat this we have an abstraction called MemoryIdler (https://github.com/facebook/folly/blob/master/folly/detail/MemoryIdler.h) which is essentially a wrapper around futex that signals jemalloc to release its caches when the thread is idle. From what I can tell this is a practice that isn’t widely adopted even though it can save a substantial amount of memory – rseq makes this a moot point since caches can be per-cpu and the memory allocator does not need to worry about an idle thread hogging the cache. 2) The per-thread nature of malloc implementations has generally led people to avoid thread-per-request designs. Things like MemoryIdler can help you if a thread is going to be idle for seconds before it is used again, but if your thread makes a 100 ms RPC to another server clearing the cache is far too expensive to justify. But you still have a bunch of memory sitting around unused for 100ms. Multiply that by many concurrent requests and you are consuming a lot of memory. This has forced people to handle multiple requests in a single thread – this leads to problems of its own like a contested lock in one request impacting many other requests on the same thread. rseq opens up a whole world of algorithms to userspace – algorithms that are O(num CPUs) and where one can have an extremely fast fastpath at the cost of a slower slow path. Many of these algorithms are in use in the kernel today – per-cpu allocators, RCU, light-weight reader writer locks, etc. Even in cases where these APIs can be implemented today, a rseq implementation is often superior in terms of predictability and usability (eg per-thread counters consume more memory and are more expensive to read than per-cpu counters). Isn’t the large number of uses of rseq-like algorithms in the kernel a pretty substantial sign that there would be demand for similar algorithms by user-space systems programmers? -b
RE: [RFC PATCH] thread_local_abi system call: caching current CPU number (x86)
Mathieu Desnoyers wrote: > Expose a new system call allowing threads to register a userspace memory > area where to store the current CPU number. Scheduler migration sets the I really like that this approach makes it easier to add a per-thread interaction between userspace and the kernel in the future. >+ if (!tlap || t->thread_local_abi_len < >+ offsetof(struct thread_local_abi, cpu) >+ + sizeof(tlap->cpu)) Could you save a branch here by enforcing that thread_local_abi_len = 0 if thread_local_abi = null? -b-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/