On 8/25/16, 10:08 AM, "Mathieu Desnoyers" <[email protected]> 
wrote:
> The most appealing application we have so far is Dave Watson's Facebook
>    services using jemalloc as a memory allocator. It would be nice if he
>    could re-run those benchmarks with my rseq implementation. The trade-offs
>    here are about speed and memory usage:
 
One piece of context I’d like to provide about our investigation into using 
rseq for malloc – I think that it’s really important to keep in mind that we’ve 
optimized the software that we write to survive well in a world where there’s a 
high memory cost for jemalloc for each thread and jemalloc is unable to have 
allocation caches as large as we would like. We’re not going to have real world 
benchmarks that show a magical improvement with rseq because over time we’ve 
baked the constraints of our environment into the design of our programs and 
optimized for the current set of APIs the kernel provides. I do think rseq 
provides a benefit even for applications optimized for today’s malloc 
implementations. But the real benefit is the new types of application designed 
that rseq enables and the ability for rseq to provide predictable performance 
for low-level APIs with much less investment from users. I’ll illustrate the 
costs that rseq would let us avoid with two examples of design choices we’ve 
made:

1) Because jemalloc uses a per-thread cache, threads that are sleeping have a 
high memory cost. For example, if you have a thread-pool with 100 threads but 
only 10 are used most of the time the other 90 threads will still have a 
dormant cache consuming memory. In order to combat this we have an abstraction 
called MemoryIdler 
(https://github.com/facebook/folly/blob/master/folly/detail/MemoryIdler.h) 
which is essentially a wrapper around futex that signals jemalloc to release 
its caches when the thread is idle. From what I can tell this is a practice 
that isn’t widely adopted even though it can save a substantial amount of 
memory – rseq makes this a moot point since caches can be per-cpu and the 
memory allocator does not need to worry about an idle thread hogging the cache.
2) The per-thread nature of malloc implementations has generally led people to 
avoid thread-per-request designs. Things like MemoryIdler can help you if a 
thread is going to be idle for seconds before it is used again, but if your 
thread makes a 100 ms RPC to another server clearing the cache is far too 
expensive to justify. But you still have a bunch of memory sitting around 
unused for 100ms. Multiply that by many concurrent requests and you are 
consuming a lot of memory. This has forced people to handle multiple requests 
in a single thread – this leads to problems of its own like a contested lock in 
one request impacting many other requests on the same thread. 

rseq opens up a whole world of algorithms to userspace – algorithms that are 
O(num CPUs) and where one can have an extremely fast fastpath at the cost of a 
slower slow path. Many of these algorithms are in use in the kernel today – 
per-cpu allocators, RCU, light-weight reader writer locks, etc. Even in cases 
where these APIs can be implemented today, a rseq implementation is often 
superior in terms of predictability and usability (eg per-thread counters 
consume more memory and are more expensive to read than per-cpu counters).

Isn’t the large number of uses of rseq-like algorithms in the kernel a pretty 
substantial sign that there would be demand for similar algorithms by 
user-space systems programmers?

-b

Reply via email to