Yo,

On Mon, 16 Feb 2015, Roman Leventov wrote:

> Hello Memcached users and developers,
> I've analyzed Memcached implementation and suggested some improvements
> here: 
> http://key-value-stories.blogspot.com/2015/02/memcached-internals-design.html
>
> If I'm wrong in some details or conclusions, please correct me instead of 
> blaming, my only concern is making article as relevant as possible.

The basic overview of internals seems largely correct, but as soon as you
start recommending things it goes weird. I'll answer points below, but
I'll counter some unsolicited advice with some of my own: you should
really verify ideas with simple tests rather than simply write them out
without backing them up.

"There is a surprising text area " flags length\r\n" (the value length is
meant) between the stored key and value. Apparently, it is added for
debugging,"

It's not for debugging. This is part of the ASCII protocol:
$ telnet localhost 11211
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
set foo 0 0 2
hi
STORED
get foo
VALUE foo 0 2
hi
END

Note the "VALUE key flags length\r\n" - exactly because it's slow to
format that line, the second half is pre-formatted on set. It's not used
if you're purely dealing with the binary protocol though.

---

"Even through the whole memory dedicated to for entry allocation could be
malloc()-ed in the start of memcached process (it is done so if memcached
is configured to use large pages), entries of the size class are
"established" by chunks of maximum entry size (i. e. 1 MB). This chunk
itself also called "slab" that is confusing a bit."

It originally did that, but so many people couldn't make large allocations
at start time it was changed to as-needed.

---

"While bucket lock is well striped, other three (memory allocator, LRU
cache and statistics) locks are single, that limits scalability to only 5
threads."

I'll address the first two shortly, but the statistics lock doesn't cause
much strife. The remaining global STATS locks are relatively few on a busy
server. The rest are behind per-thread mutexes which are nearly free since
they never collide. I've tested completely removing the stats locks and
the improvement wasn't much above noise.

---

"At very least, having allocator and LRU cache locks per slabclass should
help in some cases, and should be easy to implement, because slabclasses
don't share state."

I did this and quite a bit more already. I'll talk about it at the end
(it was a bit harder than it seems).

---

"Mutex locks are slow, because they cause two context switches. Context
switch cost estimates vary in a wide range, depending on if this is just a
in-process thread switch, or an OS-level process switch (requiring much
more kernel operations and TLB cache flush), CPU model, workload, if the
server is dedicated for Memcached or have other intensively-working
processes, and so on."

This is all baseless speculation. It was spinlocking on a few mutexes
(without a time limit, I guess). I was never able to get it to be an
improvement in performance, and as the number of worker threads grows the
performance would worsen. Removing all of them produced significantly
higher performance and better latency averages. The locks are held for
generally consistent amounts of times (no syscalls/etc) so spinlocking
waiting for a lock just burns cpu for no reason.

Spinlocks don't seem to be much help outside of the kernel. I don't really
understand why people are so obsessed with them. If two worker threads
block on the same mutex, the OS can tell which one to wake up sooner. In
the spinlock case you could spend an entire timeslice (or two) on the
wrong thread. Unless there's a POSIX primitive for spinlocking which
introduces dependencies...

Even within the linux kernel they're pretty obnoxious. I speak from having
to peel them out constantly at $dayjob.

Also, don't forget futexes. For the per-thread statistics lock futexes
make it an agreeable situation. Assuming it's terrible is how we got
twemcache's release announcement to claim they've "significantly improved
performance" despite simple benchmarking proving otherwise.

In the future I would like to move more stats into the per-thread locks,
and then remove the per-thread locks on 64bit x86_64 systems (8 byte
aligned writes/reads being atomic). Since the performance difference is
so small the priority is low.

---

"Hash table bucket locks should be read-write granular."

rwlocks are slower than mutexes, and I've not gotten the bucket locks to
be a major source of pain. In recent benchmarks I've been able to fetch 20
million keys/sec from 16+ cores.

Major slowdowns come from the write path, which until recently would mix
badly with reads while the LRU locks were held.

---

"Concurrent memory allocation performs better, when it is spit into
independent regions, where the region for the allocation depends on the
current thread (jemalloc) or simply choosed randomly (one-nio). However,
this isn't easily applicable to Memcached."

This is actually a problem with the connection-related buffers. We also
have no CPU pinning mechanism yet. Those are on my short list of things to
improve: per-worker-thread connection structures, and much better buffer
handling.

---

"Using a single byte for storing key size, while the minimum entry
overhead is 63 bytes, seems to be an almost pointless optimization."

You'd be surprised. I fret about not being able to add a second byte for
internal item flags given how sensitive some users are to this. It also
stands as sanity for folks who would otherwise blindly add 16 kilobyte
keys and destroy their performance (hashing takes a lot longer, memory
efficiency is garbage, etc).

---

WRT performance changes, here are some recent benchmarks I did over a
twitter conversation:

https://gist.github.com/dormando/be32c2ed08dd529fe96e
^ using 16 and 32 worker threads (on 16+ core machine), multiget fetches
(stressing internal locks more than syscalls/network stack)
https://gist.github.com/dormando/8833c8db028d73d71915
^ similar, but fetching single keys per get. I believe there's room for
improvement here, but it's a lot more difficult since you're spending all
your time in syscalls (with few batching opportunities)
https://gist.github.com/dormando/af7a4418210ea4609d64
^ illustrating how the current performance levels match up to reality:
very easy to soak a 10g interface, even with relatively small values. A
grand bulk of users have multi-kilobyte keys.

This is a no-write read-only workload, which shows the threads scaling
pretty well. Once you mix in sets things don't go so as well though.

---

Further performance/efficiency changes:

https://github.com/memcached/memcached/pull/97

which I've been soliciting testing from for over a month now, and just
recently gotten some.

Primary differences in speed:

- Per-LRU locks
- LRU locks no longer held during the read path (normally only
once-every-60-per-item, but that still adds up)
- A number of changes and background threads to improve memory efficiency.
Some users are finally testing this in production, and while some have
negligable improvements, others have had memory requirements drop to at
least 1/4th of what they were previously.. and it uses less CPU overall.

Due to the above changes, mixing higher set loads into a high read
workload causes less impact. If you're doing enough sets to overwhelm
all of the worker threads there isn't much you can do (still), but the
ceiling for that is much higher.

This does end up stressing the slabs_lock as the next "problem global
lock". I'm fine with this for now and will attempt to improve it at a
later date.

I'm waiting to see how the production tests fare from the weekend, fixing
a few more bugs, then will release this code I think.

have fun,
-Dormando

Reply via email to