> Hello Dormando, thanks for detailed and constructive reply.
>
> On Tuesday, 17 February 2015 01:39:02 UTC+7, Dormando wrote:
>       Yo,
>
>       On Mon, 16 Feb 2015, Roman Leventov wrote:
>
>       > Hello Memcached users and developers,
>       > I've analyzed Memcached implementation and suggested some improvements
>       > here: 
> http://key-value-stories.blogspot.com/2015/02/memcached-internals-design.html
>       >
>       > If I'm wrong in some details or conclusions, please correct me 
> instead of blaming, my only concern is making article as relevant as
>       possible.
>
>       The basic overview of internals seems largely correct, but as soon as 
> you
>       start recommending things it goes weird. I'll answer points below, but
>       I'll counter some unsolicited advice with some of my own: you should
>       really verify ideas with simple tests rather than simply write them out
>       without backing them up.
>
> In this case preparation of a single post would take month instead of a week. 
> I would throw away some hypoteses, which haven't been proven in my
> syntetic tests, however, this could be because I build wrong workload model, 
> of applied slightly wrong approach. Just pronauncing many ideas (even
> controversial and imprecise) seems to be better: it generates discussions, 
> from  which some really valuable ideas or conclusions could emerge.

This isn't generally true. In my long years of maintaining memcached,
every post like yours builds up technical debt that I have to "prove
against" when I make my own statements. People will blindly accept what
you've written as truth. It's not really your fault... it's just the way
it is. People still cite blog posts and papers from 6+ years ago when
comparing memcached to things.

Discussions on mailing lists/etc tend to go just fine though.

>       "There is a surprising text area " flags length\r\n" (the value length 
> is
>       meant) between the stored key and value. Apparently, it is added for
>       debugging,"
>
>       It's not for debugging. This is part of the ASCII protocol:
>
> Ok, updated the post. I have no one to ask about Memcached (and any other 
> database), so use another approach:
> make assumption and wait until someone prove/refute it. 

Why couldn't you have asked here first? You also pinged me on twitter so
you could've just asked :P

>       "Even through the whole memory dedicated to for entry allocation could 
> be
>       malloc()-ed in the start of memcached process (it is done so if 
> memcached
>       is configured to use large pages), entries of the size class are
>       "established" by chunks of maximum entry size (i. e. 1 MB). This chunk
>       itself also called "slab" that is confusing a bit."
>
>       It originally did that, but so many people couldn't make large 
> allocations
>       at start time it was changed to as-needed.
>
> Hm. https://github.com/memcached/memcached/blob/master/memcached.c#L5245

I'm not talking about largepages (which aren't even implemented for linux
right now). It's assumed that if you intend to use large/hugepages you're
equipped enough to make sure there's enough free contiguous memory when
you start the daemon.

>       "Mutex locks are slow, because they cause two context switches. Context
>       switch cost estimates vary in a wide range, depending on if this is 
> just a
>       in-process thread switch, or an OS-level process switch (requiring much
>       more kernel operations and TLB cache flush), CPU model, workload, if the
>       server is dedicated for Memcached or have other intensively-working
>       processes, and so on."
>
>       This is all baseless speculation. It was spinlocking on a few mutexes
>       (without a time limit, I guess). I was never able to get it to be an
>       improvement in performance, and as the number of worker threads grows 
> the
>       performance would worsen. Removing all of them produced significantly
>       higher performance and better latency averages. The locks are held for
>       generally consistent amounts of times (no syscalls/etc) so spinlocking
>       waiting for a lock just burns cpu for no reason.
>
>       Spinlocks don't seem to be much help outside of the kernel. I don't 
> really
>       understand why people are so obsessed with them. If two worker threads
>       block on the same mutex, the OS can tell which one to wake up sooner. In
>       the spinlock case you could spend an entire timeslice (or two) on the
>       wrong thread. Unless there's a POSIX primitive for spinlocking which
>       introduces dependencies...
>
>       Even within the linux kernel they're pretty obnoxious. I speak from 
> having
>       to peel them out constantly at $dayjob.
>
>       Also, don't forget futexes. For the per-thread statistics lock futexes
>       make it an agreeable situation. Assuming it's terrible is how we got
>       twemcache's release announcement to claim they've "significantly 
> improved
>       performance" despite simple benchmarking proving otherwise.
>
>       In the future I would like to move more stats into the per-thread locks,
>       and then remove the per-thread locks on 64bit x86_64 systems (8 byte
>       aligned writes/reads being atomic). Since the performance difference is
>       so small the priority is low.
>
>
> Well, I was not precise (now rephrased the post a bit, but understand that 
> something is still not properly explained).
> There is simple arithmetic: with mutexes, you pay 2 context switches. With 
> spin locks, you pay average time each thread spends under the lock,
> multiplied by the average threads concurrently waiting to acquire the lock, 
> plus some context switches still need to occur for OS background tasks
> (if the server is dedicated to cache instance).

I'm stating a fact that they did not work in benchmarks. It's simple
arithmetic if you completely ignore everything else happening on the
system, ignore the scheduler, ignore interrupts, etc. With a futex (and
many pthread syscalls under linux) you do not incur a full context switch.
Instead they do a kernel switch, or for the case of a futex, which stays
in userspace *unless* the lock is contended. It does not incur an
automatic context switch at all (half or full).

On a normal linux OS, running a spinlock too long can still context switch
several times, as the scheduler decides to migrate your process to run
something else (interrupt, another process, kernel thread, etc). In this
case you would've gone to sleep, then been re-woken once when ready, so
the spinlock can rapidly become less efficient. When the kernel spinlocks
it tends to do it in uninterruptable space. You end up losing the
scalability improvements of multiple CPU's. It can also kill performance
of the entire system since that CPU is now fully locked from doing other
potential work (such as servicing unrelated sockets, if a particular
network socket is under heavy contention).

There're some tunables to control how often the kernel is allowed to
migrate threads. They can significantly reduce context switch overhead,
increasing potential latency as a compromise.

The 'perf' and 'sar' utils can show cycles lost in this mess.

> I'm not even sure statistics updates should be under locks (couldn't it
be just a number of atomic
> counters?), but it they should, operation is so thin, that even with pretty 
> high contention spin locks are far more efficient than mutexes.

Did you read my entire response? I addressed this specifically.

> Problems cause one another: for contended locks wrapping midly expensive 
> operations mutexes are better, but locks shouldn't be so contended.

They should not contend ideally. Which is why I improved it a bit.

>       "Hash table bucket locks should be read-write granular."
>
>       rwlocks are slower than mutexes, and I've not gotten the bucket locks to
>       be a major source of pain. In recent benchmarks I've been able to fetch 
> 20
>       million keys/sec from 16+ cores.
>
> Bucket locks are well striped, so you can afford yourself to have just an 
> 64-bit word acting as hand-made unfair read-write spin-lock. 

Again, in actual benchmarks I've not been able to prove them to be a
problem. In one of the gist links I provided before I show an "all miss"
case, which acquires/releases a bucket lock but does not have the overhead
of processing the value. In those cases it was able to process over 50
million keys per second. The internals don't tend to be the slow part
anymore.

There's plenty of work to do that's not based off of speculation.

I did not use a hand-rolled spinlock in my own tests, which may eventually
be useful to scale the slabs_lock since it's typically held for such a
small amount of time. However for that particular lock it might be
possible to replace the typical case with a few CAS operations or use
per-thread free queues.

But, again, the performance ceiling is so high I intend to do more work
elsewhere. You lose a lot of performance when crossing NUMA boundaries,
for instance, so pinning threads and using thread-local memory for
connections can improve that. Otherwise there're real features to add, and
memory efficiency gains such as the focus of this latest branch, help
improve end-users' hit ratio, which is the most important factor beyond
raw speed. It can also save them real money by reducing the size
requirements of their cluster.

Reply via email to