Hey,

No actually... in -t 1 mode the only producer/consumer is between the
accept thread and the worker thread. Once a connection is open the socket
events are local to the thread. Persistent connections would remove almost
all of the overhead aside from the futexes existing.

There's also work we're presently doing to scale that lock, but it's not
really necessary as nobody hits that limit.

There is loss... There's a lot of loss to running multiple instances:

- Management overhead of running multiple instances instead of one (yes I
know blah blah blah go hire a junior guy and have him set it up right.
it's harder than it looks).

- Less efficient multigets. "Natural" multigets are more efficient if you
use fewer instances. Less natural multigets don't care nearly as much, but
can suffer if you accidentally cluster too much data on a single (too
small) instance. I *have* seen this happen.

- Socket overhead. This is a big deal if you're running 8x or more
instances on a single box. Now in order to fetch individual keys spread
about, both clients and servers need to manage *much, much* more socket
"crap". This includes the DMA allocation overhead I think you noted in
your paper. If you watch /proc/slabinfo between maintaining 100 sockets or
800 sockets you'll see that there's a large memory loss, and the kernel
has to do a lot more work to manage all of that extra shuffling. Memcached
itself will lose memory to connection states and buffers.

You can tune it down with various /proc knobs but adding 8x+ connection
overhead is a very real loss.

On the note of a performance test suite... I've been trying to fix one up
to be that sort of thing. I have Matt's "memcachetest" able to saturate
memcached instead of itself, and I've gotten the redis test to do some
damage: http://dormando.livejournal.com/525147.html

...but it needs more work over the next few weeks.

Also, finally, I'm asking about a real actual usage. Academically it's
interesting that memcached should be able to run 1million+ sets per second
off of a 48 core box, but reality always *always* pins performance to
other areas.

Namely:

1) memory overhead. the more crap you can stuff in memcached the faster
your app goes per dollar.
2) other very important scientific-style advancements in cache usage are
more fixated on benefits shown from the binary protocol. Utilizing
multiset and asyncronous get/set stacking can shave real milliseconds off
of serving real responses.

Except people aren't using binprot, and they're finding bugs when they do
(it seems to generate more packets? But I haven't gotten that far yet in
fixing the benchmarks).

I wish we could focus over there for a while.

Btw, your paper was very good. There's no question that a lot of the
kernel-level fixes in there will have real world benefits.

I just question the hyperfocus here on an aspect of a practical
application that is never the actual problem.

On Mon, 4 Oct 2010, Tudor Marian wrote:

> I agree that if you have futexes on your platform and you don't contend (i.e. 
> don't even have to call into the kernel) the overhead is small(er), however, 
> there is also the overhead between the
> producer-consumer contexts, namely the event base and the `-t 1' thread 
> (again, unless I misread the code, in which case I apologize).
>
> I am not sure what paper you mean, but my thesis did deal with the 
> scalability of sockets (raw sockets in particular) and how a single-producer 
> multiple-consumer approach doesn't scale nearly as good as a
> ``one instance per core." Sure enough the details are more gory and I won't 
> get into them.  I'll try to find some time to test and compare and get back 
> to you with some numbers.  By the way, I do ot see
> how there would be ``loss when running multiple instances'' since your 
> traffic is disjoint, and if you do have ``loss'' then your OS kernel is 
> broken.
>
> As a matter of fact, I do have a 10Gbps machine, actually two of them, each 
> with a dual socket Xeon X5570 and 2x Myri-10G NICs that I was planning on 
> using for tests. Would you be so kind to tell me if
> there's any standard performance test suite for memcached that is typically 
> used? Or should I just write my own trivial client---in particular, as you 
> mentioned, I am interested in the scalability of
> memcached (-t 4 versus proper singlethreaded/multi-process) with respect to 
> the key and/or value size.
>
> Regards,
> T
>
> On Mon, Oct 4, 2010 at 4:21 PM, dormando <dorma...@rydia.net> wrote:
>       We took it out for a reason, + if you run with -t 1 you won't really see
>       contention. 'Cuz it's running single threaded and using futexes under
>       linux. Those don't have much of a performance hit until you do contend.
>
>       I know some paper just came out which showed people using multiple
>       memcached instances and scaling some kernel locks, along with the whole
>       redis "ONE INSTANCE PER CORE IS AWESOME GUYS" thing.
>
>       But I'd really love it if you would prove that this is better, and prove
>       that there is no loss when running multiple instances. This is all
>       conjecture.
>
>       I'm slowly chipping away at the 1.6 branch and some lock scaling 
> patches,
>       which feels a lot more productive than anecdotally naysaying progress.
>
>       memcached -t 4 will run 140,000 sets and 300,000+ gets per second on a 
> box
>       of mine. An unrefined patch on an older version from trond gets that to
>       400,000 sets and 630,000 gets. I expect to get that to be a bit higher.
>
>       I assume you have some 10GE memcached instances pushing 5gbps+ of 
> traffic
>       in order for this patch to be worth your time?
>
>       Or are all of your keys 1 byte and you're fetching 1 million of them per
>       second?
>
> On Mon, 4 Oct 2010, tudorica wrote:
>
> > The current memcached-1.4.5 version I downloaded appears to always be
> > built with multithreaded support (unless something subtle is happening
> > during configure that I haven't noticed).  Would it be OK if I
> > submitted a patch that allows a single-threaded memcached build? Here
> > is the rationale: instead of peppering the code with expensive user-
> > space locking and events (e.g. pthread_mutex_lock, and the producer-
> > consumers), why not just have the alternative to deploy N instances of
> > plain singlethreaded memcached distinct/isolated processes, where N is
> > the number of available CPUs (e.g. each instance on a different port)?
> > Each such memcached process will utilize 1/Nth of the memory that a
> > `memcached -t N' would have otherwise utilized, and there would be no
> > user-space locking (unlike when memcached is launched with `-t 1'),
> > i.e. all locking is performed by the in-kernel network stack when
> > traffic is demuxed onto the N sockets.  Sure, this would mean that the
> > clients will have to deal with more memcached instances (albeit
> > virtual), but my impression is that this is already the norm (see the
> > consistent hashing libraries like libketama), and proper hashing (in
> > the client) to choose the target memcached server (ip:port) is already
> > commonplace.  The only down-side I may envision is clients utilizing
> > non-uniform hash functions to choose the target memcached server, but
> > that's their problem.
> >
> > Regards,
> > T
> >
>
>
>
>

Reply via email to