Troubleshooting client timeouts

2010-10-04 Thread dormando
Hey,

A common issue folks have are when clients give inexplicable timeout
errors. You ask the client what were you doing? why do you hate me so?
but it won't answer you.

http://code.google.com/p/memcached/wiki/Timeouts

I wrote a little utility while helping a friend diagnose similar issues.
So here it is (public domain) along with a wiki page on how to use it.

-Dormando


Re: Issue 105 in memcached: curr_items stats seem wrong

2010-10-04 Thread memcached


Comment #4 on issue 105 by asankalakmal: curr_items stats seem wrong
http://code.google.com/p/memcached/issues/detail?id=105

(No comment was entered for this change.)

Attachments:
plant4.jpg  206 KB



Re: memcached-1.4.5 without multithread support (or with `-t 0')

2010-10-04 Thread dormando
We took it out for a reason, + if you run with -t 1 you won't really see
contention. 'Cuz it's running single threaded and using futexes under
linux. Those don't have much of a performance hit until you do contend.

I know some paper just came out which showed people using multiple
memcached instances and scaling some kernel locks, along with the whole
redis ONE INSTANCE PER CORE IS AWESOME GUYS thing.

But I'd really love it if you would prove that this is better, and prove
that there is no loss when running multiple instances. This is all
conjecture.

I'm slowly chipping away at the 1.6 branch and some lock scaling patches,
which feels a lot more productive than anecdotally naysaying progress.

memcached -t 4 will run 140,000 sets and 300,000+ gets per second on a box
of mine. An unrefined patch on an older version from trond gets that to
400,000 sets and 630,000 gets. I expect to get that to be a bit higher.

I assume you have some 10GE memcached instances pushing 5gbps+ of traffic
in order for this patch to be worth your time?

Or are all of your keys 1 byte and you're fetching 1 million of them per
second?

On Mon, 4 Oct 2010, tudorica wrote:

 The current memcached-1.4.5 version I downloaded appears to always be
 built with multithreaded support (unless something subtle is happening
 during configure that I haven't noticed).  Would it be OK if I
 submitted a patch that allows a single-threaded memcached build? Here
 is the rationale: instead of peppering the code with expensive user-
 space locking and events (e.g. pthread_mutex_lock, and the producer-
 consumers), why not just have the alternative to deploy N instances of
 plain singlethreaded memcached distinct/isolated processes, where N is
 the number of available CPUs (e.g. each instance on a different port)?
 Each such memcached process will utilize 1/Nth of the memory that a
 `memcached -t N' would have otherwise utilized, and there would be no
 user-space locking (unlike when memcached is launched with `-t 1'),
 i.e. all locking is performed by the in-kernel network stack when
 traffic is demuxed onto the N sockets.  Sure, this would mean that the
 clients will have to deal with more memcached instances (albeit
 virtual), but my impression is that this is already the norm (see the
 consistent hashing libraries like libketama), and proper hashing (in
 the client) to choose the target memcached server (ip:port) is already
 commonplace.  The only down-side I may envision is clients utilizing
 non-uniform hash functions to choose the target memcached server, but
 that's their problem.

 Regards,
 T



Re: memcached-1.4.5 without multithread support (or with `-t 0')

2010-10-04 Thread Tudor Marian
I agree that if you have futexes on your platform and you don't contend
(i.e. don't even have to call into the kernel) the overhead is small(er),
however, there is also the overhead between the producer-consumer contexts,
namely the event base and the `-t 1' thread (again, unless I misread the
code, in which case I apologize).

I am not sure what paper you mean, but my thesis did deal with the
scalability of sockets (raw sockets in particular) and how a single-producer
multiple-consumer approach doesn't scale nearly as good as a ``one instance
per core. Sure enough the details are more gory and I won't get into them.
I'll try to find some time to test and compare and get back to you with some
numbers.  By the way, I do ot see how there would be ``loss when running
multiple instances'' since your traffic is disjoint, and if you do have
``loss'' then your OS kernel is broken.

As a matter of fact, I do have a 10Gbps machine, actually two of them, each
with a dual socket Xeon X5570 and 2x Myri-10G NICs that I was planning on
using for tests. Would you be so kind to tell me if there's any standard
performance test suite for memcached that is typically used? Or should I
just write my own trivial client---in particular, as you mentioned, I am
interested in the scalability of memcached (-t 4 versus proper
singlethreaded/multi-process) with respect to the key and/or value size.

Regards,
T

On Mon, Oct 4, 2010 at 4:21 PM, dormando dorma...@rydia.net wrote:

 We took it out for a reason, + if you run with -t 1 you won't really see
 contention. 'Cuz it's running single threaded and using futexes under
 linux. Those don't have much of a performance hit until you do contend.

 I know some paper just came out which showed people using multiple
 memcached instances and scaling some kernel locks, along with the whole
 redis ONE INSTANCE PER CORE IS AWESOME GUYS thing.

 But I'd really love it if you would prove that this is better, and prove
 that there is no loss when running multiple instances. This is all
 conjecture.

 I'm slowly chipping away at the 1.6 branch and some lock scaling patches,
 which feels a lot more productive than anecdotally naysaying progress.

 memcached -t 4 will run 140,000 sets and 300,000+ gets per second on a box
 of mine. An unrefined patch on an older version from trond gets that to
 400,000 sets and 630,000 gets. I expect to get that to be a bit higher.

 I assume you have some 10GE memcached instances pushing 5gbps+ of traffic
 in order for this patch to be worth your time?

 Or are all of your keys 1 byte and you're fetching 1 million of them per
 second?

 On Mon, 4 Oct 2010, tudorica wrote:

  The current memcached-1.4.5 version I downloaded appears to always be
  built with multithreaded support (unless something subtle is happening
  during configure that I haven't noticed).  Would it be OK if I
  submitted a patch that allows a single-threaded memcached build? Here
  is the rationale: instead of peppering the code with expensive user-
  space locking and events (e.g. pthread_mutex_lock, and the producer-
  consumers), why not just have the alternative to deploy N instances of
  plain singlethreaded memcached distinct/isolated processes, where N is
  the number of available CPUs (e.g. each instance on a different port)?
  Each such memcached process will utilize 1/Nth of the memory that a
  `memcached -t N' would have otherwise utilized, and there would be no
  user-space locking (unlike when memcached is launched with `-t 1'),
  i.e. all locking is performed by the in-kernel network stack when
  traffic is demuxed onto the N sockets.  Sure, this would mean that the
  clients will have to deal with more memcached instances (albeit
  virtual), but my impression is that this is already the norm (see the
  consistent hashing libraries like libketama), and proper hashing (in
  the client) to choose the target memcached server (ip:port) is already
  commonplace.  The only down-side I may envision is clients utilizing
  non-uniform hash functions to choose the target memcached server, but
  that's their problem.
 
  Regards,
  T
 



Re: memcached-1.4.5 without multithread support (or with `-t 0')

2010-10-04 Thread dormando
Hey,

No actually... in -t 1 mode the only producer/consumer is between the
accept thread and the worker thread. Once a connection is open the socket
events are local to the thread. Persistent connections would remove almost
all of the overhead aside from the futexes existing.

There's also work we're presently doing to scale that lock, but it's not
really necessary as nobody hits that limit.

There is loss... There's a lot of loss to running multiple instances:

- Management overhead of running multiple instances instead of one (yes I
know blah blah blah go hire a junior guy and have him set it up right.
it's harder than it looks).

- Less efficient multigets. Natural multigets are more efficient if you
use fewer instances. Less natural multigets don't care nearly as much, but
can suffer if you accidentally cluster too much data on a single (too
small) instance. I *have* seen this happen.

- Socket overhead. This is a big deal if you're running 8x or more
instances on a single box. Now in order to fetch individual keys spread
about, both clients and servers need to manage *much, much* more socket
crap. This includes the DMA allocation overhead I think you noted in
your paper. If you watch /proc/slabinfo between maintaining 100 sockets or
800 sockets you'll see that there's a large memory loss, and the kernel
has to do a lot more work to manage all of that extra shuffling. Memcached
itself will lose memory to connection states and buffers.

You can tune it down with various /proc knobs but adding 8x+ connection
overhead is a very real loss.

On the note of a performance test suite... I've been trying to fix one up
to be that sort of thing. I have Matt's memcachetest able to saturate
memcached instead of itself, and I've gotten the redis test to do some
damage: http://dormando.livejournal.com/525147.html

...but it needs more work over the next few weeks.

Also, finally, I'm asking about a real actual usage. Academically it's
interesting that memcached should be able to run 1million+ sets per second
off of a 48 core box, but reality always *always* pins performance to
other areas.

Namely:

1) memory overhead. the more crap you can stuff in memcached the faster
your app goes per dollar.
2) other very important scientific-style advancements in cache usage are
more fixated on benefits shown from the binary protocol. Utilizing
multiset and asyncronous get/set stacking can shave real milliseconds off
of serving real responses.

Except people aren't using binprot, and they're finding bugs when they do
(it seems to generate more packets? But I haven't gotten that far yet in
fixing the benchmarks).

I wish we could focus over there for a while.

Btw, your paper was very good. There's no question that a lot of the
kernel-level fixes in there will have real world benefits.

I just question the hyperfocus here on an aspect of a practical
application that is never the actual problem.

On Mon, 4 Oct 2010, Tudor Marian wrote:

 I agree that if you have futexes on your platform and you don't contend (i.e. 
 don't even have to call into the kernel) the overhead is small(er), however, 
 there is also the overhead between the
 producer-consumer contexts, namely the event base and the `-t 1' thread 
 (again, unless I misread the code, in which case I apologize).

 I am not sure what paper you mean, but my thesis did deal with the 
 scalability of sockets (raw sockets in particular) and how a single-producer 
 multiple-consumer approach doesn't scale nearly as good as a
 ``one instance per core. Sure enough the details are more gory and I won't 
 get into them.  I'll try to find some time to test and compare and get back 
 to you with some numbers.  By the way, I do ot see
 how there would be ``loss when running multiple instances'' since your 
 traffic is disjoint, and if you do have ``loss'' then your OS kernel is 
 broken.

 As a matter of fact, I do have a 10Gbps machine, actually two of them, each 
 with a dual socket Xeon X5570 and 2x Myri-10G NICs that I was planning on 
 using for tests. Would you be so kind to tell me if
 there's any standard performance test suite for memcached that is typically 
 used? Or should I just write my own trivial client---in particular, as you 
 mentioned, I am interested in the scalability of
 memcached (-t 4 versus proper singlethreaded/multi-process) with respect to 
 the key and/or value size.

 Regards,
 T

 On Mon, Oct 4, 2010 at 4:21 PM, dormando dorma...@rydia.net wrote:
   We took it out for a reason, + if you run with -t 1 you won't really see
   contention. 'Cuz it's running single threaded and using futexes under
   linux. Those don't have much of a performance hit until you do contend.

   I know some paper just came out which showed people using multiple
   memcached instances and scaling some kernel locks, along with the whole
   redis ONE INSTANCE PER CORE IS AWESOME GUYS thing.

   But I'd really love it if you would prove that this is better,