I'm really bumbling around a bit in the dark, but might be getting closer. Now I'm testing with Perl's Memcached::libmemcached with libmemcached 1.0.17.
So, I did a new tests where memcached is running in a different data center (over a 300Mbps link, IIRC). I'm seeing three different timeout situations. But, the most common one is where poll in libmemcached is timing out. When this happens I see this from the server: Failed to read, and not due to blocking: errno: 104 Connection reset by peer rcurr=7fa3f8088b38 ritem=7fa383f7cdf8 rbuf=7fa3f8088590 rlbytes=640186 rsize=8192 The timeouts happen in this libmemcached call: static memcached_return_t io_wait(memcached_instance_st* instance, const memc_read_or_write read_or_write) and specifically what's happening is this is returning zero: int active_fd= poll(&fds, 1, instance->root->poll_timeout); With poll timeout at 2000ms I see them sometimes. At 1000ms quite often, and 5000ms not at all. Again, it seems related to how much data I'm sending to Memcached. So even 1000ms is fine for say 10K writes, but 2000ms isn't for 700K writes. My guess, is we were using a shorter poll timeout and as our usage grew 2s is no longer enough. But, 5s is a long time to wait, no? Perhaps this is a question for the libmemcached list, but what's happening where two seconds is too short? Does that make sense that the size of the data I'm storing would require a longer poll time? The other two timeouts I'm seeing is one where the socket connect() fails -- and this seems related to how busy the client server is. I'm also seeing another timeout from libmemcache in: static memcached_return_t connect_poll(memcached_instance_st* server, const int connection_error) and also in this poll call: if ((number_of= poll(fds, 1, server->root->connect_timeout)) == -1) That is returning zero and connect_timeout is 4000ms. But, that's a much more rare type of timeout. On Tue, Oct 1, 2013 at 3:41 PM, Bill Moseley <mose...@hank.org> wrote: > Just a quick follow up on this timeout issue. No solution yet -- but > sure seems like a client network issue. > > I have three servers on the same subnet. One called "mem" where I'm > running a single instance of Memcached. Then I have dev-1 and dev-2 with > each running > mc_conn_tester.pl<http://consoleninja.net/code/memcached/mc_conn_tester.pl>. > It's not reporting any timeouts on either machine. > > I then start another script on dev-1 that forks 30 processes, connects, > then sends large set requests (almost 1MB in size) in a loop. This is > suppose to emulate a busy forking web server, for example. > > Then I start seeing timeouts from mc_conn_tester.pl on the dev-1 machine > but not the dev-2 machine. And likewise, if I move the load generator to > dev-2 then I see the timeouts on dev-2 not on dev-1. Not a lot of > timeouts in either case, but it's clear it happens where the load script is > running. > > If the load generating script is changed to send much smaller data size > then the timeouts stop. > > That has me thinking this isn't a problem related to Memcached itself, > rather some network problem. The network is not close to saturation so > maybe a temporary buffer overrun. I've asked our network people to look > into it. > > Agreed? > > > > > > > On Tue, Sep 24, 2013 at 7:09 AM, Bill Moseley <mose...@hank.org> wrote: > >> I'm using the notes at https://code.google.com/p/memcached/wiki/Timeouts >> to debug timeout errors against a single 1.4.4 Memcached server with >> 8GB of RAM on CentOS 6.2 started with >> >> memcached -d -p 11211 -u memcached -m 4096 -c 8192 >> >> >> I could not get http://consoleninja.net/code/memcached/mc_conn_tester.pl to >> issue a timeout running by itself. >> >> So I wrote another script using Perl's Memcached::libmemcached that >> forked 20 or so processes and set ~1/2MB of data using keys generated by >> Data::UUID. I didn't specify an expires time for these sets. >> >> I then started to see a few timeouts w/o connecting like in the examples: >> >> Fail: (timeout: 1) (elapsed: 1.00427794) (conn: 0.00000000) (set: >> 0.00000000) (get: 0.00000000) >> >> I'm just starting to look at this now, but the network cards are not >> showing errors or dropped packets. I couldn't get enough timeouts where >> changing the timeout value made much difference. >> >> Anyone have any additional suggestions for debugging these? >> >> >> And I assume unrelated to the timeout errors, but while testing I started >> to get server errors on my script writing the large data to Memcached: >> >> SERVER_ERROR out of memory storing object >> >> >> Are those failed malloc calls? I'm suspecting that this is related to my >> old version of Memcached (per this thread): >> >> https://groups.google.com/forum/#!topic/memcached/QD7a-6JdqgA >> >> But, I just started up another instance of Memcached using the defaults >> (-m 64) and cannot get it to fail with that error. >> >> The machine where I was getting the out of memory errors has plenty of >> room: >> >> $ free >> total used free shared buffers cached >> Mem: 8059188 5006444 3052744 0 284740 215796 >> -/+ buffers/cache: 4505908 3553280 >> Swap: 10289144 0 10289144 >> >> Any chance the timeouts are somehow related? >> >> -- >> Bill Moseley >> mose...@hank.org >> > > > > -- > Bill Moseley > mose...@hank.org > -- Bill Moseley mose...@hank.org -- --- You received this message because you are subscribed to the Google Groups "memcached" group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.