If the issue goes away at 3.5 seconds but not 2.5 seconds, you're dropping
SYN packets, though it sounds like you're polling for a conect in that
first case...

Have you carefully looked at your network settings for packet loss/etc? It
sounds like you're just losing packets on the network.

If memcached is responding to some requests fine but some requests time
out, it's always your network. Since it's an event based server it's not
blocking on anything that could cause requests to take any amount of time
to write on the network. If memcached is hanging, all requests to all
clients will stop (you can validate this by running it with -t 1, just a
single worker thread).

If it's some requests to some clients, you have packet loss or your
clients aren't being prioritized well, most likely.

On Wed, 2 Oct 2013, Bill Moseley wrote:

> I'm really bumbling around a bit in the dark, but might be getting closer.
> Now I'm testing with Perl's Memcached::libmemcached with libmemcached 1.0.17.
>
>
> So, I did a new tests where memcached is running in a different data center 
> (over a 300Mbps link, IIRC).
>
> I'm seeing three different timeout situations.   But, the most common one is 
> where poll in libmemcached is timing out.   When this happens I see this from 
> the server:
>
>       Failed to read, and not due to blocking:
> errno: 104 Connection reset by peer
> rcurr=7fa3f8088b38 ritem=7fa383f7cdf8 rbuf=7fa3f8088590 rlbytes=640186 
> rsize=8192
>
>
>
>
> The timeouts happen in this libmemcached call:
>
>       static memcached_return_t io_wait(memcached_instance_st* instance,
>                                   const memc_read_or_write read_or_write)
>
>
> and specifically what's happening is this is returning zero:
>
>     int active_fd= poll(&fds, 1, instance->root->poll_timeout);
>
>
> With poll timeout at 2000ms I see them sometimes.  At 1000ms quite often, and 
> 5000ms not at all.  Again, it seems related to how much data I'm sending to 
> Memcached.   So even 1000ms is fine for say 10K writes, but 2000ms isn't for 
> 700K writes.
>
> My guess, is we were using a shorter poll timeout and as our usage grew 2s is 
> no longer enough.   But, 5s is a long time to wait, no?
>
>
> Perhaps this is a question for the libmemcached list, but what's happening 
> where two seconds is too short?   Does that make sense that the size of the 
> data I'm storing would require a longer poll time?
>
>
> The other two timeouts I'm seeing is one where the socket connect() fails -- 
> and this seems related to how busy the client server is. I'm also seeing 
> another timeout from libmemcache in:
>
>       static memcached_return_t connect_poll(memcached_instance_st* server, 
> const int connection_error)
>
>
> and also in this poll call:
>
>     if ((number_of= poll(fds, 1, server->root->connect_timeout)) == -1)
>
> That is returning zero and connect_timeout is 4000ms.   But, that's a much 
> more rare type of timeout.
>
>
>
>
> On Tue, Oct 1, 2013 at 3:41 PM, Bill Moseley <mose...@hank.org> wrote:
>       Just a quick follow up on this timeout issue.   No solution yet -- but 
> sure seems like a client network issue.
> I have three servers on the same subnet.   One called "mem" where I'm running 
> a single instance of Memcached.  Then I have dev-1 and dev-2 with each 
> running mc_conn_tester.pl.  It's not reporting any timeouts on either machine.
>
> I then start another script on dev-1 that forks 30 processes, connects, then 
> sends large set requests (almost 1MB in size) in a loop. This is suppose to 
> emulate a busy forking web server, for example.
>
> Then I start seeing timeouts from mc_conn_tester.pl on the dev-1 machine but 
> not the dev-2 machine.  And likewise, if I move the load generator to dev-2 
> then I see the timeouts on dev-2 not on dev-1.   Not a lot of timeouts in 
> either case, but it's clear it
> happens where the load script is running.
>
> If the load generating script is changed to send much smaller data size then 
> the timeouts stop.
>
> That has me thinking this isn't a problem related to Memcached itself, rather 
> some network problem.   The network is not close to saturation so maybe a 
> temporary buffer overrun.   I've asked our network people to look into it.
>
> Agreed?
>
>
>
>
>
>
> On Tue, Sep 24, 2013 at 7:09 AM, Bill Moseley <mose...@hank.org> wrote:
>       I'm using the notes at https://code.google.com/p/memcached/wiki/Timeouts
> to debug timeout errors against a single 1.4.4 Memcached server with 8GB of 
> RAM on CentOS 6.2 started with
>
>       memcached -d -p 11211 -u memcached -m 4096 -c 8192
>
>
> I could not get http://consoleninja.net/code/memcached/mc_conn_tester.pl to 
> issue a timeout running by itself.
>
> So I wrote another script using Perl's Memcached::libmemcached that forked 20 
> or so processes and set ~1/2MB of data using keys generated by Data::UUID.  I 
> didn't specify an expires time for these sets.
>
> I then started to see a few timeouts w/o connecting like in the examples:
>
>
> Fail: (timeout: 1) (elapsed: 1.00427794) (conn: 0.00000000) (set: 0.00000000) 
> (get: 0.00000000)
> I'm just starting to look at this now, but the network cards are not showing 
> errors or dropped packets.  I couldn't get enough timeouts where changing the 
> timeout value made much difference.
>
> Anyone have any additional suggestions for debugging these?
>
>
> And I assume unrelated to the timeout errors, but while testing I started to 
> get server errors on my script writing the large data to Memcached:
>
>       SERVER_ERROR out of memory storing object
>
>
> Are those failed malloc calls?  I'm suspecting that this is related to my old 
> version of Memcached (per this thread):
>
> https://groups.google.com/forum/#!topic/memcached/QD7a-6JdqgA
>
> But, I just started up another instance of Memcached using the defaults (-m 
> 64) and cannot get it to fail with that error.
>
> The machine where I was getting the out of memory errors has plenty of room:
>
>  $ free
>              total       used       free     shared    buffers     cached
> Mem:       8059188    5006444    3052744          0     284740     215796
> -/+ buffers/cache:    4505908    3553280
> Swap:     10289144          0   10289144
>
> Any chance the timeouts are somehow related?
>
> --
> Bill Moseley
> mose...@hank.org
>
>
>
>
> --
> Bill Moseley
> mose...@hank.org
>
>
>
>
> --
> Bill Moseley
> mose...@hank.org
>
> --
>  
> ---
> You received this message because you are subscribed to the Google Groups 
> "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to memcached+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to