Re: Timeouts (and out of memory)

2013-10-01 Thread Bill Moseley
Just a quick follow up on this timeout issue.   No solution yet -- but sure
seems like a client network issue.

I have three servers on the same subnet.   One called "mem" where I'm
running a single instance of Memcached.  Then I have dev-1 and dev-2 with
each running 
mc_conn_tester.pl.
 It's not reporting any timeouts on either machine.

I then start another script on dev-1 that forks 30 processes, connects,
then sends large set requests (almost 1MB in size) in a loop. This is
suppose to emulate a busy forking web server, for example.

Then I start seeing timeouts from mc_conn_tester.pl on the dev-1 machine
but not the dev-2 machine.  And likewise, if I move the load generator to
dev-2 then I see the timeouts on dev-2 not on dev-1.   Not a lot of
timeouts in either case, but it's clear it happens where the load script is
running.

If the load generating script is changed to send much smaller data size
then the timeouts stop.

That has me thinking this isn't a problem related to Memcached itself,
rather some network problem.   The network is not close to saturation so
maybe a temporary buffer overrun.   I've asked our network people to look
into it.

Agreed?






On Tue, Sep 24, 2013 at 7:09 AM, Bill Moseley  wrote:

> I'm using the notes at https://code.google.com/p/memcached/wiki/Timeouts
>  to debug timeout errors against a single 1.4.4 Memcached server with 8GB
> of RAM on CentOS 6.2 started with
>
> memcached -d -p 11211 -u memcached -m 4096 -c 8192
>
>
> I could not get http://consoleninja.net/code/memcached/mc_conn_tester.pl to
> issue a timeout running by itself.
>
> So I wrote another script using Perl's Memcached::libmemcached that forked
> 20 or so processes and set ~1/2MB of data using keys generated by
> Data::UUID.  I didn't specify an expires time for these sets.
>
> I then started to see a few timeouts w/o connecting like in the examples:
>
> Fail: (timeout: 1) (elapsed: 1.00427794) (conn: 0.) (set: 0.) 
> (get: 0.)
>
> I'm just starting to look at this now, but the network cards are not
> showing errors or dropped packets.  I couldn't get enough timeouts where
> changing the timeout value made much difference.
>
> Anyone have any additional suggestions for debugging these?
>
>
> And I assume unrelated to the timeout errors, but while testing I started
> to get server errors on my script writing the large data to Memcached:
>
> SERVER_ERROR out of memory storing object
>
>
> Are those failed malloc calls?  I'm suspecting that this is related to my
> old version of Memcached (per this thread):
>
> https://groups.google.com/forum/#!topic/memcached/QD7a-6JdqgA
>
> But, I just started up another instance of Memcached using the defaults
> (-m 64) and cannot get it to fail with that error.
>
> The machine where I was getting the out of memory errors has plenty of
> room:
>
>  $ free
>  total   used   free sharedbuffers cached
> Mem:   805918850064443052744  0 284740 215796
> -/+ buffers/cache:45059083553280
> Swap: 10289144  0   10289144
>
> Any chance the timeouts are somehow related?
>
> --
> Bill Moseley
> mose...@hank.org
>



-- 
Bill Moseley
mose...@hank.org

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Timeouts (and out of memory)

2013-10-02 Thread Bill Moseley
I'm really bumbling around a bit in the dark, but might be getting closer.

Now I'm testing with Perl's Memcached::libmemcached with libmemcached
1.0.17.


So, I did a new tests where memcached is running in a different data center
(over a 300Mbps link, IIRC).

I'm seeing three different timeout situations.   But, the most common one
is where poll in libmemcached is timing out.   When this happens I see this
from the server:

Failed to read, and not due to blocking:
errno: 104 Connection reset by peer
rcurr=7fa3f8088b38 ritem=7fa383f7cdf8 rbuf=7fa3f8088590 rlbytes=640186
rsize=8192




The timeouts happen in this libmemcached call:

static memcached_return_t io_wait(memcached_instance_st* instance,
  const memc_read_or_write read_or_write)


and specifically what's happening is this is returning zero:

int active_fd= poll(&fds, 1, instance->root->poll_timeout);


With poll timeout at 2000ms I see them sometimes.  At 1000ms quite often,
and 5000ms not at all.  Again, it seems related to how much data I'm
sending to Memcached.   So even 1000ms is fine for say 10K writes, but
2000ms isn't for 700K writes.

My guess, is we were using a shorter poll timeout and as our usage grew 2s
is no longer enough.   But, 5s is a long time to wait, no?


Perhaps this is a question for the libmemcached list, but what's happening
where two seconds is too short?   Does that make sense that the size of the
data I'm storing would require a longer poll time?


The other two timeouts I'm seeing is one where the socket connect() fails
-- and this seems related to how busy the client server is. I'm also seeing
another timeout from libmemcache in:

static memcached_return_t connect_poll(memcached_instance_st* server, const
int connection_error)


and also in this poll call:

if ((number_of= poll(fds, 1, server->root->connect_timeout)) == -1)

That is returning zero and connect_timeout is 4000ms.   But, that's a much
more rare type of timeout.




On Tue, Oct 1, 2013 at 3:41 PM, Bill Moseley  wrote:

> Just a quick follow up on this timeout issue.   No solution yet -- but
> sure seems like a client network issue.
>
> I have three servers on the same subnet.   One called "mem" where I'm
> running a single instance of Memcached.  Then I have dev-1 and dev-2 with
> each running 
> mc_conn_tester.pl.
>  It's not reporting any timeouts on either machine.
>
> I then start another script on dev-1 that forks 30 processes, connects,
> then sends large set requests (almost 1MB in size) in a loop. This is
> suppose to emulate a busy forking web server, for example.
>
> Then I start seeing timeouts from mc_conn_tester.pl on the dev-1 machine
> but not the dev-2 machine.  And likewise, if I move the load generator to
> dev-2 then I see the timeouts on dev-2 not on dev-1.   Not a lot of
> timeouts in either case, but it's clear it happens where the load script is
> running.
>
> If the load generating script is changed to send much smaller data size
> then the timeouts stop.
>
> That has me thinking this isn't a problem related to Memcached itself,
> rather some network problem.   The network is not close to saturation so
> maybe a temporary buffer overrun.   I've asked our network people to look
> into it.
>
> Agreed?
>
>
>
>
>
>
> On Tue, Sep 24, 2013 at 7:09 AM, Bill Moseley  wrote:
>
>> I'm using the notes at https://code.google.com/p/memcached/wiki/Timeouts
>>  to debug timeout errors against a single 1.4.4 Memcached server with
>> 8GB of RAM on CentOS 6.2 started with
>>
>> memcached -d -p 11211 -u memcached -m 4096 -c 8192
>>
>>
>> I could not get http://consoleninja.net/code/memcached/mc_conn_tester.pl to
>> issue a timeout running by itself.
>>
>> So I wrote another script using Perl's Memcached::libmemcached that
>> forked 20 or so processes and set ~1/2MB of data using keys generated by
>> Data::UUID.  I didn't specify an expires time for these sets.
>>
>> I then started to see a few timeouts w/o connecting like in the examples:
>>
>> Fail: (timeout: 1) (elapsed: 1.00427794) (conn: 0.) (set: 
>> 0.) (get: 0.)
>>
>> I'm just starting to look at this now, but the network cards are not
>> showing errors or dropped packets.  I couldn't get enough timeouts where
>> changing the timeout value made much difference.
>>
>> Anyone have any additional suggestions for debugging these?
>>
>>
>> And I assume unrelated to the timeout errors, but while testing I started
>> to get server errors on my script writing the large data to Memcached:
>>
>> SERVER_ERROR out of memory storing object
>>
>>
>> Are those failed malloc calls?  I'm suspecting that this is related to my
>> old version of Memcached (per this thread):
>>
>> https://groups.google.com/forum/#!topic/memcached/QD7a-6JdqgA
>>
>> But, I just started up another instance of Memcached using the defaults
>> (-m 64) and cannot get it to fail with that error.
>>
>> The machine whe

Re: Timeouts (and out of memory)

2013-10-02 Thread dormando
If the issue goes away at 3.5 seconds but not 2.5 seconds, you're dropping
SYN packets, though it sounds like you're polling for a conect in that
first case...

Have you carefully looked at your network settings for packet loss/etc? It
sounds like you're just losing packets on the network.

If memcached is responding to some requests fine but some requests time
out, it's always your network. Since it's an event based server it's not
blocking on anything that could cause requests to take any amount of time
to write on the network. If memcached is hanging, all requests to all
clients will stop (you can validate this by running it with -t 1, just a
single worker thread).

If it's some requests to some clients, you have packet loss or your
clients aren't being prioritized well, most likely.

On Wed, 2 Oct 2013, Bill Moseley wrote:

> I'm really bumbling around a bit in the dark, but might be getting closer.
> Now I'm testing with Perl's Memcached::libmemcached with libmemcached 1.0.17.
>
>
> So, I did a new tests where memcached is running in a different data center 
> (over a 300Mbps link, IIRC).
>
> I'm seeing three different timeout situations.   But, the most common one is 
> where poll in libmemcached is timing out.   When this happens I see this from 
> the server:
>
>   Failed to read, and not due to blocking:
> errno: 104 Connection reset by peer
> rcurr=7fa3f8088b38 ritem=7fa383f7cdf8 rbuf=7fa3f8088590 rlbytes=640186 
> rsize=8192
>
>
>
>
> The timeouts happen in this libmemcached call:
>
>   static memcached_return_t io_wait(memcached_instance_st* instance,
>                                   const memc_read_or_write read_or_write)
>
>
> and specifically what's happening is this is returning zero:
>
>     int active_fd= poll(&fds, 1, instance->root->poll_timeout);
>
>
> With poll timeout at 2000ms I see them sometimes.  At 1000ms quite often, and 
> 5000ms not at all.  Again, it seems related to how much data I'm sending to 
> Memcached.   So even 1000ms is fine for say 10K writes, but 2000ms isn't for 
> 700K writes.
>
> My guess, is we were using a shorter poll timeout and as our usage grew 2s is 
> no longer enough.   But, 5s is a long time to wait, no?
>
>
> Perhaps this is a question for the libmemcached list, but what's happening 
> where two seconds is too short?   Does that make sense that the size of the 
> data I'm storing would require a longer poll time?
>
>
> The other two timeouts I'm seeing is one where the socket connect() fails -- 
> and this seems related to how busy the client server is. I'm also seeing 
> another timeout from libmemcache in:
>
>   static memcached_return_t connect_poll(memcached_instance_st* server, 
> const int connection_error)
>
>
> and also in this poll call:
>
>     if ((number_of= poll(fds, 1, server->root->connect_timeout)) == -1)
>
> That is returning zero and connect_timeout is 4000ms.   But, that's a much 
> more rare type of timeout.
>
>
>
>
> On Tue, Oct 1, 2013 at 3:41 PM, Bill Moseley  wrote:
>   Just a quick follow up on this timeout issue.   No solution yet -- but 
> sure seems like a client network issue.
> I have three servers on the same subnet.   One called "mem" where I'm running 
> a single instance of Memcached.  Then I have dev-1 and dev-2 with each 
> running mc_conn_tester.pl.  It's not reporting any timeouts on either machine.
>
> I then start another script on dev-1 that forks 30 processes, connects, then 
> sends large set requests (almost 1MB in size) in a loop. This is suppose to 
> emulate a busy forking web server, for example.
>
> Then I start seeing timeouts from mc_conn_tester.pl on the dev-1 machine but 
> not the dev-2 machine.  And likewise, if I move the load generator to dev-2 
> then I see the timeouts on dev-2 not on dev-1.   Not a lot of timeouts in 
> either case, but it's clear it
> happens where the load script is running.
>
> If the load generating script is changed to send much smaller data size then 
> the timeouts stop.
>
> That has me thinking this isn't a problem related to Memcached itself, rather 
> some network problem.   The network is not close to saturation so maybe a 
> temporary buffer overrun.   I've asked our network people to look into it.
>
> Agreed?
>
>
>
>
>
>
> On Tue, Sep 24, 2013 at 7:09 AM, Bill Moseley  wrote:
>   I'm using the notes at https://code.google.com/p/memcached/wiki/Timeouts
> to debug timeout errors against a single 1.4.4 Memcached server with 8GB of 
> RAM on CentOS 6.2 started with
>
>   memcached -d -p 11211 -u memcached -m 4096 -c 8192
>
>
> I could not get http://consoleninja.net/code/memcached/mc_conn_tester.pl to 
> issue a timeout running by itself.
>
> So I wrote another script using Perl's Memcached::libmemcached that forked 20 
> or so processes and set ~1/2MB of data using keys generated by Data::UUID.  I 
> didn't specify an expires time for these sets.
>
> I then started to see a few timeouts w/o connecting like in the examples:
>
>
> Fail: (ti

Re: Timeouts (and out of memory)

2013-10-03 Thread Bill Moseley
On Wed, Oct 2, 2013 at 2:44 PM, dormando  wrote:

> If the issue goes away at 3.5 seconds but not 2.5 seconds, you're dropping
> SYN packets, though it sounds like you're polling for a conect in that
> first case...
>

Dropping SYN packets would be during connect, right?   For
mc_conn_tester.plthat might be the case because those timeouts seems
to be on the connect()
call, plus I see more timeouts on the client machine that is also
generating the most traffic to the Memcached server.


But, in my later test using libmemcached it's timing out during set calls
and when I see the timeout I also see on the *server*

Failed to read, and not due to blocking:
errno: 104 Connection reset by peer
rcurr=7fa3f8088b38 ritem=7fa383f7cdf8 rbuf=7fa3f8088590 rlbytes=640186
rsize=8192

Dropping SYN packets would be before the connection was created and here is
seems there's already a connection.

Does libmemcached close a connection if it sees error?   That is, could it
be that the poll() times out on libmemcached and then it closes the
connection and then memcached issues that message?


Have you carefully looked at your network settings for packet loss/etc? It
> sounds like you're just losing packets on the network.
>

One of our network engineers has been looking carefully at the network
stack and even adjusting some buffer sizes.   But, nothing has jumped out
as a problem yet.   But, continuing to look at that.


>
> If memcached is responding to some requests fine but some requests time
> out, it's always your network. Since it's an event based server it's not
> blocking on anything that could cause requests to take any amount of time
> to write on the network. If memcached is hanging, all requests to all
> clients will stop (you can validate this by running it with -t 1, just a
> single worker thread).
>

On average the set calls are taking about .3 seconds for about a 700K write
(when testing with Memcahed in a different data center).  And on timeout,
well, it takes the timeout amount of time (slightly over a second when
timeout is 1000ms).

I do agree, sure seems like a network issue.


In general, how do you determine if Memcached is "too busy"?   Is there a
good measure?  We do have quite a few client servers and only one Memcached
sever.

$ memcached-tool mem1 stats
#mem1:11211Field   Value
 accepting_conns   1
   auth_cmds   0
 auth_errors   0
   bytes  6233635236
  bytes_read 194386466846632
   bytes_written 217827636391684
  cas_badval   0
cas_hits   0
  cas_misses   0
   cmd_flush   0
 cmd_get 43101326351
 cmd_set 18392114626
 conn_yields  58
*   connection_structures3924*
*curr_connections1957*
  curr_items41147780
   decr_hits   0
 decr_misses   0
 delete_hits   368818846
   delete_misses14333688
   evictions  1769516705
get_hits 40831898577
  get_misses  2269427774
   incr_hits   0
 incr_misses   0
  limit_maxbytes  8589934592
 listen_disabled_num   0
 pid   15407
pointer_size  64
   reclaimed  4247869349
   rusage_system 1560721.515265
 rusage_user 685338.331737
 threads   4
time  1380807067
   total_connections   126605752
 total_items  1212245441
  uptime66589667
 version   1.4.5



>
> If it's some requests to some clients, you have packet loss or your
> clients aren't being prioritized well, most likely.
>
> On Wed, 2 Oct 2013, Bill Moseley wrote:
>
> > I'm really bumbling around a bit in the dark, but might be getting
> closer.
> > Now I'm testing with Perl's Memcached::libmemcached with libmemcached
> 1.0.17.
> >
> >
> > So, I did a new tests where memcached is running in a different data
> center (over a 300Mbps link, IIRC).
> >
> > I'm seeing three different timeout situations.   But, the most common
> one is where poll in libmemcached is timing out.   When this happens I see
> this from the server:
> >
> >   Failed to read, and not due to blocking:
> > errno: 104 Connection reset by peer
> > rcurr=7fa3f8088b38 ritem=7fa383f7cdf8 rbuf=7fa3f8088590 rlbytes=640186
> rsize=8192
> >
> >
> >
> >
> > The timeouts happen in this libmemcached call:
> >
> >   static memcached_return_t io_wait(memcached_instance_st* instance,
> >   const memc_read_or_write read_or_write)
> >
> >
> > and specifically what's happening is this is returning zero:
> >
> > int active_fd= poll(&fds, 1, instance->root->poll_timeout);
> >
> >
> > With poll timeout at 2000ms I see