I'm really bumbling around a bit in the dark, but might be getting closer.

Now I'm testing with Perl's Memcached::libmemcached with libmemcached
1.0.17.


So, I did a new tests where memcached is running in a different data center
(over a 300Mbps link, IIRC).

I'm seeing three different timeout situations.   But, the most common one
is where poll in libmemcached is timing out.   When this happens I see this
from the server:

Failed to read, and not due to blocking:
errno: 104 Connection reset by peer
rcurr=7fa3f8088b38 ritem=7fa383f7cdf8 rbuf=7fa3f8088590 rlbytes=640186
rsize=8192




The timeouts happen in this libmemcached call:

static memcached_return_t io_wait(memcached_instance_st* instance,
                                  const memc_read_or_write read_or_write)


and specifically what's happening is this is returning zero:

    int active_fd= poll(&fds, 1, instance->root->poll_timeout);


With poll timeout at 2000ms I see them sometimes.  At 1000ms quite often,
and 5000ms not at all.  Again, it seems related to how much data I'm
sending to Memcached.   So even 1000ms is fine for say 10K writes, but
2000ms isn't for 700K writes.

My guess, is we were using a shorter poll timeout and as our usage grew 2s
is no longer enough.   But, 5s is a long time to wait, no?


Perhaps this is a question for the libmemcached list, but what's happening
where two seconds is too short?   Does that make sense that the size of the
data I'm storing would require a longer poll time?


The other two timeouts I'm seeing is one where the socket connect() fails
-- and this seems related to how busy the client server is. I'm also seeing
another timeout from libmemcache in:

static memcached_return_t connect_poll(memcached_instance_st* server, const
int connection_error)


and also in this poll call:

    if ((number_of= poll(fds, 1, server->root->connect_timeout)) == -1)

That is returning zero and connect_timeout is 4000ms.   But, that's a much
more rare type of timeout.




On Tue, Oct 1, 2013 at 3:41 PM, Bill Moseley <mose...@hank.org> wrote:

> Just a quick follow up on this timeout issue.   No solution yet -- but
> sure seems like a client network issue.
>
> I have three servers on the same subnet.   One called "mem" where I'm
> running a single instance of Memcached.  Then I have dev-1 and dev-2 with
> each running 
> mc_conn_tester.pl<http://consoleninja.net/code/memcached/mc_conn_tester.pl>.
>  It's not reporting any timeouts on either machine.
>
> I then start another script on dev-1 that forks 30 processes, connects,
> then sends large set requests (almost 1MB in size) in a loop. This is
> suppose to emulate a busy forking web server, for example.
>
> Then I start seeing timeouts from mc_conn_tester.pl on the dev-1 machine
> but not the dev-2 machine.  And likewise, if I move the load generator to
> dev-2 then I see the timeouts on dev-2 not on dev-1.   Not a lot of
> timeouts in either case, but it's clear it happens where the load script is
> running.
>
> If the load generating script is changed to send much smaller data size
> then the timeouts stop.
>
> That has me thinking this isn't a problem related to Memcached itself,
> rather some network problem.   The network is not close to saturation so
> maybe a temporary buffer overrun.   I've asked our network people to look
> into it.
>
> Agreed?
>
>
>
>
>
>
> On Tue, Sep 24, 2013 at 7:09 AM, Bill Moseley <mose...@hank.org> wrote:
>
>> I'm using the notes at https://code.google.com/p/memcached/wiki/Timeouts
>>  to debug timeout errors against a single 1.4.4 Memcached server with
>> 8GB of RAM on CentOS 6.2 started with
>>
>> memcached -d -p 11211 -u memcached -m 4096 -c 8192
>>
>>
>> I could not get http://consoleninja.net/code/memcached/mc_conn_tester.pl to
>> issue a timeout running by itself.
>>
>> So I wrote another script using Perl's Memcached::libmemcached that
>> forked 20 or so processes and set ~1/2MB of data using keys generated by
>> Data::UUID.  I didn't specify an expires time for these sets.
>>
>> I then started to see a few timeouts w/o connecting like in the examples:
>>
>> Fail: (timeout: 1) (elapsed: 1.00427794) (conn: 0.00000000) (set: 
>> 0.00000000) (get: 0.00000000)
>>
>> I'm just starting to look at this now, but the network cards are not
>> showing errors or dropped packets.  I couldn't get enough timeouts where
>> changing the timeout value made much difference.
>>
>> Anyone have any additional suggestions for debugging these?
>>
>>
>> And I assume unrelated to the timeout errors, but while testing I started
>> to get server errors on my script writing the large data to Memcached:
>>
>> SERVER_ERROR out of memory storing object
>>
>>
>> Are those failed malloc calls?  I'm suspecting that this is related to my
>> old version of Memcached (per this thread):
>>
>> https://groups.google.com/forum/#!topic/memcached/QD7a-6JdqgA
>>
>> But, I just started up another instance of Memcached using the defaults
>> (-m 64) and cannot get it to fail with that error.
>>
>> The machine where I was getting the out of memory errors has plenty of
>> room:
>>
>>  $ free
>>              total       used       free     shared    buffers     cached
>> Mem:       8059188    5006444    3052744          0     284740     215796
>> -/+ buffers/cache:    4505908    3553280
>> Swap:     10289144          0   10289144
>>
>> Any chance the timeouts are somehow related?
>>
>> --
>> Bill Moseley
>> mose...@hank.org
>>
>
>
>
> --
> Bill Moseley
> mose...@hank.org
>



-- 
Bill Moseley
mose...@hank.org

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to