Hey all,

we have a sharded memcached cluster of 5 nodes.
Last week we once had a problem that one of our memcached
servers had really big issues, and only a cold process restart
helped getting it back.

The phenomena was/is that we had lots of memcached clients
getting an EPIPE, and as soon as we identified the failing
memcached node and could not connect using fresh tiny
test-scripts nor directly either.

First we thought it *MIGHT* be also a hardware issue, or whatnot,
and we wrote a monitoring test (checking every memcached node
every 10 seconds individually connecting to it, test-writing, rereading
and comparing the values, and report on error).

The result though, is, that we get EPIPEs and even rare EOFs
quite regulary, not that often that we (do not yet know, we) should
care about, but we now know of **two** cases where
we had a peak EPIPE scenario (one very big, one somewhat big)
where once only a restart helped.

We came to the conclusion, that it definitely might some issue
with the software itself, and though, seek help from upstream, in the hope
to get any kind of advise, help, or whatever you think
that might help us in getting such things fixed.

FYI: the clients are all ruby using standard gem "redis" version 2.1.1
and memcached version 1x 1.4.4 and 4x 1.4.2.

Many thanks for any thoughts :)
Christian Parpart.

Reply via email to