Hey all, we have a sharded memcached cluster of 5 nodes. Last week we once had a problem that one of our memcached servers had really big issues, and only a cold process restart helped getting it back.
The phenomena was/is that we had lots of memcached clients getting an EPIPE, and as soon as we identified the failing memcached node and could not connect using fresh tiny test-scripts nor directly either. First we thought it *MIGHT* be also a hardware issue, or whatnot, and we wrote a monitoring test (checking every memcached node every 10 seconds individually connecting to it, test-writing, rereading and comparing the values, and report on error). The result though, is, that we get EPIPEs and even rare EOFs quite regulary, not that often that we (do not yet know, we) should care about, but we now know of **two** cases where we had a peak EPIPE scenario (one very big, one somewhat big) where once only a restart helped. We came to the conclusion, that it definitely might some issue with the software itself, and though, seek help from upstream, in the hope to get any kind of advise, help, or whatever you think that might help us in getting such things fixed. FYI: the clients are all ruby using standard gem "redis" version 2.1.1 and memcached version 1x 1.4.4 and 4x 1.4.2. Many thanks for any thoughts :) Christian Parpart.