We were having some weird sporadic errors on our product, and after
scratching our heads a lot and digging down into it, it turned out that we
were getting the "SERVER_ERROR out of memory" error when storing items on
our memcached cluster, but here's the weird part:

We only got the error occasionally. Most writes went ok, but some of them
just failed We estimated the error rate to about 1 in 30.
All memcached servers had grown to the memory limit we set (512MB).
I ran stats slabs, and there were plenty of slabs of all sizes.
The number of evictions ticked up slowly, but definitely not as fast as it
should, given the rate at which we stored items.
The items that failed were all very small, with an expiry of 5 seconds.
And we were running version 1.2.5 for Windows.
And we weren't running with the -M option.

We upgraded to version 1.4.4 now, and restarted them, and it'll take a week
or two for the cache servers to get full again, and we're hoping the error
won't come back.


But what happened? How could we get that error, when the servers just should
have evicted lots of objects instead? How come only a fraction of the writes
failed that way? What does the error actually mean, since the servers
obviously weren't out of memory? And how can we prevent it from happening
again?


/Henrik Schröder

Reply via email to