> We were having some weird sporadic errors on our product, and after 
> scratching our heads a lot and digging down into it, it turned out that
> we were getting the "SERVER_ERROR out of memory" error when storing items on 
> our memcached cluster, but here's the weird part:
>
> We only got the error occasionally. Most writes went ok, but some of them 
> just failed We estimated the error rate to about 1 in 30.
> All memcached servers had grown to the memory limit we set (512MB).
> I ran stats slabs, and there were plenty of slabs of all sizes.
> The number of evictions ticked up slowly, but definitely not as fast as it 
> should, given the rate at which we stored items.
> The items that failed were all very small, with an expiry of 5 seconds.
> And we were running version 1.2.5 for Windows.
> And we weren't running with the -M option.
>
> We upgraded to version 1.4.4 now, and restarted them, and it'll take a week 
> or two for the cache servers to get full again, and we're
> hoping the error won't come back.
>
>
> But what happened? How could we get that error, when the servers just should 
> have evicted lots of objects instead? How come only a fraction
> of the writes failed that way? What does the error actually mean, since the 
> servers obviously weren't out of memory? And how can we prevent
> it from happening again?

Hella old, badly ported, tons and tons of bugs fixed in the middle, no
stats, so... you can only really guess as to what happened :)

If I had to guess though, it'd be the old bug that was fixed with the
"tail repairs" function. Very rarely refcount's wouldn't get all released
on an item and it would get stuck in an unevictable form. once you got 50
of those sitting at the bottom of the LRU you wouldn't be able to store
anything at all. Less than 50 and you'll get occasional errors.

Seriously guys; running really old code then asking wtf happened is really
hard to deal with. I'm sort of amazed how often this comes up. We've added
gobs of stats to the newer versions so you can see exactly what goes
right/wrong, but the further back you go the more guesswork is involved.

Reply via email to