Re: tail repair issue (1.4.20)

dormando Thu, 07 Aug 2014 17:17:19 -0700

Thanks! It might take me a while to look into it more closely.

That conn_mwrite is probably bad, however a single connection shouldn't be
able to do it. Before the OOM is given up, memcached walks up the chain
from the bottom of the LRU by 5ish. So all of them have to be locked, or
possibly some thing I'm unaware of.


Great that you have some cores. Can you look at the tail of the LRU for
the slab which was OOM'ing, and print the item struct there? If possible,
walk up 5-10 items back from the tail and print each (anonymized, of
course). It'd be useful to see the refcount and flags on the items.

Have you tried re-enabling tailrepairs on one of your .20 instances? It
could still crash sometimes, but you can set the timeout to a reasonably
low number and see if that helps at all while we figure this out.

On Thu, 7 Aug 2014, Jay Grizzard wrote:

> (I work with Denis, who is out of town this week)
> So we finally got a more proper 1.4.20 deployment going, and we’ve seen this 
> issue quite a lot over the past week. When it
> happened this morning I was able to grab what you requested.
>
> I’ve included a couple of “stats conn” dumps, with anonymized addresses, 
> taken four minutes apart. It looks like there’s one
> connection that could possibly be hung:
>
>   STAT 2089:state conn_mwrite
>
> …would that be enough to cause this problem? (I’m assuming the answer is “it 
> depends”) I snagged a core file from the process
> that I should be able to muck through to answer questions if there’s 
> somewhere in there we would find useful information.
>
> Worth noting that while we’ve been able to reproduce the hang (a single slab 
> starts reporting oom for every write), we haven’t
> reproduced the “but recovers on its own” part because these are production 
> servers and the problem actually causes real issues,
> so we restart them rather than waiting several hours to see if the problem 
> clears up. 
>
> Also, reading up in the thread, it’s worth noting that lack of TCP keepalives 
> (which we actually have, memcached enables it)
> wouldn’t actually affect the “and automatically recover” aspect of things, 
> because TCP keepalives only happen when a connection
> is completely idle. When there’s pending data (which there would be on a hung 
> write), standard TCP timeouts (which are much
> faster) apply.
>
> (And yes, we do have lots of idle connections to our caches, but that’s not 
> something we can immediately fix, nor should it
> directly be the cause of these issues.)
>
> Anyhow… thoughts?
>
> -j
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups 
> "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to memcached+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: tail repair issue (1.4.20)

Reply via email to