Okay, I'm pretty sure I understand what's going on here now.

This is what I think is the sequence of events is:

- Client does gets for a very large number of keys. I'm not sure how to
actually see the request in the core (if that data is even still attached),
but isize ("list of items to write out") is 3200. I'm assuming that's the
list of items pending for write, anyhow.
- All the items to be written get refcount++ and queued for delivery. Some
of these items are on the tail (or get moved there at some point)
- At some point during transmission, the client system either stops
processing, or starts processing *so* slowly it may as well have stopped.
- The connection sits there and stays healthy (since the client is still
online), but makes little/no progress, so the connection is essentially
permanently in a mwrite state, keeping all the items on the transmit list
permanently referenced.
- In the meantime, the tail gets used as normal, and as actually free
entries get used, these referenced entries 'bubble up' until they occupy
the first 5ish slots
- Presto, that slab no longer accepts writes until something happens to
force a TCP disconnect (client process crashing), or the processing of the
response is actually completed.


...the connection in specific I was looking at above was fd 2089, the one
in the conn_mwrite state for 10917 seconds in the previously attached files.

Walking the tail for slab 16 (the hung slab), the first 20 entries have
refcount=2, before finally finding a refcount=1 at entry 21.

If I take the first half dozen of those or so (that's all I tried), I can
find *every single one *of them listed in the array at conns[2089]->ilist

So unless I'm reading something horribly wrong (which I may be, only
passingly familiar with memcached internals), that's why we're breaking.

Now, how to *fix* this, I'm not sure about. Obviously the client should be
actually processing the data it's getting sent, in a timely manner. And
requesting thousands of keys may or may not be sane. But regardless of
that, it still probably shouldn't break the server.

An inactivity timer might help, as long as it's willing to kill connections
that are still in a writing state. That wouldn't actually *fix* the
problem, but would certainly decrease the odds of it happening to a point
that it could be considered "fixed" for most practical purposes.

What do you think?

-j


On Thu, Aug 7, 2014 at 5:17 PM, dormando <dorma...@rydia.net> wrote:

> Thanks! It might take me a while to look into it more closely.
>
> That conn_mwrite is probably bad, however a single connection shouldn't be
> able to do it. Before the OOM is given up, memcached walks up the chain
> from the bottom of the LRU by 5ish. So all of them have to be locked, or
> possibly some thing I'm unaware of.
>
> Great that you have some cores. Can you look at the tail of the LRU for
> the slab which was OOM'ing, and print the item struct there? If possible,
> walk up 5-10 items back from the tail and print each (anonymized, of
> course). It'd be useful to see the refcount and flags on the items.
>
> Have you tried re-enabling tailrepairs on one of your .20 instances? It
> could still crash sometimes, but you can set the timeout to a reasonably
> low number and see if that helps at all while we figure this out.
>
> On Thu, 7 Aug 2014, Jay Grizzard wrote:
>
> > (I work with Denis, who is out of town this week)
> > So we finally got a more proper 1.4.20 deployment going, and we’ve seen
> this issue quite a lot over the past week. When it
> > happened this morning I was able to grab what you requested.
> >
> > I’ve included a couple of “stats conn” dumps, with anonymized addresses,
> taken four minutes apart. It looks like there’s one
> > connection that could possibly be hung:
> >
> >   STAT 2089:state conn_mwrite
> >
> > …would that be enough to cause this problem? (I’m assuming the answer is
> “it depends”) I snagged a core file from the process
> > that I should be able to muck through to answer questions if there’s
> somewhere in there we would find useful information.
> >
> > Worth noting that while we’ve been able to reproduce the hang (a single
> slab starts reporting oom for every write), we haven’t
> > reproduced the “but recovers on its own” part because these are
> production servers and the problem actually causes real issues,
> > so we restart them rather than waiting several hours to see if the
> problem clears up.
> >
> > Also, reading up in the thread, it’s worth noting that lack of TCP
> keepalives (which we actually have, memcached enables it)
> > wouldn’t actually affect the “and automatically recover” aspect of
> things, because TCP keepalives only happen when a connection
> > is completely idle. When there’s pending data (which there would be on a
> hung write), standard TCP timeouts (which are much
> > faster) apply.
> >
> > (And yes, we do have lots of idle connections to our caches, but that’s
> not something we can immediately fix, nor should it
> > directly be the cause of these issues.)
> >
> > Anyhow… thoughts?
> >
> > -j
> >
> > --
> >
> > ---
> > You received this message because you are subscribed to the Google
> Groups "memcached" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to memcached+unsubscr...@googlegroups.com.
> > For more options, visit https://groups.google.com/d/optout.
> >
> >
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to memcached+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to