Re: tail repair issue (1.4.20)

dormando Wed, 02 Jul 2014 19:32:21 -0700

Thanks!

This is a little exciting actually, it's a new bug!


tailrepairs was only necessary when an item was legitimately leaked; if we
don't reap it, it never gets better. However you stated that for three
hours all sets fail (and at the same time some .15's crashed). Then it
self-recovered.

The .15 crashes were likely from the bug I fixed; where an active item is
fetched from the tail, but then reclaimed because it's old.

The .20 "OOM" is the defensive code working perfectly; something has
somehow retained a legitimate reference to an item for multiple hours!
More than one even, since the tail is walked up by several items while
looking for something to free.

Did you have any network blips, application server crashes, or the like?
It sounds like some connections are dying in such a way that they time
out, which is a very long timeout somehow (no tcp keepalives?).

What's *extra* exciting is that 1.4.20 now has the "stats conns" command.

If this happens again, while a .20 machine is actively OOM'ing, can you
grab a couple copies of the "stats conns" output, a few minutes apart?
That should definitively tell us if there are stuck connections causing
this issue.

Someone had a PR open for adding idle connection timeouts, but I asked
them to redo it on top of the 'stats conns' work as a more efficient
background thread. I could potentially finish this and it would be usable
as a workaround. You could also enable tcp keepalives, or otherwise fix
whatever's causing these events.

I wonder if it's also worth attempting to relink an item that ends up in
the tail but has references? That would at least potentially get them out
of the way of memory reclamation.

Thanks!

On Wed, 2 Jul 2014, Denis Samoylov wrote:

> >1)  OOM's on slab 13, but it recovered on its own? This is under version 
> >1.4.20 and you did *not* enable tail repairs? correct
>
> >2) Can you share (with me at least) the full stats/stats items/stats slabs 
> >output from one of the affected servers running 1.4.20? 
> sent you _current_ stats from the server that had OOM couple days ago and 
> still running (now with no issues).
>
> >3) Can you confirm that 1.4.20 isn't *crashing*, but is actually exhibiting 
> >write failures? 
> correct
>
> we will enable saving stderr to log. may be this can show something. If you 
> have any other ideas - let me know.
>
> -denis
>
>
>
> On Wednesday, July 2, 2014 1:36:57 PM UTC-7, Dormando wrote:
>       Cool. That is disappointing.
>
>       Can you clarify a few things for me:
>
>       1) You're saying that you were getting OOM's on slab 13, but it 
> recovered
>       on its own? This is under version 1.4.20 and you did *not* enable tail
>       repairs?
>
>       2) Can you share (with me at least) the full stats/stats items/stats 
> slabs
>       output from one of the affected servers running 1.4.20?
>
>       3) Can you confirm that 1.4.20 isn't *crashing*, but is actually
>       exhibiting write failures?
>
>       If it's not a crash, and your hash power level isn't expanding, I don't
>       think it's related to the other bug.
>
>       thanks!
>
>       On Wed, 2 Jul 2014, Denis Samoylov wrote:
>
>       > Dormando, sure, we will add option to preset hashtable. (as i see nn 
> should be 26).
>       > One question: as i see in logs for the servers there is no change for 
> hash_power_level before incident (it would be hard to say for
>       crushed but .20
>       > just had outofmemory and i have solid stats). Does not this 
> contradict the idea of cause? Server had hash_power_level = 26 for days
>       before and
>       > still has 26 days after. Just for three hours every set for slab 13 
> failed. We did not reboot/flush server and it continues to work
>       without
>       > problem. What do you think?
>       >
>       > On Tuesday, July 1, 2014 2:43:49 PM UTC-7, Dormando wrote:
>       >       Hey,
>       >
>       >       Can you presize the hash table? (-o hashpower=nn) to be large 
> enough on
>       >       those servers such that hash expansion won't happen at runtime? 
> You can
>       >       see what hashpower is on a long running server via stats to 
> know what to
>       >       set the value to.
>       >
>       >       If that helps, we might still have a bug in hash expansion. I 
> see someone
>       >       finally reproduced a possible issue there under .20. .17/.19 
> fix other
>       >       causes of the problem pretty thoroughly though.
>       >
>       >       On Tue, 1 Jul 2014, Denis Samoylov wrote:
>       >
>       >       > Hi,
>       >       > We had sporadic memory corruption due tail repair in pre .20 
> version. So we updated some our servers to .20. This Monday we
>       observed
>       >       several
>       >       > crushes in .15 version and tons of "allocation failure" in 
> .20 version. This is expected as .20 just disables "tail repair"
>       but it
>       >       seems the
>       >       > problem is still there. What is interesting:
>       >       > 1) there is no visible change in traffic and only one slab is 
> affected usually. 
>       >       > 2) this always happens with several but not all servers :)
>       >       >
>       >       > Is there any way to catch this and help with debug? I have 
> all slab and item stats for the time around incident for .15 and
>       .20
>       >       version. .15 is
>       >       > clearly memory corruption: gdb shows that hash function 
> returned 0 (line 115 uint32_t hv = hash(ITEM_key(search),
>       search->nkey, 0);).
>       >       >
>       >       > so we seems hitting this comment:
>       >       >             /* Old rare bug could cause a refcount leak. We 
> haven't seen
>       >       >              * it in years, but we leave this code in to 
> prevent failures
>       >       >              * just in case */
>       >       >
>       >       > :)
>       >       >
>       >       > Thank you,
>       >       > Denis
>       >       >
>       >       > --
>       >       >
>       >       > ---
>       >       > You received this message because you are subscribed to the 
> Google Groups "memcached" group.
>       >       > To unsubscribe from this group and stop receiving emails from 
> it, send an email to memcached+...@googlegroups.com.
>       >       > For more options, visit https://groups.google.com/d/optout.
>       >       >
>       >       >
>       >
>       > --
>       >
>       > ---
>       > You received this message because you are subscribed to the Google 
> Groups "memcached" group.
>       > To unsubscribe from this group and stop receiving emails from it, 
> send an email to memcached+...@googlegroups.com.
>       > For more options, visit https://groups.google.com/d/optout.
>       >
>       >
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups 
> "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to memcached+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: tail repair issue (1.4.20)

Reply via email to