Re: tail repair issue (1.4.20)

dormando Mon, 07 Jul 2014 14:56:52 -0700


> Dormando,Sure, I waited till Monday (our usual tailrepair/oom errors day) but 
> we did not have any issues today :). I will continue to monitor and
> will grab "stats conns" next time.


Great, thanks!

> As for network issues during the last time - i do not see any but still 
> trying to find. This can be good explanation why we have such events
> grouped in time.
>
> As for keepalive - we use default php-memcached/libmemcached setting (do not 
> change it) and as I see libmemcached does not set SO_KEEPALIVE. Do you
> recommend to set it?

Lets see what stats conns says first. I guess it's theoretically possible
that an item was leaked, but was actually fetched (and expired properly)
at some point, fixing the issue. Would've still leaked the item though.

So if 'stats conns' doesn't show some hung clients we might still have a
reference leak somewhere. Which would be sad since Steven Grimm fixed a
number of them just recently..

> On Wednesday, July 2, 2014 7:32:14 PM UTC-7, Dormando wrote:
>       Thanks!
>
>       This is a little exciting actually, it's a new bug!
>
>       tailrepairs was only necessary when an item was legitimately leaked; if 
> we
>       don't reap it, it never gets better. However you stated that for three
>       hours all sets fail (and at the same time some .15's crashed). Then it
>       self-recovered.
>
>       The .15 crashes were likely from the bug I fixed; where an active item 
> is
>       fetched from the tail, but then reclaimed because it's old.
>
>       The .20 "OOM" is the defensive code working perfectly; something has
>       somehow retained a legitimate reference to an item for multiple hours!
>       More than one even, since the tail is walked up by several items while
>       looking for something to free.
>
>       Did you have any network blips, application server crashes, or the like?
>       It sounds like some connections are dying in such a way that they time
>       out, which is a very long timeout somehow (no tcp keepalives?).
>
>       What's *extra* exciting is that 1.4.20 now has the "stats conns" 
> command.
>
>       If this happens again, while a .20 machine is actively OOM'ing, can you
>       grab a couple copies of the "stats conns" output, a few minutes apart?
>       That should definitively tell us if there are stuck connections causing
>       this issue.
>
>       Someone had a PR open for adding idle connection timeouts, but I asked
>       them to redo it on top of the 'stats conns' work as a more efficient
>       background thread. I could potentially finish this and it would be 
> usable
>       as a workaround. You could also enable tcp keepalives, or otherwise fix
>       whatever's causing these events.
>
>       I wonder if it's also worth attempting to relink an item that ends up in
>       the tail but has references? That would at least potentially get them 
> out
>       of the way of memory reclamation.
>
>       Thanks!
>
>       On Wed, 2 Jul 2014, Denis Samoylov wrote:
>
>       > >1)  OOM's on slab 13, but it recovered on its own? This is under 
> version 1.4.20 and you did *not* enable tail repairs? correct
>       >
>       > >2) Can you share (with me at least) the full stats/stats items/stats 
> slabs output from one of the affected servers running 1.4.20? 
>       > sent you _current_ stats from the server that had OOM couple days ago 
> and still running (now with no issues).
>       >
>       > >3) Can you confirm that 1.4.20 isn't *crashing*, but is actually 
> exhibiting write failures? 
>       > correct
>       >
>       > we will enable saving stderr to log. may be this can show something. 
> If you have any other ideas - let me know.
>       >
>       > -denis
>       >
>       >
>       >
>       > On Wednesday, July 2, 2014 1:36:57 PM UTC-7, Dormando wrote:
>       >       Cool. That is disappointing.
>       >
>       >       Can you clarify a few things for me:
>       >
>       >       1) You're saying that you were getting OOM's on slab 13, but it 
> recovered
>       >       on its own? This is under version 1.4.20 and you did *not* 
> enable tail
>       >       repairs?
>       >
>       >       2) Can you share (with me at least) the full stats/stats 
> items/stats slabs
>       >       output from one of the affected servers running 1.4.20?
>       >
>       >       3) Can you confirm that 1.4.20 isn't *crashing*, but is actually
>       >       exhibiting write failures?
>       >
>       >       If it's not a crash, and your hash power level isn't expanding, 
> I don't
>       >       think it's related to the other bug.
>       >
>       >       thanks!
>       >
>       >       On Wed, 2 Jul 2014, Denis Samoylov wrote:
>       >
>       >       > Dormando, sure, we will add option to preset hashtable. (as i 
> see nn should be 26).
>       >       > One question: as i see in logs for the servers there is no 
> change for hash_power_level before incident (it would be hard to
>       say for
>       >       crushed but .20
>       >       > just had outofmemory and i have solid stats). Does not this 
> contradict the idea of cause? Server had hash_power_level = 26
>       for days
>       >       before and
>       >       > still has 26 days after. Just for three hours every set for 
> slab 13 failed. We did not reboot/flush server and it continues
>       to work
>       >       without
>       >       > problem. What do you think?
>       >       >
>       >       > On Tuesday, July 1, 2014 2:43:49 PM UTC-7, Dormando wrote:
>       >       >       Hey,
>       >       >
>       >       >       Can you presize the hash table? (-o hashpower=nn) to be 
> large enough on
>       >       >       those servers such that hash expansion won't happen at 
> runtime? You can
>       >       >       see what hashpower is on a long running server via 
> stats to know what to
>       >       >       set the value to.
>       >       >
>       >       >       If that helps, we might still have a bug in hash 
> expansion. I see someone
>       >       >       finally reproduced a possible issue there under .20. 
> .17/.19 fix other
>       >       >       causes of the problem pretty thoroughly though.
>       >       >
>       >       >       On Tue, 1 Jul 2014, Denis Samoylov wrote:
>       >       >
>       >       >       > Hi,
>       >       >       > We had sporadic memory corruption due tail repair in 
> pre .20 version. So we updated some our servers to .20. This
>       Monday we
>       >       observed
>       >       >       several
>       >       >       > crushes in .15 version and tons of "allocation 
> failure" in .20 version. This is expected as .20 just disables "tail
>       repair"
>       >       but it
>       >       >       seems the
>       >       >       > problem is still there. What is interesting:
>       >       >       > 1) there is no visible change in traffic and only one 
> slab is affected usually. 
>       >       >       > 2) this always happens with several but not all 
> servers :)
>       >       >       >
>       >       >       > Is there any way to catch this and help with debug? I 
> have all slab and item stats for the time around incident for
>       .15 and
>       >       .20
>       >       >       version. .15 is
>       >       >       > clearly memory corruption: gdb shows that hash 
> function returned 0 (line 115 uint32_t hv = hash(ITEM_key(search),
>       >       search->nkey, 0);).
>       >       >       >
>       >       >       > so we seems hitting this comment:
>       >       >       >             /* Old rare bug could cause a refcount 
> leak. We haven't seen
>       >       >       >              * it in years, but we leave this code in 
> to prevent failures
>       >       >       >              * just in case */
>       >       >       >
>       >       >       > :)
>       >       >       >
>       >       >       > Thank you,
>       >       >       > Denis
>       >       >       >
>       >       >       > --
>       >       >       >
>       >       >       > ---
>       >       >       > You received this message because you are subscribed 
> to the Google Groups "memcached" group.
>       >       >       > To unsubscribe from this group and stop receiving 
> emails from it, send an email to memcached+...@googlegroups.com.
>       >       >       > For more options, visit 
> https://groups.google.com/d/optout.
>       >       >       >
>       >       >       >
>       >       >
>       >       > --
>       >       >
>       >       > ---
>       >       > You received this message because you are subscribed to the 
> Google Groups "memcached" group.
>       >       > To unsubscribe from this group and stop receiving emails from 
> it, send an email to memcached+...@googlegroups.com.
>       >       > For more options, visit https://groups.google.com/d/optout.
>       >       >
>       >       >
>       >
>       > --
>       >
>       > ---
>       > You received this message because you are subscribed to the Google 
> Groups "memcached" group.
>       > To unsubscribe from this group and stop receiving emails from it, 
> send an email to memcached+...@googlegroups.com.
>       > For more options, visit https://groups.google.com/d/optout.
>       >
>       >
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups 
> "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to memcached+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: tail repair issue (1.4.20)

Reply via email to