Re: tail repair issue (1.4.20)

Denis Samoylov Mon, 07 Jul 2014 14:50:29 -0700

Dormando,
Sure, I waited till Monday (our usual tailrepair/oom errors day) but we did 
not have any issues today :). I will continue to monitor and will grab 
"stats conns" next time.


As for network issues during the last time - i do not see any but still 
trying to find. This can be good explanation why we have such events 
grouped in time.

As for keepalive - we use default php-memcached/libmemcached setting (do 
not change it) and as I see libmemcached does not set SO_KEEPALIVE. Do you 
recommend to set it?

On Wednesday, July 2, 2014 7:32:14 PM UTC-7, Dormando wrote:
>
> Thanks! 
>
> This is a little exciting actually, it's a new bug! 
>
> tailrepairs was only necessary when an item was legitimately leaked; if we 
> don't reap it, it never gets better. However you stated that for three 
> hours all sets fail (and at the same time some .15's crashed). Then it 
> self-recovered. 
>
> The .15 crashes were likely from the bug I fixed; where an active item is 
> fetched from the tail, but then reclaimed because it's old. 
>
> The .20 "OOM" is the defensive code working perfectly; something has 
> somehow retained a legitimate reference to an item for multiple hours! 
> More than one even, since the tail is walked up by several items while 
> looking for something to free. 
>
> Did you have any network blips, application server crashes, or the like? 
> It sounds like some connections are dying in such a way that they time 
> out, which is a very long timeout somehow (no tcp keepalives?). 
>
> What's *extra* exciting is that 1.4.20 now has the "stats conns" command. 
>
> If this happens again, while a .20 machine is actively OOM'ing, can you 
> grab a couple copies of the "stats conns" output, a few minutes apart? 
> That should definitively tell us if there are stuck connections causing 
> this issue. 
>
> Someone had a PR open for adding idle connection timeouts, but I asked 
> them to redo it on top of the 'stats conns' work as a more efficient 
> background thread. I could potentially finish this and it would be usable 
> as a workaround. You could also enable tcp keepalives, or otherwise fix 
> whatever's causing these events. 
>
> I wonder if it's also worth attempting to relink an item that ends up in 
> the tail but has references? That would at least potentially get them out 
> of the way of memory reclamation. 
>
> Thanks! 
>
> On Wed, 2 Jul 2014, Denis Samoylov wrote: 
>
> > >1)  OOM's on slab 13, but it recovered on its own? This is under 
> version 1.4.20 and you did *not* enable tail repairs? correct 
> > 
> > >2) Can you share (with me at least) the full stats/stats items/stats 
> slabs output from one of the affected servers running 1.4.20?  
> > sent you _current_ stats from the server that had OOM couple days ago 
> and still running (now with no issues). 
> > 
> > >3) Can you confirm that 1.4.20 isn't *crashing*, but is 
> actually exhibiting write failures?  
> > correct 
> > 
> > we will enable saving stderr to log. may be this can show something. If 
> you have any other ideas - let me know. 
> > 
> > -denis 
> > 
> > 
> > 
> > On Wednesday, July 2, 2014 1:36:57 PM UTC-7, Dormando wrote: 
> >       Cool. That is disappointing. 
> > 
> >       Can you clarify a few things for me: 
> > 
> >       1) You're saying that you were getting OOM's on slab 13, but it 
> recovered 
> >       on its own? This is under version 1.4.20 and you did *not* enable 
> tail 
> >       repairs? 
> > 
> >       2) Can you share (with me at least) the full stats/stats 
> items/stats slabs 
> >       output from one of the affected servers running 1.4.20? 
> > 
> >       3) Can you confirm that 1.4.20 isn't *crashing*, but is actually 
> >       exhibiting write failures? 
> > 
> >       If it's not a crash, and your hash power level isn't expanding, I 
> don't 
> >       think it's related to the other bug. 
> > 
> >       thanks! 
> > 
> >       On Wed, 2 Jul 2014, Denis Samoylov wrote: 
> > 
> >       > Dormando, sure, we will add option to preset hashtable. (as i 
> see nn should be 26). 
> >       > One question: as i see in logs for the servers there is no 
> change for hash_power_level before incident (it would be hard to say for 
> >       crushed but .20 
> >       > just had outofmemory and i have solid stats). Does not this 
> contradict the idea of cause? Server had hash_power_level = 26 for days 
> >       before and 
> >       > still has 26 days after. Just for three hours every set for slab 
> 13 failed. We did not reboot/flush server and it continues to work 
> >       without 
> >       > problem. What do you think? 
> >       > 
> >       > On Tuesday, July 1, 2014 2:43:49 PM UTC-7, Dormando wrote: 
> >       >       Hey, 
> >       > 
> >       >       Can you presize the hash table? (-o hashpower=nn) to be 
> large enough on 
> >       >       those servers such that hash expansion won't happen at 
> runtime? You can 
> >       >       see what hashpower is on a long running server via stats 
> to know what to 
> >       >       set the value to. 
> >       > 
> >       >       If that helps, we might still have a bug in hash 
> expansion. I see someone 
> >       >       finally reproduced a possible issue there under .20. 
> .17/.19 fix other 
> >       >       causes of the problem pretty thoroughly though. 
> >       > 
> >       >       On Tue, 1 Jul 2014, Denis Samoylov wrote: 
> >       > 
> >       >       > Hi, 
> >       >       > We had sporadic memory corruption due tail repair in pre 
> .20 version. So we updated some our servers to .20. This Monday we 
> >       observed 
> >       >       several 
> >       >       > crushes in .15 version and tons of "allocation failure" 
> in .20 version. This is expected as .20 just disables "tail repair" 
> >       but it 
> >       >       seems the 
> >       >       > problem is still there. What is interesting: 
> >       >       > 1) there is no visible change in traffic and only one 
> slab is affected usually.  
> >       >       > 2) this always happens with several but not all servers 
> :) 
> >       >       > 
> >       >       > Is there any way to catch this and help with debug? I 
> have all slab and item stats for the time around incident for .15 and 
> >       .20 
> >       >       version. .15 is 
> >       >       > clearly memory corruption: gdb shows that hash function 
> returned 0 (line 115 uint32_t hv = hash(ITEM_key(search), 
> >       search->nkey, 0);). 
> >       >       > 
> >       >       > so we seems hitting this comment: 
> >       >       >             /* Old rare bug could cause a refcount leak. 
> We haven't seen 
> >       >       >              * it in years, but we leave this code in to 
> prevent failures 
> >       >       >              * just in case */ 
> >       >       > 
> >       >       > :) 
> >       >       > 
> >       >       > Thank you, 
> >       >       > Denis 
> >       >       > 
> >       >       > -- 
> >       >       > 
> >       >       > --- 
> >       >       > You received this message because you are subscribed to 
> the Google Groups "memcached" group. 
> >       >       > To unsubscribe from this group and stop receiving emails 
> from it, send an email to memcached+...@googlegroups.com. 
> >       >       > For more options, visit 
> https://groups.google.com/d/optout. 
> >       >       > 
> >       >       > 
> >       > 
> >       > -- 
> >       > 
> >       > --- 
> >       > You received this message because you are subscribed to the 
> Google Groups "memcached" group. 
> >       > To unsubscribe from this group and stop receiving emails from 
> it, send an email to memcached+...@googlegroups.com. 
> >       > For more options, visit https://groups.google.com/d/optout. 
> >       > 
> >       > 
> > 
> > -- 
> > 
> > --- 
> > You received this message because you are subscribed to the Google 
> Groups "memcached" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to memcached+...@googlegroups.com <javascript:>. 
> > For more options, visit https://groups.google.com/d/optout. 
> > 
> >

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: tail repair issue (1.4.20)

Reply via email to