Dormando, Sure, I waited till Monday (our usual tailrepair/oom errors day) but we did not have any issues today :). I will continue to monitor and will grab "stats conns" next time.
As for network issues during the last time - i do not see any but still trying to find. This can be good explanation why we have such events grouped in time. As for keepalive - we use default php-memcached/libmemcached setting (do not change it) and as I see libmemcached does not set SO_KEEPALIVE. Do you recommend to set it? On Wednesday, July 2, 2014 7:32:14 PM UTC-7, Dormando wrote: > > Thanks! > > This is a little exciting actually, it's a new bug! > > tailrepairs was only necessary when an item was legitimately leaked; if we > don't reap it, it never gets better. However you stated that for three > hours all sets fail (and at the same time some .15's crashed). Then it > self-recovered. > > The .15 crashes were likely from the bug I fixed; where an active item is > fetched from the tail, but then reclaimed because it's old. > > The .20 "OOM" is the defensive code working perfectly; something has > somehow retained a legitimate reference to an item for multiple hours! > More than one even, since the tail is walked up by several items while > looking for something to free. > > Did you have any network blips, application server crashes, or the like? > It sounds like some connections are dying in such a way that they time > out, which is a very long timeout somehow (no tcp keepalives?). > > What's *extra* exciting is that 1.4.20 now has the "stats conns" command. > > If this happens again, while a .20 machine is actively OOM'ing, can you > grab a couple copies of the "stats conns" output, a few minutes apart? > That should definitively tell us if there are stuck connections causing > this issue. > > Someone had a PR open for adding idle connection timeouts, but I asked > them to redo it on top of the 'stats conns' work as a more efficient > background thread. I could potentially finish this and it would be usable > as a workaround. You could also enable tcp keepalives, or otherwise fix > whatever's causing these events. > > I wonder if it's also worth attempting to relink an item that ends up in > the tail but has references? That would at least potentially get them out > of the way of memory reclamation. > > Thanks! > > On Wed, 2 Jul 2014, Denis Samoylov wrote: > > > >1) OOM's on slab 13, but it recovered on its own? This is under > version 1.4.20 and you did *not* enable tail repairs? correct > > > > >2) Can you share (with me at least) the full stats/stats items/stats > slabs output from one of the affected servers running 1.4.20? > > sent you _current_ stats from the server that had OOM couple days ago > and still running (now with no issues). > > > > >3) Can you confirm that 1.4.20 isn't *crashing*, but is > actually exhibiting write failures? > > correct > > > > we will enable saving stderr to log. may be this can show something. If > you have any other ideas - let me know. > > > > -denis > > > > > > > > On Wednesday, July 2, 2014 1:36:57 PM UTC-7, Dormando wrote: > > Cool. That is disappointing. > > > > Can you clarify a few things for me: > > > > 1) You're saying that you were getting OOM's on slab 13, but it > recovered > > on its own? This is under version 1.4.20 and you did *not* enable > tail > > repairs? > > > > 2) Can you share (with me at least) the full stats/stats > items/stats slabs > > output from one of the affected servers running 1.4.20? > > > > 3) Can you confirm that 1.4.20 isn't *crashing*, but is actually > > exhibiting write failures? > > > > If it's not a crash, and your hash power level isn't expanding, I > don't > > think it's related to the other bug. > > > > thanks! > > > > On Wed, 2 Jul 2014, Denis Samoylov wrote: > > > > > Dormando, sure, we will add option to preset hashtable. (as i > see nn should be 26). > > > One question: as i see in logs for the servers there is no > change for hash_power_level before incident (it would be hard to say for > > crushed but .20 > > > just had outofmemory and i have solid stats). Does not this > contradict the idea of cause? Server had hash_power_level = 26 for days > > before and > > > still has 26 days after. Just for three hours every set for slab > 13 failed. We did not reboot/flush server and it continues to work > > without > > > problem. What do you think? > > > > > > On Tuesday, July 1, 2014 2:43:49 PM UTC-7, Dormando wrote: > > > Hey, > > > > > > Can you presize the hash table? (-o hashpower=nn) to be > large enough on > > > those servers such that hash expansion won't happen at > runtime? You can > > > see what hashpower is on a long running server via stats > to know what to > > > set the value to. > > > > > > If that helps, we might still have a bug in hash > expansion. I see someone > > > finally reproduced a possible issue there under .20. > .17/.19 fix other > > > causes of the problem pretty thoroughly though. > > > > > > On Tue, 1 Jul 2014, Denis Samoylov wrote: > > > > > > > Hi, > > > > We had sporadic memory corruption due tail repair in pre > .20 version. So we updated some our servers to .20. This Monday we > > observed > > > several > > > > crushes in .15 version and tons of "allocation failure" > in .20 version. This is expected as .20 just disables "tail repair" > > but it > > > seems the > > > > problem is still there. What is interesting: > > > > 1) there is no visible change in traffic and only one > slab is affected usually. > > > > 2) this always happens with several but not all servers > :) > > > > > > > > Is there any way to catch this and help with debug? I > have all slab and item stats for the time around incident for .15 and > > .20 > > > version. .15 is > > > > clearly memory corruption: gdb shows that hash function > returned 0 (line 115 uint32_t hv = hash(ITEM_key(search), > > search->nkey, 0);). > > > > > > > > so we seems hitting this comment: > > > > /* Old rare bug could cause a refcount leak. > We haven't seen > > > > * it in years, but we leave this code in to > prevent failures > > > > * just in case */ > > > > > > > > :) > > > > > > > > Thank you, > > > > Denis > > > > > > > > -- > > > > > > > > --- > > > > You received this message because you are subscribed to > the Google Groups "memcached" group. > > > > To unsubscribe from this group and stop receiving emails > from it, send an email to memcached+...@googlegroups.com. > > > > For more options, visit > https://groups.google.com/d/optout. > > > > > > > > > > > > > > -- > > > > > > --- > > > You received this message because you are subscribed to the > Google Groups "memcached" group. > > > To unsubscribe from this group and stop receiving emails from > it, send an email to memcached+...@googlegroups.com. > > > For more options, visit https://groups.google.com/d/optout. > > > > > > > > > > -- > > > > --- > > You received this message because you are subscribed to the Google > Groups "memcached" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email to memcached+...@googlegroups.com <javascript:>. > > For more options, visit https://groups.google.com/d/optout. > > > > -- --- You received this message because you are subscribed to the Google Groups "memcached" group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.