Cool. That is disappointing.

Can you clarify a few things for me:

1) You're saying that you were getting OOM's on slab 13, but it recovered
on its own? This is under version 1.4.20 and you did *not* enable tail
repairs?

2) Can you share (with me at least) the full stats/stats items/stats slabs
output from one of the affected servers running 1.4.20?

3) Can you confirm that 1.4.20 isn't *crashing*, but is actually
exhibiting write failures?

If it's not a crash, and your hash power level isn't expanding, I don't
think it's related to the other bug.

thanks!

On Wed, 2 Jul 2014, Denis Samoylov wrote:

> Dormando, sure, we will add option to preset hashtable. (as i see nn should 
> be 26).
> One question: as i see in logs for the servers there is no change for 
> hash_power_level before incident (it would be hard to say for crushed but .20
> just had outofmemory and i have solid stats). Does not this contradict the 
> idea of cause? Server had hash_power_level = 26 for days before and
> still has 26 days after. Just for three hours every set for slab 13 failed. 
> We did not reboot/flush server and it continues to work without
> problem. What do you think?
>
> On Tuesday, July 1, 2014 2:43:49 PM UTC-7, Dormando wrote:
>       Hey,
>
>       Can you presize the hash table? (-o hashpower=nn) to be large enough on
>       those servers such that hash expansion won't happen at runtime? You can
>       see what hashpower is on a long running server via stats to know what to
>       set the value to.
>
>       If that helps, we might still have a bug in hash expansion. I see 
> someone
>       finally reproduced a possible issue there under .20. .17/.19 fix other
>       causes of the problem pretty thoroughly though.
>
>       On Tue, 1 Jul 2014, Denis Samoylov wrote:
>
>       > Hi,
>       > We had sporadic memory corruption due tail repair in pre .20 version. 
> So we updated some our servers to .20. This Monday we observed
>       several
>       > crushes in .15 version and tons of "allocation failure" in .20 
> version. This is expected as .20 just disables "tail repair" but it
>       seems the
>       > problem is still there. What is interesting:
>       > 1) there is no visible change in traffic and only one slab is 
> affected usually. 
>       > 2) this always happens with several but not all servers :)
>       >
>       > Is there any way to catch this and help with debug? I have all slab 
> and item stats for the time around incident for .15 and .20
>       version. .15 is
>       > clearly memory corruption: gdb shows that hash function returned 0 
> (line 115 uint32_t hv = hash(ITEM_key(search), search->nkey, 0);).
>       >
>       > so we seems hitting this comment:
>       >             /* Old rare bug could cause a refcount leak. We haven't 
> seen
>       >              * it in years, but we leave this code in to prevent 
> failures
>       >              * just in case */
>       >
>       > :)
>       >
>       > Thank you,
>       > Denis
>       >
>       > --
>       >
>       > ---
>       > You received this message because you are subscribed to the Google 
> Groups "memcached" group.
>       > To unsubscribe from this group and stop receiving emails from it, 
> send an email to memcached+...@googlegroups.com.
>       > For more options, visit https://groups.google.com/d/optout.
>       >
>       >
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups 
> "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to memcached+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to