Thanks so much for sticking around and testing!
I have a number of bugs to go over as I mentioned before, so it may take a
little longer to bake this into a release. I still want to add a cap on
how much churn it allows, so for 10,000 items you might instead get a
handful of OOM's. This is to
Okay, so, we did some testing!
I deployed a test build last Thursday and let it run with no further
changes, graphing the ‘reflocked’ counter (which is the metric I added for
‘refcounted so moved to other end of LRU’). The graph for that ends up
looking like this: http://i.imgur.com/0CZfHWf.png
Hi, sorry about the slow response. Naturally, the daily problem we were
having stopped as soon as you checked in that patch. Typical, eh?
Anyhow, I’ve studied the patch and it seems to be pretty good — the only
worry I have is that if you end up with the extremely degenerate case of an
entire LRU
Okay cool.
As I mentioned with the original link I will be adding some sort of sanity
checking to break the loop. I just have to reorganize the whole thing and
ran out of time (I got stuck for a while because unlink was wiping
search-prev and it kept bailing the loop :P)
I need someone to try it
Apparently I lied about the weekend, sorry...
On Mon, 11 Aug 2014, Jay Grizzard wrote:
Well, sounds like whatever process was asking for that data is dead (and
possibly pissing off a customer) so you should
indeed figure out what
that's about.
Yeah, we’ll definitely hunt this one down.
Well, sounds like whatever process was asking for that data is dead (and
possibly pissing off a customer) so you should
indeed figure out what
that's about.
Yeah, we’ll definitely hunt this one down. I’ll have to toss up a monitor to
look for things in a write state for extended
Okay, I'm pretty sure I understand what's going on here now.
This is what I think is the sequence of events is:
- Client does gets for a very large number of keys. I'm not sure how to
actually see the request in the core (if that data is even still attached),
but isize (list of items to write
Thanks! It might take me a while to look into it more closely.
That conn_mwrite is probably bad, however a single connection shouldn't be
able to do it. Before the OOM is given up, memcached walks up the chain
from the bottom of the LRU by 5ish. So all of them have to be locked, or
possibly some
Dormando,
Sure, I waited till Monday (our usual tailrepair/oom errors day) but we did
not have any issues today :). I will continue to monitor and will grab
stats conns next time.
As for network issues during the last time - i do not see any but still
trying to find. This can be good
it does not use whole memory :). also, small misprint - it is 25 for this
pool (26 is different pool). I've sent you stats
On Wednesday, July 2, 2014 8:27:09 PM UTC-7, Zhiwei Chan wrote:
Hi,
with the hash power 26, slab 13, that means (2**26)*1.5*1488=142G
memory is needed. Could you
Dormando,Sure, I waited till Monday (our usual tailrepair/oom errors day) but
we did not have any issues today :). I will continue to monitor and
will grab stats conns next time.
Great, thanks!
As for network issues during the last time - i do not see any but still
trying to find. This
Dormando, sure, we will add option to preset hashtable. (as i see nn should
be 26).
One question: as i see in logs for the servers there is no change for
hash_power_level
before incident (it would be hard to say for crushed but .20 just had
outofmemory and i have solid stats). Does not this
Zhiwei,
thank you for the info. But i still not sure that this relates to hash
table grow (see my answer to Dormando in this thread) and it happened for 3
hours time and disappear... Or I miss this part of code (do_item_alloc is
small but with fancy idea :) )?
-denis
On Tuesday, July 1, 2014
Cool. That is disappointing.
Can you clarify a few things for me:
1) You're saying that you were getting OOM's on slab 13, but it recovered
on its own? This is under version 1.4.20 and you did *not* enable tail
repairs?
2) Can you share (with me at least) the full stats/stats items/stats slabs
1) OOM's on slab 13, but it recovered on its own? This is under version
1.4.20 and you did *not* enable tail repairs?
correct
2) Can you share (with me at least) the full stats/stats items/stats
slabs output from one of the affected servers running 1.4.20?
sent you _current_ stats from the
Thanks!
This is a little exciting actually, it's a new bug!
tailrepairs was only necessary when an item was legitimately leaked; if we
don't reap it, it never gets better. However you stated that for three
hours all sets fail (and at the same time some .15's crashed). Then it
self-recovered.
Hi,
with the hash power 26, slab 13, that means (2**26)*1.5*1488=142G memory
is needed. Could you please put the stats info to this thread or send a
copy for me too? And, is that tons of 'allocation failure' the system
log or the outofmemory statistic in memcached? At last, i think
Hi,
We had sporadic memory corruption due tail repair in pre .20 version. So we
updated some our servers to .20. This Monday we observed several crushes in
.15 version and tons of allocation failure in .20 version. This is
expected as .20 just disables tail repair but it seems the problem is
Hey,
Can you presize the hash table? (-o hashpower=nn) to be large enough on
those servers such that hash expansion won't happen at runtime? You can
see what hashpower is on a long running server via stats to know what to
set the value to.
If that helps, we might still have a bug in hash
hi,
i think it the same bug with issue#370, and i have found the reproduce
way and pull a fix patch to github.
在 2014年7月2日星期三UTC+8上午5时43分49秒,Dormando写道:
Hey,
Can you presize the hash table? (-o hashpower=nn) to be large enough on
those servers such that hash expansion won't happen
20 matches
Mail list logo