Re: tail repair issue (1.4.20)

2014-08-28 Thread dormando
Thanks so much for sticking around and testing! I have a number of bugs to go over as I mentioned before, so it may take a little longer to bake this into a release. I still want to add a cap on how much churn it allows, so for 10,000 items you might instead get a handful of OOM's. This is to

Re: tail repair issue (1.4.20)

2014-08-26 Thread Jay Grizzard
Okay, so, we did some testing! I deployed a test build last Thursday and let it run with no further changes, graphing the ‘reflocked’ counter (which is the metric I added for ‘refcounted so moved to other end of LRU’). The graph for that ends up looking like this: http://i.imgur.com/0CZfHWf.png

Re: tail repair issue (1.4.20)

2014-08-21 Thread Jay Grizzard
Hi, sorry about the slow response. Naturally, the daily problem we were having stopped as soon as you checked in that patch. Typical, eh? Anyhow, I’ve studied the patch and it seems to be pretty good — the only worry I have is that if you end up with the extremely degenerate case of an entire LRU

Re: tail repair issue (1.4.20)

2014-08-21 Thread dormando
Okay cool. As I mentioned with the original link I will be adding some sort of sanity checking to break the loop. I just have to reorganize the whole thing and ran out of time (I got stuck for a while because unlink was wiping search-prev and it kept bailing the loop :P) I need someone to try it

Re: tail repair issue (1.4.20)

2014-08-11 Thread dormando
Apparently I lied about the weekend, sorry... On Mon, 11 Aug 2014, Jay Grizzard wrote:  Well, sounds like whatever process was asking for that data is dead (and  possibly pissing off a customer) so you should indeed figure out what  that's about. Yeah, we’ll definitely hunt this one down.

Re: tail repair issue (1.4.20)

2014-08-11 Thread dormando
 Well, sounds like whatever process was asking for that data is dead (and  possibly pissing off a customer) so you should indeed figure out what  that's about. Yeah, we’ll definitely hunt this one down. I’ll have to toss up a monitor to look for things in a write state for extended

Re: tail repair issue (1.4.20)

2014-08-08 Thread Jay Grizzard
Okay, I'm pretty sure I understand what's going on here now. This is what I think is the sequence of events is: - Client does gets for a very large number of keys. I'm not sure how to actually see the request in the core (if that data is even still attached), but isize (list of items to write

Re: tail repair issue (1.4.20)

2014-08-07 Thread dormando
Thanks! It might take me a while to look into it more closely. That conn_mwrite is probably bad, however a single connection shouldn't be able to do it. Before the OOM is given up, memcached walks up the chain from the bottom of the LRU by 5ish. So all of them have to be locked, or possibly some

Re: tail repair issue (1.4.20)

2014-07-07 Thread Denis Samoylov
Dormando, Sure, I waited till Monday (our usual tailrepair/oom errors day) but we did not have any issues today :). I will continue to monitor and will grab stats conns next time. As for network issues during the last time - i do not see any but still trying to find. This can be good

Re: tail repair issue (1.4.20)

2014-07-07 Thread Denis Samoylov
it does not use whole memory :). also, small misprint - it is 25 for this pool (26 is different pool). I've sent you stats On Wednesday, July 2, 2014 8:27:09 PM UTC-7, Zhiwei Chan wrote: Hi, with the hash power 26, slab 13, that means (2**26)*1.5*1488=142G memory is needed. Could you

Re: tail repair issue (1.4.20)

2014-07-07 Thread dormando
Dormando,Sure, I waited till Monday (our usual tailrepair/oom errors day) but we did not have any issues today :). I will continue to monitor and will grab stats conns next time. Great, thanks! As for network issues during the last time - i do not see any but still trying to find. This

Re: tail repair issue (1.4.20)

2014-07-02 Thread Denis Samoylov
Dormando, sure, we will add option to preset hashtable. (as i see nn should be 26). One question: as i see in logs for the servers there is no change for hash_power_level before incident (it would be hard to say for crushed but .20 just had outofmemory and i have solid stats). Does not this

Re: tail repair issue (1.4.20)

2014-07-02 Thread Denis Samoylov
Zhiwei, thank you for the info. But i still not sure that this relates to hash table grow (see my answer to Dormando in this thread) and it happened for 3 hours time and disappear... Or I miss this part of code (do_item_alloc is small but with fancy idea :) )? -denis On Tuesday, July 1, 2014

Re: tail repair issue (1.4.20)

2014-07-02 Thread dormando
Cool. That is disappointing. Can you clarify a few things for me: 1) You're saying that you were getting OOM's on slab 13, but it recovered on its own? This is under version 1.4.20 and you did *not* enable tail repairs? 2) Can you share (with me at least) the full stats/stats items/stats slabs

Re: tail repair issue (1.4.20)

2014-07-02 Thread Denis Samoylov
1) OOM's on slab 13, but it recovered on its own? This is under version 1.4.20 and you did *not* enable tail repairs? correct 2) Can you share (with me at least) the full stats/stats items/stats slabs output from one of the affected servers running 1.4.20? sent you _current_ stats from the

Re: tail repair issue (1.4.20)

2014-07-02 Thread dormando
Thanks! This is a little exciting actually, it's a new bug! tailrepairs was only necessary when an item was legitimately leaked; if we don't reap it, it never gets better. However you stated that for three hours all sets fail (and at the same time some .15's crashed). Then it self-recovered.

Re: tail repair issue (1.4.20)

2014-07-02 Thread Zhiwei Chan
Hi, with the hash power 26, slab 13, that means (2**26)*1.5*1488=142G memory is needed. Could you please put the stats info to this thread or send a copy for me too? And, is that tons of 'allocation failure' the system log or the outofmemory statistic in memcached? At last, i think

tail repair issue (1.4.20)

2014-07-01 Thread Denis Samoylov
Hi, We had sporadic memory corruption due tail repair in pre .20 version. So we updated some our servers to .20. This Monday we observed several crushes in .15 version and tons of allocation failure in .20 version. This is expected as .20 just disables tail repair but it seems the problem is

Re: tail repair issue (1.4.20)

2014-07-01 Thread dormando
Hey, Can you presize the hash table? (-o hashpower=nn) to be large enough on those servers such that hash expansion won't happen at runtime? You can see what hashpower is on a long running server via stats to know what to set the value to. If that helps, we might still have a bug in hash

Re: tail repair issue (1.4.20)

2014-07-01 Thread Zhiwei Chan
hi, i think it the same bug with issue#370, and i have found the reproduce way and pull a fix patch to github. 在 2014年7月2日星期三UTC+8上午5时43分49秒,Dormando写道: Hey, Can you presize the hash table? (-o hashpower=nn) to be large enough on those servers such that hash expansion won't happen