Re: Check for orphaned items in lru crawler thread

Dormando Wed, 07 Oct 2015 11:28:05 -0700

any luck?

> On Oct 6, 2015, at 12:23 AM, Dormando <dorma...@rydia.net> wrote:
> 
> ah. I pushed two more changes earlier. should fix mem_requested. just 
> cosmetic stuff though
> 
>> On Oct 6, 2015, at 12:13 AM, Scott Mansfield <smansfi...@netflix.com> wrote:
>> 
>> Oops, looks like the latest code didn't get into production today. I'm 
>> building it again, same plan as before.
>> 
>>> On Monday, October 5, 2015 at 4:38:00 PM UTC-7, Dormando wrote:
>>> Looking forward to the results. Thanks for getting on this so quickly. 
>>> 
>>> I think there's still a bug in tracking requested memory, and I want to 
>>> move the stats counters to a rollup at the end of a page move. 
>>> Otherwise I think this branch is complete pending any further stability 
>>> issues or feedback. 
>>> 
>>> On Mon, 5 Oct 2015, Scott Mansfield wrote: 
>>> 
>>> > I just put the newest code into production. I'm going to monitor it for a 
>>> > bit to see how it behaves. As long as there's no obvious issues I'll 
>>> > enable reads in a few hours, which are an order of magnitude more 
>>> > traffic. I'll let you know what I find. 
>>> > 
>>> > On Monday, October 5, 2015 at 1:29:03 AM UTC-7, Dormando wrote: 
>>> >       It took a day of running torture tests which took 30-90 minutes to 
>>> > fail, 
>>> >       but along with a bunch of house chores I believe I've found the 
>>> > problem: 
>>> > 
>>> >       https://github.com/dormando/memcached/tree/slab_rebal_next - has a 
>>> > new 
>>> >       commit, specifically this: 
>>> >       
>>> > https://github.com/dormando/memcached/commit/1c32e5eeff5bd2a8cc9b652a2ed808157e4929bb
>>> >  
>>> > 
>>> >       It's somewhat relieving that when I brained this super hard back in 
>>> >       january I may have actually gotten the complex set of interactions 
>>> >       correct, I simply failed to keep typing when converting the 
>>> > comments to 
>>> >       code. 
>>> > 
>>> >       So this has been broken since 1.4.24, but hardly anyone uses the 
>>> > page 
>>> >       mover apparently. It's survived a 5 hour torture test (that I wrote 
>>> > in 
>>> >       2011!) once fixed (previously dying after 30-90 minutes). So please 
>>> > give 
>>> >       this one a try and let me know how it goes. 
>>> > 
>>> >       If it goes well I can merge up some other fixes from PR list and 
>>> > cut a 
>>> >       release, unless someone has feedback for something to change. 
>>> > 
>>> >       thanks! 
>>> > 
>>> >       On Thu, 1 Oct 2015, dormando wrote: 
>>> > 
>>> >       > I've seen items.c:1183 reported elsewhere in 1.4.24... so 
>>> > probably the bug 
>>> >       > was introduced when I rewrote the page mover for that. 
>>> >       > 
>>> >       > I didn't mean to send me a core file: I mean if you dump the core 
>>> > you can 
>>> >       > load it in gdb and get the backtrace (bt + thread apply all bt) 
>>> >       > 
>>> >       > Don't have a handler for convenient attaching :( 
>>> >       > 
>>> >       > didn't get a chance to poke at this today... I'll need another 
>>> > day to try 
>>> >       > it out. 
>>> >       > 
>>> >       > On Thu, 1 Oct 2015, Scott Mansfield wrote: 
>>> >       > 
>>> >       > > Sorry for the data dumps here, but I want to give you 
>>> > everything I have. I found 3 more addresses that showed up in the dmesg 
>>> > logs: 
>>> >       > > 
>>> >       > > $ for addr in 40e013 40eff4 40f7c4; do addr2line -e memcached 
>>> > $addr; done 
>>> >       > > 
>>> >       > > .../build/memcached-1.4.24-slab-rebal-next/slabs.c:265 
>>> > (discriminator 1) 
>>> >       > > 
>>> >       > > .../build/memcached-1.4.24-slab-rebal-next/items.c:312 
>>> > (discriminator 1) 
>>> >       > > 
>>> >       > > .../build/memcached-1.4.24-slab-rebal-next/items.c:1183 
>>> >       > > 
>>> >       > > 
>>> >       > > I still haven't tried to attach a debugger, since the frequency 
>>> > of the error would make it hard to catch it. Is there a handler that I 
>>> > could add in to dump the stack trace when it segfaults? I'd get a core 
>>> > dump, but they would be HUGE and contain confidential information. 
>>> >       > > 
>>> >       > > 
>>> >       > > Below are the full dmesg logs. Out of 205 servers, 35 had dmesg 
>>> > logs after a memcached crash, and only one crashed twice, both times on 
>>> > the original segfault. Below is the full unified set of dmesg logs, from 
>>> > which you can get a sense of frequency. 
>>> >       > > 
>>> >       > > 
>>> >       > > [47992.109269] memcached[2798]: segfault at 0 ip 
>>> > 000000000040e007 sp 00007f4d20d25eb0 error 4 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [48960.851278] memcached[2805]: segfault at 0 ip 
>>> > 000000000040e007 sp 00007f3c30d15eb0 error 4 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [46421.604609] memcached[2784]: segfault at 0 ip 
>>> > 000000000040e007 sp 00007fdb94612eb0 error 4 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [48429.671534] traps: memcached[2768] general protection 
>>> > ip:40e013 sp:7f1c32676be0 error:0 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [71838.979269] memcached[2792]: segfault at 0 ip 
>>> > 000000000040e007 sp 00007f0162feeeb0 error 4 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [66763.091475] memcached[2804]: segfault at 0 ip 
>>> > 000000000040e007 sp 00007f8240170eb0 error 4 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [102544.376092] traps: memcached[2792] general protection 
>>> > ip:40eff4 sp:7fa58095be18 error:0 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [49932.757825] memcached[2777]: segfault at 0 ip 
>>> > 000000000040e007 sp 00007f1ff2131eb0 error 4 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [50400.415878] memcached[2794]: segfault at 0 ip 
>>> > 000000000040e007 sp 00007f11a26daeb0 error 4 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [48986.340345] memcached[2786]: segfault at 0 ip 
>>> > 000000000040e007 sp 00007f9235279eb0 error 4 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [44742.175894] memcached[2796]: segfault at 0 ip 
>>> > 000000000040e007 sp 00007eff3a0cceb0 error 4 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [49030.431879] memcached[2776]: segfault at 0 ip 
>>> > 000000000040e007 sp 00007fdef27cfbe0 error 4 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [50211.611439] traps: memcached[2782] general protection 
>>> > ip:40e013 sp:7f9ee1723be0 error:0 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [62534.892817] memcached[2783]: segfault at 0 ip 
>>> > 000000000040e007 sp 00007f37f2d4beb0 error 4 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [78697.201195] memcached[2801]: segfault at 0 ip 
>>> > 000000000040e007 sp 00007f696ef1feb0 error 4 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [48922.246712] memcached[2804]: segfault at 0 ip 
>>> > 000000000040e007 sp 00007f1ebb338eb0 error 4 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [52170.371014] memcached[2809]: segfault at 0 ip 
>>> > 000000000040e007 sp 00007f5e62fcbeb0 error 4 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [69531.775868] memcached[2785]: segfault at 0 ip 
>>> > 000000000040e007 sp 00007ff50ac2eeb0 error 4 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [48926.661559] memcached[2799]: segfault at 0 ip 
>>> > 000000000040e007 sp 00007f71e0ac6be0 error 4 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [49491.126885] memcached[2745]: segfault at 0 ip 
>>> > 000000000040e007 sp 00007f5737c4beb0 error 4 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [104247.724294] traps: memcached[2793] general protection 
>>> > ip:40f7c4 sp:7f3af8c27eb0 error:0 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [78098.528606] traps: memcached[2757] general protection 
>>> > ip:412b9d sp:7fc0700dbdd0 error:0 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [71958.385432] memcached[2809]: segfault at 0 ip 
>>> > 000000000040e007 sp 00007f8b68cd0eb0 error 4 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [48934.182852] memcached[2787]: segfault at 0 ip 
>>> > 000000000040e007 sp 00007f0aef774eb0 error 4 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [104220.754195] traps: memcached[2802] general protection 
>>> > ip:40f7c4 sp:7ffa85a2deb0 error:0 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [45807.670246] memcached[2755]: segfault at 0 ip 
>>> > 000000000040e007 sp 00007fd74a1d0eb0 error 4 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [73640.102621] memcached[2802]: segfault at 0 ip 
>>> > 000000000040e007 sp 00007f7bb30bfeb0 error 4 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [67690.640196] memcached[2787]: segfault at 0 ip 
>>> > 000000000040e007 sp 00007f299580feb0 error 4 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [57729.895442] memcached[2786]: segfault at 0 ip 
>>> > 000000000040e007 sp 00007f204073deb0 error 4 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [48009.284226] memcached[2801]: segfault at 0 ip 
>>> > 000000000040e007 sp 00007f7b30876eb0 error 4 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [48198.211826] memcached[2811]: segfault at 0 ip 
>>> > 000000000040e007 sp 00007fd496d79eb0 error 4 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [84057.439927] traps: memcached[2804] general protection 
>>> > ip:40f7c4 sp:7fbe75fffeb0 error:0 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [50215.489124] memcached[2784]: segfault at 0 ip 
>>> > 000000000040e007 sp 00007f3234b73eb0 error 4 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [46545.316351] memcached[2789]: segfault at 0 ip 
>>> > 000000000040e007 sp 00007f362ceedeb0 error 4 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [102076.523474] memcached[29833]: segfault at 0 ip 
>>> > 000000000040e007 sp 00007f3c89b9ebe0 error 4 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > [55537.568254] memcached[2780]: segfault at 0 ip 
>>> > 000000000040e007 sp 00007fc1f6005eb0 error 4 in memcached[400000+1d000] 
>>> >       > > 
>>> >       > > 
>>> >       > > 
>>> >       > > 
>>> >       > > On Thursday, October 1, 2015 at 5:40:35 PM UTC-7, Dormando 
>>> > wrote: 
>>> >       > >       got it. that might be a decent hint actually... I had 
>>> > addded a bugfix to 
>>> >       > >       the branch to not miscount the mem_requested counter, but 
>>> > it's not working 
>>> >       > >       or I missed a spot. 
>>> >       > > 
>>> >       > >       On Thu, 1 Oct 2015, Scott Mansfield wrote: 
>>> >       > > 
>>> >       > >       > The number now, after maybe 90 minutes of writes, is 
>>> > 1,446. I think after disabling a lot of the data TTL'd out. I have to 
>>> > disable it for now, again (for unrelated reasons, again). The page that I 
>>> > screenshotted gives real time data, so the numbers were from right then. 
>>> > Last night, it should have shown better numbers in terms of 
>>> >       "total_pages", 
>>> >       > >       but I didn't 
>>> >       > >       > get a screenshot. That number is directly from the 
>>> > stats slabs output. 
>>> >       > >       > 
>>> >       > >       > 
>>> >       > >       > 
>>> >       > >       > On Thursday, October 1, 2015 at 4:21:42 PM UTC-7, 
>>> > Dormando wrote: 
>>> >       > >       >       ok... slab class 12 claims to have 2 in 
>>> > "total_pages", yet 14g in 
>>> >       > >       >       mem_requested. is this stat wrong? 
>>> >       > >       > 
>>> >       > >       >       On Thu, 1 Oct 2015, Scott Mansfield wrote: 
>>> >       > >       > 
>>> >       > >       >       > The ones that crashed (new code cluster) were 
>>> > set to only be written to from the client applications. The data is an 
>>> > index key and a series of data keys that are all written one after 
>>> > another. Each key might be hashed to a different server, though, so not 
>>> > all of them are written to the same server. I can give you a snapshot 
>>> >       of one of 
>>> >       > >       the 
>>> >       > >       >       clusters that 
>>> >       > >       >       > didn't crash (attached file). I can give more 
>>> > detail offline if you need it. 
>>> >       > >       >       > 
>>> >       > >       >       > 
>>> >       > >       >       > On Thursday, October 1, 2015 at 2:32:53 PM 
>>> > UTC-7, Dormando wrote: 
>>> >       > >       >       >       Any chance you could describe (perhaps 
>>> > privately?) in very broad strokes 
>>> >       > >       >       >       what the write load looks like? (they're 
>>> > getting only writes, too?). 
>>> >       > >       >       >       otherwise I'll have to devise arbitrary 
>>> > torture tests. I'm sure the bug's 
>>> >       > >       >       >       in there but it's not obvious yet 
>>> >       > >       >       > 
>>> >       > >       >       >       On Thu, 1 Oct 2015, dormando wrote: 
>>> >       > >       >       > 
>>> >       > >       >       >       > perfect, thanks! I have $dayjob as well 
>>> > but will look into this as soon as 
>>> >       > >       >       >       > I can. my torture test machines are in 
>>> > a box but I'll try to borrow one 
>>> >       > >       >       >       > 
>>> >       > >       >       >       > On Thu, 1 Oct 2015, Scott Mansfield 
>>> > wrote: 
>>> >       > >       >       >       > 
>>> >       > >       >       >       > > Yes. Exact args: 
>>> >       > >       >       >       > > -p 11211 -u <omitted> -l 0.0.0.0 -c 
>>> > 100000 -o slab_reassign -o 
>>> > lru_maintainer,lru_crawler,hash_algorithm=murmur3 -I 4m -m 56253 
>>> >       > >       >       >       > > 
>>> >       > >       >       >       > > On Thursday, October 1, 2015 at 
>>> > 12:41:06 PM UTC-7, Dormando wrote: 
>>> >       > >       >       >       > >       Were 
>>> > lru_maintainer/lru_crawler/etc enabled though? even if slab mover is 
>>> >       > >       >       >       > >       off, those two were the big 
>>> > changes in .24 
>>> >       > >       >       >       > > 
>>> >       > >       >       >       > >       On Thu, 1 Oct 2015, Scott 
>>> > Mansfield wrote: 
>>> >       > >       >       >       > > 
>>> >       > >       >       >       > >       > The same cluster has > 400 
>>> > servers happily running 1.4.24. It's been our standard deployment for a 
>>> > while now, and we haven't seen any crashes. The servers in the same 
>>> > cluster running 1.4.24 (with the same write load the new build was 
>>> > taking) have been up for 29 days. The start options do not contain the 
>>> >       slab_automove 
>>> >       > >       option 
>>> >       > >       >       because 
>>> >       > >       >       >       it wasn't 
>>> >       > >       >       >       > >       effective for 
>>> >       > >       >       >       > >       > us before. The memory given 
>>> > is possibly slightly different per server, as we calculate on startup how 
>>> > much we give. It's in the same ballpark, though (~56 gigs). 
>>> >       > >       >       >       > >       > 
>>> >       > >       >       >       > >       > On Thursday, October 1, 2015 
>>> > at 12:11:35 PM UTC-7, Dormando wrote: 
>>> >       > >       >       >       > >       >       Just before I sit in 
>>> > and try to narrow this down: have you run any host on 
>>> >       > >       >       >       > >       >       1.4.24 mainline with 
>>> > those same start options? just in case the crash is 
>>> >       > >       >       >       > >       >       older 
>>> >       > >       >       >       > >       > 
>>> >       > >       >       >       > >       >       On Thu, 1 Oct 2015, 
>>> > Scott Mansfield wrote: 
>>> >       > >       >       >       > >       > 
>>> >       > >       >       >       > >       >       > Another message for 
>>> > you: 
>>> >       > >       >       >       > >       >       > [78098.528606] traps: 
>>> > memcached[2757] general protection ip:412b9d sp:7fc0700dbdd0 error:0 in 
>>> > memcached[400000+1d000] 
>>> >       > >       >       >       > >       >       > 
>>> >       > >       >       >       > >       >       > 
>>> >       > >       >       >       > >       >       > addr2line shows: 
>>> >       > >       >       >       > >       >       > 
>>> >       > >       >       >       > >       >       > $ addr2line -e 
>>> > memcached 412b9d 
>>> >       > >       >       >       > >       >       > 
>>> >       > >       >       >       > >       >       > 
>>> > /mnt/builds/slave/workspace/TL-SYS-memcached-slab_rebal_next/build/memcached-1.4.24-slab-rebal-next/assoc.c:119
>>> >  
>>> >       > >       >       >       > >       >       > 
>>> >       > >       >       >       > >       >       > 
>>> >       > >       >       >       > >       >       > 
>>> >       > >       >       >       > >       >       > On Thursday, October 
>>> > 1, 2015 at 1:41:44 AM UTC-7, Dormando wrote: 
>>> >       > >       >       >       > >       >       >       Ok, thanks! 
>>> >       > >       >       >       > >       >       > 
>>> >       > >       >       >       > >       >       >       I'll noodle 
>>> > this a bit... unfortunately a backtrace might be more helpful. 
>>> >       > >       >       >       > >       >       >       will ask you to 
>>> > attempt to get one if I don't figure anything out in time. 
>>> >       > >       >       >       > >       >       > 
>>> >       > >       >       >       > >       >       >       (allow it to 
>>> > core dump or attach a GDB session and set an ignore handler 
>>> >       > >       >       >       > >       >       >       for 
>>> > sigpipe/int/etc and run "continue") 
>>> >       > >       >       >       > >       >       > 
>>> >       > >       >       >       > >       >       >       what were your 
>>> > full startup args, though? 
>>> >       > >       >       >       > >       >       > 
>>> >       > >       >       >       > >       >       >       On Thu, 1 Oct 
>>> > 2015, Scott Mansfield wrote: 
>>> >       > >       >       >       > >       >       > 
>>> >       > >       >       >       > >       >       >       > The commit 
>>> > was the latest in slab_rebal_next at the time: 
>>> >       > >       >       >       > >       >       >       > 
>>> > https://github.com/dormando/memcached/commit/bdd688b4f20120ad844c8a4803e08c6e03cb061a
>>> >  
>>> >       > >       >       >       > >       >       >       > 
>>> >       > >       >       >       > >       >       >       > addr2line 
>>> > gave me this output: 
>>> >       > >       >       >       > >       >       >       > 
>>> >       > >       >       >       > >       >       >       > $ addr2line 
>>> > -e memcached 0x40e007 
>>> >       > >       >       >       > >       >       >       > 
>>> >       > >       >       >       > >       >       >       > 
>>> > /mnt/builds/slave/workspace/TL-SYS-memcached-slab_rebal_next/build/memcached-1.4.24-slab-rebal-next/slabs.c:264
>>> >  
>>> >       > >       >       >       > >       >       >       > 
>>> >       > >       >       >       > >       >       >       > 
>>> >       > >       >       >       > >       >       >       > As well, this 
>>> > was running with production writes, but not reads. Even if we had reads 
>>> > on with the few servers crashing, we're ok architecturally. That's why I 
>>> > can get it out there without worrying too much. For now, I'm going to 
>>> > turn it off. I had a metrics issue anyway that needs to get 
>>> >       fixed. 
>>> >       > >       Tomorrow I'm 
>>> >       > >       >       planning 
>>> >       > >       >       >       to test 
>>> >       > >       >       >       > >       again with 
>>> >       > >       >       >       > >       >       more 
>>> >       > >       >       >       > >       >       >       metrics, but I 
>>> >       > >       >       >       > >       >       >       > can get any 
>>> > new code in pretty quick. 
>>> >       > >       >       >       > >       >       >       > 
>>> >       > >       >       >       > >       >       >       > 
>>> >       > >       >       >       > >       >       >       > On Thursday, 
>>> > October 1, 2015 at 1:01:36 AM UTC-7, Dormando wrote: 
>>> >       > >       >       >       > >       >       >       >       How 
>>> > many servers were you running it on? I hope it wasn't more than a 
>>> >       > >       >       >       > >       >       >       >       
>>> > handful. I'd recommend starting with one :P 
>>> >       > >       >       >       > >       >       >       > 
>>> >       > >       >       >       > >       >       >       >       can you 
>>> > do an addr2line? what were your startup args, and what was the 
>>> >       > >       >       >       > >       >       >       >       commit 
>>> > sha1 for the branch you pulled? 
>>> >       > >       >       >       > >       >       >       > 
>>> >       > >       >       >       > >       >       >       >       sorry 
>>> > about that :/ 
>>> >       > >       >       >       > >       >       >       > 
>>> >       > >       >       >       > >       >       >       >       On Thu, 
>>> > 1 Oct 2015, Scott Mansfield wrote: 
>>> >       > >       >       >       > >       >       >       > 
>>> >       > >       >       >       > >       >       >       >       > A few 
>>> > different servers (5 / 205) experienced a segfault all within an hour or 
>>> > so. Unfortunately at this point I'm a bit out of my depth. I have the 
>>> > dmesg output, which is identical for all 5 boxes: 
>>> >       > >       >       >       > >       >       >       >       > 
>>> >       > >       >       >       > >       >       >       >       > 
>>> > [46545.316351] memcached[2789]: segfault at 0 ip 000000000040e007 sp 
>>> > 00007f362ceedeb0 error 4 in memcached[400000+1d000] 
>>> >       > >       >       >       > >       >       >       >       > 
>>> >       > >       >       >       > >       >       >       >       > 
>>> >       > >       >       >       > >       >       >       >       > I can 
>>> > possibly supply the binary file if needed, though we didn't do anything 
>>> > besides the standard setup and compile. 
>>> >       > >       >       >       > >       >       >       >       > 
>>> >       > >       >       >       > >       >       >       >       > 
>>> >       > >       >       >       > >       >       >       >       > 
>>> >       > >       >       >       > >       >       >       >       > On 
>>> > Tuesday, September 29, 2015 at 10:27:59 PM UTC-7, Dormando wrote: 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > If you look at the new branch there's a commit explaining the new stats. 
>>> >       > >       >       >       > >       >       >       >       > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > You can watch slab_reassing_evictions vs slab_reassign_saves. you can 
>>> > also 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > test automove=1 vs automove=2 (please also turn on the lru_maintainer and 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > lru_crawler). 
>>> >       > >       >       >       > >       >       >       >       > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > The initial branch you were running didn't add any new stats. It just 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > restored an old feature. 
>>> >       > >       >       >       > >       >       >       >       > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > On Tue, 29 Sep 2015, Scott Mansfield wrote: 
>>> >       > >       >       >       > >       >       >       >       > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > > An unrelated prod problem meant I had to stop after about an hour. I'm 
>>> > turning it on again tomorrow morning. 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > > Are there any new metrics I should be looking at? Anything new in the 
>>> > stats output? I'm about to take a look at the diffs as well. 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > > On Tuesday, September 29, 2015 at 12:37:45 PM UTC-7, Dormando wrote: 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       excellent. if automove=2 is too aggressive you'll see that come 
>>> > in in a 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       hit ratio reduction. 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       the new branch works with automove=2 as well, but it will attempt 
>>> > to 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       rescue valid items in the old slab if possible. I'll still be 
>>> > working on 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       it for another few hours today though. I'll mail again when I'm 
>>> > done. 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       On Tue, 29 Sep 2015, Scott Mansfield wrote: 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       > I have the first commit (slab_automove=2) running in prod right 
>>> > now. Later today will be a full load production test of the latest code. 
>>> > I'll just let it run for a few days unless I spot any problems. We have 
>>> > good metrics for latency et. al. from the client side, 
>>> >       though network 
>>> >       > >       normally 
>>> >       > >       >       dwarfs 
>>> >       > >       >       >       memcached 
>>> >       > >       >       >       > >       time. 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       > On Tuesday, September 29, 2015 at 3:10:03 AM UTC-7, Dormando 
>>> > wrote: 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       That's unfortunate. 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       I've done some more work on the branch: 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       https://github.com/memcached/memcached/pull/112 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       It's not completely likely you would see enough of an 
>>> > improvement from the 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       new default mode. However if your item sizes change 
>>> > gradually, items are 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       reclaimed during expiration, or get overwritten (and thus 
>>> > freed in the old 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       class), it should work just fine. I have another patch 
>>> > coming which should 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       help though. 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       Open to feedback from any interested party. 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       On Fri, 25 Sep 2015, Scott Mansfield wrote: 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       > I have it running internally, and it runs fine under 
>>> > normal load. It's difficult to put it into the line of fire for a 
>>> > production workload because of social reasons... As well it's a 
>>> > degenerate case that we normally don't run in to (and actively try to 
>>> > avoid). 
>>> >       I'm going 
>>> >       > >       to run 
>>> >       > >       >       some 
>>> >       > >       >       >       heavier load 
>>> >       > >       >       >       > >       tests on 
>>> >       > >       >       >       > >       >       it 
>>> >       > >       >       >       > >       >       >       today.  
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       > On Wednesday, September 9, 2015 at 10:23:32 AM UTC-7, 
>>> > Scott Mansfield wrote: 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       I'm working on getting a test going internally. 
>>> > I'll let you know how it goes.  
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       > Scott Mansfield 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       > On Mon, Sep 7, 2015 at 2:33 PM, dormando wrote: 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       Yo, 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       
>>> > https://github.com/dormando/memcached/commits/slab_rebal_next - would you 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       mind playing around with the branch here? You can 
>>> > see the start options in 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       the test. 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       This is a dead simple modification (a restoration 
>>> > of a feature that was 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       arleady there...). The test very aggressively 
>>> > writes and is able to shunt 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       memory around appropriately. 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       The work I'm exploring right now will allow 
>>> > savings of items being 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       rebalanced from, and increasing the aggression of 
>>> > page moving without 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       being so brain damaged about it. 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       But while I'm poking around with that, I'd be 
>>> > interested in knowing if 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       this simple branch is an improvement, and if so 
>>> > how much. 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       I'll push more code to the branch, but the 
>>> > changes should be gated behind 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       a feature flag. 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       On Tue, 18 Aug 2015, 'Scott Mansfield' via 
>>> > memcached wrote: 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       > No worries man, you're doing us a favor. Let me 
>>> > know if there's anything you need from us, and I promise I'll be quicker 
>>> > this time :) 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       > On Aug 18, 2015 12:01 AM, "dormando" 
>>> > <dorm...@rydia.net> wrote: 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       Hey, 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       I'm still really interested in working on 
>>> > this. I'll be taking a careful 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       look soon I hope. 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       On Mon, 3 Aug 2015, Scott Mansfield 
>>> > wrote: 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       > I've tweaked the program slightly, so 
>>> > I'm adding a new version. It prints more stats as it goes and runs a bit 
>>> > faster. 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       > On Monday, August 3, 2015 at 1:20:37 AM 
>>> > UTC-7, Scott Mansfield wrote: 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >       Total brain fart on my part. 
>>> > Apparently I had memcached 1.4.13 on my path (who knows how...) Using the 
>>> > actual one that I've built works. Sorry for the confusion... can't 
>>> > believe I didn't realize that before. I'm testing against the 
>>> >       compiled one now 
>>> >       > >       to see 
>>> >       > >       >       how it 
>>> >       > >       >       >       behaves. 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >       On Monday, August 3, 2015 at 
>>> > 1:15:06 AM UTC-7, Dormando wrote: 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             You sure that's 1.4.24? 
>>> > None of those fail for me :( 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             On Mon, 3 Aug 2015, Scott 
>>> > Mansfield wrote: 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             > The command line I've 
>>> > used that will start is: 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             > memcached -m 64 -o 
>>> > slab_reassign,slab_automove 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             > the ones that fail are: 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             > memcached -m 64 -o 
>>> > slab_reassign,slab_automove,lru_crawler,lru_maintainer 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             > memcached -o lru_crawler 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             > I'm sure I've missed 
>>> > something during compile, though I just used ./configure and make. 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             > On Monday, August 3, 2015 
>>> > at 12:22:33 AM UTC-7, Scott Mansfield wrote: 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             >       I've attached a 
>>> > pretty simple program to connect, fill a slab with data, and then fill 
>>> > another slab slowly with data of a different size. I've been trying to 
>>> > get memcached to run with the lru_crawler and lru_maintainer flags, 
>>> >       but I get 
>>> >       > >       ' 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             >       Illegal suboption 
>>> > "(null)"' every time I try to start with either in any configuration. 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             >       I haven't seen it 
>>> > start to move slabs automatically with a freshly installed 1.2.24. 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             >       On Tuesday, July 
>>> > 21, 2015 at 4:55:17 PM UTC-7, Scott Mansfield wrote: 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             >             I realize 
>>> > I've not given you the tests to reproduce the behavior. I should be able 
>>> > to soon. Sorry about the delay here. 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             > In the mean time, I 
>>> > wanted to bring up a possible secondary use of the same logic to move 
>>> > items on slab rebalancing. I think the system might benefit from using 
>>> > the same logic to crawl the pages in a slab and compact the data in 
>>> >       the 
>>> >       > >       background. In 
>>> >       > >       >       the case 
>>> >       > >       >       >       where we 
>>> >       > >       >       >       > >       have 
>>> >       > >       >       >       > >       >       memory that 
>>> >       > >       >       >       > >       >       >       is 
>>> >       > >       >       >       > >       >       >       >       
>>> > assigned to 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > the slab 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       but not 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       being used 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       because 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             of replaced 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             > or TTL'd out data, 
>>> > returning the memory to a pool of free memory will allow a slab to grow 
>>> > with that memory first instead of waiting for an event where memory is 
>>> > needed at that instant. 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             > 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >             > It's a change in 
>>> > approach, from reactive to proactive. What do you think? 
>>> >       > >       >       >       > >       >       >       >       >       
>>> > >       >       >       >       >     ...
>> 
>> -- 
>> 
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "memcached" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to memcached+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
> -- 
> 
> --- 
> You received this message because you are subscribed to the Google Groups 
> "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to memcached+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.


-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Check for orphaned items in lru crawler thread

Reply via email to