any luck? > On Oct 6, 2015, at 12:23 AM, Dormando <dorma...@rydia.net> wrote: > > ah. I pushed two more changes earlier. should fix mem_requested. just > cosmetic stuff though > >> On Oct 6, 2015, at 12:13 AM, Scott Mansfield <smansfi...@netflix.com> wrote: >> >> Oops, looks like the latest code didn't get into production today. I'm >> building it again, same plan as before. >> >>> On Monday, October 5, 2015 at 4:38:00 PM UTC-7, Dormando wrote: >>> Looking forward to the results. Thanks for getting on this so quickly. >>> >>> I think there's still a bug in tracking requested memory, and I want to >>> move the stats counters to a rollup at the end of a page move. >>> Otherwise I think this branch is complete pending any further stability >>> issues or feedback. >>> >>> On Mon, 5 Oct 2015, Scott Mansfield wrote: >>> >>> > I just put the newest code into production. I'm going to monitor it for a >>> > bit to see how it behaves. As long as there's no obvious issues I'll >>> > enable reads in a few hours, which are an order of magnitude more >>> > traffic. I'll let you know what I find. >>> > >>> > On Monday, October 5, 2015 at 1:29:03 AM UTC-7, Dormando wrote: >>> > It took a day of running torture tests which took 30-90 minutes to >>> > fail, >>> > but along with a bunch of house chores I believe I've found the >>> > problem: >>> > >>> > https://github.com/dormando/memcached/tree/slab_rebal_next - has a >>> > new >>> > commit, specifically this: >>> > >>> > https://github.com/dormando/memcached/commit/1c32e5eeff5bd2a8cc9b652a2ed808157e4929bb >>> > >>> > >>> > It's somewhat relieving that when I brained this super hard back in >>> > january I may have actually gotten the complex set of interactions >>> > correct, I simply failed to keep typing when converting the >>> > comments to >>> > code. >>> > >>> > So this has been broken since 1.4.24, but hardly anyone uses the >>> > page >>> > mover apparently. It's survived a 5 hour torture test (that I wrote >>> > in >>> > 2011!) once fixed (previously dying after 30-90 minutes). So please >>> > give >>> > this one a try and let me know how it goes. >>> > >>> > If it goes well I can merge up some other fixes from PR list and >>> > cut a >>> > release, unless someone has feedback for something to change. >>> > >>> > thanks! >>> > >>> > On Thu, 1 Oct 2015, dormando wrote: >>> > >>> > > I've seen items.c:1183 reported elsewhere in 1.4.24... so >>> > probably the bug >>> > > was introduced when I rewrote the page mover for that. >>> > > >>> > > I didn't mean to send me a core file: I mean if you dump the core >>> > you can >>> > > load it in gdb and get the backtrace (bt + thread apply all bt) >>> > > >>> > > Don't have a handler for convenient attaching :( >>> > > >>> > > didn't get a chance to poke at this today... I'll need another >>> > day to try >>> > > it out. >>> > > >>> > > On Thu, 1 Oct 2015, Scott Mansfield wrote: >>> > > >>> > > > Sorry for the data dumps here, but I want to give you >>> > everything I have. I found 3 more addresses that showed up in the dmesg >>> > logs: >>> > > > >>> > > > $ for addr in 40e013 40eff4 40f7c4; do addr2line -e memcached >>> > $addr; done >>> > > > >>> > > > .../build/memcached-1.4.24-slab-rebal-next/slabs.c:265 >>> > (discriminator 1) >>> > > > >>> > > > .../build/memcached-1.4.24-slab-rebal-next/items.c:312 >>> > (discriminator 1) >>> > > > >>> > > > .../build/memcached-1.4.24-slab-rebal-next/items.c:1183 >>> > > > >>> > > > >>> > > > I still haven't tried to attach a debugger, since the frequency >>> > of the error would make it hard to catch it. Is there a handler that I >>> > could add in to dump the stack trace when it segfaults? I'd get a core >>> > dump, but they would be HUGE and contain confidential information. >>> > > > >>> > > > >>> > > > Below are the full dmesg logs. Out of 205 servers, 35 had dmesg >>> > logs after a memcached crash, and only one crashed twice, both times on >>> > the original segfault. Below is the full unified set of dmesg logs, from >>> > which you can get a sense of frequency. >>> > > > >>> > > > >>> > > > [47992.109269] memcached[2798]: segfault at 0 ip >>> > 000000000040e007 sp 00007f4d20d25eb0 error 4 in memcached[400000+1d000] >>> > > > >>> > > > [48960.851278] memcached[2805]: segfault at 0 ip >>> > 000000000040e007 sp 00007f3c30d15eb0 error 4 in memcached[400000+1d000] >>> > > > >>> > > > [46421.604609] memcached[2784]: segfault at 0 ip >>> > 000000000040e007 sp 00007fdb94612eb0 error 4 in memcached[400000+1d000] >>> > > > >>> > > > [48429.671534] traps: memcached[2768] general protection >>> > ip:40e013 sp:7f1c32676be0 error:0 in memcached[400000+1d000] >>> > > > >>> > > > [71838.979269] memcached[2792]: segfault at 0 ip >>> > 000000000040e007 sp 00007f0162feeeb0 error 4 in memcached[400000+1d000] >>> > > > >>> > > > [66763.091475] memcached[2804]: segfault at 0 ip >>> > 000000000040e007 sp 00007f8240170eb0 error 4 in memcached[400000+1d000] >>> > > > >>> > > > [102544.376092] traps: memcached[2792] general protection >>> > ip:40eff4 sp:7fa58095be18 error:0 in memcached[400000+1d000] >>> > > > >>> > > > [49932.757825] memcached[2777]: segfault at 0 ip >>> > 000000000040e007 sp 00007f1ff2131eb0 error 4 in memcached[400000+1d000] >>> > > > >>> > > > [50400.415878] memcached[2794]: segfault at 0 ip >>> > 000000000040e007 sp 00007f11a26daeb0 error 4 in memcached[400000+1d000] >>> > > > >>> > > > [48986.340345] memcached[2786]: segfault at 0 ip >>> > 000000000040e007 sp 00007f9235279eb0 error 4 in memcached[400000+1d000] >>> > > > >>> > > > [44742.175894] memcached[2796]: segfault at 0 ip >>> > 000000000040e007 sp 00007eff3a0cceb0 error 4 in memcached[400000+1d000] >>> > > > >>> > > > [49030.431879] memcached[2776]: segfault at 0 ip >>> > 000000000040e007 sp 00007fdef27cfbe0 error 4 in memcached[400000+1d000] >>> > > > >>> > > > [50211.611439] traps: memcached[2782] general protection >>> > ip:40e013 sp:7f9ee1723be0 error:0 in memcached[400000+1d000] >>> > > > >>> > > > [62534.892817] memcached[2783]: segfault at 0 ip >>> > 000000000040e007 sp 00007f37f2d4beb0 error 4 in memcached[400000+1d000] >>> > > > >>> > > > [78697.201195] memcached[2801]: segfault at 0 ip >>> > 000000000040e007 sp 00007f696ef1feb0 error 4 in memcached[400000+1d000] >>> > > > >>> > > > [48922.246712] memcached[2804]: segfault at 0 ip >>> > 000000000040e007 sp 00007f1ebb338eb0 error 4 in memcached[400000+1d000] >>> > > > >>> > > > [52170.371014] memcached[2809]: segfault at 0 ip >>> > 000000000040e007 sp 00007f5e62fcbeb0 error 4 in memcached[400000+1d000] >>> > > > >>> > > > [69531.775868] memcached[2785]: segfault at 0 ip >>> > 000000000040e007 sp 00007ff50ac2eeb0 error 4 in memcached[400000+1d000] >>> > > > >>> > > > [48926.661559] memcached[2799]: segfault at 0 ip >>> > 000000000040e007 sp 00007f71e0ac6be0 error 4 in memcached[400000+1d000] >>> > > > >>> > > > [49491.126885] memcached[2745]: segfault at 0 ip >>> > 000000000040e007 sp 00007f5737c4beb0 error 4 in memcached[400000+1d000] >>> > > > >>> > > > [104247.724294] traps: memcached[2793] general protection >>> > ip:40f7c4 sp:7f3af8c27eb0 error:0 in memcached[400000+1d000] >>> > > > >>> > > > [78098.528606] traps: memcached[2757] general protection >>> > ip:412b9d sp:7fc0700dbdd0 error:0 in memcached[400000+1d000] >>> > > > >>> > > > [71958.385432] memcached[2809]: segfault at 0 ip >>> > 000000000040e007 sp 00007f8b68cd0eb0 error 4 in memcached[400000+1d000] >>> > > > >>> > > > [48934.182852] memcached[2787]: segfault at 0 ip >>> > 000000000040e007 sp 00007f0aef774eb0 error 4 in memcached[400000+1d000] >>> > > > >>> > > > [104220.754195] traps: memcached[2802] general protection >>> > ip:40f7c4 sp:7ffa85a2deb0 error:0 in memcached[400000+1d000] >>> > > > >>> > > > [45807.670246] memcached[2755]: segfault at 0 ip >>> > 000000000040e007 sp 00007fd74a1d0eb0 error 4 in memcached[400000+1d000] >>> > > > >>> > > > [73640.102621] memcached[2802]: segfault at 0 ip >>> > 000000000040e007 sp 00007f7bb30bfeb0 error 4 in memcached[400000+1d000] >>> > > > >>> > > > [67690.640196] memcached[2787]: segfault at 0 ip >>> > 000000000040e007 sp 00007f299580feb0 error 4 in memcached[400000+1d000] >>> > > > >>> > > > [57729.895442] memcached[2786]: segfault at 0 ip >>> > 000000000040e007 sp 00007f204073deb0 error 4 in memcached[400000+1d000] >>> > > > >>> > > > [48009.284226] memcached[2801]: segfault at 0 ip >>> > 000000000040e007 sp 00007f7b30876eb0 error 4 in memcached[400000+1d000] >>> > > > >>> > > > [48198.211826] memcached[2811]: segfault at 0 ip >>> > 000000000040e007 sp 00007fd496d79eb0 error 4 in memcached[400000+1d000] >>> > > > >>> > > > [84057.439927] traps: memcached[2804] general protection >>> > ip:40f7c4 sp:7fbe75fffeb0 error:0 in memcached[400000+1d000] >>> > > > >>> > > > [50215.489124] memcached[2784]: segfault at 0 ip >>> > 000000000040e007 sp 00007f3234b73eb0 error 4 in memcached[400000+1d000] >>> > > > >>> > > > [46545.316351] memcached[2789]: segfault at 0 ip >>> > 000000000040e007 sp 00007f362ceedeb0 error 4 in memcached[400000+1d000] >>> > > > >>> > > > [102076.523474] memcached[29833]: segfault at 0 ip >>> > 000000000040e007 sp 00007f3c89b9ebe0 error 4 in memcached[400000+1d000] >>> > > > >>> > > > [55537.568254] memcached[2780]: segfault at 0 ip >>> > 000000000040e007 sp 00007fc1f6005eb0 error 4 in memcached[400000+1d000] >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > On Thursday, October 1, 2015 at 5:40:35 PM UTC-7, Dormando >>> > wrote: >>> > > > got it. that might be a decent hint actually... I had >>> > addded a bugfix to >>> > > > the branch to not miscount the mem_requested counter, but >>> > it's not working >>> > > > or I missed a spot. >>> > > > >>> > > > On Thu, 1 Oct 2015, Scott Mansfield wrote: >>> > > > >>> > > > > The number now, after maybe 90 minutes of writes, is >>> > 1,446. I think after disabling a lot of the data TTL'd out. I have to >>> > disable it for now, again (for unrelated reasons, again). The page that I >>> > screenshotted gives real time data, so the numbers were from right then. >>> > Last night, it should have shown better numbers in terms of >>> > "total_pages", >>> > > > but I didn't >>> > > > > get a screenshot. That number is directly from the >>> > stats slabs output. >>> > > > > >>> > > > > >>> > > > > >>> > > > > On Thursday, October 1, 2015 at 4:21:42 PM UTC-7, >>> > Dormando wrote: >>> > > > > ok... slab class 12 claims to have 2 in >>> > "total_pages", yet 14g in >>> > > > > mem_requested. is this stat wrong? >>> > > > > >>> > > > > On Thu, 1 Oct 2015, Scott Mansfield wrote: >>> > > > > >>> > > > > > The ones that crashed (new code cluster) were >>> > set to only be written to from the client applications. The data is an >>> > index key and a series of data keys that are all written one after >>> > another. Each key might be hashed to a different server, though, so not >>> > all of them are written to the same server. I can give you a snapshot >>> > of one of >>> > > > the >>> > > > > clusters that >>> > > > > > didn't crash (attached file). I can give more >>> > detail offline if you need it. >>> > > > > > >>> > > > > > >>> > > > > > On Thursday, October 1, 2015 at 2:32:53 PM >>> > UTC-7, Dormando wrote: >>> > > > > > Any chance you could describe (perhaps >>> > privately?) in very broad strokes >>> > > > > > what the write load looks like? (they're >>> > getting only writes, too?). >>> > > > > > otherwise I'll have to devise arbitrary >>> > torture tests. I'm sure the bug's >>> > > > > > in there but it's not obvious yet >>> > > > > > >>> > > > > > On Thu, 1 Oct 2015, dormando wrote: >>> > > > > > >>> > > > > > > perfect, thanks! I have $dayjob as well >>> > but will look into this as soon as >>> > > > > > > I can. my torture test machines are in >>> > a box but I'll try to borrow one >>> > > > > > > >>> > > > > > > On Thu, 1 Oct 2015, Scott Mansfield >>> > wrote: >>> > > > > > > >>> > > > > > > > Yes. Exact args: >>> > > > > > > > -p 11211 -u <omitted> -l 0.0.0.0 -c >>> > 100000 -o slab_reassign -o >>> > lru_maintainer,lru_crawler,hash_algorithm=murmur3 -I 4m -m 56253 >>> > > > > > > > >>> > > > > > > > On Thursday, October 1, 2015 at >>> > 12:41:06 PM UTC-7, Dormando wrote: >>> > > > > > > > Were >>> > lru_maintainer/lru_crawler/etc enabled though? even if slab mover is >>> > > > > > > > off, those two were the big >>> > changes in .24 >>> > > > > > > > >>> > > > > > > > On Thu, 1 Oct 2015, Scott >>> > Mansfield wrote: >>> > > > > > > > >>> > > > > > > > > The same cluster has > 400 >>> > servers happily running 1.4.24. It's been our standard deployment for a >>> > while now, and we haven't seen any crashes. The servers in the same >>> > cluster running 1.4.24 (with the same write load the new build was >>> > taking) have been up for 29 days. The start options do not contain the >>> > slab_automove >>> > > > option >>> > > > > because >>> > > > > > it wasn't >>> > > > > > > > effective for >>> > > > > > > > > us before. The memory given >>> > is possibly slightly different per server, as we calculate on startup how >>> > much we give. It's in the same ballpark, though (~56 gigs). >>> > > > > > > > > >>> > > > > > > > > On Thursday, October 1, 2015 >>> > at 12:11:35 PM UTC-7, Dormando wrote: >>> > > > > > > > > Just before I sit in >>> > and try to narrow this down: have you run any host on >>> > > > > > > > > 1.4.24 mainline with >>> > those same start options? just in case the crash is >>> > > > > > > > > older >>> > > > > > > > > >>> > > > > > > > > On Thu, 1 Oct 2015, >>> > Scott Mansfield wrote: >>> > > > > > > > > >>> > > > > > > > > > Another message for >>> > you: >>> > > > > > > > > > [78098.528606] traps: >>> > memcached[2757] general protection ip:412b9d sp:7fc0700dbdd0 error:0 in >>> > memcached[400000+1d000] >>> > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > addr2line shows: >>> > > > > > > > > > >>> > > > > > > > > > $ addr2line -e >>> > memcached 412b9d >>> > > > > > > > > > >>> > > > > > > > > > >>> > /mnt/builds/slave/workspace/TL-SYS-memcached-slab_rebal_next/build/memcached-1.4.24-slab-rebal-next/assoc.c:119 >>> > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > On Thursday, October >>> > 1, 2015 at 1:41:44 AM UTC-7, Dormando wrote: >>> > > > > > > > > > Ok, thanks! >>> > > > > > > > > > >>> > > > > > > > > > I'll noodle >>> > this a bit... unfortunately a backtrace might be more helpful. >>> > > > > > > > > > will ask you to >>> > attempt to get one if I don't figure anything out in time. >>> > > > > > > > > > >>> > > > > > > > > > (allow it to >>> > core dump or attach a GDB session and set an ignore handler >>> > > > > > > > > > for >>> > sigpipe/int/etc and run "continue") >>> > > > > > > > > > >>> > > > > > > > > > what were your >>> > full startup args, though? >>> > > > > > > > > > >>> > > > > > > > > > On Thu, 1 Oct >>> > 2015, Scott Mansfield wrote: >>> > > > > > > > > > >>> > > > > > > > > > > The commit >>> > was the latest in slab_rebal_next at the time: >>> > > > > > > > > > > >>> > https://github.com/dormando/memcached/commit/bdd688b4f20120ad844c8a4803e08c6e03cb061a >>> > >>> > > > > > > > > > > >>> > > > > > > > > > > addr2line >>> > gave me this output: >>> > > > > > > > > > > >>> > > > > > > > > > > $ addr2line >>> > -e memcached 0x40e007 >>> > > > > > > > > > > >>> > > > > > > > > > > >>> > /mnt/builds/slave/workspace/TL-SYS-memcached-slab_rebal_next/build/memcached-1.4.24-slab-rebal-next/slabs.c:264 >>> > >>> > > > > > > > > > > >>> > > > > > > > > > > >>> > > > > > > > > > > As well, this >>> > was running with production writes, but not reads. Even if we had reads >>> > on with the few servers crashing, we're ok architecturally. That's why I >>> > can get it out there without worrying too much. For now, I'm going to >>> > turn it off. I had a metrics issue anyway that needs to get >>> > fixed. >>> > > > Tomorrow I'm >>> > > > > planning >>> > > > > > to test >>> > > > > > > > again with >>> > > > > > > > > more >>> > > > > > > > > > metrics, but I >>> > > > > > > > > > > can get any >>> > new code in pretty quick. >>> > > > > > > > > > > >>> > > > > > > > > > > >>> > > > > > > > > > > On Thursday, >>> > October 1, 2015 at 1:01:36 AM UTC-7, Dormando wrote: >>> > > > > > > > > > > How >>> > many servers were you running it on? I hope it wasn't more than a >>> > > > > > > > > > > >>> > handful. I'd recommend starting with one :P >>> > > > > > > > > > > >>> > > > > > > > > > > can you >>> > do an addr2line? what were your startup args, and what was the >>> > > > > > > > > > > commit >>> > sha1 for the branch you pulled? >>> > > > > > > > > > > >>> > > > > > > > > > > sorry >>> > about that :/ >>> > > > > > > > > > > >>> > > > > > > > > > > On Thu, >>> > 1 Oct 2015, Scott Mansfield wrote: >>> > > > > > > > > > > >>> > > > > > > > > > > > A few >>> > different servers (5 / 205) experienced a segfault all within an hour or >>> > so. Unfortunately at this point I'm a bit out of my depth. I have the >>> > dmesg output, which is identical for all 5 boxes: >>> > > > > > > > > > > > >>> > > > > > > > > > > > >>> > [46545.316351] memcached[2789]: segfault at 0 ip 000000000040e007 sp >>> > 00007f362ceedeb0 error 4 in memcached[400000+1d000] >>> > > > > > > > > > > > >>> > > > > > > > > > > > >>> > > > > > > > > > > > I can >>> > possibly supply the binary file if needed, though we didn't do anything >>> > besides the standard setup and compile. >>> > > > > > > > > > > > >>> > > > > > > > > > > > >>> > > > > > > > > > > > >>> > > > > > > > > > > > On >>> > Tuesday, September 29, 2015 at 10:27:59 PM UTC-7, Dormando wrote: >>> > > > > > > > > > > > >>> > If you look at the new branch there's a commit explaining the new stats. >>> > > > > > > > > > > > >>> > > > > > > > > > > > >>> > You can watch slab_reassing_evictions vs slab_reassign_saves. you can >>> > also >>> > > > > > > > > > > > >>> > test automove=1 vs automove=2 (please also turn on the lru_maintainer and >>> > > > > > > > > > > > >>> > lru_crawler). >>> > > > > > > > > > > > >>> > > > > > > > > > > > >>> > The initial branch you were running didn't add any new stats. It just >>> > > > > > > > > > > > >>> > restored an old feature. >>> > > > > > > > > > > > >>> > > > > > > > > > > > >>> > On Tue, 29 Sep 2015, Scott Mansfield wrote: >>> > > > > > > > > > > > >>> > > > > > > > > > > > >>> > > An unrelated prod problem meant I had to stop after about an hour. I'm >>> > turning it on again tomorrow morning. >>> > > > > > > > > > > > >>> > > Are there any new metrics I should be looking at? Anything new in the >>> > stats output? I'm about to take a look at the diffs as well. >>> > > > > > > > > > > > >>> > > >>> > > > > > > > > > > > >>> > > On Tuesday, September 29, 2015 at 12:37:45 PM UTC-7, Dormando wrote: >>> > > > > > > > > > > > >>> > > excellent. if automove=2 is too aggressive you'll see that come >>> > in in a >>> > > > > > > > > > > > >>> > > hit ratio reduction. >>> > > > > > > > > > > > >>> > > >>> > > > > > > > > > > > >>> > > the new branch works with automove=2 as well, but it will attempt >>> > to >>> > > > > > > > > > > > >>> > > rescue valid items in the old slab if possible. I'll still be >>> > working on >>> > > > > > > > > > > > >>> > > it for another few hours today though. I'll mail again when I'm >>> > done. >>> > > > > > > > > > > > >>> > > >>> > > > > > > > > > > > >>> > > On Tue, 29 Sep 2015, Scott Mansfield wrote: >>> > > > > > > > > > > > >>> > > >>> > > > > > > > > > > > >>> > > > I have the first commit (slab_automove=2) running in prod right >>> > now. Later today will be a full load production test of the latest code. >>> > I'll just let it run for a few days unless I spot any problems. We have >>> > good metrics for latency et. al. from the client side, >>> > though network >>> > > > normally >>> > > > > dwarfs >>> > > > > > memcached >>> > > > > > > > time. >>> > > > > > > > > > > > >>> > > > >>> > > > > > > > > > > > >>> > > > On Tuesday, September 29, 2015 at 3:10:03 AM UTC-7, Dormando >>> > wrote: >>> > > > > > > > > > > > >>> > > > That's unfortunate. >>> > > > > > > > > > > > >>> > > > >>> > > > > > > > > > > > >>> > > > I've done some more work on the branch: >>> > > > > > > > > > > > >>> > > > https://github.com/memcached/memcached/pull/112 >>> > > > > > > > > > > > >>> > > > >>> > > > > > > > > > > > >>> > > > It's not completely likely you would see enough of an >>> > improvement from the >>> > > > > > > > > > > > >>> > > > new default mode. However if your item sizes change >>> > gradually, items are >>> > > > > > > > > > > > >>> > > > reclaimed during expiration, or get overwritten (and thus >>> > freed in the old >>> > > > > > > > > > > > >>> > > > class), it should work just fine. I have another patch >>> > coming which should >>> > > > > > > > > > > > >>> > > > help though. >>> > > > > > > > > > > > >>> > > > >>> > > > > > > > > > > > >>> > > > Open to feedback from any interested party. >>> > > > > > > > > > > > >>> > > > >>> > > > > > > > > > > > >>> > > > On Fri, 25 Sep 2015, Scott Mansfield wrote: >>> > > > > > > > > > > > >>> > > > >>> > > > > > > > > > > > >>> > > > > I have it running internally, and it runs fine under >>> > normal load. It's difficult to put it into the line of fire for a >>> > production workload because of social reasons... As well it's a >>> > degenerate case that we normally don't run in to (and actively try to >>> > avoid). >>> > I'm going >>> > > > to run >>> > > > > some >>> > > > > > heavier load >>> > > > > > > > tests on >>> > > > > > > > > it >>> > > > > > > > > > today. >>> > > > > > > > > > > > >>> > > > > >>> > > > > > > > > > > > >>> > > > > On Wednesday, September 9, 2015 at 10:23:32 AM UTC-7, >>> > Scott Mansfield wrote: >>> > > > > > > > > > > > >>> > > > > I'm working on getting a test going internally. >>> > I'll let you know how it goes. >>> > > > > > > > > > > > >>> > > > > >>> > > > > > > > > > > > >>> > > > > >>> > > > > > > > > > > > >>> > > > > Scott Mansfield >>> > > > > > > > > > > > >>> > > > > On Mon, Sep 7, 2015 at 2:33 PM, dormando wrote: >>> > > > > > > > > > > > >>> > > > > Yo, >>> > > > > > > > > > > > >>> > > > > >>> > > > > > > > > > > > >>> > > > > >>> > https://github.com/dormando/memcached/commits/slab_rebal_next - would you >>> > > > > > > > > > > > >>> > > > > mind playing around with the branch here? You can >>> > see the start options in >>> > > > > > > > > > > > >>> > > > > the test. >>> > > > > > > > > > > > >>> > > > > >>> > > > > > > > > > > > >>> > > > > This is a dead simple modification (a restoration >>> > of a feature that was >>> > > > > > > > > > > > >>> > > > > arleady there...). The test very aggressively >>> > writes and is able to shunt >>> > > > > > > > > > > > >>> > > > > memory around appropriately. >>> > > > > > > > > > > > >>> > > > > >>> > > > > > > > > > > > >>> > > > > The work I'm exploring right now will allow >>> > savings of items being >>> > > > > > > > > > > > >>> > > > > rebalanced from, and increasing the aggression of >>> > page moving without >>> > > > > > > > > > > > >>> > > > > being so brain damaged about it. >>> > > > > > > > > > > > >>> > > > > >>> > > > > > > > > > > > >>> > > > > But while I'm poking around with that, I'd be >>> > interested in knowing if >>> > > > > > > > > > > > >>> > > > > this simple branch is an improvement, and if so >>> > how much. >>> > > > > > > > > > > > >>> > > > > >>> > > > > > > > > > > > >>> > > > > I'll push more code to the branch, but the >>> > changes should be gated behind >>> > > > > > > > > > > > >>> > > > > a feature flag. >>> > > > > > > > > > > > >>> > > > > >>> > > > > > > > > > > > >>> > > > > On Tue, 18 Aug 2015, 'Scott Mansfield' via >>> > memcached wrote: >>> > > > > > > > > > > > >>> > > > > >>> > > > > > > > > > > > >>> > > > > > >>> > > > > > > > > > > > >>> > > > > > No worries man, you're doing us a favor. Let me >>> > know if there's anything you need from us, and I promise I'll be quicker >>> > this time :) >>> > > > > > > > > > > > >>> > > > > > >>> > > > > > > > > > > > >>> > > > > > On Aug 18, 2015 12:01 AM, "dormando" >>> > <dorm...@rydia.net> wrote: >>> > > > > > > > > > > > >>> > > > > > Hey, >>> > > > > > > > > > > > >>> > > > > > >>> > > > > > > > > > > > >>> > > > > > I'm still really interested in working on >>> > this. I'll be taking a careful >>> > > > > > > > > > > > >>> > > > > > look soon I hope. >>> > > > > > > > > > > > >>> > > > > > >>> > > > > > > > > > > > >>> > > > > > On Mon, 3 Aug 2015, Scott Mansfield >>> > wrote: >>> > > > > > > > > > > > >>> > > > > > >>> > > > > > > > > > > > >>> > > > > > > I've tweaked the program slightly, so >>> > I'm adding a new version. It prints more stats as it goes and runs a bit >>> > faster. >>> > > > > > > > > > > > >>> > > > > > > >>> > > > > > > > > > > > >>> > > > > > > On Monday, August 3, 2015 at 1:20:37 AM >>> > UTC-7, Scott Mansfield wrote: >>> > > > > > > > > > > > >>> > > > > > > Total brain fart on my part. >>> > Apparently I had memcached 1.4.13 on my path (who knows how...) Using the >>> > actual one that I've built works. Sorry for the confusion... can't >>> > believe I didn't realize that before. I'm testing against the >>> > compiled one now >>> > > > to see >>> > > > > how it >>> > > > > > behaves. >>> > > > > > > > > > > > >>> > > > > > > On Monday, August 3, 2015 at >>> > 1:15:06 AM UTC-7, Dormando wrote: >>> > > > > > > > > > > > >>> > > > > > > You sure that's 1.4.24? >>> > None of those fail for me :( >>> > > > > > > > > > > > >>> > > > > > > >>> > > > > > > > > > > > >>> > > > > > > On Mon, 3 Aug 2015, Scott >>> > Mansfield wrote: >>> > > > > > > > > > > > >>> > > > > > > >>> > > > > > > > > > > > >>> > > > > > > > The command line I've >>> > used that will start is: >>> > > > > > > > > > > > >>> > > > > > > > >>> > > > > > > > > > > > >>> > > > > > > > memcached -m 64 -o >>> > slab_reassign,slab_automove >>> > > > > > > > > > > > >>> > > > > > > > >>> > > > > > > > > > > > >>> > > > > > > > >>> > > > > > > > > > > > >>> > > > > > > > the ones that fail are: >>> > > > > > > > > > > > >>> > > > > > > > >>> > > > > > > > > > > > >>> > > > > > > > >>> > > > > > > > > > > > >>> > > > > > > > memcached -m 64 -o >>> > slab_reassign,slab_automove,lru_crawler,lru_maintainer >>> > > > > > > > > > > > >>> > > > > > > > >>> > > > > > > > > > > > >>> > > > > > > > memcached -o lru_crawler >>> > > > > > > > > > > > >>> > > > > > > > >>> > > > > > > > > > > > >>> > > > > > > > >>> > > > > > > > > > > > >>> > > > > > > > I'm sure I've missed >>> > something during compile, though I just used ./configure and make. >>> > > > > > > > > > > > >>> > > > > > > > >>> > > > > > > > > > > > >>> > > > > > > > >>> > > > > > > > > > > > >>> > > > > > > > On Monday, August 3, 2015 >>> > at 12:22:33 AM UTC-7, Scott Mansfield wrote: >>> > > > > > > > > > > > >>> > > > > > > > I've attached a >>> > pretty simple program to connect, fill a slab with data, and then fill >>> > another slab slowly with data of a different size. I've been trying to >>> > get memcached to run with the lru_crawler and lru_maintainer flags, >>> > but I get >>> > > > ' >>> > > > > > > > > > > > >>> > > > > > > > >>> > > > > > > > > > > > >>> > > > > > > > Illegal suboption >>> > "(null)"' every time I try to start with either in any configuration. >>> > > > > > > > > > > > >>> > > > > > > > >>> > > > > > > > > > > > >>> > > > > > > > >>> > > > > > > > > > > > >>> > > > > > > > I haven't seen it >>> > start to move slabs automatically with a freshly installed 1.2.24. >>> > > > > > > > > > > > >>> > > > > > > > >>> > > > > > > > > > > > >>> > > > > > > > >>> > > > > > > > > > > > >>> > > > > > > > On Tuesday, July >>> > 21, 2015 at 4:55:17 PM UTC-7, Scott Mansfield wrote: >>> > > > > > > > > > > > >>> > > > > > > > I realize >>> > I've not given you the tests to reproduce the behavior. I should be able >>> > to soon. Sorry about the delay here. >>> > > > > > > > > > > > >>> > > > > > > > In the mean time, I >>> > wanted to bring up a possible secondary use of the same logic to move >>> > items on slab rebalancing. I think the system might benefit from using >>> > the same logic to crawl the pages in a slab and compact the data in >>> > the >>> > > > background. In >>> > > > > the case >>> > > > > > where we >>> > > > > > > > have >>> > > > > > > > > memory that >>> > > > > > > > > > is >>> > > > > > > > > > > >>> > assigned to >>> > > > > > > > > > > > >>> > the slab >>> > > > > > > > > > > > >>> > > but not >>> > > > > > > > > > > > >>> > > > > being used >>> > > > > > > > > > > > >>> > > > > > because >>> > > > > > > > > > > > >>> > > > > > > of replaced >>> > > > > > > > > > > > >>> > > > > > > > or TTL'd out data, >>> > returning the memory to a pool of free memory will allow a slab to grow >>> > with that memory first instead of waiting for an event where memory is >>> > needed at that instant. >>> > > > > > > > > > > > >>> > > > > > > > >>> > > > > > > > > > > > >>> > > > > > > > It's a change in >>> > approach, from reactive to proactive. What do you think? >>> > > > > > > > > > > > >>> > > > > > > ... >> >> -- >> >> --- >> You received this message because you are subscribed to the Google Groups >> "memcached" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to memcached+unsubscr...@googlegroups.com. >> For more options, visit https://groups.google.com/d/optout. > -- > > --- > You received this message because you are subscribed to the Google Groups > "memcached" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to memcached+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout.
-- --- You received this message because you are subscribed to the Google Groups "memcached" group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.