Re: Check for orphaned items in lru crawler thread

2015-10-07 Thread Dormando
any luck? > On Oct 6, 2015, at 12:23 AM, Dormando wrote: > > ah. I pushed two more changes earlier. should fix mem_requested. just > cosmetic stuff though > >> On Oct 6, 2015, at 12:13 AM, Scott Mansfield wrote: >> >> Oops, looks like the latest

Re: Check for orphaned items in lru crawler thread

2015-10-06 Thread Scott Mansfield
Oops, looks like the latest code didn't get into production today. I'm building it again, same plan as before. On Monday, October 5, 2015 at 4:38:00 PM UTC-7, Dormando wrote: > > Looking forward to the results. Thanks for getting on this so quickly. > > I think there's still a bug in tracking

Re: Check for orphaned items in lru crawler thread

2015-10-06 Thread Dormando
ah. I pushed two more changes earlier. should fix mem_requested. just cosmetic stuff though > On Oct 6, 2015, at 12:13 AM, Scott Mansfield wrote: > > Oops, looks like the latest code didn't get into production today. I'm > building it again, same plan as before. > >>

Re: Check for orphaned items in lru crawler thread

2015-10-05 Thread Scott Mansfield
I just put the newest code into production. I'm going to monitor it for a bit to see how it behaves. As long as there's no obvious issues I'll enable reads in a few hours, which are an order of magnitude more traffic. I'll let you know what I find. On Monday, October 5, 2015 at 1:29:03 AM

Re: Check for orphaned items in lru crawler thread

2015-10-05 Thread dormando
It took a day of running torture tests which took 30-90 minutes to fail, but along with a bunch of house chores I believe I've found the problem: https://github.com/dormando/memcached/tree/slab_rebal_next - has a new commit, specifically this:

Re: Check for orphaned items in lru crawler thread

2015-10-05 Thread dormando
Looking forward to the results. Thanks for getting on this so quickly. I think there's still a bug in tracking requested memory, and I want to move the stats counters to a rollup at the end of a page move. Otherwise I think this branch is complete pending any further stability issues or feedback.

Re: Check for orphaned items in lru crawler thread

2015-10-02 Thread dormando
I've seen items.c:1183 reported elsewhere in 1.4.24... so probably the bug was introduced when I rewrote the page mover for that. I didn't mean to send me a core file: I mean if you dump the core you can load it in gdb and get the backtrace (bt + thread apply all bt) Don't have a handler for

Re: Check for orphaned items in lru crawler thread

2015-10-01 Thread Scott Mansfield
The commit was the latest in slab_rebal_next at the time: https://github.com/dormando/memcached/commit/bdd688b4f20120ad844c8a4803e08c6e03cb061a addr2line gave me this output: $ addr2line -e memcached 0x40e007

Re: Check for orphaned items in lru crawler thread

2015-10-01 Thread Scott Mansfield
A few different servers (5 / 205) experienced a segfault all within an hour or so. Unfortunately at this point I'm a bit out of my depth. I have the dmesg output, which is identical for all 5 boxes: [46545.316351] memcached[2789]: segfault at 0 ip 0040e007 sp 7f362ceedeb0 error 4

Re: Check for orphaned items in lru crawler thread

2015-10-01 Thread dormando
How many servers were you running it on? I hope it wasn't more than a handful. I'd recommend starting with one :P can you do an addr2line? what were your startup args, and what was the commit sha1 for the branch you pulled? sorry about that :/ On Thu, 1 Oct 2015, Scott Mansfield wrote: > A few

Re: Check for orphaned items in lru crawler thread

2015-10-01 Thread dormando
Just before I sit in and try to narrow this down: have you run any host on 1.4.24 mainline with those same start options? just in case the crash is older On Thu, 1 Oct 2015, Scott Mansfield wrote: > Another message for you: > [78098.528606] traps: memcached[2757] general protection ip:412b9d >

Re: Check for orphaned items in lru crawler thread

2015-10-01 Thread Scott Mansfield
The same cluster has > 400 servers happily running 1.4.24. It's been our standard deployment for a while now, and we haven't seen any crashes. The servers in the same cluster running 1.4.24 (with the same write load the new build was taking) have been up for 29 days. The start options do not

Re: Check for orphaned items in lru crawler thread

2015-10-01 Thread Scott Mansfield
Yes. Exact args: -p 11211 -u -l 0.0.0.0 -c 10 -o slab_reassign -o lru_maintainer,lru_crawler,hash_algorithm=murmur3 -I 4m -m 56253 On Thursday, October 1, 2015 at 12:41:06 PM UTC-7, Dormando wrote: > > Were lru_maintainer/lru_crawler/etc enabled though? even if slab mover is > off, those

Re: Check for orphaned items in lru crawler thread

2015-10-01 Thread dormando
perfect, thanks! I have $dayjob as well but will look into this as soon as I can. my torture test machines are in a box but I'll try to borrow one On Thu, 1 Oct 2015, Scott Mansfield wrote: > Yes. Exact args: > -p 11211 -u -l 0.0.0.0 -c 10 -o slab_reassign -o >

Re: Check for orphaned items in lru crawler thread

2015-10-01 Thread dormando
Were lru_maintainer/lru_crawler/etc enabled though? even if slab mover is off, those two were the big changes in .24 On Thu, 1 Oct 2015, Scott Mansfield wrote: > The same cluster has > 400 servers happily running 1.4.24. It's been our > standard deployment for a while now, and we haven't seen

Re: Check for orphaned items in lru crawler thread

2015-10-01 Thread dormando
Any chance you could describe (perhaps privately?) in very broad strokes what the write load looks like? (they're getting only writes, too?). otherwise I'll have to devise arbitrary torture tests. I'm sure the bug's in there but it's not obvious yet On Thu, 1 Oct 2015, dormando wrote: > perfect,

Re: Check for orphaned items in lru crawler thread

2015-10-01 Thread dormando
ok... slab class 12 claims to have 2 in "total_pages", yet 14g in mem_requested. is this stat wrong? On Thu, 1 Oct 2015, Scott Mansfield wrote: > The ones that crashed (new code cluster) were set to only be written to from > the client applications. The data is an index key and a series of data

Re: Check for orphaned items in lru crawler thread

2015-10-01 Thread dormando
got it. that might be a decent hint actually... I had addded a bugfix to the branch to not miscount the mem_requested counter, but it's not working or I missed a spot. On Thu, 1 Oct 2015, Scott Mansfield wrote: > The number now, after maybe 90 minutes of writes, is 1,446. I think after >

Re: Check for orphaned items in lru crawler thread

2015-10-01 Thread Scott Mansfield
Sorry for the data dumps here, but I want to give you everything I have. I found 3 more addresses that showed up in the dmesg logs: $ for addr in 40e013 40eff4 40f7c4; do addr2line -e memcached $addr; done .../build/memcached-1.4.24-slab-rebal-next/slabs.c:265 (discriminator 1)

Re: Check for orphaned items in lru crawler thread

2015-10-01 Thread Scott Mansfield
Oops, forgot the startup args: -p 11211 -u -l 0.0.0.0 -c 10 -o slab_reassign,slab_automove,lru_maintainer,lru_crawler,hash_algorithm=murmur3 -I 2m -m 56253 On Thursday, October 1, 2015 at 1:22:12 AM UTC-7, Scott Mansfield wrote: > > The commit was the latest in slab_rebal_next at the

Re: Check for orphaned items in lru crawler thread

2015-10-01 Thread dormando
Ok, thanks! I'll noodle this a bit... unfortunately a backtrace might be more helpful. will ask you to attempt to get one if I don't figure anything out in time. (allow it to core dump or attach a GDB session and set an ignore handler for sigpipe/int/etc and run "continue") what were your full

Re: Check for orphaned items in lru crawler thread

2015-09-29 Thread dormando
That's unfortunate. I've done some more work on the branch: https://github.com/memcached/memcached/pull/112 It's not completely likely you would see enough of an improvement from the new default mode. However if your item sizes change gradually, items are reclaimed during expiration, or get

Re: Check for orphaned items in lru crawler thread

2015-09-29 Thread Scott Mansfield
An unrelated prod problem meant I had to stop after about an hour. I'm turning it on again tomorrow morning. Are there any new metrics I should be looking at? Anything new in the stats output? I'm about to take a look at the diffs as well. On Tuesday, September 29, 2015 at 12:37:45 PM UTC-7,

Re: Check for orphaned items in lru crawler thread

2015-09-29 Thread dormando
If you look at the new branch there's a commit explaining the new stats. You can watch slab_reassing_evictions vs slab_reassign_saves. you can also test automove=1 vs automove=2 (please also turn on the lru_maintainer and lru_crawler). The initial branch you were running didn't add any new

Re: Check for orphaned items in lru crawler thread

2015-09-29 Thread Scott Mansfield
I have the first commit (slab_automove=2) running in prod right now. Later today will be a full load production test of the latest code. I'll just let it run for a few days unless I spot any problems. We have good metrics for latency et. al. from the client side, though network normally dwarfs

Re: Check for orphaned items in lru crawler thread

2015-09-29 Thread dormando
excellent. if automove=2 is too aggressive you'll see that come in in a hit ratio reduction. the new branch works with automove=2 as well, but it will attempt to rescue valid items in the old slab if possible. I'll still be working on it for another few hours today though. I'll mail again when

Re: Check for orphaned items in lru crawler thread

2015-09-25 Thread Scott Mansfield
I have it running internally, and it runs fine under normal load. It's difficult to put it into the line of fire for a production workload because of social reasons... As well it's a degenerate case that we normally don't run in to (and actively try to avoid). I'm going to run some heavier load

Re: Check for orphaned items in lru crawler thread

2015-09-09 Thread 'Scott Mansfield' via memcached
I'm working on getting a test going internally. I'll let you know how it goes. *Scott Mansfield* Product Eng > Consumer Science Eng > Sr. Software Eng { M: 352-514-9452 E: smansfi...@netflix.com K: {M: mobile, E: email, K: key} } On Mon, Sep 7, 2015 at 2:33 PM, dormando

Re: Check for orphaned items in lru crawler thread

2015-09-07 Thread dormando
Yo, https://github.com/dormando/memcached/commits/slab_rebal_next - would you mind playing around with the branch here? You can see the start options in the test. This is a dead simple modification (a restoration of a feature that was arleady there...). The test very aggressively writes and is

Re: Check for orphaned items in lru crawler thread

2015-08-18 Thread dormando
Hey, I'm still really interested in working on this. I'll be taking a careful look soon I hope. On Mon, 3 Aug 2015, Scott Mansfield wrote: I've tweaked the program slightly, so I'm adding a new version. It prints more stats as it goes and runs a bit faster. On Monday, August 3, 2015 at

Re: Check for orphaned items in lru crawler thread

2015-08-18 Thread 'Scott Mansfield' via memcached
No worries man, you're doing us a favor. Let me know if there's anything you need from us, and I promise I'll be quicker this time :) On Aug 18, 2015 12:01 AM, dormando dorma...@rydia.net wrote: Hey, I'm still really interested in working on this. I'll be taking a careful look soon I hope.

Re: Check for orphaned items in lru crawler thread

2015-08-03 Thread Scott Mansfield
Total brain fart on my part. Apparently I had memcached 1.4.13 on my path (who knows how...) Using the actual one that I've built works. Sorry for the confusion... can't believe I didn't realize that before. I'm testing against the compiled one now to see how it behaves. On Monday, August 3,

Re: Check for orphaned items in lru crawler thread

2015-08-03 Thread Scott Mansfield
The command line I've used that will start is: memcached -m 64 -o slab_reassign,slab_automove the ones that fail are: memcached -m 64 -o slab_reassign,slab_automove,lru_crawler,lru_maintainer memcached -o lru_crawler I'm sure I've missed something during compile, though I just used

Re: Check for orphaned items in lru crawler thread

2015-08-03 Thread dormando
What are your startup args? On Mon, 3 Aug 2015, Scott Mansfield wrote: I've attached a pretty simple program to connect, fill a slab with data, and then fill another slab slowly with data of a different size. I've been trying to get memcached to run with the lru_crawler and lru_maintainer

Re: Check for orphaned items in lru crawler thread

2015-08-03 Thread Scott Mansfield
I've tweaked the program slightly, so I'm adding a new version. It prints more stats as it goes and runs a bit faster. On Monday, August 3, 2015 at 1:20:37 AM UTC-7, Scott Mansfield wrote: Total brain fart on my part. Apparently I had memcached 1.4.13 on my path (who knows how...) Using the

Re: Check for orphaned items in lru crawler thread

2015-08-03 Thread Scott Mansfield
I've attached a pretty simple program to connect, fill a slab with data, and then fill another slab slowly with data of a different size. I've been trying to get memcached to run with the lru_crawler and lru_maintainer flags, but I get ' Illegal suboption (null)' every time I try to start

Re: Check for orphaned items in lru crawler thread

2015-08-03 Thread dormando
You sure that's 1.4.24? None of those fail for me :( On Mon, 3 Aug 2015, Scott Mansfield wrote: The command line I've used that will start is: memcached -m 64 -o slab_reassign,slab_automove the ones that fail are: memcached -m 64 -o slab_reassign,slab_automove,lru_crawler,lru_maintainer

Re: Check for orphaned items in lru crawler thread

2015-07-21 Thread Scott Mansfield
I realize I've not given you the tests to reproduce the behavior. I should be able to soon. Sorry about the delay here. In the mean time, I wanted to bring up a possible secondary use of the same logic to move items on slab rebalancing. I think the system might benefit from using the same

Re: Check for orphaned items in lru crawler thread

2015-07-13 Thread Scott Mansfield
First, more detail for you: We are running 1.4.24 in production and haven't noticed any bugs as of yet. The new LRUs seem to be working well, though we nearly always run memcached scaled to hold all data without evictions. Those with evictions are behaving well. Those without evictions haven't

Re: Check for orphaned items in lru crawler thread

2015-07-13 Thread dormando
First, more detail for you: We are running 1.4.24 in production and haven't noticed any bugs as of yet. The new LRUs seem to be working well, though we nearly always run memcached scaled to hold all data without evictions. Those with evictions are behaving well. Those without evictions

Re: Check for orphaned items in lru crawler thread

2015-07-11 Thread dormando
Hey, On Fri, 10 Jul 2015, Scott Mansfield wrote: We've seen issues recently where we run a cluster that typically has the majority of items overwritten in the same slab every day and a sudden change in data size evicts a ton of data, affecting downstream systems. To be clear that is our

Check for orphaned items in lru crawler thread

2015-07-10 Thread Scott Mansfield
We've seen issues recently where we run a cluster that typically has the majority of items overwritten in the same slab every day and a sudden change in data size evicts a ton of data, affecting downstream systems. To be clear that is our problem, but I think there's a tweak in memcached that