Re: Check for orphaned items in lru crawler thread

2015-10-01 Thread dormando
I've seen items.c:1183 reported elsewhere in 1.4.24... so probably the bug was introduced when I rewrote the page mover for that. I didn't mean to send me a core file: I mean if you dump the core you can load it in gdb and get the backtrace (bt + thread apply all bt) Don't have a handler for conv

Re: Check for orphaned items in lru crawler thread

2015-10-01 Thread Scott Mansfield
Sorry for the data dumps here, but I want to give you everything I have. I found 3 more addresses that showed up in the dmesg logs: $ for addr in 40e013 40eff4 40f7c4; do addr2line -e memcached $addr; done .../build/memcached-1.4.24-slab-rebal-next/slabs.c:265 (discriminator 1) .../build/memcac

Re: Check for orphaned items in lru crawler thread

2015-10-01 Thread dormando
got it. that might be a decent hint actually... I had addded a bugfix to the branch to not miscount the mem_requested counter, but it's not working or I missed a spot. On Thu, 1 Oct 2015, Scott Mansfield wrote: > The number now, after maybe 90 minutes of writes, is 1,446. I think after > disabli

Re: Check for orphaned items in lru crawler thread

2015-10-01 Thread dormando
ok... slab class 12 claims to have 2 in "total_pages", yet 14g in mem_requested. is this stat wrong? On Thu, 1 Oct 2015, Scott Mansfield wrote: > The ones that crashed (new code cluster) were set to only be written to from > the client applications. The data is an index key and a series of data

Re: Check for orphaned items in lru crawler thread

2015-10-01 Thread dormando
Any chance you could describe (perhaps privately?) in very broad strokes what the write load looks like? (they're getting only writes, too?). otherwise I'll have to devise arbitrary torture tests. I'm sure the bug's in there but it's not obvious yet On Thu, 1 Oct 2015, dormando wrote: > perfect,

Re: Check for orphaned items in lru crawler thread

2015-10-01 Thread dormando
perfect, thanks! I have $dayjob as well but will look into this as soon as I can. my torture test machines are in a box but I'll try to borrow one On Thu, 1 Oct 2015, Scott Mansfield wrote: > Yes. Exact args: > -p 11211 -u -l 0.0.0.0 -c 10 -o slab_reassign -o > lru_maintainer,lru_crawler,ha

Re: Check for orphaned items in lru crawler thread

2015-10-01 Thread dormando
Were lru_maintainer/lru_crawler/etc enabled though? even if slab mover is off, those two were the big changes in .24 On Thu, 1 Oct 2015, Scott Mansfield wrote: > The same cluster has > 400 servers happily running 1.4.24. It's been our > standard deployment for a while now, and we haven't seen an

Re: Check for orphaned items in lru crawler thread

2015-10-01 Thread Scott Mansfield
Yes. Exact args: -p 11211 -u -l 0.0.0.0 -c 10 -o slab_reassign -o lru_maintainer,lru_crawler,hash_algorithm=murmur3 -I 4m -m 56253 On Thursday, October 1, 2015 at 12:41:06 PM UTC-7, Dormando wrote: > > Were lru_maintainer/lru_crawler/etc enabled though? even if slab mover is > off, those t

Re: Check for orphaned items in lru crawler thread

2015-10-01 Thread Scott Mansfield
The same cluster has > 400 servers happily running 1.4.24. It's been our standard deployment for a while now, and we haven't seen any crashes. The servers in the same cluster running 1.4.24 (with the same write load the new build was taking) have been up for 29 days. The start options do not co

Re: Check for orphaned items in lru crawler thread

2015-10-01 Thread dormando
Just before I sit in and try to narrow this down: have you run any host on 1.4.24 mainline with those same start options? just in case the crash is older On Thu, 1 Oct 2015, Scott Mansfield wrote: > Another message for you: > [78098.528606] traps: memcached[2757] general protection ip:412b9d > s

Re: Check for orphaned items in lru crawler thread

2015-10-01 Thread Scott Mansfield
Another message for you: [78098.528606] traps: memcached[2757] general protection ip:412b9d sp:7fc0700dbdd0 error:0 in memcached[40+1d000] addr2line shows: $ addr2line -e memcached 412b9d /mnt/builds/slave/workspace/TL-SYS-memcached-slab_rebal_next/build/memcached-1.4.24-slab-rebal-next/a

Re: Check for orphaned items in lru crawler thread

2015-10-01 Thread dormando
Ok, thanks! I'll noodle this a bit... unfortunately a backtrace might be more helpful. will ask you to attempt to get one if I don't figure anything out in time. (allow it to core dump or attach a GDB session and set an ignore handler for sigpipe/int/etc and run "continue") what were your full s

Re: Check for orphaned items in lru crawler thread

2015-10-01 Thread Scott Mansfield
Oops, forgot the startup args: -p 11211 -u -l 0.0.0.0 -c 10 -o slab_reassign,slab_automove,lru_maintainer,lru_crawler,hash_algorithm=murmur3 -I 2m -m 56253 On Thursday, October 1, 2015 at 1:22:12 AM UTC-7, Scott Mansfield wrote: > > The commit was the latest in slab_rebal_next at the tim

Re: Check for orphaned items in lru crawler thread

2015-10-01 Thread Scott Mansfield
The commit was the latest in slab_rebal_next at the time: https://github.com/dormando/memcached/commit/bdd688b4f20120ad844c8a4803e08c6e03cb061a addr2line gave me this output: $ addr2line -e memcached 0x40e007 /mnt/builds/slave/workspace/TL-SYS-memcached-slab_rebal_next/build/memcached-1.4.24-sl

Re: Check for orphaned items in lru crawler thread

2015-10-01 Thread dormando
How many servers were you running it on? I hope it wasn't more than a handful. I'd recommend starting with one :P can you do an addr2line? what were your startup args, and what was the commit sha1 for the branch you pulled? sorry about that :/ On Thu, 1 Oct 2015, Scott Mansfield wrote: > A few

Re: Check for orphaned items in lru crawler thread

2015-10-01 Thread Scott Mansfield
A few different servers (5 / 205) experienced a segfault all within an hour or so. Unfortunately at this point I'm a bit out of my depth. I have the dmesg output, which is identical for all 5 boxes: [46545.316351] memcached[2789]: segfault at 0 ip 0040e007 sp 7f362ceedeb0 error 4 in