any luck?
> On Oct 6, 2015, at 12:23 AM, Dormando wrote:
>
> ah. I pushed two more changes earlier. should fix mem_requested. just
> cosmetic stuff though
>
>> On Oct 6, 2015, at 12:13 AM, Scott Mansfield wrote:
>>
>> Oops, looks like the latest
Oops, looks like the latest code didn't get into production today. I'm
building it again, same plan as before.
On Monday, October 5, 2015 at 4:38:00 PM UTC-7, Dormando wrote:
>
> Looking forward to the results. Thanks for getting on this so quickly.
>
> I think there's still a bug in tracking
ah. I pushed two more changes earlier. should fix mem_requested. just cosmetic
stuff though
> On Oct 6, 2015, at 12:13 AM, Scott Mansfield wrote:
>
> Oops, looks like the latest code didn't get into production today. I'm
> building it again, same plan as before.
>
>>
I just put the newest code into production. I'm going to monitor it for a
bit to see how it behaves. As long as there's no obvious issues I'll enable
reads in a few hours, which are an order of magnitude more traffic. I'll
let you know what I find.
On Monday, October 5, 2015 at 1:29:03 AM
It took a day of running torture tests which took 30-90 minutes to fail,
but along with a bunch of house chores I believe I've found the problem:
https://github.com/dormando/memcached/tree/slab_rebal_next - has a new
commit, specifically this:
Looking forward to the results. Thanks for getting on this so quickly.
I think there's still a bug in tracking requested memory, and I want to
move the stats counters to a rollup at the end of a page move.
Otherwise I think this branch is complete pending any further stability
issues or feedback.
I've seen items.c:1183 reported elsewhere in 1.4.24... so probably the bug
was introduced when I rewrote the page mover for that.
I didn't mean to send me a core file: I mean if you dump the core you can
load it in gdb and get the backtrace (bt + thread apply all bt)
Don't have a handler for
The commit was the latest in slab_rebal_next at the time:
https://github.com/dormando/memcached/commit/bdd688b4f20120ad844c8a4803e08c6e03cb061a
addr2line gave me this output:
$ addr2line -e memcached 0x40e007
A few different servers (5 / 205) experienced a segfault all within an hour
or so. Unfortunately at this point I'm a bit out of my depth. I have the
dmesg output, which is identical for all 5 boxes:
[46545.316351] memcached[2789]: segfault at 0 ip 0040e007 sp
7f362ceedeb0 error 4
How many servers were you running it on? I hope it wasn't more than a
handful. I'd recommend starting with one :P
can you do an addr2line? what were your startup args, and what was the
commit sha1 for the branch you pulled?
sorry about that :/
On Thu, 1 Oct 2015, Scott Mansfield wrote:
> A few
Just before I sit in and try to narrow this down: have you run any host on
1.4.24 mainline with those same start options? just in case the crash is
older
On Thu, 1 Oct 2015, Scott Mansfield wrote:
> Another message for you:
> [78098.528606] traps: memcached[2757] general protection ip:412b9d
>
The same cluster has > 400 servers happily running 1.4.24. It's been our
standard deployment for a while now, and we haven't seen any crashes. The
servers in the same cluster running 1.4.24 (with the same write load the
new build was taking) have been up for 29 days. The start options do not
Yes. Exact args:
-p 11211 -u -l 0.0.0.0 -c 10 -o slab_reassign -o
lru_maintainer,lru_crawler,hash_algorithm=murmur3 -I 4m -m 56253
On Thursday, October 1, 2015 at 12:41:06 PM UTC-7, Dormando wrote:
>
> Were lru_maintainer/lru_crawler/etc enabled though? even if slab mover is
> off, those
perfect, thanks! I have $dayjob as well but will look into this as soon as
I can. my torture test machines are in a box but I'll try to borrow one
On Thu, 1 Oct 2015, Scott Mansfield wrote:
> Yes. Exact args:
> -p 11211 -u -l 0.0.0.0 -c 10 -o slab_reassign -o
>
Were lru_maintainer/lru_crawler/etc enabled though? even if slab mover is
off, those two were the big changes in .24
On Thu, 1 Oct 2015, Scott Mansfield wrote:
> The same cluster has > 400 servers happily running 1.4.24. It's been our
> standard deployment for a while now, and we haven't seen
Any chance you could describe (perhaps privately?) in very broad strokes
what the write load looks like? (they're getting only writes, too?).
otherwise I'll have to devise arbitrary torture tests. I'm sure the bug's
in there but it's not obvious yet
On Thu, 1 Oct 2015, dormando wrote:
> perfect,
ok... slab class 12 claims to have 2 in "total_pages", yet 14g in
mem_requested. is this stat wrong?
On Thu, 1 Oct 2015, Scott Mansfield wrote:
> The ones that crashed (new code cluster) were set to only be written to from
> the client applications. The data is an index key and a series of data
got it. that might be a decent hint actually... I had addded a bugfix to
the branch to not miscount the mem_requested counter, but it's not working
or I missed a spot.
On Thu, 1 Oct 2015, Scott Mansfield wrote:
> The number now, after maybe 90 minutes of writes, is 1,446. I think after
>
Sorry for the data dumps here, but I want to give you everything I have. I
found 3 more addresses that showed up in the dmesg logs:
$ for addr in 40e013 40eff4 40f7c4; do addr2line -e memcached $addr; done
.../build/memcached-1.4.24-slab-rebal-next/slabs.c:265 (discriminator 1)
Oops, forgot the startup args:
-p 11211 -u -l 0.0.0.0 -c 10 -o
slab_reassign,slab_automove,lru_maintainer,lru_crawler,hash_algorithm=murmur3
-I 2m -m 56253
On Thursday, October 1, 2015 at 1:22:12 AM UTC-7, Scott Mansfield wrote:
>
> The commit was the latest in slab_rebal_next at the
Ok, thanks!
I'll noodle this a bit... unfortunately a backtrace might be more helpful.
will ask you to attempt to get one if I don't figure anything out in time.
(allow it to core dump or attach a GDB session and set an ignore handler
for sigpipe/int/etc and run "continue")
what were your full
That's unfortunate.
I've done some more work on the branch:
https://github.com/memcached/memcached/pull/112
It's not completely likely you would see enough of an improvement from the
new default mode. However if your item sizes change gradually, items are
reclaimed during expiration, or get
An unrelated prod problem meant I had to stop after about an hour. I'm
turning it on again tomorrow morning.
Are there any new metrics I should be looking at? Anything new in the stats
output? I'm about to take a look at the diffs as well.
On Tuesday, September 29, 2015 at 12:37:45 PM UTC-7,
If you look at the new branch there's a commit explaining the new stats.
You can watch slab_reassing_evictions vs slab_reassign_saves. you can also
test automove=1 vs automove=2 (please also turn on the lru_maintainer and
lru_crawler).
The initial branch you were running didn't add any new
I have the first commit (slab_automove=2) running in prod right now. Later
today will be a full load production test of the latest code. I'll just let
it run for a few days unless I spot any problems. We have good metrics for
latency et. al. from the client side, though network normally dwarfs
excellent. if automove=2 is too aggressive you'll see that come in in a
hit ratio reduction.
the new branch works with automove=2 as well, but it will attempt to
rescue valid items in the old slab if possible. I'll still be working on
it for another few hours today though. I'll mail again when
I have it running internally, and it runs fine under normal load. It's
difficult to put it into the line of fire for a production workload because
of social reasons... As well it's a degenerate case that we normally don't
run in to (and actively try to avoid). I'm going to run some heavier load
I'm working on getting a test going internally. I'll let you know how it
goes.
*Scott Mansfield*
Product Eng > Consumer Science Eng > Sr. Software Eng
{
M: 352-514-9452
E: smansfi...@netflix.com
K: {M: mobile, E: email, K: key}
}
On Mon, Sep 7, 2015 at 2:33 PM, dormando
Yo,
https://github.com/dormando/memcached/commits/slab_rebal_next - would you
mind playing around with the branch here? You can see the start options in
the test.
This is a dead simple modification (a restoration of a feature that was
arleady there...). The test very aggressively writes and is
Hey,
I'm still really interested in working on this. I'll be taking a careful
look soon I hope.
On Mon, 3 Aug 2015, Scott Mansfield wrote:
I've tweaked the program slightly, so I'm adding a new version. It prints
more stats as it goes and runs a bit faster.
On Monday, August 3, 2015 at
No worries man, you're doing us a favor. Let me know if there's anything
you need from us, and I promise I'll be quicker this time :)
On Aug 18, 2015 12:01 AM, dormando dorma...@rydia.net wrote:
Hey,
I'm still really interested in working on this. I'll be taking a careful
look soon I hope.
Total brain fart on my part. Apparently I had memcached 1.4.13 on my path
(who knows how...) Using the actual one that I've built works. Sorry for
the confusion... can't believe I didn't realize that before. I'm testing
against the compiled one now to see how it behaves.
On Monday, August 3,
The command line I've used that will start is:
memcached -m 64 -o slab_reassign,slab_automove
the ones that fail are:
memcached -m 64 -o slab_reassign,slab_automove,lru_crawler,lru_maintainer
memcached -o lru_crawler
I'm sure I've missed something during compile, though I just used
What are your startup args?
On Mon, 3 Aug 2015, Scott Mansfield wrote:
I've attached a pretty simple program to connect, fill a slab with data, and
then fill another slab slowly with data of a different size. I've been trying
to get memcached to run with the lru_crawler and lru_maintainer
I've tweaked the program slightly, so I'm adding a new version. It prints
more stats as it goes and runs a bit faster.
On Monday, August 3, 2015 at 1:20:37 AM UTC-7, Scott Mansfield wrote:
Total brain fart on my part. Apparently I had memcached 1.4.13 on my path
(who knows how...) Using the
I've attached a pretty simple program to connect, fill a slab with data,
and then fill another slab slowly with data of a different size. I've been
trying to get memcached to run with the lru_crawler and lru_maintainer
flags, but I get '
Illegal suboption (null)' every time I try to start
You sure that's 1.4.24? None of those fail for me :(
On Mon, 3 Aug 2015, Scott Mansfield wrote:
The command line I've used that will start is:
memcached -m 64 -o slab_reassign,slab_automove
the ones that fail are:
memcached -m 64 -o slab_reassign,slab_automove,lru_crawler,lru_maintainer
I realize I've not given you the tests to reproduce the behavior. I should
be able to soon. Sorry about the delay here.
In the mean time, I wanted to bring up a possible secondary use of the same
logic to move items on slab rebalancing. I think the system might benefit
from using the same
First, more detail for you:
We are running 1.4.24 in production and haven't noticed any bugs as of yet.
The new LRUs seem to be working well, though we nearly always run memcached
scaled to hold all data without evictions. Those with evictions are
behaving well. Those without evictions haven't
First, more detail for you:
We are running 1.4.24 in production and haven't noticed any bugs as of yet.
The new LRUs seem to be working well, though we nearly always run memcached
scaled to hold all data without evictions. Those with evictions are behaving
well. Those without evictions
Hey,
On Fri, 10 Jul 2015, Scott Mansfield wrote:
We've seen issues recently where we run a cluster that typically has the
majority of items overwritten in the same slab every day and a sudden change
in data size evicts a ton of data, affecting downstream systems. To be clear
that is our
We've seen issues recently where we run a cluster that typically has the
majority of items overwritten in the same slab every day and a sudden
change in data size evicts a ton of data, affecting downstream systems. To
be clear that is our problem, but I think there's a tweak in memcached that
42 matches
Mail list logo