Re: tail repair issue (1.4.20)

dormando Mon, 11 Aug 2014 20:54:38 -0700

> > Well, sounds like whatever process was asking for that data is dead (and> 
> >possibly pissing off a customer) so you should
> indeed figure out what
> > that's about.
>
> Yeah, we’ll definitely hunt this one down. I’ll have to toss up a monitor to 
> look for things in a write state for extended
> periods and then go do some tracing (rather than, say, waiting for it to 
> actually break again). We *do* have some legitimately
> long-running (multi-hour) things going on, so can’t just say “long connection 
> bad!”, but it would be nice if maybe those
> processes could slurp their entire response upfront or some such.
>
>
> > I think another thing we can do is actually throw a 
> > refcounted-for-a-long-time 
> > item back to the front of the LRU. I'll try a patch for that this weekend. 
> > It should
> > have no real overhead compared to other approaches of timing out 
> > connections.
>
> Is there any reason you can’t do “if refcount > 1 when walking the end of the 
> tail, send to the front” without requiring
> ‘refcounted for a long time’ (with, of course, still limiting it to 5ish 
> actions)? It seems like this would be pretty safe,
> since generally stuff at the end of LRU shouldn’t have a refcount, and then 
> you don’t need extra code for figuring out how long
> something has been refcounted.
>
> I guess there’s a slightly degenerate case in there, which is that if you 
> have a small slab that’s 100% refcounted, you end up
> cycling a bunch of pointers every write just to run the LRU in a big circle 
> and never write anything (similar to the case you
> suggest in your last paragraph), but that’s a situation I’m totally willing 
> to accept. ;)
>
> Anyhow, looking forward to a patch, and will gladly help test!
>


Here, try out this branch:
https://github.com/dormando/memcached/tree/refchuck

It needs some cleanup and sanity checking. I want to redo the loop instead
of the weird goto, add an arg to item_update intead of copypasta, and add
one or two sanity checks to break the loop if you're trying to alloc out
of a class that's 100% reflocked.

I added a test that works okay. Fails before, runs after. Can you try
this on one or two machines and see what the impact is?

If it works okay I'll clean it up and merge. Need to spend a little more
time on the PR queue before I can cut though.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: tail repair issue (1.4.20)

Reply via email to