Re: Check for orphaned items in lru crawler thread

dormando Mon, 13 Jul 2015 17:54:33 -0700

> First, more detail for you:
>
> We are running 1.4.24 in production and haven't noticed any bugs as of yet. 
> The new LRUs seem to be working well, though we nearly always run memcached 
> scaled to hold all data without evictions. Those with evictions are behaving 
> well. Those without evictions haven't seen crashing or any other noticeable 
> bad behavior.


Neat.

>
> OK, I think I see an area where I was speculating on functionality. If you 
> have a key in slab 21 and then the same key is written again at a larger size 
> in slab 23 I assumed that the space in 21 was not freed on the second write. 
> With that assumption, the LRU crawler would not free up that space. Also just 
> by observation in the macro, the space is not freed
> fast enough to be effective, in our use case, to accept the writes that are 
> happening. Think in the hundreds of millions of "overwrites" in a 6 - 10 hour 
> period across a cluster.

Internally, "items" (a key/value pair) are generally immutable. The only
time when it's not is for INCR/DECR, and it still becomes immutable if two
INCR/DECR's collide.

What this means, is that the new item is staged in a piece of free memory
while the "upload" stage of the SET happens. When memcached has all of the
data in memory to replace the item, it does an internal swap under a lock.
The old item is removed from the hash table and LRU, and the new item gets
put in its place (at the head of the LRU).

Since items are refcounted, this means that if other users are downloading
an item which just got replaced, their memory doesn't get corrupted by the
item changing out from underneath them. They can continue to read the old
item until they're done. When the refcount reaches zero the old memory is
reclaimed.

Most of the time, the item replacement happens then the old memory is
immediately removed.

However, this does mean that you need *one* piece of free memory to
replace the old one. Then the old memory gets freed after that set.

So if you take a memcached instance with 0 free chunks, and do a rolling
replacement of all items (within the same slab class as before), the first
one would cause an eviction from the tail of the LRU to get a free chunk.
Every SET after that would use the chunk freed from the replacement of the
previous memory.

> After that last sentence I realized I also may not have explained well enough 
> the access pattern. The keys are all overwritten every day, but it takes some 
> time to write them all (obviously). We see a huge increase in the bytes 
> metric as if the new data for the old keys was being written for the first 
> time. Since the "old" slab for the same key doesn't
> proactively release memory, it starts to fill up the cache and then start 
> evicting data in the new slab. Once that happens, we see evictions in the old 
> slab because of the algorithm you mentioned (random picking / freeing of 
> memory). Typically we don't see any use for "upgrading" an item as the new 
> data would be entirely new and should wholesale replace the
> old data for that key. More specifically, the operation is always set, with 
> different data each day.

Right. Most of your problems will come from two areas. One being that
writing data aggressively into the new slab class (unless you set the
rebalancer to always-replace mode), the mover will make memory available
more slowly than you can insert. So you'll cause extra evictions in the
new slab class.

The secondary problem is from the random evictions in the previous slab
class as stuff is chucked on the floor to make memory moveable.

> As for testing, we'll be able to put it under real production workload. I 
> don't know what kind of data you mean you need for testing. The data stored 
> in the caches are highly confidential. I can give you all kinds of metrics, 
> since we collect most of the ones that are in the stats and some from the 
> stats slabs output. If you have some specific ones that
> need collecting, I'll double check and make sure we can get those. 
> Alternatively, it might be most beneficial to see the metrics in person :)

I just need stats snapshots here and there, and actually putting the thing
under load. When I did the LRU work I had to beg for several months
before anyone tested it with a production load. This slows things down and
demotivates me from working on the project.

Unfortunately my dayjob keeps me pretty busy so ~internet~ would probably
be best.

> I can create a driver program to reproduce the behavior on a smaller scale. 
> It would write e.g. 10k keys of 10k size, then rewrite the same keys with 
> different size data. I'll work on that and post it to this thread when I can 
> reproduce the behavior locally.

Ok. There're slab rebalance unit tests in the t/ directory which do things
like this, and I've used mc-crusher to slam the rebalancer. It's pretty
easy to run one config to load up 10k objects, then flip to the other
using the same key namespace.

> Thanks,
> Scott
>
> On Saturday, July 11, 2015 at 12:05:54 PM UTC-7, Dormando wrote:
>       Hey,
>
>       On Fri, 10 Jul 2015, Scott Mansfield wrote:
>
>       > We've seen issues recently where we run a cluster that typically has 
> the majority of items overwritten in the same slab every day and a sudden 
> change in data size evicts a ton of data, affecting downstream systems. To be 
> clear that is our problem, but I think there's a tweak in memcached that 
> might be useful and another possible feature that
>       would be even
>       > better.
>       > The data that is written to this cache is overwritten every day, 
> though the TTL is 7 days. One slab takes up the majority of the space in the 
> cache. The application wrote e.g. 10KB (slab 21) every day for each key 
> consistently. One day, a change occurred where it started writing 15KB (slab 
> 23), causing a migration of data from one slab to
>       another. We had -o
>       > slab_reassign,slab_automove=1 set on the server, causing large 
> numbers of evictions on the initial slab. Let's say the cache could hold the 
> data at 15KB per key, but the old data was not technically TTL'd out in it's 
> old slab. This means that memory was not being freed by the lru crawler 
> thread (I think) because its expiry had not come
>       around. 
>       >
>       > lines 1199 and 1200 in items.c:
>       > if ((search->exptime != 0 && search->exptime < current_time) || 
> is_flushed(search)) {
>       >
>       > If there was a check to see if this data was "orphaned," i.e. that 
> the key, if accessed, would map to a different slab than the current one, 
> then these orphans could be reclaimed as free memory. I am working on a patch 
> to do this, though I have reservations about performing a hash on the key on 
> the lru crawler thread (if the hash is not
>       already available).
>       > I have very little experience in the memcached codebase so I don't 
> know the most efficient way to do this. Any help would be appreciated.
>
>       There seems to be a misconception about how the slab classes work. A 
> key,
>       if already existing in a slab, will always map to the slab class it
>       currently fits into. The slab classes always exist, but the amount of
>       memory reserved for each of them will shift with the slab_reassign. ie: 
> 10
>       pages in slab class 21, then memory pressure on 23 causes it to move 
> over.
>
>       So if you examine a key that still exists in slab class 21, it has no
>       reason to move up or down the slab classes.
>
>       > Alternatively, and possibly more beneficial is compaction of data in 
> a slab using the same set of criteria as lru crawling. Understandably, 
> compaction is a very difficult problem to solve since moving the data would 
> be a pain in the ass. I saw a couple of discussions about this in the mailing 
> list, though I didn't see any firm thoughts about
>       it. I think it
>       > can probably be done in O(1) like the lru crawler by limiting the 
> number of items it touches each time. Writing and reading are doable in O(1) 
> so moving should be as well. Has anyone given more thought on compaction?
>
>       I'd be interested in hacking this up for you folks if you can provide me
>       testing and some data to work with. With all of the LRU work I did in
>       1.4.24, the next things I wanted to do is a big improvement on the slab
>       reassignment code.
>
>       Currently it picks essentially a random slab page, empties it, and moves
>       the slab page into the class under pressure.
>
>       One thing we can do is first examine for free memory in the existing 
> slab,
>       IE:
>
>       - Take a page from slab 21
>       - Scan the page for valid items which need to be moved
>       - Pull free memory from slab 21, migrate the item (moderately 
> complicated)
>       - When the page is empty, move it (or give up if you run out of free
>       chunks).
>
>       The next step is to pull from the LRU on slab 21:
>
>       - Take page from slab 21
>       - Scan page for valid items
>       - Pull free memory from slab 21, migrate the item
>         - If no memory free, evict tail of slab 21. use that chunk.
>       - When the page is empty, move it.
>
>       Then, when you hit this condition your least-recently-used data gets
>       culled as new data migrates your page class. This should match a natural
>       occurrance if you would already be evicting valid (but old) items to 
> make
>       room for new items.
>
>       A bonus to using the free memory trick, is that I can use the amount of
>       free space in a slab class as a heuristic to more quickly move slab 
> pages
>       around.
>
>       If it's still necessary from there, we can explore "upgrading" items to 
> a
>       new slab class, but that is much much more complicated since the item 
> has
>       to shift LRU's. Do you put it at the head, the tail, the middle, etc? It
>       might be impossible to make a good generic decision there.
>
>       What version are you currently on? If 1.4.24, have you seen any
>       instability? I'm currently torn between fighting a few bugs and start on
>       improving the slab rebalancer.
>
>       -Dormando
>
>
> On Saturday, July 11, 2015 at 12:05:54 PM UTC-7, Dormando wrote:
>       Hey,
>
>       On Fri, 10 Jul 2015, Scott Mansfield wrote:
>
>       > We've seen issues recently where we run a cluster that typically has 
> the majority of items overwritten in the same slab every day and a sudden 
> change in data size evicts a ton of data, affecting downstream systems. To be 
> clear that is our problem, but I think there's a tweak in memcached that 
> might be useful and another possible feature that
>       would be even
>       > better.
>       > The data that is written to this cache is overwritten every day, 
> though the TTL is 7 days. One slab takes up the majority of the space in the 
> cache. The application wrote e.g. 10KB (slab 21) every day for each key 
> consistently. One day, a change occurred where it started writing 15KB (slab 
> 23), causing a migration of data from one slab to
>       another. We had -o
>       > slab_reassign,slab_automove=1 set on the server, causing large 
> numbers of evictions on the initial slab. Let's say the cache could hold the 
> data at 15KB per key, but the old data was not technically TTL'd out in it's 
> old slab. This means that memory was not being freed by the lru crawler 
> thread (I think) because its expiry had not come
>       around. 
>       >
>       > lines 1199 and 1200 in items.c:
>       > if ((search->exptime != 0 && search->exptime < current_time) || 
> is_flushed(search)) {
>       >
>       > If there was a check to see if this data was "orphaned," i.e. that 
> the key, if accessed, would map to a different slab than the current one, 
> then these orphans could be reclaimed as free memory. I am working on a patch 
> to do this, though I have reservations about performing a hash on the key on 
> the lru crawler thread (if the hash is not
>       already available).
>       > I have very little experience in the memcached codebase so I don't 
> know the most efficient way to do this. Any help would be appreciated.
>
>       There seems to be a misconception about how the slab classes work. A 
> key,
>       if already existing in a slab, will always map to the slab class it
>       currently fits into. The slab classes always exist, but the amount of
>       memory reserved for each of them will shift with the slab_reassign. ie: 
> 10
>       pages in slab class 21, then memory pressure on 23 causes it to move 
> over.
>
>       So if you examine a key that still exists in slab class 21, it has no
>       reason to move up or down the slab classes.
>
>       > Alternatively, and possibly more beneficial is compaction of data in 
> a slab using the same set of criteria as lru crawling. Understandably, 
> compaction is a very difficult problem to solve since moving the data would 
> be a pain in the ass. I saw a couple of discussions about this in the mailing 
> list, though I didn't see any firm thoughts about
>       it. I think it
>       > can probably be done in O(1) like the lru crawler by limiting the 
> number of items it touches each time. Writing and reading are doable in O(1) 
> so moving should be as well. Has anyone given more thought on compaction?
>
>       I'd be interested in hacking this up for you folks if you can provide me
>       testing and some data to work with. With all of the LRU work I did in
>       1.4.24, the next things I wanted to do is a big improvement on the slab
>       reassignment code.
>
>       Currently it picks essentially a random slab page, empties it, and moves
>       the slab page into the class under pressure.
>
>       One thing we can do is first examine for free memory in the existing 
> slab,
>       IE:
>
>       - Take a page from slab 21
>       - Scan the page for valid items which need to be moved
>       - Pull free memory from slab 21, migrate the item (moderately 
> complicated)
>       - When the page is empty, move it (or give up if you run out of free
>       chunks).
>
>       The next step is to pull from the LRU on slab 21:
>
>       - Take page from slab 21
>       - Scan page for valid items
>       - Pull free memory from slab 21, migrate the item
>         - If no memory free, evict tail of slab 21. use that chunk.
>       - When the page is empty, move it.
>
>       Then, when you hit this condition your least-recently-used data gets
>       culled as new data migrates your page class. This should match a natural
>       occurrance if you would already be evicting valid (but old) items to 
> make
>       room for new items.
>
>       A bonus to using the free memory trick, is that I can use the amount of
>       free space in a slab class as a heuristic to more quickly move slab 
> pages
>       around.
>
>       If it's still necessary from there, we can explore "upgrading" items to 
> a
>       new slab class, but that is much much more complicated since the item 
> has
>       to shift LRU's. Do you put it at the head, the tail, the middle, etc? It
>       might be impossible to make a good generic decision there.
>
>       What version are you currently on? If 1.4.24, have you seen any
>       instability? I'm currently torn between fighting a few bugs and start on
>       improving the slab rebalancer.
>
>       -Dormando
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups 
> "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to memcached+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>

Re: Check for orphaned items in lru crawler thread

Reply via email to