Re: Check for orphaned items in lru crawler thread

Scott Mansfield Mon, 03 Aug 2015 01:44:42 -0700

I've tweaked the program slightly, so I'm adding a new version. It prints 
more stats as it goes and runs a bit faster.


On Monday, August 3, 2015 at 1:20:37 AM UTC-7, Scott Mansfield wrote:
>
> Total brain fart on my part. Apparently I had memcached 1.4.13 on my path 
> (who knows how...) Using the actual one that I've built works. Sorry for 
> the confusion... can't believe I didn't realize that before. I'm testing 
> against the compiled one now to see how it behaves.
>
> On Monday, August 3, 2015 at 1:15:06 AM UTC-7, Dormando wrote:
>>
>> You sure that's 1.4.24? None of those fail for me :( 
>>
>> On Mon, 3 Aug 2015, Scott Mansfield wrote: 
>>
>> > The command line I've used that will start is: 
>> > 
>> > memcached -m 64 -o slab_reassign,slab_automove 
>> > 
>> > 
>> > the ones that fail are: 
>> > 
>> > 
>> > memcached -m 64 -o 
>> slab_reassign,slab_automove,lru_crawler,lru_maintainer 
>> > 
>> > memcached -o lru_crawler 
>> > 
>> > 
>> > I'm sure I've missed something during compile, though I just used 
>> ./configure and make. 
>> > 
>> > 
>> > On Monday, August 3, 2015 at 12:22:33 AM UTC-7, Scott Mansfield wrote: 
>> >       I've attached a pretty simple program to connect, fill a slab 
>> with data, and then fill another slab slowly with data of a different size. 
>> I've been trying to get memcached to run with the lru_crawler and 
>> lru_maintainer flags, but I get ' 
>> > 
>> >       Illegal suboption "(null)"' every time I try to start with either 
>> in any configuration. 
>> > 
>> > 
>> >       I haven't seen it start to move slabs automatically with a 
>> freshly installed 1.2.24. 
>> > 
>> > 
>> >       On Tuesday, July 21, 2015 at 4:55:17 PM UTC-7, Scott Mansfield 
>> wrote: 
>> >             I realize I've not given you the tests to reproduce the 
>> behavior. I should be able to soon. Sorry about the delay here. 
>> > In the mean time, I wanted to bring up a possible secondary use of the 
>> same logic to move items on slab rebalancing. I think the system might 
>> benefit from using the same logic to crawl the pages in a slab and compact 
>> the data in the background. In the case where we have memory that is 
>> assigned to the slab but not being used because of replaced 
>> > or TTL'd out data, returning the memory to a pool of free memory will 
>> allow a slab to grow with that memory first instead of waiting for an event 
>> where memory is needed at that instant. 
>> > 
>> > It's a change in approach, from reactive to proactive. What do you 
>> think? 
>> > 
>> > On Monday, July 13, 2015 at 5:54:11 PM UTC-7, Dormando wrote: 
>> >       > First, more detail for you: 
>> >       > 
>> >       > We are running 1.4.24 in production and haven't noticed any 
>> bugs as of yet. The new LRUs seem to be working well, though we nearly 
>> always run memcached scaled to hold all data without evictions. Those with 
>> evictions are behaving well. Those without evictions haven't seen crashing 
>> or any other noticeable bad behavior. 
>> > 
>> >       Neat. 
>> > 
>> >       > 
>> >       > OK, I think I see an area where I was speculating on 
>> functionality. If you have a key in slab 21 and then the same key is 
>> written again at a larger size in slab 23 I assumed that the space in 21 
>> was not freed on the second write. With that assumption, the LRU crawler 
>> would not free up that space. Also just by observation in the 
>> >       macro, the space is not freed 
>> >       > fast enough to be effective, in our use case, to accept the 
>> writes that are happening. Think in the hundreds of millions of 
>> "overwrites" in a 6 - 10 hour period across a cluster. 
>> > 
>> >       Internally, "items" (a key/value pair) are generally immutable. 
>> The only 
>> >       time when it's not is for INCR/DECR, and it still becomes 
>> immutable if two 
>> >       INCR/DECR's collide. 
>> > 
>> >       What this means, is that the new item is staged in a piece of 
>> free memory 
>> >       while the "upload" stage of the SET happens. When memcached has 
>> all of the 
>> >       data in memory to replace the item, it does an internal swap 
>> under a lock. 
>> >       The old item is removed from the hash table and LRU, and the new 
>> item gets 
>> >       put in its place (at the head of the LRU). 
>> > 
>> >       Since items are refcounted, this means that if other users are 
>> downloading 
>> >       an item which just got replaced, their memory doesn't get 
>> corrupted by the 
>> >       item changing out from underneath them. They can continue to read 
>> the old 
>> >       item until they're done. When the refcount reaches zero the old 
>> memory is 
>> >       reclaimed. 
>> > 
>> >       Most of the time, the item replacement happens then the old 
>> memory is 
>> >       immediately removed. 
>> > 
>> >       However, this does mean that you need *one* piece of free memory 
>> to 
>> >       replace the old one. Then the old memory gets freed after that 
>> set. 
>> > 
>> >       So if you take a memcached instance with 0 free chunks, and do a 
>> rolling 
>> >       replacement of all items (within the same slab class as before), 
>> the first 
>> >       one would cause an eviction from the tail of the LRU to get a 
>> free chunk. 
>> >       Every SET after that would use the chunk freed from the 
>> replacement of the 
>> >       previous memory. 
>> > 
>> >       > After that last sentence I realized I also may not have 
>> explained well enough the access pattern. The keys are all overwritten 
>> every day, but it takes some time to write them all (obviously). We see a 
>> huge increase in the bytes metric as if the new data for the old keys was 
>> being written for the first time. Since the "old" slab for 
>> >       the same key doesn't 
>> >       > proactively release memory, it starts to fill up the cache and 
>> then start evicting data in the new slab. Once that happens, we see 
>> evictions in the old slab because of the algorithm you mentioned (random 
>> picking / freeing of memory). Typically we don't see any use for 
>> "upgrading" an item as the new data would be entirely new and 
>> >       should wholesale replace the 
>> >       > old data for that key. More specifically, the operation is 
>> always set, with different data each day. 
>> > 
>> >       Right. Most of your problems will come from two areas. One being 
>> that 
>> >       writing data aggressively into the new slab class (unless you set 
>> the 
>> >       rebalancer to always-replace mode), the mover will make memory 
>> available 
>> >       more slowly than you can insert. So you'll cause extra evictions 
>> in the 
>> >       new slab class. 
>> > 
>> >       The secondary problem is from the random evictions in the 
>> previous slab 
>> >       class as stuff is chucked on the floor to make memory moveable. 
>> > 
>> >       > As for testing, we'll be able to put it under real production 
>> workload. I don't know what kind of data you mean you need for testing. The 
>> data stored in the caches are highly confidential. I can give you all kinds 
>> of metrics, since we collect most of the ones that are in the stats and 
>> some from the stats slabs output. If you have 
>> >       some specific ones that 
>> >       > need collecting, I'll double check and make sure we can get 
>> those. Alternatively, it might be most beneficial to see the metrics in 
>> person :) 
>> > 
>> >       I just need stats snapshots here and there, and actually putting 
>> the thing 
>> >       under load. When I did the LRU work I had to beg for several 
>> months 
>> >       before anyone tested it with a production load. This slows things 
>> down and 
>> >       demotivates me from working on the project. 
>> > 
>> >       Unfortunately my dayjob keeps me pretty busy so ~internet~ would 
>> probably 
>> >       be best. 
>> > 
>> >       > I can create a driver program to reproduce the behavior on a 
>> smaller scale. It would write e.g. 10k keys of 10k size, then rewrite the 
>> same keys with different size data. I'll work on that and post it to this 
>> thread when I can reproduce the behavior locally. 
>> > 
>> >       Ok. There're slab rebalance unit tests in the t/ directory which 
>> do things 
>> >       like this, and I've used mc-crusher to slam the rebalancer. It's 
>> pretty 
>> >       easy to run one config to load up 10k objects, then flip to the 
>> other 
>> >       using the same key namespace. 
>> > 
>> >       > Thanks, 
>> >       > Scott 
>> >       > 
>> >       > On Saturday, July 11, 2015 at 12:05:54 PM UTC-7, Dormando 
>> wrote: 
>> >       >       Hey, 
>> >       > 
>> >       >       On Fri, 10 Jul 2015, Scott Mansfield wrote: 
>> >       > 
>> >       >       > We've seen issues recently where we run a cluster that 
>> typically has the majority of items overwritten in the same slab every day 
>> and a sudden change in data size evicts a ton of data, affecting downstream 
>> systems. To be clear that is our problem, but I think there's a tweak in 
>> memcached that might be useful and another 
>> >       possible feature that 
>> >       >       would be even 
>> >       >       > better. 
>> >       >       > The data that is written to this cache is overwritten 
>> every day, though the TTL is 7 days. One slab takes up the majority of the 
>> space in the cache. The application wrote e.g. 10KB (slab 21) every day for 
>> each key consistently. One day, a change occurred where it started writing 
>> 15KB (slab 23), causing a migration of data 
>> >       from one slab to 
>> >       >       another. We had -o 
>> >       >       > slab_reassign,slab_automove=1 set on the server, 
>> causing large numbers of evictions on the initial slab. Let's say the cache 
>> could hold the data at 15KB per key, but the old data was not technically 
>> TTL'd out in it's old slab. This means that memory was not being freed by 
>> the lru crawler thread (I think) because its expiry 
>> >       had not come 
>> >       >       around.  
>> >       >       > 
>> >       >       > lines 1199 and 1200 in items.c: 
>> >       >       > if ((search->exptime != 0 && search->exptime < 
>> current_time) || is_flushed(search)) { 
>> >       >       > 
>> >       >       > If there was a check to see if this data was 
>> "orphaned," i.e. that the key, if accessed, would map to a different slab 
>> than the current one, then these orphans could be reclaimed as free memory. 
>> I am working on a patch to do this, though I have reservations about 
>> performing a hash on the key on the lru crawler thread (if 
>> >       the hash is not 
>> >       >       already available). 
>> >       >       > I have very little experience in the memcached codebase 
>> so I don't know the most efficient way to do this. Any help would be 
>> appreciated. 
>> >       > 
>> >       >       There seems to be a misconception about how the slab 
>> classes work. A key, 
>> >       >       if already existing in a slab, will always map to the 
>> slab class it 
>> >       >       currently fits into. The slab classes always exist, but 
>> the amount of 
>> >       >       memory reserved for each of them will shift with the 
>> slab_reassign. ie: 10 
>> >       >       pages in slab class 21, then memory pressure on 23 causes 
>> it to move over. 
>> >       > 
>> >       >       So if you examine a key that still exists in slab class 
>> 21, it has no 
>> >       >       reason to move up or down the slab classes. 
>> >       > 
>> >       >       > Alternatively, and possibly more beneficial is 
>> compaction of data in a slab using the same set of criteria as lru 
>> crawling. Understandably, compaction is a very difficult problem to solve 
>> since moving the data would be a pain in the ass. I saw a couple of 
>> discussions about this in the mailing list, though I didn't see any 
>> >       firm thoughts about 
>> >       >       it. I think it 
>> >       >       > can probably be done in O(1) like the lru crawler by 
>> limiting the number of items it touches each time. Writing and reading are 
>> doable in O(1) so moving should be as well. Has anyone given more thought 
>> on compaction? 
>> >       > 
>> >       >       I'd be interested in hacking this up for you folks if you 
>> can provide me 
>> >       >       testing and some data to work with. With all of the LRU 
>> work I did in 
>> >       >       1.4.24, the next things I wanted to do is a big 
>> improvement on the slab 
>> >       >       reassignment code. 
>> >       > 
>> >       >       Currently it picks essentially a random slab page, 
>> empties it, and moves 
>> >       >       the slab page into the class under pressure. 
>> >       > 
>> >       >       One thing we can do is first examine for free memory in 
>> the existing slab, 
>> >       >       IE: 
>> >       > 
>> >       >       - Take a page from slab 21 
>> >       >       - Scan the page for valid items which need to be moved 
>> >       >       - Pull free memory from slab 21, migrate the item 
>> (moderately complicated) 
>> >       >       - When the page is empty, move it (or give up if you run 
>> out of free 
>> >       >       chunks). 
>> >       > 
>> >       >       The next step is to pull from the LRU on slab 21: 
>> >       > 
>> >       >       - Take page from slab 21 
>> >       >       - Scan page for valid items 
>> >       >       - Pull free memory from slab 21, migrate the item 
>> >       >         - If no memory free, evict tail of slab 21. use that 
>> chunk. 
>> >       >       - When the page is empty, move it. 
>> >       > 
>> >       >       Then, when you hit this condition your 
>> least-recently-used data gets 
>> >       >       culled as new data migrates your page class. This should 
>> match a natural 
>> >       >       occurrance if you would already be evicting valid (but 
>> old) items to make 
>> >       >       room for new items. 
>> >       > 
>> >       >       A bonus to using the free memory trick, is that I can use 
>> the amount of 
>> >       >       free space in a slab class as a heuristic to more quickly 
>> move slab pages 
>> >       >       around. 
>> >       > 
>> >       >       If it's still necessary from there, we can explore 
>> "upgrading" items to a 
>> >       >       new slab class, but that is much much more complicated 
>> since the item has 
>> >       >       to shift LRU's. Do you put it at the head, the tail, the 
>> middle, etc? It 
>> >       >       might be impossible to make a good generic decision 
>> there. 
>> >       > 
>> >       >       What version are you currently on? If 1.4.24, have you 
>> seen any 
>> >       >       instability? I'm currently torn between fighting a few 
>> bugs and start on 
>> >       >       improving the slab rebalancer. 
>> >       > 
>> >       >       -Dormando 
>> >       > 
>> >       > 
>> >       > On Saturday, July 11, 2015 at 12:05:54 PM UTC-7, Dormando 
>> wrote: 
>> >       >       Hey, 
>> >       > 
>> >       >       On Fri, 10 Jul 2015, Scott Mansfield wrote: 
>> >       > 
>> >       >       > We've seen issues recently where we run a cluster that 
>> typically has the majority of items overwritten in the same slab every day 
>> and a sudden change in data size evicts a ton of data, affecting downstream 
>> systems. To be clear that is our problem, but I think there's a tweak in 
>> memcached that might be useful and another 
>> >       possible feature that 
>> >       >       would be even 
>> >       >       > better. 
>> >       >       > The data that is written to this cache is overwritten 
>> every day, though the TTL is 7 days. One slab takes up the majority of the 
>> space in the cache. The application wrote e.g. 10KB (slab 21) every day for 
>> each key consistently. One day, a change occurred where it started writing 
>> 15KB (slab 23), causing a migration of data 
>> >       from one slab to 
>> >       >       another. We had -o 
>> >       >       > slab_reassign,slab_automove=1 set on the server, 
>> causing large numbers of evictions on the initial slab. Let's say the cache 
>> could hold the data at 15KB per key, but the old data was not technically 
>> TTL'd out in it's old slab. This means that memory was not being freed by 
>> the lru crawler thread (I think) because its expiry 
>> >       had not come 
>> >       >       around.  
>> >       >       > 
>> >       >       > lines 1199 and 1200 in items.c: 
>> >       >       > if ((search->exptime != 0 && search->exptime < 
>> current_time) || is_flushed(search)) { 
>> >       >       > 
>> >       >       > If there was a check to see if this data was 
>> "orphaned," i.e. that the key, if accessed, would map to a different slab 
>> than the current one, then these orphans could be reclaimed as free memory. 
>> I am working on a patch to do this, though I have reservations about 
>> performing a hash on the key on the lru crawler thread (if 
>> >       the hash is not 
>> >       >       already available). 
>> >       >       > I have very little experience in the memcached codebase 
>> so I don't know the most efficient way to do this. Any help would be 
>> appreciated. 
>> >       > 
>> >       >       There seems to be a misconception about how the slab 
>> classes work. A key, 
>> >       >       if already existing in a slab, will always map to the 
>> slab class it 
>> >       >       currently fits into. The slab classes always exist, but 
>> the amount of 
>> >       >       memory reserved for each of them will shift with the 
>> slab_reassign. ie: 10 
>> >       >       pages in slab class 21, then memory pressure on 23 causes 
>> it to move over. 
>> >       > 
>> >       >       So if you examine a key that still exists in slab class 
>> 21, it has no 
>> >       >       reason to move up or down the slab classes. 
>> >       > 
>> >       >       > Alternatively, and possibly more beneficial is 
>> compaction of data in a slab using the same set of criteria as lru 
>> crawling. Understandably, compaction is a very difficult problem to solve 
>> since moving the data would be a pain in the ass. I saw a couple of 
>> discussions about this in the mailing list, though I didn't see any 
>> >       firm thoughts about 
>> >       >       it. I think it 
>> >       >       > can probably be done in O(1) like the lru crawler by 
>> limiting the number of items it touches each time. Writing and reading are 
>> doable in O(1) so moving should be as well. Has anyone given more thought 
>> on compaction? 
>> >       > 
>> >       >       I'd be interested in hacking this up for you folks if you 
>> can provide me 
>> >       >       testing and some data to work with. With all of the LRU 
>> work I did in 
>> >       >       1.4.24, the next things I wanted to do is a big 
>> improvement on the slab 
>> >       >       reassignment code. 
>> >       > 
>> >       >       Currently it picks essentially a random slab page, 
>> empties it, and moves 
>> >       >       the slab page into the class under pressure. 
>> >       > 
>> >       >       One thing we can do is first examine for free memory in 
>> the existing slab, 
>> >       >       IE: 
>> >       > 
>> >       >       - Take a page from slab 21 
>> >       >       - Scan the page for valid items which need to be moved 
>> >       >       - Pull free memory from slab 21, migrate the item 
>> (moderately complicated) 
>> >       >       - When the page is empty, move it (or give up if you run 
>> out of free 
>> >       >       chunks). 
>> >       > 
>> >       >       The next step is to pull from the LRU on slab 21: 
>> >       > 
>> >       >       - Take page from slab 21 
>> >       >       - Scan page for valid items 
>> >       >       - Pull free memory from slab 21, migrate the item 
>> >       >         - If no memory free, evict tail of slab 21. use that 
>> chunk. 
>> >       >       - When the page is empty, move it. 
>> >       > 
>> >       >       Then, when you hit this condition your 
>> least-recently-used data gets 
>> >       >       culled as new data migrates your page class. This should 
>> match a natural 
>> >       >       occurrance if you would already be evicting valid (but 
>> old) items to make 
>> >       >       room for new items. 
>> >       > 
>> >       >       A bonus to using the free memory trick, is that I can use 
>> the amount of 
>> >       >       free space in a slab class as a heuristic to more quickly 
>> move slab pages 
>> >       >       around. 
>> >       > 
>> >       >       If it's still necessary from there, we can explore 
>> "upgrading" items to a 
>> >       >       new slab class, but that is much much more complicated 
>> since the item has 
>> >       >       to shift LRU's. Do you put it at the head, the tail, the 
>> middle, etc? It 
>> >       >       might be impossible to make a good generic decision 
>> there. 
>> >       > 
>> >       >       What version are you currently on? If 1.4.24, have you 
>> seen any 
>> >       >       instability? I'm currently torn between fighting a few 
>> bugs and start on 
>> >       >       improving the slab rebalancer. 
>> >       > 
>> >       >       -Dormando 
>> >       > 
>> >       > -- 
>> >       > 
>> >       > --- 
>> >       > You received this message because you are subscribed to the 
>> Google Groups "memcached" group. 
>> >       > To unsubscribe from this group and stop receiving emails from 
>> it, send an email to memcached+...@googlegroups.com. 
>> >       > For more options, visit https://groups.google.com/d/optout. 
>> >       > 
>> >       > 
>> > 
>> > -- 
>> > 
>> > --- 
>> > You received this message because you are subscribed to the Google 
>> Groups "memcached" group. 
>> > To unsubscribe from this group and stop receiving emails from it, send 
>> an email to memcached+...@googlegroups.com. 
>> > For more options, visit https://groups.google.com/d/optout. 
>> > 
>> >
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

slab_pressure.go
Description: Binary data

Re: Check for orphaned items in lru crawler thread

Reply via email to