Re: Check for orphaned items in lru crawler thread

dormando Mon, 07 Sep 2015 14:34:13 -0700

Yo,

https://github.com/dormando/memcached/commits/slab_rebal_next - would you
mind playing around with the branch here? You can see the start options in
the test.


This is a dead simple modification (a restoration of a feature that was
arleady there...). The test very aggressively writes and is able to shunt
memory around appropriately.

The work I'm exploring right now will allow savings of items being
rebalanced from, and increasing the aggression of page moving without
being so brain damaged about it.

But while I'm poking around with that, I'd be interested in knowing if
this simple branch is an improvement, and if so how much.

I'll push more code to the branch, but the changes should be gated behind
a feature flag.

On Tue, 18 Aug 2015, 'Scott Mansfield' via memcached wrote:

>
> No worries man, you're doing us a favor. Let me know if there's anything you 
> need from us, and I promise I'll be quicker this time :)
>
> On Aug 18, 2015 12:01 AM, "dormando" <dorma...@rydia.net> wrote:
>       Hey,
>
>       I'm still really interested in working on this. I'll be taking a careful
>       look soon I hope.
>
>       On Mon, 3 Aug 2015, Scott Mansfield wrote:
>
>       > I've tweaked the program slightly, so I'm adding a new version. It 
> prints more stats as it goes and runs a bit faster.
>       >
>       > On Monday, August 3, 2015 at 1:20:37 AM UTC-7, Scott Mansfield wrote:
>       >       Total brain fart on my part. Apparently I had memcached 1.4.13 
> on my path (who knows how...) Using the actual one that I've built works. 
> Sorry for the confusion... can't believe I didn't realize that before. I'm 
> testing against the compiled one now to see how it behaves.
>       >       On Monday, August 3, 2015 at 1:15:06 AM UTC-7, Dormando wrote:
>       >             You sure that's 1.4.24? None of those fail for me :(
>       >
>       >             On Mon, 3 Aug 2015, Scott Mansfield wrote:
>       >
>       >             > The command line I've used that will start is:
>       >             >
>       >             > memcached -m 64 -o slab_reassign,slab_automove
>       >             >
>       >             >
>       >             > the ones that fail are:
>       >             >
>       >             >
>       >             > memcached -m 64 -o 
> slab_reassign,slab_automove,lru_crawler,lru_maintainer
>       >             >
>       >             > memcached -o lru_crawler
>       >             >
>       >             >
>       >             > I'm sure I've missed something during compile, though I 
> just used ./configure and make.
>       >             >
>       >             >
>       >             > On Monday, August 3, 2015 at 12:22:33 AM UTC-7, Scott 
> Mansfield wrote:
>       >             >       I've attached a pretty simple program to connect, 
> fill a slab with data, and then fill another slab slowly with data of a 
> different size. I've been trying to get memcached to run with the lru_crawler 
> and lru_maintainer flags, but I get '
>       >             >
>       >             >       Illegal suboption "(null)"' every time I try to 
> start with either in any configuration.
>       >             >
>       >             >
>       >             >       I haven't seen it start to move slabs 
> automatically with a freshly installed 1.2.24.
>       >             >
>       >             >
>       >             >       On Tuesday, July 21, 2015 at 4:55:17 PM UTC-7, 
> Scott Mansfield wrote:
>       >             >             I realize I've not given you the tests to 
> reproduce the behavior. I should be able to soon. Sorry about the delay here.
>       >             > In the mean time, I wanted to bring up a possible 
> secondary use of the same logic to move items on slab rebalancing. I think 
> the system might benefit from using the same logic to crawl the pages in a 
> slab and compact the data in the background. In the case where we have memory 
> that is assigned to the slab but not being used
>       because
>       >             of replaced
>       >             > or TTL'd out data, returning the memory to a pool of 
> free memory will allow a slab to grow with that memory first instead of 
> waiting for an event where memory is needed at that instant.
>       >             >
>       >             > It's a change in approach, from reactive to proactive. 
> What do you think?
>       >             >
>       >             > On Monday, July 13, 2015 at 5:54:11 PM UTC-7, Dormando 
> wrote:
>       >             >       > First, more detail for you:
>       >             >       >
>       >             >       > We are running 1.4.24 in production and haven't 
> noticed any bugs as of yet. The new LRUs seem to be working well, though we 
> nearly always run memcached scaled to hold all data without evictions. Those 
> with evictions are behaving well. Those without evictions haven't seen 
> crashing or any other noticeable bad behavior.
>       >             >
>       >             >       Neat.
>       >             >
>       >             >       >
>       >             >       > OK, I think I see an area where I was 
> speculating on functionality. If you have a key in slab 21 and then the same 
> key is written again at a larger size in slab 23 I assumed that the space in 
> 21 was not freed on the second write. With that assumption, the LRU crawler 
> would not free up that space. Also just by observation
>       in
>       >             the
>       >             >       macro, the space is not freed
>       >             >       > fast enough to be effective, in our use case, 
> to accept the writes that are happening. Think in the hundreds of millions of 
> "overwrites" in a 6 - 10 hour period across a cluster.
>       >             >
>       >             >       Internally, "items" (a key/value pair) are 
> generally immutable. The only
>       >             >       time when it's not is for INCR/DECR, and it still 
> becomes immutable if two
>       >             >       INCR/DECR's collide.
>       >             >
>       >             >       What this means, is that the new item is staged 
> in a piece of free memory
>       >             >       while the "upload" stage of the SET happens. When 
> memcached has all of the
>       >             >       data in memory to replace the item, it does an 
> internal swap under a lock.
>       >             >       The old item is removed from the hash table and 
> LRU, and the new item gets
>       >             >       put in its place (at the head of the LRU).
>       >             >
>       >             >       Since items are refcounted, this means that if 
> other users are downloading
>       >             >       an item which just got replaced, their memory 
> doesn't get corrupted by the
>       >             >       item changing out from underneath them. They can 
> continue to read the old
>       >             >       item until they're done. When the refcount 
> reaches zero the old memory is
>       >             >       reclaimed.
>       >             >
>       >             >       Most of the time, the item replacement happens 
> then the old memory is
>       >             >       immediately removed.
>       >             >
>       >             >       However, this does mean that you need *one* piece 
> of free memory to
>       >             >       replace the old one. Then the old memory gets 
> freed after that set.
>       >             >
>       >             >       So if you take a memcached instance with 0 free 
> chunks, and do a rolling
>       >             >       replacement of all items (within the same slab 
> class as before), the first
>       >             >       one would cause an eviction from the tail of the 
> LRU to get a free chunk.
>       >             >       Every SET after that would use the chunk freed 
> from the replacement of the
>       >             >       previous memory.
>       >             >
>       >             >       > After that last sentence I realized I also may 
> not have explained well enough the access pattern. The keys are all 
> overwritten every day, but it takes some time to write them all (obviously). 
> We see a huge increase in the bytes metric as if the new data for the old 
> keys was being written for the first time. Since the
>       "old"
>       >             slab for
>       >             >       the same key doesn't
>       >             >       > proactively release memory, it starts to fill 
> up the cache and then start evicting data in the new slab. Once that happens, 
> we see evictions in the old slab because of the algorithm you mentioned 
> (random picking / freeing of memory). Typically we don't see any use for 
> "upgrading" an item as the new data would be entirely
>       >             new and
>       >             >       should wholesale replace the
>       >             >       > old data for that key. More specifically, the 
> operation is always set, with different data each day.
>       >             >
>       >             >       Right. Most of your problems will come from two 
> areas. One being that
>       >             >       writing data aggressively into the new slab class 
> (unless you set the
>       >             >       rebalancer to always-replace mode), the mover 
> will make memory available
>       >             >       more slowly than you can insert. So you'll cause 
> extra evictions in the
>       >             >       new slab class.
>       >             >
>       >             >       The secondary problem is from the random 
> evictions in the previous slab
>       >             >       class as stuff is chucked on the floor to make 
> memory moveable.
>       >             >
>       >             >       > As for testing, we'll be able to put it under 
> real production workload. I don't know what kind of data you mean you need 
> for testing. The data stored in the caches are highly confidential. I can 
> give you all kinds of metrics, since we collect most of the ones that are in 
> the stats and some from the stats slabs output. If
>       >             you have
>       >             >       some specific ones that
>       >             >       > need collecting, I'll double check and make 
> sure we can get those. Alternatively, it might be most beneficial to see the 
> metrics in person :)
>       >             >
>       >             >       I just need stats snapshots here and there, and 
> actually putting the thing
>       >             >       under load. When I did the LRU work I had to beg 
> for several months
>       >             >       before anyone tested it with a production load. 
> This slows things down and
>       >             >       demotivates me from working on the project.
>       >             >
>       >             >       Unfortunately my dayjob keeps me pretty busy so 
> ~internet~ would probably
>       >             >       be best.
>       >             >
>       >             >       > I can create a driver program to reproduce the 
> behavior on a smaller scale. It would write e.g. 10k keys of 10k size, then 
> rewrite the same keys with different size data. I'll work on that and post it 
> to this thread when I can reproduce the behavior locally.
>       >             >
>       >             >       Ok. There're slab rebalance unit tests in the t/ 
> directory which do things
>       >             >       like this, and I've used mc-crusher to slam the 
> rebalancer. It's pretty
>       >             >       easy to run one config to load up 10k objects, 
> then flip to the other
>       >             >       using the same key namespace.
>       >             >
>       >             >       > Thanks,
>       >             >       > Scott
>       >             >       >
>       >             >       > On Saturday, July 11, 2015 at 12:05:54 PM 
> UTC-7, Dormando wrote:
>       >             >       >       Hey,
>       >             >       >
>       >             >       >       On Fri, 10 Jul 2015, Scott Mansfield 
> wrote:
>       >             >       >
>       >             >       >       > We've seen issues recently where we run 
> a cluster that typically has the majority of items overwritten in the same 
> slab every day and a sudden change in data size evicts a ton of data, 
> affecting downstream systems. To be clear that is our problem, but I think 
> there's a tweak in memcached that might be useful and
>       >             another
>       >             >       possible feature that
>       >             >       >       would be even
>       >             >       >       > better.
>       >             >       >       > The data that is written to this cache 
> is overwritten every day, though the TTL is 7 days. One slab takes up the 
> majority of the space in the cache. The application wrote e.g. 10KB (slab 21) 
> every day for each key consistently. One day, a change occurred where it 
> started writing 15KB (slab 23), causing a migration
>       >             of data
>       >             >       from one slab to
>       >             >       >       another. We had -o
>       >             >       >       > slab_reassign,slab_automove=1 set on 
> the server, causing large numbers of evictions on the initial slab. Let's say 
> the cache could hold the data at 15KB per key, but the old data was not 
> technically TTL'd out in it's old slab. This means that memory was not being 
> freed by the lru crawler thread (I think) because
>       its
>       >             expiry
>       >             >       had not come
>       >             >       >       around. 
>       >             >       >       >
>       >             >       >       > lines 1199 and 1200 in items.c:
>       >             >       >       > if ((search->exptime != 0 && 
> search->exptime < current_time) || is_flushed(search)) {
>       >             >       >       >
>       >             >       >       > If there was a check to see if this 
> data was "orphaned," i.e. that the key, if accessed, would map to a different 
> slab than the current one, then these orphans could be reclaimed as free 
> memory. I am working on a patch to do this, though I have reservations about 
> performing a hash on the key on the lru crawler
>       >             thread (if
>       >             >       the hash is not
>       >             >       >       already available).
>       >             >       >       > I have very little experience in the 
> memcached codebase so I don't know the most efficient way to do this. Any 
> help would be appreciated.
>       >             >       >
>       >             >       >       There seems to be a misconception about 
> how the slab classes work. A key,
>       >             >       >       if already existing in a slab, will 
> always map to the slab class it
>       >             >       >       currently fits into. The slab classes 
> always exist, but the amount of
>       >             >       >       memory reserved for each of them will 
> shift with the slab_reassign. ie: 10
>       >             >       >       pages in slab class 21, then memory 
> pressure on 23 causes it to move over.
>       >             >       >
>       >             >       >       So if you examine a key that still exists 
> in slab class 21, it has no
>       >             >       >       reason to move up or down the slab 
> classes.
>       >             >       >
>       >             >       >       > Alternatively, and possibly more 
> beneficial is compaction of data in a slab using the same set of criteria as 
> lru crawling. Understandably, compaction is a very difficult problem to solve 
> since moving the data would be a pain in the ass. I saw a couple of 
> discussions about this in the mailing list, though I didn't
>       >             see any
>       >             >       firm thoughts about
>       >             >       >       it. I think it
>       >             >       >       > can probably be done in O(1) like the 
> lru crawler by limiting the number of items it touches each time. Writing and 
> reading are doable in O(1) so moving should be as well. Has anyone given more 
> thought on compaction?
>       >             >       >
>       >             >       >       I'd be interested in hacking this up for 
> you folks if you can provide me
>       >             >       >       testing and some data to work with. With 
> all of the LRU work I did in
>       >             >       >       1.4.24, the next things I wanted to do is 
> a big improvement on the slab
>       >             >       >       reassignment code.
>       >             >       >
>       >             >       >       Currently it picks essentially a random 
> slab page, empties it, and moves
>       >             >       >       the slab page into the class under 
> pressure.
>       >             >       >
>       >             >       >       One thing we can do is first examine for 
> free memory in the existing slab,
>       >             >       >       IE:
>       >             >       >
>       >             >       >       - Take a page from slab 21
>       >             >       >       - Scan the page for valid items which 
> need to be moved
>       >             >       >       - Pull free memory from slab 21, migrate 
> the item (moderately complicated)
>       >             >       >       - When the page is empty, move it (or 
> give up if you run out of free
>       >             >       >       chunks).
>       >             >       >
>       >             >       >       The next step is to pull from the LRU on 
> slab 21:
>       >             >       >
>       >             >       >       - Take page from slab 21
>       >             >       >       - Scan page for valid items
>       >             >       >       - Pull free memory from slab 21, migrate 
> the item
>       >             >       >         - If no memory free, evict tail of slab 
> 21. use that chunk.
>       >             >       >       - When the page is empty, move it.
>       >             >       >
>       >             >       >       Then, when you hit this condition your 
> least-recently-used data gets
>       >             >       >       culled as new data migrates your page 
> class. This should match a natural
>       >             >       >       occurrance if you would already be 
> evicting valid (but old) items to make
>       >             >       >       room for new items.
>       >             >       >
>       >             >       >       A bonus to using the free memory trick, 
> is that I can use the amount of
>       >             >       >       free space in a slab class as a heuristic 
> to more quickly move slab pages
>       >             >       >       around.
>       >             >       >
>       >             >       >       If it's still necessary from there, we 
> can explore "upgrading" items to a
>       >             >       >       new slab class, but that is much much 
> more complicated since the item has
>       >             >       >       to shift LRU's. Do you put it at the 
> head, the tail, the middle, etc? It
>       >             >       >       might be impossible to make a good 
> generic decision there.
>       >             >       >
>       >             >       >       What version are you currently on? If 
> 1.4.24, have you seen any
>       >             >       >       instability? I'm currently torn between 
> fighting a few bugs and start on
>       >             >       >       improving the slab rebalancer.
>       >             >       >
>       >             >       >       -Dormando
>       >             >       >
>       >             >       >
>       >             >       > On Saturday, July 11, 2015 at 12:05:54 PM 
> UTC-7, Dormando wrote:
>       >             >       >       Hey,
>       >             >       >
>       >             >       >       On Fri, 10 Jul 2015, Scott Mansfield 
> wrote:
>       >             >       >
>       >             >       >       > We've seen issues recently where we run 
> a cluster that typically has the majority of items overwritten in the same 
> slab every day and a sudden change in data size evicts a ton of data, 
> affecting downstream systems. To be clear that is our problem, but I think 
> there's a tweak in memcached that might be useful and
>       >             another
>       >             >       possible feature that
>       >             >       >       would be even
>       >             >       >       > better.
>       >             >       >       > The data that is written to this cache 
> is overwritten every day, though the TTL is 7 days. One slab takes up the 
> majority of the space in the cache. The application wrote e.g. 10KB (slab 21) 
> every day for each key consistently. One day, a change occurred where it 
> started writing 15KB (slab 23), causing a migration
>       >             of data
>       >             >       from one slab to
>       >             >       >       another. We had -o
>       >             >       >       > slab_reassign,slab_automove=1 set on 
> the server, causing large numbers of evictions on the initial slab. Let's say 
> the cache could hold the data at 15KB per key, but the old data was not 
> technically TTL'd out in it's old slab. This means that memory was not being 
> freed by the lru crawler thread (I think) because
>       its
>       >             expiry
>       >             >       had not come
>       >             >       >       around. 
>       >             >       >       >
>       >             >       >       > lines 1199 and 1200 in items.c:
>       >             >       >       > if ((search->exptime != 0 && 
> search->exptime < current_time) || is_flushed(search)) {
>       >             >       >       >
>       >             >       >       > If there was a check to see if this 
> data was "orphaned," i.e. that the key, if accessed, would map to a different 
> slab than the current one, then these orphans could be reclaimed as free 
> memory. I am working on a patch to do this, though I have reservations about 
> performing a hash on the key on the lru crawler
>       >             thread (if
>       >             >       the hash is not
>       >             >       >       already available).
>       >             >       >       > I have very little experience in the 
> memcached codebase so I don't know the most efficient way to do this. Any 
> help would be appreciated.
>       >             >       >
>       >             >       >       There seems to be a misconception about 
> how the slab classes work. A key,
>       >             >       >       if already existing in a slab, will 
> always map to the slab class it
>       >             >       >       currently fits into. The slab classes 
> always exist, but the amount of
>       >             >       >       memory reserved for each of them will 
> shift with the slab_reassign. ie: 10
>       >             >       >       pages in slab class 21, then memory 
> pressure on 23 causes it to move over.
>       >             >       >
>       >             >       >       So if you examine a key that still exists 
> in slab class 21, it has no
>       >             >       >       reason to move up or down the slab 
> classes.
>       >             >       >
>       >             >       >       > Alternatively, and possibly more 
> beneficial is compaction of data in a slab using the same set of criteria as 
> lru crawling. Understandably, compaction is a very difficult problem to solve 
> since moving the data would be a pain in the ass. I saw a couple of 
> discussions about this in the mailing list, though I didn't
>       >             see any
>       >             >       firm thoughts about
>       >             >       >       it. I think it
>       >             >       >       > can probably be done in O(1) like the 
> lru crawler by limiting the number of items it touches each time. Writing and 
> reading are doable in O(1) so moving should be as well. Has anyone given more 
> thought on compaction?
>       >             >       >
>       >             >       >       I'd be interested in hacking this up for 
> you folks if you can provide me
>       >             >       >       testing and some data to work with. With 
> all of the LRU work I did in
>       >             >       >       1.4.24, the next things I wanted to do is 
> a big improvement on the slab
>       >             >       >       reassignment code.
>       >             >       >
>       >             >       >       Currently it picks essentially a random 
> slab page, empties it, and moves
>       >             >       >       the slab page into the class under 
> pressure.
>       >             >       >
>       >             >       >       One thing we can do is first examine for 
> free memory in the existing slab,
>       >             >       >       IE:
>       >             >       >
>       >             >       >       - Take a page from slab 21
>       >             >       >       - Scan the page for valid items which 
> need to be moved
>       >             >       >       - Pull free memory from slab 21, migrate 
> the item (moderately complicated)
>       >             >       >       - When the page is empty, move it (or 
> give up if you run out of free
>       >             >       >       chunks).
>       >             >       >
>       >             >       >       The next step is to pull from the LRU on 
> slab 21:
>       >             >       >
>       >             >       >       - Take page from slab 21
>       >             >       >       - Scan page for valid items
>       >             >       >       - Pull free memory from slab 21, migrate 
> the item
>       >             >       >         - If no memory free, evict tail of slab 
> 21. use that chunk.
>       >             >       >       - When the page is empty, move it.
>       >             >       >
>       >             >       >       Then, when you hit this condition your 
> least-recently-used data gets
>       >             >       >       culled as new data migrates your page 
> class. This should match a natural
>       >             >       >       occurrance if you would already be 
> evicting valid (but old) items to make
>       >             >       >       room for new items.
>       >             >       >
>       >             >       >       A bonus to using the free memory trick, 
> is that I can use the amount of
>       >             >       >       free space in a slab class as a heuristic 
> to more quickly move slab pages
>       >             >       >       around.
>       >             >       >
>       >             >       >       If it's still necessary from there, we 
> can explore "upgrading" items to a
>       >             >       >       new slab class, but that is much much 
> more complicated since the item has
>       >             >       >       to shift LRU's. Do you put it at the 
> head, the tail, the middle, etc? It
>       >             >       >       might be impossible to make a good 
> generic decision there.
>       >             >       >
>       >             >       >       What version are you currently on? If 
> 1.4.24, have you seen any
>       >             >       >       instability? I'm currently torn between 
> fighting a few bugs and start on
>       >             >       >       improving the slab rebalancer.
>       >             >       >
>       >             >       >       -Dormando
>       >             >       >
>       >             >       > --
>       >             >       >
>       >             >       > ---
>       >             >       > You received this message because you are 
> subscribed to the Google Groups "memcached" group.
>       >             >       > To unsubscribe from this group and stop 
> receiving emails from it, send an email to memcached+...@googlegroups.com.
>       >             >       > For more options, visit 
> https://groups.google.com/d/optout.
>       >             >       >
>       >             >       >
>       >             >
>       >             > --
>       >             >
>       >             > ---
>       >             > You received this message because you are subscribed to 
> the Google Groups "memcached" group.
>       >             > To unsubscribe from this group and stop receiving 
> emails from it, send an email to memcached+...@googlegroups.com.
>       >             > For more options, visit 
> https://groups.google.com/d/optout.
>       >             >
>       >             >
>       >
>       > --
>       >
>       > ---
>       > You received this message because you are subscribed to the Google 
> Groups "memcached" group.
>       > To unsubscribe from this group and stop receiving emails from it, 
> send an email to memcached+unsubscr...@googlegroups.com.
>       > For more options, visit https://groups.google.com/d/optout.
>       >
>       >
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups 
> "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to memcached+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>

Re: Check for orphaned items in lru crawler thread

Reply via email to