Re: Check for orphaned items in lru crawler thread

dormando Tue, 18 Aug 2015 00:02:12 -0700

Hey,

I'm still really interested in working on this. I'll be taking a careful
look soon I hope.


On Mon, 3 Aug 2015, Scott Mansfield wrote:

> I've tweaked the program slightly, so I'm adding a new version. It prints 
> more stats as it goes and runs a bit faster.
>
> On Monday, August 3, 2015 at 1:20:37 AM UTC-7, Scott Mansfield wrote:
>       Total brain fart on my part. Apparently I had memcached 1.4.13 on my 
> path (who knows how...) Using the actual one that I've built works. Sorry for 
> the confusion... can't believe I didn't realize that before. I'm testing 
> against the compiled one now to see how it behaves.
>       On Monday, August 3, 2015 at 1:15:06 AM UTC-7, Dormando wrote:
>             You sure that's 1.4.24? None of those fail for me :(
>
>             On Mon, 3 Aug 2015, Scott Mansfield wrote:
>
>             > The command line I've used that will start is:
>             >
>             > memcached -m 64 -o slab_reassign,slab_automove
>             >
>             >
>             > the ones that fail are:
>             >
>             >
>             > memcached -m 64 -o 
> slab_reassign,slab_automove,lru_crawler,lru_maintainer
>             >
>             > memcached -o lru_crawler
>             >
>             >
>             > I'm sure I've missed something during compile, though I just 
> used ./configure and make.
>             >
>             >
>             > On Monday, August 3, 2015 at 12:22:33 AM UTC-7, Scott Mansfield 
> wrote:
>             >       I've attached a pretty simple program to connect, fill a 
> slab with data, and then fill another slab slowly with data of a different 
> size. I've been trying to get memcached to run with the lru_crawler and 
> lru_maintainer flags, but I get '
>             >
>             >       Illegal suboption "(null)"' every time I try to start 
> with either in any configuration.
>             >
>             >
>             >       I haven't seen it start to move slabs automatically with 
> a freshly installed 1.2.24.
>             >
>             >
>             >       On Tuesday, July 21, 2015 at 4:55:17 PM UTC-7, Scott 
> Mansfield wrote:
>             >             I realize I've not given you the tests to reproduce 
> the behavior. I should be able to soon. Sorry about the delay here.
>             > In the mean time, I wanted to bring up a possible secondary use 
> of the same logic to move items on slab rebalancing. I think the system might 
> benefit from using the same logic to crawl the pages in a slab and compact 
> the data in the background. In the case where we have memory that is assigned 
> to the slab but not being used because
>             of replaced
>             > or TTL'd out data, returning the memory to a pool of free 
> memory will allow a slab to grow with that memory first instead of waiting 
> for an event where memory is needed at that instant.
>             >
>             > It's a change in approach, from reactive to proactive. What do 
> you think?
>             >
>             > On Monday, July 13, 2015 at 5:54:11 PM UTC-7, Dormando wrote:
>             >       > First, more detail for you:
>             >       >
>             >       > We are running 1.4.24 in production and haven't noticed 
> any bugs as of yet. The new LRUs seem to be working well, though we nearly 
> always run memcached scaled to hold all data without evictions. Those with 
> evictions are behaving well. Those without evictions haven't seen crashing or 
> any other noticeable bad behavior.
>             >
>             >       Neat.
>             >
>             >       >
>             >       > OK, I think I see an area where I was speculating on 
> functionality. If you have a key in slab 21 and then the same key is written 
> again at a larger size in slab 23 I assumed that the space in 21 was not 
> freed on the second write. With that assumption, the LRU crawler would not 
> free up that space. Also just by observation in
>             the
>             >       macro, the space is not freed
>             >       > fast enough to be effective, in our use case, to accept 
> the writes that are happening. Think in the hundreds of millions of 
> "overwrites" in a 6 - 10 hour period across a cluster.
>             >
>             >       Internally, "items" (a key/value pair) are generally 
> immutable. The only
>             >       time when it's not is for INCR/DECR, and it still becomes 
> immutable if two
>             >       INCR/DECR's collide.
>             >
>             >       What this means, is that the new item is staged in a 
> piece of free memory
>             >       while the "upload" stage of the SET happens. When 
> memcached has all of the
>             >       data in memory to replace the item, it does an internal 
> swap under a lock.
>             >       The old item is removed from the hash table and LRU, and 
> the new item gets
>             >       put in its place (at the head of the LRU).
>             >
>             >       Since items are refcounted, this means that if other 
> users are downloading
>             >       an item which just got replaced, their memory doesn't get 
> corrupted by the
>             >       item changing out from underneath them. They can continue 
> to read the old
>             >       item until they're done. When the refcount reaches zero 
> the old memory is
>             >       reclaimed.
>             >
>             >       Most of the time, the item replacement happens then the 
> old memory is
>             >       immediately removed.
>             >
>             >       However, this does mean that you need *one* piece of free 
> memory to
>             >       replace the old one. Then the old memory gets freed after 
> that set.
>             >
>             >       So if you take a memcached instance with 0 free chunks, 
> and do a rolling
>             >       replacement of all items (within the same slab class as 
> before), the first
>             >       one would cause an eviction from the tail of the LRU to 
> get a free chunk.
>             >       Every SET after that would use the chunk freed from the 
> replacement of the
>             >       previous memory.
>             >
>             >       > After that last sentence I realized I also may not have 
> explained well enough the access pattern. The keys are all overwritten every 
> day, but it takes some time to write them all (obviously). We see a huge 
> increase in the bytes metric as if the new data for the old keys was being 
> written for the first time. Since the "old"
>             slab for
>             >       the same key doesn't
>             >       > proactively release memory, it starts to fill up the 
> cache and then start evicting data in the new slab. Once that happens, we see 
> evictions in the old slab because of the algorithm you mentioned (random 
> picking / freeing of memory). Typically we don't see any use for "upgrading" 
> an item as the new data would be entirely
>             new and
>             >       should wholesale replace the
>             >       > old data for that key. More specifically, the operation 
> is always set, with different data each day.
>             >
>             >       Right. Most of your problems will come from two areas. 
> One being that
>             >       writing data aggressively into the new slab class (unless 
> you set the
>             >       rebalancer to always-replace mode), the mover will make 
> memory available
>             >       more slowly than you can insert. So you'll cause extra 
> evictions in the
>             >       new slab class.
>             >
>             >       The secondary problem is from the random evictions in the 
> previous slab
>             >       class as stuff is chucked on the floor to make memory 
> moveable.
>             >
>             >       > As for testing, we'll be able to put it under real 
> production workload. I don't know what kind of data you mean you need for 
> testing. The data stored in the caches are highly confidential. I can give 
> you all kinds of metrics, since we collect most of the ones that are in the 
> stats and some from the stats slabs output. If
>             you have
>             >       some specific ones that
>             >       > need collecting, I'll double check and make sure we can 
> get those. Alternatively, it might be most beneficial to see the metrics in 
> person :)
>             >
>             >       I just need stats snapshots here and there, and actually 
> putting the thing
>             >       under load. When I did the LRU work I had to beg for 
> several months
>             >       before anyone tested it with a production load. This 
> slows things down and
>             >       demotivates me from working on the project.
>             >
>             >       Unfortunately my dayjob keeps me pretty busy so 
> ~internet~ would probably
>             >       be best.
>             >
>             >       > I can create a driver program to reproduce the behavior 
> on a smaller scale. It would write e.g. 10k keys of 10k size, then rewrite 
> the same keys with different size data. I'll work on that and post it to this 
> thread when I can reproduce the behavior locally.
>             >
>             >       Ok. There're slab rebalance unit tests in the t/ 
> directory which do things
>             >       like this, and I've used mc-crusher to slam the 
> rebalancer. It's pretty
>             >       easy to run one config to load up 10k objects, then flip 
> to the other
>             >       using the same key namespace.
>             >
>             >       > Thanks,
>             >       > Scott
>             >       >
>             >       > On Saturday, July 11, 2015 at 12:05:54 PM UTC-7, 
> Dormando wrote:
>             >       >       Hey,
>             >       >
>             >       >       On Fri, 10 Jul 2015, Scott Mansfield wrote:
>             >       >
>             >       >       > We've seen issues recently where we run a 
> cluster that typically has the majority of items overwritten in the same slab 
> every day and a sudden change in data size evicts a ton of data, affecting 
> downstream systems. To be clear that is our problem, but I think there's a 
> tweak in memcached that might be useful and
>             another
>             >       possible feature that
>             >       >       would be even
>             >       >       > better.
>             >       >       > The data that is written to this cache is 
> overwritten every day, though the TTL is 7 days. One slab takes up the 
> majority of the space in the cache. The application wrote e.g. 10KB (slab 21) 
> every day for each key consistently. One day, a change occurred where it 
> started writing 15KB (slab 23), causing a migration
>             of data
>             >       from one slab to
>             >       >       another. We had -o
>             >       >       > slab_reassign,slab_automove=1 set on the 
> server, causing large numbers of evictions on the initial slab. Let's say the 
> cache could hold the data at 15KB per key, but the old data was not 
> technically TTL'd out in it's old slab. This means that memory was not being 
> freed by the lru crawler thread (I think) because its
>             expiry
>             >       had not come
>             >       >       around. 
>             >       >       >
>             >       >       > lines 1199 and 1200 in items.c:
>             >       >       > if ((search->exptime != 0 && search->exptime < 
> current_time) || is_flushed(search)) {
>             >       >       >
>             >       >       > If there was a check to see if this data was 
> "orphaned," i.e. that the key, if accessed, would map to a different slab 
> than the current one, then these orphans could be reclaimed as free memory. I 
> am working on a patch to do this, though I have reservations about performing 
> a hash on the key on the lru crawler
>             thread (if
>             >       the hash is not
>             >       >       already available).
>             >       >       > I have very little experience in the memcached 
> codebase so I don't know the most efficient way to do this. Any help would be 
> appreciated.
>             >       >
>             >       >       There seems to be a misconception about how the 
> slab classes work. A key,
>             >       >       if already existing in a slab, will always map to 
> the slab class it
>             >       >       currently fits into. The slab classes always 
> exist, but the amount of
>             >       >       memory reserved for each of them will shift with 
> the slab_reassign. ie: 10
>             >       >       pages in slab class 21, then memory pressure on 
> 23 causes it to move over.
>             >       >
>             >       >       So if you examine a key that still exists in slab 
> class 21, it has no
>             >       >       reason to move up or down the slab classes.
>             >       >
>             >       >       > Alternatively, and possibly more beneficial is 
> compaction of data in a slab using the same set of criteria as lru crawling. 
> Understandably, compaction is a very difficult problem to solve since moving 
> the data would be a pain in the ass. I saw a couple of discussions about this 
> in the mailing list, though I didn't
>             see any
>             >       firm thoughts about
>             >       >       it. I think it
>             >       >       > can probably be done in O(1) like the lru 
> crawler by limiting the number of items it touches each time. Writing and 
> reading are doable in O(1) so moving should be as well. Has anyone given more 
> thought on compaction?
>             >       >
>             >       >       I'd be interested in hacking this up for you 
> folks if you can provide me
>             >       >       testing and some data to work with. With all of 
> the LRU work I did in
>             >       >       1.4.24, the next things I wanted to do is a big 
> improvement on the slab
>             >       >       reassignment code.
>             >       >
>             >       >       Currently it picks essentially a random slab 
> page, empties it, and moves
>             >       >       the slab page into the class under pressure.
>             >       >
>             >       >       One thing we can do is first examine for free 
> memory in the existing slab,
>             >       >       IE:
>             >       >
>             >       >       - Take a page from slab 21
>             >       >       - Scan the page for valid items which need to be 
> moved
>             >       >       - Pull free memory from slab 21, migrate the item 
> (moderately complicated)
>             >       >       - When the page is empty, move it (or give up if 
> you run out of free
>             >       >       chunks).
>             >       >
>             >       >       The next step is to pull from the LRU on slab 21:
>             >       >
>             >       >       - Take page from slab 21
>             >       >       - Scan page for valid items
>             >       >       - Pull free memory from slab 21, migrate the item
>             >       >         - If no memory free, evict tail of slab 21. use 
> that chunk.
>             >       >       - When the page is empty, move it.
>             >       >
>             >       >       Then, when you hit this condition your 
> least-recently-used data gets
>             >       >       culled as new data migrates your page class. This 
> should match a natural
>             >       >       occurrance if you would already be evicting valid 
> (but old) items to make
>             >       >       room for new items.
>             >       >
>             >       >       A bonus to using the free memory trick, is that I 
> can use the amount of
>             >       >       free space in a slab class as a heuristic to more 
> quickly move slab pages
>             >       >       around.
>             >       >
>             >       >       If it's still necessary from there, we can 
> explore "upgrading" items to a
>             >       >       new slab class, but that is much much more 
> complicated since the item has
>             >       >       to shift LRU's. Do you put it at the head, the 
> tail, the middle, etc? It
>             >       >       might be impossible to make a good generic 
> decision there.
>             >       >
>             >       >       What version are you currently on? If 1.4.24, 
> have you seen any
>             >       >       instability? I'm currently torn between fighting 
> a few bugs and start on
>             >       >       improving the slab rebalancer.
>             >       >
>             >       >       -Dormando
>             >       >
>             >       >
>             >       > On Saturday, July 11, 2015 at 12:05:54 PM UTC-7, 
> Dormando wrote:
>             >       >       Hey,
>             >       >
>             >       >       On Fri, 10 Jul 2015, Scott Mansfield wrote:
>             >       >
>             >       >       > We've seen issues recently where we run a 
> cluster that typically has the majority of items overwritten in the same slab 
> every day and a sudden change in data size evicts a ton of data, affecting 
> downstream systems. To be clear that is our problem, but I think there's a 
> tweak in memcached that might be useful and
>             another
>             >       possible feature that
>             >       >       would be even
>             >       >       > better.
>             >       >       > The data that is written to this cache is 
> overwritten every day, though the TTL is 7 days. One slab takes up the 
> majority of the space in the cache. The application wrote e.g. 10KB (slab 21) 
> every day for each key consistently. One day, a change occurred where it 
> started writing 15KB (slab 23), causing a migration
>             of data
>             >       from one slab to
>             >       >       another. We had -o
>             >       >       > slab_reassign,slab_automove=1 set on the 
> server, causing large numbers of evictions on the initial slab. Let's say the 
> cache could hold the data at 15KB per key, but the old data was not 
> technically TTL'd out in it's old slab. This means that memory was not being 
> freed by the lru crawler thread (I think) because its
>             expiry
>             >       had not come
>             >       >       around. 
>             >       >       >
>             >       >       > lines 1199 and 1200 in items.c:
>             >       >       > if ((search->exptime != 0 && search->exptime < 
> current_time) || is_flushed(search)) {
>             >       >       >
>             >       >       > If there was a check to see if this data was 
> "orphaned," i.e. that the key, if accessed, would map to a different slab 
> than the current one, then these orphans could be reclaimed as free memory. I 
> am working on a patch to do this, though I have reservations about performing 
> a hash on the key on the lru crawler
>             thread (if
>             >       the hash is not
>             >       >       already available).
>             >       >       > I have very little experience in the memcached 
> codebase so I don't know the most efficient way to do this. Any help would be 
> appreciated.
>             >       >
>             >       >       There seems to be a misconception about how the 
> slab classes work. A key,
>             >       >       if already existing in a slab, will always map to 
> the slab class it
>             >       >       currently fits into. The slab classes always 
> exist, but the amount of
>             >       >       memory reserved for each of them will shift with 
> the slab_reassign. ie: 10
>             >       >       pages in slab class 21, then memory pressure on 
> 23 causes it to move over.
>             >       >
>             >       >       So if you examine a key that still exists in slab 
> class 21, it has no
>             >       >       reason to move up or down the slab classes.
>             >       >
>             >       >       > Alternatively, and possibly more beneficial is 
> compaction of data in a slab using the same set of criteria as lru crawling. 
> Understandably, compaction is a very difficult problem to solve since moving 
> the data would be a pain in the ass. I saw a couple of discussions about this 
> in the mailing list, though I didn't
>             see any
>             >       firm thoughts about
>             >       >       it. I think it
>             >       >       > can probably be done in O(1) like the lru 
> crawler by limiting the number of items it touches each time. Writing and 
> reading are doable in O(1) so moving should be as well. Has anyone given more 
> thought on compaction?
>             >       >
>             >       >       I'd be interested in hacking this up for you 
> folks if you can provide me
>             >       >       testing and some data to work with. With all of 
> the LRU work I did in
>             >       >       1.4.24, the next things I wanted to do is a big 
> improvement on the slab
>             >       >       reassignment code.
>             >       >
>             >       >       Currently it picks essentially a random slab 
> page, empties it, and moves
>             >       >       the slab page into the class under pressure.
>             >       >
>             >       >       One thing we can do is first examine for free 
> memory in the existing slab,
>             >       >       IE:
>             >       >
>             >       >       - Take a page from slab 21
>             >       >       - Scan the page for valid items which need to be 
> moved
>             >       >       - Pull free memory from slab 21, migrate the item 
> (moderately complicated)
>             >       >       - When the page is empty, move it (or give up if 
> you run out of free
>             >       >       chunks).
>             >       >
>             >       >       The next step is to pull from the LRU on slab 21:
>             >       >
>             >       >       - Take page from slab 21
>             >       >       - Scan page for valid items
>             >       >       - Pull free memory from slab 21, migrate the item
>             >       >         - If no memory free, evict tail of slab 21. use 
> that chunk.
>             >       >       - When the page is empty, move it.
>             >       >
>             >       >       Then, when you hit this condition your 
> least-recently-used data gets
>             >       >       culled as new data migrates your page class. This 
> should match a natural
>             >       >       occurrance if you would already be evicting valid 
> (but old) items to make
>             >       >       room for new items.
>             >       >
>             >       >       A bonus to using the free memory trick, is that I 
> can use the amount of
>             >       >       free space in a slab class as a heuristic to more 
> quickly move slab pages
>             >       >       around.
>             >       >
>             >       >       If it's still necessary from there, we can 
> explore "upgrading" items to a
>             >       >       new slab class, but that is much much more 
> complicated since the item has
>             >       >       to shift LRU's. Do you put it at the head, the 
> tail, the middle, etc? It
>             >       >       might be impossible to make a good generic 
> decision there.
>             >       >
>             >       >       What version are you currently on? If 1.4.24, 
> have you seen any
>             >       >       instability? I'm currently torn between fighting 
> a few bugs and start on
>             >       >       improving the slab rebalancer.
>             >       >
>             >       >       -Dormando
>             >       >
>             >       > --
>             >       >
>             >       > ---
>             >       > You received this message because you are subscribed to 
> the Google Groups "memcached" group.
>             >       > To unsubscribe from this group and stop receiving 
> emails from it, send an email to memcached+...@googlegroups.com.
>             >       > For more options, visit 
> https://groups.google.com/d/optout.
>             >       >
>             >       >
>             >
>             > --
>             >
>             > ---
>             > You received this message because you are subscribed to the 
> Google Groups "memcached" group.
>             > To unsubscribe from this group and stop receiving emails from 
> it, send an email to memcached+...@googlegroups.com.
>             > For more options, visit https://groups.google.com/d/optout.
>             >
>             >
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups 
> "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to memcached+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>

Re: Check for orphaned items in lru crawler thread

Reply via email to