Re: Idea for reclaimation algo

dormando Fri, 11 Apr 2014 11:07:22 -0700


On Fri, 11 Apr 2014, Slawomir Pryczek wrote:


> Hi Dormando, more about the behaviour... when we're using "normal" memcached 
> 1.4.13 16GB of memory gets exhausted in ~1h, then we start to have
> almost instant evictions of needed items (again these items aren't really 
> "needed" individually, just when many of them gets evicted it's
> unacceptable because of how badly it affects the system)

Almost instant evictions; so an item is stored, into a 16GB instance, and
< 120 seconds later is bumped out of the LRU?

You'll probably just ignore me again, but isn't this just slab imbalance?
Once your instance fills up there're probably a few slab classes with way
too little memory in them.

'stats slabs' shows you per-slab eviction rates, along with the last
accessed time of an item when it was evicted. What does this look like on
one of your full instances?

The slab rebalance system lets you plug in your own algorithm by running
the page reassignment commands manually. Then you can smooth out the pages
to where you think they should be.

You mention long and short TTL, but what are they exactly? 120s and an
hour? A week?

I understand your desire to hack up something to solve this, but as you've
already seen scanning memory to remove expired items is problematic:
you're either going to do long walks from the tail, use a background
thread and walk a "probe" item through, or walk through random slab pages
looking for expired memory. None of these are very efficient and tend to
rely on luck.

A better way to do this is to bucket the memory by TTL. You have lots of
pretty decent options for this (and someone else already suggested one):

- In your client, use different memcached pools for major TTL buckets (ie;
one instance only gets long items, one only short). Make sure the slabs
aren't imbalanced via the slab rebalancer.

- Are the sizes of the items correlated with their TTL? Are 120s items
always in a ~300 byte range and longer items tend to be in a different
byte range? You could use length pagging to shunt them into specific slab
classes, separating them internally at the cost of some ram efficiency.

- A storage engine (god I wish we'd made 1.6 work...) which allows
bucketing by TTL ranges. You'd want a smaller set of slab classes to not
waste too much memory here, but the idea is the same as running multiple
individual instances, except internally splitting the storage engine
instead and storing everything in the same hash table.

Those three options completely avoid latency problems, the first one
requires no code modifications and will work very well. The third is the
most work (and will be tricky due to things like slab rebalance, and none
of the slab class identification code will work). I would avoid it unless
I were really bored and wanted to maintain my own fork forever.

> ~2 years ago i created another version based on that 1.4.13, than does 
> garbage collection using custom stats handler. That version is able to be
> running on half of the memory for like 2 weeks, with 0 evictions. But we gave 
> it full 16G and just restart it each week to be sure memory usage is
> kept in check, and we're not throwing away good data. Actually after changing 
> -f1.25 to -f1.041 the slabs are filling with bad items much slower,
> because items are distributed better and this custom eviction function is 
> able to catch more expired data. We have like 200GB of data evicted this
> way, daily. Because of volume (~40k req/s peak, much of it are writes) and 
> differences in expire time LRU isn't able to reclaim items efficiently.
>
> Maybe people don't even realize the problem, but when we done some testing 
> and turned off that "custom" eviction we had like 100% memory used with
> 10% of waste reported by memcached admin. But then we run that custom 
> eviction algorithm it turned out that 90% of memory is occupied by garbage.
> Waste reported grew to 80% instantly after running unlimited "reclaim 
> expired" on all items in the cache. So in "standard" client when people will
> be using different expire times for items (we have it like 1minute minimum, 
> 6h max)... they even won't be able to see how much memory they're
> wasting in some specific cases, when they'll have many items that won't be 
> hit after expiration, like we have.
>
> When using memcached as a buffer for mysql writes, we know exactly what to 
> hit and when. Short TTL expired items, pile up near the head... long TTL
> "live" items pile up near the tail and it's creating a barrier that prevents 
> the LRU algo to reclaim almost anything, if im getting how it
> currently works, correctly...
>
> >You made it sound like you had some data which never expired? Is this true? 
> Yes, i think because of how evictions are made (to be clear we're not setting 
> non-expiring items). These short expiring items pile up in the front
> of linked list, something that is supposed to live for eg. 120 or 180 seconds 
> is lingering in memory forever, untill we restart the cache... and
> new items are killed almost instantly because there are no expired items on 
> head.
>
> It's a special case, because after processing memory list, aggregating data 
> and putting it in mysql these items are no longer touched. The list for
> new time period will have completely different set of keys. As we use a 
> prefix to generate all items in the list.
>
> $time_slice = floor( self::$time / 60) - $time_slices_back;
> $prefix = ")ML){$list_id}-{$time_slice}";
>
> Again, not saying current implementation is bad... because it's fast and 
> doesn't trash CPU cache when expire times are ~equal, that was probably
> the idea... but we have not typical use case, which LRU isn't able to 
> manage...
>
> Now im making ~same changes i made for .13... but for .17 and i want to make 
> it working a little better ;)
>
>
>
> W dniu piątek, 11 kwietnia 2014 05:12:10 UTC+2 użytkownik Dormando napisał:
>
>       > Hey Dormando, thanks again for some comments... appreciate the help.
>       >
>       > Maybe i wasn't clear enough. I need only 1 minute persistence, and i 
> can lose data sometimes, just i can't keep loosing data every
>       minute due to
>       > constant evictions caused by LRU. Actually i have just wrote that in 
> my previous post. We're loosing about 1 minute of
>       non-meaningfull data every
>       > week because of restart that we do when memory starts to fill up 
> (even with our patch reclaiming using linked list, we limit
>       reclaiming to keep
>       > speed better)... so the memory fills up after a week, not 30 
> minutes...
>
>       Can you explain what you're seeing in more detail? Your data only needs 
> to
>       persist for 1 minute, but it's being evicted before 1 minute is up?
>
>       You made it sound like you had some data which never expired? Is this
>       true?
>
>       If your instance is 16GB, takes a week to fill up, but data only needs 
> to
>       persist for a minute but isn't, something else is very broken? Or am I
>       still misunderstanding you?
>
>       > Now im creating better solution, to limit locking as linked list is 
> getting bigger.
>       >
>       > I explained what was worst implications of unwanted evictions (or 
> loosing all data in cache) in my use case:
>       > 1. loosing ~1 minute of non-significant data that's about to be 
> stored in sql
>       > 2. "flat" distribution of load to workers (not taking response times 
> into account because stats reset).
>       > 3. resorting to alternative targeting algorithm (with global, not 
> local statistics).
>       >
>       > I never, ever said im going to write data that have to be persistent 
> permanently. It's actually same idea as delayed write. If power
>       fails you
>       > loose 5s of data, but you can do 100x more writes. So you need the 
> data to be persistent in memory, between writes the data **can't
>       be lost**.
>       > However you can lose it sometimes, that's the tradeoff that some 
> people can make and some not. Obviously I can't keep loosing this
>       data each
>       > minute, because if i loose much it'll become meaningfull.
>       >
>       > Maybe i wasn't clear in that matter. I can loose all data even 20 
> times a day. Sensitive data is stored using bulk update or
>       transactions,
>       > bypassing that "delayed write" layer. "0 evictions", that's the kind 
> of "persistence" im going for. So items are persistent for some
>       very short
>       > periods of time (1-5 minutes) without being killed. It's just 
> different use case. Running in production since 2 years, based on
>       1.4.13, tested for
>       > corectness, monitored so we have enough memory and 0 evictions (just 
> reclaims)
>       >
>       > When i came here with same idea ~2 years ago you just said it's very 
> stupid, now you even made me look like a moron :) And i can
>       understand why you
>       > don't want features that are not ~O(1) perfectly, but please don't 
> get so personal about different ideas to do things and use cases,
>       just because
>       > these won't work for you.
>       >
>       >
>       >
>       >
>       >
>       > W dniu czwartek, 10 kwietnia 2014 20:53:12 UTC+2 użytkownik Dormando 
> napisał:
>       >       You really really really really really *must* not put data in 
> memcached
>       >       which you can't lose.
>       >
>       >       Seriously, really don't do it. If you need persistence, try 
> using a redis
>       >       instance for the persistent stuff, and use memcached for your 
> cache stuff.
>       >       I don't see why you feel like you need to write your own thing, 
> there're a
>       >       lot of persistent key/value stores (kyotocabinet/etc?). They 
> have a much
>       >       lower request ceiling and don't handle the LRU/cache pattern as 
> well, but
>       >       that's why you can use both.
>       >
>       >       Again, please please don't do it. You are damaging your 
> company. You are a
>       >       *danger* to your company.
>       >
>       >       On Thu, 10 Apr 2014, Slawomir Pryczek wrote:
>       >
>       >       > Hi Dormando, thanks for suggestions, background thread would 
> be nice...
>       >       > The idea is actually that with 2-3GB i get plenty of 
> evictions of items that need to be fetched later. And with 16GB i still
>       get
>       >       evictions,
>       >       > actually probably i could throw more memory than 16G and it'd 
> only result in more expired items sitting in the middle of
>       slabs,
>       >       forever... Now im
>       >       > going for persistence. Sounds probably crazy, but we're 
> having some data that we can't loose:
>       >       > 1. statistics, we aggregate writes to DB using memcached 
> (+list implementation). If these items get evicted we're loosing
>       rows in db.
>       >       Loosing data
>       >       > sometimes isn't a big problem. Eg. we restart memcached once 
> a week so we're loosing 1 minute of data every week. But if we
>       have
>       >       evictions we're
>       >       > loosing data constantly (which we can't have)
>       >       > 2. we drive load balancer using data in memcached for 
> statistics, again, not nice to loose data often because workers can get
>       >       incorrect amount of
>       >       > traffic.
>       >       > 3. we're doing some adserving optimizations, eg. counting 
> per-domain ad priority, for one domain it takes about 10 seconds to
>       analyze
>       >       all data and
>       >       > create list of ads, so can't be done online... we put result 
> of this in memcached, if we loose too much of this the system
>       will start
>       >       to serve
>       >       > suboptimal ads (because it'll need to switch to more general 
> data or much simpler algorithm that can be done instantly)
>       >       >
>       >       > Probably would be best to rewrite all this using C or golang, 
> and use memcached just for caching, but it'd take too much time
>       which
>       >       we don't have
>       >       > currently...
>       >       >
>       >       > I have seen twitter and nk implementations that seem to do 
> what i need, but they seem old (based on old code), so I prefer to
>       modify
>       >       code of recent
>       >       > "official" memcached, to not be stuck with old code or 
> abandonware. Actually there are many topics about limitations of
>       currrent
>       >       eviction algo and
>       >       > option to enable some background thread to do scraping based 
> on statistics of most filled slabs (with some parameter to
>       specify if it
>       >       should take
>       >       > light or aggressive approach) would be nice...
>       >       >
>       >       > As for the code... is that slab_rebalance_move function in 
> slab.c? It seems a little difficult to gasp without some DOCs of
>       how
>       >       things are
>       >       > working... can you please write a very short description of 
> how this "angry birds" more workd?
>       >
>       >       Look at doc/protocol.txt for explanations of the slab move 
> options. the
>       >       names are greppable back to the source.
>       >
>       >       > I have quick question about this above... linked is item 
> that's placed on linked list, but what other flags means, and why 2
>       last are
>       >       2 of them
>       >       > temporary?
>       >       > #define ITEM_LINKED 1
>       >       > #define ITEM_CAS 2
>       >       >
>       >       > /* temp */
>       >       > #define ITEM_SLABBED 4
>       >       > #define ITEM_FETCHED 8
>       >       >
>       >       > This from slab_rebalance_move seems interesting:
>       >       > refcount = refcount_incr(&it->refcount);
>       >       > ...
>       >       > if (refcount == 1) { /* item is unlinked, unused */
>       >       > ...
>       >       > } else if (refcount == 2) { /* item is linked but not busy */
>       >       >
>       >       > Is there some docs about refcounts, locks and item states? 
> Basically why item with refcount 2 is not busy? You're increasing
>       refcount
>       >       by 1 on
>       >       > select, then again when reading data? Can refcount ever be 
> higher than 2 (3 in above case), meaning 2 threads can access same
>       item?
>       >
>       >       The comment on the same line is explaining exactly what it 
> means.
>       >
>       >       Unfortunately it's a bit of a crap shoot. I think I wrote a 
> threads
>       >       explanation somewhnere (some release notes, or in a file in 
> there, I can't
>       >       quite remember offhand). Since scaling the thread code it got a 
> lot more
>       >       complicated. You have to be extremely careful under what 
> circumstances you
>       >       access items (you must hold an item lock + the refcount must be 
> 2 if you
>       >       want to unlink it).
>       >
>       >       You'll just have to study it a bit, sorry. Grep around to see 
> where the
>       >       flags are used.
>       >
>       >       > Thanks.
>       >       >
>       >       > W dniu czwartek, 10 kwietnia 2014 06:05:30 UTC+2 użytkownik 
> Dormando napisał:
>       >       >       > Hi Guys,
>       >       >       > im running a specific case where i don't want 
> (actually can't have) to have evicted items (evictions = 0 ideally)...
>       now i
>       >       have
>       >       >       created some simple
>       >       >       > algo that lock the cache, goes through linked list 
> and evicts items... it makes some problems, like 10-20ms cache
>       locks on
>       >       some
>       >       >       cases.
>       >       >       >
>       >       >       > Now im thinking about going through each slab memory 
> (slabs keep a list of allocated memory regions) ... looking for
>       items,
>       >       if
>       >       >       expired item is
>       >       >       > found, evict it... this way i can go eg. 10k items or 
> 1MB of memory at a time + pick slabs with high utilization and
>       run this
>       >       >       "additional" eviction
>       >       >       > only on them... so it'll prevent allocating memory 
> just because unneded data with short TTL is occupying HEAD of the
>       list.
>       >       >       >
>       >       >       > With this linked list eviction im able to run on 
> 2-3GB of memory... without it 16GB of memory is exhausted in 1-2h
>       and then
>       >       memcached
>       >       >       starts to
>       >       >       > kill "good" items (leaving expired ones wasting 
> memory)...
>       >       >       >
>       >       >       > Any comments?
>       >       >       > Thanks.
>       >       >
>       >       >       you're going a bit against the base algorithm. if stuff 
> is falling out of
>       >       >       16GB of memory without ever being utilized again, why 
> is that critical?
>       >       >       Sounds like you're optimizing the numbers instead of 
> actually tuning
>       >       >       anything useful.
>       >       >
>       >       >       That said, you can probably just extend the slab 
> rebalance code. There's a
>       >       >       hook in there (which I called "Angry birds mode") that 
> drives a slab
>       >       >       rebalance when it'd otherwise run an eviction. That 
> code already safely
>       >       >       walks the slab page for unlocked memory and frees it; 
> you could edit it
>       >       >       slightly to check for expiration and then freelist it 
> into the slab class
>       >       >       instead.
>       >       >
>       >       >       Since it's already a background thread you could 
> further modify it to just
>       >       >       wake up and walk pages for stuff to evict.
>       >       >
>       >       > --
>       >       >
>       >       > ---
>       >       > You received this message because you are subscribed to the 
> Google Groups "memcached" group.
>       >       > To unsubscribe from this group and stop receiving emails from 
> it, send an email to memcached+...@googlegroups.com.
>       >       > For more options, visit https://groups.google.com/d/optout.
>       >       >
>       >       >
>       >
>       > --
>       >
>       > ---
>       > You received this message because you are subscribed to the Google 
> Groups "memcached" group.
>       > To unsubscribe from this group and stop receiving emails from it, 
> send an email to memcached+...@googlegroups.com.
>       > For more options, visit https://groups.google.com/d/optout.
>       >
>       >
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups 
> "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to memcached+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Idea for reclaimation algo

Reply via email to