Cool. Yeah this agrees; zero outofmemory errors on all classes. Think I'm missing a counter for chunked items still, in cases of "late" allocation errors. Given the amount of memory free I can't see why that would happen though.
Hopefully you're able to find the real error. Another thing I need to finish doing is add more logging endpoints so it's easier to gather data like that :( On Tue, 25 Apr 2017, Min-Zhong "John" Lu wrote: > Annnnnd, I guess forgetting to attach the files I promise is a sign of > dinosaurness. > Here they are. > > On Tuesday, April 25, 2017 at 5:08:01 PM UTC-4, Min-Zhong "John" Lu wrote: > Hello, > Thanks for the response! So the slab automover is not the culprit. > > As for the exact server error: unfortunately I don't have that for now as I > use > libmemcached (plus pylibmc for that matter). With that said, I do have used > the > plain telnet protocol when doing "further get requests" (as in my original > mail) > to verify the success of set requests (and the item size showing there are > exactly > the same as I've calculated within my python codes, FWIW). > > I think I can set up a little nice netcat script to imitate those set > requests, > directly through the telnet protocol, to capture the exact error message. Not > sure > how the intermittent nature of the failures can come into play here, but I'll > try > my best to reproduce it. > > As for setting -o slab_chunk_size_max=1048576 --- I'll try that, but I need to > schedule a maintenance window with my peers. Let me do the netcat script > first, > and I'll probably have the instance relaunched (with the new setting) within a > couple days and a few days later I'll ping back on whether I'm still seeing > the > failures. > > I'm attaching |stats items| here. Also attaching those |stats| and |stats > slabs| > dumped at the same time for consistency. > > Will come back with more info for the fun, > - Mnjul > > > > > > On Tuesday, April 25, 2017 at 4:40:52 PM UTC-4, Dormando wrote: > Hey! > > Unfortunately you've summoned a dinosaur, as I am old now :P > > My main question; do you have the exact server error returned by > memcached? If it is "SERVER_ERROR object too large for cache" - that > error > has nothing to do with memory allocation, and is just reflecting that > the > item attempted to store is too large (over 1MB). If it fails for that > reason it should always fail. > > First off, unfortunately your assumption that the slab page mover is > synchronous isn't correct. It's a fully backgrounded process that > doesn't > ever block anything. New memory allocations don't block for anything. > > Also; can you include "stats items"? It has some possibly relevant > info. > > Especially in your instance, which isn't using all of the memory > you've > assigned to it (about 1/3rd?). The slab page mover is simply moving > memory > back into a free pool when there is too much memory free in any > particular > slab class. > > ie; > STAT slab_global_page_pool 308 > > When new memory is requested and none is available readily in a slab > class, first a new page is pulled from the global page pool if > available. > After that, a new page is malloced. After that, items are pulled from > the > LRU and evicted. If nothing can be evicted for some reason you would > get > an allocation error. > > So you really shouldn't be seeing any. "stats items" would tell me the > nature of any allocation problems (hopefully) that you're seeing. Also > getting the exact error being thrown to you is very helpful. Most > errors > in the system are unique so I can trace them back to particular code. > > It is possible there is a bug or weirdness with chunked allocation, > which > happens for items > 512k and has gone through a couple revisions. You > can > test this theory by adding: "-o slab_chunk_size_max=1048576" (the same > as > item size max). Would be great to know if this makes the problem go > away, > since it means I have some more stuff to tune there. > > have fun, > -Dormando > > On Mon, 24 Apr 2017, Min-Zhong "John" Lu wrote: > > > Hi there, > > > > I've recently been investigating an intermittent & transient > failure-to-set issue, in a > > long-running memcached instance. And I believe I could use some > insight from you all. > > > > Let me list my configurations first. I have |stats| and |stats > slabs| dumps available as > > Google Groups attachment. If they fail to go through just lemme > re-post them on some > > pastebin service. > > > > Configuration: > > Command line arg: -m 2900 -f 1.16 -c 10240 -k -o modern > > > > Using 1.4.36 (compiled by myself) on Ubuntu 14.04.4 x64. > > > > The -k flag has been verified to be effective (I've got limits > configured correctly). > > > > Growth factor of 1.16 is just an empirical value for my item sizes. > > > > > > Symptom of the issue: > > After running the memcached for around 10 days, there have been > occasions where a set > > request of an large item (sized around 760KiBs to 930KiBs) would > fail, where memcached > > returns 37 (item too big). However, when this happens, if I wait for > around one minute, > > and then send the same set request again (with exactly the same > key/item/expiration to > > store), memcached would gladly store it. Further get requests verify > that the item is > > correctly stored. > > > > According to my logs, this happens intermittently, and I haven't > been able to correlate > > those transient failures with my slab stats. > > > > > > Observation & Question 1: > > Q1: Does my issue arise because when the initial set request arrives > at memcached, > > memcached has to run the slab automover to produce a slab (maybe two > slabs, since the > > item is larger than 512KiB) to accommodate the set request? > > > > This is my hunch --- I am yet to do a quick |stats| dump at the > exact moment of the set > > failure to confirm this. But I have seen [slab_reassign_busy_items = > 10K] and > > [slabs_moved = 16.9K] in my |stats| dumps, which means the slab > automover must have been > > triggered during memcached's entire life time. This leads to my next > question: > > > > > > Observation & Question 2 & 3: > > Q2: When the slab automover is running, would it possibly block the > large-item set > > request, as in my case above? > > > > Q3: Why would memcached favor triggering slab automover over > allocating new memory, when > > there is still host memory available? > > > > According to the stats dumps, my memcached instance has > [total_malloced = 793MiB], and a > > footprint of [bytes = 392.33MiB] --- both fall far short of > [limit_maxbytes = 2900MiB]. > > Furthermore, nothing has been evicted as I have got [evictions = 0] > > > > (And the host system has extremely enough free physical memory, per > |free -m|) > > > > I would expect that allocating memory would be faster (and *way* > faster actually) than > > triggering slab automover to reassign slabs to accommodate the > incoming set request, and > > that allocating memory would allow the initial set request to be > served immediately. > > > > In addition, if the slab automover just happens to be running when > the large-item set > > request arrives, and the answer to Q2 is "yes"... can we make it not > block if there's > > still host memory available? > > > > > > > > I'm kinda out of clues here...and I might actually be on a wrong > route in my > > investigation. > > > > Any insight is appreciated, and it'd be great if I can get rid of > those set failures > > without having to summon a dinosaur. > > > > For example, would disabling slab automover be an acceptable > band-aid fix? (and that I > > launch the manual mover (mc_slab_mover) when I know I have > relatively lighter traffic) > > > > Thanks a lot. > > > > p.s. While 'retry this set request at a later time' will work > (anecdotally), I don't > > want to implement a retry mechanism at client side, since 1) the > 'later time' is > > probably non-deterministic, and 2) I don't have a readily available > construct to > > decouple such retry from the rest of my task, and thus having to > retry would > > unnecessarily block client side. > > > > -- > > > > --- > > You received this message because you are subscribed to the Google > Groups "memcached" > > group. > > To unsubscribe from this group and stop receiving emails from it, > send an email to > > memcached+...@googlegroups.com. > > For more options, visit https://groups.google.com/d/optout. > > > > > > -- > > --- > You received this message because you are subscribed to the Google Groups > "memcached" > group. > To unsubscribe from this group and stop receiving emails from it, send an > email to > memcached+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > > -- --- You received this message because you are subscribed to the Google Groups "memcached" group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.