Re: Disappearing Keys
+1 On Tue, Jul 6, 2010 at 2:47 PM, dormando dorma...@rydia.net wrote: Or you could disable the failover feature... On Tue, 6 Jul 2010, Darryl Kuhn wrote: FYI - we made the change on one server and it does appear to have resolved premature key expiration. Effectively what appears to have been happening was that every so often a client was unable to connect to one or more of the memcached servers. When this happened it changed the key distribution. Because the connection was persistent it meant that subsequent requests would use the same connection handle with the reduced server pool. Turning off persistent connections ensures that a if we are unable to connect to a server in one instance the failure does not persist for subsequent connections. We'll be rolling this change out to the entire server pool and I'll give the list another update with our findings. Thanks, Darryl On Fri, Jul 2, 2010 at 8:34 AM, Darryl Kuhn darryl.k...@gmail.com wrote: Found the reset call - that was me being an idiot (I actually introduced it when I added logging to debug this issue)... That's been removed however there was no flush command. Somebody else suggested it may have to do with the fact that we're running persistent connections; and that if a failure occurred that failure would persist and alter hashing rules for subsequent requests on that connection. I do see a limited number of connection failures (~5-15) throughout the day. I'm going to alter the config to make connections non-persistent and see if it makes a difference (however I'm doubtful this is the issue as we've run with memcache server pools with a single instance - which would make it impossible to alter the hashing distribution). I'll report back what I find - thanks for your continued input! -Darryl On Thu, Jul 1, 2010 at 12:28 PM, dormando dorma...@rydia.net wrote: Dormando... Thanks for the response. I've moved one of our servers to use an upgraded version running 1.4.5. Couple of things: * I turned on logging last night * I'm only running -vv at the moment; -vvv generated way more logging than we could handle. As it stands we've generated ~6GB of logs since last night (using -vv). I'm looking at ways of reducing log volume by logging only specific data or perhaps standing up 10 or 20 instances on one machine (using multiple ports) and turning on -vvv on only one instance. Any suggestions there? Oh. I thought given your stats output that you had reproduced it on a server that was on a dev instance or local machine... but I guess that's related to below. Running logs on a production instance with a lot of traffic isn't that great of an idea, sorry about that :/ Looking at the logs two things jump out at me. * While I had -vvv turned on I saw stats reset command being issued constantly (at least once a second). Nothing in the code that we have does this - do you know if the PHP client does this perhaps? Is this something you've seen in the past? No, you probably have some code that's doing something intensely wrong. Now we should probably add a counter for the number of times a stats reset has been called... * Second with -vv on I get something like this: + 71 get resourceCategoryPath21:984097: 71 sending key resourceCategoryPath21:984097: 71 END 71 set popularProducts:2010-06-28:skinit.com:styleskins:en::2000:image_wall:0__type 0 86400 5 71 STORED 71 set popularProducts:2010-06-28:skinit.com:styleskins:en::2000:image_wall:0 1 86400 130230 59 get domain_host:www.bestbuyskins.com 59 sending key domain_host:www.bestbuyskins.com 59 END * Two questions on the output - what's the 71 and 59? Second - I would have thought I'd see an END after each get and set however you can see that's not the case. Last question... other than trolling through code is there a good place to go to understand how to parse out these log files (I'd prefer to self-help rather than bugging you)? Looks ike you figured that out. The numbers are the file descriptors (connections). END/STORED/etc are the responses. Honestly I'm going to take a wild guess that something on your end is constantly trying to reset the memcached instance.. it's probably doing a flush_all then a stats reset which would hide the flush counter. Do you see flush_all being called in the logs anywhere? Go find where you're calling stats reset and make it stop... that'll probably help bubble up what the real problem is. -- awl
Re: Disappearing Keys
FYI - we made the change on one server and it does appear to have resolved premature key expiration. Effectively what appears to have been happening was that every so often a client was unable to connect to one or more of the memcached servers. When this happened it changed the key distribution. Because the connection was persistent it meant that subsequent requests would use the same connection handle with the reduced server pool. Turning off persistent connections ensures that a if we are unable to connect to a server in one instance the failure does not persist for subsequent connections. We'll be rolling this change out to the entire server pool and I'll give the list another update with our findings. Thanks, Darryl On Fri, Jul 2, 2010 at 8:34 AM, Darryl Kuhn darryl.k...@gmail.com wrote: Found the reset call - that was me being an idiot (I actually introduced it when I added logging to debug this issue)... That's been removed however there was no flush command. Somebody else suggested it may have to do with the fact that we're running persistent connections; and that if a failure occurred that failure would persist and alter hashing rules for subsequent requests on that connection. I do see a limited number of connection failures (~5-15) throughout the day. I'm going to alter the config to make connections non-persistent and see if it makes a difference (however I'm doubtful this is the issue as we've run with memcache server pools with a single instance - which would make it impossible to alter the hashing distribution). I'll report back what I find - thanks for your continued input! -Darryl On Thu, Jul 1, 2010 at 12:28 PM, dormando dorma...@rydia.net wrote: Dormando... Thanks for the response. I've moved one of our servers to use an upgraded version running 1.4.5. Couple of things: * I turned on logging last night * I'm only running -vv at the moment; -vvv generated way more logging than we could handle. As it stands we've generated ~6GB of logs since last night (using -vv). I'm looking at ways of reducing log volume by logging only specific data or perhaps standing up 10 or 20 instances on one machine (using multiple ports) and turning on -vvv on only one instance. Any suggestions there? Oh. I thought given your stats output that you had reproduced it on a server that was on a dev instance or local machine... but I guess that's related to below. Running logs on a production instance with a lot of traffic isn't that great of an idea, sorry about that :/ Looking at the logs two things jump out at me. * While I had -vvv turned on I saw stats reset command being issued constantly (at least once a second). Nothing in the code that we have does this - do you know if the PHP client does this perhaps? Is this something you've seen in the past? No, you probably have some code that's doing something intensely wrong. Now we should probably add a counter for the number of times a stats reset has been called... * Second with -vv on I get something like this: + 71 get resourceCategoryPath21:984097: 71 sending key resourceCategoryPath21:984097: 71 END 71 set popularProducts:2010-06-28:skinit.com:styleskins:en::2000:image_wall:0__type 0 86400 5 71 STORED 71 set popularProducts:2010-06-28:skinit.com:styleskins:en::2000:image_wall:0 1 86400 130230 59 get domain_host:www.bestbuyskins.com 59 sending key domain_host:www.bestbuyskins.com 59 END * Two questions on the output - what's the 71 and 59? Second - I would have thought I'd see an END after each get and set however you can see that's not the case. Last question... other than trolling through code is there a good place to go to understand how to parse out these log files (I'd prefer to self-help rather than bugging you)? Looks ike you figured that out. The numbers are the file descriptors (connections). END/STORED/etc are the responses. Honestly I'm going to take a wild guess that something on your end is constantly trying to reset the memcached instance.. it's probably doing a flush_all then a stats reset which would hide the flush counter. Do you see flush_all being called in the logs anywhere? Go find where you're calling stats reset and make it stop... that'll probably help bubble up what the real problem is.
Re: Disappearing Keys
Or you could disable the failover feature... On Tue, 6 Jul 2010, Darryl Kuhn wrote: FYI - we made the change on one server and it does appear to have resolved premature key expiration. Effectively what appears to have been happening was that every so often a client was unable to connect to one or more of the memcached servers. When this happened it changed the key distribution. Because the connection was persistent it meant that subsequent requests would use the same connection handle with the reduced server pool. Turning off persistent connections ensures that a if we are unable to connect to a server in one instance the failure does not persist for subsequent connections. We'll be rolling this change out to the entire server pool and I'll give the list another update with our findings. Thanks, Darryl On Fri, Jul 2, 2010 at 8:34 AM, Darryl Kuhn darryl.k...@gmail.com wrote: Found the reset call - that was me being an idiot (I actually introduced it when I added logging to debug this issue)... That's been removed however there was no flush command. Somebody else suggested it may have to do with the fact that we're running persistent connections; and that if a failure occurred that failure would persist and alter hashing rules for subsequent requests on that connection. I do see a limited number of connection failures (~5-15) throughout the day. I'm going to alter the config to make connections non-persistent and see if it makes a difference (however I'm doubtful this is the issue as we've run with memcache server pools with a single instance - which would make it impossible to alter the hashing distribution). I'll report back what I find - thanks for your continued input! -Darryl On Thu, Jul 1, 2010 at 12:28 PM, dormando dorma...@rydia.net wrote: Dormando... Thanks for the response. I've moved one of our servers to use an upgraded version running 1.4.5. Couple of things: * I turned on logging last night * I'm only running -vv at the moment; -vvv generated way more logging than we could handle. As it stands we've generated ~6GB of logs since last night (using -vv). I'm looking at ways of reducing log volume by logging only specific data or perhaps standing up 10 or 20 instances on one machine (using multiple ports) and turning on -vvv on only one instance. Any suggestions there? Oh. I thought given your stats output that you had reproduced it on a server that was on a dev instance or local machine... but I guess that's related to below. Running logs on a production instance with a lot of traffic isn't that great of an idea, sorry about that :/ Looking at the logs two things jump out at me. * While I had -vvv turned on I saw stats reset command being issued constantly (at least once a second). Nothing in the code that we have does this - do you know if the PHP client does this perhaps? Is this something you've seen in the past? No, you probably have some code that's doing something intensely wrong. Now we should probably add a counter for the number of times a stats reset has been called... * Second with -vv on I get something like this: + 71 get resourceCategoryPath21:984097: 71 sending key resourceCategoryPath21:984097: 71 END 71 set popularProducts:2010-06-28:skinit.com:styleskins:en::2000:image_wall:0__type 0 86400 5 71 STORED 71 set popularProducts:2010-06-28:skinit.com:styleskins:en::2000:image_wall:0 1 86400 130230 59 get domain_host:www.bestbuyskins.com 59 sending key domain_host:www.bestbuyskins.com 59 END * Two questions on the output - what's the 71 and 59? Second - I would have thought I'd see an END after each get and set however you can see that's not the case. Last question... other than trolling through code is there a good place to go to understand how to parse out these log files (I'd prefer to self-help rather than bugging you)? Looks ike you figured that out. The numbers are the file descriptors (connections). END/STORED/etc are the responses. Honestly I'm going to take a wild guess that something on your end is constantly trying to reset the memcached instance.. it's probably doing a flush_all then a stats reset which would hide the flush counter. Do you see flush_all being called in the logs anywhere? Go find where you're calling stats reset and make it stop... that'll probably help bubble up what the real problem is.
Re: Disappearing Keys
Found the reset call - that was me being an idiot (I actually introduced it when I added logging to debug this issue)... That's been removed however there was no flush command. Somebody else suggested it may have to do with the fact that we're running persistent connections; and that if a failure occurred that failure would persist and alter hashing rules for subsequent requests on that connection. I do see a limited number of connection failures (~5-15) throughout the day. I'm going to alter the config to make connections non-persistent and see if it makes a difference (however I'm doubtful this is the issue as we've run with memcache server pools with a single instance - which would make it impossible to alter the hashing distribution). I'll report back what I find - thanks for your continued input! -Darryl On Thu, Jul 1, 2010 at 12:28 PM, dormando dorma...@rydia.net wrote: Dormando... Thanks for the response. I've moved one of our servers to use an upgraded version running 1.4.5. Couple of things: * I turned on logging last night * I'm only running -vv at the moment; -vvv generated way more logging than we could handle. As it stands we've generated ~6GB of logs since last night (using -vv). I'm looking at ways of reducing log volume by logging only specific data or perhaps standing up 10 or 20 instances on one machine (using multiple ports) and turning on -vvv on only one instance. Any suggestions there? Oh. I thought given your stats output that you had reproduced it on a server that was on a dev instance or local machine... but I guess that's related to below. Running logs on a production instance with a lot of traffic isn't that great of an idea, sorry about that :/ Looking at the logs two things jump out at me. * While I had -vvv turned on I saw stats reset command being issued constantly (at least once a second). Nothing in the code that we have does this - do you know if the PHP client does this perhaps? Is this something you've seen in the past? No, you probably have some code that's doing something intensely wrong. Now we should probably add a counter for the number of times a stats reset has been called... * Second with -vv on I get something like this: + 71 get resourceCategoryPath21:984097: 71 sending key resourceCategoryPath21:984097: 71 END 71 set popularProducts:2010-06-28:skinit.com:styleskins:en::2000:image_wall:0__type 0 86400 5 71 STORED 71 set popularProducts:2010-06-28:skinit.com:styleskins:en::2000:image_wall:0 1 86400 130230 59 get domain_host:www.bestbuyskins.com 59 sending key domain_host:www.bestbuyskins.com 59 END * Two questions on the output - what's the 71 and 59? Second - I would have thought I'd see an END after each get and set however you can see that's not the case. Last question... other than trolling through code is there a good place to go to understand how to parse out these log files (I'd prefer to self-help rather than bugging you)? Looks ike you figured that out. The numbers are the file descriptors (connections). END/STORED/etc are the responses. Honestly I'm going to take a wild guess that something on your end is constantly trying to reset the memcached instance.. it's probably doing a flush_all then a stats reset which would hide the flush counter. Do you see flush_all being called in the logs anywhere? Go find where you're calling stats reset and make it stop... that'll probably help bubble up what the real problem is.
Re: Disappearing Keys
Dormando... Thanks for the response. I've moved one of our servers to use an upgraded version running 1.4.5. Couple of things: * I turned on logging last night * I'm only running -vv at the moment; -vvv generated way more logging than we could handle. As it stands we've generated ~6GB of logs since last night (using -vv). I'm looking at ways of reducing log volume by logging only specific data or perhaps standing up 10 or 20 instances on one machine (using multiple ports) and turning on -vvv on only one instance. Any suggestions there? Oh. I thought given your stats output that you had reproduced it on a server that was on a dev instance or local machine... but I guess that's related to below. Running logs on a production instance with a lot of traffic isn't that great of an idea, sorry about that :/ Looking at the logs two things jump out at me. * While I had -vvv turned on I saw stats reset command being issued constantly (at least once a second). Nothing in the code that we have does this - do you know if the PHP client does this perhaps? Is this something you've seen in the past? No, you probably have some code that's doing something intensely wrong. Now we should probably add a counter for the number of times a stats reset has been called... * Second with -vv on I get something like this: + 71 get resourceCategoryPath21:984097: 71 sending key resourceCategoryPath21:984097: 71 END 71 set popularProducts:2010-06-28:skinit.com:styleskins:en::2000:image_wall:0__type 0 86400 5 71 STORED 71 set popularProducts:2010-06-28:skinit.com:styleskins:en::2000:image_wall:0 1 86400 130230 59 get domain_host:www.bestbuyskins.com 59 sending key domain_host:www.bestbuyskins.com 59 END * Two questions on the output - what's the 71 and 59? Second - I would have thought I'd see an END after each get and set however you can see that's not the case. Last question... other than trolling through code is there a good place to go to understand how to parse out these log files (I'd prefer to self-help rather than bugging you)? Looks ike you figured that out. The numbers are the file descriptors (connections). END/STORED/etc are the responses. Honestly I'm going to take a wild guess that something on your end is constantly trying to reset the memcached instance.. it's probably doing a flush_all then a stats reset which would hide the flush counter. Do you see flush_all being called in the logs anywhere? Go find where you're calling stats reset and make it stop... that'll probably help bubble up what the real problem is.
Re: Disappearing Keys
Any thoughts on what might be going on here? As for vitals/system config: Here's a recent stats dump: STAT pid 19986 STAT uptime 8526687 STAT time 1277416425 STAT version 1.2.8 You should probably upgrade, but I think this version has what you may need... If you can reproduce this against an idle memcached, start up an instance with -vvv (three v's), which will tell you why a GET failed, as well as all of the parameters of any SET's/GET's you run. So you can see if the GET failed because of it expired, was evicted, expired due to a flush command (which doesn't seem to be your case), etc. You can also see if your PHP client is actually setting the timeout correctly or not.
Disappearing Keys
We've been battling an issue for a while now (in fact not sure quite how long it's been going on). It appears that keys are being evicted before their expiration time, however when we check stats we see evicted: 0 on all slabs. To track this down we've added logging (see below) for a particular key to track activity. Server Date Time Request Key Message 10.16.18.35 2010-Jun-23 1:00:46 15463 popularProducts:2010-06-21:collegiate:en:7:image_wall:0 Get failed 10.16.18.35 2010-Jun-23 1:00:47 15463 popularProducts:2010-06-21:collegiate:en:7:image_wall:0 Set to expire in 86400 seconds 10.16.18.24 2010-Jun-23 9:58:29 55378 popularProducts:2010-06-21:collegiate:en:7:image_wall:0 Get failed 10.16.18.24 2010-Jun-23 9:58:43 55378 popularProducts:2010-06-21:collegiate:en:7:image_wall:0 Set to expire in 86400 seconds 10.16.18.19 2010-Jun-23 10:10:54 87555 popularProducts:2010-06-21:collegiate:en:7:image_wall:0 Get failed 10.16.18.19 2010-Jun-23 10:11:08 87555 popularProducts:2010-06-21:collegiate:en:7:image_wall:0 Set to expire in 86400 seconds 10.16.18.14 2010-Jun-23 10:40:57 55731 popularProducts:2010-06-21:collegiate:en:7:image_wall:0 Get failed 10.16.18.14 2010-Jun-23 10:41:15 55731 popularProducts:2010-06-21:collegiate:en:7:image_wall:0 Set to expire in 86400 seconds This is a small snippet on just one key, but it is representative of what we're seeing. As you can see we're setting the key to expire in one day however subsequent GETs are failing well before the 1 day mark... We thought that perhaps we were reaching memory limits and hitting LRU, however that doesn't appear to be the case either; we're using ~18% of memory allocated and a stats items call shows evicted and evicted_time are 0 for all slabs. We've tried switching pools around to eliminate bad ram as a cause, we've even run a pool on a local machine to eliminate network related issues and we see the same symptoms. Any thoughts on what might be going on here? As for vitals/system config: Here's a recent stats dump: STAT pid 19986 STAT uptime 8526687 STAT time 1277416425 STAT version 1.2.8 STAT pointer_size 64 STAT rusage_user 1510.430396 STAT rusage_system 3328.800037 STAT curr_items 313957 STAT total_items 0 STAT bytes 449831564 STAT curr_connections 1025 STAT total_connections 1 STAT connection_structures 1276 STAT cmd_flush 0 STAT cmd_get 14 STAT cmd_set 0 STAT get_hits 14 STAT get_misses 0 STAT evictions 0 STAT bytes_read 491 STAT bytes_written 18449 STAT limit_maxbytes 2147483648 STAT threads 17 STAT accepting_conns 1 STAT listen_disabled_num 0 STAT delete 16755 STAT replace 6515615 We're connecting to memcached through PHP 4.4.4 using the PECL memcache extension configured thusly: memcache support = enabled memcache.allow_failover = 1 = 1 memcache.chunk_size = 8192 = 8192 memcache.default_port = 11211 = 11211 memcache.hash_function = crc32 = crc32 memcache.hash_strategy = consistent = consistent memcache.max_failover_attempts = 20 = 20