inline responses. also, I dunno if I missed it but what version were you on originally? are the start arguments the same?
On Sat, 13 Oct 2018, Jim Jones wrote: > The "-n" and "ext_item_size" options were picked to help handle the large > volume of small key values our service generates. But if increasing either > will prevent potential deadlocks, I'm sure we'd prefer to accept the > efficiency hit. Those values don't necessarily translate to doing what you want; I'll illustrate that once I get a gander at your stats. > The two servers that stopped accepting connections were only getting write > operations, that's correct. They are operating as shadow copies of the > cache during this parallel testing period. I do see trivial numbers of > "get" requests, but it's a tiny number compared to the number of "set" and > "delete" commands. Got it. > The code generating the memcache requests is PHP, with "twemproxy" (v0.4.1) > handling all the communications with the memcache servers. I can dig into > the code to get more specific information on the PHP setup if that's > helpful. Ok, so ascii only, then. Just need to know the general commands being used and the protocol, at this point. > I'll relay stats from one of the servers via private e-mail in case those > counters help suggest things we can try. > > And we'll definitely gather the info you described in the event we hit this > problem again. Thanks! Appreciate your patience. > -jj > > On Sat, Oct 13, 2018 at 10:07 AM dormando <dorma...@rydia.net> wrote: > Sounds like the daemon hardlocked. Some of your start arguments > are fairly > aggressive (ext_item_size and -n especially), I'll double check > that those > won't cause problems like this. > > First, to confirm: these two hung machines were only getting > writes the > whole time? no reads? > > Any info you can share about the client? What library, protocol > (ascii or > binary, etc)? > > If you can privately share with me any of your stats snapshots > that would > help a lot as well, since I can better determine which features > are being > exercised. > > Aside from that, there don't seem to be many hints here. If it > happens > again, please do: > > 1) see if you can still connect to it and run commands, grab > stats if so. > 2) grab a GDB backtrace before killing the daemon: > gdb -p $(pidof memcached) > thread apply all bt > ^ full thread backtrace. > > If it's hardlocked the backtrace *should* immediately tell me > what's going > on. > > Sorry for the trouble! > > On Fri, 12 Oct 2018, Jim Jones wrote: > > > The commandline arguments used are: > > > > -u memcached -m 236544 -c 64000 -p 11211 -t 32 -C -n 5 -f 1.05 > -o > > >ext_path=/mnt/memcache:1700G,ext_path=/mnt1/memcache:1700G,ext_path=/mnt2/m > emcache:1700G,ext_path=/mnt3/memcache:1700G,ext_threads=32,ext_item_size=6 > 4 > > > > And we have some data, but frankly when the issue was > happening we focused on the Memcache servers late in the > process. The initial errors suggested the > > problem was in a very different part of the collection of > technical components in our larger service. When we realized > the "memcached" processes were not > > accepting new connections, we wanted to correct the behavior > quickly since a fair amount of time had already passed. > > > > First, sockets in use on the servers... > > > > One system, call it server-248, shows TCP sockets on the > system hovering around 1900 after traffic ramped up for the > day. If held at that level from > > ~6:45am until 10:06am. We collect SAR data every 2 minutes, > so the next reading was at 10:08 and the TCP sockets jumped to > 63842. Meaning it didn't grow > > slowly over time, it jumped frlmo 1937 to 63842 in a 2 minute > window. That number was 63842-63844 until 12:06pm when we > restarted the "memcached" > > process. After that the number dropped over time back to a > more typical level. > > > > 10:02am 1937 > > 10:04am 1937 > > 10:06am 1937 > > 10:08am 63842 > > 10:10am 63842 > > ...etc... > > 12:08pm 63843 > > 12:06pm 63844 > > 12:08pm 18415 > > 12:10pm 17202 > > 12:12pm 16333 > > 12:14pm 16197 > > 12:16pm 16134 > > 12:18pm 16099 > > 12:20pm 1617 > > > > The other system which ran into trouble, which I'll call > server-85, exhibited similar behavior but started later. Here's > a sample of TCP socket counts > > from that server. > > > > 11:30am 1805 > > 11:32am 1801 > > 11:34am 1830 > > 11:36am 1817 > > 11:38am 63905 > > 11:40am 63905 > > ...etc... > > 12:20pm 63908 > > 12:22pm 63908 > > 12:24pm 1708 > > 12:26pm 1720 > > 12:28pm 1747 > > > > There were other network-centric datapoints that show the > systems grind to a halt in terms of accepting new connections, > like the bandwidth going in/out > > of the NIC's, etc... But it's all in support of the same > idea, that the "memcached" server stopped accepting new > connections. > > > > Second, details about the sockets... > > > > During the incident, we did capture summary information on the > socket states from various servers on the network, and a full > "netstat -an" listing from > > one of the servers. Both server-248 and server-85 showed tens > of thousands of sockets in a CLOSE_WAIT state, hundreds in a > SYN_RCVD state, and a small > > number of ESTABLISHED sockets. > > > > There may continue to be traffic on the ESTABLISHED > connections to the "memcached" servers, but if there is it's a > trivial amount. Multiple people were > > running "memkeys" at the time and report seeing no activity. > > > > Third, "stats" from the incapacitated "memcached" processes... > > > > We do not have stats from either server-248 or server-85 > during the time they were in trouble. In hindsight, that was a > big oversight. > > It's not clear that we could have gotten a connection to the > server to pull the stats, but I'd really like to know what those > counters said! > > > > I do have the results from "stats", "stats slabs", "stats > items" and "stats conns" from 17:10 the previous evening. That > doesn't show any obvious > > errors/problems slowly building up, waiting for some event to > trigger a massive failure. But it's from ~15 hours before the > server got into trouble so I > > don't think it's all that helpful. > > > > -jj > > > > On Fri, Oct 12, 2018 at 6:24 PM dormando <dorma...@rydia.net> > wrote: > > Hey, > > > > Probably unrelated to the original thread. I'm highly > interested in > > understanding what's going on here though, and would > appreciate any help > > you can spare. > > > > What data did you collect? The more information you can > share the better: > > > > 1) start arguments for the daemon > > 2) stats output (stats, stats items, stats slabs, and > importantly "stats > > conns" - you will probably have to redact some of the > output. > > 3) were you still able to connect to and run commands on > those daemons > > while they had a mass of CLOSE_WAIT? > > > > Thanks, > > -Dormando > > > > On Fri, 12 Oct 2018, jjo...@smugmug.com wrote: > > > > > We're testing verison 1.5.10 as a possible upgrade > candidate for out older > > > memcached servers, using a pool 9 servers. They are > running in parallel > > > with the production pool, also 9 servers. For the > test all read requests > > > are going to the production pool, and all updates > (set, delete, etc...) are > > > sent to one server in production pool and one server > in the 1.5.10 pool via > > > the key hashing algorithm. > > > > > > That setup had been running without incident for about > 12 days then > > > yesterday two of the servers experienced the mass of > CLOSE_WAIT connections > > > similar to what's been described here. We were able > to collect some data, > > > but not enough to figure out what's happening. So I'm > hoping to kickstart a > > > discussion here about how to diagnose what's going > on. Until we can find > > > way to explain (and prevent) another problem like > this, we're unable to > > > upgrade. > > > > > > I can provide more information about our > configuration. I'm just not sure > > > what bits are useful/interesting. I will note that > we're using "extstore" > > > functionality on the new servers. > > > > > > -jj > > > > > > -- > > > > > > --- > > > You received this message because you are subscribed > to the Google Groups > > > "memcached" group. > > > To unsubscribe from this group and stop receiving > emails from it, send an > > > email to memcached+unsubscr...@googlegroups.com. > > > For more options, visit > https://groups.google.com/d/optout. > > > > > > > > > > -- > > > > --- > > You received this message because you are subscribed to > the Google Groups "memcached" group. > > To unsubscribe from this group and stop receiving emails > from it, send an email to > memcached+unsubscr...@googlegroups.com. > > For more options, visit > https://groups.google.com/d/optout. > > > > -- > > > > --- > > You received this message because you are subscribed to the > Google Groups "memcached" group. > > To unsubscribe from this group and stop receiving emails from > it, send an email to memcached+unsubscr...@googlegroups.com. > > For more options, visit https://groups.google.com/d/optout. > > > > > > -- > > --- > You received this message because you are subscribed to the > Google Groups "memcached" group. > To unsubscribe from this group and stop receiving emails from > it, send an email to memcached+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > > -- > > --- > You received this message because you are subscribed to the Google Groups > "memcached" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to memcached+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > > -- --- You received this message because you are subscribed to the Google Groups "memcached" group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.