Sounds like the daemon hardlocked. Some of your start arguments are fairly aggressive (ext_item_size and -n especially), I'll double check that those won't cause problems like this.
First, to confirm: these two hung machines were only getting writes the whole time? no reads? Any info you can share about the client? What library, protocol (ascii or binary, etc)? If you can privately share with me any of your stats snapshots that would help a lot as well, since I can better determine which features are being exercised. Aside from that, there don't seem to be many hints here. If it happens again, please do: 1) see if you can still connect to it and run commands, grab stats if so. 2) grab a GDB backtrace before killing the daemon: gdb -p $(pidof memcached) thread apply all bt ^ full thread backtrace. If it's hardlocked the backtrace *should* immediately tell me what's going on. Sorry for the trouble! On Fri, 12 Oct 2018, Jim Jones wrote: > The commandline arguments used are: > > -u memcached -m 236544 -c 64000 -p 11211 -t 32 -C -n 5 -f 1.05 -o > ext_path=/mnt/memcache:1700G,ext_path=/mnt1/memcache:1700G,ext_path=/mnt2/memcache:1700G,ext_path=/mnt3/memcache:1700G,ext_threads=32,ext_item_size=64 > > And we have some data, but frankly when the issue was happening we focused on > the Memcache servers late in the process. The initial errors suggested the > problem was in a very different part of the collection of technical > components in our larger service. When we realized the "memcached" processes > were not > accepting new connections, we wanted to correct the behavior quickly since a > fair amount of time had already passed. > > First, sockets in use on the servers... > > One system, call it server-248, shows TCP sockets on the system hovering > around 1900 after traffic ramped up for the day. If held at that level from > ~6:45am until 10:06am. We collect SAR data every 2 minutes, so the next > reading was at 10:08 and the TCP sockets jumped to 63842. Meaning it didn't > grow > slowly over time, it jumped frlmo 1937 to 63842 in a 2 minute window. That > number was 63842-63844 until 12:06pm when we restarted the "memcached" > process. After that the number dropped over time back to a more typical > level. > > 10:02am 1937 > 10:04am 1937 > 10:06am 1937 > 10:08am 63842 > 10:10am 63842 > ...etc... > 12:08pm 63843 > 12:06pm 63844 > 12:08pm 18415 > 12:10pm 17202 > 12:12pm 16333 > 12:14pm 16197 > 12:16pm 16134 > 12:18pm 16099 > 12:20pm 1617 > > The other system which ran into trouble, which I'll call server-85, exhibited > similar behavior but started later. Here's a sample of TCP socket counts > from that server. > > 11:30am 1805 > 11:32am 1801 > 11:34am 1830 > 11:36am 1817 > 11:38am 63905 > 11:40am 63905 > ...etc... > 12:20pm 63908 > 12:22pm 63908 > 12:24pm 1708 > 12:26pm 1720 > 12:28pm 1747 > > There were other network-centric datapoints that show the systems grind to a > halt in terms of accepting new connections, like the bandwidth going in/out > of the NIC's, etc... But it's all in support of the same idea, that the > "memcached" server stopped accepting new connections. > > Second, details about the sockets... > > During the incident, we did capture summary information on the socket states > from various servers on the network, and a full "netstat -an" listing from > one of the servers. Both server-248 and server-85 showed tens of thousands > of sockets in a CLOSE_WAIT state, hundreds in a SYN_RCVD state, and a small > number of ESTABLISHED sockets. > > There may continue to be traffic on the ESTABLISHED connections to the > "memcached" servers, but if there is it's a trivial amount. Multiple people > were > running "memkeys" at the time and report seeing no activity. > > Third, "stats" from the incapacitated "memcached" processes... > > We do not have stats from either server-248 or server-85 during the time they > were in trouble. In hindsight, that was a big oversight. > It's not clear that we could have gotten a connection to the server to pull > the stats, but I'd really like to know what those counters said! > > I do have the results from "stats", "stats slabs", "stats items" and "stats > conns" from 17:10 the previous evening. That doesn't show any obvious > errors/problems slowly building up, waiting for some event to trigger a > massive failure. But it's from ~15 hours before the server got into trouble > so I > don't think it's all that helpful. > > -jj > > On Fri, Oct 12, 2018 at 6:24 PM dormando <dorma...@rydia.net> wrote: > Hey, > > Probably unrelated to the original thread. I'm highly interested in > understanding what's going on here though, and would appreciate any help > you can spare. > > What data did you collect? The more information you can share the > better: > > 1) start arguments for the daemon > 2) stats output (stats, stats items, stats slabs, and importantly "stats > conns" - you will probably have to redact some of the output. > 3) were you still able to connect to and run commands on those daemons > while they had a mass of CLOSE_WAIT? > > Thanks, > -Dormando > > On Fri, 12 Oct 2018, jjo...@smugmug.com wrote: > > > We're testing verison 1.5.10 as a possible upgrade candidate for out > older > > memcached servers, using a pool 9 servers. They are running in > parallel > > with the production pool, also 9 servers. For the test all read > requests > > are going to the production pool, and all updates (set, delete, > etc...) are > > sent to one server in production pool and one server in the 1.5.10 > pool via > > the key hashing algorithm. > > > > That setup had been running without incident for about 12 days then > > yesterday two of the servers experienced the mass of CLOSE_WAIT > connections > > similar to what's been described here. We were able to collect some > data, > > but not enough to figure out what's happening. So I'm hoping to > kickstart a > > discussion here about how to diagnose what's going on. Until we can > find > > way to explain (and prevent) another problem like this, we're unable > to > > upgrade. > > > > I can provide more information about our configuration. I'm just not > sure > > what bits are useful/interesting. I will note that we're using > "extstore" > > functionality on the new servers. > > > > -jj > > > > -- > > > > --- > > You received this message because you are subscribed to the Google > Groups > > "memcached" group. > > To unsubscribe from this group and stop receiving emails from it, > send an > > email to memcached+unsubscr...@googlegroups.com. > > For more options, visit https://groups.google.com/d/optout. > > > > > > -- > > --- > You received this message because you are subscribed to the Google > Groups "memcached" group. > To unsubscribe from this group and stop receiving emails from it, send > an email to memcached+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > > -- > > --- > You received this message because you are subscribed to the Google Groups > "memcached" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to memcached+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > > -- --- You received this message because you are subscribed to the Google Groups "memcached" group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.