Sounds like the daemon hardlocked. Some of your start arguments are fairly
aggressive (ext_item_size and -n especially), I'll double check that those
won't cause problems like this.

First, to confirm: these two hung machines were only getting writes the
whole time? no reads?

Any info you can share about the client? What library, protocol (ascii or
binary, etc)?

If you can privately share with me any of your stats snapshots that would
help a lot as well, since I can better determine which features are being
exercised.

Aside from that, there don't seem to be many hints here. If it happens
again, please do:

1) see if you can still connect to it and run commands, grab stats if so.
2) grab a GDB backtrace before killing the daemon:
gdb -p $(pidof memcached)
thread apply all bt
^ full thread backtrace.

If it's hardlocked the backtrace *should* immediately tell me what's going
on.

Sorry for the trouble!

On Fri, 12 Oct 2018, Jim Jones wrote:

> The commandline arguments used are:
>
> -u memcached -m 236544 -c 64000 -p 11211 -t 32 -C -n 5 -f 1.05 -o
> ext_path=/mnt/memcache:1700G,ext_path=/mnt1/memcache:1700G,ext_path=/mnt2/memcache:1700G,ext_path=/mnt3/memcache:1700G,ext_threads=32,ext_item_size=64
>
> And we have some data, but frankly when the issue was happening we focused on 
> the Memcache servers late in the process.  The initial errors suggested the
> problem was in a very different part of the collection of technical 
> components in our larger service.  When we realized the "memcached" processes 
> were not
> accepting new connections, we wanted to correct the behavior quickly since a 
> fair amount of time had already passed.
>
> First, sockets in use on the servers...
>
> One system, call it server-248, shows TCP sockets on the system hovering 
> around 1900 after traffic ramped up for the day.  If held at that level from
> ~6:45am until 10:06am.  We collect SAR data every 2 minutes, so the next 
> reading was at 10:08 and the TCP sockets jumped to 63842.  Meaning it didn't 
> grow
> slowly over time, it jumped frlmo 1937 to 63842 in a 2 minute window.  That 
> number was 63842-63844 until 12:06pm when we restarted the "memcached"
> process.  After that the number dropped over time back to a more typical 
> level.
>
> 10:02am   1937
> 10:04am   1937
> 10:06am   1937
> 10:08am  63842
> 10:10am  63842
> ...etc...
> 12:08pm  63843
> 12:06pm  63844
> 12:08pm  18415
> 12:10pm  17202
> 12:12pm  16333
> 12:14pm  16197
> 12:16pm  16134
> 12:18pm  16099
> 12:20pm   1617
>
> The other system which ran into trouble, which I'll call server-85, exhibited 
> similar behavior but started later.  Here's a sample of TCP socket counts
> from that server.
>
> 11:30am   1805
> 11:32am   1801
> 11:34am   1830
> 11:36am   1817
> 11:38am  63905
> 11:40am  63905
> ...etc...
> 12:20pm  63908
> 12:22pm  63908
> 12:24pm   1708
> 12:26pm   1720
> 12:28pm   1747
>
> There were other network-centric datapoints that show the systems grind to a 
> halt in terms of accepting new connections, like the bandwidth going in/out
> of the NIC's, etc...  But it's all in support of the same idea, that the 
> "memcached" server stopped accepting new connections.
>
> Second, details about the sockets...
>
> During the incident, we did capture summary information on the socket states 
> from various servers on the network, and a full "netstat -an" listing from
> one of the servers.  Both server-248 and server-85 showed tens of thousands 
> of sockets in a CLOSE_WAIT state, hundreds in a SYN_RCVD state, and a small
> number of ESTABLISHED sockets.
>
> There may continue to be traffic on the ESTABLISHED connections to the 
> "memcached" servers, but if there is it's a trivial amount.  Multiple people 
> were
> running "memkeys" at the time and report seeing no activity.
>
> Third, "stats" from the incapacitated "memcached" processes...
>
> We do not have stats from either server-248 or server-85 during the time they 
> were in trouble.  In hindsight, that was a big oversight.
> It's not clear that we could have gotten a connection to the server to pull 
> the stats, but I'd really like to know what those counters said!
>
> I do have the results from "stats", "stats slabs", "stats items" and "stats 
> conns" from 17:10 the previous evening.  That doesn't show any obvious
> errors/problems slowly building up, waiting for some event to trigger a 
> massive failure.  But it's from ~15 hours before the server got into trouble 
> so I
> don't think it's all that helpful.
>
> -jj
>
> On Fri, Oct 12, 2018 at 6:24 PM dormando <dorma...@rydia.net> wrote:
>       Hey,
>
>       Probably unrelated to the original thread. I'm highly interested in
>       understanding what's going on here though, and would appreciate any help
>       you can spare.
>
>       What data did you collect? The more information you can share the 
> better:
>
>       1) start arguments for the daemon
>       2) stats output (stats, stats items, stats slabs, and importantly "stats
>       conns" - you will probably have to redact some of the output.
>       3) were you still able to connect to and run commands on those daemons
>       while they had a mass of CLOSE_WAIT?
>
>       Thanks,
>       -Dormando
>
>       On Fri, 12 Oct 2018, jjo...@smugmug.com wrote:
>
>       > We're testing verison 1.5.10 as a possible upgrade candidate for out 
> older
>       > memcached servers, using a pool 9 servers.  They are running in 
> parallel
>       > with the production pool, also 9 servers.  For the test all read 
> requests
>       > are going to the production pool, and all updates (set, delete, 
> etc...) are
>       > sent to one server in production pool and one server in the 1.5.10 
> pool via
>       > the key hashing algorithm.
>       >
>       > That setup had been running without incident for about 12 days then
>       > yesterday two of the servers experienced the mass of CLOSE_WAIT 
> connections
>       > similar to what's been described here.  We were able to collect some 
> data,
>       > but not enough to figure out what's happening.  So I'm hoping to 
> kickstart a
>       > discussion here about how to diagnose what's going on.  Until we can 
> find
>       > way to explain (and prevent) another problem like this, we're unable 
> to
>       > upgrade.
>       >
>       > I can provide more information about our configuration.  I'm just not 
> sure
>       > what bits are useful/interesting.  I will note that we're using 
> "extstore"
>       > functionality on the new servers.
>       >
>       > -jj
>       >
>       > --
>       >
>       > ---
>       > You received this message because you are subscribed to the Google 
> Groups
>       > "memcached" group.
>       > To unsubscribe from this group and stop receiving emails from it, 
> send an
>       > email to memcached+unsubscr...@googlegroups.com.
>       > For more options, visit https://groups.google.com/d/optout.
>       >
>       >
>
>       --
>
>       ---
>       You received this message because you are subscribed to the Google 
> Groups "memcached" group.
>       To unsubscribe from this group and stop receiving emails from it, send 
> an email to memcached+unsubscr...@googlegroups.com.
>       For more options, visit https://groups.google.com/d/optout.
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups 
> "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to memcached+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to