Re: Thousands of CLOSE_WAIT connections

dormando Sat, 13 Oct 2018 18:41:19 -0700

inline responses. also, I dunno if I missed it but what version were you
on originally? are the start arguments the same?


On Sat, 13 Oct 2018, Jim Jones wrote:

> The "-n" and "ext_item_size" options were picked to help handle the large
> volume of small key values our service generates.  But if increasing either
> will prevent potential deadlocks, I'm sure we'd prefer to accept the
> efficiency hit.

Those values don't necessarily translate to doing what you want; I'll
illustrate that once I get a gander at your stats.

> The two servers that stopped accepting connections were only getting write
> operations, that's correct.  They are operating as shadow copies of the
> cache during this parallel testing period.  I do see trivial numbers of
> "get" requests, but it's a tiny number compared to the number of "set" and
> "delete" commands.

Got it.

> The code generating the memcache requests is PHP, with "twemproxy" (v0.4.1)
> handling all the communications with the memcache servers.  I can dig into
> the code to get more specific information on the PHP setup if that's
> helpful.

Ok, so ascii only, then. Just need to know the general commands being used
and the protocol, at this point.

> I'll relay stats from one of the servers via private e-mail in case those
> counters help suggest things we can try.
>
> And we'll definitely gather the info you described in the event we hit this
> problem again.

Thanks! Appreciate your patience.

> -jj
>
> On Sat, Oct 13, 2018 at 10:07 AM dormando <dorma...@rydia.net> wrote:
>       Sounds like the daemon hardlocked. Some of your start arguments
>       are fairly
>       aggressive (ext_item_size and -n especially), I'll double check
>       that those
>       won't cause problems like this.
>
>       First, to confirm: these two hung machines were only getting
>       writes the
>       whole time? no reads?
>
>       Any info you can share about the client? What library, protocol
>       (ascii or
>       binary, etc)?
>
>       If you can privately share with me any of your stats snapshots
>       that would
>       help a lot as well, since I can better determine which features
>       are being
>       exercised.
>
>       Aside from that, there don't seem to be many hints here. If it
>       happens
>       again, please do:
>
>       1) see if you can still connect to it and run commands, grab
>       stats if so.
>       2) grab a GDB backtrace before killing the daemon:
>       gdb -p $(pidof memcached)
>       thread apply all bt
>       ^ full thread backtrace.
>
>       If it's hardlocked the backtrace *should* immediately tell me
>       what's going
>       on.
>
>       Sorry for the trouble!
>
>       On Fri, 12 Oct 2018, Jim Jones wrote:
>
>       > The commandline arguments used are:
>       >
>       > -u memcached -m 236544 -c 64000 -p 11211 -t 32 -C -n 5 -f 1.05
>       -o
>       
> >ext_path=/mnt/memcache:1700G,ext_path=/mnt1/memcache:1700G,ext_path=/mnt2/m
> emcache:1700G,ext_path=/mnt3/memcache:1700G,ext_threads=32,ext_item_size=6
>       4
>       >
>       > And we have some data, but frankly when the issue was
>       happening we focused on the Memcache servers late in the
>       process.  The initial errors suggested the
>       > problem was in a very different part of the collection of
>       technical components in our larger service.  When we realized
>       the "memcached" processes were not
>       > accepting new connections, we wanted to correct the behavior
>       quickly since a fair amount of time had already passed.
>       >
>       > First, sockets in use on the servers...
>       >
>       > One system, call it server-248, shows TCP sockets on the
>       system hovering around 1900 after traffic ramped up for the
>       day.  If held at that level from
>       > ~6:45am until 10:06am.  We collect SAR data every 2 minutes,
>       so the next reading was at 10:08 and the TCP sockets jumped to
>       63842.  Meaning it didn't grow
>       > slowly over time, it jumped frlmo 1937 to 63842 in a 2 minute
>       window.  That number was 63842-63844 until 12:06pm when we
>       restarted the "memcached"
>       > process.  After that the number dropped over time back to a
>       more typical level.
>       >
>       > 10:02am   1937
>       > 10:04am   1937
>       > 10:06am   1937
>       > 10:08am  63842
>       > 10:10am  63842
>       > ...etc...
>       > 12:08pm  63843
>       > 12:06pm  63844
>       > 12:08pm  18415
>       > 12:10pm  17202
>       > 12:12pm  16333
>       > 12:14pm  16197
>       > 12:16pm  16134
>       > 12:18pm  16099
>       > 12:20pm   1617
>       >
>       > The other system which ran into trouble, which I'll call
>       server-85, exhibited similar behavior but started later.  Here's
>       a sample of TCP socket counts
>       > from that server.
>       >
>       > 11:30am   1805
>       > 11:32am   1801
>       > 11:34am   1830
>       > 11:36am   1817
>       > 11:38am  63905
>       > 11:40am  63905
>       > ...etc...
>       > 12:20pm  63908
>       > 12:22pm  63908
>       > 12:24pm   1708
>       > 12:26pm   1720
>       > 12:28pm   1747
>       >
>       > There were other network-centric datapoints that show the
>       systems grind to a halt in terms of accepting new connections,
>       like the bandwidth going in/out
>       > of the NIC's, etc...  But it's all in support of the same
>       idea, that the "memcached" server stopped accepting new
>       connections.
>       >
>       > Second, details about the sockets...
>       >
>       > During the incident, we did capture summary information on the
>       socket states from various servers on the network, and a full
>       "netstat -an" listing from
>       > one of the servers.  Both server-248 and server-85 showed tens
>       of thousands of sockets in a CLOSE_WAIT state, hundreds in a
>       SYN_RCVD state, and a small
>       > number of ESTABLISHED sockets.
>       >
>       > There may continue to be traffic on the ESTABLISHED
>       connections to the "memcached" servers, but if there is it's a
>       trivial amount.  Multiple people were
>       > running "memkeys" at the time and report seeing no activity.
>       >
>       > Third, "stats" from the incapacitated "memcached" processes...
>       >
>       > We do not have stats from either server-248 or server-85
>       during the time they were in trouble.  In hindsight, that was a
>       big oversight.
>       > It's not clear that we could have gotten a connection to the
>       server to pull the stats, but I'd really like to know what those
>       counters said!
>       >
>       > I do have the results from "stats", "stats slabs", "stats
>       items" and "stats conns" from 17:10 the previous evening.  That
>       doesn't show any obvious
>       > errors/problems slowly building up, waiting for some event to
>       trigger a massive failure.  But it's from ~15 hours before the
>       server got into trouble so I
>       > don't think it's all that helpful.
>       >
>       > -jj
>       >
>       > On Fri, Oct 12, 2018 at 6:24 PM dormando <dorma...@rydia.net>
>       wrote:
>       >       Hey,
>       >
>       >       Probably unrelated to the original thread. I'm highly
>       interested in
>       >       understanding what's going on here though, and would
>       appreciate any help
>       >       you can spare.
>       >
>       >       What data did you collect? The more information you can
>       share the better:
>       >
>       >       1) start arguments for the daemon
>       >       2) stats output (stats, stats items, stats slabs, and
>       importantly "stats
>       >       conns" - you will probably have to redact some of the
>       output.
>       >       3) were you still able to connect to and run commands on
>       those daemons
>       >       while they had a mass of CLOSE_WAIT?
>       >
>       >       Thanks,
>       >       -Dormando
>       >
>       >       On Fri, 12 Oct 2018, jjo...@smugmug.com wrote:
>       >
>       >       > We're testing verison 1.5.10 as a possible upgrade
>       candidate for out older
>       >       > memcached servers, using a pool 9 servers.  They are
>       running in parallel
>       >       > with the production pool, also 9 servers.  For the
>       test all read requests
>       >       > are going to the production pool, and all updates
>       (set, delete, etc...) are
>       >       > sent to one server in production pool and one server
>       in the 1.5.10 pool via
>       >       > the key hashing algorithm.
>       >       >
>       >       > That setup had been running without incident for about
>       12 days then
>       >       > yesterday two of the servers experienced the mass of
>       CLOSE_WAIT connections
>       >       > similar to what's been described here.  We were able
>       to collect some data,
>       >       > but not enough to figure out what's happening.  So I'm
>       hoping to kickstart a
>       >       > discussion here about how to diagnose what's going
>       on.  Until we can find
>       >       > way to explain (and prevent) another problem like
>       this, we're unable to
>       >       > upgrade.
>       >       >
>       >       > I can provide more information about our
>       configuration.  I'm just not sure
>       >       > what bits are useful/interesting.  I will note that
>       we're using "extstore"
>       >       > functionality on the new servers.
>       >       >
>       >       > -jj
>       >       >
>       >       > --
>       >       >
>       >       > ---
>       >       > You received this message because you are subscribed
>       to the Google Groups
>       >       > "memcached" group.
>       >       > To unsubscribe from this group and stop receiving
>       emails from it, send an
>       >       > email to memcached+unsubscr...@googlegroups.com.
>       >       > For more options, visit
>       https://groups.google.com/d/optout.
>       >       >
>       >       >
>       >
>       >       --
>       >
>       >       ---
>       >       You received this message because you are subscribed to
>       the Google Groups "memcached" group.
>       >       To unsubscribe from this group and stop receiving emails
>       from it, send an email to
>       memcached+unsubscr...@googlegroups.com.
>       >       For more options, visit
>       https://groups.google.com/d/optout.
>       >
>       > --
>       >
>       > ---
>       > You received this message because you are subscribed to the
>       Google Groups "memcached" group.
>       > To unsubscribe from this group and stop receiving emails from
>       it, send an email to memcached+unsubscr...@googlegroups.com.
>       > For more options, visit https://groups.google.com/d/optout.
>       >
>       >
>
>       --
>
>       ---
>       You received this message because you are subscribed to the
>       Google Groups "memcached" group.
>       To unsubscribe from this group and stop receiving emails from
>       it, send an email to memcached+unsubscr...@googlegroups.com.
>       For more options, visit https://groups.google.com/d/optout.
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to memcached+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Thousands of CLOSE_WAIT connections

Reply via email to