Can you give me a list (privately, if need be) of a few things:

- The exact OS your server is running (centos/redhat release/etc)
- The exact kernel version (and where it came from? centos/rh proper or a
3rd party repo?)
- Full list of your 3rd party repos, since I know you had some random
french thing in there.
- Full list of packages installed from 3rd party repos.

It is extremely important that all of the software matches.

- Hardware details:
  - Network card(s), speeds
  - CPU type, number of cores (hyperthreading?)
  - Amount of RAM

- Is this a hardware machine, or a VM somewhere? If a VM, what provider?

- memcached stats snapshots again, from your machine after it's been
running a while:
  - "stats", "stats slabs", "stats items", "stats settings", "stats
conns".
^ That's five commands, don't forget any.

It's too difficult to try to debug the issue when you hit it. usually
when I'm at a gdb console I'm issuing a command every second or two, but
it takes us 10 minutes to get through 3-4 commands. It'd be nice if I
could attempt to reproduce it here.

I went digging more and there're some dup() bugs with epoll, except your
libevent is new enough to have those patched.. plus we're not using dup()
in such a way to cause the bug.

There was also an EPOLL_CTL_MOD race condition in the kernel, but so far
as I can tell even with libevent 2.x libevent's not using that feature for
us.

The issue does smell like the bug that happens with dup()'s - the events
keep happening and the fd sits half closed, but again we're never closing
those sockets.

I can also make a branch with the new dup() calls explicitly removed, but
this continues to be obnoxious multi-week-long debugging.

I'm convinced that the code in memcached is correct and the bug exists
outside of it (libevent or the kernel). There's simply no way for it to
hit that code path without closing the socket, and doubly so: epoll
automatically delete's an event when the socket is closed. We delete it
then close it, and it still comes back.

It's not possible a connection ends up in the wrong thread, since both
connection initialization and close happens local to a thread. We would
need to have a new connection come in with a duplicated fd. If that
happens, nothing on your machine would work.

thanks.

On Thu, 8 May 2014, notificati...@commando.io wrote:

> I am just speculating, and by no means have any idea what I am really talking 
> about here. :)
> With 2 threads, still solid, no timeouts, no runaway 100% cpu. Its been days. 
> Increasing from 2 threads to 4 does not generate any more traffic or
> requests to memcached. Thus I am speculating perhaps it is a race-condition 
> or some sort, only hitting with > 2 threads.
>
> Why do you say it will be less likely to happen with 2 threads than 4?
>
> On Wednesday, May 7, 2014 5:38:47 PM UTC-7, Dormando wrote:
>       That doesn't really tell us anything about the nature of the problem
>       though. With 2 threads it might still happen, but is a lot less likely.
>
>       On Wed, 7 May 2014, notifi...@commando.io wrote:
>
>       > Bumped up to 2 threads and so far no timeout errors. I'm going to let 
> it run for a few more days, then revert back to 4 threads and
>       see if timeout
>       > errors come up again. That will tell us the problem lies in spawning 
> more than 2 threads.
>       >
>       > On Wednesday, May 7, 2014 5:19:13 PM UTC-7, Dormando wrote:
>       >       Hey,
>       >
>       >       try this branch:
>       >       https://github.com/dormando/memcached/tree/double_close
>       >
>       >       so far as I can tell that emulates the behavior in .17...
>       >
>       >       to build:
>       >       ./autogen.sh && ./configure && make
>       >
>       >       run it in screen like you were doing with the other tests, see 
> if it
>       >       prints "ERROR: Double Close [somefd]". If it prints that once 
> then stops,
>       >       I guess that's what .17 was doing... if it print spams, then 
> something
>       >       else may have changed.
>       >
>       >       I'm mostly convinced something about your OS or build is 
> corrupt, but I
>       >       have no idea what it is. The only other thing I can think of is 
> to
>       >       instrument .17 a bit more and have you try that (with the 
> connection code
>       >       laid out the old way, but with a conn_closed flag to detect a 
> double close
>       >       attempt), and see if the old .17 still did it.
>       >
>       >       On Tue, 6 May 2014, notifi...@commando.io wrote:
>       >
>       >       > Changing from 4 threads to 1 seems to have resolved the 
> problem. No timeouts since. Should I set to 2 threads and wait and
>       see how
>       >       things go?
>       >       >
>       >       > On Tuesday, May 6, 2014 12:07:08 AM UTC-7, Dormando wrote:
>       >       >       and how'd that work out?
>       >       >
>       >       >       Still no other reports :/ a few thousand more downloads 
> of .19...
>       >       >
>       >       >       On Sun, 4 May 2014, notifi...@commando.io wrote:
>       >       >
>       >       >       > I'm going to try switching threads from 4 to 1. This 
> host web2 is on the only one I am seeing it on, but it also is
>       the only
>       >       hosts
>       >       >       that gets any
>       >       >       > real traffic. Super frustrating.
>       >       >       >
>       >       >       > On Sunday, May 4, 2014 10:12:08 AM UTC-7, Dormando 
> wrote:
>       >       >       >       I'm stumped. (also, your e-mails aren't 
> updating the ticket...).
>       >       >       >
>       >       >       >       It's impossible for a connection to get into 
> the closed state without
>       >       >       >       having event_del() and close() called on the 
> socket. A socket slot isn't
>       >       >       >       event_add()'ed again until after the state is 
> reset to 'init_state'.
>       >       >       >
>       >       >       >       There was no code path for event_del to 
> actually fail so far as I could
>       >       >       >       see.
>       >       >       >
>       >       >       >       I've e-mailed steven grimm for ideas but either 
> that's not his e-mail
>       >       >       >       anymore or he's not going to respond.
>       >       >       >
>       >       >       >       I really don't know. I guess the old code 
> would've just called conn_close
>       >       >       >       again by accident... I don't see how the logic 
> changed in any significant
>       >       >       >       way in .18. Though again, if it happened with 
> any frequency people's
>       >       >       >       curr_conns stat would go negative.
>       >       >       >
>       >       >       >       So... either that always happened and we never 
> noticed, or your particular
>       >       >       >       OS is corrupt. There're probably 10,000+ 
> installs of .18+ now and only one
>       >       >       >       complaint, so I'm a little hesitant to spend a 
> ton of time on this until
>       >       >       >       we get more reports.
>       >       >       >
>       >       >       >       You should downgrade to .17.
>       >       >       >
>       >       >       >       On Sun, 4 May 2014, notifi...@commando.io wrote:
>       >       >       >
>       >       >       >       > Damn it, got network timeout. CPU 3 is using 
> 100% cpu from memcached.
>       >       >       >       > Here is the result of stat to verify using 
> new version of memcached and libevent:
>       >       >       >       >
>       >       >       >       > STAT version 1.4.19
>       >       >       >       > STAT libevent 2.0.18-stable
>       >       >       >       >
>       >       >       >       >
>       >       >       >       > On Saturday, May 3, 2014 11:55:31 PM UTC-7, 
> notifi...@commando.io wrote:
>       >       >       >       >       Just upgraded all 5 web-servers to 
> memcached 1.4.19 with libevent 2.0.18. Will advise if I see
>       memcached
>       >       timeouts.
>       >       >       Should be
>       >       >       >       good
>       >       >       >       >       though.
>       >       >       >       >
>       >       >       >       > Thanks so much for all the help and patience. 
> Really appreciated.
>       >       >       >       >
>       >       >       >       > On Friday, May 2, 2014 10:20:26 PM UTC-7, 
> memc...@googlecode.com wrote:
>       >       >       >       >       Updates:
>       >       >       >       >               Status: Invalid
>       >       >       >       >
>       >       >       >       >       Comment #20 on issue 363 by 
> dorma...@rydia.net: MemcachePool::get(): Server  
>       >       >       >       >       127.0.0.1 (tcp 11211, udp 0) failed 
> with: Network timeout
>       >       >       >       >       
> http://code.google.com/p/memcached/issues/detail?id=363
>       >       >       >       >
>       >       >       >       >       Any repeat crashes? I'm going to close 
> this. it looks like remi  
>       >       >       >       >       shipped .19. reopen or open a new one 
> if it hangs in the same way somehow...
>       >       >       >       >
>       >       >       >       >       Well. 19 won't be printing anything, 
> and it won't hang, but if it's  
>       >       >       >       >       actually our bug and not libevent it 
> would end up spinning CPU. Keep an eye  
>       >       >       >       >       out I guess.
>       >       >       >       >
>       >       >       >       >       --
>       >       >       >       >       You received this message because this 
> project is configured to send all  
>       >       >       >       >       issue notifications to this address.
>       >       >       >       >       You may adjust your notification 
> preferences at:
>       >       >       >       >       https://code.google.com/hosting/settings
>       >       >       >       >
>       >       >       >       > --
>       >       >       >       >
>       >       >       >       > ---
>       >       >       >       > You received this message because you are 
> subscribed to the Google Groups "memcached" group.
>       >       >       >       > To unsubscribe from this group and stop 
> receiving emails from it, send an email to
>       memcached+...@googlegroups.com.
>       >       >       >       > For more options, visit 
> https://groups.google.com/d/optout.
>       >       >       >       >
>       >       >       >       >
>       >       >       >
>       >       >       > --
>       >       >       >
>       >       >       > ---
>       >       >       > You received this message because you are subscribed 
> to the Google Groups "memcached" group.
>       >       >       > To unsubscribe from this group and stop receiving 
> emails from it, send an email to memcached+...@googlegroups.com.
>       >       >       > For more options, visit 
> https://groups.google.com/d/optout.
>       >       >       >
>       >       >       >
>       >       >
>       >       > --
>       >       >
>       >       > ---
>       >       > You received this message because you are subscribed to the 
> Google Groups "memcached" group.
>       >       > To unsubscribe from this group and stop receiving emails from 
> it, send an email to memcached+...@googlegroups.com.
>       >       > For more options, visit https://groups.google.com/d/optout.
>       >       >
>       >       >
>       >
>       > --
>       >
>       > ---
>       > You received this message because you are subscribed to the Google 
> Groups "memcached" group.
>       > To unsubscribe from this group and stop receiving emails from it, 
> send an email to memcached+...@googlegroups.com.
>       > For more options, visit https://groups.google.com/d/optout.
>       >
>       >
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups 
> "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to memcached+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to