Re: Issue 363 in memcached: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout

dormando Thu, 08 May 2014 15:19:06 -0700

> I am just speculating, and by no means have any idea what I am really talking 
> about here. :)
> With 2 threads, still solid, no timeouts, no runaway 100% cpu. Its been days. 
> Increasing from 2 threads to 4 does not generate any more traffic or
> requests to memcached. Thus I am speculating perhaps it is a race-condition 
> or some sort, only hitting with > 2 threads.


Doesn't tell me anything useful, since I'm already looking for potential
races and don't see any possibility outside of libevent.

> Why do you say it will be less likely to happen with 2 threads than 4?

Nature of race conditions: the more threads you have running the more
likely you are to hit them, sometimes on order of magnitudes.

It doesn't really change the fact that this has worked for many years and
the code *barely* changed recently. I just don't see it.

> On Wednesday, May 7, 2014 5:38:47 PM UTC-7, Dormando wrote:
>       That doesn't really tell us anything about the nature of the problem
>       though. With 2 threads it might still happen, but is a lot less likely.
>
>       On Wed, 7 May 2014, notifi...@commando.io wrote:
>
>       > Bumped up to 2 threads and so far no timeout errors. I'm going to let 
> it run for a few more days, then revert back to 4 threads and
>       see if timeout
>       > errors come up again. That will tell us the problem lies in spawning 
> more than 2 threads.
>       >
>       > On Wednesday, May 7, 2014 5:19:13 PM UTC-7, Dormando wrote:
>       >       Hey,
>       >
>       >       try this branch:
>       >       https://github.com/dormando/memcached/tree/double_close
>       >
>       >       so far as I can tell that emulates the behavior in .17...
>       >
>       >       to build:
>       >       ./autogen.sh && ./configure && make
>       >
>       >       run it in screen like you were doing with the other tests, see 
> if it
>       >       prints "ERROR: Double Close [somefd]". If it prints that once 
> then stops,
>       >       I guess that's what .17 was doing... if it print spams, then 
> something
>       >       else may have changed.
>       >
>       >       I'm mostly convinced something about your OS or build is 
> corrupt, but I
>       >       have no idea what it is. The only other thing I can think of is 
> to
>       >       instrument .17 a bit more and have you try that (with the 
> connection code
>       >       laid out the old way, but with a conn_closed flag to detect a 
> double close
>       >       attempt), and see if the old .17 still did it.
>       >
>       >       On Tue, 6 May 2014, notifi...@commando.io wrote:
>       >
>       >       > Changing from 4 threads to 1 seems to have resolved the 
> problem. No timeouts since. Should I set to 2 threads and wait and
>       see how
>       >       things go?
>       >       >
>       >       > On Tuesday, May 6, 2014 12:07:08 AM UTC-7, Dormando wrote:
>       >       >       and how'd that work out?
>       >       >
>       >       >       Still no other reports :/ a few thousand more downloads 
> of .19...
>       >       >
>       >       >       On Sun, 4 May 2014, notifi...@commando.io wrote:
>       >       >
>       >       >       > I'm going to try switching threads from 4 to 1. This 
> host web2 is on the only one I am seeing it on, but it also is
>       the only
>       >       hosts
>       >       >       that gets any
>       >       >       > real traffic. Super frustrating.
>       >       >       >
>       >       >       > On Sunday, May 4, 2014 10:12:08 AM UTC-7, Dormando 
> wrote:
>       >       >       >       I'm stumped. (also, your e-mails aren't 
> updating the ticket...).
>       >       >       >
>       >       >       >       It's impossible for a connection to get into 
> the closed state without
>       >       >       >       having event_del() and close() called on the 
> socket. A socket slot isn't
>       >       >       >       event_add()'ed again until after the state is 
> reset to 'init_state'.
>       >       >       >
>       >       >       >       There was no code path for event_del to 
> actually fail so far as I could
>       >       >       >       see.
>       >       >       >
>       >       >       >       I've e-mailed steven grimm for ideas but either 
> that's not his e-mail
>       >       >       >       anymore or he's not going to respond.
>       >       >       >
>       >       >       >       I really don't know. I guess the old code 
> would've just called conn_close
>       >       >       >       again by accident... I don't see how the logic 
> changed in any significant
>       >       >       >       way in .18. Though again, if it happened with 
> any frequency people's
>       >       >       >       curr_conns stat would go negative.
>       >       >       >
>       >       >       >       So... either that always happened and we never 
> noticed, or your particular
>       >       >       >       OS is corrupt. There're probably 10,000+ 
> installs of .18+ now and only one
>       >       >       >       complaint, so I'm a little hesitant to spend a 
> ton of time on this until
>       >       >       >       we get more reports.
>       >       >       >
>       >       >       >       You should downgrade to .17.
>       >       >       >
>       >       >       >       On Sun, 4 May 2014, notifi...@commando.io wrote:
>       >       >       >
>       >       >       >       > Damn it, got network timeout. CPU 3 is using 
> 100% cpu from memcached.
>       >       >       >       > Here is the result of stat to verify using 
> new version of memcached and libevent:
>       >       >       >       >
>       >       >       >       > STAT version 1.4.19
>       >       >       >       > STAT libevent 2.0.18-stable
>       >       >       >       >
>       >       >       >       >
>       >       >       >       > On Saturday, May 3, 2014 11:55:31 PM UTC-7, 
> notifi...@commando.io wrote:
>       >       >       >       >       Just upgraded all 5 web-servers to 
> memcached 1.4.19 with libevent 2.0.18. Will advise if I see
>       memcached
>       >       timeouts.
>       >       >       Should be
>       >       >       >       good
>       >       >       >       >       though.
>       >       >       >       >
>       >       >       >       > Thanks so much for all the help and patience. 
> Really appreciated.
>       >       >       >       >
>       >       >       >       > On Friday, May 2, 2014 10:20:26 PM UTC-7, 
> memc...@googlecode.com wrote:
>       >       >       >       >       Updates:
>       >       >       >       >               Status: Invalid
>       >       >       >       >
>       >       >       >       >       Comment #20 on issue 363 by 
> dorma...@rydia.net: MemcachePool::get(): Server  
>       >       >       >       >       127.0.0.1 (tcp 11211, udp 0) failed 
> with: Network timeout
>       >       >       >       >       
> http://code.google.com/p/memcached/issues/detail?id=363
>       >       >       >       >
>       >       >       >       >       Any repeat crashes? I'm going to close 
> this. it looks like remi  
>       >       >       >       >       shipped .19. reopen or open a new one 
> if it hangs in the same way somehow...
>       >       >       >       >
>       >       >       >       >       Well. 19 won't be printing anything, 
> and it won't hang, but if it's  
>       >       >       >       >       actually our bug and not libevent it 
> would end up spinning CPU. Keep an eye  
>       >       >       >       >       out I guess.
>       >       >       >       >
>       >       >       >       >       --
>       >       >       >       >       You received this message because this 
> project is configured to send all  
>       >       >       >       >       issue notifications to this address.
>       >       >       >       >       You may adjust your notification 
> preferences at:
>       >       >       >       >       https://code.google.com/hosting/settings
>       >       >       >       >
>       >       >       >       > --
>       >       >       >       >
>       >       >       >       > ---
>       >       >       >       > You received this message because you are 
> subscribed to the Google Groups "memcached" group.
>       >       >       >       > To unsubscribe from this group and stop 
> receiving emails from it, send an email to
>       memcached+...@googlegroups.com.
>       >       >       >       > For more options, visit 
> https://groups.google.com/d/optout.
>       >       >       >       >
>       >       >       >       >
>       >       >       >
>       >       >       > --
>       >       >       >
>       >       >       > ---
>       >       >       > You received this message because you are subscribed 
> to the Google Groups "memcached" group.
>       >       >       > To unsubscribe from this group and stop receiving 
> emails from it, send an email to memcached+...@googlegroups.com.
>       >       >       > For more options, visit 
> https://groups.google.com/d/optout.
>       >       >       >
>       >       >       >
>       >       >
>       >       > --
>       >       >
>       >       > ---
>       >       > You received this message because you are subscribed to the 
> Google Groups "memcached" group.
>       >       > To unsubscribe from this group and stop receiving emails from 
> it, send an email to memcached+...@googlegroups.com.
>       >       > For more options, visit https://groups.google.com/d/optout.
>       >       >
>       >       >
>       >
>       > --
>       >
>       > ---
>       > You received this message because you are subscribed to the Google 
> Groups "memcached" group.
>       > To unsubscribe from this group and stop receiving emails from it, 
> send an email to memcached+...@googlegroups.com.
>       > For more options, visit https://groups.google.com/d/optout.
>       >
>       >
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups 
> "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to memcached+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Issue 363 in memcached: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout

Reply via email to