Re: Issue 363 in memcached: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout

2014-05-08 Thread notifications
I am just speculating, and by no means have any idea what I am really 
talking about here. :)

With 2 threads, still solid, no timeouts, no runaway 100% cpu. Its been 
days. Increasing from 2 threads to 4 does not generate any more traffic or 
requests to memcached. Thus I am speculating perhaps it is a race-condition 
or some sort, only hitting with > 2 threads.

Why do you say it will be less likely to happen with 2 threads than 4?

On Wednesday, May 7, 2014 5:38:47 PM UTC-7, Dormando wrote:
>
> That doesn't really tell us anything about the nature of the problem 
> though. With 2 threads it might still happen, but is a lot less likely. 
>
> On Wed, 7 May 2014, notifi...@commando.io  wrote: 
>
> > Bumped up to 2 threads and so far no timeout errors. I'm going to let it 
> run for a few more days, then revert back to 4 threads and see if timeout 
> > errors come up again. That will tell us the problem lies in spawning 
> more than 2 threads. 
> > 
> > On Wednesday, May 7, 2014 5:19:13 PM UTC-7, Dormando wrote: 
> >   Hey, 
> > 
> >   try this branch: 
> >   https://github.com/dormando/memcached/tree/double_close 
> > 
> >   so far as I can tell that emulates the behavior in .17... 
> > 
> >   to build: 
> >   ./autogen.sh && ./configure && make 
> > 
> >   run it in screen like you were doing with the other tests, see if 
> it 
> >   prints "ERROR: Double Close [somefd]". If it prints that once then 
> stops, 
> >   I guess that's what .17 was doing... if it print spams, then 
> something 
> >   else may have changed. 
> > 
> >   I'm mostly convinced something about your OS or build is corrupt, 
> but I 
> >   have no idea what it is. The only other thing I can think of is to 
> >   instrument .17 a bit more and have you try that (with the 
> connection code 
> >   laid out the old way, but with a conn_closed flag to detect a 
> double close 
> >   attempt), and see if the old .17 still did it. 
> > 
> >   On Tue, 6 May 2014, notifi...@commando.io wrote: 
> > 
> >   > Changing from 4 threads to 1 seems to have resolved the problem. 
> No timeouts since. Should I set to 2 threads and wait and see how 
> >   things go? 
> >   > 
> >   > On Tuesday, May 6, 2014 12:07:08 AM UTC-7, Dormando wrote: 
> >   >   and how'd that work out? 
> >   > 
> >   >   Still no other reports :/ a few thousand more downloads of 
> .19... 
> >   > 
> >   >   On Sun, 4 May 2014, notifi...@commando.io wrote: 
> >   > 
> >   >   > I'm going to try switching threads from 4 to 1. This 
> host web2 is on the only one I am seeing it on, but it also is the only 
> >   hosts 
> >   >   that gets any 
> >   >   > real traffic. Super frustrating. 
> >   >   > 
> >   >   > On Sunday, May 4, 2014 10:12:08 AM UTC-7, Dormando 
> wrote: 
> >   >   >   I'm stumped. (also, your e-mails aren't updating 
> the ticket...). 
> >   >   > 
> >   >   >   It's impossible for a connection to get into the 
> closed state without 
> >   >   >   having event_del() and close() called on the 
> socket. A socket slot isn't 
> >   >   >   event_add()'ed again until after the state is 
> reset to 'init_state'. 
> >   >   > 
> >   >   >   There was no code path for event_del to actually 
> fail so far as I could 
> >   >   >   see. 
> >   >   > 
> >   >   >   I've e-mailed steven grimm for ideas but either 
> that's not his e-mail 
> >   >   >   anymore or he's not going to respond. 
> >   >   > 
> >   >   >   I really don't know. I guess the old code would've 
> just called conn_close 
> >   >   >   again by accident... I don't see how the logic 
> changed in any significant 
> >   >   >   way in .18. Though again, if it happened with any 
> frequency people's 
> >   >   >   curr_conns stat would go negative. 
> >   >   > 
> >   >   >   So... either that always happened and we never 
> noticed, or your particular 
> >   >   >   OS is corrupt. There're probably 10,000+ installs 
> of .18+ now and only one 
> >   >   >   complaint, so I'm a little hesitant to spend a ton 
> of time on this until 
> >   >   >   we get more reports. 
> >   >   > 
> >   >   >   You should downgrade to .17. 
> >   >   > 
> >   >   >   On Sun, 4 May 2014, notifi...@commando.io wrote: 
> >   >   > 
> >   >   >   > Damn it, got network timeout. CPU 3 is using 
> 100% cpu from memcached. 
> >   >   >   > Here is the result of stat to verify using new 
> version of memcached and libevent: 
> >   >   >   > 
> >   >   >   > STAT version 1.4.19 
> >   >   >   > STAT libevent 2.0.18-stable 
> >   >  

Re: Issue 363 in memcached: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout

2014-05-08 Thread dormando
> I am just speculating, and by no means have any idea what I am really talking 
> about here. :)
> With 2 threads, still solid, no timeouts, no runaway 100% cpu. Its been days. 
> Increasing from 2 threads to 4 does not generate any more traffic or
> requests to memcached. Thus I am speculating perhaps it is a race-condition 
> or some sort, only hitting with > 2 threads.

Doesn't tell me anything useful, since I'm already looking for potential
races and don't see any possibility outside of libevent.

> Why do you say it will be less likely to happen with 2 threads than 4?

Nature of race conditions: the more threads you have running the more
likely you are to hit them, sometimes on order of magnitudes.

It doesn't really change the fact that this has worked for many years and
the code *barely* changed recently. I just don't see it.

> On Wednesday, May 7, 2014 5:38:47 PM UTC-7, Dormando wrote:
>   That doesn't really tell us anything about the nature of the problem
>   though. With 2 threads it might still happen, but is a lot less likely.
>
>   On Wed, 7 May 2014, notifi...@commando.io wrote:
>
>   > Bumped up to 2 threads and so far no timeout errors. I'm going to let 
> it run for a few more days, then revert back to 4 threads and
>   see if timeout
>   > errors come up again. That will tell us the problem lies in spawning 
> more than 2 threads.
>   >
>   > On Wednesday, May 7, 2014 5:19:13 PM UTC-7, Dormando wrote:
>   >       Hey,
>   >
>   >       try this branch:
>   >       https://github.com/dormando/memcached/tree/double_close
>   >
>   >       so far as I can tell that emulates the behavior in .17...
>   >
>   >       to build:
>   >       ./autogen.sh && ./configure && make
>   >
>   >       run it in screen like you were doing with the other tests, see 
> if it
>   >       prints "ERROR: Double Close [somefd]". If it prints that once 
> then stops,
>   >       I guess that's what .17 was doing... if it print spams, then 
> something
>   >       else may have changed.
>   >
>   >       I'm mostly convinced something about your OS or build is 
> corrupt, but I
>   >       have no idea what it is. The only other thing I can think of is 
> to
>   >       instrument .17 a bit more and have you try that (with the 
> connection code
>   >       laid out the old way, but with a conn_closed flag to detect a 
> double close
>   >       attempt), and see if the old .17 still did it.
>   >
>   >       On Tue, 6 May 2014, notifi...@commando.io wrote:
>   >
>   >       > Changing from 4 threads to 1 seems to have resolved the 
> problem. No timeouts since. Should I set to 2 threads and wait and
>   see how
>   >       things go?
>   >       >
>   >       > On Tuesday, May 6, 2014 12:07:08 AM UTC-7, Dormando wrote:
>   >       >       and how'd that work out?
>   >       >
>   >       >       Still no other reports :/ a few thousand more downloads 
> of .19...
>   >       >
>   >       >       On Sun, 4 May 2014, notifi...@commando.io wrote:
>   >       >
>   >       >       > I'm going to try switching threads from 4 to 1. This 
> host web2 is on the only one I am seeing it on, but it also is
>   the only
>   >       hosts
>   >       >       that gets any
>   >       >       > real traffic. Super frustrating.
>   >       >       >
>   >       >       > On Sunday, May 4, 2014 10:12:08 AM UTC-7, Dormando 
> wrote:
>   >       >       >       I'm stumped. (also, your e-mails aren't 
> updating the ticket...).
>   >       >       >
>   >       >       >       It's impossible for a connection to get into 
> the closed state without
>   >       >       >       having event_del() and close() called on the 
> socket. A socket slot isn't
>   >       >       >       event_add()'ed again until after the state is 
> reset to 'init_state'.
>   >       >       >
>   >       >       >       There was no code path for event_del to 
> actually fail so far as I could
>   >       >       >       see.
>   >       >       >
>   >       >       >       I've e-mailed steven grimm for ideas but either 
> that's not his e-mail
>   >       >       >       anymore or he's not going to respond.
>   >       >       >
>   >       >       >       I really don't know. I guess the old code 
> would've just called conn_close
>   >       >       >       again by accident... I don't see how the logic 
> changed in any significant
>   >       >       >       way in .18. Though again, if it happened with 
> any frequency people's
>   >       >       >       curr_conns stat would go negative.
>   >       >       >
>   >       >       >       So... either that always happened and we never 
> noticed, or your particular
>   >       >       >       OS is corrupt. There're proba

Re: Issue 363 in memcached: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout

2014-05-08 Thread dormando
To that note, it *is* useful if you try that branch I posted, since so far
as I can tell that should emulate the .17 behavior.

On Thu, 8 May 2014, dormando wrote:

> > I am just speculating, and by no means have any idea what I am really 
> > talking about here. :)
> > With 2 threads, still solid, no timeouts, no runaway 100% cpu. Its been 
> > days. Increasing from 2 threads to 4 does not generate any more traffic or
> > requests to memcached. Thus I am speculating perhaps it is a race-condition 
> > or some sort, only hitting with > 2 threads.
>
> Doesn't tell me anything useful, since I'm already looking for potential
> races and don't see any possibility outside of libevent.
>
> > Why do you say it will be less likely to happen with 2 threads than 4?
>
> Nature of race conditions: the more threads you have running the more
> likely you are to hit them, sometimes on order of magnitudes.
>
> It doesn't really change the fact that this has worked for many years and
> the code *barely* changed recently. I just don't see it.
>
> > On Wednesday, May 7, 2014 5:38:47 PM UTC-7, Dormando wrote:
> >   That doesn't really tell us anything about the nature of the problem
> >   though. With 2 threads it might still happen, but is a lot less 
> > likely.
> >
> >   On Wed, 7 May 2014, notifi...@commando.io wrote:
> >
> >   > Bumped up to 2 threads and so far no timeout errors. I'm going to 
> > let it run for a few more days, then revert back to 4 threads and
> >   see if timeout
> >   > errors come up again. That will tell us the problem lies in 
> > spawning more than 2 threads.
> >   >
> >   > On Wednesday, May 7, 2014 5:19:13 PM UTC-7, Dormando wrote:
> >   >       Hey,
> >   >
> >   >       try this branch:
> >   >       https://github.com/dormando/memcached/tree/double_close
> >   >
> >   >       so far as I can tell that emulates the behavior in .17...
> >   >
> >   >       to build:
> >   >       ./autogen.sh && ./configure && make
> >   >
> >   >       run it in screen like you were doing with the other tests, 
> > see if it
> >   >       prints "ERROR: Double Close [somefd]". If it prints that once 
> > then stops,
> >   >       I guess that's what .17 was doing... if it print spams, then 
> > something
> >   >       else may have changed.
> >   >
> >   >       I'm mostly convinced something about your OS or build is 
> > corrupt, but I
> >   >       have no idea what it is. The only other thing I can think of 
> > is to
> >   >       instrument .17 a bit more and have you try that (with the 
> > connection code
> >   >       laid out the old way, but with a conn_closed flag to detect a 
> > double close
> >   >       attempt), and see if the old .17 still did it.
> >   >
> >   >       On Tue, 6 May 2014, notifi...@commando.io wrote:
> >   >
> >   >       > Changing from 4 threads to 1 seems to have resolved the 
> > problem. No timeouts since. Should I set to 2 threads and wait and
> >   see how
> >   >       things go?
> >   >       >
> >   >       > On Tuesday, May 6, 2014 12:07:08 AM UTC-7, Dormando wrote:
> >   >       >       and how'd that work out?
> >   >       >
> >   >       >       Still no other reports :/ a few thousand more 
> > downloads of .19...
> >   >       >
> >   >       >       On Sun, 4 May 2014, notifi...@commando.io wrote:
> >   >       >
> >   >       >       > I'm going to try switching threads from 4 to 1. 
> > This host web2 is on the only one I am seeing it on, but it also is
> >   the only
> >   >       hosts
> >   >       >       that gets any
> >   >       >       > real traffic. Super frustrating.
> >   >       >       >
> >   >       >       > On Sunday, May 4, 2014 10:12:08 AM UTC-7, Dormando 
> > wrote:
> >   >       >       >       I'm stumped. (also, your e-mails aren't 
> > updating the ticket...).
> >   >       >       >
> >   >       >       >       It's impossible for a connection to get into 
> > the closed state without
> >   >       >       >       having event_del() and close() called on the 
> > socket. A socket slot isn't
> >   >       >       >       event_add()'ed again until after the state is 
> > reset to 'init_state'.
> >   >       >       >
> >   >       >       >       There was no code path for event_del to 
> > actually fail so far as I could
> >   >       >       >       see.
> >   >       >       >
> >   >       >       >       I've e-mailed steven grimm for ideas but 
> > either that's not his e-mail
> >   >       >       >       anymore or he's not going to respond.
> >   >       >       >
> >   >       >       >       I really don't know. I guess the old code 
> > would've just called conn_close
> >   >       >       >       again by accident... I don't see how the 
> > logic changed in any signi

Re: MEMCACHED_SERVER_MEMORY_ALLOCATION_FAILURE (SERVER_ERROR out of memory storing object) error

2014-05-08 Thread dormando
> Dormando,Yes, have to admit - we cache too aggressively (just do not want to 
> use different less polite word :)).
>
> Going to do two test experiments: enable compression and auto reallocation. 
> Before doing this:
> 1) why auto reallocation is not enabled by default, what issues/disadvantage 
> to expect?

Because it pulls memory from other places and evicts those items
regardless of if they were still valid or expired. There's no way for it
to reassign slab pages of "just expired memory". Some people would prefer
to just let evictions fall from the tail (least used) rather than do this,
so we didn't change the defaults after introducing the feature.

> 2) why memcached does not have compression on server side if CPU is idle, 
> because of ideology to keep it simple and fast? (just asking)

I said already: in typical use case there are many more clients, and a
very high rate of usage. If you flipped where the compression happens the
server would run out of CPU very quickly, and be much more latent. We
could support it in the server but it'd be a very low priority feature.

> On Tuesday, May 6, 2014 6:40:07 PM UTC-7, Dormando wrote:
>   > Hi Dormando,
>   > Full Slabs and Items stats are below. The problem is that other slabs 
> are full too, so rebalancing is not trivial. I will try to
>   create a wrapper
>   > that will do some analysis and do slab rebalancing based on stats 
> (the idea to move try to shrink slabs with low eviction but need to
>   think more).
>   > But i see there is "Slabs Automove" in protocol.txt. Do you recommend 
> it?
>
>   If it fits your needs. Otherwise, write an external daemon that controls
>   the automover based on your own needs.
>
>   > >You either need to add more memory to the total system or rebalance 
> them. 
>   > we run many-many memcached servers with 30Gb+ memory each box. And 
> the problem occurs on some boxes periodically. So I am thinking
>   how to convert
>   > manual restart to automatic action.
>
>   I'm not sure why restarting will fix it, if above you say "rebalancing 
> is
>   not trivial". If restarting would fix it, rebalancing would also fix it.
>
>   From the stats below, you do have a fair amount of memory spread out
>   among the higher order slab classes. Compression, or otherwise
>   re-evaluating how you store those values may make a big difference.
>
>   There's also a huge amount of stuff being evicted without ever being
>   fetched again. Are you caching too aggressively, or is memory just way 
> too
>   small and they never get a chance to be fetched after being set?
>
>   I'm just eyeballing it but evicted_time seems pretty short (a matter of
>   hours). That's the last access time of the last object to be evicted...
>   and it's like that across most of your slab classes.
>
>   So, shuffle and compress and whatnot, but I think you're out of ram 
> dude.
>
>   > server
>   > stats
>   > STAT pid 15480
>   > STAT uptime 2476264
>   > STAT time 1399422427
>   > STAT version 1.4.15
>   > STAT libevent 1.4.13-stable
>   > STAT pointer_size 64
>   > STAT rusage_user 639012.117392
>   > STAT rusage_system 2076810.323840
>   > STAT curr_connections 5237
>   > STAT total_connections 122995977
>   > STAT connection_structures 23402
>   > STAT reserved_fds 40
>   > STAT cmd_get 91928675147
>   > STAT cmd_set 4358475896
>   > STAT cmd_flush 1
>   > STAT cmd_touch 0
>   > STAT get_hits 85005900667
>   > STAT get_misses 6922774480
>   > STAT delete_misses 4238049567
>   > STAT delete_hits 885535057
>   > STAT incr_misses 0
>   > STAT incr_hits 0
>   > STAT decr_misses 0
>   > STAT decr_hits 0
>   > STAT cas_misses 1074
>   > STAT cas_hits 4784930
>   > STAT cas_badval 14966
>   > STAT touch_hits 0
>   > STAT touch_misses 0
>   > STAT auth_cmds 0
>   > STAT auth_errors 0
>   > STAT bytes_read 32317259718167
>   > STAT bytes_written 221039272582722
>   > STAT limit_maxbytes 25769803776
>   > STAT accepting_conns 1
>   > STAT listen_disabled_num 0
>   > STAT threads 8
>   > STAT conn_yields 0
>   > STAT hash_power_level 25
>   > STAT hash_bytes 268435456
>   > STAT hash_is_expanding 0
>   > STAT slab_reassign_running 0
>   > STAT slabs_moved 0
>   > STAT bytes 23567307974
>   > STAT curr_items 32559669
>   > STAT total_items 61290586
>   > STAT expired_unfetched 6664504
>   > STAT evicted_unfetched 1244432758
>   > STAT evictions 2522683859
>   > STAT reclaimed 7626148
>   > END
>   >
>   >
>   >
>   > stats slabs
>   > STAT 1:chunk_size 96
>   > STAT 1:chunks_per_page 10922
>   > STAT 1:total_pages 1
>   > STAT 1:total_chunks 10922
>   > STAT 1:used_chunks 0
>   > STAT 1:free_chunks 10922
>   > STAT 1