memcached-1.4.20 stuck when "Too many open connections"

2014-10-28 Thread Samdy Sun
Hello,
  I got a "memcached-1.4.20 stuck" problem when EMFILE happen.
  Here are my memcached's cmdline "memcached -s /xxx/mc_usock.11201 -c 1024 
-m 4000 -f 1.05 -o slab_automove -o slab_reassign  -t 1 -p 11201".
 
  cat /proc/version 
  Linux version 2.6.32-358.el6.x86_64 
(mockbu...@x86-022.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 
(Red Hat 4.4.7-3) (GCC) ) #1 SMP Tue Jan 29 11:47:41 EST 2013

  memcached-1.4.20 stuck and don't work any more when it runs for a period 
of time.

  Here are some information for gdb:
  (gdb) p stats
  $2 = {mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __kind = 
0, __nusers = 0, {__spins = 0, 
__list = {__next = 0x0}}}, __size = '\000' , 
__align = 0}, curr_items = 149156, 
  total_items = 9876811, curr_bytes = 3712501870,* curr_conns = 5*, 
total_conns = 39738, rejected_conns = 0, 
  malloc_fails = 0, *reserved_fds = 5*, *conn_structs = 1012*, get_cmds = 
0, set_cmds = 0, touch_cmds = 0, 
  get_hits = 0, get_misses = 0, touch_hits = 0, touch_misses = 0, evictions 
= 0, reclaimed = 0, 
  started = 0, *accepting_conns = false*, *listen_disabled_num = 1*, 
hash_power_level = 17, 
  hash_bytes = 524288, hash_is_expanding = false, expired_unfetched = 0, 
evicted_unfetched = 0, 
  slab_reassign_running = false, slabs_moved = 20, lru_crawler_running = 
false, 
  disable_write_by_exptime = 0, disable_write_by_length = 0, 
disable_write_by_access = 0, 
  evicted_write_reply_timeout_times = 0}

  (gdb) p allow_new_conns
  $4 = false

  And I found that "allow_new_conns" just set to false when "accept" failed 
and errno is "EMFILE". 
  Here are the codes:  
static void drive_machine(conn *c) {
 ……
 } else if (errno == EMFILE) {
   if (settings.verbose > 0)
 fprintf(stderr, "Too many open connections\n");
  * accept_new_conns(false);*
   stop = true;
 } else {
 ……
}
  
  If I change the flag "allow_new_conns", it can work again. As below:
  *(gdb) set allow_new_conns=1*
  (gdb) p allow_new_conns
  $5 = true
  (gdb) c
  Continuing.

  I know that "allow_new_conns" will be set to "true" when "conn_close" 
called. But how could it happen for the case that when "accept" failed , 
and errno is "EMFILE", and this connection is the only one for accepting. 
Notes that *curr_conns = 5.*
  Not run out of fd:
  ls /proc/1748(memcached_pid)/fd | wc -l
  17
  

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: memcached-1.4.20 stuck when "Too many open connections"

2014-10-28 Thread dormando
You're absolutely sure the running version was 1.4.20? that looks like a
bug that was fixed in .19 or .20

hmmm... maybe a unix domain bug?

On Tue, 28 Oct 2014, Samdy Sun wrote:

> Hello,  I got a "memcached-1.4.20 stuck" problem when EMFILE happen.
>   Here are my memcached's cmdline "memcached -s /xxx/mc_usock.11201 -c 1024 
> -m 4000 -f 1.05 -o slab_automove -o slab_reassign  -t 1 -p 11201".
>  
>   cat /proc/version 
>   Linux version 2.6.32-358.el6.x86_64 
> (mockbu...@x86-022.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red 
> Hat 4.4.7-3) (GCC) ) #1 SMP Tue
> Jan 29 11:47:41 EST 2013
>
>   memcached-1.4.20 stuck and don't work any more when it runs for a period of 
> time.
>
>   Here are some information for gdb:  (gdb) p stats
>   $2 = {mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __kind = 0, 
> __nusers = 0, {__spins = 0, 
>         __list = {__next = 0x0}}}, __size = '\000' , 
> __align = 0}, curr_items = 149156, 
>   total_items = 9876811, curr_bytes = 3712501870, curr_conns = 5, total_conns 
> = 39738, rejected_conns = 0, 
>   malloc_fails = 0, reserved_fds = 5, conn_structs = 1012, get_cmds = 0, 
> set_cmds = 0, touch_cmds = 0, 
>   get_hits = 0, get_misses = 0, touch_hits = 0, touch_misses = 0, evictions = 
> 0, reclaimed = 0, 
>   started = 0, accepting_conns = false, listen_disabled_num = 1, 
> hash_power_level = 17, 
>   hash_bytes = 524288, hash_is_expanding = false, expired_unfetched = 0, 
> evicted_unfetched = 0, 
>   slab_reassign_running = false, slabs_moved = 20, lru_crawler_running = 
> false, 
>   disable_write_by_exptime = 0, disable_write_by_length = 0, 
> disable_write_by_access = 0, 
>   evicted_write_reply_timeout_times = 0}
>
>   (gdb) p allow_new_conns
>   $4 = false
>
>   And I found that "allow_new_conns" just set to false when "accept" failed 
> and errno is "EMFILE". 
>   Here are the codes:  
> static void drive_machine(conn *c) {
>                  ……
>                  } else if (errno == EMFILE) {
>                    if (settings.verbose > 0)
>                          fprintf(stderr, "Too many open connections\n");
>                    accept_new_conns(false);
>                    stop = true;
>                  } else {
>                  ……
> }
>   
>   If I change the flag "allow_new_conns", it can work again. As below:
>   (gdb) set allow_new_conns=1
>   (gdb) p allow_new_conns
>   $5 = true
>   (gdb) c
>   Continuing.
>
>   I know that "allow_new_conns" will be set to "true" when "conn_close" 
> called. But how could it happen for the case that when "accept" failed ,
> and errno is "EMFILE", and this connection is the only one for accepting. 
> Notes that curr_conns = 5.
>   Not run out of fd:
>   ls /proc/1748(memcached_pid)/fd | wc -l
>   17
>   
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups 
> "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to memcached+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: memcached-1.4.20 stuck when "Too many open connections"

2014-10-29 Thread Samdy Sun
Thanks for the reply.
I'am absolutely sure the running version was 1.4.20.  And I don't know how 
can this happen.
Really, I use unix domain for running, and inet domain just for telnet 
easily.


在 2014年10月29日星期三UTC+8下午2时50分29秒,Dormando写道:
>
> You're absolutely sure the running version was 1.4.20? that looks like a 
> bug that was fixed in .19 or .20 
>
> hmmm... maybe a unix domain bug? 
>
> On Tue, 28 Oct 2014, Samdy Sun wrote: 
>
> > Hello,  I got a "memcached-1.4.20 stuck" problem when EMFILE happen. 
> >   Here are my memcached's cmdline "memcached -s /xxx/mc_usock.11201 -c 
> 1024 -m 4000 -f 1.05 -o slab_automove -o slab_reassign  -t 1 -p 11201". 
> >   
> >   cat /proc/version  
> >   Linux version 2.6.32-358.el6.x86_64 (
> mock...@x86-022.build.eng.bos.redhat.com ) (gcc version 
> 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Tue 
> > Jan 29 11:47:41 EST 2013 
> > 
> >   memcached-1.4.20 stuck and don't work any more when it runs for 
> a period of time. 
> > 
> >   Here are some information for gdb:  (gdb) p stats 
> >   $2 = {mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __kind 
> = 0, __nusers = 0, {__spins = 0,  
> > __list = {__next = 0x0}}}, __size = '\000' , 
> __align = 0}, curr_items = 149156,  
> >   total_items = 9876811, curr_bytes = 3712501870, curr_conns = 5, 
> total_conns = 39738, rejected_conns = 0,  
> >   malloc_fails = 0, reserved_fds = 5, conn_structs = 1012, get_cmds = 0, 
> set_cmds = 0, touch_cmds = 0,  
> >   get_hits = 0, get_misses = 0, touch_hits = 0, touch_misses = 0, 
> evictions = 0, reclaimed = 0,  
> >   started = 0, accepting_conns = false, listen_disabled_num = 1, 
> hash_power_level = 17,  
> >   hash_bytes = 524288, hash_is_expanding = false, expired_unfetched = 0, 
> evicted_unfetched = 0,  
> >   slab_reassign_running = false, slabs_moved = 20, lru_crawler_running = 
> false,  
> >   disable_write_by_exptime = 0, disable_write_by_length = 0, 
> disable_write_by_access = 0,  
> >   evicted_write_reply_timeout_times = 0} 
> > 
> >   (gdb) p allow_new_conns 
> >   $4 = false 
> > 
> >   And I found that "allow_new_conns" just set to false when "accept" 
> failed and errno is "EMFILE".  
> >   Here are the codes:   
> > static void drive_machine(conn *c) { 
> >  …… 
> >  } else if (errno == EMFILE) { 
> >if (settings.verbose > 0) 
> >  fprintf(stderr, "Too many open connections\n"); 
> >accept_new_conns(false); 
> >stop = true; 
> >  } else { 
> >  …… 
> > } 
> >
> >   If I change the flag "allow_new_conns", it can work again. As below: 
> >   (gdb) set allow_new_conns=1 
> >   (gdb) p allow_new_conns 
> >   $5 = true 
> >   (gdb) c 
> >   Continuing. 
> > 
> >   I know that "allow_new_conns" will be set to "true" when "conn_close" 
> called. But how could it happen for the case that when "accept" failed , 
> > and errno is "EMFILE", and this connection is the only one for 
> accepting. Notes that curr_conns = 5. 
> >   Not run out of fd: 
> >   ls /proc/1748(memcached_pid)/fd | wc -l 
> >   17 
> >
> > 
> > -- 
> > 
> > --- 
> > You received this message because you are subscribed to the Google 
> Groups "memcached" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to memcached+...@googlegroups.com . 
> > For more options, visit https://groups.google.com/d/optout. 
> > 
> >

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: memcached-1.4.20 stuck when "Too many open connections"

2014-10-29 Thread Samdy Sun
Addition, the libevent version is "1.4.14b-stable".

在 2014年10月29日星期三UTC+8下午3时01分43秒,Samdy Sun写道:
>
> Thanks for the reply.
> I'am absolutely sure the running version was 1.4.20.  And I don't know how 
> can this happen.
> Really, I use unix domain for running, and inet domain just for telnet 
> easily.
>
>
> 在 2014年10月29日星期三UTC+8下午2时50分29秒,Dormando写道:
>>
>> You're absolutely sure the running version was 1.4.20? that looks like a 
>> bug that was fixed in .19 or .20 
>>
>> hmmm... maybe a unix domain bug? 
>>
>> On Tue, 28 Oct 2014, Samdy Sun wrote: 
>>
>> > Hello,  I got a "memcached-1.4.20 stuck" problem when EMFILE happen. 
>> >   Here are my memcached's cmdline "memcached -s /xxx/mc_usock.11201 -c 
>> 1024 -m 4000 -f 1.05 -o slab_automove -o slab_reassign  -t 1 -p 11201". 
>> >   
>> >   cat /proc/version  
>> >   Linux version 2.6.32-358.el6.x86_64 (
>> mock...@x86-022.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 
>> (Red Hat 4.4.7-3) (GCC) ) #1 SMP Tue 
>> > Jan 29 11:47:41 EST 2013 
>> > 
>> >   memcached-1.4.20 stuck and don't work any more when it runs for 
>> a period of time. 
>> > 
>> >   Here are some information for gdb:  (gdb) p stats 
>> >   $2 = {mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __kind 
>> = 0, __nusers = 0, {__spins = 0,  
>> > __list = {__next = 0x0}}}, __size = '\000' , 
>> __align = 0}, curr_items = 149156,  
>> >   total_items = 9876811, curr_bytes = 3712501870, curr_conns = 5, 
>> total_conns = 39738, rejected_conns = 0,  
>> >   malloc_fails = 0, reserved_fds = 5, conn_structs = 1012, get_cmds = 
>> 0, set_cmds = 0, touch_cmds = 0,  
>> >   get_hits = 0, get_misses = 0, touch_hits = 0, touch_misses = 0, 
>> evictions = 0, reclaimed = 0,  
>> >   started = 0, accepting_conns = false, listen_disabled_num = 1, 
>> hash_power_level = 17,  
>> >   hash_bytes = 524288, hash_is_expanding = false, expired_unfetched = 
>> 0, evicted_unfetched = 0,  
>> >   slab_reassign_running = false, slabs_moved = 20, lru_crawler_running 
>> = false,  
>> >   disable_write_by_exptime = 0, disable_write_by_length = 0, 
>> disable_write_by_access = 0,  
>> >   evicted_write_reply_timeout_times = 0} 
>> > 
>> >   (gdb) p allow_new_conns 
>> >   $4 = false 
>> > 
>> >   And I found that "allow_new_conns" just set to false when "accept" 
>> failed and errno is "EMFILE".  
>> >   Here are the codes:   
>> > static void drive_machine(conn *c) { 
>> >  …… 
>> >  } else if (errno == EMFILE) { 
>> >if (settings.verbose > 0) 
>> >  fprintf(stderr, "Too many open 
>> connections\n"); 
>> >accept_new_conns(false); 
>> >stop = true; 
>> >  } else { 
>> >  …… 
>> > } 
>> >
>> >   If I change the flag "allow_new_conns", it can work again. As below: 
>> >   (gdb) set allow_new_conns=1 
>> >   (gdb) p allow_new_conns 
>> >   $5 = true 
>> >   (gdb) c 
>> >   Continuing. 
>> > 
>> >   I know that "allow_new_conns" will be set to "true" when "conn_close" 
>> called. But how could it happen for the case that when "accept" failed , 
>> > and errno is "EMFILE", and this connection is the only one for 
>> accepting. Notes that curr_conns = 5. 
>> >   Not run out of fd: 
>> >   ls /proc/1748(memcached_pid)/fd | wc -l 
>> >   17 
>> >
>> > 
>> > -- 
>> > 
>> > --- 
>> > You received this message because you are subscribed to the Google 
>> Groups "memcached" group. 
>> > To unsubscribe from this group and stop receiving emails from it, send 
>> an email to memcached+...@googlegroups.com. 
>> > For more options, visit https://groups.google.com/d/optout. 
>> > 
>> >
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: memcached-1.4.20 stuck when "Too many open connections"

2014-10-29 Thread Samdy Sun
Addition, the libevent version is "1.4.14b-stable".

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: memcached-1.4.20 stuck when "Too many open connections"

2014-10-29 Thread Samdy Sun
*There are no deadlocks*, 
(gdb) info thread
* 5 Thread 0xf7771b70 (LWP 24962)  0x080509dd in transmit (fd=431, which=2, 
arg=0xfef8ce48)
at memcached.c:4044
  4 Thread 0xf6d70b70 (LWP 24963)  0x007ad430 in __kernel_vsyscall ()
  3 Thread 0xf636fb70 (LWP 24964)  0x007ad430 in __kernel_vsyscall ()
  2 Thread 0xf596eb70 (LWP 24965)  0x007ad430 in __kernel_vsyscall ()
  1 Thread 0xf77b38d0 (LWP 24961)  0x007ad430 in __kernel_vsyscall ()
(gdb) t 1
[Switching to thread 1 (Thread 0xf77b38d0 (LWP 24961))]#0  0x007ad430 in 
__kernel_vsyscall ()
(gdb) bt
#0  0x007ad430 in __kernel_vsyscall ()
#1  0x005c5366 in epoll_wait () from /lib/libc.so.6
#2  0x0074a750 in epoll_dispatch (base=0x9305008, arg=0x93053c0, 
tv=0xff8e0cdc) at epoll.c:198
#3  0x0073d714 in event_base_loop (base=0x9305008, flags=0) at event.c:538
#4  0x08054467 in main (argc=19, argv=0xff8e2274) at memcached.c:5795
(gdb) 

(gdb) t 2
[Switching to thread 2 (Thread 0xf596eb70 (LWP 24965))]#0  0x007ad430 in 
__kernel_vsyscall ()
(gdb) bt
#0  0x007ad430 in __kernel_vsyscall ()
#1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib/libpthread.so.0
#2  0x08055662 in slab_rebalance_thread (arg=0x0) at slabs.c:859
#3  0x00a61a49 in start_thread () from /lib/libpthread.so.0
#4  0x005c4aee in clone () from /lib/libc.so.6
(gdb) t 3
[Switching to thread 3 (Thread 0xf636fb70 (LWP 24964))]#0  0x007ad430 in 
__kernel_vsyscall ()
(gdb) bt
#0  0x007ad430 in __kernel_vsyscall ()
#1  0x005838b6 in nanosleep () from /lib/libc.so.6
#2  0x005836e0 in sleep () from /lib/libc.so.6
#3  0x08056f6e in slab_maintenance_thread (arg=0x0) at slabs.c:819
#4  0x00a61a49 in start_thread () from /lib/libpthread.so.0
#5  0x005c4aee in clone () from /lib/libc.so.6
(gdb) t 4
[Switching to thread 4 (Thread 0xf6d70b70 (LWP 24963))]#0  0x007ad430 in 
__kernel_vsyscall ()
(gdb) bt
#0  0x007ad430 in __kernel_vsyscall ()
#1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib/libpthread.so.0
#2  0x080599f5 in assoc_maintenance_thread (arg=0x0) at assoc.c:251
#3  0x00a61a49 in start_thread () from /lib/libpthread.so.0
#4  0x005c4aee in clone () from /lib/libc.so.6
(gdb) t 5
[Switching to thread 5 (Thread 0xf7771b70 (LWP 24962))]#0  0x007ad430 in 
__kernel_vsyscall ()
(gdb) bt
#0  0x007ad430 in __kernel_vsyscall ()
#1  0x00a68998 in sendmsg () from /lib/libpthread.so.0
#2  0x080509dd in transmit (fd=431, which=2, arg=0xfef8ce48) at 
memcached.c:4044
#3  drive_machine (fd=431, which=2, arg=0xfef8ce48) at memcached.c:4370
#4  event_handler (fd=431, which=2, arg=0xfef8ce48) at memcached.c:4441
#5  0x0073d9e4 in event_process_active (base=0x9310658, flags=0) at 
event.c:395
#6  event_base_loop (base=0x9310658, flags=0) at event.c:547
#7  0x08059fee in worker_libevent (arg=0x930c698) at thread.c:471
#8  0x00a61a49 in start_thread () from /lib/libpthread.so.0
#9  0x005c4aee in clone () from /lib/libc.so.6
(gdb) 

*strace info, there is the only event named maxconnsevent on epoll?*
epoll_wait(4, {}, 32, 10)   = 0
clock_gettime(CLOCK_MONOTONIC, {8374269, 10084037}) = 0
epoll_wait(4, {}, 32, 10)   = 0
clock_gettime(CLOCK_MONOTONIC, {8374269, 20246365}) = 0
epoll_wait(4, {}, 32, 10)   = 0
clock_gettime(CLOCK_MONOTONIC, {8374269, 30382098}) = 0
epoll_wait(4, {}, 32, 10)   = 0
clock_gettime(CLOCK_MONOTONIC, {8374269, 40509766}) = 0
epoll_wait(4, {}, 32, 10)   = 0
clock_gettime(CLOCK_MONOTONIC, {8374269, 50657403}) = 0
epoll_wait(4, {}, 32, 10)   = 0
clock_gettime(CLOCK_MONOTONIC, {8374269, 60823841}) = 0
epoll_wait(4, {}, 32, 10)   = 0
clock_gettime(CLOCK_MONOTONIC, {8374269, 71013006}) = 0
epoll_wait(4, {}, 32, 10)   = 0
clock_gettime(CLOCK_MONOTONIC, {8374269, 81234264}) = 0
epoll_wait(4, {}, 32, 10)   = 0
clock_gettime(CLOCK_MONOTONIC, {8374269, 91407508}) = 0
epoll_wait(4, {}, 32, 10)   = 0
clock_gettime(CLOCK_MONOTONIC, {8374269, 101581187}) = 0
epoll_wait(4, {}, 32, 10)   = 0
clock_gettime(CLOCK_MONOTONIC, {8374269, 111752457}) = 0
epoll_wait(4, {}, 32, 10)   = 0
clock_gettime(CLOCK_MONOTONIC, {8374269, 121919049}) = 0
epoll_wait(4, {}, 32, 10)   = 0
clock_gettime(CLOCK_MONOTONIC, {8374269, 132057597}) = 0



在 2014年10月29日星期三UTC+8下午2时47分23秒,Samdy Sun写道:
>
> Hello,
>   I got a "memcached-1.4.20 stuck" problem when EMFILE happen.
>   Here are my memcached's cmdline "memcached -s /xxx/mc_usock.11201 -c 
> 1024 -m 4000 -f 1.05 -o slab_automove -o slab_reassign  -t 1 -p 11201".
>  
>   cat /proc/version 
>   Linux version 2.6.32-358.el6.x86_64 (
> mockbu...@x86-022.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 
> (Red Hat 4.4.7-3) (GCC) ) #1 SMP Tue Jan 29 11:47:41 EST 2013
>
>   memcached-1.4.20 stuck and don't work any more when it runs for a period 
> of time.
>
>   Here are some information for gdb:
>   (gdb) p stats
>   $2 = {mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __kind = 
> 0, __nusers = 0, {__spins = 

Re: memcached-1.4.20 stuck when "Too many open connections"

2014-10-31 Thread dormando
Hey,

How are you reproducing this? How many connections do you typically have
open?

It's really bizarre that your curr_conns is "5", but your connections are
disabled? Even if there's still a race, as more connections close they
each have an opportunity to flip the acceptor back on.

Can you print what "stats settings" shows? If it's adjusting your actual
maxconns downward it should show there...

On Wed, 29 Oct 2014, Samdy Sun wrote:

> There are no deadlocks, (gdb) info thread
> * 5 Thread 0xf7771b70 (LWP 24962)  0x080509dd in transmit (fd=431, which=2, 
> arg=0xfef8ce48)
>     at memcached.c:4044
>   4 Thread 0xf6d70b70 (LWP 24963)  0x007ad430 in __kernel_vsyscall ()
>   3 Thread 0xf636fb70 (LWP 24964)  0x007ad430 in __kernel_vsyscall ()
>   2 Thread 0xf596eb70 (LWP 24965)  0x007ad430 in __kernel_vsyscall ()
>   1 Thread 0xf77b38d0 (LWP 24961)  0x007ad430 in __kernel_vsyscall ()
> (gdb) t 1
> [Switching to thread 1 (Thread 0xf77b38d0 (LWP 24961))]#0  0x007ad430 in 
> __kernel_vsyscall ()
> (gdb) bt
> #0  0x007ad430 in __kernel_vsyscall ()
> #1  0x005c5366 in epoll_wait () from /lib/libc.so.6
> #2  0x0074a750 in epoll_dispatch (base=0x9305008, arg=0x93053c0, 
> tv=0xff8e0cdc) at epoll.c:198
> #3  0x0073d714 in event_base_loop (base=0x9305008, flags=0) at event.c:538
> #4  0x08054467 in main (argc=19, argv=0xff8e2274) at memcached.c:5795
> (gdb) 
>
> (gdb) t 2
> [Switching to thread 2 (Thread 0xf596eb70 (LWP 24965))]#0  0x007ad430 in 
> __kernel_vsyscall ()
> (gdb) bt
> #0  0x007ad430 in __kernel_vsyscall ()
> #1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
> #2  0x08055662 in slab_rebalance_thread (arg=0x0) at slabs.c:859
> #3  0x00a61a49 in start_thread () from /lib/libpthread.so.0
> #4  0x005c4aee in clone () from /lib/libc.so.6
> (gdb) t 3
> [Switching to thread 3 (Thread 0xf636fb70 (LWP 24964))]#0  0x007ad430 in 
> __kernel_vsyscall ()
> (gdb) bt
> #0  0x007ad430 in __kernel_vsyscall ()
> #1  0x005838b6 in nanosleep () from /lib/libc.so.6
> #2  0x005836e0 in sleep () from /lib/libc.so.6
> #3  0x08056f6e in slab_maintenance_thread (arg=0x0) at slabs.c:819
> #4  0x00a61a49 in start_thread () from /lib/libpthread.so.0
> #5  0x005c4aee in clone () from /lib/libc.so.6
> (gdb) t 4
> [Switching to thread 4 (Thread 0xf6d70b70 (LWP 24963))]#0  0x007ad430 in 
> __kernel_vsyscall ()
> (gdb) bt
> #0  0x007ad430 in __kernel_vsyscall ()
> #1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
> #2  0x080599f5 in assoc_maintenance_thread (arg=0x0) at assoc.c:251
> #3  0x00a61a49 in start_thread () from /lib/libpthread.so.0
> #4  0x005c4aee in clone () from /lib/libc.so.6
> (gdb) t 5
> [Switching to thread 5 (Thread 0xf7771b70 (LWP 24962))]#0  0x007ad430 in 
> __kernel_vsyscall ()
> (gdb) bt
> #0  0x007ad430 in __kernel_vsyscall ()
> #1  0x00a68998 in sendmsg () from /lib/libpthread.so.0
> #2  0x080509dd in transmit (fd=431, which=2, arg=0xfef8ce48) at 
> memcached.c:4044
> #3  drive_machine (fd=431, which=2, arg=0xfef8ce48) at memcached.c:4370
> #4  event_handler (fd=431, which=2, arg=0xfef8ce48) at memcached.c:4441
> #5  0x0073d9e4 in event_process_active (base=0x9310658, flags=0) at 
> event.c:395
> #6  event_base_loop (base=0x9310658, flags=0) at event.c:547
> #7  0x08059fee in worker_libevent (arg=0x930c698) at thread.c:471
> #8  0x00a61a49 in start_thread () from /lib/libpthread.so.0
> #9  0x005c4aee in clone () from /lib/libc.so.6
> (gdb) 
>
> strace info, there is the only event named maxconnsevent on epoll?
> epoll_wait(4, {}, 32, 10)               = 0
> clock_gettime(CLOCK_MONOTONIC, {8374269, 10084037}) = 0
> epoll_wait(4, {}, 32, 10)               = 0
> clock_gettime(CLOCK_MONOTONIC, {8374269, 20246365}) = 0
> epoll_wait(4, {}, 32, 10)               = 0
> clock_gettime(CLOCK_MONOTONIC, {8374269, 30382098}) = 0
> epoll_wait(4, {}, 32, 10)               = 0
> clock_gettime(CLOCK_MONOTONIC, {8374269, 40509766}) = 0
> epoll_wait(4, {}, 32, 10)               = 0
> clock_gettime(CLOCK_MONOTONIC, {8374269, 50657403}) = 0
> epoll_wait(4, {}, 32, 10)               = 0
> clock_gettime(CLOCK_MONOTONIC, {8374269, 60823841}) = 0
> epoll_wait(4, {}, 32, 10)               = 0
> clock_gettime(CLOCK_MONOTONIC, {8374269, 71013006}) = 0
> epoll_wait(4, {}, 32, 10)               = 0
> clock_gettime(CLOCK_MONOTONIC, {8374269, 81234264}) = 0
> epoll_wait(4, {}, 32, 10)               = 0
> clock_gettime(CLOCK_MONOTONIC, {8374269, 91407508}) = 0
> epoll_wait(4, {}, 32, 10)               = 0
> clock_gettime(CLOCK_MONOTONIC, {8374269, 101581187}) = 0
> epoll_wait(4, {}, 32, 10)               = 0
> clock_gettime(CLOCK_MONOTONIC, {8374269, 111752457}) = 0
> epoll_wait(4, {}, 32, 10)               = 0
> clock_gettime(CLOCK_MONOTONIC, {8374269, 121919049}) = 0
> epoll_wait(4, {}, 32, 10)               = 0
> clock_gettime(CLOCK_MONOTONIC, {8374269, 132057597}) = 0
>
>
>
> 在 2014年10月29日星期三UTC+8下午2时47分23秒,Samdy Sun写道:
>   Hello,  I got a "memcached-1.4.20 stuck" problem when EMF

Re: memcached-1.4.20 stuck when "Too many open connections"

2014-10-31 Thread Samdy Sun
@Dormando,
  
  I try my best to reproduce this in my environment, but failed. This just 
happened on my servers. 

  I use "stats" command to check the memcached if it is available or not. 
If the memcached is unavailable, we will not send request to it. 

  This is what I feel strange when my curr_conns is "5" and memcached can't 
recover itself. I think "conn_new" call maybe fail, and it call 
"close(fd)" directly, not "conn_close()"? Such as below?

  1. malloc fails when "conn_new()"
  2. event_add fails when "conn_new()"
  3. other case?

  Take notice of that I build "memcached" on 32-bit system and it runs on 
64-bit system. Additionally, I use "-m 4000" for memcached's start.

  Thanks,
  Samdy Sun

在 2014年10月31日星期五UTC+8下午3时01分06秒,Dormando写道:
>
> Hey, 
>
> How are you reproducing this? How many connections do you typically have 
> open? 
>
> It's really bizarre that your curr_conns is "5", but your connections are 
> disabled? Even if there's still a race, as more connections close they 
> each have an opportunity to flip the acceptor back on. 
>
> Can you print what "stats settings" shows? If it's adjusting your actual 
> maxconns downward it should show there... 
>
> On Wed, 29 Oct 2014, Samdy Sun wrote: 
>
> > There are no deadlocks, (gdb) info thread 
> > * 5 Thread 0xf7771b70 (LWP 24962)  0x080509dd in transmit (fd=431, 
> which=2, arg=0xfef8ce48) 
> > at memcached.c:4044 
> >   4 Thread 0xf6d70b70 (LWP 24963)  0x007ad430 in __kernel_vsyscall () 
> >   3 Thread 0xf636fb70 (LWP 24964)  0x007ad430 in __kernel_vsyscall () 
> >   2 Thread 0xf596eb70 (LWP 24965)  0x007ad430 in __kernel_vsyscall () 
> >   1 Thread 0xf77b38d0 (LWP 24961)  0x007ad430 in __kernel_vsyscall () 
> > (gdb) t 1 
> > [Switching to thread 1 (Thread 0xf77b38d0 (LWP 24961))]#0  0x007ad430 in 
> __kernel_vsyscall () 
> > (gdb) bt 
> > #0  0x007ad430 in __kernel_vsyscall () 
> > #1  0x005c5366 in epoll_wait () from /lib/libc.so.6 
> > #2  0x0074a750 in epoll_dispatch (base=0x9305008, arg=0x93053c0, 
> tv=0xff8e0cdc) at epoll.c:198 
> > #3  0x0073d714 in event_base_loop (base=0x9305008, flags=0) at 
> event.c:538 
> > #4  0x08054467 in main (argc=19, argv=0xff8e2274) at memcached.c:5795 
> > (gdb)  
> > 
> > (gdb) t 2 
> > [Switching to thread 2 (Thread 0xf596eb70 (LWP 24965))]#0  0x007ad430 in 
> __kernel_vsyscall () 
> > (gdb) bt 
> > #0  0x007ad430 in __kernel_vsyscall () 
> > #1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib/libpthread.so.0 
> > #2  0x08055662 in slab_rebalance_thread (arg=0x0) at slabs.c:859 
> > #3  0x00a61a49 in start_thread () from /lib/libpthread.so.0 
> > #4  0x005c4aee in clone () from /lib/libc.so.6 
> > (gdb) t 3 
> > [Switching to thread 3 (Thread 0xf636fb70 (LWP 24964))]#0  0x007ad430 in 
> __kernel_vsyscall () 
> > (gdb) bt 
> > #0  0x007ad430 in __kernel_vsyscall () 
> > #1  0x005838b6 in nanosleep () from /lib/libc.so.6 
> > #2  0x005836e0 in sleep () from /lib/libc.so.6 
> > #3  0x08056f6e in slab_maintenance_thread (arg=0x0) at slabs.c:819 
> > #4  0x00a61a49 in start_thread () from /lib/libpthread.so.0 
> > #5  0x005c4aee in clone () from /lib/libc.so.6 
> > (gdb) t 4 
> > [Switching to thread 4 (Thread 0xf6d70b70 (LWP 24963))]#0  0x007ad430 in 
> __kernel_vsyscall () 
> > (gdb) bt 
> > #0  0x007ad430 in __kernel_vsyscall () 
> > #1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib/libpthread.so.0 
> > #2  0x080599f5 in assoc_maintenance_thread (arg=0x0) at assoc.c:251 
> > #3  0x00a61a49 in start_thread () from /lib/libpthread.so.0 
> > #4  0x005c4aee in clone () from /lib/libc.so.6 
> > (gdb) t 5 
> > [Switching to thread 5 (Thread 0xf7771b70 (LWP 24962))]#0  0x007ad430 in 
> __kernel_vsyscall () 
> > (gdb) bt 
> > #0  0x007ad430 in __kernel_vsyscall () 
> > #1  0x00a68998 in sendmsg () from /lib/libpthread.so.0 
> > #2  0x080509dd in transmit (fd=431, which=2, arg=0xfef8ce48) at 
> memcached.c:4044 
> > #3  drive_machine (fd=431, which=2, arg=0xfef8ce48) at memcached.c:4370 
> > #4  event_handler (fd=431, which=2, arg=0xfef8ce48) at memcached.c:4441 
> > #5  0x0073d9e4 in event_process_active (base=0x9310658, flags=0) at 
> event.c:395 
> > #6  event_base_loop (base=0x9310658, flags=0) at event.c:547 
> > #7  0x08059fee in worker_libevent (arg=0x930c698) at thread.c:471 
> > #8  0x00a61a49 in start_thread () from /lib/libpthread.so.0 
> > #9  0x005c4aee in clone () from /lib/libc.so.6 
> > (gdb)  
> > 
> > strace info, there is the only event named maxconnsevent on epoll? 
> > epoll_wait(4, {}, 32, 10)   = 0 
> > clock_gettime(CLOCK_MONOTONIC, {8374269, 10084037}) = 0 
> > epoll_wait(4, {}, 32, 10)   = 0 
> > clock_gettime(CLOCK_MONOTONIC, {8374269, 20246365}) = 0 
> > epoll_wait(4, {}, 32, 10)   = 0 
> > clock_gettime(CLOCK_MONOTONIC, {8374269, 30382098}) = 0 
> > epoll_wait(4, {}, 32, 10)   = 0 
> > clock_gettime(CLOCK_MONOTONIC, {8374269, 40509766}) = 0 
> > epoll_wait(4, {}, 32, 10)   = 0 
> > cl

Re: memcached-1.4.20 stuck when "Too many open connections"

2014-10-31 Thread dormando
Hey,

32-bit memcached with -m 4000 will never work. the best you can do is
probably -m 1600. 32bit applications typically can only allocate up to 2G
of ram.

memcached isn't protected from a lot of malloc failure scenarios, so what
you're doing will never work.

-m 4000 only limits the slab memory usage. there're a lot of buffers/etc
outside of that. Also the hash table, which is measured separately.

On Fri, 31 Oct 2014, Samdy Sun wrote:

> @Dormando,  
>   I try my best to reproduce this in my environment, but failed. This just 
> happened on my servers. 
>
>   I use "stats" command to check the memcached if it is available or not. If 
> the memcached is unavailable, we will not send request to it. 
>
>   This is what I feel strange when my curr_conns is "5" and memcached can't 
> recover itself. I think "conn_new" call maybe fail, and it call
> "close(fd)" directly, not "conn_close()"? Such as below?
>
>   1. malloc fails when "conn_new()"
>   2. event_add fails when "conn_new()"
>   3. other case?
>
>   Take notice of that I build "memcached" on 32-bit system and it runs on 
> 64-bit system. Additionally, I use "-m 4000" for memcached's start.
>
>   Thanks,
>   Samdy Sun
>
> 在 2014年10月31日星期五UTC+8下午3时01分06秒,Dormando写道:
>   Hey,
>
>   How are you reproducing this? How many connections do you typically have
>   open?
>
>   It's really bizarre that your curr_conns is "5", but your connections 
> are
>   disabled? Even if there's still a race, as more connections close they
>   each have an opportunity to flip the acceptor back on.
>
>   Can you print what "stats settings" shows? If it's adjusting your actual
>   maxconns downward it should show there...
>
>   On Wed, 29 Oct 2014, Samdy Sun wrote:
>
>   > There are no deadlocks, (gdb) info thread
>   > * 5 Thread 0xf7771b70 (LWP 24962)  0x080509dd in transmit (fd=431, 
> which=2, arg=0xfef8ce48)
>   >     at memcached.c:4044
>   >   4 Thread 0xf6d70b70 (LWP 24963)  0x007ad430 in __kernel_vsyscall ()
>   >   3 Thread 0xf636fb70 (LWP 24964)  0x007ad430 in __kernel_vsyscall ()
>   >   2 Thread 0xf596eb70 (LWP 24965)  0x007ad430 in __kernel_vsyscall ()
>   >   1 Thread 0xf77b38d0 (LWP 24961)  0x007ad430 in __kernel_vsyscall ()
>   > (gdb) t 1
>   > [Switching to thread 1 (Thread 0xf77b38d0 (LWP 24961))]#0  0x007ad430 
> in __kernel_vsyscall ()
>   > (gdb) bt
>   > #0  0x007ad430 in __kernel_vsyscall ()
>   > #1  0x005c5366 in epoll_wait () from /lib/libc.so.6
>   > #2  0x0074a750 in epoll_dispatch (base=0x9305008, arg=0x93053c0, 
> tv=0xff8e0cdc) at epoll.c:198
>   > #3  0x0073d714 in event_base_loop (base=0x9305008, flags=0) at 
> event.c:538
>   > #4  0x08054467 in main (argc=19, argv=0xff8e2274) at memcached.c:5795
>   > (gdb) 
>   >
>   > (gdb) t 2
>   > [Switching to thread 2 (Thread 0xf596eb70 (LWP 24965))]#0  0x007ad430 
> in __kernel_vsyscall ()
>   > (gdb) bt
>   > #0  0x007ad430 in __kernel_vsyscall ()
>   > #1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib/libpthread.so.0
>   > #2  0x08055662 in slab_rebalance_thread (arg=0x0) at slabs.c:859
>   > #3  0x00a61a49 in start_thread () from /lib/libpthread.so.0
>   > #4  0x005c4aee in clone () from /lib/libc.so.6
>   > (gdb) t 3
>   > [Switching to thread 3 (Thread 0xf636fb70 (LWP 24964))]#0  0x007ad430 
> in __kernel_vsyscall ()
>   > (gdb) bt
>   > #0  0x007ad430 in __kernel_vsyscall ()
>   > #1  0x005838b6 in nanosleep () from /lib/libc.so.6
>   > #2  0x005836e0 in sleep () from /lib/libc.so.6
>   > #3  0x08056f6e in slab_maintenance_thread (arg=0x0) at slabs.c:819
>   > #4  0x00a61a49 in start_thread () from /lib/libpthread.so.0
>   > #5  0x005c4aee in clone () from /lib/libc.so.6
>   > (gdb) t 4
>   > [Switching to thread 4 (Thread 0xf6d70b70 (LWP 24963))]#0  0x007ad430 
> in __kernel_vsyscall ()
>   > (gdb) bt
>   > #0  0x007ad430 in __kernel_vsyscall ()
>   > #1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib/libpthread.so.0
>   > #2  0x080599f5 in assoc_maintenance_thread (arg=0x0) at assoc.c:251
>   > #3  0x00a61a49 in start_thread () from /lib/libpthread.so.0
>   > #4  0x005c4aee in clone () from /lib/libc.so.6
>   > (gdb) t 5
>   > [Switching to thread 5 (Thread 0xf7771b70 (LWP 24962))]#0  0x007ad430 
> in __kernel_vsyscall ()
>   > (gdb) bt
>   > #0  0x007ad430 in __kernel_vsyscall ()
>   > #1  0x00a68998 in sendmsg () from /lib/libpthread.so.0
>   > #2  0x080509dd in transmit (fd=431, which=2, arg=0xfef8ce48) at 
> memcached.c:4044
>   > #3  drive_machine (fd=431, which=2, arg=0xfef8ce48) at 
> memcached.c:4370
>   > #4  event_handler (fd=431, which=2, arg=0xfef8ce48) at 
> memcached.c:4441
>   > #5  0x0073d9e4 in event_process_active (base=0x9310658, flags=0) at 
> event.c:395
>   > #6  event_base_loop (base=0x9310658, fla

Re: memcached-1.4.20 stuck when "Too many open connections"

2014-11-02 Thread Samdy Sun
  Thanks, Dormando. I will try with smaller ram memory.
  
  32bit applications can malloc about 3.8G of ram on my 64bit system, so 
"-m 3200" may be ok?
  

在 2014年11月1日星期六UTC+8上午12时30分29秒,Dormando写道:
>
> Hey, 
>
> 32-bit memcached with -m 4000 will never work. the best you can do is 
> probably -m 1600. 32bit applications typically can only allocate up to 2G 
> of ram. 
>
> memcached isn't protected from a lot of malloc failure scenarios, so what 
> you're doing will never work. 
>
> -m 4000 only limits the slab memory usage. there're a lot of buffers/etc 
> outside of that. Also the hash table, which is measured separately. 
>
> On Fri, 31 Oct 2014, Samdy Sun wrote: 
>
> > @Dormando,   
> >   I try my best to reproduce this in my environment, but failed. This 
> just happened on my servers.  
> > 
> >   I use "stats" command to check the memcached if it is available or 
> not. If the memcached is unavailable, we will not send request to it.  
> > 
> >   This is what I feel strange when my curr_conns is "5" and memcached 
> can't recover itself. I think "conn_new" call maybe fail, and it call 
> > "close(fd)" directly, not "conn_close()"? Such as below? 
> > 
> >   1. malloc fails when "conn_new()" 
> >   2. event_add fails when "conn_new()" 
> >   3. other case? 
> > 
> >   Take notice of that I build "memcached" on 32-bit system and it runs 
> on 64-bit system. Additionally, I use "-m 4000" for memcached's start. 
> > 
> >   Thanks, 
> >   Samdy Sun 
> > 
> > 在 2014年10月31日星期五UTC+8下午3时01分06秒,Dormando写道: 
> >   Hey, 
> > 
> >   How are you reproducing this? How many connections do you 
> typically have 
> >   open? 
> > 
> >   It's really bizarre that your curr_conns is "5", but your 
> connections are 
> >   disabled? Even if there's still a race, as more connections close 
> they 
> >   each have an opportunity to flip the acceptor back on. 
> > 
> >   Can you print what "stats settings" shows? If it's adjusting your 
> actual 
> >   maxconns downward it should show there... 
> > 
> >   On Wed, 29 Oct 2014, Samdy Sun wrote: 
> > 
> >   > There are no deadlocks, (gdb) info thread 
> >   > * 5 Thread 0xf7771b70 (LWP 24962)  0x080509dd in transmit 
> (fd=431, which=2, arg=0xfef8ce48) 
> >   > at memcached.c:4044 
> >   >   4 Thread 0xf6d70b70 (LWP 24963)  0x007ad430 in 
> __kernel_vsyscall () 
> >   >   3 Thread 0xf636fb70 (LWP 24964)  0x007ad430 in 
> __kernel_vsyscall () 
> >   >   2 Thread 0xf596eb70 (LWP 24965)  0x007ad430 in 
> __kernel_vsyscall () 
> >   >   1 Thread 0xf77b38d0 (LWP 24961)  0x007ad430 in 
> __kernel_vsyscall () 
> >   > (gdb) t 1 
> >   > [Switching to thread 1 (Thread 0xf77b38d0 (LWP 24961))]#0 
>  0x007ad430 in __kernel_vsyscall () 
> >   > (gdb) bt 
> >   > #0  0x007ad430 in __kernel_vsyscall () 
> >   > #1  0x005c5366 in epoll_wait () from /lib/libc.so.6 
> >   > #2  0x0074a750 in epoll_dispatch (base=0x9305008, arg=0x93053c0, 
> tv=0xff8e0cdc) at epoll.c:198 
> >   > #3  0x0073d714 in event_base_loop (base=0x9305008, flags=0) at 
> event.c:538 
> >   > #4  0x08054467 in main (argc=19, argv=0xff8e2274) at 
> memcached.c:5795 
> >   > (gdb)  
> >   > 
> >   > (gdb) t 2 
> >   > [Switching to thread 2 (Thread 0xf596eb70 (LWP 24965))]#0 
>  0x007ad430 in __kernel_vsyscall () 
> >   > (gdb) bt 
> >   > #0  0x007ad430 in __kernel_vsyscall () 
> >   > #1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib/libpthread.so.0 
> >   > #2  0x08055662 in slab_rebalance_thread (arg=0x0) at slabs.c:859 
> >   > #3  0x00a61a49 in start_thread () from /lib/libpthread.so.0 
> >   > #4  0x005c4aee in clone () from /lib/libc.so.6 
> >   > (gdb) t 3 
> >   > [Switching to thread 3 (Thread 0xf636fb70 (LWP 24964))]#0 
>  0x007ad430 in __kernel_vsyscall () 
> >   > (gdb) bt 
> >   > #0  0x007ad430 in __kernel_vsyscall () 
> >   > #1  0x005838b6 in nanosleep () from /lib/libc.so.6 
> >   > #2  0x005836e0 in sleep () from /lib/libc.so.6 
> >   > #3  0x08056f6e in slab_maintenance_thread (arg=0x0) at 
> slabs.c:819 
> >   > #4  0x00a61a49 in start_thread () from /lib/libpthread.so.0 
> >   > #5  0x005c4aee in clone () from /lib/libc.so.6 
> >   > (gdb) t 4 
> >   > [Switching to thread 4 (Thread 0xf6d70b70 (LWP 24963))]#0 
>  0x007ad430 in __kernel_vsyscall () 
> >   > (gdb) bt 
> >   > #0  0x007ad430 in __kernel_vsyscall () 
> >   > #1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib/libpthread.so.0 
> >   > #2  0x080599f5 in assoc_maintenance_thread (arg=0x0) at 
> assoc.c:251 
> >   > #3  0x00a61a49 in start_thread () from /lib/libpthread.so.0 
> >   > #4  0x005c4aee in clone () from /lib/libc.so.6 
> >   > (gdb) t 5 
> >   > [Switching to thread 5 (Thread 0xf7771b70 (LWP 24962))]#0 
>  0x007ad430 in __kernel_vsyscall () 
> >   > (gdb) bt 
> >   > #0  0x007ad4

Re: memcached-1.4.20 stuck when "Too many open connections"

2014-11-02 Thread dormando
I don't really know offhand. why are you running 32bit memcached at all?
Just run it in 64bit mode?

On Sun, 2 Nov 2014, Samdy Sun wrote:

>   Thanks, Dormando. I will try with smaller ram memory.
>   
>   32bit applications can malloc about 3.8G of ram on my 64bit system, so "-m 
> 3200" may be ok?
>   
>
> 在 2014年11月1日星期六UTC+8上午12时30分29秒,Dormando写道:
>   Hey,
>
>   32-bit memcached with -m 4000 will never work. the best you can do is
>   probably -m 1600. 32bit applications typically can only allocate up to 
> 2G
>   of ram.
>
>   memcached isn't protected from a lot of malloc failure scenarios, so 
> what
>   you're doing will never work.
>
>   -m 4000 only limits the slab memory usage. there're a lot of buffers/etc
>   outside of that. Also the hash table, which is measured separately.
>
>   On Fri, 31 Oct 2014, Samdy Sun wrote:
>
>   > @Dormando,  
>   >   I try my best to reproduce this in my environment, but failed. This 
> just happened on my servers. 
>   >
>   >   I use "stats" command to check the memcached if it is available or 
> not. If the memcached is unavailable, we will not send request
>   to it. 
>   >
>   >   This is what I feel strange when my curr_conns is "5" and memcached 
> can't recover itself. I think "conn_new" call maybe fail, and
>   it call
>   > "close(fd)" directly, not "conn_close()"? Such as below?
>   >
>   >   1. malloc fails when "conn_new()"
>   >   2. event_add fails when "conn_new()"
>   >   3. other case?
>   >
>   >   Take notice of that I build "memcached" on 32-bit system and it 
> runs on 64-bit system. Additionally, I use "-m 4000" for
>   memcached's start.
>   >
>   >   Thanks,
>   >   Samdy Sun
>   >
>   > 在 2014年10月31日星期五UTC+8下午3时01分06秒,Dormando写道:
>   >       Hey,
>   >
>   >       How are you reproducing this? How many connections do you 
> typically have
>   >       open?
>   >
>   >       It's really bizarre that your curr_conns is "5", but your 
> connections are
>   >       disabled? Even if there's still a race, as more connections 
> close they
>   >       each have an opportunity to flip the acceptor back on.
>   >
>   >       Can you print what "stats settings" shows? If it's adjusting 
> your actual
>   >       maxconns downward it should show there...
>   >
>   >       On Wed, 29 Oct 2014, Samdy Sun wrote:
>   >
>   >       > There are no deadlocks, (gdb) info thread
>   >       > * 5 Thread 0xf7771b70 (LWP 24962)  0x080509dd in transmit 
> (fd=431, which=2, arg=0xfef8ce48)
>   >       >     at memcached.c:4044
>   >       >   4 Thread 0xf6d70b70 (LWP 24963)  0x007ad430 in 
> __kernel_vsyscall ()
>   >       >   3 Thread 0xf636fb70 (LWP 24964)  0x007ad430 in 
> __kernel_vsyscall ()
>   >       >   2 Thread 0xf596eb70 (LWP 24965)  0x007ad430 in 
> __kernel_vsyscall ()
>   >       >   1 Thread 0xf77b38d0 (LWP 24961)  0x007ad430 in 
> __kernel_vsyscall ()
>   >       > (gdb) t 1
>   >       > [Switching to thread 1 (Thread 0xf77b38d0 (LWP 24961))]#0  
> 0x007ad430 in __kernel_vsyscall ()
>   >       > (gdb) bt
>   >       > #0  0x007ad430 in __kernel_vsyscall ()
>   >       > #1  0x005c5366 in epoll_wait () from /lib/libc.so.6
>   >       > #2  0x0074a750 in epoll_dispatch (base=0x9305008, 
> arg=0x93053c0, tv=0xff8e0cdc) at epoll.c:198
>   >       > #3  0x0073d714 in event_base_loop (base=0x9305008, flags=0) 
> at event.c:538
>   >       > #4  0x08054467 in main (argc=19, argv=0xff8e2274) at 
> memcached.c:5795
>   >       > (gdb) 
>   >       >
>   >       > (gdb) t 2
>   >       > [Switching to thread 2 (Thread 0xf596eb70 (LWP 24965))]#0  
> 0x007ad430 in __kernel_vsyscall ()
>   >       > (gdb) bt
>   >       > #0  0x007ad430 in __kernel_vsyscall ()
>   >       > #1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib/libpthread.so.0
>   >       > #2  0x08055662 in slab_rebalance_thread (arg=0x0) at 
> slabs.c:859
>   >       > #3  0x00a61a49 in start_thread () from /lib/libpthread.so.0
>   >       > #4  0x005c4aee in clone () from /lib/libc.so.6
>   >       > (gdb) t 3
>   >       > [Switching to thread 3 (Thread 0xf636fb70 (LWP 24964))]#0  
> 0x007ad430 in __kernel_vsyscall ()
>   >       > (gdb) bt
>   >       > #0  0x007ad430 in __kernel_vsyscall ()
>   >       > #1  0x005838b6 in nanosleep () from /lib/libc.so.6
>   >       > #2  0x005836e0 in sleep () from /lib/libc.so.6
>   >       > #3  0x08056f6e in slab_maintenance_thread (arg=0x0) at 
> slabs.c:819
>   >       > #4  0x00a61a49 in start_thread () from /lib/libpthread.so.0
>   >       > #5  0x005c4aee in clone () from /lib/libc.so.6
>   >       > (gdb) t 4
>   >       > [Switching to thread 4 (Thread 0xf6d70b70 (LWP 24963))]#0  
> 0x007ad430 in __kernel_vsyscall

Re: memcached-1.4.20 stuck when "Too many open connections"

2014-11-03 Thread Samdy Sun
……
Because we also have 32-bit systems and 64-bit applications don't work on 
them. So we build memcached on 32-bit system conformably.

在 2014年11月3日星期一UTC+8下午2时56分01秒,Dormando写道:
>
> I don't really know offhand. why are you running 32bit memcached at all? 
> Just run it in 64bit mode? 
>
> On Sun, 2 Nov 2014, Samdy Sun wrote: 
>
> >   Thanks, Dormando. I will try with smaller ram memory. 
> >
> >   32bit applications can malloc about 3.8G of ram on my 64bit system, so 
> "-m 3200" may be ok? 
> >
> > 
> > 在 2014年11月1日星期六UTC+8上午12时30分29秒,Dormando写道: 
> >   Hey, 
> > 
> >   32-bit memcached with -m 4000 will never work. the best you can do 
> is 
> >   probably -m 1600. 32bit applications typically can only allocate 
> up to 2G 
> >   of ram. 
> > 
> >   memcached isn't protected from a lot of malloc failure scenarios, 
> so what 
> >   you're doing will never work. 
> > 
> >   -m 4000 only limits the slab memory usage. there're a lot of 
> buffers/etc 
> >   outside of that. Also the hash table, which is measured 
> separately. 
> > 
> >   On Fri, 31 Oct 2014, Samdy Sun wrote: 
> > 
> >   > @Dormando,   
> >   >   I try my best to reproduce this in my environment, but failed. 
> This just happened on my servers.  
> >   > 
> >   >   I use "stats" command to check the memcached if it 
> is available or not. If the memcached is unavailable, we will not send 
> request 
> >   to it.  
> >   > 
> >   >   This is what I feel strange when my curr_conns is "5" and 
> memcached can't recover itself. I think "conn_new" call maybe fail, and 
> >   it call 
> >   > "close(fd)" directly, not "conn_close()"? Such as below? 
> >   > 
> >   >   1. malloc fails when "conn_new()" 
> >   >   2. event_add fails when "conn_new()" 
> >   >   3. other case? 
> >   > 
> >   >   Take notice of that I build "memcached" on 32-bit system and 
> it runs on 64-bit system. Additionally, I use "-m 4000" for 
> >   memcached's start. 
> >   > 
> >   >   Thanks, 
> >   >   Samdy Sun 
> >   > 
> >   > 在 2014年10月31日星期五UTC+8下午3时01分06秒,Dormando写道: 
> >   >   Hey, 
> >   > 
> >   >   How are you reproducing this? How many connections do you 
> typically have 
> >   >   open? 
> >   > 
> >   >   It's really bizarre that your curr_conns is "5", but your 
> connections are 
> >   >   disabled? Even if there's still a race, as more 
> connections close they 
> >   >   each have an opportunity to flip the acceptor back on. 
> >   > 
> >   >   Can you print what "stats settings" shows? If it's 
> adjusting your actual 
> >   >   maxconns downward it should show there... 
> >   > 
> >   >   On Wed, 29 Oct 2014, Samdy Sun wrote: 
> >   > 
> >   >   > There are no deadlocks, (gdb) info thread 
> >   >   > * 5 Thread 0xf7771b70 (LWP 24962)  0x080509dd in 
> transmit (fd=431, which=2, arg=0xfef8ce48) 
> >   >   > at memcached.c:4044 
> >   >   >   4 Thread 0xf6d70b70 (LWP 24963)  0x007ad430 in 
> __kernel_vsyscall () 
> >   >   >   3 Thread 0xf636fb70 (LWP 24964)  0x007ad430 in 
> __kernel_vsyscall () 
> >   >   >   2 Thread 0xf596eb70 (LWP 24965)  0x007ad430 in 
> __kernel_vsyscall () 
> >   >   >   1 Thread 0xf77b38d0 (LWP 24961)  0x007ad430 in 
> __kernel_vsyscall () 
> >   >   > (gdb) t 1 
> >   >   > [Switching to thread 1 (Thread 0xf77b38d0 (LWP 
> 24961))]#0  0x007ad430 in __kernel_vsyscall () 
> >   >   > (gdb) bt 
> >   >   > #0  0x007ad430 in __kernel_vsyscall () 
> >   >   > #1  0x005c5366 in epoll_wait () from /lib/libc.so.6 
> >   >   > #2  0x0074a750 in epoll_dispatch (base=0x9305008, 
> arg=0x93053c0, tv=0xff8e0cdc) at epoll.c:198 
> >   >   > #3  0x0073d714 in event_base_loop (base=0x9305008, 
> flags=0) at event.c:538 
> >   >   > #4  0x08054467 in main (argc=19, argv=0xff8e2274) at 
> memcached.c:5795 
> >   >   > (gdb)  
> >   >   > 
> >   >   > (gdb) t 2 
> >   >   > [Switching to thread 2 (Thread 0xf596eb70 (LWP 
> 24965))]#0  0x007ad430 in __kernel_vsyscall () 
> >   >   > (gdb) bt 
> >   >   > #0  0x007ad430 in __kernel_vsyscall () 
> >   >   > #1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib/libpthread.so.0 
> >   >   > #2  0x08055662 in slab_rebalance_thread (arg=0x0) at 
> slabs.c:859 
> >   >   > #3  0x00a61a49 in start_thread () from 
> /lib/libpthread.so.0 
> >   >   > #4  0x005c4aee in clone () from /lib/libc.so.6 
> >   >   > (gdb) t 3 
> >   >   > [Switching to thread 3 (Thread 0xf636fb70 (LWP 
> 24964))]#0  0x007ad430 in __kernel_vsyscall () 
> >   >   > (gdb) bt 
> >   >   > #0  0x007ad430 in __kernel_vsyscall () 
> >   >   > #1  0x005838b6 in nanosl

Re: memcached-1.4.20 stuck when "Too many open connections"

2014-11-05 Thread Samdy Sun
Hey,
  
  I also got a stuck when specifing "-m 200". 
  As mentioned previously, that case could happend as below?
*  1. malloc fails when "conn_new()"*
*  2. event_add fails when "conn_new()"*
*  3. other case?*

  And I find another case after code reviewing. Here is, memcached stuck 
for a while, for which our client close the connection because 
200ms-timeout. So, if the previous 1023 connections get timeout and 
memcached calls "transmit" to write, "Broken pipe" error will happend. And 
then, memcached get "TRANSMIT_HARD_ERROR" error and calls 
"conn_close" immediately.
  So, it will happend as below?
  accept(), errno == EMFILE
  fd1 close,
  fd2 close,
  fd3 close,
  ……
  fd1023 close,
  accept_new_conns(false) for EMFILE

  That just is a supposition, but I will try to log some infomation 
to prove it.
  
  Any way, is it better to call "conn_close" after for a while, such as 
waiting for next event when getting TRANSMIT_HARD_ERROR error then to 
conn_close immediately?
  

在 2014年10月31日星期五UTC+8下午3时01分06秒,Dormando写道:
>
> Hey, 
>
> How are you reproducing this? How many connections do you typically have 
> open? 
>
> It's really bizarre that your curr_conns is "5", but your connections are 
> disabled? Even if there's still a race, as more connections close they 
> each have an opportunity to flip the acceptor back on. 
>
> Can you print what "stats settings" shows? If it's adjusting your actual 
> maxconns downward it should show there... 
>
> On Wed, 29 Oct 2014, Samdy Sun wrote: 
>
> > There are no deadlocks, (gdb) info thread 
> > * 5 Thread 0xf7771b70 (LWP 24962)  0x080509dd in transmit (fd=431, 
> which=2, arg=0xfef8ce48) 
> > at memcached.c:4044 
> >   4 Thread 0xf6d70b70 (LWP 24963)  0x007ad430 in __kernel_vsyscall () 
> >   3 Thread 0xf636fb70 (LWP 24964)  0x007ad430 in __kernel_vsyscall () 
> >   2 Thread 0xf596eb70 (LWP 24965)  0x007ad430 in __kernel_vsyscall () 
> >   1 Thread 0xf77b38d0 (LWP 24961)  0x007ad430 in __kernel_vsyscall () 
> > (gdb) t 1 
> > [Switching to thread 1 (Thread 0xf77b38d0 (LWP 24961))]#0  0x007ad430 in 
> __kernel_vsyscall () 
> > (gdb) bt 
> > #0  0x007ad430 in __kernel_vsyscall () 
> > #1  0x005c5366 in epoll_wait () from /lib/libc.so.6 
> > #2  0x0074a750 in epoll_dispatch (base=0x9305008, arg=0x93053c0, 
> tv=0xff8e0cdc) at epoll.c:198 
> > #3  0x0073d714 in event_base_loop (base=0x9305008, flags=0) at 
> event.c:538 
> > #4  0x08054467 in main (argc=19, argv=0xff8e2274) at memcached.c:5795 
> > (gdb)  
> > 
> > (gdb) t 2 
> > [Switching to thread 2 (Thread 0xf596eb70 (LWP 24965))]#0  0x007ad430 in 
> __kernel_vsyscall () 
> > (gdb) bt 
> > #0  0x007ad430 in __kernel_vsyscall () 
> > #1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib/libpthread.so.0 
> > #2  0x08055662 in slab_rebalance_thread (arg=0x0) at slabs.c:859 
> > #3  0x00a61a49 in start_thread () from /lib/libpthread.so.0 
> > #4  0x005c4aee in clone () from /lib/libc.so.6 
> > (gdb) t 3 
> > [Switching to thread 3 (Thread 0xf636fb70 (LWP 24964))]#0  0x007ad430 in 
> __kernel_vsyscall () 
> > (gdb) bt 
> > #0  0x007ad430 in __kernel_vsyscall () 
> > #1  0x005838b6 in nanosleep () from /lib/libc.so.6 
> > #2  0x005836e0 in sleep () from /lib/libc.so.6 
> > #3  0x08056f6e in slab_maintenance_thread (arg=0x0) at slabs.c:819 
> > #4  0x00a61a49 in start_thread () from /lib/libpthread.so.0 
> > #5  0x005c4aee in clone () from /lib/libc.so.6 
> > (gdb) t 4 
> > [Switching to thread 4 (Thread 0xf6d70b70 (LWP 24963))]#0  0x007ad430 in 
> __kernel_vsyscall () 
> > (gdb) bt 
> > #0  0x007ad430 in __kernel_vsyscall () 
> > #1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib/libpthread.so.0 
> > #2  0x080599f5 in assoc_maintenance_thread (arg=0x0) at assoc.c:251 
> > #3  0x00a61a49 in start_thread () from /lib/libpthread.so.0 
> > #4  0x005c4aee in clone () from /lib/libc.so.6 
> > (gdb) t 5 
> > [Switching to thread 5 (Thread 0xf7771b70 (LWP 24962))]#0  0x007ad430 in 
> __kernel_vsyscall () 
> > (gdb) bt 
> > #0  0x007ad430 in __kernel_vsyscall () 
> > #1  0x00a68998 in sendmsg () from /lib/libpthread.so.0 
> > #2  0x080509dd in transmit (fd=431, which=2, arg=0xfef8ce48) at 
> memcached.c:4044 
> > #3  drive_machine (fd=431, which=2, arg=0xfef8ce48) at memcached.c:4370 
> > #4  event_handler (fd=431, which=2, arg=0xfef8ce48) at memcached.c:4441 
> > #5  0x0073d9e4 in event_process_active (base=0x9310658, flags=0) at 
> event.c:395 
> > #6  event_base_loop (base=0x9310658, flags=0) at event.c:547 
> > #7  0x08059fee in worker_libevent (arg=0x930c698) at thread.c:471 
> > #8  0x00a61a49 in start_thread () from /lib/libpthread.so.0 
> > #9  0x005c4aee in clone () from /lib/libc.so.6 
> > (gdb)  
> > 
> > strace info, there is the only event named maxconnsevent on epoll? 
> > epoll_wait(4, {}, 32, 10)   = 0 
> > clock_gettime(CLOCK_MONOTONIC, {8374269, 10084037}) = 0 
> > epoll_wait(4, {}, 32, 10)   = 0 
> > clock_gettime(CLOCK_MONOTONIC, {8374269, 20246365}) 

Re: memcached-1.4.20 stuck when "Too many open connections"

2014-11-05 Thread dormando
There're too many things that will go wrong if malloc fails...

There's a stats counter covering some of them. Is that going up for you?

Have you disabled overcommit memory? Have you observed the process size
when it hangs? malloc should almost never actually fail under normal
conditions...

On Wed, 5 Nov 2014, Samdy Sun wrote:

> Hey,  
>   I also got a stuck when specifing "-m 200". 
>   As mentioned previously, that case could happend as below?
>   1. malloc fails when "conn_new()"  2. event_add fails when "conn_new()"
>   3. other case?
>
>   And I find another case after code reviewing. Here is, memcached stuck for 
> a while, for which our client close the connection because
> 200ms-timeout. So, if the previous 1023 connections get timeout and memcached 
> calls "transmit" to write, "Broken pipe" error will happend. And
> then, memcached get "TRANSMIT_HARD_ERROR" error and calls "conn_close" 
> immediately.
>   So, it will happend as below?
>   accept(), errno == EMFILE
>   fd1 close,
>   fd2 close,
>   fd3 close,
>   ……
>   fd1023 close,
>   accept_new_conns(false) for EMFILE
>
>   That just is a supposition, but I will try to log some infomation to prove 
> it.
>   
>   Any way, is it better to call "conn_close" after for a while, such as 
> waiting for next event when getting TRANSMIT_HARD_ERROR error then to
> conn_close immediately?
>   
>
> 在 2014年10月31日星期五UTC+8下午3时01分06秒,Dormando写道:
>   Hey,
>
>   How are you reproducing this? How many connections do you typically have
>   open?
>
>   It's really bizarre that your curr_conns is "5", but your connections 
> are
>   disabled? Even if there's still a race, as more connections close they
>   each have an opportunity to flip the acceptor back on.
>
>   Can you print what "stats settings" shows? If it's adjusting your actual
>   maxconns downward it should show there...
>
>   On Wed, 29 Oct 2014, Samdy Sun wrote:
>
>   > There are no deadlocks, (gdb) info thread
>   > * 5 Thread 0xf7771b70 (LWP 24962)  0x080509dd in transmit (fd=431, 
> which=2, arg=0xfef8ce48)
>   >     at memcached.c:4044
>   >   4 Thread 0xf6d70b70 (LWP 24963)  0x007ad430 in __kernel_vsyscall ()
>   >   3 Thread 0xf636fb70 (LWP 24964)  0x007ad430 in __kernel_vsyscall ()
>   >   2 Thread 0xf596eb70 (LWP 24965)  0x007ad430 in __kernel_vsyscall ()
>   >   1 Thread 0xf77b38d0 (LWP 24961)  0x007ad430 in __kernel_vsyscall ()
>   > (gdb) t 1
>   > [Switching to thread 1 (Thread 0xf77b38d0 (LWP 24961))]#0  0x007ad430 
> in __kernel_vsyscall ()
>   > (gdb) bt
>   > #0  0x007ad430 in __kernel_vsyscall ()
>   > #1  0x005c5366 in epoll_wait () from /lib/libc.so.6
>   > #2  0x0074a750 in epoll_dispatch (base=0x9305008, arg=0x93053c0, 
> tv=0xff8e0cdc) at epoll.c:198
>   > #3  0x0073d714 in event_base_loop (base=0x9305008, flags=0) at 
> event.c:538
>   > #4  0x08054467 in main (argc=19, argv=0xff8e2274) at memcached.c:5795
>   > (gdb) 
>   >
>   > (gdb) t 2
>   > [Switching to thread 2 (Thread 0xf596eb70 (LWP 24965))]#0  0x007ad430 
> in __kernel_vsyscall ()
>   > (gdb) bt
>   > #0  0x007ad430 in __kernel_vsyscall ()
>   > #1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib/libpthread.so.0
>   > #2  0x08055662 in slab_rebalance_thread (arg=0x0) at slabs.c:859
>   > #3  0x00a61a49 in start_thread () from /lib/libpthread.so.0
>   > #4  0x005c4aee in clone () from /lib/libc.so.6
>   > (gdb) t 3
>   > [Switching to thread 3 (Thread 0xf636fb70 (LWP 24964))]#0  0x007ad430 
> in __kernel_vsyscall ()
>   > (gdb) bt
>   > #0  0x007ad430 in __kernel_vsyscall ()
>   > #1  0x005838b6 in nanosleep () from /lib/libc.so.6
>   > #2  0x005836e0 in sleep () from /lib/libc.so.6
>   > #3  0x08056f6e in slab_maintenance_thread (arg=0x0) at slabs.c:819
>   > #4  0x00a61a49 in start_thread () from /lib/libpthread.so.0
>   > #5  0x005c4aee in clone () from /lib/libc.so.6
>   > (gdb) t 4
>   > [Switching to thread 4 (Thread 0xf6d70b70 (LWP 24963))]#0  0x007ad430 
> in __kernel_vsyscall ()
>   > (gdb) bt
>   > #0  0x007ad430 in __kernel_vsyscall ()
>   > #1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib/libpthread.so.0
>   > #2  0x080599f5 in assoc_maintenance_thread (arg=0x0) at assoc.c:251
>   > #3  0x00a61a49 in start_thread () from /lib/libpthread.so.0
>   > #4  0x005c4aee in clone () from /lib/libc.so.6
>   > (gdb) t 5
>   > [Switching to thread 5 (Thread 0xf7771b70 (LWP 24962))]#0  0x007ad430 
> in __kernel_vsyscall ()
>   > (gdb) bt
>   > #0  0x007ad430 in __kernel_vsyscall ()
>   > #1  0x00a68998 in sendmsg () from /lib/libpthread.so.0
>   > #2  0x080509dd in transmit (fd=431, which=2, arg=0xfef8ce48) at 
> memcached.c:4044
>   > #3  drive_machine (fd=431, which=2, arg=0xfef8ce48) at 
> memcached.c:4370
>   > #4  event_handler (fd=431, which=2, arg=0

Re: memcached-1.4.20 stuck when "Too many open connections"

2014-11-05 Thread Samdy Sun
@Dormando,
  
  I think it's not because of malloc fails. 
  I use "-m 200" and there is 48G ram of my server.
  After close(c->sfd), I printf a message as "fprintf(stderr, 
"INFO`fd=%d`reason=%s`allow_new_conns=true\n", c->sfd, "conn_close");"
  Here are the codes,
*  static void conn_close(conn *c) {*
* ……*
* close(c->sfd);*
 
* pthread_mutex_lock(&conn_lock);*
* allow_new_conns = true;*
* pthread_mutex_unlock(&conn_lock);*
* // XXX: tracing fd close*
* fprintf(stderr, "INFO`fd=%d`reason=%s`allow_new_conns=true\n", 
c->sfd, "conn_close");*
* ……*
*  }*

  Here are the logs,
*  ……*
*INFO`fd=816`reason=conn_close`allow_new_conns=true*
*INFO`fd=817`reason=conn_close`allow_new_conns=true*
*Failed to write, and not due to blocking: Broken pipe*
*INFO`fd=818`reason=conn_close`allow_new_conns=true*
*INFO`fd=819`reason=conn_close`allow_new_conns=true*
*Failed to write, and not due to blocking: Broken pipe*
*INFO`fd=820`reason=conn_close`allow_new_conns=true*
*Failed to write, and not due to blocking: Broken pipe*
*INFO`fd=821`reason=conn_close`allow_new_conns=true*
*Failed to write, and not due to blocking: Broken pipe*
*INFO`fd=822`reason=conn_close`allow_new_conns=true*
*Failed to write, and not due to blocking: Broken pipe*
*  ……*
*INFO`fd=1011`reason=conn_close`allow_new_conns=true*
*INFO`fd=1012`reason=conn_close`allow_new_conns=true*
*INFO`fd=1013`reason=conn_close`allow_new_conns=true*
*INFO`fd=1014`reason=conn_close`allow_new_conns=true*
*INFO`fd=1015`reason=conn_close`allow_new_conns=true*
*INFO`fd=1016`reason=conn_close`allow_new_conns=true*
*INFO`fd=1017`reason=conn_close`allow_new_conns=true*
*INFO`fd=1018`reason=conn_close`allow_new_conns=true*
*INFO`fd=1019`reason=conn_close`allow_new_conns=true*
*INFO`fd=1020`reason=conn_close`allow_new_conns=true*
*INFO`fd=1021`reason=conn_close`allow_new_conns=true*
*INFO`fd=1022`reason=conn_close`allow_new_conns=true*
*INFO`fd=1023`reason=conn_close`allow_new_conns=true*
*accept(): Too many open files*
*Too many open connections*
*INFO`fd=-1`errno=24(Too many open files)`reason=Too many open connections*
*allow_new_conns=false*


It's very strange that "accept()" fail after close(fd) has called. The flag 
called "*allow_new_conns*" is finally set to false, so memcached is stuck.
But when I found that memcached was stuck, the used fd count are small.
ls /proc/6873/fd|wc -l
17



在 2014年11月6日星期四UTC+8上午8时08分50秒,Dormando写道:
>
> There're too many things that will go wrong if malloc fails... 
>
> There's a stats counter covering some of them. Is that going up for you? 
>
> Have you disabled overcommit memory? Have you observed the process size 
> when it hangs? malloc should almost never actually fail under normal 
> conditions... 
>
> On Wed, 5 Nov 2014, Samdy Sun wrote: 
>
> > Hey,   
> >   I also got a stuck when specifing "-m 200".  
> >   As mentioned previously, that case could happend as below? 
> >   1. malloc fails when "conn_new()"  2. event_add fails when 
> "conn_new()" 
> >   3. other case? 
> > 
> >   And I find another case after code reviewing. Here is, memcached stuck 
> for a while, for which our client close the connection because 
> > 200ms-timeout. So, if the previous 1023 connections get timeout and 
> memcached calls "transmit" to write, "Broken pipe" error will happend. And 
> > then, memcached get "TRANSMIT_HARD_ERROR" error and calls 
> "conn_close" immediately. 
> >   So, it will happend as below? 
> >   accept(), errno == EMFILE 
> >   fd1 close, 
> >   fd2 close, 
> >   fd3 close, 
> >   …… 
> >   fd1023 close, 
> >   accept_new_conns(false) for EMFILE 
> > 
> >   That just is a supposition, but I will try to log some infomation 
> to prove it. 
> >
> >   Any way, is it better to call "conn_close" after for a while, such as 
> waiting for next event when getting TRANSMIT_HARD_ERROR error then to 
> > conn_close immediately? 
> >
> > 
> > 在 2014年10月31日星期五UTC+8下午3时01分06秒,Dormando写道: 
> >   Hey, 
> > 
> >   How are you reproducing this? How many connections do you 
> typically have 
> >   open? 
> > 
> >   It's really bizarre that your curr_conns is "5", but your 
> connections are 
> >   disabled? Even if there's still a race, as more connections close 
> they 
> >   each have an opportunity to flip the acceptor back on. 
> > 
> >   Can you print what "stats settings" shows? If it's adjusting your 
> actual 
> >   maxconns downward it should show there... 
> > 
> >   On Wed, 29 Oct 2014, Samdy Sun wrote: 
> > 
> >   > There are no deadlocks, (gdb) info thread 
> >   > * 5 Thread 0xf7771b70 (LWP 24962)  0x080509dd in transmit 
> (fd=431, which=2, arg=0xfef8ce48) 
> >   > at memcached.c:4044 
> >   >   4 Thread 0xf6d70b70 (LWP 24963)  0x007ad430 in 
> __kernel_vsyscall () 
> >   >   3 Thread 0xf636fb70 (LWP 24964)  0x007ad430 in 
> __kernel_vsyscall () 
> >   >   2 Thread 0xf596eb70 (LWP 24965)  0x007ad430 in 
> __kernel_vsyscall () 
> >

Re: memcached-1.4.20 stuck when "Too many open connections"

2014-11-05 Thread Samdy Sun
Additionally,  the stats counter of malloc_fails is ok.

(gdb) p stats

$1 = {mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __kind = 0, 
__nusers = 0, {__spins = 0, 

__list = {__next = 0x0}}}, __size = '\000' , __align 
= 0}, curr_items = 190673, 

  total_items = 7728815, curr_bytes = 3765464159, curr_conns = 5, total_conns = 
28838, rejected_conns = 0, 

  *malloc_fails = 0*, reserved_fds = 5, *conn_structs = 1012*
, get_cmds = 0, set_cmds = 0, touch_cmds = 0, 

  get_hits = 0, get_misses = 0, touch_hits = 0, touch_misses = 0, evictions = 
0, reclaimed = 0, 

  started = 0, *accepting_conns = false*, *listen_disabled_num = 1*
, hash_power_level = 17, 

  hash_bytes = 524288, hash_is_expanding = false, expired_unfetched = 0, 
evicted_unfetched = 0, 

  slab_reassign_running = false, slabs_moved = 19, lru_crawler_running = false, 

  disable_write_by_exptime = 0, disable_write_by_length = 0, 
disable_write_by_access = 0, 

  evicted_write_reply_timeout_times = 0}

  
在 2014年11月6日星期四UTC+8下午3时21分31秒,Samdy Sun写道:
>
> @Dormando,
>   
>   I think it's not because of malloc fails. 
>   I use "-m 200" and there is 48G ram of my server.
>   After close(c->sfd), I printf a message as "fprintf(stderr, 
> "INFO`fd=%d`reason=%s`allow_new_conns=true\n", c->sfd, "conn_close");"
>   Here are the codes,
> *  static void conn_close(conn *c) {*
> * ……*
> * close(c->sfd);*
>  
> * pthread_mutex_lock(&conn_lock);*
> * allow_new_conns = true;*
> * pthread_mutex_unlock(&conn_lock);*
> * // XXX: tracing fd close*
> * fprintf(stderr, "INFO`fd=%d`reason=%s`allow_new_conns=true\n", 
> c->sfd, "conn_close");*
> * ……*
> *  }*
>
>   Here are the logs,
> *  ……*
> *INFO`fd=816`reason=conn_close`allow_new_conns=true*
> *INFO`fd=817`reason=conn_close`allow_new_conns=true*
> *Failed to write, and not due to blocking: Broken pipe*
> *INFO`fd=818`reason=conn_close`allow_new_conns=true*
> *INFO`fd=819`reason=conn_close`allow_new_conns=true*
> *Failed to write, and not due to blocking: Broken pipe*
> *INFO`fd=820`reason=conn_close`allow_new_conns=true*
> *Failed to write, and not due to blocking: Broken pipe*
> *INFO`fd=821`reason=conn_close`allow_new_conns=true*
> *Failed to write, and not due to blocking: Broken pipe*
> *INFO`fd=822`reason=conn_close`allow_new_conns=true*
> *Failed to write, and not due to blocking: Broken pipe*
> *  ……*
> *INFO`fd=1011`reason=conn_close`allow_new_conns=true*
> *INFO`fd=1012`reason=conn_close`allow_new_conns=true*
> *INFO`fd=1013`reason=conn_close`allow_new_conns=true*
> *INFO`fd=1014`reason=conn_close`allow_new_conns=true*
> *INFO`fd=1015`reason=conn_close`allow_new_conns=true*
> *INFO`fd=1016`reason=conn_close`allow_new_conns=true*
> *INFO`fd=1017`reason=conn_close`allow_new_conns=true*
> *INFO`fd=1018`reason=conn_close`allow_new_conns=true*
> *INFO`fd=1019`reason=conn_close`allow_new_conns=true*
> *INFO`fd=1020`reason=conn_close`allow_new_conns=true*
> *INFO`fd=1021`reason=conn_close`allow_new_conns=true*
> *INFO`fd=1022`reason=conn_close`allow_new_conns=true*
> *INFO`fd=1023`reason=conn_close`allow_new_conns=true*
> *accept(): Too many open files*
> *Too many open connections*
> *INFO`fd=-1`errno=24(Too many open files)`reason=Too many open connections*
> *allow_new_conns=false*
>
>
> It's very strange that "accept()" fail after close(fd) has called. The 
> flag called "*allow_new_conns*" is finally set to false, so memcached is 
> stuck.
> But when I found that memcached was stuck, the used fd count are small.
> ls /proc/6873/fd|wc -l
> 17
>
>
>
> 在 2014年11月6日星期四UTC+8上午8时08分50秒,Dormando写道:
>>
>> There're too many things that will go wrong if malloc fails... 
>>
>> There's a stats counter covering some of them. Is that going up for you? 
>>
>> Have you disabled overcommit memory? Have you observed the process size 
>> when it hangs? malloc should almost never actually fail under normal 
>> conditions... 
>>
>> On Wed, 5 Nov 2014, Samdy Sun wrote: 
>>
>> > Hey,   
>> >   I also got a stuck when specifing "-m 200".  
>> >   As mentioned previously, that case could happend as below? 
>> >   1. malloc fails when "conn_new()"  2. event_add fails when 
>> "conn_new()" 
>> >   3. other case? 
>> > 
>> >   And I find another case after code reviewing. Here is, memcached 
>> stuck for a while, for which our client close the connection because 
>> > 200ms-timeout. So, if the previous 1023 connections get timeout and 
>> memcached calls "transmit" to write, "Broken pipe" error will happend. And 
>> > then, memcached get "TRANSMIT_HARD_ERROR" error and calls 
>> "conn_close" immediately. 
>> >   So, it will happend as below? 
>> >   accept(), errno == EMFILE 
>> >   fd1 close, 
>> >   fd2 close, 
>> >   fd3 close, 
>> >   …… 
>> >   fd1023 close, 
>> >   accept_new_conns(false) for EMFILE 
>> > 
>> >   That just is a supposition, but I will try to log some infomation 
>> to prove it. 
>> >
>> >   Any way, is it better to call "conn_close" after for a while, suc

Re: memcached-1.4.20 stuck when "Too many open connections"

2014-11-06 Thread Samdy Sun
Here is my assumption,

Firstly, accept() is interrupted by a signal after it has got EMFILE error.
Secondly, the worker threads call the first 1023 conn_close()  while 
accept() is interrupting.
Thirdly, accept() is recovered and still return EMFILE error because it is 
reentrant.

Could this happend? It's very strange.

在 2014年11月6日星期四UTC+8上午8时08分50秒,Dormando写道:
>
> There're too many things that will go wrong if malloc fails... 
>
> There's a stats counter covering some of them. Is that going up for you? 
>
> Have you disabled overcommit memory? Have you observed the process size 
> when it hangs? malloc should almost never actually fail under normal 
> conditions... 
>
> On Wed, 5 Nov 2014, Samdy Sun wrote: 
>
> > Hey,   
> >   I also got a stuck when specifing "-m 200".  
> >   As mentioned previously, that case could happend as below? 
> >   1. malloc fails when "conn_new()"  2. event_add fails when 
> "conn_new()" 
> >   3. other case? 
> > 
> >   And I find another case after code reviewing. Here is, memcached stuck 
> for a while, for which our client close the connection because 
> > 200ms-timeout. So, if the previous 1023 connections get timeout and 
> memcached calls "transmit" to write, "Broken pipe" error will happend. And 
> > then, memcached get "TRANSMIT_HARD_ERROR" error and calls 
> "conn_close" immediately. 
> >   So, it will happend as below? 
> >   accept(), errno == EMFILE 
> >   fd1 close, 
> >   fd2 close, 
> >   fd3 close, 
> >   …… 
> >   fd1023 close, 
> >   accept_new_conns(false) for EMFILE 
> > 
> >   That just is a supposition, but I will try to log some infomation 
> to prove it. 
> >
> >   Any way, is it better to call "conn_close" after for a while, such as 
> waiting for next event when getting TRANSMIT_HARD_ERROR error then to 
> > conn_close immediately? 
> >
> > 
> > 在 2014年10月31日星期五UTC+8下午3时01分06秒,Dormando写道: 
> >   Hey, 
> > 
> >   How are you reproducing this? How many connections do you 
> typically have 
> >   open? 
> > 
> >   It's really bizarre that your curr_conns is "5", but your 
> connections are 
> >   disabled? Even if there's still a race, as more connections close 
> they 
> >   each have an opportunity to flip the acceptor back on. 
> > 
> >   Can you print what "stats settings" shows? If it's adjusting your 
> actual 
> >   maxconns downward it should show there... 
> > 
> >   On Wed, 29 Oct 2014, Samdy Sun wrote: 
> > 
> >   > There are no deadlocks, (gdb) info thread 
> >   > * 5 Thread 0xf7771b70 (LWP 24962)  0x080509dd in transmit 
> (fd=431, which=2, arg=0xfef8ce48) 
> >   > at memcached.c:4044 
> >   >   4 Thread 0xf6d70b70 (LWP 24963)  0x007ad430 in 
> __kernel_vsyscall () 
> >   >   3 Thread 0xf636fb70 (LWP 24964)  0x007ad430 in 
> __kernel_vsyscall () 
> >   >   2 Thread 0xf596eb70 (LWP 24965)  0x007ad430 in 
> __kernel_vsyscall () 
> >   >   1 Thread 0xf77b38d0 (LWP 24961)  0x007ad430 in 
> __kernel_vsyscall () 
> >   > (gdb) t 1 
> >   > [Switching to thread 1 (Thread 0xf77b38d0 (LWP 24961))]#0 
>  0x007ad430 in __kernel_vsyscall () 
> >   > (gdb) bt 
> >   > #0  0x007ad430 in __kernel_vsyscall () 
> >   > #1  0x005c5366 in epoll_wait () from /lib/libc.so.6 
> >   > #2  0x0074a750 in epoll_dispatch (base=0x9305008, arg=0x93053c0, 
> tv=0xff8e0cdc) at epoll.c:198 
> >   > #3  0x0073d714 in event_base_loop (base=0x9305008, flags=0) at 
> event.c:538 
> >   > #4  0x08054467 in main (argc=19, argv=0xff8e2274) at 
> memcached.c:5795 
> >   > (gdb)  
> >   > 
> >   > (gdb) t 2 
> >   > [Switching to thread 2 (Thread 0xf596eb70 (LWP 24965))]#0 
>  0x007ad430 in __kernel_vsyscall () 
> >   > (gdb) bt 
> >   > #0  0x007ad430 in __kernel_vsyscall () 
> >   > #1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib/libpthread.so.0 
> >   > #2  0x08055662 in slab_rebalance_thread (arg=0x0) at slabs.c:859 
> >   > #3  0x00a61a49 in start_thread () from /lib/libpthread.so.0 
> >   > #4  0x005c4aee in clone () from /lib/libc.so.6 
> >   > (gdb) t 3 
> >   > [Switching to thread 3 (Thread 0xf636fb70 (LWP 24964))]#0 
>  0x007ad430 in __kernel_vsyscall () 
> >   > (gdb) bt 
> >   > #0  0x007ad430 in __kernel_vsyscall () 
> >   > #1  0x005838b6 in nanosleep () from /lib/libc.so.6 
> >   > #2  0x005836e0 in sleep () from /lib/libc.so.6 
> >   > #3  0x08056f6e in slab_maintenance_thread (arg=0x0) at 
> slabs.c:819 
> >   > #4  0x00a61a49 in start_thread () from /lib/libpthread.so.0 
> >   > #5  0x005c4aee in clone () from /lib/libc.so.6 
> >   > (gdb) t 4 
> >   > [Switching to thread 4 (Thread 0xf6d70b70 (LWP 24963))]#0 
>  0x007ad430 in __kernel_vsyscall () 
> >   > (gdb) bt 
> >   > #0  0x007ad430 in __kernel_vsyscall () 
> >   > #1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib/libpthread.so.0 
> >   > #2  0x080599f5 in assoc_main

Re: memcached-1.4.20 stuck when "Too many open connections"

2014-11-06 Thread Samdy Sun
hmmm... I'm not sure about the assumption. 
But here is another case log,
26042 INFO`fd=568`reason=conn_close`allow_new_conns=true
26043 INFO`fd=569`reason=conn_close`allow_new_conns=true
*26044 accept(): Too many open files*
*26045 INFO`fd=563`reason=conn_close`allow_new_conns=true*
*26046 Too many open connections*
26047 INFO`fd=571`reason=conn_close`allow_new_conns=true
26048 INFO`fd=565`reason=conn_close`allow_new_conns=true
26049 Failed to write, and not due to blocking: Broken pipe
……
28050 INFO`fd=1020`reason=conn_close`allow_new_conns=true
28051 Failed to write, and not due to blocking: Broken pipe
28052 INFO`fd=1021`reason=conn_close`allow_new_conns=true
28053 Failed to write, and not due to blocking: Broken pipe
28054 INFO`fd=1022`reason=conn_close`allow_new_conns=true
28055 Failed to write, and not due to blocking: Broken pipe
28056 INFO`fd=1023`reason=conn_close`allow_new_conns=true
*28057 INFO`fd=-1`errno=24(Too many open files)`reason=Too many open 
connections*
*28058 allow_new_conns=false*

It seem to be that the main thread can't acquire the conn_lock/etc while 
worker thread is closing the connections, so the "allow_new_conns" flag is 
set to "false" finally.

在 2014年11月6日星期四UTC+8下午9时42分07秒,Samdy Sun写道:
>
> Here is my assumption,
>
> Firstly, accept() is interrupted by a signal after it has got EMFILE error.
> Secondly, the worker threads call the first 1023 conn_close()  while 
> accept() is interrupting.
> Thirdly, accept() is recovered and still return EMFILE error because it is 
> reentrant.
>
> Could this happend? It's very strange.
>
> 在 2014年11月6日星期四UTC+8上午8时08分50秒,Dormando写道:
>>
>> There're too many things that will go wrong if malloc fails... 
>>
>> There's a stats counter covering some of them. Is that going up for you? 
>>
>> Have you disabled overcommit memory? Have you observed the process size 
>> when it hangs? malloc should almost never actually fail under normal 
>> conditions... 
>>
>> On Wed, 5 Nov 2014, Samdy Sun wrote: 
>>
>> > Hey,   
>> >   I also got a stuck when specifing "-m 200".  
>> >   As mentioned previously, that case could happend as below? 
>> >   1. malloc fails when "conn_new()"  2. event_add fails when 
>> "conn_new()" 
>> >   3. other case? 
>> > 
>> >   And I find another case after code reviewing. Here is, memcached 
>> stuck for a while, for which our client close the connection because 
>> > 200ms-timeout. So, if the previous 1023 connections get timeout and 
>> memcached calls "transmit" to write, "Broken pipe" error will happend. And 
>> > then, memcached get "TRANSMIT_HARD_ERROR" error and calls 
>> "conn_close" immediately. 
>> >   So, it will happend as below? 
>> >   accept(), errno == EMFILE 
>> >   fd1 close, 
>> >   fd2 close, 
>> >   fd3 close, 
>> >   …… 
>> >   fd1023 close, 
>> >   accept_new_conns(false) for EMFILE 
>> > 
>> >   That just is a supposition, but I will try to log some infomation 
>> to prove it. 
>> >
>> >   Any way, is it better to call "conn_close" after for a while, such as 
>> waiting for next event when getting TRANSMIT_HARD_ERROR error then to 
>> > conn_close immediately? 
>> >
>> > 
>> > 在 2014年10月31日星期五UTC+8下午3时01分06秒,Dormando写道: 
>> >   Hey, 
>> > 
>> >   How are you reproducing this? How many connections do you 
>> typically have 
>> >   open? 
>> > 
>> >   It's really bizarre that your curr_conns is "5", but your 
>> connections are 
>> >   disabled? Even if there's still a race, as more connections close 
>> they 
>> >   each have an opportunity to flip the acceptor back on. 
>> > 
>> >   Can you print what "stats settings" shows? If it's adjusting your 
>> actual 
>> >   maxconns downward it should show there... 
>> > 
>> >   On Wed, 29 Oct 2014, Samdy Sun wrote: 
>> > 
>> >   > There are no deadlocks, (gdb) info thread 
>> >   > * 5 Thread 0xf7771b70 (LWP 24962)  0x080509dd in transmit 
>> (fd=431, which=2, arg=0xfef8ce48) 
>> >   > at memcached.c:4044 
>> >   >   4 Thread 0xf6d70b70 (LWP 24963)  0x007ad430 in 
>> __kernel_vsyscall () 
>> >   >   3 Thread 0xf636fb70 (LWP 24964)  0x007ad430 in 
>> __kernel_vsyscall () 
>> >   >   2 Thread 0xf596eb70 (LWP 24965)  0x007ad430 in 
>> __kernel_vsyscall () 
>> >   >   1 Thread 0xf77b38d0 (LWP 24961)  0x007ad430 in 
>> __kernel_vsyscall () 
>> >   > (gdb) t 1 
>> >   > [Switching to thread 1 (Thread 0xf77b38d0 (LWP 24961))]#0 
>>  0x007ad430 in __kernel_vsyscall () 
>> >   > (gdb) bt 
>> >   > #0  0x007ad430 in __kernel_vsyscall () 
>> >   > #1  0x005c5366 in epoll_wait () from /lib/libc.so.6 
>> >   > #2  0x0074a750 in epoll_dispatch (base=0x9305008, 
>> arg=0x93053c0, tv=0xff8e0cdc) at epoll.c:198 
>> >   > #3  0x0073d714 in event_base_loop (base=0x9305008, flags=0) at 
>> event.c:538 
>> >   > #4  0x08054467 in main (argc=19, argv=0xff8e2274) at 
>> memcached.c:5795 
>> >   > (gdb)  
>> >   > 
>> >   > (gdb) t 2 
>> >   > [Switching to t