Re: memcached-1.4.20 stuck when "Too many open connections"

dormando Wed, 05 Nov 2014 16:09:07 -0800

There're too many things that will go wrong if malloc fails...

There's a stats counter covering some of them. Is that going up for you?


Have you disabled overcommit memory? Have you observed the process size
when it hangs? malloc should almost never actually fail under normal
conditions...

On Wed, 5 Nov 2014, Samdy Sun wrote:

> Hey,  
>   I also got a stuck when specifing "-m 200". 
>   As mentioned previously, that case could happend as below?
>   1. malloc fails when "conn_new()"  2. event_add fails when "conn_new()"
>   3. other case?
>
>   And I find another case after code reviewing. Here is, memcached stuck for 
> a while, for which our client close the connection because
> 200ms-timeout. So, if the previous 1023 connections get timeout and memcached 
> calls "transmit" to write, "Broken pipe" error will happend. And
> then, memcached get "TRANSMIT_HARD_ERROR" error and calls "conn_close" 
> immediately.
>   So, it will happend as below?
>   accept(), errno == EMFILE
>   fd1 close,
>   fd2 close,
>   fd3 close,
>   ……
>   fd1023 close,
>   accept_new_conns(false) for EMFILE
>
>   That just is a supposition, but I will try to log some infomation to prove 
> it.
>   
>   Any way, is it better to call "conn_close" after for a while, such as 
> waiting for next event when getting TRANSMIT_HARD_ERROR error then to
> conn_close immediately?
>   
>
> 在 2014年10月31日星期五UTC+8下午3时01分06秒，Dormando写道：
>       Hey,
>
>       How are you reproducing this? How many connections do you typically have
>       open?
>
>       It's really bizarre that your curr_conns is "5", but your connections 
> are
>       disabled? Even if there's still a race, as more connections close they
>       each have an opportunity to flip the acceptor back on.
>
>       Can you print what "stats settings" shows? If it's adjusting your actual
>       maxconns downward it should show there...
>
>       On Wed, 29 Oct 2014, Samdy Sun wrote:
>
>       > There are no deadlocks, (gdb) info thread
>       > * 5 Thread 0xf7771b70 (LWP 24962)  0x080509dd in transmit (fd=431, 
> which=2, arg=0xfef8ce48)
>       >     at memcached.c:4044
>       >   4 Thread 0xf6d70b70 (LWP 24963)  0x007ad430 in __kernel_vsyscall ()
>       >   3 Thread 0xf636fb70 (LWP 24964)  0x007ad430 in __kernel_vsyscall ()
>       >   2 Thread 0xf596eb70 (LWP 24965)  0x007ad430 in __kernel_vsyscall ()
>       >   1 Thread 0xf77b38d0 (LWP 24961)  0x007ad430 in __kernel_vsyscall ()
>       > (gdb) t 1
>       > [Switching to thread 1 (Thread 0xf77b38d0 (LWP 24961))]#0  0x007ad430 
> in __kernel_vsyscall ()
>       > (gdb) bt
>       > #0  0x007ad430 in __kernel_vsyscall ()
>       > #1  0x005c5366 in epoll_wait () from /lib/libc.so.6
>       > #2  0x0074a750 in epoll_dispatch (base=0x9305008, arg=0x93053c0, 
> tv=0xff8e0cdc) at epoll.c:198
>       > #3  0x0073d714 in event_base_loop (base=0x9305008, flags=0) at 
> event.c:538
>       > #4  0x08054467 in main (argc=19, argv=0xff8e2274) at memcached.c:5795
>       > (gdb) 
>       >
>       > (gdb) t 2
>       > [Switching to thread 2 (Thread 0xf596eb70 (LWP 24965))]#0  0x007ad430 
> in __kernel_vsyscall ()
>       > (gdb) bt
>       > #0  0x007ad430 in __kernel_vsyscall ()
>       > #1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib/libpthread.so.0
>       > #2  0x08055662 in slab_rebalance_thread (arg=0x0) at slabs.c:859
>       > #3  0x00a61a49 in start_thread () from /lib/libpthread.so.0
>       > #4  0x005c4aee in clone () from /lib/libc.so.6
>       > (gdb) t 3
>       > [Switching to thread 3 (Thread 0xf636fb70 (LWP 24964))]#0  0x007ad430 
> in __kernel_vsyscall ()
>       > (gdb) bt
>       > #0  0x007ad430 in __kernel_vsyscall ()
>       > #1  0x005838b6 in nanosleep () from /lib/libc.so.6
>       > #2  0x005836e0 in sleep () from /lib/libc.so.6
>       > #3  0x08056f6e in slab_maintenance_thread (arg=0x0) at slabs.c:819
>       > #4  0x00a61a49 in start_thread () from /lib/libpthread.so.0
>       > #5  0x005c4aee in clone () from /lib/libc.so.6
>       > (gdb) t 4
>       > [Switching to thread 4 (Thread 0xf6d70b70 (LWP 24963))]#0  0x007ad430 
> in __kernel_vsyscall ()
>       > (gdb) bt
>       > #0  0x007ad430 in __kernel_vsyscall ()
>       > #1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib/libpthread.so.0
>       > #2  0x080599f5 in assoc_maintenance_thread (arg=0x0) at assoc.c:251
>       > #3  0x00a61a49 in start_thread () from /lib/libpthread.so.0
>       > #4  0x005c4aee in clone () from /lib/libc.so.6
>       > (gdb) t 5
>       > [Switching to thread 5 (Thread 0xf7771b70 (LWP 24962))]#0  0x007ad430 
> in __kernel_vsyscall ()
>       > (gdb) bt
>       > #0  0x007ad430 in __kernel_vsyscall ()
>       > #1  0x00a68998 in sendmsg () from /lib/libpthread.so.0
>       > #2  0x080509dd in transmit (fd=431, which=2, arg=0xfef8ce48) at 
> memcached.c:4044
>       > #3  drive_machine (fd=431, which=2, arg=0xfef8ce48) at 
> memcached.c:4370
>       > #4  event_handler (fd=431, which=2, arg=0xfef8ce48) at 
> memcached.c:4441
>       > #5  0x0073d9e4 in event_process_active (base=0x9310658, flags=0) at 
> event.c:395
>       > #6  event_base_loop (base=0x9310658, flags=0) at event.c:547
>       > #7  0x08059fee in worker_libevent (arg=0x930c698) at thread.c:471
>       > #8  0x00a61a49 in start_thread () from /lib/libpthread.so.0
>       > #9  0x005c4aee in clone () from /lib/libc.so.6
>       > (gdb) 
>       >
>       > strace info, there is the only event named maxconnsevent on epoll?
>       > epoll_wait(4, {}, 32, 10)               = 0
>       > clock_gettime(CLOCK_MONOTONIC, {8374269, 10084037}) = 0
>       > epoll_wait(4, {}, 32, 10)               = 0
>       > clock_gettime(CLOCK_MONOTONIC, {8374269, 20246365}) = 0
>       > epoll_wait(4, {}, 32, 10)               = 0
>       > clock_gettime(CLOCK_MONOTONIC, {8374269, 30382098}) = 0
>       > epoll_wait(4, {}, 32, 10)               = 0
>       > clock_gettime(CLOCK_MONOTONIC, {8374269, 40509766}) = 0
>       > epoll_wait(4, {}, 32, 10)               = 0
>       > clock_gettime(CLOCK_MONOTONIC, {8374269, 50657403}) = 0
>       > epoll_wait(4, {}, 32, 10)               = 0
>       > clock_gettime(CLOCK_MONOTONIC, {8374269, 60823841}) = 0
>       > epoll_wait(4, {}, 32, 10)               = 0
>       > clock_gettime(CLOCK_MONOTONIC, {8374269, 71013006}) = 0
>       > epoll_wait(4, {}, 32, 10)               = 0
>       > clock_gettime(CLOCK_MONOTONIC, {8374269, 81234264}) = 0
>       > epoll_wait(4, {}, 32, 10)               = 0
>       > clock_gettime(CLOCK_MONOTONIC, {8374269, 91407508}) = 0
>       > epoll_wait(4, {}, 32, 10)               = 0
>       > clock_gettime(CLOCK_MONOTONIC, {8374269, 101581187}) = 0
>       > epoll_wait(4, {}, 32, 10)               = 0
>       > clock_gettime(CLOCK_MONOTONIC, {8374269, 111752457}) = 0
>       > epoll_wait(4, {}, 32, 10)               = 0
>       > clock_gettime(CLOCK_MONOTONIC, {8374269, 121919049}) = 0
>       > epoll_wait(4, {}, 32, 10)               = 0
>       > clock_gettime(CLOCK_MONOTONIC, {8374269, 132057597}) = 0
>       >
>       >
>       >
>       > 在 2014年10月29日星期三UTC+8下午2时47分23秒，Samdy Sun写道：
>       >       Hello,  I got a "memcached-1.4.20 stuck" problem when EMFILE 
> happen.
>       >   Here are my memcached's cmdline "memcached -s /xxx/mc_usock.11201 
> -c 1024 -m 4000 -f 1.05 -o slab_automove -o slab_reassign  -t 1
>       -p
>       > 11201".
>       >  
>       >   cat /proc/version 
>       >   Linux version 2.6.32-358.el6.x86_64 
> (mock...@x86-022.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red 
> Hat 4.4.7-3) (GCC)
>       ) #1
>       > SMP Tue Jan 29 11:47:41 EST 2013
>       >
>       >   memcached-1.4.20 stuck and don't work any more when it runs for a 
> period of time.
>       >
>       >   Here are some information for gdb:  (gdb) p stats
>       >   $2 = {mutex = {__data = {__lock = 0, __count = 0, __owner = 0, 
> __kind = 0, __nusers = 0, {__spins = 0, 
>       >         __list = {__next = 0x0}}}, __size = '\000' <repeats 23 
> times>, __align = 0}, curr_items = 149156, 
>       >   total_items = 9876811, curr_bytes = 3712501870, curr_conns = 5, 
> total_conns = 39738, rejected_conns = 0, 
>       >   malloc_fails = 0, reserved_fds = 5, conn_structs = 1012, get_cmds = 
> 0, set_cmds = 0, touch_cmds = 0, 
>       >   get_hits = 0, get_misses = 0, touch_hits = 0, touch_misses = 0, 
> evictions = 0, reclaimed = 0, 
>       >   started = 0, accepting_conns = false, listen_disabled_num = 1, 
> hash_power_level = 17, 
>       >   hash_bytes = 524288, hash_is_expanding = false, expired_unfetched = 
> 0, evicted_unfetched = 0, 
>       >   slab_reassign_running = false, slabs_moved = 20, 
> lru_crawler_running = false, 
>       >   disable_write_by_exptime = 0, disable_write_by_length = 0, 
> disable_write_by_access = 0, 
>       >   evicted_write_reply_timeout_times = 0}
>       >
>       >   (gdb) p allow_new_conns
>       >   $4 = false
>       >
>       >   And I found that "allow_new_conns" just set to false when "accept" 
> failed and errno is "EMFILE". 
>       >   Here are the codes:  
>       > static void drive_machine(conn *c) {
>       >                  ……
>       >                  } else if (errno == EMFILE) {
>       >                    if (settings.verbose > 0)
>       >                          fprintf(stderr, "Too many open 
> connections\n");
>       >                    accept_new_conns(false);
>       >                    stop = true;
>       >                  } else {
>       >                  ……
>       > }
>       >   
>       >   If I change the flag "allow_new_conns", it can work again. As below:
>       >   (gdb) set allow_new_conns=1
>       >   (gdb) p allow_new_conns
>       >   $5 = true
>       >   (gdb) c
>       >   Continuing.
>       >
>       >   I know that "allow_new_conns" will be set to "true" when 
> "conn_close" called. But how could it happen for the case that when
>       "accept"
>       > failed , and errno is "EMFILE", and this connection is the only one 
> for accepting. Notes that curr_conns = 5.
>       >   Not run out of fd:
>       >   ls /proc/1748(memcached_pid)/fd | wc -l
>       >   17
>       >   
>       >
>       > --
>       >
>       > ---
>       > You received this message because you are subscribed to the Google 
> Groups "memcached" group.
>       > To unsubscribe from this group and stop receiving emails from it, 
> send an email to memcached+...@googlegroups.com.
>       > For more options, visit https://groups.google.com/d/optout.
>       >
>       >
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups 
> "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to memcached+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: memcached-1.4.20 stuck when "Too many open connections"

Reply via email to