Hey, 32-bit memcached with -m 4000 will never work. the best you can do is probably -m 1600. 32bit applications typically can only allocate up to 2G of ram.
memcached isn't protected from a lot of malloc failure scenarios, so what you're doing will never work. -m 4000 only limits the slab memory usage. there're a lot of buffers/etc outside of that. Also the hash table, which is measured separately. On Fri, 31 Oct 2014, Samdy Sun wrote: > @Dormando, > I try my best to reproduce this in my environment, but failed. This just > happened on my servers. > > I use "stats" command to check the memcached if it is available or not. If > the memcached is unavailable, we will not send request to it. > > This is what I feel strange when my curr_conns is "5" and memcached can't > recover itself. I think "conn_new" call maybe fail, and it call > "close(fd)" directly, not "conn_close()"? Such as below? > > 1. malloc fails when "conn_new()" > 2. event_add fails when "conn_new()" > 3. other case? > > Take notice of that I build "memcached" on 32-bit system and it runs on > 64-bit system. Additionally, I use "-m 4000" for memcached's start. > > Thanks, > Samdy Sun > > 在 2014年10月31日星期五UTC+8下午3时01分06秒,Dormando写道: > Hey, > > How are you reproducing this? How many connections do you typically have > open? > > It's really bizarre that your curr_conns is "5", but your connections > are > disabled? Even if there's still a race, as more connections close they > each have an opportunity to flip the acceptor back on. > > Can you print what "stats settings" shows? If it's adjusting your actual > maxconns downward it should show there... > > On Wed, 29 Oct 2014, Samdy Sun wrote: > > > There are no deadlocks, (gdb) info thread > > * 5 Thread 0xf7771b70 (LWP 24962) 0x080509dd in transmit (fd=431, > which=2, arg=0xfef8ce48) > > at memcached.c:4044 > > 4 Thread 0xf6d70b70 (LWP 24963) 0x007ad430 in __kernel_vsyscall () > > 3 Thread 0xf636fb70 (LWP 24964) 0x007ad430 in __kernel_vsyscall () > > 2 Thread 0xf596eb70 (LWP 24965) 0x007ad430 in __kernel_vsyscall () > > 1 Thread 0xf77b38d0 (LWP 24961) 0x007ad430 in __kernel_vsyscall () > > (gdb) t 1 > > [Switching to thread 1 (Thread 0xf77b38d0 (LWP 24961))]#0 0x007ad430 > in __kernel_vsyscall () > > (gdb) bt > > #0 0x007ad430 in __kernel_vsyscall () > > #1 0x005c5366 in epoll_wait () from /lib/libc.so.6 > > #2 0x0074a750 in epoll_dispatch (base=0x9305008, arg=0x93053c0, > tv=0xff8e0cdc) at epoll.c:198 > > #3 0x0073d714 in event_base_loop (base=0x9305008, flags=0) at > event.c:538 > > #4 0x08054467 in main (argc=19, argv=0xff8e2274) at memcached.c:5795 > > (gdb) > > > > (gdb) t 2 > > [Switching to thread 2 (Thread 0xf596eb70 (LWP 24965))]#0 0x007ad430 > in __kernel_vsyscall () > > (gdb) bt > > #0 0x007ad430 in __kernel_vsyscall () > > #1 0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib/libpthread.so.0 > > #2 0x08055662 in slab_rebalance_thread (arg=0x0) at slabs.c:859 > > #3 0x00a61a49 in start_thread () from /lib/libpthread.so.0 > > #4 0x005c4aee in clone () from /lib/libc.so.6 > > (gdb) t 3 > > [Switching to thread 3 (Thread 0xf636fb70 (LWP 24964))]#0 0x007ad430 > in __kernel_vsyscall () > > (gdb) bt > > #0 0x007ad430 in __kernel_vsyscall () > > #1 0x005838b6 in nanosleep () from /lib/libc.so.6 > > #2 0x005836e0 in sleep () from /lib/libc.so.6 > > #3 0x08056f6e in slab_maintenance_thread (arg=0x0) at slabs.c:819 > > #4 0x00a61a49 in start_thread () from /lib/libpthread.so.0 > > #5 0x005c4aee in clone () from /lib/libc.so.6 > > (gdb) t 4 > > [Switching to thread 4 (Thread 0xf6d70b70 (LWP 24963))]#0 0x007ad430 > in __kernel_vsyscall () > > (gdb) bt > > #0 0x007ad430 in __kernel_vsyscall () > > #1 0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib/libpthread.so.0 > > #2 0x080599f5 in assoc_maintenance_thread (arg=0x0) at assoc.c:251 > > #3 0x00a61a49 in start_thread () from /lib/libpthread.so.0 > > #4 0x005c4aee in clone () from /lib/libc.so.6 > > (gdb) t 5 > > [Switching to thread 5 (Thread 0xf7771b70 (LWP 24962))]#0 0x007ad430 > in __kernel_vsyscall () > > (gdb) bt > > #0 0x007ad430 in __kernel_vsyscall () > > #1 0x00a68998 in sendmsg () from /lib/libpthread.so.0 > > #2 0x080509dd in transmit (fd=431, which=2, arg=0xfef8ce48) at > memcached.c:4044 > > #3 drive_machine (fd=431, which=2, arg=0xfef8ce48) at > memcached.c:4370 > > #4 event_handler (fd=431, which=2, arg=0xfef8ce48) at > memcached.c:4441 > > #5 0x0073d9e4 in event_process_active (base=0x9310658, flags=0) at > event.c:395 > > #6 event_base_loop (base=0x9310658, flags=0) at event.c:547 > > #7 0x08059fee in worker_libevent (arg=0x930c698) at thread.c:471 > > #8 0x00a61a49 in start_thread () from /lib/libpthread.so.0 > > #9 0x005c4aee in clone () from /lib/libc.so.6 > > (gdb) > > > > strace info, there is the only event named maxconnsevent on epoll? > > epoll_wait(4, {}, 32, 10) = 0 > > clock_gettime(CLOCK_MONOTONIC, {8374269, 10084037}) = 0 > > epoll_wait(4, {}, 32, 10) = 0 > > clock_gettime(CLOCK_MONOTONIC, {8374269, 20246365}) = 0 > > epoll_wait(4, {}, 32, 10) = 0 > > clock_gettime(CLOCK_MONOTONIC, {8374269, 30382098}) = 0 > > epoll_wait(4, {}, 32, 10) = 0 > > clock_gettime(CLOCK_MONOTONIC, {8374269, 40509766}) = 0 > > epoll_wait(4, {}, 32, 10) = 0 > > clock_gettime(CLOCK_MONOTONIC, {8374269, 50657403}) = 0 > > epoll_wait(4, {}, 32, 10) = 0 > > clock_gettime(CLOCK_MONOTONIC, {8374269, 60823841}) = 0 > > epoll_wait(4, {}, 32, 10) = 0 > > clock_gettime(CLOCK_MONOTONIC, {8374269, 71013006}) = 0 > > epoll_wait(4, {}, 32, 10) = 0 > > clock_gettime(CLOCK_MONOTONIC, {8374269, 81234264}) = 0 > > epoll_wait(4, {}, 32, 10) = 0 > > clock_gettime(CLOCK_MONOTONIC, {8374269, 91407508}) = 0 > > epoll_wait(4, {}, 32, 10) = 0 > > clock_gettime(CLOCK_MONOTONIC, {8374269, 101581187}) = 0 > > epoll_wait(4, {}, 32, 10) = 0 > > clock_gettime(CLOCK_MONOTONIC, {8374269, 111752457}) = 0 > > epoll_wait(4, {}, 32, 10) = 0 > > clock_gettime(CLOCK_MONOTONIC, {8374269, 121919049}) = 0 > > epoll_wait(4, {}, 32, 10) = 0 > > clock_gettime(CLOCK_MONOTONIC, {8374269, 132057597}) = 0 > > > > > > > > 在 2014年10月29日星期三UTC+8下午2时47分23秒,Samdy Sun写道: > > Hello, I got a "memcached-1.4.20 stuck" problem when EMFILE > happen. > > Here are my memcached's cmdline "memcached -s /xxx/mc_usock.11201 > -c 1024 -m 4000 -f 1.05 -o slab_automove -o slab_reassign -t 1 > -p > > 11201". > > > > cat /proc/version > > Linux version 2.6.32-358.el6.x86_64 > (mock...@x86-022.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red > Hat 4.4.7-3) (GCC) > ) #1 > > SMP Tue Jan 29 11:47:41 EST 2013 > > > > memcached-1.4.20 stuck and don't work any more when it runs for a > period of time. > > > > Here are some information for gdb: (gdb) p stats > > $2 = {mutex = {__data = {__lock = 0, __count = 0, __owner = 0, > __kind = 0, __nusers = 0, {__spins = 0, > > __list = {__next = 0x0}}}, __size = '\000' <repeats 23 > times>, __align = 0}, curr_items = 149156, > > total_items = 9876811, curr_bytes = 3712501870, curr_conns = 5, > total_conns = 39738, rejected_conns = 0, > > malloc_fails = 0, reserved_fds = 5, conn_structs = 1012, get_cmds = > 0, set_cmds = 0, touch_cmds = 0, > > get_hits = 0, get_misses = 0, touch_hits = 0, touch_misses = 0, > evictions = 0, reclaimed = 0, > > started = 0, accepting_conns = false, listen_disabled_num = 1, > hash_power_level = 17, > > hash_bytes = 524288, hash_is_expanding = false, expired_unfetched = > 0, evicted_unfetched = 0, > > slab_reassign_running = false, slabs_moved = 20, > lru_crawler_running = false, > > disable_write_by_exptime = 0, disable_write_by_length = 0, > disable_write_by_access = 0, > > evicted_write_reply_timeout_times = 0} > > > > (gdb) p allow_new_conns > > $4 = false > > > > And I found that "allow_new_conns" just set to false when "accept" > failed and errno is "EMFILE". > > Here are the codes: > > static void drive_machine(conn *c) { > > …… > > } else if (errno == EMFILE) { > > if (settings.verbose > 0) > > fprintf(stderr, "Too many open > connections\n"); > > accept_new_conns(false); > > stop = true; > > } else { > > …… > > } > > > > If I change the flag "allow_new_conns", it can work again. As below: > > (gdb) set allow_new_conns=1 > > (gdb) p allow_new_conns > > $5 = true > > (gdb) c > > Continuing. > > > > I know that "allow_new_conns" will be set to "true" when > "conn_close" called. But how could it happen for the case that when > "accept" > > failed , and errno is "EMFILE", and this connection is the only one > for accepting. Notes that curr_conns = 5. > > Not run out of fd: > > ls /proc/1748(memcached_pid)/fd | wc -l > > 17 > > > > > > -- > > > > --- > > You received this message because you are subscribed to the Google > Groups "memcached" group. > > To unsubscribe from this group and stop receiving emails from it, > send an email to memcached+...@googlegroups.com. > > For more options, visit https://groups.google.com/d/optout. > > > > > > -- > > --- > You received this message because you are subscribed to the Google Groups > "memcached" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to memcached+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > > -- --- You received this message because you are subscribed to the Google Groups "memcached" group. To unsubscribe from this group and stop receiving emails from it, send an email to memcached+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.