Thank you for trying out this test. I have upgraded to release 1.2.5 and applied the fix posted for the leak to /dev/shm. Unfortunately when I run the test application (slightly modified to fix a couple of bugs I found) I still find /dev/shm filling up with large files "control_buffer-xxx, dispatch_buffer-xxx, fdata-xxxx, request_buffer_xxx, response_buffer_xxx" even after corosync is restarted and the application daemon killed. It would appear that there may still be a problem in the cleanup of the temporary files used by corosync (library and daemon?) in /dev/shm.
Should the shutdown of the application (and associated corosync library) cleanup the temporary files? Should the shutdown of the daemon cleanup the /dev/shm temporary files? Would a stop gap measure be to rm -f /dev/shm/* in the init.d script to cleanup any leftovers? Would that break the library if the applications were not also shut down? dan On Thu, Jun 24, 2010 at 12:16 AM, Steven Dake <sd...@redhat.com> wrote: > On 06/23/2010 11:35 PM, Andrew Beekhof wrote: >> >> On Thu, Jun 24, 2010 at 1:50 AM, dan clark<2cla...@gmail.com> wrote: >>> >>> Dear Gentle Reader.... >>> >>> Attached is a small test program to stress initializing and finalizing >>> communication between a corosync cpg client and the corosync daemon. >>> The test was run under version 1.2.4. Initial testing was with a >>> single node, subsequent testing occurred on a system consisting of 3 >>> nodes. >>> >>> 1) If the program is run in such a way that it loops on the >>> initialize/mcast_joined/dispatch/finalize AND the corosync daemon is >>> restarted while the program is looping (service corosync restart) then >>> the application locks up in the corosync client library in a variety >>> of interesting locations. This is easiest to reproduce in a single >>> node system with a large iteration count and a usleep value between >>> joins. 'stress_finalize -t 500 -i 10000 -u 1000 -v' Sometimes it >>> recovers in a few seconds (analysis of strace indicated >>> futex(...FUTEX_WAIT, 0, {1, 997888000}) ... which would account for >>> multiple 2 second delays in error recovery from a lost corosync >>> daemon). Sometimes it locks up solid! What is the proper way of >>> handling the loss of the corosync daemon? Is it possible to have the >>> cpg library have a fast error recovery in the case of a failed daemon? >>> >>> sample back trace of lockup: >>> #0 0x000000363c60c711 in sem_wait () from /lib64/libpthread.so.0 >>> #1 0x0000003000002a34 in coroipcc_msg_send_reply_receive ( >>> handle=<value optimized out>, iov=<value optimized out>, iov_len=1, >>> res_msg=0x7fffaefecac0, res_len=24) at coroipcc.c:465 >>> #2 0x0000003000802db1 in cpg_leave (handle=1648075416440668160, >>> group=<value optimized out>) at cpg.c:458 >>> #3 0x0000000000400df8 in coInit (handle=0x7fffaefecdb0, >>> groupNameStr=0x7fffaefeccb0 "./stress_finalize_groupName-0", ctx=0x6e1) >>> at stress_finalize.c:101 >>> #4 0x000000000040138a in main (argc=8, argv=0x7fffaefecf28) >>> at stress_finalize.c:243 >> >> I've also started getting semaphore related stack traces. >> > > the stack trace from Dan is different from yours Andrew. Yours is during > startup. Dan is more concerned about the fact that sem_timedwait sits > around for 2 seconds before returning information indicating the server has > exited or stopped. (along with other issues) > >> #0 __new_sem_init (sem=0x7ff01f81a008, pshared=1, value=0) at >> sem_init.c:45 >> 45 isem->value = value; >> Missing separate debuginfos, use: debuginfo-install >> audit-libs-2.0.1-1.fc12.x86_64 libgcrypt-1.4.4-8.fc12.x86_64 >> libgpg-error-1.6-4.x86_64 libtasn1-2.3-1.fc12.x86_64 >> libuuid-2.16-10.2.fc12.x86_64 >> (gdb) where >> #0 __new_sem_init (sem=0x7ff01f81a008, pshared=1, value=0) at >> sem_init.c:45 >> #1 0x00007ff01e601e8e in coroipcc_service_connect (socket_name=<value >> optimized out>, service=<value optimized out>, request_size=1048576, >> response_size=1048576, dispatch_size=1048576, handle=<value optimized >> out>) >> at coroipcc.c:706 >> #2 0x00007ff01ec1bb81 in init_ais_connection_once (dispatch=0x40e798 >> <cib_ais_dispatch>, destroy=0x40e8f2<cib_ais_destroy>, our_uuid=0x0, >> our_uname=0x6182c0, nodeid=0x0) at ais.c:622 >> #3 0x00007ff01ec1ba22 in init_ais_connection (dispatch=0x40e798 >> <cib_ais_dispatch>, destroy=0x40e8f2<cib_ais_destroy>, our_uuid=0x0, >> our_uname=0x6182c0, nodeid=0x0) at ais.c:585 >> #4 0x00007ff01ec16b90 in crm_cluster_connect (our_uname=0x6182c0, >> our_uuid=0x0, dispatch=0x40e798, destroy=0x40e8f2, hb_conn=0x6182b0) >> at cluster.c:56 >> #5 0x000000000040e9fb in cib_init () at main.c:424 >> #6 0x000000000040df78 in main (argc=1, argv=0x7ffff194aaf8) at main.c:218 >> (gdb) print *isem >> Cannot access memory at address 0x7ff01f81a008 >> >> sigh >> > > This code literally hasn't been modified for over a year - strange to start > seeing errors now. > > Is your /dev/shm full? > > Regards > -steve > _______________________________________________ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais