Dietmar, thanks for test. With your test I was ABLE to reproduce very fast. This is definitively something for BZ. I will try to work on that issue, and let you/others now.
Regards, Honza Dietmar Maurer wrote: >> Best application for such test is testcpg.c. If there is really bug, >> can you please create BZ (ideally with way to reproduce, because I'm really >> not able to reproduce such behavior). > > I still wait for a BZ account, so I post it here. The attached > program 'cpgtest' reproduces the problem. Compile with: > > # gcc -Wall cpgtest.c $(shell pkg-config --cflags --libs libcpg libcoroipcc) > -o cpgtest > > It executes a simple loop: > > start: > cpg_initialize > cpg_join > cpg_dispatch > send one message in confchg_callback > cpg_finalize after receiving that message > goto start > > When I run that it executes several successful iterations, but sometime > the join fails: > > # cgptest > ... > starting cpgtest > calling cpg_initialize > calling cpg_join > cpg_join failed: 14 > > An worse, sometimes it hangs in main loop: > > # cpgtest > ... > starting cpgtest > calling cpg_initialize > calling cpg_join > starting main loop (hangs here) > > When that happens, I abort with CTRL-C. After that there is > such a stale CPG member. After several runs I get: > > # corosync-cpgtool > TESTGROUP\x00 > 4610 3 (192.168.2.8) > 27678 3 (192.168.2.8) > 21828 3 (192.168.2.8) > 16841 3 (192.168.2.8) > 10901 3 (192.168.2.8) > 10773 3 (192.168.2.8) > 10496 3 (192.168.2.8) > 9866 3 (192.168.2.8) > 8552 3 (192.168.2.8) > 7439 3 (192.168.2.8) > 6782 3 (192.168.2.8) > > Not a single of those PIDs exist! I currently run on Debian squeeze, > kernel 2.6.32 and corosync 1.2.0. > > Is somebody able to reproduce that issue? > > - Dietmar > >> Regards, >> Honza >> >> Dietmar Maurer wrote: >>> Just found the following commit: >>> >>> >> http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff; >> h=bcc5fdef8473d99399c624a7bc15423a2af645c1 >>> The problematic test case looks very similar to my tests - maybe that >> problem still exists? >>>> It's strange, but the problem only occurs when fencing is involved, >>>> and cman kills a node. I will try to write a minimal CPG application >>>> which >>>> triggers that bug. >>>> >>>> btw, can a memory corruption inside my application cause such >> behavior? >>>> - Dietmar >>>> >>>>> Dietmar, >>>>> process *should* be removed after IPC is finished. >>>>> >>>>> Maybe it is bug. Do you have any reproduces? >>>>> >>>>> Thanks, >>>>> Honza >>>>> >>>>> Dietmar Maurer wrote: >>>>>>> Inside my CPG application, The confchg callback is called with >>>>> 'dead' >>>>>>> members: >>>>>>> >>>>>>> [debug] cpg member node 3 pid 1132 >>>>>>> [debug] cpg member node 3 pid 14640 >>>>>>> >>>>>>> for example process 1132 does not exists any longer on node 3. >> Any >>>>> idea >>>>>>> what >>>>>>> can cause such 'ghost' entries? >>>>>> If I run corosync-cpgtool on the node I get: >>>>>> >>>>>>> # corosync-cpgtool >>>>>>> Group Name PID Node ID >>>>>>> mygroup >>>>>>> 1132 3 (192.168.2.8) >>>>>>> 14887 3 (192.168.2.8) >>>>>> But process 1132 does not exists? How can that happen? I thought a >>>>> process >>>>>> is automatically removed from the CPG member list if it exits (or >>>>> crash)? >>>>>> - Dietmar >>>>>> >>>>>> _______________________________________________ >>>>>> Openais mailing list >>>>>> Openais@lists.linux-foundation.org >>>>>> https://lists.linux-foundation.org/mailman/listinfo/openais >>> > _______________________________________________ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais