Thanks for the patch. I merged it. I'll work out a clm patch shortly.
Regards -steve On Fri, 2009-10-09 at 13:44 +0200, Jerome Flesch wrote: > Actually, it wasn't due to a BSD-ism, but it was related to the fact I'm > working on a BSD system (-> _POSIX_THREAD_PROCESS_SHARED == 0 -> use of > semop() > instead of semwait()). It made me find another - crossplatform - smaller issue > in CLM (see at the end of this mail). > > corosync/exec/coroipcs.c: > pthread_ipc_consumer(), after calling semop(), didn't check the value > returned by ipc_thread_active(). So if the semaphore was incremented to > indicate a disconnection, pthread_ipc_consumer() was still calling > zerocopy_operations_process() and called once again all the callbacks > corresponding to the last message read. Only then, it looked at the value > returned by ipc_thread_active() and stopped. > > openais/services/clm.c: > When the problem above happened, the last message received from my test > program was the one coming from SaClmClusterTrack(). So > message_handler_req_lib_clm_clustertrack() was called a seconde time, and the > same connection was added again to the list > 'library_notification_send_listhead' by CLM. This second call to list_add() > messed up the pointers in the list. Later, when the client disconnected, > list_del(), instead of removing the connection, made things worse by placing > pointers to 0xdeadb33f in the list and not removing the element. Result: When > another Corosync node left the cluster, a loop going through this list to warn > the client programs made Corosync crash. > > Remaining issue: I've also tried calling twice in a row SaClmClusterTrack() > from the same connection and not calling SaClmClusterTrackStop() at all. In > CLM, it also calls twice list_add() for the same connection, also ending up > messing up the list 'library_notification_send_listhead', and eventually > crashing. This issue *is* crossplatform. However I'm not sure if it's a bug or > not, since it's based on a misuse of the API. > > Regarding this last issue, to make things clear, you can have a look at: > http://github.com/jflesch/Corotests/blob/master/test_cpg_multiplayer.c#L148 > (If you want to test it, compile with 'make DEBUG_FLAGS="-g -ggdb > -DCRASH_CLM"'). I have been able to reproduce it on a Debian > cluster. Hopefully, I haven't messed up my test this time ... > > On Wed, Oct 07, 2009 at 11:17:59AM +0200, Jerome Flesch wrote: > > On Tue, Oct 06, 2009 at 11:09:41AM -0500, Ryan O'Hara wrote: > > > On Tue, Oct 06, 2009 at 09:45:17AM +0200, Jerome Flesch wrote: > > > > On Tue, Oct 06, 2009 at 12:32:50AM -0700, Steven Dake wrote: > > > > > how many nodes > > > > > > > > > > > > > 2 Debians or 3 FreeBSDs (1 Corosync per OS). I'm simply using the > > > > following scripts to start my test suite: > > > > http://github.com/jflesch/Corotests/blob/master/corotests-debian.sh > > > > http://github.com/jflesch/Corotests/blob/master/corotests-freebsd.sh > > > > > > Can you reproduce this bug without using these scripts? In other > > > words, can you reproduce the problem buy running corotests manually on > > > 2-3 nodes? I don't think I will be able to use the scripts on my > > > development nodes. > > > > > > > As I said in a previous mail, the crash I reported actually seems to be > > linked > > to a BSD-ism (I did a bad diagnostic, sorry again for that). I'm still > > investigating regarding the exact origin. > > > > However, in case you're still interrested to run these tests, you just need > > to > > know that these scripts are basically just here to dispatch quickly new > > versions of Corosync/OpenAIS/Corotests on the test cluster and start the > > tests. So if you want to start Corotests "by hand": > > - Compile corotests > > - Dispatch corotests on all the test machine > > - Start Corosync on each test machine > > - Start 'corotests' on each test machine. It requires, as only argument, the > > total number of Corotests instance that will be started > > (I've added a README file in the git repo including these info) > > > > As one of the test implies killing and resurrecting Corosync, the > > 'corotests' > > binary will need root privileges on test machines. > > > > > > > On Tue, 2009-10-06 at 09:33 +0200, Jerome Flesch wrote: > > > > > > On Mon, Oct 05, 2009 at 03:18:03PM -0500, Ryan O'Hara wrote: > > > > > > > > > > > > > > I downloaded your app and compiled it. I also wrote my own test > > > > > > > app > > > > > > > that just does saClmInitialize, saClmClusterTrack, and > > > > > > > saClmFinalize. I can't recreate the problem with my test app or > > > > > > > your > > > > > > > test app. > > > > > > > > > > > > > > When exactly are you killing node "b"? I think I need precise > > > > > > > instructions on what to do to recreate it. Also, I am just running > > > > > > > "corotests 2", FYI. > > > > > > > > > > > > > > > > > > > My test program kills it. It's during the step > > > > > > 'STEP_COROSYNC_MUST_DIE', where > > > > > > the master test program sends the command 'MSG_KILL_COROSYNC' to > > > > > > all the other > > > > > > test programs: > > > > > > http://github.com/jflesch/Corotests/blob/master/test_cpg_multiplayer.c#L325 > > > > > > > > > > > > Also, I'm sorry, I forgot to specify that if you want to reproduce > > > > > > the CLM > > > > > > crash with my test program, you must disable the code between > > > > > > "#ifndef > > > > > > CRASH_CLM" and "#endif" ( > > > > > > http://github.com/jflesch/Corotests/blob/master/test_cpg_multiplayer.c#L156 > > > > > > ) > > > > > > > > > > > > > > > > > > > Ryan > > > > > > > > > > > > > > PS - I am cc'ing Steve Dake. > > > > > > > > > > > > > > > > > > > Argh, it's also my bad, I should have kept the mailing-list in CC > > > > > > :/ (I'm actually allowed to share the test program source with > > > > > > anyone) > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Oct 05, 2009 at 05:59:23PM +0200, Jerome Flesch wrote: > > > > > > > > On Mon, Oct 05, 2009 at 09:30:26AM -0500, Ryan O'Hara wrote: > > > > > > > > > On Mon, Oct 05, 2009 at 09:16:02AM -0500, Ryan O'Hara wrote: > > > > > > > > > > On Mon, Oct 05, 2009 at 03:49:42PM +0200, Jerome Flesch > > > > > > > > > > wrote: > > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > > > > > Hi, Jerome. > > > > > > > > > > > > > > > > > > > > > I'm still stress-testing Corosync/Openais (trunk) on > > > > > > > > > > > FreeBSD, and I've found out a tiny bug: > > > > > > > > > > > > > > > > > > > > > > On peer A, my test program calls: > > > > > > > > > > > - saClmInitialize() > > > > > > > > > > > - saClmClusterTrack(SA_TRACK_CURRENT | SA_TRACK_CHANGES) > > > > > > > > > > > - saClmFinalize() > > > > > > > > > > > - (does various tests with CPG ..) > > > > > > > > > > > Next, when I shut down/kill Corosync on peer B, Corosync > > > > > > > > > > > on peer A segfaults. > > > > > > > > > > > > > > > > > > Node A segfaults, correct? See below. > > > > > > > > > > > > > > > > > > > Can you provide the exact test program? I'd like to see all > > > > > > > > > > the > > > > > > > > > > details of each API call. Are you using a test program from > > > > > > > > > > the > > > > > > > > > > openais tree or did you write your own. > > > > > > > > > > > > > > > > > > > > > > > > > > I wrote my own. My goal is to test Corosync (CPG) / Openais > > > > > > > > (CLM) on a FreeBSD > > > > > > > > cluster as much as possible and to be able to compare the > > > > > > > > results with the ones > > > > > > > > from a Debian cluster as quickly as possible. To do that, I > > > > > > > > have a scripts > > > > > > > > dispatching Corosync, Openais, and the test program on a bunch > > > > > > > > of virtual > > > > > > > > machines (or real machines, depending of the settings), and > > > > > > > > then starting > > > > > > > > corosync and the test program. > > > > > > > > > > > > > > > > I've create a public git repository: > > > > > > > > git clone git://github.com/jflesch/Corotests.git corotests > > > > > > > > > > > > > > > > The code related to CLM that you are looking for is the > > > > > > > > following: > > > > > > > > http://github.com/jflesch/Corotests/blob/master/test_cpg_multiplayer.c#L134 > > > > > > > > CLM is only used during the initialization of this test suite. > > > > > > > > > > > > > > > > PS: I did this code on my work time, so legally, the copyright > > > > > > > > belongs to my > > > > > > > > company (Netasq). However, I just got the authorization to > > > > > > > > share it (and patchs > > > > > > > > are welcome, of course :) > > > > > > > > > > > > > > > > > > > > > > > > > > > When my test program calls saClmClusterTrackStop() before > > > > > > > > > > > saClmFinalize, > > > > > > > > > > > Corosync doesn't crash on peer B. From that and the > > > > > > > > > > > stacktrace > > > > > > > > > > > (joined below) > > > > > > > > > > > > > > > > > > OK. I re-read this email and I am a bit confused. Here you > > > > > > > > > can it > > > > > > > > > crashed on node B. Above you said node A segfaults. Can you > > > > > > > > > clarify? > > > > > > > > > > > > > > > > > Oops, my bad. So the crash happens on node A (the one where my > > > > > > > > test program called > > > > > > > > saClmInitialize(), saClmClusterTrack() and saClmFinalize()) > > > > > > > > when I kill node B. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I guess it tries to signal the change in the cluster to a > > > > > > > > > > > program that is not > > > > > > > > > > > connected anymore (-> missing disconnection notification > > > > > > > > > > > to CLM ?). I also > > > > > > > > > > > guess it means that Corosync will segfault if the client > > > > > > > > > > > itself crashes. > > > > > > > > > > > > > > > > > > > > I'm guessing that a callback is sent to node A. If I > > > > > > > > > > understand, you > > > > > > > > > > are enabling tracking on group A, correct? If CLM is > > > > > > > > > > anything like MSG > > > > > > > > > > service (and I think it is with respect to how tracking > > > > > > > > > > works), > > > > > > > > > > enabling tracking will generate callbacks on membership > > > > > > > > > > changes *to > > > > > > > > > > the node that enabled tracking. > > > > > > > > > > > > > > > > > > Sorry. I was trying to write a reply while in a meeting and I > > > > > > > > > forgot > > > > > > > > > to finish this thought. > > > > > > > > > > > > > > > > > > If a callback is being sent to node A after it has already > > > > > > > > > called > > > > > > > > > finalize, I believe it should be a no-op. I think it would be > > > > > > > > > better > > > > > > > > > if tracking callbacks weren't sent at all if the node that > > > > > > > > > enabled > > > > > > > > > tracking calls Finalize, but how CLM handles these things is > > > > > > > > > an > > > > > > > > > implementation detail that I will look into. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > Openais mailing list > > [email protected] > > https://lists.linux-foundation.org/mailman/listinfo/openais > > > _______________________________________________ > Openais mailing list > [email protected] > https://lists.linux-foundation.org/mailman/listinfo/openais _______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
