Re: [Openais] [CLM] Crash if saClmClusterTrackStop() is not called before saClmFinalize()

Jerome Flesch Wed, 07 Oct 2009 02:22:53 -0700

On Tue, Oct 06, 2009 at 11:09:41AM -0500, Ryan O'Hara wrote:
> On Tue, Oct 06, 2009 at 09:45:17AM +0200, Jerome Flesch wrote:
> > On Tue, Oct 06, 2009 at 12:32:50AM -0700, Steven Dake wrote:
> > > how many nodes
> > >
> > 
> > 2 Debians or 3 FreeBSDs (1 Corosync per OS). I'm simply using the following 
> > scripts to start my test suite:
> > http://github.com/jflesch/Corotests/blob/master/corotests-debian.sh
> > http://github.com/jflesch/Corotests/blob/master/corotests-freebsd.sh
> 
> Can you reproduce this bug without using these scripts? In other
> words, can you reproduce the problem buy running corotests manually on
> 2-3 nodes? I don't think I will be able to use the scripts on my
> development nodes.
>


As I said in a previous mail, the crash I reported actually seems to be linked
to a BSD-ism (I did a bad diagnostic, sorry again for that). I'm still
investigating regarding the exact origin.

However, in case you're still interrested to run these tests, you just need to
know that these scripts are basically just here to dispatch quickly new
versions of Corosync/OpenAIS/Corotests on the test cluster and start the
tests. So if you want to start Corotests "by hand":
- Compile corotests
- Dispatch corotests on all the test machine
- Start Corosync on each test machine
- Start 'corotests' on each test machine. It requires, as only argument, the
  total number of Corotests instance that will be started
(I've added a README file in the git repo including these info)

As one of the test implies killing and resurrecting Corosync, the 'corotests'
binary will need root privileges on test machines.

> > > On Tue, 2009-10-06 at 09:33 +0200, Jerome Flesch wrote:
> > > > On Mon, Oct 05, 2009 at 03:18:03PM -0500, Ryan O'Hara wrote:
> > > > > 
> > > > > I downloaded your app and compiled it. I also wrote my own test app
> > > > > that just does saClmInitialize, saClmClusterTrack, and
> > > > > saClmFinalize. I can't recreate the problem with my test app or your
> > > > > test app.
> > > > > 
> > > > > When exactly are you killing node "b"? I think I need precise
> > > > > instructions on what to do to recreate it. Also, I am just running
> > > > > "corotests 2", FYI.
> > > > > 
> > > > 
> > > > My test program kills it. It's during the step 
> > > > 'STEP_COROSYNC_MUST_DIE', where
> > > > the master test program sends the command 'MSG_KILL_COROSYNC' to all 
> > > > the other
> > > > test programs:
> > > > http://github.com/jflesch/Corotests/blob/master/test_cpg_multiplayer.c#L325
> > > > 
> > > > Also, I'm sorry, I forgot to specify that if you want to reproduce the 
> > > > CLM
> > > > crash with my test program, you must disable the code between "#ifndef
> > > > CRASH_CLM" and "#endif" (
> > > > http://github.com/jflesch/Corotests/blob/master/test_cpg_multiplayer.c#L156
> > > >  )
> > > > 
> > > > 
> > > > > Ryan
> > > > > 
> > > > > PS - I am cc'ing Steve Dake.
> > > > >
> > > > 
> > > > Argh, it's also my bad, I should have kept the mailing-list in CC :/ 
> > > > (I'm actually allowed to share the test program source with anyone)
> > > > 
> > > >  
> > > > > 
> > > > > On Mon, Oct 05, 2009 at 05:59:23PM +0200, Jerome Flesch wrote:
> > > > > > On Mon, Oct 05, 2009 at 09:30:26AM -0500, Ryan O'Hara wrote:
> > > > > > > On Mon, Oct 05, 2009 at 09:16:02AM -0500, Ryan O'Hara wrote:
> > > > > > > > On Mon, Oct 05, 2009 at 03:49:42PM +0200, Jerome Flesch wrote:
> > > > > > > > > Hello,
> > > > > > > > 
> > > > > > > > Hi, Jerome.
> > > > > > > > 
> > > > > > > > > I'm still stress-testing Corosync/Openais (trunk) on FreeBSD, 
> > > > > > > > > and I've found out a tiny bug:
> > > > > > > > > 
> > > > > > > > > On peer A, my test program calls:
> > > > > > > > > - saClmInitialize()
> > > > > > > > > - saClmClusterTrack(SA_TRACK_CURRENT | SA_TRACK_CHANGES)
> > > > > > > > > - saClmFinalize()
> > > > > > > > > - (does various tests with CPG ..)
> > > > > > > > > Next, when I shut down/kill Corosync on peer B, Corosync on 
> > > > > > > > > peer A segfaults.
> > > > > > > 
> > > > > > > Node A segfaults, correct? See below.
> > > > > > > 
> > > > > > > > Can you provide the exact test program? I'd like to see all the
> > > > > > > > details of each API call. Are you using a test program from the
> > > > > > > > openais tree or did you write your own.
> > > > > > > > 
> > > > > > 
> > > > > > I wrote my own. My goal is to test Corosync (CPG) / Openais (CLM) 
> > > > > > on a FreeBSD
> > > > > > cluster as much as possible and to be able to compare the results 
> > > > > > with the ones
> > > > > > from a Debian cluster as quickly as possible. To do that, I have a 
> > > > > > scripts
> > > > > > dispatching Corosync, Openais, and the test program on a bunch of 
> > > > > > virtual
> > > > > > machines (or real machines, depending of the settings), and then 
> > > > > > starting
> > > > > > corosync and the test program.
> > > > > > 
> > > > > > I've create a public git repository:
> > > > > > git clone git://github.com/jflesch/Corotests.git corotests
> > > > > > 
> > > > > > The code related to CLM that you are looking for is the following:
> > > > > > http://github.com/jflesch/Corotests/blob/master/test_cpg_multiplayer.c#L134
> > > > > > CLM is only used during the initialization of this test suite.
> > > > > > 
> > > > > > PS: I did this code on my work time, so legally, the copyright 
> > > > > > belongs to my
> > > > > > company (Netasq). However, I just got the authorization to share it 
> > > > > > (and patchs
> > > > > > are welcome, of course :)
> > > > > > 
> > > > > > 
> > > > > > > > > When my test program calls saClmClusterTrackStop() before 
> > > > > > > > > saClmFinalize,
> > > > > > > > > Corosync doesn't crash on peer B. From that and the stacktrace
> > > > > > > > > (joined below)
> > > > > > > 
> > > > > > > OK. I re-read this email and I am a bit confused. Here you can it
> > > > > > > crashed on node B. Above you said node A segfaults. Can you 
> > > > > > > clarify?
> > > > > > > 
> > > > > > Oops, my bad. So the crash happens on node A (the one where my test 
> > > > > > program called
> > > > > > saClmInitialize(), saClmClusterTrack() and saClmFinalize()) when I 
> > > > > > kill node B.
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > > > I guess it tries to signal the change in the cluster to a 
> > > > > > > > > program that is not
> > > > > > > > > connected anymore (-> missing disconnection notification to 
> > > > > > > > > CLM ?). I also
> > > > > > > > > guess it means that Corosync will segfault if the client 
> > > > > > > > > itself crashes.
> > > > > > > > 
> > > > > > > > I'm guessing that a callback is sent to node A. If I 
> > > > > > > > understand, you
> > > > > > > > are enabling tracking on group A, correct? If CLM is anything 
> > > > > > > > like MSG
> > > > > > > > service (and I think it is with respect to how tracking works),
> > > > > > > > enabling tracking will generate callbacks on membership changes 
> > > > > > > > *to
> > > > > > > > the node that enabled tracking.
> > > > > > > 
> > > > > > > Sorry. I was trying to write a reply while in a meeting and I 
> > > > > > > forgot
> > > > > > > to finish this thought.
> > > > > > > 
> > > > > > > If a callback is being sent to node A after it has already called
> > > > > > > finalize, I believe it should be a no-op. I think it would be 
> > > > > > > better
> > > > > > > if tracking callbacks weren't sent at all if the node that enabled
> > > > > > > tracking calls Finalize, but how CLM handles these things is an
> > > > > > > implementation detail that I will look into.
> > > > > > > 
> > > > > 
> > > > 
> > > 
> > > 
> 

_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [CLM] Crash if saClmClusterTrackStop() is not called before saClmFinalize()

Reply via email to