Re: [Openais] detecting cpg joiners

2009-05-06 Thread David Teigland
On Wed, May 06, 2009 at 02:10:27PM -0700, Steven Dake wrote:
> On Wed, 2009-05-06 at 15:04 -0500, David Teigland wrote:
> > On Mon, Apr 13, 2009 at 02:17:00PM -0500, David Teigland wrote:
> > > On Mon, Apr 13, 2009 at 12:10:33PM -0700, Steven Dake wrote:
> > > > On Mon, 2009-04-13 at 13:35 -0500, David Teigland wrote:
> > > > > 0. configure token timeout to some long time that is longer than all 
> > > > > the
> > > > >following steps take
> > > > > 
> > > > > 1. cluster members are nodeid's: 1,2,3,4
> > > > > 
> > > > > 2. cpg foo has the following members:
> > > > >nodeid 1, pid 10
> > > > >nodeid 2, pid 20
> > > > >nodeid 3, pid 30
> > > > >nodeid 4, pid 40
> > > > > 
> > > > > 3. nodeid 4: ifdown eth0, kill corosync, kill pid 40
> > > > >(optionally reboot this node now)
> > > > > 
> > > > > 4. nodeid 4: ifup eth0, start corosync
> > > > > 
> > > > > 5. members of cpg foo (1:10, 2:20, 3:30) all get a confchg
> > > > >showing that 4:40 is not a member
> > > > > 
> > > > > 6. nodeid 4: start process pid 41 that joins cpg foo
> > > > > 
> > > > > 7. members of cpg foo (1:10, 2:20, 3:30, 4:41) all get a confchg
> > > > >showing that 4:41 is a member
> > > > > 
> > > > > (Steps 6 and 7 should work the same even if the process started in 
> > > > > step 6
> > > > > has pid 40 instead of pid 41.)
> > > 
> > > > 100% agree that is how it should work.  If it doesn't, we will fix it.
> > > > The only thing that may be strange is if pid in step 6 is the same pid
> > > > as 40.  Are you certain the test case which fails has a differing pid at
> > > > step 6?
> > > 
> > > If you fix step 5, then I suspect steps 6,7 will "just work".  After the 
> > > test
> > > failed at step 5 I didn't pay too much attention to 6,7... but I'm sure 
> > > that
> > > the pid in step 6 was different (I didn't reboot the node).
> > 
> > It's not clear what the plan was for this, any recent related changes I 
> > should
> > try?
> > Dave
> > 
> 
> I haven't tried corosync with this test case, but it should work now.
> Did you try latest corosync on this case?   If it still fails Jan can
> address before 1.0.

Just tried it, and I get the same behavior as before.
Dave

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


Re: [Openais] detecting cpg joiners

2009-05-06 Thread Steven Dake
On Wed, 2009-05-06 at 15:04 -0500, David Teigland wrote:
> On Mon, Apr 13, 2009 at 02:17:00PM -0500, David Teigland wrote:
> > On Mon, Apr 13, 2009 at 12:10:33PM -0700, Steven Dake wrote:
> > > On Mon, 2009-04-13 at 13:35 -0500, David Teigland wrote:
> > > > 0. configure token timeout to some long time that is longer than all the
> > > >following steps take
> > > > 
> > > > 1. cluster members are nodeid's: 1,2,3,4
> > > > 
> > > > 2. cpg foo has the following members:
> > > >nodeid 1, pid 10
> > > >nodeid 2, pid 20
> > > >nodeid 3, pid 30
> > > >nodeid 4, pid 40
> > > > 
> > > > 3. nodeid 4: ifdown eth0, kill corosync, kill pid 40
> > > >(optionally reboot this node now)
> > > > 
> > > > 4. nodeid 4: ifup eth0, start corosync
> > > > 
> > > > 5. members of cpg foo (1:10, 2:20, 3:30) all get a confchg
> > > >showing that 4:40 is not a member
> > > > 
> > > > 6. nodeid 4: start process pid 41 that joins cpg foo
> > > > 
> > > > 7. members of cpg foo (1:10, 2:20, 3:30, 4:41) all get a confchg
> > > >showing that 4:41 is a member
> > > > 
> > > > (Steps 6 and 7 should work the same even if the process started in step 
> > > > 6
> > > > has pid 40 instead of pid 41.)
> > 
> > > 100% agree that is how it should work.  If it doesn't, we will fix it.
> > > The only thing that may be strange is if pid in step 6 is the same pid
> > > as 40.  Are you certain the test case which fails has a differing pid at
> > > step 6?
> > 
> > If you fix step 5, then I suspect steps 6,7 will "just work".  After the 
> > test
> > failed at step 5 I didn't pay too much attention to 6,7... but I'm sure that
> > the pid in step 6 was different (I didn't reboot the node).
> 
> It's not clear what the plan was for this, any recent related changes I should
> try?
> Dave
> 

I haven't tried corosync with this test case, but it should work now.
Did you try latest corosync on this case?   If it still fails Jan can
address before 1.0.

Regards
-steve


> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


Re: [Openais] detecting cpg joiners

2009-05-06 Thread David Teigland
On Mon, Apr 13, 2009 at 02:17:00PM -0500, David Teigland wrote:
> On Mon, Apr 13, 2009 at 12:10:33PM -0700, Steven Dake wrote:
> > On Mon, 2009-04-13 at 13:35 -0500, David Teigland wrote:
> > > 0. configure token timeout to some long time that is longer than all the
> > >following steps take
> > > 
> > > 1. cluster members are nodeid's: 1,2,3,4
> > > 
> > > 2. cpg foo has the following members:
> > >nodeid 1, pid 10
> > >nodeid 2, pid 20
> > >nodeid 3, pid 30
> > >nodeid 4, pid 40
> > > 
> > > 3. nodeid 4: ifdown eth0, kill corosync, kill pid 40
> > >(optionally reboot this node now)
> > > 
> > > 4. nodeid 4: ifup eth0, start corosync
> > > 
> > > 5. members of cpg foo (1:10, 2:20, 3:30) all get a confchg
> > >showing that 4:40 is not a member
> > > 
> > > 6. nodeid 4: start process pid 41 that joins cpg foo
> > > 
> > > 7. members of cpg foo (1:10, 2:20, 3:30, 4:41) all get a confchg
> > >showing that 4:41 is a member
> > > 
> > > (Steps 6 and 7 should work the same even if the process started in step 6
> > > has pid 40 instead of pid 41.)
> 
> > 100% agree that is how it should work.  If it doesn't, we will fix it.
> > The only thing that may be strange is if pid in step 6 is the same pid
> > as 40.  Are you certain the test case which fails has a differing pid at
> > step 6?
> 
> If you fix step 5, then I suspect steps 6,7 will "just work".  After the test
> failed at step 5 I didn't pay too much attention to 6,7... but I'm sure that
> the pid in step 6 was different (I didn't reboot the node).

It's not clear what the plan was for this, any recent related changes I should
try?
Dave

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


Re: [Openais] detecting cpg joiners

2009-04-13 Thread Joel Becker
On Mon, Apr 13, 2009 at 02:17:00PM -0500, David Teigland wrote:
> On Mon, Apr 13, 2009 at 12:10:33PM -0700, Steven Dake wrote:
> > On Mon, 2009-04-13 at 13:35 -0500, David Teigland wrote:
> > > 0. configure token timeout to some long time that is longer than all the
> > >following steps take
> > > 
> > > 1. cluster members are nodeid's: 1,2,3,4
> > > 
> > > 2. cpg foo has the following members:
> > >nodeid 1, pid 10
> > >nodeid 2, pid 20
> > >nodeid 3, pid 30
> > >nodeid 4, pid 40
> > > 
> > > 3. nodeid 4: ifdown eth0, kill corosync, kill pid 40
> > >(optionally reboot this node now)
> > > 
> > > 4. nodeid 4: ifup eth0, start corosync
> > > 
> > > 5. members of cpg foo (1:10, 2:20, 3:30) all get a confchg
> > >showing that 4:40 is not a member
> > > 
> > > 6. nodeid 4: start process pid 41 that joins cpg foo
> > > 
> > > 7. members of cpg foo (1:10, 2:20, 3:30, 4:41) all get a confchg
> > >showing that 4:41 is a member
> > > 
> > > (Steps 6 and 7 should work the same even if the process started in step 6
> > > has pid 40 instead of pid 41.)
> 
> > 100% agree that is how it should work.  If it doesn't, we will fix it.
> > The only thing that may be strange is if pid in step 6 is the same pid
> > as 40.  Are you certain the test case which fails has a differing pid at
> > step 6?
> 
> If you fix step 5, then I suspect steps 6,7 will "just work".  After the test
> failed at step 5 I didn't pay too much attention to 6,7... but I'm sure that
> the pid in step 6 was different (I didn't reboot the node).

Yeah, if we reliably get "4:40 leaves; 4:40 joins", we still
have the information we need.  We need the event.  The pid-wrap concern
was based on the assumption that 4:40 leaving and a new process 4:40
joining would be considered as a steady-state and we would get no leave
event.

Joel

-- 

"Anything that is too stupid to be spoken is sung."  
- Voltaire

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.bec...@oracle.com
Phone: (650) 506-8127
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


Re: [Openais] detecting cpg joiners

2009-04-13 Thread David Teigland
On Mon, Apr 13, 2009 at 12:10:33PM -0700, Steven Dake wrote:
> On Mon, 2009-04-13 at 13:35 -0500, David Teigland wrote:
> > 0. configure token timeout to some long time that is longer than all the
> >following steps take
> > 
> > 1. cluster members are nodeid's: 1,2,3,4
> > 
> > 2. cpg foo has the following members:
> >nodeid 1, pid 10
> >nodeid 2, pid 20
> >nodeid 3, pid 30
> >nodeid 4, pid 40
> > 
> > 3. nodeid 4: ifdown eth0, kill corosync, kill pid 40
> >(optionally reboot this node now)
> > 
> > 4. nodeid 4: ifup eth0, start corosync
> > 
> > 5. members of cpg foo (1:10, 2:20, 3:30) all get a confchg
> >showing that 4:40 is not a member
> > 
> > 6. nodeid 4: start process pid 41 that joins cpg foo
> > 
> > 7. members of cpg foo (1:10, 2:20, 3:30, 4:41) all get a confchg
> >showing that 4:41 is a member
> > 
> > (Steps 6 and 7 should work the same even if the process started in step 6
> > has pid 40 instead of pid 41.)

> 100% agree that is how it should work.  If it doesn't, we will fix it.
> The only thing that may be strange is if pid in step 6 is the same pid
> as 40.  Are you certain the test case which fails has a differing pid at
> step 6?

If you fix step 5, then I suspect steps 6,7 will "just work".  After the test
failed at step 5 I didn't pay too much attention to 6,7... but I'm sure that
the pid in step 6 was different (I didn't reboot the node).

Dave

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


Re: [Openais] detecting cpg joiners

2009-04-13 Thread Steven Dake
On Mon, 2009-04-13 at 09:41 -0700, Joel Becker wrote:
> On Thu, Apr 09, 2009 at 09:58:15PM -0700, Joel Becker wrote:
> > On Thu, Apr 09, 2009 at 06:06:13PM -0700, Steven Dake wrote:
> > > I'd like to clear up that when Andrew talks about the membership not
> > > generating a leave event for totem processes in this scenario (which he
> > > integrates directly with), this is true.  But cpg should generate a
> > > leave event.
> > 
> > Even if the pid is the same?  That is, if my node reboots very
> > fast, and my daemon comes back.  What happens in cpg if a) my daemon has
> > a different pid, b) my daemon has the same pid?  I'd like to see a) a
> > leave event for the old nodeid+pid and a join event for the new
> > nodeid+pid, b) a leave and a join event for the nodeid+pid.
> 
> Steve,
>   I never got a reply for this.  I want to clarify cpg behavior
> before I fix up my daemon's routines.
> 
My reply was this:
http://marc.info/?l=openais&m=123932549923230&w=2

And I recently posted about the weakness in pid reuse in a rebooting
node which seems like a pretty serious problem.

Dave gives an excellent outline of the events we expect to see in a
followup message.  That fits your outlined events you would like to see.

Regards
-steve


> Joel
> 

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


Re: [Openais] detecting cpg joiners

2009-04-13 Thread Steven Dake
On Mon, 2009-04-13 at 13:35 -0500, David Teigland wrote:
> On Thu, Apr 09, 2009 at 06:02:38PM -0700, Steven Dake wrote:
> > The issue that Dave is talking about I believe is described in the
> > following bugzilla:
> > https://bugzilla.redhat.com/show_bug.cgi?id=489451
> 
> No, not at all.
> 
> > IMO you should get a leave event for any process that leaves the process
> > group independent of how totem works underneath.  CPG should provide the
> > guarantees you seek, and if it doesn't, it is defective.  
> 
> OK, good.  Here's what we expect:
> 
> 0. configure token timeout to some long time that is longer than all the
>following steps take
> 
> 1. cluster members are nodeid's: 1,2,3,4
> 
> 2. cpg foo has the following members:
>nodeid 1, pid 10
>nodeid 2, pid 20
>nodeid 3, pid 30
>nodeid 4, pid 40
> 
> 3. nodeid 4: ifdown eth0, kill corosync, kill pid 40
>(optionally reboot this node now)
> 
> 4. nodeid 4: ifup eth0, start corosync
> 
> 5. members of cpg foo (1:10, 2:20, 3:30) all get a confchg
>showing that 4:40 is not a member
> 
> 6. nodeid 4: start process pid 41 that joins cpg foo
> 
> 7. members of cpg foo (1:10, 2:20, 3:30, 4:41) all get a confchg
>showing that 4:41 is a member
> 
> (Steps 6 and 7 should work the same even if the process started in step 6 has
> pid 40 instead of pid 41.)
> 
> Dave

100% agree that is how it should work.  If it doesn't, we will fix it.
The only thing that may be strange is if pid in step 6 is the same pid
as 40.  Are you certain the test case which fails has a differing pid at
step 6?

This points out a weakness in the current cpg protocol which could be
addressed by adding a pid start time to the multicast message to
uniquely identify node restarts with the same pid startup order.
Unfortunately this would have to be done in some backward compatible
fashion.

Regards
-steve

> 

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


Re: [Openais] detecting cpg joiners

2009-04-13 Thread David Teigland
On Thu, Apr 09, 2009 at 06:02:38PM -0700, Steven Dake wrote:
> The issue that Dave is talking about I believe is described in the
> following bugzilla:
> https://bugzilla.redhat.com/show_bug.cgi?id=489451

No, not at all.

> IMO you should get a leave event for any process that leaves the process
> group independent of how totem works underneath.  CPG should provide the
> guarantees you seek, and if it doesn't, it is defective.  

OK, good.  Here's what we expect:

0. configure token timeout to some long time that is longer than all the
   following steps take

1. cluster members are nodeid's: 1,2,3,4

2. cpg foo has the following members:
   nodeid 1, pid 10
   nodeid 2, pid 20
   nodeid 3, pid 30
   nodeid 4, pid 40

3. nodeid 4: ifdown eth0, kill corosync, kill pid 40
   (optionally reboot this node now)

4. nodeid 4: ifup eth0, start corosync

5. members of cpg foo (1:10, 2:20, 3:30) all get a confchg
   showing that 4:40 is not a member

6. nodeid 4: start process pid 41 that joins cpg foo

7. members of cpg foo (1:10, 2:20, 3:30, 4:41) all get a confchg
   showing that 4:41 is a member

(Steps 6 and 7 should work the same even if the process started in step 6 has
pid 40 instead of pid 41.)

Dave

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


Re: [Openais] detecting cpg joiners

2009-04-13 Thread Joel Becker
On Thu, Apr 09, 2009 at 09:58:15PM -0700, Joel Becker wrote:
> On Thu, Apr 09, 2009 at 06:06:13PM -0700, Steven Dake wrote:
> > I'd like to clear up that when Andrew talks about the membership not
> > generating a leave event for totem processes in this scenario (which he
> > integrates directly with), this is true.  But cpg should generate a
> > leave event.
> 
>   Even if the pid is the same?  That is, if my node reboots very
> fast, and my daemon comes back.  What happens in cpg if a) my daemon has
> a different pid, b) my daemon has the same pid?  I'd like to see a) a
> leave event for the old nodeid+pid and a join event for the new
> nodeid+pid, b) a leave and a join event for the nodeid+pid.

Steve,
I never got a reply for this.  I want to clarify cpg behavior
before I fix up my daemon's routines.

Joel

-- 

Life's Little Instruction Book #30

"Never buy a house without a fireplace."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.bec...@oracle.com
Phone: (650) 506-8127
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


Re: [Openais] detecting cpg joiners

2009-04-09 Thread Joel Becker
On Thu, Apr 09, 2009 at 06:06:13PM -0700, Steven Dake wrote:
> I'd like to clear up that when Andrew talks about the membership not
> generating a leave event for totem processes in this scenario (which he
> integrates directly with), this is true.  But cpg should generate a
> leave event.

Even if the pid is the same?  That is, if my node reboots very
fast, and my daemon comes back.  What happens in cpg if a) my daemon has
a different pid, b) my daemon has the same pid?  I'd like to see a) a
leave event for the old nodeid+pid and a join event for the new
nodeid+pid, b) a leave and a join event for the nodeid+pid.

Joel

-- 

Life's Little Instruction Book #306

"Take a nap on Sunday afternoons."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.bec...@oracle.com
Phone: (650) 506-8127
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


Re: [Openais] detecting cpg joiners

2009-04-09 Thread Dietmar Maurer
> guarantees you seek, and if it doesn't, it is defective.  The only
> exception might be if the new process reuses the same PID since the
> pid/nodeid/group are the uniqifiers and if pid is the same, there is
no
> way to detect the new process (and remove the old one).

PID reuse happens more often than you may think. We finally started to
use PID/starttime tuple to get unique process identifiers.

- Dietmar


___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


Re: [Openais] detecting cpg joiners

2009-04-09 Thread Steven Dake
On Thu, 2009-04-09 at 17:17 -0700, Joel Becker wrote:
> On Thu, Apr 09, 2009 at 04:09:18PM -0500, David Teigland wrote:
> > On Thu, Apr 09, 2009 at 03:50:08PM -0500, David Teigland wrote:
> > > On Thu, Apr 09, 2009 at 10:12:43PM +0200, Andrew Beekhof wrote:
> > > > On Thu, Apr 9, 2009 at 20:49, Joel Becker  
> > > > wrote:
> > > > > On Thu, Apr 09, 2009 at 01:50:18PM +0200, Andrew Beekhof wrote:
> > > > >> For added fun, a node that restarts quickly enough (think a VM) won't
> > > > >> even appear to have left (or rejoined) the cluster.
> > > > >> At the next totem confchg event, It will simply just be there again
> > > > >> with no indication that anything happened.
> > > > >
> > > > > ? ? ? ?This had BETTER not happen.
> > > > 
> > > > It does, I've seen it enough times that Pacemaker has code to deal with 
> > > > it.
> > > 
> > > I'd call that a serious flaw we need to get fixed.  I'll see if I can 
> > > make it
> > > happen here.
> > 
> > That was pretty simple.
> > - set token to 5 minutes
> > - nodes 1,2,3,4 are cluster members and members of a cpg
> > - on node4: ifdown eth0, kill corosync, ifup eth0, start corosync
> > - nodes 1,2,3 seem completely unaware that 4 ever went away
> > 
> > When node 4 joins the cpg after coming back, the cpg on nodes 1,2,3 think 
> > that
> > a new fifth process/node is joining the cpg.  The cpg on node 4 shows itself
> > being added as a new fourth cpg member.
> 
> Steve,
>   If node 4's old process went away, shouldn't we be getting a
> 'leave' for that, rather than it persisting in the member list?
> 
> Joel
> 

I'd like to clear up that when Andrew talks about the membership not
generating a leave event for totem processes in this scenario (which he
integrates directly with), this is true.  But cpg should generate a
leave event.

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


Re: [Openais] detecting cpg joiners

2009-04-09 Thread Steven Dake
On Thu, 2009-04-09 at 17:17 -0700, Joel Becker wrote:
> On Thu, Apr 09, 2009 at 04:09:18PM -0500, David Teigland wrote:
> > On Thu, Apr 09, 2009 at 03:50:08PM -0500, David Teigland wrote:
> > > On Thu, Apr 09, 2009 at 10:12:43PM +0200, Andrew Beekhof wrote:
> > > > On Thu, Apr 9, 2009 at 20:49, Joel Becker  
> > > > wrote:
> > > > > On Thu, Apr 09, 2009 at 01:50:18PM +0200, Andrew Beekhof wrote:
> > > > >> For added fun, a node that restarts quickly enough (think a VM) won't
> > > > >> even appear to have left (or rejoined) the cluster.
> > > > >> At the next totem confchg event, It will simply just be there again
> > > > >> with no indication that anything happened.
> > > > >
> > > > > ? ? ? ?This had BETTER not happen.
> > > > 
> > > > It does, I've seen it enough times that Pacemaker has code to deal with 
> > > > it.
> > > 
> > > I'd call that a serious flaw we need to get fixed.  I'll see if I can 
> > > make it
> > > happen here.
> > 
> > That was pretty simple.
> > - set token to 5 minutes
> > - nodes 1,2,3,4 are cluster members and members of a cpg
> > - on node4: ifdown eth0, kill corosync, ifup eth0, start corosync
> > - nodes 1,2,3 seem completely unaware that 4 ever went away
> > 
> > When node 4 joins the cpg after coming back, the cpg on nodes 1,2,3 think 
> > that
> > a new fifth process/node is joining the cpg.  The cpg on node 4 shows itself
> > being added as a new fourth cpg member.
> 
> Steve,
>   If node 4's old process went away, shouldn't we be getting a
> 'leave' for that, rather than it persisting in the member list?
> 

The issue that Dave is talking about I believe is described in the
following bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=489451

The bugzilla is a little misleading.  I think sync prior to this bug fix
didn't work at all.

IMO you should get a leave event for any process that leaves the process
group independent of how totem works underneath.  CPG should provide the
guarantees you seek, and if it doesn't, it is defective.  The only
exception might be if the new process reuses the same PID since the
pid/nodeid/group are the uniqifiers and if pid is the same, there is no
way to detect the new process (and remove the old one).

How it works in reality, i am not sure.  Have you tried Dave's test case
with a recent whitetank?

Honza and I are working on a rework of the cpg service engine which
should have correct behavior in whitetank and corosync when it is
finished as well as fix race condition crashes.

Regards
-steve

> Joel
> 

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


Re: [Openais] detecting cpg joiners

2009-04-09 Thread Joel Becker
On Thu, Apr 09, 2009 at 04:09:18PM -0500, David Teigland wrote:
> On Thu, Apr 09, 2009 at 03:50:08PM -0500, David Teigland wrote:
> > On Thu, Apr 09, 2009 at 10:12:43PM +0200, Andrew Beekhof wrote:
> > > On Thu, Apr 9, 2009 at 20:49, Joel Becker  wrote:
> > > > On Thu, Apr 09, 2009 at 01:50:18PM +0200, Andrew Beekhof wrote:
> > > >> For added fun, a node that restarts quickly enough (think a VM) won't
> > > >> even appear to have left (or rejoined) the cluster.
> > > >> At the next totem confchg event, It will simply just be there again
> > > >> with no indication that anything happened.
> > > >
> > > > ? ? ? ?This had BETTER not happen.
> > > 
> > > It does, I've seen it enough times that Pacemaker has code to deal with 
> > > it.
> > 
> > I'd call that a serious flaw we need to get fixed.  I'll see if I can make 
> > it
> > happen here.
> 
> That was pretty simple.
> - set token to 5 minutes
> - nodes 1,2,3,4 are cluster members and members of a cpg
> - on node4: ifdown eth0, kill corosync, ifup eth0, start corosync
> - nodes 1,2,3 seem completely unaware that 4 ever went away
> 
> When node 4 joins the cpg after coming back, the cpg on nodes 1,2,3 think that
> a new fifth process/node is joining the cpg.  The cpg on node 4 shows itself
> being added as a new fourth cpg member.

Steve,
If node 4's old process went away, shouldn't we be getting a
'leave' for that, rather than it persisting in the member list?

Joel

-- 

"I don't want to achieve immortality through my work; I want to
 achieve immortality through not dying."
- Woody Allen

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.bec...@oracle.com
Phone: (650) 506-8127
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


Re: [Openais] detecting cpg joiners

2009-04-09 Thread Joel Becker
On Thu, Apr 09, 2009 at 10:12:43PM +0200, Andrew Beekhof wrote:
> On Thu, Apr 9, 2009 at 20:49, Joel Becker  wrote:
> > On Thu, Apr 09, 2009 at 01:50:18PM +0200, Andrew Beekhof wrote:
> >> For added fun, a node that restarts quickly enough (think a VM) won't
> >> even appear to have left (or rejoined) the cluster.
> >> At the next totem confchg event, It will simply just be there again
> >> with no indication that anything happened.
> >
> >        This had BETTER not happen.
> 
> It does, I've seen it enough times that Pacemaker has code to deal with it.

Andrew, I'm mad at you.  This is death for filesystems.  Next
time, please let us know when the system is this bad :-)

Joel

-- 

"Ninety feet between bases is perhaps as close as man has ever come
 to perfection."
- Red Smith

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.bec...@oracle.com
Phone: (650) 506-8127
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


Re: [Openais] detecting cpg joiners

2009-04-09 Thread Joel Becker
On Thu, Apr 09, 2009 at 03:17:47PM -0700, Steven Dake wrote:
> A proper system using this model doesn't care - it synchronizes every
> time regardless of who left or joined based upon whether it has state to
> sync that is unique.

Dave,
If we're going to use cpg for our membership, we need to come up
with a scheme to detect these node downs.  We probably should do this
together, so we don't reinvent it.

Joel

-- 

"If you are ever in doubt as to whether or not to kiss a pretty girl, 
 give her the benefit of the doubt"
-Thomas Carlyle

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.bec...@oracle.com
Phone: (650) 506-8127
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


Re: [Openais] detecting cpg joiners

2009-04-09 Thread Joel Becker
On Thu, Apr 09, 2009 at 03:17:47PM -0700, Steven Dake wrote:
> You want a guarantee that virtual synchrony doesn't provide.  Virtual
> synchrony doesn't provide indications of join or left, but only the
> current membership.  It has no way of knowing who joined, or left other
> then to take the previous membership list and compare it to the current.
> Keep that in mind when looking at the joined and left list in your
> callbacks.
> 
> A proper system using this model doesn't care - it synchronizes every
> time regardless of who left or joined based upon whether it has state to
> sync that is unique.
> 
> I was tempted long ago to remove the join and left lists from the
> callbacks, since they don't really make any sense, but the community
> said they could deal with this quirk.

Hmm, I don't think any of us in the world of dlms realized this.
You're providing the level-triggered case, and we mostly only care about
the edges.  ocfs2, for example, doesn't really care who the members are.
It just needs to know when one died.  And if we can't reliably detect
that, we're dead in the water.

Joel

-- 

Life's Little Instruction Book #497

"Go down swinging."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.bec...@oracle.com
Phone: (650) 506-8127
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


Re: [Openais] detecting cpg joiners

2009-04-09 Thread Steven Dake
On Thu, 2009-04-09 at 16:09 -0500, David Teigland wrote:
> On Thu, Apr 09, 2009 at 03:50:08PM -0500, David Teigland wrote:
> > On Thu, Apr 09, 2009 at 10:12:43PM +0200, Andrew Beekhof wrote:
> > > On Thu, Apr 9, 2009 at 20:49, Joel Becker  wrote:
> > > > On Thu, Apr 09, 2009 at 01:50:18PM +0200, Andrew Beekhof wrote:
> > > >> For added fun, a node that restarts quickly enough (think a VM) won't
> > > >> even appear to have left (or rejoined) the cluster.
> > > >> At the next totem confchg event, It will simply just be there again
> > > >> with no indication that anything happened.
> > > >
> > > > ? ? ? ?This had BETTER not happen.
> > > 
> > > It does, I've seen it enough times that Pacemaker has code to deal with 
> > > it.
> > 
> > I'd call that a serious flaw we need to get fixed.  I'll see if I can make 
> > it
> > happen here.
> 
> That was pretty simple.
> - set token to 5 minutes
> - nodes 1,2,3,4 are cluster members and members of a cpg
> - on node4: ifdown eth0, kill corosync, ifup eth0, start corosync
> - nodes 1,2,3 seem completely unaware that 4 ever went away
> 
> When node 4 joins the cpg after coming back, the cpg on nodes 1,2,3 think that
> a new fifth process/node is joining the cpg.  The cpg on node 4 shows itself
> being added as a new fourth cpg member.
> 
> Dave
> 

You want a guarantee that virtual synchrony doesn't provide.  Virtual
synchrony doesn't provide indications of join or left, but only the
current membership.  It has no way of knowing who joined, or left other
then to take the previous membership list and compare it to the current.
Keep that in mind when looking at the joined and left list in your
callbacks.

A proper system using this model doesn't care - it synchronizes every
time regardless of who left or joined based upon whether it has state to
sync that is unique.

I was tempted long ago to remove the join and left lists from the
callbacks, since they don't really make any sense, but the community
said they could deal with this quirk.

regards
-steve

> ___
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


Re: [Openais] detecting cpg joiners

2009-04-09 Thread Joel Becker
On Thu, Apr 09, 2009 at 03:50:08PM -0500, David Teigland wrote:
> On Thu, Apr 09, 2009 at 10:12:43PM +0200, Andrew Beekhof wrote:
> > On Thu, Apr 9, 2009 at 20:49, Joel Becker  wrote:
> > > On Thu, Apr 09, 2009 at 01:50:18PM +0200, Andrew Beekhof wrote:
> > >> For added fun, a node that restarts quickly enough (think a VM) won't
> > >> even appear to have left (or rejoined) the cluster.
> > >> At the next totem confchg event, It will simply just be there again
> > >> with no indication that anything happened.
> > >
> > > ? ? ? ?This had BETTER not happen.
> > 
> > It does, I've seen it enough times that Pacemaker has code to deal with it.
> 
> I'd call that a serious flaw we need to get fixed.  I'll see if I can make it
> happen here.

Yeah, if this is the way it works, ocfs2's going to have to go
drop openais, and I don't want to do that.

Joel

-- 

"All alone at the end of the evening
 When the bright lights have faded to blue.
 I was thinking about a woman who had loved me
 And I never knew"

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.bec...@oracle.com
Phone: (650) 506-8127
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


Re: [Openais] detecting cpg joiners

2009-04-09 Thread David Teigland
On Thu, Apr 09, 2009 at 03:50:08PM -0500, David Teigland wrote:
> On Thu, Apr 09, 2009 at 10:12:43PM +0200, Andrew Beekhof wrote:
> > On Thu, Apr 9, 2009 at 20:49, Joel Becker  wrote:
> > > On Thu, Apr 09, 2009 at 01:50:18PM +0200, Andrew Beekhof wrote:
> > >> For added fun, a node that restarts quickly enough (think a VM) won't
> > >> even appear to have left (or rejoined) the cluster.
> > >> At the next totem confchg event, It will simply just be there again
> > >> with no indication that anything happened.
> > >
> > > ? ? ? ?This had BETTER not happen.
> > 
> > It does, I've seen it enough times that Pacemaker has code to deal with it.
> 
> I'd call that a serious flaw we need to get fixed.  I'll see if I can make it
> happen here.

That was pretty simple.
- set token to 5 minutes
- nodes 1,2,3,4 are cluster members and members of a cpg
- on node4: ifdown eth0, kill corosync, ifup eth0, start corosync
- nodes 1,2,3 seem completely unaware that 4 ever went away

When node 4 joins the cpg after coming back, the cpg on nodes 1,2,3 think that
a new fifth process/node is joining the cpg.  The cpg on node 4 shows itself
being added as a new fourth cpg member.

Dave

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


Re: [Openais] detecting cpg joiners

2009-04-09 Thread David Teigland
On Thu, Apr 09, 2009 at 10:12:43PM +0200, Andrew Beekhof wrote:
> On Thu, Apr 9, 2009 at 20:49, Joel Becker  wrote:
> > On Thu, Apr 09, 2009 at 01:50:18PM +0200, Andrew Beekhof wrote:
> >> For added fun, a node that restarts quickly enough (think a VM) won't
> >> even appear to have left (or rejoined) the cluster.
> >> At the next totem confchg event, It will simply just be there again
> >> with no indication that anything happened.
> >
> > ? ? ? ?This had BETTER not happen.
> 
> It does, I've seen it enough times that Pacemaker has code to deal with it.

I'd call that a serious flaw we need to get fixed.  I'll see if I can make it
happen here.

Dave

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


Re: [Openais] detecting cpg joiners

2009-04-09 Thread Andrew Beekhof
On Thu, Apr 9, 2009 at 19:15, David Teigland  wrote:
> On Thu, Apr 09, 2009 at 01:50:18PM +0200, Andrew Beekhof wrote:
>> For added fun, a node that restarts quickly enough (think a VM) won't
>> even appear to have left (or rejoined) the cluster.
>> At the next totem confchg event, It will simply just be there again
>> with no indication that anything happened.
>>
>> At least this is true for the raw corosync/openais membership data,
>> perhaps CPG can infer this some other way.
>
> Cpg should not let a node go away and come back without notice.  In practice
> I'd expect back to back confchg's: one showing it leave and another showing it
> join.

If you mean the raw confchg's that lcrsos see, then nope.
Try this, set token: to longer than your node takes to reboot and reboot a node.

For physical nodes this isn't a realistic scenario, but VMs can easily
boot in 10 seconds or so.

> As Chrissie mentioned earlier, cpg shouldn't show the same node both
> leaving and joining in a single confchg.  In theory I think it would be
> legitimate.
>
> Consider a couple examples.
> m: member list, j: joined list, l: left list
>
> 1. nodes A and B join at once
> A gets confchg: m=A,B j=A,B l=
> B gets confchg: m=A,B j=A,B l=
>
> 2. node C joins
> A gets confchg: m=A,B,C j=C l=
> B gets confchg: m=A,B,C j=C l=
> C gets confchg: m=A,B,C j=C l=
>
> 3. node C leaves and quickly rejoins in a single confchg
> A gets confchg: m=A,B,C j=C l=C
> B gets confchg: m=A,B,C j=C l=C
> C gets confchg: m=A,B,C j=C l=C
>
> 4. node D joins and quickly leaves (or fails) in a single confchg
> A gets confchg: m=A,B,C j=D l=D
> B gets confchg: m=A,B,C j=D l=D
> C gets confchg: m=A,B,C j=D l=D
> D gets confchg: m=A,B,C j=D l=D ?*
>
> * if D does a quick join+leave it may expect to see this confchg showing it in
> the joined list, the left list, and not in the member list.
>
> Again, the examples in 3 and 4 are, I think, legitimate in theory.  In
> practice it sounds like they won't occur.
>
> If a quick leave+join is guaranteed to be visible through cpg, then it must be
> possible to observe at the lower level from raw corosync data.
>
> Dave
>
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


Re: [Openais] detecting cpg joiners

2009-04-09 Thread Andrew Beekhof
On Thu, Apr 9, 2009 at 20:49, Joel Becker  wrote:
> On Thu, Apr 09, 2009 at 01:50:18PM +0200, Andrew Beekhof wrote:
>> For added fun, a node that restarts quickly enough (think a VM) won't
>> even appear to have left (or rejoined) the cluster.
>> At the next totem confchg event, It will simply just be there again
>> with no indication that anything happened.
>
>        This had BETTER not happen.

It does, I've seen it enough times that Pacemaker has code to deal with it.

> If it does, we can't recover the
> dead+restarted node, and our filesystems are going to corrupt all the
> time.
>
> Joel
>
> --
>
> "If you are ever in doubt as to whether or not to kiss a pretty girl,
>  give her the benefit of the doubt"
>                                        -Thomas Carlyle
>
> Joel Becker
> Principal Software Developer
> Oracle
> E-mail: joel.bec...@oracle.com
> Phone: (650) 506-8127
>
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


Re: [Openais] detecting cpg joiners

2009-04-09 Thread Joel Becker
On Thu, Apr 09, 2009 at 08:37:00AM +0100, Chrissie Caulfield wrote:
> 1) If member_count == join count, then it's a safe bet that they are all
> new nodes, and yes , it is true that all nodes should see the same
> confchg messages
> 
> 2) if join_count > 0 then leave_count will always be zero. That's a
> consequence of how CPG sends its messages really, join and leave
> messages are always separate. Don't rely on this behaviour though!
> Although I can't see any reason to change it, I'd rather not have it
> burned into the defacto specification.

I agree we shouldn't rely on this.  I'm just more concerned that
if there is member_count==join_count and leave_count>0, we can rely on
members == joiners, and thus treat it as a newly created group (all
members are in the "just joined" state).

Joel

-- 

"War doesn't determine who's right; war determines who's left."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.bec...@oracle.com
Phone: (650) 506-8127
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


Re: [Openais] detecting cpg joiners

2009-04-09 Thread Joel Becker
On Thu, Apr 09, 2009 at 01:50:18PM +0200, Andrew Beekhof wrote:
> For added fun, a node that restarts quickly enough (think a VM) won't
> even appear to have left (or rejoined) the cluster.
> At the next totem confchg event, It will simply just be there again
> with no indication that anything happened.

This had BETTER not happen.  If it does, we can't recover the
dead+restarted node, and our filesystems are going to corrupt all the
time.

Joel

-- 

"If you are ever in doubt as to whether or not to kiss a pretty girl, 
 give her the benefit of the doubt"
-Thomas Carlyle

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.bec...@oracle.com
Phone: (650) 506-8127
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


Re: [Openais] detecting cpg joiners

2009-04-09 Thread David Teigland
On Thu, Apr 09, 2009 at 01:50:18PM +0200, Andrew Beekhof wrote:
> For added fun, a node that restarts quickly enough (think a VM) won't
> even appear to have left (or rejoined) the cluster.
> At the next totem confchg event, It will simply just be there again
> with no indication that anything happened.
> 
> At least this is true for the raw corosync/openais membership data,
> perhaps CPG can infer this some other way.

Cpg should not let a node go away and come back without notice.  In practice
I'd expect back to back confchg's: one showing it leave and another showing it
join.  As Chrissie mentioned earlier, cpg shouldn't show the same node both
leaving and joining in a single confchg.  In theory I think it would be
legitimate.

Consider a couple examples.
m: member list, j: joined list, l: left list

1. nodes A and B join at once
A gets confchg: m=A,B j=A,B l=
B gets confchg: m=A,B j=A,B l=

2. node C joins
A gets confchg: m=A,B,C j=C l=
B gets confchg: m=A,B,C j=C l=
C gets confchg: m=A,B,C j=C l=

3. node C leaves and quickly rejoins in a single confchg
A gets confchg: m=A,B,C j=C l=C
B gets confchg: m=A,B,C j=C l=C
C gets confchg: m=A,B,C j=C l=C

4. node D joins and quickly leaves (or fails) in a single confchg
A gets confchg: m=A,B,C j=D l=D
B gets confchg: m=A,B,C j=D l=D
C gets confchg: m=A,B,C j=D l=D
D gets confchg: m=A,B,C j=D l=D ?*

* if D does a quick join+leave it may expect to see this confchg showing it in
the joined list, the left list, and not in the member list.

Again, the examples in 3 and 4 are, I think, legitimate in theory.  In
practice it sounds like they won't occur.

If a quick leave+join is guaranteed to be visible through cpg, then it must be
possible to observe at the lower level from raw corosync data.

Dave

___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


Re: [Openais] detecting cpg joiners

2009-04-09 Thread Chrissie Caulfield
Robert Wipfel wrote:
 On 4/9/2009 at  5:50 AM, in message
> <26ef5e70904090450s40e92dcfgea0fc34826360...@mail.gmail.com>, Andrew Beekhof
>  wrote: 
>> On Thu, Apr 9, 2009 at 09:37, Chrissie Caulfield  wrote:
>>> Joel Becker wrote:
 Steve, Dave, etc,
   Someone told me a while back that a node joining a cpg group
 would be by its lonesome in the join message.  That is, when the node
 gets its first confchg, it will be the only node in the list of joins.
 I've been using this to detect the first joiner of the group ("I joined,
 and the member count is 1").
   Dave's since told me that this assumption is not valid (if it
 ever was).  So two or more nodes can join in parallel, and each can see
 more than node in the list of joins for its first confchg.  I'm now
 trying to figure out an algorithm for "first joiner".  I have a couple
 of questions:

 1) If I see member_count == join_count, does that mean every member has
 just joinded, and all the members are receiving the same join message?

 2) If member_count == join_count, can leave_count be non-zero?  If it
 is, am I guaranteed that we're looking at "all old members left, all new
 members joined"?

   If these both are true, I can simply isolate a "first joiner" by
 checking member_count == join_count and selecting the lowest node
 number.
>>>
>>> I don't think you can detect a first-joiner using CPG. cman does it by
>>> reading the totem confchg messages. It is quite possible for two nodes
>>> to join at the same time ... during the same SYNC phase so you certainly
>>> can't rely on that.
>>>
>>> 1) If member_count == join count, then it's a safe bet that they are all
>>> new nodes, and yes , it is true that all nodes should see the same
>>> confchg messages
>>>
>>> 2) if join_count > 0 then leave_count will always be zero. That's a
>>> consequence of how CPG sends its messages really, join and leave
>>> messages are always separate. Don't rely on this behaviour though!
>>> Although I can't see any reason to change it, I'd rather not have it
>>> burned into the defacto specification.
>> For added fun, a node that restarts quickly enough (think a VM) won't
>> even appear to have left (or rejoined) the cluster.
>> At the next totem confchg event, It will simply just be there again
>> with no indication that anything happened.
>>
>> At least this is true for the raw corosync/openais membership data,
>> perhaps CPG can infer this some other way.
> 
> When a new node joins the group does it also create the group?
> e.g. http://www.opengroup.org/RI/technologies/cords/gipc.pdf
> has an epoch number with each join/leave message, the group is
> created by whoever joined in epoch 0.

That would work but it would also break the wire-protocol AND the API!

-- 

Chrissie
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


Re: [Openais] detecting cpg joiners

2009-04-09 Thread Robert Wipfel
>>> On 4/9/2009 at  5:50 AM, in message
<26ef5e70904090450s40e92dcfgea0fc34826360...@mail.gmail.com>, Andrew Beekhof
 wrote: 
> On Thu, Apr 9, 2009 at 09:37, Chrissie Caulfield  wrote:
>> Joel Becker wrote:
>>> Steve, Dave, etc,
>>>   Someone told me a while back that a node joining a cpg group
>>> would be by its lonesome in the join message.  That is, when the node
>>> gets its first confchg, it will be the only node in the list of joins.
>>> I've been using this to detect the first joiner of the group ("I joined,
>>> and the member count is 1").
>>>   Dave's since told me that this assumption is not valid (if it
>>> ever was).  So two or more nodes can join in parallel, and each can see
>>> more than node in the list of joins for its first confchg.  I'm now
>>> trying to figure out an algorithm for "first joiner".  I have a couple
>>> of questions:
>>>
>>> 1) If I see member_count == join_count, does that mean every member has
>>> just joinded, and all the members are receiving the same join message?
>>>
>>> 2) If member_count == join_count, can leave_count be non-zero?  If it
>>> is, am I guaranteed that we're looking at "all old members left, all new
>>> members joined"?
>>>
>>>   If these both are true, I can simply isolate a "first joiner" by
>>> checking member_count == join_count and selecting the lowest node
>>> number.
>>
>>
>> I don't think you can detect a first-joiner using CPG. cman does it by
>> reading the totem confchg messages. It is quite possible for two nodes
>> to join at the same time ... during the same SYNC phase so you certainly
>> can't rely on that.
>>
>> 1) If member_count == join count, then it's a safe bet that they are all
>> new nodes, and yes , it is true that all nodes should see the same
>> confchg messages
>>
>> 2) if join_count > 0 then leave_count will always be zero. That's a
>> consequence of how CPG sends its messages really, join and leave
>> messages are always separate. Don't rely on this behaviour though!
>> Although I can't see any reason to change it, I'd rather not have it
>> burned into the defacto specification.
> 
> For added fun, a node that restarts quickly enough (think a VM) won't
> even appear to have left (or rejoined) the cluster.
> At the next totem confchg event, It will simply just be there again
> with no indication that anything happened.
> 
> At least this is true for the raw corosync/openais membership data,
> perhaps CPG can infer this some other way.

When a new node joins the group does it also create the group?
e.g. http://www.opengroup.org/RI/technologies/cords/gipc.pdf
has an epoch number with each join/leave message, the group is
created by whoever joined in epoch 0.

Hth,
Robert





___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


Re: [Openais] detecting cpg joiners

2009-04-09 Thread Andrew Beekhof
On Thu, Apr 9, 2009 at 09:37, Chrissie Caulfield  wrote:
> Joel Becker wrote:
>> Steve, Dave, etc,
>>       Someone told me a while back that a node joining a cpg group
>> would be by its lonesome in the join message.  That is, when the node
>> gets its first confchg, it will be the only node in the list of joins.
>> I've been using this to detect the first joiner of the group ("I joined,
>> and the member count is 1").
>>       Dave's since told me that this assumption is not valid (if it
>> ever was).  So two or more nodes can join in parallel, and each can see
>> more than node in the list of joins for its first confchg.  I'm now
>> trying to figure out an algorithm for "first joiner".  I have a couple
>> of questions:
>>
>> 1) If I see member_count == join_count, does that mean every member has
>> just joinded, and all the members are receiving the same join message?
>>
>> 2) If member_count == join_count, can leave_count be non-zero?  If it
>> is, am I guaranteed that we're looking at "all old members left, all new
>> members joined"?
>>
>>       If these both are true, I can simply isolate a "first joiner" by
>> checking member_count == join_count and selecting the lowest node
>> number.
>
>
> I don't think you can detect a first-joiner using CPG. cman does it by
> reading the totem confchg messages. It is quite possible for two nodes
> to join at the same time ... during the same SYNC phase so you certainly
> can't rely on that.
>
> 1) If member_count == join count, then it's a safe bet that they are all
> new nodes, and yes , it is true that all nodes should see the same
> confchg messages
>
> 2) if join_count > 0 then leave_count will always be zero. That's a
> consequence of how CPG sends its messages really, join and leave
> messages are always separate. Don't rely on this behaviour though!
> Although I can't see any reason to change it, I'd rather not have it
> burned into the defacto specification.

For added fun, a node that restarts quickly enough (think a VM) won't
even appear to have left (or rejoined) the cluster.
At the next totem confchg event, It will simply just be there again
with no indication that anything happened.

At least this is true for the raw corosync/openais membership data,
perhaps CPG can infer this some other way.
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


Re: [Openais] detecting cpg joiners

2009-04-09 Thread Chrissie Caulfield
Joel Becker wrote:
> Steve, Dave, etc,
>   Someone told me a while back that a node joining a cpg group
> would be by its lonesome in the join message.  That is, when the node
> gets its first confchg, it will be the only node in the list of joins.
> I've been using this to detect the first joiner of the group ("I joined,
> and the member count is 1").
>   Dave's since told me that this assumption is not valid (if it
> ever was).  So two or more nodes can join in parallel, and each can see
> more than node in the list of joins for its first confchg.  I'm now
> trying to figure out an algorithm for "first joiner".  I have a couple
> of questions:
> 
> 1) If I see member_count == join_count, does that mean every member has
> just joinded, and all the members are receiving the same join message?
> 
> 2) If member_count == join_count, can leave_count be non-zero?  If it
> is, am I guaranteed that we're looking at "all old members left, all new
> members joined"?
> 
>   If these both are true, I can simply isolate a "first joiner" by
> checking member_count == join_count and selecting the lowest node
> number.


I don't think you can detect a first-joiner using CPG. cman does it by
reading the totem confchg messages. It is quite possible for two nodes
to join at the same time ... during the same SYNC phase so you certainly
can't rely on that.

1) If member_count == join count, then it's a safe bet that they are all
new nodes, and yes , it is true that all nodes should see the same
confchg messages

2) if join_count > 0 then leave_count will always be zero. That's a
consequence of how CPG sends its messages really, join and leave
messages are always separate. Don't rely on this behaviour though!
Although I can't see any reason to change it, I'd rather not have it
burned into the defacto specification.


-- 

Chrissie
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais


[Openais] detecting cpg joiners

2009-04-08 Thread Joel Becker
Steve, Dave, etc,
Someone told me a while back that a node joining a cpg group
would be by its lonesome in the join message.  That is, when the node
gets its first confchg, it will be the only node in the list of joins.
I've been using this to detect the first joiner of the group ("I joined,
and the member count is 1").
Dave's since told me that this assumption is not valid (if it
ever was).  So two or more nodes can join in parallel, and each can see
more than node in the list of joins for its first confchg.  I'm now
trying to figure out an algorithm for "first joiner".  I have a couple
of questions:

1) If I see member_count == join_count, does that mean every member has
just joinded, and all the members are receiving the same join message?

2) If member_count == join_count, can leave_count be non-zero?  If it
is, am I guaranteed that we're looking at "all old members left, all new
members joined"?

If these both are true, I can simply isolate a "first joiner" by
checking member_count == join_count and selecting the lowest node
number.

Joel


-- 

Life's Little Instruction Book #444

"Never underestimate the power of a kind word or deed."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.bec...@oracle.com
Phone: (650) 506-8127
___
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais