Re: [Pacemaker] Backup ring is marked faulty

2011-08-04 Thread Sebastian Kaps
Hi Steven,

On 04.08.2011, at 20:59, Steven Dake wrote:

> meaning the corosync community doesn't investigate redundant ring issues
> prior to corosync versions 1.4.1.

Sadly, we need to use the SLES version for support reasons.
I'll try to convince them to supply us with a fix for this problem.

In the mean time: would it be safe to leave the backup ring marked faulty 
the next this happens? Would this result in a state that is effectively 
like having no second ring or is there a chance that this might still 
affect the cluster's stability? 
To my knowledge, changing the ring configuration requires a complete 
restart of the cluster framework on all nodes, right?

> I expect the root of ypur problem is already fixed (the retransmit list
> problem) however in the repos and latest released versions.


I'll try to get an update as soon as possible. Thanks a lot!

-- 
Sebastian


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Backup ring is marked faulty

2011-08-04 Thread Steven Dake
On 08/04/2011 11:43 AM, Sebastian Kaps wrote:
> Hi Steven,
> 
> On 04.08.2011, at 18:27, Steven Dake wrote:
> 
>> redundant ring is only supported upstream in corosync 1.4.1 or later.
> 
> What does "supported" mean in this context, exactly? 
> 

meaning the corosync community doesn't investigate redundant ring issues
prior to corosync versions 1.4.1.

I expect the root of ypur problem is already fixed (the retransmit list
problem) however in the repos and latest released versions.

Regards
-steve

> I'm asking, because we're having serious issues with these systems since 
> they went into production (the testing phase did not show any problems, 
> but we also couldn't use real workloads then).
> 
> Since the cluster went productive, we're having issues with seemingly random 
> STONITH events that seem to be related to a high I/O load on a DRBD-mirrored
> OCFS2 volume - but I don't see any pattern yet. We've had these machines 
> running for nearly two weeks without major problems and suddenly they went 
> back to killing each other :-(
> 
>> The retransmit list message issues you are having is fixed in corosync
>> 1.3.3. and later  This is what is triggering the redundant ring faulty
>> error.
> 
> Could it also cause the instability problems we're seeing?
> Thanks again, for helping!

yes

> 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Backup ring is marked faulty

2011-08-04 Thread Sebastian Kaps
Hi Steven,

On 04.08.2011, at 18:27, Steven Dake wrote:

> redundant ring is only supported upstream in corosync 1.4.1 or later.

What does "supported" mean in this context, exactly? 

I'm asking, because we're having serious issues with these systems since 
they went into production (the testing phase did not show any problems, 
but we also couldn't use real workloads then).

Since the cluster went productive, we're having issues with seemingly random 
STONITH events that seem to be related to a high I/O load on a DRBD-mirrored
OCFS2 volume - but I don't see any pattern yet. We've had these machines 
running for nearly two weeks without major problems and suddenly they went 
back to killing each other :-(

> The retransmit list message issues you are having is fixed in corosync
> 1.3.3. and later  This is what is triggering the redundant ring faulty
> error.

Could it also cause the instability problems we're seeing?
Thanks again, for helping!

-- 
Sebastian

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] TOTEM: Process pause detected? Leading to STONITH...

2011-08-04 Thread Sebastian Kaps
Hi Steven,

thanks for looking into this!

> This process pause message indicates the scheduler doesn't schedule
> corosync for 11 seconds which is greater then the failure detection
> timeouts. What does your config file look like?  What load are you running?

The load at that point of time around 1.2 - nothing serious.

The config file looks like this:

- snip -
aisexec {
group:  root
user:   root
}
service {
use_mgmtd:  yes
ver:0
name:   pacemaker
}
totem {
rrp_mode:   passive
join:   100
max_messages:   20
vsftype:none
consensus:  1
secauth:on
token_retransmits_before_loss_const:10
threads:16  
token:  1
version:2

interface {
bindnetaddr:192.168.1.0
mcastaddr:  239.250.1.1
mcastport:  5405
ringnumber: 0
}
# 1 GBit as Backup
interface {
bindnetaddr:x.y.z.0
mcastaddr:  239.250.1.2
mcastport:  5415
ringnumber: 1
}
clear_node_high_bit:yes
}
logging {
to_logfile: no
to_syslog:  yes
debug:  off
timestamp:  off
to_stderr:  yes
fileline:   off
syslog_facility:daemon

}
amf {
mode:   disable
}
- snip -

-- 
Sebastian

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Backup ring is marked faulty

2011-08-04 Thread Steven Dake
On 08/02/2011 11:53 PM, Sebastian Kaps wrote:
> Hi Steven!
> 
> On Tue, 02 Aug 2011 17:45:46 -0700, Steven Dake wrote:
>> Which version of corosync?
> 
> # corosync -v
> Corosync Cluster Engine, version '1.3.1'
> Copyright (c) 2006-2009 Red Hat, Inc.
> 
> It's the version that comes with SLES11-SP1-HA.
> 

redundant ring is only supported upstream in corosync 1.4.1 or later.

The retransmit list message issues you are having is fixed in corosync
1.3.3. and later  This is what is triggering the redundant ring faulty
error.

Regards
-steve

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Live demo of Pacemaker Cloud on Fedora: Friday August 5th at 8am PST

2011-08-04 Thread Steven Dake
On 08/03/2011 06:39 PM, Bob Schatz wrote:
> Steven,
> 
> Are you planning on recording/taping it if I want to watch it later?
> 
> Thanks,
> 
> Bob

Bob,

Yes I will record if I can beat elluminate into submission.

Regards
-steve


> 
> 
> *From:* Steven Dake 
> *To:* pcmk-cl...@oss.clusterlabs.org
> *Cc:* aeolus-de...@lists.fedorahosted.org; Fedora Cloud SIG
> ; "open...@lists.linux-foundation.org"
> ; The Pacemaker cluster resource
> manager 
> *Sent:* Wednesday, August 3, 2011 9:42 AM
> *Subject:* [Pacemaker] Live demo of Pacemaker Cloud on Fedora: Friday
> August 5th at 8am PST
> 
> Extending a general invitation to the high availability communities and
> other cloud community contributors to participate in a live demo I am
> giving on Friday August 5th 8am PST (GMT-7).  Demo portion of session is
> 15 minutes and will be provided first followed by more details of our
> approach to high availability.
> 
> I will use elluminate to show the demo on my desktop machine.  To make
> elluminate work, you will need icedtea-web installed on your system
> which is not typically installed by default.
> 
> You will also need a conference # and bridge code.  Please contact me
> offlist with your location and I'll provide you with a hopefully toll
> free conference # and bridge code.
> 
> Elluminate link:
> https://sas.elluminate.com/m.jnlp?sid=819&password=M.13AB020AEBE358D265FD925A07335F
> 
> 
> Bridge Code:  Please contact me off list with your location and I'll
> respond back with dial-in information.
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> 
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> 
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Backup ring is marked faulty

2011-08-04 Thread Steven Dake
On 08/03/2011 11:31 PM, Tegtmeier.Martin wrote:
> Hello again,
> 
> in my case it is always the slower ring that fails (the 100MB network). Does 
> rrp_mode passive expect both rings to have the same speed?
> 
> Sebastian, can you confirm that in your environment also the slower ring 
> fails?
> 
> Thanks,
>   -Martin
> 
> 

Martin,

I have never tested faster+slower networks in redundant ring configs.
We just recently added support for this feature in the corosync project
meaning we can start to tackle some of these issues going forward.

The protocol is designed to limit to the speed of the slowest ring -
perhaps this is not working as intended.

Regards
-steve

> -Original Message-
> From: Tegtmeier.Martin [mailto:martin.tegtme...@realtech.com] 
> Sent: Mittwoch, 3. August 2011 11:03
> To: The Pacemaker cluster resource manager
> Subject: AW: [Pacemaker] Backup ring is marked faulty
> 
> Hello,
> 
> we have exactly the same issue! Same version of corosync (1.3.1), also 
> running on SuSE Linux Enterprise Server 11 SP1 with HAE.
> 
> Aug 01 15:45:18 corosync [TOTEM ] Received ringid(172.20.16.2:308) seq 6a
> 
> Aug 01 15:45:18 corosync [TOTEM ] Received ringid(172.20.16.2:308) seq 63
> 
> Aug 01 15:45:18 corosync [TOTEM ] releasing messages up to and including 60
> 
> Aug 01 15:45:18 corosync [TOTEM ] releasing messages up to and including 6d
> 
> Aug 01 15:45:18 corosync [TOTEM ] Marking seqid 162 ringid 1 interface 
> 10.2.2.6 FAULTY - administrative intervention required.
> 
> rksaph06:/var/log/cluster # corosync-cfgtool -s
> 
> Printing ring status.
> 
> Local node ID 101717164
> 
> RING ID 0
> 
> id  = 172.20.16.6
> 
> status  = ring 0 active with no faults
> 
> RING ID 1
> 
> id  = 10.2.2.6
> 
> status  = Marking seqid 162 ringid 1 interface 10.2.2.6 FAULTY - 
> administrative intervention required.
> 
> 
> 
> rrp_mode is set to "passive"
> Ring 0 (172.20.16.0) supports 1GB and ring 1 (10.2.2.0) supports 100 MBit. 
> There was no other network traffic on ring 1 - only corosync (!)
> 
> After re-activating both rings with "corosync-cfgtool -r" the problem is 
> reproducable by simply connecting a crm_gui and hitting "refresh" inside the 
> GUI 3-5 times. After that ring 1 (10.2.2.0) will be marked as "faulty" again.
> 
> Thanks and best regards,
>   -Martin Tegtmeier
> 
> 
> 
> 
> -Ursprüngliche Nachricht-
> Von: Sebastian Kaps [mailto:sebastian.k...@imail.de]
> Gesendet: Mi 03.08.2011 08:53
> An: The Pacemaker cluster resource manager
> Betreff: Re: [Pacemaker] Backup ring is marked faulty
>  
>  Hi Steven!
> 
>  On Tue, 02 Aug 2011 17:45:46 -0700, Steven Dake wrote:
>> Which version of corosync?
> 
>  # corosync -v
>  Corosync Cluster Engine, version '1.3.1'
>  Copyright (c) 2006-2009 Red Hat, Inc.
> 
>  It's the version that comes with SLES11-SP1-HA.
> 
> --
>  Sebastian
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] TOTEM: Process pause detected? Leading to STONITH...

2011-08-04 Thread Steven Dake
On 08/04/2011 05:46 AM, Sebastian Kaps wrote:
> Hello,
> 
> here's another problem we're having:
> 
> Jul 31 03:51:02 node01 corosync[5870]:  [TOTEM ] Process pause detected
> for 11149 ms, flushing membership messages.

This process pause message indicates the scheduler doesn't schedule
corosync for 11 seconds which is greater then the failure detection
timeouts.  What does your config file look like?  What load are you running?

Regards
-steve

> Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] CLM CONFIGURATION CHANGE
> Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] New Configuration:
> Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ]   r(0) ip(192.168.1.1)
> r(1) ip(x.y.z.3)
> Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] Members Left:
> Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ]   r(0) ip(192.168.1.2)
> r(1) ip(x.y.z.1)
> Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] Members Joined:
> Jul 31 03:51:11 node01 corosync[5870]:  [pcmk  ] notice:
> pcmk_peer_update: Transitional membership event on ring 9708: memb=1,
> new=0, lost=1
> Jul 31 03:51:11 node01 corosync[5870]:  [pcmk  ] info: pcmk_peer_update:
> memb: node01 16885952
> Jul 31 03:51:11 node01 corosync[5870]:  [pcmk  ] info: pcmk_peer_update:
> lost: node02 33663168
> Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] CLM CONFIGURATION CHANGE
> Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] New Configuration:
> Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ]   r(0) ip(192.168.1.1)
> r(1) ip(x.y.z.3)
> Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] Members Left:
> Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] Members Joined:
> Jul 31 03:51:11 node01 crmd: [5912]: notice: ais_dispatch_message:
> Membership 9708: quorum lost
> 
> Node01 gets Stonith'd shortly after that. There is no indication
> whatsoever that this would happen in the logs.
> For at least half an hour before that there's only the normal
> status-message noise from monitor ops etc.
> 
> Jul 31 03:51:01 node02 corosync[5810]:  [TOTEM ] A processor failed,
> forming new configuration.
> Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] CLM CONFIGURATION CHANGE
> Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] New Configuration:
> Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ]   r(0) ip(192.168.1.2)
> r(1) ip(x.y.z.1)
> Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] Members Left:
> Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ]   r(0) ip(192.168.1.1)
> r(1) ip(x.y.z.3)
> Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] Members Joined:
> Jul 31 03:51:11 node02 corosync[5810]:  [pcmk  ] notice:
> pcmk_peer_update: Transitional membership event on ring 9708: memb=1,
> new=0, lost=1
> Jul 31 03:51:11 node02 corosync[5810]:  [pcmk  ] info: pcmk_peer_update:
> memb: node02 33663168
> Jul 31 03:51:11 node02 corosync[5810]:  [pcmk  ] info: pcmk_peer_update:
> lost: node01 16885952
> Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] CLM CONFIGURATION CHANGE
> Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] New Configuration:
> Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ]   r(0) ip(192.168.1.2)
> r(1) ip(x.y.z.1)
> Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] Members Left:
> Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] Members Joined:
> 
> What does "Process pause detected" mean?
> 
> Quoting from my other recent post regarding the backup ring being marked
> faulty sporadically:
> 
> |We're running a two-node cluster with redundant rings.
> |Ring 0 is a 10 GB direct connection; ring 1 consists of two 1GB
> interfaces that are bonded in
> |active-backup mode and routed through two independent switches for each
> node. The ring 1 network
> |is our "normal" 1G LAN and should only be used in case the direct 10G
> connection should fail.
> |
> |Corosync Cluster Engine, version '1.3.1'
> |Copyright (c) 2006-2009 Red Hat, Inc.
> |
> |It's the version that comes with SLES11-SP1-HA.
> 
> Thanks in advance!
> 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Backup ring is marked faulty

2011-08-04 Thread Sebastian Kaps

Hello Martin,

On Thu, 4 Aug 2011 08:31:07 +0200, Tegtmeier.Martin wrote:


in my case it is always the slower ring that fails (the 100MB
network). Does rrp_mode passive expect both rings to have the same
speed?

Sebastian, can you confirm that in your environment also the slower
ring fails?


I can confirm that, but that might also be a coincidence, since we both 
use

the slower network for backup only, i.e. ring(1).
Maybe it's related to the backup ring and not to the speed.

--
Sebastian

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] TOTEM: Process pause detected? Leading to STONITH...

2011-08-04 Thread Sebastian Kaps

Hello,

here's another problem we're having:

Jul 31 03:51:02 node01 corosync[5870]:  [TOTEM ] Process pause detected 
for 11149 ms, flushing membership messages.
Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] CLM CONFIGURATION 
CHANGE

Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] New Configuration:
Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ]   r(0) ip(192.168.1.1) 
r(1) ip(x.y.z.3)

Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] Members Left:
Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ]   r(0) ip(192.168.1.2) 
r(1) ip(x.y.z.1)

Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] Members Joined:
Jul 31 03:51:11 node01 corosync[5870]:  [pcmk  ] notice: 
pcmk_peer_update: Transitional membership event on ring 9708: memb=1, 
new=0, lost=1
Jul 31 03:51:11 node01 corosync[5870]:  [pcmk  ] info: 
pcmk_peer_update: memb: node01 16885952
Jul 31 03:51:11 node01 corosync[5870]:  [pcmk  ] info: 
pcmk_peer_update: lost: node02 33663168
Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] CLM CONFIGURATION 
CHANGE

Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] New Configuration:
Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ]   r(0) ip(192.168.1.1) 
r(1) ip(x.y.z.3)

Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] Members Left:
Jul 31 03:51:11 node01 corosync[5870]:  [CLM   ] Members Joined:
Jul 31 03:51:11 node01 crmd: [5912]: notice: ais_dispatch_message: 
Membership 9708: quorum lost


Node01 gets Stonith'd shortly after that. There is no indication 
whatsoever that this would happen in the logs.
For at least half an hour before that there's only the normal 
status-message noise from monitor ops etc.


Jul 31 03:51:01 node02 corosync[5810]:  [TOTEM ] A processor failed, 
forming new configuration.
Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] CLM CONFIGURATION 
CHANGE

Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] New Configuration:
Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ]   r(0) ip(192.168.1.2) 
r(1) ip(x.y.z.1)

Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] Members Left:
Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ]   r(0) ip(192.168.1.1) 
r(1) ip(x.y.z.3)

Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] Members Joined:
Jul 31 03:51:11 node02 corosync[5810]:  [pcmk  ] notice: 
pcmk_peer_update: Transitional membership event on ring 9708: memb=1, 
new=0, lost=1
Jul 31 03:51:11 node02 corosync[5810]:  [pcmk  ] info: 
pcmk_peer_update: memb: node02 33663168
Jul 31 03:51:11 node02 corosync[5810]:  [pcmk  ] info: 
pcmk_peer_update: lost: node01 16885952
Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] CLM CONFIGURATION 
CHANGE

Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] New Configuration:
Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ]   r(0) ip(192.168.1.2) 
r(1) ip(x.y.z.1)

Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] Members Left:
Jul 31 03:51:11 node02 corosync[5810]:  [CLM   ] Members Joined:

What does "Process pause detected" mean?

Quoting from my other recent post regarding the backup ring being 
marked faulty sporadically:


|We're running a two-node cluster with redundant rings.
|Ring 0 is a 10 GB direct connection; ring 1 consists of two 1GB 
interfaces that are bonded in
|active-backup mode and routed through two independent switches for 
each node. The ring 1 network
|is our "normal" 1G LAN and should only be used in case the direct 10G 
connection should fail.

|
|Corosync Cluster Engine, version '1.3.1'
|Copyright (c) 2006-2009 Red Hat, Inc.
|
|It's the version that comes with SLES11-SP1-HA.

Thanks in advance!

--
Sebastian

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] DMC 0.9.7 - Pacemaker/Storage GUI

2011-08-04 Thread Rasto Levrinc
Hi,

This is the next DMC release 0.9.7. DMC is Pacemaker, Cluster Virtual Manager
and Storage/DRBD GUI written in Java.

The LVM plugins are now normal part of the GUI in the "Storage" view, where
you can create, delete physical volumes, volume groups, logical volumes,
resize them and create snapshots in the (whole) cluster. Goodbye command
line...

Some issues with KDE/Compiz were fixed or at least they were made much less
likely to appear. Most notably blank windows. Personally I haven't use KDE
since the 4.0 came out and just assumed that everything worked at least as
anything else on the KDE, but it didn't.

Some issues with applets were fixed as well, so go ahead and use your browser
for everything. :)

Screenshot:

http://oss.linbit.com/drbd-mc/img/drbd-mc-0.9.7.png


You can get DRBD:MC here:
http://oss.linbit.com/drbd-mc/


1. You don't need Java on the cluster servers only on your desktop.
2. Download the DMC-0.9.7.jar file.
3. Start it: java -jar DMC-0.9.7.jar, or click on the file.
4. It connects to the cluster via SSH.
5. If you use it without DRBD, you have to click "Skip" button on couple of
   places.
6. It is recommended to use it with SUN Java.

DRBD:MC is compatible with Heartbeat 2.1.3 to the Pacemaker 1.1.5 with
Corosync or Heartbeat and DRBD 8.

The most important changes:
* fix fonts in dialogs
* convert plugin code to normal dialogs
* add --debug  option
* add VG remove dialog
* add VG create to the dialogs
* add PV remove dialog
* show in the graph whether the block device is physical volume
* add PV create dialog
* add --out option to redirect the stdout
* fix distro detection so that it detects centos6
* fix unexpected jumping from Service view to VM view
* make default button work in dialogs again
* allow setting both bds to primary if allow-two-primaries is set
* don't allow to enter the same ids for the same resources
* don't create new dependent resource with existing id
* use checkboxes for colocation and order only constraints
* fix "gray window" problem with popups
* "type to search" feature for popups
* fix problem with not editable fields in an applet
* workarounds for gray windows
* fix null pointer exception, while removing a DRBD volume
* fix problem with menu nodes, while adding a clone

Rasto Levrinc

--

DI Rastislav Levrinc
Senior Software Engineer

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Some questions about pacemaker

2011-08-04 Thread Yingliang Yang
Hi, Andrew

I have some questions about pacemaker and gui.

As we know, current verion of pacemaker supports cman.It connects corosync
and pacemaker.
And cman is used in RHCS, too.
It connects corosync and rgmanager.

I'd like to know what differences cman in pacenaker and cman in RHCS. Are
they same?

Will pacemaker integrate some good features from RHCS, such as Qdisk,
rgmanager?

The current web base GUI for pacemaker is developed with ruby.Is there any
plan to develop web base GUI with java or other language?

Best regards,
Yingliang Yang
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] default gateway for virtual IP

2011-08-04 Thread Max Arnold
On Sun, Jul 31, 2011 at 03:42:15AM +0700, Max Arnold wrote:
> I'm stuck trying to get working virtual IP in combination with default 
> gateway. I have two frontend VMs with Nginx and two gateway VMs...

Any ideas?  Split the single asymmetric 4-node cluster to two separate 2-node 
ones, so they won't compete for default route?

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Backup ring is marked faulty

2011-08-04 Thread Tegtmeier.Martin
Hello again,

in my case it is always the slower ring that fails (the 100MB network). Does 
rrp_mode passive expect both rings to have the same speed?

Sebastian, can you confirm that in your environment also the slower ring fails?

Thanks,
  -Martin


-Original Message-
From: Tegtmeier.Martin [mailto:martin.tegtme...@realtech.com] 
Sent: Mittwoch, 3. August 2011 11:03
To: The Pacemaker cluster resource manager
Subject: AW: [Pacemaker] Backup ring is marked faulty

Hello,

we have exactly the same issue! Same version of corosync (1.3.1), also running 
on SuSE Linux Enterprise Server 11 SP1 with HAE.

Aug 01 15:45:18 corosync [TOTEM ] Received ringid(172.20.16.2:308) seq 6a

Aug 01 15:45:18 corosync [TOTEM ] Received ringid(172.20.16.2:308) seq 63

Aug 01 15:45:18 corosync [TOTEM ] releasing messages up to and including 60

Aug 01 15:45:18 corosync [TOTEM ] releasing messages up to and including 6d

Aug 01 15:45:18 corosync [TOTEM ] Marking seqid 162 ringid 1 interface 10.2.2.6 
FAULTY - administrative intervention required.

rksaph06:/var/log/cluster # corosync-cfgtool -s

Printing ring status.

Local node ID 101717164

RING ID 0

id  = 172.20.16.6

status  = ring 0 active with no faults

RING ID 1

id  = 10.2.2.6

status  = Marking seqid 162 ringid 1 interface 10.2.2.6 FAULTY - 
administrative intervention required.



rrp_mode is set to "passive"
Ring 0 (172.20.16.0) supports 1GB and ring 1 (10.2.2.0) supports 100 MBit. 
There was no other network traffic on ring 1 - only corosync (!)

After re-activating both rings with "corosync-cfgtool -r" the problem is 
reproducable by simply connecting a crm_gui and hitting "refresh" inside the 
GUI 3-5 times. After that ring 1 (10.2.2.0) will be marked as "faulty" again.

Thanks and best regards,
  -Martin Tegtmeier




-Ursprüngliche Nachricht-
Von: Sebastian Kaps [mailto:sebastian.k...@imail.de]
Gesendet: Mi 03.08.2011 08:53
An: The Pacemaker cluster resource manager
Betreff: Re: [Pacemaker] Backup ring is marked faulty
 
 Hi Steven!

 On Tue, 02 Aug 2011 17:45:46 -0700, Steven Dake wrote:
> Which version of corosync?

 # corosync -v
 Corosync Cluster Engine, version '1.3.1'
 Copyright (c) 2006-2009 Red Hat, Inc.

 It's the version that comes with SLES11-SP1-HA.

--
 Sebastian

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker