Re: [Pacemaker] mcast vs broadcast

2010-01-18 Thread Steven Dake
On Mon, 2010-01-18 at 11:25 -0500, Shravan Mishra wrote:
> Hi all,
> 
> 
> 
> Following is my corosync.conf.
> 
> Even though broadcast is enabled I see "mcasted" messages like these
> in corosync.log.
> 
> Is it ok?  even when the broadcast is on and not mcast.
> 

Yes you are using broadcast and the debug output doesn't print a special
case for "broadcast" (but it really is broadcasting).

This output is debug output meant for developer consumption.  It is
really not all that useful for end users.  
> ==
> Jan 18 09:50:40 corosync [TOTEM ] mcasted message added to pending queue
> Jan 18 09:50:40 corosync [TOTEM ] mcasted message added to pending queue
> Jan 18 09:50:40 corosync [TOTEM ] Delivering 171 to 173
> Jan 18 09:50:40 corosync [TOTEM ] Delivering MCAST message with seq
> 172 to pending delivery queue
> Jan 18 09:50:40 corosync [TOTEM ] Delivering MCAST message with seq
> 173 to pending delivery queue
> Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 172
> Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 172
> Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 173
> Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 173
> Jan 18 09:50:40 corosync [TOTEM ] releasing messages up to and including 172
> Jan 18 09:50:40 corosync [TOTEM ] releasing messages up to and including 173
> 
> 
> =
> 
> ===
> 
> # Please read the corosync.conf.5 manual page
> compatibility: whitetank
> 
> totem {
> version: 2
> token: 3000
> token_retransmits_before_loss_const: 10
> join: 60
> consensus: 1500
> vsftype: none
> max_messages: 20
> clear_node_high_bit: yes
> secauth: on
> threads: 0
> rrp_mode: passive
> 
> interface {
> ringnumber: 0
> bindnetaddr: 192.168.2.0
> #   mcastaddr: 226.94.1.1
> broadcast: yes
> mcastport: 5405
> }
> interface {
> ringnumber: 1
> bindnetaddr: 172.20.20.0
> #mcastaddr: 226.94.2.1
> broadcast: yes
> mcastport: 5405
> }
> }
> logging {
> fileline: off
> to_stderr: yes
> to_logfile: yes
> to_syslog: yes
> logfile: /tmp/corosync.log
> debug: on
> timestamp: on
> logger_subsys {
> subsys: AMF
> debug: off
> }
> }
> 
> service {
> name: pacemaker
> ver: 0
> }
> 
> aisexec {
> user:root
> group: root
> }
> 
> amf {
> mode: disabled
> }
> =
> 
> 
> 
> Thanks
> Shravan
> 
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] errors in corosync.log

2010-01-18 Thread Steven Dake
One possibility is you have a different cluster in your network on the
same multicast address and port.

Regards
-steve

On Sat, 2010-01-16 at 15:20 -0500, Shravan Mishra wrote:
> Hi Guys,
> 
> I'm running the following version of pacemaker and corosync
> corosync=1.1.1-1-2
> pacemaker=1.0.9-2-1
> 
> Every thing had been running fine for quite some time now but then I
> started seeing following errors in the corosync logs,
> 
> 
> =
> Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid
> digest... ignoring.
> Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data
> Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid
> digest... ignoring.
> Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data
> Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid
> digest... ignoring.
> Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data
> 
> 
> I can perform all the crm shell commands and what not but it's
> troubling that the above is happening.
> 
> My crm_mon output looks good.
> 
> 
> I also checked the authkey and did md5sum on both it's same.
> 
> Then I stopped corosync and regenerated the authkey with
> corosync-keygen and copied it to the the other machine but I still get
> the above message in the corosync log.
> 
> Is there anything other authkey that I should look into ?
> 
> 
> corosync.conf
> 
> 
> 
> # Please read the corosync.conf.5 manual page
> compatibility: whitetank
> 
> totem {
> version: 2
> token: 3000
> token_retransmits_before_loss_const: 10
> join: 60
> consensus: 1500
> vsftype: none
> max_messages: 20
> clear_node_high_bit: yes
> secauth: on
> threads: 0
> rrp_mode: passive
> 
> interface {
> ringnumber: 0
> bindnetaddr: 192.168.2.0
> #mcastaddr: 226.94.1.1
> broadcast: yes
> mcastport: 5405
> }
> interface {
> ringnumber: 1
> bindnetaddr: 172.20.20.0
> #mcastaddr: 226.94.1.1
> broadcast: yes
> mcastport: 5405
> }
> }
> 
> 
> logging {
> fileline: off
> to_stderr: yes
> to_logfile: yes
> to_syslog: yes
> logfile: /tmp/corosync.log
> debug: off
> timestamp: on
> logger_subsys {
> subsys: AMF
> debug: off
> }
> }
> 
> service {
> name: pacemaker
> ver: 0
> }
> 
> aisexec {
> user:root
> group: root
> }
> 
> amf {
> mode: disabled
> }
> 
> 
> ===
> 
> 
> Thanks
> Shravan
> 
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Split Site 2-way clusters

2010-01-18 Thread Miki Shapiro
Florian(and all), thanks for the reply.

I've gone over past threads on the DRBD list as you suggested, and found only 
this:
http://archives.free.net.ph/message/20090909.131635.ef640f6a.en.html

I am not entirely certain what specific problem the 
one-separate-cluster-at-each-site  design addresses that one-node-on-each-site 
does not.

>From the above thread, the only roadblock explicitly mentioned was setting up 
>cross-site multicast routing, which needs to be made to work. Fair enough.

I'd like to get a clear idea of what the roadblocks --actually are-- (not on a 
"The WAN link" level but what the WAN link -actually breaks-) to doing what I 
suggested.

Assuming I can get it to work, are there any other specific reasons it 
wouldn't? 

To recap, in my proposed solution, an outage will result in four things:
---
1. A "Race" by both nodes to a 3rd site, to perform an atomic operation (a 
mkdir for instance). Following it, it will be abundantly clear to both nodes 
"who is right, and who is dead".
---
2. A hard-iLO-poweroff STONITH (NOT reboot!) from the winner to the loser's 
iLO. It can  also iptables-block all comms from the loser until further notice 
as an extra safety-net. 
---
3. A hard-own-iLO-poweroff-else-kernel-halt SMITH (NOT reboot!) suicide by the 
loser (SMITH is our pet acronym for Shoot-Myself-...).
---
4. A "WAN-PROBLEM=[true|false] flag immediately raised (locally) by the winner 
based on pinging the OTHER SITE's ROUTER. A separate resource on the winner 
will, in the presence of this flag, monitor the same router of the other site 
for life, and when the other site comes back up (perhaps 
-and-stays-up-for-an-hour- or some similar flap-avoiding logic) issues a 
POWERON to the other node's iLO which will come back up as a drbd slave, resync 
and get re-promoted to master.

As an attractive side-benefit, this is a deathmatch-proof design.



NOTE: There's a departure from common wisdom here, and I am not sure whether 
this one of the issues you're pointing at. 
Common wisdom states: SMITH BAD, not reliable (obvious reasons - no 
success/failure etc)

In this solution I claim: SMIT BAD, not reliable, except in one specific 
failure mode (WAN outage) where SMITH GOOD, is reliable, shortcomings can be 
worked around.

both steps [2] and [3] are issued on EVERY TYPE of outage, regardless of 
whether it's WAN-related or not. 
In non-WAN issues the loser is considered compromised, thus making [3] 
unreliable, but [2] is reliable.
In WAN issues, the WAN is considered compromised, thus making [2] unreliable, 
but the node itself is sound, so [3] still is reliable.

To sum up, it looks to me like the "data safety" is provided by the layer 
underneath DRBD, not DRBD itself, and if it works as advertised, DRBD should 
have no problem, thus we have a system sufficiently reliable to withstand any 
scenario short of a double failure. 

... thoughts?
--

-Original Message-
From: Florian Haas [mailto:florian.h...@linbit.com] 
Sent: Monday, 18 January 2010 9:36 PM
To: pacemaker@oss.clusterlabs.org
Subject: Re: [Pacemaker] Split Site 2-way clusters

On 2010-01-18 11:14, Andrew Beekhof wrote:
> On Thu, Jan 14, 2010 at 11:44 PM, Miki Shapiro 
>  wrote:
>> Confused.
>>
>>
>>
>> I *am* running DRBD in dual-master mode
> 
> /me cringes... this sounds to me like an impossibly dangerous idea.
> Can someone from linbit comment on this please?  Am I imagining this?

Dual-Primary DRBD in a split site cluster? Really really bad idea.
Anyone attempting this, please search the drbd-user archives for multiple 
discussions about this in the past. Then reconsider.

Hope that makes it clear enough.
Florian








__
This email and any attachments may contain privileged and confidential
information and are intended for the named addressee only. If you have
received this e-mail in error, please notify the sender and delete
this e-mail immediately. Any confidentiality, privilege or copyright
is not waived or lost because this e-mail has been sent to you in
error. It is your responsibility to check this e-mail and any
attachments for viruses.  No warranty is made that this material is
free from computer virus or any other defect or error.  Any
loss/damage incurred by using this material is not the sender's
responsibility.  The sender's entire liability will be limited to
resupplying the material.
__

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


[Pacemaker] 1.0.7 upgraded, restarting resources problem

2010-01-18 Thread Martin Gombač

Hi,

i have one m/s drbd resource and one Xen instance on top. Both m/s are 
primary.
When i restart node that's _not_ hosting the Xen instance (ibm1), 
pacemaker restarts running Xen instance on the other node (ibm2). There 
is no need to do that. I thought it got fixed 
(http://developerbugs.linux-foundation.org/show_bug.cgi?id=2153). Didn't it?


Here is my config once more. Please note the WARNING showed up only 
after upgrade.
(BTW setting drbd0predHosting score to 0 doesn't restart it. But it 
doesn't help resource ordering either.)


[r...@ibm1 etc]# crm configure show
WARNING: notify: operation name not recognized
node $id="3d430f49-b915-4d52-a32b-b0799fa17ae7" ibm2
node $id="4b2047c8-f3a0-4935-84a2-967b548598c9" ibm1
primitive Hosting ocf:heartbeat:Xen \
   params xmfile="/etc/xen/Hosting.cfg" shutdown_timeout="303" \
   meta target-role="Started" allow-migrate="true" is-managed="true" \
   op monitor interval="120s" timeout="506s" start-delay="5s" \
   op migrate_to interval="0s" timeout="304s" \
   op migrate_from interval="0s" timeout="304s" \
   op stop interval="0s" timeout="304s" \
   op start interval="0s" timeout="202s"
primitive drbd_r0 ocf:linbit:drbd \
   params drbd_resource="r0" \
   op monitor interval="15s" role="Master" timeout="30s" \
   op monitor interval="30s" role="Slave" timeout="30s" \
   op stop interval="0s" timeout="501s" \
   op notify interval="0s" timeout="90s" \
   op demote interval="0s" timeout="90s" \
   op promote interval="0s" timeout="90s" \
   op start interval="0s" timeout="255s"
ms ms_drbd_r0 drbd_r0 \
   meta notify="true" master-max="2" inteleave="true" is-managed="true" 
target-role="Started"

order drbd0predHosting inf: ms_drbd_r0:promote Hosting:start
property $id="cib-bootstrap-options" \
   dc-version="1.0.7-b1191b11d4b56dcae8f34715d52532561b875cd5" \
   cluster-infrastructure="Heartbeat" \
   stonith-enabled="false" \
   no-quorum-policy="ignore" \
   default-resource-stickiness="10" \
   last-lrm-refresh="1263845352"

All i want is to have just one resource Hosting started, after drbd was 
promoted(/primary) on the node that's it's starting.

Please advise me if you can.

Thank you,
regards,
M.

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Announce: Pacemaker 1.0.7 (stable) Released

2010-01-18 Thread Andrew Beekhof
On Mon, Jan 18, 2010 at 1:29 PM, Andrew Beekhof  wrote:
> On Mon, Jan 18, 2010 at 1:17 PM, Andreas Mock  wrote:
>>> -Ursprüngliche Nachricht-
>>> Von: "Andrew Beekhof" 
>>> Gesendet: 18.01.10 12:43:30
>>> An: The Pacemaker cluster resource manager 
>>> Betreff: [Pacemaker] Announce: Pacemaker 1.0.7 (stable) Released
>>
>>
>>> The latest installment of the Pacemaker 1.0 stable series is now ready for 
>>> general consumption.
>>
>> Great.
>>
>>> Pre-built packages for Pacemaker and it s immediate dependancies are 
>>> currently building and will be available for openSUSE, SLES, Fedora, RHEL, 
>>> CentOS from the ClusterLabs Build Area (http://www.clusterlabs.org/rpm) 
>>> shortly.
>>
>> Please don't forget openSuSE 10.2. I'm waiting...  ;-)
>
> I've not forgotten.
> Actually it was the first one i tried but there seems to be some issues there.
>

Done. Please let me know how it goes.

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


[Pacemaker] mcast vs broadcast

2010-01-18 Thread Shravan Mishra
Hi all,



Following is my corosync.conf.

Even though broadcast is enabled I see "mcasted" messages like these
in corosync.log.

Is it ok?  even when the broadcast is on and not mcast.

==
Jan 18 09:50:40 corosync [TOTEM ] mcasted message added to pending queue
Jan 18 09:50:40 corosync [TOTEM ] mcasted message added to pending queue
Jan 18 09:50:40 corosync [TOTEM ] Delivering 171 to 173
Jan 18 09:50:40 corosync [TOTEM ] Delivering MCAST message with seq
172 to pending delivery queue
Jan 18 09:50:40 corosync [TOTEM ] Delivering MCAST message with seq
173 to pending delivery queue
Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 172
Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 172
Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 173
Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 173
Jan 18 09:50:40 corosync [TOTEM ] releasing messages up to and including 172
Jan 18 09:50:40 corosync [TOTEM ] releasing messages up to and including 173


=

===

# Please read the corosync.conf.5 manual page
compatibility: whitetank

totem {
version: 2
token: 3000
token_retransmits_before_loss_const: 10
join: 60
consensus: 1500
vsftype: none
max_messages: 20
clear_node_high_bit: yes
secauth: on
threads: 0
rrp_mode: passive

interface {
ringnumber: 0
bindnetaddr: 192.168.2.0
#   mcastaddr: 226.94.1.1
broadcast: yes
mcastport: 5405
}
interface {
ringnumber: 1
bindnetaddr: 172.20.20.0
#mcastaddr: 226.94.2.1
broadcast: yes
mcastport: 5405
}
}
logging {
fileline: off
to_stderr: yes
to_logfile: yes
to_syslog: yes
logfile: /tmp/corosync.log
debug: on
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}

service {
name: pacemaker
ver: 0
}

aisexec {
user:root
group: root
}

amf {
mode: disabled
}
=



Thanks
Shravan

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] errors in corosync.log

2010-01-18 Thread Shravan Mishra
Hi,

I'm seeing following messages in corosync.log
=
Jan 18 09:50:41 corosync [pcmk  ] ERROR: check_message_sanity: Message
payload is corrupted: expected 1929 bytes, got 669
Jan 18 09:50:41 corosync [pcmk  ] ERROR: check_message_sanity: Child
28857 spawned to record non-fatal assertion failure line 1286: sane
Jan 18 09:50:41 corosync [pcmk  ] ERROR: check_message_sanity: Invalid
message 70: (dest=local:cib, from=node1.itactics.com:cib.22575,
compressed=0, size=1929, total=2521)
..


I'm not entirely sure what's casuing them.

Thanks
Shravan

On Mon, Jan 18, 2010 at 9:03 AM, Shravan Mishra
 wrote:
> Hi ,
>
> Since the interfaces on the two nodes are connected via cross over
> cable so there is no chance of that happening and since I'm using rrp:
> passive, which means that the other ring i.e. ring 1 will come into
> play only when ring 0 fails,I assume.  I say this because ring 1
> interface is on the network.
>
>
> Once interesting that I observed was that
>  lintomcrypt is being used for crypto reasons because I have secauth: on.
>
> But I couldn't find that library on my machine.
>
> I'm wondering if it's because of that.
>
> Basically we are using 3 interfaces eth0, eth1 and eth2.
>
> eth0 and eth2 are for ring 0 and ring 1 respectively. eth1 is the
> primary interface.
>
> This is what my drbd.conf looks like:
>
>
> ==
> # please have a a look at the example configuration file in
> # /usr/share/doc/drbd82/drbd.conf
> #
> global {
>        usage-count no;
> }
> common {
>                protocol C;
>      startup {
>        wfc-timeout 120;
>        degr-wfc-timeout 120;
>      }
> }
> resource var_nsm {
>                syncer {
>                rate 333M;
>        }
>                handlers {
>                        fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
>                        after-resync-target 
> "/usr/lib/drbd/crm-unfence-peer.sh";
>                }
>                net {
>                        after-sb-1pri discard-secondary;
>                }
>                on node1.itactics.com {
>        device /dev/drbd1;
>         disk /dev/sdb3;
>         address 172.20.20.1:7791;
>         meta-disk internal;
>      }
>    on node2.itactics.com {
>        device /dev/drbd1;
>         disk /dev/sdb3;
>         address 172.20.20.2:7791;
>         meta-disk internal;
>                }
> }
> =
>
>
> eth0's of the two nodes are connected via cross over as I mentioned
> and eth1 and eth2 are on the network.
>
> I'm not a networking expert but is it possible that broadcast done by
> ,let's say, any node not in my cluster, will still cause it to come to
> my nodes through other interfaces which are attached to the network?
>
>
> We in the dev and the QA guys are testing this in parallel.
>
> And let's say there is QA cluster of two nodes and dev cluster of 2 nodes.
>
> And interfaces for both of them are hooked as I mentioned above and that
> corosync.conf for both the clusters have  "bindnetaddr: 192.168.2.0".
>
> Is there possibility of bad messages for the cluster casused by the other.
>
>
> We are in the final leg of the testing and this came up.
>
> Thanks for the help.
>
>
> Shravan
>
>
>
>
>
>
> On Mon, Jan 18, 2010 at 2:58 AM, Andrew Beekhof  wrote:
>> On Sat, Jan 16, 2010 at 9:20 PM, Shravan Mishra
>>  wrote:
>>> Hi Guys,
>>>
>>> I'm running the following version of pacemaker and corosync
>>> corosync=1.1.1-1-2
>>> pacemaker=1.0.9-2-1
>>>
>>> Every thing had been running fine for quite some time now but then I
>>> started seeing following errors in the corosync logs,
>>>
>>>
>>> =
>>> Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid
>>> digest... ignoring.
>>> Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data
>>> Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid
>>> digest... ignoring.
>>> Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data
>>> Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid
>>> digest... ignoring.
>>> Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data
>>> 
>>>
>>> I can perform all the crm shell commands and what not but it's
>>> troubling that the above is happening.
>>>
>>> My crm_mon output looks good.
>>>
>>>
>>> I also checked the authkey and did md5sum on both it's same.
>>>
>>> Then I stopped corosync and regenerated the authkey with
>>> corosync-keygen and copied it to the the other machine but I still get
>>> the above message in the corosync log.
>>
>> Are you sure there's not a third node somewhere broadcasting on that
>> mcast and port combination?
>>
>>>
>>> Is there anything other authkey that I should look into ?
>>>
>>>
>>> corosync.conf
>>>
>>> 
>>>
>>> # Please read the corosync.conf.5 manual page
>>> compatibility: whitetank
>>>
>>> totem {
>>>        version: 2
>>>        token: 3000
>>>        token_retransmits_before_loss_const: 10
>>>        join: 60
>>>        consensus: 1500
>>>        vsftype: none
>>>        m

Re: [Pacemaker] errors in corosync.log

2010-01-18 Thread Shravan Mishra
Hi ,

Since the interfaces on the two nodes are connected via cross over
cable so there is no chance of that happening and since I'm using rrp:
passive, which means that the other ring i.e. ring 1 will come into
play only when ring 0 fails,I assume.  I say this because ring 1
interface is on the network.


Once interesting that I observed was that
 lintomcrypt is being used for crypto reasons because I have secauth: on.

But I couldn't find that library on my machine.

I'm wondering if it's because of that.

Basically we are using 3 interfaces eth0, eth1 and eth2.

eth0 and eth2 are for ring 0 and ring 1 respectively. eth1 is the
primary interface.

This is what my drbd.conf looks like:


==
# please have a a look at the example configuration file in
# /usr/share/doc/drbd82/drbd.conf
#
global {
usage-count no;
}
common {
protocol C;
  startup {
wfc-timeout 120;
degr-wfc-timeout 120;
  }
}
resource var_nsm {
syncer {
rate 333M;
}
handlers {
fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
}
net {
after-sb-1pri discard-secondary;
}
on node1.itactics.com {
device /dev/drbd1;
 disk /dev/sdb3;
 address 172.20.20.1:7791;
 meta-disk internal;
  }
on node2.itactics.com {
device /dev/drbd1;
 disk /dev/sdb3;
 address 172.20.20.2:7791;
 meta-disk internal;
}
}
=


eth0's of the two nodes are connected via cross over as I mentioned
and eth1 and eth2 are on the network.

I'm not a networking expert but is it possible that broadcast done by
,let's say, any node not in my cluster, will still cause it to come to
my nodes through other interfaces which are attached to the network?


We in the dev and the QA guys are testing this in parallel.

And let's say there is QA cluster of two nodes and dev cluster of 2 nodes.

And interfaces for both of them are hooked as I mentioned above and that
corosync.conf for both the clusters have  "bindnetaddr: 192.168.2.0".

Is there possibility of bad messages for the cluster casused by the other.


We are in the final leg of the testing and this came up.

Thanks for the help.


Shravan






On Mon, Jan 18, 2010 at 2:58 AM, Andrew Beekhof  wrote:
> On Sat, Jan 16, 2010 at 9:20 PM, Shravan Mishra
>  wrote:
>> Hi Guys,
>>
>> I'm running the following version of pacemaker and corosync
>> corosync=1.1.1-1-2
>> pacemaker=1.0.9-2-1
>>
>> Every thing had been running fine for quite some time now but then I
>> started seeing following errors in the corosync logs,
>>
>>
>> =
>> Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid
>> digest... ignoring.
>> Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data
>> Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid
>> digest... ignoring.
>> Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data
>> Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid
>> digest... ignoring.
>> Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data
>> 
>>
>> I can perform all the crm shell commands and what not but it's
>> troubling that the above is happening.
>>
>> My crm_mon output looks good.
>>
>>
>> I also checked the authkey and did md5sum on both it's same.
>>
>> Then I stopped corosync and regenerated the authkey with
>> corosync-keygen and copied it to the the other machine but I still get
>> the above message in the corosync log.
>
> Are you sure there's not a third node somewhere broadcasting on that
> mcast and port combination?
>
>>
>> Is there anything other authkey that I should look into ?
>>
>>
>> corosync.conf
>>
>> 
>>
>> # Please read the corosync.conf.5 manual page
>> compatibility: whitetank
>>
>> totem {
>>        version: 2
>>        token: 3000
>>        token_retransmits_before_loss_const: 10
>>        join: 60
>>        consensus: 1500
>>        vsftype: none
>>        max_messages: 20
>>        clear_node_high_bit: yes
>>        secauth: on
>>        threads: 0
>>        rrp_mode: passive
>>
>>        interface {
>>                ringnumber: 0
>>                bindnetaddr: 192.168.2.0
>>                #mcastaddr: 226.94.1.1
>>                broadcast: yes
>>                mcastport: 5405
>>        }
>>        interface {
>>                ringnumber: 1
>>                bindnetaddr: 172.20.20.0
>>                #mcastaddr: 226.94.1.1
>>                broadcast: yes
>>                mcastport: 5405
>>        }
>> }
>>
>>
>> logging {
>>        fileline: off
>>        to_stderr: yes
>>        to_logfile: yes
>>        to_syslog: yes
>>        logfile: /tmp/corosync.log
>>        debug: off
>>        timestamp: on
>>        logger_subsys {
>>                subsys: AMF
>>  

Re: [Pacemaker] Split Site 2-way clusters

2010-01-18 Thread Lars Ellenberg
On Mon, Jan 18, 2010 at 11:14:58AM +0100, Andrew Beekhof wrote:
> > NodeX(Successfully) taking on data from clients while in
> > quorumless-freeze-still-providing-service, then discarding its hitherto
> > collected client data when realizing other node has quorum and discarding
> > own data isn’t good.
> 
> Agreed - freeze isn't an option if you're doing master/master.

no-quorum=freeze alone is not sufficient when doing master/slave, either:
if the current master risks to be blown away later,
you lose all changes from replication link loss to being shot.

So you have to make sure there will be no changes between those two
events, you need to also freeze IO on the DRBD Primary.  The fence-peer
handler script hook, and the DRBD fencing policy are what can be used
for this.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Announce: Pacemaker 1.0.7 (stable) Released

2010-01-18 Thread Andrew Beekhof
On Mon, Jan 18, 2010 at 1:17 PM, Andreas Mock  wrote:
>> -Ursprüngliche Nachricht-
>> Von: "Andrew Beekhof" 
>> Gesendet: 18.01.10 12:43:30
>> An: The Pacemaker cluster resource manager 
>> Betreff: [Pacemaker] Announce: Pacemaker 1.0.7 (stable) Released
>
>
>> The latest installment of the Pacemaker 1.0 stable series is now ready for 
>> general consumption.
>
> Great.
>
>> Pre-built packages for Pacemaker and it s immediate dependancies are 
>> currently building and will be available for openSUSE, SLES, Fedora, RHEL, 
>> CentOS from the ClusterLabs Build Area (http://www.clusterlabs.org/rpm) 
>> shortly.
>
> Please don't forget openSuSE 10.2. I'm waiting...  ;-)

I've not forgotten.
Actually it was the first one i tried but there seems to be some issues there.

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Announce: Pacemaker 1.0.7 (stable) Released

2010-01-18 Thread Andreas Mock
> -Ursprüngliche Nachricht-
> Von: "Andrew Beekhof" 
> Gesendet: 18.01.10 12:43:30
> An: The Pacemaker cluster resource manager 
> Betreff: [Pacemaker] Announce: Pacemaker 1.0.7 (stable) Released


> The latest installment of the Pacemaker 1.0 stable series is now ready for 
> general consumption.

Great.

> Pre-built packages for Pacemaker and it’s immediate dependancies are 
> currently building and will be available for openSUSE, SLES, Fedora, RHEL, 
> CentOS from the ClusterLabs Build Area (http://www.clusterlabs.org/rpm) 
> shortly.

Please don't forget openSuSE 10.2. I'm waiting...  ;-)

Best regards + Thanks
Andreas



___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Split Site 2-way clusters

2010-01-18 Thread Colin
On Mon, Jan 18, 2010 at 11:52 AM, Florian Haas  wrote:
>
> the current approach is to utilize 2 Pacemaker clusters, each highly
> available in its own right, and employing manual failover. As described
> here:

Thanks for the pointer! Perhaps "site" is not quite the correct term
for our setup, where we still have (multiple) Gbit-or-faster ethernet
links, think fire areas, at most in adjacent buildings.

For the next step up, two geographically different sites, I agree that
manual failover is more appropriate, but we feel that our case of the
fire areas should still be handled automatically…(?)

Can anybody judge how difficult it would be to integrate some kind of
quorum-support into the cluster? (All cluster nodes attempt a quorum
reservation; the node that gets it, has 1.5 or 2 votes towards the
quorum, rather than just one; this would ensure continued operation in
the case of a) a fire area losing power, b) the separate quorum-server
failing, or c) the cross-fire-area cluster-interconnects failing (but
not more than one failure at a time)…)

Regards, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


[Pacemaker] Announce: Pacemaker 1.0.7 (stable) Released

2010-01-18 Thread Andrew Beekhof
The latest installment of the Pacemaker 1.0 stable series is now ready for 
general consumption.

In this release, we’ve made a number improvements to clone handling - 
particularly the way ordering constraints are processed - as well as some 
really nice improvements to the shell.

The next 1.0 release is anticipated to be in mid-March. We will be switching to 
a bi-monthly release schedule to begin focusing on development for the next 
stable series (more details soon). So, if you have feature requests, now is the 
time to voice them and/or provide patches :-)

Pre-built packages for Pacemaker and it’s immediate dependancies are currently 
building and will be available for openSUSE, SLES, Fedora, RHEL, CentOS from 
the ClusterLabs Build Area (http://www.clusterlabs.org/rpm) shortly.

Read the full announcement at:
   http://theclusterguy.clusterlabs.org/post/340780359/pacemaker-1-0-7-released

General installation instructions are available at from the ClusterLabs wiki:
   http://clusterlabs.org/wiki/Install

-- Andrew




___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Pre-Announce: End of 0.6 support is near

2010-01-18 Thread Florian Haas
On 2010-01-18 12:09, Andrew Beekhof wrote:
> On Mon, Jan 18, 2010 at 11:57 AM, Florian Haas  
> wrote:
>> On 2010-01-18 11:18, Andrew Beekhof wrote:
>>> Biggest caveat is the networking issue that makes pacemaker 1.0
>>> wire-incompatible with pacemaker 0.6 (and heartbeat 2.1.x).
>>> So rolling upgrades are out and you'd need to look at one of the other
>>> upgrade strategies.
>> Even though I've bugged you about this repeatedly in the past, I'll
>> reiterate that I think this non-support of rolling upgrades is a bad
>> thing(tm).
> 
> Its not something that was done intentionally, and we have tests in
> place to ensure it doesn't happen again.
> But given that to-date about 4 people have noticed it didn't work (and
> my employer has no interest in older versions especially when they're
> running heartbeat), I have no current inclination to spend time on the
> problem myself.
>
> That doesn't prevent the vocal minority that maintain its a huge issue
> affecting half the globe from fixing the problem instead of being a
> pests.  If you spent have as much time looking into the problem as
> moaning about it, it would probably be done by now.

Calm down. I thought one smiley face was enough to mark the post as at
least partially ironic.

Suggested course of action:

Remove this part:
"This method is currently broken between Pacemaker 0.6.x and 1.0.x
Measures have been put into place to ensure rolling upgrades always work
for versions after 1.0.0 If there is sufficient demand, the work to
repair 0.6 -> 1.0 compatibility will be carried out. Otherwise, please
try one of the other upgrade strategies. Detach/Reattach is a
particularly good option for most people."

from the "rolling upgrades" section in the docs, and declare that you
will only ever guarantee to support rolling upgrades within the same
minor release, and adjacent minor releases when the major release number
got bumped.

Then:
* Rolling upgrades would always be supported between 1.n.x and 1.n.y for
any value of n, x and y;
* Rolling upgrades would be always supported between 1.n.x and 1.n+1.0,
where x is the final bugfix release of the 1.n series;
* Any other upgrade paths would only be supported on a best-effort
basis, with detach/reattach as a readily available fallback option.

Just my two cents.
Florian



signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] [Linux-HA] Announce: Hawk (HA Web Konsole)

2010-01-18 Thread Andrew Beekhof
I look forward to taking this for a spin!
Do we have a bugzilla component for it yet?

On Sat, Jan 16, 2010 at 2:14 PM, Tim Serong  wrote:
> Greetings All,
>
> This is to announce the development of the Hawk project,
> a web-based GUI for Pacemaker HA clusters.
>
> So, why another management tool, given that we already have
> the crm shell, the Python GUI, and DRBD MC?  In order:
>
> 1) We have the usual rationale for a GUI over (or in addition
>   to) a CLI tool; it is (or should be) easier to use, for
>   a wider audience.
>
> 2) The Python GUI is not always easily installable/runnable
>   (think: sysadmins with Windows desktops and/or people who
>   don't want to, or can't, forward X).
>
> 3) Believe it or not, there are a number of cases where,
>   citing security reasons, site policy prohibits ssh access
>   to servers (which is what DRBD MC uses internally).
>
> There are also some differing goals; Hawk is not intended
> to expose absolutely everything.  There will be point somewhere
> where you have to say "and now you must learn to use a shell".
>
> Likewise, Hawk is not intended to install the base cluster
> stack for you (whereas DRBD MC does a good job of this).
>
> It's early days yet (no downloadable packages), but you can
> get the current source as follows:
>
>  # hg clone http://hg.clusterlabs.org/pacemaker/hawk
>  # cd hawk
>  # hg update tip
>
> This will give you a web-based GUI with a display roughly
> analagous to crm_mon, in terms of status of cluster resources.
> It will show you running/dead/standby nodes, and the resources
> (clones, groups & primitives) running on those nodes.
>
> It does not yet provide information about failed resources or
> nodes, other than the fact that they are not running.
>
> Display of nodes & resources is collapsible (collapsed by
> default), but if something breaks while you are looking at it,
> the display will expand to show the broken nodes and/or
> resources.
>
> Hawk is intended to run on each node in your cluster.  You
> can then access it by pointing your web browser at the IP
> address of any cluster node, or the address of any IPaddr(2)
> resource you may have configured.
>
> Minimally, to see it in action, you will need the following
> packages and their dependencies (names per openSUSE/SLES):
>
>  - ruby
>  - rubygem-rails-2_3
>  - rubygem-gettext_rails
>
> Once you've got those installed, run the following command:
>
>  # hawk/script/server
>
> Then, point your browser at http://your-server:3000/ to see
> the status of your cluster.
>
> Ultimately, hawk is intended to be installed and run as a
> regular system service via /etc/init.d/hawk.  To do this,
> you will need the following additional packages:
>
>  - lighttpd
>  - lighttpd-mod_magnet
>  - ruby-fcgi
>  - rubygem-rake
>
> Then, try the following, but READ THE MAKEFILE FIRST!
> "make install" (and the rest of the build system for that
> matter) is frightfully primitive at the moment:
>
>  # make
>  # sudo make install
>  # /etc/init.d/hawk start
>
> Then, point your browser at http://your-server:/ to see
> the status of your cluster.
>
> Assuming you've read this far, what next?
>
> - In the very near future (but probably not next week,
>  because I'll be busy at linux.conf.au) you can expect to
>  see further documentation and roadmap info up on the
>  clusterlabs.org wiki.
>
> - Immediate goal is to obtain feature parity with crm_mon
>  (completing status display, adding error/failure messages).
>
> - Various pieces of scaffolding need to be put in place (login
>  page, access via HTTPS, clean up build/packaging, theming,
>  etc.)
>
> - After status display, the following major areas of
>  funcionality are:
>  - Basic operator tasks (stop/start/migrate resource,
>    standby/online node, etc.)
>  - Explore failure scenarios (shadow CIB magic to see
>    what would happen if a node/resource failed).
>  - Ability to actually configure resources and nodes.
>
> Please direct comments, feedback, questions, etc. to
> tser...@novell.com and/or the Pacemaker mailing list.
>
> Thank you for your attention.
>
> Regards,
>
> Tim
>
>
> --
> Tim Serong 
> Senior Clustering Engineer, Novell Inc.
>
>
> ___
> Linux-HA mailing list
> linux...@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Pre-Announce: End of 0.6 support is near

2010-01-18 Thread Andrew Beekhof
On Mon, Jan 18, 2010 at 11:57 AM, Florian Haas  wrote:
> On 2010-01-18 11:18, Andrew Beekhof wrote:
>> Biggest caveat is the networking issue that makes pacemaker 1.0
>> wire-incompatible with pacemaker 0.6 (and heartbeat 2.1.x).
>> So rolling upgrades are out and you'd need to look at one of the other
>> upgrade strategies.
>
> Even though I've bugged you about this repeatedly in the past, I'll
> reiterate that I think this non-support of rolling upgrades is a bad
> thing(tm).

Its not something that was done intentionally, and we have tests in
place to ensure it doesn't happen again.
But given that to-date about 4 people have noticed it didn't work (and
my employer has no interest in older versions especially when they're
running heartbeat), I have no current inclination to spend time on the
problem myself.

That doesn't prevent the vocal minority that maintain its a huge issue
affecting half the globe from fixing the problem instead of being a
pests.  If you spent have as much time looking into the problem as
moaning about it, it would probably be done by now.

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Pre-Announce: End of 0.6 support is near

2010-01-18 Thread Florian Haas
On 2010-01-18 11:18, Andrew Beekhof wrote:
> Biggest caveat is the networking issue that makes pacemaker 1.0
> wire-incompatible with pacemaker 0.6 (and heartbeat 2.1.x).
> So rolling upgrades are out and you'd need to look at one of the other
> upgrade strategies.

Even though I've bugged you about this repeatedly in the past, I'll
reiterate that I think this non-support of rolling upgrades is a bad
thing(tm).

Just so someone puts this on the record. :)

Cheers,
Florian



signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Split Site 2-way clusters

2010-01-18 Thread Florian Haas
On 2010-01-18 11:41, Colin wrote:
> Hi All,
> 
> we are currently looking at nearly the same issue, in fact I just
> wanted to start a similarly titled thread when I stumbled over these
> messages…
> 
> The setup we are evaluating is actually a 2*N-node-cluster, i.e. two
> slightly separated sites with N nodes each. The main difference to an
> N-node-cluster is that a failure of one of the two groups of nodes
> must be considered a single failure event [against which the cluster
> must protect, e.g. loss of power at one site].

Colin,

the current approach is to utilize 2 Pacemaker clusters, each highly
available in its own right, and employing manual failover. As described
here:

http://www.drbd.org/users-guide/s-pacemaker-floating-peers.html#s-pacemaker-floating-peers-site-fail-over

May be combined with DRBD resource stacking, obviously.

Given the fact that most organizations currently employ a non-automatic
policy to site failover (as in, "must be authorized by J. Random Vice
President"), this is a sane approach that works for most. Automatic
failover is a different matter, not just with regard to clustering
(where neither Corosync nor Pacemaker nor Heartbeat currently support
any concept of "sites"), but also in terms of IP address failover,
dynamic routing, etc.

Cheers,
Florian



signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Split Site 2-way clusters

2010-01-18 Thread Colin
Hi All,

we are currently looking at nearly the same issue, in fact I just
wanted to start a similarly titled thread when I stumbled over these
messages…

The setup we are evaluating is actually a 2*N-node-cluster, i.e. two
slightly separated sites with N nodes each. The main difference to an
N-node-cluster is that a failure of one of the two groups of nodes
must be considered a single failure event [against which the cluster
must protect, e.g. loss of power at one site].

As far as I gather from this, and other, mail threads, there is
currently no out-of-the-box quorum-something solution for pacemaker.
Before I start digging deeper [into possible solutions], there's one
question I need to ask:

In a pacemaker + corosync setup, who decides whether a partition has
quorum? I.e, would a quorum-device mechanism need to be integrated
with corosync, or with pacemaker, or with both?

Thanks, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Split Site 2-way clusters

2010-01-18 Thread Florian Haas
On 2010-01-18 11:14, Andrew Beekhof wrote:
> On Thu, Jan 14, 2010 at 11:44 PM, Miki Shapiro
>  wrote:
>> Confused.
>>
>>
>>
>> I *am* running DRBD in dual-master mode
> 
> /me cringes... this sounds to me like an impossibly dangerous idea.
> Can someone from linbit comment on this please?  Am I imagining this?

Dual-Primary DRBD in a split site cluster? Really really bad idea.
Anyone attempting this, please search the drbd-user archives for
multiple discussions about this in the past. Then reconsider.

Hope that makes it clear enough.
Florian









signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] DC election with downed node in 2-way cluster

2010-01-18 Thread Andrew Beekhof
On Thu, Jan 14, 2010 at 4:40 AM, Miki Shapiro  wrote:
>>> And the node really did power down?
> Yes. 100% certain and positive. OFF.
>
>>> But the other node didn't notice?!?
> Its resources (drbd master and the fence clone) did notice.
> Its dc-election-mechanism did NOT notice (and the survivor didn't re-elect)
> Its quorum-election mechanism did NOT notice (and the survivor still thinks 
> it has quorum).
>
> Logs attached.

Hmmm.
Not much to see there. crmd gets the membership event and then just
sort of stops.
Could you try again with debug turned on in openais.conf please?

>
> Keep in mind I'm relatively new to this. PEBKAC not entirely outside the 
> realm of the possible ;)

Doesn't look like it, but you might want to try something a little
more recent than 1.0.3.

> Thanks!
>
> -Original Message-
> From: Andrew Beekhof [mailto:and...@beekhof.net]
> Sent: Wednesday, 13 January 2010 7:26 PM
> To: pacemaker@oss.clusterlabs.org
> Subject: Re: [Pacemaker] DC election with downed node in 2-way cluster
>
> On Wed, Jan 13, 2010 at 9:12 AM, Miki Shapiro  
> wrote:
>> Halt = soft off - a natively issued poweroff command that shuts stuff down
>> nicely, then powers the blade off.
>
> And the node really did power down?
> But the other node didn't notice?!? That is insanely bad - looking
> forward to those logs.
>
>> Logs I'll send tomorrow (our timezone is just wrapping up for the day).
>
> Yep, I'm actually an Aussie too... just not living there at the moment :-)
>
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> __
> This email and any attachments may contain privileged and confidential
> information and are intended for the named addressee only. If you have
> received this e-mail in error, please notify the sender and delete
> this e-mail immediately. Any confidentiality, privilege or copyright
> is not waived or lost because this e-mail has been sent to you in
> error. It is your responsibility to check this e-mail and any
> attachments for viruses.  No warranty is made that this material is
> free from computer virus or any other defect or error.  Any
> loss/damage incurred by using this material is not the sender's
> responsibility.  The sender's entire liability will be limited to
> resupplying the material.
> __
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
>

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Pre-Announce: End of 0.6 support is near

2010-01-18 Thread Andrew Beekhof
On Tue, Jan 12, 2010 at 3:55 PM, Emmanuel Lesouef  wrote:
> Le Tue, 12 Jan 2010 14:56:31 +0100,
> Michael Schwartzkopff  a écrit :
>
>> Am Dienstag, 12. Januar 2010 14:48:12 schrieb Emmanuel Lesouef:
>> > Hi,
>> >
>> > We use a rather old (in fact, very old) combination :
>> >
>> > heartbeat 2.99 + openhpi 2.12
>> >
>> > What do you suggest in order to upgrade to the latest version of
>> > pacemaker ?
>> >
>> > Thanks.
>>
>> http://www.clusterlabs.org/wiki/Upgrade
>>
>
> Thanks for your answer. I already saw :
> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ap-upgrade.html
>
> In fact, my question wans't about the upgrading process but more about
> polling this list about caveats, advices or best practice when dealing
> with rather old & uncommon configuration.

Biggest caveat is the networking issue that makes pacemaker 1.0
wire-incompatible with pacemaker 0.6 (and heartbeat 2.1.x).
So rolling upgrades are out and you'd need to look at one of the other
upgrade strategies.

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Pre-Announce: End of 0.6 support is near

2010-01-18 Thread Andrew Beekhof
On Tue, Jan 12, 2010 at 2:48 PM, Emmanuel Lesouef  wrote:
> Hi,
>
> We use a rather old (in fact, very old) combination :
>
> heartbeat 2.99 + openhpi 2.12
>
> What do you suggest in order to upgrade to the latest version of
> pacemaker ?

What version of pacemaker/crm though?
"heartbeat 2.99" doesn't contain any of the crm bits that became pacemaker

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Split Site 2-way clusters

2010-01-18 Thread Andrew Beekhof
On Thu, Jan 14, 2010 at 11:44 PM, Miki Shapiro
 wrote:
> Confused.
>
>
>
> I *am* running DRBD in dual-master mode

/me cringes... this sounds to me like an impossibly dangerous idea.
Can someone from linbit comment on this please?  Am I imagining this?

> (apologies, I should have mentioned
> that earlier), and there will be both WAN clients as well as
> local-to-datacenter-clients writing to both nodes on both ends. It’s safe to
> assume the clients will know not of the split.
>
>
>
> In a WAN split I need to ensure that the node whose idea of drbd volume will
> be kept once resync happens stays up, and node that’ll get blown away and
> re-synced/overwritten becomes dead asap.

Won't you _always_ loose some data in a WAN split though?
AFAICS, you're doing here is preventing "some" being "lots".

Is master/master really a requirement?

> NodeX(Successfully) taking on data from clients while in
> quorumless-freeze-still-providing-service, then discarding its hitherto
> collected client data when realizing other node has quorum and discarding
> own data isn’t good.

Agreed - freeze isn't an option if you're doing master/master.

>
> To recap what I understood so far:
>
> 1.   CRM Availability on the multicast channel drives DC election, but
> DC election is irrelevant to us here.
>
> 2.   CRM Availability on the multicast channel (rather than resource
> failure) drive who-is-in-quorum-and-who-is-not decisions [not sure here..
> correct?

correct

> Or does resource failure drive quorum? ]

quorum applies to node availability - resource failures have no impact
(unless they lead to fencing with then leads to the node leaving the membership)

>
> 3.   Steve to clarify what happens quorum-wise if 1/3 nodes sees both
> others, but the other two only see the first (“broken triangle”), and
> whether this behaviour may differ based on whether the first node (which is
> different as it sees both others) happens to be the DC at the time or not.

Try in a cluster of 3 VMs?
Just use iptables rules to simulate the broken links

>
> Given that anyone who goes about building a production cluster would want to
> identify all likely failure modes and be able to predict how the cluster
> behaves in each one, is there any user-targeted doco/rtfm material one could
> read regarding how quorum establishment works in such scenarios?

I don't think corosync has such a doc at the moment.

> Setting up a 3-way with intermittent WAN links without getting a clear
> understanding in advance of how the software will behave is … scary J

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Cluster group and name confusion

2010-01-18 Thread Andrew Beekhof
On Sun, Jan 17, 2010 at 7:07 PM, Hunny Bunny  wrote:

> Hello folkz,
> I'm confused under which cluster group and name I should run the whole
> cluster environment root/root or hacluster/hauser.
>

hacluster/hauser


>
> I have compiled from most recent sources Corosync/OpenAIS, Cluster Glue,
> Resource Agents, Pacemaker, DRBD and OCFS2-Tools environment.
>
> This site http://www.clusterlabs.org/wiki/Install#From_Source
> suggests to create
>
> groupadd -r hacluster
> useradd -r -g hacluster -d /var/lib/heartbeat/cores/hacluster -s 
> /sbin/nologin -c "cluster user" hauser
>
> However, Corosync/OpenAIS which starts all Pacemaker CRM stuff runs as user
> and group root
>

No it doesn't.
It starts the _parent_ process as root.

Some parts need to run as root so that they can do things like "add an ip
address to the system" or "start apache" - things non-root users can't do.


> in /etc/corosync/corosync.conf
>
> <- snipped -->
>
> service {
> # Load the Pacemaker Cluster Resource Manager
> name:   pacemaker
> ver:0
> }
>
> aisexec {
> user:   root
> group:  root
> }
>
> <- snipped -->
>
> DRBD, O2CB and OCFS2 start an run as user and group root
>
> So, should I now change to run all the cluster components as a root/root or
> hacluster/haclient
>

No.


>
> Could you please clarify this cluster group/user confusion for me.
>

Did you try running it and looking at the "ps axf" output?


>
> Many thanks in advance,
>
> Alex
>
>
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
>
___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] errors in corosync.log

2010-01-18 Thread Andrew Beekhof
On Sat, Jan 16, 2010 at 9:20 PM, Shravan Mishra
 wrote:
> Hi Guys,
>
> I'm running the following version of pacemaker and corosync
> corosync=1.1.1-1-2
> pacemaker=1.0.9-2-1
>
> Every thing had been running fine for quite some time now but then I
> started seeing following errors in the corosync logs,
>
>
> =
> Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid
> digest... ignoring.
> Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data
> Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid
> digest... ignoring.
> Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data
> Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid
> digest... ignoring.
> Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data
> 
>
> I can perform all the crm shell commands and what not but it's
> troubling that the above is happening.
>
> My crm_mon output looks good.
>
>
> I also checked the authkey and did md5sum on both it's same.
>
> Then I stopped corosync and regenerated the authkey with
> corosync-keygen and copied it to the the other machine but I still get
> the above message in the corosync log.

Are you sure there's not a third node somewhere broadcasting on that
mcast and port combination?

>
> Is there anything other authkey that I should look into ?
>
>
> corosync.conf
>
> 
>
> # Please read the corosync.conf.5 manual page
> compatibility: whitetank
>
> totem {
>        version: 2
>        token: 3000
>        token_retransmits_before_loss_const: 10
>        join: 60
>        consensus: 1500
>        vsftype: none
>        max_messages: 20
>        clear_node_high_bit: yes
>        secauth: on
>        threads: 0
>        rrp_mode: passive
>
>        interface {
>                ringnumber: 0
>                bindnetaddr: 192.168.2.0
>                #mcastaddr: 226.94.1.1
>                broadcast: yes
>                mcastport: 5405
>        }
>        interface {
>                ringnumber: 1
>                bindnetaddr: 172.20.20.0
>                #mcastaddr: 226.94.1.1
>                broadcast: yes
>                mcastport: 5405
>        }
> }
>
>
> logging {
>        fileline: off
>        to_stderr: yes
>        to_logfile: yes
>        to_syslog: yes
>        logfile: /tmp/corosync.log
>        debug: off
>        timestamp: on
>        logger_subsys {
>                subsys: AMF
>                debug: off
>        }
> }
>
> service {
>        name: pacemaker
>        ver: 0
> }
>
> aisexec {
>        user:root
>        group: root
> }
>
> amf {
>        mode: disabled
> }
>
>
> ===
>
>
> Thanks
> Shravan
>
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker