Re: [Linux-ha-dev] start-delay parameter for monitor operation and Split-Brain

2007-04-12 Thread Andrew Beekhof

On 4/12/07, 池田淳子 [EMAIL PROTECTED] wrote:

Hi all,

I'm newbie, and trying to understand how or when Split-Brain happens.


when 1 or more nodes can't communicate with each other

there is no connection between this and an operation's start-delay


what you *might* be seeing is an old bug that was triggered when
start-delay  timeout


As trial, I run Dummy resource for now.
Heartbeat version is 2.0.8,
cib.xml, ha.cf and ha-log are attached.
See below cases, please give some advice.

*** case 01 ***
My cib.xml was created using hb_gui, so start-delay was 1m.
I run Heartbeat on 2 nodes at first, and disconnected the interconnect LAN
on DC node to lead Split-Brain. (# ifdown eth2)
After making Split-Brain, I would up the LAN with ifup eth2.
If eth2 is restored after Action Dummy01_monitor_1 (x) confirmed on
the former stand-by node, it seems that everything works well.
But if it's done before the confirmation, there is something wrong.
In wrong case, I couldn't generate Split-Brain again.
I found that pengine/tengine run on both nodes, but one node kept trying to
be DC and Dummy resource wouldn't start on that node, additionally,
failcount was incremented on that strange node.

log message is here...
WARN: do_dc_join_finalize: join-3: We are still in a transition.  Delaying
until the TE completes.

in this case, I couldn't shutdown Heartbeat process without KILL command...

*** case 02 ***
I changed start-delay from 1m to 0s.
The confirmation process for monitor would work immediately, so though I put
ifdown/ifup in a row, it didn't matter.

Q1;  My guess, if the interconnect LAN is down/up in a raw and some
operations aren't confirmed, one node would consider this situation as
ERROR, so update its failcaount. This case might appear when start-delay
is long (ex, 1m), and cause some strange Split-Brain.
Is this relationship between start-delay and Split-Brain correct?

*** case 03 ***
In case 02, the interconnect LAN is up after Split-Brain is completed, it
means each node is voted as DC.
For third test, I tried to down/up the interconnect LAN before the DC
election didn't finish.
In the result, I met Split-Brain again but one node stayed OFFLINE when
the interconnect LAN was up.

Q2; I know, I had better to set up two or more interconnect LANs just in
case, but are there any prefer ways to avoid case 03?  ex, tuning some
parameters or something like that.


Best Regards,
Junko Ikeda

NTT DATA INTELLILINK CORPORATION
Open Source Solutions Business Unit
Open Source Business Division

Toyosu Center Building Annex, 3-3-9, Toyosu,
Koto-ku, Tokyo 135-0061, Japan
TEL : +81-3-3534-4811
FAX : +81-3-3534-4814
mailto:[EMAIL PROTECTED]
http://www.intellilink.co.jp/

___
Linux-HA-Dev: [EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/




___
Linux-HA-Dev: [EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] start-delay parameter for monitor operation and Split-Brain

2007-04-12 Thread Alan Robertson
池田淳子 wrote:
 Hi all,
 
 I'm newbie, and trying to understand how or when Split-Brain happens.
 As trial, I run Dummy resource for now.

Split-brain occurs when there is a total communication failure (from the
heartbeat perspective) between at least two different cluster nodes.

 Heartbeat version is 2.0.8,
 cib.xml, ha.cf and ha-log are attached.
 See below cases, please give some advice.
 
 *** case 01 ***
 My cib.xml was created using hb_gui, so start-delay was 1m.
 I run Heartbeat on 2 nodes at first, and disconnected the interconnect LAN
 on DC node to lead Split-Brain. (# ifdown eth2)
 After making Split-Brain, I would up the LAN with ifup eth2.
 If eth2 is restored after Action Dummy01_monitor_1 (x) confirmed on
 the former stand-by node, it seems that everything works well.
 But if it's done before the confirmation, there is something wrong.
 In wrong case, I couldn't generate Split-Brain again.
 I found that pengine/tengine run on both nodes, but one node kept trying to
 be DC and Dummy resource wouldn't start on that node, additionally,
 failcount was incremented on that strange node.
 
 log message is here... 
 WARN: do_dc_join_finalize: join-3: We are still in a transition.  Delaying
 until the TE completes.
 
 in this case, I couldn't shutdown Heartbeat process without KILL command...

There are some known bugs regarding not being able to shut down.
They're fixed in the upcoming version.

 *** case 02 ***
 I changed start-delay from 1m to 0s.
 The confirmation process for monitor would work immediately, so though I put
 ifdown/ifup in a row, it didn't matter.
 
 Q1;  My guess, if the interconnect LAN is down/up in a raw and some
 operations aren't confirmed, one node would consider this situation as
 ERROR, so update its failcaount. This case might appear when start-delay
 is long (ex, 1m), and cause some strange Split-Brain. 
 Is this relationship between start-delay and Split-Brain correct?

As Andrew pointed out, there is no connection.

 *** case 03 ***
 In case 02, the interconnect LAN is up after Split-Brain is completed, it
 means each node is voted as DC.
 For third test, I tried to down/up the interconnect LAN before the DC
 election didn't finish.
 In the result, I met Split-Brain again but one node stayed OFFLINE when
 the interconnect LAN was up.
 
 Q2; I know, I had better to set up two or more interconnect LANs just in
 case, but are there any prefer ways to avoid case 03?  ex, tuning some
 parameters or something like that.

The nodes should go online again after rejoining.  What version is this
with?  Can you supply logs for what happened in this case?

By the way, doing ifup/ifdown of interfaces you're using for heartbeat
connections has been known to to make heartbeat sick.  Plugging and
unplugging the connectors, or adding/removing firewall rules does not do
this.

-- 
Alan Robertson [EMAIL PROTECTED]

Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions. - William
Wilberforce
___
Linux-HA-Dev: [EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


[Linux-HA] Heartbeat versus Novell Cluster Services

2007-04-12 Thread Sander van Vugt
Hi,

Just like to know your opinion about the following. A pure Linux shop
would of course definitely go for Heartbeat as the solution for high
availability. However, in an environment that comes from Novell's
NetWare, Novell Cluster Services (NCS) would be the best choice,
especially if running OESv1 that runs on top of SUSE Linux Enterprise
Server (SLES) 9. Now in the upcoming Open Enterprise Server 2, which
runs on top of SLES 10, it appears that customers do have a choice
between Heartbeat from the SLES stack, or NCS from the OES stack. Does
anyone have thoughts about that? For example, when clustering something
like Novell GroupWise in a shop that wants to implement OESv2 later this
year, to me personally, Heartbeat seems the better choice, since it has
much more features. What I'd like to know, is there any particular
reason why in the upcoming OESv2 one would still choose NCS? (Except for
backward compatibility of course)

Thanks,
Sander

___
Linux-HA mailing list
[EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] OCF_RESKEY_interval

2007-04-12 Thread Andrew Beekhof

On 4/11/07, Lars Marowsky-Bree [EMAIL PROTECTED] wrote:

On 2007-04-10T18:29:10, Andrew Beekhof [EMAIL PROTECTED] wrote:

 Apr 10 17:30:25 ha-test-1 process[26425]: Returnig 7
 Apr 10 17:30:40 ha-test-1 process[26493]: Maintainance =
 Apr 10 17:30:40 ha-test-1 process[26493]: OCF_RESKEY_probe = 1
 Apr 10 17:30:40 ha-test-1 process[26493]: Returnig 7
 Apr 10 17:30:56 ha-test-1 process[26603]: Maintainance =
 Apr 10 17:30:56 ha-test-1 process[26603]: OCF_RESKEY_probe = 1
 Apr 10 17:30:56 ha-test-1 process[26603]: Returnig 7
 Apr 10 17:31:11 ha-test-1 process[26714]: Maintainance =
 Apr 10 17:31:11 ha-test-1 process[26714]: OCF_RESKEY_probe = 1
 Apr 10 17:31:11 ha-test-1 process[26714]: Returnig 7
 [...]
 
 How can it happen that OCF_RESKEY_probe is set not only once to 1, but on
 each
 call of the monitor action?
 
 
 So I thought that probe is maybe never unset

 correct

http://www.osdl.org/developer_bugzilla/show_bug.cgi?id=1479


lmb - not the same thing.  the resource is not deleted in between the
two types of monitor calls.

this is the normal preservation of op args
___
Linux-HA mailing list
[EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] pingd not failing over

2007-04-12 Thread Andrew Beekhof

i hate to pester, but where are the fail counts kept track of and what 
maintains
them?


they are stored in the status section and are maintained by the
tengine process (which increases it whenever a monitor action fails)

there is also a CLI tool called crm_failcount that can be used to view
and modify the failcount.

there was a bug with the failcount not always being increased, this
was fixed in 2.0.8



FYI: the resource groups run fine without the pingd additions.


furthermore, here is a snip of what i think does not look right.  can anyone 
decipher?

...snip from messages

Apr  7 15:05:19 roxetta crmd: [8723]: WARN: log_data_element: do_lrm_invoke: Bad 
command rsc_op
id=11 operation=monitor operation_key=pingd-child:0_monitor_0 
on_node=roxetta
on_node_uuid=5d54293b-b319-4145-a280-a7d7e2dfd33a
transition_key=11:0:65528d84-33ba-4a40-a37c-edec19b7cd0c
Apr  7 15:05:19 roxetta crmd: [8723]: WARN: log_data_element: do_lrm_invoke: 
Bad command
primitive id=pingd-child:0 long-id=pingd:pingd-child:0 class=OCF 
provider=heartbeat
type=pingd/
Apr  7 15:05:19 roxetta crmd: [8723]: WARN: log_data_element: do_lrm_invoke: 
Bad command
attributes crm_feature_set=1.0.7 CRM_meta_timeout=5000 
CRM_meta_op_target_rc=7
CRM_meta_clone=0 CRM_meta_clone_max=2 CRM_meta_clone_node_max=1/
Apr  7 15:05:19 roxetta crmd: [8723]: WARN: log_data_element: do_lrm_invoke: Bad 
command /rsc_op
Apr  7 15:05:19 roxetta crmd: [8723]: WARN: log_data_element: do_lrm_invoke: Bad 
command rsc_op
id=12 operation=monitor operation_key=pingd-child:1_monitor_0 
on_node=roxetta
on_node_uuid=5d54293b-b319-4145-a280-a7d7e2dfd33a
transition_key=12:0:65528d84-33ba-4a40-a37c-edec19b7cd0c
Apr  7 15:05:19 roxetta crmd: [8723]: WARN: log_data_element: do_lrm_invoke: 
Bad command
primitive id=pingd-child:1 long-id=pingd:pingd-child:1 class=OCF 
provider=heartbeat
type=pingd/
Apr  7 15:05:19 roxetta crmd: [8723]: WARN: log_data_element: do_lrm_invoke: 
Bad command
attributes crm_feature_set=1.0.7 CRM_meta_timeout=5000 
CRM_meta_op_target_rc=7
CRM_meta_clone=1 CRM_meta_clone_max=2 CRM_meta_clone_node_max=1/
Apr  7 15:05:19 roxetta crmd: [8723]: WARN: log_data_element: do_lrm_invoke: Bad 
command /rsc_op
Apr  7 15:05:19 roxetta crmd: [8723]: WARN: do_log: [[FSA]] Input I_FAIL from 
get_lrm_resource()
received in state (S_TRANSITION_ENGINE)
Apr  7 15:05:19 roxetta crmd: [8723]: info: do_state_transition: roxetta: State 
transition
S_TRANSITION_ENGINE - S_POLICY_ENGINE [ input=I_FAIL cause=C_FSA_INTERNAL 
origin=get_lrm_resource ]
Apr  7 15:05:19 roxetta crmd: [8723]: info: do_state_transition: All 1 cluster 
nodes are eligable to
run resources.
Apr  7 15:05:19 roxetta crmd: [8723]: info: start_subsystem: Starting sub-system 
tengine
Apr  7 15:05:19 roxetta crmd: [8723]: WARN: start_subsystem: Client tengine 
already running as pid 29971
Apr  7 15:05:19 roxetta crmd: [8723]: info: stop_subsystem: Sent -TERM to 
tengine: [29971]
Apr  7 15:05:19 roxetta crmd: [8723]: WARN: do_log: [[FSA]] Input I_FAIL from 
get_lrm_resource()
received in state (S_POLICY_ENGINE)
Apr  7 15:05:19 roxetta crmd: [8723]: info: do_state_transition: roxetta: State 
transition
S_POLICY_ENGINE - S_INTEGRATION [ input=I_FAIL cause=C_FSA_INTERNAL 
origin=get_lrm_resource ]
Apr  7 15:05:19 roxetta crmd: [8723]: info: update_dc: Set DC to null (null)
Apr  7 15:05:19 roxetta tengine: [29971]: info: update_abort_priority: Abort 
priority upgraded to
100
Apr  7 15:05:19 roxetta tengine: [29971]: info: update_abort_priority: Abort 
action 0 superceeded by 3
Apr  7 15:05:19 roxetta crmd: [8723]: info: do_dc_join_offer_all: join-2: 
Waiting on 1 outstanding
join acks


...end snip


___
Linux-HA mailing list
[EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


___
Linux-HA mailing list
[EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Heartbeat versus Novell Cluster Services

2007-04-12 Thread Yan Fitterer
NCS has better integration with EVMS, and has data-network heartbeat. It
does not therefore require STONITH.

It has had much more testing than HB for large clusters as well. 20+
node clusters are not uncommon.

Yan

Sander van Vugt wrote:
 Hi,
 
 Just like to know your opinion about the following. A pure Linux shop
 would of course definitely go for Heartbeat as the solution for high
 availability. However, in an environment that comes from Novell's
 NetWare, Novell Cluster Services (NCS) would be the best choice,
 especially if running OESv1 that runs on top of SUSE Linux Enterprise
 Server (SLES) 9. Now in the upcoming Open Enterprise Server 2, which
 runs on top of SLES 10, it appears that customers do have a choice
 between Heartbeat from the SLES stack, or NCS from the OES stack. Does
 anyone have thoughts about that? For example, when clustering something
 like Novell GroupWise in a shop that wants to implement OESv2 later this
 year, to me personally, Heartbeat seems the better choice, since it has
 much more features. What I'd like to know, is there any particular
 reason why in the upcoming OESv2 one would still choose NCS? (Except for
 backward compatibility of course)
 
 Thanks,
 Sander
 
 ___
 Linux-HA mailing list
 [EMAIL PROTECTED]
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
[EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Heartbeat versus Novell Cluster Services

2007-04-12 Thread Yan Fitterer
Oh - and I forgot... UI was _much_ nicer last time I looked. And nowhere
near as buggy as the HB GUI.

Yan Fitterer wrote:
 NCS has better integration with EVMS, and has data-network heartbeat. It
 does not therefore require STONITH.
 
 It has had much more testing than HB for large clusters as well. 20+
 node clusters are not uncommon.
 
 Yan
 
 Sander van Vugt wrote:
 Hi,

 Just like to know your opinion about the following. A pure Linux shop
 would of course definitely go for Heartbeat as the solution for high
 availability. However, in an environment that comes from Novell's
 NetWare, Novell Cluster Services (NCS) would be the best choice,
 especially if running OESv1 that runs on top of SUSE Linux Enterprise
 Server (SLES) 9. Now in the upcoming Open Enterprise Server 2, which
 runs on top of SLES 10, it appears that customers do have a choice
 between Heartbeat from the SLES stack, or NCS from the OES stack. Does
 anyone have thoughts about that? For example, when clustering something
 like Novell GroupWise in a shop that wants to implement OESv2 later this
 year, to me personally, Heartbeat seems the better choice, since it has
 much more features. What I'd like to know, is there any particular
 reason why in the upcoming OESv2 one would still choose NCS? (Except for
 backward compatibility of course)

 Thanks,
 Sander

 ___
 Linux-HA mailing list
 [EMAIL PROTECTED]
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 ___
 Linux-HA mailing list
 [EMAIL PROTECTED]
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
[EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Re: ping/ping_group directives failing WAS: unable to start pingd (cannot start resource groups as a result

2007-04-12 Thread Alan Robertson
Andrew Beekhof wrote:
 On 4/11/07, Terry L. Inzauro [EMAIL PROTECTED] wrote:
 list,

 this is a continuation of another thread that was started a few weeks
 back.  the original thread was
 started in regards
 to the setup of pingd. this thread is in regards to pingd not being
 able to start for whatever
 reason and i suspect my resource
 groups are not starting as a result ;(

 a little background:

 - two resource groups are defined. i want to split the two resource
 groups between nodes when both
 nodes are online. if both
 nodes are not online, then obviously, fail the resource resource group
 to the other available node.
 - pingd configuration was previously verified correct by Alan R.
 - crm_verify passes
 - BasicSanityCheck 'does not pass' (fails on pingd checks)
 
 pingd isn't failing...
 
 Apr 11 12:44:07 roxetta CTS: BadNews: heartbeat[13770]:
 2007/04/11_12:44:05 ERROR: glib: Error sending packet: Operation not
 permitted
 Apr 11 12:44:07 roxetta CTS: BadNews: heartbeat[13770]:
 2007/04/11_12:44:05 ERROR: write failure on ping 127.0.0.1.: Operation
 not permitted
 
 these messages are from the heartbeat communications layer - and if
 thats not working, then pingd has no hope at all.
 
 i have no idea why pinging localhost should fail - firewall?
 
 - without pingd, the resource groups function as expected
 - heartbeat has been restarted
 - heartbeat hangs on stopping so i do the following ;)
 for i in `ps -ef | grep heart  | awk '{print $2}'`; do kill
 $i; done

 i have noticed log entries in the log file that are obviously related
 to pingd.  this however 'may'
 not be the case.
 would anyone be interested in lending a hand?

 heartbeat version = 2.0.8-r2
 OS = gentoo 2006.1
 kernel = 2.6.18 (i have tested both hardenedwith grsecurity and pax
 as well as generic)



 cibadmin -Q output , ptest output, BasicSanityCheck output and
 messages file are all attached as a
 .tar.bz2.


 believe me when i tell you that i am stumped. any assistance is
 greatly appreciated.

Are you running one of the additional Linux security layers like
SE-Linux?  (there are other ones, like AppArmor).

Firewalls are certainly a thought for this.  But I don't remember ever
seeing a firewall that would cause system call failures.

Can you show us your firewall rules?

-- 
Alan Robertson [EMAIL PROTECTED]

Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions. - William
Wilberforce
___
Linux-HA mailing list
[EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Re: ping/ping_group directives failing WAS: unable to start pingd (cannot start resource groups as a result

2007-04-12 Thread Terry L. Inzauro
Andrew Beekhof wrote:
 On 4/11/07, Terry L. Inzauro [EMAIL PROTECTED] wrote:
 list,

 this is a continuation of another thread that was started a few weeks
 back.  the original thread was
 started in regards
 to the setup of pingd. this thread is in regards to pingd not being
 able to start for whatever
 reason and i suspect my resource
 groups are not starting as a result ;(

 a little background:

 - two resource groups are defined. i want to split the two resource
 groups between nodes when both
 nodes are online. if both
 nodes are not online, then obviously, fail the resource resource group
 to the other available node.
 - pingd configuration was previously verified correct by Alan R.
 - crm_verify passes
 - BasicSanityCheck 'does not pass' (fails on pingd checks)
 
 pingd isn't failing...
 
 Apr 11 12:44:07 roxetta CTS: BadNews: heartbeat[13770]:
 2007/04/11_12:44:05 ERROR: glib: Error sending packet: Operation not
 permitted
 Apr 11 12:44:07 roxetta CTS: BadNews: heartbeat[13770]:
 2007/04/11_12:44:05 ERROR: write failure on ping 127.0.0.1.: Operation
 not permitted
 
 these messages are from the heartbeat communications layer - and if
 thats not working, then pingd has no hope at all.
 
 i have no idea why pinging localhost should fail - firewall?
 
 - without pingd, the resource groups function as expected
 - heartbeat has been restarted
 - heartbeat hangs on stopping so i do the following ;)
 for i in `ps -ef | grep heart  | awk '{print $2}'`; do kill
 $i; done

 i have noticed log entries in the log file that are obviously related
 to pingd.  this however 'may'
 not be the case.
 would anyone be interested in lending a hand?

 heartbeat version = 2.0.8-r2
 OS = gentoo 2006.1
 kernel = 2.6.18 (i have tested both hardenedwith grsecurity and pax
 as well as generic)



 cibadmin -Q output , ptest output, BasicSanityCheck output and
 messages file are all attached as a
 .tar.bz2.


 believe me when i tell you that i am stumped. any assistance is
 greatly appreciated.



 _Terry




 ___
 Linux-HA mailing list
 [EMAIL PROTECTED]
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems



no firewall. i tested with and without iptables. in fact i even unloaded ALL 
iptables modules just
to be certain.  so then i thought to myself.  pax? perhaps grsecurity? no luck 
there either.  i
rebuild a kernel without all of the grsec and pax hooks.  no luck.



destiny crm # lsmod
Module  Size  Used by
softdog 4752  0
tun 9184  0
e100   28360  0
sym53c8xx  64820  0
eepro100   25552  0
scsi_transport_spi 18752  1 sym53c8xx

destiny crm # ping 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.097 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.054 ms

--- 127.0.0.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 0.054/0.075/0.097/0.023 ms


so i re-ran BasicSAanityChecksame result.   any ideas?






___
Linux-HA mailing list
[EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Re: ping/ping_group directives failing WAS: unable to start pingd (cannot start resource groups as a result

2007-04-12 Thread Alan Robertson
Terry L. Inzauro wrote:
 Andrew Beekhof wrote:
 On 4/11/07, Terry L. Inzauro [EMAIL PROTECTED] wrote:
 list,

 this is a continuation of another thread that was started a few weeks
 back.  the original thread was
 started in regards
 to the setup of pingd. this thread is in regards to pingd not being
 able to start for whatever
 reason and i suspect my resource
 groups are not starting as a result ;(

 a little background:

 - two resource groups are defined. i want to split the two resource
 groups between nodes when both
 nodes are online. if both
 nodes are not online, then obviously, fail the resource resource group
 to the other available node.
 - pingd configuration was previously verified correct by Alan R.
 - crm_verify passes
 - BasicSanityCheck 'does not pass' (fails on pingd checks)
 pingd isn't failing...

 Apr 11 12:44:07 roxetta CTS: BadNews: heartbeat[13770]:
 2007/04/11_12:44:05 ERROR: glib: Error sending packet: Operation not
 permitted
 Apr 11 12:44:07 roxetta CTS: BadNews: heartbeat[13770]:
 2007/04/11_12:44:05 ERROR: write failure on ping 127.0.0.1.: Operation
 not permitted

 these messages are from the heartbeat communications layer - and if
 thats not working, then pingd has no hope at all.

 i have no idea why pinging localhost should fail - firewall?

 - without pingd, the resource groups function as expected
 - heartbeat has been restarted
 - heartbeat hangs on stopping so i do the following ;)
 for i in `ps -ef | grep heart  | awk '{print $2}'`; do kill
 $i; done

 i have noticed log entries in the log file that are obviously related
 to pingd.  this however 'may'
 not be the case.
 would anyone be interested in lending a hand?

 heartbeat version = 2.0.8-r2
 OS = gentoo 2006.1
 kernel = 2.6.18 (i have tested both hardenedwith grsecurity and pax
 as well as generic)



 cibadmin -Q output , ptest output, BasicSanityCheck output and
 messages file are all attached as a
 .tar.bz2.


 believe me when i tell you that i am stumped. any assistance is
 greatly appreciated.



 _Terry




 ___
 Linux-HA mailing list
 [EMAIL PROTECTED]
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems


 
 no firewall. i tested with and without iptables. in fact i even unloaded ALL 
 iptables modules just
 to be certain.  so then i thought to myself.  pax? perhaps grsecurity? no 
 luck there either.  i
 rebuild a kernel without all of the grsec and pax hooks.  no luck.
 
 
 
 destiny crm # lsmod
 Module  Size  Used by
 softdog 4752  0
 tun 9184  0
 e100   28360  0
 sym53c8xx  64820  0
 eepro100   25552  0
 scsi_transport_spi 18752  1 sym53c8xx
 
 destiny crm # ping 127.0.0.1
 PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.097 ms
 64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.054 ms
 
 --- 127.0.0.1 ping statistics ---
 2 packets transmitted, 2 received, 0% packet loss, time 1002ms
 rtt min/avg/max/mdev = 0.054/0.075/0.097/0.023 ms
 
 
 so i re-ran BasicSAanityChecksame result.   any ideas?

Here is something to run and check...

ifconfig lo;ip addr show lo; route;ip route show

Here's what it produces on my machine:
loLink encap:Local Loopback
  inet addr:127.0.0.1  Mask:255.0.0.0
  inet6 addr: ::1/128 Scope:Host
  UP LOOPBACK RUNNING  MTU:16436  Metric:1
  RX packets:520006 errors:0 dropped:0 overruns:0 frame:0
  TX packets:520006 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:0
  RX bytes:190990507 (182.1 Mb)  TX bytes:190990507 (182.1 Mb)

1: lo: LOOPBACK,UP mtu 16436 qdisc noqueue
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
   valid_lft forever preferred_lft forever
Kernel IP routing table
Destination Gateway Genmask Flags Metric RefUse
Iface
10.10.10.0  *   255.255.255.0   U 0  00 eth1
link-local  *   255.255.0.0 U 0  00 eth1
loopback*   255.0.0.0   U 0  00 lo
default gw  0.0.0.0 UG0  00 eth1
10.10.10.0/24 dev eth1  proto kernel  scope link  src 10.10.10.5
169.254.0.0/16 dev eth1  scope link
127.0.0.0/8 dev lo  scope link
default via 10.10.10.254 dev eth1


I don't know what I'm looking for to be different, but it's at least
somewhere to start...


-- 
Alan Robertson [EMAIL PROTECTED]

Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions. - William
Wilberforce
___
Linux-HA mailing list
[EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: 

Re: [Linux-HA] Heartbeat versus Novell Cluster Services

2007-04-12 Thread Alan Robertson
Sander van Vugt wrote:
 Hi,
 
 Just like to know your opinion about the following. A pure Linux shop
 would of course definitely go for Heartbeat as the solution for high
 availability. However, in an environment that comes from Novell's
 NetWare, Novell Cluster Services (NCS) would be the best choice,
 especially if running OESv1 that runs on top of SUSE Linux Enterprise
 Server (SLES) 9. Now in the upcoming Open Enterprise Server 2, which
 runs on top of SLES 10, it appears that customers do have a choice
 between Heartbeat from the SLES stack, or NCS from the OES stack. Does
 anyone have thoughts about that? For example, when clustering something
 like Novell GroupWise in a shop that wants to implement OESv2 later this
 year, to me personally, Heartbeat seems the better choice, since it has
 much more features. What I'd like to know, is there any particular
 reason why in the upcoming OESv2 one would still choose NCS? (Except for
 backward compatibility of course)


I can't comment on this, but I can suggest a question for you to ask.

What is the future (5-year) plan for NCS?



-- 
Alan Robertson [EMAIL PROTECTED]

Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions. - William
Wilberforce
___
Linux-HA mailing list
[EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] no failback

2007-04-12 Thread Alan Robertson
Bernd Eichenberg wrote:
 Hi at all,
 
 I've to warn you, because I'm a newby at HA and I'm german
 with very small english skillz.
 Hope you understand my problem and thank for that.
 
 My Problem is, there is no failback after the 1.node is valid 
 again.
 1. node is Suse 8 with heartbeat that came with.
 2. node is debian 3 with modern heartbeat.
 
 If 1. node is not reachable, 2. node do it's work.
 But if 1. node is o.k. and up again, there's no failback.
 
 The options are:
 1. node Suse 8 : 
 nice_failback off
 2. node Debian 3 : auto_failback on
 Can you tell me, what's wrong.
 Thank you very much

There is a bug we recently found out about which affects compatibility
between old versions of heartbeat and recent versions of heartbeat :-(.

Gouchun Shi is looking into exactly which versions are affected in what
ways.

Which version of heartbeat are you running on the two machines?



-- 
Alan Robertson [EMAIL PROTECTED]

Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions. - William
Wilberforce
___
Linux-HA mailing list
[EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] UDP Checksum error in heartbeat packets

2007-04-12 Thread Alan Robertson
Dominik Klein wrote:
 Hi
 
 I use heartbeat 2.0.7 from openSuSE 10.2
 
 10.250.250.27 is master
 10.250.250.28 is backup
 They are connected with direct ethernet cable
 
 When watching UDP heartbeats with tshark, my master machine says this
 (checksum errors):
 16:27:24.140509 10.250.250.28 - 10.250.250.27 UDP Source port: 32779
 Destination port: 694
 16:27:24.186876 10.250.250.27 - 10.250.250.28 UDP Source port: 32776
 Destination port: 694 [UDP CHECKSUM INCORRECT]
 16:27:25.140610 10.250.250.28 - 10.250.250.27 UDP Source port: 32779
 Destination port: 694
 16:27:25.178983 10.250.250.27 - 10.250.250.28 UDP Source port: 32776
 Destination port: 694 [UDP CHECKSUM INCORRECT]
 16:27:26.144709 10.250.250.28 - 10.250.250.27 UDP Source port: 32779
 Destination port: 694
 16:27:26.179085 10.250.250.27 - 10.250.250.28 UDP Source port: 32776
 Destination port: 694 [UDP CHECKSUM INCORRECT]
 16:27:27.144811 10.250.250.28 - 10.250.250.27 UDP Source port: 32779
 Destination port: 694
 16:27:27.179183 10.250.250.27 - 10.250.250.28 UDP Source port: 32776
 Destination port: 694 [UDP CHECKSUM INCORRECT]
 16:27:27.179503 10.250.250.27 - 10.250.250.28 UDP Source port: 32776
 Destination port: 694 [UDP CHECKSUM INCORRECT]
 
 My backup machine says this (no checksum errors):
 16:27:24.035730 10.250.250.28 - 10.250.250.27 UDP Source port: 32779
 Destination port: 694
 16:27:24.082279 10.250.250.27 - 10.250.250.28 UDP Source port: 32776
 Destination port: 694
 16:27:25.035783 10.250.250.28 - 10.250.250.27 UDP Source port: 32779
 Destination port: 694
 16:27:25.074332 10.250.250.27 - 10.250.250.28 UDP Source port: 32776
 Destination port: 694
 16:27:26.039835 10.250.250.28 - 10.250.250.27 UDP Source port: 32779
 Destination port: 694
 16:27:26.074381 10.250.250.27 - 10.250.250.28 UDP Source port: 32776
 Destination port: 694
 16:27:27.039904 10.250.250.28 - 10.250.250.27 UDP Source port: 32779
 Destination port: 694
 16:27:27.074430 10.250.250.27 - 10.250.250.28 UDP Source port: 32776
 Destination port: 694
 16:27:27.074680 10.250.250.27 - 10.250.250.28 UDP Source port: 32776
 Destination port: 694
 
 When backup becomes master, this situation does not change, so is not
 dependant on master/backup status.
 
 master and backup are identical hardware, both use openSuSE 10.2, Intel
 Primergy RX300 and Dell 82541GI/PI Gigabit Ethernet controllers.
 
 I tried using ucast and bcast heartbeats, both show checksum errors from
 master machine.
 
 Is this something to worry about and what can be done about it?

That's an awful lot of bad packets, it seems to me.

My best guess is that it's a hardware problem.

It's difficult to say which side is at fault, except by trial and error.

In theory it could also be a poor quality crossover cable.  Gigabit
needs cat5E or preferably cat6.

I'd suspect active components first - unless your crossover cable is
long, or your environment is electrically noisy.

Of course, since I work for IBM, I could think of ways to solve your
problem with IBM hardware ;-)




-- 
Alan Robertson [EMAIL PROTECTED]

Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions. - William
Wilberforce
___
Linux-HA mailing list
[EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] OCF_RESKEY_interval

2007-04-12 Thread Lars Marowsky-Bree
On 2007-04-12T08:58:15, Andrew Beekhof [EMAIL PROTECTED] wrote:

  So I thought that probe is maybe never unset
 
  correct
 
 http://www.osdl.org/developer_bugzilla/show_bug.cgi?id=1479
 
 lmb - not the same thing.  the resource is not deleted in between the
 two types of monitor calls.
 
 this is the normal preservation of op args

CRM always provides the full set of attributes, no? So maybe that logic
in the LRM should go.

(And, if I update the attributes in the CIB, does that really cause the
resource to be _deleted_ between the stop/start? In that case the
attributes would still be around ...)


Sincerely,
Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
[EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Memory Leaks

2007-04-12 Thread Hariharan Jayaraman

Hi All,

We are using linux ha for achieving HA solution for a 2 node system. We have
observed memory leaks in the crmd process and seen it grow to beyond 700MB.

I am currently running 2.0.8 with a patch that fixes a few memory leak
issues in crmd.  I am trying to use valgrind and efence to nail down more
memory leaks.

To this effect i complied heartbeat with --enable-crm-force-malloc and now
the system does not come up, looking at the error log seems that there is an
inconsistency in the call and free function, even with this flag set, i see
that while reading cib, i use crm_malloc0 to allocate and free using cl_free
which fails in GUARD_IS_OK macro.


My questions are:

1 Does --enable-crm-force-malloc  work ?
2 How can use valgrind effectively, crmd is started by heartbeat so how do
i make it work with valgrind?
3 Will using efence cause any problem, i see that code has varied
allocation logic implemented.

Incase this is not a place for this, should i post this in the dev list?

Any help for above is much appreciated.

Thanks
-Hariharan
___
Linux-HA mailing list
[EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems