Re: [Linux-ha-dev] start-delay parameter for monitor operation and Split-Brain
On 4/12/07, 池田淳子 [EMAIL PROTECTED] wrote: Hi all, I'm newbie, and trying to understand how or when Split-Brain happens. when 1 or more nodes can't communicate with each other there is no connection between this and an operation's start-delay what you *might* be seeing is an old bug that was triggered when start-delay timeout As trial, I run Dummy resource for now. Heartbeat version is 2.0.8, cib.xml, ha.cf and ha-log are attached. See below cases, please give some advice. *** case 01 *** My cib.xml was created using hb_gui, so start-delay was 1m. I run Heartbeat on 2 nodes at first, and disconnected the interconnect LAN on DC node to lead Split-Brain. (# ifdown eth2) After making Split-Brain, I would up the LAN with ifup eth2. If eth2 is restored after Action Dummy01_monitor_1 (x) confirmed on the former stand-by node, it seems that everything works well. But if it's done before the confirmation, there is something wrong. In wrong case, I couldn't generate Split-Brain again. I found that pengine/tengine run on both nodes, but one node kept trying to be DC and Dummy resource wouldn't start on that node, additionally, failcount was incremented on that strange node. log message is here... WARN: do_dc_join_finalize: join-3: We are still in a transition. Delaying until the TE completes. in this case, I couldn't shutdown Heartbeat process without KILL command... *** case 02 *** I changed start-delay from 1m to 0s. The confirmation process for monitor would work immediately, so though I put ifdown/ifup in a row, it didn't matter. Q1; My guess, if the interconnect LAN is down/up in a raw and some operations aren't confirmed, one node would consider this situation as ERROR, so update its failcaount. This case might appear when start-delay is long (ex, 1m), and cause some strange Split-Brain. Is this relationship between start-delay and Split-Brain correct? *** case 03 *** In case 02, the interconnect LAN is up after Split-Brain is completed, it means each node is voted as DC. For third test, I tried to down/up the interconnect LAN before the DC election didn't finish. In the result, I met Split-Brain again but one node stayed OFFLINE when the interconnect LAN was up. Q2; I know, I had better to set up two or more interconnect LANs just in case, but are there any prefer ways to avoid case 03? ex, tuning some parameters or something like that. Best Regards, Junko Ikeda NTT DATA INTELLILINK CORPORATION Open Source Solutions Business Unit Open Source Business Division Toyosu Center Building Annex, 3-3-9, Toyosu, Koto-ku, Tokyo 135-0061, Japan TEL : +81-3-3534-4811 FAX : +81-3-3534-4814 mailto:[EMAIL PROTECTED] http://www.intellilink.co.jp/ ___ Linux-HA-Dev: [EMAIL PROTECTED] http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: [EMAIL PROTECTED] http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] start-delay parameter for monitor operation and Split-Brain
池田淳子 wrote: Hi all, I'm newbie, and trying to understand how or when Split-Brain happens. As trial, I run Dummy resource for now. Split-brain occurs when there is a total communication failure (from the heartbeat perspective) between at least two different cluster nodes. Heartbeat version is 2.0.8, cib.xml, ha.cf and ha-log are attached. See below cases, please give some advice. *** case 01 *** My cib.xml was created using hb_gui, so start-delay was 1m. I run Heartbeat on 2 nodes at first, and disconnected the interconnect LAN on DC node to lead Split-Brain. (# ifdown eth2) After making Split-Brain, I would up the LAN with ifup eth2. If eth2 is restored after Action Dummy01_monitor_1 (x) confirmed on the former stand-by node, it seems that everything works well. But if it's done before the confirmation, there is something wrong. In wrong case, I couldn't generate Split-Brain again. I found that pengine/tengine run on both nodes, but one node kept trying to be DC and Dummy resource wouldn't start on that node, additionally, failcount was incremented on that strange node. log message is here... WARN: do_dc_join_finalize: join-3: We are still in a transition. Delaying until the TE completes. in this case, I couldn't shutdown Heartbeat process without KILL command... There are some known bugs regarding not being able to shut down. They're fixed in the upcoming version. *** case 02 *** I changed start-delay from 1m to 0s. The confirmation process for monitor would work immediately, so though I put ifdown/ifup in a row, it didn't matter. Q1; My guess, if the interconnect LAN is down/up in a raw and some operations aren't confirmed, one node would consider this situation as ERROR, so update its failcaount. This case might appear when start-delay is long (ex, 1m), and cause some strange Split-Brain. Is this relationship between start-delay and Split-Brain correct? As Andrew pointed out, there is no connection. *** case 03 *** In case 02, the interconnect LAN is up after Split-Brain is completed, it means each node is voted as DC. For third test, I tried to down/up the interconnect LAN before the DC election didn't finish. In the result, I met Split-Brain again but one node stayed OFFLINE when the interconnect LAN was up. Q2; I know, I had better to set up two or more interconnect LANs just in case, but are there any prefer ways to avoid case 03? ex, tuning some parameters or something like that. The nodes should go online again after rejoining. What version is this with? Can you supply logs for what happened in this case? By the way, doing ifup/ifdown of interfaces you're using for heartbeat connections has been known to to make heartbeat sick. Plugging and unplugging the connectors, or adding/removing firewall rules does not do this. -- Alan Robertson [EMAIL PROTECTED] Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions. - William Wilberforce ___ Linux-HA-Dev: [EMAIL PROTECTED] http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
[Linux-HA] Heartbeat versus Novell Cluster Services
Hi, Just like to know your opinion about the following. A pure Linux shop would of course definitely go for Heartbeat as the solution for high availability. However, in an environment that comes from Novell's NetWare, Novell Cluster Services (NCS) would be the best choice, especially if running OESv1 that runs on top of SUSE Linux Enterprise Server (SLES) 9. Now in the upcoming Open Enterprise Server 2, which runs on top of SLES 10, it appears that customers do have a choice between Heartbeat from the SLES stack, or NCS from the OES stack. Does anyone have thoughts about that? For example, when clustering something like Novell GroupWise in a shop that wants to implement OESv2 later this year, to me personally, Heartbeat seems the better choice, since it has much more features. What I'd like to know, is there any particular reason why in the upcoming OESv2 one would still choose NCS? (Except for backward compatibility of course) Thanks, Sander ___ Linux-HA mailing list [EMAIL PROTECTED] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] OCF_RESKEY_interval
On 4/11/07, Lars Marowsky-Bree [EMAIL PROTECTED] wrote: On 2007-04-10T18:29:10, Andrew Beekhof [EMAIL PROTECTED] wrote: Apr 10 17:30:25 ha-test-1 process[26425]: Returnig 7 Apr 10 17:30:40 ha-test-1 process[26493]: Maintainance = Apr 10 17:30:40 ha-test-1 process[26493]: OCF_RESKEY_probe = 1 Apr 10 17:30:40 ha-test-1 process[26493]: Returnig 7 Apr 10 17:30:56 ha-test-1 process[26603]: Maintainance = Apr 10 17:30:56 ha-test-1 process[26603]: OCF_RESKEY_probe = 1 Apr 10 17:30:56 ha-test-1 process[26603]: Returnig 7 Apr 10 17:31:11 ha-test-1 process[26714]: Maintainance = Apr 10 17:31:11 ha-test-1 process[26714]: OCF_RESKEY_probe = 1 Apr 10 17:31:11 ha-test-1 process[26714]: Returnig 7 [...] How can it happen that OCF_RESKEY_probe is set not only once to 1, but on each call of the monitor action? So I thought that probe is maybe never unset correct http://www.osdl.org/developer_bugzilla/show_bug.cgi?id=1479 lmb - not the same thing. the resource is not deleted in between the two types of monitor calls. this is the normal preservation of op args ___ Linux-HA mailing list [EMAIL PROTECTED] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] pingd not failing over
i hate to pester, but where are the fail counts kept track of and what maintains them? they are stored in the status section and are maintained by the tengine process (which increases it whenever a monitor action fails) there is also a CLI tool called crm_failcount that can be used to view and modify the failcount. there was a bug with the failcount not always being increased, this was fixed in 2.0.8 FYI: the resource groups run fine without the pingd additions. furthermore, here is a snip of what i think does not look right. can anyone decipher? ...snip from messages Apr 7 15:05:19 roxetta crmd: [8723]: WARN: log_data_element: do_lrm_invoke: Bad command rsc_op id=11 operation=monitor operation_key=pingd-child:0_monitor_0 on_node=roxetta on_node_uuid=5d54293b-b319-4145-a280-a7d7e2dfd33a transition_key=11:0:65528d84-33ba-4a40-a37c-edec19b7cd0c Apr 7 15:05:19 roxetta crmd: [8723]: WARN: log_data_element: do_lrm_invoke: Bad command primitive id=pingd-child:0 long-id=pingd:pingd-child:0 class=OCF provider=heartbeat type=pingd/ Apr 7 15:05:19 roxetta crmd: [8723]: WARN: log_data_element: do_lrm_invoke: Bad command attributes crm_feature_set=1.0.7 CRM_meta_timeout=5000 CRM_meta_op_target_rc=7 CRM_meta_clone=0 CRM_meta_clone_max=2 CRM_meta_clone_node_max=1/ Apr 7 15:05:19 roxetta crmd: [8723]: WARN: log_data_element: do_lrm_invoke: Bad command /rsc_op Apr 7 15:05:19 roxetta crmd: [8723]: WARN: log_data_element: do_lrm_invoke: Bad command rsc_op id=12 operation=monitor operation_key=pingd-child:1_monitor_0 on_node=roxetta on_node_uuid=5d54293b-b319-4145-a280-a7d7e2dfd33a transition_key=12:0:65528d84-33ba-4a40-a37c-edec19b7cd0c Apr 7 15:05:19 roxetta crmd: [8723]: WARN: log_data_element: do_lrm_invoke: Bad command primitive id=pingd-child:1 long-id=pingd:pingd-child:1 class=OCF provider=heartbeat type=pingd/ Apr 7 15:05:19 roxetta crmd: [8723]: WARN: log_data_element: do_lrm_invoke: Bad command attributes crm_feature_set=1.0.7 CRM_meta_timeout=5000 CRM_meta_op_target_rc=7 CRM_meta_clone=1 CRM_meta_clone_max=2 CRM_meta_clone_node_max=1/ Apr 7 15:05:19 roxetta crmd: [8723]: WARN: log_data_element: do_lrm_invoke: Bad command /rsc_op Apr 7 15:05:19 roxetta crmd: [8723]: WARN: do_log: [[FSA]] Input I_FAIL from get_lrm_resource() received in state (S_TRANSITION_ENGINE) Apr 7 15:05:19 roxetta crmd: [8723]: info: do_state_transition: roxetta: State transition S_TRANSITION_ENGINE - S_POLICY_ENGINE [ input=I_FAIL cause=C_FSA_INTERNAL origin=get_lrm_resource ] Apr 7 15:05:19 roxetta crmd: [8723]: info: do_state_transition: All 1 cluster nodes are eligable to run resources. Apr 7 15:05:19 roxetta crmd: [8723]: info: start_subsystem: Starting sub-system tengine Apr 7 15:05:19 roxetta crmd: [8723]: WARN: start_subsystem: Client tengine already running as pid 29971 Apr 7 15:05:19 roxetta crmd: [8723]: info: stop_subsystem: Sent -TERM to tengine: [29971] Apr 7 15:05:19 roxetta crmd: [8723]: WARN: do_log: [[FSA]] Input I_FAIL from get_lrm_resource() received in state (S_POLICY_ENGINE) Apr 7 15:05:19 roxetta crmd: [8723]: info: do_state_transition: roxetta: State transition S_POLICY_ENGINE - S_INTEGRATION [ input=I_FAIL cause=C_FSA_INTERNAL origin=get_lrm_resource ] Apr 7 15:05:19 roxetta crmd: [8723]: info: update_dc: Set DC to null (null) Apr 7 15:05:19 roxetta tengine: [29971]: info: update_abort_priority: Abort priority upgraded to 100 Apr 7 15:05:19 roxetta tengine: [29971]: info: update_abort_priority: Abort action 0 superceeded by 3 Apr 7 15:05:19 roxetta crmd: [8723]: info: do_dc_join_offer_all: join-2: Waiting on 1 outstanding join acks ...end snip ___ Linux-HA mailing list [EMAIL PROTECTED] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list [EMAIL PROTECTED] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Heartbeat versus Novell Cluster Services
NCS has better integration with EVMS, and has data-network heartbeat. It does not therefore require STONITH. It has had much more testing than HB for large clusters as well. 20+ node clusters are not uncommon. Yan Sander van Vugt wrote: Hi, Just like to know your opinion about the following. A pure Linux shop would of course definitely go for Heartbeat as the solution for high availability. However, in an environment that comes from Novell's NetWare, Novell Cluster Services (NCS) would be the best choice, especially if running OESv1 that runs on top of SUSE Linux Enterprise Server (SLES) 9. Now in the upcoming Open Enterprise Server 2, which runs on top of SLES 10, it appears that customers do have a choice between Heartbeat from the SLES stack, or NCS from the OES stack. Does anyone have thoughts about that? For example, when clustering something like Novell GroupWise in a shop that wants to implement OESv2 later this year, to me personally, Heartbeat seems the better choice, since it has much more features. What I'd like to know, is there any particular reason why in the upcoming OESv2 one would still choose NCS? (Except for backward compatibility of course) Thanks, Sander ___ Linux-HA mailing list [EMAIL PROTECTED] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list [EMAIL PROTECTED] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Heartbeat versus Novell Cluster Services
Oh - and I forgot... UI was _much_ nicer last time I looked. And nowhere near as buggy as the HB GUI. Yan Fitterer wrote: NCS has better integration with EVMS, and has data-network heartbeat. It does not therefore require STONITH. It has had much more testing than HB for large clusters as well. 20+ node clusters are not uncommon. Yan Sander van Vugt wrote: Hi, Just like to know your opinion about the following. A pure Linux shop would of course definitely go for Heartbeat as the solution for high availability. However, in an environment that comes from Novell's NetWare, Novell Cluster Services (NCS) would be the best choice, especially if running OESv1 that runs on top of SUSE Linux Enterprise Server (SLES) 9. Now in the upcoming Open Enterprise Server 2, which runs on top of SLES 10, it appears that customers do have a choice between Heartbeat from the SLES stack, or NCS from the OES stack. Does anyone have thoughts about that? For example, when clustering something like Novell GroupWise in a shop that wants to implement OESv2 later this year, to me personally, Heartbeat seems the better choice, since it has much more features. What I'd like to know, is there any particular reason why in the upcoming OESv2 one would still choose NCS? (Except for backward compatibility of course) Thanks, Sander ___ Linux-HA mailing list [EMAIL PROTECTED] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list [EMAIL PROTECTED] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list [EMAIL PROTECTED] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Re: ping/ping_group directives failing WAS: unable to start pingd (cannot start resource groups as a result
Andrew Beekhof wrote: On 4/11/07, Terry L. Inzauro [EMAIL PROTECTED] wrote: list, this is a continuation of another thread that was started a few weeks back. the original thread was started in regards to the setup of pingd. this thread is in regards to pingd not being able to start for whatever reason and i suspect my resource groups are not starting as a result ;( a little background: - two resource groups are defined. i want to split the two resource groups between nodes when both nodes are online. if both nodes are not online, then obviously, fail the resource resource group to the other available node. - pingd configuration was previously verified correct by Alan R. - crm_verify passes - BasicSanityCheck 'does not pass' (fails on pingd checks) pingd isn't failing... Apr 11 12:44:07 roxetta CTS: BadNews: heartbeat[13770]: 2007/04/11_12:44:05 ERROR: glib: Error sending packet: Operation not permitted Apr 11 12:44:07 roxetta CTS: BadNews: heartbeat[13770]: 2007/04/11_12:44:05 ERROR: write failure on ping 127.0.0.1.: Operation not permitted these messages are from the heartbeat communications layer - and if thats not working, then pingd has no hope at all. i have no idea why pinging localhost should fail - firewall? - without pingd, the resource groups function as expected - heartbeat has been restarted - heartbeat hangs on stopping so i do the following ;) for i in `ps -ef | grep heart | awk '{print $2}'`; do kill $i; done i have noticed log entries in the log file that are obviously related to pingd. this however 'may' not be the case. would anyone be interested in lending a hand? heartbeat version = 2.0.8-r2 OS = gentoo 2006.1 kernel = 2.6.18 (i have tested both hardenedwith grsecurity and pax as well as generic) cibadmin -Q output , ptest output, BasicSanityCheck output and messages file are all attached as a .tar.bz2. believe me when i tell you that i am stumped. any assistance is greatly appreciated. Are you running one of the additional Linux security layers like SE-Linux? (there are other ones, like AppArmor). Firewalls are certainly a thought for this. But I don't remember ever seeing a firewall that would cause system call failures. Can you show us your firewall rules? -- Alan Robertson [EMAIL PROTECTED] Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions. - William Wilberforce ___ Linux-HA mailing list [EMAIL PROTECTED] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Re: ping/ping_group directives failing WAS: unable to start pingd (cannot start resource groups as a result
Andrew Beekhof wrote: On 4/11/07, Terry L. Inzauro [EMAIL PROTECTED] wrote: list, this is a continuation of another thread that was started a few weeks back. the original thread was started in regards to the setup of pingd. this thread is in regards to pingd not being able to start for whatever reason and i suspect my resource groups are not starting as a result ;( a little background: - two resource groups are defined. i want to split the two resource groups between nodes when both nodes are online. if both nodes are not online, then obviously, fail the resource resource group to the other available node. - pingd configuration was previously verified correct by Alan R. - crm_verify passes - BasicSanityCheck 'does not pass' (fails on pingd checks) pingd isn't failing... Apr 11 12:44:07 roxetta CTS: BadNews: heartbeat[13770]: 2007/04/11_12:44:05 ERROR: glib: Error sending packet: Operation not permitted Apr 11 12:44:07 roxetta CTS: BadNews: heartbeat[13770]: 2007/04/11_12:44:05 ERROR: write failure on ping 127.0.0.1.: Operation not permitted these messages are from the heartbeat communications layer - and if thats not working, then pingd has no hope at all. i have no idea why pinging localhost should fail - firewall? - without pingd, the resource groups function as expected - heartbeat has been restarted - heartbeat hangs on stopping so i do the following ;) for i in `ps -ef | grep heart | awk '{print $2}'`; do kill $i; done i have noticed log entries in the log file that are obviously related to pingd. this however 'may' not be the case. would anyone be interested in lending a hand? heartbeat version = 2.0.8-r2 OS = gentoo 2006.1 kernel = 2.6.18 (i have tested both hardenedwith grsecurity and pax as well as generic) cibadmin -Q output , ptest output, BasicSanityCheck output and messages file are all attached as a .tar.bz2. believe me when i tell you that i am stumped. any assistance is greatly appreciated. _Terry ___ Linux-HA mailing list [EMAIL PROTECTED] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems no firewall. i tested with and without iptables. in fact i even unloaded ALL iptables modules just to be certain. so then i thought to myself. pax? perhaps grsecurity? no luck there either. i rebuild a kernel without all of the grsec and pax hooks. no luck. destiny crm # lsmod Module Size Used by softdog 4752 0 tun 9184 0 e100 28360 0 sym53c8xx 64820 0 eepro100 25552 0 scsi_transport_spi 18752 1 sym53c8xx destiny crm # ping 127.0.0.1 PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data. 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.097 ms 64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.054 ms --- 127.0.0.1 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1002ms rtt min/avg/max/mdev = 0.054/0.075/0.097/0.023 ms so i re-ran BasicSAanityChecksame result. any ideas? ___ Linux-HA mailing list [EMAIL PROTECTED] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Re: ping/ping_group directives failing WAS: unable to start pingd (cannot start resource groups as a result
Terry L. Inzauro wrote: Andrew Beekhof wrote: On 4/11/07, Terry L. Inzauro [EMAIL PROTECTED] wrote: list, this is a continuation of another thread that was started a few weeks back. the original thread was started in regards to the setup of pingd. this thread is in regards to pingd not being able to start for whatever reason and i suspect my resource groups are not starting as a result ;( a little background: - two resource groups are defined. i want to split the two resource groups between nodes when both nodes are online. if both nodes are not online, then obviously, fail the resource resource group to the other available node. - pingd configuration was previously verified correct by Alan R. - crm_verify passes - BasicSanityCheck 'does not pass' (fails on pingd checks) pingd isn't failing... Apr 11 12:44:07 roxetta CTS: BadNews: heartbeat[13770]: 2007/04/11_12:44:05 ERROR: glib: Error sending packet: Operation not permitted Apr 11 12:44:07 roxetta CTS: BadNews: heartbeat[13770]: 2007/04/11_12:44:05 ERROR: write failure on ping 127.0.0.1.: Operation not permitted these messages are from the heartbeat communications layer - and if thats not working, then pingd has no hope at all. i have no idea why pinging localhost should fail - firewall? - without pingd, the resource groups function as expected - heartbeat has been restarted - heartbeat hangs on stopping so i do the following ;) for i in `ps -ef | grep heart | awk '{print $2}'`; do kill $i; done i have noticed log entries in the log file that are obviously related to pingd. this however 'may' not be the case. would anyone be interested in lending a hand? heartbeat version = 2.0.8-r2 OS = gentoo 2006.1 kernel = 2.6.18 (i have tested both hardenedwith grsecurity and pax as well as generic) cibadmin -Q output , ptest output, BasicSanityCheck output and messages file are all attached as a .tar.bz2. believe me when i tell you that i am stumped. any assistance is greatly appreciated. _Terry ___ Linux-HA mailing list [EMAIL PROTECTED] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems no firewall. i tested with and without iptables. in fact i even unloaded ALL iptables modules just to be certain. so then i thought to myself. pax? perhaps grsecurity? no luck there either. i rebuild a kernel without all of the grsec and pax hooks. no luck. destiny crm # lsmod Module Size Used by softdog 4752 0 tun 9184 0 e100 28360 0 sym53c8xx 64820 0 eepro100 25552 0 scsi_transport_spi 18752 1 sym53c8xx destiny crm # ping 127.0.0.1 PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data. 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.097 ms 64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.054 ms --- 127.0.0.1 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1002ms rtt min/avg/max/mdev = 0.054/0.075/0.097/0.023 ms so i re-ran BasicSAanityChecksame result. any ideas? Here is something to run and check... ifconfig lo;ip addr show lo; route;ip route show Here's what it produces on my machine: loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:520006 errors:0 dropped:0 overruns:0 frame:0 TX packets:520006 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:190990507 (182.1 Mb) TX bytes:190990507 (182.1 Mb) 1: lo: LOOPBACK,UP mtu 16436 qdisc noqueue link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host valid_lft forever preferred_lft forever Kernel IP routing table Destination Gateway Genmask Flags Metric RefUse Iface 10.10.10.0 * 255.255.255.0 U 0 00 eth1 link-local * 255.255.0.0 U 0 00 eth1 loopback* 255.0.0.0 U 0 00 lo default gw 0.0.0.0 UG0 00 eth1 10.10.10.0/24 dev eth1 proto kernel scope link src 10.10.10.5 169.254.0.0/16 dev eth1 scope link 127.0.0.0/8 dev lo scope link default via 10.10.10.254 dev eth1 I don't know what I'm looking for to be different, but it's at least somewhere to start... -- Alan Robertson [EMAIL PROTECTED] Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions. - William Wilberforce ___ Linux-HA mailing list [EMAIL PROTECTED] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also:
Re: [Linux-HA] Heartbeat versus Novell Cluster Services
Sander van Vugt wrote: Hi, Just like to know your opinion about the following. A pure Linux shop would of course definitely go for Heartbeat as the solution for high availability. However, in an environment that comes from Novell's NetWare, Novell Cluster Services (NCS) would be the best choice, especially if running OESv1 that runs on top of SUSE Linux Enterprise Server (SLES) 9. Now in the upcoming Open Enterprise Server 2, which runs on top of SLES 10, it appears that customers do have a choice between Heartbeat from the SLES stack, or NCS from the OES stack. Does anyone have thoughts about that? For example, when clustering something like Novell GroupWise in a shop that wants to implement OESv2 later this year, to me personally, Heartbeat seems the better choice, since it has much more features. What I'd like to know, is there any particular reason why in the upcoming OESv2 one would still choose NCS? (Except for backward compatibility of course) I can't comment on this, but I can suggest a question for you to ask. What is the future (5-year) plan for NCS? -- Alan Robertson [EMAIL PROTECTED] Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions. - William Wilberforce ___ Linux-HA mailing list [EMAIL PROTECTED] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] no failback
Bernd Eichenberg wrote: Hi at all, I've to warn you, because I'm a newby at HA and I'm german with very small english skillz. Hope you understand my problem and thank for that. My Problem is, there is no failback after the 1.node is valid again. 1. node is Suse 8 with heartbeat that came with. 2. node is debian 3 with modern heartbeat. If 1. node is not reachable, 2. node do it's work. But if 1. node is o.k. and up again, there's no failback. The options are: 1. node Suse 8 : nice_failback off 2. node Debian 3 : auto_failback on Can you tell me, what's wrong. Thank you very much There is a bug we recently found out about which affects compatibility between old versions of heartbeat and recent versions of heartbeat :-(. Gouchun Shi is looking into exactly which versions are affected in what ways. Which version of heartbeat are you running on the two machines? -- Alan Robertson [EMAIL PROTECTED] Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions. - William Wilberforce ___ Linux-HA mailing list [EMAIL PROTECTED] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] UDP Checksum error in heartbeat packets
Dominik Klein wrote: Hi I use heartbeat 2.0.7 from openSuSE 10.2 10.250.250.27 is master 10.250.250.28 is backup They are connected with direct ethernet cable When watching UDP heartbeats with tshark, my master machine says this (checksum errors): 16:27:24.140509 10.250.250.28 - 10.250.250.27 UDP Source port: 32779 Destination port: 694 16:27:24.186876 10.250.250.27 - 10.250.250.28 UDP Source port: 32776 Destination port: 694 [UDP CHECKSUM INCORRECT] 16:27:25.140610 10.250.250.28 - 10.250.250.27 UDP Source port: 32779 Destination port: 694 16:27:25.178983 10.250.250.27 - 10.250.250.28 UDP Source port: 32776 Destination port: 694 [UDP CHECKSUM INCORRECT] 16:27:26.144709 10.250.250.28 - 10.250.250.27 UDP Source port: 32779 Destination port: 694 16:27:26.179085 10.250.250.27 - 10.250.250.28 UDP Source port: 32776 Destination port: 694 [UDP CHECKSUM INCORRECT] 16:27:27.144811 10.250.250.28 - 10.250.250.27 UDP Source port: 32779 Destination port: 694 16:27:27.179183 10.250.250.27 - 10.250.250.28 UDP Source port: 32776 Destination port: 694 [UDP CHECKSUM INCORRECT] 16:27:27.179503 10.250.250.27 - 10.250.250.28 UDP Source port: 32776 Destination port: 694 [UDP CHECKSUM INCORRECT] My backup machine says this (no checksum errors): 16:27:24.035730 10.250.250.28 - 10.250.250.27 UDP Source port: 32779 Destination port: 694 16:27:24.082279 10.250.250.27 - 10.250.250.28 UDP Source port: 32776 Destination port: 694 16:27:25.035783 10.250.250.28 - 10.250.250.27 UDP Source port: 32779 Destination port: 694 16:27:25.074332 10.250.250.27 - 10.250.250.28 UDP Source port: 32776 Destination port: 694 16:27:26.039835 10.250.250.28 - 10.250.250.27 UDP Source port: 32779 Destination port: 694 16:27:26.074381 10.250.250.27 - 10.250.250.28 UDP Source port: 32776 Destination port: 694 16:27:27.039904 10.250.250.28 - 10.250.250.27 UDP Source port: 32779 Destination port: 694 16:27:27.074430 10.250.250.27 - 10.250.250.28 UDP Source port: 32776 Destination port: 694 16:27:27.074680 10.250.250.27 - 10.250.250.28 UDP Source port: 32776 Destination port: 694 When backup becomes master, this situation does not change, so is not dependant on master/backup status. master and backup are identical hardware, both use openSuSE 10.2, Intel Primergy RX300 and Dell 82541GI/PI Gigabit Ethernet controllers. I tried using ucast and bcast heartbeats, both show checksum errors from master machine. Is this something to worry about and what can be done about it? That's an awful lot of bad packets, it seems to me. My best guess is that it's a hardware problem. It's difficult to say which side is at fault, except by trial and error. In theory it could also be a poor quality crossover cable. Gigabit needs cat5E or preferably cat6. I'd suspect active components first - unless your crossover cable is long, or your environment is electrically noisy. Of course, since I work for IBM, I could think of ways to solve your problem with IBM hardware ;-) -- Alan Robertson [EMAIL PROTECTED] Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions. - William Wilberforce ___ Linux-HA mailing list [EMAIL PROTECTED] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] OCF_RESKEY_interval
On 2007-04-12T08:58:15, Andrew Beekhof [EMAIL PROTECTED] wrote: So I thought that probe is maybe never unset correct http://www.osdl.org/developer_bugzilla/show_bug.cgi?id=1479 lmb - not the same thing. the resource is not deleted in between the two types of monitor calls. this is the normal preservation of op args CRM always provides the full set of attributes, no? So maybe that logic in the LRM should go. (And, if I update the attributes in the CIB, does that really cause the resource to be _deleted_ between the stop/start? In that case the attributes would still be around ...) Sincerely, Lars -- Teamlead Kernel, SuSE Labs, Research and Development SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA mailing list [EMAIL PROTECTED] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Memory Leaks
Hi All, We are using linux ha for achieving HA solution for a 2 node system. We have observed memory leaks in the crmd process and seen it grow to beyond 700MB. I am currently running 2.0.8 with a patch that fixes a few memory leak issues in crmd. I am trying to use valgrind and efence to nail down more memory leaks. To this effect i complied heartbeat with --enable-crm-force-malloc and now the system does not come up, looking at the error log seems that there is an inconsistency in the call and free function, even with this flag set, i see that while reading cib, i use crm_malloc0 to allocate and free using cl_free which fails in GUARD_IS_OK macro. My questions are: 1 Does --enable-crm-force-malloc work ? 2 How can use valgrind effectively, crmd is started by heartbeat so how do i make it work with valgrind? 3 Will using efence cause any problem, i see that code has varied allocation logic implemented. Incase this is not a place for this, should i post this in the dev list? Any help for above is much appreciated. Thanks -Hariharan ___ Linux-HA mailing list [EMAIL PROTECTED] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems