Re: [Linux-HA] How to debug corosync?
On 4/29/2011 at 03:36 AM, "Stallmann, Andreas" wrote: > Hi! > > In one of my clusters I disconnect one of the nodes (say app01) from the > network. App02 takes of the resources as it should. Nice. > When I reconnect app01 to the network, crm_mon on app01 continues to report > app02 as "offline" and crm_mon on app02 does the same for app01. Still, no > errors are reported for TOTEM in the logs, and corosync-cfgtool -s reports > both > rings as "active with no faults". > > When sniffing for multicast-packets, I see packets originating from app01 but > > not from app02. Just on a punt... There's not a (partial) firewall running on app02 is there? Regards, Tim > Pinging the nodes (using ips or names) works for all interfaces. > > I'm at a loss. Any ideas? How can I debug what's happening between the two > nodes? And how can I bring an "offline" node online again without rebooting > or restarting corosync? > > Thanks in advance, > > Andreas - breaking any record in this mailing list in asking questions... > PS: corosync.conf below: > > compatibility: whitetank > aisexec { > user: root > group: root > } > service { > ver:0 > name: pacemaker > use_mgmtd: yes > use_logd: yes > } > totem { > version:2 > token: 5000 > token_retransmits_before_loss_const: 10 > join: 60 > consensus: 6000 > vsftype:none > max_messages: 20 > clear_node_high_bit: yes > secauth:off > threads:0 > interface { > ringnumber: 0 > bindnetaddr:10.10.10.0 > mcastaddr: 239.192.200.51 > mcastport: 5405 > } > interface { > ringnumber: 1 > bindnetaddr:192.168.1.0 > mcastaddr: 239.192.200.52 > mcastport: 5405 > } > rrp_mode: active > } > logging { > fileline: off > to_stderr: no > to_logfile: no > to_syslog: yes > syslog_facility: daemon > debug: off > timestamp: off > } > amf { > mode: disabled > } > > > CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef. > Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136) > Gesch?ftsf?hrer/Managing Directors: J?rgen Zender (Sprecher/Chairman), Anke > H?fer > ___ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > -- Tim Serong Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Auto Failback despite location constrain
On 4/29/2011 at 03:36 AM, "Stallmann, Andreas" wrote: > Hi! > > I configured my nodes *not* to auto failback after a defective node comes > back online. This worked nicely for a while, but now it doesn't (and, > honestly, I do not know what was changed in the meantime). > > What we do: We disconnect the two (virtual) interfaces of our node mgmt01 > (running on vmware esxi) by means of the vsphere client. Node mgmt02 takes > over the services as it should. When node mgmt01's interfaces are switched on > > again, everything looks alright for a minute or two, but then mgmt01 takes > over the resources again. Which it should not. Here's the relevant sniplet of > > the configuration (full config below): > > location nag_loc nag_grp 100: ipfuie-mgmt01 > property default-resource-stickiness="100" > > I thought, that because the resource-stickiness has the same value as the > location constrain, the resources would stick to the node they are started > on. Am I wrong? If the resource ends up on the non-preferred node, those settings will cause it to have an equal score on both nodes, so it should stay put. If you want to verify, try "ptest -Ls" to see what scores each resource has. Anyway, the problem is this constraint: location cli-prefer-nag_grp nag_grp \ rule $id="cli-prefer-rule-nag_grp" inf: #uname eq ipfuie-mgmt01 and #uname eq ipfuie-mgmt01 Because that constraint has a score of "inf", it'll take precedence. Probably "crm resource move nag_grp ipfuie-mgmt01" was run at some point, to forcibly move the resource to ipfuie-mgmt01. That constraint will persist until you run "crm resource unmove nag_grp" Kind of weird that the hostname is listed twice in that rule though... Regards, Tim -- Tim Serong Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] WARN Gmain_timeout_dispatch
More interesting than the drbd does not generate a log even with it all normal.The network cards are connected with a Gigabit broadcom cross cable.These logs are generated once in a while.Will attempt to set the parameters in sysctl.conf below concerning the network, what do you think?net.core.rmem_max = 16777216net.core.wmem_max = 16777216net.core.rmem_default = 16777216net.core.wmem_default = 16777216 ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] WARN: Gmain_timeout_dispatch Log
Hello, I am using drbd (two primary) + heartbeat (auto_failback on). In this Server1 has more hosts connected to this by presenting the following log: Version 3.0.3-2 heartbeat.I changed the values in / etc / ha.d / ha.cfg as below, but the problem continueskeepalive 4deadtime 20warntime 15inga root @: ~ # tail-f / var / log / ha-logApr 27 07:37:55 inga heartbeat: [8495]: WARN: Gmain_timeout_dispatch: Dispatch function for send local status took too long to execute: 100 ms (> 50 ms) (GSource: 0x74e350)Apr 27 13:11:43 inga heartbeat: [8495]: WARN: Gmain_timeout_dispatch: Dispatch function for send local status took too long to execute: 60 ms (> 50 ms) (GSource: 0x74e350)Apr 27 13:12:02 inga heartbeat: [8495]: WARN: G_CH_dispatch_int: Dispatch function for read child took too long to execute: 70 ms (> 50 ms) (GSource: 0x74bac0)Apr 27 13:12:03 inga heartbeat: [8495]: WARN: G_CH_dispatch_int: Dispatch function for read child took too long to execute: 60 ms (> 50 ms) (GSource: 0x74bac0)This log worries me A few more days he appeared and the server eventually declared dead.Thanks ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Auto Failback despite location constrain
Hi! I configured my nodes *not* to auto failback after a defective node comes back online. This worked nicely for a while, but now it doesn't (and, honestly, I do not know what was changed in the meantime). What we do: We disconnect the two (virtual) interfaces of our node mgmt01 (running on vmware esxi) by means of the vsphere client. Node mgmt02 takes over the services as it should. When node mgmt01's interfaces are switched on again, everything looks alright for a minute or two, but then mgmt01 takes over the resources again. Which it should not. Here's the relevant sniplet of the configuration (full config below): location nag_loc nag_grp 100: ipfuie-mgmt01 property default-resource-stickiness="100" I thought, that because the resource-stickiness has the same value as the location constrain, the resources would stick to the node they are started on. Am I wrong? Is there any other way to let resources by default start on mgmt01 (make mgmt01 the default preferred node), but don't allow resources to migrate back after the cluster is complete again after a split brain? Thanks for your input, Andreas PS: Full config below: node ipfuie-mgmt01 node ipfuie-mgmt02 primitive ajaxterm lsb:ajaxterm \ op monitor interval="15s" \ op start interval="0" timeout="30s" \ op stop interval="0" timeout="30s" primitive drbd_r0 ocf:linbit:drbd \ params drbd_resource="r0" \ op monitor interval="15s" \ op start interval="0" timeout="240s" \ op stop interval="0" timeout="100s" primitive fs_r0 ocf:heartbeat:Filesystem \ params device="/dev/drbd0" directory="/drbd" fstype="ext4" \ op start interval="0" timeout="60s" \ op stop interval="0" timeout="60s" primitive nagios_res lsb:nagios \ op monitor interval="1min" \ op start interval="0" timeout="1min" \ op stop interval="0" timeout="1min" primitive pingy_res ocf:pacemaker:ping \ params dampen="5s" multiplier="1000" host_list="10.10.10.205 10.10.10.206 10.10.10.254" \ op monitor interval="60s" timeout="60s" \ op start interval="0" timeout="60s" primitive sharedIP ocf:heartbeat:IPaddr2 \ params ip="10.10.10.204" cidr_netmask="255.255.252.0" nic="eth0:0" primitive web_res ocf:heartbeat:apache \ params configfile="/etc/apache2/httpd.conf" \ params httpd="/usr/sbin/httpd2-prefork" \ params testregex="body" statusurl="http://localhost/server-status"; \ op start interval="0" timeout="40s" \ op stop interval="0" timeout="60s" \ op monitor interval="1min" group nag_grp fs_r0 sharedIP web_res nagios_res ajaxterm \ meta target-role="Started" ms ms_drbd_r0 drbd_r0 \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Started" clone pingy_clone pingy_res \ meta target-role="Started" location cli-prefer-nag_grp nag_grp \ rule $id="cli-prefer-rule-nag_grp" inf: #uname eq ipfuie-mgmt01 and #uname eq ipfuie-mgmt01 location nag_loc nag_grp 100: ipfuie-mgmt01 location only-if-connected nag_grp \ rule $id="only-if-connected-rule" -inf: not_defined pingd or pingd lte 1500 colocation nag_grp-only-on-master inf: nag_grp ms_drbd_r0:Master order apache-after-ip inf: sharedIP web_res order nag_grp-after-drbd inf: ms_drbd_r0:promote nag_grp:start order nagios-after-apache inf: web_res nagios_res property $id="cib-bootstrap-options" \ stonith-enabled="false" \ no-quorum-policy="ignore" \ stonith-action="poweroff" \ default-resource-stickiness="100" \ dc-version="1.1.2-8b9ec9ccc5060457ac761dce1de719af86895b10" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ stop-all-resources="false" \ last-lrm-refresh="1303825164" CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef. Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136) Gesch?ftsf?hrer/Managing Directors: J?rgen Zender (Sprecher/Chairman), Anke H?fer ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] How to debug corosync?
Hi! In one of my clusters I disconnect one of the nodes (say app01) from the network. App02 takes of the resources as it should. Nice. When I reconnect app01 to the network, crm_mon on app01 continues to report app02 as "offline" and crm_mon on app02 does the same for app01. Still, no errors are reported for TOTEM in the logs, and corosync-cfgtool -s reports both rings as "active with no faults". When sniffing for multicast-packets, I see packets originating from app01 but not from app02. Pinging the nodes (using ips or names) works for all interfaces. I'm at a loss. Any ideas? How can I debug what's happening between the two nodes? And how can I bring an "offline" node online again without rebooting or restarting corosync? Thanks in advance, Andreas - breaking any record in this mailing list in asking questions... PS: corosync.conf below: compatibility: whitetank aisexec { user: root group: root } service { ver:0 name: pacemaker use_mgmtd: yes use_logd: yes } totem { version:2 token: 5000 token_retransmits_before_loss_const: 10 join: 60 consensus: 6000 vsftype:none max_messages: 20 clear_node_high_bit: yes secauth:off threads:0 interface { ringnumber: 0 bindnetaddr:10.10.10.0 mcastaddr: 239.192.200.51 mcastport: 5405 } interface { ringnumber: 1 bindnetaddr:192.168.1.0 mcastaddr: 239.192.200.52 mcastport: 5405 } rrp_mode: active } logging { fileline: off to_stderr: no to_logfile: no to_syslog: yes syslog_facility: daemon debug: off timestamp: off } amf { mode: disabled } CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef. Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136) Gesch?ftsf?hrer/Managing Directors: J?rgen Zender (Sprecher/Chairman), Anke H?fer ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Pingd does not react as expected => split brain
On Wed, Apr 27, 2011 at 7:18 PM, Stallmann, Andreas wrote: > Hi Andrew, > >> According to your configuration, it can be up to 60s before we'll detect a >> change in external connectivity. >> Thats plenty of time for the cluster to start resources. >> Maybe shortening the monitor interval will help you. > > TNX for the suggestion, I'll try that. Any suggestions on recommended monitor > intervals for pingd? > >> Couldn't hurt. > Hm... if I - for example, set the monitor interval to 10s, I'd have to adjust > the timeout for monitor to 10s as well, right? Right. > Ping is quite sluggish, it takes up to 30s to check the three nodes. Sounds like something is misconfigured. > If I now adust the interval to 10s, the next check might be triggered before > the last one is complete. Will this confuse pacemaker? No. The next op will happen 10s after the last finishes. > >>> Yes, and there is no proper way to use DRBD in a three node cluster. >> How is one related to the other? >> No-one said the third node had to run anything. > > Ok, thanks for the info; I thought all members of the cluster had to be able > to run cluster resources. I would have to keep resources from trying to run > on the third node then via a location constrain, right? Or node standby. > > TNX for your input! > > Andreas > > > CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef. > Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136) > Geschäftsführer/Managing Directors: Jürgen Zender (Sprecher/Chairman), Anke > Höfer > ___ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems