Re: [Linux-HA] How to debug corosync?

2011-04-28 Thread Tim Serong
On 4/29/2011 at 03:36 AM, "Stallmann, Andreas"  wrote: 
> Hi! 
>  
> In one of my clusters I disconnect one of the nodes (say app01) from the  
> network. App02 takes of the resources as it should. Nice. 
> When I reconnect app01 to the network, crm_mon on app01 continues to report  
> app02 as "offline" and crm_mon on app02 does the same for app01. Still, no  
> errors are reported for TOTEM in the logs, and corosync-cfgtool -s reports 
> both  
> rings as "active with no faults". 
>  
> When sniffing for multicast-packets, I see packets originating from app01 but 
>  
> not from app02. 

Just on a punt...  There's not a (partial) firewall running on app02 is there?

Regards,

Tim

> Pinging the nodes (using ips or names) works for all interfaces. 
>  
> I'm at a loss. Any ideas? How can I debug what's happening between the two  
> nodes? And how can I bring an "offline" node online again without rebooting  
> or restarting corosync? 
>  
> Thanks in advance, 
>  
> Andreas - breaking any record in this mailing list in asking questions... 
> PS: corosync.conf below: 
>  
> compatibility: whitetank 
> aisexec { 
> user:   root 
> group:  root 
> } 
> service { 
> ver:0 
> name:   pacemaker 
> use_mgmtd:  yes 
> use_logd:   yes 
> } 
> totem { 
> version:2 
> token:  5000 
> token_retransmits_before_loss_const: 10 
> join:   60 
> consensus:  6000 
> vsftype:none 
> max_messages:   20 
> clear_node_high_bit: yes 
> secauth:off 
> threads:0 
> interface { 
> ringnumber: 0 
> bindnetaddr:10.10.10.0 
> mcastaddr:  239.192.200.51 
> mcastport:  5405 
> } 
> interface { 
> ringnumber: 1 
> bindnetaddr:192.168.1.0 
> mcastaddr:  239.192.200.52 
> mcastport:  5405 
> } 
> rrp_mode:   active 
> } 
> logging { 
> fileline:   off 
> to_stderr:  no 
> to_logfile: no 
> to_syslog:  yes 
> syslog_facility: daemon 
> debug:  off 
> timestamp:  off 
> } 
> amf { 
> mode: disabled 
> } 
>  
>  
> CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef. 
> Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136) 
> Gesch?ftsf?hrer/Managing Directors: J?rgen Zender (Sprecher/Chairman), Anke  
> H?fer 
> ___ 
> Linux-HA mailing list 
> Linux-HA@lists.linux-ha.org 
> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
> See also: http://linux-ha.org/ReportingProblems 
>  




-- 
Tim Serong 
Senior Clustering Engineer, OPS Engineering, Novell Inc.



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Auto Failback despite location constrain

2011-04-28 Thread Tim Serong
On 4/29/2011 at 03:36 AM, "Stallmann, Andreas"  wrote: 
> Hi! 
>  
> I configured my nodes *not* to auto failback after a defective node comes  
> back online. This worked nicely for a while, but now it doesn't (and,  
> honestly, I do not know what was changed in the meantime). 
>  
> What we do: We disconnect the two (virtual) interfaces of our node mgmt01  
> (running on vmware esxi) by means of the vsphere client. Node mgmt02 takes  
> over the services as it should. When node mgmt01's interfaces are switched on 
>  
> again, everything looks alright for a minute or two, but then mgmt01 takes  
> over the resources again. Which it should not. Here's the relevant sniplet of 
>  
> the  configuration (full config below): 
>  
> location nag_loc nag_grp 100: ipfuie-mgmt01 
> property default-resource-stickiness="100" 
>  
> I thought, that because the resource-stickiness has the same value as the  
> location constrain, the resources would stick to the node they are started  
> on. Am I wrong? 

If the resource ends up on the non-preferred node, those settings will
cause it to have an equal score on both nodes, so it should stay put.
If you want to verify, try "ptest -Ls" to see what scores each resource
has.

Anyway, the problem is this constraint:

location cli-prefer-nag_grp nag_grp \
rule $id="cli-prefer-rule-nag_grp" inf: #uname eq ipfuie-mgmt01 and 
#uname eq ipfuie-mgmt01

Because that constraint has a score of "inf", it'll take precedence.
Probably "crm resource move nag_grp ipfuie-mgmt01" was run at some point,
to forcibly move the resource to ipfuie-mgmt01.  That constraint will
persist until you run "crm resource unmove nag_grp"

Kind of weird that the hostname is listed twice in that rule though...

Regards,

Tim


-- 
Tim Serong 
Senior Clustering Engineer, OPS Engineering, Novell Inc.



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] WARN Gmain_timeout_dispatch

2011-04-28 Thread gilmarlinux


More interesting than the drbd does not generate a log even with it all 
normal.The
network cards are connected with a Gigabit broadcom cross cable.These logs are
generated once in a while.Will attempt to set the parameters in sysctl.conf 
below
concerning the network, what do you think?net.core.rmem_max = 
16777216net.core.wmem_max = 16777216net.core.rmem_default = 
16777216net.core.wmem_default = 16777216
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] WARN: Gmain_timeout_dispatch Log

2011-04-28 Thread gilmarlinux


Hello, I am using drbd (two primary) + heartbeat (auto_failback on). In this 
Server1 has
more hosts connected to this by presenting the following log: Version 3.0.3-2
heartbeat.I changed the values in / etc / ha.d / ha.cfg as below, but the 
problem
continueskeepalive 4deadtime 20warntime 15inga root @: ~ #
tail-f / var / log / ha-logApr 27 07:37:55 inga heartbeat: [8495]: WARN:
Gmain_timeout_dispatch: Dispatch function for send local status took too long to
execute: 100 ms (> 50 ms) (GSource: 0x74e350)Apr 27 13:11:43 inga heartbeat:
[8495]: WARN: Gmain_timeout_dispatch: Dispatch function for send local status 
took too
long to execute: 60 ms (> 50 ms) (GSource: 0x74e350)Apr 27 13:12:02 inga
heartbeat: [8495]: WARN: G_CH_dispatch_int: Dispatch function for read child 
took too
long to execute: 70 ms (> 50 ms) (GSource: 0x74bac0)Apr 27 13:12:03 inga
heartbeat: [8495]: WARN: G_CH_dispatch_int: Dispatch function for read child 
took too
long to execute: 60 ms (> 50 ms) (GSource: 0x74bac0)This log worries me A few
more days he appeared and the server eventually declared dead.Thanks
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Auto Failback despite location constrain

2011-04-28 Thread Stallmann, Andreas
Hi!

I configured my nodes *not* to auto failback after a defective node comes back 
online. This worked nicely for a while, but now it doesn't (and, honestly, I do 
not know what was changed in the meantime).

What we do: We disconnect the two (virtual) interfaces of our node mgmt01 
(running on vmware esxi) by means of the vsphere client. Node mgmt02 takes over 
the services as it should. When node mgmt01's interfaces are switched on again, 
everything looks alright for a minute or two, but then mgmt01 takes over the 
resources again. Which it should not. Here's the relevant sniplet of the  
configuration (full config below):

location nag_loc nag_grp 100: ipfuie-mgmt01
property default-resource-stickiness="100"

I thought, that because the resource-stickiness has the same value as the 
location constrain, the resources would stick to the node they are started on. 
Am I wrong?

Is there any other way to let resources by default start on mgmt01 (make mgmt01 
the default preferred node), but don't allow resources to migrate back after 
the cluster is complete again after a split brain?

Thanks for your input,

Andreas
PS: Full config below:

node ipfuie-mgmt01
node ipfuie-mgmt02
primitive ajaxterm lsb:ajaxterm \
op monitor interval="15s" \
op start interval="0" timeout="30s" \
op stop interval="0" timeout="30s"
primitive drbd_r0 ocf:linbit:drbd \
params drbd_resource="r0" \
op monitor interval="15s" \
op start interval="0" timeout="240s" \
op stop interval="0" timeout="100s"
primitive fs_r0 ocf:heartbeat:Filesystem \
params device="/dev/drbd0" directory="/drbd" fstype="ext4" \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="60s"
primitive nagios_res lsb:nagios \
op monitor interval="1min" \
op start interval="0" timeout="1min" \
op stop interval="0" timeout="1min"
primitive pingy_res ocf:pacemaker:ping \
params dampen="5s" multiplier="1000" host_list="10.10.10.205 
10.10.10.206 10.10.10.254" \
op monitor interval="60s" timeout="60s" \
op start interval="0" timeout="60s"
primitive sharedIP ocf:heartbeat:IPaddr2 \
params ip="10.10.10.204" cidr_netmask="255.255.252.0" nic="eth0:0"
primitive web_res ocf:heartbeat:apache \
params configfile="/etc/apache2/httpd.conf" \
params httpd="/usr/sbin/httpd2-prefork" \
params testregex="body" statusurl="http://localhost/server-status"; \
op start interval="0" timeout="40s" \
op stop interval="0" timeout="60s" \
op monitor interval="1min"
group nag_grp fs_r0 sharedIP web_res nagios_res ajaxterm \
meta target-role="Started"
ms ms_drbd_r0 drbd_r0 \
meta master-max="1" master-node-max="1" clone-max="2" 
clone-node-max="1" notify="true" target-role="Started"
clone pingy_clone pingy_res \
meta target-role="Started"
location cli-prefer-nag_grp nag_grp \
rule $id="cli-prefer-rule-nag_grp" inf: #uname eq ipfuie-mgmt01 and 
#uname eq ipfuie-mgmt01
location nag_loc nag_grp 100: ipfuie-mgmt01
location only-if-connected nag_grp \
rule $id="only-if-connected-rule" -inf: not_defined pingd or pingd lte 
1500
colocation nag_grp-only-on-master inf: nag_grp ms_drbd_r0:Master
order apache-after-ip inf: sharedIP web_res
order nag_grp-after-drbd inf: ms_drbd_r0:promote nag_grp:start
order nagios-after-apache inf: web_res nagios_res
property $id="cib-bootstrap-options" \
stonith-enabled="false" \
no-quorum-policy="ignore" \
stonith-action="poweroff" \
default-resource-stickiness="100" \
dc-version="1.1.2-8b9ec9ccc5060457ac761dce1de719af86895b10" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
stop-all-resources="false" \
last-lrm-refresh="1303825164"


CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef.
Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136)
Gesch?ftsf?hrer/Managing Directors: J?rgen Zender (Sprecher/Chairman), Anke 
H?fer
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] How to debug corosync?

2011-04-28 Thread Stallmann, Andreas
Hi!

In one of my clusters I disconnect one of the nodes (say app01) from the 
network. App02 takes of the resources as it should. Nice.
When I reconnect app01 to the network, crm_mon on app01 continues to report 
app02 as "offline" and crm_mon on app02 does the same for app01. Still, no 
errors are reported for TOTEM in the logs, and corosync-cfgtool -s reports both 
rings as "active with no faults".

When sniffing for multicast-packets, I see packets originating from app01 but 
not from app02.

Pinging the nodes (using ips or names) works for all interfaces.

I'm at a loss. Any ideas? How can I debug what's happening between the two 
nodes? And how can I bring an "offline" node online again without rebooting or 
restarting corosync?

Thanks in advance,

Andreas - breaking any record in this mailing list in asking questions...
PS: corosync.conf below:

compatibility: whitetank
aisexec {
user:   root
group:  root
}
service {
ver:0
name:   pacemaker
use_mgmtd:  yes
use_logd:   yes
}
totem {
version:2
token:  5000
token_retransmits_before_loss_const: 10
join:   60
consensus:  6000
vsftype:none
max_messages:   20
clear_node_high_bit: yes
secauth:off
threads:0
interface {
ringnumber: 0
bindnetaddr:10.10.10.0
mcastaddr:  239.192.200.51
mcastport:  5405
}
interface {
ringnumber: 1
bindnetaddr:192.168.1.0
mcastaddr:  239.192.200.52
mcastport:  5405
}
rrp_mode:   active
}
logging {
fileline:   off
to_stderr:  no
to_logfile: no
to_syslog:  yes
syslog_facility: daemon
debug:  off
timestamp:  off
}
amf {
mode: disabled
}


CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef.
Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136)
Gesch?ftsf?hrer/Managing Directors: J?rgen Zender (Sprecher/Chairman), Anke 
H?fer
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Pingd does not react as expected => split brain

2011-04-28 Thread Andrew Beekhof
On Wed, Apr 27, 2011 at 7:18 PM, Stallmann, Andreas  wrote:
> Hi Andrew,
>
>> According to your configuration, it can be up to 60s before we'll detect a 
>> change in external connectivity.
>> Thats plenty of time for the cluster to start resources.
>> Maybe shortening the monitor interval will help you.
>
> TNX for the suggestion, I'll try that. Any suggestions on recommended monitor 
> intervals for pingd?
>
>> Couldn't hurt.
> Hm... if I - for example, set the monitor interval to 10s, I'd have to adjust 
> the timeout for monitor to 10s as well, right?

Right.

> Ping is quite sluggish, it takes up to 30s to check the three nodes.

Sounds like something is misconfigured.

> If I now adust the interval to 10s, the next check might be triggered before 
> the last one is complete. Will this confuse pacemaker?

No. The next op will happen 10s after the last finishes.

>
>>> Yes, and there is no proper way to use DRBD in a three node cluster.
>> How is one related to the other?
>> No-one said the third node had to run anything.
>
> Ok, thanks for the info; I thought all members of the cluster had to be able 
> to run cluster resources. I would have to keep resources from trying to run 
> on the third node then via a location constrain, right?

Or node standby.

>
> TNX for your input!
>
> Andreas
>
> 
> CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef.
> Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136)
> Geschäftsführer/Managing Directors: Jürgen Zender (Sprecher/Chairman), Anke 
> Höfer
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems