Re: [Linux-HA] crm_verfify cib.xml verification error

2007-04-10 Thread Andrew Beekhof

pretty sure i commented on this recently

i'll patch it today

On Apr 6, 2007, at 2:40 PM, Alan Robertson wrote:


kisalay wrote:

Hi,

I recently migrated from 2.0.7 to 2.0.8.
when I run my old ( 2.0.7 ) cib.xml through crm_verify now, I receive
following warns / errors:

element cib: validity error : Element cib content does not follow  
the DTD,

expecting (configuration , status), got (configuration )

element cib: validity error : Element cib does not carry attribute
num_updates
element cib: validity error : Element cib does not carry attribute  
epoch

element cib: validity error : Element cib does not carry attribute
admin_epoch

crm_verify[25908]: 2007/04/06_13:18:35 ERROR: validate_with_dtd:  
CIB does

not validate against /usr/lib/heartbeat/crm.dtd
crm_verify[25908]: 2007/04/06_13:18:35 ERROR: main: CIB did not  
pass DTD

validation
crm_verify[25908]: 2007/04/06_13:18:35 WARN: cluster_option: Using
deprecated name 'no_quorum_policy' for cluster option 'no-quorum- 
policy'

crm_verify[25908]: 2007/04/06_13:18:35 WARN: cluster_option: Using
deprecated name 'symmetric_cluster' for cluster option 'symmetric- 
cluster'

crm_verify[25908]: 2007/04/06_13:18:35 WARN: cluster_option: Using
deprecated name 'stonith_enabled' for cluster option 'stonith- 
enabled'

crm_verify[25908]: 2007/04/06_13:18:35 WARN: cluster_option: Using
deprecated name 'stonith_action' for cluster option 'stonith-action'
crm_verify[25908]: 2007/04/06_13:18:35 WARN: cluster_option: Using
deprecated name 'default_resource_stickiness' for cluster option
'default-resource-stickiness'
crm_verify[25908]: 2007/04/06_13:18:35 WARN: cluster_option: Using
deprecated name 'default_resource_failure_stickiness' for cluster  
option

'default-resource-failure-stickiness'
crm_verify[25908]: 2007/04/06_13:18:35 WARN: cluster_option: Using
deprecated name 'is_managed_default' for cluster option
'is-managed-default'
crm_verify[25908]: 2007/04/06_13:18:35 WARN: cluster_option: Using
deprecated name 'transition_idle_timeout' for cluster option
'cluster-delay'
crm_verify[25908]: 2007/04/06_13:18:35 WARN: cluster_option: Using
deprecated name 'default_action_timeout' for cluster option
'default-action-timeout'
crm_verify[25908]: 2007/04/06_13:18:35 WARN: cluster_option: Using
deprecated name 'stop_orphan_resources' for cluster option
'stop-orphan-resources'
crm_verify[25908]: 2007/04/06_13:18:35 WARN: cluster_option: Using
deprecated name 'stop_orphan_actions' for cluster option
'stop-orphan-actions'
crm_verify[25908]: 2007/04/06_13:18:35 WARN: cluster_option: Using
deprecated name 'remove_after_stop' for cluster option 'remove- 
after-stop'

Errors found during check: config not valid


I have removed the warns regarding deprecated names.

What i am really stuck is the first error it throws. When I create  
a tag

for
status in the cib.xml, then run it through the crm_verify, i do  
not get the

error.
But when I start heartbeat, it deletes the status part in cib.xml  
during

updating it, and this error again appears from crm_verify.
I was developing a tool to monitor for cib corrution for my setup,  
but i am

stuck because of this error.

Please suggest how do i fix this error, or if this is an error at  
all ?

I am attaching my cib.xml for reference.



Something is a little odd, I think - because the CIB I see on my disk
looks the same.  At one point in the past, I thought it output a dummy
 section...

I don't know if Andrew will show up on the list today or not (it's a
holiday in many places).  He could answer your questions.

--
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] 2.0.7 Failover Behavior Question

2007-04-10 Thread Andrew Beekhof

On 3/29/07, Mohler, Eric (EMOHLER) <[EMAIL PROTECTED]> wrote:

Andrew,

Thanks for your reply. Please refer to <--'s below.

The resulting behavior is that the app only restarts on the same node,
never ping-pong.


**


i assume "ON" and "OFF" refer to the resource state? <--YES YOU ARE
RIGHT

try:
rsc_location(your_resource, BOX1, 1) rsc_location(your_resource,
BOX2, 1) default_resource_failure_stickiness = 100
default_resource_stickiness = 10

that should let it ping-pong (due to failures) between your nodes 200
times before we'll give up

<--- YES ping-pong in response to successive app failures is exactly
what I'm after. See mods below:


I CHANGED STICKINESS VALUES:





I ADDED rsc_location CONSTRAINTS <-- DID I DO THIS RIGHT?


no, you only specified scores for one node


<--- QUESTION #1
IS IS OK TO APPLY rsc_location with rsc_colocation CONSTRAINTS
<--- QUESTION #2
ANY IDEAS WHY I'M NOT GETTING ONLY GETTING RESTART BEHAVIOR AND NOT
PING-PONG BEHAVIOR  <--- QUESTION #3


  
  
  
  
  
  
  
  
  
  
  
  



HERE'S THE WHOLE CIB.XML FILE:

 
   
 
   
 
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
 
   
 
 
   
   
 
 
   
 
   
 
   
 
   
 
   
   
 
   
 
   
   
 
   
 
   
 
   
   
 
   
 
   
   
 
   
 
   
 
   
   
 
   
 
   
   
 
   
 
   
 
   
 
 
   
   
   
   
   
   
   
   
   
   
   
   
 
   
 
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Heartbeat stop hangs

2007-04-10 Thread Andrew Beekhof

On 4/9/07, Kevin Jamieson <[EMAIL PROTECTED]> wrote:

kisalay wrote:

> I have a 2 node 2.0.8 Linux HA setup.
> I have observed that when stop is issued on my setup, as soon as the start
> returns, the stop hangs indefinitely, and the only way to stop heartbeat is
> to do killall.


or wait for the really long timeout

this was fixed last week IIRC



I've noticed similar behaviour on heartbeat 2.0.7, though haven't gotten
around to filing a bugzilla on it.

In the situation I've observed, it looks like a race between shutting
down of the heartbeat parent process and either the setpgid() or the
SIGTERM signal handler installation in a newly created child process
(the log indicates heartbeat is killing the crmd but the crmd appears to
never receive a SIGTERM).

Mar 15 21:52:22 main heartbeat: [4499]: info: Starting child client
"/usr/lib/heartbeat/crmd" (90,90)
Mar 15 21:52:22 main heartbeat: [4499]: info: killing
/usr/lib/heartbeat/crmd process group 4586 with signal 15
Mar 15 21:52:22 main heartbeat: [4586]: info: Starting
"/usr/lib/heartbeat/crmd" as uid 90 gid 90 (pid 4586)
Mar 15 21:52:22 main crmd: [4586]: info: init_start:main.c Starting crmd
Mar 15 21:52:22 main crmd: [4586]: info: G_main_add_SignalHandler: Added
signal handler for signal 15
Mar 15 21:52:22 main crmd: [4586]: info: G_main_add_TriggerHandler:
Added signal manual handler
Mar 15 21:52:22 main crmd: [4586]: info: G_main_add_SignalHandler: Added
signal handler for signal 17

Kevin

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] OCF_RESKEY_interval

2007-04-10 Thread Lars Marowsky-Bree
On 2007-04-05T17:46:54, Bernd Schubert <[EMAIL PROTECTED]> wrote:

> Ok, so we need to correct the doku again.
> 
> 
> Here we add a second monitor action, one that runs once per minute. The 
> interval is passed to the ResourceAgent as OCF_RESKEY_interval and is a 
> period in milliseconds. In theory one could check this value and perform more 
> (or less) superficial internal checks for the resource. (However there is a 
> much better way, see "Per Action Parameters" below.)
> 

Yes, that documentation is incorrect. Where did you find that?

However, it still gets passed in - just as OCF_RESKEY_CRM_meta_interval,
to show the distinction to an instance parameter.


> # on probe (== exclusive) always report process not running
>   ql_log warn "OCF_RESKEY_interval = ${OCF_RESKEY_interval}"
>   if [ -z "$OCF_RESKEY_interval" ] || [ "$OCF_RESKEY_interval" = 0 ]; then
>  ql_log warn "Returnig ${OCF_NOT_RUNNING}"
> return ${OCF_NOT_RUNNING}
>   fi

Ugh. Even probe shouldn't always return "not running", but the actual
state. This seems like a weird work-around for an otherwise broken
monitor action, or am I missing something ...?



Sincerely,
Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Annouce: IPAddr2 RA v1.30 alpha

2007-04-10 Thread Lars Marowsky-Bree
On 2007-04-09T20:03:03, Michael Schwartzkopff <[EMAIL PROTECTED]> wrote:

> Kernel crash: See https://bugzilla.novell.com/show_bug.cgi?id=238646
> Oops: I just noticed that you are responsoble for that bug since Jan
> 25th 2007.

Ah, that crash. No, it's not assigned to me, but I commented on it since
then ... I've pinged Jaroslav again. We were all very busy with SLE10
SP1, so apparently the 10.2 bug got less attention.


> @@ -143,6 +154,15 @@
>  
>  
>  
> +
> +
> +Enable load sharing via clusterip target of iptables. Be sure to have
> +iptables with clusterip target compiled in.
> +
> +Enable load sharing
> +
> +
> +

The old script tried to auto-detect that it was run as a clone and then
automatically enabled this, which I think is still preferable. If the
CRM_meta_clone{,_max} show up in the environment, it should switch into
this mode by itself. No need for an additional parameter.

> @@ -285,10 +286,17 @@
>   LVS_SUPPORT=1
>   fi
>   
> - IP_INC_GLOBAL=${OCF_RESKEY_incarnations_max_global:-1}
> - IP_INC_NO=${OCF_RESKEY_incarnation_no:-0}
> - IP_CIP_HASH="$OCF_RESKEY_clusterip_hash"
> - IP_CIP_MARK=${OCF_RESKEY_clusterip_mark:-1}
> + LOAD_SHARE=0
> + if [ x"${OCF_RESKEY_load_share}" = x"true" \
> +-o x"${OCF_RESKEY_load_share}" = x"on" \
> +-o x"${OCF_RESKEY_load_share}" = x"1" \
> + -o x"${OCF_RESKEY_load_share}" = x"yes" ]; then
> +LOAD_SHARE=1
> +fi
> +
> + IP_INC_GLOBAL=${OCF_RESKEY_CRM_meta_clone_max:-1}
> + IP_INC_NO=$((OCF_RESKEY_CRM_meta_clone+1))
> + IP_CIP_HASH="${OCF_RESKEY_clusterip_hash}"

Yes, turns out the instance parameters were renamed since ;-) Thanks for
catching this.

You sure about the "clone+1"? I thought hash buckets, too, were numbered
from 0 on.

> - if [ "$IP_INC_GLOBAL" -gt 1 ]; then
> + if [ "$IP_INC_GLOBAL" -gt 1 -a $LOAD_SHARE -gt 0 ]; then
> + ocf_log info "user wants load share"

I think this change can be skipped for the reasons given above.

> @@ -582,16 +596,7 @@
>   fi
>   
>   if [ -n "$IP_CIP" ] && [ $ip_status = "no" ]; then
> - $MODPROBE ip_tables
>   $MODPROBE ip_conntrack
> - $MODPROBE ipt_CLUSTERIP
> - $IPTABLES -A OUTPUT -s $BASEIP -o $NIC \
> - -m state --state NEW \
> - -j CONNMARK --set-mark $IP_CIP_MARK

You removed the connmark upstream completely. Did this become obsolete
upstream?

> @@ -662,16 +667,15 @@
>   fi
>   echo "-$IP_INC_NO" >$IP_CIP_FILE
>   if [ "x$(cat $IP_CIP_FILE)" = "x" ]; then
> - # This was the last incarnation
> - $IPTABLES -D OUTPUT -s $CLUSTERIP -o $NIC \
> - -m state --state NEW \
> - -j CONNMARK --set-mark $IP_CIP_MARK
> + ocf_log info $BASEIP, $IP_CIP_HASH
>   $IPTABLES -D INPUT -d $BASEIP -i $NIC -j CLUSTERIP \
>   --new \
>   --clustermac $IF_MAC \
>   --total-nodes $IP_INC_GLOBAL \
>   --local-node $IP_INC_NO \
>   --hashmode $IP_CIP_HASH
> + $MODPROBE -r ipt_CLUSTERIP
> + $MODPROBE -r ip_conntrack

I'd advise against rmmod. Removing modules isn't entirely supported on
Linux.

> @@ -695,7 +699,9 @@
>   ip_init
>   # TODO: Implement more elaborate monitoring like checking for
>   # interface health maybe via a daemon like FailSafe etc...
> - case `ip_served $BASEIP` in
> +
> + local ip_status=`ip_served`
> + case $ip_status in

Good catch, ip_served indeed doesn't take a parameter (or rather,
ignores it silently). case `ip_served` would have done the job though,
no need for a local variable.

Hey, the changes are a lot smaller than I expected. Seems it at least
wasn't conceptionally broken ;-) Thanks for the changes. If we can
clarify the changes above, we should be able to merge it asap.


Sincerely,
Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] OCF_RESKEY_interval

2007-04-10 Thread Bernd Schubert
On Tuesday 10 April 2007 10:15:03 Lars Marowsky-Bree wrote:
> However, it still gets passed in - just as OCF_RESKEY_CRM_meta_interval,
> to show the distinction to an instance parameter.
>
> > # on probe (== exclusive) always report process not running
> >   ql_log warn "OCF_RESKEY_interval = ${OCF_RESKEY_interval}"
> >   if [ -z "$OCF_RESKEY_interval" ] || [ "$OCF_RESKEY_interval" = 0 ];
> > then ql_log warn "Returnig ${OCF_NOT_RUNNING}"
> > return ${OCF_NOT_RUNNING}
> >   fi
>
> Ugh. Even probe shouldn't always return "not running", but the actual
> state. This seems like a weird work-around for an otherwise broken
> monitor action, or am I missing something ...?

Well, once OCF_RESKEY_interval was set, it didn't return "not running", of 
course. The variable was/is mis-used to tell heartbeat that a resource 
started *before* the startup of the heartbeat-resource-groups shall only be 
monitored. 
So on the startup of the resource group, it shall not be killed first and it 
shall not have an effect on the other members of this resource group.

If you know another way to achive this, I would be glad to hear. 
And thanks a lot, for now we will use OCF_RESKEY_CRM_meta_interval, though I 
guess it will also go away on future version.

Thanks,
Bernd


-- 
Bernd Schubert
Q-Leap Networks GmbH
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Annouce: IPAddr2 RA v1.30 alpha

2007-04-10 Thread Michael Schwartzkopff
Am Dienstag, 10. April 2007 10:59 schrieb Lars Marowsky-Bree:
> > @@ -143,6 +154,15 @@
> >  
> >  
> >
> > +
> > +
> > +Enable load sharing via clusterip target of iptables. Be sure to have
> > +iptables with clusterip target compiled in.
> > +
> > +Enable load sharing
> > +
> > +
> > +
>
> The old script tried to auto-detect that it was run as a clone and then
> automatically enabled this, which I think is still preferable. If the
> CRM_meta_clone{,_max} show up in the environment, it should switch into
> this mode by itself. No need for an additional parameter.

At the moment I would like to have that parameter since interop between LVS 
and CLUSTERIP is not tested at all. After these tests we can drop it.

>
> > @@ -285,10 +286,17 @@
> > LVS_SUPPORT=1
> > fi
> >
> > -   IP_INC_GLOBAL=${OCF_RESKEY_incarnations_max_global:-1}
> > -   IP_INC_NO=${OCF_RESKEY_incarnation_no:-0}
> > -   IP_CIP_HASH="$OCF_RESKEY_clusterip_hash"
> > -   IP_CIP_MARK=${OCF_RESKEY_clusterip_mark:-1}
> > +   LOAD_SHARE=0
> > +   if [ x"${OCF_RESKEY_load_share}" = x"true" \
> > +-o x"${OCF_RESKEY_load_share}" = x"on" \
> > +-o x"${OCF_RESKEY_load_share}" = x"1" \
> > +   -o x"${OCF_RESKEY_load_share}" = x"yes" ]; then
> > +LOAD_SHARE=1
> > +fi
> > +
> > +   IP_INC_GLOBAL=${OCF_RESKEY_CRM_meta_clone_max:-1}
> > +   IP_INC_NO=$((OCF_RESKEY_CRM_meta_clone+1))
> > +   IP_CIP_HASH="${OCF_RESKEY_clusterip_hash}"
>
> Yes, turns out the instance parameters were renamed since ;-) Thanks for
> catching this.
>
> You sure about the "clone+1"? I thought hash buckets, too, were numbered
> from 0 on.

Hash buckets are numbered from 1 on.

>
> > -   if [ "$IP_INC_GLOBAL" -gt 1 ]; then
> > +   if [ "$IP_INC_GLOBAL" -gt 1 -a $LOAD_SHARE -gt 0 ]; then
> > +   ocf_log info "user wants load share"
>
> I think this change can be skipped for the reasons given above.

I'd like to have it. Reasons see above.

>
> > @@ -582,16 +596,7 @@
> > fi
> >
> > if [ -n "$IP_CIP" ] && [ $ip_status = "no" ]; then
> > -   $MODPROBE ip_tables
> > $MODPROBE ip_conntrack
> > -   $MODPROBE ipt_CLUSTERIP
> > -   $IPTABLES -A OUTPUT -s $BASEIP -o $NIC \
> > -   -m state --state NEW \
> > -   -j CONNMARK --set-mark $IP_CIP_MARK
>
> You removed the connmark upstream completely. Did this become obsolete
> upstream?
CLUSTERIP always did work without marking. If you mark the pakets all hashes 
will be handled only from one node (hash=1).

>
> > @@ -662,16 +667,15 @@
> > fi
> > echo "-$IP_INC_NO" >$IP_CIP_FILE
> > if [ "x$(cat $IP_CIP_FILE)" = "x" ]; then
> > -   # This was the last incarnation
> > -   $IPTABLES -D OUTPUT -s $CLUSTERIP -o $NIC \
> > -   -m state --state NEW \
> > -   -j CONNMARK --set-mark $IP_CIP_MARK
> > +   ocf_log info $BASEIP, $IP_CIP_HASH
> > $IPTABLES -D INPUT -d $BASEIP -i $NIC -j CLUSTERIP \
> > --new \
> > --clustermac $IF_MAC \
> > --total-nodes $IP_INC_GLOBAL \
> > --local-node $IP_INC_NO \
> > --hashmode $IP_CIP_HASH
> > +   $MODPROBE -r ipt_CLUSTERIP
> > +   $MODPROBE -r ip_conntrack
>
> I'd advise against rmmod. Removing modules isn't entirely supported on
> Linux.

Due to kernel crash I'd like to remove the module. Anyway, somebody who likes 
to have CLUSTERIP will know what he does and will have a distro where module 
removal is supported.

(...)

> Thanks for the changes. If we can
> clarify the changes above, we should be able to merge it asap.

Please test it extensively. I only checked it my me two virtual machines with 
up to 12 clone instances. No LVS cross check, no backward compatibility.

At the moment I am working on a locking mechanism during resource start. So 
the restriction of a ordered clone would not be nescessary any more. But this 
needs some more conceptual changes in ip_start and ip_stop.

-- 
Dr. Michael Schwartzkopff
MultiNET Services GmbH
Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany
Tel: +49 - 89 - 45 69 11 0
Fax: +49 - 89 - 45 69 11 21
mob: +49 - 174 - 343 28 75

mail: [EMAIL PROTECTED]
web: www.multinet.de

Sitz der Gesellschaft: 85630 Grasbrunn
Registergericht: Amtsgericht München HRB 114375
Geschäftsführer: Günter Jurgeneit, Hubert Martens

---

PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B
Skype: misch42
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] OCF_RESKEY_interval

2007-04-10 Thread Bernd Schubert
On Tuesday 10 April 2007 10:15:03 Lars Marowsky-Bree wrote:
> On 2007-04-05T17:46:54, Bernd Schubert <[EMAIL PROTECTED]> wrote:
> > Ok, so we need to correct the doku again.
> >
> > 
> > Here we add a second monitor action, one that runs once per minute. The
> > interval is passed to the ResourceAgent as OCF_RESKEY_interval and is a
> > period in milliseconds. In theory one could check this value and perform
> > more (or less) superficial internal checks for the resource. (However
> > there is a much better way, see "Per Action Parameters" below.)
> > 
>
> Yes, that documentation is incorrect. Where did you find that?

*Was* incorrect, Alan already corrected it :)  Found it by googling for 
OCF_RESKEY_interval.

http://www.linux-ha.org/ClusterInformationBase/Actions

-- 
Bernd Schubert
Q-Leap Networks GmbH
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Annouce: IPAddr2 RA v1.30 alpha

2007-04-10 Thread Lars Marowsky-Bree
On 2007-04-10T11:12:11, Michael Schwartzkopff <[EMAIL PROTECTED]> wrote:

> > The old script tried to auto-detect that it was run as a clone and then
> > automatically enabled this, which I think is still preferable. If the
> > CRM_meta_clone{,_max} show up in the environment, it should switch into
> > this mode by itself. No need for an additional parameter.
> At the moment I would like to have that parameter since interop between LVS 
> and CLUSTERIP is not tested at all. After these tests we can drop it.

One can't simply drop a parameter once introduced.

LVS vs clusterip is a good point, but that doesn't depend on this
setting, does it? As long as it's detected by the script as above.

> > You sure about the "clone+1"? I thought hash buckets, too, were numbered
> > from 0 on.
> Hash buckets are numbered from 1 on.

Oh, ok. Thanks. Got that one wrong then ;-)

> > I'd advise against rmmod. Removing modules isn't entirely supported
> > on Linux.
> 
> Due to kernel crash I'd like to remove the module. Anyway, somebody
> who likes to have CLUSTERIP will know what he does and will have a
> distro where module removal is supported.

I doubt it. Module removal is not supported by the Linux kernel well.
And kernel crashes are exactly why I'd prefer not to do that ;-)

> Please test it extensively. I only checked it my me two virtual machines with 
> up to 12 clone instances. No LVS cross check, no backward compatibility.

LVS doesn't work with the clusterip anyway, so that's not really a
problem.

> At the moment I am working on a locking mechanism during resource
> start. So the restriction of a ordered clone would not be nescessary
> any more. But this needs some more conceptual changes in ip_start and
> ip_stop.

I really don't think the RA is the right place to implement this
serialization, but a per-node ordered setting in the CIB should be the
preferable way.



Sincerely,
Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] OCF_RESKEY_interval

2007-04-10 Thread Lars Marowsky-Bree
On 2007-04-10T11:08:56, Bernd Schubert <[EMAIL PROTECTED]> wrote:

> > Ugh. Even probe shouldn't always return "not running", but the actual
> > state. This seems like a weird work-around for an otherwise broken
> > monitor action, or am I missing something ...?
> 
> Well, once OCF_RESKEY_interval was set, it didn't return "not running", of 
> course. The variable was/is mis-used to tell heartbeat that a resource 
> started *before* the startup of the heartbeat-resource-groups shall only be 
> monitored. 
> So on the startup of the resource group, it shall not be killed first and it 
> shall not have an effect on the other members of this resource group.

I'm missing the point here.

But when you return the proper status - running, failed, not running -,
heartbeat should do the "right thing" automatically when it finds the
resource active prior to heartbeat being (re-)started?


Sincerely,
Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Annouce: IPAddr2 RA v1.30 alpha

2007-04-10 Thread Michael Schwartzkopff
Am Dienstag, 10. April 2007 11:20 schrieb Lars Marowsky-Bree:
> > At the moment I would like to have that parameter since interop between
> > LVS and CLUSTERIP is not tested at all. After these tests we can drop it.
>
> One can't simply drop a parameter once introduced.

Why not? My script is still marked alpha and not intended for production use. 
If it getts better variables get fixed.

>
> LVS vs clusterip is a good point, but that doesn't depend on this
> setting, does it? As long as it's detected by the script as above.
>
> LVS doesn't work with the clusterip anyway, so that's not really a
> problem.

But I did not test what happens if the user configures both.

> > At the moment I am working on a locking mechanism during resource
> > start. So the restriction of a ordered clone would not be nescessary
> > any more. But this needs some more conceptual changes in ip_start and
> > ip_stop.
>
> I really don't think the RA is the right place to implement this
> serialization, but a per-node ordered setting in the CIB should be the
> preferable way.

Per-node ordering might be not the right way. Think about the following. Four 
clones on four machines. Machine 3 dies and the first one has to take over. 
1) Resource 1 is shout down
2) Resource 1 is started
3) Resource 3 is started.

Between 1) and 2) there is no connectivity for resource 1. This might be up to 
several seconds, which is not acceptable in my oppinion. Is would be better 
just to start resource 3 on node 1. Conceptual thoughts about this are 
welcome!

Basically there is no big difference between per-node ordering and global 
ordering. The difference: All resources on alle nodes are not available for 
several seconds.

-- 
Dr. Michael Schwartzkopff
MultiNET Services GmbH
Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany
Tel: +49 - 89 - 45 69 11 0
Fax: +49 - 89 - 45 69 11 21
mob: +49 - 174 - 343 28 75

mail: [EMAIL PROTECTED]
web: www.multinet.de

Sitz der Gesellschaft: 85630 Grasbrunn
Registergericht: Amtsgericht München HRB 114375
Geschäftsführer: Günter Jurgeneit, Hubert Martens

---

PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B
Skype: misch42
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] OCF_RESKEY_interval

2007-04-10 Thread Peter Kruse

Hello Lars,

Lars Marowsky-Bree wrote:

I'm missing the point here.

But when you return the proper status - running, failed, not running -,
heartbeat should do the "right thing" automatically when it finds the
resource active prior to heartbeat being (re-)started?


The point is, that we misuse heartbeat to monitor a process, that
maybe already running before heartbeat starts, like cron.  And we
don't want heartbeat to stop it when it's already running.  Therefore
on probe we return not running.
We only need a way to distinguish between a probe action and a monitor
action, which we have done by checking the OCF_RESKEY_interval.
The question now is, how can we find out if "probe" instead of "monitor"
is what heartbeat executes?  And what mechanism can we use that survives
more than two releases?

Cheers,

Peter
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Error in compiling LinxuHA.

2007-04-10 Thread Athrun Zara

dear Alan Robertson,

Thank you very much for fast the reply and sorry for my late one.

After following your suggestion by adding --disable-fatal-warnings ,
the source can be compiled.

FYI :
my configure command is :
**./env \
CFLAGS="-I/opt/include -Wl,--rpath=/opt/lib" \
LDFLAGS="-L/opt/lib"  \
./configure --prefix=/opt \
--enable-largefile\
--enable-pretty   \
--enable-thread-safe  \
--enable-mgmt \
--enable-lrm --disable-fatal-warnings

all of the prerequisite library was compiled with --prefix=/opt

Once again, thank you for your help. :)


On 3/23/07, Alan Robertson <[EMAIL PROTECTED]> wrote:


Athrun Zara wrote:
> Dear Linux HA experts,
>
> I am trying to build Linux HA from scratch.
> but I have an error when compiling the package :
>
> Making all in libnet_util
> Compiling send_arp.c:
> [ERROR]
>  gcc -DHAVE_CONFIG_H -I. -I. -I../../linux-ha -I../../include
> -I../../include -I../../include -I../../linux-ha -I../../linux-ha
> -I../../libltdl -I../../libltdl -pthread
> -I/usr/include/glib-2.0-I/usr/lib/glib-
> 2.0/include -I/usr/include/libxml2 -I/opt/include -Wl,--rpath=/opt/lib
> -Wall
> -Wmissing-prototypes -Wmissing-declarations -Wstrict-prototypes
> -Wdeclaration-after-statement -Wpointer-arith -Wwrite-strings
-Wcast-qual
> -Wcast-align -Wbad-function-cast -Winline -Wmissing-format-attribute
> -Wformat=2 -Wformat-security -Wformat-nonliteral -Wno-long-long
> -Wno-strict-aliasing -Werror -ggdb3 -funsigned-char -DVAR_RUN_D=""
> -DVAR_LIB_D="" -DHA_D="" -DHALIB="/opt/lib/heartbeat" -I/opt/include
> -Wl,--rpath=/opt/lib -Wall -Wmissing-prototypes -Wmissing-declarations
> -Wstrict-prototypes -Wdeclaration-afIn file included from
> /opt/include/libnet.h:124,
>   from send_arp.c:37:
>  /opt/include/./libnet/libnet-functions.h:1839: warning: function
> declaration isn't a prototype
>  /opt/include/./libnet/libnet-functions.h:1861: warning: function
> declaration isn't a prototype
>  /opt/include/./libnet/libnet-functions.h:1868: warning: function
> declaration isn't a prototype
>  /opt/include/./libnet/libnet-functions.h:1876: warning: function
> declaration isn't a prototype
>  /opt/include/./libnet/libnet-functions.h:1884: warning: function
> declaration isn't a prototype
> gmake[2]: *** [send_arp.o] Error 1
> gmake[1]: *** [all-recursive] Error 1
> make: *** [all-recursive] Error 1
>
> --
> Linux : Centos 4.3 with recompiled kernel 2.6.19.1
> Linux HA : ver 2.0.8
> LibNet : ver 1.1.2.1
>
> Both LinuxHA and Libnet are configured with --prefix=/opt

void libnet_cq_destroy();

It ought to have a (void) instead of the ().

You can either patch those ()'s into (void)'s or you can configure
heartbeat with --disable-fatal-warnings.

Hmmm... I'm not sure why we don't get that error (?).  Obviously you're
doing something differently ;-)  I don't think the CentOS people got
that error.


--
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] HA status (active / passive)

2007-04-10 Thread Mark Frasa
Hello,

I am playing with HA for some production servers.

And i would like to know what the easiest way is to tell wheter the
local server is active of passive.

I guess a ifconfig check *could* do, but is there a builtin? or
something else?


Thanks alot,
/Mark

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] HA problems

2007-04-10 Thread Angelo Venera
Hi at all,

i'm new about this list and about HA. I'm trying to build a HA Active/Passive 
for this service:

amavisd clamd.amavisd dhcpd dovecot httpd mysqld named postfix smb spamassassin 
squid

On start the heartbeat run this service and became primary. But when i try the 
command nmap on my IPcluster i have this:

nmap -P0 ha

Starting Nmap 4.11 ( http://www.insecure.org/nmap/ ) at 2007-04-10 11:28 CEST
Nmap finished: 1 IP address (0 hosts up) scanned in 0.224 seconds

seem that the clusterip is down without service, instead if i run nmap on the 
node ip:

nmap -P0 nodo1

Starting Nmap 4.11 ( http://www.insecure.org/nmap/ ) at 2007-04-10 12:38 CEST
Interesting ports on nodo1 (192.168.1.200):
Not shown: 1678 closed ports
PORT  STATE SERVICE
22/tcpopen  ssh
1/tcp open  snet-sensor-mgmt

Nmap finished: 1 IP address (1 host up) scanned in 0.091 seconds

all services are down for node ip, and it is ok.

there are my files configuration:

drbd.conf

#
# drbd.conf
#
resource r0
{
protocol C;
incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; 
halt -f";
startup
{
degr-wfc-timeout 120;# 2 minutes.
}
disk
{
on-io-error   detach;
}
net
{
}
syncer
{
rate 100M;
group 1;
al-extents 257;
}
on nodo1
{
device /dev/drbd0;
disk   /dev/hdc2;
address192.168.1.5:7788;
meta-disk  internal;
}
on nodo2
{
device/dev/drbd0;
disk  /dev/hda2;
address   192.168.1.4:7788;
meta-disk internal;
}
}


ha.cf

debugfile   /var/log/ha-debug
logfile /var/log/ha-log
logfacility local0
keepalive   2
deadtime30
bcast   eth1
auto_failback   off
nodenodo1 nodo2
crm no
debug   3

haresources

#
# haresources
#
nodo1 IPaddr::192.168.1.210/24/eth0 drbddisk::r0 
Filesystem::/dev/drbd0::/cluster::ext3 amavisd clamd.amavisd dhcpd dovecot 
httpd mysqld named postfix smb spamassassin squid

and my ifconfig is:

eth0  Link encap:Ethernet  HWaddr 00:E0:4C:39:65:E4  
  inet addr:192.168.1.200  Bcast:192.168.1.255  Mask:255.255.255.0
  inet6 addr: fe80::2e0:4cff:fe39:65e4/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:182005 errors:0 dropped:0 overruns:0 frame:0
  TX packets:537446 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000 
  RX bytes:36397105 (34.7 MiB)  TX bytes:438135340 (417.8 MiB)
  Interrupt:18 Base address:0x4000 

eth0:0Link encap:Ethernet  HWaddr 00:E0:4C:39:65:E4  
  inet addr:192.168.1.210  Bcast:192.168.1.255  Mask:255.255.255.0
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  Interrupt:18 Base address:0x4000 

eth1  Link encap:Ethernet  HWaddr 00:0E:2E:AE:DF:4F  
  inet addr:192.168.1.5  Bcast:192.168.1.255  Mask:255.255.255.0
  inet6 addr: fe80::20e:2eff:feae:df4f/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:485293 errors:0 dropped:0 overruns:0 frame:0
  TX packets:148794 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000 
  RX bytes:38443850 (36.6 MiB)  TX bytes:31554489 (30.0 MiB)
  Interrupt:19 Base address:0x6000 

loLink encap:Local Loopback  
  inet addr:127.0.0.1  Mask:255.0.0.0
  inet6 addr: ::1/128 Scope:Host
  UP LOOPBACK RUNNING  MTU:16436  Metric:1
  RX packets:20708 errors:0 dropped:0 overruns:0 frame:0
  TX packets:20708 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:0 
  RX bytes:1011054 (987.3 KiB)  TX bytes:1011054 (987.3 KiB)


why this ? And why my service are filtered ??

Thank for all.

P.S. Sorry for my bad english
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] OCF_RESKEY_interval

2007-04-10 Thread Lars Marowsky-Bree
On 2007-04-10T12:02:30, Peter Kruse <[EMAIL PROTECTED]> wrote:

> >But when you return the proper status - running, failed, not running -,
> >heartbeat should do the "right thing" automatically when it finds the
> >resource active prior to heartbeat being (re-)started?
> The point is, that we misuse heartbeat to monitor a process, that
> maybe already running before heartbeat starts, like cron.  And we
> don't want heartbeat to stop it when it's already running.  Therefore
> on probe we return not running.

But, it will NOT be stopped if it is running! 

Only if your configuration says that it shouldn't be active where it
is.

I wonder if your abstraction really is the best one ...? In case of
cron, maybe you would instead want a clone which means crond running
everywhere; or a specific cronjob moved in and out of /etc/cron.*/?

> The question now is, how can we find out if "probe" instead of "monitor"
> is what heartbeat executes?  And what mechanism can we use that survives
> more than two releases?

I remain unconvinced that we should give that to you ;-) You wish to
make use of an internal detail. The CRM_meta_* stuff isn't guaranteed to
remain perfectly stable.

(Although I think they're unlikely to change again.)

You can use per-operation specific parameters in the CIB as well. You
can define a special monitor op with interval="0"; the instance
parameters defined there will be passed to the startup probe.


Sincerely,
Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Annouce: IPAddr2 RA v1.30 alpha

2007-04-10 Thread Lars Marowsky-Bree
On 2007-04-10T11:56:59, Michael Schwartzkopff <[EMAIL PROTECTED]> wrote:

> > > At the moment I would like to have that parameter since interop between
> > > LVS and CLUSTERIP is not tested at all. After these tests we can drop it.
> > One can't simply drop a parameter once introduced.
> Why not? My script is still marked alpha and not intended for production use. 
> If it getts better variables get fixed.

True, but that means I can't just pull your changes in ;-) For the final
version, this should probably be removed. It seems we agree on this
one.

> > LVS vs clusterip is a good point, but that doesn't depend on this
> > setting, does it? As long as it's detected by the script as above.
> >
> > LVS doesn't work with the clusterip anyway, so that's not really a
> > problem.
> But I did not test what happens if the user configures both.

Just reject it and error out then - lvs_support doesn't work in clone
mode, that is fine.

The load_share parameter doesn't convey any additional meaning than
configuring it as a clone (or not). Cloning an IPaddr2 resource only
makes sense as a clusterip target.

> Per-node ordering might be not the right way. Think about the following. Four 
> clones on four machines. Machine 3 dies and the first one has to take over. 
> 1) Resource 1 is shout down
> 2) Resource 1 is started
> 3) Resource 3 is started.
> 
> Between 1) and 2) there is no connectivity for resource 1.

Uhm. Of course. Machine 3 has just died. Of course there's no
connectivity until it is restarted.

> This might be up to several seconds, which is not acceptable in my
> oppinion.

But, there also won't be any IP rejects/denied being send, as none of
the other node replies to that specific hash bucket yet. So, it'll look
like a brief delay only.

> Is would be better just to start resource 3 on node 1.

That's what happens. If node3 crashed, we obviously don't stop the
resource - that's implied in the node hosting the resource being gone &
fenced.

> Basically there is no big difference between per-node ordering and global 
> ordering.

There's a big difference for your case, though. No concurrent
invocations for the same clone on the same node, so you don't need to do
any locking in your script.

I don't see how your suggestion improves availability.


Thanks,
Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Can a RA know if a clone resource is ordered or interleave is true?

2007-04-10 Thread Alan Robertson
Lars Marowsky-Bree wrote:
> On 2007-04-05T08:46:40, Alan Robertson <[EMAIL PROTECTED]> wrote:
> 
>>> My only comment on this is that if having two copies of your resource
>>> agent running at once causes serious problems, you need to _strongly_
>>> consider re-writing you agent to have sufficient locking / atomicity. Or
>>> it will come back to bite you some day...
>> This is the job of the Heartbeat infrastructure - to ensure that this
>> never happens.
> 
> This is not true as such.
> 
> We ensure that the same resource instance never has more than 1
> concurrent execution going, but there can be several instances with the
> same type.
> 
> In short, we don't serialize on resource type+class, but on rsc id.

Thanks for making that clearer.  I thought that's what was being asked
about, but was a little sloppy in my answer.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Heartbeat stop hangs

2007-04-10 Thread Alan Robertson
kisalay wrote:
> Hi,
> 
> I have a 2 node 2.0.8 Linux HA setup.
> I have observed that when stop is issued on my setup, as soon as the start
> returns, the stop hangs indefinitely, and the only way to stop heartbeat is
> to do killall.
> 
> I dug a little deeper into the problem.
> 
> First, the problem is sporadic. I wrote the following script to reproduce
> it:
>while [ true ]
>do
> /sbin/service heartbeat start
> /sbin/service heartbeat stop
>done
>I observed it that after random trials, i could reproduce it.
>Thereafter, once first stop hangs, any number of stops will hang
> too.
> 
> Second,
>I attached gdb to the heartbeat ( master control process ) and tried
> to see which handler is called on SIGTERM on the setup on which stop had
> hung. I observed that there was no handler being called.
> 
> Third, I decided to see the sigmask of the heartbeat
> I did  `ps -ae -o pid,caught,ignored` on the heartbeat, both on my normally
> functioning setup and on my hung setup.
> On normally functioning setup i got:
> pid  caughtignored
> 2337 000180016a01 00301002
> 
> and on my  hung setup, i got:
> pidcaughtignored
> 29822 000180012a01 00325002
> 
> If we see the hex in binary, the hung-setup heartbeat has "ignored" signal
> 15 ( SIGTERM ), whereas the normal heartbeat has handled it.
> This is the reason why the stop hangs, because the SIGTERM sent to the
> hung-heartbeat is ignored.
> 
> This hints to me that if a stop is issued to heartbeat while its still
> starting ( and registering the signal handlers ), there is a minute time
> window, where if it is issued, the SIGTERM in that window can result into a
> heartbeat which has ignored the SIGTERM, and thereby can-never be
> subsequently brought down cleanly.
> 
> Please correct me if I am erring somewhere. Please also suggest any
> work-arounds to ensure that i issue stop only after  heartbeat has
> installed
> signal-handlers properly

No.  It's not quite like that.  Heartbeat installs signal handers LONG
before heartbeat start returns.  But, once you send it one signal, it
propagates it to its child processes and waits for them all to die.

As Andrew pointed out, there was a case in some of heartbeat's child
processes where they didn't die if we heartbeat tried to kill _them_ too
early.  Should be fixed in 2.1.0 when it comes out.

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Annouce: IPAddr2 RA v1.30 alpha

2007-04-10 Thread Michael Schwartzkopff
Am Dienstag, 10. April 2007 14:14 schrieb Lars Marowsky-Bree:
> > Per-node ordering might be not the right way. Think about the following.
> > Four clones on four machines. Machine 3 dies and the first one has to
> > take over. 1) Resource 1 is shout down
> > 2) Resource 1 is started
> > 3) Resource 3 is started.
> >
> > Between 1) and 2) there is no connectivity for resource 1.
>
> Uhm. Of course. Machine 3 has just died. Of course there's no
> connectivity until it is restarted.

Not really. Not only resource 3 is not available during failover, but ALSO all 
other resources! That is the problem.

To make it clear in the above situation of failure of node 3:
1) resource 4 on node 4 is stopped
2) node 3 is failed
3) resource 2 on node 2 is stopped
4) resource 1 on node 1 is stopped
5) resrouce 1 on node 1 is started
6) resource 2 on node 2 is started
7) resource 3 on node 1 is started
8) resource 4 on node 4 is started

The time from 1) to 8) might be several seconds (~10s). During this time 
resource 4 is not available although it is not affected from failure and no 
additional resource is started on that node.

> But, there also won't be any IP rejects/denied being send, as none of
> the other node replies to that specific hash bucket yet. So, it'll look
> like a brief delay only.
>
> > Is would be better just to start resource 3 on node 1.
>
> That's what happens. If node3 crashed, we obviously don't stop the
> resource - that's implied in the node hosting the resource being gone &
> fenced.

No. In ordered clones ALL resources are stopped on ALL nodes and then started 
again. Of course the resource of the failed node is started on a new node. 
This process might last quite long for more than two nodes.

Or perhaps I have a mistake in my setup os the ordered clone resource...

-- 
Dr. Michael Schwartzkopff
MultiNET Services GmbH
Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany
Tel: +49 - 89 - 45 69 11 0
Fax: +49 - 89 - 45 69 11 21
mob: +49 - 174 - 343 28 75

mail: [EMAIL PROTECTED]
web: www.multinet.de

Sitz der Gesellschaft: 85630 Grasbrunn
Registergericht: Amtsgericht München HRB 114375
Geschäftsführer: Günter Jurgeneit, Hubert Martens

---

PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B
Skype: misch42
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] HA problems

2007-04-10 Thread Alan Robertson
Angelo Venera wrote:
> Hi at all,
> 
> i'm new about this list and about HA. I'm trying to build a HA Active/Passive 
> for this service:
> 
> amavisd clamd.amavisd dhcpd dovecot httpd mysqld named postfix smb 
> spamassassin squid
> 
> On start the heartbeat run this service and became primary. But when i try 
> the command nmap on my IPcluster i have this:
> 
> nmap -P0 ha
> 
> Starting Nmap 4.11 ( http://www.insecure.org/nmap/ ) at 2007-04-10 11:28 CEST
> Nmap finished: 1 IP address (0 hosts up) scanned in 0.224 seconds
> 
> seem that the clusterip is down without service, instead if i run nmap on the 
> node ip:
> 
> nmap -P0 nodo1
> 
> Starting Nmap 4.11 ( http://www.insecure.org/nmap/ ) at 2007-04-10 12:38 CEST
> Interesting ports on nodo1 (192.168.1.200):
> Not shown: 1678 closed ports
> PORT  STATE SERVICE
> 22/tcpopen  ssh
> 1/tcp open  snet-sensor-mgmt
> 
> Nmap finished: 1 IP address (1 host up) scanned in 0.091 seconds
> 
> all services are down for node ip, and it is ok.
> 
> there are my files configuration:
> 
> drbd.conf
> 
> #
> # drbd.conf
> #
> resource r0
> {
> protocol C;
> incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; 
> halt -f";
> startup
> {
> degr-wfc-timeout 120;# 2 minutes.
> }
> disk
> {
> on-io-error   detach;
> }
> net
> {
> }
> syncer
> {
> rate 100M;
> group 1;
> al-extents 257;
> }
> on nodo1
> {
> device /dev/drbd0;
> disk   /dev/hdc2;
> address192.168.1.5:7788;
> meta-disk  internal;
> }
> on nodo2
> {
> device/dev/drbd0;
> disk  /dev/hda2;
> address   192.168.1.4:7788;
> meta-disk internal;
> }
> }
> 
> 
> ha.cf
> 
> debugfile   /var/log/ha-debug
> logfile /var/log/ha-log
> logfacility local0
> keepalive   2
> deadtime30
> bcast   eth1
> auto_failback   off
> nodenodo1 nodo2
> crm no
> debug   3
> 
> haresources
> 
> #
> # haresources
> #
> nodo1 IPaddr::192.168.1.210/24/eth0 drbddisk::r0 
> Filesystem::/dev/drbd0::/cluster::ext3 amavisd clamd.amavisd dhcpd dovecot 
> httpd mysqld named postfix smb spamassassin squid
> 
> and my ifconfig is:
> 
> eth0  Link encap:Ethernet  HWaddr 00:E0:4C:39:65:E4  
>   inet addr:192.168.1.200  Bcast:192.168.1.255  Mask:255.255.255.0
>   inet6 addr: fe80::2e0:4cff:fe39:65e4/64 Scope:Link
>   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>   RX packets:182005 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:537446 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:1000 
>   RX bytes:36397105 (34.7 MiB)  TX bytes:438135340 (417.8 MiB)
>   Interrupt:18 Base address:0x4000 
> 
> eth0:0Link encap:Ethernet  HWaddr 00:E0:4C:39:65:E4  
>   inet addr:192.168.1.210  Bcast:192.168.1.255  Mask:255.255.255.0
>   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>   Interrupt:18 Base address:0x4000 
> 
> eth1  Link encap:Ethernet  HWaddr 00:0E:2E:AE:DF:4F  
>   inet addr:192.168.1.5  Bcast:192.168.1.255  Mask:255.255.255.0
>   inet6 addr: fe80::20e:2eff:feae:df4f/64 Scope:Link
>   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>   RX packets:485293 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:148794 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:1000 
>   RX bytes:38443850 (36.6 MiB)  TX bytes:31554489 (30.0 MiB)
>   Interrupt:19 Base address:0x6000 
> 
> loLink encap:Local Loopback  
>   inet addr:127.0.0.1  Mask:255.0.0.0
>   inet6 addr: ::1/128 Scope:Host
>   UP LOOPBACK RUNNING  MTU:16436  Metric:1
>   RX packets:20708 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:20708 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:0 
>   RX bytes:1011054 (987.3 KiB)  TX bytes:1011054 (987.3 KiB)
> 
> 
> why this ? And why my service are filtered ??
> 
> Thank for all.
> 
> P.S. Sorry for my bad english

At the end of every email on this mailing list is

> See also: http://linux-ha.org/ReportingProblems

This is mostly about sending logs.  You didn't send any.  You have a
pretty complicated configuration.  There's no way we can figure anything
out without logs.  This is why every single email to the mailing lists
mentions this ReportingProblems link.


Since I've never used nmap, I don't have any idea what to expect from it.

In one place you called a node "ha" and in another you called a node
"nod01".

I would use the ip command or the ifconfig command to see what IP

Re: [Linux-HA] OCF_RESKEY_interval

2007-04-10 Thread Bernd Schubert
On Tuesday 10 April 2007 14:07:54 Lars Marowsky-Bree wrote:
> On 2007-04-10T12:02:30, Peter Kruse <[EMAIL PROTECTED]> wrote:
> > >But when you return the proper status - running, failed, not running -,
> > >heartbeat should do the "right thing" automatically when it finds the
> > >resource active prior to heartbeat being (re-)started?
> >
> > The point is, that we misuse heartbeat to monitor a process, that
> > maybe already running before heartbeat starts, like cron.  And we
> > don't want heartbeat to stop it when it's already running.  Therefore
> > on probe we return not running.
>
> But, it will NOT be stopped if it is running!
>
> Only if your configuration says that it shouldn't be active where it
> is.

I'm also not convinced yet that we really need it anymore, but for now we just 
want to keep the old behaviour of our scripts.

[...]

> You can use per-operation specific parameters in the CIB as well. You
> can define a special monitor op with interval="0"; the instance
> parameters defined there will be passed to the startup probe.

Ok, thats in principle fine, but won't that make two probe actions to run? So 
with our scripts one will return that the resource is *not* running and one 
will return the resource *is* running?

Thanks,
Bernd


-- 
Bernd Schubert
Q-Leap Networks GmbH
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] OCF_RESKEY_interval

2007-04-10 Thread Andrew Beekhof

On 4/10/07, Bernd Schubert <[EMAIL PROTECTED]> wrote:

On Tuesday 10 April 2007 14:07:54 Lars Marowsky-Bree wrote:
> On 2007-04-10T12:02:30, Peter Kruse <[EMAIL PROTECTED]> wrote:
> > >But when you return the proper status - running, failed, not running -,
> > >heartbeat should do the "right thing" automatically when it finds the
> > >resource active prior to heartbeat being (re-)started?
> >
> > The point is, that we misuse heartbeat to monitor a process, that
> > maybe already running before heartbeat starts, like cron.  And we
> > don't want heartbeat to stop it when it's already running.  Therefore
> > on probe we return not running.
>
> But, it will NOT be stopped if it is running!
>
> Only if your configuration says that it shouldn't be active where it
> is.

I'm also not convinced yet that we really need it anymore, but for now we just
want to keep the old behaviour of our scripts.

[...]

> You can use per-operation specific parameters in the CIB as well. You
> can define a special monitor op with interval="0"; the instance
> parameters defined there will be passed to the startup probe.

Ok, thats in principle fine, but won't that make two probe actions to run? So


no (and if it does, feel free to complain very loudly)


with our scripts one will return that the resource is *not* running and one
will return the resource *is* running?

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Status (rc.d/status)

2007-04-10 Thread Mark Frasa
Hello,

For an active/passive configuration enviroment i want to know the
status of hearthbeat on the local machine.

I have found a script: 

/etc/ha.d/rc.d/status

But this outputs: 
/etc/ha.d/rc.d/status: line 3: .: filename argument required
.: usage: . filename

The problem is line 3 in sourcing the . $HA_FUNC:

#!/bin/sh

. $HA_FUNCS

case $HA_st in
  dead) $HA_BIN/mach_down $HA_src;;
esac


Can anyone tell me where is should retrieve this $HA_FUNCS from?
It would make life so much simpler to know wheter or not the local
machine is active / passive.


Thanks in advance!
/Mark.

Ps: i sent a mail to this mailing list straight after enabling access to
this list, but it failed to arrive.

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Getting the status of the node

2007-04-10 Thread Mark Eisenblaetter

Hi,

sorry, i don't find that script.
Only some confusing mails about that script.

do you know were i can find that script?

Thanks Mark


On 4/3/07, Alan Robertson <[EMAIL PROTECTED]> wrote:

Mark Eisenblaetter wrote:
> Hello list,
>
> i'm searching for a tool/script that tells me if the node is active or
> passiv.

Heartbeat isn't an active/passive solution.

So there is no "active node" or "passive node".

But hb_status rscstatus will tell you what you want to know - but not
very directly.

Read the man page for it.


--
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems




--
Mark Eisenblätter
Geissendoerfer & Leschinsky GmbH
www.gl-sytemhaus.de
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Status (rc.d/status)

2007-04-10 Thread Alan Robertson
Mark Frasa wrote:
> Hello,
> 
> For an active/passive configuration enviroment i want to know the
> status of hearthbeat on the local machine.

That's not what the status scrip does.

How about cl_status rscstatus?

Do read http://linux-ha.org/cl_status carefully.

It's not wonderful, but for an R1 system it will do what you need for it to.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Getting the status of the node

2007-04-10 Thread Alan Robertson
Mark Eisenblaetter wrote:
> Hi,
> 
> sorry, i don't find that script.
> Only some confusing mails about that script.
> 
> do you know were i can find that script?


It's not a script.

What version are you running?

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Getting the status of the node

2007-04-10 Thread Alan Robertson
Alan Robertson wrote:
> Mark Eisenblaetter wrote:
>> Hi,
>>
>> sorry, i don't find that script.
>> Only some confusing mails about that script.

Did you read the web page?

On my machine it's located in /usr/bin/cl_status.  Where it is on yours
depends on how you have things configured.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Annouce: IPAddr2 RA v1.30 alpha

2007-04-10 Thread Lars Marowsky-Bree
On 2007-04-10T14:39:56, Michael Schwartzkopff <[EMAIL PROTECTED]> wrote:

> > Uhm. Of course. Machine 3 has just died. Of course there's no
> > connectivity until it is restarted.
> Not really. Not only resource 3 is not available during failover, but
> ALSO all other resources! That is the problem.

Uh?

> To make it clear in the above situation of failure of node 3:
> 1) resource 4 on node 4 is stopped
> 2) node 3 is failed
> 3) resource 2 on node 2 is stopped
> 4) resource 1 on node 1 is stopped
> 5) resrouce 1 on node 1 is started
> 6) resource 2 on node 2 is started
> 7) resource 3 on node 1 is started
> 8) resource 4 on node 4 is started

You need to set a resource stickiness > 0 so that the clones don't get
load-balanced. I think with the most recent dev version this is the
default for clones, actually.

> No. In ordered clones ALL resources are stopped on ALL nodes and then started 
> again.

That is not what ordered is about, sorry. This seems to be a side effect
of having no stickiness set.

Does that help?


Sincerely,
Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] pingd not failing over

2007-04-10 Thread Terry L. Inzauro
Terry L. Inzauro wrote:
> Alan Robertson wrote:
>> Terry L. Inzauro wrote:
>>> Alan Robertson wrote:
 Terry L. Inzauro wrote:
> Alan Robertson wrote:
>> Daniel Bray wrote:
>>> Hello List,
>>>
>>> I have been unable to get a 2 node active/passive cluster to
>>> auto-failover using pingd.  I was hoping someone could look over my
>>> configs and tell me what I'm missing.  I can manually fail the cluster
>>> over, and it will even auto-fail over if I stop heartbeat on one of the
>>> nodes.  But, what I would like to have happen, is when I unplug the
>>> network cable from node1, everything auto-fails over to node2 and stays
>>> there until I manually fail it back.
>>>
>>> #/etc/ha.d/ha.cf
>>> udpport 6901
>>> autojoin any
>>> crm true
>>> bcast eth1
>>> node node1
>>> node node2
>>> respawn root /sbin/evmsd
>>> apiauth evms uid=hacluster,root
>>> ping 192.168.1.1
>>> respawn root /usr/lib/heartbeat/pingd -m 100 -d 5s
>>>
>>> #/var/lib/heartbeat/crm/cib.xml
>>>   >> ignore_dtd="false" ccm_transition="14" num_peers="2"
>>> cib_feature_revision="1.3"
>>> dc_uuid="e88ed713-ba7b-4c42-8a38-983eada05adb" epoch="14"
>>> num_updates="330" cib-last-written="Mon Mar 26 10:48:31 2007">
>>>
>>>  
>>>
>>>  
>>>>> value="True"/>
>>>>> id="cib-bootstrap-options-symmetric-cluster" value="True"/>
>>>>> name="default-action-timeout" value="60s"/>
>>>>> id="cib-bootstrap-options-default-resource-failure-stickiness"
>>> name="default-resource-failure-stickiness" value="-500"/>
>>>>> id="cib-bootstrap-options-default-resource-stickiness"
>>> name="default-resource-stickiness" value="INFINITY"/>
>>>>> id="cib-bootstrap-options-last-lrm-refresh" value="1174833528"/>
>>>  
>>>
>>>  
>>>  
>>>>> id="e88ed713-ba7b-4c42-8a38-983eada05adb">
>>>  >> id="nodes-e88ed713-ba7b-4c42-8a38-983eada05adb">
>>>
>>>  >> id="standby-e88ed713-ba7b-4c42-8a38-983eada05adb" value="off"/>
>>>
>>>  
>>>
>>>>> id="f6774ed6-4e03-4eb1-9e4a-8aea20c4ee8e">
>>>  >> id="nodes-f6774ed6-4e03-4eb1-9e4a-8aea20c4ee8e">
>>>
>>>  >> id="standby-f6774ed6-4e03-4eb1-9e4a-8aea20c4ee8e" value="off"/>
>>>
>>>  
>>>
>>>  
>>>  
>>>>> resource_stickiness="INFINITY" id="group_my_cluster">
>>>  >> id="resource_my_cluster-data">
>>>>> id="resource_my_cluster-data_instance_attrs">
>>>  
>>>>> id="resource_my_cluster-data_target_role" value="started"/>
>>>>> name="device" value="/dev/sdb1"/>
>>>>> id="9e0a0246-e5cb-4261-9916-ad967772c80b" value="/data"/>
>>>>> name="fstype" value="ext3"/>
>>>  
>>>
>>>  
>>>  >> type="IPaddr" provider="heartbeat">
>>>>> id="resource_my_cluster-IP_instance_attrs">
>>>  
>>>>> name="target_role" value="started"/>
>>>>> name="ip" value="101.202.43.251"/>
>>>  
>>>
>>>  
>>>  >> id="resource_my_cluster-pingd">
>>>>> id="resource_my_cluster-pingd_instance_attrs">
>>>  
>>>>> id="resource_my_cluster-pingd_target_role" value="started"/>
>>>>> name="host_list" value="node1,node2"/>
>>>  
>>>
>>>
>>>  >> timeout="90" prereq="nothing"/>
>>>  >> name="monitor" interval="20" timeout="40" start_delay="1m"
>>> prereq="nothing"/>
>>>
>>>  
>>>  >> id="resource_my_cluster-stonssh">
>>>>> id="resource_my_cluster-stonssh_instance_attrs">
>>>  
>>>>> id="resource_my_cluster-stonssh_target_role" value="started"/>
>>>>> name="hostlist" value="node1,node2"/>
>>>  
>>>
>>>
>>>  >> timeout="15" prereq="nothing"/>
>>>  >> name="monitor" interval="5" timeout="20" start_delay="15"/>
>>>
>>>  
>>>
>>>  
>>>  
>>>
>>>  
>>>>> id="c9adb725-e0fc-4b9c-95ee-0265d50d8eb9" operation="eq" value="node1"/>
>>>  
>>>
>>>
>>>  
>>>   

Re: [Linux-HA] OCF_RESKEY_interval

2007-04-10 Thread Bernd Schubert
On Thursday 05 April 2007 20:11:51 Alan Robertson wrote:
> This particular document had a couple of other errors too, which I
> believe I've corrected.  See what you think.


Thanks for improving the documentation, but I think the given xml example does 
not work as it is. Here's a similar fragment











Now crm_verify complains:


crm_verify[12992]: 2007/04/10_16:16:39 WARN: assign_uuid: Updating object from 
 to 
crm_verify[12992]: 2007/04/10_16:16:39 ERROR: do_id_check: Detected 
 object without an ID. Assigned: 
6cd0933e-cc54-4ddb-8951-63259e41f3bd
crm_verify[12992]: 2007/04/10_16:16:39 ERROR: main: ID Check failed


Setting  will correct that.



Thanks,
Bernd

-- 
Bernd Schubert
Q-Leap Networks GmbH
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] OCF_RESKEY_interval

2007-04-10 Thread Bernd Schubert
On Tuesday 10 April 2007 15:15:09 Andrew Beekhof wrote:
> > > You can use per-operation specific parameters in the CIB as well. You
> > > can define a special monitor op with interval="0"; the instance
> > > parameters defined there will be passed to the startup probe.
> >
> > Ok, thats in principle fine, but won't that make two probe actions to
> > run? So
>
> no (and if it does, feel free to complain very loudly)
>

Complaining loudly ;)

Not entirely the same as I expected what could happen, but also not good.



  

  

  

  






Now from the log of our resource agent:

Apr 10 17:30:25 ha-test-1 process[26425]: Returnig 7
Apr 10 17:30:40 ha-test-1 process[26493]: Maintainance =
Apr 10 17:30:40 ha-test-1 process[26493]: OCF_RESKEY_probe = 1
Apr 10 17:30:40 ha-test-1 process[26493]: Returnig 7
Apr 10 17:30:56 ha-test-1 process[26603]: Maintainance =
Apr 10 17:30:56 ha-test-1 process[26603]: OCF_RESKEY_probe = 1
Apr 10 17:30:56 ha-test-1 process[26603]: Returnig 7
Apr 10 17:31:11 ha-test-1 process[26714]: Maintainance =
Apr 10 17:31:11 ha-test-1 process[26714]: OCF_RESKEY_probe = 1
Apr 10 17:31:11 ha-test-1 process[26714]: Returnig 7
[...]

How can it happen that OCF_RESKEY_probe is set not only once to 1, but on each 
call of the monitor action?


So I thought that probe is maybe never unset and therefore defined a probe=0 
for the other monitoring action:


  
  
  
  
  


  
  
  
  
  


Now:

Apr 10 17:51:26 ha-test-2 process[564]: OCF_RESKEY_probe = 0
Apr 10 17:51:26 ha-test-2 process[564]: Process ntpd *running*
Apr 10 17:51:41 ha-test-2 process[675]: Maintainance =
Apr 10 17:51:41 ha-test-2 process[675]: OCF_RESKEY_probe = 0
Apr 10 17:51:41 ha-test-2 process[675]: Process ntpd *running*
Apr 10 17:51:56 ha-test-2 process[791]: Maintainance =
Apr 10 17:51:56 ha-test-2 process[791]: OCF_RESKEY_probe = 0
[...]

Apparently the value is used for all probe actions and probe=0 just overrides 
probe=1 also for the interval=0.
Also, using the same id's of the different istance_attributes and/or nvair 
values will make the cib.xml invalid. Without having looked into the source, 
I guess the values are not properly assigned within a data structure.

Any idea which source files I should look into?

Thanks,
Bernd

-- 
Bernd Schubert
Q-Leap Networks GmbH
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] OCF_RESKEY_interval

2007-04-10 Thread Andrew Beekhof

On 4/10/07, Bernd Schubert <[EMAIL PROTECTED]> wrote:

On Tuesday 10 April 2007 15:15:09 Andrew Beekhof wrote:
> > > You can use per-operation specific parameters in the CIB as well. You
> > > can define a special monitor op with interval="0"; the instance
> > > parameters defined there will be passed to the startup probe.
> >
> > Ok, thats in principle fine, but won't that make two probe actions to
> > run? So
>
> no (and if it does, feel free to complain very loudly)
>

Complaining loudly ;)

Not entirely the same as I expected what could happen, but also not good.



  

  

  

  






Now from the log of our resource agent:

Apr 10 17:30:25 ha-test-1 process[26425]: Returnig 7
Apr 10 17:30:40 ha-test-1 process[26493]: Maintainance =
Apr 10 17:30:40 ha-test-1 process[26493]: OCF_RESKEY_probe = 1
Apr 10 17:30:40 ha-test-1 process[26493]: Returnig 7
Apr 10 17:30:56 ha-test-1 process[26603]: Maintainance =
Apr 10 17:30:56 ha-test-1 process[26603]: OCF_RESKEY_probe = 1
Apr 10 17:30:56 ha-test-1 process[26603]: Returnig 7
Apr 10 17:31:11 ha-test-1 process[26714]: Maintainance =
Apr 10 17:31:11 ha-test-1 process[26714]: OCF_RESKEY_probe = 1
Apr 10 17:31:11 ha-test-1 process[26714]: Returnig 7
[...]

How can it happen that OCF_RESKEY_probe is set not only once to 1, but on each
call of the monitor action?


So I thought that probe is maybe never unset


correct


and therefore defined a probe=0
for the other monitoring action:


right



  
  
  
  
  


  
  
  
  
  


Now:

Apr 10 17:51:26 ha-test-2 process[564]: OCF_RESKEY_probe = 0
Apr 10 17:51:26 ha-test-2 process[564]: Process ntpd *running*
Apr 10 17:51:41 ha-test-2 process[675]: Maintainance =
Apr 10 17:51:41 ha-test-2 process[675]: OCF_RESKEY_probe = 0
Apr 10 17:51:41 ha-test-2 process[675]: Process ntpd *running*
Apr 10 17:51:56 ha-test-2 process[791]: Maintainance =
Apr 10 17:51:56 ha-test-2 process[791]: OCF_RESKEY_probe = 0
[...]

Apparently the value is used for all probe actions and probe=0 just overrides
probe=1 also for the interval=0.
Also, using the same id's of the different istance_attributes and/or nvair
values will make the cib.xml invalid. Without having looked into the source,
I guess the values are not properly assigned within a data structure.

Any idea which source files I should look into?


nothing in the crm...

if i put your example into a CIB, and run it through ptest, you can
see "probe" being set correctly.  so either its a bug in the LRM or
something (potentially elsewhere) that was fixed after your version
was released.



 
   
   
 
   
   
 

  

  

  

  
  

  

  

  
  
  

 
   
   
 
 
   
 




  

  


  


  
  

  


  


  

  

  
  

  


  


  

  

  
  

  

  


  

  

  
  

  

  


  

  

  

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] bringing up IP address outside default netblock

2007-04-10 Thread Dale Yamamoto
Running 2.0.5 on Debian Sarge, having an issue bringing up an IP
address.

These servers have plenty of IP addresses controlled by heartbeat
where those IPs are in the same netblock as the server's own IP
address.  Our ISP has allocated us a second netblock that's not
contiguous with the first one (but uses the same router, ethernet
interface, etc.).  If I try to bring up an IP address in this second
block without the netmask, I see "ERROR: Cannot use default route w/o
netmask" in the logs.  If I try to bring it up explicitly designating
it as a /27, the logs don't seem to show any errors, but the IP
address never comes up.  In both cases, the crm_mon output shows

ip_180_134  (heartbeat::ocf:IPaddr):Started lvlb1
(unmanaged)

but like I said, the IP address in question never comes up.

Since this IP address would only serve requests, I don't think I would
need a gateway, but IPaddr doesn't seem to like that.

What should I do to get heartbeat to start/control IP addresses in the
new netblock?  Thanks.

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Attend Gelato ICE, April 16-18

2007-04-10 Thread Nan Holda
Gelato ICE: Itanium Conference & Exhibition
April 16-18, 2007 | Doubletree Hotel | San Jose, California

Dear Linux High-Availability,

The Gelato Federation is proud to announce the technical program for the 
Gelato ICE: Itanium(r) Conference & Expo. International Itanium architecture 
experts are scheduled to deliver over 70 top-notch presentations focusing 
on Linux on the Intel Itanium architecture. The full program is now available 
online at . 

Program tracks include: multi-core programming, IA-64 Linux kernel work, 
virtualization, tools and tuning, topics for enterprise, GCC improvements, 
and cutting-edge research. Linux keynote speakers will be Andrew Morton, 
Maintainer of the Linux 2.6 Kernel, and Wim Coekaerts, Senior Director for 
Linux Engineering at Oracle. You will also not want to miss the presentation 
from Intel's James Fister outlining the latest, yet to be disclosed, Itanium 
processor roadmap.

Your email box is full of invitations to conferences large and small. Figuring 
out which event to cover isn't easy. If you want to find out about 
revolutionary 
technology, attending Gelato ICE is a must. 

For media pass and interview requests, please send me a note.

Regards,
Nan

Nan Holda
GELATO Federation
Community for 
Linux(r) on Itanium(r) Architecture
1308 W. Main St.
Urbana, IL, USA 61801-2307
tel: 217.265.0947
fax: 217.333.5579
www.gelato.org

Join us at the
Gelato ICE: Itanium(r) Conference & Expo
Date:April 16-18, 2007
Venue: Doubletree Hotel, San Jose, California, USA
Details available at www.ice.gelato.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] heartbeat does not start when the stonith device is not available

2007-04-10 Thread Martin
Hello !

Today I have noticed that the heartbeat startup script does not start
when the APC PDU (my stonith device) is configured in ha.cf but not
available. IMHO it creates single point of failure. All the services that 
should be highly available are blocked by a simple problem with a 
non-essential device.

Here are the details: the cluster is a two-nodes Linux system. Heartbeat
version is 2.0.8. The stonith program uses the apcmastersnmp driver.
If the PDU is disconnected from the LAN, the stonith program fails
(with time-out error) for obvious reason, but it reports config error,
which is simply not the case. This failure prevents heartbeat from
starting. See below.

Do you have any comments or ideas related to this ?

Best regards

Martin

- - - - - - - - - - - - - - - -
# service heartbeat start

Starting High-Availability services: 
2007/04/10_22:19:07 INFO:  Resource is stopped
   [FAILED]
heartbeat[7286]: 2007/04/10_22:19:13 ERROR: glib: APC_read: error 
sending/receiving pdu (cliberr: 0 / snmperr: -24 / error: Timeout).
heartbeat[7286]: 2007/04/10_22:19:13 ERROR: glib: apcmastersnmp_set_config: 
cannot read number of outlets.
heartbeat[7286]: 2007/04/10_22:19:13 ERROR: Unknown Stonith config error 
[/etc/ha.d/apcpdu.cf] [2]
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] heartbeat does not start when the stonith device is not available

2007-04-10 Thread Alan Robertson
Martin wrote:
> Hello !
> 
> Today I have noticed that the heartbeat startup script does not start
> when the APC PDU (my stonith device) is configured in ha.cf but not
> available. IMHO it creates single point of failure. All the services that 
> should be highly available are blocked by a simple problem with a 
> non-essential device.
> 
> Here are the details: the cluster is a two-nodes Linux system. Heartbeat
> version is 2.0.8. The stonith program uses the apcmastersnmp driver.
> If the PDU is disconnected from the LAN, the stonith program fails
> (with time-out error) for obvious reason, but it reports config error,
> which is simply not the case. This failure prevents heartbeat from
> starting. See below.
> 
> Do you have any comments or ideas related to this ?

Your current situation is indistinguishable from a configuration error.

R2 configurations won't have this issue.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] bringing up IP address outside default netblock

2007-04-10 Thread Alan Robertson
Dale Yamamoto wrote:
> Running 2.0.5 on Debian Sarge, having an issue bringing up an IP
> address.
> 
> These servers have plenty of IP addresses controlled by heartbeat
> where those IPs are in the same netblock as the server's own IP
> address.  Our ISP has allocated us a second netblock that's not
> contiguous with the first one (but uses the same router, ethernet
> interface, etc.).  If I try to bring up an IP address in this second
> block without the netmask, I see "ERROR: Cannot use default route w/o
> netmask" in the logs.  If I try to bring it up explicitly designating
> it as a /27, the logs don't seem to show any errors, but the IP
> address never comes up.  In both cases, the crm_mon output shows
> 
> ip_180_134  (heartbeat::ocf:IPaddr):Started lvlb1
> (unmanaged)
> 
> but like I said, the IP address in question never comes up.
> 
> Since this IP address would only serve requests, I don't think I would
> need a gateway, but IPaddr doesn't seem to like that.
> 
> What should I do to get heartbeat to start/control IP addresses in the
> new netblock?  Thanks.


I suspect this is due to a known bug in IPaddr in all 2.0.x released
versions (but fixed in Hg).  Try also specifying the netmask AND the
interface.  I think you need both.  In any case, it's broken, and
probably fixed in Hg.

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] OCF_RESKEY_interval

2007-04-10 Thread Alan Robertson
Hi Bernd,

Thanks for your continuing vigilance!

Bernd Schubert wrote:
> On Thursday 05 April 2007 20:11:51 Alan Robertson wrote:
>> This particular document had a couple of other errors too, which I
>> believe I've corrected.  See what you think.
> 
> 
> Thanks for improving the documentation, but I think the given xml example 
> does 
> not work as it is. Here's a similar fragment
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Now crm_verify complains:
> 
> 
> crm_verify[12992]: 2007/04/10_16:16:39 WARN: assign_uuid: Updating object 
> from 
>  to  id=6cd0933e-cc54-4ddb-8951-63259e41f3bd/>
> crm_verify[12992]: 2007/04/10_16:16:39 ERROR: do_id_check: Detected 
>  object without an ID. Assigned: 
> 6cd0933e-cc54-4ddb-8951-63259e41f3bd
> crm_verify[12992]: 2007/04/10_16:16:39 ERROR: main: ID Check failed
> 
> 
> Setting  will correct that.


Did I change that example in the document?  I only fixed the things I
noticed ;-).

Did you know you can fix this yourself?
http://linux-ha.org/HowToUpdateWebsite

Obviously I can fix this myself, but we really need to get the community
more involved.  There just isn't enough of Andrew and Lars and Dejan and
me to do everything ourselves.

If you don't want to / can't fix it, let me know.  Otherwise, I'll leave
it as a learning experience for you ;-).

Thanks again for pointing this out!

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Interest in Linux-HA at LinuxWorld San Francisco?

2007-04-10 Thread Alan Robertson
Hi,


I'll be speaking at LinuxWorld in San Francisco August 6-9 this year.
So, I'll be there.  Are others from the list coming?

I'll be giving a tutorial at LinuxWorld San Francisco, and I got a note
which offers two things:

1) Birds of a Feather session -- Is there interest in this?

2)  A .org booth for Linux-HA

If we could get 4 others who could help staff the booth, I'm sure I/we
could come up with some materials for such a booth.  I'm also pretty
sure it would be a good bit of work for all concerned for 3 days.

Or if someone on the list knows of another .org group that would let us
share some of their space to pass out brochures, etc. that would cool,
and a lot less work ;-).


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
--- Begin Message ---


LinuxWorld Conference & Expo 
http://emessaging.vertexcommunication.com/ct/ct.php?t=2429170&c=1013576737&m=m&type=3&h=982E4A25E5D7CCF1243D908311A2C244
August 6 - 9, 2007 
Moscone Center 
San Francisco, CA 

Dear Alan, 

Interested in LinuxWorld's .org Pavilion? Fill out the
Application Form! 

LinuxWorld Conference & Expo is looking for exhibitors for the
.org Pavilion, our free-of-charge exhibit area for free software
and open source projects. We're looking for projects that produce
great software and can host an informative, helpful booth for our
attendees. 

This year, we will be hosting an "un-conference" theater area
adjacent to the pavilion. If you would like a larger venue for
works in progress, Q & A sessions, demonstrations, and media,
it's there for you. 

Projects use LinuxWorld Conference & Expo for many reasons: to
answer questions from possible new users, to distribute copies
of software, to solicit donations, and to sell project
merchandise. 

If you have participated in the .org Pavilion at previous shows,
you still need to re-apply for it this time. 

To submit for the .org Pavilion, please click on the link below.

http://emessaging.vertexcommunication.com/ct/ct.php?t=2429171&c=1013576737&m=m&type=3&h=982E4A25E5D7CCF1243D908311A2C244

Don't forget! Birds-of-a-Feather sessions are still available.
Click on the link below to submit.
http://emessaging.vertexcommunication.com/ct/ct.php?t=2429172&c=1013576737&m=m&type=3&h=982E4A25E5D7CCF1243D908311A2C244

Please contact Alison McCormack with any questions (508)
988-7880.


This message was intended for [EMAIL PROTECTED]
We want to provide you with the most relevant information.

Click below to do the following:
* Change your email preferences
* Update your information
* Opt-out from our email programs

http://emessaging.vertexcommunication.com/phase2/survey1/survey.htm?CID=wjvfou&action=update&[EMAIL
 PROTECTED]&_mh=ff0ebc1ff53de2c2806b889c75ee15e5
IDG World Expo, 3 Speen Street, Framingham, MA 01701
800-657-1474


Powered by Vertex Communications
http://www.vertexcommunication.com/contact.htm




--- End Message ---
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Question about ipfail problem with heartbeat 1.2.x

2007-04-10 Thread Hideaki Kondo

I have one question about heartbeat1.2.x.
# I'm sorry for my question about heartbeat 1.2.x
# while recent topic is heartbeat 2.0.x .

I heard that there's the problem about ipfail and bcast
with heartbeat1.2.x as written the following URL.

http://www.gossamer-threads.com/lists/linuxha/users/23130#23130

How is the status of this problem ?
(Is this problem solved ?) 

Thanks in advance.
Best regards
--
Hideaki Kondo


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Question about ipfail problem with heartbeat 1.2.x

2007-04-10 Thread Alan Robertson
Hideaki Kondo wrote:
> I have one question about heartbeat1.2.x.
> # I'm sorry for my question about heartbeat 1.2.x
> # while recent topic is heartbeat 2.0.x .
> 
> I heard that there's the problem about ipfail and bcast
> with heartbeat1.2.x as written the following URL.
> 
> http://www.gossamer-threads.com/lists/linuxha/users/23130#23130
> 
> How is the status of this problem ?
> (Is this problem solved ?) 

It's solved - but I don't remember in which version :-(.  I remember
that there had been various problems with ping and ping_group, and
ipfail, but the details escape me.

It's definitely solved in the 2.0.x series, which also runs 1.x style
configurations.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Question about ipfail problem with heartbeat 1.2.x

2007-04-10 Thread Hideaki Kondo

Thanks a lot for your quick reply.

> It's definitely solved in the 2.0.x series, which also runs 1.x style
> configurations.

I've understood.
It seems that i had better use the 2.0.x series with 1.x style
configurations in order to solve the problem about ipfail.


On Tue, 10 Apr 2007 20:08:37 -0600
Alan Robertson <[EMAIL PROTECTED]> wrote:

> Hideaki Kondo wrote:
> > I have one question about heartbeat1.2.x.
> > # I'm sorry for my question about heartbeat 1.2.x
> > # while recent topic is heartbeat 2.0.x .
> > 
> > I heard that there's the problem about ipfail and bcast
> > with heartbeat1.2.x as written the following URL.
> > 
> > http://www.gossamer-threads.com/lists/linuxha/users/23130#23130
> > 
> > How is the status of this problem ?
> > (Is this problem solved ?) 
> 
> It's solved - but I don't remember in which version :-(.  I remember
> that there had been various problems with ping and ping_group, and
> ipfail, but the details escape me.
> 
> It's definitely solved in the 2.0.x series, which also runs 1.x style
> configurations.
> 
> 
> -- 
> Alan Robertson <[EMAIL PROTECTED]>
> 
> "Openness is the foundation and preservative of friendship...  Let me
> claim from you at all times your undisguised opinions." - William
> Wilberforce
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

Best regards

--
Hideaki Kondo


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems