Re: [Pacemaker] Trouble with ordering

2011-09-30 Thread Serge Dubrouski
On Fri, Sep 30, 2011 at 9:20 AM, Gerald Vogt  wrote:

> On 30.09.11 15:03, Serge Dubrouski wrote:
> > May be you didn't look carefully but that script does exactly that, it
> > monitors process and service. Also if you want cluster to control your
> > service, it has to be able to start and stop it. You can configure your
> > service as a clone and it'll be up on several nodes.
> > But if you don't want to use it you don't have to.
>
> You are right. I did not look at the monitor function. I checked the
> status function and thought it would be in there if it checked it.
>

That's one of the main differences between LSB and OCF RAs.


> Technically, I don't want the cluster to control the service in the
> meaning of starting and stopping. The cluster controls the IP addresses
> and moves them between nodes. The dns service resource is supposed to
> provide a check that the dns service is working on the node and migrate
> the service and most important the IP address if it becomes unresponsive.
>
> I didn't look at the concept of clones, yet. Maybe I took a completely
> wrong approach to what I am trying to do.
>

I think that clones is  rally good solution for this situation. You can
configure BIND as a clone service with different configuration though. One
node will be master another slave. You can also have a floating VIP tied up
to any of the nodes but collocated with the running BIND.If BIND dies for
some reason, pacemaker will move your IP to the survived node. You can
addsending additional alarms.


> The cluster until recently only operated the two DNS service IP
> addresses ns1-ip and ns2-ip for our LAN. Three nodes are used to provide
> redundancy in case one node fails. This way our two DNS server IPs are
> active at all times.
>
> Bind is running on all three nodes. Bind is configured to scan for
> interface changes every 60s. The three nodes are configured as slave
> servers, getting notified of zone updates by the master server.
>

Here you have several options. You can either schedule reload operation for
NAMED RA in cluster. or you can try to create an odrer constraint like
somebody else suggested:

order named-service-clone-after-Cluster_IP inf: Cluster_IP:start
Named_Service:reload



> This works in regard to node failures and similar. If a node crashes the
> IP address is moved to another node.
>
> The problem is if the node is still up but the named process becomes
> unresponsive and is hanging. The cluster wouldn't notice this.
>

With NAMED RA it will.


>
> If I understand your script correctly, it starts and stops the named
> process. If I do this, the node which is not running the dns server
> won't get zone updates, i.e. if it starts it has outdated zone files.
>

As I said earlier  you can configure it as a clone. In this case cluster
will start it on all nodes. and will do monitoring.


> Now if the master server is accessible and running at the time of start
> the dns server gets updated quickly. The trouble is if the master is
> down, too, the dns server will provide outdated dns information until
> the master is running again.
>


>
> That seems to me the problem when the bind process is started and
> stopped on the nodes and that I was trying to avoid. IMHO the named
> process can be running all the time, thus getting zone notifies in the
> usual manner.
>

It depend how you populate your zones.  If you put your zone and config
files on a shared device (DRBD or so) then you can fail it over along with
IP and restart BIND with each failover. If you want to use master/slave BIND
replication then you obviously need to have them both running at all times
and then you need to use clones.


> But maybe I am not getting what clones do. I think so far I didn't quite
> get what they do exactly from the guides in respect to what I am trying
> to achieve.
>
> Maybe you can give me a hint how I would achieve this with a clone,
> running named on all nodes at all times and moving the service IP
> addresses between nodes in case a node or dns server fails or hangs.
>

> Thanks!
>
> Gerald
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>



-- 
Serge Dubrouski.
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Ignoring expired failure

2011-09-30 Thread Proskurin Kirill

Hello all.

corosync-1.4.1
pacemaker-1.1.5
pacemaker runs with "ver: 1"

I run again on monitoring fail and still don`t know why it happends.
Details are here:
http://www.mail-archive.com/pacemaker@oss.clusterlabs.org/msg09986.html

Some info:
I twice run on situation then pacemaker thinks what resource is started 
but it is not. We use slightly modifed version of "anything" agent for 
our scripts but they are aware of OCF return codes and other staff.


I run monitoring by our agent from console:

# env -i ; OCF_ROOT=/usr/lib/ocf 
OCF_RESKEY_binfile=/usr/local/mpop/bin/my/tranprocessor.pl 
/usr/lib/ocf/resource.d/mail.ru/generic monitor

# generic[14992]: DEBUG: default monitor : 7


But this time I see in logs:
Oct 01 02:00:12 mysender34.mail.ru pengine: [26301]: notice: 
unpack_rsc_op: Ignoring expired failure tranprocessor_stop_0 (rc=-2, 
magic=2:-2;121:690:0:4c16dc39-1fd3-41f2-b582-0236f6b6eccc) on 
mysender34.mail.ru


So Pacemaker knows what resource may be down but ignoring it. Why?

--
Best regards,
Proskurin Kirill

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Trouble with ordering

2011-09-30 Thread Gerald Vogt
On 30.09.11 15:03, Serge Dubrouski wrote:
> May be you didn't look carefully but that script does exactly that, it
> monitors process and service. Also if you want cluster to control your
> service, it has to be able to start and stop it. You can configure your
> service as a clone and it'll be up on several nodes.
> But if you don't want to use it you don't have to.

You are right. I did not look at the monitor function. I checked the
status function and thought it would be in there if it checked it.

Technically, I don't want the cluster to control the service in the
meaning of starting and stopping. The cluster controls the IP addresses
and moves them between nodes. The dns service resource is supposed to
provide a check that the dns service is working on the node and migrate
the service and most important the IP address if it becomes unresponsive.

I didn't look at the concept of clones, yet. Maybe I took a completely
wrong approach to what I am trying to do.

The cluster until recently only operated the two DNS service IP
addresses ns1-ip and ns2-ip for our LAN. Three nodes are used to provide
redundancy in case one node fails. This way our two DNS server IPs are
active at all times.

Bind is running on all three nodes. Bind is configured to scan for
interface changes every 60s. The three nodes are configured as slave
servers, getting notified of zone updates by the master server.

This works in regard to node failures and similar. If a node crashes the
IP address is moved to another node.

The problem is if the node is still up but the named process becomes
unresponsive and is hanging. The cluster wouldn't notice this.

If I understand your script correctly, it starts and stops the named
process. If I do this, the node which is not running the dns server
won't get zone updates, i.e. if it starts it has outdated zone files.

Now if the master server is accessible and running at the time of start
the dns server gets updated quickly. The trouble is if the master is
down, too, the dns server will provide outdated dns information until
the master is running again.

That seems to me the problem when the bind process is started and
stopped on the nodes and that I was trying to avoid. IMHO the named
process can be running all the time, thus getting zone notifies in the
usual manner.

But maybe I am not getting what clones do. I think so far I didn't quite
get what they do exactly from the guides in respect to what I am trying
to achieve.

Maybe you can give me a hint how I would achieve this with a clone,
running named on all nodes at all times and moving the service IP
addresses between nodes in case a node or dns server fails or hangs.

Thanks!

Gerald

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [Linux-ha-dev] [ha-wg] CFP: HA Mini-Conference in Prague on Oct 25th

2011-09-30 Thread Digimer
On 09/27/2011 07:58 AM, Lars Marowsky-Bree wrote:
> Hi all,
> 
> it turns out that there was zero feedback about people wanting to
> present, only some about travel budget being too tight to come. So we
> had some discussions about whether to cancel this completely, as this
> made planning rather difficult.
> 
> But just in the last few days, I got a fair share of e-mails asking if
> this still takes place, and who is going to be there. ;-)
> 
> So: we have the room. I will be there, and it seems so will at least a
> few other people, including Andrew. I suggest we do it in an
> "unconference" style and draw up the agenda as we go along; you're
> welcome to stop by and discuss HA/clustering topics that are important
> to you.  It is going to be as successful as we all make it out to be.
> 
> We share the venue with LinuxCon Europe: Clarion Congress Hotel ·
> Prague, Czech Republic, on Oct 25th.
> 
> I suggest we start at 9:30 in the morning and go from there.
> 
> 
> Regards,
> Lars
> 

Is it possible, if this isn't set in stone, to push back to later in the
day? I don't fly in until the 25th, and I think there is one other
person who wants to attend in the same boat.

-- 
Digimer
E-Mail:  digi...@alteeve.com
Freenode handle: digimer
Papers and Projects: http://alteeve.com
Node Assassin:   http://nodeassassin.org
"At what point did we forget that the Space Shuttle was, essentially,
a program that strapped human beings to an explosion and tried to stab
through the sky with fire and math?"

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Running two clusters on same node

2011-09-30 Thread Med Hmici
Hi all,
 
I'm trying to find out a recipe to setup a node part of clusters. However, I 
only want the node to be active in one cluster at a time. I'd like to do this 
all programatically, that is with without yast or such (the server doesn't have 
X installed).
 
Sorry if this is a dummy question. I'm trying to get my mind around some 
concepts that are new to me.
 
Mo.___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Trouble with ordering

2011-09-30 Thread Serge Dubrouski
May be you didn't look carefully but that script does exactly that, it
monitors process and service. Also if you want cluster to control your
service, it has to be able to start and stop it. You can configure your
service as a clone and it'll be up on several nodes.
But if you don't want to use it you don't have to.
On Sep 30, 2011 6:02 AM, "Gerald Vogt"  wrote:
> On 30.09.11 13:41, Serge Dubrouski wrote:
>> OCF script for bind was recently added to cluster-resources on gorging.
>> Could you please try to use that one?
>
> Which script where?
>
> The one you have posted here:
>
http://lists.linux-ha.org/pipermail/linux-ha-dev/attachments/20110712/e1a1e792/attachment.obj
>
> doesn't do what I need. I don't want to start or stop the name server.
> The name server (bind process) is supposed to be running all the time to
> get updates from the master.
>
> The script also doesn't check whether the process is working or not. The
> process could be running but not responding.
>
> My script tests whether bind is listening on the resource ip address and
> whether it resolves one of our domains. If it does, it's O.K. If not, it
> fails.
>
> The script is working properly. I just need to tell pacemaker that
> ns1-ip should always be stopped before ns1-dns. That's not the case if
> ns1-dns monitor returns failure.
>
> Thanks!
>
> Gerald
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Trouble with ordering

2011-09-30 Thread Gerald Vogt
On 30.09.11 13:41, Serge Dubrouski wrote:
> OCF script for bind was recently added to cluster-resources on gorging.
> Could you please try to use that one?

Which script where?

The one you have posted here:
http://lists.linux-ha.org/pipermail/linux-ha-dev/attachments/20110712/e1a1e792/attachment.obj

doesn't do what I need. I don't want to start or stop the name server.
The name server (bind process) is supposed to be running all the time to
get updates from the master.

The script also doesn't check whether the process is working or not. The
process could be running but not responding.

My script tests whether bind is listening on the resource ip address and
whether it resolves one of our domains. If it does, it's O.K. If not, it
fails.

The script is working properly. I just need to tell pacemaker that
ns1-ip should always be stopped before ns1-dns. That's not the case if
ns1-dns monitor returns failure.

Thanks!

Gerald

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Trouble with ordering

2011-09-30 Thread Serge Dubrouski
OCF script for bind was recently added to cluster-resources on gorging.
Could you please try to use that one?
On Sep 30, 2011 2:09 AM, "Gerald Vogt"  wrote:
> Hi!
>
> I am running a cluster with 3 nodes. These nodes provide dns service.
> The purpose of the cluster is to have our two dns service ip addresses
> online at all times. I use IPaddr2 and that part works.
>
> Now I try to extend our setup to check the dns service itself. So far,
> if a dns server on any node stops or hangs the cluster won't notice.
> Thus, I wrote a custom ocf script to check whether the dns service on
> a node is operational (i.e. if the dns server is listening on the ip
> address and whether it responds to a dns request).
>
> All cluster nodes are slave dns servers, therefore the dns server
> process is running at all times to get zone transfers from the dns
> master.
>
> Obviously, the dns service resource must be colocated with the IP
> address resource. However, as the dns server is running at all times,
> the dns service resource must be started or stopped after the ip
> address. This leads me to something like this:
>
> primitive ns1-ip ocf:heartbeat:IPaddr2 ...
> primitive ns1-dns ocf:custom:dns op monitor interval="30s"
>
> colocation dns-ip1 inf: ns1-dns ns1-ip
> order ns1-ip-dns inf: ns1-ip ns1-dns symmetrical=false
>
> Problem 1: it seems as if the order constraint does not wait for an
> operation on the first resource to finish before it starts the
> operation on the second. When I migrate an IP address to another node
> the stop operation on ns1-dns will fail because the ip address is
> still active on the network interface. I have worked around this by
> checking for the IP address on the interface in the stop part of my
> dns script and sleeping 5 seconds if it is still there before checking
> again and continuing.
>
> Shouldn't the stop on ns1-ip first finish before the node initiates
> the stop on ns1-dns?
>
> Problem 2: if the dns service fails, e.g. hangs, the monitor operation
> fails. Thus, the cluster wants to migrate the ip address and service
> to another node. However, it first initiates a stop on ns1-dns and
> then on ns1-ip.
>
> What I need is ns1-ip to stop before ns1-dns. But this seems
> impossible to configure. The order constraint only says what operation
> is executed on ns1-dns depending on the status of ns1-ip. It says what
> happens after something. It cannot say what happens before something.
> Is that correct? Or am I missing a configuration option?
>
> Thanks,
>
> Gerald
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Trouble with ordering

2011-09-30 Thread Lars Ellenberg
On Fri, Sep 30, 2011 at 10:06:51AM +0200, Gerald Vogt wrote:
> Hi!
> 
> I am running a cluster with 3 nodes. These nodes provide dns service.
> The purpose of the cluster is to have our two dns service ip addresses
> online at all times. I use IPaddr2 and that part works.
> 
> Now I try to extend our setup to check the dns service itself. So far,
> if a dns server on any node stops or hangs the cluster won't notice.
> Thus, I wrote a custom ocf script to check whether the dns service on
> a node is operational (i.e. if the dns server is listening on the ip
> address and whether it responds to a dns request).
> 
> All cluster nodes are slave dns servers, therefore the dns server
> process is running at all times to get zone transfers from the dns
> master.
> 
> Obviously, the dns service resource must be colocated with the IP
> address resource. However, as the dns server is running at all times,
> the dns service resource must be started or stopped after the ip
> address. This leads me to something like this:
> 
> primitive ns1-ip ocf:heartbeat:IPaddr2 ...
> primitive ns1-dns ocf:custom:dns op monitor interval="30s"
> 
> colocation dns-ip1 inf: ns1-dns ns1-ip
> order ns1-ip-dns inf: ns1-ip ns1-dns symmetrical=false

maybe, if this is what you mean, add:
order ns1-ip-dns inf: ns1-ip:stop ns1-dns:stop symmetrical=false

> 
> Problem 1: it seems as if the order constraint does not wait for an
> operation on the first resource to finish before it starts the
> operation on the second. When I migrate an IP address to another node
> the stop operation on ns1-dns will fail because the ip address is
> still active on the network interface. I have worked around this by
> checking for the IP address on the interface in the stop part of my
> dns script and sleeping 5 seconds if it is still there before checking
> again and continuing.
> 
> Shouldn't the stop on ns1-ip first finish before the node initiates
> the stop on ns1-dns?
> 
> Problem 2: if the dns service fails, e.g. hangs, the monitor operation
> fails. Thus, the cluster wants to migrate the ip address and service
> to another node. However, it first initiates a stop on ns1-dns and
> then on ns1-ip.
> 
> What I need is ns1-ip to stop before ns1-dns. But this seems
> impossible to configure. The order constraint only says what operation
> is executed on ns1-dns depending on the status of ns1-ip. It says what
> happens after something. It cannot say what happens before something.
> Is that correct? Or am I missing a configuration option?
> 
> Thanks,
> 
> Gerald

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Trouble with ordering

2011-09-30 Thread Gerald Vogt
Hi!

I am running a cluster with 3 nodes. These nodes provide dns service.
The purpose of the cluster is to have our two dns service ip addresses
online at all times. I use IPaddr2 and that part works.

Now I try to extend our setup to check the dns service itself. So far,
if a dns server on any node stops or hangs the cluster won't notice.
Thus, I wrote a custom ocf script to check whether the dns service on
a node is operational (i.e. if the dns server is listening on the ip
address and whether it responds to a dns request).

All cluster nodes are slave dns servers, therefore the dns server
process is running at all times to get zone transfers from the dns
master.

Obviously, the dns service resource must be colocated with the IP
address resource. However, as the dns server is running at all times,
the dns service resource must be started or stopped after the ip
address. This leads me to something like this:

primitive ns1-ip ocf:heartbeat:IPaddr2 ...
primitive ns1-dns ocf:custom:dns op monitor interval="30s"

colocation dns-ip1 inf: ns1-dns ns1-ip
order ns1-ip-dns inf: ns1-ip ns1-dns symmetrical=false

Problem 1: it seems as if the order constraint does not wait for an
operation on the first resource to finish before it starts the
operation on the second. When I migrate an IP address to another node
the stop operation on ns1-dns will fail because the ip address is
still active on the network interface. I have worked around this by
checking for the IP address on the interface in the stop part of my
dns script and sleeping 5 seconds if it is still there before checking
again and continuing.

Shouldn't the stop on ns1-ip first finish before the node initiates
the stop on ns1-dns?

Problem 2: if the dns service fails, e.g. hangs, the monitor operation
fails. Thus, the cluster wants to migrate the ip address and service
to another node. However, it first initiates a stop on ns1-dns and
then on ns1-ip.

What I need is ns1-ip to stop before ns1-dns. But this seems
impossible to configure. The order constraint only says what operation
is executed on ns1-dns depending on the status of ns1-ip. It says what
happens after something. It cannot say what happens before something.
Is that correct? Or am I missing a configuration option?

Thanks,

Gerald

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker