Re: [ClusterLabs] Wait until resource is really ready before moving clusterip

2016-01-26 Thread Joakim Hansson
Thanks for the help guys.
I ended up patching together my own RA from the Delay and Dummy RA's and
using curl to request the header of solr's ping request handler on
localhost, which made the resource start return a bit more dynamic.
However, now I have another problem which I don't think is related to my RA.
For some reason when failing over the nodes, the ClusterIP (vIP below)
seems to avoid the node running the fencing agent:

pcs status

Online: [ node01 node02 ]
OFFLINE: [ node03 ]

Full list of resources:

 VMWare-fence   (stonith:fence_vmware_soap):Started node02
 Clone Set: dlm-clone [dlm]
 Started: [ node01 node02 ]
 Stopped: [ node03 ]
 Clone Set: GFS2-clone [GFS2] (unique)
 GFS2:0 (ocf::heartbeat:Filesystem):Started node01
 GFS2:1 (ocf::heartbeat:Filesystem):Stopped
 GFS2:2 (ocf::heartbeat:Filesystem):Started node02
 Clone Set: Tomcat-clone [Tomcat]
 Started: [ node02 ]
 Stopped: [ node01 node03 ]
 vIP(ocf::heartbeat:IPaddr2): Stopped

Notice how the tomcat-clone is started on node02 but the vIP remains
stopped.
If I start the fence agent on any of the other nodes the same thing happens
(ie, vIP avoiding the fencing node)
Any idea why this happens?

Output of 'pcs config show':
https://github.com/apepojken/pacemaker/blob/master/Config

Thanks again!

2016-01-20 1:14 GMT+01:00 Jan Pokorný :

> On 14/01/16 14:46 +0100, Kristoffer Grönlund wrote:
> > Joakim Hansson  writes:
> >> When adding the Delay RA it starts throwing a bunch of errors and the
> >> cluster starts fencing the nodes one by one.
> >>
> >> The error's I get with "pcs status":
> >>
> >> Failed Actions:
> >> * Delay_monitor_0 on node03 'unknown error' (1): call=51, status=Timed
> Out,
> >> exit
> >> reason='none',
> >> last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30002ms
> >> * Delay_monitor_0 on node01 'unknown error' (1): call=53, status=Timed
> Out,
> >> exit
> >> reason='none',
> >> last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30002ms
> >> * Delay_monitor_0 on node02 'unknown error' (1): call=51, status=Timed
> Out,
> >> exit
> >> reason='none',
> >> last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30006ms
> >>
> >> and in the /var/log/pacemaker.log:
> >>
> >>
> https://github.com/apepojken/pacemaker-errors/blob/master/ocf:heartbeat:Delay
> >>
> >> I added the Delay RA with:
> >>
> >> pcs resource create Delay ocf:heartbeat:Delay \
> >> startdelay="120" meta target-role=Started \
> >> op start timeout="180"
> >>
> >> and my config looks like this:
> >>
> >> https://github.com/apepojken/pacemaker/blob/master/Config
> >>
> >> Am I missing something obvious here?
> >
> > It looks like you have a monitor operation configured for the Delay
> > resource, but you haven't set the mondelay parameter. But either way,
> > there is no reason to monitor the Delay resource, so remove that. Same
> > thing for the stop operation, just remove it.
> >
> > I'm guessing pcs adds these by default.
>
> It's true that pcs adds equivalent of "op monitor interval=60s"
> as an unconditional fallback when defining a new resource.
> Other operations are driven solely by explicit values or by
> defaults for particular resource, and this can be turned off
> via "--no-default-ops" option to pcs.
>
> FWIW, this could be a way to have monitor explicitly deactivated:
>
> pcs resource create   ... op monitor interval=0s
>
> --
> Jan (Poki)
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Wait until resource is really ready before moving clusterip

2016-01-26 Thread Ken Gaillot
On 01/26/2016 05:06 AM, Joakim Hansson wrote:
> Thanks for the help guys.
> I ended up patching together my own RA from the Delay and Dummy RA's and
> using curl to request the header of solr's ping request handler on
> localhost, which made the resource start return a bit more dynamic.
> However, now I have another problem which I don't think is related to my RA.
> For some reason when failing over the nodes, the ClusterIP (vIP below)
> seems to avoid the node running the fencing agent:
> 
> pcs status
> 
> Online: [ node01 node02 ]
> OFFLINE: [ node03 ]
> 
> Full list of resources:
> 
>  VMWare-fence   (stonith:fence_vmware_soap):Started node02
>  Clone Set: dlm-clone [dlm]
>  Started: [ node01 node02 ]
>  Stopped: [ node03 ]
>  Clone Set: GFS2-clone [GFS2] (unique)
>  GFS2:0 (ocf::heartbeat:Filesystem):Started node01
>  GFS2:1 (ocf::heartbeat:Filesystem):Stopped
>  GFS2:2 (ocf::heartbeat:Filesystem):Started node02
>  Clone Set: Tomcat-clone [Tomcat]
>  Started: [ node02 ]
>  Stopped: [ node01 node03 ]
>  vIP(ocf::heartbeat:IPaddr2): Stopped
> 
> Notice how the tomcat-clone is started on node02 but the vIP remains
> stopped.
> If I start the fence agent on any of the other nodes the same thing happens
> (ie, vIP avoiding the fencing node)
> Any idea why this happens?
> 
> Output of 'pcs config show':
> https://github.com/apepojken/pacemaker/blob/master/Config

I notice you have mutliple ordering constraints but only one colocation
constraint. That means, for example, that tomcat-clone must be started
after GFS2, but it does not have to be on the same node. I'm pretty sure
you want colocation constraints as well, to make them start on the same
node.

FYI, a group is like a shorthand for ordering and constraint constraints
for multiple resources that need to be kept together and started/stopped
in order.

I also see you have globally-unique=true on GFS2-clone. You probably do
not want this. globally-unique=false (the default) is more common, and
means that all clone instances are interchangeable, and is usually
configured with clone-node-max=1, because only one instance is ever
needed on any one node. globally-unique=true means that each clone
instance handles a different subset of requests, and is usually
configured with clone-node-max > 1 so that multiple clone instances can
run on a single node if needed.

I don't see from that alone why vIP wouldn't start, but take care of the
above issues first, and see what the behavior is then.

> Thanks again!
> 
> 2016-01-20 1:14 GMT+01:00 Jan Pokorný :
> 
>> On 14/01/16 14:46 +0100, Kristoffer Grönlund wrote:
>>> Joakim Hansson  writes:
 When adding the Delay RA it starts throwing a bunch of errors and the
 cluster starts fencing the nodes one by one.

 The error's I get with "pcs status":

 Failed Actions:
 * Delay_monitor_0 on node03 'unknown error' (1): call=51, status=Timed
>> Out,
 exit
 reason='none',
 last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30002ms
 * Delay_monitor_0 on node01 'unknown error' (1): call=53, status=Timed
>> Out,
 exit
 reason='none',
 last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30002ms
 * Delay_monitor_0 on node02 'unknown error' (1): call=51, status=Timed
>> Out,
 exit
 reason='none',
 last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30006ms

 and in the /var/log/pacemaker.log:


>> https://github.com/apepojken/pacemaker-errors/blob/master/ocf:heartbeat:Delay

 I added the Delay RA with:

 pcs resource create Delay ocf:heartbeat:Delay \
 startdelay="120" meta target-role=Started \
 op start timeout="180"

 and my config looks like this:

 https://github.com/apepojken/pacemaker/blob/master/Config

 Am I missing something obvious here?
>>>
>>> It looks like you have a monitor operation configured for the Delay
>>> resource, but you haven't set the mondelay parameter. But either way,
>>> there is no reason to monitor the Delay resource, so remove that. Same
>>> thing for the stop operation, just remove it.
>>>
>>> I'm guessing pcs adds these by default.
>>
>> It's true that pcs adds equivalent of "op monitor interval=60s"
>> as an unconditional fallback when defining a new resource.
>> Other operations are driven solely by explicit values or by
>> defaults for particular resource, and this can be turned off
>> via "--no-default-ops" option to pcs.
>>
>> FWIW, this could be a way to have monitor explicitly deactivated:
>>
>> pcs resource create   ... op monitor interval=0s
>>
>> --
>> Jan (Poki)


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: 

Re: [ClusterLabs] Wait until resource is really ready before moving clusterip

2016-01-19 Thread Jan Pokorný
On 14/01/16 14:46 +0100, Kristoffer Grönlund wrote:
> Joakim Hansson  writes:
>> When adding the Delay RA it starts throwing a bunch of errors and the
>> cluster starts fencing the nodes one by one.
>> 
>> The error's I get with "pcs status":
>> 
>> Failed Actions:
>> * Delay_monitor_0 on node03 'unknown error' (1): call=51, status=Timed Out,
>> exit
>> reason='none',
>> last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30002ms
>> * Delay_monitor_0 on node01 'unknown error' (1): call=53, status=Timed Out,
>> exit
>> reason='none',
>> last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30002ms
>> * Delay_monitor_0 on node02 'unknown error' (1): call=51, status=Timed Out,
>> exit
>> reason='none',
>> last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30006ms
>> 
>> and in the /var/log/pacemaker.log:
>> 
>> https://github.com/apepojken/pacemaker-errors/blob/master/ocf:heartbeat:Delay
>> 
>> I added the Delay RA with:
>> 
>> pcs resource create Delay ocf:heartbeat:Delay \
>> startdelay="120" meta target-role=Started \
>> op start timeout="180"
>> 
>> and my config looks like this:
>> 
>> https://github.com/apepojken/pacemaker/blob/master/Config
>> 
>> Am I missing something obvious here?
> 
> It looks like you have a monitor operation configured for the Delay
> resource, but you haven't set the mondelay parameter. But either way,
> there is no reason to monitor the Delay resource, so remove that. Same
> thing for the stop operation, just remove it.
> 
> I'm guessing pcs adds these by default.

It's true that pcs adds equivalent of "op monitor interval=60s"
as an unconditional fallback when defining a new resource.
Other operations are driven solely by explicit values or by
defaults for particular resource, and this can be turned off
via "--no-default-ops" option to pcs.

FWIW, this could be a way to have monitor explicitly deactivated:

pcs resource create   ... op monitor interval=0s

-- 
Jan (Poki)


pgpUalyCO3_xr.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Wait until resource is really ready before moving clusterip

2016-01-14 Thread Joakim Hansson
>
> >> Hi,
> >>
> >> There is the ocf:heartbeat:Delay resource agent, which on one hand is
> >> documented as a test resource, but on the other hand should do what you
> >> need:
> >>
> >> primitive solr ...
> >> primitive two-minute-delay ocf:heartbeat:Delay \
> >>   params startdelay=120 meta target-role=Started \
> >> op start timeout=180
> >> group solr-then-wait solr two-minute-delay
> >>
> >> Now the group acts basically like the solr resource, except for the
> >> two-minute delay after starting solr before the group itself is
> >> considered started.
> >>
> >> Cheers,
> >> Kristoffer
> >>
> >>>
> >>> / Jocke
> >
> >Another way would be to customize the tomcat resource agent so that
> >start doesn't return success until it's fully ready to accept requests
> >(which would probably be specific to whatever app you're running via
> >tomcat). Of course you'd need a long start timeout.
>
> Thanks for the tips guys!
I'm using the systemd RA of tomcat (I know it's not recommended) and can't
seem to figure out  how to go about postponing the success return.
Maybe I'll try the OCF one later.

When adding the Delay RA it starts throwing a bunch of errors and the
cluster starts fencing the nodes one by one.

The error's I get with "pcs status":

Failed Actions:
* Delay_monitor_0 on node03 'unknown error' (1): call=51, status=Timed Out,
exit
reason='none',
last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30002ms
* Delay_monitor_0 on node01 'unknown error' (1): call=53, status=Timed Out,
exit
reason='none',
last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30002ms
* Delay_monitor_0 on node02 'unknown error' (1): call=51, status=Timed Out,
exit
reason='none',
last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30006ms

and in the /var/log/pacemaker.log:

https://github.com/apepojken/pacemaker-errors/blob/master/ocf:heartbeat:Delay

I added the Delay RA with:

pcs resource create Delay ocf:heartbeat:Delay \
startdelay="120" meta target-role=Started \
op start timeout="180"

and my config looks like this:

https://github.com/apepojken/pacemaker/blob/master/Config

Am I missing something obvious here?

Thanks again for all the help so far!
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Wait until resource is really ready before moving clusterip

2016-01-14 Thread Kristoffer Grönlund
Joakim Hansson  writes:

>
> When adding the Delay RA it starts throwing a bunch of errors and the
> cluster starts fencing the nodes one by one.
>
> The error's I get with "pcs status":
>
> Failed Actions:
> * Delay_monitor_0 on node03 'unknown error' (1): call=51, status=Timed Out,
> exit
> reason='none',
> last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30002ms
> * Delay_monitor_0 on node01 'unknown error' (1): call=53, status=Timed Out,
> exit
> reason='none',
> last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30002ms
> * Delay_monitor_0 on node02 'unknown error' (1): call=51, status=Timed Out,
> exit
> reason='none',
> last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30006ms
>
> and in the /var/log/pacemaker.log:
>
> https://github.com/apepojken/pacemaker-errors/blob/master/ocf:heartbeat:Delay
>
> I added the Delay RA with:
>
> pcs resource create Delay ocf:heartbeat:Delay \
> startdelay="120" meta target-role=Started \
> op start timeout="180"
>
> and my config looks like this:
>
> https://github.com/apepojken/pacemaker/blob/master/Config
>
> Am I missing something obvious here?

Hi,

It looks like you have a monitor operation configured for the Delay
resource, but you haven't set the mondelay parameter. But either way,
there is no reason to monitor the Delay resource, so remove that. Same
thing for the stop operation, just remove it.

I'm guessing pcs adds these by default.

Cheers,
Kristoffer

>
> Thanks again for all the help so far!
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-- 
// Kristoffer Grönlund
// kgronl...@suse.com

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Wait until resource is really ready before moving clusterip

2016-01-12 Thread Kristoffer Grönlund
Joakim Hansson  writes:

> Hi!
> I have a cluster running tomcat which in turn run solr.
> I use three nodes with loadbalancing via ipaddr2.
> The thing is, when tomcat is started on a node it takes about 2 minutes
> before solr is functioning correctly.
>
> Is there a way to make the ipaddr2-clone wait 2 minutes after tomcat is
> started before it moves the ip to the node?
>
> Much appreciated!

Hi,

There is the ocf:heartbeat:Delay resource agent, which on one hand is
documented as a test resource, but on the other hand should do what you
need:

primitive solr ...
primitive two-minute-delay ocf:heartbeat:Delay \
  params startdelay=120 meta target-role=Started \
  op start timeout=180
group solr-then-wait solr two-minute-delay

Now the group acts basically like the solr resource, except for the
two-minute delay after starting solr before the group itself is
considered started.

Cheers,
Kristoffer

>
> / Jocke
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-- 
// Kristoffer Grönlund
// kgronl...@suse.com

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Wait until resource is really ready before moving clusterip

2016-01-12 Thread Ken Gaillot
On 01/12/2016 07:57 AM, Kristoffer Grönlund wrote:
> Joakim Hansson  writes:
> 
>> Hi!
>> I have a cluster running tomcat which in turn run solr.
>> I use three nodes with loadbalancing via ipaddr2.
>> The thing is, when tomcat is started on a node it takes about 2 minutes
>> before solr is functioning correctly.
>>
>> Is there a way to make the ipaddr2-clone wait 2 minutes after tomcat is
>> started before it moves the ip to the node?
>>
>> Much appreciated!
> 
> Hi,
> 
> There is the ocf:heartbeat:Delay resource agent, which on one hand is
> documented as a test resource, but on the other hand should do what you
> need:
> 
> primitive solr ...
> primitive two-minute-delay ocf:heartbeat:Delay \
>   params startdelay=120 meta target-role=Started \
>   op start timeout=180
> group solr-then-wait solr two-minute-delay
> 
> Now the group acts basically like the solr resource, except for the
> two-minute delay after starting solr before the group itself is
> considered started.
> 
> Cheers,
> Kristoffer
> 
>>
>> / Jocke

Another way would be to customize the tomcat resource agent so that
start doesn't return success until it's fully ready to accept requests
(which would probably be specific to whatever app you're running via
tomcat). Of course you'd need a long start timeout.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org