Re: [ClusterLabs] Apache Active Active Balancer without FileSystem Cluster

2016-06-13 Thread Ken Gaillot
On 06/13/2016 08:06 AM, Klaus Wenninger wrote:
> On 06/13/2016 02:33 PM, alan john wrote:
>> Dear All,
>>
>> I am trying to setup an  Apache active-active cluster. I do not wish
>> to have common file system for both nodes. it i  However I do not like
>> to have  pcs/corosync to start or stop apache, but monitor it and move
>> only VIP  to secondary node and on recovery pull it back. Would this
>> be practically possible or do you think it is not-achievable.
>>
>>
>> I have following constraints.
>>
>> 1. Virtual IP is not where apache is not running. --- Could not achieve.
>> 2. Node 1 is priority - Works fine
>> 3. pcs should not start/ stop apache  -- Works fine using un-managed

Unmanaged won't let you achieve #1.

It's a lot easier to let the cluster manage apache, but if you really
want to go the other way, you'll need to write a custom OCF agent for
apache.

Start/stop/monitor should use the ha_pseudo_resource function in
ocf-shellfuncs so the agent can distinguish "running" from "not running"
(itself, not apache).

The monitor command should additionally check apache (you can copy the
code from the standard agent), and set a node attribute with apache's
status.

Clone the agent so it's always running everywhere apache might run.

Finally, set a location constraint for your VIP using a rule matching
that node attribute.

So, if apache fails, the new agent detects that and updates the node
attribute, and pacemaker moves the VIP away from that node.

>> 4. send mail when vip gets switched or cluster status changes -- guess
>> achievable.
> Check out the new alerts-feature in pacemaker 1.1.15 for that.
>> 5.  pcs monitor apache process id  and responses to satisfy Pt1.
>>
>> Regards,
>> Alan

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Apache Active Active Balancer without FileSystem Cluster

2016-06-13 Thread Klaus Wenninger
On 06/13/2016 02:33 PM, alan john wrote:
> Dear All,
>
> I am trying to setup an  Apache active-active cluster. I do not wish
> to have common file system for both nodes. it i  However I do not like
> to have  pcs/corosync to start or stop apache, but monitor it and move
> only VIP  to secondary node and on recovery pull it back. Would this
> be practically possible or do you think it is not-achievable.
>
>
> I have following constraints.
>
> 1. Virtual IP is not where apache is not running. --- Could not achieve.
> 2. Node 1 is priority - Works fine
> 3. pcs should not start/ stop apache  -- Works fine using un-managed
> 4. send mail when vip gets switched or cluster status changes -- guess
> achievable.
Check out the new alerts-feature in pacemaker 1.1.15 for that.
> 5.  pcs monitor apache process id  and responses to satisfy Pt1.
>
> Regards,
> Alan
>
>
>
>
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Cluster administration from non-root users

2016-06-13 Thread Tomas Jelinek

Dne 13.6.2016 v 13:57 Auer, Jens napsal(a):

Hi,

I am trying to give admin rights to my clusters to non-root users. I
have two users which need to be able to control the cluster. Both are
members of the haclient group, and I have created acl roles granting
write-access. I can query the cluster status, but I am unable to perform
any commands:
id
uid=1000(mdaf) gid=1000(mdaf)
groups=1000(mdaf),10(wheel),189(haclient),801(mdaf),802(mdafkey),803(mdafmaintain)

pcs acl
ACLs are enabled

User: mdaf
   Roles: admin
User: mdafmaintain
   Roles: admin
Role: admin
   Permission: write xpath /cib (admin-write)

pcs cluster status
Cluster Status:
  Last updated: Mon Jun 13 11:46:45 2016Last change: Mon Jun 13
11:46:38 2016 by root via cibadmin on MDA2PFP-S02
  Stack: corosync
  Current DC: MDA2PFP-S01 (version 1.1.13-10.el7-44eb2dd) - partition
with quorum
  2 nodes and 9 resources configured
  Online: [ MDA2PFP-S01 MDA2PFP-S02 ]

PCSD Status:
   MDA2PFP-S01: Online
   MDA2PFP-S02: Online

pcs cluster stop
Error: localhost: Permission denied - (HTTP error: 403)

pcs cluster start
Error: localhost: Permission denied - (HTTP error: 403)


Hi Jens,

You configured permissions to edit CIB. But it is also required to 
assign permissions to use pcsd (only root is allowed to start and stop 
services, so the request goes through pcsd).


This can be done using pcs web UI:
- open the web UI in your browser at https://:2224
- login as hacluster user
- add existing cluster
- go to permissions
- set permissions for your cluster
- don't forget to apply changes

Regards,
Tomas



I tried to use sudo instead, but this also not working:
sudo pcs status
Permission denied
Error: unable to locate command: /usr/sbin/crm_mon

Any help would be greatly appreciated.

Best wishes,
   Jens

--
*Jens Auer *| CGI | Software-Engineer
CGI (Germany) GmbH & Co. KG
Rheinstraße 95 | 64295 Darmstadt | Germany
T: +49 6151 36860 154
_jens.auer@cgi.com_ 
Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie
unter _de.cgi.com/pflichtangaben_ .
CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging
to CGI Group Inc. and its affiliates may be contained in this message.
If you are not a recipient indicated or intended in this message (or
responsible for delivery of this message to such person), or you think
for any reason that this message may have been addressed to you in
error, you may not use or copy or deliver this message to anyone else.
In such case, you should destroy this message and are asked to notify
the sender by reply e-mail.


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Cluster administration from non-root users

2016-06-13 Thread Auer, Jens
Hi,

I am trying to give admin rights to my clusters to non-root users. I have two 
users which need to be able to control the cluster. Both are members of the 
haclient group, and I have created acl roles granting write-access. I can query 
the cluster status, but I am unable to perform any commands:
id
uid=1000(mdaf) gid=1000(mdaf) 
groups=1000(mdaf),10(wheel),189(haclient),801(mdaf),802(mdafkey),803(mdafmaintain)

pcs acl
ACLs are enabled

User: mdaf
  Roles: admin
User: mdafmaintain
  Roles: admin
Role: admin
  Permission: write xpath /cib (admin-write)

pcs cluster status
Cluster Status:
 Last updated: Mon Jun 13 11:46:45 2016Last change: Mon Jun 13 11:46:38 
2016 by root via cibadmin on MDA2PFP-S02
 Stack: corosync
 Current DC: MDA2PFP-S01 (version 1.1.13-10.el7-44eb2dd) - partition with quorum
 2 nodes and 9 resources configured
 Online: [ MDA2PFP-S01 MDA2PFP-S02 ]

PCSD Status:
  MDA2PFP-S01: Online
  MDA2PFP-S02: Online

pcs cluster stop
Error: localhost: Permission denied - (HTTP error: 403)

pcs cluster start
Error: localhost: Permission denied - (HTTP error: 403)

I tried to use sudo instead, but this also not working:
sudo pcs status
Permission denied
Error: unable to locate command: /usr/sbin/crm_mon

Any help would be greatly appreciated.

Best wishes,
  Jens

--
Jens Auer | CGI | Software-Engineer
CGI (Germany) GmbH & Co. KG
Rheinstraße 95 | 64295 Darmstadt | Germany
T: +49 6151 36860 154
jens.a...@cgi.com
Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter 
de.cgi.com/pflichtangaben.

CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI 
Group Inc. and its affiliates may be contained in this message. If you are not 
a recipient indicated or intended in this message (or responsible for delivery 
of this message to such person), or you think for any reason that this message 
may have been addressed to you in error, you may not use or copy or deliver 
this message to anyone else. In such case, you should destroy this message and 
are asked to notify the sender by reply e-mail.
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

2016-06-13 Thread Adam Spiers
Andrew Beekhof  wrote:
> On Wed, Jun 8, 2016 at 6:23 PM, Adam Spiers  wrote:
> > Andrew Beekhof  wrote:
> >> On Wed, Jun 8, 2016 at 12:11 AM, Adam Spiers  wrote:
> >> > Ken Gaillot  wrote:
> >> >> On 06/06/2016 05:45 PM, Adam Spiers wrote:
> >> >> > Maybe your point was that if the expected start never happens (so
> >> >> > never even gets a chance to fail), we still want to do a nova
> >> >> > service-disable?
> >> >>
> >> >> That is a good question, which might mean it should be done on every
> >> >> stop -- or could that cause problems (besides delays)?
> >> >
> >> > No, the whole point of adding this feature is to avoid a
> >> > service-disable on every stop, and instead only do it on the final
> >> > stop.  If there are corner cases where we never reach the final stop,
> >> > that's not a disaster because nova will eventually figure it out and
> >> > do the right thing when the server-agent connection times out.
> >> >
> >> >> Another aspect of this is that the proposed feature could only look at a
> >> >> single transition. What if stop is called with start_expected=false, but
> >> >> then Pacemaker is able to start the service on the same node in the next
> >> >> transition immediately afterward? Would having called service-disable
> >> >> cause problems for that start?
> >> >
> >> > We would also need to ensure that service-enable is called on start
> >> > when necessary.  Perhaps we could track the enable/disable state in a
> >> > local temporary file, and if the file indicates that we've previously
> >> > done service-disable, we know to run service-enable on start.  This
> >> > would avoid calling service-enable on every single start.
> >>
> >> feels like an over-optimization
> >> in fact, the whole thing feels like that if i'm honest.
> >
> > Huh ... You didn't seem to think that when we discussed automating
> > service-disable at length in Austin.
> 
> I didn't feel the need to push back because RH uses the systemd agent
> instead so you're only hanging yourself, but more importantly because
> the proposed implementation to facilitate it wasn't leading RA writers
> down a hazardous path :-)

I'm a bit confused by that statement, because the only proposed
implementation we came up with in Austin was adding this new feature
to Pacemaker.  Prior to that, AFAICR, you, Dawid, and I had a long
afternoon discussion in the sun where we tried to figure out a way to
implement it just by tweaking the OCF RAs, but every approach we
discussed turned out to have fundamental issues.  That's why we
eventually turned to the idea of this new feature in Pacemaker.

But anyway, it's water under the bridge now :-)

> > What changed?  Can you suggest a better approach?
> 
> Either always or never disable the service would be my advice.
> "Always" specifically getting my vote.

OK, thanks.  We discussed that at the meeting this morning, and it
looks like we'll give it a try.

> >> why are we trying to optimise the projected performance impact
> >
> > It's not really "projected"; we know exactly what the impact is.  And
> > it's not really a performance impact either.  If nova-compute (or a
> > dependency) is malfunctioning on a compute node, there will be a
> > window (bounded by nova.conf's rpc_response_timeout value, IIUC) in
> > which nova-scheduler could still schedule VMs onto that compute node,
> > and then of course they'll fail to boot.
> 
> Right, but that window exists regardless of whether the node is or is
> not ever coming back.

Sure, but the window's a *lot* bigger if we don't do service-disable.
Although perhaps your question "why are we trying to optimise the
projected performance impact" was actually "why are we trying to avoid
extra calls to service-disable" rather than "why do we want to call
service-disable" as I initially assumed.  Is that right?

> And as we already discussed, the proposed feature still leaves you
> open to this window because we can't know if the expected restart will
> ever happen.

Yes, but as I already said, the perfect should not become the enemy of
the good.  Just because an approach doesn't solve all cases, it
doesn't necessarily mean it's not suitable for solving some of them.

> In this context, trying to avoid the disable call under certain
> circumstances, to avoid repeated and frequent flip-flopping of the
> state, seems ill-advised.  At the point nova compute is bouncing up
> and down like that, you have a more fundamental issue somewhere in
> your stack and this is only one (and IMHO minor) symptom of it.

That's a fair point.

> > The masakari folks have a lot of operational experience in this space,
> > and they found that this was enough of a problem to justify calling
> > nova service-disable whenever the failure is detected.
> 
> If you really want it whenever the failure is detected, call it from
> the monitor operation that finds it broken.

Hmm, that 

Re: [ClusterLabs] [corosync][Problem] Very long "pause detect ... " was detected.

2016-06-13 Thread renayama19661014
Hi Honza,

Thank you for comment.


>>  Our user constituted a cluster in corosync and Pacemaker in the next 
> environment.
>>  The cluster constituted it among guests.
>> 
>>  * Host/Guest : RHEL6.6 - kernel : 2.6.32-504.el6.x86_64
>>  * libqb 0.17.1
>>  * corosync 2.3.4
>>  * Pacemaker 1.1.12
>> 
>>  The cluster worked well.
>>  When a user stopped an active guest, the next log was output in standby 
> guests repeatedly.
> 
> What exactly you mean by "active guest" and "standby 
> guests"?

The cluster is active / standby constitution.

As for the standby guest, a wait is in a state until a resource breaks down in 
active guests.


When a resource was replaced by standby, this problem seemed to occur.


> 
>> 
>>  May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause 
> detected for 5515870 ms, flushing membership messages.
>>  May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause 
> detected for 5515920 ms, flushing membership messages.
>>  May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause 
> detected for 5515971 ms, flushing membership messages.
>>  May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause 
> detected for 5516021 ms, flushing membership messages.
>>  May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause 
> detected for 5516071 ms, flushing membership messages.
>>  May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause 
> detected for 5516121 ms, flushing membership messages.
>>  May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause 
> detected for 5516171 ms, flushing membership messages.
>>  May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause 
> detected for 5516221 ms, flushing membership messages.
>>  May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause 
> detected for 5516271 ms, flushing membership messages.
>>  May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause 
> detected for 5516322 ms, flushing membership messages.
>>  May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause 
> detected for 5516372 ms, flushing membership messages.
>>  (snip)
>>  May xx xx:26:03 standby-guest corosync[6311]:  [TOTEM ] Process pause 
> detected for 5526172 ms, flushing membership messages.
>>  May xx xx:26:03 standby-guest corosync[6311]:  [MAIN  ] Totem is unable to 
> form a cluster because of an operating system or network fault. The most 
> common 
> cause of this message is that the local firewall is configured improperly.
>>  May xx xx:26:03 standby-guest corosync[6311]:  [TOTEM ] Process pause 
> detected for 5526222 ms, flushing membership messages.
>>  (snip)
>> 
> 
> This is weird. Not because of enormous pause length but because corosync 
> has a "scheduler pause" detector which warns before "Process 
> pause 
> detected ..." error is logged.

I thought so, too.
However, "scheduler pause" does not seem to be taking place.

> 
>>  As a result, the standby guest failed in the construction of the 
> independent cluster.
>> 
>>  It is recorded in log as if a timer stopped for 91 minutes.
>>  It is abnormal length for 91 minutes.
>> 
>>  Did you see a similar problem?
> 
> Never

Okay!


> 
>> 
>>  Possibly I think whether it is libqb or Kernel or some kind of problems.
> 
> What virtualization technology are you using? KVM?
> 
>>  * I suspect that the set of the timer failed in reset_pause_timeout().
> 
> You can try to put asserts into this function, but there is really not 
> too much reasons why it should fail (ether malloc returns NULL or some 
> nasty memory corruption).


I read a source code, too.
However, it is the street of your opinion.

I do not know whether a problem reappears, but I constitute it in RHEL6.6 and 
intend to take load this week.

If any you have noticed, please give me an email.

Best Regards,
Hideo Yamauchi.


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [corosync][Problem] Very long "pause detect ... " was detected.

2016-06-13 Thread Jan Friesse

Hideo,


Hi All,

Our user constituted a cluster in corosync and Pacemaker in the next 
environment.
The cluster constituted it among guests.

* Host/Guest : RHEL6.6 - kernel : 2.6.32-504.el6.x86_64
* libqb 0.17.1
* corosync 2.3.4
* Pacemaker 1.1.12

The cluster worked well.
When a user stopped an active guest, the next log was output in standby guests 
repeatedly.


What exactly you mean by "active guest" and "standby guests"?



May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5515870 ms, flushing membership messages.
May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5515920 ms, flushing membership messages.
May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5515971 ms, flushing membership messages.
May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5516021 ms, flushing membership messages.
May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5516071 ms, flushing membership messages.
May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5516121 ms, flushing membership messages.
May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5516171 ms, flushing membership messages.
May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5516221 ms, flushing membership messages.
May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5516271 ms, flushing membership messages.
May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5516322 ms, flushing membership messages.
May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5516372 ms, flushing membership messages.
(snip)
May xx xx:26:03 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5526172 ms, flushing membership messages.
May xx xx:26:03 standby-guest corosync[6311]:  [MAIN  ] Totem is unable to form 
a cluster because of an operating system or network fault. The most common 
cause of this message is that the local firewall is configured improperly.
May xx xx:26:03 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5526222 ms, flushing membership messages.
(snip)



This is weird. Not because of enormous pause length but because corosync 
has a "scheduler pause" detector which warns before "Process pause 
detected ..." error is logged.



As a result, the standby guest failed in the construction of the independent 
cluster.

It is recorded in log as if a timer stopped for 91 minutes.
It is abnormal length for 91 minutes.

Did you see a similar problem?


Never



Possibly I think whether it is libqb or Kernel or some kind of problems.


What virtualization technology are you using? KVM?


* I suspect that the set of the timer failed in reset_pause_timeout().


You can try to put asserts into this function, but there is really not 
too much reasons why it should fail (ether malloc returns NULL or some 
nasty memory corruption).


Regards,
  Honza



Best Regards,
Hideo Yamauchi.


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org