Re: [Pacemaker] [ha-wg-technical] [RFC] Organizing HA Summit 2015

2014-12-22 Thread Keisuke MORI
Hi all,

Really late response but,
I will be joining the HA summit, with a few colleagues from NTT.

See you guys in Brno,
Thanks,


2014-12-08 22:36 GMT+09:00 Jan Pokorný jpoko...@redhat.com:
 Hello,

 it occured to me that if you want to use the opportunity and double
 as as tourist while being in Brno, it's about the right time to
 consider reservations/ticket purchases this early.
 At least in some cases it is a must, e.g., Villa Tugendhat:

 http://rezervace.spilberk.cz/langchange.aspx?mrsname=languageId=2returnUrl=%2Flist

 On 08/09/14 12:30 +0200, Fabio M. Di Nitto wrote:
 DevConf will start Friday the 6th of Feb 2015 in Red Hat Brno offices.

 My suggestion would be to have a 2 days dedicated HA summit the 4th and
 the 5th of February.

 --
 Jan

 ___
 ha-wg-technical mailing list
 ha-wg-techni...@lists.linux-foundation.org
 https://lists.linuxfoundation.org/mailman/listinfo/ha-wg-technical




-- 
Keisuke MORI

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] PgSQL_Replicated_Cluster wiki's bug

2014-04-08 Thread Keisuke MORI
Anzai-san,

Thank you very much for pointing it out.

Yes, you are right.
I have updated the wiki page and should be fixed now.

Regards,
Keisuke MORI


2014-04-08 11:15 GMT+09:00 Naoya Anzai anzai-na...@mxu.nes.nec.co.jp:
 Hi,All

 I'm reading following wiki:
 http://clusterlabs.org/mwiki/index.php?title=PgSQL_Replicated_Clustersetlang=ja#Pacemaker_.28both_nodes.29

 In Sample configuration for pcs command, it says as follows:
 ---
 pcs -f pgsql_cfg resource op defaults resource-stickiness=INFINITY
 pcs -f pgsql_cfg resource op defaults migration-threshold=1
 ---

 But it doesn't work correctly, right?

 I think op command is unnecessary in their configurations.

 Correct commands are:
 ---
 pcs -f pgsql_cfg resource defaults resource-stickiness=INFINITY
 pcs -f pgsql_cfg resource defaults migration-threshold=1
 ---

 Regards,

 Naoya

 ---
 Naoya Anzai
 Engineering Department
 NEC Solution Inovetors, Ltd.
 E-Mail: anzai-na...@mxu.nes.nec.co.jp
 ---



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



-- 
Keisuke MORI

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pre_notify_demote is issued twice

2014-03-10 Thread Keisuke MORI
Hi,

2014-02-24 10:49 GMT+09:00 Andrew Beekhof and...@beekhof.net:

 On 21 Feb 2014, at 2:19 pm, Andrew Beekhof and...@beekhof.net wrote:


 On 18 Feb 2014, at 1:23 pm, Andrew Beekhof and...@beekhof.net wrote:


 On 6 Feb 2014, at 7:45 pm, Keisuke MORI keisuke.mori...@gmail.com wrote:

 Hi,

 I observed that pre_notify_demote is issued twice when a master
 resource is migrating.
 I'm wondering if this is the correct behavior.

 Steps to reproduce:

 - Start up 2 nodes cluster configured for the PostgreSQL streaming
 replication using pgsql RA as  a master/slave resource.
 - kill the postgresql process on the master node to induce a fail-over.
 - The fail-over succeeds as expected, but pre_notify_demote was
 executed twice on each node before demoting on the master resource.

 100% reproducible on my cluster.

 Pacemaker version: 1.1.11-rc4 (source build from the repo)
 OS: RHEL6.4

 I have never seen this on Pacemaker-1.0.* cluster with the same 
 configuration.

 The relevant logs and pe-inputs are attached.


 Diagnostics:

 (1) The first transition caused by the process failure (pe-input-160)
 initiates pre_notify_demote on both nodes and cancelling slave monitor
 on the slave node.
 {{{
 171 Jan 30 16:08:59 rhel64-1 crmd[8143]:   notice: te_rsc_command:
 Initiating action 9: cancel prmPostgresql_cancel_1 on rhel64-2
 172 Jan 30 16:08:59 rhel64-1 crmd[8143]:   notice: te_rsc_command:
 Initiating action 79: notify prmPostgresql_pre_notify_demote_0 on
 rhel64-1 (local)

 175 Jan 30 16:08:59 rhel64-1 crmd[8143]:   notice: te_rsc_command:
 Initiating action 81: notify prmPostgresql_pre_notify_demote_0 on
 rhel64-2
 }}}

 (2) When cancelling slave monitor completes, the transition is aborted
 by Resource op removal.
 {{{
 176 Jan 30 16:08:59 rhel64-1 crmd[8143]: info: match_graph_event:
 Action prmPostgresql_monitor_1 (9) confirmed on rhel64-2 (rc=0)
 177 Jan 30 16:08:59 rhel64-1 cib[8138]: info: cib_process_request:
 Completed cib_delete operation for section status: OK (rc=0,
 origin=rhel64-2/crmd/21, version=0.37.9)
 178 Jan 30 16:08:59 rhel64-1 crmd[8143]: info:
 abort_transition_graph: te_update_diff:258 - Triggered transition
 abort (complete=0, node=rhel64-2, tag=lrm_rsc_op,
 id=prmPostgresql_monitor_1,
 magic=0:0;26:12:0:acf9a2a3-307c-460b-b786-fc20e6b8aad5, cib=0.37.9) :
 Resource op removal
 }}}

 (3) The second transition is calculated by the abort (pe-input-161)
 which results initiating pre_notify_demote again.

 If the demote didn't complete (or wasn't even attempted), then we must send 
 the pre_notify_demote again unfortunately.
 The real bug may well be that the transition shouldn't have been aborted.

 It looks legitimate:

 Jan 30 16:08:59 rhel64-1 crmd[8143]: info: abort_transition_graph: 
 te_update_diff:258 - Triggered transition abort (complete=0, node=rhel64-2, 
 tag=lrm_rsc_op, id=prmPostgresql_monitor_1, 
 magic=0:0;26:12:0:acf9a2a3-307c-460b-b786-fc20e6b8aad5, cib=0.37.9) : 
 Resource op removal

 It looks like get_cancel_action() was not functioning correctly:

https://github.com/beekhof/pacemaker/commit/9d77c99


Thanks for looking into it.

I have confirmed that the issue is now resolved with the recent
revision on your repo. at:
  
https://github.com/beekhof/pacemaker/commit/04ff1bd2d144e7defd6f1f67f6bde6fa95c428e1

Thanks!

-- 
Keisuke MORI




 Jan 30 16:08:59 rhel64-1 cib[8138]: info: cib_process_request: Completed 
 cib_delete operation for section status: OK (rc=0, origin=rhel64-2/crmd/21, 
 version=0.37.9)

 It looks like part of the node status entry being removed for rhel64-2.
 Possibly as a result of:

 Jan 30 16:07:54 rhel64-2 crmd[25070]: info: erase_status_tag: Deleting 
 xpath: //node_state[@uname='rhel64-2']/transient_attributes

 The new cib code, being much faster, might help here too :)


 {{{
 227 Jan 30 16:09:01 rhel64-1 pengine[8142]:   notice:
 process_pe_message: Calculated Transition 15:
 /var/lib/pacemaker/pengine/pe-input-161.bz2
 229 Jan 30 16:09:01 rhel64-1 crmd[8143]:   notice: te_rsc_command:
 Initiating action 78: notify prmPostgresql_pre_notify_demote_0 on
 rhel64-1 (local)
 232 Jan 30 16:09:01 rhel64-1 crmd[8143]:   notice: te_rsc_command:
 Initiating action 80: notify prmPostgresql_pre_notify_demote_0 on
 rhel64-2
 }}}

 I think that the transition abort at (2) should not happen.

 Regards,
 --
 Keisuke MORI
 logs-pre-notify-20140206.tar.bz2___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc

Re: [Pacemaker] Time to get ready for 1.1.11

2014-01-15 Thread Keisuke MORI
: crm_report: Suppress logging errors after the target directory has been
 compressed
 Fix: crm_attribute: Do not swallow hostname lookup failures
 Fix: crmd: Avoid deleting the 'shutdown' attribute
 Log: attrd: Quote attribute names
 Doc: Pacemaker_Explained: Fix formatting


 A new release candidate for pacemaker 1.1.11 is now available, 
 pacemaker-1.1.11-rc4.

 https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.11-rc4

 The lrmd crash has finally been resolved. That was the last fix we were 
 waiting on before officially releasing Pacemaker v1.1.11.  RC4 is likely 
 going to be the 1.1.11 final release. Please test and report any regressions 
 as soon as possible.  1.1.11 will be released mid next week if no major 
 issues are encountered.

 CHANGES RC3 to RC4
 Fix: stonith_admin: Ensure pointers passed to sscanf() are properly 
 initialized
 Fix: Prevent potential use-of-NULL
 Fix: upstart: Allow comilation with glib versions older than 2.28
 Fix: services: Fixes segfault associated with cancelling in-flight recurring 
 operations.
 Low: crmd: Change the default value of node-action-limit

 Thanks,
 -- Vossel




 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



-- 
Keisuke MORI

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Time to get ready for 1.1.11

2014-01-15 Thread Keisuke MORI
Hi Andrew,

2014/1/16 Andrew Beekhof and...@beekhof.net:

 On 16 Jan 2014, at 3:00 pm, Keisuke MORI keisuke.mori...@gmail.com wrote:

 Hi,

 Just curious,
 I found that RC4 has been branched out of the master after RC3.

 What would the fixes only in the master branch be in the future?
 Are they going to be merged into 1.1.12 someday and just skipping 1.1.11?

 Yes.  Unlike last time we're trying to be better about not merging new 
 features and other risky changes during the RC phase :-)
 The branching should have happened earlier but I forgot.

 Or are they separated for the next major version such as v1.2 or v2.0?

 I think our plans for 1.2/2.0 are on hold indefinitely.  Its all 1.1.x 
 releases for the foreseeable future.
 For a small dev team, the benefits didn't outweigh the costs.

Thank you for the answer.
That makes things clear to me.

Regards,




 Thanks,


 2014/1/16 David Vossel dvos...@redhat.com:
 - Original Message -
 From: David Vossel dvos...@redhat.com
 To: The Pacemaker cluster resource manager 
 pacemaker@oss.clusterlabs.org
 Sent: Tuesday, January 7, 2014 4:50:11 PM
 Subject: Re: [Pacemaker] Time to get ready for 1.1.11

 - Original Message -
 From: Andrew Beekhof and...@beekhof.net
 To: The Pacemaker cluster resource manager
 pacemaker@oss.clusterlabs.org
 Sent: Thursday, December 19, 2013 2:25:00 PM
 Subject: Re: [Pacemaker] Time to get ready for 1.1.11


 On 20 Dec 2013, at 2:11 am, Andrew Martin amar...@xes-inc.com wrote:

 David/Andrew,

 Once 1.1.11 final is released, is it considered the new stable series of
 Pacemaker,

 yes

 or should 1.1.10 still be used in very stable/critical production
 environments?

 Thanks,

 Andrew

 - Original Message -
 From: David Vossel dvos...@redhat.com
 To: The Pacemaker cluster resource manager
 pacemaker@oss.clusterlabs.org
 Sent: Wednesday, December 11, 2013 3:33:46 PM
 Subject: Re: [Pacemaker] Time to get ready for 1.1.11

 - Original Message -
 From: Andrew Beekhof and...@beekhof.net
 To: The Pacemaker cluster resource manager
 pacemaker@oss.clusterlabs.org
 Sent: Wednesday, November 20, 2013 9:02:40 PM
 Subject: [Pacemaker] Time to get ready for 1.1.11

 With over 400 updates since the release of 1.1.10, its time to start
 thinking about a new release.

 Today I have tagged release candidate 1[1].
 The most notable fixes include:

 + attrd: Implementation of a truely atomic attrd for use with corosync
 2.x
 + cib: Allow values to be added/updated and removed in a single update
 + cib: Support XML comments in diffs
 + Core: Allow blackbox logging to be disabled with SIGUSR2
 + crmd: Do not block on proxied calls from pacemaker_remoted
 + crmd: Enable cluster-wide throttling when the cib heavily exceeds
 its
 target load
 + crmd: Use the load on our peers to know how many jobs to send them
 + crm_mon: add --hide-headers option to hide all headers
 + crm_report: Collect logs directly from journald if available
 + Fencing: On timeout, clean up the agent's entire process group
 + Fencing: Support agents that need the host to be unfenced at startup
 + ipc: Raise the default buffer size to 128k
 + PE: Add a special attribute for distinguishing between real nodes
 and
 containers in constraint rules
 + PE: Allow location constraints to take a regex pattern to match
 against
 resource IDs
 + pengine: Distinguish between the agent being missing and something
 the
 agent needs being missing
 + remote: Properly version the remote connection protocol
 + services: Detect missing agents and permission errors before forking
 + Bug cl#5171 - pengine: Don't prevent clones from running due to
 dependant
 resources
 + Bug cl#5179 - Corosync: Attempt to retrieve a peer's node name if it
 is
 not already known
 + Bug cl#5181 - corosync: Ensure node IDs are written to the CIB as
 unsigned integers

 If you are a user of `pacemaker_remoted`, you should take the time to
 read
 about changes to the online wire protocol[2] that are present in this
 release.

 [1]
 https://github.com/ClusterLabs/pacemaker/releases/Pacemaker-1.1.11-rc1
 [2]
 http://blog.clusterlabs.org/blog/2013/changes-to-the-remote-wire-protocol/

 To build `rpm` packages for testing:

 1. Clone the current sources:

  # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
  # cd pacemaker

 1. If you haven't already, install Pacemaker's dependancies

  [Fedora] # sudo yum install -y yum-utils
  [ALL] # make rpm-dep

 1. Build Pacemaker

  # make rc

 1. Copy the rpms and deploy as needed


 A new release candidate, Pacemaker-1.1.11-rc2, is ready for testing.
 https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.11-rc2

 Assuming no major regressions are encountered during testing, this tag
 will
 become the final Pacemaker-1.1.11 release a week from today.

 -- Vossel

 Alright, New RC time. Pacemaker-1.1.11-rc3.

 If no regressions are encountered, rc3 will become the 1.1.11 final 
 release a
 week from today

Re: [Pacemaker] [Problem][crmsh]The designation of the 'ordered' attribute becomes the error.

2013-04-30 Thread Keisuke MORI
Hi Dejan, Andreas, Yamauchi-san


2013/4/18 renayama19661...@ybb.ne.jp

 Hi Dejan,
 Hi Andreas,

  The shell in pacemaker v1.0.x is in maintenance mode and shipped
  along with the pacemaker code. The v1.1.x doesn't have the
  ordered and collocated meta attributes.

 I sent the pull request of the patch which Mr. Dejan donated.
  * https://github.com/ClusterLabs/pacemaker-1.0/pull/14


The patch for crmsh is now included in the 1.0.x repository:

https://github.com/ClusterLabs/pacemaker-1.0/commit/9227e89fb748cd52d330f5fca80d56fbd9d3efbf

It will be appeared in 1.0.14 maintenance release, which is not scheduled
yet though.

Thanks,

Keisuke MORI



 Many Thanks!
 Hideo Yamauchi.
 --- On Tue, 2013/4/2, Dejan Muhamedagic deja...@fastmail.fm wrote:

  Hi,
 
  On Mon, Apr 01, 2013 at 09:19:51PM +0200, Andreas Kurz wrote:
   Hi Dejan,
  
   On 2013-03-06 11:59, Dejan Muhamedagic wrote:
Hi Hideo-san,
   
On Wed, Mar 06, 2013 at 10:37:44AM +0900, 
renayama19661...@ybb.ne.jpwrote:
Hi Dejan,
Hi Andrew,
   
As for the crm shell, the check of the meta attribute was revised
 with the next patch.
   
 * http://hg.savannah.gnu.org/hgweb/crmsh/rev/d1174f42f4b3
   
This patch was backported in Pacemaker1.0.13.
   
 *
 https://github.com/ClusterLabs/pacemaker-1.0/commit/fa1a99ab36e0ed015f1bcbbb28f7db962a9d1abc#shell/modules/cibconfig.py
   
However, the ordered,colocated attribute of the group resource is
 treated as an error when I use crm Shell which adopted this patch.
   
--
(snip)
### Group Configuration ###
group master-group \
vip-master \
vip-rep \
meta \
ordered=false
(snip)
   
[root@rh63-heartbeat1 ~]# crm configure load update test2339.crm
INFO: building help index
crm_verify[20028]: 2013/03/06_17:57:18 WARN: unpack_nodes: Blind
 faith: not fencing unseen nodes
WARNING: vip-master: specified timeout 60s for start is smaller
 than the advised 90
WARNING: vip-master: specified timeout 60s for stop is smaller than
 the advised 100
WARNING: vip-rep: specified timeout 60s for start is smaller than
 the advised 90
WARNING: vip-rep: specified timeout 60s for stop is smaller than
 the advised 100
ERROR: master-group: attribute ordered does not exist  - WHY?
Do you still want to commit? y
--
   
If it chooses `yes` by a confirmation message, it is reflected, but
 it is a problem that error message is displayed.
 * The error occurs in the same way when I appoint colocated
 attribute.
AndI noticed that there was not explanation of
 ordered,colocated of the group resource in online help of Pacemaker.
   
I think that the designation of the ordered,colocated attribute
 should not become the error in group resource.
In addition, I think that ordered,colocated should be added to
 online help.
   
These attributes are not listed in crmsh. Does the attached patch
help?
  
   Dejan, will this patch for the missing ordered and collocated group
   meta-attribute be included in the next crmsh release? ... can't see the
   patch in the current tip.
 
  The shell in pacemaker v1.0.x is in maintenance mode and shipped
  along with the pacemaker code. The v1.1.x doesn't have the
  ordered and collocated meta attributes.
 
  Thanks,
 
  Dejan
 
 
   Thanks  Regards,
   Andreas
  
   
Thanks,
   
Dejan
   
Best Regards,
Hideo Yamauchi.
   
   
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
   
Project Home: http://www.clusterlabs.org
Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
   
   
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
   
Project Home: http://www.clusterlabs.org
Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
  
  
  
 
 
 
   ___
   Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
  
   Project Home: http://www.clusterlabs.org
   Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
   Bugs: http://bugs.clusterlabs.org
 
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org

Re: [Pacemaker] Compilation problem in centos 6.3

2012-09-03 Thread Keisuke MORI
Hi,

I've seen a similar problem.
It was caused by an unseen escape sequence produced by the crm shell
(readline library in particular) when TERM=xterm.

Try export TERM=vt100 and rebuild it.
Or grab the latest crm shell.

2012/9/4 Miguel Angel Guerrero miguel.guerr...@itac.com.co

 Hi all

 I'm trying to compile the last version of the src package of pacemaker from 
 red hat in centos 6.3 with snmp support but i get this error in the process 
 and i don't understand this, i check the dependencies and all is correct, 
 thanks for your help

 Processing file tmp/en-US/xml_tmp/Ch-Tools.xml - tmp/en-US/xml/Ch-Tools.xml

 not well-formed (invalid token) at line 20, column 8, byte 884:
 paraTake some time to familiarize yourself with what it can do./para
 para# commandcrm --help/command/para
 screen
 ===^
 usage:
 crm [-D display_type] [-f file] [-hF] [args]
  at /usr/lib64/perl5/XML/Parser.pm line 187
 gmake[1]: *** [Clusters_from_Scratch.txt] Error 255
 gmake[1]: Leaving directory 
 `/home/itac/rpmbuild/BUILD/ClusterLabs-pacemaker-148fccf/doc'
 make: *** [all-recursive] Error 1
 error: Estado de salida erróneo de /var/tmp/rpm-tmp.yHX51k (%build)


 Errores de construcción RPM:
 InstallSourcePackage at: psm.c:244: CabeceraV3 RSA/SHA256 Signature, ID 
 de clave fd431d51: NOKEY
 Estado de salida erróneo de /var/tmp/rpm-tmp.yHX51k (%build)


 --


 www.itac.com.co

 Miguel Angel Guerrero
 Ingeniero de Infraestructura

 ITAC - IT Applications Consulting
 Avenida 19 # 114 – 65 Oficina 215
 Bogota, DC. Colombia
 Telefono (+571) 6400338 Ext. 147
 miguel.guerr...@itac.com.co

 

 Nuestros clientes hacen parte de la calidad que brindamos por lo que lo 
 invitamos a que si tiene una queja, reclamo o sugerencia nos la haga saber a 
 cali...@itac.com.co





 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org




--
Keisuke MORI

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] LF#2605 pingd misdetection of failure when kernel.pid_max 65536

2011-06-16 Thread Keisuke MORI
Hi,

I've filed one issue to bugzilla. Please find the detail and the patch on it.

  LF#2605 pingd misdetection of failure when kernel.pid_max  65536
  http://developerbugs.linux-foundation.org/show_bug.cgi?id=2605

Thanks,
-- 
Keisuke MORI

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] HealthSMART RA fix devices option

2011-03-02 Thread Keisuke MORI
Hi,

HealthSMART RA seems not working properly when you specified 'devices' option.
Suggest the attached patch.

Thanks,
-- 
Keisuke MORI
HealthSMART: fix devices option

diff -r cf4e9febed8e extra/resources/HealthSMART
--- a/extra/resources/HealthSMART	Wed Feb 23 14:52:34 2011 +0100
+++ b/extra/resources/HealthSMART	Wed Mar 02 18:45:22 2011 +0900
@@ -254,13 +254,13 @@ HealthSMART_monitor() {
 # Check drive temperature(s)
 	if [ ${OCF_RESKEY_devices} ]; then
 		for DEVICE in ${OCF_RESKEY_devices}; do
-		check_temperature `$SMARTCTL $DEVICE -A ${DRIVE} | awk '/^194/ { print $10 }'`
+		check_temperature `$SMARTCTL -d $DEVICE -A ${DRIVE} | awk '/^194/ { print $10 }'`
 		if [ $? != 0 ]; then
 			return $OCF_SUCCESS
 		fi
 		done
 	else
-		check_temperature `$SMARTCTL $DEVICE -A ${DRIVE} | awk '/^194/ { print $10 }'`
+		check_temperature `$SMARTCTL -A ${DRIVE} | awk '/^194/ { print $10 }'`
 		if [ $? != 0 ]; then
 		return $OCF_SUCCESS
 		fi
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Pacemaker-1.1.4, when?

2010-11-29 Thread Keisuke MORI
Hi Andrew,

2010/11/15 Andrew Beekhof and...@beekhof.net:
 If someone can fix the patch so that the regression tests pass I'll
 apply it, but I won't have any time to work on it for at least a few
 weeks.

I've been trying to write a patch for this, and it almost works fine,
but I found that it is very hard to make it 100% compatible with the
latest glib2 because of the implementation difference of GHashTable
between glib2-2.12(RHEL5) and glib2-2.26.

The attached patch almost works well, except that the regression tests
fails on 3 items regarding to the utilization test cases.
(the patch and the failed diff are attached)
           
  Test utilization-order1:   Utilization Order - Simple
  * FAILED:  xml-file changed
  Test utilization-order2:   Utilization Order - Complex
  * FAILED:  xml-file changed
  Test utilization-order3:   Utilization Order - Migrate
  * FAILED:  xml-file changed

  * ERROR:   Results of 3 failed tests (out of 293) are in
./.regression.failed.diff
           

It seems only the difference of the processing order of the nodes at
pengine/native.c:native_internal_constraints().

This difference comes because GHashTable is implemented differently,
where glib2-2.12 uses a linked list for iteration, while glib2-2.26 no
longer uses a linked list and just go through an array.


I think that the possible options that we can take are:

1) Apply the patch and just ignore the errors on RHEL5
   - as long as they're considered harmless.
2) Sort the node list when creating the graph of the utilization
   - although it may cause another performance penalty.
3) Revert using GList for the node list
   - if the node lookup is not the major factor of the performance issue.

It's all up to you.

Hope it helps.
Thanks,

Keisuke MORI


 On Mon, Nov 15, 2010 at 2:59 AM, nozawat noza...@gmail.com wrote:
 Hi Andrew and Nikola,

   Oneself carried out regression test, too, and an error was given equally.

 Regards,
 Tomo


 2010/11/12 Nikola Ciprich extmaill...@linuxbox.cz

 (resent)
 1.1.4 with new glib2: tests pass smoothly
 1.1.4 + patch and older glib2 - all tests are segfaulting...

 ie:
 Program terminated with signal 11, Segmentation fault.
 #0  IA__g_str_hash (v=0x0) at gstring.c:95
 95    guint32 h = *p;
 (gdb) bt
 #0  IA__g_str_hash (v=0x0) at gstring.c:95
 #1  0x7fe087bb6128 in g_hash_table_lookup_node (hash_table=0x1390ec0,
 key=0x0, value=0x13a3b00) at ghash.c:231
 #2  IA__g_hash_table_insert (hash_table=0x1390ec0, key=0x0,
 value=0x13a3b00) at ghash.c:336
 #3  0x7fe089367953 in convert_graph_action (resource=0x13a30a0,
 action=0x139cb80, status=0, rc=7) at unpack.c:308
 #4  0x0040362a in exec_rsc_action (graph=0x1394fa0,
 action=0x139cb80) at crm_inject.c:359
 #5  0x7fe089368642 in initiate_action (graph=0x1394fa0,
 action=0x139cb80) at graph.c:172
 #6  0x7fe08936899d in fire_synapse (graph=0x1394fa0,
 synapse=0x139ba60) at graph.c:204
 #7  0x7fe089368dbd in run_graph (graph=0x1394fa0) at graph.c:262
 #8  0x0040428f in run_simulation (data_set=0x7fff712280a0) at
 crm_inject.c:540
 #9  0x0040632a in main (argc=9, argv=0x7fff71228308) at
 crm_inject.c:1148




-- 
Keisuke MORI
# HG changeset patch
# User Keisuke MORI kskm...@intellilink.co.jp
# Date 1290657182 -32400
# Node ID d0b7749d477fe9048c2edd877c07a411282540e5
# Parent  6407a7137b5748d6375083f0be843c198b3d95d2
[mq]: glib2.patch

diff -r 6407a7137b57 -r d0b7749d477f configure.ac
--- a/configure.ac	Fri Nov 19 18:19:03 2010 +0100
+++ b/configure.ac	Thu Nov 25 12:53:02 2010 +0900
@@ -654,7 +654,7 @@ AC_MSG_RESULT(using $GLIBCONFIG)
 
 AC_CHECK_LIB(glib-2.0, g_hash_table_get_values)
 if test x$ac_cv_lib_glib_2_0_g_hash_table_get_values != xyes; then
-   AC_MSG_ERROR(Your version of Glib is too old, you need at least 2.14)
+   AC_MSG_WARN(Your version of Glib is too old, you should have at least 2.14)
 fi
 
 #
diff -r 6407a7137b57 -r d0b7749d477f include/crm/common/util.h
--- a/include/crm/common/util.h	Fri Nov 19 18:19:03 2010 +0100
+++ b/include/crm/common/util.h	Thu Nov 25 12:53:02 2010 +0900
@@ -298,4 +298,69 @@ extern int node_score_infinity;
 extern xmlNode *create_operation_update(xmlNode *parent, lrm_op_t *op, const char *caller_version, int target_rc, const char *origin, int level);
 extern void free_lrm_op(lrm_op_t *op);
 
+#if HAVE_LIBGLIB_2_0
+
+#else
+
+typedef struct fake_ghi
+{
+GHashTable *hash;
+int nth; /* current index over the iteration */
+int lpc; /* internal loop counter inside g_hash_table_find */
+gpointer key;
+gpointer value;
+} GHashTableIter;
+
+static inline void g_hash_prepend_value(gpointer key, gpointer value, gpointer user_data)
+{
+GList **values = (GList **)user_data;
+*values = g_list_prepend(*values, value);
+}
+
+static inline GList *g_hash_table_get_values(GHashTable *hash_table)
+{
+GList *values

Re: [Pacemaker] Project updates

2010-11-18 Thread Keisuke MORI
2010/11/18 Andrew Beekhof and...@beekhof.net:
 On Tue, Nov 16, 2010 at 9:13 AM, Andrew Beekhof and...@beekhof.net wrote:
 On Tue, Nov 16, 2010 at 8:07 AM, Keisuke MORI keisuke.mori...@gmail.com 
 wrote:
  STABLE_SERIES          = stable-1.0

  RPM_ROOT       = $(shell pwd)
 diff -r 99f5a1e61667 configure.ac
 --- a/configure.ac      Fri Nov 12 09:12:32 2010 +0100
 +++ b/configure.ac      Fri Nov 12 11:47:28 2010 -0500
 @@ -19,7 +19,7 @@
  dnl     checks for library functions
  dnl     checks for system services

 -AC_INIT(pacemaker, 1.0.9, pacemaker@oss.clusterlabs.org)
 +AC_INIT(pacemaker, 1.0.10, pacemaker@oss.clusterlabs.org)

 thats kinda annoying but not crucial.  thanks for pointing it out


 This would be confusing for users to tell which version they're
 actually using when they are going to report a problem because all the
 logs and crm_mon output shows the version as 1.0.9.

 Any chance of the release for another RPMs with this fix?

 Oh, I forgot about crm_mon.
 I'll see what I can do.


 Happily a change to the spec file is all that was needed.
 New rpms should be available for all platforms

Great!
It would greatly help to me and my customers, too.

Thanks,

-- 
Keisuke MORI

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] [FYI] Required errata for RHEL5

2010-11-18 Thread Keisuke MORI
Hi all,

For your information for RedHat users,

As a conclusion of testing in my company, we consider that
the following errata should be applied on RHEL5.5 or below in order to
get Pacemaker work more stably.
http://rhn.redhat.com/errata/RHBA-2010-0764.html

Due to a malfunction in libxml2-2.6.26 which comes with RHEL5.5,
Pacemaker may fail to update the cib information.

If you saw an error log like this, you're hitting this issue:
---
attrd: [16708]: ERROR: attrd_cib_callback: Update -40 for
default_ping_set=100 failed: Required data for this CIB API call not
found
---

Or you may get an error from a crm_* command like this:
---
# crm_standby -U node1 -v off
Please choose from one of the matches above and suppy the 'id' with --attr-id
---

It happened only in limited situations as long as we've tested,  but
technically it randomly happens.
The bug in libxml2 seems have been already fixed in libxml2-2.6.27 or
later according to the git log.


Related discussions and links:
http://www.gossamer-threads.com/lists/linuxha/pacemaker/61182
https://bugzilla.redhat.com/show_bug.cgi?id=613860
http://git.gnome.org/browse/libxml2/commit/?id=6422d916d929cb8653d950d4b424388a7ea7230d


Acknowledgments:
Many thanks to Hideo Yamauchi and RedHat support for resolving  this issue.


I hope it helps to all.
Thanks,

-- 
Keisuke MORI

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] crm_mon and pingd

2010-11-11 Thread Keisuke MORI
No objections.

I've push the changeset below:
http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/53132ed532ea

But it would be still preferable to rely on -A, particularly if you
want to use two or more pingd resources or change the attribute name.

I'd also agree for the filtering feature as an enhancement.

Thanks,

2010/11/10 Andrew Beekhof and...@beekhof.net:
 Any objections Mori-san?
 Seems like a reasonable change to me.

 On Tue, Nov 9, 2010 at 1:26 PM, Vadym Chepkov vchep...@gmail.com wrote:

 Would it be too much harm to restore the previous behavior at least 
 partially?

 diff -r 7f2e453eedfa -r ab2da8a98b47 tools/crm_mon.c
 --- a/tools/crm_mon.c   Mon Nov 08 23:13:17 2010 +0100
 +++ b/tools/crm_mon.c   Tue Nov 09 07:18:53 2010 -0500
 @@ -748,6 +748,17 @@
    g_list_free(sorted_op_list);
  }

 +static void get_ping_score(node_t *node, pe_working_set_t *data_set)
 +{
 +    const char *attr = pingd;
 +    const char *value = NULL;
 +    value = g_hash_table_lookup(node-details-attrs, attr);
 +
 +    if(value != NULL) {
 +       print_as( %s=%s, attr, value);
 +    }
 +}
 +
  static void print_attr_msg(node_t *node, GListPtr rsc_list, const
 char *attrname, const char *attrvalue)
  {
    slist_iter(rsc, resource_t, rsc_list, lpc2,
 @@ -848,6 +859,9 @@
       }

       print_as(* Node %s: , crm_element_value(node_state, XML_ATTR_UNAME));
 +       if(!print_nodes_attr) {
 +               get_ping_score(node, data_set);
 +       }
       print_as(\n);

       lrm_rsc = find_xml_node(node_state, XML_CIB_TAG_LRM, FALSE);

 Thanks,
 Vadym

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker




-- 
Keisuke MORI

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] crm_mon and pingd

2010-11-05 Thread Keisuke MORI
Hi Vadym,

Could you provide the output of 'cibadmin -Q' to see what's happening
over there?

Thanks,

2010/11/4 Vadym Chepkov vchep...@gmail.com:
 Hi,

 It seems this patch in pacemaker doesn't work as expected

 changeset:   15672:4d50adc3ccd9
 branch:      stable-1.0
 user:        Andrew Beekhof and...@beekhof.net
 date:        Mon May 10 10:26:50 2010 +0200
 summary:     Medium: tools: crm_mon - Enable 'connectivity' mode for 'ping' 
 resources too

 crm_mon doesn't show pingd attribute value in Migration summary: anymore.

 Version: 1.0.9-0a40fd0cb9f2fcedef9d1967115c912314c57438

 Thanks,
 Vadym


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker




-- 
Keisuke MORI

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] About behavior in Action Lost.

2010-10-12 Thread Keisuke MORI
2010/10/7 Andrew Beekhof and...@beekhof.net:
 On Thu, Oct 7, 2010 at 11:48 AM, Keisuke MORI keisuke.mori...@gmail.com 
 wrote:
 Andrew,

 2010/9/23 Andrew Beekhof and...@beekhof.net:
 Pushed as:
   http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18

 Not sure about applying to 1.0 though, its a dramatic change in behavior.

 I would like to backport this to 1.0.
 Would you agree with this?

 I would prefer not to, but if it is important to you then I will agree.


Thank you for your ACK. It's now in 1.0.
http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/146e405c1afa

-- 
Keisuke MORI

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] resource stop timeout broken in 1.0 branch tip

2010-10-12 Thread Keisuke MORI
2010/10/9 Andrew Beekhof and...@beekhof.net:
 On Fri, Oct 8, 2010 at 4:17 PM, Lars Ellenberg
 lars.ellenb...@linbit.com wrote:
 On Wed, Oct 06, 2010 at 06:29:12PM +0900, Keisuke MORI wrote:
 2010/10/6 Andrew Beekhof and...@beekhof.net:
  Is there more changesets
  that need to be backported regarding to this issues?
 
  There is now that Andreas brought the problem to my attention :-)
    http://hg.clusterlabs.org/pacemaker/1.1/rev/e097c70226fe
 
  If not, I think that the Andreas' patch should be applied to 1.0.
  It seems to me that the patch is sane as it would restore the old
  behavior for the stop operation with having the resource attributes as
  the first patch intended.
 
  See the comment in the above patch. Andreas' original patch wouldn't
  have worked if the resource definition changed.

 I see, I will backport this to 1.0 too.

Done.
http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/0d019d9e9c61



 May I take the oportunity to point you to
 http://hg.clusterlabs.org/pacemaker/1.1/rev/3f8df3dfb328

 ACK, no objection to this being backported :-)

Also done, along with a minor compilation fix.
http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/70438ddd4351
http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/0a40fd0cb9f2

-- 
Keisuke MORI

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] About behavior in Action Lost.

2010-10-07 Thread Keisuke MORI
Andrew,

2010/9/23 Andrew Beekhof and...@beekhof.net:
 Pushed as:
   http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18

 Not sure about applying to 1.0 though, its a dramatic change in behavior.

I would like to backport this to 1.0.
Would you agree with this?

Without this the failed node was not fenced when it ought to be and
failed to continue the service.
I would also think that it would be good to have the same behavior
between 1.0 and 1.1 in such a critical condition to support both
versions better.

Thanks,
Keisuke MORI


 On Wed, Sep 22, 2010 at 11:18 AM,  renayama19661...@ybb.ne.jp wrote:
 Hi Andrew,

 Thank you for comment.

 A long time ago in a galaxy far away, some messaging layers used to
 loose quite a few actions, including stops.
 About the same time, we decided that fencing because a stop action was
 lost wasn't a good idea.

 The rationale was that if the operation eventually completed, it would
 end up in the CIB anyway.
 And even if it didn't, the PE would continue to try the operation
 again until the whole node fell over at which point it would get shot
 anyway.

 Sorry...
 I did not know the fact that there was such an argument in old days.


 Now, having said that, things have improved since then and perhaps,
 the interest of speeding up recovery in these situations, it is time
 to stop treating stop operations differently.
 Would you agree?

 That means, you change it in the case of Action Lost of the stop this time 
 to carry out stonith?
 If my recognition is right, I agree too.

 if(timer-action-type != action_type_rsc) {
 send_update = FALSE;
 } else if(safe_str_eq(task, cancel)) {
 /* we dont need to update the CIB with these */
 send_update = FALSE;
 }
 --- delete else if(safe_str_eq(task, stop)){..} ?

 if(send_update) {
 /* cib_action_update(timer-action, LRM_OP_PENDING, EXECRA_STATUS_UNKNOWN); 
 */
 cib_action_update(timer-action, LRM_OP_TIMEOUT, EXECRA_UNKNOWN_ERROR);
 }

 Best Regards,
 Hideo Yamauchi.

 --- Andrew Beekhof and...@beekhof.net wrote:

 On Tue, Sep 21, 2010 at 8:59 AM,  renayama19661...@ybb.ne.jp wrote:
  Hi,
 
  Node was in state that the load was very high, and we confirmed monitor 
  movement of Pacemeker.
  Action Lost occurred in stop movement after the error of the monitor 
  occurred.
 
  Sep #65533;8 20:02:22 cgl54 crmd: [3507]: ERROR: print_elem: Aborting 
  transition, action lost:
 [Action 9]:
  In-flight (id: prmApPostgreSQLDB1_stop_0, loc: cgl49, priority: 0)
  Sep #65533;8 20:02:22 cgl54 crmd: [3507]: info: abort_transition_graph: 
  action_timer_callback:486
 -
  Triggered transition abort (complete=0) : Action lost
 
 
  For the load of the node, We think that the stop movement did not go well.
  But cannot nodes execute stonith.

 A long time ago in a galaxy far away, some messaging layers used to
 loose quite a few actions, including stops.
 About the same time, we decided that fencing because a stop action was
 lost wasn't a good idea.

 The rationale was that if the operation eventually completed, it would
 end up in the CIB anyway.
 And even if it didn't, the PE would continue to try the operation
 again until the whole node fell over at which point it would get shot
 anyway.

 Now, having said that, things have improved since then and perhaps,
 the interest of speeding up recovery in these situations, it is time
 to stop treating stop operations differently.
 Would you agree?

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker




-- 
Keisuke MORI

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Can somebody please explain pengine's urge to move all resources?

2010-10-06 Thread Keisuke MORI
Hi Raoul,

2010/9/28 Andrew Beekhof and...@beekhof.net:
 On Tue, Sep 28, 2010 at 11:48 AM, Raoul Bhatia [IPAX] r.bha...@ipax.at 
 wrote:
 On 09/23/2010 09:28 AM, Andrew Beekhof wrote:
 The good news is that 1.1.3 doesn't have that behavior.
 Lets see how 1.0 goes once all the relevant patches have been backported.

 thanks for your answer! will those patches make it into 1.0.10 or
 do you have another eta for this?

This should have been fix with this:
http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/5fe02f48c47b

The patch has been already backported to the 1.0 repository and will
be included in 1.0.10.
Will you test with the tip of 1.0 repository if you have any chance?

Thanks,


 MORI-san from NTT is currently working on the backports.
 we'll delay .10 until he has a chance to complete the process :-)
-- 
Keisuke MORI

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] resource stop timeout broken in 1.0 branch tip

2010-10-06 Thread Keisuke MORI
2010/10/6 Andrew Beekhof and...@beekhof.net:
 Is there more changesets
 that need to be backported regarding to this issues?

 There is now that Andreas brought the problem to my attention :-)
   http://hg.clusterlabs.org/pacemaker/1.1/rev/e097c70226fe

 If not, I think that the Andreas' patch should be applied to 1.0.
 It seems to me that the patch is sane as it would restore the old
 behavior for the stop operation with having the resource attributes as
 the first patch intended.

 See the comment in the above patch. Andreas' original patch wouldn't
 have worked if the resource definition changed.

I see, I will backport this to 1.0 too.

Thanks,
-- 
Keisuke MORI

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] resource stop timeout broken in 1.0 branch tip

2010-10-04 Thread Keisuke MORI
2010/10/2 Andreas Hofmeister a...@collax.com:
 Hi,

 it seems to me that patch

  http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/8241f689bf9f

 broke  timeouts for stop operations. The observable effect is that the
 timeout for stop operations is always 125s, regardless what was specified in
 the CIB. Reverting the part of the patch that changes crmd/lrm.c seems to
 fix the problem.

 The attached patch reverts the change to crmd/lrm.c and also

  http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/66df1404fdcb

 which dealt with another side effect of the change in crmd/lrm.c .


Hi Andreas,

You're right and I confirmed that the problem exists
in the 1.0 tip and the problem goes away with your patch.


Andrew,

Doesn't this problem exist in 1.1? Is there more changesets
that need to be backported regarding to this issues?

If not, I think that the Andreas' patch should be applied to 1.0.
It seems to me that the patch is sane as it would restore the old
behavior for the stop operation with having the resource attributes as
the first patch intended.

Thanks,

Keisuke MORI

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Memory leaks in pacemaker-1.0.8

2010-03-29 Thread Keisuke MORI
Hi,

Pacemaker-1.0.8 seems to have some memory leakage.

Please find the detail on the bugzilla item:
http://developerbugs.linux-foundation.org/show_bug.cgi?id=2386


Regards,

-- 
Keisuke MORI

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


[Pacemaker] Pseudo RAs do not work properly on Corosync stack

2010-03-16 Thread Keisuke MORI
Hi,

Sorry for a bit long mail.
I'm going to describe the issue of the Subject: and would like to
suggest some changes to the agents package (and possibly Pacemaker, too).
I would be grad if you could give me your thought and comments.



A pseudo RA which creates a stat file under HA_RSCTMP
(/var/run/heartbeat/rsctmp), such as Dummy, MailTo, etc. do not
work properly on the Pacemaker+Corosync stack.

When a node crashed and was rebooted, a stale stat file is
left over the reboot and hence the RA misbehaves as if the
resource was already started when the cluster is launched again
for the recovery.

This problem does not occur on Heartbeat stack because
Heartbeat removes HA_RSCTMP when its startup,
while on Pacemaker stack none of Pacemaker/Corosync removes it.

But removing them by Pacemaker does not seem to be correct -
if they were removed at the cluster startup time then the
maintenance mode would no longer work properly.

In my understanding, the correct behavior is:
 - They should NOT be removed at the cluster startup time.
 - They should be removed at the OS bootup time.



My suggestion to address this issue is, to fix as the following;

 - 1) change the HA_RSCTMP location to /var/run/resource-agents,
  or wherever a subdirectory right under /var/run.
 - 2) having the directory permission as 01777 (with sticky bit)
 - 3) change IPaddr/SendArp RA not to use its own subdirectory
  but instead, add a prefix for the filename.
 - 4) make /var/run/heartbeat/rsctmp as obsolete;
  Heartbeat/Pacemaker could preserve the current behavior
  for a while for the compatibility.


The basic idea of the changes is that, we're now going to follow the
file removal procedure defined by FHS(Filesystem Hierarchy Standard).

http://www.pathname.com/fhs/pub/fhs-2.3.html#VARRUNRUNTIMEVARIABLEDATA

FHS defines that any files under a subdirectory of /var/run
should be removed at the OS bootup time.

Unfortunately the second level subdirectory is out of the scope and
you can not rely on the removal (and that's the case of
/var/run/heartbeat/rsctmp).


I believe that the impacts for existing RAs are minimum.
If your RA is implemented correctly then you need to do nothing -
just notice that the location of the stat file is changed.

If your RA has hardcoded /var/run/heartbeat/rsctmp, or it
creates its own subdirectory, it is encouraged to fix because it
may not work well with the maintenance mode, but you can
continue to use the old rsctmp if you would like.


I would like to hear your thought and comments.

Regards,
-- 
Keisuke MORI

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker