Re: [Pacemaker] Enable remote monitoring

2012-12-12 Thread Gao,Yan
On 12/12/12 15:38, Gao,Yan wrote:
 On 12/12/12 11:14, Gao,Yan wrote:
 On 12/12/12 01:53, David Vossel wrote:
 - Original Message -
 From: Yan Gao y...@suse.com
 To: pacemaker@oss.clusterlabs.org
 Sent: Tuesday, December 11, 2012 1:23:03 AM
 Subject: Re: [Pacemaker] Enable remote monitoring

 Hi,
 Here's the latest code:
 https://github.com/gao-yan/pacemaker/commit/4d58026c2171c42385c85162a0656c44b37fa7e8


 Now:
 - container-type:
   * black - ordering, colocating
   * white - ordering
   Both them are not probed so far.

 I think for the sake of this implementation we should ignore the whitebox 
 use case for now.  There are aspects of the whitebox use case that I'm just 
 not sure about yet, and I don't want to hold you all up trying to define 
 that. I don't mind re-approaching this container concept and expanding it 
 to the whitebox use case later on building with what you have here.  I'm in 
 favor of removing the container-type letting the blackbox use case be the 
 default for now, and I'll go in and do our whitebox bits later. 
 Hmm, this might be better before we have a clear definition for whitebox.
 Removed container-type for now. Pushed with several regression tests:
Sorry, forgot the link:
https://github.com/gao-yan/pacemaker/commits/container

Regards,
  Gao,Yan
-- 
Gao,Yan y...@suse.com
Software Engineer
China Server Team, SUSE.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Suggestion to improve movement of booth

2012-12-12 Thread Jiaju Zhang
Hi Yusuke,

On Fri, 2012-11-30 at 21:30 +0900, yusuke iida wrote:
 Hi, Jiaju
 
 
 
 When communication of a part of proposer and acceptor goes out,
 re-acquisition of lease is temporarily performed by proposer.
 
 Since a ticket will be temporarily revoke(d) at this time, service
 will stop temporarily.
 
 I think that this is a problem.
 
 I hope that lease of the ticket is held.

This is what I wanted to do as well;) That is to say, the lease should
keep renewing on the original site successfully unless it was down.
Current implementation is to let the original site renew the ticket
before ticket lease expires (only when lease expires the ticket will be
revoked), hence, before other sites tries to acquire the ticket, the
original site has renewed the ticket already, so the result is the
ticket is still on that site.

I'm not quite understand your problem here. Is that the lease not
keeping in the original site?

Thanks,
Jiaju

 
 
 
 I thought about a plan to prevent movement to become the reaccession
 of lease.
 
 When proposer continues updating lease, I think that you should refuse
 a message from new proposer.
 
 In order to remain at the existing behavior, I want to be switched
 according to the setting.
 
 
 
 I wrote the patch about this proposal.
 
 https://github.com/yuusuke/booth/commit/6b82fda7b4220c418ff906a9cf8152fe88032566
  
 
 
 
 What do you think about this proposal?
 
 
 
 Best regards,
 Yuusuke
 -- 
  
 METRO SYSTEMS CO., LTD 
 
 Yuusuke Iida 
 Mail: yusk.i...@gmail.com
  



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] node status does not change even if pacemakerd dies

2012-12-12 Thread Kazunori INOUE

(12.12.06 12:18), Andrew Beekhof wrote:

On Wed, Dec 5, 2012 at 8:32 PM, Kazunori INOUE
inouek...@intellilink.co.jp wrote:

(12.12.05 02:02), David Vossel wrote:




- Original Message -


From: Kazunori INOUE inouek...@intellilink.co.jp
To: The Pacemaker cluster resource manager
pacemaker@oss.clusterlabs.org
Sent: Monday, December 3, 2012 11:41:56 PM
Subject: Re: [Pacemaker] node status does not change even if pacemakerd
dies

(12.12.03 20:24), Andrew Beekhof wrote:


On Mon, Dec 3, 2012 at 8:15 PM, Kazunori INOUE
inouek...@intellilink.co.jp wrote:


(12.11.30 23:52), David Vossel wrote:



- Original Message -



From: Kazunori INOUE inouek...@intellilink.co.jp
To: pacemaker@oss pacemaker@oss.clusterlabs.org
Sent: Friday, November 30, 2012 2:38:50 AM
Subject: [Pacemaker] node status does not change even if
pacemakerd dies

Hi,

I am testing the latest version.
- ClusterLabs/pacemaker  9c13d14640(Nov 27, 2012)
- corosync   92e0f9c7bb(Nov 07, 2012)
- libqb  30a7871646(Nov 29, 2012)


Although I killed pacemakerd, node status did not change.

 [dev1 ~]$ pkill -9 pacemakerd
 [dev1 ~]$ crm_mon
   :
 Stack: corosync
 Current DC: dev2 (2472913088) - partition with quorum
 Version: 1.1.8-9c13d14
 2 Nodes configured, unknown expected votes
 0 Resources configured.


 Online: [ dev1 dev2 ]

 [dev1 ~]$ ps -ef|egrep 'corosync|pacemaker'
 root 11990 1  1 16:05 ?00:00:00 corosync
 496  12010 1  0 16:05 ?00:00:00
 /usr/libexec/pacemaker/cib
 root 12011 1  0 16:05 ?00:00:00
 /usr/libexec/pacemaker/stonithd
 root 12012 1  0 16:05 ?00:00:00
 /usr/libexec/pacemaker/lrmd
 496  12013 1  0 16:05 ?00:00:00
 /usr/libexec/pacemaker/attrd
 496  12014 1  0 16:05 ?00:00:00
 /usr/libexec/pacemaker/pengine
 496  12015 1  0 16:05 ?00:00:00
 /usr/libexec/pacemaker/crmd


We want the node status to change to
OFFLINE(stonith-enabled=false),
UNCLEAN(stonith-enabled=true).
That is, we want the function of this deleted code.


https://github.com/ClusterLabs/pacemaker/commit/dfdfb6c9087e644cb898143e198b240eb9a928b4




How are you launching pacemakerd?  The systemd service script
relaunches
pacemakerd on failure and pacemakerd has the ability to attach to
all the
old processes if they are still around as if nothing happened.

-- Vossel



Hi David,

We are using RHEL6 and use it for a while after this.
Therefore, I start it by the following commands.

$ /etc/init.d/pacemakerd start
or
$ service pacemaker start



Ok.
Are you using the pacemaker plugin?

When using cman or corosync 2.0, pacemakerd isn't strictly needed
for
normal operation.
Its only there to shutdown and/or respawn failed components.


We are using corosync 2.1,
so service does not stop normally after pacemakerd died.

$ pkill -9 pacemakerd
$ service pacemaker stop
$ echo $?
0
$ ps -ef|egrep 'corosync|pacemaker'
root  3807 1  0 13:10 ?00:00:00 corosync
496   3827 1  0 13:10 ?00:00:00
/usr/libexec/pacemaker/cib
root  3828 1  0 13:10 ?00:00:00
/usr/libexec/pacemaker/stonithd
root  3829 1  0 13:10 ?00:00:00
/usr/libexec/pacemaker/lrmd
496   3830 1  0 13:10 ?00:00:00
/usr/libexec/pacemaker/attrd
496   3831 1  0 13:10 ?00:00:00
/usr/libexec/pacemaker/pengine
496   3832 1  0 13:10 ?00:00:00
/usr/libexec/pacemaker/crmd



Ah yes, that is a problem.

Having pacemaker still running when the init script says it is down...
that is bad.  Perhaps we should just make the init script smart enough to
check to make sure all the pacemaker components are down after pacemakerd is
down.

The argument of whether or not the failure of pacemakerd is something that
the cluster should be alerted to is something i'm not sure about.  With the
corosync 2.0 stack, pacemakerd really doesn't do anything except launch
processes/relaunch processes.  A cluster can be completely functional
without a pacemakerd instance running anywhere.  If any of the actual
pacemaker components on a node fail, the logic that causes that node to get
fenced has nothing to do with pacemakerd.

-- Vossel




Hi,

I think that relaunch processes of pacemakerd is a very useful function,
so I want to avoid management of a resource in the node in which pacemakerd
does not exist.


You do understand that the node will be fenced if any of those
processes fail right?
Its not like a node could end up in a bad state if pacemakerd isn't
around to respawn things.

The relaunch processes is there in attempt to recover before anyone
else notices.
So essentially what you're asking for, is to fence the node and
migrate all the resources so that in the future IF another process
dies, we MIGHT not have to fence the node 

Re: [Pacemaker] booth is the state of started on pacemaker before booth write ticket info in cib.

2012-12-12 Thread Jiaju Zhang
On Tue, 2012-12-11 at 20:15 +0900, Yuichi SEINO wrote:
 Hi Jiaju,
 
 Currently, booth is the state of started on pacemaker before booth
 writes ticket information in cib. So, If the old ticket information is
 included in cib, a resource relating to the ticket may start before
 booth resets the ticket. I think that this problem is when to be
 daemon in booth.

The resouce should not be started before the booth daemon is ready. We
suggest to configure an ordering constraint for the booth daemon and the
managed resources by that ticket. That being said, if the ticket is in
the CIB but booth daemon has not been started, the resources would not
be started.

 
 Perhaps,  this problem didn't happen before the following commit.
 https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f

Currently when all of the initialization (including loading the new
ticket information) finished, booth should be regarded as ready. So if
you encounter some problem here, I guess we should improve the RA to
better reflect the booth startup status, but not moving the
initialization order, since it may introduce other regression as we have
encountered before;)

Thanks,
Jiaju

 
 Sincerely,
 Yuichi
 
 --
 Yuichi SEINO
 METROSYSTEMS CORPORATION
 E-mail:seino.clust...@gmail.com



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Enable remote monitoring

2012-12-12 Thread Lars Marowsky-Bree
On 2012-12-11T12:53:39, David Vossel dvos...@redhat.com wrote:

Excellent progress!

Just one aspect caught my eye:

  - on-fail defaults restart-container for most actions,
 
except for stop op (Not sure what it means if a stop fails. A
nagios
  daemon cannot be terminated? Should it always return success?) ,
 
 A nagios stop action should always return success.  The nagio's agent 
 doesn't even need a stop function, the lrmd can know to treat  a stop as a 
 (no-op for stop) + (cancel all recurring actions).  In this case if the 
 nagios agent doesn't stop successfully,  it is because of an lrmd failure 
 which should result in a fencing action i'd imagine.

That's something that, IMHO, shouldn't be handled by the container
abstraction, but - like you say - by the LRM/class code.

I think on-fail=restart-container makes sense even for stop. If
stop can't technically fail for a given class, even better. But it
could mean that we actually need to stop some monitoring daemon or
whatever.

The other logic might be to set it to ignore, which would also work
for me (even if a bit less obviously).

But really I'd not want to make oh let's just skip stop for contained
resources here ;-)

  - Failures of resources count against container's
 What happens if someone wants to clear the container's failcount? Do we need 
 to add some logic to go in and clear all the child resource's failures as 
 well to make this happen correctly?

That appears to make sense.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Moving multi-state resources

2012-12-12 Thread pavan tc
Hi,

My requirement was to do some administration on one of the nodes where a
2-node multi-state resource was running.
To effect a resource instance stoppage on one of the nodes, I added a
resource constraint as below:

crm configure location ms_stop_res_on_node ms_resource rule -inf: \#uname
eq `hostname`

The resource cleanly moved over to the other node. Incidentally, the
resource was the master on this node
and was successfully moved to a master state on the other node too.
Now, I want to bring the resource back onto the original node.

But the above resource constraint seems to have a persistent behaviour.
crm resource unmigrate ms_resource does not seem to undo the effects of
the constraint addition.

I think the location constraint is preventing the resource from starting on
the original node.
How do I delete this location constraint now?

Is there a more standard way of doing such administrative tasks? The
requirement is that I do not want to offline the
entire node while doing the administration but rather would want to stop
only the resource instance, do the admin work
and restart the resource instance on the node.

Thanks,
Pavan
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker processes RSS growth

2012-12-12 Thread Vladislav Bogdanov
12.12.2012 05:35, Andrew Beekhof wrote:
 On Tue, Dec 11, 2012 at 5:49 PM, Vladislav Bogdanov
 bub...@hoster-ok.com wrote:
 11.12.2012 06:52, Vladislav Bogdanov wrote:
 11.12.2012 05:12, Andrew Beekhof wrote:
 On Mon, Dec 10, 2012 at 11:34 PM, Vladislav Bogdanov
 bub...@hoster-ok.com wrote:
 10.12.2012 09:56, Vladislav Bogdanov wrote:
 10.12.2012 04:29, Andrew Beekhof wrote:
 On Fri, Dec 7, 2012 at 5:37 PM, Vladislav Bogdanov 
 bub...@hoster-ok.com wrote:
 06.12.2012 09:04, Vladislav Bogdanov wrote:
 06.12.2012 06:05, Andrew Beekhof wrote:
 I wonder what the growth looks like with the recent libqb fix.
 That could be an explanation.

 Valid point. I will watch.

 On a almost static cluster the only change in memory state during 24
 hours is +700kb of shared memory to crmd on a DC. Will look after that
 one for more time.

 It still grows. ~650-700k per day. I sampled 'maps' and 'smaps' content
 from crmd's proc and will look what differs there over the time.

 smaps tells me it may be in /dev/shm/qb-pengine-event-1735-1736-4-data.
 1735 is pengine, 1736 is crmd.

 Diff of that part:
 @@ -56,13 +56,13 @@
  MMUPageSize:   4 kB
  7f427fddf000-7f42802df000 rw-s  00:0f 12332
   /dev/shm/qb-pengine-event-1735-1736-4-data
  Size:   5120 kB
 -Rss:4180 kB
 -Pss:2089 kB
 +Rss:4320 kB
 +Pss:2159 kB
  Shared_Clean:  0 kB
 -Shared_Dirty:   4180 kB
 +Shared_Dirty:   4320 kB
  Private_Clean: 0 kB
  Private_Dirty: 0 kB
 -Referenced: 4180 kB
 +Referenced: 4320 kB
  Anonymous: 0 kB
  AnonHugePages: 0 kB
  Swap:  0 kB

 'Rss' and 'Shared_Dirty' will soon reach 'Size' (now 4792 vs 5120), I'll
 look what happens then. I expect growth to stop and pages to be reused.
 If that is true, then there are no any leaks, but rather controlled fill
 of a buffer of a predefined size.
 
 Great. Please let me know how it turns out.

Now I see

@@ -56,13 +56,13 @@
 MMUPageSize:   4 kB
 7f427fddf000-7f42802df000 rw-s  00:0f 12332
  /dev/shm/qb-pengine-event-1735-1736-4-data
 Size:   5120 kB
-Rss:4180 kB
-Pss:2089 kB
+Rss:5120 kB
+Pss:2559 kB
 Shared_Clean:  0 kB
-Shared_Dirty:   4180 kB
+Shared_Dirty:   5120 kB
 Private_Clean: 0 kB
 Private_Dirty: 0 kB
-Referenced: 4180 kB
+Referenced: 5120 kB
 Anonymous: 0 kB
 AnonHugePages: 0 kB
 Swap:  0 kB
@@ -70,13 +70,13 @@
 MMUPageSize:   4 kB
 7f42802df000-7f42807df000 rw-s  00:0f 12332
  /dev/shm/qb-pengine-event-1735-1736-4-data
 Size:   5120 kB
-Rss:   0 kB
-Pss:   0 kB
+Rss:   4 kB
+Pss:   1 kB
 Shared_Clean:  0 kB
-Shared_Dirty:  0 kB
+Shared_Dirty:  4 kB
 Private_Clean: 0 kB
 Private_Dirty: 0 kB
-Referenced:0 kB
+Referenced:4 kB
 Anonymous: 0 kB
 AnonHugePages: 0 kB
 Swap:  0 kB

So, it stuck at 5Mb and does not grow anymore.

More, all pacemaker processes on DC for some reason now consume much
less shared memory, according to htop, than last time I looked at them.
It seems to be due to decrease of referenced pages within some anonymous
mappings. Though I do not have idea why was that happened.

Ok, the main conclusion I can make is that pacemaker does not have any
memory leaks in code paths used by a static cluster.

Will try to provide some load now.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Moving multi-state resources

2012-12-12 Thread Dejan Muhamedagic
Hi,

On Wed, Dec 12, 2012 at 03:50:01PM +0530, pavan tc wrote:
 Hi,
 
 My requirement was to do some administration on one of the nodes where a
 2-node multi-state resource was running.
 To effect a resource instance stoppage on one of the nodes, I added a
 resource constraint as below:
 
 crm configure location ms_stop_res_on_node ms_resource rule -inf: \#uname
 eq `hostname`
 
 The resource cleanly moved over to the other node. Incidentally, the
 resource was the master on this node
 and was successfully moved to a master state on the other node too.
 Now, I want to bring the resource back onto the original node.
 
 But the above resource constraint seems to have a persistent behaviour.
 crm resource unmigrate ms_resource does not seem to undo the effects of
 the constraint addition.

You can try to remove your constraint:

crm configure delete ms_stop_res_on_node

migrate/unmigrate generate/remove special constraints.

Thanks,

Dejan

 
 I think the location constraint is preventing the resource from starting on
 the original node.
 How do I delete this location constraint now?
 
 Is there a more standard way of doing such administrative tasks? The
 requirement is that I do not want to offline the
 entire node while doing the administration but rather would want to stop
 only the resource instance, do the admin work
 and restart the resource instance on the node.
 
 Thanks,
 Pavan

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Moving multi-state resources

2012-12-12 Thread pavan tc
On Wed, Dec 12, 2012 at 6:46 PM, Dejan Muhamedagic deja...@fastmail.fmwrote:

 Hi,

 On Wed, Dec 12, 2012 at 03:50:01PM +0530, pavan tc wrote:
  Hi,
 
  My requirement was to do some administration on one of the nodes where a
  2-node multi-state resource was running.
  To effect a resource instance stoppage on one of the nodes, I added a
  resource constraint as below:
 
  crm configure location ms_stop_res_on_node ms_resource rule -inf:
 \#uname
  eq `hostname`
 
  The resource cleanly moved over to the other node. Incidentally, the
  resource was the master on this node
  and was successfully moved to a master state on the other node too.
  Now, I want to bring the resource back onto the original node.
 
  But the above resource constraint seems to have a persistent behaviour.
  crm resource unmigrate ms_resource does not seem to undo the effects of
  the constraint addition.

 You can try to remove your constraint:

 crm configure delete ms_stop_res_on_node


That did the job. Thanks a ton!

Pavan


 migrate/unmigrate generate/remove special constraints.

 Thanks,

 Dejan

 
  I think the location constraint is preventing the resource from starting
 on
  the original node.
  How do I delete this location constraint now?
 
  Is there a more standard way of doing such administrative tasks? The
  requirement is that I do not want to offline the
  entire node while doing the administration but rather would want to stop
  only the resource instance, do the admin work
  and restart the resource instance on the node.
 
  Thanks,
  Pavan

  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Listing resources by attributes

2012-12-12 Thread pavan tc
Hi,

Is there a way in which resources can be listed based on some attributes?
For example, listing resource running on a certain node, or listing ms
resources.

The crm_resource manpage talks about the -N and -t options that seem to
address the requirements above.
But they do not provide the expected result.
crm_resource --list or crm_resource --list-raw give the same output
immaterial of whether it was provided with -N or -t.

I had to do the following to pull out 'ms' resources, for example:
crm configure show | grep -w ^ms | awk '{print $2}'

Is there a cleaner way to list resources?

Thanks,
Pavan
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Action from a different CRMD transition results in restarting services

2012-12-12 Thread Latrous, Youssef
Hi,

 

I run into the following issue and I couldn't find what it really means:

 

Detected action msgbroker_monitor_1 from a different
transition: 16048 vs. 18014

 

I can see that its impact is to stop/start a service but I'd like to
understand it a bit more.

 

Thank you in advance for any information.

 

 

Logs about this issue:

...

Dec  6 22:55:05 Node1 crmd: [5235]: info: process_graph_event: Detected
action msgbroker_monitor_1 from a different transition: 16048 vs.
18014

Dec  6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph:
process_graph_event:477 - Triggered transition abort (complete=1,
tag=lrm_rsc_op, id=msgbroker_monitor_1,
magic=0:7;104:16048:0:5fb57f01-3397-45a8-905f-c48cecdc8692, cib=0.971.5)
: Old event

Dec  6 22:55:05 Node1 crmd: [5235]: WARN: update_failcount: Updating
failcount for msgbroker on Node0 after failed monitor: rc=7
(update=value++, time=1354852505)

Dec  6 22:55:05 Node1 crmd: [5235]: info: do_state_transition: State
transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_FSA_INTERNAL origin=abort_transition_graph ]

Dec  6 22:55:05 Node1 crmd: [5235]: info: do_state_transition: All 2
cluster nodes are eligible to run resources.

Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28069:
Requesting the current CIB: S_POLICY_ENGINE

Dec  6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph:
te_update_diff:142 - Triggered transition abort (complete=1, tag=nvpair,
id=status-Node0-fail-count-msgbroker, magic=NA, cib=0.971.6) : Transient
attribute: update

Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28070:
Requesting the current CIB: S_POLICY_ENGINE

Dec  6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph:
te_update_diff:142 - Triggered transition abort (complete=1, tag=nvpair,
id=status-Node0-last-failure-msgbroker, magic=NA, cib=0.971.7) :
Transient attribute: update

Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28071:
Requesting the current CIB: S_POLICY_ENGINE

Dec  6 22:55:05 Node1 attrd: [5232]: info: find_hash_entry: Creating
hash entry for last-failure-msgbroker

Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke_callback:
Invoking the PE: query=28071, ref=pe_calc-dc-1354852505-39407, seq=12,
quorate=1

Dec  6 22:55:05 Node1 pengine: [5233]: notice: unpack_config: On loss of
CCM Quorum: Ignore

Dec  6 22:55:05 Node1 pengine: [5233]: notice: unpack_rsc_op: Operation
txpublisher_monitor_0 found resource txpublisher active on Node1

Dec  6 22:55:05 Node1 pengine: [5233]: WARN: unpack_rsc_op: Processing
failed op msgbroker_monitor_1 on Node0: not running (7)

...

Dec  6 22:55:05 Node1 pengine: [5233]: notice: common_apply_stickiness:
msgbroker can fail 99 more times on Node0 before being forced off

...

Dec  6 22:55:05 Node1 pengine: [5233]: notice: RecurringOp:  Start
recurring monitor (10s) for msgbroker on Node0

...

Dec  6 22:55:05 Node1 pengine: [5233]: notice: LogActions: Recover
msgbroker  (Started Node0)

...

Dec  6 22:55:05 Node1 crmd: [5235]: info: te_rsc_command: Initiating
action 37: stop msgbroker_stop_0 on Node0

 

 

Transition 18014 details:

 

Dec  6 22:52:18 Node1 pengine: [5233]: notice: process_pe_message:
Transition 18014: PEngine Input stored in:
/var/lib/pengine/pe-input-3270.bz2

Dec  6 22:52:18 Node1 crmd: [5235]: info: do_state_transition: State
transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
cause=C_IPC_MESSAGE origin=handle_response ]

Dec  6 22:52:18 Node1 crmd: [5235]: info: unpack_graph: Unpacked
transition 18014: 0 actions in 0 synapses

Dec  6 22:52:18 Node1 crmd: [5235]: info: do_te_invoke: Processing graph
18014 (ref=pe_calc-dc-1354852338-39406) derived from
/var/lib/pengine/pe-input-3270.bz2

Dec  6 22:52:18 Node1 crmd: [5235]: info: run_graph:


Dec  6 22:52:18 Node1 crmd: [5235]: notice: run_graph: Transition 18014
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pengine/pe-input-3270.bz2): Complete

Dec  6 22:52:18 Node1 crmd: [5235]: info: te_graph_trigger: Transition
18014 is now complete

Dec  6 22:52:18 Node1 crmd: [5235]: info: notify_crmd: Transition 18014
status: done - null

Dec  6 22:52:18 Node1 crmd: [5235]: info: do_state_transition: State
transition S_TRANSITION_ENGINE - S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]

Dec  6 22:52:18 Node1 crmd: [5235]: info: do_state_transition: Starting
PEngine Recheck Timer

 

 

Youssef

 

 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker processes RSS growth

2012-12-12 Thread Andrew Beekhof
On Wed, Dec 12, 2012 at 11:17 PM, Vladislav Bogdanov
bub...@hoster-ok.com wrote:
 Ok, the main conclusion I can make is that pacemaker does not have any
 memory leaks in code paths used by a static cluster.

Huzah! :)

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] node status does not change even if pacemakerd dies

2012-12-12 Thread Andrew Beekhof
On Wed, Dec 12, 2012 at 8:02 PM, Kazunori INOUE
inouek...@intellilink.co.jp wrote:

 Hi,

 I recognize that pacemakerd is much less likely to crash.
 However, a possibility of being killed by OOM_Killer etc. is not 0%.

True.  Although we just established in another thread that we don't
have any leaks :)

 So I think that a user gets confused. since behavior at the time of process
 death differs even if pacemakerd is running.

 case A)
  When pacemakerd and other processes (crmd etc.) are the parent-child
 relation.


[snip]


  For example, crmd died.
  However, since it is relaunched, the state of the cluster is not affected.

Right.

[snip]


 case B)
  When pacemakerd and other processes are NOT the parent-child relation.
  Although pacemakerd was killed, it assumed the state where it was respawned
 by Upstart.

   $ service corosync start ; service pacemaker start
   $ pkill -9 pacemakerd
   $ ps -ef|egrep 'corosync|pacemaker|UID'
   UID  PID  PPID  C STIME TTY   TIME CMD
   root   21091 1  1 14:52 ? 00:00:00 corosync
   49621099 1  0 14:52 ? 00:00:00 /usr/libexec/pacemaker/cib
   root   21100 1  0 14:52 ? 00:00:00 /usr/libexec/pacemaker/stonithd
   root   21101 1  0 14:52 ? 00:00:00 /usr/libexec/pacemaker/lrmd
   49621102 1  0 14:52 ? 00:00:00 /usr/libexec/pacemaker/attrd
   49621103 1  0 14:52 ? 00:00:00 /usr/libexec/pacemaker/pengine
   49621104 1  0 14:52 ? 00:00:00 /usr/libexec/pacemaker/crmd
   root   21128 1  1 14:53 ? 00:00:00 /usr/sbin/pacemakerd

Yep, looks right.

  In this case, the node will be set to UNCLEAN if crmd dies.
  That is, the node will be fenced if there is stonith resource.

Which is exactly what happens if only pacemakerd is killed with your proposal.
Except now you have time to do a graceful pacemaker restart to
re-establish the parent-child relationship.

If you want to compare B with something, it needs to be with the old
children terminate if pacemakerd dies strategy.
Which is:

   $ service corosync start ; service pacemaker start
   $ pkill -9 pacemakerd
  ... the node will be set to UNCLEAN

Old way: always downtime because children terminate which triggers fencing
Our way: no downtime unless there is an additional failure (to the cib or crmd)

Given that we're trying for HA, the second seems preferable.


   $ pkill -9 crmd
   $ crm_mon -1
   Last updated: Wed Dec 12 14:53:48 2012
   Last change: Wed Dec 12 14:53:10 2012 via crmd on dev2

   Stack: corosync
   Current DC: dev2 (2472913088) - partition with quorum
   Version: 1.1.8-3035414

   2 Nodes configured, unknown expected votes
   0 Resources configured.

   Node dev1 (2506467520): UNCLEAN (online)
   Online: [ dev2 ]


 How about making behavior selectable with an option?

MORE_DOWNTIME_PLEASE=(true|false) ?


 When pacemakerd dies,
 mode A) which behaves in an existing way. (default)
 mode B) which makes the node UNCLEAN.

 Best Regards,
 Kazunori INOUE



 Making stop work when there is no pacemakerd process is a different
 matter. We can make that work.


 Though the best solution is to relaunch pacemakerd, if it is difficult,
 I think that a shortcut method is to make a node unclean.


 And now, I tried Upstart a little bit.

 1) started the corosync and pacemaker.

   $ cat /etc/init/pacemaker.conf
   respawn
   script
   [ -f /etc/sysconfig/pacemaker ]  {
   . /etc/sysconfig/pacemaker
   }
   exec /usr/sbin/pacemakerd
   end script

   $ service co start
   Starting Corosync Cluster Engine (corosync):   [  OK  ]
   $ initctl start pacemaker
   pacemaker start/running, process 4702


   $ ps -ef|egrep 'corosync|pacemaker'
   root   4695 1  0 17:21 ?00:00:00 corosync
   root   4702 1  0 17:21 ?00:00:00 /usr/sbin/pacemakerd
   4964703  4702  0 17:21 ?00:00:00 /usr/libexec/pacemaker/cib
   root   4704  4702  0 17:21 ?00:00:00
 /usr/libexec/pacemaker/stonithd
   root   4705  4702  0 17:21 ?00:00:00 /usr/libexec/pacemaker/lrmd
   4964706  4702  0 17:21 ?00:00:00 /usr/libexec/pacemaker/attrd
   4964707  4702  0 17:21 ?00:00:00 /usr/libexec/pacemaker/pengine
   4964708  4702  0 17:21 ?00:00:00 /usr/libexec/pacemaker/crmd

 2) killed pacemakerd.

   $ pkill -9 pacemakerd

   $ ps -ef|egrep 'corosync|pacemaker'
   root   4695 1  0 17:21 ?00:00:01 corosync
   4964703 1  0 17:21 ?00:00:00 /usr/libexec/pacemaker/cib
   root   4704 1  0 17:21 ?00:00:00
 /usr/libexec/pacemaker/stonithd
   root   4705 1  0 17:21 ?00:00:00 /usr/libexec/pacemaker/lrmd
   4964706 1  0 17:21 ?00:00:00 /usr/libexec/pacemaker/attrd
   4964707 1  0 17:21 ?00:00:00 /usr/libexec/pacemaker/pengine
   4964708 1  0 17:21 ?00:00:00 /usr/libexec/pacemaker/crmd
   root   4760 1  1 17:24 ?00:00:00 /usr/sbin/pacemakerd

 3) then I stopped pacemakerd. however, some processes did not stop.

   $ 

Re: [Pacemaker] Listing resources by attributes

2012-12-12 Thread Andrew Beekhof
On Thu, Dec 13, 2012 at 1:09 AM, pavan tc pavan...@gmail.com wrote:
 Hi,

 Is there a way in which resources can be listed based on some attributes?
 For example, listing resource running on a certain node, or listing ms
 resources.

 The crm_resource manpage talks about the -N and -t options that seem to
 address the requirements above.

Not really.  They're not designed to work with --list or --locate

 But they do not provide the expected result.
 crm_resource --list or crm_resource --list-raw give the same output
 immaterial of whether it was provided with -N or -t.

 I had to do the following to pull out 'ms' resources, for example:
 crm configure show | grep -w ^ms | awk '{print $2}'

 Is there a cleaner way to list resources?

Not really.


 Thanks,
 Pavan

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Action from a different CRMD transition results in restarting services

2012-12-12 Thread Andrew Beekhof
On Thu, Dec 13, 2012 at 6:31 AM, Latrous, Youssef
ylatr...@broadviewnet.com wrote:
 Hi,



 I run into the following issue and I couldn’t find what it really means:



 Detected action msgbroker_monitor_1 from a different transition:
 16048 vs. 18014

18014 is where we're up to now, 16048 is the (old) one that scheduled
the recurring monitor operation.
I suspect you'll find the action failed earlier in the logs and thats
why it needed to be restarted.

Not the best log message though :(




 I can see that its impact is to stop/start a service but I’d like to
 understand it a bit more.



 Thank you in advance for any information.





 Logs about this issue:

 …

 Dec  6 22:55:05 Node1 crmd: [5235]: info: process_graph_event: Detected
 action msgbroker_monitor_1 from a different transition: 16048 vs. 18014

 Dec  6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph:
 process_graph_event:477 - Triggered transition abort (complete=1,
 tag=lrm_rsc_op, id=msgbroker_monitor_1,
 magic=0:7;104:16048:0:5fb57f01-3397-45a8-905f-c48cecdc8692, cib=0.971.5) :
 Old event

 Dec  6 22:55:05 Node1 crmd: [5235]: WARN: update_failcount: Updating
 failcount for msgbroker on Node0 after failed monitor: rc=7 (update=value++,
 time=1354852505)

 Dec  6 22:55:05 Node1 crmd: [5235]: info: do_state_transition: State
 transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
 origin=abort_transition_graph ]

 Dec  6 22:55:05 Node1 crmd: [5235]: info: do_state_transition: All 2 cluster
 nodes are eligible to run resources.

 Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28069:
 Requesting the current CIB: S_POLICY_ENGINE

 Dec  6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph:
 te_update_diff:142 - Triggered transition abort (complete=1, tag=nvpair,
 id=status-Node0-fail-count-msgbroker, magic=NA, cib=0.971.6) : Transient
 attribute: update

 Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28070:
 Requesting the current CIB: S_POLICY_ENGINE

 Dec  6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph:
 te_update_diff:142 - Triggered transition abort (complete=1, tag=nvpair,
 id=status-Node0-last-failure-msgbroker, magic=NA, cib=0.971.7) : Transient
 attribute: update

 Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28071:
 Requesting the current CIB: S_POLICY_ENGINE

 Dec  6 22:55:05 Node1 attrd: [5232]: info: find_hash_entry: Creating hash
 entry for last-failure-msgbroker

 Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke_callback: Invoking
 the PE: query=28071, ref=pe_calc-dc-1354852505-39407, seq=12, quorate=1

 Dec  6 22:55:05 Node1 pengine: [5233]: notice: unpack_config: On loss of CCM
 Quorum: Ignore

 Dec  6 22:55:05 Node1 pengine: [5233]: notice: unpack_rsc_op: Operation
 txpublisher_monitor_0 found resource txpublisher active on Node1

 Dec  6 22:55:05 Node1 pengine: [5233]: WARN: unpack_rsc_op: Processing
 failed op msgbroker_monitor_1 on Node0: not running (7)

 …

 Dec  6 22:55:05 Node1 pengine: [5233]: notice: common_apply_stickiness:
 msgbroker can fail 99 more times on Node0 before being forced off

 …

 Dec  6 22:55:05 Node1 pengine: [5233]: notice: RecurringOp:  Start recurring
 monitor (10s) for msgbroker on Node0

 …

 Dec  6 22:55:05 Node1 pengine: [5233]: notice: LogActions: Recover msgbroker
 (Started Node0)

 …

 Dec  6 22:55:05 Node1 crmd: [5235]: info: te_rsc_command: Initiating action
 37: stop msgbroker_stop_0 on Node0





 Transition 18014 details:



 Dec  6 22:52:18 Node1 pengine: [5233]: notice: process_pe_message:
 Transition 18014: PEngine Input stored in:
 /var/lib/pengine/pe-input-3270.bz2

 Dec  6 22:52:18 Node1 crmd: [5235]: info: do_state_transition: State
 transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
 cause=C_IPC_MESSAGE origin=handle_response ]

 Dec  6 22:52:18 Node1 crmd: [5235]: info: unpack_graph: Unpacked transition
 18014: 0 actions in 0 synapses

 Dec  6 22:52:18 Node1 crmd: [5235]: info: do_te_invoke: Processing graph
 18014 (ref=pe_calc-dc-1354852338-39406) derived from
 /var/lib/pengine/pe-input-3270.bz2

 Dec  6 22:52:18 Node1 crmd: [5235]: info: run_graph:
 

 Dec  6 22:52:18 Node1 crmd: [5235]: notice: run_graph: Transition 18014
 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
 Source=/var/lib/pengine/pe-input-3270.bz2): Complete

 Dec  6 22:52:18 Node1 crmd: [5235]: info: te_graph_trigger: Transition 18014
 is now complete

 Dec  6 22:52:18 Node1 crmd: [5235]: info: notify_crmd: Transition 18014
 status: done - null

 Dec  6 22:52:18 Node1 crmd: [5235]: info: do_state_transition: State
 transition S_TRANSITION_ENGINE - S_IDLE [ input=I_TE_SUCCESS
 cause=C_FSA_INTERNAL origin=notify_crmd ]

 Dec  6 22:52:18 Node1 crmd: [5235]: info: do_state_transition: Starting
 PEngine Recheck Timer





 Youssef






 ___
 Pacemaker mailing 

Re: [Pacemaker] gfs2 / dlm on centos 6.2

2012-12-12 Thread Xavier Lashmar
I see, thanks very much for pointing me in the right direction!

Xavier Lashmar
Université d'Ottawa / University of Ottawa
Analyste de Systèmes | Systems Analyst
Service étudiants, service de l'informatique et des communications |
Student services, computing and communications services.
1 Nicholas Street (810)
Ottawa ON K1N 7B7
Tél. | Tel. 613-562-5800 (2120)

From: Andrew Beekhof [and...@beekhof.net]
Sent: Tuesday, December 11, 2012 9:30 PM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] gfs2 / dlm on centos 6.2



On Wed, Dec 12, 2012 at 1:29 AM, Xavier Lashmar 
xlash...@uottawa.camailto:xlash...@uottawa.ca wrote:
Hello,

We are attempting to mount gfs2 partitions on CentOS using DRBD + COROSYNC + 
PACEMAKER.  Unfortunately we consistently get the following error:

You'll need to configure pacemaker to use cman for this.
See:
   
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html/Clusters_from_Scratch/ch08s02.html


# mount /dev/vg_data/lv_data /webdata/ -t gfs2 -v
mount /dev/dm-2 /webdata
parse_opts: opts = rw
  clear flag 1 for rw, flags = 0
parse_opts: flags = 0
parse_opts: extra = 
parse_opts: hostdata = 
parse_opts: lockproto = 
parse_opts: locktable = 
gfs_controld join connect error: Connection refused
error mounting lockproto lock_dlm

We are trying to find out where to get the lock_dlm libraries and packages for 
Centos 6.2 and 6.3

Also, I found that the document “Pacemaker 1.1 - Clusters from Scratch” the 
Fedora 17 version is a bit problematic.  I’m also running a Fedora 17 system 
and found no package “dlm” as per the instructions in section 8.1.1

yum install -y gfs2-utils dlm kernel-modules-extra

Any idea if an external repository is needed?  If so, which one ? and which 
package do we need to install for CentOS 6+ ?

Thanks very much



[Description: Description: cid:D85E51EA-D618-4CBC-9F88-34F696123DED]



Xavier Lashmar
Analyste de Systèmes | Systems Analyst
Service étudiants, service de l'informatique et des communications/Student 
services, computing and communications services.
1 Nicholas Street (810)
Ottawa ON K1N 7B7
Tél. | Tel. 613-562-5800 (2120)






___
Pacemaker mailing list: 
Pacemaker@oss.clusterlabs.orgmailto:Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


inline: image003.pnginline: image001.pnginline: image002.png___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] booth is the state of started on pacemaker before booth write ticket info in cib.

2012-12-12 Thread Yuichi SEINO
Hi Jiaju,

2012/12/12 Jiaju Zhang jjzh...@suse.de:
 On Tue, 2012-12-11 at 20:15 +0900, Yuichi SEINO wrote:
 Hi Jiaju,

 Currently, booth is the state of started on pacemaker before booth
 writes ticket information in cib. So, If the old ticket information is
 included in cib, a resource relating to the ticket may start before
 booth resets the ticket. I think that this problem is when to be
 daemon in booth.

 The resouce should not be started before the booth daemon is ready. We
 suggest to configure an ordering constraint for the booth daemon and the
 managed resources by that ticket. That being said, if the ticket is in
 the CIB but booth daemon has not been started, the resources would not
 be started.


booth RA finishes booth_start when booth changed the daemon from the
foreground process.(To be exact, sleep 1 is included). The current
booth change daemon before catchup. On the other hand, the previous
booth change daemon after catchup. catchup write a ticket in cib.
 Even if an ordering constraint is set, as shown below, the related
resource can start when booth changes the state of started on
pacemaker. At this point, the current booth still may not finish
catchup.

crm_mon paste.
...
booth(ocf::pacemaker:booth-site):Started multi-site-a-1
...


 Perhaps,  this problem didn't happen before the following commit.
 https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f

 Currently when all of the initialization (including loading the new
 ticket information) finished, booth should be regarded as ready. So if
 you encounter some problem here, I guess we should improve the RA to
 better reflect the booth startup status, but not moving the
 initialization order, since it may introduce other regression as we have
 encountered before;)


I am not still sure which we should fix RA or booth.

 Thanks,
 Jiaju


 Sincerely,
 Yuichi





--
Yuichi SEINO
METROSYSTEMS CORPORATION
E-mail:seino.clust...@gmail.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org