Re: [ClusterLabs] interesting blog on Pacemaker-related outage

2017-12-07 Thread Andrei Borzenkov
07.12.2017 15:13, Adam Spiers пишет:
> https://gocardless.com/blog/incident-review-api-and-dashboard-outage-on-10th-october/
> 
> 
> It's a great write-up, although a little frustrating that it is still
> not fully understood why a -inf colocation failed whereas a +inf
> succeeded.  (I actually have a vague memory of discovering something
> very similar a while back, but I can't find the details.)
> 

According to the only information we have (I can hardly call it
documentation) about how colocation constraints (are supposed to) work,
colocation on master/slave only affects order in which nodes are chosen
for promotion, not promotion decision itself. So yes, I'd love to see
relevant piece of pacemaker configuration.


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] low-cost ways to make Pacemaker more usable?

2017-12-07 Thread Adam Spiers
Ken Gaillot  wrote: 
On Thu, 2017-12-07 at 17:15 +, Adam Spiers wrote: 
For example, making a few of the most crucial existing log messages 
less cryptic could maybe go a long way.  Or if "dumbing down" log 
messages would make life harder for developers who are familiar 
with Pacemaker internals and need to be able to track all the gory 
details, recognise the fact that the kind of logs which developers 
and users need to read are vastly different, and consequently 
provide a way of distinguishing between the two kinds.  Making all 
developer logs DEBUG level and non-developer other levels might be 
one way, but there are probably better approaches (e.g. tag all 
developer logs with a certain string which can be filtered out). 


You're late to the party on this one :) 

We do try to keep all messages of interest to novice users at the 
critical-to-notice levels (which go to syslog by default), messages of 
interest to more advanced users at the info level and to developers at 
the debug-to-trace levels (which go to pacemaker.log by default). 

There was a big push a few releases back to improve the wording of the 
most user-visible log messages. You should have seen them before. ;) 

In a 2015 release (libqb + pacemaker), we added support for a single 
message to go into both syslog and pacemaker.log with different levels 
of detail. The syslog message has plain English for users, and the 
pacemaker.log message has added debugging information tacked onto the 
end. For an example, see the pacemakerd "Starting Pacemaker" message in 
each log.


Hrm, I hadn't noticed that - I wonder if it's because I mainly 
work with enterprise products on a long lifecycle, so maybe I didn't 
experience that release yet ... 

This is definitely ongoing, and it would be really helpful to have 
examples of particular messages of how they are now vs what they should 
say.


OK, I'll bear that in mind. 

Another simple change would be to adopt a policy that rather than 
sharing information on this list in response to questions which 
arise,
add the answers to the documentation and then just give a short reply 
to the list saying "here's the link to the documentation I just 
updated".  I'm sure that the archives of this list are an absolute 
gold mine of useful information, but list archives make for really 
poor documentation ... 


Agreed in principle, but again it goes back to time. Better a mailing 
list post than something at the end of the to-do list. 


Absolutely; I wasn't suggesting collecting yet more to-do items for 
later.


(Wiki edits don't take much more time than mailing lists, so I could 
see taking more advantage of that.) 


Yep, that's the clearer version of what I was trying to say ;-)

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] low-cost ways to make Pacemaker more usable?

2017-12-07 Thread Ken Gaillot
On Thu, 2017-12-07 at 17:15 +, Adam Spiers wrote:
> Ken Gaillot  wrote:
> > On Thu, 2017-12-07 at 12:13 +, Adam Spiers wrote:
> > > https://gocardless.com/blog/incident-review-api-and-dashboard-out
> > > age-
> > > on-10th-october/
> > > 
> > > It's a great write-up, although a little frustrating that it is
> > > still
> > > not fully understood why a -inf colocation failed whereas a +inf
> > > succeeded.  (I actually have a vague memory of discovering
> > > something
> > > very similar a while back, but I can't find the details.)
> > 
> > That is an excellent post. I'll contact them directly to discuss it
> > further.
> 
> Cool, thanks!
> 
> > > IMHO this serves as a good example of the difficulty Pacemaker
> > > faces,
> > > and consequently as valuable feedback for how Pacemaker needs to
> > > improve: it's all too easy to do one tiny misconfiguration which
> > > can
> > > potentially bring the whole house of cards tumbling down, and
> > > it's
> > > often really hard to understand what went wrong.
> > > 
> > > So FWIW, my personal view is that more than anything else right
> > > now,
> > > Pacemaker needs to be made easier to understand.  I know this is
> > > a
> > 
> > Agreed, but there are about a dozen things that are more important
> > than
> > anything else right now ;)
> 
> Heheh yeah, I can related to that feeling ;-)
> 
> > Personally, my current focus is technical debt: stripping out all
> > the
> > legacy features that were deprecated in 1.1.18, so we can release
> > 2.0.0
> > with a smaller code base that is easier to maintain going forward.
> > The
> > hope is that this pays off in greater time savings down the road,
> > but
> > it sucks up a lot of time in the near term.
> > 
> > There are a large number of outstanding bug reports that bother me,
> > several of them quite serious, and I would like to spend more time
> > on
> > those before new features, but ...
> > 
> > There is constant demand for new features from paying customers,
> > and we
> > can't stay relevant without trying to keep up at least to an
> > extent.
> > Several recent projects (bundles, alerts, versioned attributes)
> > could
> > really benefit from some follow-up work, and more major projects
> > are
> > right on the horizon (failure handling configuration overhaul,
> > crm_mon
> > overhaul, containerization of pacemaker/corosync, corosync 3/knet
> > compatibility).
> > 
> > And of course usability is, indeed, an incredibly important area to
> > be
> > addressed, spanning log messages, documentation, and tooling.
> 
> Yep, totally understood.
> 
> > Which is to say, volunteers welcome :-)
> 
> ... which is the cue for everyone to run away, leaving tumbleweed
> silence ;-)
> 
> Seriously though, I acknowledge the lack of resources, so maybe just
> aim for a few small steps forward here and there?
> 
> For example, making a few of the most crucial existing log messages
> less cryptic could maybe go a long way.  Or if "dumbing down" log
> messages would make life harder for developers who are familiar with
> Pacemaker internals and need to be able to track all the gory
> details,
> recognise the fact that the kind of logs which developers and users
> need to read are vastly different, and consequently provide a way of
> distinguishing between the two kinds.  Making all developer logs
> DEBUG
> level and non-developer other levels might be one way, but there are
> probably better approaches (e.g. tag all developer logs with a
> certain
> string which can be filtered out).

You're late to the party on this one :)

We do try to keep all messages of interest to novice users at the
critical-to-notice levels (which go to syslog by default), messages of
interest to more advanced users at the info level and to developers at
the debug-to-trace levels (which go to pacemaker.log by default).

There was a big push a few releases back to improve the wording of the
most user-visible log messages. You should have seen them before. ;)

In a 2015 release (libqb + pacemaker), we added support for a single
message to go into both syslog and pacemaker.log with different levels
of detail. The syslog message has plain English for users, and the
pacemaker.log message has added debugging information tacked onto the
end. For an example, see the pacemakerd "Starting Pacemaker" message in
each log.

This is definitely ongoing, and it would be really helpful to have
examples of particular messages of how they are now vs what they should
say.

> Another simple change would be to adopt a policy that rather than
> sharing information on this list in response to questions which
> arise,
> add the answers to the documentation and then just give a short reply
> to the list saying "here's the link to the documentation I just
> updated".  I'm sure that the archives of this list are an absolute
> gold mine of useful information, but list archives make for really
> poor documentation ...

Agreed in principle, but again it goes 

[ClusterLabs] low-cost ways to make Pacemaker more usable?

2017-12-07 Thread Adam Spiers

Ken Gaillot  wrote:

On Thu, 2017-12-07 at 12:13 +, Adam Spiers wrote:

https://gocardless.com/blog/incident-review-api-and-dashboard-outage-
on-10th-october/

It's a great write-up, although a little frustrating that it is still
not fully understood why a -inf colocation failed whereas a +inf
succeeded.  (I actually have a vague memory of discovering something
very similar a while back, but I can't find the details.)


That is an excellent post. I'll contact them directly to discuss it
further.


Cool, thanks!


IMHO this serves as a good example of the difficulty Pacemaker faces,
and consequently as valuable feedback for how Pacemaker needs to
improve: it's all too easy to do one tiny misconfiguration which can
potentially bring the whole house of cards tumbling down, and it's
often really hard to understand what went wrong.

So FWIW, my personal view is that more than anything else right now,
Pacemaker needs to be made easier to understand.  I know this is a


Agreed, but there are about a dozen things that are more important than
anything else right now ;)


Heheh yeah, I can related to that feeling ;-)


Personally, my current focus is technical debt: stripping out all the
legacy features that were deprecated in 1.1.18, so we can release 2.0.0
with a smaller code base that is easier to maintain going forward. The
hope is that this pays off in greater time savings down the road, but
it sucks up a lot of time in the near term.

There are a large number of outstanding bug reports that bother me,
several of them quite serious, and I would like to spend more time on
those before new features, but ...

There is constant demand for new features from paying customers, and we
can't stay relevant without trying to keep up at least to an extent.
Several recent projects (bundles, alerts, versioned attributes) could
really benefit from some follow-up work, and more major projects are
right on the horizon (failure handling configuration overhaul, crm_mon
overhaul, containerization of pacemaker/corosync, corosync 3/knet
compatibility).

And of course usability is, indeed, an incredibly important area to be
addressed, spanning log messages, documentation, and tooling.


Yep, totally understood.


Which is to say, volunteers welcome :-)


... which is the cue for everyone to run away, leaving tumbleweed
silence ;-)

Seriously though, I acknowledge the lack of resources, so maybe just
aim for a few small steps forward here and there?

For example, making a few of the most crucial existing log messages
less cryptic could maybe go a long way.  Or if "dumbing down" log
messages would make life harder for developers who are familiar with
Pacemaker internals and need to be able to track all the gory details,
recognise the fact that the kind of logs which developers and users
need to read are vastly different, and consequently provide a way of
distinguishing between the two kinds.  Making all developer logs DEBUG
level and non-developer other levels might be one way, but there are
probably better approaches (e.g. tag all developer logs with a certain
string which can be filtered out).

Another simple change would be to adopt a policy that rather than
sharing information on this list in response to questions which arise,
add the answers to the documentation and then just give a short reply
to the list saying "here's the link to the documentation I just
updated".  I'm sure that the archives of this list are an absolute
gold mine of useful information, but list archives make for really
poor documentation ...

And BTW, lest I come across as a constant whinger ... I think you're
doing an absolutely fantastic job as maintainer! ;-)

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] interesting blog on Pacemaker-related outage

2017-12-07 Thread Ken Gaillot
On Thu, 2017-12-07 at 12:13 +, Adam Spiers wrote:
> https://gocardless.com/blog/incident-review-api-and-dashboard-outage-
> on-10th-october/
> 
> It's a great write-up, although a little frustrating that it is still
> not fully understood why a -inf colocation failed whereas a +inf
> succeeded.  (I actually have a vague memory of discovering something
> very similar a while back, but I can't find the details.)

That is an excellent post. I'll contact them directly to discuss it
further.

> IMHO this serves as a good example of the difficulty Pacemaker faces,
> and consequently as valuable feedback for how Pacemaker needs to
> improve: it's all too easy to do one tiny misconfiguration which can
> potentially bring the whole house of cards tumbling down, and it's
> often really hard to understand what went wrong.
> 
> So FWIW, my personal view is that more than anything else right now,
> Pacemaker needs to be made easier to understand.  I know this is a 

Agreed, but there are about a dozen things that are more important than
anything else right now ;)

Personally, my current focus is technical debt: stripping out all the
legacy features that were deprecated in 1.1.18, so we can release 2.0.0
with a smaller code base that is easier to maintain going forward. The
hope is that this pays off in greater time savings down the road, but
it sucks up a lot of time in the near term.

There are a large number of outstanding bug reports that bother me,
several of them quite serious, and I would like to spend more time on
those before new features, but ...

There is constant demand for new features from paying customers, and we
can't stay relevant without trying to keep up at least to an extent.
Several recent projects (bundles, alerts, versioned attributes) could
really benefit from some follow-up work, and more major projects are
right on the horizon (failure handling configuration overhaul, crm_mon
overhaul, containerization of pacemaker/corosync, corosync 3/knet
compatibility).

And of course usability is, indeed, an incredibly important area to be
addressed, spanning log messages, documentation, and tooling.

Which is to say, volunteers welcome :-)

> big
> ask since HA is unavoidably complex, but I'm sure there are
> actionable
> items which would serve as relatively manageable yet very worthwhile
> steps towards this goal.  I alluded to this during my presentation at
> the Clusterlabs Summit, e.g. see
> 
> https://aspiers.github.io/clusterlabs-summit-2017-openstack-ha/#/
> debugging
> 
> and the following slide.  And in fact I remember some really good
> discussions on this during the summit too, but I'm not sure if they
> led anywhere.
> 
> Hope this feedback is useful!
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pacemaker drbd resource speed

2017-12-07 Thread Ken Gaillot
On Wed, 2017-12-06 at 15:11 +0200, Vaggelis Papastavros wrote:
> Dear friends ,
> we have the following configuration setup ,
> 1. Corosync Cluster Engine, version 2.4.0 
> 2. Pacemaker 0.9.152  under CentOS7 .
> 3. DRBD version 9.0.8
> Everything is configured as clustered resource and is working fine.
> I need to make some changes on the DRBD resource configuration file,
> the corresponding resource is also a pacemaker resource 
> (i.e., everything managed by pacemaker) . 
> Please help me the recommended order of the above changes without
> down time .  

It's been a while since I used DRBD, so someone who knows more may
correct me, but for Pacemaker in general, you can make any changes you
want to a cluster by putting it into maintenance mode.

Exact syntax depends on your configuration tool but the result is
setting maintenance-mode=true in cluster properties.

In maintenance mode, pacemaker will still run and report the results of
monitors, but it won't start or stop anything. You can set maintenance
mode, make your changes, wait for all the monitors to come back clean,
then leave maintenance mode.

That approach leaves everything else running. If you have dependencies
that will fail if you make DRBD changes, I would set target-
role=Stopped on the DRBD resource in Pacemaker, which will make
Pacemaker stop DRBD and everything that depends on it. Then, you can
make your changes, start DRBD manually to test them, then stop DRBD and
re-enable it in Pacemaker so Pacemaker starts everything again.

> For example if i change the global_conf of DRDB configuration, what
> actions I need to make on the Pacemaker in order to reload the
> resource 
> with the updated values ?
> 
> Sincerely ,
> Vaggelis Papastavros 
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] interesting blog on Pacemaker-related outage

2017-12-07 Thread Adam Spiers

https://gocardless.com/blog/incident-review-api-and-dashboard-outage-on-10th-october/

It's a great write-up, although a little frustrating that it is still
not fully understood why a -inf colocation failed whereas a +inf
succeeded.  (I actually have a vague memory of discovering something
very similar a while back, but I can't find the details.)

IMHO this serves as a good example of the difficulty Pacemaker faces,
and consequently as valuable feedback for how Pacemaker needs to
improve: it's all too easy to do one tiny misconfiguration which can
potentially bring the whole house of cards tumbling down, and it's
often really hard to understand what went wrong.

So FWIW, my personal view is that more than anything else right now,
Pacemaker needs to be made easier to understand.  I know this is a big
ask since HA is unavoidably complex, but I'm sure there are actionable
items which would serve as relatively manageable yet very worthwhile
steps towards this goal.  I alluded to this during my presentation at
the Clusterlabs Summit, e.g. see

   https://aspiers.github.io/clusterlabs-summit-2017-openstack-ha/#/debugging

and the following slide.  And in fact I remember some really good
discussions on this during the summit too, but I'm not sure if they
led anywhere.

Hope this feedback is useful!

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org