Re: [ClusterLabs] Antw: [EXT] True time periods/CLOCK_MONOTONIC node vs. cluster wide (Was: Coming in Pacemaker 2.0.4: dependency on monotonic clock for systemd resources)

2020-03-12 Thread Jan Pokorný
On 12/03/20 08:22 +0100, Ulrich Windl wrote:
> Sorry for top-posting, but if you have NTP-synced your nodes,
> CLOCK_MONOTONIC will not have much advantage over CLOCK_REALTIME
> as the clocks will be rather the same,

I guess you mean they will be rather the same amongst the nodes
relative to each other (not MONOTONIC vs. REALTIME, since they
will trivially differ at some point), regardless if you use
either of these.

How is exactly the same point in time to synchronize going to be
achieved?  Otherwise, you'll suffer "clock not the same" eventually.

How is the failure to NTP-sync going to be detected and what
consequences will it impose on cluster?  If you'll happily continue
regardless, you'll suffer "clock not the same" eventually.

NTP is currently highly recommended as an upstream guidance, but
it can easily be out of the scope of HA projects ... just consider
the autonomous units Digimer was talking about.  Still, they
could at least in theory keep some cluster-private time measuring
sync'd without any exact external calendar anchoring
(i.e., "stop-watching" rather than "calendar-watching").

Also, at barebone level and when REALTIME wouldn't be ever used
for anything in cluster (but logging perhaps, for which
"calendar-watching" [see above] may be of practical value), cluster
would _not_ be interested in calendar-correct time relating (but
NTP is, primarily) --- rather just in the sense of measuring the
periods, hence that particular clocks amongst the nodes are
reasonably comparable (ticking rather synchronously).

> and they won't "jump".
> IMHO the latter is the main reason for using CLOCK_MONOTONIC

yes

> (if the admin

or said NTP client, conventional change (see "leap second",
for instance), HW fault (battery for instance) or anything else

> decides to adjust the real-time clock).  So far the theory. In
> practice the clock jumps, even with NTP, especially if the node had
> been running for a long time, is no[uz]t updating the RTC, and then
> is fenced. The clock may be off by minutes after boot then, and NTP
> is quite conservative when adjusting the time (first it won't
> believe that the clock if off that far, then after minutes the clock
> will actually jump. That's why some fast pre-ntpd adjustment is
> generally used.) The point is (if you can sync your clocks to
> real-time at all): How long do you want to wait for all your nodes
> to agree on some common time?

Specifically with NTP and systemd picture, I think you can order
startup of some units once "NTP synchronized" target is reached.

> Maybe CLOCK_MONOTONIC could help here...

NTP in calendar-correct sense can be skipped altogether
when cluster is happy just relying on that (and MONOTONIC-like
synchronization exists across nodes for good measure).

> The other useful application is any sort of timeout or repeat thing
> that should not be affected by adjusting the real-time clock.

That's exactly the point of using that for node-local measurements
and why I propose users would need to willingly opt-in for cluster
stack components to compile at all where MONOTONIC is not present,
provided that the components are timeout/interval sensitive.

-- 
Jan (Poki)


pgp4CJ7Ki7woj.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] I want to have some resource monitored and based on that make an acton. Is it possible?

2020-03-12 Thread Ken Gaillot
On Wed, 2020-03-11 at 20:01 +0200, Roman Hershkovich wrote:
> But colocation of dbprobe won't pull trigger of webserver ? Or
> because that it is below in order - it will just restart services ?

I'm not sure what you mean. Resource failures don't cause node fencing
unless you explicitly ask for it with on-fail=fence. Resource failures
don't cause failover to another node unless migration-threshold is set
for the resource (or it fails a million times).

If you want the db monitor failure to cause the web server to restart,
order the web server after the db monitor. Otherwise, the db monitor
failure won't have any effect on the web server.

> On Wed, Mar 11, 2020, 18:41 Ken Gaillot  wrote:
> > On Wed, 2020-03-11 at 16:08 +0200, Roman Hershkovich wrote:
> > > Great, thank you very much for explanation. Regarding returning
> > error
> > > - i did not knew.
> > > So, basically i can have a service, that will probe for master
> > DB, in
> > > case of its transfer - service will update /etc/hosts and return
> > > error, which will be caught by pcs and it will restart whole
> > > dependent set ? Sounds good.
> > > But how i can do 2 "main resources" ? I have webserver AND
> > > db_monitor. In case of failure of webserver - should all start on
> > > node b, but in case of DB change - only underlying resources ...
> > > Should i make webserver outside of set? 
> > 
> > If you want the webserver to move to another node after a single
> > failure (of the webserver itself), set its migration-threshold to
> > 1. If
> > you want other resources to move with it, colocate them with the
> > webserver.
> > 
> > The db monitor won't affect that -- if the db monitor fails,
> > anything
> > ordered after it will restart.
> > 
> > > On Wed, Mar 11, 2020 at 3:57 PM Ken Gaillot 
> > > wrote:
> > > > On Wed, 2020-03-11 at 02:27 +0200, Roman Hershkovich wrote:
> > > > > Yes.
> > > > > I have only 1 APP active at same time, and so I want this app
> > to
> > > > be
> > > > > restarted whenever DB changes. Another one is a "standby"
> > APP,
> > > > where
> > > > > all resources are shut.
> > > > > So i thought about adding some "service" script, which will
> > probe
> > > > a
> > > > > DB , and in case if it finds a CHANGE - will trigger pcs to
> > > > reload a
> > > > > set of resources, where one of resource would be a systemctl
> > > > file,
> > > > > which will continue to run a script, so in case of next
> > change of
> > > > DB
> > > > > - it will restart APP set again. Is it sounds reasonable? (i
> > > > don't
> > > > > care of errors. I mean - i do, i want to log, but i'm ok to
> > see
> > > > them)
> > > > 
> > > > That sounds fine, but I'd trigger the restart by returning an
> > error
> > > > code from the db-monitoring script, rather than directly
> > attempt to
> > > > restart the resources via pcs. If you order the other resources
> > > > after
> > > > the db-monitoring script, pacemaker will automatically restart
> > them
> > > > when the db-monitoring script returns an error.
> > > > 
> > > > > In addition - i thought maybe bringing PAF here could be
> > useful -
> > > > but
> > > > > this is even more complex ... 
> > > > 
> > > > If bringing the db into the cluster is a possibility, that
> > would
> > > > probably be more reliable, with a quicker response too.
> > > > 
> > > > In that case you would simply order the dependent resources
> > after
> > > > the
> > > > database master promotion. pcs example: pcs constraint order
> > > > promote
> > > > DB-RSC then start DEPENDENT-RSC
> > > > 
> > > > > On Tue, Mar 10, 2020 at 10:28 PM Ken Gaillot <
> > kgail...@redhat.com
> > > > >
> > > > > wrote:
> > > > > > On Tue, 2020-03-10 at 21:03 +0200, Roman Hershkovich wrote:
> > > > > > > DB servers are not in PCS cluster. Basically you say that
> > i
> > > > need
> > > > > > to
> > > > > > > add them to PCS cluster and then start them? but in case
> > if
> > > > DB1
> > > > > > fails
> > > > > > > - DB2 autopromoted and not required start of service
> > again>
> > > > > > > 
> > > > > > > Regarding colocation rule - i'm kind of missing logic how
> > it
> > > > > > works -
> > > > > > > how i can "colocate" 1 of 2 APP servers to be around a
> > master
> > > > DB
> > > > > > ? 
> > > > > > 
> > > > > > If I understand correctly, what you want is that both apps
> > are
> > > > > > restarted if the master changes?
> > > > > > 
> > > > > > I'm thinking you'll need a custom OCF agent for the app
> > > > servers.
> > > > > > The
> > > > > > monitor action, in addition to checking the app's status,
> > could
> > > > > > also
> > > > > > check which db is master, and return an error if it's
> > changed
> > > > since
> > > > > > the
> > > > > > last monitor. (The start action would have to record the
> > > > initial
> > > > > > master.) Pacemaker will restart the app to recover from the
> > > > error.
> > > > > > 
> > > > > > That is a little hacky because you'll have errors in the
> > status
> > > > > > every
> > > > > > time the 

Re: [ClusterLabs] Resource Parameter Change Not Honoring Constraints

2020-03-12 Thread Ken Gaillot
On Wed, 2020-03-11 at 17:24 -0400, Marc Smith wrote:
> Hi,
> 
> I'm using Pacemaker 1.1.20 (yes, I know, a bit dated now). I noticed

I'd still consider that recent :)

> when I modify a resource parameter (eg, update the value), this
> causes
> the resource itself to restart. And that's fine, but when this
> resource is restarted, it doesn't appear to honor the full set of
> constraints for that resource.
> 
> I see the output like this (right after the resource parameter
> change):
> ...
> Mar 11 20:43:25 localhost crmd[1943]:   notice: State transition
> S_IDLE -> S_POL
> ICY_ENGINE
> Mar 11 20:43:25 localhost crmd[1943]:   notice: Current ping state:
> S_POLICY_ENG
> INE
> Mar 11 20:43:25 localhost pengine[1942]:   notice: Clearing failure
> of
> p_bmd_140c58-1 on 140c58-1 because resource parameters have changed
> Mar 11 20:43:25 localhost pengine[1942]:   notice:  * Restart
> p_bmd_140c58-1 (   140c58-1 )   due to
> resource definition change
> Mar 11 20:43:25 localhost pengine[1942]:   notice:  * Restart
> p_dummy_g_lvm_140c58-1 (   140c58-1 )   due to
> required g_md_140c58-1 running
> Mar 11 20:43:25 localhost pengine[1942]:   notice:  * Restart
> p_lvm_140c58_vg_01 (   140c58-1 )   due to
> required p_dummy_g_lvm_140c58-1 start
> Mar 11 20:43:25 localhost pengine[1942]:   notice: Calculated
> transition 41, saving inputs in
> /var/lib/pacemaker/pengine/pe-input-173.bz2
> Mar 11 20:43:25 localhost crmd[1943]:   notice: Initiating stop
> operation p_lvm_140c58_vg_01_stop_0 on 140c58-1
> Mar 11 20:43:25 localhost crmd[1943]:   notice: Transition aborted by
> deletion of lrm_rsc_op[@id='p_bmd_140c58-1_last_failure_0']: Resource
> operation removal
> Mar 11 20:43:25 localhost crmd[1943]:   notice: Current ping state:
> S_TRANSITION_ENGINE
> ...
> 
> The stop on 'p_lvm_140c58_vg_01' then times out, because the other
> constraint (to stop the service above LVM) is never executed. I can
> see from the messages it never even tries to demote the resource
> above
> that.
> 
> Yet, if I use crmsh at the shell, and do a restart on that same
> resource, it works correctly, and all constraints are honored: crm
> resource restart p_bmd_140c58-1
> 
> I can certainly provide my full cluster config if needed, but hoping
> to keep this email concise for clarity. =)
> 
> I guess my questions are: 1) Is the difference in restart behavior
> expected, and not all constraints are followed when resource
> parameters change (or some other restart event that originated
> internally like this)? 2) Or perhaps this is known bug that was
> already resolved in newer versions of Pacemaker?

No to both. Can you attach that pe-input-173.bz2 file (with any
sensitive info removed)?
> 
> I searched a bit for #2 but I didn't get many (well any) hits on
> other
> users experiencing this behavior.
> 
> Many thanks in advance.
> 
> --Marc
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] FYI: users being automatically unsubscribed from list and/or not getting messages

2020-03-12 Thread Ken Gaillot
Hi all,

TL;DR if you've been having problems, they hopefully will get better
now


We have gotten some reports of users being automatically unsubscribed
from this list, or not getting all messages from the list.

The issue is that some mail servers have become more strict about what
mail they send and accept. Some servers cryptographically sign outgoing
messages (using "DKIM") so forgeries can be detected. Some servers
reject incoming messages if there's a signature and it doesn't match.

Many mailing lists, including this one, change some of the mail headers
-- most obviously, the subject line gets "[ClusterLabs]" added to the
front, to make list messages easy to filter. Unfortunately this breaks
DKIM signatures. Thus, if someone sends a DKIM-signed message to this
list, some recipients' servers will reject the message. After a certain
number of rejections, this list's server will automatically unsubscribe
the user.

Luckily, most servers that are configured to reject broken DKIM
signatures are also configured to accept the mail anyway if the sending
domain has proper "SPF" records (a DNS-based mechanism for preventing
address spoofing). We have just added SPF records for clusterlabs.org,
so hopefully the situation will improve for users who have been
affected.

If anyone continues to have problems after this point, please let us
know (either to this list or directly to me).
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: [EXT] True time periods/CLOCK_MONOTONIC node vs. cluster wide (Was: Coming in Pacemaker 2.0.4: dependency on monotonic clock for systemd resources)

2020-03-12 Thread Ulrich Windl
Hi!

Sorry for top-posting, but if you have NTP-synced your nodes, CLOCK_MONOTONIC
will not have much advantage over CLOCK_REALTIME as the clocks will be rather
the same, and they won't "jump". IMHO the latter is the main reason for using
CLOCK_MONOTONIC (if the admin decides to adjust the real-time clock).
So far the theory. In practice the clock jumps, even with NTP, especially if
the node had been running for a long time, is nouzt updating the RTC, and then
is fenced. The clock may be off by minutes after boot then, and NTP is quite
conservative when adjusting the time (first it won't believe that the clock if
off that far, then after minutes the clock will actually jump. That's why some
fast pre-ntpd adjustment is generally used.)
The point is (if you can sync your clocks to real-time at all): How long do
you want to wait for all your nodes to agree on some common time? Maybe
CLOCK_MONOTONIC could help here...

The other useful application is any sort of timeout or repeat thing that
should not be affected by adjusting the real-time clock.

Regards,
Ulrich

>>> Jan Pokorný  schrieb am 11.03.2020 um 23:43 in
Nachricht
<17407_1583966641_5e6969b1_17407_22_1_20200311224339.ga...@redhat.com>:
> On 11/03/20 09:04 ‑0500, Ken Gaillot wrote:
>> On Wed, 2020‑03‑11 at 08:20 +0100, Ulrich Windl wrote:
>>> You only have to take care not to compare CLOCK_MONOTONIC
>>> timestamps between nodes or node restarts. 
>> 
>> Definitely :)
>> 
>> They are used only to calculate action queue and run durations
> 
> Both these ... from an isolated perspective of a single node only.
> E.g., run durations related to the one currently responsible to act
> upon the resource in some way (the "atomic" operation is always
> bound to the single host context and when retried or logically
> followed with another operation, it's measured anew on pertaining,
> perhaps different node).
> 
> I feel that's a rather important detail, and just recently this
> surface received some slight scratching on the conceptual level...
> 
> Current inability to synchronize measurements of CLOCK_MONOTONIC
> like notions of time amongst nodes (especially tranfer from old,
> possibly failed DC to new DC, likely involving some admitted loss
> of precisenesss ‑‑ mind you, cluster is never fully synchronous,
> you'd need the help of specialized HW for that) in as lossless
> way as possible is what I believe is the main show stopper for
> being able to accurately express the actual "availability score"
> for given resource or resource group ‑‑‑ yep, that famous number,
> the holy grail of anyone taking HA seriously ‑‑‑ while at the
> same time, something the cluster stack currently cannot readily
> present to users (despite it having all or most of the relevant
> information, just piecewise).
> 
> IOW, this sort of non‑localized measurement is what asks for
> emulation of cluster‑wide CLOCK_MONOTONIC‑like measurement, which is
> not that trivial if you think about it.  Sort of a corollary of what
> Ulrich said, because emulating that pushes you exactly in these waters
> of relating CLOCK_MONOTONIC measurements from different nodes
> together.
> 
>   Not to speak of evaluating whether any node is totally off in its
>   own CLOCK_MONOTONIC measurements and hence shall rather be fenced
>   as "brain damaged", and perhaps even using the measurements of the
>   nodes keeping up together to somehow calculate what's the average
>   rate of measured time progress so as to self‑maintain time‑bound
>   cluster‑wide integrity, which may just as well be important for
>   sbd(!).  (nope, this doesn't get anywhere close to near‑light
>   speed concerns, just imprecise HW and possibly implied/or
>   inter‑VM differences)
> 
> Perhaps cheapest way out would be to use NTP‑level algorithms to
> synchronize two CLOCK_MONOTIC timers at the point the worker node
> for resource in question claimed "resource stopped", between this
> worker node and DC, so that the DC can synchronize again like that
> with a new worker node at the point in time when this new claims
> "resource started".  At that point, DC would have a rather accurate
> knowledge of how long this fail‑/move‑over, hence down‑time, lasted,
> hence being able to reflect it to the "availability score" equations.
> 
>   Hmm, no wonder that businesses with deep pockets and serious
>   synchronicity requirements across the globe resort to using atomic
>   clocks, incredibly precise CLOCK_MONOTONIC by default :‑)
> 
>> For most resource types those are optional (for reporting only), but
>> systemd resources require them (multiple status checks are usually
>> necessary to verify a start or stop worked, and we need to check the
>> remaining timeout each time).
> 
> Coincidentally, IIRC systemd alone strictly requires CLOCK_MONOTIC
> (and we shall get a lot more strict as well to provide reasonable
> expectations to the users as mentioned recently[*]), so said
> requirement is just a logical extension without corner cases.
> 
>