Re: [ClusterLabs] Antw: why is node fenced ?

2019-05-20 Thread Jan Pokorný
On 20/05/19 08:28 +0200, Ulrich Windl wrote:
>> One network interface is gone for a short period. But it's in a
>> bonding device (round-robin), so the connection shouldn't be lost.
>> Both nodes are connected directly, there is no switch in between.
> 
> I think you misunderstood: a round-robin bonding device is not
> fault-safe IMHO, but it depends a lot on your cabling details. Also
> you did not show the logs on the other nodes.

That was sort of my point.  I think that in this case, the
fault tolerance together with TCP's "best effort" makes the case
effectively fault-recoverable (except for some pathological scenarios
perhaps) -- so the whole "value proposition" for that mode can
make false impressions even for when it's not the case ... like
with corosync (since it intentionally opts for unreliable
transport for performance/scalability).

(saying that as someone who has just about a single hands-on
experiment with bonding behind...)

-- 
Jan (Poki)


pgpHjgc9VtFaL.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Antw: Re: Constant stop/start of resource in spite of interval=0

2019-05-20 Thread Ulrich Windl
What worries me is "Rejecting name for unique".

>>> Kadlecsik József  schrieb am 20.05.2019
um
14:37 in Nachricht :
> On Sun, 19 May 2019, Kadlecsik József wrote:
> 
>> On Sat, 18 May 2019, Kadlecsik József wrote:
>> 
>> > On Sat, 18 May 2019, Kadlecsik József wrote:
>> > 
>> > > On Sat, 18 May 2019, Andrei Borzenkov wrote:
>> > > 
>> > > > 18.05.2019 18:34, Kadlecsik József пишет:
>> > > 
>> > > > > We have a resource agent which creates IP tunnels. In spite of the

>> > > > > configuration setting
>> > > > > 
>> > > > > primitive tunnel-eduroam ocf:local:tunnel \
>> > > > > params 
>> > > > > op start timeout=120s interval=0 \
>> > > > > op stop timeout=300s interval=0 \
>> > > > > op monitor timeout=30s interval=30s depth=0 \
>> > > > > meta target-role=Started
>> > > > > order bifur-eduroam-ipv4-before-tunnel-eduroam \
>> > > > >  Mandatory: bifur-eduroam-ipv4 tunnel-eduroam
>> > > > > colocation tunnel-eduroam-on-bifur-eduroam-ipv4 inf: tunnel-eduroam
\
>> > > > >  bifur-eduroam-ipv4:Started
>> > > > > 
>> > > > > the resource is restarted again and again. According to the debug
logs:
>> > > > > 
>> > > > > Parameters to tunnel-eduroam_start_0 on bifur1 changed: was 
>> > > > > 94afff0ff7cfc62f7cb1d5bf5b4d83aa vs. now 
> f2317cad3d54cec5d7d7aa7d0bf35cf8 
>> > > > > (restart:3.0.11) 0:0;48:3:0:73562fd6-1fe2-4930-8c6e-5953b82ebb32
>> > > > 
>> > > > This means that instance attributes changed in this case pacemaker
>> > > > restarts resource to apply new values. Turning on trace level
hopefully
>> > > > will show what exactly is being changed. You can also dump CIB
before
>> > > > and after restart to compare current information.
>> > > 
>> > > The strange thing is that the new value seems never be stored. Just the

>> > > "was-now" part from the log lines:
>> > > 
>> > > was 94afff0ff7cfc62f7cb1d5bf5b4d83aa vs. now 
> f2317cad3d54cec5d7d7aa7d0bf35cf8
>> > > was 94afff0ff7cfc62f7cb1d5bf5b4d83aa vs. now 
> f2317cad3d54cec5d7d7aa7d0bf35cf8
>> > > was 94afff0ff7cfc62f7cb1d5bf5b4d83aa vs. now 
> f2317cad3d54cec5d7d7aa7d0bf35cf8
>> > > ...
>> > > 
>> > > However, after issuing "cibadmin --query --local", the whole flipping 
>> > > stopped! :-) Thanks!
>> > 
>> > No, I was wrong - it still repeats every ~15mins. The diff between two
cib 
>> > xml dumps doesn't say much to me - I'm going to enable tracing.
>> 
>> I have attached the trace file created according to 
>> http://blog.clusterlabs.org/blog/2013/pacemaker-logging.
>> 
>> What looks strange to me is that build_parameter_list() first rejects
>> attributes, then accepts them:
>> 
>> trace   May 18 23:02:49 build_operation_update(787):0: Including additional

> digests for ocf::local:tunnel
>> trace   May 18 23:02:49 build_parameter_list(621):0: Rejecting name for 
> unique
>> trace   May 18 23:02:49 build_parameter_list(614):0: Attr id is unique
>> trace   May 18 23:02:49 build_parameter_list(632):0: Adding attr id=0 to
the 
> xml result
>> trace   May 18 23:02:49 build_parameter_list(621):0: Rejecting src_ip for 
> unique
>> trace   May 18 23:02:49 build_parameter_list(621):0: Rejecting dst_ip for 
> unique
>> ...
>> trace   May 18 23:02:49 calculate_xml_digest_v1(71):0: Sorting xml...
>> trace   May 18 23:02:49 calculate_xml_digest_v1(73):0: Done
>> trace   May 18 23:02:49 crm_md5sum(2102):0: Beginning digest of 22 bytes
>> trace   May 18 23:02:49 crm_md5sum(2110):0: Digest 
> 94afff0ff7cfc62f7cb1d5bf5b4d83aa.
>> 
>> and then:
>> 
>> trace   May 18 23:02:49 calculate_xml_digest_v1(83):0: digest:source   
> 
>> trace   May 18 23:02:49 append_restart_list(693):0: tunnel-eduroam: 
> 94afff0ff7cfc62f7cb1d5bf5b4d83aa,  id 
>> trace   May 18 23:02:49 append_restart_list(694):0: restart digest source  

> 
>> trace   May 18 23:02:49 build_parameter_list(621):0: Rejecting name for 
> private
>> trace   May 18 23:02:49 build_parameter_list(625):0: Inverting name match 
> for private xml
>> trace   May 18 23:02:49 build_parameter_list(632):0: Adding attr 
> name=eduroam IPv4 tunnel to the xml result
>> trace   May 18 23:02:49 build_parameter_list(621):0: Rejecting id for 
> private
>> trace   May 18 23:02:49 build_parameter_list(625):0: Inverting id match for

> private xml
>> trace   May 18 23:02:49 build_parameter_list(632):0: Adding attr id=0 to
the 
> xml result
>> 
>> 
>> By the way, it's debian stretch with pacemaker 1.1.16-1.
> 
> I have double and triple checked the agent and it seems just a normal, 
> working agent.
> 
> The agent accepts the reload operation, it is advertised in the actions 
> section of its metadata, there are parameters with unique set to 0 and 
> still stop/start is called instead of reload. (I could even live with 
> reload instead of start/stop in every 15 mins).
> 
> As a desperate attempt, I deleted the resource and re-added and it of 
> course did not help.
> 
> I also created the attached trace file during creating the resource in the 
> hope that it could help find

Re: [ClusterLabs] Antw: Re: Constant stop/start of resource in spite of interval=0

2019-05-20 Thread Ken Gaillot
On Mon, 2019-05-20 at 15:29 +0200, Ulrich Windl wrote:
> What worries me is "Rejecting name for unique".

Trace messages are often not user-friendly. The rejecting/accepting is
nothing to be concerned about; it just refers to which parameters are
being used to calculate that particular hash.

Pacemaker calculates up to three hashes.

The first is a hash of all the resource parameters, to detect if
anything changed; this is stored as "op-digest" in the CIB status
entries.

If the resource is reloadable, another hash is calculated with just the
parameters marked as unique=1 (which means they can't be reloaded). Any
change in these parameters requires a full restart. This one is "op-
restart-digest".

Finally, if the resource has sensitive parameters like passwords, a
hash of everything but those parameters is stored as "op-secure-
digest". This one is only used when simulating CIBs grabbed from
cluster reports, which have sensitive info scrubbed.

From what's described here, the op-restart-digest is changing every
time, which means something is going wrong in the hash comparison
(since the definition is not really changing).

The log that stands out to me is:

trace   May 18 23:02:49 calculate_xml_digest_v1(83):0: digest:source   



The id is the resource name, which isn't "0". That leads me to:

trace   May 18 23:02:49 svc_read_output(87):0: Got 499 chars: 


which is the likely source of the problem. "id" is a pacemaker
property, not an OCF resource parameter. It shouldn't be in the
resource agent meta-data. Remove that, and I bet it will be OK.

BTW the "every 15 minutes" would be the cluster-recheck-interval
cluster property.

> > > > Kadlecsik József  schrieb am
> > > > 20.05.2019
> 
> um
> 14:37 in Nachricht <
> alpine.deb.2.20.1905201428050.24...@blackhole.kfki.hu>:
> > On Sun, 19 May 2019, Kadlecsik József wrote:
> > 
> > > On Sat, 18 May 2019, Kadlecsik József wrote:
> > > 
> > > > On Sat, 18 May 2019, Kadlecsik József wrote:
> > > > 
> > > > > On Sat, 18 May 2019, Andrei Borzenkov wrote:
> > > > > 
> > > > > > 18.05.2019 18:34, Kadlecsik József пишет:
> > > > > > > We have a resource agent which creates IP tunnels. In
> > > > > > > spite of the
> > > > > > > configuration setting
> > > > > > > 
> > > > > > > primitive tunnel-eduroam ocf:local:tunnel \
> > > > > > > params 
> > > > > > > op start timeout=120s interval=0 \
> > > > > > > op stop timeout=300s interval=0 \
> > > > > > > op monitor timeout=30s interval=30s depth=0 \
> > > > > > > meta target-role=Started
> > > > > > > order bifur-eduroam-ipv4-before-tunnel-eduroam \
> > > > > > >   Mandatory: bifur-eduroam-ipv4 tunnel-eduroam
> > > > > > > colocation tunnel-eduroam-on-bifur-eduroam-ipv4 inf:
> > > > > > > tunnel-eduroam
> 
> \
> > > > > > >   bifur-eduroam-ipv4:Started
> > > > > > > 
> > > > > > > the resource is restarted again and again. According to
> > > > > > > the debug
> 
> logs:
> > > > > > > 
> > > > > > > Parameters to tunnel-eduroam_start_0 on bifur1
> > > > > > > changed: was 
> > > > > > > 94afff0ff7cfc62f7cb1d5bf5b4d83aa vs. now 
> > 
> > f2317cad3d54cec5d7d7aa7d0bf35cf8 
> > > > > > > (restart:3.0.11) 0:0;48:3:0:73562fd6-1fe2-4930-8c6e-
> > > > > > > 5953b82ebb32
> > > > > > 
> > > > > > This means that instance attributes changed in this case
> > > > > > pacemaker
> > > > > > restarts resource to apply new values. Turning on trace
> > > > > > level
> 
> hopefully
> > > > > > will show what exactly is being changed. You can also dump
> > > > > > CIB
> 
> before
> > > > > > and after restart to compare current information.
> > > > > 
> > > > > The strange thing is that the new value seems never be
> > > > > stored. Just the
> > > > > "was-now" part from the log lines:
> > > > > 
> > > > > was 94afff0ff7cfc62f7cb1d5bf5b4d83aa vs. now 
> > 
> > f2317cad3d54cec5d7d7aa7d0bf35cf8
> > > > > was 94afff0ff7cfc62f7cb1d5bf5b4d83aa vs. now 
> > 
> > f2317cad3d54cec5d7d7aa7d0bf35cf8
> > > > > was 94afff0ff7cfc62f7cb1d5bf5b4d83aa vs. now 
> > 
> > f2317cad3d54cec5d7d7aa7d0bf35cf8
> > > > > ...
> > > > > 
> > > > > However, after issuing "cibadmin --query --local", the whole
> > > > > flipping 
> > > > > stopped! :-) Thanks!
> > > > 
> > > > No, I was wrong - it still repeats every ~15mins. The diff
> > > > between two
> 
> cib 
> > > > xml dumps doesn't say much to me - I'm going to enable tracing.
> > > 
> > > I have attached the trace file created according to 
> > > http://blog.clusterlabs.org/blog/2013/pacemaker-logging.
> > > 
> > > What looks strange to me is that build_parameter_list() first
> > > rejects
> > > attributes, then accepts them:
> > > 
> > > trace   May 18 23:02:49 build_operation_update(787):0: Including
> > > additional
> > digests for ocf::local:tunnel
> > > trace   May 18 23:02:49 build_parameter_list(621):0: Rejecting
> > > name for 
> > 
> > unique
> > > trace   May 18 23:02:49 build_parameter_list(614):0: Attr id is
> > > unique
> > > trace   May 18 23:02:49 bui

Re: [ClusterLabs] Antw: Re: Constant stop/start of resource in spite of interval=0

2019-05-20 Thread Kadlecsik József
Hi,

On Mon, 20 May 2019, Ken Gaillot wrote:

> On Mon, 2019-05-20 at 15:29 +0200, Ulrich Windl wrote:
> > What worries me is "Rejecting name for unique".
> 
> Trace messages are often not user-friendly. The rejecting/accepting is 
> nothing to be concerned about; it just refers to which parameters are 
> being used to calculate that particular hash.
>
> Pacemaker calculates up to three hashes.
> 
> The first is a hash of all the resource parameters, to detect if
> anything changed; this is stored as "op-digest" in the CIB status
> entries.
> 
> If the resource is reloadable, another hash is calculated with just the
> parameters marked as unique=1 (which means they can't be reloaded). Any
> change in these parameters requires a full restart. This one is "op-
> restart-digest".
> 
> Finally, if the resource has sensitive parameters like passwords, a
> hash of everything but those parameters is stored as "op-secure-
> digest". This one is only used when simulating CIBs grabbed from
> cluster reports, which have sensitive info scrubbed.

Thanks for the explanation! It seemed very cryptic in the trace messages 
that different hashes were calculated with differen parameter lists.
 
> From what's described here, the op-restart-digest is changing every
> time, which means something is going wrong in the hash comparison
> (since the definition is not really changing).
> 
> The log that stands out to me is:
> 
> trace   May 18 23:02:49 calculate_xml_digest_v1(83):0: digest:source   
> 
> 
> The id is the resource name, which isn't "0". That leads me to:
> 
> trace   May 18 23:02:49 svc_read_output(87):0: Got 499 chars:  name="id" unique="1" required="1">
> 
> which is the likely source of the problem. "id" is a pacemaker property, 
> not an OCF resource parameter. It shouldn't be in the resource agent 
> meta-data. Remove that, and I bet it will be OK.

I renamed the parameter to "tunnel_id", redefined the resources and 
started them again.
 
> BTW the "every 15 minutes" would be the cluster-recheck-interval
> cluster property.

I have waited more than half an hour and there are no more 
stopping/starting of the resources. :-) I haven't thought that "id" is 
reserved as parameter name.

Thank you!

Best regards,
Jozsef
--
E-mail : kadlecsik.joz...@wigner.mta.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics, Hungarian Academy of Sciences
 H-1525 Budapest 114, POB. 49, Hungary
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Re: Constant stop/start of resource in spite of interval=0

2019-05-20 Thread Ken Gaillot
On Mon, 2019-05-20 at 23:15 +0200, Kadlecsik József wrote:
> Hi,
> 
> On Mon, 20 May 2019, Ken Gaillot wrote:
> 
> > On Mon, 2019-05-20 at 15:29 +0200, Ulrich Windl wrote:
> > > What worries me is "Rejecting name for unique".
> > 
> > Trace messages are often not user-friendly. The rejecting/accepting
> > is 
> > nothing to be concerned about; it just refers to which parameters
> > are 
> > being used to calculate that particular hash.
> > 
> > Pacemaker calculates up to three hashes.
> > 
> > The first is a hash of all the resource parameters, to detect if
> > anything changed; this is stored as "op-digest" in the CIB status
> > entries.
> > 
> > If the resource is reloadable, another hash is calculated with just
> > the
> > parameters marked as unique=1 (which means they can't be reloaded).
> > Any
> > change in these parameters requires a full restart. This one is
> > "op-
> > restart-digest".
> > 
> > Finally, if the resource has sensitive parameters like passwords, a
> > hash of everything but those parameters is stored as "op-secure-
> > digest". This one is only used when simulating CIBs grabbed from
> > cluster reports, which have sensitive info scrubbed.
> 
> Thanks for the explanation! It seemed very cryptic in the trace
> messages 
> that different hashes were calculated with differen parameter lists.
>  
> > From what's described here, the op-restart-digest is changing every
> > time, which means something is going wrong in the hash comparison
> > (since the definition is not really changing).
> > 
> > The log that stands out to me is:
> > 
> > trace   May 18 23:02:49 calculate_xml_digest_v1(83):0:
> > digest:source   
> > 
> > The id is the resource name, which isn't "0". That leads me to:
> > 
> > trace   May 18 23:02:49 svc_read_output(87):0: Got 499 chars:
> > 
> > 
> > which is the likely source of the problem. "id" is a pacemaker
> > property, 
> > not an OCF resource parameter. It shouldn't be in the resource
> > agent 
> > meta-data. Remove that, and I bet it will be OK.
> 
> I renamed the parameter to "tunnel_id", redefined the resources and 
> started them again.
>  
> > BTW the "every 15 minutes" would be the cluster-recheck-interval
> > cluster property.
> 
> I have waited more than half an hour and there are no more 
> stopping/starting of the resources. :-) I haven't thought that "id"
> is 
> reserved as parameter name.

It isn't, by the OCF standard. :) This could be considered a pacemaker
bug; pacemaker should be able to distinguish its own "id" from an OCF
parameter "id", but it currently can't.

> 
> Thank you!
> 
> Best regards,
> Jozsef
> --
> E-mail : kadlecsik.joz...@wigner.mta.hu
> PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
> Address: Wigner Research Centre for Physics, Hungarian Academy of
> Sciences
>  H-1525 Budapest 114, POB. 49, Hungary
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/