Re: [ClusterLabs] Antw: Re: Constant stop/start of resource in spite of interval=0

2019-05-21 Thread Andrei Borzenkov
21.05.2019 0:46, Ken Gaillot пишет:
>>  
>>> From what's described here, the op-restart-digest is changing every
>>> time, which means something is going wrong in the hash comparison
>>> (since the definition is not really changing).
>>>
>>> The log that stands out to me is:
>>>
>>> trace   May 18 23:02:49 calculate_xml_digest_v1(83):0:
>>> digest:source   
>>>
>>> The id is the resource name, which isn't "0". That leads me to:
>>>
>>> trace   May 18 23:02:49 svc_read_output(87):0: Got 499 chars:
>>> 
>>>
>>> which is the likely source of the problem. "id" is a pacemaker
>>> property, 
>>> not an OCF resource parameter. It shouldn't be in the resource
>>> agent 
>>> meta-data. Remove that, and I bet it will be OK.
>>
>> I renamed the parameter to "tunnel_id", redefined the resources and 
>> started them again.
>>  
>>> BTW the "every 15 minutes" would be the cluster-recheck-interval
>>> cluster property.
>>
>> I have waited more than half an hour and there are no more 
>> stopping/starting of the resources. :-) I haven't thought that "id"
>> is 
>> reserved as parameter name.
> 
> It isn't, by the OCF standard. :) This could be considered a pacemaker
> bug; pacemaker should be able to distinguish its own "id" from an OCF
> parameter "id", but it currently can't.
> 


I'm really baffled by this explanation. I tried to create resource with
"id" unique instance property and I do not observe this problem. No
restarts.

As none of traces provided captures of the moment of restart-digest
mismatch I also am not sure where to look. I do not see "id" being
treated anyway specially in the code.

Somewhat interesting is that restart digest source in two traces is
different:

bor@bor-Latitude-E5450:~$ grep -w 'restart digest' /tmp/trace.log*
/tmp/trace.log:trace   May 18 23:02:49 append_restart_list(694):0:
restart digest source   
/tmp/trace.log:trace   May 18 23:02:50 append_restart_list(694):0:
restart digest source   
/tmp/trace.log.2:trace   May 20 13:56:16 append_restart_list(694):0:
restart digest source   
/tmp/trace.log.2:trace   May 20 13:56:17 append_restart_list(694):0:
restart digest source   
/tmp/trace.log.2:trace   May 20 13:56:18 append_restart_list(694):0:
restart digest source   
bor@bor-Latitude-E5450:~$

In one case it does not include "name" parameter. Whether configuration
was changed in between is unknown, we never have seen full RA metadata
in each case nor full resource definition so ...

My hunch is that "id" is red herring and something else has changed when
resource definition was edited. If I'm wrong I appreciate pointer to
code where "id" is mishandled.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: Re: Constant stop/start of resource in spite of interval=0

2019-05-20 Thread Ken Gaillot
On Mon, 2019-05-20 at 23:15 +0200, Kadlecsik József wrote:
> Hi,
> 
> On Mon, 20 May 2019, Ken Gaillot wrote:
> 
> > On Mon, 2019-05-20 at 15:29 +0200, Ulrich Windl wrote:
> > > What worries me is "Rejecting name for unique".
> > 
> > Trace messages are often not user-friendly. The rejecting/accepting
> > is 
> > nothing to be concerned about; it just refers to which parameters
> > are 
> > being used to calculate that particular hash.
> > 
> > Pacemaker calculates up to three hashes.
> > 
> > The first is a hash of all the resource parameters, to detect if
> > anything changed; this is stored as "op-digest" in the CIB status
> > entries.
> > 
> > If the resource is reloadable, another hash is calculated with just
> > the
> > parameters marked as unique=1 (which means they can't be reloaded).
> > Any
> > change in these parameters requires a full restart. This one is
> > "op-
> > restart-digest".
> > 
> > Finally, if the resource has sensitive parameters like passwords, a
> > hash of everything but those parameters is stored as "op-secure-
> > digest". This one is only used when simulating CIBs grabbed from
> > cluster reports, which have sensitive info scrubbed.
> 
> Thanks for the explanation! It seemed very cryptic in the trace
> messages 
> that different hashes were calculated with differen parameter lists.
>  
> > From what's described here, the op-restart-digest is changing every
> > time, which means something is going wrong in the hash comparison
> > (since the definition is not really changing).
> > 
> > The log that stands out to me is:
> > 
> > trace   May 18 23:02:49 calculate_xml_digest_v1(83):0:
> > digest:source   
> > 
> > The id is the resource name, which isn't "0". That leads me to:
> > 
> > trace   May 18 23:02:49 svc_read_output(87):0: Got 499 chars:
> > 
> > 
> > which is the likely source of the problem. "id" is a pacemaker
> > property, 
> > not an OCF resource parameter. It shouldn't be in the resource
> > agent 
> > meta-data. Remove that, and I bet it will be OK.
> 
> I renamed the parameter to "tunnel_id", redefined the resources and 
> started them again.
>  
> > BTW the "every 15 minutes" would be the cluster-recheck-interval
> > cluster property.
> 
> I have waited more than half an hour and there are no more 
> stopping/starting of the resources. :-) I haven't thought that "id"
> is 
> reserved as parameter name.

It isn't, by the OCF standard. :) This could be considered a pacemaker
bug; pacemaker should be able to distinguish its own "id" from an OCF
parameter "id", but it currently can't.

> 
> Thank you!
> 
> Best regards,
> Jozsef
> --
> E-mail : kadlecsik.joz...@wigner.mta.hu
> PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
> Address: Wigner Research Centre for Physics, Hungarian Academy of
> Sciences
>  H-1525 Budapest 114, POB. 49, Hungary
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: Re: Constant stop/start of resource in spite of interval=0

2019-05-20 Thread Kadlecsik József
Hi,

On Mon, 20 May 2019, Ken Gaillot wrote:

> On Mon, 2019-05-20 at 15:29 +0200, Ulrich Windl wrote:
> > What worries me is "Rejecting name for unique".
> 
> Trace messages are often not user-friendly. The rejecting/accepting is 
> nothing to be concerned about; it just refers to which parameters are 
> being used to calculate that particular hash.
>
> Pacemaker calculates up to three hashes.
> 
> The first is a hash of all the resource parameters, to detect if
> anything changed; this is stored as "op-digest" in the CIB status
> entries.
> 
> If the resource is reloadable, another hash is calculated with just the
> parameters marked as unique=1 (which means they can't be reloaded). Any
> change in these parameters requires a full restart. This one is "op-
> restart-digest".
> 
> Finally, if the resource has sensitive parameters like passwords, a
> hash of everything but those parameters is stored as "op-secure-
> digest". This one is only used when simulating CIBs grabbed from
> cluster reports, which have sensitive info scrubbed.

Thanks for the explanation! It seemed very cryptic in the trace messages 
that different hashes were calculated with differen parameter lists.
 
> From what's described here, the op-restart-digest is changing every
> time, which means something is going wrong in the hash comparison
> (since the definition is not really changing).
> 
> The log that stands out to me is:
> 
> trace   May 18 23:02:49 calculate_xml_digest_v1(83):0: digest:source   
> 
> 
> The id is the resource name, which isn't "0". That leads me to:
> 
> trace   May 18 23:02:49 svc_read_output(87):0: Got 499 chars:  name="id" unique="1" required="1">
> 
> which is the likely source of the problem. "id" is a pacemaker property, 
> not an OCF resource parameter. It shouldn't be in the resource agent 
> meta-data. Remove that, and I bet it will be OK.

I renamed the parameter to "tunnel_id", redefined the resources and 
started them again.
 
> BTW the "every 15 minutes" would be the cluster-recheck-interval
> cluster property.

I have waited more than half an hour and there are no more 
stopping/starting of the resources. :-) I haven't thought that "id" is 
reserved as parameter name.

Thank you!

Best regards,
Jozsef
--
E-mail : kadlecsik.joz...@wigner.mta.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics, Hungarian Academy of Sciences
 H-1525 Budapest 114, POB. 49, Hungary
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Re: Constant stop/start of resource in spite of interval=0

2019-05-20 Thread Ken Gaillot
On Mon, 2019-05-20 at 15:29 +0200, Ulrich Windl wrote:
> What worries me is "Rejecting name for unique".

Trace messages are often not user-friendly. The rejecting/accepting is
nothing to be concerned about; it just refers to which parameters are
being used to calculate that particular hash.

Pacemaker calculates up to three hashes.

The first is a hash of all the resource parameters, to detect if
anything changed; this is stored as "op-digest" in the CIB status
entries.

If the resource is reloadable, another hash is calculated with just the
parameters marked as unique=1 (which means they can't be reloaded). Any
change in these parameters requires a full restart. This one is "op-
restart-digest".

Finally, if the resource has sensitive parameters like passwords, a
hash of everything but those parameters is stored as "op-secure-
digest". This one is only used when simulating CIBs grabbed from
cluster reports, which have sensitive info scrubbed.

From what's described here, the op-restart-digest is changing every
time, which means something is going wrong in the hash comparison
(since the definition is not really changing).

The log that stands out to me is:

trace   May 18 23:02:49 calculate_xml_digest_v1(83):0: digest:source   



The id is the resource name, which isn't "0". That leads me to:

trace   May 18 23:02:49 svc_read_output(87):0: Got 499 chars: 


which is the likely source of the problem. "id" is a pacemaker
property, not an OCF resource parameter. It shouldn't be in the
resource agent meta-data. Remove that, and I bet it will be OK.

BTW the "every 15 minutes" would be the cluster-recheck-interval
cluster property.

> > > > Kadlecsik József  schrieb am
> > > > 20.05.2019
> 
> um
> 14:37 in Nachricht <
> alpine.deb.2.20.1905201428050.24...@blackhole.kfki.hu>:
> > On Sun, 19 May 2019, Kadlecsik József wrote:
> > 
> > > On Sat, 18 May 2019, Kadlecsik József wrote:
> > > 
> > > > On Sat, 18 May 2019, Kadlecsik József wrote:
> > > > 
> > > > > On Sat, 18 May 2019, Andrei Borzenkov wrote:
> > > > > 
> > > > > > 18.05.2019 18:34, Kadlecsik József пишет:
> > > > > > > We have a resource agent which creates IP tunnels. In
> > > > > > > spite of the
> > > > > > > configuration setting
> > > > > > > 
> > > > > > > primitive tunnel-eduroam ocf:local:tunnel \
> > > > > > > params 
> > > > > > > op start timeout=120s interval=0 \
> > > > > > > op stop timeout=300s interval=0 \
> > > > > > > op monitor timeout=30s interval=30s depth=0 \
> > > > > > > meta target-role=Started
> > > > > > > order bifur-eduroam-ipv4-before-tunnel-eduroam \
> > > > > > >   Mandatory: bifur-eduroam-ipv4 tunnel-eduroam
> > > > > > > colocation tunnel-eduroam-on-bifur-eduroam-ipv4 inf:
> > > > > > > tunnel-eduroam
> 
> \
> > > > > > >   bifur-eduroam-ipv4:Started
> > > > > > > 
> > > > > > > the resource is restarted again and again. According to
> > > > > > > the debug
> 
> logs:
> > > > > > > 
> > > > > > > Parameters to tunnel-eduroam_start_0 on bifur1
> > > > > > > changed: was 
> > > > > > > 94afff0ff7cfc62f7cb1d5bf5b4d83aa vs. now 
> > 
> > f2317cad3d54cec5d7d7aa7d0bf35cf8 
> > > > > > > (restart:3.0.11) 0:0;48:3:0:73562fd6-1fe2-4930-8c6e-
> > > > > > > 5953b82ebb32
> > > > > > 
> > > > > > This means that instance attributes changed in this case
> > > > > > pacemaker
> > > > > > restarts resource to apply new values. Turning on trace
> > > > > > level
> 
> hopefully
> > > > > > will show what exactly is being changed. You can also dump
> > > > > > CIB
> 
> before
> > > > > > and after restart to compare current information.
> > > > > 
> > > > > The strange thing is that the new value seems never be
> > > > > stored. Just the
> > > > > "was-now" part from the log lines:
> > > > > 
> > > > > was 94afff0ff7cfc62f7cb1d5bf5b4d83aa vs. now 
> > 
> > f2317cad3d54cec5d7d7aa7d0bf35cf8
> > > > > was 94afff0ff7cfc62f7cb1d5bf5b4d83aa vs. now 
> > 
> > f2317cad3d54cec5d7d7aa7d0bf35cf8
> > > > > was 94afff0ff7cfc62f7cb1d5bf5b4d83aa vs. now 
> > 
> > f2317cad3d54cec5d7d7aa7d0bf35cf8
> > > > > ...
> > > > > 
> > > > > However, after issuing "cibadmin --query --local", the whole
> > > > > flipping 
> > > > > stopped! :-) Thanks!
> > > > 
> > > > No, I was wrong - it still repeats every ~15mins. The diff
> > > > between two
> 
> cib 
> > > > xml dumps doesn't say much to me - I'm going to enable tracing.
> > > 
> > > I have attached the trace file created according to 
> > > http://blog.clusterlabs.org/blog/2013/pacemaker-logging.
> > > 
> > > What looks strange to me is that build_parameter_list() first
> > > rejects
> > > attributes, then accepts them:
> > > 
> > > trace   May 18 23:02:49 build_operation_update(787):0: Including
> > > additional
> > digests for ocf::local:tunnel
> > > trace   May 18 23:02:49 build_parameter_list(621):0: Rejecting
> > > name for 
> > 
> > unique
> > > trace   May 18 23:02:49 build_parameter_list(614):0: Attr id is
> > > unique
> > > trace   May 18 23:02:49