Re: [ClusterLabs] VirtualDomain and Resource_is_Too_Active ?? - problem/error

2019-06-03 Thread Jan Pokorný
On 03/06/19 13:39 +0200, Jan Pokorný wrote:
> Yes, there are at least two issues in ocf:heartbeat:VirtualDomain:
> 
> 1/ dealing with user input derived value, in an unchecked manner, while
>such value can be an empty string or may contain spaces
>(for the latter, see also I've raised back then:
>https://lists.clusterlabs.org/pipermail/users/2015-May/007629.html 
>https://lists.clusterlabs.org/pipermail/developers/2015-May/000620.html
>)
> 
> 2/ agent doesn't try to figure out whether is tries to parse
>a reasonably familiar file, in this case, it means it can
>be grep'ing file spanning up to terabytes of data
> 
> In your case, you mistakenly pointed the agent (via "config" parameter
> as highlighted above) not to the expected configuration, but rather to
> the disk image itself -- that's not how talk to libvirt -- only such
> a guest configuration XML shall point to where the disk image itself
> is located instead.  See ocf_heartbeat_VirtualDomain(7) or the output
> you get when invoking the agent with "meta-data" argument.
> 
> Such configuration issue could be indicated reliably with "validate-all"
> passed as an action for configured set of agent parameters would
> 1/ and/or 2/ not exist in the agent's implementation.
> Please, file issues against VirtualDomain agent to that effect at
> https://github.com/ClusterLabs/fence-agents/issues

sorry, apparently resource-agents, hence

https://github.com/ClusterLabs/resource-agents/issues

(I don't think GitHub allows after-the-fact issue retargeting, sadly)

-- 
Jan (Poki)


pgpnoFDROu6cE.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] announcement: schedule for resource-agents release 4.3.0

2019-06-03 Thread Oyvind Albrigtsen

Hi,

This is a tentative schedule for resource-agents v4.3.0:
4.3.0-rc1: June 14.
4.3.0: June 21.

I've modified the corresponding milestones at
https://github.com/ClusterLabs/resource-agents/milestones

If there's anything you think should be part of the release
please open an issue, a pull request, or a bugzilla, as you see
fit.

If there's anything that hasn't received due attention, please
let us know.

Finally, if you can help with resolving issues consider yourself
invited to do so. There are currently 101 issues and 60 pull
requests still open.


Cheers,
Oyvind Albrigtsen
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] VirtualDomain and Resource_is_Too_Active ?? - problem/error

2019-06-03 Thread Jan Pokorný
On 29/05/19 09:29 -0500, Ken Gaillot wrote:
> On Wed, 2019-05-29 at 11:42 +0100, lejeczek wrote:
>> I doing something which I believe is fairly simple, namely:
>> 
>> $ pcs resource create HA-work9-win10-kvm VirtualDomain \
>>   hypervisor="qemu:///system" \
>>   config="/0-ALL.SYSDATA/QEMU_VMs/HA-work9-win10.qcow2" \
 
>>   migration_transport=ssh --disable
>> 
>> virt guest is good, runs in libvirth okey, yet pacemaker fails:
>> 
>> ...
>>   notice: Calculated transition 1864, saving inputs in 
>> /var/lib/pacemaker/pengine/pe-input-2022.bz2
>>   notice: Configuration ERRORs found during PE processing.  Please run 
>> "crm_verify -L" to identify issues.
>>   notice: Initiating monitor operation HA-work9-win10-kvm_monitor_0 locally 
>> on whale.private
>>   notice: Initiating monitor operation HA-work9-win10-kvm_monitor_0 on 
>> swir.private
>>   notice: Initiating monitor operation HA-work9-win10-kvm_monitor_0 on 
>> rider.private
>>  warning: HA-work9-win10-kvm_monitor_0 process (PID 2103512) timed out
>>  warning: HA-work9-win10-kvm_monitor_0:2103512 - timed out after 3ms
>>   notice: HA-work9-win10-kvm_monitor_0:2103512:stderr [ 
>> /usr/lib/ocf/resource.d/heartbeat/VirtualDomain: line 981: [: too
>> many arguments ]
> 
> This looks like a bug in the resource agent, probably due to some
> unexpected configuration value. Double-check your resource
> configuration for what values the various parameters can have. (Or it
> may just be a side effect of the interval issue above, so try fixing
> that first.)

Yes, there are at least two issues in ocf:heartbeat:VirtualDomain:

1/ dealing with user input derived value, in an unchecked manner, while
   such value can be an empty string or may contain spaces
   (for the latter, see also I've raised back then:
   https://lists.clusterlabs.org/pipermail/users/2015-May/007629.html 
   https://lists.clusterlabs.org/pipermail/developers/2015-May/000620.html
   )

2/ agent doesn't try to figure out whether is tries to parse
   a reasonably familiar file, in this case, it means it can
   be grep'ing file spanning up to terabytes of data

In your case, you mistakenly pointed the agent (via "config" parameter
as highlighted above) not to the expected configuration, but rather to
the disk image itself -- that's not how talk to libvirt -- only such
a guest configuration XML shall point to where the disk image itself
is located instead.  See ocf_heartbeat_VirtualDomain(7) or the output
you get when invoking the agent with "meta-data" argument.

Such configuration issue could be indicated reliably with "validate-all"
passed as an action for configured set of agent parameters would
1/ and/or 2/ not exist in the agent's implementation.
Please, file issues against VirtualDomain agent to that effect at
https://github.com/ClusterLabs/fence-agents/issues

In addition, there may be some other configuration discrepancies as
pointed out by Ken.  Let us know if any issues persist once all these
are resolved.

-- 
Jan (Poki)


pgpE18eLaF8PX.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Antw: Inconclusive recap for bonding (balance-rr) vs. HA (Was: why is node fenced ?)

2019-06-03 Thread Ulrich Windl
Hi!

A rather good summary I think, specifically the notes on link failure
detection and timeouts. One problem is that corosync reacts very (too?) fast on
communication dropouts, while link failure monitoring (and recovery) measures
are typically much slower.

We also fell into that pit years ago: "miimon" seemed nice to detect link
failures quickly, but when plugging into a switch, and the inter-switch links
failed, the host did not detect the failure. So no recovery measures were
executed...

When using the ARP method, you must take care to avoid "detecting" a link
failure when a remote host goes offline, and also avoid polling more frequently
than necessary...

Regrads,
Ulrich

>>> Jan Pokorný  schrieb am 30.05.2019 um 17:53 in
Nachricht
<20190530155338.gh6...@redhat.com>:
> On 20/05/19 14:35 +0200, Jan Pokorný wrote:
>> On 20/05/19 08:28 +0200, Ulrich Windl wrote:
 One network interface is gone for a short period. But it's in a
 bonding device (round-robin), so the connection shouldn't be lost.
 Both nodes are connected directly, there is no switch in between.
>>> 
>>> I think you misunderstood: a round-robin bonding device is not
>>> fault-safe IMHO, but it depends a lot on your cabling details. Also
>>> you did not show the logs on the other nodes.
>> 
>> That was sort of my point.  I think that in this case, the
>> fault tolerance together with TCP's "best effort" makes the case
>> effectively fault-recoverable (except for some pathological scenarios
>> perhaps) -- so the whole "value proposition" for that mode can
>> make false impressions even for when it's not the case ... like
>> with corosync (since it intentionally opts for unreliable
>> transport for performance/scalability).
>> 
>> (saying that as someone who has just about a single hands-on
>> experiment with bonding behind...)
> 
> Ok, I've tried to read up on the subject a bit -- still no more
> hands-on so feel free to correct or amend my conclusions below.
> This discusses a Linux setup.
> 
> First of all, I think that claiming further unspecified balance-rr
> bonding mode to be SPOF solving solution is a myth -- and relying on
> that unconditionally is a way to fail hard.
> 
> It needs in-depth considerations, I think:
> 
> 1. some bonding modes (incl. mentioned balance-rr and active-backup)
>vitally depend on ability to actually detect a link failure
> 
> 2. configuration of the bonding therefore needs to specify what's the
>most optimal way to detect such failures for given selection of
>network adapters (it seems the configuration for a resulting bond
>instance is shares across the enslaved devices -- it would then
>mean that preferably the same models shall be used, since this
>optimality is then shared inherently), that is, either
> 
>- miimon, or
>- arp_interval and arp_ip_target
>
>parameters need to be specified to the kernel bonding module
> 
>when this is not done, presumed non-SPOF interconnect still
>remains SPOF (!)
> 
> 3. then, it is being discussed that there's hardly a notion of
>real-time detection of the link failure -- since all such
>indications are basically being polled for, and moreover,
>drivers for particular adapters can add up to the propagation
>delay, meaning the detection happens in order of hundreds+
>milliseconds after the fact -- which makes me think that
>such faulty link behaves essentially as a blackhole for such
>a period of time
> 
> 4. finally, to get back to what I've meant with diametral differences
>between casual TCP vs. corosync (considering UDP unicast) traffic
>and which may play a role here as well, is that mentioned TCP's
>"best effort" with built-in confirmations will not normally give
>up in order of tens of seconds and more (please correct me),
>but in corosync case, with default "token" parameter of 1 second,
>multiplied with retransmit attempts (4 by default, see
>token_retransmits_before_loss_cons), we operate within the order
>of lower seconds (again, please correct me)
> 
>therefore, specifying the above parameters for bonding in
>an excessive manner compard to corosync configuration (like
>miimon=1) could under further circumstances mean the
>same SPOF effect, e.g. when
> 
>- packets_per_slave parameter specified in a way it will
>  contain all those possibly repeated attempts in the
>  corosync exchange (selected link may be, out of bad luck,
>  be the faulty one while the failure hasn't been detected
>  yet)
> 
>- (unsure if can happen) when logical messages corosync
>  exchanges doesn't fit a single UDP datagram (low MTU?), and
>  packets_per_slave is 1 (default), complete message is
>  never successfully transmitted, since its part will always
>  be carried over the faulty link (again, while its failure
>  hasn't been detected yet), IIUIC
> 
> 
> Looks quite tricky, overall.  Myself, I'd likely