[ClusterLabs] Antw: Re: Strange Corosync (TOTEM) logs, Pacemaker OK but DLM stuck

2017-09-01 Thread Ulrich Windl
Hi!

I don't know the answer, but I wonder what would happen if corosync runs at
normal scheduling priority. My suspect is that something's wrong, and using
highest real-time priority could be the wrong fix for that problem ;-)

Personally I think a process that does disk I/O and is waiting for network
input cannot be the highest priority real-time job. (Such a candidate would be
a process that had it's memeory locked and is doing shared memory communication
without any I/O)...

Sorry for this off-topic thought.

Regards,
Ulrich

>>> Ferenc Wágner  schrieb am 01.09.2017 um 00:40 in Nachricht
<87inh38ip3@lant.ki.iif.hu>:
> Jan Friesse  writes:
> 
>> wf...@niif.hu writes:
>>
>>> In a 6-node cluster (vhbl03-08) the following happens 1-5 times a day
>>> (in August; in May, it happened 0-2 times a day only, it's slowly
>>> ramping up):
>>>
>>> vhbl08 corosync[3687]:   [TOTEM ] A processor failed, forming new 
> configuration.
>>> vhbl03 corosync[3890]:   [TOTEM ] A processor failed, forming new 
> configuration.
>>> vhbl07 corosync[3805]:   [MAIN  ] Corosync main process was not scheduled

> for 4317.0054 ms (threshold is 2400. ms). Consider token timeout 
> increase.
>>
>> ^^^ This is main problem you have to solve. It usually means that
>> machine is too overloaded. [...]
> 
> Before I start tracing the scheduler, I'd like to ask something: what
> wakes up the Corosync main process periodically?  The token making a
> full circle?  (Please forgive my simplistic understanding of the TOTEM
> protocol.)  That would explain the recommendation in the log message,
> but does not fit well with the overload assumption: totally idle nodes
> could just as easily produce such warnings if there are no other regular
> wakeup sources.  (I'm looking at timer_function_scheduler_timeout but I
> know too little of libqb to decide.)
> 
>> As a start you can try what message say = Consider token timeout
>> increase. Currently you have 3 seconds, in theory 6 second should be
>> enough.
> 
> It was probably high time I realized that token timeout is scaled
> automatically when one has a nodelist.  When you say Corosync should
> work OK with default settings up to 16 nodes, you assume this scaling is
> in effect, don't you?  On the other hand, I've got no nodelist in the
> config, but token = 3000, which is less than the default 1000+4*650 with
> six nodes, and this will get worse as the cluster grows.
> 
> Comments on the above ramblings welcome!
> 
> I'm grateful for all the valuable input poured into this thread by all
> parties: it's proven really educative in quite unexpected ways beyond
> what I was able to ask in the beginning.
> -- 
> Thanks,
> Feri
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 




___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: Antw: Is there a way to ignore a single monitoring timeout

2017-09-01 Thread Ulrich Windl
>>> Klechomir  schrieb am 01.09.2017 um 08:48 in Nachricht
<9f043557-233d-6c1c-b46d-63f8c2ee5...@gmail.com>:
> Hi Ulrich,
> Have to disagree here.
> 
> I have cases, when for an unknown reason a single monitoring request 
> never returns result.
> So having bigger timeouts doesn't resolve this problem.

But if your monitor hangs instead of giving a result, you also cannot ignore 
the result that isn't there! OTOH: Isn't the operation timeout for monitors 
that hang? If the monitor is killed, it returns an implicit status (it failed).

Can you elaborate?


Regards,
Ulrich

> 
> Best regards,
> Klecho
> 
> On 1.09.2017 09:11, Ulrich Windl wrote:
> Klechomir  schrieb am 31.08.2017 um 17:18 in Nachricht
>> <2733498.Wyt05tt8L0@bobo>:
>>
>>> Hi List,
>>> I've been fighting with one problem with many different kinds of resources.
>>>
>>> While the resource is ok, without apparent reason, once in a while there is
>>> no
>>> response to a monitoring request, which leads to a monitoring timeout and
>>> resource restart etc
>>>
>>> Is there any way to ignore one timed out monitoring request and react only
>>> on
>>> two (or more) failed requests in a row?
>> I think you are asking the question in the wrong way: You'll have to choose 
> a timeout value that fails in less than 0.x of all cases when the resource is 
> fine, while it fails in (1 - 0.x) cases when the resource has a problem 
> (assuming the monitor just hangs if there is a problem). It's up to you to 
> select the x (via timeout of monitor) that fits your needs.
>>
>> In a nutshell: short timeouts may produce errors when there are none, long 
> timeouts may cause extra delays when there is a problem.
>>
>> BTW: Having an extra round of monitoring is equivalent to doubling the 
> timeout value; isn't it?
>>
>> Regards,
>> Ulrich
>>
>>
>>
>>
>> ___
>> Users mailing list: Users@clusterlabs.org 
>> http://lists.clusterlabs.org/mailman/listinfo/users 
>>
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 





___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Antw: Is there a way to ignore a single monitoring timeout

2017-09-01 Thread Jehan-Guillaume de Rorthais
On Fri, 01 Sep 2017 09:07:16 +0200
"Ulrich Windl"  wrote:

> >>> Klechomir  schrieb am 01.09.2017 um 08:48 in Nachricht  
> <9f043557-233d-6c1c-b46d-63f8c2ee5...@gmail.com>:
> > Hi Ulrich,
> > Have to disagree here.
> > 
> > I have cases, when for an unknown reason a single monitoring request 
> > never returns result.
> > So having bigger timeouts doesn't resolve this problem.  
> 
> But if your monitor hangs instead of giving a result, you also cannot ignore
> the result that isn't there! OTOH: Isn't the operation timeout for monitors
> that hang? If the monitor is killed, it returns an implicit status (it
> failed).

I agree. It seems to me the problems comes from either the resource agent or
the resource itself. Presently, this issue bothers the cluster stack, but soon
or later, it will blows something else. Track where the issue comes from, and
fix it.

-- 
Jehan-Guillaume de Rorthais
Dalibo

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker stopped monitoring the resource

2017-09-01 Thread Abhay B
>
> Are you sure the monitor stopped? Pacemaker only logs recurring monitors
> when the status changes. Any successful monitors after this wouldn't be
> logged.


Yes. Since there  were no logs which said "RecurringOp:  Start recurring
monitor" on the node after it had failed.
Also there were no logs for any actions pertaining to
The problem was that even though the one node was failing, the resources
were never moved to the other node(the node on which I suspect monitoring
had stopped).

There are a lot of resource action failures, so I'm not sure where the
> issue is, but I'm guessing it has to do with migration-threshold=1 --
> once a resource has failed once on a node, it won't be allowed back on
> that node until the failure is cleaned up. Of course you also have
> failure-timeout=1s, which should clean it up immediately, so I'm not
> sure.


migration-threshold=1
failure-timeout=1s
cluster-recheck-interval=2s

first, set "two_node:
> 1" in corosync.conf and let no-quorum-policy default in pacemaker


This is already configured.
# cat /etc/corosync/corosync.conf
totem {
version: 2
secauth: off
cluster_name: SVSDEHA
transport: udpu
token: 5000
}

nodelist {
node {
ring0_addr: 2.0.0.10
nodeid: 1
}

node {
ring0_addr: 2.0.0.11
nodeid: 2
}
}

quorum {
provider: corosync_votequorum
*two_node: 1*
}

logging {
to_logfile: yes
logfile: /var/log/cluster/corosync.log
to_syslog: yes
}

let no-quorum-policy default in pacemaker; then,
> get stonith configured, tested, and enabled


By not configuring no-quorum-policy, would it ignore quorum for a 2 node
cluster?
For my use case I don't need stonith enabled. My intention is to have a
highly available system all the time.
I will test my RA again as suggested with no-quorum-policy=default.

One more doubt.
Why do we see this is 'pcs property' ?
last-lrm-refresh: 1504090367

Never seen this on a healthy cluster.
>From RHEL documentation:
last-lrm-refresh
Last refresh of the Local Resource Manager, given in units of seconds since
epoca. Used for diagnostic purposes; not user-configurable.

Doesn't explain much.

Also. does avg. CPU load impact resource monitoring ?

Regards,
Abhay


On Thu, 31 Aug 2017 at 20:11 Ken Gaillot  wrote:

> On Thu, 2017-08-31 at 06:41 +, Abhay B wrote:
> > Hi,
> >
> >
> > I have a 2 node HA cluster configured on CentOS 7 with pcs command.
> >
> >
> > Below are the properties of the cluster :
> >
> >
> > # pcs property
> > Cluster Properties:
> >  cluster-infrastructure: corosync
> >  cluster-name: SVSDEHA
> >  cluster-recheck-interval: 2s
> >  dc-deadtime: 5
> >  dc-version: 1.1.15-11.el7_3.5-e174ec8
> >  have-watchdog: false
> >  last-lrm-refresh: 1504090367
> >  no-quorum-policy: ignore
> >  start-failure-is-fatal: false
> >  stonith-enabled: false
> >
> >
> > PFA the cib.
> > Also attached is the corosync.log around the time the below issue
> > happened.
> >
> >
> > After around 10 hrs and multiple failures, pacemaker stops monitoring
> > resource on one of the nodes in the cluster.
> >
> >
> > So even though the resource on other node fails, it is never migrated
> > to the node on which the resource is not monitored.
> >
> >
> > Wanted to know what could have triggered this and how to avoid getting
> > into such scenarios.
> > I am going through the logs and couldn't find why this happened.
> >
> >
> > After this log the monitoring stopped.
> >
> > Aug 29 11:01:44 [16500] TPC-D12-10-002.phaedrus.sandvine.com
> > crmd: info: process_lrm_event:   Result of monitor operation for
> > SVSDEHA on TPC-D12-10-002.phaedrus.sandvine.com: 0 (ok) | call=538
> > key=SVSDEHA_monitor_2000 confirmed=false cib-update=50013
>
> Are you sure the monitor stopped? Pacemaker only logs recurring monitors
> when the status changes. Any successful monitors after this wouldn't be
> logged.
>
> > Below log says the resource is leaving the cluster.
> > Aug 29 11:01:44 [16499] TPC-D12-10-002.phaedrus.sandvine.com
> > pengine: info: LogActions:  Leave   SVSDEHA:0   (Slave
> > TPC-D12-10-002.phaedrus.sandvine.com)
>
> This means that the cluster will leave the resource where it is (i.e. it
> doesn't need a start, stop, move, demote, promote, etc.).
>
> > Let me know if anything more is needed.
> >
> >
> > Regards,
> > Abhay
> >
> >
> > PS:'pcs resource cleanup' brought the cluster back into good state.
>
> There are a lot of resource action failures, so I'm not sure where the
> issue is, but I'm guessing it has to do with migration-threshold=1 --
> once a resource has failed once on a node, it won't be allowed back on
> that node until the failure is cleaned up. Of course you also have
> failure-timeout=1s, which should clean it up immediately, so I'm not
> sure.
>
> My gut feeling is that you're trying to do too many things at once. I'd
> start over from scratch and proceed more slowly: first, set "two_node:
> 1" in corosync.conf and let no-quorum-policy default in

Re: [ClusterLabs] Strange Corosync (TOTEM) logs, Pacemaker OK but DLM stuck

2017-09-01 Thread Ferenc Wágner
Jan Friesse  writes:

> wf...@niif.hu writes:
>
>> Jan Friesse  writes:
>>
>>> wf...@niif.hu writes:
>>>
 In a 6-node cluster (vhbl03-08) the following happens 1-5 times a day
 (in August; in May, it happened 0-2 times a day only, it's slowly
 ramping up):

 vhbl08 corosync[3687]:   [TOTEM ] A processor failed, forming new 
 configuration.
 vhbl03 corosync[3890]:   [TOTEM ] A processor failed, forming new 
 configuration.
 vhbl07 corosync[3805]:   [MAIN  ] Corosync main process was not scheduled 
 for 4317.0054 ms (threshold is 2400. ms). Consider token timeout 
 increase.
>>>
>>> ^^^ This is main problem you have to solve. It usually means that
>>> machine is too overloaded. [...]
>>
>> Before I start tracing the scheduler, I'd like to ask something: what
>> wakes up the Corosync main process periodically?  The token making a
>> full circle?  (Please forgive my simplistic understanding of the TOTEM
>> protocol.)  That would explain the recommendation in the log message,
>> but does not fit well with the overload assumption: totally idle nodes
>> could just as easily produce such warnings if there are no other regular
>> wakeup sources.  (I'm looking at timer_function_scheduler_timeout but I
>> know too little of libqb to decide.)
>
> Corosync main loop is based on epoll, so corosync is waked up ether by
> receiving data (network socket or unix socket for services) or when
> there are data to sent and socket is ready for non blocking write or
> after timeout. This timeout is exactly what you call other wakeup
> resource.
>
> Timeout is used for scheduling periodical tasks inside corosync.
>
> One of periodical tasks is scheduler pause detector. It is basically
> scheduled every (token_timeout / 3) msec and it computes diff between
> current and last time. If diff is larger than (token_timeout * 0.8) it
> displays warning.

Thanks, I can work with this.  I'll come back as soon as I find
something (or need further information :).

>>> As a start you can try what message say = Consider token timeout
>>> increase. Currently you have 3 seconds, in theory 6 second should be
>>> enough.
>>
>> It was probably high time I realized that token timeout is scaled
>> automatically when one has a nodelist.  When you say Corosync should
>> work OK with default settings up to 16 nodes, you assume this scaling is
>> in effect, don't you?  On the other hand, I've got no nodelist in the
>> config, but token = 3000, which is less than the default 1000+4*650 with
>> six nodes, and this will get worse as the cluster grows.
>
> This is described in corosync.conf man page (token_coefficient).

Yes, that's how I found out.  It also says: "This value is used only
when nodelist section is specified and contains at least 3 nodes."

> Final timeout is computed using totem.token as a base value. So if you
> set totem.token to 3000 it means that final totem timeout value is not
> 3000 but (3000 + 4 * 650).

But I've got no nodelist section, and according to the warning, my token
timeout is indeed 3 seconds, as you promptly deduced.  So the
documentation seems to be correct.
-- 
Thanks,
Feri

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Antw: Is there a way to ignore a single monitoring timeout

2017-09-01 Thread Klechomir
What I observe is that single monitoring request of different resources 
with different resource agents is timing out.


For example LVM resource (the LVM RA) does this sometimes.
Setting ridiculously high timeouts (5 minutes and more) didn't solve the 
problem, so I think I'm  out of options there.

Same for other I/O related resources/RAs.

Regards,
Klecho

One of the typical cases is LVM (LVM RA)monitoring.

On 1.09.2017 11:07, Jehan-Guillaume de Rorthais wrote:

On Fri, 01 Sep 2017 09:07:16 +0200
"Ulrich Windl"  wrote:


Klechomir  schrieb am 01.09.2017 um 08:48 in Nachricht

<9f043557-233d-6c1c-b46d-63f8c2ee5...@gmail.com>:

Hi Ulrich,
Have to disagree here.

I have cases, when for an unknown reason a single monitoring request
never returns result.
So having bigger timeouts doesn't resolve this problem.

But if your monitor hangs instead of giving a result, you also cannot ignore
the result that isn't there! OTOH: Isn't the operation timeout for monitors
that hang? If the monitor is killed, it returns an implicit status (it
failed).

I agree. It seems to me the problems comes from either the resource agent or
the resource itself. Presently, this issue bothers the cluster stack, but soon
or later, it will blows something else. Track where the issue comes from, and
fix it.




___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Antw: Is there a way to ignore a single monitoring timeout

2017-09-01 Thread Kristián Feldsam


S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz

www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za 
adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446

> On 1 Sep 2017, at 13:15, Klechomir  wrote:
> 
> What I observe is that single monitoring request of different resources with 
> different resource agents is timing out.
> 
> For example LVM resource (the LVM RA) does this sometimes.
> Setting ridiculously high timeouts (5 minutes and more) didn't solve the 
> problem, so I think I'm  out of options there.
> Same for other I/O related resources/RAs.
> 

hmm, so probably is something bad in clvm configuration? I use clvm in three 
node cluster without issues. Which version of centos u use? I experience clvm 
problems only on pre 7.3 version due to bug in libqb.

> Regards,
> Klecho
> 
> One of the typical cases is LVM (LVM RA)monitoring.
> 
> On 1.09.2017 11:07, Jehan-Guillaume de Rorthais wrote:
>> On Fri, 01 Sep 2017 09:07:16 +0200
>> "Ulrich Windl"  wrote:
>> 
>> Klechomir  schrieb am 01.09.2017 um 08:48 in Nachricht
>>> <9f043557-233d-6c1c-b46d-63f8c2ee5...@gmail.com>:
 Hi Ulrich,
 Have to disagree here.
 
 I have cases, when for an unknown reason a single monitoring request
 never returns result.
 So having bigger timeouts doesn't resolve this problem.
>>> But if your monitor hangs instead of giving a result, you also cannot ignore
>>> the result that isn't there! OTOH: Isn't the operation timeout for monitors
>>> that hang? If the monitor is killed, it returns an implicit status (it
>>> failed).
>> I agree. It seems to me the problems comes from either the resource agent or
>> the resource itself. Presently, this issue bothers the cluster stack, but 
>> soon
>> or later, it will blows something else. Track where the issue comes from, and
>> fix it.
>> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Strange Corosync (TOTEM) logs, Pacemaker OK but DLM stuck

2017-09-01 Thread Ferenc Wágner
Digimer  writes:

> On 2017-08-29 10:45 AM, Ferenc Wágner wrote:
>
>> Digimer  writes:
>> 
>>> On 2017-08-28 12:07 PM, Ferenc Wágner wrote:
>>>
 [...]
 While dlm_tool status reports (similar on all nodes):

 cluster nodeid 167773705 quorate 1 ring seq 3088 3088
 daemon now 2941405 fence_pid 0 
 node 167773705 M add 196 rem 0 fail 0 fence 0 at 0 0
 node 167773706 M add 5960 rem 5730 fail 0 fence 0 at 0 0
 node 167773707 M add 2089 rem 1802 fail 0 fence 0 at 0 0
 node 167773708 M add 3646 rem 3413 fail 0 fence 0 at 0 0
 node 167773709 M add 2588921 rem 2588920 fail 0 fence 0 at 0 0
 node 167773710 M add 196 rem 0 fail 0 fence 0 at 0 0

 dlm_tool ls shows "kern_stop":

 dlm lockspaces
 name  clvmd
 id0x4104eefa
 flags 0x0004 kern_stop
 changemember 5 joined 0 remove 1 failed 1 seq 8,8
 members   167773705 167773706 167773707 167773708 167773710 
 new changemember 6 joined 1 remove 0 failed 0 seq 9,9
 new statuswait messages 1
 new members   167773705 167773706 167773707 167773708 167773709 167773710 

 on all nodes except for vhbl07 (167773709), where it gives

 dlm lockspaces
 name  clvmd
 id0x4104eefa
 flags 0x 
 changemember 6 joined 1 remove 0 failed 0 seq 11,11
 members   167773705 167773706 167773707 167773708 167773709 167773710 

 instead.

 [...] Is there a way to unblock DLM without rebooting all nodes?
>>>
>>> Looks like the lost node wasn't fenced.
>> 
>> Why dlm status does not report any lost node then?  Or do I misinterpret
>> its output?
>> 
>>> Do you have fencing configured and tested? If not, DLM will block
>>> forever because it won't recover until it has been told that the lost
>>> peer has been fenced, by design.
>> 
>> What command would you recommend for unblocking DLM in this case?
>
> First, fix fencing. Do you have that setup and working?

I really don't want DLM to do fencing.  DLM blocking for a couple of
days is not an issue in this setup (cLVM isn't a "service" of this
cluster, only a rarely needed administration tool).  Fencing is set up
and works fine for Pacemaker, so it's used to recover actual HA
services.  But letting DLM use it resulted in disaster one and a half
year ago (see Message-ID: <87r3g5a969@lant.ki.iif.hu>), which I
failed to understand yet, and I'd rather not go there again until that's
taken care of properly.  So for now, a manual unblock path is all I'm
after.
-- 
Thanks,
Feri

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Antw: Is there a way to ignore a single monitoring timeout

2017-09-01 Thread Klechomir
On 1 Sep 2017, at 13:15, Klechomir > wrote:


What I observe is that single monitoring request of different 
resources with different resource agents is timing out.


For example LVM resource (the LVM RA) does this sometimes.
Setting ridiculously high timeouts (5 minutes and more) didn't solve 
the problem, so I think I'm  out of options there.

Same for other I/O related resources/RAs.



hmm, so probably is something bad in clvm configuration? I use clvm in 
three node cluster without issues. Which version of centos u use? I 
experience clvm problems only on pre 7.3 version due to bug in libqb.


I have timeouts with and without clvm. Have it with resources managing 
(scst) iscsi targets as well.





Regards,
Klecho

One of the typical cases is LVM (LVM RA)monitoring.

On 1.09.2017 11:07, Jehan-Guillaume de Rorthais wrote:

On Fri, 01 Sep 2017 09:07:16 +0200
"Ulrich Windl" > wrote:


Klechomir mailto:kle...@gmail.com>> schrieb 
am 01.09.2017 um 08:48 in Nachricht
<9f043557-233d-6c1c-b46d-63f8c2ee5...@gmail.com 
>:

Hi Ulrich,
Have to disagree here.

I have cases, when for an unknown reason a single monitoring request
never returns result.
So having bigger timeouts doesn't resolve this problem.
But if your monitor hangs instead of giving a result, you also 
cannot ignore
the result that isn't there! OTOH: Isn't the operation timeout for 
monitors

that hang? If the monitor is killed, it returns an implicit status (it
failed).
I agree. It seems to me the problems comes from either the resource 
agent or
the resource itself. Presently, this issue bothers the cluster 
stack, but soon
or later, it will blows something else. Track where the issue comes 
from, and

fix it.




___
Users mailing list: Users@clusterlabs.org 
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Strange Corosync (TOTEM) logs, Pacemaker OK but DLM stuck

2017-09-01 Thread Klaus Wenninger
On 08/31/2017 11:58 PM, Ferenc Wágner wrote:
> Klaus Wenninger  writes:
>
>> Just seen that you are hosting VMs which might make you use KSM ...
>> Don't fully remember at the moment but I have some memory of
>> issues with KSM and page-locking.
>> iirc it was some bug in the kernel memory-management that should
>> be fixed a long time ago but ...
> Hi Klaus,
>
> I failed to find anything relevant by a quick internet search.  Can you
> recall something more specific, so that I can ensure I'm running with
> this issue fixed?

puuh ... long time ago ... let's try

think using KSM & page-locking led to more stuff locked

than actually intended/expected (all memory used by ksm

iirc) which subsequently drove

the system into real trouble having to do heavy swapping on

the little memory left. In your case without swapping it is

hard to guess what would actually happen rather than oom

which is obviously not happening.

I would expect this issue anyway to be solved meanwhile

but anyway ...

Sorry for the degradation of my biological data-store.

There should be something on disk somewhere but

unfortunately I have no access atm.

Regards,

Klaus

  


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: Antw: Re: Antw: Is there a way to ignore a single monitoring timeout

2017-09-01 Thread Ulrich Windl
>>> Klechomir  schrieb am 01.09.2017 um 13:15 in Nachricht
<258bd7d9-ed89-f1f0-8f9b-bca7420c6...@gmail.com>:
> What I observe is that single monitoring request of different resources 
> with different resource agents is timing out.
> 
> For example LVM resource (the LVM RA) does this sometimes.

We had that, too about 6 years ago. Since then we do not monitor the LV state 
(after having seen what the monitor does). The problem is tha tin a SAN 
environment with some hundred disks (due to multipath), LVM dies not scale 
well: The more disks you have, the slower is a thing like vgdisplay.
Maybe that changed since then, but we are still happy. As we don't use raw 
access to LVs, we don't really miss a problem as the upper layers are 
monitored...

> Setting ridiculously high timeouts (5 minutes and more) didn't solve the 
> problem, so I think I'm  out of options there.
> Same for other I/O related resources/RAs.
> 
> Regards,
> Klecho
> 
> One of the typical cases is LVM (LVM RA)monitoring.
> 
> On 1.09.2017 11:07, Jehan-Guillaume de Rorthais wrote:
>> On Fri, 01 Sep 2017 09:07:16 +0200
>> "Ulrich Windl"  wrote:
>>
>> Klechomir  schrieb am 01.09.2017 um 08:48 in Nachricht
>>> <9f043557-233d-6c1c-b46d-63f8c2ee5...@gmail.com>:
 Hi Ulrich,
 Have to disagree here.

 I have cases, when for an unknown reason a single monitoring request
 never returns result.
 So having bigger timeouts doesn't resolve this problem.
>>> But if your monitor hangs instead of giving a result, you also cannot ignore
>>> the result that isn't there! OTOH: Isn't the operation timeout for monitors
>>> that hang? If the monitor is killed, it returns an implicit status (it
>>> failed).
>> I agree. It seems to me the problems comes from either the resource agent or
>> the resource itself. Presently, this issue bothers the cluster stack, but 
> soon
>> or later, it will blows something else. Track where the issue comes from, 
> and
>> fix it.
>>
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 





___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Is there a way to ignore a single monitoring timeout

2017-09-01 Thread Jan Pokorný
On 01/09/17 09:48 +0300, Klechomir wrote:
> I have cases, when for an unknown reason a single monitoring request
> never returns result.
> So having bigger timeouts doesn't resolve this problem.

If I get you right, the pain point here is a command called by the
resource agents during monitor operation, while this command under
some circumstances _never_ terminates (for dead waiting, infinite
loop, or whatever other reason) or possibly terminates based on
external/asynchronous triggers (e.g. network connection gets
reestablished).

Stating obvious, the solution should be:
- work towards fixing such particular command if blocking
  is an unexpected behaviour (clarify this with upstream
  if needed)
- find more reliable way for the agent to monitor the resource

For the planned soft-recovery options Ken talked about, I am not
sure if it would be trivially possible to differentiate exceeded
monitor timeout from a plain monitor failure.

-- 
Jan (Poki)


pgpSKNHqKHNlw.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Is there a way to ignore a single monitoring timeout

2017-09-01 Thread Klechomir

On 1.09.2017 17:21, Jan Pokorný wrote:

On 01/09/17 09:48 +0300, Klechomir wrote:

I have cases, when for an unknown reason a single monitoring request
never returns result.
So having bigger timeouts doesn't resolve this problem.

If I get you right, the pain point here is a command called by the
resource agents during monitor operation, while this command under
some circumstances _never_ terminates (for dead waiting, infinite
loop, or whatever other reason) or possibly terminates based on
external/asynchronous triggers (e.g. network connection gets
reestablished).

Stating obvious, the solution should be:
- work towards fixing such particular command if blocking
   is an unexpected behaviour (clarify this with upstream
   if needed)
- find more reliable way for the agent to monitor the resource

For the planned soft-recovery options Ken talked about, I am not
sure if it would be trivially possible to differentiate exceeded
monitor timeout from a plain monitor failure.


In any case currently there is no differentiation between failed 
monitoring request and timeouts, so a parameter for ignoring X fails in 
a row would be very welcome for me.


Here is one very fresh example, entirely unrelated to LV&I/O:
Aug 30 10:44:19 [1686093] CLUSTER-1   crmd:error: 
process_lrm_event:LRM operation p_PingD_monitor_0 (1148) Timed Out 
(timeout=2ms)
Aug 30 10:44:56 [1686093] CLUSTER-1   crmd:   notice: 
process_lrm_event:LRM operation p_PingD_stop_0 (call=1234, rc=0, 
cib-update=40, confirmed=true) ok
Aug 30 10:45:26 [1686093] CLUSTER-1   crmd:   notice: 
process_lrm_event:LRM operation p_PingD_start_0 (call=1240, rc=0, 
cib-update=41, confirmed=true) ok
In this case PingD is fencing drbd and causes unneeded (as the next 
monitoring request is ok) restart of all related resources.



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] VirtualDomain live migration error

2017-09-01 Thread Ken Gaillot
On Fri, 2017-09-01 at 00:26 +0200, Oscar Segarra wrote:
> Hi,
> 
> 
> Yes, it is
> 
> 
> The qemu-kvm process is executed by the oneadmin user.
> 
> 
> When I cluster tries the live migration, what users do play?
> 
> 
> Oneadmin
> Root
> Hacluster
> 
> 
> I have just configured pasworless ssh connection with oneadmin.
> 
> 
> Do I need to configure any other passwordless ssh connection with any
> other user?
> 
> 
> What user executes the virsh migrate - - live?

The cluster executes resource actions as root.

> Is there any way to check ssk keys? 

I'd just login once to the host as root from the cluster nodes, to make
it sure it works, and accept the host when asked.

> 
> Sorry for all theese questions. 
> 
> 
> Thanks a lot 
> 
> 
> 
> 
> 
> 
> El 1 sept. 2017 0:12, "Ken Gaillot"  escribió:
> On Thu, 2017-08-31 at 23:45 +0200, Oscar Segarra wrote:
> > Hi Ken,
> >
> >
> > Thanks a lot for you quick answer:
> >
> >
> > Regarding to selinux, it is disabled. The FW is disabled as
> well.
> >
> >
> > [root@vdicnode01 ~]# sestatus
> > SELinux status: disabled
> >
> >
> > [root@vdicnode01 ~]# service firewalld status
> > Redirecting to /bin/systemctl status  firewalld.service
> > ● firewalld.service - firewalld - dynamic firewall daemon
> >Loaded: loaded
> (/usr/lib/systemd/system/firewalld.service;
> > disabled; vendor preset: enabled)
> >Active: inactive (dead)
> >  Docs: man:firewalld(1)
> >
> >
> > On migration, it performs a gracefully shutdown and a start
> on the new
> > node.
> >
> >
> > I attach the logs when trying to migrate from vdicnode02 to
> > vdicnode01:
> >
> >
> > vdicnode02 corosync.log:
> > Aug 31 23:38:17 [1521] vdicnode02cib: info:
> > cib_perform_op: Diff: --- 0.161.2 2
> > Aug 31 23:38:17 [1521] vdicnode02cib: info:
> > cib_perform_op: Diff: +++ 0.162.0 (null)
> > Aug 31 23:38:17 [1521] vdicnode02cib: info:
> > cib_perform_op:
> >
> -- 
> /cib/configuration/constraints/rsc_location[@id='location-vm-vdicdb01-vdicnode01--INFINITY']
> > Aug 31 23:38:17 [1521] vdicnode02cib: info:
> > cib_perform_op: +  /cib:  @epoch=162, @num_updates=0
> > Aug 31 23:38:17 [1521] vdicnode02cib: info:
> > cib_process_request:Completed cib_replace operation for
> section
> > configuration: OK (rc=0, origin=vdicnode01/cibadmin/2,
> > version=0.162.0)
> > Aug 31 23:38:17 [1521] vdicnode02cib: info:
> > cib_file_backup:Archived previous version
> > as /var/lib/pacemaker/cib/cib-65.raw
> > Aug 31 23:38:17 [1521] vdicnode02cib: info:
> > cib_file_write_with_digest: Wrote version 0.162.0 of the
> CIB to
> > disk (digest: 1f87611b60cd7c48b95b6b788b47f65f)
> > Aug 31 23:38:17 [1521] vdicnode02cib: info:
> > cib_file_write_with_digest: Reading cluster
> configuration
> > file /var/lib/pacemaker/cib/cib.jt2KPw
> > (digest: /var/lib/pacemaker/cib/cib.Kwqfpl)
> > Aug 31 23:38:22 [1521] vdicnode02cib: info:
> > cib_process_ping:   Reporting our current digest to
> vdicnode01:
> > dace3a23264934279d439420d5a716cc for 0.162.0 (0x7f96bb26c5c0
> 0)
> > Aug 31 23:38:27 [1521] vdicnode02cib: info:
> > cib_perform_op: Diff: --- 0.162.0 2
> > Aug 31 23:38:27 [1521] vdicnode02cib: info:
> > cib_perform_op: Diff: +++ 0.163.0 (null)
> > Aug 31 23:38:27 [1521] vdicnode02cib: info:
> > cib_perform_op: +  /cib:  @epoch=163
> > Aug 31 23:38:27 [1521] vdicnode02cib: info:
> > cib_perform_op: ++ /cib/configuration/constraints:
>  > id="location-vm-vdicdb01-vdicnode02--INFINITY"
> node="vdicnode02"
> > rsc="vm-vdicdb01" score="-INFINITY"/>
> > Aug 31 23:38:27 [1521] vdicnode02cib: info:
> > cib_process_request:Completed cib_replace operation for
> section
> > configuration: OK (rc=0, origin=vdicnode01/cibadmin/2,
> > version=0.163.0)
> > Aug 31 23:38:27 [1521] vdicnode02cib: info:
> > cib_file_backup:Archived previous version
> > as /var/lib/pacemaker/cib/cib-66.raw
> > Aug 31 23:38:27 [1521] vdicnode02cib: info:
> > cib_file_write_with_digest: Wrote version 0.163.0 of the
> CIB to
> > disk (digest: 47a548b36746de9275d66cc6aeb0fdc4)
> > Aug 31 23:38:27 [152

Re: [ClusterLabs] Pacemaker stopped monitoring the resource

2017-09-01 Thread Ken Gaillot
On Fri, 2017-09-01 at 15:06 +0530, Abhay B wrote:
> Are you sure the monitor stopped? Pacemaker only logs
> recurring monitors
> when the status changes. Any successful monitors after this
> wouldn't be
> logged.  
>  
> Yes. Since there  were no logs which said "RecurringOp:  Start
> recurring monitor" on the node after it had failed.
> Also there were no logs for any actions pertaining to 
> The problem was that even though the one node was failing, the
> resources were never moved to the other node(the node on which I
> suspect monitoring had stopped).
> 
> 
> There are a lot of resource action failures, so I'm not sure
> where the
> issue is, but I'm guessing it has to do with
> migration-threshold=1 --
> once a resource has failed once on a node, it won't be allowed
> back on
> that node until the failure is cleaned up. Of course you also
> have
> failure-timeout=1s, which should clean it up immediately, so
> I'm not
> sure.
> 
> 
> migration-threshold=1
> failure-timeout=1s
> 
> cluster-recheck-interval=2s
> 
> 
> first, set "two_node:
> 1" in corosync.conf and let no-quorum-policy default in
> pacemaker
> 
> 
> This is already configured.
> # cat /etc/corosync/corosync.conf
> totem {
> version: 2
> secauth: off
> cluster_name: SVSDEHA
> transport: udpu
> token: 5000
> }
> 
> 
> nodelist {
> node {
> ring0_addr: 2.0.0.10
> nodeid: 1
> }
> 
> 
> node {
> ring0_addr: 2.0.0.11
> nodeid: 2
> }
> }
> 
> 
> quorum {
> provider: corosync_votequorum
> two_node: 1
> }
> 
> 
> logging {
> to_logfile: yes
> logfile: /var/log/cluster/corosync.log
> to_syslog: yes
> }
> 
> 
> let no-quorum-policy default in pacemaker; then,
> get stonith configured, tested, and enabled
> 
> 
> By not configuring no-quorum-policy, would it ignore quorum for a 2
> node cluster? 

With two_node, corosync always provides quorum to pacemaker, so
pacemaker doesn't see any quorum loss. The only significant difference
from ignoring quorum is that corosync won't form a cluster from a cold
start unless both nodes can reach each other (a safety feature).

> For my use case I don't need stonith enabled. My intention is to have
> a highly available system all the time.

Stonith is the only way to recover from certain types of failure, such
as the "split brain" scenario, and a resource that fails to stop.

If your nodes are physical machines with hardware watchdogs, you can set
up sbd for fencing without needing any extra equipment.

> I will test my RA again as suggested with no-quorum-policy=default.
> 
> 
> One more doubt. 
> Why do we see this is 'pcs property' ?
> last-lrm-refresh: 1504090367
> 
> 
> 
> Never seen this on a healthy cluster.
> From RHEL documentation: 
> last-lrm-refresh
>  
> Last refresh of the
> Local Resource Manager,
> given in units of
> seconds since epoca.
> Used for diagnostic
> purposes; not
> user-configurable. 
> 
> 
> Doesn't explain much.

Whenever a cluster property changes, the cluster rechecks the current
state to see if anything needs to be done. last-lrm-refresh is just a
dummy property that the cluster uses to trigger that. It's set in
certain rare circumstances when a resource cleanup is done. You should
see a line in your logs like "Triggering a refresh after ... deleted ...
from the LRM". That might give some idea of why.

> Also. does avg. CPU load impact resource monitoring ? 
> 
> 
> Regards,
> Abhay

Well, it could cause the monitor to take so long that it times out. The
only direct effect of load on pacemaker is that the cluster might lower
the number of agent actions that it can execute simultaneously.


> On Thu, 31 Aug 2017 at 20:11 Ken Gaillot  wrote:
> 
> On Thu, 2017-08-31 at 06:41 +, Abhay B wrote:
> > Hi,
> >
> >
> > I have a 2 node HA cluster configured on CentOS 7 with pcs
> command.
> >
> >
> > Below are the properties of the cluster :
> >
> >
> > # pcs property
> > Cluster Properties:
> >  cluster-infrastructure: corosync
> >  cluster-name: SVSDEHA
> >  cluster-recheck-interval: 2s
> >  dc-deadtime: 5
> >  dc-version: 1.1.15-11.el7_3.5-e174ec8
> >  have-watchdog: false
> >  last-lrm-refresh: 1504090367
> >  no-quorum-policy: ignore
> >  start-failure-is-fatal: false
> >  stonith-enabled: false
> >
> >
> > PFA the cib.
> > Also attached is the corosync.log around the time the below
> issue
> > happened.
> >
> >
> > After around 10 hrs and multiple failures, pacemaker stops
> monitoring
> > resource on one of the nodes in the cluster.
> >
> >
>

Re: [ClusterLabs] VirtualDomain live migration error

2017-09-01 Thread Oscar Segarra
Hi,

I have updated the known_hosts:

Now, I get the following error:

Sep 02 01:03:41 [1535] vdicnode01cib: info: cib_perform_op: +
 
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='vm-vdicdb01']/lrm_rsc_op[@id='vm-vdicdb01_last_0']:
 @operation_key=vm-vdicdb01_migrate_to_0, @operation=migrate_to,
@crm-debug-origin=cib_action_update,
@transition-key=6:27:0:a7fef266-46c3-429e-ab00-c1a0aab24da5,
@transition-magic=-1:193;6:27:0:a7fef266-46c3-429e-ab00-c1a0aab24da5,
@call-id=-1, @rc-code=193, @op-status=-1, @last-run=1504307021, @last-rc-c
Sep 02 01:03:41 [1535] vdicnode01cib: info:
cib_process_request:Completed cib_modify operation for section status:
OK (rc=0, origin=vdicnode01/crmd/77, version=0.169.1)
VirtualDomain(vm-vdicdb01)[13085]:  2017/09/02_01:03:41 INFO: vdicdb01:
Starting live migration to vdicnode02 (using: virsh
--connect=qemu:///system --quiet migrate --live  vdicdb01
qemu+ssh://vdicnode02/system ).
VirtualDomain(vm-vdicdb01)[13085]:  2017/09/02_01:03:41 ERROR:
vdicdb01: live migration to vdicnode02 failed: 1
 ]p 02 01:03:41 [1537] vdicnode01   lrmd:   notice: operation_finished:
vm-vdicdb01_migrate_to_0:13085:stderr [ error: Cannot recv data:
Permission denied, please try again.
 ]p 02 01:03:41 [1537] vdicnode01   lrmd:   notice: operation_finished:
vm-vdicdb01_migrate_to_0:13085:stderr [ Permission denied, please try
again.
Sep 02 01:03:41 [1537] vdicnode01   lrmd:   notice: operation_finished:
vm-vdicdb01_migrate_to_0:13085:stderr [ Permission denied
(publickey,gssapi-keyex,gssapi-with-mic,password).: Connection reset by
peer ]
Sep 02 01:03:41 [1537] vdicnode01   lrmd:   notice: operation_finished:
vm-vdicdb01_migrate_to_0:13085:stderr [ ocf-exit-reason:vdicdb01: live
migration to vdicnode02 failed: 1 ]
Sep 02 01:03:41 [1537] vdicnode01   lrmd: info: log_finished:
finished - rsc:vm-vdicdb01 action:migrate_to call_id:16 pid:13085
exit-code:1 exec-time:119ms queue-time:0ms
Sep 02 01:03:41 [1540] vdicnode01   crmd:   notice: process_lrm_event:
 Result of migrate_to operation for vm-vdicdb01 on vdicnode01: 1
(unknown error) | call=16 key=vm-vdicdb01_migrate_to_0 confirmed=true
cib-update=78
Sep 02 01:03:41 [1540] vdicnode01   crmd:   notice: process_lrm_event:
 vdicnode01-vm-vdicdb01_migrate_to_0:16 [ error: Cannot recv data:
Permission denied, please try again.\r\nPermission denied, please try
again.\r\nPermission denied
(publickey,gssapi-keyex,gssapi-with-mic,password).: Connection reset by
peer\nocf-exit-reason:vdicdb01: live migration to vdicnode02 failed: 1\n ]
Sep 02 01:03:41 [1535] vdicnode01cib: info:
cib_process_request:Forwarding cib_modify operation for section status
to all (origin=local/crmd/78)
Sep 02 01:03:41 [1535] vdicnode01cib: info: cib_perform_op:
Diff: --- 0.169.1 2
Sep 02 01:03:41 [1535] vdicnode01cib: info: cib_perform_op:
Diff: +++ 0.169.2 (null)
Sep 02 01:03:41 [1535] vdicnode01cib: info: cib_perform_op: +
 /cib:  @num_updates=2
Sep 02 01:03:41 [1535] vdicnode01cib: info: cib_perform_op: +
 
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='vm-vdicdb01']/lrm_rsc_op[@id='vm-vdicdb01_last_0']:
 @crm-debug-origin=do_update_resource,
@transition-magic=0:1;6:27:0:a7fef266-46c3-429e-ab00-c1a0aab24da5,
@call-id=16, @rc-code=1, @op-status=0, @exec-time=119,
@exit-reason=vdicdb01: live migration to vdicnode02 failed: 1
Sep 02 01:03:4

as root <-- system prompts the password
[root@vdicnode01 .ssh]# virsh --connect=qemu:///system --quiet migrate
--live  vdicdb01 qemu+ssh://vdicnode02/system
root@vdicnode02's password:

as oneadmin (the user that executes the qemu-kvm) <-- does not prompt the
password
virsh --connect=qemu:///system --quiet migrate --live  vdicdb01
qemu+ssh://vdicnode02/system

Must I configure passwordless connection with root in order to make live
migration work?

Or maybe is there any way to instruct pacemaker to use my oneadmin user for
migrations inestad of root?

Thanks a lot:


2017-09-01 23:14 GMT+02:00 Ken Gaillot :

> On Fri, 2017-09-01 at 00:26 +0200, Oscar Segarra wrote:
> > Hi,
> >
> >
> > Yes, it is
> >
> >
> > The qemu-kvm process is executed by the oneadmin user.
> >
> >
> > When I cluster tries the live migration, what users do play?
> >
> >
> > Oneadmin
> > Root
> > Hacluster
> >
> >
> > I have just configured pasworless ssh connection with oneadmin.
> >
> >
> > Do I need to configure any other passwordless ssh connection with any
> > other user?
> >
> >
> > What user executes the virsh migrate - - live?
>
> The cluster executes resource actions as root.
>
> > Is there any way to check ssk keys?
>
> I'd just login once to the host as root from the cluster nodes, to make
> it sure it works, and accept the host when asked.
>
> >
> > Sorry for all theese questions.
> >
> >
> > Thanks a lot
> >
> >
> >
> >
> >
> >
> > El 1 sept. 2017 0:12,