Re: [ClusterLabs] Upgrade corosync problem

2018-06-25 Thread Salvatore D'angelo
Hi,

Let me add here one important detail. I use Docker for my test with 5 
containers deployed on my Mac.
Basically the team that worked on this project installed the cluster on soft 
layer bare metal.
The PostgreSQL cluster was hard to test and if a misconfiguration occurred 
recreate the cluster from scratch is not easy.
Test it was a cumbersome if you consider that we access to the machines with a 
complex system hard to describe here.
For this reason I ported the cluster on Docker for test purpose. I am not 
interested to have it working for months, I just need a proof of concept. 

When the migration works I’ll port everything on bare metal where the size of 
resources are ambundant.  

Now I have enough RAM and disk space on my Mac so if you tell me what should be 
an acceptable size for several days of running it is ok for me.
It is ok also have commands to clean the shm when required.
I know I can find them on Google but if you can suggest me these info I’ll 
appreciate. I have OS knowledge to do that but I would like to avoid days of 
guesswork and try and error if possible.

> On 25 Jun 2018, at 21:18, Jan Pokorný  wrote:
> 
> On 25/06/18 19:06 +0200, Salvatore D'angelo wrote:
>> Thanks for reply. I scratched my cluster and created it again and
>> then migrated as before. This time I uninstalled pacemaker,
>> corosync, crmsh and resource agents with make uninstall
>> 
>> then I installed new packages. The problem is the same, when
>> I launch:
>> corosync-quorumtool -ps
>> 
>> I got: Cannot initialize QUORUM service
>> 
>> Here the log with debug enabled:
>> 
>> 
>> [18019] pg3 corosyncerror   [QB] couldn't create circular mmap on 
>> /dev/shm/qb-cfg-event-18020-18028-23-data
>> [18019] pg3 corosyncerror   [QB] qb_rb_open:cfg-event-18020-18028-23: 
>> Resource temporarily unavailable (11)
>> [18019] pg3 corosyncdebug   [QB] Free'ing ringbuffer: 
>> /dev/shm/qb-cfg-request-18020-18028-23-header
>> [18019] pg3 corosyncdebug   [QB] Free'ing ringbuffer: 
>> /dev/shm/qb-cfg-response-18020-18028-23-header
>> [18019] pg3 corosyncerror   [QB] shm connection FAILED: Resource 
>> temporarily unavailable (11)
>> [18019] pg3 corosyncerror   [QB] Error in connection setup 
>> (18020-18028-23): Resource temporarily unavailable (11)
>> 
>> I tried to check /dev/shm and I am not sure these are the right
>> commands, however:
>> 
>> df -h /dev/shm
>> Filesystem  Size  Used Avail Use% Mounted on
>> shm  64M   16M   49M  24% /dev/shm
>> 
>> ls /dev/shm
>> qb-cmap-request-18020-18036-25-dataqb-corosync-blackbox-data
>> qb-quorum-request-18020-18095-32-data
>> qb-cmap-request-18020-18036-25-header  qb-corosync-blackbox-header  
>> qb-quorum-request-18020-18095-32-header
>> 
>> Is 64 Mb enough for /dev/shm. If no, why it worked with previous
>> corosync release?
> 
> For a start, can you try configuring corosync with
> --enable-small-memory-footprint switch?
> 
> Hard to say why the space provisioned to /dev/shm is the direct
> opposite of generous (per today's standards), but may be the result
> of automatic HW adaptation, and if RAM is so scarce in your case,
> the above build-time toggle might help.
> 
> If not, then exponentially increasing size of /dev/shm space is
> likely your best bet (I don't recommended fiddling with mlockall()
> and similar measures in corosync).
> 
> Of course, feel free to raise a regression if you have a reproducible
> comparison between two corosync (plus possibly different libraries
> like libqb) versions, one that works and one that won't, in
> reproducible conditions (like this small /dev/shm, VM image, etc.).
> 
> -- 
> Jan (Poki)
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pcs 0.9.165 released

2018-06-25 Thread Jan Pokorný
On 25/06/18 12:08 +0200, Tomas Jelinek wrote:
> I am happy to announce the latest release of pcs, version 0.9.165.

What a mighty patch/micro version component ;-)

With several pacemaker 2.0 release candidates out, it would be perhaps
welcome to share details about versioning (branches) politics of pcs
regarding the supported stacks, since this is something I myself
didn't learn about until recently and only by chance...

Thanks

-- 
Jan (Poki)


pgp9QQs5PQo0Q.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Upgrade corosync problem

2018-06-25 Thread Jan Pokorný
On 25/06/18 19:06 +0200, Salvatore D'angelo wrote:
> Thanks for reply. I scratched my cluster and created it again and
> then migrated as before. This time I uninstalled pacemaker,
> corosync, crmsh and resource agents with make uninstall
> 
> then I installed new packages. The problem is the same, when
> I launch:
> corosync-quorumtool -ps
> 
> I got: Cannot initialize QUORUM service
> 
> Here the log with debug enabled:
> 
> 
> [18019] pg3 corosyncerror   [QB] couldn't create circular mmap on 
> /dev/shm/qb-cfg-event-18020-18028-23-data
> [18019] pg3 corosyncerror   [QB] qb_rb_open:cfg-event-18020-18028-23: 
> Resource temporarily unavailable (11)
> [18019] pg3 corosyncdebug   [QB] Free'ing ringbuffer: 
> /dev/shm/qb-cfg-request-18020-18028-23-header
> [18019] pg3 corosyncdebug   [QB] Free'ing ringbuffer: 
> /dev/shm/qb-cfg-response-18020-18028-23-header
> [18019] pg3 corosyncerror   [QB] shm connection FAILED: Resource 
> temporarily unavailable (11)
> [18019] pg3 corosyncerror   [QB] Error in connection setup 
> (18020-18028-23): Resource temporarily unavailable (11)
> 
> I tried to check /dev/shm and I am not sure these are the right
> commands, however:
> 
> df -h /dev/shm
> Filesystem  Size  Used Avail Use% Mounted on
> shm  64M   16M   49M  24% /dev/shm
> 
> ls /dev/shm
> qb-cmap-request-18020-18036-25-dataqb-corosync-blackbox-data
> qb-quorum-request-18020-18095-32-data
> qb-cmap-request-18020-18036-25-header  qb-corosync-blackbox-header  
> qb-quorum-request-18020-18095-32-header
> 
> Is 64 Mb enough for /dev/shm. If no, why it worked with previous
> corosync release?

For a start, can you try configuring corosync with
--enable-small-memory-footprint switch?

Hard to say why the space provisioned to /dev/shm is the direct
opposite of generous (per today's standards), but may be the result
of automatic HW adaptation, and if RAM is so scarce in your case,
the above build-time toggle might help.

If not, then exponentially increasing size of /dev/shm space is
likely your best bet (I don't recommended fiddling with mlockall()
and similar measures in corosync).

Of course, feel free to raise a regression if you have a reproducible
comparison between two corosync (plus possibly different libraries
like libqb) versions, one that works and one that won't, in
reproducible conditions (like this small /dev/shm, VM image, etc.).

-- 
Jan (Poki)


pgpjALMoPzoef.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Upgrade corosync problem

2018-06-25 Thread Salvatore D'angelo
Hi,Thanks for reply. I scratched my cluster and created it again and then migrated as before. This time I uninstalled pacemaker, corosync, crmsh and resource agents withmake uninstallthen I installed new packages. The problem is the same, when I launch:corosync-quorumtool -psI got: Cannot initialize QUORUM serviceHere the log with debug enabled:

corosync.log
Description: Binary data
[18019] pg3 corosyncerror   [QB    ] couldn't create circular mmap on /dev/shm/qb-cfg-event-18020-18028-23-data[18019] pg3 corosyncerror   [QB    ] qb_rb_open:cfg-event-18020-18028-23: Resource temporarily unavailable (11)[18019] pg3 corosyncdebug   [QB    ] Free'ing ringbuffer: /dev/shm/qb-cfg-request-18020-18028-23-header[18019] pg3 corosyncdebug   [QB    ] Free'ing ringbuffer: /dev/shm/qb-cfg-response-18020-18028-23-header[18019] pg3 corosyncerror   [QB    ] shm connection FAILED: Resource temporarily unavailable (11)[18019] pg3 corosyncerror   [QB    ] Error in connection setup (18020-18028-23): Resource temporarily unavailable (11)I tried to check /dev/shm and I am not sure these are the right commands, however:df -h /dev/shmFilesystem      Size  Used Avail Use% Mounted onshm              64M   16M   49M  24% /dev/shmls /dev/shmqb-cmap-request-18020-18036-25-data    qb-corosync-blackbox-data    qb-quorum-request-18020-18095-32-dataqb-cmap-request-18020-18036-25-header  qb-corosync-blackbox-header  qb-quorum-request-18020-18095-32-headerIs 64 Mb enough for /dev/shm. If no, why it worked with previous corosync release?On 25 Jun 2018, at 09:09, Christine Caulfield  wrote:On 22/06/18 11:23, Salvatore D'angelo wrote:Hi,Here the log:[17323] pg1 corosyncerror   [QB    ] couldn't create circular mmap on/dev/shm/qb-cfg-event-17324-17334-23-data[17323] pg1 corosyncerror   [QB    ]qb_rb_open:cfg-event-17324-17334-23: Resource temporarily unavailable (11)[17323] pg1 corosyncdebug   [QB    ] Free'ing ringbuffer:/dev/shm/qb-cfg-request-17324-17334-23-header[17323] pg1 corosyncdebug   [QB    ] Free'ing ringbuffer:/dev/shm/qb-cfg-response-17324-17334-23-header[17323] pg1 corosyncerror   [QB    ] shm connection FAILED: Resourcetemporarily unavailable (11)[17323] pg1 corosyncerror   [QB    ] Error in connection setup(17324-17334-23): Resource temporarily unavailable (11)[17323] pg1 corosyncdebug   [QB    ] qb_ipcs_disconnect(17324-17334-23)state:0is /dev/shm full?ChrissieOn 22 Jun 2018, at 12:10, Christine Caulfield  wrote:On 22/06/18 10:39, Salvatore D'angelo wrote:Hi,Can you tell me exactly which log you need. I’ll provide you as soon as possible.Regarding some settings, I am not the original author of this cluster. People created it left the company I am working with and I inerithed the code and sometime I do not know why some settings are used.The old versions of pacemaker, corosync,  crash and resource agents were compiled and installed.I simply downloaded the new versions compiled and installed them. I didn’t get any compliant during ./configure that usually checks for library compatibility.To be honest I do not know if this is the right approach. Should I “make unistall" old versions before installing the new one?Which is the suggested approach?Thank in advance for your help.OK fair enough!To be honest the best approach is almost always to get the latestpackages from the distributor rather than compile from source. That wayyou can be more sure that upgrades will be more smoothly. Though, to behonest, I'm not sure how good the Ubuntu packages are (they might begreat, they might not, I genuinely don't know)When building from source and if you don't know the provenance of theprevious version then I would recommend a 'make uninstall' first - orremoval of the packages if that's where they came from.One thing you should do is make sure that all the cluster nodes arerunning the same version. If some are running older versions then nodescould drop out for obscure reasons. We try and keep minor versionson-wire compatible but it's always best to be cautious.The tidying of your corosync.conf wan wait for the moment, lets getthings mostly working first. If you enable debug logging in corosync.conf:logging {  to_syslog: yes	debug: on}Then see what happens and post the syslog file that has all of thecorosync messages in it, we'll take it from there.ChrissieOn 22 Jun 2018, at 11:30, Christine Caulfield  wrote:On 22/06/18 10:14, Salvatore D'angelo wrote:Hi Christine,Thanks for reply. Let me add few details. When I run the corosyncservice I se the corosync process running. If I stop it and run:corosync -f I see three warnings:warning [MAIN  ] interface section bindnetaddr is used together withnodelist. Nodelist one is going to be used.warning [MAIN  ] Please migrate config file to nodelist.warning [MAIN  ] Could not set SCHED_RR at priority 99: Operation notpermitted (1)warning [MAIN  ] Could not set priority -2147483648: Permission denied (13)but I see node joined.Those certainly 

Re: [ClusterLabs] VM failure during shutdown

2018-06-25 Thread Ken Gaillot
On Mon, 2018-06-25 at 09:47 -0500, Ken Gaillot wrote:
> On Mon, 2018-06-25 at 11:33 +0300, Vaggelis Papastavros wrote:
> > Dear friends ,
> > 
> > We have the following configuration :
> > 
> > CentOS7 , pacemaker 0.9.152 and Corosync 2.4.0, storage with DRBD
> > and 
> > stonith eanbled with APC PDU devices.
> > 
> > I have a windows VM configured as cluster resource with the
> > following 
> > attributes :
> > 
> > Resource: WindowSentinelOne_res (class=ocf provider=heartbeat 
> > type=VirtualDomain)
> > Attributes: hypervisor=qemu:///system 
> > config=/opt/customer_vms/conf/WindowSentinelOne/WindowSentinelOne.x
> > ml
> >  
> > migration_transport=ssh
> > Utilization: cpu=8 hv_memory=8192
> > Operations: start interval=0s timeout=120s 
> > (WindowSentinelOne_res-start-interval-0s)
> >          stop interval=0s timeout=120s 
> > (WindowSentinelOne_res-stop-interval-0s)
> >  monitor interval=10s timeout=30s 
> > (WindowSentinelOne_res-monitor-interval-10s)
> > 
> > under some circumstances  (which i try to identify) the VM fails
> > and 
> > disappears under virsh list --all and also pacemaker reports the VM
> > as 
> > stopped .
> > 
> > If run pcs resource cleanup windows_wm everything is OK, but i
> > can't 
> > identify the reason of failure.
> > 
> > For example when shutdown the VM (with windows shutdown)  the
> > cluster 
> > reports the following :
> > 
> > WindowSentinelOne_res    (ocf::heartbeat:VirtualDomain): Started
> > sgw-
> > 02 
> > (failure ignored)
> > 
> > Failed Actions:
> > * WindowSentinelOne_res_monitor_1 on sgw-02 'not running' (7): 
> > call=67, status=complete, exitreason='none',
> >  last-rc-change='Mon Jun 25 07:41:37 2018', queued=0ms,
> > exec=0ms.
> > 
> > 
> > My questions are
> > 
> > 1) why the VM shutdown is reported as (FailedAction) from cluster ?
> > Its 
> > a worthy operation during VM life cycle .
> 
> Pacemaker has no way of knowing that the VM was intentionally shut
> down, vs crashed.
> 
> When some resource is managed by the cluster, all starts and stops of
> the resource have to go through the cluster. You can either set
> target-
> role=Stopped in the resource configuration, or if it's a temporary
> issue (e.g. rebooting for some OS updates), you could set is-
> managed=false to take it out of cluster control, do the work, then
> set
> is-managed=true again.

Also, a nice feature is that you can use rules to set a maintenance
window ahead of time (especially helpful if the person who maintains
the cluster isn't the same person who needs to do the VM updates). For
example, you could set a rule that the resource's is-managed option
will be false from 9pm to midnight on Fridays. See:

http://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pa
cemaker_Explained/index.html#idm140583511697312

particularly the parts about time/date expressions and using rules to
control resource options.

> 
> > 2) why sometimes the resource is marked as stopped (the VM is
> > healthy) 
> > and needs cleanup ?
> 
> That's a problem. If the VM is truly healthy, it sounds like there's
> an
> issue with the resource agent. You'd have to look at the logs to see
> if
> it gave any more information (e.g. if it's a timeout, raising the
> timeout might be sufficient).
> 
> > 3) I can't understand the corosync logs ... during the the VM
> > shutdown 
> > corosync logs is the following
> 
> FYI, the system log will have the most important messages.
> corosync.log
> will additionally have info-level messages -- potentially helpful but
> definitely difficult to follow.
> 
> > Jun 25 07:41:37 [5140] sgw-02   crmd: info: 
> > process_lrm_event:    Result of monitor operation for 
> > WindowSentinelOne_res on sgw-02: 7 (not running) | call=67 
> > key=WindowSentinelOne_res_monitor_1 confirmed=false cib-
> > update=36
> 
> This is really the only important message. It says that a recurring
> monitor on the WindowSentinelOne_res resource on node sgw-02 exited
> with status code 7 (which means the resource agent thinks the
> resource
> is not running).
> 
> 'key=WindowSentinelOne_res_monitor_1' is how pacemaker identifies
> resource agent actions. The format is _ name>_
> 
> This is the only information Pacemaker will get from the resource
> agent. To investigate more deeply, you'll have to check for log
> messages from the agent itself.
> 
> > Jun 25 07:41:37 [5130] sgw-02    cib: info: 
> > cib_process_request:    Forwarding cib_modify operation for
> > section 
> > status to all (origin=local/crmd/36)
> > Jun 25 07:41:37 [5130] sgw-02    cib: info:
> > cib_perform_op:
> > Diff: --- 0.4704.67 2
> > Jun 25 07:41:37 [5130] sgw-02    cib: info:
> > cib_perform_op:
> > Diff: +++ 0.4704.68 (null)
> > Jun 25 07:41:37 [5130] sgw-02    cib: info:
> > cib_perform_op:
> > +  /cib:  @num_updates=68
> > Jun 25 07:41:37 [5130] sgw-02    cib: info:
> > cib_perform_op:
> > +  

Re: [ClusterLabs] VM failure during shutdown

2018-06-25 Thread Ken Gaillot
On Mon, 2018-06-25 at 11:33 +0300, Vaggelis Papastavros wrote:
> Dear friends ,
> 
> We have the following configuration :
> 
> CentOS7 , pacemaker 0.9.152 and Corosync 2.4.0, storage with DRBD
> and 
> stonith eanbled with APC PDU devices.
> 
> I have a windows VM configured as cluster resource with the
> following 
> attributes :
> 
> Resource: WindowSentinelOne_res (class=ocf provider=heartbeat 
> type=VirtualDomain)
> Attributes: hypervisor=qemu:///system 
> config=/opt/customer_vms/conf/WindowSentinelOne/WindowSentinelOne.xml
>  
> migration_transport=ssh
> Utilization: cpu=8 hv_memory=8192
> Operations: start interval=0s timeout=120s 
> (WindowSentinelOne_res-start-interval-0s)
>          stop interval=0s timeout=120s 
> (WindowSentinelOne_res-stop-interval-0s)
>  monitor interval=10s timeout=30s 
> (WindowSentinelOne_res-monitor-interval-10s)
> 
> under some circumstances  (which i try to identify) the VM fails and 
> disappears under virsh list --all and also pacemaker reports the VM
> as 
> stopped .
> 
> If run pcs resource cleanup windows_wm everything is OK, but i can't 
> identify the reason of failure.
> 
> For example when shutdown the VM (with windows shutdown)  the
> cluster 
> reports the following :
> 
> WindowSentinelOne_res    (ocf::heartbeat:VirtualDomain): Started sgw-
> 02 
> (failure ignored)
> 
> Failed Actions:
> * WindowSentinelOne_res_monitor_1 on sgw-02 'not running' (7): 
> call=67, status=complete, exitreason='none',
>  last-rc-change='Mon Jun 25 07:41:37 2018', queued=0ms, exec=0ms.
> 
> 
> My questions are
> 
> 1) why the VM shutdown is reported as (FailedAction) from cluster ?
> Its 
> a worthy operation during VM life cycle .

Pacemaker has no way of knowing that the VM was intentionally shut
down, vs crashed.

When some resource is managed by the cluster, all starts and stops of
the resource have to go through the cluster. You can either set target-
role=Stopped in the resource configuration, or if it's a temporary
issue (e.g. rebooting for some OS updates), you could set is-
managed=false to take it out of cluster control, do the work, then set
is-managed=true again.

> 2) why sometimes the resource is marked as stopped (the VM is
> healthy) 
> and needs cleanup ?

That's a problem. If the VM is truly healthy, it sounds like there's an
issue with the resource agent. You'd have to look at the logs to see if
it gave any more information (e.g. if it's a timeout, raising the
timeout might be sufficient).

> 3) I can't understand the corosync logs ... during the the VM
> shutdown 
> corosync logs is the following

FYI, the system log will have the most important messages. corosync.log
will additionally have info-level messages -- potentially helpful but
definitely difficult to follow.

> Jun 25 07:41:37 [5140] sgw-02   crmd: info: 
> process_lrm_event:    Result of monitor operation for 
> WindowSentinelOne_res on sgw-02: 7 (not running) | call=67 
> key=WindowSentinelOne_res_monitor_1 confirmed=false cib-update=36

This is really the only important message. It says that a recurring
monitor on the WindowSentinelOne_res resource on node sgw-02 exited
with status code 7 (which means the resource agent thinks the resource
is not running).

'key=WindowSentinelOne_res_monitor_1' is how pacemaker identifies
resource agent actions. The format is __

This is the only information Pacemaker will get from the resource
agent. To investigate more deeply, you'll have to check for log
messages from the agent itself.

> Jun 25 07:41:37 [5130] sgw-02    cib: info: 
> cib_process_request:    Forwarding cib_modify operation for section 
> status to all (origin=local/crmd/36)
> Jun 25 07:41:37 [5130] sgw-02    cib: info:
> cib_perform_op:
> Diff: --- 0.4704.67 2
> Jun 25 07:41:37 [5130] sgw-02    cib: info:
> cib_perform_op:
> Diff: +++ 0.4704.68 (null)
> Jun 25 07:41:37 [5130] sgw-02    cib: info:
> cib_perform_op:
> +  /cib:  @num_updates=68
> Jun 25 07:41:37 [5130] sgw-02    cib: info:
> cib_perform_op:
> +  /cib/status/node_state[@id='2']: @crm-debug-
> origin=do_update_resource
> Jun 25 07:41:37 [5130] sgw-02    cib: info:
> cib_perform_op:
> ++ 
> /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resour
> ce[@id='WindowSentinelOne_res']: 
>  operation_key="WindowSentinelOne_res_monitor_1"
> operation="monitor" 
> crm-debug-origin="do_update_resource" crm_feature_set="3.0.10" 
> transition-key="84:3:0:f910c793-a714-4e24-80d1-b0ec66275491" 
> transition-magic="0:7;84:3:0:f910c793-a714-4e24-80d1-b0ec66275491" 
> on_node="sgw-02" cal
> Jun 25 07:41:37 [5130] sgw-02    cib: info: 
> cib_process_request:    Completed cib_modify operation for section 
> status: OK (rc=0, origin=sgw-02/crmd/36, version=0.4704.68)

You can usually ignore the 'cib' messages. This just means Pacemaker
recorded the result on disk.

> Jun 25 07:41:37 [5137] sgw-02  

Re: [ClusterLabs] corosync doesn't start any resource

2018-06-25 Thread Stefan Krueger
Hello Andrei,

> Then you need to set symmetrical="false".
yep, now it seems to work now, thank you very much!

> I assume this would be "pcs constraint order set ...
> symmetrical=false".
yes almost:
pcs constraint order set nfs-server vm_storage ha-ip action=start setoptions 
symmetrical=false


Thank you very very much!

best regards
Stefan

> Gesendet: Samstag, 23. Juni 2018 um 22:13 Uhr
> Von: "Andrei Borzenkov" 
> An: users@clusterlabs.org
> Betreff: Re: [ClusterLabs] corosync doesn't start any resource
>
> 22.06.2018 11:22, Stefan Krueger пишет:
> > Hello Andrei,
> > 
> > thanks for this hint, but I need this "special" order. In an other setup it 
> > works.
> > 
> 
> Then you need to set symmetrical="false". Otherwise pacemaker implicitly
> creates reverse order which leads to deadlock. I am not intimately
> familiar with pcs, I assume this would be "pcs constraint order set ...
> symmetrical=false".
> 
> > best regards
> > Stefan
> > 
> >> Gesendet: Freitag, 22. Juni 2018 um 06:57 Uhr
> >> Von: "Andrei Borzenkov" 
> >> An: users@clusterlabs.org
> >> Betreff: Re: [ClusterLabs] corosync doesn't start any resource
> >>
> >> 21.06.2018 16:04, Stefan Krueger пишет:
> >>> Hi Ken,
> >>>
>  Can you attach the pe-input file listed just above here?
> >>> done ;) 
> >>>
> >>> And thank you for your patience!
> >>>
> >>
> >> You delete all context which makes it hard to answer. This is not web
> >> forum where users can simply scroll up to see previous reply.
> >>
> >> Both your logs and pe-input show that nfs-server and vm-storage wait for
> >> each other.
> >>
> >> My best guess is that you have incorrect ordering for start and stop
> >> which causes loop in pacemaker decision. Your start order is "nfs-server
> >> vm-storage" and your stop order is "nfs-server vm-storage", while it
> >> should normally be symmetrical. Reversing order in one of sets makes it
> >> work as intended (verified).
> >>
> >> I would actually expect that asymmetrical configuration still should
> >> work, so I leave it to pacemaker developers to comment whether this is a
> >> bug or feature :)
> >>
> >> ___
> >> Users mailing list: Users@clusterlabs.org
> >> https://lists.clusterlabs.org/mailman/listinfo/users
> >>
> >> Project Home: http://www.clusterlabs.org
> >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://bugs.clusterlabs.org
> >>
> > ___
> > Users mailing list: Users@clusterlabs.org
> > https://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> > 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Questions about SBD behavior

2018-06-25 Thread Klaus Wenninger
On 06/25/2018 12:01 PM, 井上 和徳 wrote:
>> -Original Message-
>> From: Klaus Wenninger [mailto:kwenn...@redhat.com]
>> Sent: Wednesday, June 13, 2018 6:40 PM
>> To: Cluster Labs - All topics related to open-source clustering welcomed; 井上 
>> 和
>> 徳
>> Subject: Re: [ClusterLabs] Questions about SBD behavior
>>
>> On 06/13/2018 10:58 AM, 井上 和徳 wrote:
>>> Thanks for the response.
>>>
>>> As of v1.3.1 and later, I recognized that real quorum is necessary.
>>> I also read this:
>>>
>> https://wiki.clusterlabs.org/wiki/Using_SBD_with_Pacemaker#Watchdog-based_self
>> -fencing_with_resource_recovery
>>> As related to this specification, in order to use pacemaker-2.0,
>>> we are confirming the following known issue.
>>>
>>> * When SIGSTOP is sent to the pacemaker process, no failure of the
>>>   resource will be detected.
>>>   https://lists.clusterlabs.org/pipermail/users/2016-September/011146.html
>>>   https://lists.clusterlabs.org/pipermail/users/2016-October/011429.html
>>>
>>>   I expected that it was being handled by SBD, but no one detected
>>>   that the following process was frozen. Therefore, no failure of
>>>   the resource was detected either.
>>>   - pacemaker-based
>>>   - pacemaker-execd
>>>   - pacemaker-attrd
>>>   - pacemaker-schedulerd
>>>   - pacemaker-controld
>>>
>>>   I confirmed this, but I couldn't read about the correspondence
>>>   situation.
>>>
>> https://wiki.clusterlabs.org/w/images/1/1a/Recent_Work_and_Future_Plans_for_SB
>> D_1.1.pdf
>> You are right. The issue was known as when I created these slides.
>> So a plan for improving the observation of the pacemaker-daemons
>> should have gone into that probably.
>>
> It's good news that there is a plan to improve.
> So I registered it as a memorandum in CLBZ:
> https://bugs.clusterlabs.org/show_bug.cgi?id=5356
>
> Best Regards
Wasn't there a bug filed before?

Klaus

>
>> Thanks for bringing this to the table.
>> Guess the issue got a little bit neglected recently.
>>
>>> As a result of our discussion, we want SBD to detect it and reset the
>>> machine.
>> Implementation wise I would go for some kind of a split
>> solution between pacemaker & SBD. Thinking of Pacemaker
>> observing the sub-daemons by itself while there would be
>> some kind of a heartbeat (implicitly via corosync or explicitly)
>> between pacemaker & SBD that assures this internal
>> observation is doing it's job properly.
>>
>>> Also, for users who do not have shared disk or qdevice,
>>> we need an option to work even without real quorum.
>>> (fence races are going to avoid with delay attribute:
>>>  https://access.redhat.com/solutions/91653
>>>  https://access.redhat.com/solutions/1293523)
>> I'm not sure if I get your point here.
>> Watchdog-fencing on a 2-node-cluster without
>> additional qdevice or shared disk is like denying
>> the laws of physics in my mind.
>> At the moment I don't see why auto_tie_breaker
>> wouldn't work on a 4-node and up cluster here.
>>
>> Regards,
>> Klaus
>>> Best Regards,
>>> Kazunori INOUE
>>>
 -Original Message-
 From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Klaus 
 Wenninger
 Sent: Friday, May 25, 2018 4:08 PM
 To: users@clusterlabs.org
 Subject: Re: [ClusterLabs] Questions about SBD behavior

 On 05/25/2018 07:31 AM, 井上 和徳 wrote:
> Hi,
>
> I am checking the watchdog function of SBD (without shared block-device).
> In a two-node cluster, if one cluster is stopped, watchdog is triggered 
> on the
 remaining node.
> Is this the designed behavior?
 SBD without a shared block-device doesn't really make sense on
 a two-node cluster.
 The basic idea is - e.g. in a case of a networking problem -
 that a cluster splits up in a quorate and a non-quorate partition.
 The quorate partition stays over while SBD guarantees a
 reliable watchdog-based self-fencing of the non-quorate partition
 within a defined timeout.
 This idea of course doesn't work with just 2 nodes.
 Taking quorum info from the 2-node feature of corosync (automatically
 switching on wait-for-all) doesn't help in this case but instead
 would lead to split-brain.
 What you can do - and what e.g. pcs does automatically - is enable
 the auto-tie-breaker instead of two-node in corosync. But that
 still doesn't give you a higher availability than the one of the
 winner of auto-tie-breaker. (Maybe interesting if you are going
 for a load-balancing-scenario that doesn't affect availability or
 for a transient state while setting up a cluste node-by-node ...)
 What you can do though is using qdevice to still have 'real-quorum'
 info with just 2 full cluster-nodes.

 There was quite a lot of discussion round this topic on this
 thread previously if you search the history.

 Regards,
 Klaus
>>> ___
>>> Users mailing list: Users@clusterlabs.org
>>> 

[ClusterLabs] pcs 0.9.165 released

2018-06-25 Thread Tomas Jelinek

I am happy to announce the latest release of pcs, version 0.9.165.

Source code is available at:
https://github.com/ClusterLabs/pcs/archive/0.9.165.tar.gz
or
https://github.com/ClusterLabs/pcs/archive/0.9.165.zip


Complete change log for this release:
## [0.9.165] - 2018-06-21

### Added
- Pcsd option to reject client initiated SSL/TLS renegotiation
  ([rhbz#1566382])
- Commands for listing and testing watchdog devices ([rhbz#1475318])
- Option for setting netmtu in `pcs cluster setup` command
  ([rhbz#1535967])
- Validation for an unaccessible resource inside a bundle
  ([rhbz#1462248])
- Options to display and filter failures by an operation and its
  interval in `pcs resource failcount reset` and `pcs resource failcount
  show` commands ([rhbz#1427273])
- When starting a cluster, each node is now started with a small delay
  to help preventing JOIN flood in corosync ([rhbz#1572886])

### Fixed
- `pcs cib-push diff-against=` does not consider an empty diff as an
  error ([ghpull#166])
- `pcs resource update` does not create an empty meta\_attributes
  element any more ([rhbz#1568353])
- `pcs resource debug-*` commands provide debug messages even with
  pacemaker-1.1.18 and newer ([rhbz#1574898])
- `pcs config` no longer crashes when `crm_mon` prints anything to
  stderr ([rhbz#1581150])
- Removing resources using web UI when the operation takes longer than
  expected ([rhbz#1579911])
- Improve `pcs quorum device add` usage and man page ([rhbz#1476862])
- `pcs resource failcount show` works correctly with pacemaker-1.1.18
  and newer ([rhbz#1588667])
- Do not lowercase node addresses in the `pcs cluster auth` command
  ([rhbz#1590533])

### Changed
- Watchdog devices are validated against a list provided by sbd
  ([rhbz#1475318]).


Thanks / congratulations to everyone who contributed to this release,
including Bruno Travouillon, Ivan Devat, Ondrej Mular and Tomas Jelinek.

Cheers,
Tomas


[ghpull#166]: https://github.com/ClusterLabs/pcs/pull/166
[rhbz#1427273]: https://bugzilla.redhat.com/show_bug.cgi?id=1427273
[rhbz#1462248]: https://bugzilla.redhat.com/show_bug.cgi?id=1462248
[rhbz#1475318]: https://bugzilla.redhat.com/show_bug.cgi?id=1475318
[rhbz#1476862]: https://bugzilla.redhat.com/show_bug.cgi?id=1476862
[rhbz#1535967]: https://bugzilla.redhat.com/show_bug.cgi?id=1535967
[rhbz#1566382]: https://bugzilla.redhat.com/show_bug.cgi?id=1566382
[rhbz#1568353]: https://bugzilla.redhat.com/show_bug.cgi?id=1568353
[rhbz#1572886]: https://bugzilla.redhat.com/show_bug.cgi?id=1572886
[rhbz#1574898]: https://bugzilla.redhat.com/show_bug.cgi?id=1574898
[rhbz#1579911]: https://bugzilla.redhat.com/show_bug.cgi?id=1579911
[rhbz#1581150]: https://bugzilla.redhat.com/show_bug.cgi?id=1581150
[rhbz#1588667]: https://bugzilla.redhat.com/show_bug.cgi?id=1588667
[rhbz#1590533]: https://bugzilla.redhat.com/show_bug.cgi?id=1590533
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Questions about SBD behavior

2018-06-25 Thread 井上 和徳
> -Original Message-
> From: Klaus Wenninger [mailto:kwenn...@redhat.com]
> Sent: Wednesday, June 13, 2018 6:40 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed; 井上 和
> 徳
> Subject: Re: [ClusterLabs] Questions about SBD behavior
> 
> On 06/13/2018 10:58 AM, 井上 和徳 wrote:
> > Thanks for the response.
> >
> > As of v1.3.1 and later, I recognized that real quorum is necessary.
> > I also read this:
> >
> https://wiki.clusterlabs.org/wiki/Using_SBD_with_Pacemaker#Watchdog-based_self
> -fencing_with_resource_recovery
> >
> > As related to this specification, in order to use pacemaker-2.0,
> > we are confirming the following known issue.
> >
> > * When SIGSTOP is sent to the pacemaker process, no failure of the
> >   resource will be detected.
> >   https://lists.clusterlabs.org/pipermail/users/2016-September/011146.html
> >   https://lists.clusterlabs.org/pipermail/users/2016-October/011429.html
> >
> >   I expected that it was being handled by SBD, but no one detected
> >   that the following process was frozen. Therefore, no failure of
> >   the resource was detected either.
> >   - pacemaker-based
> >   - pacemaker-execd
> >   - pacemaker-attrd
> >   - pacemaker-schedulerd
> >   - pacemaker-controld
> >
> >   I confirmed this, but I couldn't read about the correspondence
> >   situation.
> >
> https://wiki.clusterlabs.org/w/images/1/1a/Recent_Work_and_Future_Plans_for_SB
> D_1.1.pdf
> You are right. The issue was known as when I created these slides.
> So a plan for improving the observation of the pacemaker-daemons
> should have gone into that probably.
> 

It's good news that there is a plan to improve.
So I registered it as a memorandum in CLBZ:
https://bugs.clusterlabs.org/show_bug.cgi?id=5356

Best Regards

> Thanks for bringing this to the table.
> Guess the issue got a little bit neglected recently.
> 
> >
> > As a result of our discussion, we want SBD to detect it and reset the
> > machine.
> 
> Implementation wise I would go for some kind of a split
> solution between pacemaker & SBD. Thinking of Pacemaker
> observing the sub-daemons by itself while there would be
> some kind of a heartbeat (implicitly via corosync or explicitly)
> between pacemaker & SBD that assures this internal
> observation is doing it's job properly.
> 
> >
> > Also, for users who do not have shared disk or qdevice,
> > we need an option to work even without real quorum.
> > (fence races are going to avoid with delay attribute:
> >  https://access.redhat.com/solutions/91653
> >  https://access.redhat.com/solutions/1293523)
> I'm not sure if I get your point here.
> Watchdog-fencing on a 2-node-cluster without
> additional qdevice or shared disk is like denying
> the laws of physics in my mind.
> At the moment I don't see why auto_tie_breaker
> wouldn't work on a 4-node and up cluster here.
> 
> Regards,
> Klaus
> >
> > Best Regards,
> > Kazunori INOUE
> >
> >> -Original Message-
> >> From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Klaus 
> >> Wenninger
> >> Sent: Friday, May 25, 2018 4:08 PM
> >> To: users@clusterlabs.org
> >> Subject: Re: [ClusterLabs] Questions about SBD behavior
> >>
> >> On 05/25/2018 07:31 AM, 井上 和徳 wrote:
> >>> Hi,
> >>>
> >>> I am checking the watchdog function of SBD (without shared block-device).
> >>> In a two-node cluster, if one cluster is stopped, watchdog is triggered 
> >>> on the
> >> remaining node.
> >>> Is this the designed behavior?
> >> SBD without a shared block-device doesn't really make sense on
> >> a two-node cluster.
> >> The basic idea is - e.g. in a case of a networking problem -
> >> that a cluster splits up in a quorate and a non-quorate partition.
> >> The quorate partition stays over while SBD guarantees a
> >> reliable watchdog-based self-fencing of the non-quorate partition
> >> within a defined timeout.
> >> This idea of course doesn't work with just 2 nodes.
> >> Taking quorum info from the 2-node feature of corosync (automatically
> >> switching on wait-for-all) doesn't help in this case but instead
> >> would lead to split-brain.
> >> What you can do - and what e.g. pcs does automatically - is enable
> >> the auto-tie-breaker instead of two-node in corosync. But that
> >> still doesn't give you a higher availability than the one of the
> >> winner of auto-tie-breaker. (Maybe interesting if you are going
> >> for a load-balancing-scenario that doesn't affect availability or
> >> for a transient state while setting up a cluste node-by-node ...)
> >> What you can do though is using qdevice to still have 'real-quorum'
> >> info with just 2 full cluster-nodes.
> >>
> >> There was quite a lot of discussion round this topic on this
> >> thread previously if you search the history.
> >>
> >> Regards,
> >> Klaus
> > ___
> > Users mailing list: Users@clusterlabs.org
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > 

[ClusterLabs] Fencing 2 Nodes

2018-06-25 Thread Joniel Aguilar
Hi,

I have a 2 clustered nodes and it required to configure the fencing. i want
to configure the fencing thru VM ESXI, what fencing agent do i need to use
? is it fence_virt, fence_xvm or fence_vmware_soap ? im so confused on this
fencing. may you can also provide simple configuration of the best fencing
agent.

Thanks
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] VM failure during shutdown

2018-06-25 Thread Vaggelis Papastavros

Dear friends ,

We have the following configuration :

CentOS7 , pacemaker 0.9.152 and Corosync 2.4.0, storage with DRBD and 
stonith eanbled with APC PDU devices.


I have a windows VM configured as cluster resource with the following 
attributes :


Resource: WindowSentinelOne_res (class=ocf provider=heartbeat 
type=VirtualDomain)
Attributes: hypervisor=qemu:///system 
config=/opt/customer_vms/conf/WindowSentinelOne/WindowSentinelOne.xml 
migration_transport=ssh

Utilization: cpu=8 hv_memory=8192
Operations: start interval=0s timeout=120s 
(WindowSentinelOne_res-start-interval-0s)
        stop interval=0s timeout=120s 
(WindowSentinelOne_res-stop-interval-0s)
    monitor interval=10s timeout=30s 
(WindowSentinelOne_res-monitor-interval-10s)


under some circumstances  (which i try to identify) the VM fails and 
disappears under virsh list --all and also pacemaker reports the VM as 
stopped .


If run pcs resource cleanup windows_wm everything is OK, but i can't 
identify the reason of failure.


For example when shutdown the VM (with windows shutdown)  the cluster 
reports the following :


WindowSentinelOne_res    (ocf::heartbeat:VirtualDomain): Started sgw-02 
(failure ignored)


Failed Actions:
* WindowSentinelOne_res_monitor_1 on sgw-02 'not running' (7): 
call=67, status=complete, exitreason='none',

    last-rc-change='Mon Jun 25 07:41:37 2018', queued=0ms, exec=0ms.


My questions are

1) why the VM shutdown is reported as (FailedAction) from cluster ? Its 
a worthy operation during VM life cycle .


2) why sometimes the resource is marked as stopped (the VM is healthy) 
and needs cleanup ?


3) I can't understand the corosync logs ... during the the VM shutdown 
corosync logs is the following



Jun 25 07:41:37 [5140] sgw-02   crmd: info: 
process_lrm_event:    Result of monitor operation for 
WindowSentinelOne_res on sgw-02: 7 (not running) | call=67 
key=WindowSentinelOne_res_monitor_1 confirmed=false cib-update=36
Jun 25 07:41:37 [5130] sgw-02    cib: info: 
cib_process_request:    Forwarding cib_modify operation for section 
status to all (origin=local/crmd/36)
Jun 25 07:41:37 [5130] sgw-02    cib: info: cib_perform_op:    
Diff: --- 0.4704.67 2
Jun 25 07:41:37 [5130] sgw-02    cib: info: cib_perform_op:    
Diff: +++ 0.4704.68 (null)
Jun 25 07:41:37 [5130] sgw-02    cib: info: cib_perform_op:    
+  /cib:  @num_updates=68
Jun 25 07:41:37 [5130] sgw-02    cib: info: cib_perform_op:    
+  /cib/status/node_state[@id='2']: @crm-debug-origin=do_update_resource
Jun 25 07:41:37 [5130] sgw-02    cib: info: cib_perform_op:    
++ 
/cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='WindowSentinelOne_res']: 
operation_key="WindowSentinelOne_res_monitor_1" operation="monitor" 
crm-debug-origin="do_update_resource" crm_feature_set="3.0.10" 
transition-key="84:3:0:f910c793-a714-4e24-80d1-b0ec66275491" 
transition-magic="0:7;84:3:0:f910c793-a714-4e24-80d1-b0ec66275491" 
on_node="sgw-02" cal
Jun 25 07:41:37 [5130] sgw-02    cib: info: 
cib_process_request:    Completed cib_modify operation for section 
status: OK (rc=0, origin=sgw-02/crmd/36, version=0.4704.68)
Jun 25 07:41:37 [5137] sgw-02  attrd: info: 
attrd_peer_update:    Setting fail-count-WindowSentinelOne_res[sgw-02]: 
(null) -> 1 from sgw-01
Jun 25 07:41:37 [5137] sgw-02  attrd: info: write_attribute:    
Sent update 10 with 1 changes for fail-count-WindowSentinelOne_res, 
id=, set=(null)
Jun 25 07:41:37 [5130] sgw-02    cib: info: 
cib_process_request:    Forwarding cib_modify operation for section 
status to all (origin=local/attrd/10)
Jun 25 07:41:37 [5137] sgw-02  attrd: info: 
attrd_peer_update:    Setting 
last-failure-WindowSentinelOne_res[sgw-02]: (null) -> 1529912497 from sgw-01
Jun 25 07:41:37 [5137] sgw-02  attrd: info: write_attribute:    
Sent update 11 with 1 changes for last-failure-WindowSentinelOne_res, 
id=, set=(null)
Jun 25 07:41:37 [5130] sgw-02    cib: info: 
cib_process_request:    Forwarding cib_modify operation for section 
status to all (origin=local/attrd/11)
Jun 25 07:41:37 [5130] sgw-02    cib: info: cib_perform_op:    
Diff: --- 0.4704.68 2
Jun 25 07:41:37 [5130] sgw-02    cib: info: cib_perform_op:    
Diff: +++ 0.4704.69 (null)
Jun 25 07:41:37 [5130] sgw-02    cib: info: cib_perform_op:    
+  /cib:  @num_updates=69
Jun 25 07:41:37 [5130] sgw-02    cib: info: cib_perform_op:    
++ 
/cib/status/node_state[@id='2']/transient_attributes[@id='2']/instance_attributes[@id='status-2']: 
name="fail-count-WindowSentinelOne_res" value="1"/>
Jun 25 07:41:37 [5130] sgw-02    cib: info: 
cib_process_request:    Completed cib_modify operation for section 
status: OK (rc=0, origin=sgw-02/attrd/10, version=0.4704.69)
Jun 25 07:41:37 [5137] sgw-02  attrd: info: 
attrd_cib_callback:    Update 10 for 

Re: [ClusterLabs] Upgrade corosync problem

2018-06-25 Thread Christine Caulfield
On 22/06/18 11:23, Salvatore D'angelo wrote:
> Hi,
> Here the log:
> 
> 
> 
[17323] pg1 corosyncerror   [QB] couldn't create circular mmap on
/dev/shm/qb-cfg-event-17324-17334-23-data
[17323] pg1 corosyncerror   [QB]
qb_rb_open:cfg-event-17324-17334-23: Resource temporarily unavailable (11)
[17323] pg1 corosyncdebug   [QB] Free'ing ringbuffer:
/dev/shm/qb-cfg-request-17324-17334-23-header
[17323] pg1 corosyncdebug   [QB] Free'ing ringbuffer:
/dev/shm/qb-cfg-response-17324-17334-23-header
[17323] pg1 corosyncerror   [QB] shm connection FAILED: Resource
temporarily unavailable (11)
[17323] pg1 corosyncerror   [QB] Error in connection setup
(17324-17334-23): Resource temporarily unavailable (11)
[17323] pg1 corosyncdebug   [QB] qb_ipcs_disconnect(17324-17334-23)
state:0



is /dev/shm full?


Chrissie


> 
> 
>> On 22 Jun 2018, at 12:10, Christine Caulfield  wrote:
>>
>> On 22/06/18 10:39, Salvatore D'angelo wrote:
>>> Hi,
>>>
>>> Can you tell me exactly which log you need. I’ll provide you as soon as 
>>> possible.
>>>
>>> Regarding some settings, I am not the original author of this cluster. 
>>> People created it left the company I am working with and I inerithed the 
>>> code and sometime I do not know why some settings are used.
>>> The old versions of pacemaker, corosync,  crash and resource agents were 
>>> compiled and installed.
>>> I simply downloaded the new versions compiled and installed them. I didn’t 
>>> get any compliant during ./configure that usually checks for library 
>>> compatibility.
>>>
>>> To be honest I do not know if this is the right approach. Should I “make 
>>> unistall" old versions before installing the new one?
>>> Which is the suggested approach?
>>> Thank in advance for your help.
>>>
>>
>> OK fair enough!
>>
>> To be honest the best approach is almost always to get the latest
>> packages from the distributor rather than compile from source. That way
>> you can be more sure that upgrades will be more smoothly. Though, to be
>> honest, I'm not sure how good the Ubuntu packages are (they might be
>> great, they might not, I genuinely don't know)
>>
>> When building from source and if you don't know the provenance of the
>> previous version then I would recommend a 'make uninstall' first - or
>> removal of the packages if that's where they came from.
>>
>> One thing you should do is make sure that all the cluster nodes are
>> running the same version. If some are running older versions then nodes
>> could drop out for obscure reasons. We try and keep minor versions
>> on-wire compatible but it's always best to be cautious.
>>
>> The tidying of your corosync.conf wan wait for the moment, lets get
>> things mostly working first. If you enable debug logging in corosync.conf:
>>
>> logging {
>>to_syslog: yes
>>  debug: on
>> }
>>
>> Then see what happens and post the syslog file that has all of the
>> corosync messages in it, we'll take it from there.
>>
>> Chrissie
>>
 On 22 Jun 2018, at 11:30, Christine Caulfield  wrote:

 On 22/06/18 10:14, Salvatore D'angelo wrote:
> Hi Christine,
>
> Thanks for reply. Let me add few details. When I run the corosync
> service I se the corosync process running. If I stop it and run:
>
> corosync -f 
>
> I see three warnings:
> warning [MAIN  ] interface section bindnetaddr is used together with
> nodelist. Nodelist one is going to be used.
> warning [MAIN  ] Please migrate config file to nodelist.
> warning [MAIN  ] Could not set SCHED_RR at priority 99: Operation not
> permitted (1)
> warning [MAIN  ] Could not set priority -2147483648: Permission denied 
> (13)
>
> but I see node joined.
>

 Those certainly need fixing but are probably not the cause. Also why do
 you have these values below set?

 max_network_delay: 100
 retransmits_before_loss_const: 25
 window_size: 150

 I'm not saying they are causing the trouble, but they aren't going to
 help keep a stable cluster.

 Without more logs (full logs are always better than just the bits you
 think are meaningful) I still can't be sure. it could easily be just
 that you've overwritten a packaged version of corosync with your own
 compiled one and they have different configure options or that the
 libraries now don't match.

 Chrissie


> My corosync.conf file is below.
>
> With service corosync up and running I have the following output:
> *corosync-cfgtool -s*
> Printing ring status.
> Local node ID 1
> RING ID 0
> id= 10.0.0.11
> status= ring 0 active with no faults
> RING ID 1
> id= 192.168.0.11
> status= ring 1 active with no faults
>
> *corosync-cmapctl  | grep members*
> runtime.totem.pg.mrp.srp.*members*.1.config_version (u64) = 0
> runtime.totem.pg.mrp.srp.*members*.1.ip (str) = r(0) ip(10.0.0.11) r(1)
>