Re: [ClusterLabs] Upgrade corosync problem

2018-07-09 Thread Jan Pokorný
On 06/07/18 15:25 +0200, Salvatore D'angelo wrote:
> On 6 Jul 2018, at 14:40, Christine Caulfield  wrote:
>> Yes. you can't randomly swap in and out hand-compiled libqb versions.
>> Find one that works and stick to it. It's an annoying 'feature' of newer
>> linkers that we had to workaround in libqb. So if you rebuild libqb
>> 1.0.3 then you will, in all likelihood, need to rebuild corosync to
>> match it.
> 
> The problem is opposite to what you are saying.
> 
> When I build corosync with old libqb and I verified the new updated
> node worked properly I updated with new libqb hand-compiled and it
> works fine.
> But in a normal upgrade procedure I first build libqb (removing
> first the old one) and then corosync, when I follow this order it
> does not work.
> This is what make me crazy. I do not understand this behavior.

I will assume you have all the steps right, like issuing equivalent
of "make install" once you've built libqb, assuring that system-native
(e.g. distribution packages) libqb and corosync won't be mixed here,
simply that you are cautious.

>> On 06/07/18 13:24, Salvatore D'angelo wrote:
>>> if I launch corosync -f I got:
>>> *corosync: main.c:143: logsys_qb_init: Assertion `"implicit callsite
>>> section is populated, otherwise target's build is at fault, preventing
>>> reliable logging" && __start___verbose != __stop___verbose' failed.*

The extreme and discouraged workaround is to compile corosync with
something like: "make CPPFLAGS=-DQB_KILL_ATTRIBUTE_SECTION", even though
we'd rather diagnose why this happens in the first place.

That apart, it'd be helpful to know output from the following commands
once you have corosync binary (symbolically referred to as $COROSYNC)
in a state the avouve error will be reproduced and you didn't change
anything about your build environment:

  # version of the linker
  bash -c 'paste <(ld --version) <(ld.bfd --version) | head -n1'

  # how does the ELF section/respective symbols appear in the binary
  readelf -s $COROSYNC | grep ___verbose

Hopefully, it will allow us to advance here.

-- 
Nazdar,
Jan (Poki)


pgpfpmmbY_IKV.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Upgrade corosync problem

2018-07-06 Thread Salvatore D'angelo
Hi,

Thanks to reply. The problem is opposite to what you are saying.

When I build corosync with old libqb and I verified the new updated node worked 
properly I updated with new libqb hand-compiled and it works fine.
But in a normale upgrade procedure I first build libqb (removing first the old 
one) and then corosync, when I follow this order it does not work.
This is what make me crazy.
I do not understand this behavior.

> On 6 Jul 2018, at 14:40, Christine Caulfield  wrote:
> 
> On 06/07/18 13:24, Salvatore D'angelo wrote:
>> Hi All,
>> 
>> The option --ulimit memlock=536870912 worked fine.
>> 
>> I have now another strange issue. The upgrade without updating libqb
>> (leaving the 0.16.0) worked fine.
>> If after the upgrade I stop pacemaker and corosync, I download the
>> latest libqb version:
>> https://github.com/ClusterLabs/libqb/releases/download/v1.0.3/libqb-1.0.3.tar.gz
>> build and install it everything works fine.
>> 
>> If I try to install in sequence (after the installation of old code):
>> 
>> libqb 1.0.3
>> corosync 2.4.4
>> pacemaker 1.1.18
>> crmsh 3.0.1
>> resource agents 4.1.1
>> 
>> when I try to start corosync I got the following error:
>> *Starting Corosync Cluster Engine (corosync): /etc/init.d/corosync: line
>> 99:  8470 Aborted $prog $COROSYNC_OPTIONS > /dev/null 2>&1*
>> *[FAILED]*
> 
> 
> Yes. you can't randomly swap in and out hand-compiled libqb versions.
> Find one that works and stick to it. It's an annoying 'feature' of newer
> linkers that we had to workaround in libqb. So if you rebuild libqb
> 1.0.3 then you will, in all likelihood, need to rebuild corosync to
> match it.
> 
> Chrissie
> 
> 
>> 
>> if I launch corosync -f I got:
>> *corosync: main.c:143: logsys_qb_init: Assertion `"implicit callsite
>> section is populated, otherwise target's build is at fault, preventing
>> reliable logging" && __start___verbose != __stop___verbose' failed.*
>> 
>> anything is logged (even in debug mode).
>> 
>> I do not understand why installing libqb during the normal upgrade
>> process fails while if I upgrade it after the
>> crmsh/pacemaker/corosync/resourceagents upgrade it works fine. 
>> 
>> On 3 Jul 2018, at 11:42, Christine Caulfield > 
>> >> wrote:
>>> 
>>> On 03/07/18 07:53, Jan Pokorný wrote:
 On 02/07/18 17:19 +0200, Salvatore D'angelo wrote:
> Today I tested the two suggestions you gave me. Here what I did. 
> In the script where I create my 5 machines cluster (I use three
> nodes for pacemaker PostgreSQL cluster and two nodes for glusterfs
> that we use for database backup and WAL files).
> 
> FIRST TEST
> ——
> I added the —shm-size=512m to the “docker create” command. I noticed
> that as soon as I start it the shm size is 512m and I didn’t need to
> add the entry in /etc/fstab. However, I did it anyway:
> 
> tmpfs  /dev/shm  tmpfs   defaults,size=512m   0   0
> 
> and then
> mount -o remount /dev/shm
> 
> Then I uninstalled all pieces of software (crmsh, resource agents,
> corosync and pacemaker) and installed the new one.
> Started corosync and pacemaker but same problem occurred.
> 
> SECOND TEST
> ———
> stopped corosync and pacemaker
> uninstalled corosync
> build corosync with --enable-small-memory-footprint and installed it
> starte corosync and pacemaker
> 
> IT WORKED.
> 
> I would like to understand now why it didn’t worked in first test
> and why it worked in second. Which kind of memory is used too much
> here? /dev/shm seems not the problem, I allocated 512m on all three
> docker images (obviously on my single Mac) and enabled the container
> option as you suggested. Am I missing something here?
 
 My suspicion then fully shifts towards "maximum number of bytes of
 memory that may be locked into RAM" per-process resource limit as
 raised in one of the most recent message ...
 
> Now I want to use Docker for the moment only for test purpose so it
> could be ok to use the --enable-small-memory-footprint, but there is
> something I can do to have corosync working even without this
> option?
 
 ... so try running the container the already suggested way:
 
  docker run ... --ulimit memlock=33554432 ...
 
 or possibly higher (as a rule of thumb, keep doubling the accumulated
 value until some unreasonable amount is reached, like the equivalent
 of already used 512 MiB).
 
 Hope this helps.
>>> 
>>> This makes a lot of sense to me. As Poki pointed out earlier, in
>>> corosync 2.4.3 (I think) we fixed a regression in that caused corosync
>>> NOT to be locked in RAM after it forked - which was causing potential
>>> performance issues. So if you replace an earlier corosync with 2.4.3 or
>>> later then it will use more locked memory 

Re: [ClusterLabs] Upgrade corosync problem

2018-07-06 Thread Christine Caulfield
On 06/07/18 13:24, Salvatore D'angelo wrote:
> Hi All,
> 
> The option --ulimit memlock=536870912 worked fine.
> 
> I have now another strange issue. The upgrade without updating libqb
> (leaving the 0.16.0) worked fine.
> If after the upgrade I stop pacemaker and corosync, I download the
> latest libqb version:
> https://github.com/ClusterLabs/libqb/releases/download/v1.0.3/libqb-1.0.3.tar.gz
> build and install it everything works fine.
> 
> If I try to install in sequence (after the installation of old code):
> 
> libqb 1.0.3
> corosync 2.4.4
> pacemaker 1.1.18
> crmsh 3.0.1
> resource agents 4.1.1
> 
> when I try to start corosync I got the following error:
> *Starting Corosync Cluster Engine (corosync): /etc/init.d/corosync: line
> 99:  8470 Aborted                 $prog $COROSYNC_OPTIONS > /dev/null 2>&1*
> *[FAILED]*


Yes. you can't randomly swap in and out hand-compiled libqb versions.
Find one that works and stick to it. It's an annoying 'feature' of newer
linkers that we had to workaround in libqb. So if you rebuild libqb
1.0.3 then you will, in all likelihood, need to rebuild corosync to
match it.

Chrissie


> 
> if I launch corosync -f I got:
> *corosync: main.c:143: logsys_qb_init: Assertion `"implicit callsite
> section is populated, otherwise target's build is at fault, preventing
> reliable logging" && __start___verbose != __stop___verbose' failed.*
> 
> anything is logged (even in debug mode).
> 
> I do not understand why installing libqb during the normal upgrade
> process fails while if I upgrade it after the
> crmsh/pacemaker/corosync/resourceagents upgrade it works fine. 
> 
> On 3 Jul 2018, at 11:42, Christine Caulfield  > wrote:
>>
>> On 03/07/18 07:53, Jan Pokorný wrote:
>>> On 02/07/18 17:19 +0200, Salvatore D'angelo wrote:
 Today I tested the two suggestions you gave me. Here what I did. 
 In the script where I create my 5 machines cluster (I use three
 nodes for pacemaker PostgreSQL cluster and two nodes for glusterfs
 that we use for database backup and WAL files).

 FIRST TEST
 ——
 I added the —shm-size=512m to the “docker create” command. I noticed
 that as soon as I start it the shm size is 512m and I didn’t need to
 add the entry in /etc/fstab. However, I did it anyway:

 tmpfs  /dev/shm  tmpfs   defaults,size=512m   0   0

 and then
 mount -o remount /dev/shm

 Then I uninstalled all pieces of software (crmsh, resource agents,
 corosync and pacemaker) and installed the new one.
 Started corosync and pacemaker but same problem occurred.

 SECOND TEST
 ———
 stopped corosync and pacemaker
 uninstalled corosync
 build corosync with --enable-small-memory-footprint and installed it
 starte corosync and pacemaker

 IT WORKED.

 I would like to understand now why it didn’t worked in first test
 and why it worked in second. Which kind of memory is used too much
 here? /dev/shm seems not the problem, I allocated 512m on all three
 docker images (obviously on my single Mac) and enabled the container
 option as you suggested. Am I missing something here?
>>>
>>> My suspicion then fully shifts towards "maximum number of bytes of
>>> memory that may be locked into RAM" per-process resource limit as
>>> raised in one of the most recent message ...
>>>
 Now I want to use Docker for the moment only for test purpose so it
 could be ok to use the --enable-small-memory-footprint, but there is
 something I can do to have corosync working even without this
 option?
>>>
>>> ... so try running the container the already suggested way:
>>>
>>>  docker run ... --ulimit memlock=33554432 ...
>>>
>>> or possibly higher (as a rule of thumb, keep doubling the accumulated
>>> value until some unreasonable amount is reached, like the equivalent
>>> of already used 512 MiB).
>>>
>>> Hope this helps.
>>
>> This makes a lot of sense to me. As Poki pointed out earlier, in
>> corosync 2.4.3 (I think) we fixed a regression in that caused corosync
>> NOT to be locked in RAM after it forked - which was causing potential
>> performance issues. So if you replace an earlier corosync with 2.4.3 or
>> later then it will use more locked memory than before.
>>
>> Chrissie
>>
>>
>>>
 The reason I am asking this is that, in the future, it could be
 possible we deploy in production our cluster in containerised way
 (for the moment is just an idea). This will save a lot of time in
 developing, maintaining and deploying our patch system. All
 prerequisites and dependencies will be enclosed in container and if
 IT team will do some maintenance on bare metal (i.e. install new
 dependencies) it will not affects our containers. I do not see a lot
 of performance drawbacks in using container. The point is to
 understand if a containerised approach could save us lot of headache
 about 

Re: [ClusterLabs] Upgrade corosync problem

2018-07-06 Thread Salvatore D'angelo
Here some strike of corosync:

execve("/usr/sbin/corosync", ["corosync"], [/* 21 vars */]) = 0
brk(0)  = 0x563b1f774000
access("/etc/ld.so.nohwcap", F_OK)  = -1 ENOENT (No such file or directory)
access("/etc/ld.so.preload", R_OK)  = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=26561, ...}) = 0
mmap(NULL, 26561, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f0cd4182000
close(3)= 0
access("/etc/ld.so.nohwcap", F_OK)  = -1 ENOENT (No such file or directory)
open("/usr/lib/libtotem_pg.so.5", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260a\0\0\0\0\0\0"..., 
832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=917346, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x7f0cd4181000
mmap(NULL, 2267392, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0x7f0cd3d3d000
mprotect(0x7f0cd3d61000, 2093056, PROT_NONE) = 0
mmap(0x7f0cd3f6, 8192, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x23000) = 0x7f0cd3f6
mmap(0x7f0cd3f62000, 18688, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f0cd3f62000
close(3)= 0
access("/etc/ld.so.nohwcap", F_OK)  = -1 ENOENT (No such file or directory)
open("/usr/lib/libcorosync_common.so.4", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\20\6\0\0\0\0\0\0"..., 
832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=43858, ...}) = 0
mmap(NULL, 2105360, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0x7f0cd3b3a000
mprotect(0x7f0cd3b3b000, 2097152, PROT_NONE) = 0
mmap(0x7f0cd3d3b000, 8192, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1000) = 0x7f0cd3d3b000
close(3)= 0
access("/etc/ld.so.nohwcap", F_OK)  = -1 ENOENT (No such file or directory)
open("/lib/x86_64-linux-gnu/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320\16\0\0\0\0\0\0"..., 
832) = 832
fstat(3, {st_mode=S_IFREG|0644, st_size=14664, ...}) = 0
mmap(NULL, 2109744, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0x7f0cd3936000
mprotect(0x7f0cd3939000, 2093056, PROT_NONE) = 0
mmap(0x7f0cd3b38000, 8192, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7f0cd3b38000
close(3)= 0
access("/etc/ld.so.nohwcap", F_OK)  = -1 ENOENT (No such file or directory)
open("/lib/x86_64-linux-gnu/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0po\0\0\0\0\0\0"..., 832) 
= 832
fstat(3, {st_mode=S_IFREG|0755, st_size=141574, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x7f0cd418
mmap(NULL, 2217264, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0x7f0cd3718000
mprotect(0x7f0cd3731000, 2093056, PROT_NONE) = 0
mmap(0x7f0cd393, 8192, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x18000) = 0x7f0cd393
mmap(0x7f0cd3932000, 13616, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f0cd3932000
close(3)= 0
access("/etc/ld.so.nohwcap", F_OK)  = -1 ENOENT (No such file or directory)
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P \2\0\0\0\0\0"..., 832) 
= 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1857312, ...}) = 0
mmap(NULL, 3965632, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0x7f0cd334f000
mprotect(0x7f0cd350d000, 2097152, PROT_NONE) = 0
mmap(0x7f0cd370d000, 24576, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1be000) = 0x7f0cd370d000
mmap(0x7f0cd3713000, 17088, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f0cd3713000
close(3)= 0
access("/etc/ld.so.nohwcap", F_OK)  = -1 ENOENT (No such file or directory)
open("/usr/lib/libqb.so.0", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0`\243\0\0\0\0\0\0"..., 
832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=951833, ...}) = 0
mmap(NULL, 6717576, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0x7f0cd2ce6000
mprotect(0x7f0cd2d0b000, 6557696, PROT_NONE) = 0
mmap(0x7f0cd2f0a000, 8192, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x24000) = 0x7f0cd2f0a000
mmap(0x7f0cd2f0c000, 264352, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f0cd2f0c000
mmap(0x7f0cd334c000, 12288, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x66000) = 0x7f0cd334c000
close(3)= 0
access("/etc/ld.so.nohwcap", F_OK)  = -1 ENOENT (No such file or directory)
open("/usr/lib/x86_64-linux-gnu/libnss3.so", O_RDONLY|O_CLOEXEC) = 3
read(3, 

Re: [ClusterLabs] Upgrade corosync problem

2018-07-06 Thread Salvatore D'angelo
Hi All,

The option --ulimit memlock=536870912 worked fine.

I have now another strange issue. The upgrade without updating libqb (leaving 
the 0.16.0) worked fine.
If after the upgrade I stop pacemaker and corosync, I download the latest libqb 
version:
https://github.com/ClusterLabs/libqb/releases/download/v1.0.3/libqb-1.0.3.tar.gz
build and install it everything works fine.

If I try to install in sequence (after the installation of old code):

libqb 1.0.3
corosync 2.4.4
pacemaker 1.1.18
crmsh 3.0.1
resource agents 4.1.1

when I try to start corosync I got the following error:
Starting Corosync Cluster Engine (corosync): /etc/init.d/corosync: line 99:  
8470 Aborted $prog $COROSYNC_OPTIONS > /dev/null 2>&1
[FAILED]

if I launch corosync -f I got:
corosync: main.c:143: logsys_qb_init: Assertion `"implicit callsite section is 
populated, otherwise target's build is at fault, preventing reliable logging" 
&& __start___verbose != __stop___verbose' failed.

anything is logged (even in debug mode).

I do not understand why installing libqb during the normal upgrade process 
fails while if I upgrade it after the crmsh/pacemaker/corosync/resourceagents 
upgrade it works fine. 

On 3 Jul 2018, at 11:42, Christine Caulfield  wrote:
> 
> On 03/07/18 07:53, Jan Pokorný wrote:
>> On 02/07/18 17:19 +0200, Salvatore D'angelo wrote:
>>> Today I tested the two suggestions you gave me. Here what I did. 
>>> In the script where I create my 5 machines cluster (I use three
>>> nodes for pacemaker PostgreSQL cluster and two nodes for glusterfs
>>> that we use for database backup and WAL files).
>>> 
>>> FIRST TEST
>>> ——
>>> I added the —shm-size=512m to the “docker create” command. I noticed
>>> that as soon as I start it the shm size is 512m and I didn’t need to
>>> add the entry in /etc/fstab. However, I did it anyway:
>>> 
>>> tmpfs  /dev/shm  tmpfs   defaults,size=512m   0   0
>>> 
>>> and then
>>> mount -o remount /dev/shm
>>> 
>>> Then I uninstalled all pieces of software (crmsh, resource agents,
>>> corosync and pacemaker) and installed the new one.
>>> Started corosync and pacemaker but same problem occurred.
>>> 
>>> SECOND TEST
>>> ———
>>> stopped corosync and pacemaker
>>> uninstalled corosync
>>> build corosync with --enable-small-memory-footprint and installed it
>>> starte corosync and pacemaker
>>> 
>>> IT WORKED.
>>> 
>>> I would like to understand now why it didn’t worked in first test
>>> and why it worked in second. Which kind of memory is used too much
>>> here? /dev/shm seems not the problem, I allocated 512m on all three
>>> docker images (obviously on my single Mac) and enabled the container
>>> option as you suggested. Am I missing something here?
>> 
>> My suspicion then fully shifts towards "maximum number of bytes of
>> memory that may be locked into RAM" per-process resource limit as
>> raised in one of the most recent message ...
>> 
>>> Now I want to use Docker for the moment only for test purpose so it
>>> could be ok to use the --enable-small-memory-footprint, but there is
>>> something I can do to have corosync working even without this
>>> option?
>> 
>> ... so try running the container the already suggested way:
>> 
>>  docker run ... --ulimit memlock=33554432 ...
>> 
>> or possibly higher (as a rule of thumb, keep doubling the accumulated
>> value until some unreasonable amount is reached, like the equivalent
>> of already used 512 MiB).
>> 
>> Hope this helps.
> 
> This makes a lot of sense to me. As Poki pointed out earlier, in
> corosync 2.4.3 (I think) we fixed a regression in that caused corosync
> NOT to be locked in RAM after it forked - which was causing potential
> performance issues. So if you replace an earlier corosync with 2.4.3 or
> later then it will use more locked memory than before.
> 
> Chrissie
> 
> 
>> 
>>> The reason I am asking this is that, in the future, it could be
>>> possible we deploy in production our cluster in containerised way
>>> (for the moment is just an idea). This will save a lot of time in
>>> developing, maintaining and deploying our patch system. All
>>> prerequisites and dependencies will be enclosed in container and if
>>> IT team will do some maintenance on bare metal (i.e. install new
>>> dependencies) it will not affects our containers. I do not see a lot
>>> of performance drawbacks in using container. The point is to
>>> understand if a containerised approach could save us lot of headache
>>> about maintenance of this cluster without affect performance too
>>> much. I am notice in Cloud environment this approach in a lot of
>>> contexts.
>> 
>> 
>> 
>> ___
>> Users mailing list: Users@clusterlabs.org 
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> 
>> Project Home: http://www.clusterlabs.org 
>> Getting started: 

Re: [ClusterLabs] Upgrade corosync problem

2018-07-03 Thread Christine Caulfield
On 03/07/18 07:53, Jan Pokorný wrote:
> On 02/07/18 17:19 +0200, Salvatore D'angelo wrote:
>> Today I tested the two suggestions you gave me. Here what I did. 
>> In the script where I create my 5 machines cluster (I use three
>> nodes for pacemaker PostgreSQL cluster and two nodes for glusterfs
>> that we use for database backup and WAL files).
>>
>> FIRST TEST
>> ——
>> I added the —shm-size=512m to the “docker create” command. I noticed
>> that as soon as I start it the shm size is 512m and I didn’t need to
>> add the entry in /etc/fstab. However, I did it anyway:
>>
>> tmpfs  /dev/shm  tmpfs   defaults,size=512m   0   0
>>
>> and then
>> mount -o remount /dev/shm
>>
>> Then I uninstalled all pieces of software (crmsh, resource agents,
>> corosync and pacemaker) and installed the new one.
>> Started corosync and pacemaker but same problem occurred.
>>
>> SECOND TEST
>> ———
>> stopped corosync and pacemaker
>> uninstalled corosync
>> build corosync with --enable-small-memory-footprint and installed it
>> starte corosync and pacemaker
>>
>> IT WORKED.
>>
>> I would like to understand now why it didn’t worked in first test
>> and why it worked in second. Which kind of memory is used too much
>> here? /dev/shm seems not the problem, I allocated 512m on all three
>> docker images (obviously on my single Mac) and enabled the container
>> option as you suggested. Am I missing something here?
> 
> My suspicion then fully shifts towards "maximum number of bytes of
> memory that may be locked into RAM" per-process resource limit as
> raised in one of the most recent message ...
> 
>> Now I want to use Docker for the moment only for test purpose so it
>> could be ok to use the --enable-small-memory-footprint, but there is
>> something I can do to have corosync working even without this
>> option?
> 
> ... so try running the container the already suggested way:
> 
>   docker run ... --ulimit memlock=33554432 ...
> 
> or possibly higher (as a rule of thumb, keep doubling the accumulated
> value until some unreasonable amount is reached, like the equivalent
> of already used 512 MiB).
> 
> Hope this helps.

This makes a lot of sense to me. As Poki pointed out earlier, in
corosync 2.4.3 (I think) we fixed a regression in that caused corosync
NOT to be locked in RAM after it forked - which was causing potential
performance issues. So if you replace an earlier corosync with 2.4.3 or
later then it will use more locked memory than before.

Chrissie


> 
>> The reason I am asking this is that, in the future, it could be
>> possible we deploy in production our cluster in containerised way
>> (for the moment is just an idea). This will save a lot of time in
>> developing, maintaining and deploying our patch system. All
>> prerequisites and dependencies will be enclosed in container and if
>> IT team will do some maintenance on bare metal (i.e. install new
>> dependencies) it will not affects our containers. I do not see a lot
>> of performance drawbacks in using container. The point is to
>> understand if a containerised approach could save us lot of headache
>> about maintenance of this cluster without affect performance too
>> much. I am notice in Cloud environment this approach in a lot of
>> contexts.
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Upgrade corosync problem

2018-07-03 Thread Jan Pokorný
On 02/07/18 17:19 +0200, Salvatore D'angelo wrote:
> Today I tested the two suggestions you gave me. Here what I did. 
> In the script where I create my 5 machines cluster (I use three
> nodes for pacemaker PostgreSQL cluster and two nodes for glusterfs
> that we use for database backup and WAL files).
> 
> FIRST TEST
> ——
> I added the —shm-size=512m to the “docker create” command. I noticed
> that as soon as I start it the shm size is 512m and I didn’t need to
> add the entry in /etc/fstab. However, I did it anyway:
> 
> tmpfs  /dev/shm  tmpfs   defaults,size=512m   0   0
> 
> and then
> mount -o remount /dev/shm
> 
> Then I uninstalled all pieces of software (crmsh, resource agents,
> corosync and pacemaker) and installed the new one.
> Started corosync and pacemaker but same problem occurred.
> 
> SECOND TEST
> ———
> stopped corosync and pacemaker
> uninstalled corosync
> build corosync with --enable-small-memory-footprint and installed it
> starte corosync and pacemaker
> 
> IT WORKED.
> 
> I would like to understand now why it didn’t worked in first test
> and why it worked in second. Which kind of memory is used too much
> here? /dev/shm seems not the problem, I allocated 512m on all three
> docker images (obviously on my single Mac) and enabled the container
> option as you suggested. Am I missing something here?

My suspicion then fully shifts towards "maximum number of bytes of
memory that may be locked into RAM" per-process resource limit as
raised in one of the most recent message ...

> Now I want to use Docker for the moment only for test purpose so it
> could be ok to use the --enable-small-memory-footprint, but there is
> something I can do to have corosync working even without this
> option?

... so try running the container the already suggested way:

  docker run ... --ulimit memlock=33554432 ...

or possibly higher (as a rule of thumb, keep doubling the accumulated
value until some unreasonable amount is reached, like the equivalent
of already used 512 MiB).

Hope this helps.

> The reason I am asking this is that, in the future, it could be
> possible we deploy in production our cluster in containerised way
> (for the moment is just an idea). This will save a lot of time in
> developing, maintaining and deploying our patch system. All
> prerequisites and dependencies will be enclosed in container and if
> IT team will do some maintenance on bare metal (i.e. install new
> dependencies) it will not affects our containers. I do not see a lot
> of performance drawbacks in using container. The point is to
> understand if a containerised approach could save us lot of headache
> about maintenance of this cluster without affect performance too
> much. I am notice in Cloud environment this approach in a lot of
> contexts.

-- 
Jan (Poki)


pgpIzxGt_NUq6.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Upgrade corosync problem

2018-07-02 Thread Salvatore D'angelo
Hi All,

Today I tested the two suggestions you gave me. Here what I did. 
In the script where I create my 5 machines cluster (I use three nodes for 
pacemaker PostgreSQL cluster and two nodes for glusterfs that we use for 
database backup and WAL files).

FIRST TEST
——
I added the —shm-size=512m to the “docker create” command. I noticed that as 
soon as I start it the shm size is 512m and I didn’t need to add the entry in 
/etc/fstab. However, I did it anyway:

tmpfs  /dev/shm  tmpfs   defaults,size=512m   0   0

and then
mount -o remount /dev/shm

Then I uninstalled all pieces of software (crmsh, resource agents, corosync and 
pacemaker) and installed the new one.
Started corosync and pacemaker but same problem occurred.

SECOND TEST
———
stopped corosync and pacemaker
uninstalled corosync
build corosync with --enable-small-memory-footprint and installed it
starte corosync and pacemaker

IT WORKED.

I would like to understand now why it didn’t worked in first test and why it 
worked in second. Which kind of memory is used too much here? /dev/shm seems 
not the problem, I allocated 512m on all three docker images (obviously on my 
single Mac) and enabled the container option as you suggested. Am I missing 
something here?

Now I want to use Docker for the moment only for test purpose so it could be ok 
to use the --enable-small-memory-footprint, but there is something I can do to 
have corosync working even without this option?


The reason I am asking this is that, in the future, it could be possible we 
deploy in production our cluster in containerised way (for the moment is just 
an idea). This will save a lot of time in developing, maintaining and deploying 
our patch system. All prerequisites and dependencies will be enclosed in 
container and if IT team will do some maintenance on bare metal (i.e. install 
new dependencies) it will not affects our containers. I do not see a lot of 
performance drawbacks in using container. The point is to understand if a 
containerised approach could save us lot of headache about maintenance of this 
cluster without affect performance too much. I am notice in Cloud environment 
this approach in a lot of contexts.


> On 2 Jul 2018, at 08:54, Christine Caulfield  wrote:
> 
> On 29/06/18 17:20, Jan Pokorný wrote:
>> On 29/06/18 10:00 +0100, Christine Caulfield wrote:
>>> On 27/06/18 08:35, Salvatore D'angelo wrote:
 One thing that I do not understand is that I tried to compare corosync
 2.3.5 (the old version that worked fine) and 2.4.4 to understand
 differences but I haven’t found anything related to the piece of code
 that affects the issue. The quorum tool.c and cfg.c are almost the same.
 Probably the issue is somewhere else.
 
>>> 
>>> This might be asking a bit much, but would it be possible to try this
>>> using Virtual Machines rather than Docker images? That would at least
>>> eliminate a lot of complex variables.
>> 
>> Salvatore, you can ignore the part below, try following the "--shm"
>> advice in other part of this thread.  Also the previous suggestion
>> to compile corosync with --small-memory-footprint may be of help,
>> but comes with other costs (expect lower throughput).
>> 
>> 
>> Chrissie, I have a plausible explanation and if it's true, then the
>> same will be reproduced wherever /dev/shm is small enough.
>> 
>> If I am right, then the offending commit is
>> https://github.com/corosync/corosync/commit/238e2e62d8b960e7c10bfa0a8281d78ec99f3a26
>> (present since 2.4.3), and while it arranges things for the better
>> in the context of prioritized, low jitter process, it all of
>> a sudden prevents as-you-need memory acquisition from the system,
>> meaning that the memory consumption constraints are checked immediately
>> when the memory is claimed (as it must fit into dedicated physical
>> memory in full).  Hence this impact we likely never realized may
>> be perceived as a sort of a regression.
>> 
>> Since we can calculate the approximate requirements statically, might
>> be worthy to add something like README.requirements, detailing how much
>> space will be occupied for typical configurations at minimum, e.g.:
>> 
>> - standard + --small-memory-footprint configuration
>> - 2 + 3 + X nodes (5?)
>> - without any service on top + teamed with qnetd + teamed with
>>  pacemaker atop (including just IPC channels between pacemaker
>>  daemons and corosync's CPG service, indeed)
>> 
> 
> That is possible explanation I suppose, yes.it 's not 
> something we can
> sensibly revert because it was already fixing another regression!
> 
> 
> I like the idea of documenting the /dev/shm requrements - that would
> certainly help with other people using containers - Salvatore mentioned
> earlier that there was nothing to guide him about the size needed. I'll
> raise an issue in github to cover it. Your input on how to do it for
> containers would also be helpful.
> 
> Chrissie
> 

Re: [ClusterLabs] Upgrade corosync problem

2018-07-02 Thread Christine Caulfield
On 29/06/18 17:20, Jan Pokorný wrote:
> On 29/06/18 10:00 +0100, Christine Caulfield wrote:
>> On 27/06/18 08:35, Salvatore D'angelo wrote:
>>> One thing that I do not understand is that I tried to compare corosync
>>> 2.3.5 (the old version that worked fine) and 2.4.4 to understand
>>> differences but I haven’t found anything related to the piece of code
>>> that affects the issue. The quorum tool.c and cfg.c are almost the same.
>>> Probably the issue is somewhere else.
>>>
>>
>> This might be asking a bit much, but would it be possible to try this
>> using Virtual Machines rather than Docker images? That would at least
>> eliminate a lot of complex variables.
> 
> Salvatore, you can ignore the part below, try following the "--shm"
> advice in other part of this thread.  Also the previous suggestion
> to compile corosync with --small-memory-footprint may be of help,
> but comes with other costs (expect lower throughput).
> 
> 
> Chrissie, I have a plausible explanation and if it's true, then the
> same will be reproduced wherever /dev/shm is small enough.
> 
> If I am right, then the offending commit is
> https://github.com/corosync/corosync/commit/238e2e62d8b960e7c10bfa0a8281d78ec99f3a26
> (present since 2.4.3), and while it arranges things for the better
> in the context of prioritized, low jitter process, it all of
> a sudden prevents as-you-need memory acquisition from the system,
> meaning that the memory consumption constraints are checked immediately
> when the memory is claimed (as it must fit into dedicated physical
> memory in full).  Hence this impact we likely never realized may
> be perceived as a sort of a regression.
> 
> Since we can calculate the approximate requirements statically, might
> be worthy to add something like README.requirements, detailing how much
> space will be occupied for typical configurations at minimum, e.g.:
> 
> - standard + --small-memory-footprint configuration
> - 2 + 3 + X nodes (5?)
> - without any service on top + teamed with qnetd + teamed with
>   pacemaker atop (including just IPC channels between pacemaker
>   daemons and corosync's CPG service, indeed)
> 

That is possible explanation I suppose, yes.it's not something we can
sensibly revert because it was already fixing another regression!


I like the idea of documenting the /dev/shm requrements - that would
certainly help with other people using containers - Salvatore mentioned
earlier that there was nothing to guide him about the size needed. I'll
raise an issue in github to cover it. Your input on how to do it for
containers would also be helpful.

Chrissie
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Upgrade corosync problem

2018-06-30 Thread Salvatore D'angelo
Hi everyone,

Thanks for suggestion. Yesterday in Rome was City Holiday and with week end I 
think I’ll try all your proposal Monday morning
when I go back to office. Thanks again for support I appreciate it a lot.

> On 29 Jun 2018, at 18:20, Jan Pokorný  wrote:
> 
> On 29/06/18 10:00 +0100, Christine Caulfield wrote:
>> On 27/06/18 08:35, Salvatore D'angelo wrote:
>>> One thing that I do not understand is that I tried to compare corosync
>>> 2.3.5 (the old version that worked fine) and 2.4.4 to understand
>>> differences but I haven’t found anything related to the piece of code
>>> that affects the issue. The quorum tool.c and cfg.c are almost the same.
>>> Probably the issue is somewhere else.
>>> 
>> 
>> This might be asking a bit much, but would it be possible to try this
>> using Virtual Machines rather than Docker images? That would at least
>> eliminate a lot of complex variables.
> 
> Salvatore, you can ignore the part below, try following the "--shm"
> advice in other part of this thread.  Also the previous suggestion
> to compile corosync with --small-memory-footprint may be of help,
> but comes with other costs (expect lower throughput).
> 
> 
> Chrissie, I have a plausible explanation and if it's true, then the
> same will be reproduced wherever /dev/shm is small enough.
> 
> If I am right, then the offending commit is
> https://github.com/corosync/corosync/commit/238e2e62d8b960e7c10bfa0a8281d78ec99f3a26
> (present since 2.4.3), and while it arranges things for the better
> in the context of prioritized, low jitter process, it all of
> a sudden prevents as-you-need memory acquisition from the system,
> meaning that the memory consumption constraints are checked immediately
> when the memory is claimed (as it must fit into dedicated physical
> memory in full).  Hence this impact we likely never realized may
> be perceived as a sort of a regression.
> 
> Since we can calculate the approximate requirements statically, might
> be worthy to add something like README.requirements, detailing how much
> space will be occupied for typical configurations at minimum, e.g.:
> 
> - standard + --small-memory-footprint configuration
> - 2 + 3 + X nodes (5?)
> - without any service on top + teamed with qnetd + teamed with
>  pacemaker atop (including just IPC channels between pacemaker
>  daemons and corosync's CPG service, indeed)
> 
> -- 
> Jan (Poki)
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Upgrade corosync problem

2018-06-29 Thread Jan Pokorný
On 29/06/18 19:13 +0200, Salvatore D'angelo wrote:
> Good to know. I'll try it. I'll try to work on VM too.

If that won't work, you can also try:

  docker run ... --ulimit memlock=33554432 ...

where 32768 (kiB) may still be not enough (assuming the default
of 16384), hard to say, since proper root user may normally be
bypassing any such limitations.

Good luck.

> Il Ven 29 Giu 2018, 5:46 PM Jan Pokorný  ha scritto:
> 
>> On 26/06/18 11:03 +0200, Salvatore D'angelo wrote:
>>> Yes, sorry you’re right I could find it by myself.
>>> However, I did the following:
>>> 
>>> 1. Added the line you suggested to /etc/fstab
>>> 2. mount -o remount /dev/shm
>>> 3. Now I correctly see /dev/shm of 512M with df -h
>>> Filesystem  Size  Used Avail Use% Mounted on
>>> overlay  63G   11G   49G  19% /
>>> tmpfs64M  4.0K   64M   1% /dev
>>> tmpfs  1000M 0 1000M   0% /sys/fs/cgroup
>>> osxfs   466G  158G  305G  35% /Users
>>> /dev/sda163G   11G   49G  19% /etc/hosts
>>> shm 512M   15M  498M   3% /dev/shm
>>> tmpfs  1000M 0 1000M   0% /sys/firmware
>>> tmpfs   128M 0  128M   0% /tmp
>>> 
>>> The errors in log went away. Consider that I remove the log file
>>> before start corosync so it does not contains lines of previous
>>> executions.
>>> 
>>> 
>>> But the command:
>>> corosync-quorumtool -ps
>>> 
>>> still give:
>>> Cannot initialize QUORUM service
>>> 
>>> Consider that few minutes before it gave me the message:
>>> Cannot initialize CFG service
>>> 
>>> I do not know the differences between CFG and QUORUM in this case.
>>> 
>>> If I try to start pacemaker the service is OK but I see only
>>> pacemaker and the Transport does not work if I try to run a cam
>>> command.
>>> Any suggestion?
>> 
>> Frankly, best generic suggestion I can serve with is to learn
>> sufficient portions of the details about the tool you are relying on.
>> 
>> I had a second look and it seems that what drives the actual
>> size of the container's /dev/shm mountpoint with docker
>> (per other response, you don't seem to be using --ipc switch) is
>> it's --shm-size option for "run" subcommand (hence it's rather
>> a property of the run-time, as the default of "64m" may be
>> silently overriding your believed-to-be-persistent static changes
>> within the container).
>> 
>> Try using that option and you'll see.  Definitely keep you mind open
>> regarding "container != magic-less system" inequality.

-- 
Jan (Poki)


pgpqx78u6Noj5.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Upgrade corosync problem

2018-06-29 Thread Salvatore D'angelo
Good to know. I'll try it. I'll try to work on VM too.

Il Ven 29 Giu 2018, 5:46 PM Jan Pokorný  ha scritto:

> On 26/06/18 11:03 +0200, Salvatore D'angelo wrote:
> > Yes, sorry you’re right I could find it by myself.
> > However, I did the following:
> >
> > 1. Added the line you suggested to /etc/fstab
> > 2. mount -o remount /dev/shm
> > 3. Now I correctly see /dev/shm of 512M with df -h
> > Filesystem  Size  Used Avail Use% Mounted on
> > overlay  63G   11G   49G  19% /
> > tmpfs64M  4.0K   64M   1% /dev
> > tmpfs  1000M 0 1000M   0% /sys/fs/cgroup
> > osxfs   466G  158G  305G  35% /Users
> > /dev/sda163G   11G   49G  19% /etc/hosts
> > shm 512M   15M  498M   3% /dev/shm
> > tmpfs  1000M 0 1000M   0% /sys/firmware
> > tmpfs   128M 0  128M   0% /tmp
> >
> > The errors in log went away. Consider that I remove the log file
> > before start corosync so it does not contains lines of previous
> > executions.
> >
> >
> > But the command:
> > corosync-quorumtool -ps
> >
> > still give:
> > Cannot initialize QUORUM service
> >
> > Consider that few minutes before it gave me the message:
> > Cannot initialize CFG service
> >
> > I do not know the differences between CFG and QUORUM in this case.
> >
> > If I try to start pacemaker the service is OK but I see only
> > pacemaker and the Transport does not work if I try to run a cam
> > command.
> > Any suggestion?
>
> Frankly, best generic suggestion I can serve with is to learn
> sufficient portions of the details about the tool you are relying on.
>
> I had a second look and it seems that what drives the actual
> size of the container's /dev/shm mountpoint with docker
> (per other response, you don't seem to be using --ipc switch) is
> it's --shm-size option for "run" subcommand (hence it's rather
> a property of the run-time, as the default of "64m" may be
> silently overriding your believed-to-be-persistent static changes
> within the container).
>
> Try using that option and you'll see.  Definitely keep you mind open
> regarding "container != magic-less system" inequality.
>
> --
> Jan (Poki)
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Upgrade corosync problem

2018-06-29 Thread Jan Pokorný
On 29/06/18 10:00 +0100, Christine Caulfield wrote:
> On 27/06/18 08:35, Salvatore D'angelo wrote:
>> One thing that I do not understand is that I tried to compare corosync
>> 2.3.5 (the old version that worked fine) and 2.4.4 to understand
>> differences but I haven’t found anything related to the piece of code
>> that affects the issue. The quorum tool.c and cfg.c are almost the same.
>> Probably the issue is somewhere else.
>> 
> 
> This might be asking a bit much, but would it be possible to try this
> using Virtual Machines rather than Docker images? That would at least
> eliminate a lot of complex variables.

Salvatore, you can ignore the part below, try following the "--shm"
advice in other part of this thread.  Also the previous suggestion
to compile corosync with --small-memory-footprint may be of help,
but comes with other costs (expect lower throughput).


Chrissie, I have a plausible explanation and if it's true, then the
same will be reproduced wherever /dev/shm is small enough.

If I am right, then the offending commit is
https://github.com/corosync/corosync/commit/238e2e62d8b960e7c10bfa0a8281d78ec99f3a26
(present since 2.4.3), and while it arranges things for the better
in the context of prioritized, low jitter process, it all of
a sudden prevents as-you-need memory acquisition from the system,
meaning that the memory consumption constraints are checked immediately
when the memory is claimed (as it must fit into dedicated physical
memory in full).  Hence this impact we likely never realized may
be perceived as a sort of a regression.

Since we can calculate the approximate requirements statically, might
be worthy to add something like README.requirements, detailing how much
space will be occupied for typical configurations at minimum, e.g.:

- standard + --small-memory-footprint configuration
- 2 + 3 + X nodes (5?)
- without any service on top + teamed with qnetd + teamed with
  pacemaker atop (including just IPC channels between pacemaker
  daemons and corosync's CPG service, indeed)

-- 
Jan (Poki)


pgpbz8BsKG8wW.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Upgrade corosync problem

2018-06-29 Thread Jan Pokorný
On 26/06/18 11:03 +0200, Salvatore D'angelo wrote:
> Yes, sorry you’re right I could find it by myself.
> However, I did the following:
> 
> 1. Added the line you suggested to /etc/fstab
> 2. mount -o remount /dev/shm
> 3. Now I correctly see /dev/shm of 512M with df -h
> Filesystem  Size  Used Avail Use% Mounted on
> overlay  63G   11G   49G  19% /
> tmpfs64M  4.0K   64M   1% /dev
> tmpfs  1000M 0 1000M   0% /sys/fs/cgroup
> osxfs   466G  158G  305G  35% /Users
> /dev/sda163G   11G   49G  19% /etc/hosts
> shm 512M   15M  498M   3% /dev/shm
> tmpfs  1000M 0 1000M   0% /sys/firmware
> tmpfs   128M 0  128M   0% /tmp
> 
> The errors in log went away. Consider that I remove the log file
> before start corosync so it does not contains lines of previous
> executions.
> 
> 
> But the command:
> corosync-quorumtool -ps
> 
> still give:
> Cannot initialize QUORUM service
> 
> Consider that few minutes before it gave me the message:
> Cannot initialize CFG service
> 
> I do not know the differences between CFG and QUORUM in this case.
> 
> If I try to start pacemaker the service is OK but I see only
> pacemaker and the Transport does not work if I try to run a cam
> command.
> Any suggestion?

Frankly, best generic suggestion I can serve with is to learn
sufficient portions of the details about the tool you are relying on.

I had a second look and it seems that what drives the actual
size of the container's /dev/shm mountpoint with docker
(per other response, you don't seem to be using --ipc switch) is
it's --shm-size option for "run" subcommand (hence it's rather
a property of the run-time, as the default of "64m" may be
silently overriding your believed-to-be-persistent static changes
within the container).

Try using that option and you'll see.  Definitely keep you mind open
regarding "container != magic-less system" inequality.

-- 
Jan (Poki)


pgpwkcAAfuhlc.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Upgrade corosync problem

2018-06-29 Thread Christine Caulfield
On 27/06/18 08:35, Salvatore D'angelo wrote:
> Hi,
> 
> Thanks for reply and detailed explaination. I am not using the
> —network=host option.
> I have a docker image based on Ubuntu 14.04 where I only deploy this
> additional software:
> 
> *RUN apt-get update && apt-get install -y wget git xz-utils
> openssh-server \*
> *systemd-services make gcc pkg-config psmisc fuse libpython2.7
> libopenipmi0 \*
> *libdbus-glib-1-2 libsnmp30 libtimedate-perl libpcap0.8*
> 
> configure ssh with key pairs to communicate easily. The containers are
> created with these simple commands:
> 
> *docker create -it --cap-add=MKNOD --cap-add SYS_ADMIN --device
> /dev/loop0 --device /dev/fuse --net ${PUBLIC_NETWORK_NAME} --publish
> ${PG1_SSH_PORT}:22 --ip ${PG1_PUBLIC_IP} --name ${PG1_PRIVATE_NAME}
> --hostname ${PG1_PRIVATE_NAME} -v ${MOUNT_FOLDER}:/Users ngha /bin/bash*
> 
> *docker create -it --cap-add=MKNOD --cap-add SYS_ADMIN --device
> /dev/loop1 --device /dev/fuse --net ${PUBLIC_NETWORK_NAME} --publish
> ${PG2_SSH_PORT}:22 --ip ${PG2_PUBLIC_IP} --name ${PG2_PRIVATE_NAME}
> --hostname ${PG2_PRIVATE_NAME} -v ${MOUNT_FOLDER}:/Users ngha /bin/bash*
> 
> *docker create -it --cap-add=MKNOD --cap-add SYS_ADMIN --device
> /dev/loop2 --device /dev/fuse --net ${PUBLIC_NETWORK_NAME} --publish
> ${PG3_SSH_PORT}:22 --ip ${PG3_PUBLIC_IP} --name ${PG3_PRIVATE_NAME}
> --hostname ${PG3_PRIVATE_NAME} -v ${MOUNT_FOLDER}:/Users ngha /bin/bash*
> 
> /dev/fuse is used to configure glusterfs on two others nodes and
> /dev/loopX just to simulate better my bare metal env.
> 
> One thing that I do not understand is that I tried to compare corosync
> 2.3.5 (the old version that worked fine) and 2.4.4 to understand
> differences but I haven’t found anything related to the piece of code
> that affects the issue. The quorum tool.c and cfg.c are almost the same.
> Probably the issue is somewhere else.
> 

This might be asking a bit much, but would it be possible to try this
using Virtual Machines rather than Docker images? That would at least
eliminate a lot of complex variables.

Chrissie


> 
>> On 27 Jun 2018, at 08:34, Jan Pokorný > > wrote:
>>
>> On 26/06/18 17:56 +0200, Salvatore D'angelo wrote:
>>> I did another test. I modified docker container in order to be able
>>> to run strace.
>>> Running strace corosync-quorumtool -ps I got the following:
>>
>>> [snipped]
>>> connect(5, {sa_family=AF_LOCAL, sun_path=@"cfg"}, 110) = 0
>>> setsockopt(5, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0
>>> sendto(5,
>>> "\377\377\377\377\0\0\0\0\30\0\0\0\0\0\0\0\0\0\20\0\0\0\0\0", 24,
>>> MSG_NOSIGNAL, NULL, 0) = 24
>>> setsockopt(5, SOL_SOCKET, SO_PASSCRED, [0], 4) = 0
>>> recvfrom(5, 0x7ffd73bd7ac0, 12328, 16640, 0, 0) = -1 EAGAIN (Resource
>>> temporarily unavailable)
>>> poll([{fd=5, events=POLLIN}], 1, 4294967295) = 1 ([{fd=5,
>>> revents=POLLIN}])
>>> recvfrom(5,
>>> "\377\377\377\377\0\0\0\0(0\0\0\0\0\0\0\365\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0"...,
>>> 12328, MSG_WAITALL|MSG_NOSIGNAL, NULL, NULL) = 12328
>>> shutdown(5, SHUT_RDWR)  = 0
>>> close(5)    = 0
>>> write(2, "Cannot initialise CFG service\n", 30Cannot initialise CFG
>>> service) = 30
>>> [snipped]
>>
>> This just demonstrated the effect of already detailed server-side
>> error in the client, which communicates with the server just fine,
>> but as soon as the server hits the mmap-based problem, it bails
>> out the observed way, leaving client unsatisfied.
>>
>> Note one thing, abstract Unix sockets are being used for the
>> communication like this (observe the first line in the strace
>> output excerpt above), and if you happen to run container via
>> a docker command with --network=host, you may also be affected with
>> issues arising from abstract sockets not being isolated but rather
>> sharing the same namespace.  At least that was the case some years
>> back and what asked for a switch in underlying libqb library to
>> use strictly the file-backed sockets, where the isolation
>> semantics matches the intuition:
>>
>> https://lists.clusterlabs.org/pipermail/users/2017-May/013003.html
>>
>> + way to enable (presumably only for container environments, note
>> that there's no per process straightforward granularity):
>>
>> https://clusterlabs.github.io/libqb/1.0.2/doxygen/qb_ipc_overview.html
>> (scroll down to "IPC sockets (Linux only)")
>>
>> You may test that if you are using said --network=host switch.
>>
>>> I tried to understand what happen behind the scene but it is not easy
>>> for me.
>>> Hoping someone on this list can help.
>>
>> Containers are tricky, just as Ansible (as shown earlier on the list)
>> can be, when encumbered with false believes and/or misunderstandings.
>> Virtual machines may serve better wrt. insights for the later bare
>> metal deployments.
>>
>> -- 
>> Jan (Poki)
>> ___
>> Users mailing list: Users@clusterlabs.org
>> 

Re: [ClusterLabs] Upgrade corosync problem

2018-06-27 Thread Salvatore D'angelo
Hi,

Thanks for reply and detailed explaination. I am not using the —network=host 
option.
I have a docker image based on Ubuntu 14.04 where I only deploy this additional 
software:

RUN apt-get update && apt-get install -y wget git xz-utils 
openssh-server \
systemd-services make gcc pkg-config psmisc fuse libpython2.7 
libopenipmi0 \
libdbus-glib-1-2 libsnmp30 libtimedate-perl libpcap0.8

configure ssh with key pairs to communicate easily. The containers are created 
with these simple commands:

docker create -it --cap-add=MKNOD --cap-add SYS_ADMIN --device 
/dev/loop0 --device /dev/fuse --net ${PUBLIC_NETWORK_NAME} --publish
 ${PG1_SSH_PORT}:22 --ip ${PG1_PUBLIC_IP} --name ${PG1_PRIVATE_NAME} --hostname 
${PG1_PRIVATE_NAME} -v ${MOUNT_FOLDER}:/Users ngha /bin/bash

docker create -it --cap-add=MKNOD --cap-add SYS_ADMIN --device 
/dev/loop1 --device /dev/fuse --net ${PUBLIC_NETWORK_NAME} --publish 
${PG2_SSH_PORT}:22 --ip ${PG2_PUBLIC_IP} --name ${PG2_PRIVATE_NAME} --hostname 
${PG2_PRIVATE_NAME} -v ${MOUNT_FOLDER}:/Users ngha /bin/bash 

docker create -it --cap-add=MKNOD --cap-add SYS_ADMIN --device 
/dev/loop2 --device /dev/fuse --net ${PUBLIC_NETWORK_NAME} --publish 
${PG3_SSH_PORT}:22 --ip ${PG3_PUBLIC_IP} --name ${PG3_PRIVATE_NAME} --hostname 
${PG3_PRIVATE_NAME} -v ${MOUNT_FOLDER}:/Users ngha /bin/bash

/dev/fuse is used to configure glusterfs on two others nodes and /dev/loopX 
just to simulate better my bare metal env.

One thing that I do not understand is that I tried to compare corosync 2.3.5 
(the old version that worked fine) and 2.4.4 to understand differences but I 
haven’t found anything related to the piece of code that affects the issue. The 
quorum tool.c and cfg.c are almost the same. Probably the issue is somewhere 
else.


> On 27 Jun 2018, at 08:34, Jan Pokorný  wrote:
> 
> On 26/06/18 17:56 +0200, Salvatore D'angelo wrote:
>> I did another test. I modified docker container in order to be able to run 
>> strace.
>> Running strace corosync-quorumtool -ps I got the following:
> 
>> [snipped]
>> connect(5, {sa_family=AF_LOCAL, sun_path=@"cfg"}, 110) = 0
>> setsockopt(5, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0
>> sendto(5, "\377\377\377\377\0\0\0\0\30\0\0\0\0\0\0\0\0\0\20\0\0\0\0\0", 24, 
>> MSG_NOSIGNAL, NULL, 0) = 24
>> setsockopt(5, SOL_SOCKET, SO_PASSCRED, [0], 4) = 0
>> recvfrom(5, 0x7ffd73bd7ac0, 12328, 16640, 0, 0) = -1 EAGAIN (Resource 
>> temporarily unavailable)
>> poll([{fd=5, events=POLLIN}], 1, 4294967295) = 1 ([{fd=5, revents=POLLIN}])
>> recvfrom(5, 
>> "\377\377\377\377\0\0\0\0(0\0\0\0\0\0\0\365\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0"...,
>>  12328, MSG_WAITALL|MSG_NOSIGNAL, NULL, NULL) = 12328
>> shutdown(5, SHUT_RDWR)  = 0
>> close(5)= 0
>> write(2, "Cannot initialise CFG service\n", 30Cannot initialise CFG service) 
>> = 30
>> [snipped]
> 
> This just demonstrated the effect of already detailed server-side
> error in the client, which communicates with the server just fine,
> but as soon as the server hits the mmap-based problem, it bails
> out the observed way, leaving client unsatisfied.
> 
> Note one thing, abstract Unix sockets are being used for the
> communication like this (observe the first line in the strace
> output excerpt above), and if you happen to run container via
> a docker command with --network=host, you may also be affected with
> issues arising from abstract sockets not being isolated but rather
> sharing the same namespace.  At least that was the case some years
> back and what asked for a switch in underlying libqb library to
> use strictly the file-backed sockets, where the isolation
> semantics matches the intuition:
> 
> https://lists.clusterlabs.org/pipermail/users/2017-May/013003.html
> 
> + way to enable (presumably only for container environments, note
> that there's no per process straightforward granularity):
> 
> https://clusterlabs.github.io/libqb/1.0.2/doxygen/qb_ipc_overview.html
> (scroll down to "IPC sockets (Linux only)")
> 
> You may test that if you are using said --network=host switch.
> 
>> I tried to understand what happen behind the scene but it is not easy for me.
>> Hoping someone on this list can help.
> 
> Containers are tricky, just as Ansible (as shown earlier on the list)
> can be, when encumbered with false believes and/or misunderstandings.
> Virtual machines may serve better wrt. insights for the later bare
> metal deployments.
> 
> -- 
> Jan (Poki)
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org

Re: [ClusterLabs] Upgrade corosync problem

2018-06-27 Thread Jan Pokorný
On 26/06/18 17:56 +0200, Salvatore D'angelo wrote:
> I did another test. I modified docker container in order to be able to run 
> strace.
> Running strace corosync-quorumtool -ps I got the following:

> [snipped]
> connect(5, {sa_family=AF_LOCAL, sun_path=@"cfg"}, 110) = 0
> setsockopt(5, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0
> sendto(5, "\377\377\377\377\0\0\0\0\30\0\0\0\0\0\0\0\0\0\20\0\0\0\0\0", 24, 
> MSG_NOSIGNAL, NULL, 0) = 24
> setsockopt(5, SOL_SOCKET, SO_PASSCRED, [0], 4) = 0
> recvfrom(5, 0x7ffd73bd7ac0, 12328, 16640, 0, 0) = -1 EAGAIN (Resource 
> temporarily unavailable)
> poll([{fd=5, events=POLLIN}], 1, 4294967295) = 1 ([{fd=5, revents=POLLIN}])
> recvfrom(5, 
> "\377\377\377\377\0\0\0\0(0\0\0\0\0\0\0\365\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0"...,
>  12328, MSG_WAITALL|MSG_NOSIGNAL, NULL, NULL) = 12328
> shutdown(5, SHUT_RDWR)  = 0
> close(5)= 0
> write(2, "Cannot initialise CFG service\n", 30Cannot initialise CFG service) 
> = 30
> [snipped]

This just demonstrated the effect of already detailed server-side
error in the client, which communicates with the server just fine,
but as soon as the server hits the mmap-based problem, it bails
out the observed way, leaving client unsatisfied.

Note one thing, abstract Unix sockets are being used for the
communication like this (observe the first line in the strace
output excerpt above), and if you happen to run container via
a docker command with --network=host, you may also be affected with
issues arising from abstract sockets not being isolated but rather
sharing the same namespace.  At least that was the case some years
back and what asked for a switch in underlying libqb library to
use strictly the file-backed sockets, where the isolation
semantics matches the intuition:

https://lists.clusterlabs.org/pipermail/users/2017-May/013003.html

+ way to enable (presumably only for container environments, note
that there's no per process straightforward granularity):

https://clusterlabs.github.io/libqb/1.0.2/doxygen/qb_ipc_overview.html
(scroll down to "IPC sockets (Linux only)")

You may test that if you are using said --network=host switch.

> I tried to understand what happen behind the scene but it is not easy for me.
> Hoping someone on this list can help.

Containers are tricky, just as Ansible (as shown earlier on the list)
can be, when encumbered with false believes and/or misunderstandings.
Virtual machines may serve better wrt. insights for the later bare
metal deployments.

-- 
Jan (Poki)


pgpx32mt8_3EG.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Salvatore D'angelo
I noticed that corosync 2.4.4 depends on the following libraries:
https://launchpad.net/ubuntu/+source/corosync/2.4.4-3 


I imagine that all the corosync-* and libcorosync-* libraries are build from 
the corosync build, so I should have them. Am I correct?

libcfg6
libcmap4
libcpg4
libquorum5
libsam4
libtotem-pg5
libvotequorum8

Can you tell me where these libraries come from and if I need them?

> On 26 Jun 2018, at 14:08, Christine Caulfield  wrote:
> 
> On 26/06/18 12:16, Salvatore D'angelo wrote:
>> libqb update to 1.0.3 but same issue.
>> 
>> I know corosync has also these dependencies nspr and nss3. I updated
>> them using apt-get install, here the version installed:
>> 
>>libnspr4, libnspr4-dev  2:4.13.1-0ubuntu0.14.04.1
>>libnss3, libnss3-dev, libnss3-nssb   2:3.28.4-0ubuntu0.14.04.3
>> 
>> but same problem.
>> 
>> I am working on Ubuntu 14.04 image and I know that packages could be
>> quite old here. Are there new versions for these libraries?
>> Where I can download them? I tried to search on google but results where
>> quite confusing.
>> 
> 
> It's pretty unlikely to be the crypto libraries. It's almost certainly
> in libqb, with a small possibility that of corosync.  Which versions did
> you have that worked (libqb and corosync) ?
> 
> Chrissie
> 
> 
>> 
>>> On 26 Jun 2018, at 12:27, Christine Caulfield >> > wrote:
>>> 
>>> On 26/06/18 11:24, Salvatore D'angelo wrote:
 Hi,
 
 I have tried with:
 0.16.0.real-1ubuntu4
 0.16.0.real-1ubuntu5
 
 which version should I try?
>>> 
>>> 
>>> Hmm both of those are actually quite old! maybe a newer one?
>>> 
>>> Chrissie
>>> 
 
> On 26 Jun 2018, at 12:03, Christine Caulfield  
> > wrote:
> 
> On 26/06/18 11:00, Salvatore D'angelo wrote:
>> Consider that the container is the same when corosync 2.3.5 run.
>> If it is something related to the container probably the 2.4.4
>> introduced a feature that has an impact on container.
>> Should be something related to libqb according to the code.
>> Anyone can help?
>> 
> 
> 
> Have you tried downgrading libqb to the previous version to see if it
> still happens?
> 
> Chrissie
> 
>>> On 26 Jun 2018, at 11:56, Christine Caulfield >> 
>>> 
>>> > wrote:
>>> 
>>> On 26/06/18 10:35, Salvatore D'angelo wrote:
 Sorry after the command:
 
 corosync-quorumtool -ps
 
 the error in log are still visible. Looking at the source code it
 seems
 problem is at this line:
 https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c
 
 if (quorum_initialize(_handle, _callbacks, _type) !=
 CS_OK) {
 fprintf(stderr, "Cannot initialize QUORUM service\n");
 q_handle = 0;
 goto out;
 }
 
 if (corosync_cfg_initialize(_handle, _callbacks) != CS_OK) {
 fprintf(stderr, "Cannot initialise CFG service\n");
 c_handle = 0;
 goto out;
 }
 
 The quorum_initialize function is defined here:
 https://github.com/corosync/corosync/blob/master/lib/quorum.c
 
 It seems interacts with libqb to allocate space on /dev/shm but
 something fails. I tried to update the libqb with apt-get install
 but no
 success.
 
 The same for second function:
 https://github.com/corosync/corosync/blob/master/lib/cfg.c
 
 Now I am not an expert of libqb. I have the
 version 0.16.0.real-1ubuntu5.
 
 The folder /dev/shm has 777 permission like other nodes with older
 corosync and pacemaker that work fine. The only difference is that I
 only see files created by root, no one created by hacluster like
 other
 two nodes (probably because pacemaker didn’t start correctly).
 
 This is the analysis I have done so far.
 Any suggestion?
 
 
>>> 
>>> Hmm. t seems very likely something to do with the way the container is
>>> set up then - and I know nothing about containers. Sorry :/
>>> 
>>> Can anyone else help here?
>>> 
>>> Chrissie
>>> 
> On 26 Jun 2018, at 11:03, Salvatore D'angelo
> mailto:sasadang...@gmail.com>
> 
> 
> > wrote:
> 
> Yes, sorry you’re right I could find it by myself.
> However, I did the following:
> 
> 1. Added the line you suggested to /etc/fstab
> 2. mount -o remount /dev/shm
> 3. Now I correctly see /dev/shm of 

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Salvatore D'angelo
corosync 2.3.5 and libqb 0.16.0

> On 26 Jun 2018, at 14:08, Christine Caulfield  wrote:
> 
> On 26/06/18 12:16, Salvatore D'angelo wrote:
>> libqb update to 1.0.3 but same issue.
>> 
>> I know corosync has also these dependencies nspr and nss3. I updated
>> them using apt-get install, here the version installed:
>> 
>>libnspr4, libnspr4-dev  2:4.13.1-0ubuntu0.14.04.1
>>libnss3, libnss3-dev, libnss3-nssb   2:3.28.4-0ubuntu0.14.04.3
>> 
>> but same problem.
>> 
>> I am working on Ubuntu 14.04 image and I know that packages could be
>> quite old here. Are there new versions for these libraries?
>> Where I can download them? I tried to search on google but results where
>> quite confusing.
>> 
> 
> It's pretty unlikely to be the crypto libraries. It's almost certainly
> in libqb, with a small possibility that of corosync.  Which versions did
> you have that worked (libqb and corosync) ?
> 
> Chrissie
> 
> 
>> 
>>> On 26 Jun 2018, at 12:27, Christine Caulfield >> 
>>> >> wrote:
>>> 
>>> On 26/06/18 11:24, Salvatore D'angelo wrote:
 Hi,
 
 I have tried with:
 0.16.0.real-1ubuntu4
 0.16.0.real-1ubuntu5
 
 which version should I try?
>>> 
>>> 
>>> Hmm both of those are actually quite old! maybe a newer one?
>>> 
>>> Chrissie
>>> 
 
> On 26 Jun 2018, at 12:03, Christine Caulfield  
> >
> >> wrote:
> 
> On 26/06/18 11:00, Salvatore D'angelo wrote:
>> Consider that the container is the same when corosync 2.3.5 run.
>> If it is something related to the container probably the 2.4.4
>> introduced a feature that has an impact on container.
>> Should be something related to libqb according to the code.
>> Anyone can help?
>> 
> 
> 
> Have you tried downgrading libqb to the previous version to see if it
> still happens?
> 
> Chrissie
> 
>>> On 26 Jun 2018, at 11:56, Christine Caulfield >> 
>>> >
>>> >
>>> >> wrote:
>>> 
>>> On 26/06/18 10:35, Salvatore D'angelo wrote:
 Sorry after the command:
 
 corosync-quorumtool -ps
 
 the error in log are still visible. Looking at the source code it
 seems
 problem is at this line:
 https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c
  
 
 
 if (quorum_initialize(_handle, _callbacks, _type) !=
 CS_OK) {
 fprintf(stderr, "Cannot initialize QUORUM service\n");
 q_handle = 0;
 goto out;
 }
 
 if (corosync_cfg_initialize(_handle, _callbacks) != CS_OK) {
 fprintf(stderr, "Cannot initialise CFG service\n");
 c_handle = 0;
 goto out;
 }
 
 The quorum_initialize function is defined here:
 https://github.com/corosync/corosync/blob/master/lib/quorum.c 
 
 
 It seems interacts with libqb to allocate space on /dev/shm but
 something fails. I tried to update the libqb with apt-get install
 but no
 success.
 
 The same for second function:
 https://github.com/corosync/corosync/blob/master/lib/cfg.c 
 
 
 Now I am not an expert of libqb. I have the
 version 0.16.0.real-1ubuntu5.
 
 The folder /dev/shm has 777 permission like other nodes with older
 corosync and pacemaker that work fine. The only difference is that I
 only see files created by root, no one created by hacluster like
 other
 two nodes (probably because pacemaker didn’t start correctly).
 
 This is the analysis I have done so far.
 Any suggestion?
 
 
>>> 
>>> Hmm. t seems very likely something to do with the way the container is
>>> set up then - and I know nothing about containers. Sorry :/
>>> 
>>> Can anyone else help here?
>>> 
>>> Chrissie
>>> 
> On 26 Jun 2018, at 11:03, Salvatore D'angelo
> mailto:sasadang...@gmail.com> 
> >
> >
> >
> >> 

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Christine Caulfield
On 26/06/18 12:16, Salvatore D'angelo wrote:
> libqb update to 1.0.3 but same issue.
> 
> I know corosync has also these dependencies nspr and nss3. I updated
> them using apt-get install, here the version installed:
> 
>    libnspr4, libnspr4-dev  2:4.13.1-0ubuntu0.14.04.1
>    libnss3, libnss3-dev, libnss3-nssb   2:3.28.4-0ubuntu0.14.04.3
> 
> but same problem.
> 
> I am working on Ubuntu 14.04 image and I know that packages could be
> quite old here. Are there new versions for these libraries?
> Where I can download them? I tried to search on google but results where
> quite confusing.
> 

It's pretty unlikely to be the crypto libraries. It's almost certainly
in libqb, with a small possibility that of corosync.  Which versions did
you have that worked (libqb and corosync) ?

Chrissie


> 
>> On 26 Jun 2018, at 12:27, Christine Caulfield > > wrote:
>>
>> On 26/06/18 11:24, Salvatore D'angelo wrote:
>>> Hi,
>>>
>>> I have tried with:
>>> 0.16.0.real-1ubuntu4
>>> 0.16.0.real-1ubuntu5
>>>
>>> which version should I try?
>>
>>
>> Hmm both of those are actually quite old! maybe a newer one?
>>
>> Chrissie
>>
>>>
 On 26 Jun 2018, at 12:03, Christine Caulfield >>> 
 > wrote:

 On 26/06/18 11:00, Salvatore D'angelo wrote:
> Consider that the container is the same when corosync 2.3.5 run.
> If it is something related to the container probably the 2.4.4
> introduced a feature that has an impact on container.
> Should be something related to libqb according to the code.
> Anyone can help?
>


 Have you tried downgrading libqb to the previous version to see if it
 still happens?

 Chrissie

>> On 26 Jun 2018, at 11:56, Christine Caulfield > 
>> 
>> > wrote:
>>
>> On 26/06/18 10:35, Salvatore D'angelo wrote:
>>> Sorry after the command:
>>>
>>> corosync-quorumtool -ps
>>>
>>> the error in log are still visible. Looking at the source code it
>>> seems
>>> problem is at this line:
>>> https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c
>>>
>>>     if (quorum_initialize(_handle, _callbacks, _type) !=
>>> CS_OK) {
>>> fprintf(stderr, "Cannot initialize QUORUM service\n");
>>> q_handle = 0;
>>> goto out;
>>> }
>>>
>>> if (corosync_cfg_initialize(_handle, _callbacks) != CS_OK) {
>>> fprintf(stderr, "Cannot initialise CFG service\n");
>>> c_handle = 0;
>>> goto out;
>>> }
>>>
>>> The quorum_initialize function is defined here:
>>> https://github.com/corosync/corosync/blob/master/lib/quorum.c
>>>
>>> It seems interacts with libqb to allocate space on /dev/shm but
>>> something fails. I tried to update the libqb with apt-get install
>>> but no
>>> success.
>>>
>>> The same for second function:
>>> https://github.com/corosync/corosync/blob/master/lib/cfg.c
>>>
>>> Now I am not an expert of libqb. I have the
>>> version 0.16.0.real-1ubuntu5.
>>>
>>> The folder /dev/shm has 777 permission like other nodes with older
>>> corosync and pacemaker that work fine. The only difference is that I
>>> only see files created by root, no one created by hacluster like
>>> other
>>> two nodes (probably because pacemaker didn’t start correctly).
>>>
>>> This is the analysis I have done so far.
>>> Any suggestion?
>>>
>>>
>>
>> Hmm. t seems very likely something to do with the way the container is
>> set up then - and I know nothing about containers. Sorry :/
>>
>> Can anyone else help here?
>>
>> Chrissie
>>
 On 26 Jun 2018, at 11:03, Salvatore D'angelo
 mailto:sasadang...@gmail.com>
 
 
 > wrote:

 Yes, sorry you’re right I could find it by myself.
 However, I did the following:

 1. Added the line you suggested to /etc/fstab
 2. mount -o remount /dev/shm
 3. Now I correctly see /dev/shm of 512M with df -h
 Filesystem      Size  Used Avail Use% Mounted on
 overlay          63G   11G   49G  19% /
 tmpfs            64M  4.0K   64M   1% /dev
 tmpfs          1000M     0 1000M   0% /sys/fs/cgroup
 osxfs           466G  158G  305G  35% /Users
 /dev/sda1        63G   11G   49G  19% /etc/hosts
 *shm             512M   15M  498M   3% /dev/shm*
 tmpfs          1000M     0 1000M   0% /sys/firmware
 tmpfs           128M     0  128M   0% /tmp

 The errors in log went away. Consider that I remove the log file
 before start corosync so it does not contains lines of previous
 

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Salvatore D'angelo
libqb update to 1.0.3 but same issue.

I know corosync has also these dependencies nspr and nss3. I updated them using 
apt-get install, here the version installed:

   libnspr4, libnspr4-dev   2:4.13.1-0ubuntu0.14.04.1
   libnss3, libnss3-dev, libnss3-nssb   2:3.28.4-0ubuntu0.14.04.3

but same problem.

I am working on Ubuntu 14.04 image and I know that packages could be quite old 
here. Are there new versions for these libraries?
Where I can download them? I tried to search on google but results where quite 
confusing.


> On 26 Jun 2018, at 12:27, Christine Caulfield  wrote:
> 
> On 26/06/18 11:24, Salvatore D'angelo wrote:
>> Hi,
>> 
>> I have tried with:
>> 0.16.0.real-1ubuntu4
>> 0.16.0.real-1ubuntu5
>> 
>> which version should I try?
> 
> 
> Hmm both of those are actually quite old! maybe a newer one?
> 
> Chrissie
> 
>> 
>>> On 26 Jun 2018, at 12:03, Christine Caulfield >> > wrote:
>>> 
>>> On 26/06/18 11:00, Salvatore D'angelo wrote:
 Consider that the container is the same when corosync 2.3.5 run.
 If it is something related to the container probably the 2.4.4
 introduced a feature that has an impact on container.
 Should be something related to libqb according to the code.
 Anyone can help?
 
>>> 
>>> 
>>> Have you tried downgrading libqb to the previous version to see if it
>>> still happens?
>>> 
>>> Chrissie
>>> 
> On 26 Jun 2018, at 11:56, Christine Caulfield  
> > wrote:
> 
> On 26/06/18 10:35, Salvatore D'angelo wrote:
>> Sorry after the command:
>> 
>> corosync-quorumtool -ps
>> 
>> the error in log are still visible. Looking at the source code it seems
>> problem is at this line:
>> https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c
>> 
>> if (quorum_initialize(_handle, _callbacks, _type) != CS_OK) {
>> fprintf(stderr, "Cannot initialize QUORUM service\n");
>> q_handle = 0;
>> goto out;
>> }
>> 
>> if (corosync_cfg_initialize(_handle, _callbacks) != CS_OK) {
>> fprintf(stderr, "Cannot initialise CFG service\n");
>> c_handle = 0;
>> goto out;
>> }
>> 
>> The quorum_initialize function is defined here:
>> https://github.com/corosync/corosync/blob/master/lib/quorum.c
>> 
>> It seems interacts with libqb to allocate space on /dev/shm but
>> something fails. I tried to update the libqb with apt-get install
>> but no
>> success.
>> 
>> The same for second function:
>> https://github.com/corosync/corosync/blob/master/lib/cfg.c
>> 
>> Now I am not an expert of libqb. I have the
>> version 0.16.0.real-1ubuntu5.
>> 
>> The folder /dev/shm has 777 permission like other nodes with older
>> corosync and pacemaker that work fine. The only difference is that I
>> only see files created by root, no one created by hacluster like other
>> two nodes (probably because pacemaker didn’t start correctly).
>> 
>> This is the analysis I have done so far.
>> Any suggestion?
>> 
>> 
> 
> Hmm. t seems very likely something to do with the way the container is
> set up then - and I know nothing about containers. Sorry :/
> 
> Can anyone else help here?
> 
> Chrissie
> 
>>> On 26 Jun 2018, at 11:03, Salvatore D'angelo
>>> mailto:sasadang...@gmail.com>
>>> 
>>> > wrote:
>>> 
>>> Yes, sorry you’re right I could find it by myself.
>>> However, I did the following:
>>> 
>>> 1. Added the line you suggested to /etc/fstab
>>> 2. mount -o remount /dev/shm
>>> 3. Now I correctly see /dev/shm of 512M with df -h
>>> Filesystem  Size  Used Avail Use% Mounted on
>>> overlay  63G   11G   49G  19% /
>>> tmpfs64M  4.0K   64M   1% /dev
>>> tmpfs  1000M 0 1000M   0% /sys/fs/cgroup
>>> osxfs   466G  158G  305G  35% /Users
>>> /dev/sda163G   11G   49G  19% /etc/hosts
>>> *shm 512M   15M  498M   3% /dev/shm*
>>> tmpfs  1000M 0 1000M   0% /sys/firmware
>>> tmpfs   128M 0  128M   0% /tmp
>>> 
>>> The errors in log went away. Consider that I remove the log file
>>> before start corosync so it does not contains lines of previous
>>> executions.
>>> 
>>> 
>>> But the command:
>>> corosync-quorumtool -ps
>>> 
>>> still give:
>>> Cannot initialize QUORUM service
>>> 
>>> Consider that few minutes before it gave me the message:
>>> Cannot initialize CFG service
>>> 
>>> I do not know the differences between CFG and QUORUM in this case.
>>> 
>>> If I try to start pacemaker the service is OK but I see only pacemaker
>>> and the Transport does not work if I try to run a cam command.

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Christine Caulfield
On 26/06/18 11:24, Salvatore D'angelo wrote:
> Hi,
> 
> I have tried with:
> 0.16.0.real-1ubuntu4
> 0.16.0.real-1ubuntu5
> 
> which version should I try?


Hmm both of those are actually quite old! maybe a newer one?

Chrissie

> 
>> On 26 Jun 2018, at 12:03, Christine Caulfield > > wrote:
>>
>> On 26/06/18 11:00, Salvatore D'angelo wrote:
>>> Consider that the container is the same when corosync 2.3.5 run.
>>> If it is something related to the container probably the 2.4.4
>>> introduced a feature that has an impact on container.
>>> Should be something related to libqb according to the code.
>>> Anyone can help?
>>>
>>
>>
>> Have you tried downgrading libqb to the previous version to see if it
>> still happens?
>>
>> Chrissie
>>
 On 26 Jun 2018, at 11:56, Christine Caulfield >>> 
 > wrote:

 On 26/06/18 10:35, Salvatore D'angelo wrote:
> Sorry after the command:
>
> corosync-quorumtool -ps
>
> the error in log are still visible. Looking at the source code it seems
> problem is at this line:
> https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c
>
>     if (quorum_initialize(_handle, _callbacks, _type) != CS_OK) {
> fprintf(stderr, "Cannot initialize QUORUM service\n");
> q_handle = 0;
> goto out;
> }
>
> if (corosync_cfg_initialize(_handle, _callbacks) != CS_OK) {
> fprintf(stderr, "Cannot initialise CFG service\n");
> c_handle = 0;
> goto out;
> }
>
> The quorum_initialize function is defined here:
> https://github.com/corosync/corosync/blob/master/lib/quorum.c
>
> It seems interacts with libqb to allocate space on /dev/shm but
> something fails. I tried to update the libqb with apt-get install
> but no
> success.
>
> The same for second function:
> https://github.com/corosync/corosync/blob/master/lib/cfg.c
>
> Now I am not an expert of libqb. I have the
> version 0.16.0.real-1ubuntu5.
>
> The folder /dev/shm has 777 permission like other nodes with older
> corosync and pacemaker that work fine. The only difference is that I
> only see files created by root, no one created by hacluster like other
> two nodes (probably because pacemaker didn’t start correctly).
>
> This is the analysis I have done so far.
> Any suggestion?
>
>

 Hmm. t seems very likely something to do with the way the container is
 set up then - and I know nothing about containers. Sorry :/

 Can anyone else help here?

 Chrissie

>> On 26 Jun 2018, at 11:03, Salvatore D'angelo
>> mailto:sasadang...@gmail.com>
>> 
>> > wrote:
>>
>> Yes, sorry you’re right I could find it by myself.
>> However, I did the following:
>>
>> 1. Added the line you suggested to /etc/fstab
>> 2. mount -o remount /dev/shm
>> 3. Now I correctly see /dev/shm of 512M with df -h
>> Filesystem      Size  Used Avail Use% Mounted on
>> overlay          63G   11G   49G  19% /
>> tmpfs            64M  4.0K   64M   1% /dev
>> tmpfs          1000M     0 1000M   0% /sys/fs/cgroup
>> osxfs           466G  158G  305G  35% /Users
>> /dev/sda1        63G   11G   49G  19% /etc/hosts
>> *shm             512M   15M  498M   3% /dev/shm*
>> tmpfs          1000M     0 1000M   0% /sys/firmware
>> tmpfs           128M     0  128M   0% /tmp
>>
>> The errors in log went away. Consider that I remove the log file
>> before start corosync so it does not contains lines of previous
>> executions.
>> 
>>
>> But the command:
>> corosync-quorumtool -ps
>>
>> still give:
>> Cannot initialize QUORUM service
>>
>> Consider that few minutes before it gave me the message:
>> Cannot initialize CFG service
>>
>> I do not know the differences between CFG and QUORUM in this case.
>>
>> If I try to start pacemaker the service is OK but I see only pacemaker
>> and the Transport does not work if I try to run a cam command.
>> Any suggestion?
>>
>>
>>> On 26 Jun 2018, at 10:49, Christine Caulfield
>>> mailto:ccaul...@redhat.com>
>>> 
>>> > wrote:
>>>
>>> On 26/06/18 09:40, Salvatore D'angelo wrote:
 Hi,

 Yes,

 I am reproducing only the required part for test. I think the
 original
 system has a larger shm. The problem is that I do not know
 exactly how
 to change it.
 I tried the following steps, but I have the impression I didn’t
 performed the right one:

 1. remove everything under /tmp
 2. Added the following line to /etc/fstab
 tmpfs   /tmp         tmpfs  

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Salvatore D'angelo
Hi,

I have tried with:
0.16.0.real-1ubuntu4
0.16.0.real-1ubuntu5

which version should I try?

> On 26 Jun 2018, at 12:03, Christine Caulfield  wrote:
> 
> On 26/06/18 11:00, Salvatore D'angelo wrote:
>> Consider that the container is the same when corosync 2.3.5 run.
>> If it is something related to the container probably the 2.4.4
>> introduced a feature that has an impact on container.
>> Should be something related to libqb according to the code.
>> Anyone can help?
>> 
> 
> 
> Have you tried downgrading libqb to the previous version to see if it
> still happens?
> 
> Chrissie
> 
>>> On 26 Jun 2018, at 11:56, Christine Caulfield >> > wrote:
>>> 
>>> On 26/06/18 10:35, Salvatore D'angelo wrote:
 Sorry after the command:
 
 corosync-quorumtool -ps
 
 the error in log are still visible. Looking at the source code it seems
 problem is at this line:
 https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c
 
 if (quorum_initialize(_handle, _callbacks, _type) != CS_OK) {
 fprintf(stderr, "Cannot initialize QUORUM service\n");
 q_handle = 0;
 goto out;
 }
 
 if (corosync_cfg_initialize(_handle, _callbacks) != CS_OK) {
 fprintf(stderr, "Cannot initialise CFG service\n");
 c_handle = 0;
 goto out;
 }
 
 The quorum_initialize function is defined here:
 https://github.com/corosync/corosync/blob/master/lib/quorum.c
 
 It seems interacts with libqb to allocate space on /dev/shm but
 something fails. I tried to update the libqb with apt-get install but no
 success.
 
 The same for second function:
 https://github.com/corosync/corosync/blob/master/lib/cfg.c
 
 Now I am not an expert of libqb. I have the version 0.16.0.real-1ubuntu5.
 
 The folder /dev/shm has 777 permission like other nodes with older
 corosync and pacemaker that work fine. The only difference is that I
 only see files created by root, no one created by hacluster like other
 two nodes (probably because pacemaker didn’t start correctly).
 
 This is the analysis I have done so far.
 Any suggestion?
 
 
>>> 
>>> Hmm. t seems very likely something to do with the way the container is
>>> set up then - and I know nothing about containers. Sorry :/
>>> 
>>> Can anyone else help here?
>>> 
>>> Chrissie
>>> 
> On 26 Jun 2018, at 11:03, Salvatore D'angelo  
> > wrote:
> 
> Yes, sorry you’re right I could find it by myself.
> However, I did the following:
> 
> 1. Added the line you suggested to /etc/fstab
> 2. mount -o remount /dev/shm
> 3. Now I correctly see /dev/shm of 512M with df -h
> Filesystem  Size  Used Avail Use% Mounted on
> overlay  63G   11G   49G  19% /
> tmpfs64M  4.0K   64M   1% /dev
> tmpfs  1000M 0 1000M   0% /sys/fs/cgroup
> osxfs   466G  158G  305G  35% /Users
> /dev/sda163G   11G   49G  19% /etc/hosts
> *shm 512M   15M  498M   3% /dev/shm*
> tmpfs  1000M 0 1000M   0% /sys/firmware
> tmpfs   128M 0  128M   0% /tmp
> 
> The errors in log went away. Consider that I remove the log file
> before start corosync so it does not contains lines of previous
> executions.
> 
> 
> But the command:
> corosync-quorumtool -ps
> 
> still give:
> Cannot initialize QUORUM service
> 
> Consider that few minutes before it gave me the message:
> Cannot initialize CFG service
> 
> I do not know the differences between CFG and QUORUM in this case.
> 
> If I try to start pacemaker the service is OK but I see only pacemaker
> and the Transport does not work if I try to run a cam command.
> Any suggestion?
> 
> 
>> On 26 Jun 2018, at 10:49, Christine Caulfield > 
>> > wrote:
>> 
>> On 26/06/18 09:40, Salvatore D'angelo wrote:
>>> Hi,
>>> 
>>> Yes,
>>> 
>>> I am reproducing only the required part for test. I think the original
>>> system has a larger shm. The problem is that I do not know exactly how
>>> to change it.
>>> I tried the following steps, but I have the impression I didn’t
>>> performed the right one:
>>> 
>>> 1. remove everything under /tmp
>>> 2. Added the following line to /etc/fstab
>>> tmpfs   /tmp tmpfs  
>>> defaults,nodev,nosuid,mode=1777,size=128M 
>>> 0  0
>>> 3. mount /tmp
>>> 4. df -h
>>> Filesystem  Size  Used Avail Use% Mounted on
>>> overlay  63G   11G   49G  19% /
>>> tmpfs64M  4.0K   64M   1% /dev
>>> tmpfs  1000M 0 1000M   0% /sys/fs/cgroup
>>> osxfs   466G  158G  305G  35% /Users
>>> /dev/sda1  

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Christine Caulfield
On 26/06/18 11:00, Salvatore D'angelo wrote:
> Consider that the container is the same when corosync 2.3.5 run.
> If it is something related to the container probably the 2.4.4
> introduced a feature that has an impact on container.
> Should be something related to libqb according to the code.
> Anyone can help?
> 


Have you tried downgrading libqb to the previous version to see if it
still happens?

Chrissie

>> On 26 Jun 2018, at 11:56, Christine Caulfield > > wrote:
>>
>> On 26/06/18 10:35, Salvatore D'angelo wrote:
>>> Sorry after the command:
>>>
>>> corosync-quorumtool -ps
>>>
>>> the error in log are still visible. Looking at the source code it seems
>>> problem is at this line:
>>> https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c
>>>
>>>     if (quorum_initialize(_handle, _callbacks, _type) != CS_OK) {
>>> fprintf(stderr, "Cannot initialize QUORUM service\n");
>>> q_handle = 0;
>>> goto out;
>>> }
>>>
>>> if (corosync_cfg_initialize(_handle, _callbacks) != CS_OK) {
>>> fprintf(stderr, "Cannot initialise CFG service\n");
>>> c_handle = 0;
>>> goto out;
>>> }
>>>
>>> The quorum_initialize function is defined here:
>>> https://github.com/corosync/corosync/blob/master/lib/quorum.c
>>>
>>> It seems interacts with libqb to allocate space on /dev/shm but
>>> something fails. I tried to update the libqb with apt-get install but no
>>> success.
>>>
>>> The same for second function:
>>> https://github.com/corosync/corosync/blob/master/lib/cfg.c
>>>
>>> Now I am not an expert of libqb. I have the version 0.16.0.real-1ubuntu5.
>>>
>>> The folder /dev/shm has 777 permission like other nodes with older
>>> corosync and pacemaker that work fine. The only difference is that I
>>> only see files created by root, no one created by hacluster like other
>>> two nodes (probably because pacemaker didn’t start correctly).
>>>
>>> This is the analysis I have done so far.
>>> Any suggestion?
>>>
>>>
>>
>> Hmm. t seems very likely something to do with the way the container is
>> set up then - and I know nothing about containers. Sorry :/
>>
>> Can anyone else help here?
>>
>> Chrissie
>>
 On 26 Jun 2018, at 11:03, Salvatore D'angelo >>> 
 > wrote:

 Yes, sorry you’re right I could find it by myself.
 However, I did the following:

 1. Added the line you suggested to /etc/fstab
 2. mount -o remount /dev/shm
 3. Now I correctly see /dev/shm of 512M with df -h
 Filesystem      Size  Used Avail Use% Mounted on
 overlay          63G   11G   49G  19% /
 tmpfs            64M  4.0K   64M   1% /dev
 tmpfs          1000M     0 1000M   0% /sys/fs/cgroup
 osxfs           466G  158G  305G  35% /Users
 /dev/sda1        63G   11G   49G  19% /etc/hosts
 *shm             512M   15M  498M   3% /dev/shm*
 tmpfs          1000M     0 1000M   0% /sys/firmware
 tmpfs           128M     0  128M   0% /tmp

 The errors in log went away. Consider that I remove the log file
 before start corosync so it does not contains lines of previous
 executions.
 

 But the command:
 corosync-quorumtool -ps

 still give:
 Cannot initialize QUORUM service

 Consider that few minutes before it gave me the message:
 Cannot initialize CFG service

 I do not know the differences between CFG and QUORUM in this case.

 If I try to start pacemaker the service is OK but I see only pacemaker
 and the Transport does not work if I try to run a cam command.
 Any suggestion?


> On 26 Jun 2018, at 10:49, Christine Caulfield  
> > wrote:
>
> On 26/06/18 09:40, Salvatore D'angelo wrote:
>> Hi,
>>
>> Yes,
>>
>> I am reproducing only the required part for test. I think the original
>> system has a larger shm. The problem is that I do not know exactly how
>> to change it.
>> I tried the following steps, but I have the impression I didn’t
>> performed the right one:
>>
>> 1. remove everything under /tmp
>> 2. Added the following line to /etc/fstab
>> tmpfs   /tmp         tmpfs  
>> defaults,nodev,nosuid,mode=1777,size=128M 
>>         0  0
>> 3. mount /tmp
>> 4. df -h
>> Filesystem      Size  Used Avail Use% Mounted on
>> overlay          63G   11G   49G  19% /
>> tmpfs            64M  4.0K   64M   1% /dev
>> tmpfs          1000M     0 1000M   0% /sys/fs/cgroup
>> osxfs           466G  158G  305G  35% /Users
>> /dev/sda1        63G   11G   49G  19% /etc/hosts
>> shm              64M   11M   54M  16% /dev/shm
>> tmpfs          1000M     0 1000M   0% /sys/firmware
>> *tmpfs           128M     0  128M   0% /tmp*
>>
>> The errors are exactly the same.
>> I have the impression that I changed the wrong parameter. Probably I

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Salvatore D'angelo
Consider that the container is the same when corosync 2.3.5 run.
If it is something related to the container probably the 2.4.4 introduced a 
feature that has an impact on container.
Should be something related to libqb according to the code.
Anyone can help?

> On 26 Jun 2018, at 11:56, Christine Caulfield  wrote:
> 
> On 26/06/18 10:35, Salvatore D'angelo wrote:
>> Sorry after the command:
>> 
>> corosync-quorumtool -ps
>> 
>> the error in log are still visible. Looking at the source code it seems
>> problem is at this line:
>> https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c
>> 
>> if (quorum_initialize(_handle, _callbacks, _type) != CS_OK) {
>> fprintf(stderr, "Cannot initialize QUORUM service\n");
>> q_handle = 0;
>> goto out;
>> }
>> 
>> if (corosync_cfg_initialize(_handle, _callbacks) != CS_OK) {
>> fprintf(stderr, "Cannot initialise CFG service\n");
>> c_handle = 0;
>> goto out;
>> }
>> 
>> The quorum_initialize function is defined here:
>> https://github.com/corosync/corosync/blob/master/lib/quorum.c
>> 
>> It seems interacts with libqb to allocate space on /dev/shm but
>> something fails. I tried to update the libqb with apt-get install but no
>> success.
>> 
>> The same for second function:
>> https://github.com/corosync/corosync/blob/master/lib/cfg.c
>> 
>> Now I am not an expert of libqb. I have the version 0.16.0.real-1ubuntu5.
>> 
>> The folder /dev/shm has 777 permission like other nodes with older
>> corosync and pacemaker that work fine. The only difference is that I
>> only see files created by root, no one created by hacluster like other
>> two nodes (probably because pacemaker didn’t start correctly).
>> 
>> This is the analysis I have done so far.
>> Any suggestion?
>> 
>> 
> 
> Hmm. t seems very likely something to do with the way the container is
> set up then - and I know nothing about containers. Sorry :/
> 
> Can anyone else help here?
> 
> Chrissie
> 
>>> On 26 Jun 2018, at 11:03, Salvatore D'angelo >> 
>>> >> wrote:
>>> 
>>> Yes, sorry you’re right I could find it by myself.
>>> However, I did the following:
>>> 
>>> 1. Added the line you suggested to /etc/fstab
>>> 2. mount -o remount /dev/shm
>>> 3. Now I correctly see /dev/shm of 512M with df -h
>>> Filesystem  Size  Used Avail Use% Mounted on
>>> overlay  63G   11G   49G  19% /
>>> tmpfs64M  4.0K   64M   1% /dev
>>> tmpfs  1000M 0 1000M   0% /sys/fs/cgroup
>>> osxfs   466G  158G  305G  35% /Users
>>> /dev/sda163G   11G   49G  19% /etc/hosts
>>> *shm 512M   15M  498M   3% /dev/shm*
>>> tmpfs  1000M 0 1000M   0% /sys/firmware
>>> tmpfs   128M 0  128M   0% /tmp
>>> 
>>> The errors in log went away. Consider that I remove the log file
>>> before start corosync so it does not contains lines of previous
>>> executions.
>>> 
>>> 
>>> But the command:
>>> corosync-quorumtool -ps
>>> 
>>> still give:
>>> Cannot initialize QUORUM service
>>> 
>>> Consider that few minutes before it gave me the message:
>>> Cannot initialize CFG service
>>> 
>>> I do not know the differences between CFG and QUORUM in this case.
>>> 
>>> If I try to start pacemaker the service is OK but I see only pacemaker
>>> and the Transport does not work if I try to run a cam command.
>>> Any suggestion?
>>> 
>>> 
 On 26 Jun 2018, at 10:49, Christine Caulfield >>> 
 >> wrote:
 
 On 26/06/18 09:40, Salvatore D'angelo wrote:
> Hi,
> 
> Yes,
> 
> I am reproducing only the required part for test. I think the original
> system has a larger shm. The problem is that I do not know exactly how
> to change it.
> I tried the following steps, but I have the impression I didn’t
> performed the right one:
> 
> 1. remove everything under /tmp
> 2. Added the following line to /etc/fstab
> tmpfs   /tmp tmpfs   defaults,nodev,nosuid,mode=1777,size=128M 
> 0  0
> 3. mount /tmp
> 4. df -h
> Filesystem  Size  Used Avail Use% Mounted on
> overlay  63G   11G   49G  19% /
> tmpfs64M  4.0K   64M   1% /dev
> tmpfs  1000M 0 1000M   0% /sys/fs/cgroup
> osxfs   466G  158G  305G  35% /Users
> /dev/sda163G   11G   49G  19% /etc/hosts
> shm  64M   11M   54M  16% /dev/shm
> tmpfs  1000M 0 1000M   0% /sys/firmware
> *tmpfs   128M 0  128M   0% /tmp*
> 
> The errors are exactly the same.
> I have the impression that I changed the wrong parameter. Probably I
> have to change:
> shm  64M   11M   54M  16% /dev/shm
> 
> but I do not know how to do that. Any suggestion?
> 
 
 According to google, you just add a new line to /etc/fstab for /dev/shm

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Christine Caulfield
On 26/06/18 10:35, Salvatore D'angelo wrote:
> Sorry after the command:
> 
> corosync-quorumtool -ps
> 
> the error in log are still visible. Looking at the source code it seems
> problem is at this line:
> https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c
> 
>     if (quorum_initialize(_handle, _callbacks, _type) != CS_OK) {
> fprintf(stderr, "Cannot initialize QUORUM service\n");
> q_handle = 0;
> goto out;
> }
> 
> if (corosync_cfg_initialize(_handle, _callbacks) != CS_OK) {
> fprintf(stderr, "Cannot initialise CFG service\n");
> c_handle = 0;
> goto out;
> }
> 
> The quorum_initialize function is defined here:
> https://github.com/corosync/corosync/blob/master/lib/quorum.c
> 
> It seems interacts with libqb to allocate space on /dev/shm but
> something fails. I tried to update the libqb with apt-get install but no
> success.
> 
> The same for second function:
> https://github.com/corosync/corosync/blob/master/lib/cfg.c
> 
> Now I am not an expert of libqb. I have the version 0.16.0.real-1ubuntu5.
> 
> The folder /dev/shm has 777 permission like other nodes with older
> corosync and pacemaker that work fine. The only difference is that I
> only see files created by root, no one created by hacluster like other
> two nodes (probably because pacemaker didn’t start correctly).
> 
> This is the analysis I have done so far.
> Any suggestion?
> 
> 

Hmm. t seems very likely something to do with the way the container is
set up then - and I know nothing about containers. Sorry :/

Can anyone else help here?

Chrissie

>> On 26 Jun 2018, at 11:03, Salvatore D'angelo > > wrote:
>>
>> Yes, sorry you’re right I could find it by myself.
>> However, I did the following:
>>
>> 1. Added the line you suggested to /etc/fstab
>> 2. mount -o remount /dev/shm
>> 3. Now I correctly see /dev/shm of 512M with df -h
>> Filesystem      Size  Used Avail Use% Mounted on
>> overlay          63G   11G   49G  19% /
>> tmpfs            64M  4.0K   64M   1% /dev
>> tmpfs          1000M     0 1000M   0% /sys/fs/cgroup
>> osxfs           466G  158G  305G  35% /Users
>> /dev/sda1        63G   11G   49G  19% /etc/hosts
>> *shm             512M   15M  498M   3% /dev/shm*
>> tmpfs          1000M     0 1000M   0% /sys/firmware
>> tmpfs           128M     0  128M   0% /tmp
>>
>> The errors in log went away. Consider that I remove the log file
>> before start corosync so it does not contains lines of previous
>> executions.
>> 
>>
>> But the command:
>> corosync-quorumtool -ps
>>
>> still give:
>> Cannot initialize QUORUM service
>>
>> Consider that few minutes before it gave me the message:
>> Cannot initialize CFG service
>>
>> I do not know the differences between CFG and QUORUM in this case.
>>
>> If I try to start pacemaker the service is OK but I see only pacemaker
>> and the Transport does not work if I try to run a cam command.
>> Any suggestion?
>>
>>
>>> On 26 Jun 2018, at 10:49, Christine Caulfield >> > wrote:
>>>
>>> On 26/06/18 09:40, Salvatore D'angelo wrote:
 Hi,

 Yes,

 I am reproducing only the required part for test. I think the original
 system has a larger shm. The problem is that I do not know exactly how
 to change it.
 I tried the following steps, but I have the impression I didn’t
 performed the right one:

 1. remove everything under /tmp
 2. Added the following line to /etc/fstab
 tmpfs   /tmp         tmpfs   defaults,nodev,nosuid,mode=1777,size=128M 
         0  0
 3. mount /tmp
 4. df -h
 Filesystem      Size  Used Avail Use% Mounted on
 overlay          63G   11G   49G  19% /
 tmpfs            64M  4.0K   64M   1% /dev
 tmpfs          1000M     0 1000M   0% /sys/fs/cgroup
 osxfs           466G  158G  305G  35% /Users
 /dev/sda1        63G   11G   49G  19% /etc/hosts
 shm              64M   11M   54M  16% /dev/shm
 tmpfs          1000M     0 1000M   0% /sys/firmware
 *tmpfs           128M     0  128M   0% /tmp*

 The errors are exactly the same.
 I have the impression that I changed the wrong parameter. Probably I
 have to change:
 shm              64M   11M   54M  16% /dev/shm

 but I do not know how to do that. Any suggestion?

>>>
>>> According to google, you just add a new line to /etc/fstab for /dev/shm
>>>
>>> tmpfs  /dev/shm  tmpfs   defaults,size=512m   0   0
>>>
>>> Chrissie
>>>
> On 26 Jun 2018, at 09:48, Christine Caulfield  
> > wrote:
>
> On 25/06/18 20:41, Salvatore D'angelo wrote:
>> Hi,
>>
>> Let me add here one important detail. I use Docker for my test with 5
>> containers deployed on my Mac.
>> Basically the team that worked on this project installed the cluster
>> on soft layer bare metal.
>> The PostgreSQL cluster was hard to test and if a misconfiguration
>> occurred 

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Salvatore D'angelo
Sorry after the command:

corosync-quorumtool -ps

the error in log are still visible. Looking at the source code it seems problem 
is at this line:
https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c 


if (quorum_initialize(_handle, _callbacks, _type) != CS_OK) {
fprintf(stderr, "Cannot initialize QUORUM service\n");
q_handle = 0;
goto out;
}

if (corosync_cfg_initialize(_handle, _callbacks) != CS_OK) {
fprintf(stderr, "Cannot initialise CFG service\n");
c_handle = 0;
goto out;
}

The quorum_initialize function is defined here:
https://github.com/corosync/corosync/blob/master/lib/quorum.c 


It seems interacts with libqb to allocate space on /dev/shm but something 
fails. I tried to update the libqb with apt-get install but no success.

The same for second function:
https://github.com/corosync/corosync/blob/master/lib/cfg.c 


Now I am not an expert of libqb. I have the version 0.16.0.real-1ubuntu5.

The folder /dev/shm has 777 permission like other nodes with older corosync and 
pacemaker that work fine. The only difference is that I only see files created 
by root, no one created by hacluster like other two nodes (probably because 
pacemaker didn’t start correctly).

This is the analysis I have done so far.
Any suggestion?


> On 26 Jun 2018, at 11:03, Salvatore D'angelo  wrote:
> 
> Yes, sorry you’re right I could find it by myself.
> However, I did the following:
> 
> 1. Added the line you suggested to /etc/fstab
> 2. mount -o remount /dev/shm
> 3. Now I correctly see /dev/shm of 512M with df -h
> Filesystem  Size  Used Avail Use% Mounted on
> overlay  63G   11G   49G  19% /
> tmpfs64M  4.0K   64M   1% /dev
> tmpfs  1000M 0 1000M   0% /sys/fs/cgroup
> osxfs   466G  158G  305G  35% /Users
> /dev/sda163G   11G   49G  19% /etc/hosts
> shm 512M   15M  498M   3% /dev/shm
> tmpfs  1000M 0 1000M   0% /sys/firmware
> tmpfs   128M 0  128M   0% /tmp
> 
> The errors in log went away. Consider that I remove the log file before start 
> corosync so it does not contains lines of previous executions.
> 
> 
> But the command:
> corosync-quorumtool -ps
> 
> still give:
> Cannot initialize QUORUM service
> 
> Consider that few minutes before it gave me the message:
> Cannot initialize CFG service
> 
> I do not know the differences between CFG and QUORUM in this case.
> 
> If I try to start pacemaker the service is OK but I see only pacemaker and 
> the Transport does not work if I try to run a cam command.
> Any suggestion?
> 
> 
>> On 26 Jun 2018, at 10:49, Christine Caulfield > > wrote:
>> 
>> On 26/06/18 09:40, Salvatore D'angelo wrote:
>>> Hi,
>>> 
>>> Yes,
>>> 
>>> I am reproducing only the required part for test. I think the original
>>> system has a larger shm. The problem is that I do not know exactly how
>>> to change it.
>>> I tried the following steps, but I have the impression I didn’t
>>> performed the right one:
>>> 
>>> 1. remove everything under /tmp
>>> 2. Added the following line to /etc/fstab
>>> tmpfs   /tmp tmpfs   defaults,nodev,nosuid,mode=1777,size=128M 
>>> 0  0
>>> 3. mount /tmp
>>> 4. df -h
>>> Filesystem  Size  Used Avail Use% Mounted on
>>> overlay  63G   11G   49G  19% /
>>> tmpfs64M  4.0K   64M   1% /dev
>>> tmpfs  1000M 0 1000M   0% /sys/fs/cgroup
>>> osxfs   466G  158G  305G  35% /Users
>>> /dev/sda163G   11G   49G  19% /etc/hosts
>>> shm  64M   11M   54M  16% /dev/shm
>>> tmpfs  1000M 0 1000M   0% /sys/firmware
>>> *tmpfs   128M 0  128M   0% /tmp*
>>> 
>>> The errors are exactly the same.
>>> I have the impression that I changed the wrong parameter. Probably I
>>> have to change:
>>> shm  64M   11M   54M  16% /dev/shm
>>> 
>>> but I do not know how to do that. Any suggestion?
>>> 
>> 
>> According to google, you just add a new line to /etc/fstab for /dev/shm
>> 
>> tmpfs  /dev/shm  tmpfs   defaults,size=512m   0   0
>> 
>> Chrissie
>> 
 On 26 Jun 2018, at 09:48, Christine Caulfield >>> 
 >> wrote:
 
 On 25/06/18 20:41, Salvatore D'angelo wrote:
> Hi,
> 
> Let me add here one important detail. I use Docker for my test with 5
> containers deployed on my Mac.
> Basically the team that worked on this project installed the cluster
> on soft layer bare metal.
> The PostgreSQL cluster was hard to test and if a misconfiguration
> occurred recreate the cluster from scratch is not 

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Salvatore D'angelo
Yes, sorry you’re right I could find it by myself.However, I did the following:1. Added the line you suggested to /etc/fstab2. mount -o remount /dev/shm3. Now I correctly see /dev/shm of 512M with df -hFilesystem      Size  Used Avail Use% Mounted onoverlay          63G   11G   49G  19% /tmpfs            64M  4.0K   64M   1% /devtmpfs          1000M     0 1000M   0% /sys/fs/cgrouposxfs           466G  158G  305G  35% /Users/dev/sda1        63G   11G   49G  19% /etc/hostsshm             512M   15M  498M   3% /dev/shmtmpfs          1000M     0 1000M   0% /sys/firmwaretmpfs           128M     0  128M   0% /tmpThe errors in log went away. Consider that I remove the log file before start corosync so it does not contains lines of previous executions.

corosync.log
Description: Binary data
But the command:corosync-quorumtool -psstill give:Cannot initialize QUORUM serviceConsider that few minutes before it gave me the message:Cannot initialize CFG serviceI do not know the differences between CFG and QUORUM in this case.If I try to start pacemaker the service is OK but I see only pacemaker and the Transport does not work if I try to run a cam command.Any suggestion?On 26 Jun 2018, at 10:49, Christine Caulfield  wrote:On 26/06/18 09:40, Salvatore D'angelo wrote:Hi,Yes,I am reproducing only the required part for test. I think the originalsystem has a larger shm. The problem is that I do not know exactly howto change it.I tried the following steps, but I have the impression I didn’tperformed the right one:1. remove everything under /tmp2. Added the following line to /etc/fstabtmpfs   /tmp         tmpfs   defaults,nodev,nosuid,mode=1777,size=128M         0  03. mount /tmp4. df -hFilesystem      Size  Used Avail Use% Mounted onoverlay          63G   11G   49G  19% /tmpfs            64M  4.0K   64M   1% /devtmpfs          1000M     0 1000M   0% /sys/fs/cgrouposxfs           466G  158G  305G  35% /Users/dev/sda1        63G   11G   49G  19% /etc/hostsshm              64M   11M   54M  16% /dev/shmtmpfs          1000M     0 1000M   0% /sys/firmware*tmpfs           128M     0  128M   0% /tmp*The errors are exactly the same.I have the impression that I changed the wrong parameter. Probably Ihave to change:shm              64M   11M   54M  16% /dev/shmbut I do not know how to do that. Any suggestion?According to google, you just add a new line to /etc/fstab for /dev/shmtmpfs  /dev/shm  tmpfs   defaults,size=512m   0   0ChrissieOn 26 Jun 2018, at 09:48, Christine Caulfield > wrote:On 25/06/18 20:41, Salvatore D'angelo wrote:Hi,Let me add here one important detail. I use Docker for my test with 5containers deployed on my Mac.Basically the team that worked on this project installed the clusteron soft layer bare metal.The PostgreSQL cluster was hard to test and if a misconfigurationoccurred recreate the cluster from scratch is not easy.Test it was a cumbersome if you consider that we access to themachines with a complex system hard to describe here.For this reason I ported the cluster on Docker for test purpose. I amnot interested to have it working for months, I just need a proof ofconcept. When the migration works I’ll port everything on bare metal where thesize of resources are ambundant.  Now I have enough RAM and disk space on my Mac so if you tell me whatshould be an acceptable size for several days of running it is ok for me.It is ok also have commands to clean the shm when required.I know I can find them on Google but if you can suggest me these infoI’ll appreciate. I have OS knowledge to do that but I would like toavoid days of guesswork and try and error if possible.I would recommend at least 128MB of space on /dev/shm, 256MB if you canspare it. My 'standard' system uses 75MB under normal running allowingfor one command-line query to run.If I read this right then you're reproducing a bare-metal system incontainers now? so the original systems will have a default /dev/shmsize which is probably much larger than your containers?I'm just checking here that we don't have a regression in memory usageas Poki suggested.ChrissieOn 25 Jun 2018, at 21:18, Jan Pokorný > wrote:On 25/06/18 19:06 +0200, Salvatore D'angelo wrote:Thanks for reply. I scratched my cluster and created it again andthen migrated as before. This time I uninstalled pacemaker,corosync, crmsh and resource agents with make uninstallthen I installed new packages. The problem is the same, whenI launch:corosync-quorumtool -psI got: Cannot initialize QUORUM serviceHere the log with debug enabled:[18019] pg3 corosyncerror   [QB    ] couldn't create circular mmapon /dev/shm/qb-cfg-event-18020-18028-23-data[18019] pg3 corosyncerror   [QB    ]qb_rb_open:cfg-event-18020-18028-23: Resource temporarilyunavailable (11)[18019] pg3 corosyncdebug   [QB    ] Free'ing ringbuffer:/dev/shm/qb-cfg-request-18020-18028-23-header[18019] pg3 corosyncdebug   [QB    

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Christine Caulfield
On 26/06/18 09:40, Salvatore D'angelo wrote:
> Hi,
> 
> Yes,
> 
> I am reproducing only the required part for test. I think the original
> system has a larger shm. The problem is that I do not know exactly how
> to change it.
> I tried the following steps, but I have the impression I didn’t
> performed the right one:
> 
> 1. remove everything under /tmp
> 2. Added the following line to /etc/fstab
> tmpfs   /tmp         tmpfs   defaults,nodev,nosuid,mode=1777,size=128M 
>         0  0
> 3. mount /tmp
> 4. df -h
> Filesystem      Size  Used Avail Use% Mounted on
> overlay          63G   11G   49G  19% /
> tmpfs            64M  4.0K   64M   1% /dev
> tmpfs          1000M     0 1000M   0% /sys/fs/cgroup
> osxfs           466G  158G  305G  35% /Users
> /dev/sda1        63G   11G   49G  19% /etc/hosts
> shm              64M   11M   54M  16% /dev/shm
> tmpfs          1000M     0 1000M   0% /sys/firmware
> *tmpfs           128M     0  128M   0% /tmp*
> 
> The errors are exactly the same.
> I have the impression that I changed the wrong parameter. Probably I
> have to change:
> shm              64M   11M   54M  16% /dev/shm
> 
> but I do not know how to do that. Any suggestion?
> 

According to google, you just add a new line to /etc/fstab for /dev/shm

tmpfs  /dev/shm  tmpfs   defaults,size=512m   0   0

Chrissie

>> On 26 Jun 2018, at 09:48, Christine Caulfield > > wrote:
>>
>> On 25/06/18 20:41, Salvatore D'angelo wrote:
>>> Hi,
>>>
>>> Let me add here one important detail. I use Docker for my test with 5
>>> containers deployed on my Mac.
>>> Basically the team that worked on this project installed the cluster
>>> on soft layer bare metal.
>>> The PostgreSQL cluster was hard to test and if a misconfiguration
>>> occurred recreate the cluster from scratch is not easy.
>>> Test it was a cumbersome if you consider that we access to the
>>> machines with a complex system hard to describe here.
>>> For this reason I ported the cluster on Docker for test purpose. I am
>>> not interested to have it working for months, I just need a proof of
>>> concept. 
>>>
>>> When the migration works I’ll port everything on bare metal where the
>>> size of resources are ambundant.  
>>>
>>> Now I have enough RAM and disk space on my Mac so if you tell me what
>>> should be an acceptable size for several days of running it is ok for me.
>>> It is ok also have commands to clean the shm when required.
>>> I know I can find them on Google but if you can suggest me these info
>>> I’ll appreciate. I have OS knowledge to do that but I would like to
>>> avoid days of guesswork and try and error if possible.
>>
>>
>> I would recommend at least 128MB of space on /dev/shm, 256MB if you can
>> spare it. My 'standard' system uses 75MB under normal running allowing
>> for one command-line query to run.
>>
>> If I read this right then you're reproducing a bare-metal system in
>> containers now? so the original systems will have a default /dev/shm
>> size which is probably much larger than your containers?
>>
>> I'm just checking here that we don't have a regression in memory usage
>> as Poki suggested.
>>
>> Chrissie
>>
 On 25 Jun 2018, at 21:18, Jan Pokorný >>> > wrote:

 On 25/06/18 19:06 +0200, Salvatore D'angelo wrote:
> Thanks for reply. I scratched my cluster and created it again and
> then migrated as before. This time I uninstalled pacemaker,
> corosync, crmsh and resource agents with make uninstall
>
> then I installed new packages. The problem is the same, when
> I launch:
> corosync-quorumtool -ps
>
> I got: Cannot initialize QUORUM service
>
> Here the log with debug enabled:
>
>
> [18019] pg3 corosyncerror   [QB    ] couldn't create circular mmap
> on /dev/shm/qb-cfg-event-18020-18028-23-data
> [18019] pg3 corosyncerror   [QB    ]
> qb_rb_open:cfg-event-18020-18028-23: Resource temporarily
> unavailable (11)
> [18019] pg3 corosyncdebug   [QB    ] Free'ing ringbuffer:
> /dev/shm/qb-cfg-request-18020-18028-23-header
> [18019] pg3 corosyncdebug   [QB    ] Free'ing ringbuffer:
> /dev/shm/qb-cfg-response-18020-18028-23-header
> [18019] pg3 corosyncerror   [QB    ] shm connection FAILED:
> Resource temporarily unavailable (11)
> [18019] pg3 corosyncerror   [QB    ] Error in connection setup
> (18020-18028-23): Resource temporarily unavailable (11)
>
> I tried to check /dev/shm and I am not sure these are the right
> commands, however:
>
> df -h /dev/shm
> Filesystem  Size  Used Avail Use% Mounted on
> shm  64M   16M   49M  24% /dev/shm
>
> ls /dev/shm
> qb-cmap-request-18020-18036-25-data    qb-corosync-blackbox-data
>    qb-quorum-request-18020-18095-32-data
> qb-cmap-request-18020-18036-25-header  qb-corosync-blackbox-header
>  qb-quorum-request-18020-18095-32-header
>
> Is 64 Mb 

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Salvatore D'angelo
Hi,

Yes,

I am reproducing only the required part for test. I think the original system 
has a larger shm. The problem is that I do not know exactly how to change it.
I tried the following steps, but I have the impression I didn’t performed the 
right one:

1. remove everything under /tmp
2. Added the following line to /etc/fstab
tmpfs   /tmp tmpfs   defaults,nodev,nosuid,mode=1777,size=128M  
0  0
3. mount /tmp
4. df -h
Filesystem  Size  Used Avail Use% Mounted on
overlay  63G   11G   49G  19% /
tmpfs64M  4.0K   64M   1% /dev
tmpfs  1000M 0 1000M   0% /sys/fs/cgroup
osxfs   466G  158G  305G  35% /Users
/dev/sda163G   11G   49G  19% /etc/hosts
shm  64M   11M   54M  16% /dev/shm
tmpfs  1000M 0 1000M   0% /sys/firmware
tmpfs   128M 0  128M   0% /tmp

The errors are exactly the same.
I have the impression that I changed the wrong parameter. Probably I have to 
change:
shm  64M   11M   54M  16% /dev/shm

but I do not know how to do that. Any suggestion?

> On 26 Jun 2018, at 09:48, Christine Caulfield  wrote:
> 
> On 25/06/18 20:41, Salvatore D'angelo wrote:
>> Hi,
>> 
>> Let me add here one important detail. I use Docker for my test with 5 
>> containers deployed on my Mac.
>> Basically the team that worked on this project installed the cluster on soft 
>> layer bare metal.
>> The PostgreSQL cluster was hard to test and if a misconfiguration occurred 
>> recreate the cluster from scratch is not easy.
>> Test it was a cumbersome if you consider that we access to the machines with 
>> a complex system hard to describe here.
>> For this reason I ported the cluster on Docker for test purpose. I am not 
>> interested to have it working for months, I just need a proof of concept. 
>> 
>> When the migration works I’ll port everything on bare metal where the size 
>> of resources are ambundant.  
>> 
>> Now I have enough RAM and disk space on my Mac so if you tell me what should 
>> be an acceptable size for several days of running it is ok for me.
>> It is ok also have commands to clean the shm when required.
>> I know I can find them on Google but if you can suggest me these info I’ll 
>> appreciate. I have OS knowledge to do that but I would like to avoid days of 
>> guesswork and try and error if possible.
> 
> 
> I would recommend at least 128MB of space on /dev/shm, 256MB if you can
> spare it. My 'standard' system uses 75MB under normal running allowing
> for one command-line query to run.
> 
> If I read this right then you're reproducing a bare-metal system in
> containers now? so the original systems will have a default /dev/shm
> size which is probably much larger than your containers?
> 
> I'm just checking here that we don't have a regression in memory usage
> as Poki suggested.
> 
> Chrissie
> 
>>> On 25 Jun 2018, at 21:18, Jan Pokorný  wrote:
>>> 
>>> On 25/06/18 19:06 +0200, Salvatore D'angelo wrote:
 Thanks for reply. I scratched my cluster and created it again and
 then migrated as before. This time I uninstalled pacemaker,
 corosync, crmsh and resource agents with make uninstall
 
 then I installed new packages. The problem is the same, when
 I launch:
 corosync-quorumtool -ps
 
 I got: Cannot initialize QUORUM service
 
 Here the log with debug enabled:
 
 
 [18019] pg3 corosyncerror   [QB] couldn't create circular mmap on 
 /dev/shm/qb-cfg-event-18020-18028-23-data
 [18019] pg3 corosyncerror   [QB] qb_rb_open:cfg-event-18020-18028-23: 
 Resource temporarily unavailable (11)
 [18019] pg3 corosyncdebug   [QB] Free'ing ringbuffer: 
 /dev/shm/qb-cfg-request-18020-18028-23-header
 [18019] pg3 corosyncdebug   [QB] Free'ing ringbuffer: 
 /dev/shm/qb-cfg-response-18020-18028-23-header
 [18019] pg3 corosyncerror   [QB] shm connection FAILED: Resource 
 temporarily unavailable (11)
 [18019] pg3 corosyncerror   [QB] Error in connection setup 
 (18020-18028-23): Resource temporarily unavailable (11)
 
 I tried to check /dev/shm and I am not sure these are the right
 commands, however:
 
 df -h /dev/shm
 Filesystem  Size  Used Avail Use% Mounted on
 shm  64M   16M   49M  24% /dev/shm
 
 ls /dev/shm
 qb-cmap-request-18020-18036-25-dataqb-corosync-blackbox-data
 qb-quorum-request-18020-18095-32-data
 qb-cmap-request-18020-18036-25-header  qb-corosync-blackbox-header  
 qb-quorum-request-18020-18095-32-header
 
 Is 64 Mb enough for /dev/shm. If no, why it worked with previous
 corosync release?
>>> 
>>> For a start, can you try configuring corosync with
>>> --enable-small-memory-footprint switch?
>>> 
>>> Hard to say why the space provisioned to /dev/shm is the direct
>>> opposite of generous (per today's standards), but may be the result
>>> of automatic HW adaptation, and if RAM is so 

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Christine Caulfield
On 25/06/18 20:41, Salvatore D'angelo wrote:
> Hi,
> 
> Let me add here one important detail. I use Docker for my test with 5 
> containers deployed on my Mac.
> Basically the team that worked on this project installed the cluster on soft 
> layer bare metal.
> The PostgreSQL cluster was hard to test and if a misconfiguration occurred 
> recreate the cluster from scratch is not easy.
> Test it was a cumbersome if you consider that we access to the machines with 
> a complex system hard to describe here.
> For this reason I ported the cluster on Docker for test purpose. I am not 
> interested to have it working for months, I just need a proof of concept. 
> 
> When the migration works I’ll port everything on bare metal where the size of 
> resources are ambundant.  
> 
> Now I have enough RAM and disk space on my Mac so if you tell me what should 
> be an acceptable size for several days of running it is ok for me.
> It is ok also have commands to clean the shm when required.
> I know I can find them on Google but if you can suggest me these info I’ll 
> appreciate. I have OS knowledge to do that but I would like to avoid days of 
> guesswork and try and error if possible.


I would recommend at least 128MB of space on /dev/shm, 256MB if you can
spare it. My 'standard' system uses 75MB under normal running allowing
for one command-line query to run.

If I read this right then you're reproducing a bare-metal system in
containers now? so the original systems will have a default /dev/shm
size which is probably much larger than your containers?

I'm just checking here that we don't have a regression in memory usage
as Poki suggested.

Chrissie

>> On 25 Jun 2018, at 21:18, Jan Pokorný  wrote:
>>
>> On 25/06/18 19:06 +0200, Salvatore D'angelo wrote:
>>> Thanks for reply. I scratched my cluster and created it again and
>>> then migrated as before. This time I uninstalled pacemaker,
>>> corosync, crmsh and resource agents with make uninstall
>>>
>>> then I installed new packages. The problem is the same, when
>>> I launch:
>>> corosync-quorumtool -ps
>>>
>>> I got: Cannot initialize QUORUM service
>>>
>>> Here the log with debug enabled:
>>>
>>>
>>> [18019] pg3 corosyncerror   [QB] couldn't create circular mmap on 
>>> /dev/shm/qb-cfg-event-18020-18028-23-data
>>> [18019] pg3 corosyncerror   [QB] qb_rb_open:cfg-event-18020-18028-23: 
>>> Resource temporarily unavailable (11)
>>> [18019] pg3 corosyncdebug   [QB] Free'ing ringbuffer: 
>>> /dev/shm/qb-cfg-request-18020-18028-23-header
>>> [18019] pg3 corosyncdebug   [QB] Free'ing ringbuffer: 
>>> /dev/shm/qb-cfg-response-18020-18028-23-header
>>> [18019] pg3 corosyncerror   [QB] shm connection FAILED: Resource 
>>> temporarily unavailable (11)
>>> [18019] pg3 corosyncerror   [QB] Error in connection setup 
>>> (18020-18028-23): Resource temporarily unavailable (11)
>>>
>>> I tried to check /dev/shm and I am not sure these are the right
>>> commands, however:
>>>
>>> df -h /dev/shm
>>> Filesystem  Size  Used Avail Use% Mounted on
>>> shm  64M   16M   49M  24% /dev/shm
>>>
>>> ls /dev/shm
>>> qb-cmap-request-18020-18036-25-dataqb-corosync-blackbox-data
>>> qb-quorum-request-18020-18095-32-data
>>> qb-cmap-request-18020-18036-25-header  qb-corosync-blackbox-header  
>>> qb-quorum-request-18020-18095-32-header
>>>
>>> Is 64 Mb enough for /dev/shm. If no, why it worked with previous
>>> corosync release?
>>
>> For a start, can you try configuring corosync with
>> --enable-small-memory-footprint switch?
>>
>> Hard to say why the space provisioned to /dev/shm is the direct
>> opposite of generous (per today's standards), but may be the result
>> of automatic HW adaptation, and if RAM is so scarce in your case,
>> the above build-time toggle might help.
>>
>> If not, then exponentially increasing size of /dev/shm space is
>> likely your best bet (I don't recommended fiddling with mlockall()
>> and similar measures in corosync).
>>
>> Of course, feel free to raise a regression if you have a reproducible
>> comparison between two corosync (plus possibly different libraries
>> like libqb) versions, one that works and one that won't, in
>> reproducible conditions (like this small /dev/shm, VM image, etc.).
>>
>> -- 
>> Jan (Poki)
>> ___
>> Users mailing list: Users@clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

___
Users mailing list: Users@clusterlabs.org

Re: [ClusterLabs] Upgrade corosync problem

2018-06-25 Thread Salvatore D'angelo
Hi,

Let me add here one important detail. I use Docker for my test with 5 
containers deployed on my Mac.
Basically the team that worked on this project installed the cluster on soft 
layer bare metal.
The PostgreSQL cluster was hard to test and if a misconfiguration occurred 
recreate the cluster from scratch is not easy.
Test it was a cumbersome if you consider that we access to the machines with a 
complex system hard to describe here.
For this reason I ported the cluster on Docker for test purpose. I am not 
interested to have it working for months, I just need a proof of concept. 

When the migration works I’ll port everything on bare metal where the size of 
resources are ambundant.  

Now I have enough RAM and disk space on my Mac so if you tell me what should be 
an acceptable size for several days of running it is ok for me.
It is ok also have commands to clean the shm when required.
I know I can find them on Google but if you can suggest me these info I’ll 
appreciate. I have OS knowledge to do that but I would like to avoid days of 
guesswork and try and error if possible.

> On 25 Jun 2018, at 21:18, Jan Pokorný  wrote:
> 
> On 25/06/18 19:06 +0200, Salvatore D'angelo wrote:
>> Thanks for reply. I scratched my cluster and created it again and
>> then migrated as before. This time I uninstalled pacemaker,
>> corosync, crmsh and resource agents with make uninstall
>> 
>> then I installed new packages. The problem is the same, when
>> I launch:
>> corosync-quorumtool -ps
>> 
>> I got: Cannot initialize QUORUM service
>> 
>> Here the log with debug enabled:
>> 
>> 
>> [18019] pg3 corosyncerror   [QB] couldn't create circular mmap on 
>> /dev/shm/qb-cfg-event-18020-18028-23-data
>> [18019] pg3 corosyncerror   [QB] qb_rb_open:cfg-event-18020-18028-23: 
>> Resource temporarily unavailable (11)
>> [18019] pg3 corosyncdebug   [QB] Free'ing ringbuffer: 
>> /dev/shm/qb-cfg-request-18020-18028-23-header
>> [18019] pg3 corosyncdebug   [QB] Free'ing ringbuffer: 
>> /dev/shm/qb-cfg-response-18020-18028-23-header
>> [18019] pg3 corosyncerror   [QB] shm connection FAILED: Resource 
>> temporarily unavailable (11)
>> [18019] pg3 corosyncerror   [QB] Error in connection setup 
>> (18020-18028-23): Resource temporarily unavailable (11)
>> 
>> I tried to check /dev/shm and I am not sure these are the right
>> commands, however:
>> 
>> df -h /dev/shm
>> Filesystem  Size  Used Avail Use% Mounted on
>> shm  64M   16M   49M  24% /dev/shm
>> 
>> ls /dev/shm
>> qb-cmap-request-18020-18036-25-dataqb-corosync-blackbox-data
>> qb-quorum-request-18020-18095-32-data
>> qb-cmap-request-18020-18036-25-header  qb-corosync-blackbox-header  
>> qb-quorum-request-18020-18095-32-header
>> 
>> Is 64 Mb enough for /dev/shm. If no, why it worked with previous
>> corosync release?
> 
> For a start, can you try configuring corosync with
> --enable-small-memory-footprint switch?
> 
> Hard to say why the space provisioned to /dev/shm is the direct
> opposite of generous (per today's standards), but may be the result
> of automatic HW adaptation, and if RAM is so scarce in your case,
> the above build-time toggle might help.
> 
> If not, then exponentially increasing size of /dev/shm space is
> likely your best bet (I don't recommended fiddling with mlockall()
> and similar measures in corosync).
> 
> Of course, feel free to raise a regression if you have a reproducible
> comparison between two corosync (plus possibly different libraries
> like libqb) versions, one that works and one that won't, in
> reproducible conditions (like this small /dev/shm, VM image, etc.).
> 
> -- 
> Jan (Poki)
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Upgrade corosync problem

2018-06-25 Thread Jan Pokorný
On 25/06/18 19:06 +0200, Salvatore D'angelo wrote:
> Thanks for reply. I scratched my cluster and created it again and
> then migrated as before. This time I uninstalled pacemaker,
> corosync, crmsh and resource agents with make uninstall
> 
> then I installed new packages. The problem is the same, when
> I launch:
> corosync-quorumtool -ps
> 
> I got: Cannot initialize QUORUM service
> 
> Here the log with debug enabled:
> 
> 
> [18019] pg3 corosyncerror   [QB] couldn't create circular mmap on 
> /dev/shm/qb-cfg-event-18020-18028-23-data
> [18019] pg3 corosyncerror   [QB] qb_rb_open:cfg-event-18020-18028-23: 
> Resource temporarily unavailable (11)
> [18019] pg3 corosyncdebug   [QB] Free'ing ringbuffer: 
> /dev/shm/qb-cfg-request-18020-18028-23-header
> [18019] pg3 corosyncdebug   [QB] Free'ing ringbuffer: 
> /dev/shm/qb-cfg-response-18020-18028-23-header
> [18019] pg3 corosyncerror   [QB] shm connection FAILED: Resource 
> temporarily unavailable (11)
> [18019] pg3 corosyncerror   [QB] Error in connection setup 
> (18020-18028-23): Resource temporarily unavailable (11)
> 
> I tried to check /dev/shm and I am not sure these are the right
> commands, however:
> 
> df -h /dev/shm
> Filesystem  Size  Used Avail Use% Mounted on
> shm  64M   16M   49M  24% /dev/shm
> 
> ls /dev/shm
> qb-cmap-request-18020-18036-25-dataqb-corosync-blackbox-data
> qb-quorum-request-18020-18095-32-data
> qb-cmap-request-18020-18036-25-header  qb-corosync-blackbox-header  
> qb-quorum-request-18020-18095-32-header
> 
> Is 64 Mb enough for /dev/shm. If no, why it worked with previous
> corosync release?

For a start, can you try configuring corosync with
--enable-small-memory-footprint switch?

Hard to say why the space provisioned to /dev/shm is the direct
opposite of generous (per today's standards), but may be the result
of automatic HW adaptation, and if RAM is so scarce in your case,
the above build-time toggle might help.

If not, then exponentially increasing size of /dev/shm space is
likely your best bet (I don't recommended fiddling with mlockall()
and similar measures in corosync).

Of course, feel free to raise a regression if you have a reproducible
comparison between two corosync (plus possibly different libraries
like libqb) versions, one that works and one that won't, in
reproducible conditions (like this small /dev/shm, VM image, etc.).

-- 
Jan (Poki)


pgpjALMoPzoef.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Upgrade corosync problem

2018-06-25 Thread Salvatore D'angelo
Hi,Thanks for reply. I scratched my cluster and created it again and then migrated as before. This time I uninstalled pacemaker, corosync, crmsh and resource agents withmake uninstallthen I installed new packages. The problem is the same, when I launch:corosync-quorumtool -psI got: Cannot initialize QUORUM serviceHere the log with debug enabled:

corosync.log
Description: Binary data
[18019] pg3 corosyncerror   [QB    ] couldn't create circular mmap on /dev/shm/qb-cfg-event-18020-18028-23-data[18019] pg3 corosyncerror   [QB    ] qb_rb_open:cfg-event-18020-18028-23: Resource temporarily unavailable (11)[18019] pg3 corosyncdebug   [QB    ] Free'ing ringbuffer: /dev/shm/qb-cfg-request-18020-18028-23-header[18019] pg3 corosyncdebug   [QB    ] Free'ing ringbuffer: /dev/shm/qb-cfg-response-18020-18028-23-header[18019] pg3 corosyncerror   [QB    ] shm connection FAILED: Resource temporarily unavailable (11)[18019] pg3 corosyncerror   [QB    ] Error in connection setup (18020-18028-23): Resource temporarily unavailable (11)I tried to check /dev/shm and I am not sure these are the right commands, however:df -h /dev/shmFilesystem      Size  Used Avail Use% Mounted onshm              64M   16M   49M  24% /dev/shmls /dev/shmqb-cmap-request-18020-18036-25-data    qb-corosync-blackbox-data    qb-quorum-request-18020-18095-32-dataqb-cmap-request-18020-18036-25-header  qb-corosync-blackbox-header  qb-quorum-request-18020-18095-32-headerIs 64 Mb enough for /dev/shm. If no, why it worked with previous corosync release?On 25 Jun 2018, at 09:09, Christine Caulfield  wrote:On 22/06/18 11:23, Salvatore D'angelo wrote:Hi,Here the log:[17323] pg1 corosyncerror   [QB    ] couldn't create circular mmap on/dev/shm/qb-cfg-event-17324-17334-23-data[17323] pg1 corosyncerror   [QB    ]qb_rb_open:cfg-event-17324-17334-23: Resource temporarily unavailable (11)[17323] pg1 corosyncdebug   [QB    ] Free'ing ringbuffer:/dev/shm/qb-cfg-request-17324-17334-23-header[17323] pg1 corosyncdebug   [QB    ] Free'ing ringbuffer:/dev/shm/qb-cfg-response-17324-17334-23-header[17323] pg1 corosyncerror   [QB    ] shm connection FAILED: Resourcetemporarily unavailable (11)[17323] pg1 corosyncerror   [QB    ] Error in connection setup(17324-17334-23): Resource temporarily unavailable (11)[17323] pg1 corosyncdebug   [QB    ] qb_ipcs_disconnect(17324-17334-23)state:0is /dev/shm full?ChrissieOn 22 Jun 2018, at 12:10, Christine Caulfield  wrote:On 22/06/18 10:39, Salvatore D'angelo wrote:Hi,Can you tell me exactly which log you need. I’ll provide you as soon as possible.Regarding some settings, I am not the original author of this cluster. People created it left the company I am working with and I inerithed the code and sometime I do not know why some settings are used.The old versions of pacemaker, corosync,  crash and resource agents were compiled and installed.I simply downloaded the new versions compiled and installed them. I didn’t get any compliant during ./configure that usually checks for library compatibility.To be honest I do not know if this is the right approach. Should I “make unistall" old versions before installing the new one?Which is the suggested approach?Thank in advance for your help.OK fair enough!To be honest the best approach is almost always to get the latestpackages from the distributor rather than compile from source. That wayyou can be more sure that upgrades will be more smoothly. Though, to behonest, I'm not sure how good the Ubuntu packages are (they might begreat, they might not, I genuinely don't know)When building from source and if you don't know the provenance of theprevious version then I would recommend a 'make uninstall' first - orremoval of the packages if that's where they came from.One thing you should do is make sure that all the cluster nodes arerunning the same version. If some are running older versions then nodescould drop out for obscure reasons. We try and keep minor versionson-wire compatible but it's always best to be cautious.The tidying of your corosync.conf wan wait for the moment, lets getthings mostly working first. If you enable debug logging in corosync.conf:logging {  to_syslog: yes	debug: on}Then see what happens and post the syslog file that has all of thecorosync messages in it, we'll take it from there.ChrissieOn 22 Jun 2018, at 11:30, Christine Caulfield  wrote:On 22/06/18 10:14, Salvatore D'angelo wrote:Hi Christine,Thanks for reply. Let me add few details. When I run the corosyncservice I se the corosync process running. If I stop it and run:corosync -f I see three warnings:warning [MAIN  ] interface section bindnetaddr is used together withnodelist. Nodelist one is going to be used.warning [MAIN  ] Please migrate config file to nodelist.warning [MAIN  ] Could not set SCHED_RR at priority 99: Operation notpermitted (1)warning [MAIN  ] Could not set priority -2147483648: Permission denied (13)but I see node joined.Those certainly 

Re: [ClusterLabs] Upgrade corosync problem

2018-06-25 Thread Christine Caulfield
On 22/06/18 11:23, Salvatore D'angelo wrote:
> Hi,
> Here the log:
> 
> 
> 
[17323] pg1 corosyncerror   [QB] couldn't create circular mmap on
/dev/shm/qb-cfg-event-17324-17334-23-data
[17323] pg1 corosyncerror   [QB]
qb_rb_open:cfg-event-17324-17334-23: Resource temporarily unavailable (11)
[17323] pg1 corosyncdebug   [QB] Free'ing ringbuffer:
/dev/shm/qb-cfg-request-17324-17334-23-header
[17323] pg1 corosyncdebug   [QB] Free'ing ringbuffer:
/dev/shm/qb-cfg-response-17324-17334-23-header
[17323] pg1 corosyncerror   [QB] shm connection FAILED: Resource
temporarily unavailable (11)
[17323] pg1 corosyncerror   [QB] Error in connection setup
(17324-17334-23): Resource temporarily unavailable (11)
[17323] pg1 corosyncdebug   [QB] qb_ipcs_disconnect(17324-17334-23)
state:0



is /dev/shm full?


Chrissie


> 
> 
>> On 22 Jun 2018, at 12:10, Christine Caulfield  wrote:
>>
>> On 22/06/18 10:39, Salvatore D'angelo wrote:
>>> Hi,
>>>
>>> Can you tell me exactly which log you need. I’ll provide you as soon as 
>>> possible.
>>>
>>> Regarding some settings, I am not the original author of this cluster. 
>>> People created it left the company I am working with and I inerithed the 
>>> code and sometime I do not know why some settings are used.
>>> The old versions of pacemaker, corosync,  crash and resource agents were 
>>> compiled and installed.
>>> I simply downloaded the new versions compiled and installed them. I didn’t 
>>> get any compliant during ./configure that usually checks for library 
>>> compatibility.
>>>
>>> To be honest I do not know if this is the right approach. Should I “make 
>>> unistall" old versions before installing the new one?
>>> Which is the suggested approach?
>>> Thank in advance for your help.
>>>
>>
>> OK fair enough!
>>
>> To be honest the best approach is almost always to get the latest
>> packages from the distributor rather than compile from source. That way
>> you can be more sure that upgrades will be more smoothly. Though, to be
>> honest, I'm not sure how good the Ubuntu packages are (they might be
>> great, they might not, I genuinely don't know)
>>
>> When building from source and if you don't know the provenance of the
>> previous version then I would recommend a 'make uninstall' first - or
>> removal of the packages if that's where they came from.
>>
>> One thing you should do is make sure that all the cluster nodes are
>> running the same version. If some are running older versions then nodes
>> could drop out for obscure reasons. We try and keep minor versions
>> on-wire compatible but it's always best to be cautious.
>>
>> The tidying of your corosync.conf wan wait for the moment, lets get
>> things mostly working first. If you enable debug logging in corosync.conf:
>>
>> logging {
>>to_syslog: yes
>>  debug: on
>> }
>>
>> Then see what happens and post the syslog file that has all of the
>> corosync messages in it, we'll take it from there.
>>
>> Chrissie
>>
 On 22 Jun 2018, at 11:30, Christine Caulfield  wrote:

 On 22/06/18 10:14, Salvatore D'angelo wrote:
> Hi Christine,
>
> Thanks for reply. Let me add few details. When I run the corosync
> service I se the corosync process running. If I stop it and run:
>
> corosync -f 
>
> I see three warnings:
> warning [MAIN  ] interface section bindnetaddr is used together with
> nodelist. Nodelist one is going to be used.
> warning [MAIN  ] Please migrate config file to nodelist.
> warning [MAIN  ] Could not set SCHED_RR at priority 99: Operation not
> permitted (1)
> warning [MAIN  ] Could not set priority -2147483648: Permission denied 
> (13)
>
> but I see node joined.
>

 Those certainly need fixing but are probably not the cause. Also why do
 you have these values below set?

 max_network_delay: 100
 retransmits_before_loss_const: 25
 window_size: 150

 I'm not saying they are causing the trouble, but they aren't going to
 help keep a stable cluster.

 Without more logs (full logs are always better than just the bits you
 think are meaningful) I still can't be sure. it could easily be just
 that you've overwritten a packaged version of corosync with your own
 compiled one and they have different configure options or that the
 libraries now don't match.

 Chrissie


> My corosync.conf file is below.
>
> With service corosync up and running I have the following output:
> *corosync-cfgtool -s*
> Printing ring status.
> Local node ID 1
> RING ID 0
> id= 10.0.0.11
> status= ring 0 active with no faults
> RING ID 1
> id= 192.168.0.11
> status= ring 1 active with no faults
>
> *corosync-cmapctl  | grep members*
> runtime.totem.pg.mrp.srp.*members*.1.config_version (u64) = 0
> runtime.totem.pg.mrp.srp.*members*.1.ip (str) = r(0) ip(10.0.0.11) r(1)
> 

Re: [ClusterLabs] Upgrade corosync problem

2018-06-22 Thread Salvatore D'angelo
Hi,
Here the log:


corosync.log
Description: Binary data


> On 22 Jun 2018, at 12:10, Christine Caulfield  wrote:
> 
> On 22/06/18 10:39, Salvatore D'angelo wrote:
>> Hi,
>> 
>> Can you tell me exactly which log you need. I’ll provide you as soon as 
>> possible.
>> 
>> Regarding some settings, I am not the original author of this cluster. 
>> People created it left the company I am working with and I inerithed the 
>> code and sometime I do not know why some settings are used.
>> The old versions of pacemaker, corosync,  crash and resource agents were 
>> compiled and installed.
>> I simply downloaded the new versions compiled and installed them. I didn’t 
>> get any compliant during ./configure that usually checks for library 
>> compatibility.
>> 
>> To be honest I do not know if this is the right approach. Should I “make 
>> unistall" old versions before installing the new one?
>> Which is the suggested approach?
>> Thank in advance for your help.
>> 
> 
> OK fair enough!
> 
> To be honest the best approach is almost always to get the latest
> packages from the distributor rather than compile from source. That way
> you can be more sure that upgrades will be more smoothly. Though, to be
> honest, I'm not sure how good the Ubuntu packages are (they might be
> great, they might not, I genuinely don't know)
> 
> When building from source and if you don't know the provenance of the
> previous version then I would recommend a 'make uninstall' first - or
> removal of the packages if that's where they came from.
> 
> One thing you should do is make sure that all the cluster nodes are
> running the same version. If some are running older versions then nodes
> could drop out for obscure reasons. We try and keep minor versions
> on-wire compatible but it's always best to be cautious.
> 
> The tidying of your corosync.conf wan wait for the moment, lets get
> things mostly working first. If you enable debug logging in corosync.conf:
> 
> logging {
>to_syslog: yes
>   debug: on
> }
> 
> Then see what happens and post the syslog file that has all of the
> corosync messages in it, we'll take it from there.
> 
> Chrissie
> 
>>> On 22 Jun 2018, at 11:30, Christine Caulfield  wrote:
>>> 
>>> On 22/06/18 10:14, Salvatore D'angelo wrote:
 Hi Christine,
 
 Thanks for reply. Let me add few details. When I run the corosync
 service I se the corosync process running. If I stop it and run:
 
 corosync -f 
 
 I see three warnings:
 warning [MAIN  ] interface section bindnetaddr is used together with
 nodelist. Nodelist one is going to be used.
 warning [MAIN  ] Please migrate config file to nodelist.
 warning [MAIN  ] Could not set SCHED_RR at priority 99: Operation not
 permitted (1)
 warning [MAIN  ] Could not set priority -2147483648: Permission denied (13)
 
 but I see node joined.
 
>>> 
>>> Those certainly need fixing but are probably not the cause. Also why do
>>> you have these values below set?
>>> 
>>> max_network_delay: 100
>>> retransmits_before_loss_const: 25
>>> window_size: 150
>>> 
>>> I'm not saying they are causing the trouble, but they aren't going to
>>> help keep a stable cluster.
>>> 
>>> Without more logs (full logs are always better than just the bits you
>>> think are meaningful) I still can't be sure. it could easily be just
>>> that you've overwritten a packaged version of corosync with your own
>>> compiled one and they have different configure options or that the
>>> libraries now don't match.
>>> 
>>> Chrissie
>>> 
>>> 
 My corosync.conf file is below.
 
 With service corosync up and running I have the following output:
 *corosync-cfgtool -s*
 Printing ring status.
 Local node ID 1
 RING ID 0
 id= 10.0.0.11
 status= ring 0 active with no faults
 RING ID 1
 id= 192.168.0.11
 status= ring 1 active with no faults
 
 *corosync-cmapctl  | grep members*
 runtime.totem.pg.mrp.srp.*members*.1.config_version (u64) = 0
 runtime.totem.pg.mrp.srp.*members*.1.ip (str) = r(0) ip(10.0.0.11) r(1)
 ip(192.168.0.11) 
 runtime.totem.pg.mrp.srp.*members*.1.join_count (u32) = 1
 runtime.totem.pg.mrp.srp.*members*.1.status (str) = joined
 runtime.totem.pg.mrp.srp.*members*.2.config_version (u64) = 0
 runtime.totem.pg.mrp.srp.*members*.2.ip (str) = r(0) ip(10.0.0.12) r(1)
 ip(192.168.0.12) 
 runtime.totem.pg.mrp.srp.*members*.2.join_count (u32) = 1
 runtime.totem.pg.mrp.srp.*members*.2.status (str) = joined
 
 For the moment I have two nodes in my cluster (third node and some
 issues and at the moment I did crm node standby on it).
 
 Here the dependency I have installed for corosync (that works fine with
 pacemaker 1.1.14 and corosync 2.3.5):
 libnspr4-dev_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb
 libnspr4_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb
 

Re: [ClusterLabs] Upgrade corosync problem

2018-06-22 Thread Christine Caulfield
On 22/06/18 10:39, Salvatore D'angelo wrote:
> Hi,
> 
> Can you tell me exactly which log you need. I’ll provide you as soon as 
> possible.
> 
> Regarding some settings, I am not the original author of this cluster. People 
> created it left the company I am working with and I inerithed the code and 
> sometime I do not know why some settings are used.
> The old versions of pacemaker, corosync,  crash and resource agents were 
> compiled and installed.
> I simply downloaded the new versions compiled and installed them. I didn’t 
> get any compliant during ./configure that usually checks for library 
> compatibility.
> 
> To be honest I do not know if this is the right approach. Should I “make 
> unistall" old versions before installing the new one?
> Which is the suggested approach?
> Thank in advance for your help.
> 

OK fair enough!

To be honest the best approach is almost always to get the latest
packages from the distributor rather than compile from source. That way
you can be more sure that upgrades will be more smoothly. Though, to be
honest, I'm not sure how good the Ubuntu packages are (they might be
great, they might not, I genuinely don't know)

When building from source and if you don't know the provenance of the
previous version then I would recommend a 'make uninstall' first - or
removal of the packages if that's where they came from.

One thing you should do is make sure that all the cluster nodes are
running the same version. If some are running older versions then nodes
could drop out for obscure reasons. We try and keep minor versions
on-wire compatible but it's always best to be cautious.

The tidying of your corosync.conf wan wait for the moment, lets get
things mostly working first. If you enable debug logging in corosync.conf:

logging {
to_syslog: yes
debug: on
}

Then see what happens and post the syslog file that has all of the
corosync messages in it, we'll take it from there.

Chrissie

>> On 22 Jun 2018, at 11:30, Christine Caulfield  wrote:
>>
>> On 22/06/18 10:14, Salvatore D'angelo wrote:
>>> Hi Christine,
>>>
>>> Thanks for reply. Let me add few details. When I run the corosync
>>> service I se the corosync process running. If I stop it and run:
>>>
>>> corosync -f 
>>>
>>> I see three warnings:
>>> warning [MAIN  ] interface section bindnetaddr is used together with
>>> nodelist. Nodelist one is going to be used.
>>> warning [MAIN  ] Please migrate config file to nodelist.
>>> warning [MAIN  ] Could not set SCHED_RR at priority 99: Operation not
>>> permitted (1)
>>> warning [MAIN  ] Could not set priority -2147483648: Permission denied (13)
>>>
>>> but I see node joined.
>>>
>>
>> Those certainly need fixing but are probably not the cause. Also why do
>> you have these values below set?
>>
>> max_network_delay: 100
>> retransmits_before_loss_const: 25
>> window_size: 150
>>
>> I'm not saying they are causing the trouble, but they aren't going to
>> help keep a stable cluster.
>>
>> Without more logs (full logs are always better than just the bits you
>> think are meaningful) I still can't be sure. it could easily be just
>> that you've overwritten a packaged version of corosync with your own
>> compiled one and they have different configure options or that the
>> libraries now don't match.
>>
>> Chrissie
>>
>>
>>> My corosync.conf file is below.
>>>
>>> With service corosync up and running I have the following output:
>>> *corosync-cfgtool -s*
>>> Printing ring status.
>>> Local node ID 1
>>> RING ID 0
>>> id= 10.0.0.11
>>> status= ring 0 active with no faults
>>> RING ID 1
>>> id= 192.168.0.11
>>> status= ring 1 active with no faults
>>>
>>> *corosync-cmapctl  | grep members*
>>> runtime.totem.pg.mrp.srp.*members*.1.config_version (u64) = 0
>>> runtime.totem.pg.mrp.srp.*members*.1.ip (str) = r(0) ip(10.0.0.11) r(1)
>>> ip(192.168.0.11) 
>>> runtime.totem.pg.mrp.srp.*members*.1.join_count (u32) = 1
>>> runtime.totem.pg.mrp.srp.*members*.1.status (str) = joined
>>> runtime.totem.pg.mrp.srp.*members*.2.config_version (u64) = 0
>>> runtime.totem.pg.mrp.srp.*members*.2.ip (str) = r(0) ip(10.0.0.12) r(1)
>>> ip(192.168.0.12) 
>>> runtime.totem.pg.mrp.srp.*members*.2.join_count (u32) = 1
>>> runtime.totem.pg.mrp.srp.*members*.2.status (str) = joined
>>>
>>> For the moment I have two nodes in my cluster (third node and some
>>> issues and at the moment I did crm node standby on it).
>>>
>>> Here the dependency I have installed for corosync (that works fine with
>>> pacemaker 1.1.14 and corosync 2.3.5):
>>>  libnspr4-dev_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb
>>>  libnspr4_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb
>>>  libnss3-dev_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb
>>>  libnss3-nssdb_2%253a3.19.2.1-0ubuntu0.14.04.2_all.deb
>>>  libnss3_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb
>>>  libqb-dev_0.16.0.real-1ubuntu4_amd64.deb
>>>  libqb0_0.16.0.real-1ubuntu4_amd64.deb
>>>
>>> *corosync.conf*
>>> -
>>> quorum {

Re: [ClusterLabs] Upgrade corosync problem

2018-06-22 Thread Salvatore D'angelo
Hi,

Can you tell me exactly which log you need. I’ll provide you as soon as 
possible.

Regarding some settings, I am not the original author of this cluster. People 
created it left the company I am working with and I inerithed the code and 
sometime I do not know why some settings are used.
The old versions of pacemaker, corosync,  crash and resource agents were 
compiled and installed.
I simply downloaded the new versions compiled and installed them. I didn’t get 
any compliant during ./configure that usually checks for library compatibility.

To be honest I do not know if this is the right approach. Should I “make 
unistall" old versions before installing the new one?
Which is the suggested approach?
Thank in advance for your help.

> On 22 Jun 2018, at 11:30, Christine Caulfield  wrote:
> 
> On 22/06/18 10:14, Salvatore D'angelo wrote:
>> Hi Christine,
>> 
>> Thanks for reply. Let me add few details. When I run the corosync
>> service I se the corosync process running. If I stop it and run:
>> 
>> corosync -f 
>> 
>> I see three warnings:
>> warning [MAIN  ] interface section bindnetaddr is used together with
>> nodelist. Nodelist one is going to be used.
>> warning [MAIN  ] Please migrate config file to nodelist.
>> warning [MAIN  ] Could not set SCHED_RR at priority 99: Operation not
>> permitted (1)
>> warning [MAIN  ] Could not set priority -2147483648: Permission denied (13)
>> 
>> but I see node joined.
>> 
> 
> Those certainly need fixing but are probably not the cause. Also why do
> you have these values below set?
> 
> max_network_delay: 100
> retransmits_before_loss_const: 25
> window_size: 150
> 
> I'm not saying they are causing the trouble, but they aren't going to
> help keep a stable cluster.
> 
> Without more logs (full logs are always better than just the bits you
> think are meaningful) I still can't be sure. it could easily be just
> that you've overwritten a packaged version of corosync with your own
> compiled one and they have different configure options or that the
> libraries now don't match.
> 
> Chrissie
> 
> 
>> My corosync.conf file is below.
>> 
>> With service corosync up and running I have the following output:
>> *corosync-cfgtool -s*
>> Printing ring status.
>> Local node ID 1
>> RING ID 0
>> id= 10.0.0.11
>> status= ring 0 active with no faults
>> RING ID 1
>> id= 192.168.0.11
>> status= ring 1 active with no faults
>> 
>> *corosync-cmapctl  | grep members*
>> runtime.totem.pg.mrp.srp.*members*.1.config_version (u64) = 0
>> runtime.totem.pg.mrp.srp.*members*.1.ip (str) = r(0) ip(10.0.0.11) r(1)
>> ip(192.168.0.11) 
>> runtime.totem.pg.mrp.srp.*members*.1.join_count (u32) = 1
>> runtime.totem.pg.mrp.srp.*members*.1.status (str) = joined
>> runtime.totem.pg.mrp.srp.*members*.2.config_version (u64) = 0
>> runtime.totem.pg.mrp.srp.*members*.2.ip (str) = r(0) ip(10.0.0.12) r(1)
>> ip(192.168.0.12) 
>> runtime.totem.pg.mrp.srp.*members*.2.join_count (u32) = 1
>> runtime.totem.pg.mrp.srp.*members*.2.status (str) = joined
>> 
>> For the moment I have two nodes in my cluster (third node and some
>> issues and at the moment I did crm node standby on it).
>> 
>> Here the dependency I have installed for corosync (that works fine with
>> pacemaker 1.1.14 and corosync 2.3.5):
>>  libnspr4-dev_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb
>>  libnspr4_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb
>>  libnss3-dev_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb
>>  libnss3-nssdb_2%253a3.19.2.1-0ubuntu0.14.04.2_all.deb
>>  libnss3_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb
>>  libqb-dev_0.16.0.real-1ubuntu4_amd64.deb
>>  libqb0_0.16.0.real-1ubuntu4_amd64.deb
>> 
>> *corosync.conf*
>> -
>> quorum {
>> provider: corosync_votequorum
>> expected_votes: 3
>> }
>> totem {
>> version: 2
>> crypto_cipher: none
>> crypto_hash: none
>> rrp_mode: passive
>> interface {
>> ringnumber: 0
>> bindnetaddr: 10.0.0.0
>> mcastport: 5405
>> ttl: 1
>> }
>> interface {
>> ringnumber: 1
>> bindnetaddr: 192.168.0.0
>> mcastport: 5405
>> ttl: 1
>> }
>> transport: udpu
>> max_network_delay: 100
>> retransmits_before_loss_const: 25
>> window_size: 150
>> }
>> nodelist {
>> node {
>> ring0_addr: pg1
>> ring1_addr: pg1p
>> nodeid: 1
>> }
>> node {
>> ring0_addr: pg2
>> ring1_addr: pg2p
>> nodeid: 2
>> }
>> node {
>> ring0_addr: pg3
>> ring1_addr: pg3p
>> nodeid: 3
>> }
>> }
>> logging {
>> to_syslog: yes
>> }
>> 
>> 
>> 
>> 
>>> On 22 Jun 2018, at 09:24, Christine Caulfield >> > wrote:
>>> 
>>> On 21/06/18 16:16, Salvatore 

Re: [ClusterLabs] Upgrade corosync problem

2018-06-22 Thread Christine Caulfield
On 22/06/18 10:14, Salvatore D'angelo wrote:
> Hi Christine,
> 
> Thanks for reply. Let me add few details. When I run the corosync
> service I se the corosync process running. If I stop it and run:
> 
> corosync -f 
> 
> I see three warnings:
> warning [MAIN  ] interface section bindnetaddr is used together with
> nodelist. Nodelist one is going to be used.
> warning [MAIN  ] Please migrate config file to nodelist.
> warning [MAIN  ] Could not set SCHED_RR at priority 99: Operation not
> permitted (1)
> warning [MAIN  ] Could not set priority -2147483648: Permission denied (13)
> 
> but I see node joined.
> 

Those certainly need fixing but are probably not the cause. Also why do
you have these values below set?

max_network_delay: 100
retransmits_before_loss_const: 25
window_size: 150

I'm not saying they are causing the trouble, but they aren't going to
help keep a stable cluster.

Without more logs (full logs are always better than just the bits you
think are meaningful) I still can't be sure. it could easily be just
that you've overwritten a packaged version of corosync with your own
compiled one and they have different configure options or that the
libraries now don't match.

Chrissie


> My corosync.conf file is below.
> 
> With service corosync up and running I have the following output:
> *corosync-cfgtool -s*
> Printing ring status.
> Local node ID 1
> RING ID 0
> id= 10.0.0.11
> status= ring 0 active with no faults
> RING ID 1
> id= 192.168.0.11
> status= ring 1 active with no faults
> 
> *corosync-cmapctl  | grep members*
> runtime.totem.pg.mrp.srp.*members*.1.config_version (u64) = 0
> runtime.totem.pg.mrp.srp.*members*.1.ip (str) = r(0) ip(10.0.0.11) r(1)
> ip(192.168.0.11) 
> runtime.totem.pg.mrp.srp.*members*.1.join_count (u32) = 1
> runtime.totem.pg.mrp.srp.*members*.1.status (str) = joined
> runtime.totem.pg.mrp.srp.*members*.2.config_version (u64) = 0
> runtime.totem.pg.mrp.srp.*members*.2.ip (str) = r(0) ip(10.0.0.12) r(1)
> ip(192.168.0.12) 
> runtime.totem.pg.mrp.srp.*members*.2.join_count (u32) = 1
> runtime.totem.pg.mrp.srp.*members*.2.status (str) = joined
> 
> For the moment I have two nodes in my cluster (third node and some
> issues and at the moment I did crm node standby on it).
> 
> Here the dependency I have installed for corosync (that works fine with
> pacemaker 1.1.14 and corosync 2.3.5):
>      libnspr4-dev_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb
>      libnspr4_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb
>      libnss3-dev_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb
>      libnss3-nssdb_2%253a3.19.2.1-0ubuntu0.14.04.2_all.deb
>      libnss3_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb
>      libqb-dev_0.16.0.real-1ubuntu4_amd64.deb
>      libqb0_0.16.0.real-1ubuntu4_amd64.deb
> 
> *corosync.conf*
> -
> quorum {
>         provider: corosync_votequorum
>         expected_votes: 3
> }
> totem {
>         version: 2
>         crypto_cipher: none
>         crypto_hash: none
>         rrp_mode: passive
>         interface {
>                 ringnumber: 0
>                 bindnetaddr: 10.0.0.0
>                 mcastport: 5405
>                 ttl: 1
>         }
>         interface {
>                 ringnumber: 1
>                 bindnetaddr: 192.168.0.0
>                 mcastport: 5405
>                 ttl: 1
>         }
>         transport: udpu
>         max_network_delay: 100
>         retransmits_before_loss_const: 25
>         window_size: 150
> }
> nodelist {
>         node {
>                 ring0_addr: pg1
>                 ring1_addr: pg1p
>                 nodeid: 1
>         }
>         node {
>                 ring0_addr: pg2
>                 ring1_addr: pg2p
>                 nodeid: 2
>         }
>         node {
>                 ring0_addr: pg3
>                 ring1_addr: pg3p
>                 nodeid: 3
>         }
> }
> logging {
>         to_syslog: yes
> }
> 
> 
> 
> 
>> On 22 Jun 2018, at 09:24, Christine Caulfield > > wrote:
>>
>> On 21/06/18 16:16, Salvatore D'angelo wrote:
>>> Hi,
>>>
>>> I upgraded my PostgreSQL/Pacemaker cluster with these versions.
>>> Pacemaker 1.1.14 -> 1.1.18
>>> Corosync 2.3.5 -> 2.4.4
>>> Crmsh 2.2.0 -> 3.0.1
>>> Resource agents 3.9.7 -> 4.1.1
>>>
>>> I started on a first node  (I am trying one node at a time upgrade).
>>> On a PostgreSQL slave node  I did:
>>>
>>> *crm node standby *
>>> *service pacemaker stop*
>>> *service corosync stop*
>>>
>>> Then I build the tool above as described on their GitHub.com
>>> 
>>> > page. 
>>>
>>> *./autogen.sh (where required)*
>>> *./configure*
>>> *make (where required)*
>>> *make install*
>>>
>>> Everything went ok. I expect new file overwrite old one. I left the
>>> dependency I had with old software because I noticed the .configure
>>> didn’t complain. 
>>> I started corosync.
>>>
>>> *service corosync start*
>>>
>>> To verify corosync work properly I used the following 

Re: [ClusterLabs] Upgrade corosync problem

2018-06-22 Thread Salvatore D'angelo
Hi Christine,

Thanks for reply. Let me add few details. When I run the corosync service I se 
the corosync process running. If I stop it and run:

corosync -f 

I see three warnings:
warning [MAIN  ] interface section bindnetaddr is used together with nodelist. 
Nodelist one is going to be used.
warning [MAIN  ] Please migrate config file to nodelist.
warning [MAIN  ] Could not set SCHED_RR at priority 99: Operation not permitted 
(1)
warning [MAIN  ] Could not set priority -2147483648: Permission denied (13)

but I see node joined.

My corosync.conf file is below.

With service corosync up and running I have the following output:
corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
id  = 10.0.0.11
status  = ring 0 active with no faults
RING ID 1
id  = 192.168.0.11
status  = ring 1 active with no faults

corosync-cmapctl  | grep members
runtime.totem.pg.mrp.srp.members.1.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(10.0.0.11) r(1) 
ip(192.168.0.11) 
runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.1.status (str) = joined
runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(10.0.0.12) r(1) 
ip(192.168.0.12) 
runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.2.status (str) = joined

For the moment I have two nodes in my cluster (third node and some issues and 
at the moment I did crm node standby on it).

Here the dependency I have installed for corosync (that works fine with 
pacemaker 1.1.14 and corosync 2.3.5):
 libnspr4-dev_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb
 libnspr4_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb
 libnss3-dev_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb
 libnss3-nssdb_2%253a3.19.2.1-0ubuntu0.14.04.2_all.deb
 libnss3_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb
 libqb-dev_0.16.0.real-1ubuntu4_amd64.deb
 libqb0_0.16.0.real-1ubuntu4_amd64.deb

corosync.conf
-
quorum {
provider: corosync_votequorum
expected_votes: 3
}
totem {
version: 2
crypto_cipher: none
crypto_hash: none
rrp_mode: passive
interface {
ringnumber: 0
bindnetaddr: 10.0.0.0
mcastport: 5405
ttl: 1
}
interface {
ringnumber: 1
bindnetaddr: 192.168.0.0
mcastport: 5405
ttl: 1
}
transport: udpu
max_network_delay: 100
retransmits_before_loss_const: 25
window_size: 150
}
nodelist {
node {
ring0_addr: pg1
ring1_addr: pg1p
nodeid: 1
}
node {
ring0_addr: pg2
ring1_addr: pg2p
nodeid: 2
}
node {
ring0_addr: pg3
ring1_addr: pg3p
nodeid: 3
}
}
logging {
to_syslog: yes
}




> On 22 Jun 2018, at 09:24, Christine Caulfield  wrote:
> 
> On 21/06/18 16:16, Salvatore D'angelo wrote:
>> Hi,
>> 
>> I upgraded my PostgreSQL/Pacemaker cluster with these versions.
>> Pacemaker 1.1.14 -> 1.1.18
>> Corosync 2.3.5 -> 2.4.4
>> Crmsh 2.2.0 -> 3.0.1
>> Resource agents 3.9.7 -> 4.1.1
>> 
>> I started on a first node  (I am trying one node at a time upgrade).
>> On a PostgreSQL slave node  I did:
>> 
>> *crm node standby *
>> *service pacemaker stop*
>> *service corosync stop*
>> 
>> Then I build the tool above as described on their GitHub.com
>> > page. 
>> 
>> *./autogen.sh (where required)*
>> *./configure*
>> *make (where required)*
>> *make install*
>> 
>> Everything went ok. I expect new file overwrite old one. I left the
>> dependency I had with old software because I noticed the .configure
>> didn’t complain. 
>> I started corosync.
>> 
>> *service corosync start*
>> 
>> To verify corosync work properly I used the following commands:
>> *corosync-cfg-tool -s*
>> *corosync-cmapctl | grep members*
>> 
>> Everything seemed ok and I verified my node joined the cluster (at least
>> this is my impression).
>> 
>> Here I verified a problem. Doing the command:
>> corosync-quorumtool -ps
>> 
>> I got the following problem:
>> Cannot initialise CFG service
>> 
> That says that corosync is not running. Have a look in the log files to
> see why it stopped. The pacemaker logs below are showing the same thing,
> but we can't make any more guesses until we see what corosync itself is
> doing. Enabling debug in corosync.conf will also help if more detail is
> needed.
> 
> Also starting corosync with 'corosync -pf' on the command-line is often
> a quick way of checking things are starting OK.
> 
> Chrissie
> 
> 
>> If I try to start pacemaker, I only see pacemaker process running and
>> pacemaker.log containing the following 

Re: [ClusterLabs] Upgrade corosync problem

2018-06-22 Thread Christine Caulfield
On 21/06/18 16:16, Salvatore D'angelo wrote:
> Hi,
> 
> I upgraded my PostgreSQL/Pacemaker cluster with these versions.
> Pacemaker 1.1.14 -> 1.1.18
> Corosync 2.3.5 -> 2.4.4
> Crmsh 2.2.0 -> 3.0.1
> Resource agents 3.9.7 -> 4.1.1
> 
> I started on a first node  (I am trying one node at a time upgrade).
> On a PostgreSQL slave node  I did:
> 
> *crm node standby *
> *service pacemaker stop*
> *service corosync stop*
> 
> Then I build the tool above as described on their GitHub.com
>  page. 
> 
> *./autogen.sh (where required)*
> *./configure*
> *make (where required)*
> *make install*
> 
> Everything went ok. I expect new file overwrite old one. I left the
> dependency I had with old software because I noticed the .configure
> didn’t complain. 
> I started corosync.
> 
> *service corosync start*
> 
> To verify corosync work properly I used the following commands:
> *corosync-cfg-tool -s*
> *corosync-cmapctl | grep members*
> 
> Everything seemed ok and I verified my node joined the cluster (at least
> this is my impression).
> 
> Here I verified a problem. Doing the command:
> corosync-quorumtool -ps
> 
> I got the following problem:
> Cannot initialise CFG service
> 
That says that corosync is not running. Have a look in the log files to
see why it stopped. The pacemaker logs below are showing the same thing,
but we can't make any more guesses until we see what corosync itself is
doing. Enabling debug in corosync.conf will also help if more detail is
needed.

Also starting corosync with 'corosync -pf' on the command-line is often
a quick way of checking things are starting OK.

Chrissie


> If I try to start pacemaker, I only see pacemaker process running and
> pacemaker.log containing the following lines:
> 
> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info: crm_log_init:Changed
> active directory to /var/lib/pacemaker/cores/
> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info:
> get_cluster_type:Detected an active 'corosync' cluster/
> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info:
> mcp_read_config:Reading configure for stack: corosync/
> /Jun 21 15:09:38 [17115] pg1 pacemakerd:   notice: main:Starting
> Pacemaker 1.1.18 | build=2b07d5c5a9 features: libqb-logging libqb-ipc
> lha-fencing nagios  corosync-native atomic-attrd acls/
> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info: main:Maximum core
> file size is: 18446744073709551615/
> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info:
> qb_ipcs_us_publish:server name: pacemakerd/
> /Jun 21 15:09:53 [17115] pg1 pacemakerd:  warning:
> corosync_node_name:Could not connect to Cluster Configuration Database
> API, error CS_ERR_TRY_AGAIN/
> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
> corosync_node_name:Unable to get node name for nodeid 1/
> /Jun 21 15:09:53 [17115] pg1 pacemakerd:   notice: get_node_name:Could
> not obtain a node name for corosync nodeid 1/
> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info: crm_get_peer:Created
> entry 1aeef8ac-643b-44f7-8ce3-d82bbf40bbc1/0x557dc7f05d30 for node
> (null)/1 (1 total)/
> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info: crm_get_peer:Node 1
> has uuid 1/
> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
> crm_update_peer_proc:cluster_connect_cpg: Node (null)[1] - corosync-cpg
> is now online/
> /Jun 21 15:09:53 [17115] pg1 pacemakerd:    error:
> cluster_connect_quorum:Could not connect to the Quorum API: 2/
> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
> qb_ipcs_us_withdraw:withdrawing server sockets/
> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info: main:Exiting pacemakerd/
> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
> crm_xml_cleanup:Cleaning up memory from libxml2/
> 
> *What is wrong in my procedure?*
> 
> 
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Upgrade corosync problem

2018-06-21 Thread Salvatore D'angelo
Hi,

I upgraded my PostgreSQL/Pacemaker cluster with these versions.
Pacemaker 1.1.14 -> 1.1.18
Corosync 2.3.5 -> 2.4.4
Crmsh 2.2.0 -> 3.0.1
Resource agents 3.9.7 -> 4.1.1

I started on a first node  (I am trying one node at a time upgrade).
On a PostgreSQL slave node  I did:

crm node standby 
service pacemaker stop
service corosync stop

Then I build the tool above as described on their GitHub.com page. 

./autogen.sh (where required)
./configure
make (where required)
make install

Everything went ok. I expect new file overwrite old one. I left the dependency 
I had with old software because I noticed the .configure didn’t complain. 
I started corosync.

service corosync start

To verify corosync work properly I used the following commands:
corosync-cfg-tool -s
corosync-cmapctl | grep members

Everything seemed ok and I verified my node joined the cluster (at least this 
is my impression).

Here I verified a problem. Doing the command:
corosync-quorumtool -ps

I got the following problem:
Cannot initialise CFG service

If I try to start pacemaker, I only see pacemaker process running and 
pacemaker.log containing the following lines:

Jun 21 15:09:38 [17115] pg1 pacemakerd: info: crm_log_init: Changed active 
directory to /var/lib/pacemaker/cores
Jun 21 15:09:38 [17115] pg1 pacemakerd: info: get_cluster_type: 
Detected an active 'corosync' cluster
Jun 21 15:09:38 [17115] pg1 pacemakerd: info: mcp_read_config:  Reading 
configure for stack: corosync
Jun 21 15:09:38 [17115] pg1 pacemakerd:   notice: main: Starting Pacemaker 
1.1.18 | build=2b07d5c5a9 features: libqb-logging libqb-ipc lha-fencing nagios  
corosync-native atomic-attrd acls
Jun 21 15:09:38 [17115] pg1 pacemakerd: info: main: Maximum core file size 
is: 18446744073709551615
Jun 21 15:09:38 [17115] pg1 pacemakerd: info: qb_ipcs_us_publish:   server 
name: pacemakerd
Jun 21 15:09:53 [17115] pg1 pacemakerd:  warning: corosync_node_name:   Could 
not connect to Cluster Configuration Database API, error CS_ERR_TRY_AGAIN
Jun 21 15:09:53 [17115] pg1 pacemakerd: info: corosync_node_name:   Unable 
to get node name for nodeid 1
Jun 21 15:09:53 [17115] pg1 pacemakerd:   notice: get_node_name:Could 
not obtain a node name for corosync nodeid 1
Jun 21 15:09:53 [17115] pg1 pacemakerd: info: crm_get_peer: Created entry 
1aeef8ac-643b-44f7-8ce3-d82bbf40bbc1/0x557dc7f05d30 for node (null)/1 (1 total)
Jun 21 15:09:53 [17115] pg1 pacemakerd: info: crm_get_peer: Node 1 has uuid 
1
Jun 21 15:09:53 [17115] pg1 pacemakerd: info: crm_update_peer_proc: 
cluster_connect_cpg: Node (null)[1] - corosync-cpg is now online
Jun 21 15:09:53 [17115] pg1 pacemakerd:error: cluster_connect_quorum:   
Could not connect to the Quorum API: 2
Jun 21 15:09:53 [17115] pg1 pacemakerd: info: qb_ipcs_us_withdraw:  
withdrawing server sockets
Jun 21 15:09:53 [17115] pg1 pacemakerd: info: main: Exiting pacemakerd
Jun 21 15:09:53 [17115] pg1 pacemakerd: info: crm_xml_cleanup:  
Cleaning up memory from libxml2

What is wrong in my procedure?



___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org