Re: [ClusterLabs] Problem with pacemaker init.d script

2018-07-12 Thread Salvatore D'angelo
Hi Jan,

My “talent” as you said comes from a bad knowledge of system vs upstart vs sysV 
mechanism.
Let me only underline that I compile directly on target system and not a 
different machine.
Moreover, all ./configure requirements are met because when this didn’t happen 
the ./configure stopped (at least this is what I expected).
However, I’ll pay more attention the next time I’ll run ./configure to the 
output.

For the moment, I have the workaround mentioned in the previous email because I 
only developed a proof of concepts.
When we will start to create production code I’ll try to better understand how 
systemd works and how to use it to fix my issue.
Thank you for support.

> On 12 Jul 2018, at 15:47, Jan Pokorný  wrote:
> 
> Hello Salvatore,
> 
> we can cope with that without much trouble, but you seem to have
> a talent to present multiple related issues at once, or perhaps
> to start solving the problems from the too distant point :-) 
> 
> As mentioned, that's also fine, but let's separate them...
> 
> On 11/07/18 18:43 +0200, Salvatore D'angelo wrote:
>>>>> On Wed, 2018-07-11 at 18:43 +0200, Salvatore D'angelo wrote:
>>>>>> [...] question is pacemaker install
>>>>>> /etc/init.d/pacemaker script but its header is not compatible
>>>>>> with newer system that uses LSB.
> 
> Combining LSB and "newer system" in one sentence is sort of ridiculous
> since LSB dates back to 2002 (LSB 1.1 seems to be first to actually
> explain headers like Default-Start [1]), and the concept of system
> initialization in the Linux context was gradually refined with
> projects like upstart and systemd.  
> 
> What may have changed between older and newer Ubuntu, though,
> towards less compatibility (or contrary, more strictness on the
> headers/initscript meta-data) is that the compatibility support
> scripts/translators are written from scratch, leaving relaxed
> approach (the standard is also not that strict) behind ... or not,
> and this is just a new regression subsequently fixed later on:
> 
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=588868
> 
> which matches your:
> 
>>> 11.07.2018 21:01, Salvatore D'angelo пишет:
>>>> [...] system find that sysV is installed and try to leverage on
>>>> update-rc.d scripts and the failure occurs:
>>>> 
>>>> root@pg1:~# systemctl enable corosync
>>>> corosync.service is not a native service, redirecting to 
>>>> systemd-sysv-install
>>>> Executing /lib/systemd/systemd-sysv-install enable corosync
>>>> update-rc.d: error: corosync Default-Start contains no runlevels, aborting.
> 
> So I'd be inclined to think the existing initscripts can be used just
> fine in bug-free LSB initialization scenarios.  It'd be hence just
> your choice whether to apply the workaround for sysv-rc bug (?) that
> you've presented.
> 
> Anyway, as mentioned:
> 
>>>>>> On 11 Jul 2018, at 18:40, Andrei Borzenkov 
>>>>>> wrote:
>>>>>>> 16.04 is using systemd, you need to create systemd unit. I do
>>>>>>> not know if there is any compatibility layer to interpret
>>>>>>> upstart configuration like the one for sysvinit.
> 
> and
> 
>> On 11 Jul 2018, at 21:07, Andrei Borzenkov  wrote:
>>> Then you built corosync without systemd integration. systemd will prefer
>>> native units.
> 
> You shouldn't be using, not even indirectly, initscripts with
> systemd-enabled deployments, otherwise you ask for various dependency
> mismatches and other complications.
> 
> Which gets us to another problem in pacemaker context:
> 
>>>>> On Wed, 2018-07-11 at 18:43 +0200, Salvatore D'angelo wrote:
>>>>>> when I run “make install” anything is created for systemd env.
> 
> (presuming anything ~ nothing), because as Ken mentioned:
> 
>>>>> On 11 Jul 2018, at 19:29, Ken Gaillot  wrote:
>>>>> With Ubuntu 16, you should use "systemctl enable pacemaker" instead
>>>>> of update-rc.d.
> 
> So, if you haven't tried to build pacemaker on a system differing to
> the target one in some build-time important aspects, then I can guess
> you just hadn't ensured you have all systemd-related requisities
> installed at the time you've run "./configure".  Currently, it boils
> down to libdbus-1 library + development files (looks covered with
> "libdbus-1-dev" package) & working "systemctl" command from systemd
> suite.
> 
> Frankly, when people go the path of custom compilations, it's assumed
> they will get familiar with the build system too

Re: [ClusterLabs] Problem with pacemaker init.d script

2018-07-12 Thread Salvatore D'angelo
Hi,

I have a cluster on three bare metal and I use two busters nodes to keep 
walking files and backup store on an object store.
I use Docker for test purpose.

Here the possible upgrade scenario you can apply:
http://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/ap-upgrade.html

I used Rolling method or node by node because I always have a master up and 
running (there is few ms of downtime when switch occurs and I have a connection 
pool then immediately switch to the new master).

You need to understand that if if Rolling is only suggested for minor upgrade, 
2.0.0 didn’t introduced anything new, only removed deprecated features + bug 
fix on top of 1.1.18 code.
1.1.19 contains back porting of these fix (but not the remove of deprecated 
features). So you can apply this upgrade method even if you want to move to 
2.0.0. 
We decide to go on 1.1.19. GA code was release two days ago.

Read all threads with my name in June and July here:
https://oss.clusterlabs.org/mailman/listinfo/users

you’ll find useful info I had to solve. 

On each node I do the following:
1. I first download source code for old pacemaker, corosync, crmsh and 
resource agents
2. mae uninstall on all of them
3. download source of new code (I verified on Ubuntu 14.04 library must 
not be upgraded, for 16.04 I just downloaded the new version of the libraries)
4. follow the build instruction for all the 4 projects. Resource agents 
is not documented, I do:
autoreconf -i -v
./configure
make
make install

for the other simply:
./autogen.sh
./configure
make
make install

5.  then do again the configuration steps you did when created the 
cluster with 1.1.14

I used Docker for test because it minimise the test time. If I do something 
wrong and situation is not recoverable I recreate the cluster in minutes.
You can do the same with virtual machine and Vagrant. But I consider this 
critical.

That’s all. 


> On 12 Jul 2018, at 00:31, Casey & Gina  wrote:
> 
>> After I successfully upgraded Pacemaker from 1.1.14 to 1.1.18 and corosync 
>> from 2.3.35 to 2.4.4 on Ubuntu 14.04 I am trying to repeat the same scenario 
>> on Ubuntu 16.04.
> 
> Forgive me for interjecting, but how did you upgrade on Ubuntu?  I'm 
> frustrated with limitations in 1.1.14 (particularly in PCS so not sure if 
> it's relevant), and Ubuntu is ignoring my bug reports, so it would be great 
> to upgrade if possible.  I'm using Ubuntu 16.04.
> 
> Best wishes,
> -- 
> Casey
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Problem with pacemaker init.d script

2018-07-11 Thread Salvatore D'angelo
Sorry replied too soon. 
Since disabling the update-rc.d command I assume the build process creates the 
services.
The only problem is that enabling them with systemctl does not work because it 
leverage on update-rc.d command that works only if LSB header container at 
least one run level.

For the moment the only fix I see is to  manipulate these init.d scripts by 
myself hoping they will be fixed in pacemaker/corosync.

> On 11 Jul 2018, at 23:18, Salvatore D'angelo  wrote:
> 
> Hi,
> 
> I solved the issue (I am not sure to be honest) simply removing the 
> update-rc.d command.
> I noticed I can start the corosync and pacemaker services with:
> 
> service corosync start
> service pacemaker start
> 
> I am not sure if they have been enabled at book (on Docker is not easy to 
> test).
> I do not know if pacemaker build creates automatically these services and 
> then it is required extra work to make them available at book.
> 
>> On 11 Jul 2018, at 21:07, Andrei Borzenkov > <mailto:arvidj...@gmail.com>> wrote:
>> 
>> 11.07.2018 21:01, Salvatore D'angelo пишет:
>>> Yes, but doing what you suggested the system find that sysV is installed 
>>> and try to leverage on update-rc.d scripts and the failure occurs:
>> 
>> Then you built corosync without systemd integration. systemd will prefer
>> native units.
> 
> How can I build them with system integration?
> 
>> 
>>> 
>>> root@pg1:~# systemctl enable corosync
>>> corosync.service is not a native service, redirecting to 
>>> systemd-sysv-install
>>> Executing /lib/systemd/systemd-sysv-install enable corosync
>>> update-rc.d: error: corosync Default-Start contains no runlevels, aborting.
>>> 
>>> the only fix I found was to manipulate manually the header of 
>>> /etc/init.d/corosync adding the rows mentioned below.
>>> But this is not a clean approach to solve the issue.
>>> 
>>> What pacemaker suggest for newer distributions?
>>> 
>>> If you look at corosync code the init/corosync file does not container run 
>>> levels in header.
>>> So I suspect it is a code problem. Am I wrong?
>>> 
>> 
>> Probably not. Description of special comments in LSB standard imply that
>> they must contain at least one value. Also how should service manager
>> know for which run level to enable service without it? It is amusing
>> that this problem was first found on a distribution that does not even
>> use SysV for years …
> 
> What do you suggest?
> 
>> 
>> 
>> 
>>>> On 11 Jul 2018, at 19:29, Ken Gaillot >>> <mailto:kgail...@redhat.com>> wrote:
>>>> 
>>>> On Wed, 2018-07-11 at 18:43 +0200, Salvatore D'angelo wrote:
>>>>> Hi,
>>>>> 
>>>>> Yes that was clear to me, but question is pacemaker install
>>>>> /etc/init.d/pacemaker script but its header is not compatible with
>>>>> newer system that uses LSB.
>>>>> So if pacemaker creates scripts in /etc/init.d it should create them
>>>>> so that they are compatible with OS supported (not sure if Ubuntu is
>>>>> one).
>>>>> when I run “make install” anything is created for systemd env.
>>>> 
>>>> With Ubuntu 16, you should use "systemctl enable pacemaker" instead of
>>>> update-rc.d.
>>>> 
>>>> The pacemaker configure script should have detected that the OS uses
>>>> systemd and installed the appropriate unit file.
>>>> 
>>>>> I am not a SysV vs System expert, hoping I haven’t said anything
>>>>> wrong.
>>>>> 
>>>>>> On 11 Jul 2018, at 18:40, Andrei Borzenkov >>>>> <mailto:arvidj...@gmail.com>>
>>>>>> wrote:
>>>>>> 
>>>>>> 11.07.2018 18:08, Salvatore D'angelo пишет:
>>>>>>> Hi All,
>>>>>>> 
>>>>>>> After I successfully upgraded Pacemaker from 1.1.14 to 1.1.18 and
>>>>>>> corosync from 2.3.35 to 2.4.4 on Ubuntu 14.04 I am trying to
>>>>>>> repeat the same scenario on Ubuntu 16.04.
>>>>>> 
>>>>>> 16.04 is using systemd, you need to create systemd unit. I do not
>>>>>> know
>>>>>> if there is any compatibility layer to interpret upstart
>>>>>> configuration
>>>>>> like the one for sysvinit.
>>>>>> 
>>>>>>> As my previous scenario I am 

Re: [ClusterLabs] Problem with pacemaker init.d script

2018-07-11 Thread Salvatore D'angelo
Hi,

I solved the issue (I am not sure to be honest) simply removing the update-rc.d 
command.
I noticed I can start the corosync and pacemaker services with:

service corosync start
service pacemaker start

I am not sure if they have been enabled at book (on Docker is not easy to test).
I do not know if pacemaker build creates automatically these services and then 
it is required extra work to make them available at book.

> On 11 Jul 2018, at 21:07, Andrei Borzenkov  wrote:
> 
> 11.07.2018 21:01, Salvatore D'angelo пишет:
>> Yes, but doing what you suggested the system find that sysV is installed and 
>> try to leverage on update-rc.d scripts and the failure occurs:
> 
> Then you built corosync without systemd integration. systemd will prefer
> native units.

How can I build them with system integration?

> 
>> 
>> root@pg1:~# systemctl enable corosync
>> corosync.service is not a native service, redirecting to systemd-sysv-install
>> Executing /lib/systemd/systemd-sysv-install enable corosync
>> update-rc.d: error: corosync Default-Start contains no runlevels, aborting.
>> 
>> the only fix I found was to manipulate manually the header of 
>> /etc/init.d/corosync adding the rows mentioned below.
>> But this is not a clean approach to solve the issue.
>> 
>> What pacemaker suggest for newer distributions?
>> 
>> If you look at corosync code the init/corosync file does not container run 
>> levels in header.
>> So I suspect it is a code problem. Am I wrong?
>> 
> 
> Probably not. Description of special comments in LSB standard imply that
> they must contain at least one value. Also how should service manager
> know for which run level to enable service without it? It is amusing
> that this problem was first found on a distribution that does not even
> use SysV for years …

What do you suggest?

> 
> 
> 
>>> On 11 Jul 2018, at 19:29, Ken Gaillot  wrote:
>>> 
>>> On Wed, 2018-07-11 at 18:43 +0200, Salvatore D'angelo wrote:
>>>> Hi,
>>>> 
>>>> Yes that was clear to me, but question is pacemaker install
>>>> /etc/init.d/pacemaker script but its header is not compatible with
>>>> newer system that uses LSB.
>>>> So if pacemaker creates scripts in /etc/init.d it should create them
>>>> so that they are compatible with OS supported (not sure if Ubuntu is
>>>> one).
>>>> when I run “make install” anything is created for systemd env.
>>> 
>>> With Ubuntu 16, you should use "systemctl enable pacemaker" instead of
>>> update-rc.d.
>>> 
>>> The pacemaker configure script should have detected that the OS uses
>>> systemd and installed the appropriate unit file.
>>> 
>>>> I am not a SysV vs System expert, hoping I haven’t said anything
>>>> wrong.
>>>> 
>>>>> On 11 Jul 2018, at 18:40, Andrei Borzenkov 
>>>>> wrote:
>>>>> 
>>>>> 11.07.2018 18:08, Salvatore D'angelo пишет:
>>>>>> Hi All,
>>>>>> 
>>>>>> After I successfully upgraded Pacemaker from 1.1.14 to 1.1.18 and
>>>>>> corosync from 2.3.35 to 2.4.4 on Ubuntu 14.04 I am trying to
>>>>>> repeat the same scenario on Ubuntu 16.04.
>>>>> 
>>>>> 16.04 is using systemd, you need to create systemd unit. I do not
>>>>> know
>>>>> if there is any compatibility layer to interpret upstart
>>>>> configuration
>>>>> like the one for sysvinit.
>>>>> 
>>>>>> As my previous scenario I am using Docker for test purpose before
>>>>>> move to Bare metal.
>>>>>> The scenario worked properly after I downloaded the correct
>>>>>> dependencies versions.
>>>>>> 
>>>>>> The only problem I experienced is that in my procedure install I
>>>>>> set corosync and pacemaker to run at startup updating the init.d
>>>>>> scripts with this commands:
>>>>>> 
>>>>>> update-rc.d corosync defaults
>>>>>> update-rc.d pacemaker defaults 80 80
>>>>>> 
>>>>>> I noticed that links in /etc/rc are not created.
>>>>>> 
>>>>>> I have also the following errors on second update-rc.d command:
>>>>>> insserv: Service corosync has to be enabled to start service
>>>>>> pacemaker
>>>>>> insserv: exiting now!
>>

Re: [ClusterLabs] Problem with pacemaker init.d script

2018-07-11 Thread Salvatore D'angelo
Yes, but doing what you suggested the system find that sysV is installed and 
try to leverage on update-rc.d scripts and the failure occurs:

root@pg1:~# systemctl enable corosync
corosync.service is not a native service, redirecting to systemd-sysv-install
Executing /lib/systemd/systemd-sysv-install enable corosync
update-rc.d: error: corosync Default-Start contains no runlevels, aborting.

the only fix I found was to manipulate manually the header of 
/etc/init.d/corosync adding the rows mentioned below.
But this is not a clean approach to solve the issue.

What pacemaker suggest for newer distributions?

If you look at corosync code the init/corosync file does not container run 
levels in header.
So I suspect it is a code problem. Am I wrong?

> On 11 Jul 2018, at 19:29, Ken Gaillot  wrote:
> 
> On Wed, 2018-07-11 at 18:43 +0200, Salvatore D'angelo wrote:
>> Hi,
>> 
>> Yes that was clear to me, but question is pacemaker install
>> /etc/init.d/pacemaker script but its header is not compatible with
>> newer system that uses LSB.
>> So if pacemaker creates scripts in /etc/init.d it should create them
>> so that they are compatible with OS supported (not sure if Ubuntu is
>> one).
>> when I run “make install” anything is created for systemd env.
> 
> With Ubuntu 16, you should use "systemctl enable pacemaker" instead of
> update-rc.d.
> 
> The pacemaker configure script should have detected that the OS uses
> systemd and installed the appropriate unit file.
> 
>> I am not a SysV vs System expert, hoping I haven’t said anything
>> wrong.
>> 
>>> On 11 Jul 2018, at 18:40, Andrei Borzenkov 
>>> wrote:
>>> 
>>> 11.07.2018 18:08, Salvatore D'angelo пишет:
>>>> Hi All,
>>>> 
>>>> After I successfully upgraded Pacemaker from 1.1.14 to 1.1.18 and
>>>> corosync from 2.3.35 to 2.4.4 on Ubuntu 14.04 I am trying to
>>>> repeat the same scenario on Ubuntu 16.04.
>>> 
>>> 16.04 is using systemd, you need to create systemd unit. I do not
>>> know
>>> if there is any compatibility layer to interpret upstart
>>> configuration
>>> like the one for sysvinit.
>>> 
>>>> As my previous scenario I am using Docker for test purpose before
>>>> move to Bare metal.
>>>> The scenario worked properly after I downloaded the correct
>>>> dependencies versions.
>>>> 
>>>> The only problem I experienced is that in my procedure install I
>>>> set corosync and pacemaker to run at startup updating the init.d
>>>> scripts with this commands:
>>>> 
>>>> update-rc.d corosync defaults
>>>> update-rc.d pacemaker defaults 80 80
>>>> 
>>>> I noticed that links in /etc/rc are not created.
>>>> 
>>>> I have also the following errors on second update-rc.d command:
>>>> insserv: Service corosync has to be enabled to start service
>>>> pacemaker
>>>> insserv: exiting now!
>>>> 
>>>> I was able to solve the issue manually replacing these lines in
>>>> /etc/init.d/corosync and /etc/init.d/pacemaker:
>>>> # Default-Start:
>>>> # Default-Stop:
>>>> 
>>>> with this:
>>>> # Default-Start:2 3 4 5
>>>> # Default-Stop: 0 1 6
>>>> 
>>>> I didn’t understand if this is a bug of corosync or pacemaker or
>>>> simply there is a dependency missing on Ubuntu 16.04 that was
>>>> installed by default on 14.04. I found other discussion on this
>>>> forum about this problem but it’s not clear the solution.
>>>> Thanks in advance for support.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> ___
>>>> Users mailing list: Users@clusterlabs.org
>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>> 
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scra
>>>> tch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>> 
>>> 
>>> ___
>>> Users mailing list: Users@clusterlabs.org
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc
>>> h.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 
>> ___
>> Users mailing list: Users@clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
>> pdf
>> Bugs: http://bugs.clusterlabs.org
> -- 
> Ken Gaillot 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] STONITH resources on wrong nodes

2018-07-11 Thread Salvatore D'angelo
Thank you. It's clear now.

Il Mer 11 Lug 2018, 7:18 PM Andrei Borzenkov  ha
scritto:

> 11.07.2018 20:12, Salvatore D'angelo пишет:
> > Does this mean that if STONITH resource p_ston_pg1 even if it runs on
> node pg2 if pacemaker send a signal to it pg1 is powered of and not pg2.
> > Am I correct?
>
> Yes. Resource will be used to power off whatever hosts are listed in its
> pcmk_host_list. It is totally unrelated to where it is active currently.
>
> >
> >> On 11 Jul 2018, at 19:10, Andrei Borzenkov  wrote:
> >>
> >> 11.07.2018 19:44, Salvatore D'angelo пишет:
> >>> Hi all,
> >>>
> >>> in my cluster doing cam_mon -1ARrf I noticed my STONITH resources are
> not correctly located:
> >>
> >> Actual location of stonith resources does not really matter in up to
> >> date pacemaker. It only determines where resource will be monitored;
> >> resource will be used by whatever node will be selected to perform
> stonith.
> >>
> >> The only requirement is that stonith resource is not prohibited from
> >> running on node by constraints.
> >>
> >>> p_ston_pg1  (stonith:external/ipmi):Started pg2
> >>> p_ston_pg2  (stonith:external/ipmi):Started pg1
> >>> p_ston_pg3  (stonith:external/ipmi):Started pg1
> >>>
> >>> I have three node: pg1 (10.0.0.11), pg2 (10.0.0.12), and pg3
> (10.0.0.13). I expected p_ston_pg3 was running on pg3, but I see it on pg1.
> >>>
> >>> Here my configuration:
> >>> primitive p_ston_pg1 stonith:external/ipmi \\
> >>> params hostname=pg1 pcmk_host_list=pg1 pcmk_host_check=static-list
> ipaddr=10.0.0.11 userid=root passwd="/etc/ngha/PG1-ipmipass"
> passwd_method=file interface=lan priv=OPERATOR
> >>> primitive p_ston_pg2 stonith:external/ipmi \\
> >>> params hostname=pg2 pcmk_host_list=pg2 pcmk_host_check=static-list
> ipaddr=10.0.0.12 userid=root passwd="/etc/ngha/PG2-ipmipass"
> passwd_method=file interface=lan priv=OPERATOR
> >>> primitive p_ston_pg3 stonith:external/ipmi \\
> >>> params hostname=pg3 pcmk_host_list=pg3 pcmk_host_check=static-list
> ipaddr=10.0.0.13 userid=root passwd="/etc/ngha/PG3-ipmipass"
> passwd_method=file interface=lan priv=OPERATOR
> >>>
> >>> location l_ston_pg1 p_ston_pg1 -inf: pg1
> >>> location l_ston_pg2 p_ston_pg2 -inf: pg2
> >>> location l_ston_pg3 p_ston_pg3 -inf: pg3
> >>>
> >>> this seems work fine on bare metal.
> >>> Any suggestion what could be root cause?
> >>>
> >>
> >> Root cause of what? Locations match your constraints.
> >> ___
> >> Users mailing list: Users@clusterlabs.org
> >> https://lists.clusterlabs.org/mailman/listinfo/users
> >>
> >> Project Home: http://www.clusterlabs.org
> >> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://bugs.clusterlabs.org
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
>
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] STONITH resources on wrong nodes

2018-07-11 Thread Salvatore D'angelo
Does this mean that if STONITH resource p_ston_pg1 even if it runs on node pg2 
if pacemaker send a signal to it pg1 is powered of and not pg2.
Am I correct?

> On 11 Jul 2018, at 19:10, Andrei Borzenkov  wrote:
> 
> 11.07.2018 19:44, Salvatore D'angelo пишет:
>> Hi all,
>> 
>> in my cluster doing cam_mon -1ARrf I noticed my STONITH resources are not 
>> correctly located:
> 
> Actual location of stonith resources does not really matter in up to
> date pacemaker. It only determines where resource will be monitored;
> resource will be used by whatever node will be selected to perform stonith.
> 
> The only requirement is that stonith resource is not prohibited from
> running on node by constraints.
> 
>> p_ston_pg1   (stonith:external/ipmi):Started pg2
>> p_ston_pg2   (stonith:external/ipmi):Started pg1
>> p_ston_pg3   (stonith:external/ipmi):Started pg1
>> 
>> I have three node: pg1 (10.0.0.11), pg2 (10.0.0.12), and pg3 (10.0.0.13). I 
>> expected p_ston_pg3 was running on pg3, but I see it on pg1.
>> 
>> Here my configuration:
>> primitive p_ston_pg1 stonith:external/ipmi \\
>>  params hostname=pg1 pcmk_host_list=pg1 pcmk_host_check=static-list 
>> ipaddr=10.0.0.11 userid=root passwd="/etc/ngha/PG1-ipmipass" 
>> passwd_method=file interface=lan priv=OPERATOR
>> primitive p_ston_pg2 stonith:external/ipmi \\
>>  params hostname=pg2 pcmk_host_list=pg2 pcmk_host_check=static-list 
>> ipaddr=10.0.0.12 userid=root passwd="/etc/ngha/PG2-ipmipass" 
>> passwd_method=file interface=lan priv=OPERATOR
>> primitive p_ston_pg3 stonith:external/ipmi \\
>>  params hostname=pg3 pcmk_host_list=pg3 pcmk_host_check=static-list 
>> ipaddr=10.0.0.13 userid=root passwd="/etc/ngha/PG3-ipmipass" 
>> passwd_method=file interface=lan priv=OPERATOR
>> 
>> location l_ston_pg1 p_ston_pg1 -inf: pg1
>> location l_ston_pg2 p_ston_pg2 -inf: pg2
>> location l_ston_pg3 p_ston_pg3 -inf: pg3
>> 
>> this seems work fine on bare metal.
>> Any suggestion what could be root cause?
>> 
> 
> Root cause of what? Locations match your constraints.
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] STONITH resources on wrong nodes

2018-07-11 Thread Salvatore D'angelo
Suppose I do the following:

crm configure delete l_ston_pg1
crm configure delete l_ston_pg2
crm configure delete l_ston_pg3
crm configure location l_ston_pg1 p_ston_pg1 inf: pg1
crm configure location l_ston_pg2 p_ston_pg2 inf: pg2
crm configure location l_ston_pg3 p_ston_pg3 inf: pg3

How long should I wait to see each STONITH resource on the correct node? 
Should I do something to adjust things on the fly?
Thanks for support.

> On 11 Jul 2018, at 18:47, Emmanuel Gelati  wrote:
> 
> You need to use location l_ston_pg3 p_ston_pg3 inf: pg3, because -inf is 
> negative.
> 
> 2018-07-11 18:44 GMT+02:00 Salvatore D'angelo  <mailto:sasadang...@gmail.com>>:
> Hi all,
> 
> in my cluster doing cam_mon -1ARrf I noticed my STONITH resources are not 
> correctly located:
> p_ston_pg1(stonith:external/ipmi):Started pg2
> p_ston_pg2(stonith:external/ipmi):Started pg1
> p_ston_pg3(stonith:external/ipmi):Started pg1
> 
> I have three node: pg1 (10.0.0.11), pg2 (10.0.0.12), and pg3 (10.0.0.13). I 
> expected p_ston_pg3 was running on pg3, but I see it on pg1.
> 
> Here my configuration:
> primitive p_ston_pg1 stonith:external/ipmi \\
>   params hostname=pg1 pcmk_host_list=pg1 pcmk_host_check=static-list 
> ipaddr=10.0.0.11 userid=root passwd="/etc/ngha/PG1-ipmipass" 
> passwd_method=file interface=lan priv=OPERATOR
> primitive p_ston_pg2 stonith:external/ipmi \\
>   params hostname=pg2 pcmk_host_list=pg2 pcmk_host_check=static-list 
> ipaddr=10.0.0.12 userid=root passwd="/etc/ngha/PG2-ipmipass" 
> passwd_method=file interface=lan priv=OPERATOR
> primitive p_ston_pg3 stonith:external/ipmi \\
>   params hostname=pg3 pcmk_host_list=pg3 pcmk_host_check=static-list 
> ipaddr=10.0.0.13 userid=root passwd="/etc/ngha/PG3-ipmipass" 
> passwd_method=file interface=lan priv=OPERATOR
> 
> location l_ston_pg1 p_ston_pg1 -inf: pg1
> location l_ston_pg2 p_ston_pg2 -inf: pg2
> location l_ston_pg3 p_ston_pg3 -inf: pg3
> 
> this seems work fine on bare metal.
> Any suggestion what could be root cause?
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org <mailto:Users@clusterlabs.org>
> https://lists.clusterlabs.org/mailman/listinfo/users 
> <https://lists.clusterlabs.org/mailman/listinfo/users>
> 
> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/>
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
> 
> 
> 
> 
> -- 
>   .~.
>   /V\
>  //  \\
> /(   )\
> ^`~'^
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] STONITH resources on wrong nodes

2018-07-11 Thread Salvatore D'angelo
Hi all,

in my cluster doing cam_mon -1ARrf I noticed my STONITH resources are not 
correctly located:
p_ston_pg1  (stonith:external/ipmi):Started pg2
p_ston_pg2  (stonith:external/ipmi):Started pg1
p_ston_pg3  (stonith:external/ipmi):Started pg1

I have three node: pg1 (10.0.0.11), pg2 (10.0.0.12), and pg3 (10.0.0.13). I 
expected p_ston_pg3 was running on pg3, but I see it on pg1.

Here my configuration:
primitive p_ston_pg1 stonith:external/ipmi \\
params hostname=pg1 pcmk_host_list=pg1 pcmk_host_check=static-list 
ipaddr=10.0.0.11 userid=root passwd="/etc/ngha/PG1-ipmipass" passwd_method=file 
interface=lan priv=OPERATOR
primitive p_ston_pg2 stonith:external/ipmi \\
params hostname=pg2 pcmk_host_list=pg2 pcmk_host_check=static-list 
ipaddr=10.0.0.12 userid=root passwd="/etc/ngha/PG2-ipmipass" passwd_method=file 
interface=lan priv=OPERATOR
primitive p_ston_pg3 stonith:external/ipmi \\
params hostname=pg3 pcmk_host_list=pg3 pcmk_host_check=static-list 
ipaddr=10.0.0.13 userid=root passwd="/etc/ngha/PG3-ipmipass" passwd_method=file 
interface=lan priv=OPERATOR

location l_ston_pg1 p_ston_pg1 -inf: pg1
location l_ston_pg2 p_ston_pg2 -inf: pg2
location l_ston_pg3 p_ston_pg3 -inf: pg3

this seems work fine on bare metal.
Any suggestion what could be root cause?


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Problem with pacemaker init.d script

2018-07-11 Thread Salvatore D'angelo
Hi,

Yes that was clear to me, but question is pacemaker install 
/etc/init.d/pacemaker script but its header is not compatible with newer system 
that uses LSB.
So if pacemaker creates scripts in /etc/init.d it should create them so that 
they are compatible with OS supported (not sure if Ubuntu is one).
when I run “make install” anything is created for systemd env.

I am not a SysV vs System expert, hoping I haven’t said anything wrong.

> On 11 Jul 2018, at 18:40, Andrei Borzenkov  wrote:
> 
> 11.07.2018 18:08, Salvatore D'angelo пишет:
>> Hi All,
>> 
>> After I successfully upgraded Pacemaker from 1.1.14 to 1.1.18 and corosync 
>> from 2.3.35 to 2.4.4 on Ubuntu 14.04 I am trying to repeat the same scenario 
>> on Ubuntu 16.04.
> 
> 16.04 is using systemd, you need to create systemd unit. I do not know
> if there is any compatibility layer to interpret upstart configuration
> like the one for sysvinit.
> 
>> As my previous scenario I am using Docker for test purpose before move to 
>> Bare metal.
>> The scenario worked properly after I downloaded the correct dependencies 
>> versions.
>> 
>> The only problem I experienced is that in my procedure install I set 
>> corosync and pacemaker to run at startup updating the init.d scripts with 
>> this commands:
>> 
>> update-rc.d corosync defaults
>> update-rc.d pacemaker defaults 80 80
>> 
>> I noticed that links in /etc/rc are not created.
>> 
>> I have also the following errors on second update-rc.d command:
>> insserv: Service corosync has to be enabled to start service pacemaker
>> insserv: exiting now!
>> 
>> I was able to solve the issue manually replacing these lines in 
>> /etc/init.d/corosync and /etc/init.d/pacemaker:
>> # Default-Start:
>> # Default-Stop:
>> 
>> with this:
>> # Default-Start:2 3 4 5
>> # Default-Stop: 0 1 6
>> 
>> I didn’t understand if this is a bug of corosync or pacemaker or simply 
>> there is a dependency missing on Ubuntu 16.04 that was installed by default 
>> on 14.04. I found other discussion on this forum about this problem but it’s 
>> not clear the solution.
>> Thanks in advance for support.
>> 
>> 
>> 
>> 
>> ___
>> Users mailing list: Users@clusterlabs.org <mailto:Users@clusterlabs.org>
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> <https://lists.clusterlabs.org/mailman/listinfo/users>
>> 
>> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/>
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
>> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
>> 
> 
> ___
> Users mailing list: Users@clusterlabs.org <mailto:Users@clusterlabs.org>
> https://lists.clusterlabs.org/mailman/listinfo/users 
> <https://lists.clusterlabs.org/mailman/listinfo/users>
> 
> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/>
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Problem with pacemaker init.d script

2018-07-11 Thread Salvatore D'angelo
Hi All,

After I successfully upgraded Pacemaker from 1.1.14 to 1.1.18 and corosync from 
2.3.35 to 2.4.4 on Ubuntu 14.04 I am trying to repeat the same scenario on 
Ubuntu 16.04.
As my previous scenario I am using Docker for test purpose before move to Bare 
metal.
The scenario worked properly after I downloaded the correct dependencies 
versions.

The only problem I experienced is that in my procedure install I set corosync 
and pacemaker to run at startup updating the init.d scripts with this commands:

update-rc.d corosync defaults
update-rc.d pacemaker defaults 80 80

I noticed that links in /etc/rc are not created.

I have also the following errors on second update-rc.d command:
insserv: Service corosync has to be enabled to start service pacemaker
insserv: exiting now!

I was able to solve the issue manually replacing these lines in 
/etc/init.d/corosync and /etc/init.d/pacemaker:
# Default-Start:
# Default-Stop:

with this:
# Default-Start:2 3 4 5
# Default-Stop: 0 1 6

I didn’t understand if this is a bug of corosync or pacemaker or simply there 
is a dependency missing on Ubuntu 16.04 that was installed by default on 14.04. 
I found other discussion on this forum about this problem but it’s not clear 
the solution.
Thanks in advance for support.

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Upgrade corosync problem

2018-07-06 Thread Salvatore D'angelo
Hi,

Thanks to reply. The problem is opposite to what you are saying.

When I build corosync with old libqb and I verified the new updated node worked 
properly I updated with new libqb hand-compiled and it works fine.
But in a normale upgrade procedure I first build libqb (removing first the old 
one) and then corosync, when I follow this order it does not work.
This is what make me crazy.
I do not understand this behavior.

> On 6 Jul 2018, at 14:40, Christine Caulfield  wrote:
> 
> On 06/07/18 13:24, Salvatore D'angelo wrote:
>> Hi All,
>> 
>> The option --ulimit memlock=536870912 worked fine.
>> 
>> I have now another strange issue. The upgrade without updating libqb
>> (leaving the 0.16.0) worked fine.
>> If after the upgrade I stop pacemaker and corosync, I download the
>> latest libqb version:
>> https://github.com/ClusterLabs/libqb/releases/download/v1.0.3/libqb-1.0.3.tar.gz
>> build and install it everything works fine.
>> 
>> If I try to install in sequence (after the installation of old code):
>> 
>> libqb 1.0.3
>> corosync 2.4.4
>> pacemaker 1.1.18
>> crmsh 3.0.1
>> resource agents 4.1.1
>> 
>> when I try to start corosync I got the following error:
>> *Starting Corosync Cluster Engine (corosync): /etc/init.d/corosync: line
>> 99:  8470 Aborted $prog $COROSYNC_OPTIONS > /dev/null 2>&1*
>> *[FAILED]*
> 
> 
> Yes. you can't randomly swap in and out hand-compiled libqb versions.
> Find one that works and stick to it. It's an annoying 'feature' of newer
> linkers that we had to workaround in libqb. So if you rebuild libqb
> 1.0.3 then you will, in all likelihood, need to rebuild corosync to
> match it.
> 
> Chrissie
> 
> 
>> 
>> if I launch corosync -f I got:
>> *corosync: main.c:143: logsys_qb_init: Assertion `"implicit callsite
>> section is populated, otherwise target's build is at fault, preventing
>> reliable logging" && __start___verbose != __stop___verbose' failed.*
>> 
>> anything is logged (even in debug mode).
>> 
>> I do not understand why installing libqb during the normal upgrade
>> process fails while if I upgrade it after the
>> crmsh/pacemaker/corosync/resourceagents upgrade it works fine. 
>> 
>> On 3 Jul 2018, at 11:42, Christine Caulfield > <mailto:ccaul...@redhat.com>
>> <mailto:ccaul...@redhat.com <mailto:ccaul...@redhat.com>>> wrote:
>>> 
>>> On 03/07/18 07:53, Jan Pokorný wrote:
>>>> On 02/07/18 17:19 +0200, Salvatore D'angelo wrote:
>>>>> Today I tested the two suggestions you gave me. Here what I did. 
>>>>> In the script where I create my 5 machines cluster (I use three
>>>>> nodes for pacemaker PostgreSQL cluster and two nodes for glusterfs
>>>>> that we use for database backup and WAL files).
>>>>> 
>>>>> FIRST TEST
>>>>> ——
>>>>> I added the —shm-size=512m to the “docker create” command. I noticed
>>>>> that as soon as I start it the shm size is 512m and I didn’t need to
>>>>> add the entry in /etc/fstab. However, I did it anyway:
>>>>> 
>>>>> tmpfs  /dev/shm  tmpfs   defaults,size=512m   0   0
>>>>> 
>>>>> and then
>>>>> mount -o remount /dev/shm
>>>>> 
>>>>> Then I uninstalled all pieces of software (crmsh, resource agents,
>>>>> corosync and pacemaker) and installed the new one.
>>>>> Started corosync and pacemaker but same problem occurred.
>>>>> 
>>>>> SECOND TEST
>>>>> ———
>>>>> stopped corosync and pacemaker
>>>>> uninstalled corosync
>>>>> build corosync with --enable-small-memory-footprint and installed it
>>>>> starte corosync and pacemaker
>>>>> 
>>>>> IT WORKED.
>>>>> 
>>>>> I would like to understand now why it didn’t worked in first test
>>>>> and why it worked in second. Which kind of memory is used too much
>>>>> here? /dev/shm seems not the problem, I allocated 512m on all three
>>>>> docker images (obviously on my single Mac) and enabled the container
>>>>> option as you suggested. Am I missing something here?
>>>> 
>>>> My suspicion then fully shifts towards "maximum number of bytes of
>>>> memory that may be locked into RAM" per-process resource limit as
>>>> raised in one of the most recent message ...
>>&

Re: [ClusterLabs] Upgrade corosync problem

2018-07-06 Thread Salvatore D'angelo
= 0
mprotect(0x7f0cd2f0a000, 4096, PROT_READ) = 0
mprotect(0x7f0cd3d3b000, 4096, PROT_READ) = 0
mprotect(0x7f0cd3f6, 4096, PROT_READ) = 0
mprotect(0x563b1e764000, 4096, PROT_READ) = 0
mprotect(0x7f0cd4189000, 4096, PROT_READ) = 0
munmap(0x7f0cd4182000, 26561)   = 0
set_tid_address(0x7f0cd417aa10) = 8532
set_robust_list(0x7f0cd417aa20, 24) = 0
futex(0x7fff877daea0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, NULL, 
7f0cd417a740) = -1 EAGAIN (Resource temporarily unavailable)
rt_sigaction(SIGRTMIN, {0x7f0cd371e9f0, [], SA_RESTORER|SA_SIGINFO, 
0x7f0cd3728330}, NULL, 8) = 0
rt_sigaction(SIGRT_1, {0x7f0cd371ea80, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 
0x7f0cd3728330}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
brk(0)  = 0x563b1f774000
brk(0x563b1f795000) = 0x563b1f795000
futex(0x7f0cd3b390d0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
write(2, "corosync: main.c:143: logsys_qb_"..., 207corosync: main.c:143: 
logsys_qb_init: Assertion `"implicit callsite section is populated, otherwise 
target's build is at fault, preventing reliable logging" && __start___verbose 
!= __stop___verbose' failed.
) = 207
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x7f0cd4188000
rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
tgkill(8532, 8532, SIGABRT) = 0
--- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=8532, si_uid=0} ---
+++ killed by SIGABRT +++
Aborted

> On 6 Jul 2018, at 14:24, Salvatore D'angelo  wrote:
> 
> Hi All,
> 
> The option --ulimit memlock=536870912 worked fine.
> 
> I have now another strange issue. The upgrade without updating libqb (leaving 
> the 0.16.0) worked fine.
> If after the upgrade I stop pacemaker and corosync, I download the latest 
> libqb version:
> https://github.com/ClusterLabs/libqb/releases/download/v1.0.3/libqb-1.0.3.tar.gz
>  
> <https://github.com/ClusterLabs/libqb/releases/download/v1.0.3/libqb-1.0.3.tar.gz>
> build and install it everything works fine.
> 
> If I try to install in sequence (after the installation of old code):
> 
> libqb 1.0.3
> corosync 2.4.4
> pacemaker 1.1.18
> crmsh 3.0.1
> resource agents 4.1.1
> 
> when I try to start corosync I got the following error:
> Starting Corosync Cluster Engine (corosync): /etc/init.d/corosync: line 99:  
> 8470 Aborted $prog $COROSYNC_OPTIONS > /dev/null 2>&1
> [FAILED]
> 
> if I launch corosync -f I got:
> corosync: main.c:143: logsys_qb_init: Assertion `"implicit callsite section 
> is populated, otherwise target's build is at fault, preventing reliable 
> logging" && __start___verbose != __stop___verbose' failed.
> 
> anything is logged (even in debug mode).
> 
> I do not understand why installing libqb during the normal upgrade process 
> fails while if I upgrade it after the crmsh/pacemaker/corosync/resourceagents 
> upgrade it works fine. 
> 
> On 3 Jul 2018, at 11:42, Christine Caulfield  <mailto:ccaul...@redhat.com>> wrote:
>> 
>> On 03/07/18 07:53, Jan Pokorný wrote:
>>> On 02/07/18 17:19 +0200, Salvatore D'angelo wrote:
>>>> Today I tested the two suggestions you gave me. Here what I did. 
>>>> In the script where I create my 5 machines cluster (I use three
>>>> nodes for pacemaker PostgreSQL cluster and two nodes for glusterfs
>>>> that we use for database backup and WAL files).
>>>> 
>>>> FIRST TEST
>>>> ——
>>>> I added the —shm-size=512m to the “docker create” command. I noticed
>>>> that as soon as I start it the shm size is 512m and I didn’t need to
>>>> add the entry in /etc/fstab. However, I did it anyway:
>>>> 
>>>> tmpfs  /dev/shm  tmpfs   defaults,size=512m   0   0
>>>> 
>>>> and then
>>>> mount -o remount /dev/shm
>>>> 
>>>> Then I uninstalled all pieces of software (crmsh, resource agents,
>>>> corosync and pacemaker) and installed the new one.
>>>> Started corosync and pacemaker but same problem occurred.
>>>> 
>>>> SECOND TEST
>>>> ———
>>>> stopped corosync and pacemaker
>>>> uninstalled corosync
>>>> build corosync with --enable-small-memory-footprint and installed it
>>>> starte corosync and pacemaker
>>>> 
>>>> IT WORKED.
>>>> 
>>>> I would like to understand now why it didn’t worked in first test
>>>> and why it worked in second. Which kind of memory is used too much
&g

Re: [ClusterLabs] Upgrade corosync problem

2018-07-06 Thread Salvatore D'angelo
Hi All,

The option --ulimit memlock=536870912 worked fine.

I have now another strange issue. The upgrade without updating libqb (leaving 
the 0.16.0) worked fine.
If after the upgrade I stop pacemaker and corosync, I download the latest libqb 
version:
https://github.com/ClusterLabs/libqb/releases/download/v1.0.3/libqb-1.0.3.tar.gz
build and install it everything works fine.

If I try to install in sequence (after the installation of old code):

libqb 1.0.3
corosync 2.4.4
pacemaker 1.1.18
crmsh 3.0.1
resource agents 4.1.1

when I try to start corosync I got the following error:
Starting Corosync Cluster Engine (corosync): /etc/init.d/corosync: line 99:  
8470 Aborted $prog $COROSYNC_OPTIONS > /dev/null 2>&1
[FAILED]

if I launch corosync -f I got:
corosync: main.c:143: logsys_qb_init: Assertion `"implicit callsite section is 
populated, otherwise target's build is at fault, preventing reliable logging" 
&& __start___verbose != __stop___verbose' failed.

anything is logged (even in debug mode).

I do not understand why installing libqb during the normal upgrade process 
fails while if I upgrade it after the crmsh/pacemaker/corosync/resourceagents 
upgrade it works fine. 

On 3 Jul 2018, at 11:42, Christine Caulfield  wrote:
> 
> On 03/07/18 07:53, Jan Pokorný wrote:
>> On 02/07/18 17:19 +0200, Salvatore D'angelo wrote:
>>> Today I tested the two suggestions you gave me. Here what I did. 
>>> In the script where I create my 5 machines cluster (I use three
>>> nodes for pacemaker PostgreSQL cluster and two nodes for glusterfs
>>> that we use for database backup and WAL files).
>>> 
>>> FIRST TEST
>>> ——
>>> I added the —shm-size=512m to the “docker create” command. I noticed
>>> that as soon as I start it the shm size is 512m and I didn’t need to
>>> add the entry in /etc/fstab. However, I did it anyway:
>>> 
>>> tmpfs  /dev/shm  tmpfs   defaults,size=512m   0   0
>>> 
>>> and then
>>> mount -o remount /dev/shm
>>> 
>>> Then I uninstalled all pieces of software (crmsh, resource agents,
>>> corosync and pacemaker) and installed the new one.
>>> Started corosync and pacemaker but same problem occurred.
>>> 
>>> SECOND TEST
>>> ———
>>> stopped corosync and pacemaker
>>> uninstalled corosync
>>> build corosync with --enable-small-memory-footprint and installed it
>>> starte corosync and pacemaker
>>> 
>>> IT WORKED.
>>> 
>>> I would like to understand now why it didn’t worked in first test
>>> and why it worked in second. Which kind of memory is used too much
>>> here? /dev/shm seems not the problem, I allocated 512m on all three
>>> docker images (obviously on my single Mac) and enabled the container
>>> option as you suggested. Am I missing something here?
>> 
>> My suspicion then fully shifts towards "maximum number of bytes of
>> memory that may be locked into RAM" per-process resource limit as
>> raised in one of the most recent message ...
>> 
>>> Now I want to use Docker for the moment only for test purpose so it
>>> could be ok to use the --enable-small-memory-footprint, but there is
>>> something I can do to have corosync working even without this
>>> option?
>> 
>> ... so try running the container the already suggested way:
>> 
>>  docker run ... --ulimit memlock=33554432 ...
>> 
>> or possibly higher (as a rule of thumb, keep doubling the accumulated
>> value until some unreasonable amount is reached, like the equivalent
>> of already used 512 MiB).
>> 
>> Hope this helps.
> 
> This makes a lot of sense to me. As Poki pointed out earlier, in
> corosync 2.4.3 (I think) we fixed a regression in that caused corosync
> NOT to be locked in RAM after it forked - which was causing potential
> performance issues. So if you replace an earlier corosync with 2.4.3 or
> later then it will use more locked memory than before.
> 
> Chrissie
> 
> 
>> 
>>> The reason I am asking this is that, in the future, it could be
>>> possible we deploy in production our cluster in containerised way
>>> (for the moment is just an idea). This will save a lot of time in
>>> developing, maintaining and deploying our patch system. All
>>> prerequisites and dependencies will be enclosed in container and if
>>> IT team will do some maintenance on bare metal (i.e. install new
>>> dependencies) it will not affects our containers. I do not see a lot
>>> of performance drawbacks in using container. The poi

[ClusterLabs] Found libqb issue that affects pacemaker 1.1.18

2018-07-05 Thread Salvatore D'angelo
Hi,

I tried to build libqb 1.0.3 on a fresh machine and then corosync 2.4.4 and 
pacemaker 1.1.18.
I found the following bug and filed against libqb GitHub:
https://github.com/ClusterLabs/libqb/issues/312 


for the moment I fixed it manually on my env. Anyone experienced this issue?___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] crm --version shows "cam dev"

2018-07-05 Thread Salvatore D'angelo
Hi,

Thanks for reply. Here the steps:

wget https://github.com/ClusterLabs/crmsh/archive/3.0.1.tar.gz
tar xzvf 3.0.1.tar.gz
cd crash-3.0.1
apt-get install python-setuptools
./autogen.sh
./configure
make
make install

the apt-get was required in my container because a python library was not 
installed (I do not remember which one).

Once installed do:

crm —version

In 2.2.0 the output was correctly:
crm 2.2.0

Now it is:
crm dev

this does not allow me to check I installed the correct version. To be more 
precise, it is an upgrade from 2.2.0 where I first uninstalled 2.2.0 and then 
installed this one. I am sure uninstall removed everything related to crash 
because I run a “find / -name crash*” on the container.


> On 4 Jul 2018, at 21:32, Kristoffer Grönlund  wrote:
> 
> On Wed, 2018-07-04 at 17:52 +0200, Salvatore D'angelo wrote:
>> Hi,
>> 
>> With crash 2.2.0 the command:
>> cam —version
>> works fine. I downloaded 3.0.1 and it shows:
>> crm dev
>> 
>> I know this is not a big issue but I just wanted to verify I
>> installed the correct version of crash.
>> 
> 
> It's probably right, but can you describe in more detail from where you
> downloaded and how you installed it?
> 
> Cheers,
> Kristoffer
> 
>> ___
>> Users mailing list: Users@clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
>> pdf
>> Bugs: http://bugs.clusterlabs.org
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] crm --version shows "cam dev"

2018-07-04 Thread Salvatore D'angelo
Hi,

With crash 2.2.0 the command:
cam —version
works fine. I downloaded 3.0.1 and it shows:
crm dev

I know this is not a big issue but I just wanted to verify I installed the 
correct version of crash.
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Upgrade corosync problem

2018-07-02 Thread Salvatore D'angelo
Hi All,

Today I tested the two suggestions you gave me. Here what I did. 
In the script where I create my 5 machines cluster (I use three nodes for 
pacemaker PostgreSQL cluster and two nodes for glusterfs that we use for 
database backup and WAL files).

FIRST TEST
——
I added the —shm-size=512m to the “docker create” command. I noticed that as 
soon as I start it the shm size is 512m and I didn’t need to add the entry in 
/etc/fstab. However, I did it anyway:

tmpfs  /dev/shm  tmpfs   defaults,size=512m   0   0

and then
mount -o remount /dev/shm

Then I uninstalled all pieces of software (crmsh, resource agents, corosync and 
pacemaker) and installed the new one.
Started corosync and pacemaker but same problem occurred.

SECOND TEST
———
stopped corosync and pacemaker
uninstalled corosync
build corosync with --enable-small-memory-footprint and installed it
starte corosync and pacemaker

IT WORKED.

I would like to understand now why it didn’t worked in first test and why it 
worked in second. Which kind of memory is used too much here? /dev/shm seems 
not the problem, I allocated 512m on all three docker images (obviously on my 
single Mac) and enabled the container option as you suggested. Am I missing 
something here?

Now I want to use Docker for the moment only for test purpose so it could be ok 
to use the --enable-small-memory-footprint, but there is something I can do to 
have corosync working even without this option?


The reason I am asking this is that, in the future, it could be possible we 
deploy in production our cluster in containerised way (for the moment is just 
an idea). This will save a lot of time in developing, maintaining and deploying 
our patch system. All prerequisites and dependencies will be enclosed in 
container and if IT team will do some maintenance on bare metal (i.e. install 
new dependencies) it will not affects our containers. I do not see a lot of 
performance drawbacks in using container. The point is to understand if a 
containerised approach could save us lot of headache about maintenance of this 
cluster without affect performance too much. I am notice in Cloud environment 
this approach in a lot of contexts.


> On 2 Jul 2018, at 08:54, Christine Caulfield  wrote:
> 
> On 29/06/18 17:20, Jan Pokorný wrote:
>> On 29/06/18 10:00 +0100, Christine Caulfield wrote:
>>> On 27/06/18 08:35, Salvatore D'angelo wrote:
>>>> One thing that I do not understand is that I tried to compare corosync
>>>> 2.3.5 (the old version that worked fine) and 2.4.4 to understand
>>>> differences but I haven’t found anything related to the piece of code
>>>> that affects the issue. The quorum tool.c and cfg.c are almost the same.
>>>> Probably the issue is somewhere else.
>>>> 
>>> 
>>> This might be asking a bit much, but would it be possible to try this
>>> using Virtual Machines rather than Docker images? That would at least
>>> eliminate a lot of complex variables.
>> 
>> Salvatore, you can ignore the part below, try following the "--shm"
>> advice in other part of this thread.  Also the previous suggestion
>> to compile corosync with --small-memory-footprint may be of help,
>> but comes with other costs (expect lower throughput).
>> 
>> 
>> Chrissie, I have a plausible explanation and if it's true, then the
>> same will be reproduced wherever /dev/shm is small enough.
>> 
>> If I am right, then the offending commit is
>> https://github.com/corosync/corosync/commit/238e2e62d8b960e7c10bfa0a8281d78ec99f3a26
>> (present since 2.4.3), and while it arranges things for the better
>> in the context of prioritized, low jitter process, it all of
>> a sudden prevents as-you-need memory acquisition from the system,
>> meaning that the memory consumption constraints are checked immediately
>> when the memory is claimed (as it must fit into dedicated physical
>> memory in full).  Hence this impact we likely never realized may
>> be perceived as a sort of a regression.
>> 
>> Since we can calculate the approximate requirements statically, might
>> be worthy to add something like README.requirements, detailing how much
>> space will be occupied for typical configurations at minimum, e.g.:
>> 
>> - standard + --small-memory-footprint configuration
>> - 2 + 3 + X nodes (5?)
>> - without any service on top + teamed with qnetd + teamed with
>>  pacemaker atop (including just IPC channels between pacemaker
>>  daemons and corosync's CPG service, indeed)
>> 
> 
> That is possible explanation I suppose, yes.it <http://yes.it/>'s not 
> something we can
> sensibly revert because it was already fixing another regression!
> 

Re: [ClusterLabs] Upgrade corosync problem

2018-06-30 Thread Salvatore D'angelo
Hi everyone,

Thanks for suggestion. Yesterday in Rome was City Holiday and with week end I 
think I’ll try all your proposal Monday morning
when I go back to office. Thanks again for support I appreciate it a lot.

> On 29 Jun 2018, at 18:20, Jan Pokorný  wrote:
> 
> On 29/06/18 10:00 +0100, Christine Caulfield wrote:
>> On 27/06/18 08:35, Salvatore D'angelo wrote:
>>> One thing that I do not understand is that I tried to compare corosync
>>> 2.3.5 (the old version that worked fine) and 2.4.4 to understand
>>> differences but I haven’t found anything related to the piece of code
>>> that affects the issue. The quorum tool.c and cfg.c are almost the same.
>>> Probably the issue is somewhere else.
>>> 
>> 
>> This might be asking a bit much, but would it be possible to try this
>> using Virtual Machines rather than Docker images? That would at least
>> eliminate a lot of complex variables.
> 
> Salvatore, you can ignore the part below, try following the "--shm"
> advice in other part of this thread.  Also the previous suggestion
> to compile corosync with --small-memory-footprint may be of help,
> but comes with other costs (expect lower throughput).
> 
> 
> Chrissie, I have a plausible explanation and if it's true, then the
> same will be reproduced wherever /dev/shm is small enough.
> 
> If I am right, then the offending commit is
> https://github.com/corosync/corosync/commit/238e2e62d8b960e7c10bfa0a8281d78ec99f3a26
> (present since 2.4.3), and while it arranges things for the better
> in the context of prioritized, low jitter process, it all of
> a sudden prevents as-you-need memory acquisition from the system,
> meaning that the memory consumption constraints are checked immediately
> when the memory is claimed (as it must fit into dedicated physical
> memory in full).  Hence this impact we likely never realized may
> be perceived as a sort of a regression.
> 
> Since we can calculate the approximate requirements statically, might
> be worthy to add something like README.requirements, detailing how much
> space will be occupied for typical configurations at minimum, e.g.:
> 
> - standard + --small-memory-footprint configuration
> - 2 + 3 + X nodes (5?)
> - without any service on top + teamed with qnetd + teamed with
>  pacemaker atop (including just IPC channels between pacemaker
>  daemons and corosync's CPG service, indeed)
> 
> -- 
> Jan (Poki)
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Upgrade corosync problem

2018-06-29 Thread Salvatore D'angelo
Good to know. I'll try it. I'll try to work on VM too.

Il Ven 29 Giu 2018, 5:46 PM Jan Pokorný  ha scritto:

> On 26/06/18 11:03 +0200, Salvatore D'angelo wrote:
> > Yes, sorry you’re right I could find it by myself.
> > However, I did the following:
> >
> > 1. Added the line you suggested to /etc/fstab
> > 2. mount -o remount /dev/shm
> > 3. Now I correctly see /dev/shm of 512M with df -h
> > Filesystem  Size  Used Avail Use% Mounted on
> > overlay  63G   11G   49G  19% /
> > tmpfs64M  4.0K   64M   1% /dev
> > tmpfs  1000M 0 1000M   0% /sys/fs/cgroup
> > osxfs   466G  158G  305G  35% /Users
> > /dev/sda163G   11G   49G  19% /etc/hosts
> > shm 512M   15M  498M   3% /dev/shm
> > tmpfs  1000M 0 1000M   0% /sys/firmware
> > tmpfs   128M 0  128M   0% /tmp
> >
> > The errors in log went away. Consider that I remove the log file
> > before start corosync so it does not contains lines of previous
> > executions.
> >
> >
> > But the command:
> > corosync-quorumtool -ps
> >
> > still give:
> > Cannot initialize QUORUM service
> >
> > Consider that few minutes before it gave me the message:
> > Cannot initialize CFG service
> >
> > I do not know the differences between CFG and QUORUM in this case.
> >
> > If I try to start pacemaker the service is OK but I see only
> > pacemaker and the Transport does not work if I try to run a cam
> > command.
> > Any suggestion?
>
> Frankly, best generic suggestion I can serve with is to learn
> sufficient portions of the details about the tool you are relying on.
>
> I had a second look and it seems that what drives the actual
> size of the container's /dev/shm mountpoint with docker
> (per other response, you don't seem to be using --ipc switch) is
> it's --shm-size option for "run" subcommand (hence it's rather
> a property of the run-time, as the default of "64m" may be
> silently overriding your believed-to-be-persistent static changes
> within the container).
>
> Try using that option and you'll see.  Definitely keep you mind open
> regarding "container != magic-less system" inequality.
>
> --
> Jan (Poki)
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Install fresh pacemaker + corosync fails

2018-06-28 Thread Salvatore D'angelo
Hi All,

I am here again. I am still fighting against upgrade problems but now I am 
trying to change the approach.
I want now to try to install fresh a new version Corosync and Postgres to have 
it working.
For the moment I am not interested to a specific configuration, just three 
nodes where I can run a dummy resource as in this tutorial.

I prefer to download specific version of the packages but I am ok to whatever 
new version for now.
I followed the following procedure:
https://wiki.clusterlabs.org/wiki/SourceInstall 


but this procedure fails the compilation. If I want to compile from source it’s 
not clear what are the dependencies. 
I started from a scratch Ubuntu 14.04 (I only configured ssh to connect to the 
machines).

For libqb I had to install with apt-get install the following dependencies:
autoconf
libtool 

Corosync compilation failed to the step 
./autogen.sh && ./configure --prefix=$PREFIX 
with the following error:

checking for knet... no
configure: error: Package requirements (libknet) were not met:

No package 'libknet' found

Consider adjusting the PKG_CONFIG_PATH environment variable if you
installed software in a non-standard prefix.

I tried to install with apt-get the libraries libknet1 and libknet-dev but they 
were not found. Tried to download the source code of this library here:
https://github.com/kronosnet/kronosnet 

but the ./autogen.sh && ./configure --prefix=$PREFIX step failed too with this 
error:

configure: error: Package requirements (liblz4) were not met:
No package 'liblz4’ found

I installed liblz4 and liblz4-dev but problem still occurs.

I am going around in circle here. I am asking if someone tested the install 
procedure on Ubuntu 14.04 and can give me the exact steps to install fresh 
pacemaker 1.1.18 (or later) with corosync 2.4.4 (or later).
Thanks in advance for help.






___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Upgrade corosync problem

2018-06-27 Thread Salvatore D'angelo
Hi,

Thanks for reply and detailed explaination. I am not using the —network=host 
option.
I have a docker image based on Ubuntu 14.04 where I only deploy this additional 
software:

RUN apt-get update && apt-get install -y wget git xz-utils 
openssh-server \
systemd-services make gcc pkg-config psmisc fuse libpython2.7 
libopenipmi0 \
libdbus-glib-1-2 libsnmp30 libtimedate-perl libpcap0.8

configure ssh with key pairs to communicate easily. The containers are created 
with these simple commands:

docker create -it --cap-add=MKNOD --cap-add SYS_ADMIN --device 
/dev/loop0 --device /dev/fuse --net ${PUBLIC_NETWORK_NAME} --publish
 ${PG1_SSH_PORT}:22 --ip ${PG1_PUBLIC_IP} --name ${PG1_PRIVATE_NAME} --hostname 
${PG1_PRIVATE_NAME} -v ${MOUNT_FOLDER}:/Users ngha /bin/bash

docker create -it --cap-add=MKNOD --cap-add SYS_ADMIN --device 
/dev/loop1 --device /dev/fuse --net ${PUBLIC_NETWORK_NAME} --publish 
${PG2_SSH_PORT}:22 --ip ${PG2_PUBLIC_IP} --name ${PG2_PRIVATE_NAME} --hostname 
${PG2_PRIVATE_NAME} -v ${MOUNT_FOLDER}:/Users ngha /bin/bash 

docker create -it --cap-add=MKNOD --cap-add SYS_ADMIN --device 
/dev/loop2 --device /dev/fuse --net ${PUBLIC_NETWORK_NAME} --publish 
${PG3_SSH_PORT}:22 --ip ${PG3_PUBLIC_IP} --name ${PG3_PRIVATE_NAME} --hostname 
${PG3_PRIVATE_NAME} -v ${MOUNT_FOLDER}:/Users ngha /bin/bash

/dev/fuse is used to configure glusterfs on two others nodes and /dev/loopX 
just to simulate better my bare metal env.

One thing that I do not understand is that I tried to compare corosync 2.3.5 
(the old version that worked fine) and 2.4.4 to understand differences but I 
haven’t found anything related to the piece of code that affects the issue. The 
quorum tool.c and cfg.c are almost the same. Probably the issue is somewhere 
else.


> On 27 Jun 2018, at 08:34, Jan Pokorný  wrote:
> 
> On 26/06/18 17:56 +0200, Salvatore D'angelo wrote:
>> I did another test. I modified docker container in order to be able to run 
>> strace.
>> Running strace corosync-quorumtool -ps I got the following:
> 
>> [snipped]
>> connect(5, {sa_family=AF_LOCAL, sun_path=@"cfg"}, 110) = 0
>> setsockopt(5, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0
>> sendto(5, "\377\377\377\377\0\0\0\0\30\0\0\0\0\0\0\0\0\0\20\0\0\0\0\0", 24, 
>> MSG_NOSIGNAL, NULL, 0) = 24
>> setsockopt(5, SOL_SOCKET, SO_PASSCRED, [0], 4) = 0
>> recvfrom(5, 0x7ffd73bd7ac0, 12328, 16640, 0, 0) = -1 EAGAIN (Resource 
>> temporarily unavailable)
>> poll([{fd=5, events=POLLIN}], 1, 4294967295) = 1 ([{fd=5, revents=POLLIN}])
>> recvfrom(5, 
>> "\377\377\377\377\0\0\0\0(0\0\0\0\0\0\0\365\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0"...,
>>  12328, MSG_WAITALL|MSG_NOSIGNAL, NULL, NULL) = 12328
>> shutdown(5, SHUT_RDWR)  = 0
>> close(5)= 0
>> write(2, "Cannot initialise CFG service\n", 30Cannot initialise CFG service) 
>> = 30
>> [snipped]
> 
> This just demonstrated the effect of already detailed server-side
> error in the client, which communicates with the server just fine,
> but as soon as the server hits the mmap-based problem, it bails
> out the observed way, leaving client unsatisfied.
> 
> Note one thing, abstract Unix sockets are being used for the
> communication like this (observe the first line in the strace
> output excerpt above), and if you happen to run container via
> a docker command with --network=host, you may also be affected with
> issues arising from abstract sockets not being isolated but rather
> sharing the same namespace.  At least that was the case some years
> back and what asked for a switch in underlying libqb library to
> use strictly the file-backed sockets, where the isolation
> semantics matches the intuition:
> 
> https://lists.clusterlabs.org/pipermail/users/2017-May/013003.html
> 
> + way to enable (presumably only for container environments, note
> that there's no per process straightforward granularity):
> 
> https://clusterlabs.github.io/libqb/1.0.2/doxygen/qb_ipc_overview.html
> (scroll down to "IPC sockets (Linux only)")
> 
> You may test that if you are using said --network=host switch.
> 
>> I tried to understand what happen behind the scene but it is not easy for me.
>> Hoping someone on this list can help.
> 
> Containers are tricky, just as Ansible (as shown earlier on the list)
> can be, when encumbered with false believes and/or misunderstandings.
> Virtual machines may serve better wrt. insights for the later bare
> metal deployments.
> 
> -- 
> Jan (Poki)
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mail

Re: [ClusterLabs] Antw: Re: Upgrade corosync problem

2018-06-26 Thread Salvatore D'angelo
Hi again,I did another test. I modified docker container in order to be able to run strace.Running strace corosync-quorumtool -ps I got the following:

corosync-quorumtool-strace.log
Description: Binary data
I tried to understand what happen behind the scene but it is not easy for me.Hoping someone on this list can help.On 26 Jun 2018, at 16:06, Ulrich Windl  wrote:Salvatore D'angelo  schrieb am 26.06.2018 um 10:40 inNachricht :Hi,Yes,I am reproducing only the required part for test. I think the original system has a larger shm. The problem is that I do not know exactly how to change it.If you want to go paranoid, here's a setting from a SLES11 system:# grep shm /etc/sysctl.confkernel.shmmax = 9223372036854775807kernel.shmall = 1152921504606846720[...]See SYSCTL(8)Regards,Ulrich___Users mailing list: Users@clusterlabs.orghttps://lists.clusterlabs.org/mailman/listinfo/usersProject Home: http://www.clusterlabs.orgGetting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdfBugs: http://bugs.clusterlabs.org___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Salvatore D'angelo
I noticed that corosync 2.4.4 depends on the following libraries:
https://launchpad.net/ubuntu/+source/corosync/2.4.4-3 
<https://launchpad.net/ubuntu/+source/corosync/2.4.4-3>

I imagine that all the corosync-* and libcorosync-* libraries are build from 
the corosync build, so I should have them. Am I correct?

libcfg6
libcmap4
libcpg4
libquorum5
libsam4
libtotem-pg5
libvotequorum8

Can you tell me where these libraries come from and if I need them?

> On 26 Jun 2018, at 14:08, Christine Caulfield  wrote:
> 
> On 26/06/18 12:16, Salvatore D'angelo wrote:
>> libqb update to 1.0.3 but same issue.
>> 
>> I know corosync has also these dependencies nspr and nss3. I updated
>> them using apt-get install, here the version installed:
>> 
>>libnspr4, libnspr4-dev  2:4.13.1-0ubuntu0.14.04.1
>>libnss3, libnss3-dev, libnss3-nssb   2:3.28.4-0ubuntu0.14.04.3
>> 
>> but same problem.
>> 
>> I am working on Ubuntu 14.04 image and I know that packages could be
>> quite old here. Are there new versions for these libraries?
>> Where I can download them? I tried to search on google but results where
>> quite confusing.
>> 
> 
> It's pretty unlikely to be the crypto libraries. It's almost certainly
> in libqb, with a small possibility that of corosync.  Which versions did
> you have that worked (libqb and corosync) ?
> 
> Chrissie
> 
> 
>> 
>>> On 26 Jun 2018, at 12:27, Christine Caulfield >> <mailto:ccaul...@redhat.com>> wrote:
>>> 
>>> On 26/06/18 11:24, Salvatore D'angelo wrote:
>>>> Hi,
>>>> 
>>>> I have tried with:
>>>> 0.16.0.real-1ubuntu4
>>>> 0.16.0.real-1ubuntu5
>>>> 
>>>> which version should I try?
>>> 
>>> 
>>> Hmm both of those are actually quite old! maybe a newer one?
>>> 
>>> Chrissie
>>> 
>>>> 
>>>>> On 26 Jun 2018, at 12:03, Christine Caulfield >>>> <mailto:ccaul...@redhat.com>
>>>>> <mailto:ccaul...@redhat.com>> wrote:
>>>>> 
>>>>> On 26/06/18 11:00, Salvatore D'angelo wrote:
>>>>>> Consider that the container is the same when corosync 2.3.5 run.
>>>>>> If it is something related to the container probably the 2.4.4
>>>>>> introduced a feature that has an impact on container.
>>>>>> Should be something related to libqb according to the code.
>>>>>> Anyone can help?
>>>>>> 
>>>>> 
>>>>> 
>>>>> Have you tried downgrading libqb to the previous version to see if it
>>>>> still happens?
>>>>> 
>>>>> Chrissie
>>>>> 
>>>>>>> On 26 Jun 2018, at 11:56, Christine Caulfield >>>>>> <mailto:ccaul...@redhat.com>
>>>>>>> <mailto:ccaul...@redhat.com>
>>>>>>> <mailto:ccaul...@redhat.com>> wrote:
>>>>>>> 
>>>>>>> On 26/06/18 10:35, Salvatore D'angelo wrote:
>>>>>>>> Sorry after the command:
>>>>>>>> 
>>>>>>>> corosync-quorumtool -ps
>>>>>>>> 
>>>>>>>> the error in log are still visible. Looking at the source code it
>>>>>>>> seems
>>>>>>>> problem is at this line:
>>>>>>>> https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c
>>>>>>>> 
>>>>>>>> if (quorum_initialize(_handle, _callbacks, _type) !=
>>>>>>>> CS_OK) {
>>>>>>>> fprintf(stderr, "Cannot initialize QUORUM service\n");
>>>>>>>> q_handle = 0;
>>>>>>>> goto out;
>>>>>>>> }
>>>>>>>> 
>>>>>>>> if (corosync_cfg_initialize(_handle, _callbacks) != CS_OK) {
>>>>>>>> fprintf(stderr, "Cannot initialise CFG service\n");
>>>>>>>> c_handle = 0;
>>>>>>>> goto out;
>>>>>>>> }
>>>>>>>> 
>>>>>>>> The quorum_initialize function is defined here:
>>>>>>>> https://github.com/corosync/corosync/blob/master/lib/quorum.c
>>>>>>>> 
>>>>>>>> It seems interacts with libqb to allocate space on /dev/shm but
>>>>>>>> something fails. I tried to up

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Salvatore D'angelo
corosync 2.3.5 and libqb 0.16.0

> On 26 Jun 2018, at 14:08, Christine Caulfield  wrote:
> 
> On 26/06/18 12:16, Salvatore D'angelo wrote:
>> libqb update to 1.0.3 but same issue.
>> 
>> I know corosync has also these dependencies nspr and nss3. I updated
>> them using apt-get install, here the version installed:
>> 
>>libnspr4, libnspr4-dev  2:4.13.1-0ubuntu0.14.04.1
>>libnss3, libnss3-dev, libnss3-nssb   2:3.28.4-0ubuntu0.14.04.3
>> 
>> but same problem.
>> 
>> I am working on Ubuntu 14.04 image and I know that packages could be
>> quite old here. Are there new versions for these libraries?
>> Where I can download them? I tried to search on google but results where
>> quite confusing.
>> 
> 
> It's pretty unlikely to be the crypto libraries. It's almost certainly
> in libqb, with a small possibility that of corosync.  Which versions did
> you have that worked (libqb and corosync) ?
> 
> Chrissie
> 
> 
>> 
>>> On 26 Jun 2018, at 12:27, Christine Caulfield >> <mailto:ccaul...@redhat.com>
>>> <mailto:ccaul...@redhat.com <mailto:ccaul...@redhat.com>>> wrote:
>>> 
>>> On 26/06/18 11:24, Salvatore D'angelo wrote:
>>>> Hi,
>>>> 
>>>> I have tried with:
>>>> 0.16.0.real-1ubuntu4
>>>> 0.16.0.real-1ubuntu5
>>>> 
>>>> which version should I try?
>>> 
>>> 
>>> Hmm both of those are actually quite old! maybe a newer one?
>>> 
>>> Chrissie
>>> 
>>>> 
>>>>> On 26 Jun 2018, at 12:03, Christine Caulfield >>>> <mailto:ccaul...@redhat.com>
>>>>> <mailto:ccaul...@redhat.com <mailto:ccaul...@redhat.com>>
>>>>> <mailto:ccaul...@redhat.com <mailto:ccaul...@redhat.com>>> wrote:
>>>>> 
>>>>> On 26/06/18 11:00, Salvatore D'angelo wrote:
>>>>>> Consider that the container is the same when corosync 2.3.5 run.
>>>>>> If it is something related to the container probably the 2.4.4
>>>>>> introduced a feature that has an impact on container.
>>>>>> Should be something related to libqb according to the code.
>>>>>> Anyone can help?
>>>>>> 
>>>>> 
>>>>> 
>>>>> Have you tried downgrading libqb to the previous version to see if it
>>>>> still happens?
>>>>> 
>>>>> Chrissie
>>>>> 
>>>>>>> On 26 Jun 2018, at 11:56, Christine Caulfield >>>>>> <mailto:ccaul...@redhat.com>
>>>>>>> <mailto:ccaul...@redhat.com <mailto:ccaul...@redhat.com>>
>>>>>>> <mailto:ccaul...@redhat.com <mailto:ccaul...@redhat.com>>
>>>>>>> <mailto:ccaul...@redhat.com <mailto:ccaul...@redhat.com>>> wrote:
>>>>>>> 
>>>>>>> On 26/06/18 10:35, Salvatore D'angelo wrote:
>>>>>>>> Sorry after the command:
>>>>>>>> 
>>>>>>>> corosync-quorumtool -ps
>>>>>>>> 
>>>>>>>> the error in log are still visible. Looking at the source code it
>>>>>>>> seems
>>>>>>>> problem is at this line:
>>>>>>>> https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c
>>>>>>>>  
>>>>>>>> <https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c>
>>>>>>>> 
>>>>>>>> if (quorum_initialize(_handle, _callbacks, _type) !=
>>>>>>>> CS_OK) {
>>>>>>>> fprintf(stderr, "Cannot initialize QUORUM service\n");
>>>>>>>> q_handle = 0;
>>>>>>>> goto out;
>>>>>>>> }
>>>>>>>> 
>>>>>>>> if (corosync_cfg_initialize(_handle, _callbacks) != CS_OK) {
>>>>>>>> fprintf(stderr, "Cannot initialise CFG service\n");
>>>>>>>> c_handle = 0;
>>>>>>>> goto out;
>>>>>>>> }
>>>>>>>> 
>>>>>>>> The quorum_initialize function is defined here:
>>>>>>>> https://github.com/corosync/corosync/blob/master/lib/quorum.c 
>>>>>>>> <https://github.com/corosync/corosync/blob/master/l

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Salvatore D'angelo
libqb update to 1.0.3 but same issue.

I know corosync has also these dependencies nspr and nss3. I updated them using 
apt-get install, here the version installed:

   libnspr4, libnspr4-dev   2:4.13.1-0ubuntu0.14.04.1
   libnss3, libnss3-dev, libnss3-nssb   2:3.28.4-0ubuntu0.14.04.3

but same problem.

I am working on Ubuntu 14.04 image and I know that packages could be quite old 
here. Are there new versions for these libraries?
Where I can download them? I tried to search on google but results where quite 
confusing.


> On 26 Jun 2018, at 12:27, Christine Caulfield  wrote:
> 
> On 26/06/18 11:24, Salvatore D'angelo wrote:
>> Hi,
>> 
>> I have tried with:
>> 0.16.0.real-1ubuntu4
>> 0.16.0.real-1ubuntu5
>> 
>> which version should I try?
> 
> 
> Hmm both of those are actually quite old! maybe a newer one?
> 
> Chrissie
> 
>> 
>>> On 26 Jun 2018, at 12:03, Christine Caulfield >> <mailto:ccaul...@redhat.com>> wrote:
>>> 
>>> On 26/06/18 11:00, Salvatore D'angelo wrote:
>>>> Consider that the container is the same when corosync 2.3.5 run.
>>>> If it is something related to the container probably the 2.4.4
>>>> introduced a feature that has an impact on container.
>>>> Should be something related to libqb according to the code.
>>>> Anyone can help?
>>>> 
>>> 
>>> 
>>> Have you tried downgrading libqb to the previous version to see if it
>>> still happens?
>>> 
>>> Chrissie
>>> 
>>>>> On 26 Jun 2018, at 11:56, Christine Caulfield >>>> <mailto:ccaul...@redhat.com>
>>>>> <mailto:ccaul...@redhat.com>> wrote:
>>>>> 
>>>>> On 26/06/18 10:35, Salvatore D'angelo wrote:
>>>>>> Sorry after the command:
>>>>>> 
>>>>>> corosync-quorumtool -ps
>>>>>> 
>>>>>> the error in log are still visible. Looking at the source code it seems
>>>>>> problem is at this line:
>>>>>> https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c
>>>>>> 
>>>>>> if (quorum_initialize(_handle, _callbacks, _type) != CS_OK) {
>>>>>> fprintf(stderr, "Cannot initialize QUORUM service\n");
>>>>>> q_handle = 0;
>>>>>> goto out;
>>>>>> }
>>>>>> 
>>>>>> if (corosync_cfg_initialize(_handle, _callbacks) != CS_OK) {
>>>>>> fprintf(stderr, "Cannot initialise CFG service\n");
>>>>>> c_handle = 0;
>>>>>> goto out;
>>>>>> }
>>>>>> 
>>>>>> The quorum_initialize function is defined here:
>>>>>> https://github.com/corosync/corosync/blob/master/lib/quorum.c
>>>>>> 
>>>>>> It seems interacts with libqb to allocate space on /dev/shm but
>>>>>> something fails. I tried to update the libqb with apt-get install
>>>>>> but no
>>>>>> success.
>>>>>> 
>>>>>> The same for second function:
>>>>>> https://github.com/corosync/corosync/blob/master/lib/cfg.c
>>>>>> 
>>>>>> Now I am not an expert of libqb. I have the
>>>>>> version 0.16.0.real-1ubuntu5.
>>>>>> 
>>>>>> The folder /dev/shm has 777 permission like other nodes with older
>>>>>> corosync and pacemaker that work fine. The only difference is that I
>>>>>> only see files created by root, no one created by hacluster like other
>>>>>> two nodes (probably because pacemaker didn’t start correctly).
>>>>>> 
>>>>>> This is the analysis I have done so far.
>>>>>> Any suggestion?
>>>>>> 
>>>>>> 
>>>>> 
>>>>> Hmm. t seems very likely something to do with the way the container is
>>>>> set up then - and I know nothing about containers. Sorry :/
>>>>> 
>>>>> Can anyone else help here?
>>>>> 
>>>>> Chrissie
>>>>> 
>>>>>>> On 26 Jun 2018, at 11:03, Salvatore D'angelo
>>>>>>> mailto:sasadang...@gmail.com>
>>>>>>> <mailto:sasadang...@gmail.com>
>>>>>>> <mailto:sasadang...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> Yes, sorry you’re 

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Salvatore D'angelo
Hi,

I have tried with:
0.16.0.real-1ubuntu4
0.16.0.real-1ubuntu5

which version should I try?

> On 26 Jun 2018, at 12:03, Christine Caulfield  wrote:
> 
> On 26/06/18 11:00, Salvatore D'angelo wrote:
>> Consider that the container is the same when corosync 2.3.5 run.
>> If it is something related to the container probably the 2.4.4
>> introduced a feature that has an impact on container.
>> Should be something related to libqb according to the code.
>> Anyone can help?
>> 
> 
> 
> Have you tried downgrading libqb to the previous version to see if it
> still happens?
> 
> Chrissie
> 
>>> On 26 Jun 2018, at 11:56, Christine Caulfield >> <mailto:ccaul...@redhat.com>> wrote:
>>> 
>>> On 26/06/18 10:35, Salvatore D'angelo wrote:
>>>> Sorry after the command:
>>>> 
>>>> corosync-quorumtool -ps
>>>> 
>>>> the error in log are still visible. Looking at the source code it seems
>>>> problem is at this line:
>>>> https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c
>>>> 
>>>> if (quorum_initialize(_handle, _callbacks, _type) != CS_OK) {
>>>> fprintf(stderr, "Cannot initialize QUORUM service\n");
>>>> q_handle = 0;
>>>> goto out;
>>>> }
>>>> 
>>>> if (corosync_cfg_initialize(_handle, _callbacks) != CS_OK) {
>>>> fprintf(stderr, "Cannot initialise CFG service\n");
>>>> c_handle = 0;
>>>> goto out;
>>>> }
>>>> 
>>>> The quorum_initialize function is defined here:
>>>> https://github.com/corosync/corosync/blob/master/lib/quorum.c
>>>> 
>>>> It seems interacts with libqb to allocate space on /dev/shm but
>>>> something fails. I tried to update the libqb with apt-get install but no
>>>> success.
>>>> 
>>>> The same for second function:
>>>> https://github.com/corosync/corosync/blob/master/lib/cfg.c
>>>> 
>>>> Now I am not an expert of libqb. I have the version 0.16.0.real-1ubuntu5.
>>>> 
>>>> The folder /dev/shm has 777 permission like other nodes with older
>>>> corosync and pacemaker that work fine. The only difference is that I
>>>> only see files created by root, no one created by hacluster like other
>>>> two nodes (probably because pacemaker didn’t start correctly).
>>>> 
>>>> This is the analysis I have done so far.
>>>> Any suggestion?
>>>> 
>>>> 
>>> 
>>> Hmm. t seems very likely something to do with the way the container is
>>> set up then - and I know nothing about containers. Sorry :/
>>> 
>>> Can anyone else help here?
>>> 
>>> Chrissie
>>> 
>>>>> On 26 Jun 2018, at 11:03, Salvatore D'angelo >>>> <mailto:sasadang...@gmail.com>
>>>>> <mailto:sasadang...@gmail.com>> wrote:
>>>>> 
>>>>> Yes, sorry you’re right I could find it by myself.
>>>>> However, I did the following:
>>>>> 
>>>>> 1. Added the line you suggested to /etc/fstab
>>>>> 2. mount -o remount /dev/shm
>>>>> 3. Now I correctly see /dev/shm of 512M with df -h
>>>>> Filesystem  Size  Used Avail Use% Mounted on
>>>>> overlay  63G   11G   49G  19% /
>>>>> tmpfs64M  4.0K   64M   1% /dev
>>>>> tmpfs  1000M 0 1000M   0% /sys/fs/cgroup
>>>>> osxfs   466G  158G  305G  35% /Users
>>>>> /dev/sda163G   11G   49G  19% /etc/hosts
>>>>> *shm 512M   15M  498M   3% /dev/shm*
>>>>> tmpfs  1000M 0 1000M   0% /sys/firmware
>>>>> tmpfs   128M 0  128M   0% /tmp
>>>>> 
>>>>> The errors in log went away. Consider that I remove the log file
>>>>> before start corosync so it does not contains lines of previous
>>>>> executions.
>>>>> 
>>>>> 
>>>>> But the command:
>>>>> corosync-quorumtool -ps
>>>>> 
>>>>> still give:
>>>>> Cannot initialize QUORUM service
>>>>> 
>>>>> Consider that few minutes before it gave me the message:
>>>>> Cannot initialize CFG service
>>>>> 
>>>>> I do not know the differences between CFG and QUORUM in this case.
>>>&g

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Salvatore D'angelo
Consider that the container is the same when corosync 2.3.5 run.
If it is something related to the container probably the 2.4.4 introduced a 
feature that has an impact on container.
Should be something related to libqb according to the code.
Anyone can help?

> On 26 Jun 2018, at 11:56, Christine Caulfield  wrote:
> 
> On 26/06/18 10:35, Salvatore D'angelo wrote:
>> Sorry after the command:
>> 
>> corosync-quorumtool -ps
>> 
>> the error in log are still visible. Looking at the source code it seems
>> problem is at this line:
>> https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c
>> 
>> if (quorum_initialize(_handle, _callbacks, _type) != CS_OK) {
>> fprintf(stderr, "Cannot initialize QUORUM service\n");
>> q_handle = 0;
>> goto out;
>> }
>> 
>> if (corosync_cfg_initialize(_handle, _callbacks) != CS_OK) {
>> fprintf(stderr, "Cannot initialise CFG service\n");
>> c_handle = 0;
>> goto out;
>> }
>> 
>> The quorum_initialize function is defined here:
>> https://github.com/corosync/corosync/blob/master/lib/quorum.c
>> 
>> It seems interacts with libqb to allocate space on /dev/shm but
>> something fails. I tried to update the libqb with apt-get install but no
>> success.
>> 
>> The same for second function:
>> https://github.com/corosync/corosync/blob/master/lib/cfg.c
>> 
>> Now I am not an expert of libqb. I have the version 0.16.0.real-1ubuntu5.
>> 
>> The folder /dev/shm has 777 permission like other nodes with older
>> corosync and pacemaker that work fine. The only difference is that I
>> only see files created by root, no one created by hacluster like other
>> two nodes (probably because pacemaker didn’t start correctly).
>> 
>> This is the analysis I have done so far.
>> Any suggestion?
>> 
>> 
> 
> Hmm. t seems very likely something to do with the way the container is
> set up then - and I know nothing about containers. Sorry :/
> 
> Can anyone else help here?
> 
> Chrissie
> 
>>> On 26 Jun 2018, at 11:03, Salvatore D'angelo >> <mailto:sasadang...@gmail.com>
>>> <mailto:sasadang...@gmail.com <mailto:sasadang...@gmail.com>>> wrote:
>>> 
>>> Yes, sorry you’re right I could find it by myself.
>>> However, I did the following:
>>> 
>>> 1. Added the line you suggested to /etc/fstab
>>> 2. mount -o remount /dev/shm
>>> 3. Now I correctly see /dev/shm of 512M with df -h
>>> Filesystem  Size  Used Avail Use% Mounted on
>>> overlay  63G   11G   49G  19% /
>>> tmpfs64M  4.0K   64M   1% /dev
>>> tmpfs  1000M 0 1000M   0% /sys/fs/cgroup
>>> osxfs   466G  158G  305G  35% /Users
>>> /dev/sda163G   11G   49G  19% /etc/hosts
>>> *shm 512M   15M  498M   3% /dev/shm*
>>> tmpfs  1000M 0 1000M   0% /sys/firmware
>>> tmpfs   128M 0  128M   0% /tmp
>>> 
>>> The errors in log went away. Consider that I remove the log file
>>> before start corosync so it does not contains lines of previous
>>> executions.
>>> 
>>> 
>>> But the command:
>>> corosync-quorumtool -ps
>>> 
>>> still give:
>>> Cannot initialize QUORUM service
>>> 
>>> Consider that few minutes before it gave me the message:
>>> Cannot initialize CFG service
>>> 
>>> I do not know the differences between CFG and QUORUM in this case.
>>> 
>>> If I try to start pacemaker the service is OK but I see only pacemaker
>>> and the Transport does not work if I try to run a cam command.
>>> Any suggestion?
>>> 
>>> 
>>>> On 26 Jun 2018, at 10:49, Christine Caulfield >>> <mailto:ccaul...@redhat.com>
>>>> <mailto:ccaul...@redhat.com <mailto:ccaul...@redhat.com>>> wrote:
>>>> 
>>>> On 26/06/18 09:40, Salvatore D'angelo wrote:
>>>>> Hi,
>>>>> 
>>>>> Yes,
>>>>> 
>>>>> I am reproducing only the required part for test. I think the original
>>>>> system has a larger shm. The problem is that I do not know exactly how
>>>>> to change it.
>>>>> I tried the following steps, but I have the impression I didn’t
>>>>> performed the right one:
>>>>> 
>>>>> 1. remove everything under /tmp
>>>>> 2. Added the following line to /etc/fstab
>>>&

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Salvatore D'angelo
Sorry after the command:

corosync-quorumtool -ps

the error in log are still visible. Looking at the source code it seems problem 
is at this line:
https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c 
<https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c>

if (quorum_initialize(_handle, _callbacks, _type) != CS_OK) {
fprintf(stderr, "Cannot initialize QUORUM service\n");
q_handle = 0;
goto out;
}

if (corosync_cfg_initialize(_handle, _callbacks) != CS_OK) {
fprintf(stderr, "Cannot initialise CFG service\n");
c_handle = 0;
goto out;
}

The quorum_initialize function is defined here:
https://github.com/corosync/corosync/blob/master/lib/quorum.c 
<https://github.com/corosync/corosync/blob/master/lib/quorum.c>

It seems interacts with libqb to allocate space on /dev/shm but something 
fails. I tried to update the libqb with apt-get install but no success.

The same for second function:
https://github.com/corosync/corosync/blob/master/lib/cfg.c 
<https://github.com/corosync/corosync/blob/master/lib/cfg.c>

Now I am not an expert of libqb. I have the version 0.16.0.real-1ubuntu5.

The folder /dev/shm has 777 permission like other nodes with older corosync and 
pacemaker that work fine. The only difference is that I only see files created 
by root, no one created by hacluster like other two nodes (probably because 
pacemaker didn’t start correctly).

This is the analysis I have done so far.
Any suggestion?


> On 26 Jun 2018, at 11:03, Salvatore D'angelo  wrote:
> 
> Yes, sorry you’re right I could find it by myself.
> However, I did the following:
> 
> 1. Added the line you suggested to /etc/fstab
> 2. mount -o remount /dev/shm
> 3. Now I correctly see /dev/shm of 512M with df -h
> Filesystem  Size  Used Avail Use% Mounted on
> overlay  63G   11G   49G  19% /
> tmpfs64M  4.0K   64M   1% /dev
> tmpfs  1000M 0 1000M   0% /sys/fs/cgroup
> osxfs   466G  158G  305G  35% /Users
> /dev/sda163G   11G   49G  19% /etc/hosts
> shm 512M   15M  498M   3% /dev/shm
> tmpfs  1000M 0 1000M   0% /sys/firmware
> tmpfs   128M 0  128M   0% /tmp
> 
> The errors in log went away. Consider that I remove the log file before start 
> corosync so it does not contains lines of previous executions.
> 
> 
> But the command:
> corosync-quorumtool -ps
> 
> still give:
> Cannot initialize QUORUM service
> 
> Consider that few minutes before it gave me the message:
> Cannot initialize CFG service
> 
> I do not know the differences between CFG and QUORUM in this case.
> 
> If I try to start pacemaker the service is OK but I see only pacemaker and 
> the Transport does not work if I try to run a cam command.
> Any suggestion?
> 
> 
>> On 26 Jun 2018, at 10:49, Christine Caulfield > <mailto:ccaul...@redhat.com>> wrote:
>> 
>> On 26/06/18 09:40, Salvatore D'angelo wrote:
>>> Hi,
>>> 
>>> Yes,
>>> 
>>> I am reproducing only the required part for test. I think the original
>>> system has a larger shm. The problem is that I do not know exactly how
>>> to change it.
>>> I tried the following steps, but I have the impression I didn’t
>>> performed the right one:
>>> 
>>> 1. remove everything under /tmp
>>> 2. Added the following line to /etc/fstab
>>> tmpfs   /tmp tmpfs   defaults,nodev,nosuid,mode=1777,size=128M 
>>> 0  0
>>> 3. mount /tmp
>>> 4. df -h
>>> Filesystem  Size  Used Avail Use% Mounted on
>>> overlay  63G   11G   49G  19% /
>>> tmpfs64M  4.0K   64M   1% /dev
>>> tmpfs  1000M 0 1000M   0% /sys/fs/cgroup
>>> osxfs   466G  158G  305G  35% /Users
>>> /dev/sda163G   11G   49G  19% /etc/hosts
>>> shm  64M   11M   54M  16% /dev/shm
>>> tmpfs  1000M 0 1000M   0% /sys/firmware
>>> *tmpfs   128M 0  128M   0% /tmp*
>>> 
>>> The errors are exactly the same.
>>> I have the impression that I changed the wrong parameter. Probably I
>>> have to change:
>>> shm      64M   11M   54M  16% /dev/shm
>>> 
>>> but I do not know how to do that. Any suggestion?
>>> 
>> 
>> According to google, you just add a new line to /etc/fstab for /dev/shm
>> 
>> tmpfs  /dev/shm  tmpfs   defaults,size=512m   0   0
>> 
>> Chrissie
>> 
>>>> On 26 Jun 2018, at 09:48, Christine Caulfi

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Salvatore D'angelo
Yes, sorry you’re right I could find it by myself.However, I did the following:1. Added the line you suggested to /etc/fstab2. mount -o remount /dev/shm3. Now I correctly see /dev/shm of 512M with df -hFilesystem      Size  Used Avail Use% Mounted onoverlay          63G   11G   49G  19% /tmpfs            64M  4.0K   64M   1% /devtmpfs          1000M     0 1000M   0% /sys/fs/cgrouposxfs           466G  158G  305G  35% /Users/dev/sda1        63G   11G   49G  19% /etc/hostsshm             512M   15M  498M   3% /dev/shmtmpfs          1000M     0 1000M   0% /sys/firmwaretmpfs           128M     0  128M   0% /tmpThe errors in log went away. Consider that I remove the log file before start corosync so it does not contains lines of previous executions.

corosync.log
Description: Binary data
But the command:corosync-quorumtool -psstill give:Cannot initialize QUORUM serviceConsider that few minutes before it gave me the message:Cannot initialize CFG serviceI do not know the differences between CFG and QUORUM in this case.If I try to start pacemaker the service is OK but I see only pacemaker and the Transport does not work if I try to run a cam command.Any suggestion?On 26 Jun 2018, at 10:49, Christine Caulfield <ccaul...@redhat.com> wrote:On 26/06/18 09:40, Salvatore D'angelo wrote:Hi,Yes,I am reproducing only the required part for test. I think the originalsystem has a larger shm. The problem is that I do not know exactly howto change it.I tried the following steps, but I have the impression I didn’tperformed the right one:1. remove everything under /tmp2. Added the following line to /etc/fstabtmpfs   /tmp         tmpfs   defaults,nodev,nosuid,mode=1777,size=128M         0  03. mount /tmp4. df -hFilesystem      Size  Used Avail Use% Mounted onoverlay          63G   11G   49G  19% /tmpfs            64M  4.0K   64M   1% /devtmpfs          1000M     0 1000M   0% /sys/fs/cgrouposxfs           466G  158G  305G  35% /Users/dev/sda1        63G   11G   49G  19% /etc/hostsshm              64M   11M   54M  16% /dev/shmtmpfs          1000M     0 1000M   0% /sys/firmware*tmpfs           128M     0  128M   0% /tmp*The errors are exactly the same.I have the impression that I changed the wrong parameter. Probably Ihave to change:shm              64M   11M   54M  16% /dev/shmbut I do not know how to do that. Any suggestion?According to google, you just add a new line to /etc/fstab for /dev/shmtmpfs  /dev/shm  tmpfs   defaults,size=512m   0   0ChrissieOn 26 Jun 2018, at 09:48, Christine Caulfield <ccaul...@redhat.com<mailto:ccaul...@redhat.com>> wrote:On 25/06/18 20:41, Salvatore D'angelo wrote:Hi,Let me add here one important detail. I use Docker for my test with 5containers deployed on my Mac.Basically the team that worked on this project installed the clusteron soft layer bare metal.The PostgreSQL cluster was hard to test and if a misconfigurationoccurred recreate the cluster from scratch is not easy.Test it was a cumbersome if you consider that we access to themachines with a complex system hard to describe here.For this reason I ported the cluster on Docker for test purpose. I amnot interested to have it working for months, I just need a proof ofconcept. When the migration works I’ll port everything on bare metal where thesize of resources are ambundant.  Now I have enough RAM and disk space on my Mac so if you tell me whatshould be an acceptable size for several days of running it is ok for me.It is ok also have commands to clean the shm when required.I know I can find them on Google but if you can suggest me these infoI’ll appreciate. I have OS knowledge to do that but I would like toavoid days of guesswork and try and error if possible.I would recommend at least 128MB of space on /dev/shm, 256MB if you canspare it. My 'standard' system uses 75MB under normal running allowingfor one command-line query to run.If I read this right then you're reproducing a bare-metal system incontainers now? so the original systems will have a default /dev/shmsize which is probably much larger than your containers?I'm just checking here that we don't have a regression in memory usageas Poki suggested.ChrissieOn 25 Jun 2018, at 21:18, Jan Pokorný <jpoko...@redhat.com<mailto:jpoko...@redhat.com>> wrote:On 25/06/18 19:06 +0200, Salvatore D'angelo wrote:Thanks for reply. I scratched my cluster and created it again andthen migrated as before. This time I uninstalled pacemaker,corosync, crmsh and resource agents with make uninstallthen I installed new packages. The problem is the same, whenI launch:corosync-quorumtool -psI got: Cannot initialize QUORUM serviceHere the log with debug enabled:[18019] pg3 corosyncerror   [QB    ] couldn't create circular mmapon /dev/shm/qb-cfg-event-18020-18028-23-data[18019] pg3 corosyncerror   [QB    ]qb_rb_open:cfg-event-18020-18028-23: Resource temporarilyunavailable (11)[18019] pg3 corosyncdebug   [QB    ] Free'ing ringbuffer:/dev/shm/qb-cfg-request-18020-18028-23-header[

Re: [ClusterLabs] Upgrade corosync problem

2018-06-26 Thread Salvatore D'angelo
Hi,

Yes,

I am reproducing only the required part for test. I think the original system 
has a larger shm. The problem is that I do not know exactly how to change it.
I tried the following steps, but I have the impression I didn’t performed the 
right one:

1. remove everything under /tmp
2. Added the following line to /etc/fstab
tmpfs   /tmp tmpfs   defaults,nodev,nosuid,mode=1777,size=128M  
0  0
3. mount /tmp
4. df -h
Filesystem  Size  Used Avail Use% Mounted on
overlay  63G   11G   49G  19% /
tmpfs64M  4.0K   64M   1% /dev
tmpfs  1000M 0 1000M   0% /sys/fs/cgroup
osxfs   466G  158G  305G  35% /Users
/dev/sda163G   11G   49G  19% /etc/hosts
shm  64M   11M   54M  16% /dev/shm
tmpfs  1000M 0 1000M   0% /sys/firmware
tmpfs   128M 0  128M   0% /tmp

The errors are exactly the same.
I have the impression that I changed the wrong parameter. Probably I have to 
change:
shm  64M   11M   54M  16% /dev/shm

but I do not know how to do that. Any suggestion?

> On 26 Jun 2018, at 09:48, Christine Caulfield  wrote:
> 
> On 25/06/18 20:41, Salvatore D'angelo wrote:
>> Hi,
>> 
>> Let me add here one important detail. I use Docker for my test with 5 
>> containers deployed on my Mac.
>> Basically the team that worked on this project installed the cluster on soft 
>> layer bare metal.
>> The PostgreSQL cluster was hard to test and if a misconfiguration occurred 
>> recreate the cluster from scratch is not easy.
>> Test it was a cumbersome if you consider that we access to the machines with 
>> a complex system hard to describe here.
>> For this reason I ported the cluster on Docker for test purpose. I am not 
>> interested to have it working for months, I just need a proof of concept. 
>> 
>> When the migration works I’ll port everything on bare metal where the size 
>> of resources are ambundant.  
>> 
>> Now I have enough RAM and disk space on my Mac so if you tell me what should 
>> be an acceptable size for several days of running it is ok for me.
>> It is ok also have commands to clean the shm when required.
>> I know I can find them on Google but if you can suggest me these info I’ll 
>> appreciate. I have OS knowledge to do that but I would like to avoid days of 
>> guesswork and try and error if possible.
> 
> 
> I would recommend at least 128MB of space on /dev/shm, 256MB if you can
> spare it. My 'standard' system uses 75MB under normal running allowing
> for one command-line query to run.
> 
> If I read this right then you're reproducing a bare-metal system in
> containers now? so the original systems will have a default /dev/shm
> size which is probably much larger than your containers?
> 
> I'm just checking here that we don't have a regression in memory usage
> as Poki suggested.
> 
> Chrissie
> 
>>> On 25 Jun 2018, at 21:18, Jan Pokorný  wrote:
>>> 
>>> On 25/06/18 19:06 +0200, Salvatore D'angelo wrote:
>>>> Thanks for reply. I scratched my cluster and created it again and
>>>> then migrated as before. This time I uninstalled pacemaker,
>>>> corosync, crmsh and resource agents with make uninstall
>>>> 
>>>> then I installed new packages. The problem is the same, when
>>>> I launch:
>>>> corosync-quorumtool -ps
>>>> 
>>>> I got: Cannot initialize QUORUM service
>>>> 
>>>> Here the log with debug enabled:
>>>> 
>>>> 
>>>> [18019] pg3 corosyncerror   [QB] couldn't create circular mmap on 
>>>> /dev/shm/qb-cfg-event-18020-18028-23-data
>>>> [18019] pg3 corosyncerror   [QB] qb_rb_open:cfg-event-18020-18028-23: 
>>>> Resource temporarily unavailable (11)
>>>> [18019] pg3 corosyncdebug   [QB] Free'ing ringbuffer: 
>>>> /dev/shm/qb-cfg-request-18020-18028-23-header
>>>> [18019] pg3 corosyncdebug   [QB] Free'ing ringbuffer: 
>>>> /dev/shm/qb-cfg-response-18020-18028-23-header
>>>> [18019] pg3 corosyncerror   [QB] shm connection FAILED: Resource 
>>>> temporarily unavailable (11)
>>>> [18019] pg3 corosyncerror   [QB] Error in connection setup 
>>>> (18020-18028-23): Resource temporarily unavailable (11)
>>>> 
>>>> I tried to check /dev/shm and I am not sure these are the right
>>>> commands, however:
>>>> 
>>>> df -h /dev/shm
>>>> Filesystem  Size  Used Avail Use% Mounted on
>>>> shm  64M   16M   49M  24% /dev/shm
>>>> 
>>>> ls /dev/shm
>>>

Re: [ClusterLabs] Upgrade corosync problem

2018-06-25 Thread Salvatore D'angelo
Hi,

Let me add here one important detail. I use Docker for my test with 5 
containers deployed on my Mac.
Basically the team that worked on this project installed the cluster on soft 
layer bare metal.
The PostgreSQL cluster was hard to test and if a misconfiguration occurred 
recreate the cluster from scratch is not easy.
Test it was a cumbersome if you consider that we access to the machines with a 
complex system hard to describe here.
For this reason I ported the cluster on Docker for test purpose. I am not 
interested to have it working for months, I just need a proof of concept. 

When the migration works I’ll port everything on bare metal where the size of 
resources are ambundant.  

Now I have enough RAM and disk space on my Mac so if you tell me what should be 
an acceptable size for several days of running it is ok for me.
It is ok also have commands to clean the shm when required.
I know I can find them on Google but if you can suggest me these info I’ll 
appreciate. I have OS knowledge to do that but I would like to avoid days of 
guesswork and try and error if possible.

> On 25 Jun 2018, at 21:18, Jan Pokorný  wrote:
> 
> On 25/06/18 19:06 +0200, Salvatore D'angelo wrote:
>> Thanks for reply. I scratched my cluster and created it again and
>> then migrated as before. This time I uninstalled pacemaker,
>> corosync, crmsh and resource agents with make uninstall
>> 
>> then I installed new packages. The problem is the same, when
>> I launch:
>> corosync-quorumtool -ps
>> 
>> I got: Cannot initialize QUORUM service
>> 
>> Here the log with debug enabled:
>> 
>> 
>> [18019] pg3 corosyncerror   [QB] couldn't create circular mmap on 
>> /dev/shm/qb-cfg-event-18020-18028-23-data
>> [18019] pg3 corosyncerror   [QB] qb_rb_open:cfg-event-18020-18028-23: 
>> Resource temporarily unavailable (11)
>> [18019] pg3 corosyncdebug   [QB] Free'ing ringbuffer: 
>> /dev/shm/qb-cfg-request-18020-18028-23-header
>> [18019] pg3 corosyncdebug   [QB] Free'ing ringbuffer: 
>> /dev/shm/qb-cfg-response-18020-18028-23-header
>> [18019] pg3 corosyncerror   [QB] shm connection FAILED: Resource 
>> temporarily unavailable (11)
>> [18019] pg3 corosyncerror   [QB] Error in connection setup 
>> (18020-18028-23): Resource temporarily unavailable (11)
>> 
>> I tried to check /dev/shm and I am not sure these are the right
>> commands, however:
>> 
>> df -h /dev/shm
>> Filesystem  Size  Used Avail Use% Mounted on
>> shm  64M   16M   49M  24% /dev/shm
>> 
>> ls /dev/shm
>> qb-cmap-request-18020-18036-25-dataqb-corosync-blackbox-data
>> qb-quorum-request-18020-18095-32-data
>> qb-cmap-request-18020-18036-25-header  qb-corosync-blackbox-header  
>> qb-quorum-request-18020-18095-32-header
>> 
>> Is 64 Mb enough for /dev/shm. If no, why it worked with previous
>> corosync release?
> 
> For a start, can you try configuring corosync with
> --enable-small-memory-footprint switch?
> 
> Hard to say why the space provisioned to /dev/shm is the direct
> opposite of generous (per today's standards), but may be the result
> of automatic HW adaptation, and if RAM is so scarce in your case,
> the above build-time toggle might help.
> 
> If not, then exponentially increasing size of /dev/shm space is
> likely your best bet (I don't recommended fiddling with mlockall()
> and similar measures in corosync).
> 
> Of course, feel free to raise a regression if you have a reproducible
> comparison between two corosync (plus possibly different libraries
> like libqb) versions, one that works and one that won't, in
> reproducible conditions (like this small /dev/shm, VM image, etc.).
> 
> -- 
> Jan (Poki)
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Upgrade corosync problem

2018-06-25 Thread Salvatore D'angelo
Hi,Thanks for reply. I scratched my cluster and created it again and then migrated as before. This time I uninstalled pacemaker, corosync, crmsh and resource agents withmake uninstallthen I installed new packages. The problem is the same, when I launch:corosync-quorumtool -psI got: Cannot initialize QUORUM serviceHere the log with debug enabled:

corosync.log
Description: Binary data
[18019] pg3 corosyncerror   [QB    ] couldn't create circular mmap on /dev/shm/qb-cfg-event-18020-18028-23-data[18019] pg3 corosyncerror   [QB    ] qb_rb_open:cfg-event-18020-18028-23: Resource temporarily unavailable (11)[18019] pg3 corosyncdebug   [QB    ] Free'ing ringbuffer: /dev/shm/qb-cfg-request-18020-18028-23-header[18019] pg3 corosyncdebug   [QB    ] Free'ing ringbuffer: /dev/shm/qb-cfg-response-18020-18028-23-header[18019] pg3 corosyncerror   [QB    ] shm connection FAILED: Resource temporarily unavailable (11)[18019] pg3 corosyncerror   [QB    ] Error in connection setup (18020-18028-23): Resource temporarily unavailable (11)I tried to check /dev/shm and I am not sure these are the right commands, however:df -h /dev/shmFilesystem      Size  Used Avail Use% Mounted onshm              64M   16M   49M  24% /dev/shmls /dev/shmqb-cmap-request-18020-18036-25-data    qb-corosync-blackbox-data    qb-quorum-request-18020-18095-32-dataqb-cmap-request-18020-18036-25-header  qb-corosync-blackbox-header  qb-quorum-request-18020-18095-32-headerIs 64 Mb enough for /dev/shm. If no, why it worked with previous corosync release?On 25 Jun 2018, at 09:09, Christine Caulfield <ccaul...@redhat.com> wrote:On 22/06/18 11:23, Salvatore D'angelo wrote:Hi,Here the log:[17323] pg1 corosyncerror   [QB    ] couldn't create circular mmap on/dev/shm/qb-cfg-event-17324-17334-23-data[17323] pg1 corosyncerror   [QB    ]qb_rb_open:cfg-event-17324-17334-23: Resource temporarily unavailable (11)[17323] pg1 corosyncdebug   [QB    ] Free'ing ringbuffer:/dev/shm/qb-cfg-request-17324-17334-23-header[17323] pg1 corosyncdebug   [QB    ] Free'ing ringbuffer:/dev/shm/qb-cfg-response-17324-17334-23-header[17323] pg1 corosyncerror   [QB    ] shm connection FAILED: Resourcetemporarily unavailable (11)[17323] pg1 corosyncerror   [QB    ] Error in connection setup(17324-17334-23): Resource temporarily unavailable (11)[17323] pg1 corosyncdebug   [QB    ] qb_ipcs_disconnect(17324-17334-23)state:0is /dev/shm full?ChrissieOn 22 Jun 2018, at 12:10, Christine Caulfield <ccaul...@redhat.com> wrote:On 22/06/18 10:39, Salvatore D'angelo wrote:Hi,Can you tell me exactly which log you need. I’ll provide you as soon as possible.Regarding some settings, I am not the original author of this cluster. People created it left the company I am working with and I inerithed the code and sometime I do not know why some settings are used.The old versions of pacemaker, corosync,  crash and resource agents were compiled and installed.I simply downloaded the new versions compiled and installed them. I didn’t get any compliant during ./configure that usually checks for library compatibility.To be honest I do not know if this is the right approach. Should I “make unistall" old versions before installing the new one?Which is the suggested approach?Thank in advance for your help.OK fair enough!To be honest the best approach is almost always to get the latestpackages from the distributor rather than compile from source. That wayyou can be more sure that upgrades will be more smoothly. Though, to behonest, I'm not sure how good the Ubuntu packages are (they might begreat, they might not, I genuinely don't know)When building from source and if you don't know the provenance of theprevious version then I would recommend a 'make uninstall' first - orremoval of the packages if that's where they came from.One thing you should do is make sure that all the cluster nodes arerunning the same version. If some are running older versions then nodescould drop out for obscure reasons. We try and keep minor versionson-wire compatible but it's always best to be cautious.The tidying of your corosync.conf wan wait for the moment, lets getthings mostly working first. If you enable debug logging in corosync.conf:logging {  to_syslog: yes	debug: on}Then see what happens and post the syslog file that has all of thecorosync messages in it, we'll take it from there.ChrissieOn 22 Jun 2018, at 11:30, Christine Caulfield <ccaul...@redhat.com> wrote:On 22/06/18 10:14, Salvatore D'angelo wrote:Hi Christine,Thanks for reply. Let me add few details. When I run the corosyncservice I se the corosync process running. If I stop it and run:corosync -f I see three warnings:warning [MAIN  ] interface section bindnetaddr is used together withnodelist. Nodelist one is going to be used.warning [MAIN  ] Please migrate config file to nodelist.warning [MAIN  ] Could not set SCHED_RR at priority 99: Operation notpermitted (1)warning [MAIN  ] Could not set priority -2147483648: Permission denied (13)but I see node

Re: [ClusterLabs] Upgrade corosync problem

2018-06-22 Thread Salvatore D'angelo
Hi,
Here the log:


corosync.log
Description: Binary data


> On 22 Jun 2018, at 12:10, Christine Caulfield  wrote:
> 
> On 22/06/18 10:39, Salvatore D'angelo wrote:
>> Hi,
>> 
>> Can you tell me exactly which log you need. I’ll provide you as soon as 
>> possible.
>> 
>> Regarding some settings, I am not the original author of this cluster. 
>> People created it left the company I am working with and I inerithed the 
>> code and sometime I do not know why some settings are used.
>> The old versions of pacemaker, corosync,  crash and resource agents were 
>> compiled and installed.
>> I simply downloaded the new versions compiled and installed them. I didn’t 
>> get any compliant during ./configure that usually checks for library 
>> compatibility.
>> 
>> To be honest I do not know if this is the right approach. Should I “make 
>> unistall" old versions before installing the new one?
>> Which is the suggested approach?
>> Thank in advance for your help.
>> 
> 
> OK fair enough!
> 
> To be honest the best approach is almost always to get the latest
> packages from the distributor rather than compile from source. That way
> you can be more sure that upgrades will be more smoothly. Though, to be
> honest, I'm not sure how good the Ubuntu packages are (they might be
> great, they might not, I genuinely don't know)
> 
> When building from source and if you don't know the provenance of the
> previous version then I would recommend a 'make uninstall' first - or
> removal of the packages if that's where they came from.
> 
> One thing you should do is make sure that all the cluster nodes are
> running the same version. If some are running older versions then nodes
> could drop out for obscure reasons. We try and keep minor versions
> on-wire compatible but it's always best to be cautious.
> 
> The tidying of your corosync.conf wan wait for the moment, lets get
> things mostly working first. If you enable debug logging in corosync.conf:
> 
> logging {
>to_syslog: yes
>   debug: on
> }
> 
> Then see what happens and post the syslog file that has all of the
> corosync messages in it, we'll take it from there.
> 
> Chrissie
> 
>>> On 22 Jun 2018, at 11:30, Christine Caulfield  wrote:
>>> 
>>> On 22/06/18 10:14, Salvatore D'angelo wrote:
>>>> Hi Christine,
>>>> 
>>>> Thanks for reply. Let me add few details. When I run the corosync
>>>> service I se the corosync process running. If I stop it and run:
>>>> 
>>>> corosync -f 
>>>> 
>>>> I see three warnings:
>>>> warning [MAIN  ] interface section bindnetaddr is used together with
>>>> nodelist. Nodelist one is going to be used.
>>>> warning [MAIN  ] Please migrate config file to nodelist.
>>>> warning [MAIN  ] Could not set SCHED_RR at priority 99: Operation not
>>>> permitted (1)
>>>> warning [MAIN  ] Could not set priority -2147483648: Permission denied (13)
>>>> 
>>>> but I see node joined.
>>>> 
>>> 
>>> Those certainly need fixing but are probably not the cause. Also why do
>>> you have these values below set?
>>> 
>>> max_network_delay: 100
>>> retransmits_before_loss_const: 25
>>> window_size: 150
>>> 
>>> I'm not saying they are causing the trouble, but they aren't going to
>>> help keep a stable cluster.
>>> 
>>> Without more logs (full logs are always better than just the bits you
>>> think are meaningful) I still can't be sure. it could easily be just
>>> that you've overwritten a packaged version of corosync with your own
>>> compiled one and they have different configure options or that the
>>> libraries now don't match.
>>> 
>>> Chrissie
>>> 
>>> 
>>>> My corosync.conf file is below.
>>>> 
>>>> With service corosync up and running I have the following output:
>>>> *corosync-cfgtool -s*
>>>> Printing ring status.
>>>> Local node ID 1
>>>> RING ID 0
>>>> id= 10.0.0.11
>>>> status= ring 0 active with no faults
>>>> RING ID 1
>>>> id= 192.168.0.11
>>>> status= ring 1 active with no faults
>>>> 
>>>> *corosync-cmapctl  | grep members*
>>>> runtime.totem.pg.mrp.srp.*members*.1.config_version (u64) = 0
>>>> runtime.totem.pg.mrp.srp.*members*.1.ip (str) = r(0) ip(10.0.0.11) r(1)
>>>> ip(192.168.0.11) 
>>>> runtime.totem

Re: [ClusterLabs] Upgrade corosync problem

2018-06-22 Thread Salvatore D'angelo
Hi,

Can you tell me exactly which log you need. I’ll provide you as soon as 
possible.

Regarding some settings, I am not the original author of this cluster. People 
created it left the company I am working with and I inerithed the code and 
sometime I do not know why some settings are used.
The old versions of pacemaker, corosync,  crash and resource agents were 
compiled and installed.
I simply downloaded the new versions compiled and installed them. I didn’t get 
any compliant during ./configure that usually checks for library compatibility.

To be honest I do not know if this is the right approach. Should I “make 
unistall" old versions before installing the new one?
Which is the suggested approach?
Thank in advance for your help.

> On 22 Jun 2018, at 11:30, Christine Caulfield  wrote:
> 
> On 22/06/18 10:14, Salvatore D'angelo wrote:
>> Hi Christine,
>> 
>> Thanks for reply. Let me add few details. When I run the corosync
>> service I se the corosync process running. If I stop it and run:
>> 
>> corosync -f 
>> 
>> I see three warnings:
>> warning [MAIN  ] interface section bindnetaddr is used together with
>> nodelist. Nodelist one is going to be used.
>> warning [MAIN  ] Please migrate config file to nodelist.
>> warning [MAIN  ] Could not set SCHED_RR at priority 99: Operation not
>> permitted (1)
>> warning [MAIN  ] Could not set priority -2147483648: Permission denied (13)
>> 
>> but I see node joined.
>> 
> 
> Those certainly need fixing but are probably not the cause. Also why do
> you have these values below set?
> 
> max_network_delay: 100
> retransmits_before_loss_const: 25
> window_size: 150
> 
> I'm not saying they are causing the trouble, but they aren't going to
> help keep a stable cluster.
> 
> Without more logs (full logs are always better than just the bits you
> think are meaningful) I still can't be sure. it could easily be just
> that you've overwritten a packaged version of corosync with your own
> compiled one and they have different configure options or that the
> libraries now don't match.
> 
> Chrissie
> 
> 
>> My corosync.conf file is below.
>> 
>> With service corosync up and running I have the following output:
>> *corosync-cfgtool -s*
>> Printing ring status.
>> Local node ID 1
>> RING ID 0
>> id= 10.0.0.11
>> status= ring 0 active with no faults
>> RING ID 1
>> id= 192.168.0.11
>> status= ring 1 active with no faults
>> 
>> *corosync-cmapctl  | grep members*
>> runtime.totem.pg.mrp.srp.*members*.1.config_version (u64) = 0
>> runtime.totem.pg.mrp.srp.*members*.1.ip (str) = r(0) ip(10.0.0.11) r(1)
>> ip(192.168.0.11) 
>> runtime.totem.pg.mrp.srp.*members*.1.join_count (u32) = 1
>> runtime.totem.pg.mrp.srp.*members*.1.status (str) = joined
>> runtime.totem.pg.mrp.srp.*members*.2.config_version (u64) = 0
>> runtime.totem.pg.mrp.srp.*members*.2.ip (str) = r(0) ip(10.0.0.12) r(1)
>> ip(192.168.0.12) 
>> runtime.totem.pg.mrp.srp.*members*.2.join_count (u32) = 1
>> runtime.totem.pg.mrp.srp.*members*.2.status (str) = joined
>> 
>> For the moment I have two nodes in my cluster (third node and some
>> issues and at the moment I did crm node standby on it).
>> 
>> Here the dependency I have installed for corosync (that works fine with
>> pacemaker 1.1.14 and corosync 2.3.5):
>>  libnspr4-dev_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb
>>  libnspr4_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb
>>  libnss3-dev_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb
>>  libnss3-nssdb_2%253a3.19.2.1-0ubuntu0.14.04.2_all.deb
>>  libnss3_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb
>>  libqb-dev_0.16.0.real-1ubuntu4_amd64.deb
>>  libqb0_0.16.0.real-1ubuntu4_amd64.deb
>> 
>> *corosync.conf*
>> -
>> quorum {
>> provider: corosync_votequorum
>> expected_votes: 3
>> }
>> totem {
>> version: 2
>> crypto_cipher: none
>> crypto_hash: none
>> rrp_mode: passive
>> interface {
>> ringnumber: 0
>> bindnetaddr: 10.0.0.0
>> mcastport: 5405
>> ttl: 1
>> }
>> interface {
>> ringnumber: 1
>> bindnetaddr: 192.168.0.0
>> mcastport: 5405
>> ttl: 1
>> }
>> transport: udpu
>> max_network_delay: 100
>> retransmits_before_loss_const: 25
>> window_size: 150
>> }
>> nodelist {
>>

Re: [ClusterLabs] Upgrade corosync problem

2018-06-22 Thread Salvatore D'angelo
Hi Christine,

Thanks for reply. Let me add few details. When I run the corosync service I se 
the corosync process running. If I stop it and run:

corosync -f 

I see three warnings:
warning [MAIN  ] interface section bindnetaddr is used together with nodelist. 
Nodelist one is going to be used.
warning [MAIN  ] Please migrate config file to nodelist.
warning [MAIN  ] Could not set SCHED_RR at priority 99: Operation not permitted 
(1)
warning [MAIN  ] Could not set priority -2147483648: Permission denied (13)

but I see node joined.

My corosync.conf file is below.

With service corosync up and running I have the following output:
corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
id  = 10.0.0.11
status  = ring 0 active with no faults
RING ID 1
id  = 192.168.0.11
status  = ring 1 active with no faults

corosync-cmapctl  | grep members
runtime.totem.pg.mrp.srp.members.1.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(10.0.0.11) r(1) 
ip(192.168.0.11) 
runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.1.status (str) = joined
runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(10.0.0.12) r(1) 
ip(192.168.0.12) 
runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.2.status (str) = joined

For the moment I have two nodes in my cluster (third node and some issues and 
at the moment I did crm node standby on it).

Here the dependency I have installed for corosync (that works fine with 
pacemaker 1.1.14 and corosync 2.3.5):
 libnspr4-dev_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb
 libnspr4_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb
 libnss3-dev_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb
 libnss3-nssdb_2%253a3.19.2.1-0ubuntu0.14.04.2_all.deb
 libnss3_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb
 libqb-dev_0.16.0.real-1ubuntu4_amd64.deb
 libqb0_0.16.0.real-1ubuntu4_amd64.deb

corosync.conf
-
quorum {
provider: corosync_votequorum
expected_votes: 3
}
totem {
version: 2
crypto_cipher: none
crypto_hash: none
rrp_mode: passive
interface {
ringnumber: 0
bindnetaddr: 10.0.0.0
mcastport: 5405
ttl: 1
}
interface {
ringnumber: 1
bindnetaddr: 192.168.0.0
mcastport: 5405
ttl: 1
}
transport: udpu
max_network_delay: 100
retransmits_before_loss_const: 25
window_size: 150
}
nodelist {
node {
ring0_addr: pg1
ring1_addr: pg1p
nodeid: 1
}
node {
ring0_addr: pg2
ring1_addr: pg2p
nodeid: 2
}
node {
ring0_addr: pg3
ring1_addr: pg3p
nodeid: 3
}
}
logging {
to_syslog: yes
}




> On 22 Jun 2018, at 09:24, Christine Caulfield  wrote:
> 
> On 21/06/18 16:16, Salvatore D'angelo wrote:
>> Hi,
>> 
>> I upgraded my PostgreSQL/Pacemaker cluster with these versions.
>> Pacemaker 1.1.14 -> 1.1.18
>> Corosync 2.3.5 -> 2.4.4
>> Crmsh 2.2.0 -> 3.0.1
>> Resource agents 3.9.7 -> 4.1.1
>> 
>> I started on a first node  (I am trying one node at a time upgrade).
>> On a PostgreSQL slave node  I did:
>> 
>> *crm node standby *
>> *service pacemaker stop*
>> *service corosync stop*
>> 
>> Then I build the tool above as described on their GitHub.com
>> <http://GitHub.com <http://github.com/>> page. 
>> 
>> *./autogen.sh (where required)*
>> *./configure*
>> *make (where required)*
>> *make install*
>> 
>> Everything went ok. I expect new file overwrite old one. I left the
>> dependency I had with old software because I noticed the .configure
>> didn’t complain. 
>> I started corosync.
>> 
>> *service corosync start*
>> 
>> To verify corosync work properly I used the following commands:
>> *corosync-cfg-tool -s*
>> *corosync-cmapctl | grep members*
>> 
>> Everything seemed ok and I verified my node joined the cluster (at least
>> this is my impression).
>> 
>> Here I verified a problem. Doing the command:
>> corosync-quorumtool -ps
>> 
>> I got the following problem:
>> Cannot initialise CFG service
>> 
> That says that corosync is not running. Have a look in the log files to
> see why it stopped. The pacemaker logs below are showing the same thing,
> but we can't make any more guesses until we see what corosync itself is
> doing. Enabling debug in corosync.conf w

[ClusterLabs] Upgrade corosync problem

2018-06-21 Thread Salvatore D'angelo
Hi,

I upgraded my PostgreSQL/Pacemaker cluster with these versions.
Pacemaker 1.1.14 -> 1.1.18
Corosync 2.3.5 -> 2.4.4
Crmsh 2.2.0 -> 3.0.1
Resource agents 3.9.7 -> 4.1.1

I started on a first node  (I am trying one node at a time upgrade).
On a PostgreSQL slave node  I did:

crm node standby 
service pacemaker stop
service corosync stop

Then I build the tool above as described on their GitHub.com page. 

./autogen.sh (where required)
./configure
make (where required)
make install

Everything went ok. I expect new file overwrite old one. I left the dependency 
I had with old software because I noticed the .configure didn’t complain. 
I started corosync.

service corosync start

To verify corosync work properly I used the following commands:
corosync-cfg-tool -s
corosync-cmapctl | grep members

Everything seemed ok and I verified my node joined the cluster (at least this 
is my impression).

Here I verified a problem. Doing the command:
corosync-quorumtool -ps

I got the following problem:
Cannot initialise CFG service

If I try to start pacemaker, I only see pacemaker process running and 
pacemaker.log containing the following lines:

Jun 21 15:09:38 [17115] pg1 pacemakerd: info: crm_log_init: Changed active 
directory to /var/lib/pacemaker/cores
Jun 21 15:09:38 [17115] pg1 pacemakerd: info: get_cluster_type: 
Detected an active 'corosync' cluster
Jun 21 15:09:38 [17115] pg1 pacemakerd: info: mcp_read_config:  Reading 
configure for stack: corosync
Jun 21 15:09:38 [17115] pg1 pacemakerd:   notice: main: Starting Pacemaker 
1.1.18 | build=2b07d5c5a9 features: libqb-logging libqb-ipc lha-fencing nagios  
corosync-native atomic-attrd acls
Jun 21 15:09:38 [17115] pg1 pacemakerd: info: main: Maximum core file size 
is: 18446744073709551615
Jun 21 15:09:38 [17115] pg1 pacemakerd: info: qb_ipcs_us_publish:   server 
name: pacemakerd
Jun 21 15:09:53 [17115] pg1 pacemakerd:  warning: corosync_node_name:   Could 
not connect to Cluster Configuration Database API, error CS_ERR_TRY_AGAIN
Jun 21 15:09:53 [17115] pg1 pacemakerd: info: corosync_node_name:   Unable 
to get node name for nodeid 1
Jun 21 15:09:53 [17115] pg1 pacemakerd:   notice: get_node_name:Could 
not obtain a node name for corosync nodeid 1
Jun 21 15:09:53 [17115] pg1 pacemakerd: info: crm_get_peer: Created entry 
1aeef8ac-643b-44f7-8ce3-d82bbf40bbc1/0x557dc7f05d30 for node (null)/1 (1 total)
Jun 21 15:09:53 [17115] pg1 pacemakerd: info: crm_get_peer: Node 1 has uuid 
1
Jun 21 15:09:53 [17115] pg1 pacemakerd: info: crm_update_peer_proc: 
cluster_connect_cpg: Node (null)[1] - corosync-cpg is now online
Jun 21 15:09:53 [17115] pg1 pacemakerd:error: cluster_connect_quorum:   
Could not connect to the Quorum API: 2
Jun 21 15:09:53 [17115] pg1 pacemakerd: info: qb_ipcs_us_withdraw:  
withdrawing server sockets
Jun 21 15:09:53 [17115] pg1 pacemakerd: info: main: Exiting pacemakerd
Jun 21 15:09:53 [17115] pg1 pacemakerd: info: crm_xml_cleanup:  
Cleaning up memory from libxml2

What is wrong in my procedure?



___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Resource agents differences from 1.1.14 and 1.1.18

2018-06-21 Thread Salvatore D'angelo
Hi, thanks for reply

> On 21 Jun 2018, at 15:09, Jan Pokorný  wrote:
> 
> Hello Salvatore,
> 
> On 21/06/18 12:44 +0200, Salvatore D'angelo wrote:
>> I am trying to upgrade my PostgresSQL cluster managed by pacemaker
>> to pacemaker 1.1.8 or 2.0.0.  I have some resource agents that I
>> patched to have them working with my cluster.
>> 
>> Can someone tell me if something is changed in the OCF interface
>> from 1.1.14 release and the 1.1.8/2.0.0?
> 
> You can consider the OCF specification/interface stable and no
> breakages are really imminent.

Good to know

>  There are admittedly some parts with
> less than well-defined semantics (if it's defined at all; for instance,
> questions on what's the proper interpretation of "unique" slash
> reloadable parameters was raised in the past [1,2]).
> 
> This stability is moreover enforced with the requirement of cross
> compatibility between various OCF conformant agent vs. resource
> manager implementations (say those maintained in resource-agents
> project vs. pacemaker, plus various versions thereof, without any
> apriori defined ways of how to negotiate any further inteface
> specifics, but see [3], for instance).
> 
>> I am using the following resource agents:
>> 
>> /usr/lib/ocf/resource.d/heartbeat/Filesystem
>> /usr/lib/ocf/resource.d/heartbeat/ethmonitor
>> /usr/lib/ocf/resource.d/heartbeat/pgsql (patched)
> 
> ^ this is really contained within resource-agents project, and as
>  mentioned, nothing pushes you to update this piece of software
>  even if you intend to update pacemaker (granted, keeping step
>  with overall evolutionary "time snapshots" is always wise)
> 
>> /usr/lib/ocf/resource.d/pacemaker/HealthCPU (patched)
>> /usr/lib/ocf/resource.d/pacemaker/ping (patched)
>> /usr/lib/ocf/resource.d/pacemaker/SysInfo (patched)
> 
> ^ and these are from pacemaker's realms, so there's naturally
>  a closer coupling possibly beyond what standard mandates, but
>  again, OCF forms a "fixed point", basis upon which the graph
>  connecting the functionality user(s) and providers is formed,
>  so presumably you can mix and match various versions even if
>  the bits come from the very same project
> 
>> I am doing some tests to verify this but I would like to know if
>> there is at high level something I should be aware.
> 
> Nothing comes to my mind, though your are always best served with
> your own investigation (since you are modifying the agents anyway).
> 
> As a rule of thumb, I'd start with checking the changelogs of the
> mentioned projects, and deeper concerns can ultimately be resolved
> with the review of cross-version changes on the source code level,
> e.g.:
> 
>  git clone https://github.com/ClusterLabs/resource-agents.git
>  pushd resource-agents
>  # let's say you start with agents from v3.9.7 release
>  git diff v3.9.7 v4.1.1 -- heartbeat/{Filesystem,ethmonitor,pgsql}
>  popd
> 
>  git clone https://github.com/ClusterLabs/pacemaker.git
>  pushd pacemaker
>  git diff Pacemaker-1.1.14 Pacemaker-1.1.18 -- \
>  extra/resources/{HealthCPU,SysInfo,ping}
>  popd
> 
> It's more like showing how to fish than serving you a meal,
> but hopefully this helps regardless (perhaps even more than
> latter would do).
> 

Yes, that’s exactly what I did. I just double checked.

> 
> [1] https://lists.clusterlabs.org/pipermail/users/2016-June/010635.html
> [2] https://lists.clusterlabs.org/pipermail/users/2017-September/013743.html
> [3] https://github.com/ClusterLabs/OCF-spec/issues/17
> 
> -- 
> Jan (Poki)
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Resource agents differences from 1.1.14 and 1.1.18

2018-06-21 Thread Salvatore D'angelo
Hi all,

I am trying to upgrade my PostgresSQL cluster managed by pacemaker to pacemaker 
1.1.8 or 2.0.0.
I have some resource agents that I patched to have them working with my cluster.

Can someone tell me if something is changed in the OCF interface from 1.1.14 
release and the 1.1.8/2.0.0?
I am using the following resource agents:

/usr/lib/ocf/resource.d/heartbeat/Filesystem
/usr/lib/ocf/resource.d/heartbeat/ethmonitor
/usr/lib/ocf/resource.d/heartbeat/pgsql (patched)
/usr/lib/ocf/resource.d/pacemaker/HealthCPU (patched)
/usr/lib/ocf/resource.d/pacemaker/ping (patched)
/usr/lib/ocf/resource.d/pacemaker/SysInfo (patched)

I am doing some tests to verify this but I would like to know if there is at 
high level something I should be aware.
Thanks in advance for your help.___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] How to declare ping primitive with rule

2018-06-08 Thread Salvatore D'angelo
Hi All,

I have a PostgreSQL cluster on three nodes (Master/Sync/Async) with WAL files 
stored on two GlusterFS nodes. In total, 5 machines.
Let call the first three machines: pg1, pg2, pg3. The other two (without 
pacemaker): pgalog1, pgalog2.

Now I this code works fine on some bare metal machines. I was able to port it 
on Docker (because this simplify tests and allow us to experiment). So far so 
good.

I wasn’t the original author of the code. Now we have some scripts to create 
this cluster that works fine on bare metal as said before. In particular, I 
have this piece of code:

cat - 

Re: [ClusterLabs] Pacemaker PostgreSQL cluster

2018-05-30 Thread Salvatore D'angelo
Hi,

Last question. In order to migrate pacemaker with minimum downtime the option I 
see are Rolling (node by node) and Disconnect Reattach
http://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/ap-upgrade.html
 
<http://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/ap-upgrade.html>

What I want to do is first migrate pacemaker manually and then automate it with 
some scripts.

According to what Ken Gaillot said:

"Rolling upgrades are always supported within the same major number line
(i.e. 1.anything to 1.anything). With the major number change, rolling
upgrades will not always be supported. In the case of 2.0.0, we are
supporting rolling upgrades from 1.1.11 or later on top of corosync 2
or later. You should be fine whichever you choose.”

if I implement a set of scripts based on Rolling upgrade to migrate from 1.1.14 
to 1.1.18/2.0.0 the risk is that in the future if there will be an upgrade with 
major number change I should rewrite my automation script to support another 
type of migration (probably Detach and Reattach). My question is, if I want to 
avoid this extra work in the future, is the Detach Reattach procedure more 
adaptable to whatever version migration? 
My understanding is that with this procedure PostgreSQL will always be up and 
running and I only need detach pacemaker from them on the three nodes, migrate 
them and then Reattach. During this period What happen if PostgreSQL master 
goes down?
Thanks again for support.

> On 30 May 2018, at 04:04, Ken Gaillot  wrote:
> 
> On Tue, 2018-05-29 at 22:25 +0200, Salvatore D'angelo wrote:
>> Hi,
>> 
>> Regarding last question about pacemaker dependencies for Ubuntu I
>> found this for 1.1.18:
>> https://launchpad.net/ubuntu/+source/pacemaker/1.1.18-0ubuntu2/+build 
>> <https://launchpad.net/ubuntu/+source/pacemaker/1.1.18-0ubuntu2/+build>
>> /14818856
>> 
>> It’s not clear to me why pacemaker 1.1.18 is available on
>> launchpad.net <http://launchpad.net/> and not on the official Ubuntu Search 
>> Packages website.
>> However, can I assume 1.1.19 and 2.2.0 have the same dependencies
>> list (considering they have only removed deprecated function and
>> applied some bug fixes)?
> 
> Yes, the dependencies should be the same (when corosync 2 is used)
> 
>> Thanks again for answers
>> 
>> 
>>> On 29 May 2018, at 17:41, Jehan-Guillaume de Rorthais >> com> wrote:
>>> 
>>> On Tue, 29 May 2018 14:23:31 +0200
>>> Salvatore D'angelo  wrote:
>>> ...
>>>> 2. I read some documentation about upgrade and since we want 0 ms
>>>> downtime I
>>>> think the Rolling Upgrade (node by node) is the better approach.
>>> 
>>> The 0ms upgrade is almost impossible. At some point, you will have
>>> to move the
>>> master somewhere else.
>>> 
>>> Unless you have some session management that are able to wait for
>>> the
>>> current sessions to finish, then hold the incoming sessions while
>>> you are
>>> moving the master, you will have downtime and/or xact rollback.
>>> 
>>> Good luck anyway :)
>>> 
>>> -- 
>>> Jehan-Guillaume de Rorthais
>>> Dalibo
>> 
>> ___
>> Users mailing list: Users@clusterlabs.org <mailto:Users@clusterlabs.org>
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> <https://lists.clusterlabs.org/mailman/listinfo/users>
>> 
>> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/>
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch 
>> <http://www.clusterlabs.org/doc/Cluster_from_Scratch>.
>> pdf
>> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
> -- 
> Ken Gaillot mailto:kgail...@redhat.com>>
> ___
> Users mailing list: Users@clusterlabs.org <mailto:Users@clusterlabs.org>
> https://lists.clusterlabs.org/mailman/listinfo/users 
> <https://lists.clusterlabs.org/mailman/listinfo/users>
> 
> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/>
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker PostgreSQL cluster

2018-05-29 Thread Salvatore D'angelo
Hi,

Regarding last question about pacemaker dependencies for Ubuntu I found this 
for 1.1.18:
https://launchpad.net/ubuntu/+source/pacemaker/1.1.18-0ubuntu2/+build/14818856 
<https://launchpad.net/ubuntu/+source/pacemaker/1.1.18-0ubuntu2/+build/14818856>

It’s not clear to me why pacemaker 1.1.18 is available on launchpad.net and not 
on the official Ubuntu Search Packages website.
However, can I assume 1.1.19 and 2.2.0 have the same dependencies list 
(considering they have only removed deprecated function and applied some bug 
fixes)?
Thanks again for answers


> On 29 May 2018, at 17:41, Jehan-Guillaume de Rorthais  wrote:
> 
> On Tue, 29 May 2018 14:23:31 +0200
> Salvatore D'angelo  wrote:
> ...
>> 2. I read some documentation about upgrade and since we want 0 ms downtime I
>> think the Rolling Upgrade (node by node) is the better approach.
> 
> The 0ms upgrade is almost impossible. At some point, you will have to move the
> master somewhere else.
> 
> Unless you have some session management that are able to wait for the
> current sessions to finish, then hold the incoming sessions while you are
> moving the master, you will have downtime and/or xact rollback.
> 
> Good luck anyway :)
> 
> -- 
> Jehan-Guillaume de Rorthais
> Dalibo

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Pacemaker PostgreSQL cluster

2018-05-29 Thread Salvatore D'angelo
Hi All,

I am new to this list. I am working on a project that uses a cluster composed 
by 3 nodes (with Ubuntu 14.04 trusty) on which we run PostgreSQL managed as 
Master/slaves.
We uses Pacemaker/Corosync to manage this cluster. In addition, we have a two 
node GlusterFS where we store backups and Wal files.
Currently the versions of our components are quite old, we have:
Pacemaker 1.1.14
Corosync 2.3.5

and we want to move to a new version of Pacemaker but I have some doubts.

1. I noticed there is 2.0.0 candidate release so it could be convenient for us 
move to this release. When will be published the final release? Is it 
convenient move to 2.0.0 or 1.1.18?
2. I read some documentation about upgrade and since we want 0 ms downtime I 
think the Rolling Upgrade (node by node) is the better approach. We migrate a 
node and in the meantime the other two nodes are still active. The problem is 
that I do not know if I can have a mix of 1.1.14 and 1.1.18 (or 2.0.0) nodes. 
The documentation does not clarify it or at least it was not clear to me. Is 
this possible?
http://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/ap-upgrade.html
 

https://wiki.clusterlabs.org/wiki/Upgrade 

3. I need to upgrade pacemaker/corosync on Ubuntu 14.04. I noticed for 1.1.18 
there are Ubuntu packages available. What about 2.0.0? Is it possible create 
Ubuntu packages in some way?
4. Where I can find the list of (ubuntu) dependencies required to 
pacemaker/corosync for 1.1.18 and 2.0.0?

Thanks in advance for your help.___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Problems with pacemaker

2017-11-17 Thread Salvatore D'Angelo
Hi All,

I am working on a PostgreSQL cluster managed using Pacemaker and Corosync.
The cluster is composed by three nodes: Master, Sync and Async.
Currently my cluster uses PostgreSQL 9.4.9 and I am going to upgrade my 
environments to 9.6.3.

My upgrade procedure uses the following steps:

1. install 9.6.3 binaries on all the three nodes
2. crm node standby of all the three nodes
3. create new data directory on all the three nodes
4. only on master node do initdb on data directory
5. pg_upgrade from old to the new directory
6. update postgresql.conf file
7. start postgres with pg_ctl command
8. stop postgres with pg_ctl command
9. crm node online 

on the others two nodes:
10. pg_base_backup to align slave node with master
11. crm node online 

Now I have two questions:
1. In order to have the master working properly I needed step 7 and 8 in 
order to let master recognize the status of the cluster before the upgrade 
and let it recognize the others node and go in hot standby mode. Is this 
start/sto really required or there is another way to have master start 
from the status before the upgrade?

2. On test environment this procedure works fine. In production, two times 
happened that when I start nodes in step 9 the slaves start as well. Is 
this possible? In which case this could happen?

Salvatore D'Angelo
Advisory Software  Engineer - IBM Cloud
IBM Rome Software Lab
Via Sciangai 53, 00144 Roma
Phone +39-347-432-8059


IBM Italia S.p.A. Sede Legale: Circonvallazione Idroscalo - 20090 Segrate 
(MI) Cap. Soc. euro 347.256.998,80 C. F. e Reg. Imprese MI 01442240030 - 
Partita IVA 10914660153 Societa' con unico azionista Societa' soggetta 
all'attivita' di direzione e coordinamento di International Business 
Machines Corporation (Salvo che sia diversamente indicato sopra / Unless 
stated otherwise above)
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org