On 22/06/18 11:23, Salvatore D'angelo wrote: > Hi, > Here the log: > > > [17323] pg1 corosyncerror [QB ] couldn't create circular mmap on /dev/shm/qb-cfg-event-17324-17334-23-data [17323] pg1 corosyncerror [QB ] qb_rb_open:cfg-event-17324-17334-23: Resource temporarily unavailable (11) [17323] pg1 corosyncdebug [QB ] Free'ing ringbuffer: /dev/shm/qb-cfg-request-17324-17334-23-header [17323] pg1 corosyncdebug [QB ] Free'ing ringbuffer: /dev/shm/qb-cfg-response-17324-17334-23-header [17323] pg1 corosyncerror [QB ] shm connection FAILED: Resource temporarily unavailable (11) [17323] pg1 corosyncerror [QB ] Error in connection setup (17324-17334-23): Resource temporarily unavailable (11) [17323] pg1 corosyncdebug [QB ] qb_ipcs_disconnect(17324-17334-23) state:0
is /dev/shm full? Chrissie > > >> On 22 Jun 2018, at 12:10, Christine Caulfield <ccaul...@redhat.com> wrote: >> >> On 22/06/18 10:39, Salvatore D'angelo wrote: >>> Hi, >>> >>> Can you tell me exactly which log you need. I’ll provide you as soon as >>> possible. >>> >>> Regarding some settings, I am not the original author of this cluster. >>> People created it left the company I am working with and I inerithed the >>> code and sometime I do not know why some settings are used. >>> The old versions of pacemaker, corosync, crash and resource agents were >>> compiled and installed. >>> I simply downloaded the new versions compiled and installed them. I didn’t >>> get any compliant during ./configure that usually checks for library >>> compatibility. >>> >>> To be honest I do not know if this is the right approach. Should I “make >>> unistall" old versions before installing the new one? >>> Which is the suggested approach? >>> Thank in advance for your help. >>> >> >> OK fair enough! >> >> To be honest the best approach is almost always to get the latest >> packages from the distributor rather than compile from source. That way >> you can be more sure that upgrades will be more smoothly. Though, to be >> honest, I'm not sure how good the Ubuntu packages are (they might be >> great, they might not, I genuinely don't know) >> >> When building from source and if you don't know the provenance of the >> previous version then I would recommend a 'make uninstall' first - or >> removal of the packages if that's where they came from. >> >> One thing you should do is make sure that all the cluster nodes are >> running the same version. If some are running older versions then nodes >> could drop out for obscure reasons. We try and keep minor versions >> on-wire compatible but it's always best to be cautious. >> >> The tidying of your corosync.conf wan wait for the moment, lets get >> things mostly working first. If you enable debug logging in corosync.conf: >> >> logging { >> to_syslog: yes >> debug: on >> } >> >> Then see what happens and post the syslog file that has all of the >> corosync messages in it, we'll take it from there. >> >> Chrissie >> >>>> On 22 Jun 2018, at 11:30, Christine Caulfield <ccaul...@redhat.com> wrote: >>>> >>>> On 22/06/18 10:14, Salvatore D'angelo wrote: >>>>> Hi Christine, >>>>> >>>>> Thanks for reply. Let me add few details. When I run the corosync >>>>> service I se the corosync process running. If I stop it and run: >>>>> >>>>> corosync -f >>>>> >>>>> I see three warnings: >>>>> warning [MAIN ] interface section bindnetaddr is used together with >>>>> nodelist. Nodelist one is going to be used. >>>>> warning [MAIN ] Please migrate config file to nodelist. >>>>> warning [MAIN ] Could not set SCHED_RR at priority 99: Operation not >>>>> permitted (1) >>>>> warning [MAIN ] Could not set priority -2147483648: Permission denied >>>>> (13) >>>>> >>>>> but I see node joined. >>>>> >>>> >>>> Those certainly need fixing but are probably not the cause. Also why do >>>> you have these values below set? >>>> >>>> max_network_delay: 100 >>>> retransmits_before_loss_const: 25 >>>> window_size: 150 >>>> >>>> I'm not saying they are causing the trouble, but they aren't going to >>>> help keep a stable cluster. >>>> >>>> Without more logs (full logs are always better than just the bits you >>>> think are meaningful) I still can't be sure. it could easily be just >>>> that you've overwritten a packaged version of corosync with your own >>>> compiled one and they have different configure options or that the >>>> libraries now don't match. >>>> >>>> Chrissie >>>> >>>> >>>>> My corosync.conf file is below. >>>>> >>>>> With service corosync up and running I have the following output: >>>>> *corosync-cfgtool -s* >>>>> Printing ring status. >>>>> Local node ID 1 >>>>> RING ID 0 >>>>> id= 10.0.0.11 >>>>> status= ring 0 active with no faults >>>>> RING ID 1 >>>>> id= 192.168.0.11 >>>>> status= ring 1 active with no faults >>>>> >>>>> *corosync-cmapctl | grep members* >>>>> runtime.totem.pg.mrp.srp.*members*.1.config_version (u64) = 0 >>>>> runtime.totem.pg.mrp.srp.*members*.1.ip (str) = r(0) ip(10.0.0.11) r(1) >>>>> ip(192.168.0.11) >>>>> runtime.totem.pg.mrp.srp.*members*.1.join_count (u32) = 1 >>>>> runtime.totem.pg.mrp.srp.*members*.1.status (str) = joined >>>>> runtime.totem.pg.mrp.srp.*members*.2.config_version (u64) = 0 >>>>> runtime.totem.pg.mrp.srp.*members*.2.ip (str) = r(0) ip(10.0.0.12) r(1) >>>>> ip(192.168.0.12) >>>>> runtime.totem.pg.mrp.srp.*members*.2.join_count (u32) = 1 >>>>> runtime.totem.pg.mrp.srp.*members*.2.status (str) = joined >>>>> >>>>> For the moment I have two nodes in my cluster (third node and some >>>>> issues and at the moment I did crm node standby on it). >>>>> >>>>> Here the dependency I have installed for corosync (that works fine with >>>>> pacemaker 1.1.14 and corosync 2.3.5): >>>>> libnspr4-dev_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb >>>>> libnspr4_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb >>>>> libnss3-dev_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb >>>>> libnss3-nssdb_2%253a3.19.2.1-0ubuntu0.14.04.2_all.deb >>>>> libnss3_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb >>>>> libqb-dev_0.16.0.real-1ubuntu4_amd64.deb >>>>> libqb0_0.16.0.real-1ubuntu4_amd64.deb >>>>> >>>>> *corosync.conf* >>>>> --------------------- >>>>> quorum { >>>>> provider: corosync_votequorum >>>>> expected_votes: 3 >>>>> } >>>>> totem { >>>>> version: 2 >>>>> crypto_cipher: none >>>>> crypto_hash: none >>>>> rrp_mode: passive >>>>> interface { >>>>> ringnumber: 0 >>>>> bindnetaddr: 10.0.0.0 >>>>> mcastport: 5405 >>>>> ttl: 1 >>>>> } >>>>> interface { >>>>> ringnumber: 1 >>>>> bindnetaddr: 192.168.0.0 >>>>> mcastport: 5405 >>>>> ttl: 1 >>>>> } >>>>> transport: udpu >>>>> max_network_delay: 100 >>>>> retransmits_before_loss_const: 25 >>>>> window_size: 150 >>>>> } >>>>> nodelist { >>>>> node { >>>>> ring0_addr: pg1 >>>>> ring1_addr: pg1p >>>>> nodeid: 1 >>>>> } >>>>> node { >>>>> ring0_addr: pg2 >>>>> ring1_addr: pg2p >>>>> nodeid: 2 >>>>> } >>>>> node { >>>>> ring0_addr: pg3 >>>>> ring1_addr: pg3p >>>>> nodeid: 3 >>>>> } >>>>> } >>>>> logging { >>>>> to_syslog: yes >>>>> } >>>>> >>>>> >>>>> >>>>> >>>>>> On 22 Jun 2018, at 09:24, Christine Caulfield <ccaul...@redhat.com >>>>>> <mailto:ccaul...@redhat.com>> wrote: >>>>>> >>>>>> On 21/06/18 16:16, Salvatore D'angelo wrote: >>>>>>> Hi, >>>>>>> >>>>>>> I upgraded my PostgreSQL/Pacemaker cluster with these versions. >>>>>>> Pacemaker 1.1.14 -> 1.1.18 >>>>>>> Corosync 2.3.5 -> 2.4.4 >>>>>>> Crmsh 2.2.0 -> 3.0.1 >>>>>>> Resource agents 3.9.7 -> 4.1.1 >>>>>>> >>>>>>> I started on a first node (I am trying one node at a time upgrade). >>>>>>> On a PostgreSQL slave node I did: >>>>>>> >>>>>>> *crm node standby <node>* >>>>>>> *service pacemaker stop* >>>>>>> *service corosync stop* >>>>>>> >>>>>>> Then I build the tool above as described on their GitHub.com >>>>>>> <http://GitHub.com> >>>>>>> <http://GitHub.com <http://github.com/>> page. >>>>>>> >>>>>>> *./autogen.sh (where required)* >>>>>>> *./configure* >>>>>>> *make (where required)* >>>>>>> *make install* >>>>>>> >>>>>>> Everything went ok. I expect new file overwrite old one. I left the >>>>>>> dependency I had with old software because I noticed the .configure >>>>>>> didn’t complain. >>>>>>> I started corosync. >>>>>>> >>>>>>> *service corosync start* >>>>>>> >>>>>>> To verify corosync work properly I used the following commands: >>>>>>> *corosync-cfg-tool -s* >>>>>>> *corosync-cmapctl | grep members* >>>>>>> >>>>>>> Everything seemed ok and I verified my node joined the cluster (at least >>>>>>> this is my impression). >>>>>>> >>>>>>> Here I verified a problem. Doing the command: >>>>>>> corosync-quorumtool -ps >>>>>>> >>>>>>> I got the following problem: >>>>>>> Cannot initialise CFG service >>>>>>> >>>>>> That says that corosync is not running. Have a look in the log files to >>>>>> see why it stopped. The pacemaker logs below are showing the same thing, >>>>>> but we can't make any more guesses until we see what corosync itself is >>>>>> doing. Enabling debug in corosync.conf will also help if more detail is >>>>>> needed. >>>>>> >>>>>> Also starting corosync with 'corosync -pf' on the command-line is often >>>>>> a quick way of checking things are starting OK. >>>>>> >>>>>> Chrissie >>>>>> >>>>>> >>>>>>> If I try to start pacemaker, I only see pacemaker process running and >>>>>>> pacemaker.log containing the following lines: >>>>>>> >>>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd: info: crm_log_init:Changed >>>>>>> active directory to /var/lib/pacemaker/cores/ >>>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd: info: >>>>>>> get_cluster_type:Detected an active 'corosync' cluster/ >>>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd: info: >>>>>>> mcp_read_config:Reading configure for stack: corosync/ >>>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd: notice: main:Starting >>>>>>> Pacemaker 1.1.18 | build=2b07d5c5a9 features: libqb-logging libqb-ipc >>>>>>> lha-fencing nagios corosync-native atomic-attrd acls/ >>>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd: info: main:Maximum core >>>>>>> file size is: 18446744073709551615/ >>>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd: info: >>>>>>> qb_ipcs_us_publish:server name: pacemakerd/ >>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd: warning: >>>>>>> corosync_node_name:Could not connect to Cluster Configuration Database >>>>>>> API, error CS_ERR_TRY_AGAIN/ >>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd: info: >>>>>>> corosync_node_name:Unable to get node name for nodeid 1/ >>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd: notice: get_node_name:Could >>>>>>> not obtain a node name for corosync nodeid 1/ >>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd: info: crm_get_peer:Created >>>>>>> entry 1aeef8ac-643b-44f7-8ce3-d82bbf40bbc1/0x557dc7f05d30 for node >>>>>>> (null)/1 (1 total)/ >>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd: info: crm_get_peer:Node 1 >>>>>>> has uuid 1/ >>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd: info: >>>>>>> crm_update_peer_proc:cluster_connect_cpg: Node (null)[1] - corosync-cpg >>>>>>> is now online/ >>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd: error: >>>>>>> cluster_connect_quorum:Could not connect to the Quorum API: 2/ >>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd: info: >>>>>>> qb_ipcs_us_withdraw:withdrawing server sockets/ >>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd: info: main:Exiting >>>>>>> pacemakerd/ >>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd: info: >>>>>>> crm_xml_cleanup:Cleaning up memory from libxml2/ >>>>>>> >>>>>>> *What is wrong in my procedure?* >>>>>>> >>>>>>> >>>>>>> >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org