Hello,

Our cluster was working OK on corosync stack, with corosync 2.3.0 and
pacemaker 1.1.8.

After upgrading (full versions and configs below), we began to have
problems with node names.
It's a two node cluster, with node names "turifel" (DC) and "selavi".

When selavi joins cluster, we have this warning at selavi log:

-----
Jun 27 11:54:29 selavi attrd[11998]:   notice: corosync_node_name:
Unable to get node name for nodeid 168385827
Jun 27 11:54:29 selavi attrd[11998]:   notice: get_node_name: Defaulting
to uname -n for the local corosync node name
-----

This is ok, and also happenned with version 1.1.8.

At corosync level, all seems ok:
----
Jun 27 11:51:18 turifel corosync[6725]:   [TOTEM ] A processor joined or
left the membership and a new membership (10.9.93.35:1184) was formed.
Jun 27 11:51:18 turifel corosync[6725]:   [QUORUM] Members[2]: 168385827
168385835
Jun 27 11:51:18 turifel corosync[6725]:   [MAIN  ] Completed service
synchronization, ready to provide service.
Jun 27 11:51:18 turifel crmd[19526]:   notice: crm_update_peer_state:
pcmk_quorum_notification: Node selavi[168385827] - state is now member
(was lost)
-------

But when starting pacemaker on selavi (the new node), turifel log shows
this:

----
Jun 27 11:54:28 turifel crmd[19526]:   notice: do_state_transition:
State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN
cause=C_FSA_INTERNAL origin=peer_update_callback ]
Jun 27 11:54:28 turifel crmd[19526]:  warning: crm_get_peer: Node
'selavi' and 'selavi' share the same cluster nodeid: 168385827
Jun 27 11:54:28 turifel crmd[19526]:  warning: crmd_cs_dispatch:
Recieving messages from a node we think is dead: selavi[0]
Jun 27 11:54:29 turifel crmd[19526]:  warning: crm_get_peer: Node
'selavi' and 'selavi' share the same cluster nodeid: 168385827
Jun 27 11:54:29 turifel crmd[19526]:  warning: do_state_transition: Only
1 of 2 cluster nodes are eligible to run resources - continue 0
Jun 27 11:54:29 turifel attrd[19524]:   notice: attrd_local_callback:
Sending full refresh (origin=crmd)
----

And selavi remains on pending state. Some times turifel (DC) fences
selavi, but other times remains pending forever.

On turifel node, all resources gives warnings like this one:
 warning: custom_action: Action p_drbd_ha0:0_monitor_0 on selavi is
unrunnable (pending)

On both nodes, uname -n and crm_node -n gives correct node names (selavi
and turifel respectively)

¿Do you think it's a configuration problem?


Below I give information about versions and configurations.

Best regards,
Bernardo.


-----
Versions (git/hg compiled versions):

corosync: 2.3.0.66-615d
pacemaker: 1.1.9-61e4b8f
cluster-glue: 1.0.11
libqb:  0.14.4.43-bb4c3
resource-agents: 3.9.5.98-3b051
crmsh: 1.2.5

Cluster also has drbd, dlm and gfs2, but I think versions are unrelevant
here.

--------
Output of pacemaker configuration:
./configure --prefix=/opt/ha --without-cman \
    --without-heartbeat --with-corosync \
    --enable-fatal-warnings=no --with-lcrso-dir=/opt/ha/libexec/lcrso

pacemaker configuration:
  Version                  = 1.1.9 (Build: 61e4b8f)
  Features                 = generated-manpages ascii-docs ncurses
libqb-logging libqb-ipc lha-fencing upstart nagios  corosync-native snmp
libesmtp

  Prefix                   = /opt/ha
  Executables              = /opt/ha/sbin
  Man pages                = /opt/ha/share/man
  Libraries                = /opt/ha/lib
  Header files             = /opt/ha/include
  Arch-independent files   = /opt/ha/share
  State information        = /opt/ha/var
  System configuration     = /opt/ha/etc
  Corosync Plugins         = /opt/ha/lib

  Use system LTDL          = yes

  HA group name            = haclient
  HA user name             = hacluster

  CFLAGS                   = -I/opt/ha/include -I/opt/ha/include
-I/opt/ha/include/heartbeat    -I/opt/ha/include   -I/opt/ha/include
-ggdb  -fgnu89-inline -fstack-protector-all -Wall -Waggregate-return
-Wbad-function-cast -Wcast-align -Wdeclaration-after-statement
-Wendif-labels -Wfloat-equal -Wformat=2 -Wformat-security
-Wformat-nonliteral -Wmissing-prototypes -Wmissing-declarations
-Wnested-externs -Wno-long-long -Wno-strict-aliasing
-Wunused-but-set-variable -Wpointer-arith -Wstrict-prototypes
-Wwrite-strings
  Libraries                = -lgnutls -lcorosync_common -lplumb -lpils
-lqb -lbz2 -lxslt -lxml2 -lc -luuid -lpam -lrt -ldl  -lglib-2.0   -lltdl
-L/opt/ha/lib -lqb -ldl -lrt -lpthread
  Stack Libraries          =   -L/opt/ha/lib -lqb -ldl -lrt -lpthread
-L/opt/ha/lib -lcpg   -L/opt/ha/lib -lcfg   -L/opt/ha/lib -lcmap
-L/opt/ha/lib -lquorum

----
Corosync config:

totem {
        version: 2
        crypto_cipher: none
        crypto_hash: none
        cluster_name: fiestaha
        interface {
                ringnumber: 0
                ttl: 1
                bindnetaddr: 10.9.93.0
                mcastaddr: 226.94.1.1
                mcastport: 5405
        }
}
logging {
        fileline: off
        to_stderr: yes
        to_logfile: no
        to_syslog: yes
        syslog_facility: local7
        debug: off
        timestamp: on
        logger_subsys {
                subsys: QUORUM
                debug: off
        }
}
quorum {
        provider: corosync_votequorum
        expected_votes: 2
        two_node: 1
        wait_for_all: 0
}












-- 
APSL
*Bernardo Cabezas Serra*
*Responsable Sistemas*
Camí Vell de Bunyola 37, esc. A, local 7
07009 Polígono de Son Castelló, Palma
Mail: bcabe...@apsl.net
Skype: bernat.cabezas
Tel: 971439771


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to