Hi!

I cannot really help you, but it proved to be helpful to open a wide "fail -f 
/var/log/messages" window for every cluster node while issuing the actual 
commands in another window. Maybe you could also watch the cluster with hawk or 
crm_mon. My favourite option set is "-1Arf"...

Regards,
Ulrich

>>> Thomas Schulte <tho...@cupracer.de> schrieb am 22.01.2014 um 09:55 in 
>>> Nachricht
<09d9d36ad571203a5b9b048da373d...@ser4.de>:
> Hi all,
> 
> I'm experiencing difficulties with my 2-node cluster and I'm running
> out of ideas about how to fix this. I'd be glad if someone here
> could point me to the right direction.
> 
> As said, it's a 2-node cluster, running with openSUSE 13.1 and the 
> HA-Factory packages:
> 
> cluster-glue: 1.0.12-rc1 (b5f1605097857b8b96bd517282ab300e2ad7af99)
> resource-agents: # Build version: 
> f725724964882a407f7f33a97124da07a2b28d5d
> CRM Version: 1.1.10+git20140117.a3cda76-102.1 
> (1.1.10+git20140117.a3cda76)
> pacemaker 1.1.10+git20140117.a3cda76-102.1 - 
> network:ha-clustering:Factory / openSUSE_13.1 x86_64
> libpacemaker3 1.1.10+git20140117.a3cda76-102.1 - 
> network:ha-clustering:Factory / openSUSE_13.1 x86_64
> corosync 2.3.2-48.1 - network:ha-clustering:Factory / openSUSE_13.1 
> x86_64
> libcorosync4 2.3.2-48.1 - network:ha-clustering:Factory / openSUSE_13.1 
> x86_64
> resource-agents 3.9.5-63.1 - network:ha-clustering:Factory / 
> openSUSE_13.1 x86_64
> cluster-glue 1.0.12-0.rc1.69.1 - network:ha-clustering:Factory / 
> openSUSE_13.1 x86_64
> libglue2 1.0.12-0.rc1.69.1 - network:ha-clustering:Factory / 
> openSUSE_13.1 x86_64
> ldirectord 3.9.5-63.1 - network:ha-clustering:Factory / openSUSE_13.1 
> x86_64
> 
> Both nodes have two NIC's, one is connected to the world and the other 
> one
> connects both nodes with a crossover cable. An internal subnet is used 
> here
> and my /etc/hosts files are fine:
> 
> 127.0.0.1       localhost
> 10.0.0.1        s00201.ser4.de s00201
> 10.0.0.2        s00202.ser4.de s00202
> 
> Corosync is configured with udpu and the firewall does not block any 
> traffic between the interal NIC's.
> 
> 
> My cluster is up and running, both nodes are providing some services,
> filesystems are mirrored by drbd and the world is a happy place. :-)
> The cluster uses a valid and available DC. Editing and executing actions
> on resources is usually working fine.
> 
> Sometimes, when I run a command like "crm resource migrate grp_nginx"
> or just a "crm resource cleanup pri_svc_varnish" it may happen that 
> those
> commands don't return but timeout after a while. At this state even
> a "crmadmin -D" does not return.
> 
> This happened a lot of times in the last days (I migrated to openSUSE 
> 13.1 last week),
> so I tried different things to clear the problem, but nothing seems to 
> work.
> I may happen that the STONITH mechanism is executed for one of the 
> nodes.
> Interestingly the other node does not seem to recognize that it's alone 
> then.
> "crm status" sometimes still shows both nodes as "online". In other 
> cases
> it may occur that the second node comes up again after rebooting, but it 
> doesn't
> get found by the first node and appears "offline".
> The network connection does not seem to have any problems.
> Communication is still possible between the nodes and I can see a lot of
> UDP traffic between both nodes.
> 
> Most of the times I solve this by booting the "unresponsive" cluster 
> node, too.
> This leads to other problems because my drbd devices become out of sync,
> services get stopped and so on. On the other hand, the node does not 
> seem
> to heal itself, so no "crm" actions can successfully be executed.
> 
> The last time that I dared to run a "crm" action was yesterday between 
> 18 and 19 o'clock.
> I created a full hb_report that should contain all relevant information,
> including the pe-input files. I also enabled the debug logging mode for 
> corosync,
> so extended logs are available, too.
> 
> I used strace to find out what a simple "crmadmin -D" does. It ends 
> with:
> 
> ---------------
> uname({sys="Linux", node="s00201", ...}) = 0
> uname({sys="Linux", node="s00201", ...}) = 0
> uname({sys="Linux", node="s00201", ...}) = 0
> uname({sys="Linux", node="s00201", ...}) = 0
> uname({sys="Linux", node="s00201", ...}) = 0
> futex(0x7fdccdcf2d48, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> mmap(NULL, 135168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 
> 0) = 0x7fdccf39f000
> socket(PF_LOCAL, SOCK_STREAM, 0)        = 3
> fcntl(3, F_GETFD)                       = 0
> fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
> fcntl(3, F_SETFL, O_RDONLY|O_NONBLOCK)  = 0
> connect(3, {sa_family=AF_LOCAL, sun_path=@"crmd"}, 110) = 0
> setsockopt(3, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0
> sendto(3, "\377\377\377\377\0\0\0\0\30\0\0\0\0\0\0\0\0\0\2\0\0\0\0\0", 
> 24, MSG_NOSIGNAL, NULL, 0) = 24
> setsockopt(3, SOL_SOCKET, SO_PASSCRED, [0], 4) = 0
> recvfrom(3, 0x7fff4bc37d10, 12328, 16640, 0, 0) = -1 EAGAIN (Resource 
> temporarily unavailable)
> poll([{fd=3, events=POLLIN}], 1, 4294967295Process 3781 detached
>   <detached ...>
> ---------------
> 
> (The full log is available)
> 
> crmadmin tries to reach crmd, so I also straced the running crmd 
> process.
> There's not much happening here:
> 
> ---------------
> Process 8669 attached
> read(22, Process 8669 detached
>   <detached ...>
> ---------------
> 
> I killed the crmd process and it got restarted automatically (by 
> pacemakerd?).
> After that, strace just shows countless messages like these:
> 
> ---------------
> Process 7856 attached
> poll([{fd=20, events=POLLIN}, {fd=22, events=POLLIN}, {fd=21, 
> events=POLLIN}], 3, 3000) = 2 ([{fd=20, revents=POLLHUP}, {fd=22, 
> revents=POLLHUP}])
> poll([{fd=20, events=POLLIN}, {fd=22, events=POLLIN}, {fd=21, 
> events=POLLIN}], 3, 3000) = 2 ([{fd=20, revents=POLLHUP}, {fd=22, 
> revents=POLLHUP}])
> poll([{fd=20, events=POLLIN}, {fd=22, events=POLLIN}, {fd=21, 
> events=POLLIN}], 3, 3000) = 2 ([{fd=20, revents=POLLHUP}, {fd=22, 
> revents=POLLHUP}])
> poll([{fd=20, events=POLLIN}, {fd=22, events=POLLIN}, {fd=21, 
> events=POLLIN}], 3, 3000) = 2 ([{fd=20, revents=POLLHUP}, {fd=22, 
> revents=POLLHUP}])
> ---------------
> 
> Almost the same output is shown when strace'ing pacemakerd, cib and 
> lrmd.
> Unfortunately, I can't remember what happens then, but I think that 
> after restarting crmd
> it took some time until the other node was fenced. It came up again 
> after rebooting,
> both nodes found each other, a DC was elected and everything was fine 
> again.
> 
> Another thing that I could not solve is this type of messages:
> 
> ---------------
> pacemaker.service: Got notification message from PID 20309, but 
> reception only permitted for PID 8663
> ---------------
> 
> The PID's are:
> 
> ---------------
> ps ax|egrep "20309|8663"
>   8663 ?        Ss     0:05 /usr/sbin/pacemakerd -f
> 20309 ?        Ss     0:29 /usr/sbin/httpd2 -DSTATUS -f 
> /etc/apache2/httpd.conf -c PidFile /var/run//httpd2.pid
> ---------------
> 
> Maybe this is not that important and has nothing to do with my described 
> problem.
> 
> 
> 
> I'd rather not attach the whole hb_report and logging data to this 
> e-mail,
> but if someone would like to have a look at the files, I would send them 
> directly.
> 
> Maybe I missed to look at the right points to figure out what's going 
> wrong.
> Any hint would be welcome. :-)
> 
> 
> Thanks for reading!
> 
> Regards,
> Thomas
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org 
> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
> See also: http://linux-ha.org/ReportingProblems 


_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to