Hi! I cannot really help you, but it proved to be helpful to open a wide "fail -f /var/log/messages" window for every cluster node while issuing the actual commands in another window. Maybe you could also watch the cluster with hawk or crm_mon. My favourite option set is "-1Arf"...
Regards, Ulrich >>> Thomas Schulte <tho...@cupracer.de> schrieb am 22.01.2014 um 09:55 in >>> Nachricht <09d9d36ad571203a5b9b048da373d...@ser4.de>: > Hi all, > > I'm experiencing difficulties with my 2-node cluster and I'm running > out of ideas about how to fix this. I'd be glad if someone here > could point me to the right direction. > > As said, it's a 2-node cluster, running with openSUSE 13.1 and the > HA-Factory packages: > > cluster-glue: 1.0.12-rc1 (b5f1605097857b8b96bd517282ab300e2ad7af99) > resource-agents: # Build version: > f725724964882a407f7f33a97124da07a2b28d5d > CRM Version: 1.1.10+git20140117.a3cda76-102.1 > (1.1.10+git20140117.a3cda76) > pacemaker 1.1.10+git20140117.a3cda76-102.1 - > network:ha-clustering:Factory / openSUSE_13.1 x86_64 > libpacemaker3 1.1.10+git20140117.a3cda76-102.1 - > network:ha-clustering:Factory / openSUSE_13.1 x86_64 > corosync 2.3.2-48.1 - network:ha-clustering:Factory / openSUSE_13.1 > x86_64 > libcorosync4 2.3.2-48.1 - network:ha-clustering:Factory / openSUSE_13.1 > x86_64 > resource-agents 3.9.5-63.1 - network:ha-clustering:Factory / > openSUSE_13.1 x86_64 > cluster-glue 1.0.12-0.rc1.69.1 - network:ha-clustering:Factory / > openSUSE_13.1 x86_64 > libglue2 1.0.12-0.rc1.69.1 - network:ha-clustering:Factory / > openSUSE_13.1 x86_64 > ldirectord 3.9.5-63.1 - network:ha-clustering:Factory / openSUSE_13.1 > x86_64 > > Both nodes have two NIC's, one is connected to the world and the other > one > connects both nodes with a crossover cable. An internal subnet is used > here > and my /etc/hosts files are fine: > > 127.0.0.1 localhost > 10.0.0.1 s00201.ser4.de s00201 > 10.0.0.2 s00202.ser4.de s00202 > > Corosync is configured with udpu and the firewall does not block any > traffic between the interal NIC's. > > > My cluster is up and running, both nodes are providing some services, > filesystems are mirrored by drbd and the world is a happy place. :-) > The cluster uses a valid and available DC. Editing and executing actions > on resources is usually working fine. > > Sometimes, when I run a command like "crm resource migrate grp_nginx" > or just a "crm resource cleanup pri_svc_varnish" it may happen that > those > commands don't return but timeout after a while. At this state even > a "crmadmin -D" does not return. > > This happened a lot of times in the last days (I migrated to openSUSE > 13.1 last week), > so I tried different things to clear the problem, but nothing seems to > work. > I may happen that the STONITH mechanism is executed for one of the > nodes. > Interestingly the other node does not seem to recognize that it's alone > then. > "crm status" sometimes still shows both nodes as "online". In other > cases > it may occur that the second node comes up again after rebooting, but it > doesn't > get found by the first node and appears "offline". > The network connection does not seem to have any problems. > Communication is still possible between the nodes and I can see a lot of > UDP traffic between both nodes. > > Most of the times I solve this by booting the "unresponsive" cluster > node, too. > This leads to other problems because my drbd devices become out of sync, > services get stopped and so on. On the other hand, the node does not > seem > to heal itself, so no "crm" actions can successfully be executed. > > The last time that I dared to run a "crm" action was yesterday between > 18 and 19 o'clock. > I created a full hb_report that should contain all relevant information, > including the pe-input files. I also enabled the debug logging mode for > corosync, > so extended logs are available, too. > > I used strace to find out what a simple "crmadmin -D" does. It ends > with: > > --------------- > uname({sys="Linux", node="s00201", ...}) = 0 > uname({sys="Linux", node="s00201", ...}) = 0 > uname({sys="Linux", node="s00201", ...}) = 0 > uname({sys="Linux", node="s00201", ...}) = 0 > uname({sys="Linux", node="s00201", ...}) = 0 > futex(0x7fdccdcf2d48, FUTEX_WAKE_PRIVATE, 2147483647) = 0 > mmap(NULL, 135168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, > 0) = 0x7fdccf39f000 > socket(PF_LOCAL, SOCK_STREAM, 0) = 3 > fcntl(3, F_GETFD) = 0 > fcntl(3, F_SETFD, FD_CLOEXEC) = 0 > fcntl(3, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 > connect(3, {sa_family=AF_LOCAL, sun_path=@"crmd"}, 110) = 0 > setsockopt(3, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0 > sendto(3, "\377\377\377\377\0\0\0\0\30\0\0\0\0\0\0\0\0\0\2\0\0\0\0\0", > 24, MSG_NOSIGNAL, NULL, 0) = 24 > setsockopt(3, SOL_SOCKET, SO_PASSCRED, [0], 4) = 0 > recvfrom(3, 0x7fff4bc37d10, 12328, 16640, 0, 0) = -1 EAGAIN (Resource > temporarily unavailable) > poll([{fd=3, events=POLLIN}], 1, 4294967295Process 3781 detached > <detached ...> > --------------- > > (The full log is available) > > crmadmin tries to reach crmd, so I also straced the running crmd > process. > There's not much happening here: > > --------------- > Process 8669 attached > read(22, Process 8669 detached > <detached ...> > --------------- > > I killed the crmd process and it got restarted automatically (by > pacemakerd?). > After that, strace just shows countless messages like these: > > --------------- > Process 7856 attached > poll([{fd=20, events=POLLIN}, {fd=22, events=POLLIN}, {fd=21, > events=POLLIN}], 3, 3000) = 2 ([{fd=20, revents=POLLHUP}, {fd=22, > revents=POLLHUP}]) > poll([{fd=20, events=POLLIN}, {fd=22, events=POLLIN}, {fd=21, > events=POLLIN}], 3, 3000) = 2 ([{fd=20, revents=POLLHUP}, {fd=22, > revents=POLLHUP}]) > poll([{fd=20, events=POLLIN}, {fd=22, events=POLLIN}, {fd=21, > events=POLLIN}], 3, 3000) = 2 ([{fd=20, revents=POLLHUP}, {fd=22, > revents=POLLHUP}]) > poll([{fd=20, events=POLLIN}, {fd=22, events=POLLIN}, {fd=21, > events=POLLIN}], 3, 3000) = 2 ([{fd=20, revents=POLLHUP}, {fd=22, > revents=POLLHUP}]) > --------------- > > Almost the same output is shown when strace'ing pacemakerd, cib and > lrmd. > Unfortunately, I can't remember what happens then, but I think that > after restarting crmd > it took some time until the other node was fenced. It came up again > after rebooting, > both nodes found each other, a DC was elected and everything was fine > again. > > Another thing that I could not solve is this type of messages: > > --------------- > pacemaker.service: Got notification message from PID 20309, but > reception only permitted for PID 8663 > --------------- > > The PID's are: > > --------------- > ps ax|egrep "20309|8663" > 8663 ? Ss 0:05 /usr/sbin/pacemakerd -f > 20309 ? Ss 0:29 /usr/sbin/httpd2 -DSTATUS -f > /etc/apache2/httpd.conf -c PidFile /var/run//httpd2.pid > --------------- > > Maybe this is not that important and has nothing to do with my described > problem. > > > > I'd rather not attach the whole hb_report and logging data to this > e-mail, > but if someone would like to have a look at the files, I would send them > directly. > > Maybe I missed to look at the right points to figure out what's going > wrong. > Any hint would be welcome. :-) > > > Thanks for reading! > > Regards, > Thomas > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems