Hi, We have 16 nodes cluster. Recently facing issues with the nodes. The problem is,
occassionaly find one of the nodes is not accessable through ssh. The node is up and running, the zen guests on the nodes are also pingable . But the node, and the guests on the nodes are not able to accessable. Very recently it happened to one of the node again. The nodes are of rhel5.5, kernel 2.6.18-194.3.1.el5xen #1 SMP Sun May 2 04:26:43 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux when it happens that node becomes detached from the cluster. If anybody can give some hints that will be really appreciated. Not sure it is the kernel but or not...Here is the few line of the log file, when it happened last time. Thanks in advance.. ____________________________________________________________ Jul 1 17:11:03 server crond[11715]: (root) CMD (python /usr/share/rhn/virtualization/poller.py) Jul 1 17:11:03 server crond[11716]: (root) CMD (python /usr/share/rhn/virtualization/poller.py) Jul 1 17:11:01 server crond[11685]: (root) error: Job execution of per-minute job scheduled for 17:10 delayed into subsequent minute 17:11. Skipping job run. Jul 1 17:11:03 server crond[11685]: CRON (root) ERROR: cannot set security context Jul 1 17:17:13 server xinetd[6778]: START: pblocald pid=11896 from=xxx.xx.222.4 Jul 1 17:21:01 server crond[11852]: (root) error: Job execution of per-minute job scheduled for 17:15 delayed into subsequent minute 17:21. Skipping job run. Jul 1 17:21:01 server crond[11852]: CRON (root) ERROR: cannot set security context Jul 1 17:21:05 server crond[12031]: (root) CMD (python /usr/share/rhn/virtualization/poller.py) Jul 1 17:23:34 server INFO: task cmahealthd:7492 blocked for more than 120 seconds. Jul 1 17:23:37 server "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 1 17:23:37 server cmahealthd D 0000000000000180 0 7492 1 7507 7430 (NOTLB) Jul 1 17:23:37 server ffff880065f29b18 0000000000000282 0000000000000000 0000000000000000 Jul 1 17:23:37 server 0000000000000009 ffff880065c35040 ffff88007f6720c0 000000000001d982 Jul 1 17:23:37 server ffff880065c35228 ffff88007e16b400 Jul 1 17:23:37 server Call Trace: Jul 1 17:23:37 server [<ffffffff80287795>] __wake_up_common+0x3e/0x68 Jul 1 17:23:37 server [<ffffffff802d81f3>] base_probe+0x0/0x36 Jul 1 17:23:37 server [<ffffffff80262fb3>] wait_for_completion+0x7d/0xaa Jul 1 17:23:52 server [<ffffffff80288f86>] default_wake_function+0x0/0xe Jul 1 17:23:52 server [<ffffffff80298e84>] call_usermodehelper_keys+0xe3/0xf8 Jul 1 17:23:52 server [<ffffffff80298e99>] __call_usermodehelper+0x0/0x4f Jul 1 17:23:52 server [<ffffffff802071b2>] find_get_page+0x4d/0x55 Jul 1 17:23:52 server [<ffffffff80299275>] request_module+0x139/0x14d Jul 1 17:23:52 server [<ffffffff8022cf67>] mntput_no_expire+0x19/0x89 Jul 1 17:23:52 server [<ffffffff8020edda>] link_path_walk+0xa6/0xb2 Jul 1 17:23:52 server [<ffffffff80263914>] mutex_lock+0xd/0x1d Jul 1 17:23:52 server [<ffffffff802d8211>] base_probe+0x1e/0x36 Jul 1 17:23:52 server [<ffffffff803af5c9>] kobj_lookup+0x132/0x19b Jul 1 17:31:24 server xinetd[6778]: START: pblocald pid=12151 from=xxx.xx.222.4 Jul 1 17:28:16 server openais[6172]: [TOTEM] entering GATHER state from 12. Jul 1 17:28:41 server openais[6172]: [TOTEM] Creating commit token because I am the rep. Jul 1 17:28:41 server openais[6172]: [TOTEM] Saving state aru 20a high seq received 20a Jul 1 17:28:41 server openais[6172]: [TOTEM] Storing new sequence id for ring 2e90 Jul 1 17:28:49 server openais[6172]: [TOTEM] entering COMMIT state. Jul 1 17:31:30 server openais[6172]: [TOTEM] Creating commit token because I am the rep. Jul 1 17:31:30 server openais[6172]: [TOTEM] Storing new sequence id for ring 2e94 Jul 1 17:31:30 server openais[6172]: [TOTEM] entering COMMIT state. Jul 1 17:31:30 server openais[6172]: [TOTEM] entering GATHER state from 13. Jul 1 17:31:30 server openais[6172]: [TOTEM] Creating commit token because I am the rep. Jul 1 17:33:30 server [<ffffffff8024b204>] chrdev_open+0x53/0x183 Jul 1 17:33:30 server [<ffffffff8024b1b1>] chrdev_open+0x0/0x183 Jul 1 17:33:30 server [<ffffffff8021edc8>] __dentry_open+0xd9/0x1dc Jul 1 17:33:30 server [<ffffffff80227bca>] do_filp_open+0x2a/0x38 Jul 1 17:33:30 server [<ffffffff8021a270>] do_sys_open+0x44/0xbe Jul 1 17:33:30 server [<ffffffff8026168d>] ia32_sysret+0x0/0x5 Jul 1 17:33:30 server Jul 1 17:31:30 server openais[6172]: [TOTEM] Storing new sequence id for ring 2e98 Jul 1 17:31:30 server openais[6172]: [TOTEM] entering COMMIT state. Jul 1 17:31:30 server openais[6172]: [TOTEM] entering RECOVERY state. Jul 1 17:31:30 server openais[6172]: [TOTEM] position [0] member 192.168.xxx.9: Jul 1 17:31:30 server openais[6172]: [TOTEM] previous ring seq 11916 rep 192.168.xxx.9 Jul 1 17:31:30 server openais[6172]: [TOTEM] aru 20a high delivered 20a received flag 1 Jul 1 17:31:30 server openais[6172]: [TOTEM] position [1] member 192.168.xxx.10: Jul 1 17:31:30 server openais[6172]: [TOTEM] previous ring seq 11924 rep 192.168.xxx.10 Jul 1 17:31:30 server openais[6172]: [TOTEM] aru e2 high delivered e2 received flag 1 Jul 1 17:31:30 server openais[6172]: [TOTEM] position [2] member 192.168.xxx.11: Jul 1 17:33:30 server INFO: task cmahealthd:7492 blocked for more than 120 seconds. Jul 1 17:33:30 server "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 1 17:33:30 server cmahealthd D 0000000000000180 0 7492 1 7507 7430 (NOTLB) Jul 1 17:33:30 server ffff880065f29b18 0000000000000282 0000000000000000 0000000000000000 Jul 1 17:33:30 server 0000000000000009 ffff880065c35040 ffff88007f6720c0 000000000001d982 Jul 1 17:33:30 server ffff880065c35228 ffff88007e16b400 Jul 1 17:33:30 server Call Trace: Jul 1 17:33:30 server [<ffffffff80287795>] __wake_up_common+0x3e/0x68 Jul 1 17:33:30 server [<ffffffff802d81f3>] base_probe+0x0/0x36 Jul 1 17:33:30 server [<ffffffff80262fb3>] wait_for_completion+0x7d/0xaa Jul 1 17:33:29 server dlm_controld[6272]: cluster is down, exiting Jul 1 17:31:30 server openais[6172]: [TOTEM] previous ring seq 11924 rep 192.168.xxx.10 Jul 1 17:31:30 server openais[6172]: [TOTEM] aru e2 high delivered e2 received flag 1 Jul 1 17:31:30 server openais[6172]: [TOTEM] position [3] member 192.168.xxx.12: Jul 1 17:31:30 server openais[6172]: [TOTEM] previous ring seq 11924 rep 192.168.xxx.10 Jul 1 17:31:30 server openais[6172]: [TOTEM] aru e2 high delivered e2 received flag 1 Jul 1 17:31:30 server openais[6172]: [TOTEM] position [4] member 192.168.xxx.13: Jul 1 17:31:30 server openais[6172]: [TOTEM] previous ring seq 11924 rep 192.168.xxx.10 Jul 1 17:31:30 server openais[6172]: [TOTEM] aru e2 high delivered e2 received flag 1 Jul 1 17:31:30 server openais[6172]: [TOTEM] position [5] member 192.168.xxx.14: Jul 1 17:31:30 server openais[6172]: [TOTEM] previous ring seq 11924 rep 192.168.xxx.10 Jul 1 17:33:30 server [<ffffffff80288f86>] default_wake_function+0x0/0xe Jul 1 17:33:30 server [<ffffffff80298e84>] call_usermodehelper_keys+0xe3/0xf8 Jul 1 17:33:30 server [<ffffffff80298e99>] __call_usermodehelper+0x0/0x4f Jul 1 17:33:30 server [<ffffffff802071b2>] find_get_page+0x4d/0x55 Jul 1 17:33:30 server [<ffffffff80299275>] request_module+0x139/0x14d Jul 1 17:33:30 server [<ffffffff8022cf67>] mntput_no_expire+0x19/0x89 Jul 1 17:33:30 server [<ffffffff8020edda>] link_path_walk+0xa6/0xb2 Jul 1 17:33:30 server [<ffffffff80263914>] mutex_lock+0xd/0x1d Jul 1 17:33:30 server [<ffffffff802d8211>] base_probe+0x1e/0x36 Jul 1 17:33:30 server [<ffffffff803af5c9>] kobj_lookup+0x132/0x19b Jul 1 17:33:30 server gfs_controld[6280]: cluster is down, exiting -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster