Hans, Some of the Opensaf threads are indeed run in real-time mode (that is, if you are running Opensaf with "root" priviledges). However, that may not be enough to handle stress on the following:
(1) Memory: I see that the "java" process has a huge size. What do the top few lines of your "top" dump show? (2) Network Traffic: Heart-beats may be getting queued (and delayed) behind ordinary traffic if network traffic is very high. This may (just may :-)) be the case if your CPU load is associated with lot of network traffic. It may be useful to check the above two possibilities to help isolate the root cause. Thanks, Phani ECC, Motorola India Pvt. Ltd. -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Hans Feldt Sent: Wednesday, December 19, 2007 7:46 PM To: [email protected] Subject: [Users] Instable cluster with CPU load We are trying to run a cluster with a CPU load of appr. 40%. The test application uses CKPT and EVT. What happens is that the active controller logs that it has missed heartbeats with other nodes in the cluster. When it misses the heartbeat with the standby controller, it orders it to reboot. The heartbeat settings in BOM.xml are default as delivered in OpenSAF: <sndHbInt>1000</sndHbInt> <rcvHbInt>3000</rcvHbInt> Irrespective of this configuration values, I thought the system was designed with real time threads for managing critical protocols? And it seems to be: > SC_2_2# ps -eLfc | grep scap > root 1371 1 1371 14 TS 27 Dec17 ? 00:00:00 /opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21 > root 1371 1 1372 14 RR 125 Dec17 ? 00:00:00 /opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21 > root 1371 1 1373 14 RR 130 Dec17 ? 00:00:00 /opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21 > root 1371 1 1374 14 RR 126 Dec17 ? 00:02:44 /opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21 > root 1371 1 1375 14 TS 29 Dec17 ? 00:00:00 /opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21 > root 1371 1 1376 14 TS 29 Dec17 ? 00:00:00 /opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21 > root 1371 1 1377 14 TS 29 Dec17 ? 00:00:02 /opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21 > root 1371 1 1378 14 TS 29 Dec17 ? 00:00:00 /opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21 > root 1371 1 1379 14 TS 29 Dec17 ? 00:00:04 /opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21 > root 1371 1 1380 14 TS 29 Dec17 ? 00:00:00 /opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21 > root 1371 1 1381 14 TS 29 Dec17 ? 00:00:00 /opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21 > root 1371 1 1382 14 TS 29 Dec17 ? 00:00:00 /opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21 > root 1371 1 1384 14 RR 126 Dec17 ? 00:01:07 /opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21 > root 1371 1 1390 14 TS 29 Dec17 ? 00:00:00 /opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21 The process load on the active looks something like: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2458 root 23 4 1389m 236m 8916 S 55 5.8 6:40.88 java 1635 root 20 4 1004m 8072 2228 S 15 0.2 1:54.67 rssServer 1394 root 26 4 52284 2076 1440 S 3 0.1 0:28.54 ncs_cpd 1324 root 20 4 54928 3692 1744 S 1 0.1 0:03.60 ncs_dts 1350 root 20 4 64244 13m 1428 S 0 0.3 0:01.82 ncs_eds 1398 root 23 4 52468 2716 2020 S 0 0.1 0:04.49 ncs_cpnd Could there some error in the AVD-AVD/AVND heartbeat design? In the NCS-AVSV-MIB I see other default values for heartbeats, 300ms resp 2000ms. Regards, Hans _______________________________________________ Users mailing list [email protected] http://list.opensaf.org/maillist/listinfo/users _______________________________________________ Users mailing list [email protected] http://list.opensaf.org/maillist/listinfo/users
