Re: [Users] Instable cluster with CPU load

Marella P-G19460 Fri, 21 Dec 2007 01:33:15 -0800

Hans, 

Some of the Opensaf threads are indeed run in real-time mode (that is,
if you are running Opensaf with "root" priviledges).  However, that may
not be enough to handle stress on the following:


(1)  Memory:   I see that the "java" process has a huge size.  What do
the top few lines of your "top" dump show?

(2)  Network Traffic: Heart-beats may be getting queued (and delayed)
behind ordinary traffic if network traffic is very high. This may (just
may :-)) be the case if your CPU load is associated with lot of network
traffic.

It may be useful to check the above two possibilities to help isolate
the root cause. 

Thanks, 
Phani
ECC, 
Motorola India Pvt. Ltd.



-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Hans Feldt
Sent: Wednesday, December 19, 2007 7:46 PM
To: [email protected]
Subject: [Users] Instable cluster with CPU load

We are trying to run a cluster with a CPU load of appr. 40%. The test
application uses CKPT and EVT.

What happens is that the active controller logs that it has missed
heartbeats
with other nodes in the cluster. When it misses the heartbeat with the
standby controller, it orders it to reboot.

The heartbeat settings in BOM.xml are default as delivered in OpenSAF:
             <sndHbInt>1000</sndHbInt>
             <rcvHbInt>3000</rcvHbInt>

Irrespective of this configuration values, I thought the system was
designed
with real time threads for managing critical protocols?

And it seems to be:

> SC_2_2# ps -eLfc | grep scap
> root      1371     1  1371   14 TS   27 Dec17 ?        00:00:00
/opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21
> root      1371     1  1372   14 RR  125 Dec17 ?        00:00:00
/opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21
> root      1371     1  1373   14 RR  130 Dec17 ?        00:00:00
/opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21
> root      1371     1  1374   14 RR  126 Dec17 ?        00:02:44
/opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21
> root      1371     1  1375   14 TS   29 Dec17 ?        00:00:00
/opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21
> root      1371     1  1376   14 TS   29 Dec17 ?        00:00:00
/opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21
> root      1371     1  1377   14 TS   29 Dec17 ?        00:00:02
/opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21
> root      1371     1  1378   14 TS   29 Dec17 ?        00:00:00
/opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21
> root      1371     1  1379   14 TS   29 Dec17 ?        00:00:04
/opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21
> root      1371     1  1380   14 TS   29 Dec17 ?        00:00:00
/opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21
> root      1371     1  1381   14 TS   29 Dec17 ?        00:00:00
/opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21
> root      1371     1  1382   14 TS   29 Dec17 ?        00:00:00
/opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21
> root      1371     1  1384   14 RR  126 Dec17 ?        00:01:07
/opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21
> root      1371     1  1390   14 TS   29 Dec17 ?        00:00:00
/opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21

The process load on the active looks something like:

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  2458 root      23   4 1389m 236m 8916 S   55  5.8   6:40.88 java
  1635 root      20   4 1004m 8072 2228 S   15  0.2   1:54.67 rssServer
  1394 root      26   4 52284 2076 1440 S    3  0.1   0:28.54 ncs_cpd
  1324 root      20   4 54928 3692 1744 S    1  0.1   0:03.60 ncs_dts
  1350 root      20   4 64244  13m 1428 S    0  0.3   0:01.82 ncs_eds
  1398 root      23   4 52468 2716 2020 S    0  0.1   0:04.49 ncs_cpnd

Could there some error in the AVD-AVD/AVND heartbeat design?

In the NCS-AVSV-MIB I see other default values for heartbeats, 300ms
resp 2000ms.

Regards,
Hans

_______________________________________________
Users mailing list
[email protected]
http://list.opensaf.org/maillist/listinfo/users
_______________________________________________
Users mailing list
[email protected]
http://list.opensaf.org/maillist/listinfo/users

Re: [Users] Instable cluster with CPU load

Reply via email to