Re: [users] Max number of SUs/Components in a cluster?

Nagendra Kumar Wed, 22 Oct 2014 05:28:14 -0700

I tested as below:

With TIPC:  changeset: 21a73092ad97
1. With 550 amf component on single node. I unlock-in and all were instantiated.
2. Configured 550 amf component dynamically on each node and unlock-in all 
components and all were instantiated on each node(Total components in cluster 
was 550*2== 1100).


So, everything worked fine. But opensafd stop was hung and all opensaf 
processes were in defunct state.

root     24136     1  7 17:42 ?        00:00:05 /usr/local/lib/opensaf/osafdtmd
root     24243     1  6 17:42 ?        00:00:04 /usr/local/lib/opensaf/osafamfd 
--tracemask=0xffffffff
root     24253     1  7 17:42 ?        00:00:04 
/usr/local/lib/opensaf/osafamfnd --tracemask=0xffffffff
root     28880 24253  0 17:43 ?        00:00:00 [osaf-clmd] <defunct>
root     28881 24253  0 17:43 ?        00:00:00 [osaf-ckptd] <defunct>
root     28882 24253  0 17:43 ?        00:00:00 [osaf-evtd] <defunct>
root     28883 24253  0 17:43 ?        00:00:00 [osaf-fmd] <defunct>
root     28884 24253  0 17:43 ?        00:00:00 [osaf-lckd] <defunct>
root     28885 24253  0 17:43 ?        00:00:00 [osaf-logd] <defunct>
root     28886 24253  0 17:43 ?        00:00:00 [osaf-msgd] <defunct>
root     28887 24253  0 17:43 ?        00:00:00 [osaf-ntfd] <defunct>
root     28888 24253  0 17:43 ?        00:00:00 [osaf-rded] <defunct>
root     28889 24253  0 17:43 ?        00:00:00 [osaf-smfd] <defunct>
root     28890 24253  0 17:43 ?        00:00:00 [osaf-immd] <defunct>
root     28891 24253  0 17:43 ?        00:00:00 [osaf-ckptnd] <defunct>
root     28892 24253  0 17:43 ?        00:00:00 [osaf-lcknd] <defunct>
root     28893 24253  0 17:43 ?        00:00:00 [osaf-msgnd] <defunct>
root     28894 24253  0 17:43 ?        00:00:00 [osaf-clmna] <defunct>
root     28895 24253  0 17:43 ?        00:00:00 [osaf-immnd] <defunct>
root     28896 24253  0 17:43 ?        00:00:00 [osaf-smfnd] <defunct>
root     28897 24253  0 17:43 ?        00:00:00 [osaf-amfwd] <defunct>

With TCP:  changeset: 21a73092ad97
(With #define DTM_INTRANODE_MAX_PROCESSES   650)
1. With 550 amf component on single node. I unlock-in and all were instantiated.
2. Configured 550 amf component dynamically on each node and unlock-in all 
components and all were instantiated on each node(Total components in cluster 
was 550*2== 1100).
So, everything worked fine. But opensafd stop was hung and all opensaf 
processes were in defunct state as shown above.

Now, I took dump and started  SC-1 and when component were up and running on 
SC-1, then started SC-2. All came up fine.
Now, I restarted SC-2 and SC-2 failed to come up with following errors:

Oct 22 17:59:47 PM_SC-2 osafamfd[32221]: MDTM:socket_recv() = 0, conn lost with 
dh server, exiting library err :Success
Oct 22 17:59:48 PM_SC-2 osafimmnd[32164]: NO Implementer locally disconnected. 
Marking it as doomed 15 <9, 2020f> (@safAmfService2020f)
Oct 22 17:59:48 PM_SC-2 osafamfnd[32238]: NO saAmfCtDefQuiescingCompleteTimeout 
for 'safVersion=4.0.0,safCompType=AmfDemo1' initialized with 
saAmfCtDefCallbackTimeout
Oct 22 17:59:48 PM_SC-2 osafimmnd[32164]: NO Implementer disconnected 15 <9, 
2020f> (@safAmfService2020f)
Oct 22 17:59:50 PM_SC-2 osafamfnd[32238]: ER ncsmds_api for 0 FAILED, dest=0
Oct 22 17:59:50 PM_SC-2 osafamfnd[32238]: NO saAmfCtDefQuiescingCompleteTimeout 
for 'safVersion=4.0.0,safCompType=AmfDemo1' initialized with 
saAmfCtDefCallbackTimeout
Oct 22 17:59:50 PM_SC-2 osafamfnd[32238]: ER ncsmds_api for 0 FAILED, dest=0

All components registration failed.

Thanks
-Nagu


> -----Original Message-----
> From: Neelakanta Reddy
> Sent: 22 October 2014 12:43
> To: Shu Wang; [email protected]
> Cc: Lisa Ann Lentz-Liddell
> Subject: Re: [users] Max number of SUs/Components in a cluster?
> 
> Hi ,
> 
> There is a limitation DTM_INTRANODE_MAX_PROCESSES in present opensaf.
> There is an enhancement  ticket
> opened.(https://sourceforge.net/p/opensaf/tickets/1187/)
> Based on the discussion limit can be increased or can be configurable.
> 
> /Neel.
> 
> 
> 
> 
> On Wednesday 22 October 2014 03:56 AM, Shu Wang wrote:
> > When testing the cluster for max size what configuration do you use?  Are we
> able to get a copy of that to understand the look of the cluster?
> > What is the max number of :
> >   - service groups per cluster tested?
> >   - service units per cluster tested?
> >   - components per cluster tested?
> >   - service units on a node tested?
> >   - components on a node tested?
> >
> > While continuing to investigate the below problem, we found :
> >    On a controller node, when the component count for that node reaches 78,
> opensaf fails
> >    On a payload node, when the component count for that node reaches
> > 90, opensaf fails
> >
> > There are :
> >    On a controller node, there are 22 opensaf components
> >    On a payload node, there are 10 opensaf components
> >
> > Total components running on a node:
> > Controller node:   78 + 22 = 100
> > Payload node:   90 + 10 = 100
> >
> > We looked through the OpenSAF code for defines that had a value of
> > 100.  In osaf/services/infrastructure/dtms/dtm
> >
> > dtm_intra.c:#define DTM_INTRANODE_MAX_PROCESSES 100
> >
> > We changed that value to 115 and retried our test.  Increasing to 115 
> > allowed
> the total component count to go above 100 and of course then OpenSAF failed
> when we unlocked the SU that pushed the component count to over 115.
> >
> > Does anyone know of any ill effect if we change the
> DTM_INTRANODE_MAX_PROCESSES to a larger value e.g. 250?  The
> dtm_intra.c is using it for the number of items in 2 different arrays and 
> that is
> contained within that .c.
> >
> > We tried this change only with OpenSAF 4.4 and TCP.
> >
> > Thanks.
> >
> > Shu Wang | Senior Analyst | +1(407)708-5117 or x3917|
> > www.NetCracker.com Proven Partner to Communications Service Providers
> >
> >
> > -----Original Message-----
> > From: Neelakanta Reddy [mailto:[email protected]]
> > Sent: Tuesday, October 21, 2014 2:55 AM
> > To: [email protected]; Shu Wang
> > Cc: Lisa Ann Lentz-Liddell
> > Subject: Re: [users] Max number of SUs/Components in a cluster?
> >
> > Hi ,
> >
> > Comments inline.
> >
> > /Neel.
> >
> > On Tuesday 21 October 2014 04:41 AM, Shu Wang wrote:
> >> The IMM documentation states:
> >>
> >> Applications that intend to add their own imm classes and imm objects need
> to be aware that capacity is limited. OpenSAF4.1 has been system tested with 
> up
> to 350 000 objects of average size 300 bytes. It is not advisable to generate
> larger imm-contents than that.
> >> What is the definition of an object?
> > The 300 bytes is the size of each object, (which is the accumulated size of 
> > the
> assigned attributes for a class). The size of the object depends upon the 
> number
> of  attributes and different type of attributes of a class.
> >> We have a cluster defined across 6 nodes with a total of 12 SGs, a total 
> >> of 64
> SUs, and a total of 292 components.  We can start OpenSAF successfully across
> the nodes and unlock all SUs with no problems.
> >>
> >> The cluster definition was increased to 6 nodes, 15 SGs, a total of 56 
> >> SUs, and
> a total of 388 components.  We are able to start OpenSAF on all nodes
> successfully but as soon as a little over 300 components have been unlocked,
> things start to fall apart.  The opensaf processes start to die and the 
> cluster is no
> longer usable.
> > The SU's, SG's and components are internally objects for IMM. Incresing to 
> > "6
> nodes, 15 SGs, a total of 56 SUs, and a total of 388 components"
> > should not have caused any IMM related problems.
> >
> >> Oct 19 16:29:06 colobus osafamfnd[3649]: NO Assigned
> 'safSi=amfSDFSISI1.3,safApp=olcApp' ACTIVE to
> 'safSu=amfSDFSISU1.4,safSg=amfSDFSISG1,safApp=olcApp'
> >> Oct 19 16:29:07 colobus osafamfnd[3649]: NO
> 'safComp=amfOELCElfComp2.3.1,safSu=amfOELCSU2.3,safSg=amfOELCSG2,saf
> App=olcApp' faulted due to 'avaDown' : Recovery is 'componentRestart'
> >> Oct 19 16:29:09 colobus osafamfnd[3649]: NO
> 'safComp=amfOELCElfComp2.3.1,safSu=amfOELCSU2.3,safSg=amfOELCSG2,saf
> App=olcApp' faulted due to 'avaDown' : Recovery is 'componentRestart'
> >> Oct 19 16:29:09 colobus osafntfd[3587]: ER ntfs_mds_msg_send FAILED
> >> Oct 19 16:29:09 colobus osafntfd[3587]: ER ntfs_mds_msg_send to ntfa
> >> failed rc: 2 Oct 19 16:29:09 colobus osafntfd[3587]: ER
> >> ntfs_mds_msg_send FAILED Oct 19 16:29:09 colobus osafntfd[3587]: ER
> >> ntfs_mds_msg_send FAILED ....
> >> Oct 19 16:33:24 colobus ntpd_initres[2608]: host name not found:
> >> 0.rhel.pool.ntp.org ....
> >> Oct 19 16:35:18 colobus osafimmnd[3549]: NO Implementer disconnected
> >> 14 <0, 22b0f> (MsgQueueService142095) Oct 19 16:35:19 colobus
> >> osafclmd[3602]: NO proc_initialize_msg: send failed.
> >> dest:22b0f00007a77 Oct 19 16:35:19 colobus osafimmnd[3549]: NO Global
> discard node received for nodeId:22b0f pid:31333 Oct 19 16:35:20 colobus
> osafamfnd[3649]: NO
> 'safComp=amfOELCElfComp2.3.1,safSu=amfOELCSU2.3,safSg=amfOELCSG2,saf
> App=olcApp' faulted due to 'avaDown' : Recovery is 'componentRestart'
> >> Oct 19 16:35:22 colobus osafamfnd[3649]: NO
> 'safComp=amfOELCElfComp2.3.1,safSu=amfOELCSU2.3,safSg=amfOELCSG2,saf
> App=olcApp' faulted due to 'avaDown' : Recovery is 'componentRestart'
> >> Oct 19 16:35:24 colobus osafamfnd[3649]: NO
> 'safComp=amfOELCElfComp2.3.1,safSu=amfOELCSU2.3,safSg=amfOELCSG2,saf
> App=olcApp' faulted due to 'avaDown' : Recovery is 'componentRestart'
> >> Oct 19 16:35:26 colobus osafimmnd[3549]: NO Implementer connected: 16
> >> (MsgQueueService142095) <12021, 2280f> Oct 19 16:35:26 colobus
> >> osafimmnd[3549]: NO Implementer locally disconnected. Marking it as
> >> doomed 16 <12021, 2280f> (MsgQueueService142095) Oct 19 16:35:26
> >> colobus osafimmnd[3549]: NO Implementer disconnected 16 <12021,
> >> 2280f>
> >> (MsgQueueService142095) Oct 19 16:35:26 colobus osafamfd[3631]: NO
> >> Node 'bedrazzas.monkey.lab' left the cluster
> >>
> >> Have we reached a max of the number of SUs/Components that can be
> started within a single OpenSAF cluster?
> > OpenSAF 4.4/4.5 is tested for 70 nodes.
> >> We have tried the above with OpenSAF 4.4 and OpenSAF 4.5 and with both
> TCP and TIPC, all fail similarly.
> > This should have been an application problem or adjustments related to
> timeouts. Please share the syslog messages of all the nodes.
> >> Thank you!
> >>
> >> Shu Wang | Senior Analyst | +1(407)708-5117 or x3917|
> >> www.NetCracker.com Proven Partner to Communications Service Providers
> >>
> >>
> >>
> >>
> >> ________________________________
> >> The information transmitted herein is intended only for the person or 
> >> entity
> to which it is addressed and may contain confidential, proprietary and/or
> privileged material. Any review, retransmission, dissemination or other use 
> of,
> or taking of any action in reliance upon, this information by persons or 
> entities
> other than the intended recipient is prohibited. If you received this in 
> error,
> please contact the sender and delete the material from any computer.
> >> ---------------------------------------------------------------------
> >> -
> >> -------- Comprehensive Server Monitoring with Site24x7.
> >> Monitor 10 servers for $9/Month.
> >> Get alerted through email, SMS, voice calls or mobile push notifications.
> >> Take corrective actions from your mobile device.
> >> http://p.sf.net/sfu/Zoho
> >> _______________________________________________
> >> Opensaf-users mailing list
> >> [email protected]
> >> https://lists.sourceforge.net/lists/listinfo/opensaf-users
> >
> >
> > ________________________________
> > The information transmitted herein is intended only for the person or 
> > entity to
> which it is addressed and may contain confidential, proprietary and/or
> privileged material. Any review, retransmission, dissemination or other use 
> of,
> or taking of any action in reliance upon, this information by persons or 
> entities
> other than the intended recipient is prohibited. If you received this in 
> error,
> please contact the sender and delete the material from any computer.
> 
> 
> ------------------------------------------------------------------------------
> Comprehensive Server Monitoring with Site24x7.
> Monitor 10 servers for $9/Month.
> Get alerted through email, SMS, voice calls or mobile push notifications.
> Take corrective actions from your mobile device.
> http://p.sf.net/sfu/Zoho
> _______________________________________________
> Opensaf-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/opensaf-users

------------------------------------------------------------------------------
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://p.sf.net/sfu/Zoho
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Re: [users] Max number of SUs/Components in a cluster?

Reply via email to