Re: [devel] [PATCH 0 of 3] Review Request for mds: use TIPC multicast for MDS broadcast [#851]

Hans Feldt Wed, 13 Aug 2014 03:01:21 -0700

Why is the finalizesync message so big?

I think that is a concern if we have a mw that broadcasts big blobs without any 
upper limit.


As a fix now MDS could instead of failing the broadcast send when the message 
if greater than 2^16, revert to the old multi-unicast.

Thanks,
Hans

> -----Original Message-----
> From: Neelakanta Reddy [mailto:reddy.neelaka...@oracle.com]
> Sent: den 13 augusti 2014 11:34
> To: mahesh.va...@oracle.com; Hans Feldt; Anders Widell; Anders Björnerstedt
> Cc: opensaf-devel@lists.sourceforge.net
> Subject: Re: [devel] [PATCH 0 of 3] Review Request for mds: use TIPC 
> multicast for MDS broadcast [#851]
> 
> Hi Mahesh,
> 
> >>- A single multicast message can accommodate max of MDS_DIRECT_BUF_MAXSIZE
> >>    (2^16).
> 
> 
> The above limit can be removed, because IMM can send the event with grater 
> than (2^16) size.
> 
> 
> After running multiple regression test and do a failover the following is 
> observed:
> 
> The testcase is Failover, the problem happened when Faiover node is joining.
> At the end of syncing IMMND co-ordinator sends the finalizesync message, 
> which will be sent to IMMD and IMMD broadcast to all
> IMMNDs.
> The finalizesync meaasage may have the event size more than 64000. Since, MDS 
> drops packets with the size greater than 64000
> because of
> this there is mismatch in FEVS count and out of order is observed.This will 
> lead to cluster re-start.
> 
> syslog at co-ordinator node:
> 
> Aug 13 10:24:30 SLES-SLOT-2 osafimmnd[17711]: NO NODE STATE->
> IMM_NODE_R_AVAILABLE
> Aug 13 10:24:30 SLES-SLOT-2 osafimmd[6462]: NO Successfully announced
> sync. New ruling epoch:19
> Aug 13 10:24:30 SLES-SLOT-2 osafimmloadd: NO Sync starting
> Aug 13 10:24:53 SLES-SLOT-2 osafimmloadd: IN Synced 421 objects in total
> Aug 13 10:24:53 SLES-SLOT-2 osafimmnd[17711]: NO NODE STATE->
> IMM_NODE_FULLY_AVAILABLE 16063
> Aug 13 10:24:53 SLES-SLOT-2 osafimmloadd: NO Sync ending normally
> Aug 13 10:24:53 SLES-SLOT-2 osafimmd[6462]: NO MDTM: Not possible to
> send size:94807 TIPC multicast to svc_id: 25
> Aug 13 10:24:53 SLES-SLOT-2 osafimmd[6462]: NO MDTM: Not possible to
> send size:94807 TIPC multicast to svc_id: 25
> Aug 13 10:24:53 SLES-SLOT-2 osafimmd[6462]: NO MDTM: Not possible to
> send size:94807 TIPC multicast to svc_id: 25
> Aug 13 10:24:53 SLES-SLOT-2 osafimmd[6462]: NO MDTM: Not possible to
> send size:95147 TIPC multicast to svc_id: 25
> Aug 13 10:26:44 SLES-SLOT-2 syslog-ng[1190]: Log statistics;
> dropped='pipe(/dev/xconsole)=1825', dropped='pipe(/dev/tty10)=0',
> processed='center(queued)=54989', processed='center(received)=22910',
> processed='destination(messages)=22908',
> processed='destination(mailinfo)=2',
> processed='destination(mailwarn)=0',
> processed='destination(localmessages)=21235',
> processed='destination(newserr)=0', processed='destination(mailerr)=0',
> processed='destination(netmgm)=0', processed='destination(warn)=3868',
> processed='destination(console)=3487', processed='destination(null)=0',
> processed='destination(mail)=2', processed='destination(xconsole)=3487',
> processed='destination(firewall)=0', processed='destination(acpid)=0',
> processed='destination(newscrit)=0',
> processed='destination(newsnotice)=0', processed='source(src)=22910'
> Aug 13 10:32:29 SLES-SLOT-2 osafimmnd[17711]: WA MESSAGE:78692 OUT OF
> ORDER my highest processed:78690, exiting
> Aug 13 10:32:29 SLES-SLOT-2 osafimmpbed: WA PBE lost contact with parent
> IMMND - Exiting
> Aug 13 10:32:29 SLES-SLOT-2 osafamfnd[6533]: NO
> 'safSu=SC-2,safSg=NoRed,safApp=OpenSAF' component restart probation
> timer started (timeout: 60000000000 ns)
> Aug 13 10:32:29 SLES-SLOT-2 osafamfnd[6533]: NO Restarting a component
> of 'safSu=SC-2,safSg=NoRed,safApp=OpenSAF' (comp restart count: 1)
> Aug 13 10:32:29 SLES-SLOT-2 osafamfnd[6533]: NO
> 'safComp=IMMND,safSu=SC-2,safSg=NoRed,safApp=OpenSAF' faulted due to
> 'avaDown' : Recovery is 'componentRestart'
> Aug 13 10:32:29 SLES-SLOT-2 osafimmd[6462]: WA IMMND coordinator at
> 2020f apparently crashed => electing new coord
> Aug 13 10:32:29 SLES-SLOT-2 osafimmd[6462]: ER Failed to find candidate
> for new IMMND coordinator
> Aug 13 10:32:29 SLES-SLOT-2 osafimmd[6462]: ER Active IMMD has to
> restart the IMMSv. All IMMNDs will restart
> Aug 13 10:32:29 SLES-SLOT-2 osafimmnd[6099]: Started
> Aug 13 10:32:29 SLES-SLOT-2 osafimmnd[6099]: NO Persistent Back-End
> capability configured, Pbe file:imm.db (suffix may get added)
> Aug 13 10:32:29 SLES-SLOT-2 osafimmnd[6099]: NO SERVER STATE:
> IMM_SERVER_ANONYMOUS --> IMM_SERVER_CLUSTER_WAITING
> Aug 13 10:32:30 SLES-SLOT-2 osafimmnd[6099]: NO IMMND received reset
> order from IMMD, but has just restarted - ignoring
> Aug 13 10:32:30 SLES-SLOT-2 osafimmd[6462]: ER IMM RELOAD  => ensure
> cluster restart by IMMD exit at both SCs, exiting
> Aug 13 10:32:30 SLES-SLOT-2 osafamfnd[6533]: NO
> 'safComp=IMMD,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to
> 'avaDown' : Recovery is 'nodeFailfast'
> Aug 13 10:32:30 SLES-SLOT-2 osafamfnd[6533]: ER
> safComp=IMMD,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown
> Recovery is:nodeFailfast
> Aug 13 10:32:30 SLES-SLOT-2 osafamfnd[6533]: Rebooting OpenSAF NodeId =
> 131599 EE Name = , Reason: Component faulted: recovery is node failfast,
> OwnNodeId = 131599, SupervisionTime = 60
> 
> 
> /Neel.
> 
> 
> On Thursday 07 August 2014 02:05 PM, mahesh.va...@oracle.com wrote:
> > Summary: mds: use TIPC multicast for MDS broadcast [#851]
> > Review request for Trac Ticket(s): #851
> > Peer Reviewer(s): Hans/Anders.widell/Anders.bj
> > Pull request to: <<LIST THE PERSON WITH PUSH ACCESS HERE>>
> > Affected branch(es): default
> > Development branch: default
> >
> > --------------------------------
> > Impacted area       Impact y/n
> > --------------------------------
> >   Docs                    n
> >   Build system            n
> >   RPM/packaging           n
> >   Configuration files     n
> >   Startup scripts         n
> >   SAF services            n
> >   OpenSAF services        n
> >   Core libraries          y
> >   Samples                 n
> >   Tests                   n
> >   Other                   n
> >
> >
> > Comments (indicate scope for each "y" above):
> > ---------------------------------------------
> >
> > changeset c6de7935bc53f5b96ee0cf604e29d2784c6e4fca
> > Author:     A V Mahesh <mahesh.va...@oracle.com>
> > Date:       Thu, 07 Aug 2014 12:13:04 +0530
> >
> >     mds: use TIPC multicast for MDS broadcast [#851] Brief note on Multicast
> >     Enhancement Ticket:
> >     
> > -----------------------------------------------------------------------------
> >     Currently Opensaf Broadcast message was implemented as multiple
> >     unicasts (that means Broadcast message was reaching N nodes in T-1 to 
> > T-N
> >     time based on the number of nodes) ,after the ticket #851 changes , the
> >     Opensaf Broadcast message will be utilizing TIPC Multicast , that means
> >     message may reach T-1 time irrelevant of number of nodes in the cluster.
> >
> >     This Enhancement of TIPC multicast make sure or receives broadcast 
> > message
> >     at same instant of Time . So this enhancement provides an improvement of
> >     node cluster or node cluster should take same bring up time ,and it also
> >     eliminate timing issue that were facing because multiple unicasts.
> >
> >     But it also improves load time and sync time because of reduced 
> > unnecessary
> >     load to the sending process.
> >
> >     Code changes :
> >     
> > -----------------------------------------------------------------------------
> >
> >     - The Code changes are only effects to MDS TIPC transport , NO Changes 
> > in
> >     MDS TCP transport .This change are with in-service upgrade support.
> >
> >     - Now the MDS TIPC transport sendto() for SENDTYPE BCAST & RBCAST 
> > addrtype
> >     is TIPC_ADDR_MCAST.
> >
> >     - A single multicast message can accommodate max of 
> > MDS_DIRECT_BUF_MAXSIZE
> >     (2^16).
> >
> >     - MDS_SVC_INFO structure has a new variable , which maintains count of
> >     previous Opensaf version subscribers count for this service.
> >
> >      a subscribe/unsubscribe of previous Opensaf version service for this
> >     service.
> >
> >     - If the count is ZERO means all are nodes in the cluster has new 
> > version
> >     Opensaf
> >
> >     - If the count is grater than ZERO means nodes in the cluster has both
> >     old and new version Opensaf.
> >
> >     - If the count is ZERO the SENDTYPE BCAST & RBCAST messages will be sent
> >     as TIPC Multicast and this is at service level.
> >
> >     - If the count is grater than ZERO SENDTYPE BCAST & RBCAST messages will
> >     be sent as previous multiple unicasts and this is at service level.
> >
> >     Opensaf Services Code Tuning :
> >     
> > -----------------------------------------------------------------------------
> >     While adopting new Multicast messaging I came across following
> >     Opensaf integration issue in very complex test cases ,in n normal
> >     conditions/use case these issue may not occur.
> >
> >     But I have created ticket for all of the so these tickets required to be
> >     fixed to Multicast to work with out any issues even in complex 
> > conditions.
> >
> >      1. https://sourceforge.net/p/opensaf/tickets/952/ 2.
> >     https://sourceforge.net/p/opensaf/tickets/953/ 3.
> >     https://sourceforge.net/p/opensaf/tickets/954/ 4.
> >     https://sourceforge.net/p/opensaf/tickets/955/ 5.
> >     https://sourceforge.net/p/opensaf/tickets/946/
> >
> > changeset 5f86111a818ccff18f6a9cd93d12ac7b15b3a7d3
> > Author:     A V Mahesh <mahesh.va...@oracle.com>
> > Date:       Thu, 07 Aug 2014 12:14:11 +0530
> >
> >     imm: tune imm macros according to mds mcast size [#851]
> >
> >      - No functional changes.
> >
> >      - The IMM IMMSV_DEFAULT_MAX_SYNC_BATCH_SIZE is limited to 90%
> >        MDS_DIRECT_BUF_MAXSIZE (2^16) to accommodate IMM header data.
> >
> > changeset e446a201760b8e1e2674fa022b622af0aa5ce34f
> > Author:     A V Mahesh <mahesh.va...@oracle.com>
> > Date:       Thu, 07 Aug 2014 13:56:07 +0530
> >
> >     mds: support multicast for n_way configuration [#851]
> >
> >     - The existing Opensaf Broadcast message was implemented such that the 
> > Bcast
> >     messes will be send to only the active subscribers as multiple unicast,
> >     assuming that the subscription list may have n_way_active & n_way.
> >
> >     - But currently we don't have any n_way configuration in the current 
> > Opensaf
> >     middle ware , and so far we didn't come across such n_way configuration
> >     application.
> >
> >     - In case of Multicast Messaging TIPC to send a copy of the message to 
> > every
> >     port in the sender's cluster irrelevant of role of the subscriber (
> >     assuming application will filter it based on active/standby)
> >
> >     - So to match existing Opensaf Broadcast message , i am providing this
> >     Option patch as well.
> >
> >     - If community thinks this is required we will push this optional patch 
> > ,
> >     otherwise we will attach this to ticket for future use when we will 
> > suppor
> >     n_way .
> >
> >     Code changes :
> >
> >     Before sending the Multicast Messaging , query the subscribers for any
> >     standby exist (n_way subscribers list) , if (if standby exist) send old
> >     type Broadcast message as multiple unicast only to actives , else in the
> >     case of n_way_active configuration sends Multicast Message.
> >
> >
> > Complete diffstat:
> > ------------------
> >   osaf/libs/common/immsv/include/immsv_api.h     |    6 ++--
> >   osaf/libs/core/mds/include/mds_core.h          |    5 ++++
> >   osaf/libs/core/mds/mds_c_api.c                 |   19 ++++++++++++--
> >   osaf/libs/core/mds/mds_c_db.c                  |   29 
> > +++++++++++++++++++++++
> >   osaf/libs/core/mds/mds_c_sndrcv.c              |  126
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------
> >   osaf/libs/core/mds/mds_dt_tipc.c               |   69 
> > ++++++++++++++++++++++++++++++++++++++++++++++++++----
> >   osaf/libs/core/mds/mds_main.c                  |    2 +-
> >   osaf/services/saf/immsv/immloadd/imm_loader.cc |    8 +++++-
> >   8 files changed, 208 insertions(+), 56 deletions(-)
> >
> >
> > Testing Commands:
> > -----------------
> > s Patch has improved ~50% of performance
> >
> > I have considered IMM concurrently syncing multiple Payload for
> > measuring Benchmarks of the Multicast Messaging ( this is one of the case 
> > where
> > Multicast/Broadcast being used with high frequency ).
> >
> >
> > With Multicast changes ,all restarted Node now joining cluster with in
> > ~42 seconds , which used to take ~74 seconds when One million IMM objects
> > are involved in sync process.
> >
> >
> > Following are the benchmarking figure:
> >
> >
> > OpenSAF 4.5 release ( Default staging) WITH #851 Multicast patch:-
> > -----------------------------------------------------------------------
> > Jul 15 15:40:59 SLES-SLOT2 opensafd: Starting OpenSAF Services
> > Jul 15 15:41:43 SLES-SLOT2 opensafd: OpenSAF(4.5.M0 - ) services
> > successfully started <=== 42 Seconds
> >
> > Jul 15 15:40:58 SLES-SLOT3 opensafd: Starting OpenSAF Services
> > Jul 15 15:41:42 SLES-SLOT3 opensafd: OpenSAF(4.5.M0 - ) services
> > successfully started <=== 44 Seconds
> >
> > Jul 15 15:41:30 SLES-SLOT4 opensafd: Starting OpenSAF Services
> > Jul 15 15:42:12 SLES-SLOT4 opensafd: OpenSAF(4.5.M0 - ) services
> > successfully started <=== 42 Seconds
> > -----------------------------------------------------------------------
> >
> >
> > OpenSAF 4.5 release ( Default staging) WITH-OUT #851 Multicast patch:-
> > ----------------------------------------------------------------------
> > Jul 15 15:21:21 SLES-SLOT2 opensafd: Starting OpenSAF Services
> > Jul 15 15:22:36 SLES-SLOT2 opensafd: OpenSAF(4.5.M0 - ) services
> > successfully started <=== 74 Seconds
> >
> > Jul 15 15:21:20 SLES-SLOT3 opensafd: Starting OpenSAF Services
> > Jul 15 15:22:34 SLES-SLOT3 opensafd: OpenSAF(4.5.M0 - ) services
> > successfully started <=== 74 Seconds
> >
> > Jul 15 15:21:51 SLES-SLOT4 opensafd: Starting OpenSAF Services
> > Jul 15 15:23:05 SLES-SLOT4 opensafd: OpenSAF(4.5.M0 - ) services
> > successfully started <=== 74 Seconds
> > ------------------------------------------------------------------------
> >
> >
> > Note : need to make sure IMMND(s) of each restated node SERVER STATE
> > should be in `IMM_SERVER_SYNC_CLIENT` at same point of time.
> >
> >
> > You can achieve this programmatically by hacking/commenting code
> > `cb->mLostNodes--;` line : 580 of `immnd_proc.c` file.
> >
> >
> >
> > Testing, Expected Results:
> > --------------------------
> >   <<PASTE COMMAND OUTPUTS / TEST RESULTS>>
> >
> >
> > Conditions of Submission:
> > -------------------------
> > Ack from Reviewer
> >
> > Arch      Built     Started    Linux distro
> > -------------------------------------------
> > mips        n          n
> > mips64      n          n
> > x86         n          n
> > x86_64      y          y
> > powerpc     n          n
> > powerpc64   n          n
> >
> >
> > Reviewer Checklist:
> > -------------------
> > [Submitters: make sure that your review doesn't trigger any checkmarks!]
> >
> >
> > Your checkin has not passed review because (see checked entries):
> >
> > ___ Your RR template is generally incomplete; it has too many blank entries
> >      that need proper data filled in.
> >
> > ___ You have failed to nominate the proper persons for review and push.
> >
> > ___ Your patches do not have proper short+long header
> >
> > ___ You have grammar/spelling in your header that is unacceptable.
> >
> > ___ You have exceeded a sensible line length in your headers/comments/text.
> >
> > ___ You have failed to put in a proper Trac Ticket # into your commits.
> >
> > ___ You have incorrectly put/left internal data in your comments/files
> >      (i.e. internal bug tracking tool IDs, product names etc)
> >
> > ___ You have not given any evidence of testing beyond basic build tests.
> >      Demonstrate some level of runtime or other sanity testing.
> >
> > ___ You have ^M present in some of your files. These have to be removed.
> >
> > ___ You have needlessly changed whitespace or added whitespace crimes
> >      like trailing spaces, or spaces before tabs.
> >
> > ___ You have mixed real technical changes with whitespace and other
> >      cosmetic code cleanup changes. These have to be separate commits.
> >
> > ___ You need to refactor your submission into logical chunks; there is
> >      too much content into a single commit.
> >
> > ___ You have extraneous garbage in your review (merge commits etc)
> >
> > ___ You have giant attachments which should never have been sent;
> >      Instead you should place your content in a public tree to be pulled.
> >
> > ___ You have too many commits attached to an e-mail; resend as threaded
> >      commits, or place in a public tree for a pull.
> >
> > ___ You have resent this content multiple times without a clear indication
> >      of what has changed between each re-send.
> >
> > ___ You have failed to adequately and individually address all of the
> >      comments and change requests that were proposed in the initial review.
> >
> > ___ You have a misconfigured ~/.hgrc file (i.e. username, email etc)
> >
> > ___ Your computer have a badly configured date and time; confusing the
> >      the threaded patch review.
> >
> > ___ Your changes affect IPC mechanism, and you don't present any results
> >      for in-service upgradability test.
> >
> > ___ Your changes affect user manual and documentation, your patch series
> >      do not contain the patch that updates the Doxygen manual.
> >
> >
> > ------------------------------------------------------------------------------
> > Infragistics Professional
> > Build stunning WinForms apps today!
> > Reboot your WinForms applications with our WinForms controls.
> > Build a bridge from your legacy apps to the future.
> > http://pubads.g.doubleclick.net/gampad/clk?id=153845071&iu=/4140/ostg.clktrk
> > _______________________________________________
> > Opensaf-devel mailing list
> > Opensaf-devel@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/opensaf-devel


------------------------------------------------------------------------------
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 0 of 3] Review Request for mds: use TIPC multicast for MDS broadcast [#851]

Reply via email to