sysctl -a | grep rmem

set rmem_default to 256K or so

/Hans

> -----Original Message-----
> From: Hans Feldt [mailto:[email protected]]
> Sent: den 8 januari 2014 14:01
> To: A V Mahesh; Alex Jones
> Cc: [email protected]
> Subject: Re: [devel] checkpoint problems
> 
> The socket receive buffer size used is the system default. It can be too 
> small, pump it up.
> I plan todo some change in MDS for this (and other stuff).
> /Hans
> 
> > -----Original Message-----
> > From: A V Mahesh [mailto:[email protected]]
> > Sent: den 8 januari 2014 11:29
> > To: Alex Jones
> > Cc: [email protected]
> > Subject: Re: [devel] checkpoint problems
> >
> > Hi Alex,
> >
> > I suggest you increase and try the following TIPC values ( tipc code )
> > and rebuild `tipc.ko`:
> >
> > net/tipc/tipc_socket.c:#define OVERLOAD_LIMIT_BASE      5000
> >
> > You can increase it to 50000 and try again.
> >
> > - AVM.
> >
> > On 1/8/2014 4:16 AM, Alex Jones wrote:
> > > After doing some deep debugging I am seeing the following in the MDS
> > > log on node B.  This is when the CPND_EVT_ND2ND_CKPT_ACTIVE_SYNC is
> > > sent from the active replica on node A to the replica on node B.  The
> > > sync message never gets up to the CPND layer on node B because it is
> > > dropped.
> > >
> > > This is with 10k sections, each section 1k.
> > >
> > > Jan  7 21:32:32.772347 <1789648919> ERR    |MDTM: Frag recd is not
> > > next frag so dropping adest=<0x010010023922604c>
> > > Jan  7 21:32:32.772399 <1789648919> ERR    |MDTM: Message is dropped
> > > as msg is out of seq TRANSPOR-ID=<0x010010023922604c>
> > >
> > > I've turned on MDS debug on node B, and the packet being sent over is
> > > gigantic.  It starts failing at fragment number 2703.  The next
> > > fragment that comes in is 2707, then 2722.  The last fragment that
> > > comes in is 7444.
> > >
> > > I've done a cursory look at the hardware stats, and nothing is being
> > > rate-limited or dropped.
> > >
> > > I'm going to take a deeper look at this, but I'm mentioning it in case
> > > it rings any bells.  I am using TIPC as the transport.
> > >
> > > Alex
> > >
> > > On 01/07/2014 07:24 AM, Alex Jones wrote:
> > >> AVM,
> > >>
> > >>     I get SA_AIS_ERR_TIMEOUT even when I pass SA_TIME_END as the
> > >> timeout value.  Is this not a bug?  the synchronous CheckpointOpen
> > >> call doesn't work at all in this scenario.  It never succeeds.
> > >>
> > >>     I can reproduce the problem with
> > >> sectionCreationAttributes.expirationTime set to SA_TIME_ONE_DAY.
> > >>
> > >>     You should be able to reproduce the problem with the code I sent
> > >> in the last e-mail.
> > >>
> > >> Alex
> > >>
> > >> On 01/06/2014 10:31 PM, A V Mahesh wrote:
> > >>> Hi Alex,
> > >>>
> > >>> CheckpointOpen call failing with SA_AIS_ERR_TIMEOUT   NOT a bug , it
> > >>> is expected if you pass  less time out value `timeout = 1000000000`
> > >>> to saCkptCheckpointOpen(....,timeout ...) call ,when ckpt has very
> > >>> large data/section. just increasing timeout will avoids the
> > >>> SA_AIS_ERR_TIMEOUT.
> > >>>
> > >>> Let us focus on your original issue/scenario, are you able to
> > >>> reproduce the  problem with sectionCreationAttributes.expirationTime
> > >>> with SA_TIME_ONE_DAY ?
> > >>>
> > >>> -AVM
> > >>>
> > >>> On 1/7/2014 1:17 AM, Alex Jones wrote:
> > >>>> AVM,
> > >>>>
> > >>>>     I've been playing around with your test program, and have
> > >>>> gotten it to fail.
> > >>>>
> > >>>>     I made the following changes:
> > >>>>
> > >>>>  1. Change init_dataX to be 1024k bytes, so that you are
> > >>>>     initializing the section to be 1024k.
> > >>>>  2. Also, don't start the program on node B until A has finished
> > >>>>     writing/creating all the sections.
> > >>>>  3. Before hitting the enter key on node B, wait for the OpenAsync
> > >>>>     call to finish.
> > >>>>
> > >>>>     You might notice the CheckpointOpen call failing now with
> > >>>> SA_AIS_ERR_TIMEOUT.  I had to turn this into OpenAsync, and add a
> > >>>> thread to process CkptDispatch messages.  This uncovers another bug
> > >>>> in OpenAsync.  I've attached the mods to your program here.
> > >>>>
> > >>>>    The OpenAsync callback will be called twice, both times with
> > >>>> error == SA_AIS_ERR_TIMEOUT.  If I call OpenAsync again when I get
> > >>>> this error, the next callback returns success, but the callback
> > >>>> gets called twice with success and with two different checkpoint
> > >>>> handles!
> > >>>>
> > >>>> Alex
> > >>>>
> > >>>>
> > >>>> On 01/06/2014 06:18 AM, A V Mahesh wrote:
> > >>>>> Hi Alex,
> > >>>>>
> > >>>>> I have  created 10K sections  ( please find the attached test
> > >>>>> application  `Alex_test_node_A_app.c`  & `Alex_test_node_B_app.c ` )
> > >>>>> with your specified scenario & configuration and I haven't observed 
> > >>>>> any
> > >>>>> issue with  sections  on another node.
> > >>>>>
> > >>>>> Try to reproduce the problem on your setup & let me know the result .
> > >>>>>
> > >>>>> One more importent point how much did you configured
> > >>>>> `sectionCreationAttributes.expirationTime `  ?
> > >>>>> I configured  SA_TIME_ONE_DAY.
> > >>>>>
> > >>>>> Steps to rung the application :
> > >>>>>
> > >>>>>
> >
> ======================================================================================================
> > =============
> > >>>>>
> > >>>>> Compile :
> > >>>>>
> > >>>>> NODE-A# gcc Alex_test_node_A_app.c -o checkpoint_A -lSaCkpt
> > >>>>> NODE-A# gcc Alex_test_node_B_app.c -o checkpoint_B -lSaCkpt
> > >>>>>
> > >>>>>
> > >>>>> Run :
> > >>>>>
> > >>>>> 1) saCkptCheckpointOpen On node A
> > >>>>>
> > >>>>> NODE-A# ./checkpoint_A
> > >>>>>
> > >>>>> CPSV:CPA:ONsaCkptSectionCreate  Waiting to Create Sections
> > >>>>> safCkpt=test_checkpoint_name1,safApp=safCkptService....
> > >>>>> saCkptSectionCreate Press <Enter> key to continue...
> > >>>>>
> > >>>>> .
> > >>>>> 2) saCkptCheckpointOpen() same ckpt On node B
> > >>>>>
> > >>>>> NODE-B# ./checkpoint_B
> > >>>>>
> > >>>>> CPSV:CPA:ONsaCkptSectionIterationInitialize Waiting to read Sections
> > >>>>> safCkpt=test_checkpoint_name1,safApp=safCkptService....
> > >>>>> saCkptActiveReplicaSet saCkptSectionIterationInitialize Press <Enter>
> > >>>>> key to continue...
> > >>>>>
> > >>>>>
> > >>>>> 3) saCkptSectionCreate() On node A  and read 
> > >>>>> saCkptCheckpointStatusGet()
> > >>>>>
> > >>>>> NODE-A#
> > >>>>>    checkpointStatus.numberOfSections : 10000
> > >>>>>    checkpointStatus.memoryUsed :756000
> > >>>>>     checkpointCreationAttributes.creationFlags;10
> > >>>>>    checkpointCreationAttributes.checkpointSize;10240000
> > >>>>>    checkpointCreationAttributes.retentionDuration;60000000000
> > >>>>>    checkpointCreationAttributes.maxSections;10000
> > >>>>>    checkpointCreationAttributes.maxSectionSize;1024
> > >>>>>    checkpointCreationAttributes.maxSectionIdSize;64
> > >>>>>    ================================
> > >>>>> saCkptCheckpointUnlink / saCkptCheckpointClose / saCkptFinalize Press
> > >>>>> <Enter> key to continue...
> > >>>>> saCkptCheckpoint Press <Enter> key to continue...
> > >>>>>
> > >>>>>
> > >>>>> 4) saCkptActiveReplicaSet() & On node B  and 
> > >>>>> saCkptCheckpointStatusGet()
> > >>>>>
> > >>>>> NODE-B#
> > >>>>>    checkpointStatus.numberOfSections : 10000
> > >>>>>    checkpointStatus.memoryUsed :756000
> > >>>>>     checkpointCreationAttributes.creationFlags;10
> > >>>>>    checkpointCreationAttributes.checkpointSize;10240000
> > >>>>>    checkpointCreationAttributes.retentionDuration;60000000000
> > >>>>>    checkpointCreationAttributes.maxSections;10000
> > >>>>>    checkpointCreationAttributes.maxSectionSize;1024
> > >>>>>    checkpointCreationAttributes.maxSectionIdSize;64
> > >>>>>
> > >>>>>    saCkptCheckpointUnlink / saCkptCheckpointClose / saCkptFinalize 
> > >>>>> Press
> > >>>>> <Enter> key to continue...
> > >>>>>    saCkptCheckpoint Press <Enter> key to continue..
> > >>>>>
> > >>>>>
> >
> ======================================================================================================
> > ==========================
> > >>>>>
> > >>>>> -AVM
> > >>>>>
> > >>>>>
> > >>>>> On 1/6/2014 12:32 PM, A V Mahesh wrote:
> > >>>>>> Hi Alex,
> > >>>>>>
> > >>>>>> We never tested the  7500 sections , will test & and let you know ,
> > >>>>>> can you please share your test application ,
> > >>>>>>   that allow us to respond quick.
> > >>>>>>
> > >>>>>> -AVM
> > >>>>>>
> > >>>>>> On 1/3/2014 8:23 PM, Alex Jones wrote:
> > >>>>>>> Hello All,
> > >>>>>>>
> > >>>>>>>       I'm experimenting with the checkpoint service, and some things
> > >>>>>>> don't appear to work.
> > >>>>>>>
> > >>>>>>>       The saCkptActiveReplicaSet and
> > >>>>>>> saCkptCheckpointSynchronize[Async] don't appear to work when the
> > >>>>>>> checkpoint has section numbers greater than around 5500.
> > >>>>>>>
> > >>>>>>>       I've created a checkpoint with 7500 sections, each section 
> > >>>>>>> being
> > >>>>>>> 1024 bytes.  The checkpoint is co-located and the "active replica"
> > >>>>>>> bit is set.
> > >>>>>>>
> > >>>>>>>       I can create and write all the sections.  And from another 
> > >>>>>>> node
> > >>>>>>> I run saCkptCheckpointStatusGet, and the information all looks good.
> > >>>>>>> Everything is there.  I see no errors from any CKPT API calls.
> > >>>>>>>
> > >>>>>>>       The problem comes when I call saCkptActiveReplicaSet from this
> > >>>>>>> other node.  After I do this, saCkptCheckpointStatusGet now returns
> > >>>>>>> all the same information except the number of sections is no longer
> > >>>>>>> 7500 but 0.  If I do this test with 50,000 sections only about 3,000
> > >>>>>>> entries get synced.  And iterating through the sections shows that
> > >>>>>>> there are only 3,000 sections.
> > >>>>>>>
> > >>>>>>>       Calling saCkptCheckpointSynchronize[Async] in this situation 
> > >>>>>>> has
> > >>>>>>> no effect, either.
> > >>>>>>>
> > >>>>>>>       After looking through the code I see a comment in
> > >>>>>>> cpnd_evt_proc_ckpt_arep_set that says "/* ###TBD sync up is missing
> > >>>>>>> with old active if now this fellow is becoming active. */"  So, it
> > >>>>>>> doesn't appear that syncing is being done in the
> > >>>>>>> saCkptActiveReplicaSet, which it should be.
> > >>>>>>>
> > >>>>>>>       Can someone comment?
> > >>>>>>>
> > >>>>>>>       I'm going to fix this and post a patch unless someone else is
> > >>>>>>> already working on it, but I didn't see a bug for it.
> > >>>>>>>
> > >>>>>>> Alex
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> ------------------------------------------------------------------------------
> > >>>>>>>
> > >>>>>>> Rapidly troubleshoot problems before they affect your business. 
> > >>>>>>> Most IT
> > >>>>>>> organizations don't have a clear picture of how application 
> > >>>>>>> performance
> > >>>>>>> affects their revenue. With AppDynamics, you get 100% visibility 
> > >>>>>>> into
> > >>>>>>> your
> > >>>>>>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of
> > >>>>>>> AppDynamics Pro!
> > >>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
> > >>>>>>>
> > >>>>>>> _______________________________________________
> > >>>>>>> Opensaf-devel mailing list
> > >>>>>>> [email protected]
> > >>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
> > >>>>
> > >>>
> > >>
> > >
> >
> > ------------------------------------------------------------------------------
> > Rapidly troubleshoot problems before they affect your business. Most IT
> > organizations don't have a clear picture of how application performance
> > affects their revenue. With AppDynamics, you get 100% visibility into your
> > Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics 
> > Pro!
> > http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
> > _______________________________________________
> > Opensaf-devel mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/opensaf-devel
> 
> ------------------------------------------------------------------------------
> Rapidly troubleshoot problems before they affect your business. Most IT
> organizations don't have a clear picture of how application performance
> affects their revenue. With AppDynamics, you get 100% visibility into your
> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
> _______________________________________________
> Opensaf-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/opensaf-devel

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to