sysctl -a | grep rmem set rmem_default to 256K or so
/Hans > -----Original Message----- > From: Hans Feldt [mailto:[email protected]] > Sent: den 8 januari 2014 14:01 > To: A V Mahesh; Alex Jones > Cc: [email protected] > Subject: Re: [devel] checkpoint problems > > The socket receive buffer size used is the system default. It can be too > small, pump it up. > I plan todo some change in MDS for this (and other stuff). > /Hans > > > -----Original Message----- > > From: A V Mahesh [mailto:[email protected]] > > Sent: den 8 januari 2014 11:29 > > To: Alex Jones > > Cc: [email protected] > > Subject: Re: [devel] checkpoint problems > > > > Hi Alex, > > > > I suggest you increase and try the following TIPC values ( tipc code ) > > and rebuild `tipc.ko`: > > > > net/tipc/tipc_socket.c:#define OVERLOAD_LIMIT_BASE 5000 > > > > You can increase it to 50000 and try again. > > > > - AVM. > > > > On 1/8/2014 4:16 AM, Alex Jones wrote: > > > After doing some deep debugging I am seeing the following in the MDS > > > log on node B. This is when the CPND_EVT_ND2ND_CKPT_ACTIVE_SYNC is > > > sent from the active replica on node A to the replica on node B. The > > > sync message never gets up to the CPND layer on node B because it is > > > dropped. > > > > > > This is with 10k sections, each section 1k. > > > > > > Jan 7 21:32:32.772347 <1789648919> ERR |MDTM: Frag recd is not > > > next frag so dropping adest=<0x010010023922604c> > > > Jan 7 21:32:32.772399 <1789648919> ERR |MDTM: Message is dropped > > > as msg is out of seq TRANSPOR-ID=<0x010010023922604c> > > > > > > I've turned on MDS debug on node B, and the packet being sent over is > > > gigantic. It starts failing at fragment number 2703. The next > > > fragment that comes in is 2707, then 2722. The last fragment that > > > comes in is 7444. > > > > > > I've done a cursory look at the hardware stats, and nothing is being > > > rate-limited or dropped. > > > > > > I'm going to take a deeper look at this, but I'm mentioning it in case > > > it rings any bells. I am using TIPC as the transport. > > > > > > Alex > > > > > > On 01/07/2014 07:24 AM, Alex Jones wrote: > > >> AVM, > > >> > > >> I get SA_AIS_ERR_TIMEOUT even when I pass SA_TIME_END as the > > >> timeout value. Is this not a bug? the synchronous CheckpointOpen > > >> call doesn't work at all in this scenario. It never succeeds. > > >> > > >> I can reproduce the problem with > > >> sectionCreationAttributes.expirationTime set to SA_TIME_ONE_DAY. > > >> > > >> You should be able to reproduce the problem with the code I sent > > >> in the last e-mail. > > >> > > >> Alex > > >> > > >> On 01/06/2014 10:31 PM, A V Mahesh wrote: > > >>> Hi Alex, > > >>> > > >>> CheckpointOpen call failing with SA_AIS_ERR_TIMEOUT NOT a bug , it > > >>> is expected if you pass less time out value `timeout = 1000000000` > > >>> to saCkptCheckpointOpen(....,timeout ...) call ,when ckpt has very > > >>> large data/section. just increasing timeout will avoids the > > >>> SA_AIS_ERR_TIMEOUT. > > >>> > > >>> Let us focus on your original issue/scenario, are you able to > > >>> reproduce the problem with sectionCreationAttributes.expirationTime > > >>> with SA_TIME_ONE_DAY ? > > >>> > > >>> -AVM > > >>> > > >>> On 1/7/2014 1:17 AM, Alex Jones wrote: > > >>>> AVM, > > >>>> > > >>>> I've been playing around with your test program, and have > > >>>> gotten it to fail. > > >>>> > > >>>> I made the following changes: > > >>>> > > >>>> 1. Change init_dataX to be 1024k bytes, so that you are > > >>>> initializing the section to be 1024k. > > >>>> 2. Also, don't start the program on node B until A has finished > > >>>> writing/creating all the sections. > > >>>> 3. Before hitting the enter key on node B, wait for the OpenAsync > > >>>> call to finish. > > >>>> > > >>>> You might notice the CheckpointOpen call failing now with > > >>>> SA_AIS_ERR_TIMEOUT. I had to turn this into OpenAsync, and add a > > >>>> thread to process CkptDispatch messages. This uncovers another bug > > >>>> in OpenAsync. I've attached the mods to your program here. > > >>>> > > >>>> The OpenAsync callback will be called twice, both times with > > >>>> error == SA_AIS_ERR_TIMEOUT. If I call OpenAsync again when I get > > >>>> this error, the next callback returns success, but the callback > > >>>> gets called twice with success and with two different checkpoint > > >>>> handles! > > >>>> > > >>>> Alex > > >>>> > > >>>> > > >>>> On 01/06/2014 06:18 AM, A V Mahesh wrote: > > >>>>> Hi Alex, > > >>>>> > > >>>>> I have created 10K sections ( please find the attached test > > >>>>> application `Alex_test_node_A_app.c` & `Alex_test_node_B_app.c ` ) > > >>>>> with your specified scenario & configuration and I haven't observed > > >>>>> any > > >>>>> issue with sections on another node. > > >>>>> > > >>>>> Try to reproduce the problem on your setup & let me know the result . > > >>>>> > > >>>>> One more importent point how much did you configured > > >>>>> `sectionCreationAttributes.expirationTime ` ? > > >>>>> I configured SA_TIME_ONE_DAY. > > >>>>> > > >>>>> Steps to rung the application : > > >>>>> > > >>>>> > > > ====================================================================================================== > > ============= > > >>>>> > > >>>>> Compile : > > >>>>> > > >>>>> NODE-A# gcc Alex_test_node_A_app.c -o checkpoint_A -lSaCkpt > > >>>>> NODE-A# gcc Alex_test_node_B_app.c -o checkpoint_B -lSaCkpt > > >>>>> > > >>>>> > > >>>>> Run : > > >>>>> > > >>>>> 1) saCkptCheckpointOpen On node A > > >>>>> > > >>>>> NODE-A# ./checkpoint_A > > >>>>> > > >>>>> CPSV:CPA:ONsaCkptSectionCreate Waiting to Create Sections > > >>>>> safCkpt=test_checkpoint_name1,safApp=safCkptService.... > > >>>>> saCkptSectionCreate Press <Enter> key to continue... > > >>>>> > > >>>>> . > > >>>>> 2) saCkptCheckpointOpen() same ckpt On node B > > >>>>> > > >>>>> NODE-B# ./checkpoint_B > > >>>>> > > >>>>> CPSV:CPA:ONsaCkptSectionIterationInitialize Waiting to read Sections > > >>>>> safCkpt=test_checkpoint_name1,safApp=safCkptService.... > > >>>>> saCkptActiveReplicaSet saCkptSectionIterationInitialize Press <Enter> > > >>>>> key to continue... > > >>>>> > > >>>>> > > >>>>> 3) saCkptSectionCreate() On node A and read > > >>>>> saCkptCheckpointStatusGet() > > >>>>> > > >>>>> NODE-A# > > >>>>> checkpointStatus.numberOfSections : 10000 > > >>>>> checkpointStatus.memoryUsed :756000 > > >>>>> checkpointCreationAttributes.creationFlags;10 > > >>>>> checkpointCreationAttributes.checkpointSize;10240000 > > >>>>> checkpointCreationAttributes.retentionDuration;60000000000 > > >>>>> checkpointCreationAttributes.maxSections;10000 > > >>>>> checkpointCreationAttributes.maxSectionSize;1024 > > >>>>> checkpointCreationAttributes.maxSectionIdSize;64 > > >>>>> ================================ > > >>>>> saCkptCheckpointUnlink / saCkptCheckpointClose / saCkptFinalize Press > > >>>>> <Enter> key to continue... > > >>>>> saCkptCheckpoint Press <Enter> key to continue... > > >>>>> > > >>>>> > > >>>>> 4) saCkptActiveReplicaSet() & On node B and > > >>>>> saCkptCheckpointStatusGet() > > >>>>> > > >>>>> NODE-B# > > >>>>> checkpointStatus.numberOfSections : 10000 > > >>>>> checkpointStatus.memoryUsed :756000 > > >>>>> checkpointCreationAttributes.creationFlags;10 > > >>>>> checkpointCreationAttributes.checkpointSize;10240000 > > >>>>> checkpointCreationAttributes.retentionDuration;60000000000 > > >>>>> checkpointCreationAttributes.maxSections;10000 > > >>>>> checkpointCreationAttributes.maxSectionSize;1024 > > >>>>> checkpointCreationAttributes.maxSectionIdSize;64 > > >>>>> > > >>>>> saCkptCheckpointUnlink / saCkptCheckpointClose / saCkptFinalize > > >>>>> Press > > >>>>> <Enter> key to continue... > > >>>>> saCkptCheckpoint Press <Enter> key to continue.. > > >>>>> > > >>>>> > > > ====================================================================================================== > > ========================== > > >>>>> > > >>>>> -AVM > > >>>>> > > >>>>> > > >>>>> On 1/6/2014 12:32 PM, A V Mahesh wrote: > > >>>>>> Hi Alex, > > >>>>>> > > >>>>>> We never tested the 7500 sections , will test & and let you know , > > >>>>>> can you please share your test application , > > >>>>>> that allow us to respond quick. > > >>>>>> > > >>>>>> -AVM > > >>>>>> > > >>>>>> On 1/3/2014 8:23 PM, Alex Jones wrote: > > >>>>>>> Hello All, > > >>>>>>> > > >>>>>>> I'm experimenting with the checkpoint service, and some things > > >>>>>>> don't appear to work. > > >>>>>>> > > >>>>>>> The saCkptActiveReplicaSet and > > >>>>>>> saCkptCheckpointSynchronize[Async] don't appear to work when the > > >>>>>>> checkpoint has section numbers greater than around 5500. > > >>>>>>> > > >>>>>>> I've created a checkpoint with 7500 sections, each section > > >>>>>>> being > > >>>>>>> 1024 bytes. The checkpoint is co-located and the "active replica" > > >>>>>>> bit is set. > > >>>>>>> > > >>>>>>> I can create and write all the sections. And from another > > >>>>>>> node > > >>>>>>> I run saCkptCheckpointStatusGet, and the information all looks good. > > >>>>>>> Everything is there. I see no errors from any CKPT API calls. > > >>>>>>> > > >>>>>>> The problem comes when I call saCkptActiveReplicaSet from this > > >>>>>>> other node. After I do this, saCkptCheckpointStatusGet now returns > > >>>>>>> all the same information except the number of sections is no longer > > >>>>>>> 7500 but 0. If I do this test with 50,000 sections only about 3,000 > > >>>>>>> entries get synced. And iterating through the sections shows that > > >>>>>>> there are only 3,000 sections. > > >>>>>>> > > >>>>>>> Calling saCkptCheckpointSynchronize[Async] in this situation > > >>>>>>> has > > >>>>>>> no effect, either. > > >>>>>>> > > >>>>>>> After looking through the code I see a comment in > > >>>>>>> cpnd_evt_proc_ckpt_arep_set that says "/* ###TBD sync up is missing > > >>>>>>> with old active if now this fellow is becoming active. */" So, it > > >>>>>>> doesn't appear that syncing is being done in the > > >>>>>>> saCkptActiveReplicaSet, which it should be. > > >>>>>>> > > >>>>>>> Can someone comment? > > >>>>>>> > > >>>>>>> I'm going to fix this and post a patch unless someone else is > > >>>>>>> already working on it, but I didn't see a bug for it. > > >>>>>>> > > >>>>>>> Alex > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> ------------------------------------------------------------------------------ > > >>>>>>> > > >>>>>>> Rapidly troubleshoot problems before they affect your business. > > >>>>>>> Most IT > > >>>>>>> organizations don't have a clear picture of how application > > >>>>>>> performance > > >>>>>>> affects their revenue. With AppDynamics, you get 100% visibility > > >>>>>>> into > > >>>>>>> your > > >>>>>>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of > > >>>>>>> AppDynamics Pro! > > >>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk > > >>>>>>> > > >>>>>>> _______________________________________________ > > >>>>>>> Opensaf-devel mailing list > > >>>>>>> [email protected] > > >>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel > > >>>> > > >>> > > >> > > > > > > > ------------------------------------------------------------------------------ > > Rapidly troubleshoot problems before they affect your business. Most IT > > organizations don't have a clear picture of how application performance > > affects their revenue. With AppDynamics, you get 100% visibility into your > > Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics > > Pro! > > http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk > > _______________________________________________ > > Opensaf-devel mailing list > > [email protected] > > https://lists.sourceforge.net/lists/listinfo/opensaf-devel > > ------------------------------------------------------------------------------ > Rapidly troubleshoot problems before they affect your business. Most IT > organizations don't have a clear picture of how application performance > affects their revenue. With AppDynamics, you get 100% visibility into your > Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! > http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk > _______________________________________________ > Opensaf-devel mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/opensaf-devel ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
