Hi Alex, On 1/8/2014 4:16 AM, Alex Jones wrote: > I can create and write all the sections. And from another node > I run saCkptCheckpointStatusGet, and the information all looks good. > Everything is there. I see no errors from any CKPT API calls. > > The problem comes when I call saCkptActiveReplicaSet from this > other node. After I do this, saCkptCheckpointStatusGet now returns > all the same information except the number of sections is no longer > 7500 but 0. If I do this test with 50,000 sections only about 3,000 > entries get synced. And iterating through the sections shows that > there are only 3,000 sections.
Let us go one by one , let me fist address your initial issue , where you Opened checkpoint on both node fist and create and write all the sections on node A , then call saCkptActiveReplicaSet from this other node B and the saCkptCheckpointStatusGet on node B returns all the same information except the number of sections is no longer 7500 but 0. Is this problem reproducible with sectionCreationAttributes.expirationTime set to SA_TIME_ONE_DAY ? Let me fist resolve your initial problem , this fix will resolve synchronization issue between the nodes , so most of the issue will be resolved. -AVM On 1/7/2014 5:54 PM, Alex Jones wrote: > AVM, > > I get SA_AIS_ERR_TIMEOUT even when I pass SA_TIME_END as the > timeout value. Is this not a bug? the synchronous CheckpointOpen > call doesn't work at all in this scenario. It never succeeds. > > I can reproduce the problem with > sectionCreationAttributes.expirationTime set to SA_TIME_ONE_DAY. > > You should be able to reproduce the problem with the code I sent > in the last e-mail. > > Alex > > On 01/06/2014 10:31 PM, A V Mahesh wrote: >> Hi Alex, >> >> CheckpointOpen call failing with SA_AIS_ERR_TIMEOUT NOT a bug , it >> is expected if you pass less time out value `timeout = 1000000000` >> to saCkptCheckpointOpen(....,timeout ...) call ,when ckpt has very >> large data/section. just increasing timeout will avoids the >> SA_AIS_ERR_TIMEOUT. >> >> Let us focus on your original issue/scenario, are you able to >> reproduce the problem with sectionCreationAttributes.expirationTime >> with SA_TIME_ONE_DAY ? >> >> -AVM >> >> On 1/7/2014 1:17 AM, Alex Jones wrote: >>> AVM, >>> >>> I've been playing around with your test program, and have gotten >>> it to fail. >>> >>> I made the following changes: >>> >>> 1. Change init_dataX to be 1024k bytes, so that you are >>> initializing the section to be 1024k. >>> 2. Also, don't start the program on node B until A has finished >>> writing/creating all the sections. >>> 3. Before hitting the enter key on node B, wait for the OpenAsync >>> call to finish. >>> >>> You might notice the CheckpointOpen call failing now with >>> SA_AIS_ERR_TIMEOUT. I had to turn this into OpenAsync, and add a >>> thread to process CkptDispatch messages. This uncovers another bug >>> in OpenAsync. I've attached the mods to your program here. >>> >>> The OpenAsync callback will be called twice, both times with >>> error == SA_AIS_ERR_TIMEOUT. If I call OpenAsync again when I get >>> this error, the next callback returns success, but the callback gets >>> called twice with success and with two different checkpoint handles! >>> >>> Alex >>> >>> >>> On 01/06/2014 06:18 AM, A V Mahesh wrote: >>>> Hi Alex, >>>> >>>> I have created 10K sections ( please find the attached test >>>> application `Alex_test_node_A_app.c` & `Alex_test_node_B_app.c ` ) >>>> with your specified scenario & configuration and I haven't observed any >>>> issue with sections on another node. >>>> >>>> Try to reproduce the problem on your setup & let me know the result . >>>> >>>> One more importent point how much did you configured >>>> `sectionCreationAttributes.expirationTime ` ? >>>> I configured SA_TIME_ONE_DAY. >>>> >>>> Steps to rung the application : >>>> >>>> =================================================================================================================== >>>> >>>> Compile : >>>> >>>> NODE-A# gcc Alex_test_node_A_app.c -o checkpoint_A -lSaCkpt >>>> NODE-A# gcc Alex_test_node_B_app.c -o checkpoint_B -lSaCkpt >>>> >>>> >>>> Run : >>>> >>>> 1) saCkptCheckpointOpen On node A >>>> >>>> NODE-A# ./checkpoint_A >>>> >>>> CPSV:CPA:ONsaCkptSectionCreate Waiting to Create Sections >>>> safCkpt=test_checkpoint_name1,safApp=safCkptService.... >>>> saCkptSectionCreate Press <Enter> key to continue... >>>> >>>> . >>>> 2) saCkptCheckpointOpen() same ckpt On node B >>>> >>>> NODE-B# ./checkpoint_B >>>> >>>> CPSV:CPA:ONsaCkptSectionIterationInitialize Waiting to read Sections >>>> safCkpt=test_checkpoint_name1,safApp=safCkptService.... >>>> saCkptActiveReplicaSet saCkptSectionIterationInitialize Press <Enter> >>>> key to continue... >>>> >>>> >>>> 3) saCkptSectionCreate() On node A and read saCkptCheckpointStatusGet() >>>> >>>> NODE-A# >>>> checkpointStatus.numberOfSections : 10000 >>>> checkpointStatus.memoryUsed :756000 >>>> checkpointCreationAttributes.creationFlags;10 >>>> checkpointCreationAttributes.checkpointSize;10240000 >>>> checkpointCreationAttributes.retentionDuration;60000000000 >>>> checkpointCreationAttributes.maxSections;10000 >>>> checkpointCreationAttributes.maxSectionSize;1024 >>>> checkpointCreationAttributes.maxSectionIdSize;64 >>>> ================================ >>>> saCkptCheckpointUnlink / saCkptCheckpointClose / saCkptFinalize Press >>>> <Enter> key to continue... >>>> saCkptCheckpoint Press <Enter> key to continue... >>>> >>>> >>>> 4) saCkptActiveReplicaSet() & On node B and saCkptCheckpointStatusGet() >>>> >>>> NODE-B# >>>> checkpointStatus.numberOfSections : 10000 >>>> checkpointStatus.memoryUsed :756000 >>>> checkpointCreationAttributes.creationFlags;10 >>>> checkpointCreationAttributes.checkpointSize;10240000 >>>> checkpointCreationAttributes.retentionDuration;60000000000 >>>> checkpointCreationAttributes.maxSections;10000 >>>> checkpointCreationAttributes.maxSectionSize;1024 >>>> checkpointCreationAttributes.maxSectionIdSize;64 >>>> >>>> saCkptCheckpointUnlink / saCkptCheckpointClose / saCkptFinalize Press >>>> <Enter> key to continue... >>>> saCkptCheckpoint Press <Enter> key to continue.. >>>> >>>> ================================================================================================================================ >>>> >>>> -AVM >>>> >>>> >>>> On 1/6/2014 12:32 PM, A V Mahesh wrote: >>>>> Hi Alex, >>>>> >>>>> We never tested the 7500 sections , will test & and let you know , >>>>> can you please share your test application , >>>>> that allow us to respond quick. >>>>> >>>>> -AVM >>>>> >>>>> On 1/3/2014 8:23 PM, Alex Jones wrote: >>>>>> Hello All, >>>>>> >>>>>> I'm experimenting with the checkpoint service, and some things >>>>>> don't appear to work. >>>>>> >>>>>> The saCkptActiveReplicaSet and >>>>>> saCkptCheckpointSynchronize[Async] don't appear to work when the >>>>>> checkpoint has section numbers greater than around 5500. >>>>>> >>>>>> I've created a checkpoint with 7500 sections, each section being >>>>>> 1024 bytes. The checkpoint is co-located and the "active replica" >>>>>> bit is set. >>>>>> >>>>>> I can create and write all the sections. And from another node >>>>>> I run saCkptCheckpointStatusGet, and the information all looks good. >>>>>> Everything is there. I see no errors from any CKPT API calls. >>>>>> >>>>>> The problem comes when I call saCkptActiveReplicaSet from this >>>>>> other node. After I do this, saCkptCheckpointStatusGet now returns >>>>>> all the same information except the number of sections is no longer >>>>>> 7500 but 0. If I do this test with 50,000 sections only about 3,000 >>>>>> entries get synced. And iterating through the sections shows that >>>>>> there are only 3,000 sections. >>>>>> >>>>>> Calling saCkptCheckpointSynchronize[Async] in this situation has >>>>>> no effect, either. >>>>>> >>>>>> After looking through the code I see a comment in >>>>>> cpnd_evt_proc_ckpt_arep_set that says "/* ###TBD sync up is missing >>>>>> with old active if now this fellow is becoming active. */" So, it >>>>>> doesn't appear that syncing is being done in the >>>>>> saCkptActiveReplicaSet, which it should be. >>>>>> >>>>>> Can someone comment? >>>>>> >>>>>> I'm going to fix this and post a patch unless someone else is >>>>>> already working on it, but I didn't see a bug for it. >>>>>> >>>>>> Alex >>>>>> >>>>>> >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> >>>>>> Rapidly troubleshoot problems before they affect your business. Most IT >>>>>> organizations don't have a clear picture of how application performance >>>>>> affects their revenue. With AppDynamics, you get 100% visibility into >>>>>> your >>>>>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of >>>>>> AppDynamics Pro! >>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Opensaf-devel mailing list >>>>>> Opensaf-devel@lists.sourceforge.net >>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel >>> >> > ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel