After doing some deep debugging I am seeing the following in the MDS log 
on node B.  This is when the CPND_EVT_ND2ND_CKPT_ACTIVE_SYNC is sent 
from the active replica on node A to the replica on node B. The sync 
message never gets up to the CPND layer on node B because it is dropped.

This is with 10k sections, each section 1k.

Jan  7 21:32:32.772347 <1789648919> ERR    |MDTM: Frag recd is not next 
frag so dropping adest=<0x010010023922604c>
Jan  7 21:32:32.772399 <1789648919> ERR    |MDTM: Message is dropped as 
msg is out of seq TRANSPOR-ID=<0x010010023922604c>

I've turned on MDS debug on node B, and the packet being sent over is 
gigantic.  It starts failing at fragment number 2703.  The next fragment 
that comes in is 2707, then 2722.  The last fragment that comes in is 7444.

I've done a cursory look at the hardware stats, and nothing is being 
rate-limited or dropped.

I'm going to take a deeper look at this, but I'm mentioning it in case 
it rings any bells.  I am using TIPC as the transport.

Alex

On 01/07/2014 07:24 AM, Alex Jones wrote:
> AVM,
>
>     I get SA_AIS_ERR_TIMEOUT even when I pass SA_TIME_END as the 
> timeout value.  Is this not a bug?  the synchronous CheckpointOpen 
> call doesn't work at all in this scenario.  It never succeeds.
>
>     I can reproduce the problem with 
> sectionCreationAttributes.expirationTime set to SA_TIME_ONE_DAY.
>
>     You should be able to reproduce the problem with the code I sent 
> in the last e-mail.
>
> Alex
>
> On 01/06/2014 10:31 PM, A V Mahesh wrote:
>> Hi Alex,
>>
>> CheckpointOpen call failing with SA_AIS_ERR_TIMEOUT   NOT a bug , it 
>> is expected if you pass  less time out value `timeout = 1000000000`
>> to saCkptCheckpointOpen(....,timeout ...) call ,when ckpt has very 
>> large data/section. just increasing timeout will avoids the 
>> SA_AIS_ERR_TIMEOUT.
>>
>> Let us focus on your original issue/scenario, are you able to 
>> reproduce the  problem with sectionCreationAttributes.expirationTime 
>> with SA_TIME_ONE_DAY ?
>>
>> -AVM
>>
>> On 1/7/2014 1:17 AM, Alex Jones wrote:
>>> AVM,
>>>
>>>     I've been playing around with your test program, and have gotten 
>>> it to fail.
>>>
>>>     I made the following changes:
>>>
>>>  1. Change init_dataX to be 1024k bytes, so that you are
>>>     initializing the section to be 1024k.
>>>  2. Also, don't start the program on node B until A has finished
>>>     writing/creating all the sections.
>>>  3. Before hitting the enter key on node B, wait for the OpenAsync
>>>     call to finish.
>>>
>>>     You might notice the CheckpointOpen call failing now with 
>>> SA_AIS_ERR_TIMEOUT.  I had to turn this into OpenAsync, and add a 
>>> thread to process CkptDispatch messages.  This uncovers another bug 
>>> in OpenAsync.  I've attached the mods to your program here.
>>>
>>>    The OpenAsync callback will be called twice, both times with 
>>> error == SA_AIS_ERR_TIMEOUT.  If I call OpenAsync again when I get 
>>> this error, the next callback returns success, but the callback gets 
>>> called twice with success and with two different checkpoint handles!
>>>
>>> Alex
>>>
>>>
>>> On 01/06/2014 06:18 AM, A V Mahesh wrote:
>>>> Hi Alex,
>>>>
>>>> I have  created 10K sections  ( please find the attached test
>>>> application  `Alex_test_node_A_app.c`  & `Alex_test_node_B_app.c ` )
>>>> with your specified scenario & configuration and I haven't observed any
>>>> issue with  sections  on another node.
>>>>
>>>> Try to reproduce the problem on your setup & let me know the result .
>>>>
>>>> One more importent point how much did you configured
>>>> `sectionCreationAttributes.expirationTime `  ?
>>>> I configured  SA_TIME_ONE_DAY.
>>>>
>>>> Steps to rung the application :
>>>>
>>>> ===================================================================================================================
>>>>
>>>> Compile :
>>>>
>>>> NODE-A# gcc Alex_test_node_A_app.c -o checkpoint_A -lSaCkpt
>>>> NODE-A# gcc Alex_test_node_B_app.c -o checkpoint_B -lSaCkpt
>>>>
>>>>
>>>> Run :
>>>>
>>>> 1) saCkptCheckpointOpen On node A
>>>>
>>>> NODE-A# ./checkpoint_A
>>>>
>>>> CPSV:CPA:ONsaCkptSectionCreate  Waiting to Create Sections
>>>> safCkpt=test_checkpoint_name1,safApp=safCkptService....
>>>> saCkptSectionCreate Press <Enter> key to continue...
>>>>
>>>> .
>>>> 2) saCkptCheckpointOpen() same ckpt On node B
>>>>
>>>> NODE-B# ./checkpoint_B
>>>>
>>>> CPSV:CPA:ONsaCkptSectionIterationInitialize Waiting to read Sections
>>>> safCkpt=test_checkpoint_name1,safApp=safCkptService....
>>>> saCkptActiveReplicaSet saCkptSectionIterationInitialize Press <Enter>
>>>> key to continue...
>>>>
>>>>
>>>> 3) saCkptSectionCreate() On node A  and read saCkptCheckpointStatusGet()
>>>>
>>>> NODE-A#
>>>>    checkpointStatus.numberOfSections : 10000
>>>>    checkpointStatus.memoryUsed :756000
>>>>     checkpointCreationAttributes.creationFlags;10
>>>>    checkpointCreationAttributes.checkpointSize;10240000
>>>>    checkpointCreationAttributes.retentionDuration;60000000000
>>>>    checkpointCreationAttributes.maxSections;10000
>>>>    checkpointCreationAttributes.maxSectionSize;1024
>>>>    checkpointCreationAttributes.maxSectionIdSize;64
>>>>    ================================
>>>> saCkptCheckpointUnlink / saCkptCheckpointClose / saCkptFinalize Press
>>>> <Enter> key to continue...
>>>> saCkptCheckpoint Press <Enter> key to continue...
>>>>
>>>>
>>>> 4) saCkptActiveReplicaSet() & On node B  and saCkptCheckpointStatusGet()
>>>>
>>>> NODE-B#
>>>>    checkpointStatus.numberOfSections : 10000
>>>>    checkpointStatus.memoryUsed :756000
>>>>     checkpointCreationAttributes.creationFlags;10
>>>>    checkpointCreationAttributes.checkpointSize;10240000
>>>>    checkpointCreationAttributes.retentionDuration;60000000000
>>>>    checkpointCreationAttributes.maxSections;10000
>>>>    checkpointCreationAttributes.maxSectionSize;1024
>>>>    checkpointCreationAttributes.maxSectionIdSize;64
>>>>
>>>>    saCkptCheckpointUnlink / saCkptCheckpointClose / saCkptFinalize Press
>>>> <Enter> key to continue...
>>>>    saCkptCheckpoint Press <Enter> key to continue..
>>>>
>>>> ================================================================================================================================
>>>>
>>>> -AVM
>>>>
>>>>
>>>> On 1/6/2014 12:32 PM, A V Mahesh wrote:
>>>>> Hi Alex,
>>>>>
>>>>> We never tested the  7500 sections , will test & and let you know ,
>>>>> can you please share your test application ,
>>>>>   that allow us to respond quick.
>>>>>
>>>>> -AVM
>>>>>
>>>>> On 1/3/2014 8:23 PM, Alex Jones wrote:
>>>>>> Hello All,
>>>>>>
>>>>>>       I'm experimenting with the checkpoint service, and some things
>>>>>> don't appear to work.
>>>>>>
>>>>>>       The saCkptActiveReplicaSet and
>>>>>> saCkptCheckpointSynchronize[Async] don't appear to work when the
>>>>>> checkpoint has section numbers greater than around 5500.
>>>>>>
>>>>>>       I've created a checkpoint with 7500 sections, each section being
>>>>>> 1024 bytes.  The checkpoint is co-located and the "active replica"
>>>>>> bit is set.
>>>>>>
>>>>>>       I can create and write all the sections.  And from another node
>>>>>> I run saCkptCheckpointStatusGet, and the information all looks good.
>>>>>> Everything is there.  I see no errors from any CKPT API calls.
>>>>>>
>>>>>>       The problem comes when I call saCkptActiveReplicaSet from this
>>>>>> other node.  After I do this, saCkptCheckpointStatusGet now returns
>>>>>> all the same information except the number of sections is no longer
>>>>>> 7500 but 0.  If I do this test with 50,000 sections only about 3,000
>>>>>> entries get synced.  And iterating through the sections shows that
>>>>>> there are only 3,000 sections.
>>>>>>
>>>>>>       Calling saCkptCheckpointSynchronize[Async] in this situation has
>>>>>> no effect, either.
>>>>>>
>>>>>>       After looking through the code I see a comment in
>>>>>> cpnd_evt_proc_ckpt_arep_set that says "/* ###TBD sync up is missing
>>>>>> with old active if now this fellow is becoming active. */"  So, it
>>>>>> doesn't appear that syncing is being done in the
>>>>>> saCkptActiveReplicaSet, which it should be.
>>>>>>
>>>>>>       Can someone comment?
>>>>>>
>>>>>>       I'm going to fix this and post a patch unless someone else is
>>>>>> already working on it, but I didn't see a bug for it.
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>>
>>>>>> Rapidly troubleshoot problems before they affect your business. Most IT
>>>>>> organizations don't have a clear picture of how application performance
>>>>>> affects their revenue. With AppDynamics, you get 100% visibility into
>>>>>> your
>>>>>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of
>>>>>> AppDynamics Pro!
>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
>>>>>>   
>>>>>>
>>>>>> _______________________________________________
>>>>>> Opensaf-devel mailing list
>>>>>> Opensaf-devel@lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>>>
>>
>

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to