Hi Alex,

I suggest you increase and try the following TIPC values ( tipc code ) 
and rebuild `tipc.ko`:

net/tipc/tipc_socket.c:#define OVERLOAD_LIMIT_BASE      5000

You can increase it to 50000 and try again.

- AVM.

On 1/8/2014 4:16 AM, Alex Jones wrote:
> After doing some deep debugging I am seeing the following in the MDS 
> log on node B.  This is when the CPND_EVT_ND2ND_CKPT_ACTIVE_SYNC is 
> sent from the active replica on node A to the replica on node B.  The 
> sync message never gets up to the CPND layer on node B because it is 
> dropped.
>
> This is with 10k sections, each section 1k.
>
> Jan  7 21:32:32.772347 <1789648919> ERR    |MDTM: Frag recd is not 
> next frag so dropping adest=<0x010010023922604c>
> Jan  7 21:32:32.772399 <1789648919> ERR    |MDTM: Message is dropped 
> as msg is out of seq TRANSPOR-ID=<0x010010023922604c>
>
> I've turned on MDS debug on node B, and the packet being sent over is 
> gigantic.  It starts failing at fragment number 2703.  The next 
> fragment that comes in is 2707, then 2722.  The last fragment that 
> comes in is 7444.
>
> I've done a cursory look at the hardware stats, and nothing is being 
> rate-limited or dropped.
>
> I'm going to take a deeper look at this, but I'm mentioning it in case 
> it rings any bells.  I am using TIPC as the transport.
>
> Alex
>
> On 01/07/2014 07:24 AM, Alex Jones wrote:
>> AVM,
>>
>>     I get SA_AIS_ERR_TIMEOUT even when I pass SA_TIME_END as the 
>> timeout value.  Is this not a bug?  the synchronous CheckpointOpen 
>> call doesn't work at all in this scenario.  It never succeeds.
>>
>>     I can reproduce the problem with 
>> sectionCreationAttributes.expirationTime set to SA_TIME_ONE_DAY.
>>
>>     You should be able to reproduce the problem with the code I sent 
>> in the last e-mail.
>>
>> Alex
>>
>> On 01/06/2014 10:31 PM, A V Mahesh wrote:
>>> Hi Alex,
>>>
>>> CheckpointOpen call failing with SA_AIS_ERR_TIMEOUT   NOT a bug , it 
>>> is expected if you pass  less time out value `timeout = 1000000000`
>>> to saCkptCheckpointOpen(....,timeout ...) call ,when ckpt has very 
>>> large data/section. just increasing timeout will avoids the 
>>> SA_AIS_ERR_TIMEOUT.
>>>
>>> Let us focus on your original issue/scenario, are you able to 
>>> reproduce the  problem with sectionCreationAttributes.expirationTime 
>>> with SA_TIME_ONE_DAY ?
>>>
>>> -AVM
>>>
>>> On 1/7/2014 1:17 AM, Alex Jones wrote:
>>>> AVM,
>>>>
>>>>     I've been playing around with your test program, and have 
>>>> gotten it to fail.
>>>>
>>>>     I made the following changes:
>>>>
>>>>  1. Change init_dataX to be 1024k bytes, so that you are
>>>>     initializing the section to be 1024k.
>>>>  2. Also, don't start the program on node B until A has finished
>>>>     writing/creating all the sections.
>>>>  3. Before hitting the enter key on node B, wait for the OpenAsync
>>>>     call to finish.
>>>>
>>>>     You might notice the CheckpointOpen call failing now with 
>>>> SA_AIS_ERR_TIMEOUT.  I had to turn this into OpenAsync, and add a 
>>>> thread to process CkptDispatch messages.  This uncovers another bug 
>>>> in OpenAsync.  I've attached the mods to your program here.
>>>>
>>>>    The OpenAsync callback will be called twice, both times with 
>>>> error == SA_AIS_ERR_TIMEOUT.  If I call OpenAsync again when I get 
>>>> this error, the next callback returns success, but the callback 
>>>> gets called twice with success and with two different checkpoint 
>>>> handles!
>>>>
>>>> Alex
>>>>
>>>>
>>>> On 01/06/2014 06:18 AM, A V Mahesh wrote:
>>>>> Hi Alex,
>>>>>
>>>>> I have  created 10K sections  ( please find the attached test
>>>>> application  `Alex_test_node_A_app.c`  & `Alex_test_node_B_app.c ` )
>>>>> with your specified scenario & configuration and I haven't observed any
>>>>> issue with  sections  on another node.
>>>>>
>>>>> Try to reproduce the problem on your setup & let me know the result .
>>>>>
>>>>> One more importent point how much did you configured
>>>>> `sectionCreationAttributes.expirationTime `  ?
>>>>> I configured  SA_TIME_ONE_DAY.
>>>>>
>>>>> Steps to rung the application :
>>>>>
>>>>> ===================================================================================================================
>>>>>
>>>>> Compile :
>>>>>
>>>>> NODE-A# gcc Alex_test_node_A_app.c -o checkpoint_A -lSaCkpt
>>>>> NODE-A# gcc Alex_test_node_B_app.c -o checkpoint_B -lSaCkpt
>>>>>
>>>>>
>>>>> Run :
>>>>>
>>>>> 1) saCkptCheckpointOpen On node A
>>>>>
>>>>> NODE-A# ./checkpoint_A
>>>>>
>>>>> CPSV:CPA:ONsaCkptSectionCreate  Waiting to Create Sections
>>>>> safCkpt=test_checkpoint_name1,safApp=safCkptService....
>>>>> saCkptSectionCreate Press <Enter> key to continue...
>>>>>
>>>>> .
>>>>> 2) saCkptCheckpointOpen() same ckpt On node B
>>>>>
>>>>> NODE-B# ./checkpoint_B
>>>>>
>>>>> CPSV:CPA:ONsaCkptSectionIterationInitialize Waiting to read Sections
>>>>> safCkpt=test_checkpoint_name1,safApp=safCkptService....
>>>>> saCkptActiveReplicaSet saCkptSectionIterationInitialize Press <Enter>
>>>>> key to continue...
>>>>>
>>>>>
>>>>> 3) saCkptSectionCreate() On node A  and read saCkptCheckpointStatusGet()
>>>>>
>>>>> NODE-A#
>>>>>    checkpointStatus.numberOfSections : 10000
>>>>>    checkpointStatus.memoryUsed :756000
>>>>>     checkpointCreationAttributes.creationFlags;10
>>>>>    checkpointCreationAttributes.checkpointSize;10240000
>>>>>    checkpointCreationAttributes.retentionDuration;60000000000
>>>>>    checkpointCreationAttributes.maxSections;10000
>>>>>    checkpointCreationAttributes.maxSectionSize;1024
>>>>>    checkpointCreationAttributes.maxSectionIdSize;64
>>>>>    ================================
>>>>> saCkptCheckpointUnlink / saCkptCheckpointClose / saCkptFinalize Press
>>>>> <Enter> key to continue...
>>>>> saCkptCheckpoint Press <Enter> key to continue...
>>>>>
>>>>>
>>>>> 4) saCkptActiveReplicaSet() & On node B  and saCkptCheckpointStatusGet()
>>>>>
>>>>> NODE-B#
>>>>>    checkpointStatus.numberOfSections : 10000
>>>>>    checkpointStatus.memoryUsed :756000
>>>>>     checkpointCreationAttributes.creationFlags;10
>>>>>    checkpointCreationAttributes.checkpointSize;10240000
>>>>>    checkpointCreationAttributes.retentionDuration;60000000000
>>>>>    checkpointCreationAttributes.maxSections;10000
>>>>>    checkpointCreationAttributes.maxSectionSize;1024
>>>>>    checkpointCreationAttributes.maxSectionIdSize;64
>>>>>
>>>>>    saCkptCheckpointUnlink / saCkptCheckpointClose / saCkptFinalize Press
>>>>> <Enter> key to continue...
>>>>>    saCkptCheckpoint Press <Enter> key to continue..
>>>>>
>>>>> ================================================================================================================================
>>>>>
>>>>> -AVM
>>>>>
>>>>>
>>>>> On 1/6/2014 12:32 PM, A V Mahesh wrote:
>>>>>> Hi Alex,
>>>>>>
>>>>>> We never tested the  7500 sections , will test & and let you know ,
>>>>>> can you please share your test application ,
>>>>>>   that allow us to respond quick.
>>>>>>
>>>>>> -AVM
>>>>>>
>>>>>> On 1/3/2014 8:23 PM, Alex Jones wrote:
>>>>>>> Hello All,
>>>>>>>
>>>>>>>       I'm experimenting with the checkpoint service, and some things
>>>>>>> don't appear to work.
>>>>>>>
>>>>>>>       The saCkptActiveReplicaSet and
>>>>>>> saCkptCheckpointSynchronize[Async] don't appear to work when the
>>>>>>> checkpoint has section numbers greater than around 5500.
>>>>>>>
>>>>>>>       I've created a checkpoint with 7500 sections, each section being
>>>>>>> 1024 bytes.  The checkpoint is co-located and the "active replica"
>>>>>>> bit is set.
>>>>>>>
>>>>>>>       I can create and write all the sections.  And from another node
>>>>>>> I run saCkptCheckpointStatusGet, and the information all looks good.
>>>>>>> Everything is there.  I see no errors from any CKPT API calls.
>>>>>>>
>>>>>>>       The problem comes when I call saCkptActiveReplicaSet from this
>>>>>>> other node.  After I do this, saCkptCheckpointStatusGet now returns
>>>>>>> all the same information except the number of sections is no longer
>>>>>>> 7500 but 0.  If I do this test with 50,000 sections only about 3,000
>>>>>>> entries get synced.  And iterating through the sections shows that
>>>>>>> there are only 3,000 sections.
>>>>>>>
>>>>>>>       Calling saCkptCheckpointSynchronize[Async] in this situation has
>>>>>>> no effect, either.
>>>>>>>
>>>>>>>       After looking through the code I see a comment in
>>>>>>> cpnd_evt_proc_ckpt_arep_set that says "/* ###TBD sync up is missing
>>>>>>> with old active if now this fellow is becoming active. */"  So, it
>>>>>>> doesn't appear that syncing is being done in the
>>>>>>> saCkptActiveReplicaSet, which it should be.
>>>>>>>
>>>>>>>       Can someone comment?
>>>>>>>
>>>>>>>       I'm going to fix this and post a patch unless someone else is
>>>>>>> already working on it, but I didn't see a bug for it.
>>>>>>>
>>>>>>> Alex
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------------------------
>>>>>>>
>>>>>>> Rapidly troubleshoot problems before they affect your business. Most IT
>>>>>>> organizations don't have a clear picture of how application performance
>>>>>>> affects their revenue. With AppDynamics, you get 100% visibility into
>>>>>>> your
>>>>>>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of
>>>>>>> AppDynamics Pro!
>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
>>>>>>>   
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Opensaf-devel mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>>>>
>>>
>>
>

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to