Hi Guys,
After doing some more testing I'm still seeing some problems.
The patch worked fine for a 2N model, but our real requirements are
a little different.
Here's the setup. 5+1 redundancy. 6 active blades and 1 standby
blade protecting all the other blades. I am creating a check point on
each active blade, and the standby is opening all 5 checkpoints to do
the backup.
40k sections on each checkpoint, and 1k of data in each section.
Every so often I am still seeing MDS problems, but they are
different now with the patch.
Jan 10 16:20:35.776939 <3680821261> ERR |MDS_SND_RCV: Timeout or
Error occured
Jan 10 16:20:35.777031 <3680821261> ERR |MDS_SND_RCV: Timeout occured
on sndrsp message
Jan 10 16:20:35.777062 <3680821261> ERR |MDS_SND_RCV:
Adest=<0x0002040f,3798024214>
Jan 10 16:20:50.098279 <3680821261> ERR
|LEN-MISMATCH:recvd_on_sock=8034, size_in_mds_hdr=65034,
TIPC-ID=0x010010056ab7600b, ADEST=<0002050f,1790402571>
Jan 10 16:20:50.098326 <3680821261> ERR |DUMP:Changing
dump-extent:buff=0x998fa300:max=100, len=8034
Jan 10 16:20:50.098348 <3680821261> ERR |DUMP:buff=0x998fa300:offset= 0
to 7:Bytes = 0xfe 0x0a 0x00 0x00 : 0x0f 0x1e 0x80 0x01
Alex
On 01/09/2014 04:43 AM, A V Mahesh wrote:
> Hi Alex,
>
> Use the below patch as workaround for you to proceed your testing .
> This patch just increases the MDS internal fragmentation value to
> ~ TIPC_MAX_USER_MSG_SIZE define in tipc.h
>
> I will work with Hans to have final patch by considering the both
> TIPC & TCP transports,
> and testing involved as a part of ticket `#654 MDS improvements`
> (https://sourceforge.net/p/opensaf/tickets/654/ ).
>
> I tested this patch with 10K sections checkpoint memory used was :
> 10136000 on TIPC transport.
>
> ==================================================================================
>
>
> diff --git a/osaf/libs/core/mds/include/mds_dt.h
> b/osaf/libs/core/mds/include/mds_dt.h
> --- a/osaf/libs/core/mds/include/mds_dt.h
> +++ b/osaf/libs/core/mds/include/mds_dt.h
> @@ -32,6 +32,7 @@
> #include "ncs_main_papi.h"
> #include "ncssysf_mem.h"
> #include "ncspatricia.h"
> +#include <linux/tipc.h>
>
>
> /* This file is private to the MDTM layer. */
> @@ -109,7 +110,7 @@ typedef struct mdtm_reassembly_queue {
>
> #define MDTM_MAX_DIRECT_BUFF_SIZE MDTM_MAX_SEGMENT_SIZE
>
> -#define MDTM_NORMAL_MSG_FRAG_SIZE 1400
> +#define MDTM_NORMAL_MSG_FRAG_SIZE (TIPC_MAX_USER_MSG_SIZE-1000) /*
> TIPC_MAX_USER_MSG_SIZE = 66000 define <linux/tipc.h> */
>
> #define MDTM_RECV_BUFFER_SIZE
> ((MDS_DIRECT_BUF_MAXSIZE>MDTM_NORMAL_MSG_FRAG_SIZE)? \
> (MDS_DIRECT_BUF_MAXSIZE+SUM_MDS_HDR_PLUS_MDTM_HDR_PLUS_LEN):(MDTM_NORMAL_MSG_FRAG_SIZE+SUM_MDS_HDR_PLUS_MDTM_HDR_PLUS_LEN))
>
>
> ==================================================================================
>
>
>
> -AVM
>
>
> On 1/8/2014 10:42 PM, Alex Jones wrote:
>> Hi Hans,
>>
>> Changing rmem_default and rmem_max has no effect on the problem.
>> I even tried up to 2M to no avail.
>>
>> However, after looking at the cpnd_transfer_replica function in
>> cpnd_evt.c, I found the following in cpsv_evt.h which controls how
>> large the packets are which are sent through MDS:
>>
>> #define MAX_SYNC_TRANSFER_SIZE (30 * 1024 * 1024)
>>
>> 30M? What is the rationale for this number? This seems way too
>> high. When I change it to (4*1024*1024) (4M) it solves my problem,
>> and doesn't appear to affect performance.
>>
>> Alex
>>
>> On 01/08/2014 08:30 AM, Hans Feldt wrote:
>>> sysctl -a | grep rmem
>>>
>>> set rmem_default to 256K or so
>>>
>>> /Hans
>>>
>>>> -----Original Message-----
>>>> From: Hans Feldt [mailto:[email protected]]
>>>> Sent: den 8 januari 2014 14:01
>>>> To: A V Mahesh; Alex Jones
>>>> Cc: [email protected]
>>>> Subject: Re: [devel] checkpoint problems
>>>>
>>>> The socket receive buffer size used is the system default. It can
>>>> be too small, pump it up.
>>>> I plan todo some change in MDS for this (and other stuff).
>>>> /Hans
>>>>
>>>>> -----Original Message-----
>>>>> From: A V Mahesh [mailto:[email protected]]
>>>>> Sent: den 8 januari 2014 11:29
>>>>> To: Alex Jones
>>>>> Cc: [email protected]
>>>>> Subject: Re: [devel] checkpoint problems
>>>>>
>>>>> Hi Alex,
>>>>>
>>>>> I suggest you increase and try the following TIPC values ( tipc
>>>>> code )
>>>>> and rebuild `tipc.ko`:
>>>>>
>>>>> net/tipc/tipc_socket.c:#define OVERLOAD_LIMIT_BASE 5000
>>>>>
>>>>> You can increase it to 50000 and try again.
>>>>>
>>>>> - AVM.
>>>>>
>>>>> On 1/8/2014 4:16 AM, Alex Jones wrote:
>>>>>> After doing some deep debugging I am seeing the following in the MDS
>>>>>> log on node B. This is when the CPND_EVT_ND2ND_CKPT_ACTIVE_SYNC is
>>>>>> sent from the active replica on node A to the replica on node B.
>>>>>> The
>>>>>> sync message never gets up to the CPND layer on node B because it is
>>>>>> dropped.
>>>>>>
>>>>>> This is with 10k sections, each section 1k.
>>>>>>
>>>>>> Jan 7 21:32:32.772347 <1789648919> ERR |MDTM: Frag recd is not
>>>>>> next frag so dropping adest=<0x010010023922604c>
>>>>>> Jan 7 21:32:32.772399 <1789648919> ERR |MDTM: Message is dropped
>>>>>> as msg is out of seq TRANSPOR-ID=<0x010010023922604c>
>>>>>>
>>>>>> I've turned on MDS debug on node B, and the packet being sent
>>>>>> over is
>>>>>> gigantic. It starts failing at fragment number 2703. The next
>>>>>> fragment that comes in is 2707, then 2722. The last fragment that
>>>>>> comes in is 7444.
>>>>>>
>>>>>> I've done a cursory look at the hardware stats, and nothing is being
>>>>>> rate-limited or dropped.
>>>>>>
>>>>>> I'm going to take a deeper look at this, but I'm mentioning it in
>>>>>> case
>>>>>> it rings any bells. I am using TIPC as the transport.
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>> On 01/07/2014 07:24 AM, Alex Jones wrote:
>>>>>>> AVM,
>>>>>>>
>>>>>>> I get SA_AIS_ERR_TIMEOUT even when I pass SA_TIME_END as the
>>>>>>> timeout value. Is this not a bug? the synchronous CheckpointOpen
>>>>>>> call doesn't work at all in this scenario. It never succeeds.
>>>>>>>
>>>>>>> I can reproduce the problem with
>>>>>>> sectionCreationAttributes.expirationTime set to SA_TIME_ONE_DAY.
>>>>>>>
>>>>>>> You should be able to reproduce the problem with the code I
>>>>>>> sent
>>>>>>> in the last e-mail.
>>>>>>>
>>>>>>> Alex
>>>>>>>
>>>>>>> On 01/06/2014 10:31 PM, A V Mahesh wrote:
>>>>>>>> Hi Alex,
>>>>>>>>
>>>>>>>> CheckpointOpen call failing with SA_AIS_ERR_TIMEOUT NOT a bug , it
>>>>>>>> is expected if you pass less time out value `timeout =
>>>>>>>> 1000000000`
>>>>>>>> to saCkptCheckpointOpen(....,timeout ...) call ,when ckpt has very
>>>>>>>> large data/section. just increasing timeout will avoids the
>>>>>>>> SA_AIS_ERR_TIMEOUT.
>>>>>>>>
>>>>>>>> Let us focus on your original issue/scenario, are you able to
>>>>>>>> reproduce the problem with
>>>>>>>> sectionCreationAttributes.expirationTime
>>>>>>>> with SA_TIME_ONE_DAY ?
>>>>>>>>
>>>>>>>> -AVM
>>>>>>>>
>>>>>>>> On 1/7/2014 1:17 AM, Alex Jones wrote:
>>>>>>>>> AVM,
>>>>>>>>>
>>>>>>>>> I've been playing around with your test program, and have
>>>>>>>>> gotten it to fail.
>>>>>>>>>
>>>>>>>>> I made the following changes:
>>>>>>>>>
>>>>>>>>> 1. Change init_dataX to be 1024k bytes, so that you are
>>>>>>>>> initializing the section to be 1024k.
>>>>>>>>> 2. Also, don't start the program on node B until A has finished
>>>>>>>>> writing/creating all the sections.
>>>>>>>>> 3. Before hitting the enter key on node B, wait for the
>>>>>>>>> OpenAsync
>>>>>>>>> call to finish.
>>>>>>>>>
>>>>>>>>> You might notice the CheckpointOpen call failing now with
>>>>>>>>> SA_AIS_ERR_TIMEOUT. I had to turn this into OpenAsync, and add a
>>>>>>>>> thread to process CkptDispatch messages. This uncovers
>>>>>>>>> another bug
>>>>>>>>> in OpenAsync. I've attached the mods to your program here.
>>>>>>>>>
>>>>>>>>> The OpenAsync callback will be called twice, both times with
>>>>>>>>> error == SA_AIS_ERR_TIMEOUT. If I call OpenAsync again when I
>>>>>>>>> get
>>>>>>>>> this error, the next callback returns success, but the callback
>>>>>>>>> gets called twice with success and with two different checkpoint
>>>>>>>>> handles!
>>>>>>>>>
>>>>>>>>> Alex
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 01/06/2014 06:18 AM, A V Mahesh wrote:
>>>>>>>>>> Hi Alex,
>>>>>>>>>>
>>>>>>>>>> I have created 10K sections ( please find the attached test
>>>>>>>>>> application `Alex_test_node_A_app.c` &
>>>>>>>>>> `Alex_test_node_B_app.c ` )
>>>>>>>>>> with your specified scenario & configuration and I haven't
>>>>>>>>>> observed any
>>>>>>>>>> issue with sections on another node.
>>>>>>>>>>
>>>>>>>>>> Try to reproduce the problem on your setup & let me know the
>>>>>>>>>> result .
>>>>>>>>>>
>>>>>>>>>> One more importent point how much did you configured
>>>>>>>>>> `sectionCreationAttributes.expirationTime ` ?
>>>>>>>>>> I configured SA_TIME_ONE_DAY.
>>>>>>>>>>
>>>>>>>>>> Steps to rung the application :
>>>>>>>>>>
>>>>>>>>>>
>>>> ======================================================================================================
>>>>
>>>>
>>>>> =============
>>>>>>>>>> Compile :
>>>>>>>>>>
>>>>>>>>>> NODE-A# gcc Alex_test_node_A_app.c -o checkpoint_A -lSaCkpt
>>>>>>>>>> NODE-A# gcc Alex_test_node_B_app.c -o checkpoint_B -lSaCkpt
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Run :
>>>>>>>>>>
>>>>>>>>>> 1) saCkptCheckpointOpen On node A
>>>>>>>>>>
>>>>>>>>>> NODE-A# ./checkpoint_A
>>>>>>>>>>
>>>>>>>>>> CPSV:CPA:ONsaCkptSectionCreate Waiting to Create Sections
>>>>>>>>>> safCkpt=test_checkpoint_name1,safApp=safCkptService....
>>>>>>>>>> saCkptSectionCreate Press <Enter> key to continue...
>>>>>>>>>>
>>>>>>>>>> .
>>>>>>>>>> 2) saCkptCheckpointOpen() same ckpt On node B
>>>>>>>>>>
>>>>>>>>>> NODE-B# ./checkpoint_B
>>>>>>>>>>
>>>>>>>>>> CPSV:CPA:ONsaCkptSectionIterationInitialize Waiting to read
>>>>>>>>>> Sections
>>>>>>>>>> safCkpt=test_checkpoint_name1,safApp=safCkptService....
>>>>>>>>>> saCkptActiveReplicaSet saCkptSectionIterationInitialize Press
>>>>>>>>>> <Enter>
>>>>>>>>>> key to continue...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 3) saCkptSectionCreate() On node A and read
>>>>>>>>>> saCkptCheckpointStatusGet()
>>>>>>>>>>
>>>>>>>>>> NODE-A#
>>>>>>>>>> checkpointStatus.numberOfSections : 10000
>>>>>>>>>> checkpointStatus.memoryUsed :756000
>>>>>>>>>> checkpointCreationAttributes.creationFlags;10
>>>>>>>>>> checkpointCreationAttributes.checkpointSize;10240000
>>>>>>>>>> checkpointCreationAttributes.retentionDuration;60000000000
>>>>>>>>>> checkpointCreationAttributes.maxSections;10000
>>>>>>>>>> checkpointCreationAttributes.maxSectionSize;1024
>>>>>>>>>> checkpointCreationAttributes.maxSectionIdSize;64
>>>>>>>>>> ================================
>>>>>>>>>> saCkptCheckpointUnlink / saCkptCheckpointClose /
>>>>>>>>>> saCkptFinalize Press
>>>>>>>>>> <Enter> key to continue...
>>>>>>>>>> saCkptCheckpoint Press <Enter> key to continue...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 4) saCkptActiveReplicaSet() & On node B and
>>>>>>>>>> saCkptCheckpointStatusGet()
>>>>>>>>>>
>>>>>>>>>> NODE-B#
>>>>>>>>>> checkpointStatus.numberOfSections : 10000
>>>>>>>>>> checkpointStatus.memoryUsed :756000
>>>>>>>>>> checkpointCreationAttributes.creationFlags;10
>>>>>>>>>> checkpointCreationAttributes.checkpointSize;10240000
>>>>>>>>>> checkpointCreationAttributes.retentionDuration;60000000000
>>>>>>>>>> checkpointCreationAttributes.maxSections;10000
>>>>>>>>>> checkpointCreationAttributes.maxSectionSize;1024
>>>>>>>>>> checkpointCreationAttributes.maxSectionIdSize;64
>>>>>>>>>>
>>>>>>>>>> saCkptCheckpointUnlink / saCkptCheckpointClose /
>>>>>>>>>> saCkptFinalize Press
>>>>>>>>>> <Enter> key to continue...
>>>>>>>>>> saCkptCheckpoint Press <Enter> key to continue..
>>>>>>>>>>
>>>>>>>>>>
>>>> ======================================================================================================
>>>>
>>>>
>>>>> ==========================
>>>>>>>>>> -AVM
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 1/6/2014 12:32 PM, A V Mahesh wrote:
>>>>>>>>>>> Hi Alex,
>>>>>>>>>>>
>>>>>>>>>>> We never tested the 7500 sections , will test & and let you
>>>>>>>>>>> know ,
>>>>>>>>>>> can you please share your test application ,
>>>>>>>>>>> that allow us to respond quick.
>>>>>>>>>>>
>>>>>>>>>>> -AVM
>>>>>>>>>>>
>>>>>>>>>>> On 1/3/2014 8:23 PM, Alex Jones wrote:
>>>>>>>>>>>> Hello All,
>>>>>>>>>>>>
>>>>>>>>>>>> I'm experimenting with the checkpoint service, and
>>>>>>>>>>>> some things
>>>>>>>>>>>> don't appear to work.
>>>>>>>>>>>>
>>>>>>>>>>>> The saCkptActiveReplicaSet and
>>>>>>>>>>>> saCkptCheckpointSynchronize[Async] don't appear to work
>>>>>>>>>>>> when the
>>>>>>>>>>>> checkpoint has section numbers greater than around 5500.
>>>>>>>>>>>>
>>>>>>>>>>>> I've created a checkpoint with 7500 sections, each
>>>>>>>>>>>> section being
>>>>>>>>>>>> 1024 bytes. The checkpoint is co-located and the "active
>>>>>>>>>>>> replica"
>>>>>>>>>>>> bit is set.
>>>>>>>>>>>>
>>>>>>>>>>>> I can create and write all the sections. And from
>>>>>>>>>>>> another node
>>>>>>>>>>>> I run saCkptCheckpointStatusGet, and the information all
>>>>>>>>>>>> looks good.
>>>>>>>>>>>> Everything is there. I see no errors from any CKPT API calls.
>>>>>>>>>>>>
>>>>>>>>>>>> The problem comes when I call saCkptActiveReplicaSet
>>>>>>>>>>>> from this
>>>>>>>>>>>> other node. After I do this, saCkptCheckpointStatusGet now
>>>>>>>>>>>> returns
>>>>>>>>>>>> all the same information except the number of sections is
>>>>>>>>>>>> no longer
>>>>>>>>>>>> 7500 but 0. If I do this test with 50,000 sections only
>>>>>>>>>>>> about 3,000
>>>>>>>>>>>> entries get synced. And iterating through the sections
>>>>>>>>>>>> shows that
>>>>>>>>>>>> there are only 3,000 sections.
>>>>>>>>>>>>
>>>>>>>>>>>> Calling saCkptCheckpointSynchronize[Async] in this
>>>>>>>>>>>> situation has
>>>>>>>>>>>> no effect, either.
>>>>>>>>>>>>
>>>>>>>>>>>> After looking through the code I see a comment in
>>>>>>>>>>>> cpnd_evt_proc_ckpt_arep_set that says "/* ###TBD sync up is
>>>>>>>>>>>> missing
>>>>>>>>>>>> with old active if now this fellow is becoming active. */"
>>>>>>>>>>>> So, it
>>>>>>>>>>>> doesn't appear that syncing is being done in the
>>>>>>>>>>>> saCkptActiveReplicaSet, which it should be.
>>>>>>>>>>>>
>>>>>>>>>>>> Can someone comment?
>>>>>>>>>>>>
>>>>>>>>>>>> I'm going to fix this and post a patch unless
>>>>>>>>>>>> someone else is
>>>>>>>>>>>> already working on it, but I didn't see a bug for it.
>>>>>>>>>>>>
>>>>>>>>>>>> Alex
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Rapidly troubleshoot problems before they affect your
>>>>>>>>>>>> business. Most IT
>>>>>>>>>>>> organizations don't have a clear picture of how application
>>>>>>>>>>>> performance
>>>>>>>>>>>> affects their revenue. With AppDynamics, you get 100%
>>>>>>>>>>>> visibility into
>>>>>>>>>>>> your
>>>>>>>>>>>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of
>>>>>>>>>>>> AppDynamics Pro!
>>>>>>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Opensaf-devel mailing list
>>>>>>>>>>>> [email protected]
>>>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>>>>> ------------------------------------------------------------------------------
>>>>>
>>>>>
>>>>> Rapidly troubleshoot problems before they affect your business.
>>>>> Most IT
>>>>> organizations don't have a clear picture of how application
>>>>> performance
>>>>> affects their revenue. With AppDynamics, you get 100% visibility
>>>>> into your
>>>>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of
>>>>> AppDynamics Pro!
>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Opensaf-devel mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>>>> ------------------------------------------------------------------------------
>>>>
>>>>
>>>> Rapidly troubleshoot problems before they affect your business.
>>>> Most IT
>>>> organizations don't have a clear picture of how application
>>>> performance
>>>> affects their revenue. With AppDynamics, you get 100% visibility
>>>> into your
>>>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of
>>>> AppDynamics Pro!
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
>>>>
>>>>
>>>> _______________________________________________
>>>> Opensaf-devel mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>>
>>
>
>
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel