Re: [tipc-discussion] RE : Re: Link related question/issue

Jon Maloy Fri, 07 Mar 2008 13:23:29 -0800

Hi Peter,
See my previous mail.
It should be pretty easy for you to verify my hypothesis. Just add a


if  ((gap %256) == 0)
       gap = 255;

into the function msg_set_seq_gap() in tipc_link.h.  This should
be a functional work-around for the problem.

///jon


Jon Maloy wrote:
> See below.
> I think we are close now.
>  If any of you want the dump, you will have to ask Peter directly,
> since he was reluctant to send it out to tipc-discussion.
>
> Regards
> ///jon
>
> Horvath, Elmer wrote:
>   
>> Hi,
>>
>> This is very interesting.  From the description by Jon (I did not see the 
>> Wireshark trace posted), the gap value is being calculated as 0 when it 
>> should be calculated as a non-zero value.
>>
>> We actually encountered a similar, but different, issue internally and 
>> believed it to be a compiler problem (we were not compiling with GCC).  The 
>> target was an E500 (8560 based PPC system) and was compiled with software 
>> floating point (though no floating point code is in TIPC that I know of).
>>
>> In our case, the calculation was effectively subtracting 1 from 1 and 
>> getting a 1.  The node would then send a NAK falsely asking for 
>> retransmissions on packets it did in fact receive.
>>
>> The incorrect calculation for us was in tipc_link.c in the routine 
>> link_recv_proto_msg() calculating the value of the variable 'rec_gap'.  The 
>> code is:
>>      if (less_eq(mod(l_ptr->next_in_no), msg_next_sent(msg))) {
>>              rec_gap = mod(msg_next_sent(msg) - 
>>                            mod(l_ptr->next_in_no));
>>      }
>>   
>>     
> I think this is calculated correctly in our case, but the rec_gap passed 
> into tipc_send_proto_msg() gets overwritten by
> that routine. This is normally correct, since the gap should be adjusted 
> according to what is present
> in the deferred-queue, in order to avoid retransmitting more packets 
> than necessary.
>
> The code I was referring to is the following, where 'gap' initially is 
> set to the 'rec_gap' calculated above.
>
> if (l_ptr->oldest_deferred_in) {
>    u32 rec = msg_seqno(buf_msg(l_ptr->oldest_deferred_in));
>    gap = mod(rec - mod(l_ptr->next_in_no));
> }
>
> msg_set_seq_gap(msg, gap);
> .....
>
> When the protocol gets stuck,  'rec_gap' should be found to be (54992 - 
> 53968) =  1024
> Since the result is non-zero, tipc_link_send_proto_msg() is called.
> Inside that routine three things can happen:
> 1) l_ptr->oldest_deferred_in is NULL. This means that gap' will retain 
> its value of 1024.
>     This leads us into case 3) below.
> 2) The calculation of 'gap' over-writes the original value. If this 
> value always is zero,
>      the protocol will bail out. Can this happen?
> 3) msg_set_gap() always writes a zero into the message.
>     Actually, this is fully possible. The field for 'gap' is only 8 bits 
> long, so any gap size
>     which is a multiple of 256 will give a zero. Looking at the dump, 
> this looks
>     very possible: the first packet loss is not 95 packets, as I stated 
> in my first mail, but
>     54483 - 53967= 525 packets. This is counting only from what we see 
> in Wireshark,
>    which we have reason to suspect doesn't show all packets. So the real 
> value might
>    quite well be 512.  And if this happens, we are stuck forever, 
> because the head
>    of the deferred-queue will never move.
>
> My question to Peter is: How often does this happen? Every time?  Often?
> If it happens often, can it be that the Ethernet driver has the habit of 
> throwing away
> blocks of packets which are exactly a multiple of 256 or 512. (These 
> computer
> programmers...)
>
> Anyway, we have clearly found a potential problem which must be 
> resolved.  With
> window sizes  > 255, scenario 3) is bound to happen now and then. If 
> this is Peter's
> problem, remains to be seen.
>  
>
>   
>> This resulted in rec_gap being non-zero even though both operands were the 
>> same value.  When rec_gap is non-zero, then tipc_link_send_proto_msg() is 
>> called with a non-zero gap value a bit further down in the same routine.
>>
>> Adding instrumentation sometimes fixed the problem; doing the exact same 
>> calculation again immediately following this code would yield the correct 
>> gap value.  Very bizarre.
>>
>> We attributed this to a compiler issue.
>>
>> I don't know if this is the same issue, but it surely sounds similar enough 
>> to be noted.  And this may be another place to check since it calls 
>> tipc_link_send_proto_msg() after receiving a state message.
>>
>> Elmer
>>
>>
>>
>> -----Original Message-----
>> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jon Maloy
>> Sent: Tuesday, March 04, 2008 7:51 PM
>> To: Xpl++; [EMAIL PROTECTED]; [email protected]
>> Subject: Re: [tipc-discussion] RE : Re: Link related question/issue
>>
>> Hi Peter,
>>
>> I see two interesting patterns:
>>
>> a: After the packet loss has started at packet
>>    14191, state messages from 1.1.12 always come
>>    in pairs, with the same timestamp. 
>>
>> b: Also, after the problems have started, all 
>>    state messages 1.1.6->1.1.12, even when they are
>>    not probes, are immedieately followed by a state 
>>    message in the opposite direction. 
>>    This is a strong indication that the receiver
>>    (1.1.12) actually detects the gap from the state
>>    message contents, and sends out a new state 
>>    message (a NACK), but for some reason the gap 
>>    value never makes it into that message. 
>>    Hence, tipc_link_send_proto_msg(),
>>    where the gap is calculated and added 
>>    (line 2135 in tipc_link.c, tipc-1.7.5), seems 
>>    to be a good place to start 
>>    looking.
>>    I strongly suspect that the gap calculated at
>>    lines 2128-2129 always yields 0, or that
>>    no packets ever make it into the deferred queue
>>    (via tipc_link_defer_pkt()).
>>    That would be consistent with what we see.
>>
>> Regards
>> ///jon
>>
>>
>>
>> -----Original Message-----
>> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Xpl++
>> Sent: March 4, 2008 2:17 PM
>> To: [EMAIL PROTECTED]; '[email protected]'
>> Subject: Re: [tipc-discussion] RE : Re: Link related question/issue
>>
>> Hi,
>>
>> So .. what about that TODO comment in tipc_link.c regarding the stronger 
>> seq# checking and stuff? :) Since I managed to stabilize my cluster I must 
>> proceed with a software upgrade (deadlines :( ...) and will be able to start 
>> looking into the link code sometime tomorrow evening. In the mean time any 
>> ideas as to where/what to look at would be highly appreciated ;)
>>
>> Regards,
>> Peter.
>>
>> Jon Paul Maloy ??????:
>>   
>>     
>>> Hi,
>>> Your analysis makes sense, but it still doesn't explain why TIPC 
>>> cannot handle this quite commonplace situation.
>>> Yesterday, I forgot one essential detail: Even State messages contain 
>>> info to help the receiver detect a gap. The "next_sent" sequence 
>>> number tells the receiver if it is out of synch with the sender, and 
>>> gives it a chance to send a NACK (a State with gap != 0). Since 
>>> State-packets clearly are received, otherwise the link would go down, 
>>> there must be some bug in tipc that causes the gap to be calculated 
>>> wrong, or not at all. Neither does it look like the receiver is 
>>> sending a State _immediately_ after a gap has occurred, which it 
>>> should.
>>> So, I think we are looking for some serious bug within tipc that 
>>> completely cripples the retransmission protocol. We should try to 
>>> backtrack and find out in which version it has been introduced.
>>>
>>> ///jon
>>>
>>>
>>> --- Xpl++ <[EMAIL PROTECTED]> a écrit :
>>>
>>>   
>>>     
>>>       
>>>> Hi,
>>>>
>>>> Some more info about my systems:
>>>> - all nodes that tend to drop packets are quite loaded, thou very 
>>>> rarely one can see cpu #0 being 100% busy
>>>> - there are also few multithreaded tasks that are bound to cpu#0 and 
>>>> running in SCHED_RR. All of them use tipc. None of them uses the 
>>>> maximum scheduler priority and they use very little cpu time and do 
>>>> not tend to make any peaks
>>>> - there is one taks that runs in SCHED_RR at maximum priority 99/RT 
>>>> (it really does a very very important job), which uses around 1ms of 
>>>> cpu, every 4 seconds, and it is explicitly bound to cpu#0
>>>> - all other tasks (mostly apache & php/perl) are free to run on any 
>>>> cpu
>>>> - all of these nodes also have considerable io load.
>>>> - kernel has irq balancing and prety much all irq are balanced, 
>>>> except for nic irqs. They are always services by cpu #0
>>>> - to create the packet drop issue I have to mildly stress the node, 
>>>> which would normaly mean a moment when apache would try to start some 
>>>> extra childred, that would also cause the number of simultaneously 
>>>> running php script to also rise, while at the same time the incoming 
>>>> network traffic is also rising. The stress is preceeded by a few 
>>>> seconds of high input packet rate which may be causing evene more 
>>>> stress on the scheduler and cpu starvation
>>>> - wireshark is dropping packets (surprising many, as it seems), tipc 
>>>> is confused .. and all is related to moments of general cpu 
>>>> starvation and an even worse one at cpu#0
>>>>
>>>> Then it all started adding up ..
>>>> I moved all non SCHED_OTHER tasks to other cpus, as well as few other 
>>>> services. The result - 30% of the nodes showed between 5 and 200 
>>>> packets dropped for the whole stress routine, which had not affected 
>>>> TIPC operation, nametables were in sync, all communications seem to 
>>>> work properly.
>>>> Thou this solves my problems, it is still very unclear what may have 
>>>> been happening in the kernel and in the tipc stack that is causing 
>>>> this bizzare behavior.
>>>> SMP systems alone are tricky, and when adding load and 
>>>> pseudo-realtime tasks situation seems to become really complicated.
>>>> One really cool thing to note is that Opteron based nodes handle hi 
>>>> load and cpu starvation much better than Xeon ones ..
>>>> which only confirms an
>>>> old observation of mine, that for some reason (that must be the
>>>> design/architecture?) Opterons appear _much_ more 
>>>> interactive/responsive than Xeons under heavy load ..
>>>> Another note, this on TIPC - link window for 100mbit nets should be 
>>>> at least 256 if one wants to do any serious communication between a 
>>>> dozen or more nodes. Also for a gbit net link windows above 1024 seem 
>>>> to really confuse the stack when face with high output packet rate.
>>>>
>>>> Regards,
>>>> Peter Litov.
>>>>
>>>>
>>>> Martin Peylo ??????:
>>>>     
>>>>       
>>>>         
>>>>> Hi,
>>>>>
>>>>> I'll try to help with the Wireshark side of this
>>>>>       
>>>>>         
>>>>>           
>>>> problem.
>>>>     
>>>>       
>>>>         
>>>>> On 3/4/08, Jon Maloy <[EMAIL PROTECTED]>
>>>>>       
>>>>>         
>>>>>           
>>>> wrote:
>>>>     
>>>>       
>>>>         
>>>>>   
>>>>>       
>>>>>         
>>>>>           
>>>>>>  Strangely enough, node 1.1.12 continues to ack
>>>>>>         
>>>>>>           
>>>>>>             
>>>> packets
>>>>     
>>>>       
>>>>         
>>>>>>  which we don't see in wireshark (is it possible
>>>>>>         
>>>>>>           
>>>>>>             
>>>> that
>>>>     
>>>>       
>>>>         
>>>>>>  wireshark can miss packets?). It goes on acking
>>>>>>         
>>>>>>           
>>>>>>             
>>>> packets
>>>>     
>>>>       
>>>>         
>>>>>>  up to the one with sequence number 53967, (on of
>>>>>>         
>>>>>>           
>>>>>>             
>>>> the
>>>>     
>>>>       
>>>>         
>>>>>>  "invisible" packets, but from there on it is
>>>>>>         
>>>>>>           
>>>>>>             
>>>> stop.
>>>>     
>>>>       
>>>>         
>>>>>>     
>>>>>>         
>>>>>>           
>>>>>>             
>>>>> I've never encountered Wireshark missing packets
>>>>>       
>>>>>         
>>>>>           
>>>> so far. While it
>>>>     
>>>>       
>>>>         
>>>>> sounds as it wouldn't be a problem with the TIPC
>>>>>       
>>>>>         
>>>>>           
>>>> dissector, could you
>>>>     
>>>>       
>>>>         
>>>>> please send me a trace file so I can definitely
>>>>>       
>>>>>         
>>>>>           
>>>> exclude this cause of
>>>>     
>>>>       
>>>>         
>>>>> defect? I've tried to get it from the link quoted
>>>>>       
>>>>>         
>>>>>           
>>>> in the mail from Jon
>>>>     
>>>>       
>>>>         
>>>>> but it seems it was already removed.
>>>>>
>>>>>   
>>>>>       
>>>>>         
>>>>>           
>>>>>>  [...]
>>>>>>     
>>>>>>         
>>>>>>           
>>>>>>             
>>>>>   
>>>>>       
>>>>>         
>>>>>           
>>>>>>  As a sum of this, I start to suspect your
>>>>>>         
>>>>>>           
>>>>>>             
>>>> Ethernet
>>>>     
>>>>       
>>>>         
>>>>>>  driver. It seems like it sometimes delivers
>>>>>>         
>>>>>>           
>>>>>>             
>>>> packets
>>>>     
>>>>       
>>>>         
>>>>>>  to TIPC which it does not deliver to Wireshark,
>>>>>>         
>>>>>>           
>>>>>>             
>>>> and
>>>>     
>>>>       
>>>>         
>>>>>>  vice versa. This seems to happen after a period
>>>>>>         
>>>>>>           
>>>>>>             
>>>> of
>>>>     
>>>>       
>>>>         
>>>>>>  high traffic, and only with messages beyond a
>>>>>>         
>>>>>>           
>>>>>>             
>>>> certain
>>>>     
>>>>       
>>>>         
>>>>>>  size, since the State  messages always go
>>>>>>         
>>>>>>           
>>>>>>             
>>>> through.
>>>>     
>>>>       
>>>>         
>>>>>>  Can you see any pattern in the direction the
>>>>>>         
>>>>>>           
>>>>>>             
>>>> links
>>>>     
>>>>       
>>>>         
>>>>>>  go stale, with reference to which driver you are  using. (E.g., is 
>>>>>> there always an e1000 driver
>>>>>>         
>>>>>>           
>>>>>>             
>>>> involved
>>>>     
>>>>       
>>>>         
>>>>>>  on the receiving end in the stale direction?)  Does this happen 
>>>>>> when you only run one type of
>>>>>>         
>>>>>>           
>>>>>>             
>>>> driver?
>>>>     
>>>>       
>>>>         
>>>>>>     
>>>>>>         
>>>>>>           
>>>>>>             
>>>>> I've not yet gone that deep into package capture,
>>>>>       
>>>>>         
>>>>>           
>>>> so I can't say much
>>>>     
>>>>       
>>>>         
>>>>> about that. Peter, could you send a mail to one of
>>>>>       
>>>>>         
>>>>>           
>>>> the Wireshark
>>>>     
>>>>       
>>>>         
>>>>> mailing lists describing the problem? Have you
>>>>>       
>>>>>         
>>>>>           
>>>> tried capturing other
>>>>     
>>>>       
>>>>         
>>>>> kinds of high traffic with less ressource hungy
>>>>>       
>>>>>         
>>>>>           
>>>> capture frontends?
>>>>     
>>>>       
>>>>         
>>>>> Best regards,
>>>>> Martin
>>>>>
>>>>>
>>>>>   
>>>>>       
>>>>>         
>>>>>           
>>>>     
>>>>       
>>>>         
>>> ----------------------------------------------------------------------
>>> ---
>>>   
>>>     
>>>       
>>>> This SF.net email is sponsored by: Microsoft Defy all challenges. 
>>>> Microsoft(R) Visual Studio 2008.
>>>>
>>>>     
>>>>       
>>>>         
>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>>   
>>>     
>>>       
>>>> _______________________________________________
>>>> tipc-discussion mailing list
>>>> [email protected]
>>>>
>>>>     
>>>>       
>>>>         
>>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>>   
>>>
>>>   
>>>     
>>>       
>> -------------------------------------------------------------------------
>> This SF.net email is sponsored by: Microsoft Defy all challenges. 
>> Microsoft(R) Visual Studio 2008.
>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>> _______________________________________________
>> tipc-discussion mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>
>> -------------------------------------------------------------------------
>> This SF.net email is sponsored by: Microsoft Defy all challenges. 
>> Microsoft(R) Visual Studio 2008.
>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>> _______________________________________________
>> tipc-discussion mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>
>> -------------------------------------------------------------------------
>> This SF.net email is sponsored by: Microsoft
>> Defy all challenges. Microsoft(R) Visual Studio 2008.
>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>> _______________________________________________
>> tipc-discussion mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>   
>>     
>
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> tipc-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>   


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Re: [tipc-discussion] RE : Re: Link related question/issue

Reply via email to