Re: [tipc-discussion] RE : Re: Link related question/issue

Jon Maloy Fri, 07 Mar 2008 13:10:11 -0800

See below.
I think we are close now.
 If any of you want the dump, you will have to ask Peter directly,
since he was reluctant to send it out to tipc-discussion.


Regards
///jon

Horvath, Elmer wrote:
> Hi,
>
> This is very interesting.  From the description by Jon (I did not see the 
> Wireshark trace posted), the gap value is being calculated as 0 when it 
> should be calculated as a non-zero value.
>
> We actually encountered a similar, but different, issue internally and 
> believed it to be a compiler problem (we were not compiling with GCC).  The 
> target was an E500 (8560 based PPC system) and was compiled with software 
> floating point (though no floating point code is in TIPC that I know of).
>
> In our case, the calculation was effectively subtracting 1 from 1 and getting 
> a 1.  The node would then send a NAK falsely asking for retransmissions on 
> packets it did in fact receive.
>
> The incorrect calculation for us was in tipc_link.c in the routine 
> link_recv_proto_msg() calculating the value of the variable 'rec_gap'.  The 
> code is:
>       if (less_eq(mod(l_ptr->next_in_no), msg_next_sent(msg))) {
>               rec_gap = mod(msg_next_sent(msg) - 
>                             mod(l_ptr->next_in_no));
>       }
>   
I think this is calculated correctly in our case, but the rec_gap passed 
into tipc_send_proto_msg() gets overwritten by
that routine. This is normally correct, since the gap should be adjusted 
according to what is present
in the deferred-queue, in order to avoid retransmitting more packets 
than necessary.

The code I was referring to is the following, where 'gap' initially is 
set to the 'rec_gap' calculated above.

if (l_ptr->oldest_deferred_in) {
   u32 rec = msg_seqno(buf_msg(l_ptr->oldest_deferred_in));
   gap = mod(rec - mod(l_ptr->next_in_no));
}

msg_set_seq_gap(msg, gap);
.....

When the protocol gets stuck,  'rec_gap' should be found to be (54992 - 
53968) =  1024
Since the result is non-zero, tipc_link_send_proto_msg() is called.
Inside that routine three things can happen:
1) l_ptr->oldest_deferred_in is NULL. This means that gap' will retain 
its value of 1024.
    This leads us into case 3) below.
2) The calculation of 'gap' over-writes the original value. If this 
value always is zero,
     the protocol will bail out. Can this happen?
3) msg_set_gap() always writes a zero into the message.
    Actually, this is fully possible. The field for 'gap' is only 8 bits 
long, so any gap size
    which is a multiple of 256 will give a zero. Looking at the dump, 
this looks
    very possible: the first packet loss is not 95 packets, as I stated 
in my first mail, but
    54483 - 53967= 525 packets. This is counting only from what we see 
in Wireshark,
   which we have reason to suspect doesn't show all packets. So the real 
value might
   quite well be 512.  And if this happens, we are stuck forever, 
because the head
   of the deferred-queue will never move.

My question to Peter is: How often does this happen? Every time?  Often?
If it happens often, can it be that the Ethernet driver has the habit of 
throwing away
blocks of packets which are exactly a multiple of 256 or 512. (These 
computer
programmers...)

Anyway, we have clearly found a potential problem which must be 
resolved.  With
window sizes  > 255, scenario 3) is bound to happen now and then. If 
this is Peter's
problem, remains to be seen.
 

> This resulted in rec_gap being non-zero even though both operands were the 
> same value.  When rec_gap is non-zero, then tipc_link_send_proto_msg() is 
> called with a non-zero gap value a bit further down in the same routine.
>
> Adding instrumentation sometimes fixed the problem; doing the exact same 
> calculation again immediately following this code would yield the correct gap 
> value.  Very bizarre.
>
> We attributed this to a compiler issue.
>
> I don't know if this is the same issue, but it surely sounds similar enough 
> to be noted.  And this may be another place to check since it calls 
> tipc_link_send_proto_msg() after receiving a state message.
>
> Elmer
>
>
>
> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jon Maloy
> Sent: Tuesday, March 04, 2008 7:51 PM
> To: Xpl++; [EMAIL PROTECTED]; [email protected]
> Subject: Re: [tipc-discussion] RE : Re: Link related question/issue
>
> Hi Peter,
>
> I see two interesting patterns:
>
> a: After the packet loss has started at packet
>    14191, state messages from 1.1.12 always come
>    in pairs, with the same timestamp. 
>
> b: Also, after the problems have started, all 
>    state messages 1.1.6->1.1.12, even when they are
>    not probes, are immedieately followed by a state 
>    message in the opposite direction. 
>    This is a strong indication that the receiver
>    (1.1.12) actually detects the gap from the state
>    message contents, and sends out a new state 
>    message (a NACK), but for some reason the gap 
>    value never makes it into that message. 
>    Hence, tipc_link_send_proto_msg(),
>    where the gap is calculated and added 
>    (line 2135 in tipc_link.c, tipc-1.7.5), seems 
>    to be a good place to start 
>    looking.
>    I strongly suspect that the gap calculated at
>    lines 2128-2129 always yields 0, or that
>    no packets ever make it into the deferred queue
>    (via tipc_link_defer_pkt()).
>    That would be consistent with what we see.
>
> Regards
> ///jon
>
>
>
> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Xpl++
> Sent: March 4, 2008 2:17 PM
> To: [EMAIL PROTECTED]; '[email protected]'
> Subject: Re: [tipc-discussion] RE : Re: Link related question/issue
>
> Hi,
>
> So .. what about that TODO comment in tipc_link.c regarding the stronger seq# 
> checking and stuff? :) Since I managed to stabilize my cluster I must proceed 
> with a software upgrade (deadlines :( ...) and will be able to start looking 
> into the link code sometime tomorrow evening. In the mean time any ideas as 
> to where/what to look at would be highly appreciated ;)
>
> Regards,
> Peter.
>
> Jon Paul Maloy ??????:
>   
>> Hi,
>> Your analysis makes sense, but it still doesn't explain why TIPC 
>> cannot handle this quite commonplace situation.
>> Yesterday, I forgot one essential detail: Even State messages contain 
>> info to help the receiver detect a gap. The "next_sent" sequence 
>> number tells the receiver if it is out of synch with the sender, and 
>> gives it a chance to send a NACK (a State with gap != 0). Since 
>> State-packets clearly are received, otherwise the link would go down, 
>> there must be some bug in tipc that causes the gap to be calculated 
>> wrong, or not at all. Neither does it look like the receiver is 
>> sending a State _immediately_ after a gap has occurred, which it 
>> should.
>> So, I think we are looking for some serious bug within tipc that 
>> completely cripples the retransmission protocol. We should try to 
>> backtrack and find out in which version it has been introduced.
>>
>> ///jon
>>
>>
>> --- Xpl++ <[EMAIL PROTECTED]> a écrit :
>>
>>   
>>     
>>> Hi,
>>>
>>> Some more info about my systems:
>>> - all nodes that tend to drop packets are quite loaded, thou very 
>>> rarely one can see cpu #0 being 100% busy
>>> - there are also few multithreaded tasks that are bound to cpu#0 and 
>>> running in SCHED_RR. All of them use tipc. None of them uses the 
>>> maximum scheduler priority and they use very little cpu time and do 
>>> not tend to make any peaks
>>> - there is one taks that runs in SCHED_RR at maximum priority 99/RT 
>>> (it really does a very very important job), which uses around 1ms of 
>>> cpu, every 4 seconds, and it is explicitly bound to cpu#0
>>> - all other tasks (mostly apache & php/perl) are free to run on any 
>>> cpu
>>> - all of these nodes also have considerable io load.
>>> - kernel has irq balancing and prety much all irq are balanced, 
>>> except for nic irqs. They are always services by cpu #0
>>> - to create the packet drop issue I have to mildly stress the node, 
>>> which would normaly mean a moment when apache would try to start some 
>>> extra childred, that would also cause the number of simultaneously 
>>> running php script to also rise, while at the same time the incoming 
>>> network traffic is also rising. The stress is preceeded by a few 
>>> seconds of high input packet rate which may be causing evene more 
>>> stress on the scheduler and cpu starvation
>>> - wireshark is dropping packets (surprising many, as it seems), tipc 
>>> is confused .. and all is related to moments of general cpu 
>>> starvation and an even worse one at cpu#0
>>>
>>> Then it all started adding up ..
>>> I moved all non SCHED_OTHER tasks to other cpus, as well as few other 
>>> services. The result - 30% of the nodes showed between 5 and 200 
>>> packets dropped for the whole stress routine, which had not affected 
>>> TIPC operation, nametables were in sync, all communications seem to 
>>> work properly.
>>> Thou this solves my problems, it is still very unclear what may have 
>>> been happening in the kernel and in the tipc stack that is causing 
>>> this bizzare behavior.
>>> SMP systems alone are tricky, and when adding load and 
>>> pseudo-realtime tasks situation seems to become really complicated.
>>> One really cool thing to note is that Opteron based nodes handle hi 
>>> load and cpu starvation much better than Xeon ones ..
>>> which only confirms an
>>> old observation of mine, that for some reason (that must be the
>>> design/architecture?) Opterons appear _much_ more 
>>> interactive/responsive than Xeons under heavy load ..
>>> Another note, this on TIPC - link window for 100mbit nets should be 
>>> at least 256 if one wants to do any serious communication between a 
>>> dozen or more nodes. Also for a gbit net link windows above 1024 seem 
>>> to really confuse the stack when face with high output packet rate.
>>>
>>> Regards,
>>> Peter Litov.
>>>
>>>
>>> Martin Peylo ??????:
>>>     
>>>       
>>>> Hi,
>>>>
>>>> I'll try to help with the Wireshark side of this
>>>>       
>>>>         
>>> problem.
>>>     
>>>       
>>>> On 3/4/08, Jon Maloy <[EMAIL PROTECTED]>
>>>>       
>>>>         
>>> wrote:
>>>     
>>>       
>>>>   
>>>>       
>>>>         
>>>>>  Strangely enough, node 1.1.12 continues to ack
>>>>>         
>>>>>           
>>> packets
>>>     
>>>       
>>>>>  which we don't see in wireshark (is it possible
>>>>>         
>>>>>           
>>> that
>>>     
>>>       
>>>>>  wireshark can miss packets?). It goes on acking
>>>>>         
>>>>>           
>>> packets
>>>     
>>>       
>>>>>  up to the one with sequence number 53967, (on of
>>>>>         
>>>>>           
>>> the
>>>     
>>>       
>>>>>  "invisible" packets, but from there on it is
>>>>>         
>>>>>           
>>> stop.
>>>     
>>>       
>>>>>     
>>>>>         
>>>>>           
>>>> I've never encountered Wireshark missing packets
>>>>       
>>>>         
>>> so far. While it
>>>     
>>>       
>>>> sounds as it wouldn't be a problem with the TIPC
>>>>       
>>>>         
>>> dissector, could you
>>>     
>>>       
>>>> please send me a trace file so I can definitely
>>>>       
>>>>         
>>> exclude this cause of
>>>     
>>>       
>>>> defect? I've tried to get it from the link quoted
>>>>       
>>>>         
>>> in the mail from Jon
>>>     
>>>       
>>>> but it seems it was already removed.
>>>>
>>>>   
>>>>       
>>>>         
>>>>>  [...]
>>>>>     
>>>>>         
>>>>>           
>>>>   
>>>>       
>>>>         
>>>>>  As a sum of this, I start to suspect your
>>>>>         
>>>>>           
>>> Ethernet
>>>     
>>>       
>>>>>  driver. It seems like it sometimes delivers
>>>>>         
>>>>>           
>>> packets
>>>     
>>>       
>>>>>  to TIPC which it does not deliver to Wireshark,
>>>>>         
>>>>>           
>>> and
>>>     
>>>       
>>>>>  vice versa. This seems to happen after a period
>>>>>         
>>>>>           
>>> of
>>>     
>>>       
>>>>>  high traffic, and only with messages beyond a
>>>>>         
>>>>>           
>>> certain
>>>     
>>>       
>>>>>  size, since the State  messages always go
>>>>>         
>>>>>           
>>> through.
>>>     
>>>       
>>>>>  Can you see any pattern in the direction the
>>>>>         
>>>>>           
>>> links
>>>     
>>>       
>>>>>  go stale, with reference to which driver you are  using. (E.g., is 
>>>>> there always an e1000 driver
>>>>>         
>>>>>           
>>> involved
>>>     
>>>       
>>>>>  on the receiving end in the stale direction?)  Does this happen 
>>>>> when you only run one type of
>>>>>         
>>>>>           
>>> driver?
>>>     
>>>       
>>>>>     
>>>>>         
>>>>>           
>>>> I've not yet gone that deep into package capture,
>>>>       
>>>>         
>>> so I can't say much
>>>     
>>>       
>>>> about that. Peter, could you send a mail to one of
>>>>       
>>>>         
>>> the Wireshark
>>>     
>>>       
>>>> mailing lists describing the problem? Have you
>>>>       
>>>>         
>>> tried capturing other
>>>     
>>>       
>>>> kinds of high traffic with less ressource hungy
>>>>       
>>>>         
>>> capture frontends?
>>>     
>>>       
>>>> Best regards,
>>>> Martin
>>>>
>>>>
>>>>   
>>>>       
>>>>         
>>>     
>>>       
>> ----------------------------------------------------------------------
>> ---
>>   
>>     
>>> This SF.net email is sponsored by: Microsoft Defy all challenges. 
>>> Microsoft(R) Visual Studio 2008.
>>>
>>>     
>>>       
>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>   
>>     
>>> _______________________________________________
>>> tipc-discussion mailing list
>>> [email protected]
>>>
>>>     
>>>       
>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>   
>>
>>   
>>     
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft Defy all challenges. 
> Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> tipc-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft Defy all challenges. 
> Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> tipc-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> tipc-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>   


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Re: [tipc-discussion] RE : Re: Link related question/issue

Reply via email to