Hi Peter,
See my previous mail.
It should be pretty easy for you to verify my hypothesis. Just add a
if ((gap %256) == 0)
gap = 255;
into the function msg_set_seq_gap() in tipc_link.h. This should
be a functional work-around for the problem.
///jon
Jon Maloy wrote:
> See below.
> I think we are close now.
> If any of you want the dump, you will have to ask Peter directly,
> since he was reluctant to send it out to tipc-discussion.
>
> Regards
> ///jon
>
> Horvath, Elmer wrote:
>
>> Hi,
>>
>> This is very interesting. From the description by Jon (I did not see the
>> Wireshark trace posted), the gap value is being calculated as 0 when it
>> should be calculated as a non-zero value.
>>
>> We actually encountered a similar, but different, issue internally and
>> believed it to be a compiler problem (we were not compiling with GCC). The
>> target was an E500 (8560 based PPC system) and was compiled with software
>> floating point (though no floating point code is in TIPC that I know of).
>>
>> In our case, the calculation was effectively subtracting 1 from 1 and
>> getting a 1. The node would then send a NAK falsely asking for
>> retransmissions on packets it did in fact receive.
>>
>> The incorrect calculation for us was in tipc_link.c in the routine
>> link_recv_proto_msg() calculating the value of the variable 'rec_gap'. The
>> code is:
>> if (less_eq(mod(l_ptr->next_in_no), msg_next_sent(msg))) {
>> rec_gap = mod(msg_next_sent(msg) -
>> mod(l_ptr->next_in_no));
>> }
>>
>>
> I think this is calculated correctly in our case, but the rec_gap passed
> into tipc_send_proto_msg() gets overwritten by
> that routine. This is normally correct, since the gap should be adjusted
> according to what is present
> in the deferred-queue, in order to avoid retransmitting more packets
> than necessary.
>
> The code I was referring to is the following, where 'gap' initially is
> set to the 'rec_gap' calculated above.
>
> if (l_ptr->oldest_deferred_in) {
> u32 rec = msg_seqno(buf_msg(l_ptr->oldest_deferred_in));
> gap = mod(rec - mod(l_ptr->next_in_no));
> }
>
> msg_set_seq_gap(msg, gap);
> .....
>
> When the protocol gets stuck, 'rec_gap' should be found to be (54992 -
> 53968) = 1024
> Since the result is non-zero, tipc_link_send_proto_msg() is called.
> Inside that routine three things can happen:
> 1) l_ptr->oldest_deferred_in is NULL. This means that gap' will retain
> its value of 1024.
> This leads us into case 3) below.
> 2) The calculation of 'gap' over-writes the original value. If this
> value always is zero,
> the protocol will bail out. Can this happen?
> 3) msg_set_gap() always writes a zero into the message.
> Actually, this is fully possible. The field for 'gap' is only 8 bits
> long, so any gap size
> which is a multiple of 256 will give a zero. Looking at the dump,
> this looks
> very possible: the first packet loss is not 95 packets, as I stated
> in my first mail, but
> 54483 - 53967= 525 packets. This is counting only from what we see
> in Wireshark,
> which we have reason to suspect doesn't show all packets. So the real
> value might
> quite well be 512. And if this happens, we are stuck forever,
> because the head
> of the deferred-queue will never move.
>
> My question to Peter is: How often does this happen? Every time? Often?
> If it happens often, can it be that the Ethernet driver has the habit of
> throwing away
> blocks of packets which are exactly a multiple of 256 or 512. (These
> computer
> programmers...)
>
> Anyway, we have clearly found a potential problem which must be
> resolved. With
> window sizes > 255, scenario 3) is bound to happen now and then. If
> this is Peter's
> problem, remains to be seen.
>
>
>
>> This resulted in rec_gap being non-zero even though both operands were the
>> same value. When rec_gap is non-zero, then tipc_link_send_proto_msg() is
>> called with a non-zero gap value a bit further down in the same routine.
>>
>> Adding instrumentation sometimes fixed the problem; doing the exact same
>> calculation again immediately following this code would yield the correct
>> gap value. Very bizarre.
>>
>> We attributed this to a compiler issue.
>>
>> I don't know if this is the same issue, but it surely sounds similar enough
>> to be noted. And this may be another place to check since it calls
>> tipc_link_send_proto_msg() after receiving a state message.
>>
>> Elmer
>>
>>
>>
>> -----Original Message-----
>> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jon Maloy
>> Sent: Tuesday, March 04, 2008 7:51 PM
>> To: Xpl++; [EMAIL PROTECTED]; [email protected]
>> Subject: Re: [tipc-discussion] RE : Re: Link related question/issue
>>
>> Hi Peter,
>>
>> I see two interesting patterns:
>>
>> a: After the packet loss has started at packet
>> 14191, state messages from 1.1.12 always come
>> in pairs, with the same timestamp.
>>
>> b: Also, after the problems have started, all
>> state messages 1.1.6->1.1.12, even when they are
>> not probes, are immedieately followed by a state
>> message in the opposite direction.
>> This is a strong indication that the receiver
>> (1.1.12) actually detects the gap from the state
>> message contents, and sends out a new state
>> message (a NACK), but for some reason the gap
>> value never makes it into that message.
>> Hence, tipc_link_send_proto_msg(),
>> where the gap is calculated and added
>> (line 2135 in tipc_link.c, tipc-1.7.5), seems
>> to be a good place to start
>> looking.
>> I strongly suspect that the gap calculated at
>> lines 2128-2129 always yields 0, or that
>> no packets ever make it into the deferred queue
>> (via tipc_link_defer_pkt()).
>> That would be consistent with what we see.
>>
>> Regards
>> ///jon
>>
>>
>>
>> -----Original Message-----
>> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Xpl++
>> Sent: March 4, 2008 2:17 PM
>> To: [EMAIL PROTECTED]; '[email protected]'
>> Subject: Re: [tipc-discussion] RE : Re: Link related question/issue
>>
>> Hi,
>>
>> So .. what about that TODO comment in tipc_link.c regarding the stronger
>> seq# checking and stuff? :) Since I managed to stabilize my cluster I must
>> proceed with a software upgrade (deadlines :( ...) and will be able to start
>> looking into the link code sometime tomorrow evening. In the mean time any
>> ideas as to where/what to look at would be highly appreciated ;)
>>
>> Regards,
>> Peter.
>>
>> Jon Paul Maloy ??????:
>>
>>
>>> Hi,
>>> Your analysis makes sense, but it still doesn't explain why TIPC
>>> cannot handle this quite commonplace situation.
>>> Yesterday, I forgot one essential detail: Even State messages contain
>>> info to help the receiver detect a gap. The "next_sent" sequence
>>> number tells the receiver if it is out of synch with the sender, and
>>> gives it a chance to send a NACK (a State with gap != 0). Since
>>> State-packets clearly are received, otherwise the link would go down,
>>> there must be some bug in tipc that causes the gap to be calculated
>>> wrong, or not at all. Neither does it look like the receiver is
>>> sending a State _immediately_ after a gap has occurred, which it
>>> should.
>>> So, I think we are looking for some serious bug within tipc that
>>> completely cripples the retransmission protocol. We should try to
>>> backtrack and find out in which version it has been introduced.
>>>
>>> ///jon
>>>
>>>
>>> --- Xpl++ <[EMAIL PROTECTED]> a écrit :
>>>
>>>
>>>
>>>
>>>> Hi,
>>>>
>>>> Some more info about my systems:
>>>> - all nodes that tend to drop packets are quite loaded, thou very
>>>> rarely one can see cpu #0 being 100% busy
>>>> - there are also few multithreaded tasks that are bound to cpu#0 and
>>>> running in SCHED_RR. All of them use tipc. None of them uses the
>>>> maximum scheduler priority and they use very little cpu time and do
>>>> not tend to make any peaks
>>>> - there is one taks that runs in SCHED_RR at maximum priority 99/RT
>>>> (it really does a very very important job), which uses around 1ms of
>>>> cpu, every 4 seconds, and it is explicitly bound to cpu#0
>>>> - all other tasks (mostly apache & php/perl) are free to run on any
>>>> cpu
>>>> - all of these nodes also have considerable io load.
>>>> - kernel has irq balancing and prety much all irq are balanced,
>>>> except for nic irqs. They are always services by cpu #0
>>>> - to create the packet drop issue I have to mildly stress the node,
>>>> which would normaly mean a moment when apache would try to start some
>>>> extra childred, that would also cause the number of simultaneously
>>>> running php script to also rise, while at the same time the incoming
>>>> network traffic is also rising. The stress is preceeded by a few
>>>> seconds of high input packet rate which may be causing evene more
>>>> stress on the scheduler and cpu starvation
>>>> - wireshark is dropping packets (surprising many, as it seems), tipc
>>>> is confused .. and all is related to moments of general cpu
>>>> starvation and an even worse one at cpu#0
>>>>
>>>> Then it all started adding up ..
>>>> I moved all non SCHED_OTHER tasks to other cpus, as well as few other
>>>> services. The result - 30% of the nodes showed between 5 and 200
>>>> packets dropped for the whole stress routine, which had not affected
>>>> TIPC operation, nametables were in sync, all communications seem to
>>>> work properly.
>>>> Thou this solves my problems, it is still very unclear what may have
>>>> been happening in the kernel and in the tipc stack that is causing
>>>> this bizzare behavior.
>>>> SMP systems alone are tricky, and when adding load and
>>>> pseudo-realtime tasks situation seems to become really complicated.
>>>> One really cool thing to note is that Opteron based nodes handle hi
>>>> load and cpu starvation much better than Xeon ones ..
>>>> which only confirms an
>>>> old observation of mine, that for some reason (that must be the
>>>> design/architecture?) Opterons appear _much_ more
>>>> interactive/responsive than Xeons under heavy load ..
>>>> Another note, this on TIPC - link window for 100mbit nets should be
>>>> at least 256 if one wants to do any serious communication between a
>>>> dozen or more nodes. Also for a gbit net link windows above 1024 seem
>>>> to really confuse the stack when face with high output packet rate.
>>>>
>>>> Regards,
>>>> Peter Litov.
>>>>
>>>>
>>>> Martin Peylo ??????:
>>>>
>>>>
>>>>
>>>>> Hi,
>>>>>
>>>>> I'll try to help with the Wireshark side of this
>>>>>
>>>>>
>>>>>
>>>> problem.
>>>>
>>>>
>>>>
>>>>> On 3/4/08, Jon Maloy <[EMAIL PROTECTED]>
>>>>>
>>>>>
>>>>>
>>>> wrote:
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Strangely enough, node 1.1.12 continues to ack
>>>>>>
>>>>>>
>>>>>>
>>>> packets
>>>>
>>>>
>>>>
>>>>>> which we don't see in wireshark (is it possible
>>>>>>
>>>>>>
>>>>>>
>>>> that
>>>>
>>>>
>>>>
>>>>>> wireshark can miss packets?). It goes on acking
>>>>>>
>>>>>>
>>>>>>
>>>> packets
>>>>
>>>>
>>>>
>>>>>> up to the one with sequence number 53967, (on of
>>>>>>
>>>>>>
>>>>>>
>>>> the
>>>>
>>>>
>>>>
>>>>>> "invisible" packets, but from there on it is
>>>>>>
>>>>>>
>>>>>>
>>>> stop.
>>>>
>>>>
>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> I've never encountered Wireshark missing packets
>>>>>
>>>>>
>>>>>
>>>> so far. While it
>>>>
>>>>
>>>>
>>>>> sounds as it wouldn't be a problem with the TIPC
>>>>>
>>>>>
>>>>>
>>>> dissector, could you
>>>>
>>>>
>>>>
>>>>> please send me a trace file so I can definitely
>>>>>
>>>>>
>>>>>
>>>> exclude this cause of
>>>>
>>>>
>>>>
>>>>> defect? I've tried to get it from the link quoted
>>>>>
>>>>>
>>>>>
>>>> in the mail from Jon
>>>>
>>>>
>>>>
>>>>> but it seems it was already removed.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> [...]
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> As a sum of this, I start to suspect your
>>>>>>
>>>>>>
>>>>>>
>>>> Ethernet
>>>>
>>>>
>>>>
>>>>>> driver. It seems like it sometimes delivers
>>>>>>
>>>>>>
>>>>>>
>>>> packets
>>>>
>>>>
>>>>
>>>>>> to TIPC which it does not deliver to Wireshark,
>>>>>>
>>>>>>
>>>>>>
>>>> and
>>>>
>>>>
>>>>
>>>>>> vice versa. This seems to happen after a period
>>>>>>
>>>>>>
>>>>>>
>>>> of
>>>>
>>>>
>>>>
>>>>>> high traffic, and only with messages beyond a
>>>>>>
>>>>>>
>>>>>>
>>>> certain
>>>>
>>>>
>>>>
>>>>>> size, since the State messages always go
>>>>>>
>>>>>>
>>>>>>
>>>> through.
>>>>
>>>>
>>>>
>>>>>> Can you see any pattern in the direction the
>>>>>>
>>>>>>
>>>>>>
>>>> links
>>>>
>>>>
>>>>
>>>>>> go stale, with reference to which driver you are using. (E.g., is
>>>>>> there always an e1000 driver
>>>>>>
>>>>>>
>>>>>>
>>>> involved
>>>>
>>>>
>>>>
>>>>>> on the receiving end in the stale direction?) Does this happen
>>>>>> when you only run one type of
>>>>>>
>>>>>>
>>>>>>
>>>> driver?
>>>>
>>>>
>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> I've not yet gone that deep into package capture,
>>>>>
>>>>>
>>>>>
>>>> so I can't say much
>>>>
>>>>
>>>>
>>>>> about that. Peter, could you send a mail to one of
>>>>>
>>>>>
>>>>>
>>>> the Wireshark
>>>>
>>>>
>>>>
>>>>> mailing lists describing the problem? Have you
>>>>>
>>>>>
>>>>>
>>>> tried capturing other
>>>>
>>>>
>>>>
>>>>> kinds of high traffic with less ressource hungy
>>>>>
>>>>>
>>>>>
>>>> capture frontends?
>>>>
>>>>
>>>>
>>>>> Best regards,
>>>>> Martin
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>> ----------------------------------------------------------------------
>>> ---
>>>
>>>
>>>
>>>> This SF.net email is sponsored by: Microsoft Defy all challenges.
>>>> Microsoft(R) Visual Studio 2008.
>>>>
>>>>
>>>>
>>>>
>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>>
>>>
>>>
>>>> _______________________________________________
>>>> tipc-discussion mailing list
>>>> [email protected]
>>>>
>>>>
>>>>
>>>>
>>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>>
>>>
>>>
>>>
>>>
>> -------------------------------------------------------------------------
>> This SF.net email is sponsored by: Microsoft Defy all challenges.
>> Microsoft(R) Visual Studio 2008.
>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>> _______________________________________________
>> tipc-discussion mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>
>> -------------------------------------------------------------------------
>> This SF.net email is sponsored by: Microsoft Defy all challenges.
>> Microsoft(R) Visual Studio 2008.
>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>> _______________________________________________
>> tipc-discussion mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>
>> -------------------------------------------------------------------------
>> This SF.net email is sponsored by: Microsoft
>> Defy all challenges. Microsoft(R) Visual Studio 2008.
>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>> _______________________________________________
>> tipc-discussion mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>
>>
>
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> tipc-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion