--- Jon Paul Maloy <[EMAIL PROTECTED]> a écrit : > Date : Tue, 4 Mar 2008 11:02:40 -0500 (EST) > De : Jon Paul Maloy <[EMAIL PROTECTED]> > Objet: RE : Re: [tipc-discussion] Link related > question/issue > À : Xpl++ <[EMAIL PROTECTED]> > > Hi, > Your analysis makes sense, but it still doesn't > explain why TIPC cannot handle this quite > commonplace > situation. > Yesterday, I forgot one essential detail: Even State > messages contain info to help the receiver detect a > gap. The "next_sent" sequence number tells the > receiver if it is out of synch with the sender, and > gives it a chance to send a NACK (a State with gap > != > 0). Since State-packets clearly are received, > otherwise the link would go down, there must be some > bug in tipc that causes the gap to be calculated > wrong, or not at all. Neither does it look like the > receiver is sending a State _immediately_ after a > gap > has occurred, which it should. > So, I think we are looking for some serious bug > within > tipc that completely cripples the retransmission > protocol. We should try to backtrack and find out in > which version it has been introduced. > > ///jon > > > --- Xpl++ <[EMAIL PROTECTED]> a écrit : > > > Hi, > > > > Some more info about my systems: > > - all nodes that tend to drop packets are quite > > loaded, thou very rarely > > one can see cpu #0 being 100% busy > > - there are also few multithreaded tasks that are > > bound to cpu#0 and > > running in SCHED_RR. All of them use tipc. None of > > them uses the maximum > > scheduler priority and they use very little cpu > time > > and do not tend to > > make any peaks > > - there is one taks that runs in SCHED_RR at > maximum > > priority 99/RT (it > > really does a very very important job), which uses > > around 1ms of cpu, > > every 4 seconds, and it is explicitly bound to > cpu#0 > > - all other tasks (mostly apache & php/perl) are > > free to run on any cpu > > - all of these nodes also have considerable io > load. > > - kernel has irq balancing and prety much all irq > > are balanced, except > > for nic irqs. They are always services by cpu #0 > > - to create the packet drop issue I have to mildly > > stress the node, > > which would normaly mean a moment when apache > would > > try to start some > > extra childred, that would also cause the number > of > > simultaneously > > running php script to also rise, while at the same > > time the incoming > > network traffic is also rising. The stress is > > preceeded by a few seconds > > of high input packet rate which may be causing > evene > > more stress on the > > scheduler and cpu starvation > > - wireshark is dropping packets (surprising many, > as > > it seems), tipc is > > confused .. and all is related to moments of > general > > cpu starvation and > > an even worse one at cpu#0 > > > > Then it all started adding up .. > > I moved all non SCHED_OTHER tasks to other cpus, > as > > well as few other > > services. The result - 30% of the nodes showed > > between 5 and 200 packets > > dropped for the whole stress routine, which had > not > > affected TIPC > > operation, nametables were in sync, all > > communications seem to work > > properly. > > Thou this solves my problems, it is still very > > unclear what may have > > been happening in the kernel and in the tipc stack > > that is causing this > > bizzare behavior. > > SMP systems alone are tricky, and when adding load > > and pseudo-realtime > > tasks situation seems to become really > complicated. > > One really cool thing to note is that Opteron > based > > nodes handle hi load > > and cpu starvation much better than Xeon ones .. > > which only confirms an > > old observation of mine, that for some reason > (that > > must be the > > design/architecture?) Opterons appear _much_ more > > interactive/responsive > > than Xeons under heavy load .. > > Another note, this on TIPC - link window for > 100mbit > > nets should be at > > least 256 if one wants to do any serious > > communication between a dozen > > or more nodes. Also for a gbit net link windows > > above 1024 seem to > > really confuse the stack when face with high > output > > packet rate. > > > > Regards, > > Peter Litov. > > > > > > Martin Peylo ??????: > > > Hi, > > > > > > I'll try to help with the Wireshark side of this > > problem. > > > > > > On 3/4/08, Jon Maloy <[EMAIL PROTECTED]> > > wrote: > > > > > >> Strangely enough, node 1.1.12 continues to ack > > packets > > >> which we don't see in wireshark (is it > possible > > that > > >> wireshark can miss packets?). It goes on > acking > > packets > > >> up to the one with sequence number 53967, (on > of > > the > > >> "invisible" packets, but from there on it is > > stop. > > >> > > > > > > I've never encountered Wireshark missing packets > > so far. While it > > > sounds as it wouldn't be a problem with the TIPC > > dissector, could you > > > please send me a trace file so I can definitely > > exclude this cause of > > > defect? I've tried to get it from the link > quoted > > in the mail from Jon > > > but it seems it was already removed. > > > > > > > > >> [...] > > >> > > > > > > > > >> As a sum of this, I start to suspect your > > Ethernet > > >> driver. It seems like it sometimes delivers > > packets > > >> to TIPC which it does not deliver to > Wireshark, > > and > > >> vice versa. This seems to happen after a > period > > of > > >> high traffic, and only with messages beyond a > > certain > > >> size, since the State messages always go > > through. > > >> Can you see any pattern in the direction the > > links > > >> go stale, with reference to which driver you > are > > >> using. (E.g., is there always an e1000 driver > > involved > > >> on the receiving end in the stale direction?) > > >> Does this happen when you only run one type of > > driver? > > >> > > > > > > I've not yet gone that deep into package > capture, > > so I can't say much > > > about that. Peter, could you send a mail to one > of > > the Wireshark > > > mailing lists describing the problem? Have you > > tried capturing other > > > kinds of high traffic with less ressource hungy > > capture frontends? > > > > > > Best regards, > > > Martin > > > > > > > > > > > > > > ------------------------------------------------------------------------- > > This SF.net email is sponsored by: Microsoft > > Defy all challenges. Microsoft(R) Visual Studio > > 2008. > > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > _______________________________________________ > > tipc-discussion mailing list > > [email protected] > > > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > > > >
------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ tipc-discussion mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/tipc-discussion
