Re: TCP event tracking via netlink...
Hi. On Wed, Dec 05, 2007 at 09:11:01AM -0500, John Heffner ([EMAIL PROTECTED]) wrote: > >Maybe if we want to get really fancy we can have some more-expensive > >debug mode where detailed specific events get generated via some > >macros we can scatter all over the place. This won't be useful > >for general user problem analysis, but it will be excellent for > >developers. > > > >Let me know if you think this is useful enough and I'll work on > >an implementation we can start playing with. > > > FWIW, sounds similar to what these guys are doing with SIFTR for FreeBSD: > http://caia.swin.edu.au/urp/newtcp/tools.html > http://caia.swin.edu.au/reports/070824A/CAIA-TR-070824A.pdf And even more similar to this patch from Samir Bellabes of Mandriva: http://lwn.net/Articles/202255/ -- Evgeniy Polyakov -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP event tracking via netlink...
Evgeniy Polyakov <[EMAIL PROTECTED]> writes: > Hi. > > On Wed, Dec 05, 2007 at 09:11:01AM -0500, John Heffner ([EMAIL PROTECTED]) > wrote: >> >Maybe if we want to get really fancy we can have some more-expensive >> >debug mode where detailed specific events get generated via some >> >macros we can scatter all over the place. This won't be useful >> >for general user problem analysis, but it will be excellent for >> >developers. >> > >> >Let me know if you think this is useful enough and I'll work on >> >an implementation we can start playing with. >> >> >> FWIW, sounds similar to what these guys are doing with SIFTR for FreeBSD: >> http://caia.swin.edu.au/urp/newtcp/tools.html >> http://caia.swin.edu.au/reports/070824A/CAIA-TR-070824A.pdf > > And even more similar to this patch from Samir Bellabes of Mandriva: > http://lwn.net/Articles/202255/ Indeed, I was thinking about this idea. but yet, my goal is not to deal with specific protocols like TCP, it's just to deal with the LSM hooks. Anyway, the idea is the same, having a deamon is userspace to catch informations. So why not a expansion? Lately, I'm moving the code to generic netlink, from connector. regards, sam -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP event tracking via netlink...
> it occurred to me that we might want to do something > like a state change event generator. This could be a basis for an interesting TCP performance tester. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP event tracking via netlink...
On Wed, 05 Dec 2007 08:53:07 -0800 Joe Perches <[EMAIL PROTECTED]> wrote: > > it occurred to me that we might want to do something > > like a state change event generator. > > This could be a basis for an interesting TCP > performance tester. That is what tcpprobe does but it isn't detailed enough to address SACK issues. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP event tracking via netlink...
On Wed, 5 Dec 2007, Stephen Hemminger wrote: > On Wed, 05 Dec 2007 08:53:07 -0800 > Joe Perches <[EMAIL PROTECTED]> wrote: > > > > it occurred to me that we might want to do something > > > like a state change event generator. > > > > This could be a basis for an interesting TCP > > performance tester. > > That is what tcpprobe does but it isn't detailed enough to address SACK > issues. ...It would be nice if that could be generalized so that the probe could be attached to some other functions than tcp_rcv_established instead. If we convert remaining functions that don't have sk or tp as first argument so that sk is listed first (should be many with wrong ordering if any), then maybe a generic handler could be of type: jtcp_entry(struct sock *sk, ...) or when available: jtcp_entry(struct sock *sk, struct sk_buff *ack, ...) -- i. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP event tracking via netlink...
David Miller wrote: Ilpo, I was pondering the kind of debugging one does to find congestion control issues and even SACK bugs and it's currently too painful because there is no standard way to track state changes. I assume you're using something like carefully crafted printk's, kprobes, or even ad-hoc statistic counters. That's what I used to do :-) With that in mind it occurred to me that we might want to do something like a state change event generator. Basically some application or even a daemon listens on this generic netlink socket family we create. The header of each event packet indicates what socket the event is for and then there is some state information. Then you can look at a tcpdump and this state dump side by side and see what the kernel decided to do. Now there is the question of granularity. A very important consideration in this is that we want this thing to be enabled in the distributions, therefore it must be cheap. Perhaps one test at the end of the packet input processing. So I say we pick some state to track (perhaps start with tcp_info) and just push that at the end of every packet input run. Also, we add some minimal filtering capability (match on specific IP address and/or port, for example). Maybe if we want to get really fancy we can have some more-expensive debug mode where detailed specific events get generated via some macros we can scatter all over the place. This won't be useful for general user problem analysis, but it will be excellent for developers. Let me know if you think this is useful enough and I'll work on an implementation we can start playing with. FWIW, sounds similar to what these guys are doing with SIFTR for FreeBSD: http://caia.swin.edu.au/urp/newtcp/tools.html http://caia.swin.edu.au/reports/070824A/CAIA-TR-070824A.pdf -John -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP event tracking via netlink...
On Wed, 5 Dec 2007, David Miller wrote: > Ilpo, I was pondering the kind of debugging one does to find > congestion control issues and even SACK bugs and it's currently too > painful because there is no standard way to track state changes. That's definately true. > I assume you're using something like carefully crafted printk's, > kprobes, or even ad-hoc statistic counters. That's what I used to do > :-) No, that's not at all what I do :-). I usually look time-seq graphs expect for the cases when I just find things out by reading code (or by just thinking of it). I'm so used to all things in the graphs that I can quite easily spot any inconsistencies & TCP events and then look interesting parts in greater detail, very rarely something remains uncertain... However, instead of directly going to printks, etc. I almost always read the code first (usually it's not just couple of lines but tens of potential TCP execution paths involving more than a handful of functions to check what the end result would be). This has a nice side-effect that other things tend to show up as well. Only when things get nasty and I cannot figure out what it does wrong, only then I add specially placed ad-hoc printks. One trick I also use, is to get the vars of the relevant flow from /proc/net/tcp in a while loop but it only works for my case because I use links that are slow (even a small value sleep in the loop does not hide much). For other people reports, I occasionally have to write a validator patches like you might have notice because in a typical miscount case our BUG_TRAPs are too late because they occur only after outstanding window becomes zero that might be very distant point in time already from the cause. Also, I'm planning an experiment with those markers thing to see if they are of any use when trying to gather some latency data about SACK processing because they seem light weight enough to not be disturbing. > With that in mind it occurred to me that we might want to do something > like a state change event generator. > > Basically some application or even a daemon listens on this generic > netlink socket family we create. The header of each event packet > indicates what socket the event is for and then there is some state > information. > > Then you can look at a tcpdump and this state dump side by side and > see what the kernel decided to do. Much of the info is available in tcpdump already, it's just hard to read without graphing it first because there are some many overlapping things to track in two-dimensional space. ...But yes, I have to admit that couple of problems come to my mind where having some variable from tcp_sock would have made the problem more obvious. > Now there is the question of granularity. > > A very important consideration in this is that we want this thing to > be enabled in the distributions, therefore it must be cheap. Perhaps > one test at the end of the packet input processing. Not sure what is the benefit of having distributions with it because those people hardly report problems anyway to here, they're just too happy with TCP performance unless we print something to their logs, which implies that we must setup a *_ON() condition :-(. Yes, often negleted problem is that most people are just too happy even something like TCP Tahoe or something as prehistoric. I've been surprised how badly TCP can break without nobody complaining as long as it doesn't crash (even any of the devs). Two key things seems to surface the most of the TCP related bugs: research people really staring at strange packet patterns (or code) and automatic WARN/BUG_ON checks triggered reports. The latter reports include also corner cases which nobody would otherwise ever noticed (or at least before Linus releases 3.0 :-/). IMHO, those invariant WARN/BUG_ON are the only alternative that scales to normal users well enough. The checks are simple enough so that it can be always on and then we just happen to print something to their log, and that's offensive enough for somebody to come up with a report... ;-) > So I say we pick some state to track (perhaps start with tcp_info) > and just push that at the end of every packet input run. Also, > we add some minimal filtering capability (match on specific IP > address and/or port, for example). > > Maybe if we want to get really fancy we can have some more-expensive > debug mode where detailed specific events get generated via some > macros we can scatter all over the place. > > This won't be useful for general user problem analysis, but it will be > excellent for developers. I would say that it to be generic enough, most function entrys and exits should have to be covered because the need varies a lot, the processing in general is so complex that things would get too easily shadowed otherwise! In addition we need expensive mode++ which goes all the way down to the dirty details of the write queue, they're now dirtier than
Re: TCP event tracking via netlink...
On Thu, 6 Dec 2007 00:15:49 +0200 (EET) "Ilpo Järvinen" <[EMAIL PROTECTED]> wrote: > On Wed, 5 Dec 2007, Stephen Hemminger wrote: > > > On Wed, 05 Dec 2007 08:53:07 -0800 > > Joe Perches <[EMAIL PROTECTED]> wrote: > > > > > > it occurred to me that we might want to do something > > > > like a state change event generator. > > > > > > This could be a basis for an interesting TCP > > > performance tester. > > > > That is what tcpprobe does but it isn't detailed enough to address SACK > > issues. > > ...It would be nice if that could be generalized so that the probe could > be attached to some other functions than tcp_rcv_established instead. > > If we convert remaining functions that don't have sk or tp as first > argument so that sk is listed first (should be many with wrong ordering > if any), then maybe a generic handler could be of type: > > jtcp_entry(struct sock *sk, ...) > > or when available: > > jtcp_entry(struct sock *sk, struct sk_buff *ack, ...) > > > -- > i. An earlier version had hooks in send as well, it is trivial to extend. as long as the prototypes match, any function arg ordering is okay. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP event tracking via netlink...
From: John Heffner <[EMAIL PROTECTED]> Date: Wed, 05 Dec 2007 09:11:01 -0500 > FWIW, sounds similar to what these guys are doing with SIFTR for FreeBSD: > http://caia.swin.edu.au/urp/newtcp/tools.html > http://caia.swin.edu.au/reports/070824A/CAIA-TR-070824A.pdf Yes, my proposal is very similar to this SIFTR work. In their work they tap into the stack using the packet filtering hooks. In this way they avoid having to make TCP stack modifications, they just look up the PCB and dump state, whereas we have more liberty to do more serious surgery :-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP event tracking via netlink...
From: Evgeniy Polyakov <[EMAIL PROTECTED]> Date: Wed, 5 Dec 2007 17:48:43 +0300 > On Wed, Dec 05, 2007 at 09:11:01AM -0500, John Heffner ([EMAIL PROTECTED]) > wrote: > > >Maybe if we want to get really fancy we can have some more-expensive > > >debug mode where detailed specific events get generated via some > > >macros we can scatter all over the place. This won't be useful > > >for general user problem analysis, but it will be excellent for > > >developers. > > > > > >Let me know if you think this is useful enough and I'll work on > > >an implementation we can start playing with. > > > > > > FWIW, sounds similar to what these guys are doing with SIFTR for FreeBSD: > > http://caia.swin.edu.au/urp/newtcp/tools.html > > http://caia.swin.edu.au/reports/070824A/CAIA-TR-070824A.pdf > > And even more similar to this patch from Samir Bellabes of Mandriva: > http://lwn.net/Articles/202255/ I think this work is very different. When I say "state" I mean something more significant than CLOSE, ESTABLISHED, etc. which is what Samir's patches are tracking. I'm talking about all of the sequence numbers, SACK information, congestion control knobs, etc. whose values are nearly impossible to track on a packet to packet basis in order to diagnose problems. Web100 provided facilities along these lines as well. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP event tracking via netlink...
From: Stephen Hemminger <[EMAIL PROTECTED]> Date: Wed, 5 Dec 2007 16:33:38 -0500 > On Wed, 05 Dec 2007 08:53:07 -0800 > Joe Perches <[EMAIL PROTECTED]> wrote: > > > > it occurred to me that we might want to do something > > > like a state change event generator. > > > > This could be a basis for an interesting TCP > > performance tester. > > That is what tcpprobe does but it isn't detailed enough to address SACK > issues. Indeed, this could be done via the jprobe there. Silly me I didn't do this in the implementation I whipped up, which I'll likely correct. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP event tracking via netlink...
From: "Ilpo_Järvinen" <[EMAIL PROTECTED]> Date: Thu, 6 Dec 2007 01:18:28 +0200 (EET) > On Wed, 5 Dec 2007, David Miller wrote: > > > I assume you're using something like carefully crafted printk's, > > kprobes, or even ad-hoc statistic counters. That's what I used to do > > :-) > > No, that's not at all what I do :-). I usually look time-seq graphs > expect for the cases when I just find things out by reading code (or > by just thinking of it). Can you briefly detail what graph tools and command lines you are using? The last time I did graphing to analyze things, the tools were hit-or-miss. > Much of the info is available in tcpdump already, it's just hard to read > without graphing it first because there are some many overlapping things > to track in two-dimensional space. > > ...But yes, I have to admit that couple of problems come to my mind > where having some variable from tcp_sock would have made the problem > more obvious. The most important are the cwnd and ssthresh, which you could guess using graphs but it is important to know on a packet to packet basis why we might have sent a packet or not because this has rippling effects down the rest of the RTT. > Not sure what is the benefit of having distributions with it because > those people hardly report problems anyway to here, they're just too > happy with TCP performance unless we print something to their logs, > which implies that we must setup a *_ON() condition :-(. That may be true, but if we could integrate the information with tcpdumps, we could gather internal state using tools the user already has available. Imagine if tcpdump printed out: 02:26:14.865805 IP $SRC > $DEST: . 11226:12686(1460) ack 0 win 108 ss_thresh: 129 cwnd: 133 packets_out: 132 or something like that. > Some problems are simply such that things cannot be accurately verified > without high processing overhead until it's far too late (eg skb bits vs > *_out counters). Maybe we should start to build an expensive state > validator as well which would automatically check invariants of the write > queue and tcp_sock in a straight forward, unoptimized manner? That would > definately do a lot of work for us, just ask people to turn it on and it > spits out everything that went wrong :-) (unless they really depend on > very high-speed things and are therefore unhappy if we scan thousands of > packets unnecessarily per ACK :-)). ...Early enough! ...That would work > also for distros but there's always human judgement needed to decide > whether the bug reporter will be happy when his TCP processing does no > longer scale ;-). I think it's useful as a TCP_DEBUG config option or similar, sure. But sometimes the algorithms are working as designed, it's just that they provide poor pipe utilization and CWND analysis embedded inside of a tcpdump would be one way to see that as well as determine the flaw in the algorithm. > ...Hopefully you found any of my comments useful. Very much so, thanks. I put together a sample implementation anyways just to show the idea, against net-2.6.25 below. It is untested since I didn't write the userland app yet to see that proper things get logged. Basically you could run a daemon that writes per-connection traces into files based upon the incoming netlink events. Later, using the binary pcap file and these traces, you can piece together traces like the above using the timestamps etc. to match up pcap packets to ones from the TCP logger. The userland tools could do analysis and print pre-cooked state diff logs, like "this ACK raised CWND by one" or whatever else you wanted to know. It's nice that an expert like you can look at graphs and understand, but we'd like to create more experts and besides reading code one way to become an expert is to be able to extrace live real data from the kernel's working state and try to understand how things got that way. This information is permanently lost currently. diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 56342c3..c0e61d0 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -170,6 +170,47 @@ struct tcp_md5sig { __u8tcpm_key[TCP_MD5SIG_MAXKEYLEN]; /* key (binary) */ }; +/* TCP netlink event logger. */ +struct tcp_log_key { + union { + __be32 a4; + __be32 a6[4]; + } saddr, daddr; + __be16 sport; + __be16 dport; + unsigned short family; + unsigned short __pad; +}; + +struct tcp_log_stamp { + __u32 tv_sec; + __u32 tv_usec; +}; + +struct tcp_log_payload { + struct tcp_log_key key; + struct tcp_log_stampstamp; + struct tcp_info info; +}; + +enum { + TCP_LOG_A_UNSPEC = 0, + __TCP_LOG_A_MAX, +}; +#define TCP_LOG_A_MAX (__TCP_LOG_A_MAX - 1) + +#define TCP_LOG_GENL_NAME "tcp_log" +#define TCP_LOG_GENL_VERSION 1 + +enum { + TCP_LOG_CMD_UNSPEC = 0, + TCP_LOG
Re: TCP event tracking via netlink...
On Wed, Dec 05, 2007 at 09:03:43PM -0800, David Miller ([EMAIL PROTECTED]) wrote: > I think this work is very different. > > When I say "state" I mean something more significant than > CLOSE, ESTABLISHED, etc. which is what Samir's patches are > tracking. > > I'm talking about all of the sequence numbers, SACK information, > congestion control knobs, etc. whose values are nearly impossible to > track on a packet to packet basis in order to diagnose problems. I pointed that work as a possible basis for collecting more info if you needs including sequence numbers, window sizes and so on. It just requires a useful structure layout placed, so that one would not require to recreate the same bits again, so that it could be called from any place inside the stack. -- Evgeniy Polyakov -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP event tracking via netlink...
Em Thu, Dec 06, 2007 at 02:20:58AM -0800, David Miller escreveu: > From: Stephen Hemminger <[EMAIL PROTECTED]> > Date: Wed, 5 Dec 2007 16:33:38 -0500 > > > On Wed, 05 Dec 2007 08:53:07 -0800 > > Joe Perches <[EMAIL PROTECTED]> wrote: > > > > > > it occurred to me that we might want to do something > > > > like a state change event generator. > > > > > > This could be a basis for an interesting TCP > > > performance tester. > > > > That is what tcpprobe does but it isn't detailed enough to address SACK > > issues. > > Indeed, this could be done via the jprobe there. > > Silly me I didn't do this in the implementation I whipped > up, which I'll likely correct. I have some experiments from the past on this area: This is what is produced by ctracer + the ostra callgrapher when tracking many sk_buff objects, tracing sk_buff routines and as well all other structs that have a pointer to a sk_buff, i.e. where the sk_buff can be get from the struct that has a pointer to it, tcp_sock is an "alias" to struct inet_sock that is an "alias" to struct sock, etc, so when tracing tcp_sock you also trace inet_connection_sock, inet_sock, sock methods: http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/sk_buff/many_objects/ With just one object (that is reused, so appears many times): http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/sk_buff/0x8101013130e8/ Following struct sock methods: http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/sock/many_objects/ http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/sock/0xf61bf500/ struct socket: http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/socket/many_objects/ It works by using the DWARF information to generate a systemtap module that in turn will create a relayfs channel where we store the traces and a automatically reorganized struct with just the base types (int, char, long, etc) and typedefs that end up being base types. Example of the struct minisock recreated from the debugging information and reorganized using the algorithms in pahole to save space, generated by this tool, go to the bottom, where you'll find struct ctracer__mini_sock and the collector, that from a full sized object creates the mini struct. http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/ctracer_collector.struct.sock.c And the systemtap module (the tcpprobe on steroids) automatically generated: http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/ctracer_methods.struct.sock.stp This requires more work to: . reduce the overhead . filter out undesired functions creating a "project" with the functions desired using some gui editor . specify lists of fields to put on the internal state to be collected, again using a gui or plain ctracer-edit using vi, instead of getting just base types . Be able to say: collect just the fields on the second and fourth cacheline . collectors for complex objects such as spinlocks, socket lock, mutexes But since people are wanting to work on tools to watch state transitions, fields changing, etc, I thought I should dust off the ostra experiments and the more recent dwarves ctracer work I'm doing on my copious spare time 8) In the callgrapher there are some more interesting stuff: Interface to see where fields changed: http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/sock/0xf61bf500/changes.html In this page clicking on a field name, such as: http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/sock/0xf61bf500/sk_forward_alloc.png You'll get graphs over time. Code is in the dwarves repo at: http://master.kernel.org/git/?p=linux/kernel/git/acme/pahole.git;a=summary Thanks, - Arnaldo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP event tracking via netlink...
On Thu, 06 Dec 2007 02:33:46 -0800 (PST) David Miller <[EMAIL PROTECTED]> wrote: > From: "Ilpo_Järvinen" <[EMAIL PROTECTED]> > Date: Thu, 6 Dec 2007 01:18:28 +0200 (EET) > > > On Wed, 5 Dec 2007, David Miller wrote: > > > > > I assume you're using something like carefully crafted printk's, > > > kprobes, or even ad-hoc statistic counters. That's what I used to do > > > :-) > > > > No, that's not at all what I do :-). I usually look time-seq graphs > > expect for the cases when I just find things out by reading code (or > > by just thinking of it). > > Can you briefly detail what graph tools and command lines > you are using? > > The last time I did graphing to analyze things, the tools > were hit-or-miss. > > > Much of the info is available in tcpdump already, it's just hard to read > > without graphing it first because there are some many overlapping things > > to track in two-dimensional space. > > > > ...But yes, I have to admit that couple of problems come to my mind > > where having some variable from tcp_sock would have made the problem > > more obvious. > > The most important are the cwnd and ssthresh, which you could guess > using graphs but it is important to know on a packet to packet > basis why we might have sent a packet or not because this has > rippling effects down the rest of the RTT. > > > Not sure what is the benefit of having distributions with it because > > those people hardly report problems anyway to here, they're just too > > happy with TCP performance unless we print something to their logs, > > which implies that we must setup a *_ON() condition :-(. > > That may be true, but if we could integrate the information with > tcpdumps, we could gather internal state using tools the user > already has available. > > Imagine if tcpdump printed out: > > 02:26:14.865805 IP $SRC > $DEST: . 11226:12686(1460) ack 0 win 108 > ss_thresh: 129 cwnd: 133 packets_out: 132 > > or something like that. > > > Some problems are simply such that things cannot be accurately verified > > without high processing overhead until it's far too late (eg skb bits vs > > *_out counters). Maybe we should start to build an expensive state > > validator as well which would automatically check invariants of the write > > queue and tcp_sock in a straight forward, unoptimized manner? That would > > definately do a lot of work for us, just ask people to turn it on and it > > spits out everything that went wrong :-) (unless they really depend on > > very high-speed things and are therefore unhappy if we scan thousands of > > packets unnecessarily per ACK :-)). ...Early enough! ...That would work > > also for distros but there's always human judgement needed to decide > > whether the bug reporter will be happy when his TCP processing does no > > longer scale ;-). > > I think it's useful as a TCP_DEBUG config option or similar, sure. > > But sometimes the algorithms are working as designed, it's just that > they provide poor pipe utilization and CWND analysis embedded inside > of a tcpdump would be one way to see that as well as determine the > flaw in the algorithm. > > > ...Hopefully you found any of my comments useful. > > Very much so, thanks. > > I put together a sample implementation anyways just to show the idea, > against net-2.6.25 below. > > It is untested since I didn't write the userland app yet to see that > proper things get logged. Basically you could run a daemon that > writes per-connection traces into files based upon the incoming > netlink events. Later, using the binary pcap file and these traces, > you can piece together traces like the above using the timestamps > etc. to match up pcap packets to ones from the TCP logger. > > The userland tools could do analysis and print pre-cooked state diff > logs, like "this ACK raised CWND by one" or whatever else you wanted > to know. > > It's nice that an expert like you can look at graphs and understand, > but we'd like to create more experts and besides reading code one > way to become an expert is to be able to extrace live real data > from the kernel's working state and try to understand how things > got that way. This information is permanently lost currently. Tools and scripts for testing that generate graphs are at: git://git.kernel.org/pub/scm/tcptest/tcptest -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP event tracking via netlink...
From: Stephen Hemminger <[EMAIL PROTECTED]> Date: Thu, 6 Dec 2007 09:23:12 -0800 > Tools and scripts for testing that generate graphs are at: > git://git.kernel.org/pub/scm/tcptest/tcptest I know about this, I'm just curious what exactly Ilpo is using :-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP event tracking via netlink...
On Thu, 6 Dec 2007, David Miller wrote: > From: "Ilpo_Järvinen" <[EMAIL PROTECTED]> > Date: Thu, 6 Dec 2007 01:18:28 +0200 (EET) > > > On Wed, 5 Dec 2007, David Miller wrote: > > > > > I assume you're using something like carefully crafted printk's, > > > kprobes, or even ad-hoc statistic counters. That's what I used to do > > > :-) > > > > No, that's not at all what I do :-). I usually look time-seq graphs > > expect for the cases when I just find things out by reading code (or > > by just thinking of it). > > Can you briefly detail what graph tools and command lines > you are using? I have a tool called Sealion but it's behind NDA (making it open source has been talked for long but I don't have idea why it hasn't realized yet). It's mostly tcl/tk code is, by no means nice or clean desing nor quality (I'll leave details why I think it's that way out of this discussion :-)). Produces svgs. Usually I'm have the things I need in the standard sent+ACK+SACKs(+win) graph it produces. The result is quite similar to what tcptrace+xplot produces but xplot UI is really horrible, IMHO. If I have to deal with tcpdump output only, it takes considerable amount of time to do computations with bc to come up with the same understanding by just reading tcpdumps. > The last time I did graphing to analyze things, the tools > were hit-or-miss. Yeah, this is definately true. Open source graphing tools I know are really not that astonishing :-(. I've tried to look for better tools as well but with little success. > > Much of the info is available in tcpdump already, it's just hard to read > > without graphing it first because there are some many overlapping things > > to track in two-dimensional space. > > > > ...But yes, I have to admit that couple of problems come to my mind > > where having some variable from tcp_sock would have made the problem > > more obvious. > > The most important are the cwnd and ssthresh, which you could guess > using graphs but it is important to know on a packet to packet > basis why we might have sent a packet or not because this has > rippling effects down the rest of the RTT. Couple of points: In order to evaluate validity of some action, one might need more than one packet from the history. Answer to the why we have sent a packet is rather simple (excluding RTOs): cwnd > packets_in_flight and data was available. No, it's not at all complicated. Though I might be too biased toward non-application limited cases which make the formula even simpler because everything is basically ACK clocked. To really tell what caused changes between cwnd and/or packets_in_flight one usually needs some history or more fine-grained approach, once per packet is way too wide gap. It tells just what happened, not why, unless you're really familiar with the state machine and can make the right guess. > > Not sure what is the benefit of having distributions with it because > > those people hardly report problems anyway to here, they're just too > > happy with TCP performance unless we print something to their logs, > > which implies that we must setup a *_ON() condition :-(. > > That may be true, but if we could integrate the information with > tcpdumps, we could gather internal state using tools the user > already has available. It would definately help if we could, but that of course depends on getting the reports in the first place. > Imagine if tcpdump printed out: > > 02:26:14.865805 IP $SRC > $DEST: . 11226:12686(1460) ack 0 win 108 > ss_thresh: 129 cwnd: 133 packets_out: 132 > > or something like that. How about this: 02:26:14.865805 IP $SRC > $DEST: . ack 11226 win 108 <...sack 1 {15606:18526} 17066:18526 0->S sacktag_one l0 s1 r0 f4 pc1 ... 11226:12686 clean_rtx_queue ... 11226:12686 0->L mark_head_lost l1 s1 r0 f4 pc1 ... 12686:14146 0->L mark_head_lost l2 s1 r0 f4 pc1 ... 11226:12686 L->LRe retransmit_skb l2 s1 r1 f4 pc1 ... ...would make the bug in sack processing relatively obvious (yes, it has an intentional flaw in it, points from find it :-))... That would be something I'd like to have right now. > But sometimes the algorithms are working as designed, it's just that > they provide poor pipe utilization and CWND analysis embedded inside > of a tcpdump would be one way to see that as well as determine the > flaw in the algorithm. Fair enough. > It is untested since I didn't write the userland app yet to see that > proper things get logged. Basically you could run a daemon that > writes per-connection traces into files based upon the incoming > netlink events. Later, using the binary pcap file and these traces, > you can piece together traces like the above using the timestamps > etc. to match up pcap packets to ones from the TCP logger. > > The userland tools could do analysis and print pre-cooked state diff > logs, like "this ACK raised CWND by one" or whatever else you wanted > to know. Obviously a collection of useful userland tools see
Re: TCP event tracking via netlink...
From: Stephen Hemminger <[EMAIL PROTECTED]> Date: Thu, 6 Dec 2007 09:23:12 -0800 > Tools and scripts for testing that generate graphs are at: > git://git.kernel.org/pub/scm/tcptest/tcptest Did you move it somewhere else? [EMAIL PROTECTED]:~/src/GIT$ git clone git://git.kernel.org/pub/scm/tcptest/tcptest Initialized empty Git repository in /home/davem/src/GIT/tcptest/.git/ fatal: The remote end hung up unexpectedly fetch-pack from 'git://git.kernel.org/pub/scm/tcptest/tcptest' failed. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP event tracking via netlink...
On Wed, 2 Jan 2008, David Miller wrote: > From: Stephen Hemminger <[EMAIL PROTECTED]> > Date: Thu, 6 Dec 2007 09:23:12 -0800 > > > Tools and scripts for testing that generate graphs are at: > > git://git.kernel.org/pub/scm/tcptest/tcptest > > Did you move it somewhere else? > > [EMAIL PROTECTED]:~/src/GIT$ git clone > git://git.kernel.org/pub/scm/tcptest/tcptest > Initialized empty Git repository in /home/davem/src/GIT/tcptest/.git/ > fatal: The remote end hung up unexpectedly > fetch-pack from 'git://git.kernel.org/pub/scm/tcptest/tcptest' failed. .../network/ was missing from the path :-). $ git-remote show origin * remote origin URL: git://git.kernel.org/pub/scm/network/tcptest/tcptest.git Remote branch(es) merged with 'git pull' while on branch master master Tracked remote branches master -- i. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP event tracking via netlink...
From: "Ilpo_Järvinen" <[EMAIL PROTECTED]> Date: Wed, 2 Jan 2008 13:05:17 +0200 (EET) > git://git.kernel.org/pub/scm/network/tcptest/tcptest.git Thanks a lot Ilpo. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html