Re: TCP event tracking via netlink...

2007-12-05 Thread Evgeniy Polyakov
Hi.

On Wed, Dec 05, 2007 at 09:11:01AM -0500, John Heffner ([EMAIL PROTECTED]) 
wrote:
> >Maybe if we want to get really fancy we can have some more-expensive
> >debug mode where detailed specific events get generated via some
> >macros we can scatter all over the place.  This won't be useful
> >for general user problem analysis, but it will be excellent for
> >developers.
> >
> >Let me know if you think this is useful enough and I'll work on
> >an implementation we can start playing with.
> 
> 
> FWIW, sounds similar to what these guys are doing with SIFTR for FreeBSD:
> http://caia.swin.edu.au/urp/newtcp/tools.html
> http://caia.swin.edu.au/reports/070824A/CAIA-TR-070824A.pdf

And even more similar to this patch from Samir Bellabes of Mandriva:
http://lwn.net/Articles/202255/

-- 
Evgeniy Polyakov
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP event tracking via netlink...

2007-12-05 Thread Samir Bellabes
Evgeniy Polyakov <[EMAIL PROTECTED]> writes:

> Hi.
>
> On Wed, Dec 05, 2007 at 09:11:01AM -0500, John Heffner ([EMAIL PROTECTED]) 
> wrote:
>> >Maybe if we want to get really fancy we can have some more-expensive
>> >debug mode where detailed specific events get generated via some
>> >macros we can scatter all over the place.  This won't be useful
>> >for general user problem analysis, but it will be excellent for
>> >developers.
>> >
>> >Let me know if you think this is useful enough and I'll work on
>> >an implementation we can start playing with.
>> 
>> 
>> FWIW, sounds similar to what these guys are doing with SIFTR for FreeBSD:
>> http://caia.swin.edu.au/urp/newtcp/tools.html
>> http://caia.swin.edu.au/reports/070824A/CAIA-TR-070824A.pdf
>
> And even more similar to this patch from Samir Bellabes of Mandriva:
> http://lwn.net/Articles/202255/

Indeed, I was thinking about this idea. but yet, my goal is not to deal
with specific protocols like TCP, it's just to deal with the LSM hooks.
Anyway, the idea is the same, having a deamon is userspace to catch
informations. So why not a expansion? 

Lately, I'm moving the code to generic netlink, from connector.
regards,
sam
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP event tracking via netlink...

2007-12-05 Thread Joe Perches
> it occurred to me that we might want to do something
> like a state change event generator.

This could be a basis for an interesting TCP
performance tester.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP event tracking via netlink...

2007-12-05 Thread Stephen Hemminger
On Wed, 05 Dec 2007 08:53:07 -0800
Joe Perches <[EMAIL PROTECTED]> wrote:

> > it occurred to me that we might want to do something
> > like a state change event generator.
> 
> This could be a basis for an interesting TCP
> performance tester.

That is what tcpprobe does but it isn't detailed enough to address SACK
issues.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP event tracking via netlink...

2007-12-05 Thread Ilpo Järvinen
On Wed, 5 Dec 2007, Stephen Hemminger wrote:

> On Wed, 05 Dec 2007 08:53:07 -0800
> Joe Perches <[EMAIL PROTECTED]> wrote:
> 
> > > it occurred to me that we might want to do something
> > > like a state change event generator.
> > 
> > This could be a basis for an interesting TCP
> > performance tester.
> 
> That is what tcpprobe does but it isn't detailed enough to address SACK
> issues.

...It would be nice if that could be generalized so that the probe could 
be attached to some other functions than tcp_rcv_established instead.

If we convert remaining functions that don't have sk or tp as first 
argument so that sk is listed first (should be many with wrong ordering 
if any), then maybe a generic handler could be of type:

jtcp_entry(struct sock *sk, ...)

or when available:

jtcp_entry(struct sock *sk, struct sk_buff *ack, ...)


-- 
 i.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP event tracking via netlink...

2007-12-05 Thread John Heffner

David Miller wrote:

Ilpo, I was pondering the kind of debugging one does to find
congestion control issues and even SACK bugs and it's currently too
painful because there is no standard way to track state changes.

I assume you're using something like carefully crafted printk's,
kprobes, or even ad-hoc statistic counters.  That's what I used to do
:-)

With that in mind it occurred to me that we might want to do something
like a state change event generator.

Basically some application or even a daemon listens on this generic
netlink socket family we create.  The header of each event packet
indicates what socket the event is for and then there is some state
information.

Then you can look at a tcpdump and this state dump side by side and
see what the kernel decided to do.

Now there is the question of granularity.

A very important consideration in this is that we want this thing to
be enabled in the distributions, therefore it must be cheap.  Perhaps
one test at the end of the packet input processing.

So I say we pick some state to track (perhaps start with tcp_info)
and just push that at the end of every packet input run.  Also,
we add some minimal filtering capability (match on specific IP
address and/or port, for example).

Maybe if we want to get really fancy we can have some more-expensive
debug mode where detailed specific events get generated via some
macros we can scatter all over the place.  This won't be useful
for general user problem analysis, but it will be excellent for
developers.

Let me know if you think this is useful enough and I'll work on
an implementation we can start playing with.



FWIW, sounds similar to what these guys are doing with SIFTR for FreeBSD:
http://caia.swin.edu.au/urp/newtcp/tools.html
http://caia.swin.edu.au/reports/070824A/CAIA-TR-070824A.pdf

  -John
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP event tracking via netlink...

2007-12-05 Thread Ilpo Järvinen
On Wed, 5 Dec 2007, David Miller wrote:

> Ilpo, I was pondering the kind of debugging one does to find
> congestion control issues and even SACK bugs and it's currently too
> painful because there is no standard way to track state changes.

That's definately true.

> I assume you're using something like carefully crafted printk's,
> kprobes, or even ad-hoc statistic counters.  That's what I used to do
> :-)

No, that's not at all what I do :-). I usually look time-seq graphs 
expect for the cases when I just find things out by reading code (or
by just thinking of it). I'm so used to all things in the graphs that
I can quite easily spot any inconsistencies & TCP events and then look 
interesting parts in greater detail, very rarely something remains 
uncertain... However, instead of directly going to printks, etc. I almost 
always read the code first (usually it's not just couple of lines but tens 
of potential TCP execution paths involving more than a handful of 
functions to check what the end result would be). This has a nice 
side-effect that other things tend to show up as well. Only when things 
get nasty and I cannot figure out what it does wrong, only then I add 
specially placed ad-hoc printks.

One trick I also use, is to get the vars of the relevant flow from 
/proc/net/tcp in a while loop but it only works for my case because
I use links that are slow (even a small value sleep in the loop does
not hide much).

For other people reports, I occasionally have to write a validator patches 
like you might have notice because in a typical miscount case our 
BUG_TRAPs are too late because they occur only after outstanding window 
becomes zero that might be very distant point in time already from the 
cause.

Also, I'm planning an experiment with those markers thing to see if 
they are of any use when trying to gather some latency data about 
SACK processing because they seem light weight enough to not be 
disturbing.

> With that in mind it occurred to me that we might want to do something
> like a state change event generator.
> 
> Basically some application or even a daemon listens on this generic
> netlink socket family we create.  The header of each event packet
> indicates what socket the event is for and then there is some state
> information.
> 
> Then you can look at a tcpdump and this state dump side by side and
> see what the kernel decided to do.

Much of the info is available in tcpdump already, it's just hard to read 
without graphing it first because there are some many overlapping things 
to track in two-dimensional space.

...But yes, I have to admit that couple of problems come to my mind
where having some variable from tcp_sock would have made the problem
more obvious.

> Now there is the question of granularity.
> 
> A very important consideration in this is that we want this thing to
> be enabled in the distributions, therefore it must be cheap.  Perhaps
> one test at the end of the packet input processing.

Not sure what is the benefit of having distributions with it because 
those people hardly report problems anyway to here, they're just too 
happy with TCP performance unless we print something to their logs,
which implies that we must setup a *_ON() condition :-(.

Yes, often negleted problem is that most people are just too happy even 
something like TCP Tahoe or something as prehistoric. I've been surprised 
how badly TCP can break without nobody complaining as long as it doesn't 
crash (even any of the devs). Two key things seems to surface the most of 
the TCP related bugs: research people really staring at strange packet 
patterns (or code) and automatic WARN/BUG_ON checks triggered reports.
The latter reports include also corner cases which nobody would otherwise 
ever noticed (or at least before Linus releases 3.0 :-/).

IMHO, those invariant WARN/BUG_ON are the only alternative that scales to 
normal users well enough. The checks are simple enough so that it can be 
always on and then we just happen to print something to their log, and 
that's offensive enough for somebody to come up with a report... ;-)

> So I say we pick some state to track (perhaps start with tcp_info)
> and just push that at the end of every packet input run.  Also,
> we add some minimal filtering capability (match on specific IP
> address and/or port, for example).
>
> Maybe if we want to get really fancy we can have some more-expensive
> debug mode where detailed specific events get generated via some
> macros we can scatter all over the place.
>
> This won't be useful for general user problem analysis, but it will be 
> excellent for developers.

I would say that it to be generic enough, most function entrys and exits
should have to be covered because the need varies a lot, the processing in 
general is so complex that things would get too easily shadowed otherwise! 
In addition we need expensive mode++ which goes all the way down to the 
dirty details of the write queue, they're now dirtier than

Re: TCP event tracking via netlink...

2007-12-05 Thread Stephen Hemminger
On Thu, 6 Dec 2007 00:15:49 +0200 (EET)
"Ilpo Järvinen" <[EMAIL PROTECTED]> wrote:

> On Wed, 5 Dec 2007, Stephen Hemminger wrote:
> 
> > On Wed, 05 Dec 2007 08:53:07 -0800
> > Joe Perches <[EMAIL PROTECTED]> wrote:
> > 
> > > > it occurred to me that we might want to do something
> > > > like a state change event generator.
> > > 
> > > This could be a basis for an interesting TCP
> > > performance tester.
> > 
> > That is what tcpprobe does but it isn't detailed enough to address SACK
> > issues.
> 
> ...It would be nice if that could be generalized so that the probe could 
> be attached to some other functions than tcp_rcv_established instead.
> 
> If we convert remaining functions that don't have sk or tp as first 
> argument so that sk is listed first (should be many with wrong ordering 
> if any), then maybe a generic handler could be of type:
> 
> jtcp_entry(struct sock *sk, ...)
> 
> or when available:
> 
> jtcp_entry(struct sock *sk, struct sk_buff *ack, ...)
> 
> 
> -- 
>  i.

An earlier version had hooks in send as well, it is trivial to extend. as long 
as
the prototypes match, any function arg ordering is okay.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP event tracking via netlink...

2007-12-05 Thread David Miller
From: John Heffner <[EMAIL PROTECTED]>
Date: Wed, 05 Dec 2007 09:11:01 -0500

> FWIW, sounds similar to what these guys are doing with SIFTR for FreeBSD:
> http://caia.swin.edu.au/urp/newtcp/tools.html
> http://caia.swin.edu.au/reports/070824A/CAIA-TR-070824A.pdf

Yes, my proposal is very similar to this SIFTR work.

In their work they tap into the stack using the packet filtering
hooks.

In this way they avoid having to make TCP stack modifications, they
just look up the PCB and dump state, whereas we have more liberty to
do more serious surgery :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP event tracking via netlink...

2007-12-05 Thread David Miller
From: Evgeniy Polyakov <[EMAIL PROTECTED]>
Date: Wed, 5 Dec 2007 17:48:43 +0300

> On Wed, Dec 05, 2007 at 09:11:01AM -0500, John Heffner ([EMAIL PROTECTED]) 
> wrote:
> > >Maybe if we want to get really fancy we can have some more-expensive
> > >debug mode where detailed specific events get generated via some
> > >macros we can scatter all over the place.  This won't be useful
> > >for general user problem analysis, but it will be excellent for
> > >developers.
> > >
> > >Let me know if you think this is useful enough and I'll work on
> > >an implementation we can start playing with.
> > 
> > 
> > FWIW, sounds similar to what these guys are doing with SIFTR for FreeBSD:
> > http://caia.swin.edu.au/urp/newtcp/tools.html
> > http://caia.swin.edu.au/reports/070824A/CAIA-TR-070824A.pdf
> 
> And even more similar to this patch from Samir Bellabes of Mandriva:
> http://lwn.net/Articles/202255/

I think this work is very different.

When I say "state" I mean something more significant than
CLOSE, ESTABLISHED, etc. which is what Samir's patches are
tracking.

I'm talking about all of the sequence numbers, SACK information,
congestion control knobs, etc. whose values are nearly impossible to
track on a packet to packet basis in order to diagnose problems.

Web100 provided facilities along these lines as well.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP event tracking via netlink...

2007-12-06 Thread David Miller
From: Stephen Hemminger <[EMAIL PROTECTED]>
Date: Wed, 5 Dec 2007 16:33:38 -0500

> On Wed, 05 Dec 2007 08:53:07 -0800
> Joe Perches <[EMAIL PROTECTED]> wrote:
> 
> > > it occurred to me that we might want to do something
> > > like a state change event generator.
> > 
> > This could be a basis for an interesting TCP
> > performance tester.
> 
> That is what tcpprobe does but it isn't detailed enough to address SACK
> issues.

Indeed, this could be done via the jprobe there.

Silly me I didn't do this in the implementation I whipped
up, which I'll likely correct.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP event tracking via netlink...

2007-12-06 Thread David Miller
From: "Ilpo_Järvinen" <[EMAIL PROTECTED]>
Date: Thu, 6 Dec 2007 01:18:28 +0200 (EET)

> On Wed, 5 Dec 2007, David Miller wrote:
> 
> > I assume you're using something like carefully crafted printk's,
> > kprobes, or even ad-hoc statistic counters.  That's what I used to do
> > :-)
> 
> No, that's not at all what I do :-). I usually look time-seq graphs 
> expect for the cases when I just find things out by reading code (or
> by just thinking of it).

Can you briefly detail what graph tools and command lines
you are using?

The last time I did graphing to analyze things, the tools
were hit-or-miss.

> Much of the info is available in tcpdump already, it's just hard to read 
> without graphing it first because there are some many overlapping things 
> to track in two-dimensional space.
> 
> ...But yes, I have to admit that couple of problems come to my mind
> where having some variable from tcp_sock would have made the problem
> more obvious.

The most important are the cwnd and ssthresh, which you could guess
using graphs but it is important to know on a packet to packet
basis why we might have sent a packet or not because this has
rippling effects down the rest of the RTT.

> Not sure what is the benefit of having distributions with it because 
> those people hardly report problems anyway to here, they're just too 
> happy with TCP performance unless we print something to their logs,
> which implies that we must setup a *_ON() condition :-(.

That may be true, but if we could integrate the information with
tcpdumps, we could gather internal state using tools the user
already has available.

Imagine if tcpdump printed out:

02:26:14.865805 IP $SRC > $DEST: . 11226:12686(1460) ack 0 win 108
ss_thresh: 129 cwnd: 133 packets_out: 132

or something like that.

> Some problems are simply such that things cannot be accurately verified 
> without high processing overhead until it's far too late (eg skb bits vs 
> *_out counters). Maybe we should start to build an expensive state 
> validator as well which would automatically check invariants of the write 
> queue and tcp_sock in a straight forward, unoptimized manner? That would 
> definately do a lot of work for us, just ask people to turn it on and it 
> spits out everything that went wrong :-) (unless they really depend on 
> very high-speed things and are therefore unhappy if we scan thousands of 
> packets unnecessarily per ACK :-)). ...Early enough! ...That would work 
> also for distros but there's always human judgement needed to decide 
> whether the bug reporter will be happy when his TCP processing does no 
> longer scale ;-).

I think it's useful as a TCP_DEBUG config option or similar, sure.

But sometimes the algorithms are working as designed, it's just that
they provide poor pipe utilization and CWND analysis embedded inside
of a tcpdump would be one way to see that as well as determine the
flaw in the algorithm.

> ...Hopefully you found any of my comments useful.

Very much so, thanks.

I put together a sample implementation anyways just to show the idea,
against net-2.6.25 below.

It is untested since I didn't write the userland app yet to see that
proper things get logged.  Basically you could run a daemon that
writes per-connection traces into files based upon the incoming
netlink events.  Later, using the binary pcap file and these traces,
you can piece together traces like the above using the timestamps
etc. to match up pcap packets to ones from the TCP logger.

The userland tools could do analysis and print pre-cooked state diff
logs, like "this ACK raised CWND by one" or whatever else you wanted
to know.

It's nice that an expert like you can look at graphs and understand,
but we'd like to create more experts and besides reading code one
way to become an expert is to be able to extrace live real data
from the kernel's working state and try to understand how things
got that way.  This information is permanently lost currently.

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 56342c3..c0e61d0 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -170,6 +170,47 @@ struct tcp_md5sig {
__u8tcpm_key[TCP_MD5SIG_MAXKEYLEN]; /* key (binary) */
 };
 
+/* TCP netlink event logger.  */
+struct tcp_log_key {
+   union {
+   __be32  a4;
+   __be32  a6[4];
+   } saddr, daddr;
+   __be16  sport;
+   __be16  dport;
+   unsigned short family;
+   unsigned short __pad;
+};
+
+struct tcp_log_stamp {
+   __u32   tv_sec;
+   __u32   tv_usec;
+};
+
+struct tcp_log_payload {
+   struct tcp_log_key  key;
+   struct tcp_log_stampstamp;
+   struct tcp_info info;
+};
+
+enum {
+   TCP_LOG_A_UNSPEC = 0,
+   __TCP_LOG_A_MAX,
+};
+#define TCP_LOG_A_MAX  (__TCP_LOG_A_MAX - 1)
+
+#define TCP_LOG_GENL_NAME  "tcp_log"
+#define TCP_LOG_GENL_VERSION   1
+
+enum {
+   TCP_LOG_CMD_UNSPEC = 0,
+   TCP_LOG

Re: TCP event tracking via netlink...

2007-12-06 Thread Evgeniy Polyakov
On Wed, Dec 05, 2007 at 09:03:43PM -0800, David Miller ([EMAIL PROTECTED]) 
wrote:
> I think this work is very different.
> 
> When I say "state" I mean something more significant than
> CLOSE, ESTABLISHED, etc. which is what Samir's patches are
> tracking.
> 
> I'm talking about all of the sequence numbers, SACK information,
> congestion control knobs, etc. whose values are nearly impossible to
> track on a packet to packet basis in order to diagnose problems.

I pointed that work as a possible basis for collecting more info if you
needs including sequence numbers, window sizes and so on.
It just requires a useful structure layout placed, so that one would not
require to recreate the same bits again, so that it could be called from
any place inside the stack.

-- 
Evgeniy Polyakov
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP event tracking via netlink...

2007-12-06 Thread Arnaldo Carvalho de Melo
Em Thu, Dec 06, 2007 at 02:20:58AM -0800, David Miller escreveu:
> From: Stephen Hemminger <[EMAIL PROTECTED]>
> Date: Wed, 5 Dec 2007 16:33:38 -0500
> 
> > On Wed, 05 Dec 2007 08:53:07 -0800
> > Joe Perches <[EMAIL PROTECTED]> wrote:
> > 
> > > > it occurred to me that we might want to do something
> > > > like a state change event generator.
> > > 
> > > This could be a basis for an interesting TCP
> > > performance tester.
> > 
> > That is what tcpprobe does but it isn't detailed enough to address SACK
> > issues.
> 
> Indeed, this could be done via the jprobe there.
> 
> Silly me I didn't do this in the implementation I whipped
> up, which I'll likely correct.

I have some experiments from the past on this area:

This is what is produced by ctracer + the ostra callgrapher when
tracking many sk_buff objects, tracing sk_buff routines and as well all
other structs that have a pointer to a sk_buff, i.e. where the sk_buff
can be get from the struct that has a pointer to it, tcp_sock is an
"alias" to struct inet_sock that is an "alias" to struct sock, etc, so
when tracing tcp_sock you also trace inet_connection_sock, inet_sock,
sock methods:

http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/sk_buff/many_objects/

With just one object (that is reused, so appears many times):

http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/sk_buff/0x8101013130e8/

Following struct sock methods:

http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/sock/many_objects/

http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/sock/0xf61bf500/

struct socket:

http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/socket/many_objects/

It works by using the DWARF information to generate a systemtap module
that in turn will create a relayfs channel where we store the traces and
a automatically reorganized struct with just the base types (int, char,
long, etc) and typedefs that end up being base types.

Example of the struct minisock recreated from the debugging information
and reorganized using the algorithms in pahole to save space, generated
by this tool, go to the bottom, where you'll find struct
ctracer__mini_sock and the collector, that from a full sized object
creates the mini struct.

http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/ctracer_collector.struct.sock.c

And the systemtap module (the tcpprobe on steroids) automatically
generated:

http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/ctracer_methods.struct.sock.stp

This requires more work to:

. reduce the overhead
. filter out undesired functions creating a "project" with the functions 
desired using
  some gui editor
. specify lists of fields to put on the internal state to be collected, again 
using a
  gui or plain ctracer-edit using vi, instead of getting just base types
. Be able to say: collect just the fields on the second and fourth cacheline
. collectors for complex objects such as spinlocks, socket lock, mutexes

But since people are wanting to work on tools to watch state
transitions, fields changing, etc, I thought I should dust off the ostra
experiments and the more recent dwarves ctracer work I'm doing on my
copious spare time 8)

In the callgrapher there are some more interesting stuff:

Interface to see where fields changed:

http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/sock/0xf61bf500/changes.html

In this page clicking on a field name, such as:

http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/sock/0xf61bf500/sk_forward_alloc.png

You'll get graphs over time.

Code is in the dwarves repo at:

http://master.kernel.org/git/?p=linux/kernel/git/acme/pahole.git;a=summary

Thanks,

- Arnaldo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP event tracking via netlink...

2007-12-06 Thread Stephen Hemminger
On Thu, 06 Dec 2007 02:33:46 -0800 (PST)
David Miller <[EMAIL PROTECTED]> wrote:

> From: "Ilpo_Järvinen" <[EMAIL PROTECTED]>
> Date: Thu, 6 Dec 2007 01:18:28 +0200 (EET)
> 
> > On Wed, 5 Dec 2007, David Miller wrote:
> > 
> > > I assume you're using something like carefully crafted printk's,
> > > kprobes, or even ad-hoc statistic counters.  That's what I used to do
> > > :-)
> > 
> > No, that's not at all what I do :-). I usually look time-seq graphs 
> > expect for the cases when I just find things out by reading code (or
> > by just thinking of it).
> 
> Can you briefly detail what graph tools and command lines
> you are using?
> 
> The last time I did graphing to analyze things, the tools
> were hit-or-miss.
> 
> > Much of the info is available in tcpdump already, it's just hard to read 
> > without graphing it first because there are some many overlapping things 
> > to track in two-dimensional space.
> > 
> > ...But yes, I have to admit that couple of problems come to my mind
> > where having some variable from tcp_sock would have made the problem
> > more obvious.
> 
> The most important are the cwnd and ssthresh, which you could guess
> using graphs but it is important to know on a packet to packet
> basis why we might have sent a packet or not because this has
> rippling effects down the rest of the RTT.
> 
> > Not sure what is the benefit of having distributions with it because 
> > those people hardly report problems anyway to here, they're just too 
> > happy with TCP performance unless we print something to their logs,
> > which implies that we must setup a *_ON() condition :-(.
> 
> That may be true, but if we could integrate the information with
> tcpdumps, we could gather internal state using tools the user
> already has available.
> 
> Imagine if tcpdump printed out:
> 
> 02:26:14.865805 IP $SRC > $DEST: . 11226:12686(1460) ack 0 win 108
>   ss_thresh: 129 cwnd: 133 packets_out: 132
> 
> or something like that.
> 
> > Some problems are simply such that things cannot be accurately verified 
> > without high processing overhead until it's far too late (eg skb bits vs 
> > *_out counters). Maybe we should start to build an expensive state 
> > validator as well which would automatically check invariants of the write 
> > queue and tcp_sock in a straight forward, unoptimized manner? That would 
> > definately do a lot of work for us, just ask people to turn it on and it 
> > spits out everything that went wrong :-) (unless they really depend on 
> > very high-speed things and are therefore unhappy if we scan thousands of 
> > packets unnecessarily per ACK :-)). ...Early enough! ...That would work 
> > also for distros but there's always human judgement needed to decide 
> > whether the bug reporter will be happy when his TCP processing does no 
> > longer scale ;-).
> 
> I think it's useful as a TCP_DEBUG config option or similar, sure.
> 
> But sometimes the algorithms are working as designed, it's just that
> they provide poor pipe utilization and CWND analysis embedded inside
> of a tcpdump would be one way to see that as well as determine the
> flaw in the algorithm.
> 
> > ...Hopefully you found any of my comments useful.
> 
> Very much so, thanks.
> 
> I put together a sample implementation anyways just to show the idea,
> against net-2.6.25 below.
> 
> It is untested since I didn't write the userland app yet to see that
> proper things get logged.  Basically you could run a daemon that
> writes per-connection traces into files based upon the incoming
> netlink events.  Later, using the binary pcap file and these traces,
> you can piece together traces like the above using the timestamps
> etc. to match up pcap packets to ones from the TCP logger.
> 
> The userland tools could do analysis and print pre-cooked state diff
> logs, like "this ACK raised CWND by one" or whatever else you wanted
> to know.
> 
> It's nice that an expert like you can look at graphs and understand,
> but we'd like to create more experts and besides reading code one
> way to become an expert is to be able to extrace live real data
> from the kernel's working state and try to understand how things
> got that way.  This information is permanently lost currently.


Tools and scripts for testing that generate graphs are at:
git://git.kernel.org/pub/scm/tcptest/tcptest
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP event tracking via netlink...

2007-12-06 Thread David Miller
From: Stephen Hemminger <[EMAIL PROTECTED]>
Date: Thu, 6 Dec 2007 09:23:12 -0800

> Tools and scripts for testing that generate graphs are at:
>   git://git.kernel.org/pub/scm/tcptest/tcptest

I know about this, I'm just curious what exactly Ilpo is
using :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP event tracking via netlink...

2007-12-07 Thread Ilpo Järvinen
On Thu, 6 Dec 2007, David Miller wrote:

> From: "Ilpo_Järvinen" <[EMAIL PROTECTED]>
> Date: Thu, 6 Dec 2007 01:18:28 +0200 (EET)
> 
> > On Wed, 5 Dec 2007, David Miller wrote:
> > 
> > > I assume you're using something like carefully crafted printk's,
> > > kprobes, or even ad-hoc statistic counters.  That's what I used to do
> > > :-)
> > 
> > No, that's not at all what I do :-). I usually look time-seq graphs 
> > expect for the cases when I just find things out by reading code (or
> > by just thinking of it).
> 
> Can you briefly detail what graph tools and command lines
> you are using?

I have a tool called Sealion but it's behind NDA (making it open source 
has been talked for long but I don't have idea why it hasn't realized 
yet). It's mostly tcl/tk code is, by no means nice or clean desing nor 
quality (I'll leave details why I think it's that way out of this 
discussion :-)). Produces svgs. Usually I'm have the things I need in 
the standard sent+ACK+SACKs(+win) graph it produces. The result is quite 
similar to what tcptrace+xplot produces but xplot UI is really horrible, 
IMHO.

If I have to deal with tcpdump output only, it takes considerable amount 
of time to do computations with bc to come up with the same understanding 
by just reading tcpdumps.

> The last time I did graphing to analyze things, the tools
> were hit-or-miss.

Yeah, this is definately true. Open source graphing tools I know are 
really not that astonishing :-(. I've tried to look for better tools
as well but with little success.

> > Much of the info is available in tcpdump already, it's just hard to read 
> > without graphing it first because there are some many overlapping things 
> > to track in two-dimensional space.
> > 
> > ...But yes, I have to admit that couple of problems come to my mind
> > where having some variable from tcp_sock would have made the problem
> > more obvious.
> 
> The most important are the cwnd and ssthresh, which you could guess
> using graphs but it is important to know on a packet to packet
> basis why we might have sent a packet or not because this has
> rippling effects down the rest of the RTT.

Couple of points:

In order to evaluate validity of some action, one might need more than
one packet from the history.

Answer to the why we have sent a packet is rather simple (excluding RTOs): 
cwnd > packets_in_flight and data was available. No, it's not at all 
complicated. Though I might be too biased toward non-application limited 
cases which make the formula even simpler because everything is basically 
ACK clocked.

To really tell what caused changes between cwnd and/or packets_in_flight 
one usually needs some history or more fine-grained approach, once per 
packet is way too wide gap. It tells just what happened, not why, unless 
you're really familiar with the state machine and can make the right 
guess.

> > Not sure what is the benefit of having distributions with it because 
> > those people hardly report problems anyway to here, they're just too 
> > happy with TCP performance unless we print something to their logs,
> > which implies that we must setup a *_ON() condition :-(.
> 
> That may be true, but if we could integrate the information with
> tcpdumps, we could gather internal state using tools the user
> already has available.

It would definately help if we could, but that of course depends on 
getting the reports in the first place.

> Imagine if tcpdump printed out:
> 
> 02:26:14.865805 IP $SRC > $DEST: . 11226:12686(1460) ack 0 win 108
>   ss_thresh: 129 cwnd: 133 packets_out: 132
> 
> or something like that.

How about this:

02:26:14.865805 IP $SRC > $DEST: . ack 11226 win 108 <...sack 1 {15606:18526}
17066:18526 0->S sacktag_one l0 s1 r0 f4 pc1 ...
11226:12686  clean_rtx_queue ...
11226:12686 0->L mark_head_lost l1 s1 r0 f4 pc1 ...
12686:14146 0->L mark_head_lost l2 s1 r0 f4 pc1 ...
11226:12686 L->LRe retransmit_skb l2 s1 r1 f4 pc1 ...

...would make the bug in sack processing relatively obvious (yes, it 
has an intentional flaw in it, points from find it :-))... That would
be something I'd like to have right now.

> But sometimes the algorithms are working as designed, it's just that
> they provide poor pipe utilization and CWND analysis embedded inside
> of a tcpdump would be one way to see that as well as determine the
> flaw in the algorithm.

Fair enough.


> It is untested since I didn't write the userland app yet to see that
> proper things get logged.  Basically you could run a daemon that
> writes per-connection traces into files based upon the incoming
> netlink events.  Later, using the binary pcap file and these traces,
> you can piece together traces like the above using the timestamps
> etc. to match up pcap packets to ones from the TCP logger.
>
> The userland tools could do analysis and print pre-cooked state diff
> logs, like "this ACK raised CWND by one" or whatever else you wanted
> to know.

Obviously a collection of useful userland tools see

Re: TCP event tracking via netlink...

2008-01-02 Thread David Miller
From: Stephen Hemminger <[EMAIL PROTECTED]>
Date: Thu, 6 Dec 2007 09:23:12 -0800

> Tools and scripts for testing that generate graphs are at:
>   git://git.kernel.org/pub/scm/tcptest/tcptest

Did you move it somewhere else?

[EMAIL PROTECTED]:~/src/GIT$ git clone 
git://git.kernel.org/pub/scm/tcptest/tcptest
Initialized empty Git repository in /home/davem/src/GIT/tcptest/.git/
fatal: The remote end hung up unexpectedly
fetch-pack from 'git://git.kernel.org/pub/scm/tcptest/tcptest' failed.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP event tracking via netlink...

2008-01-02 Thread Ilpo Järvinen
On Wed, 2 Jan 2008, David Miller wrote:

> From: Stephen Hemminger <[EMAIL PROTECTED]>
> Date: Thu, 6 Dec 2007 09:23:12 -0800
> 
> > Tools and scripts for testing that generate graphs are at:
> > git://git.kernel.org/pub/scm/tcptest/tcptest
> 
> Did you move it somewhere else?
> 
> [EMAIL PROTECTED]:~/src/GIT$ git clone 
> git://git.kernel.org/pub/scm/tcptest/tcptest
> Initialized empty Git repository in /home/davem/src/GIT/tcptest/.git/
> fatal: The remote end hung up unexpectedly
> fetch-pack from 'git://git.kernel.org/pub/scm/tcptest/tcptest' failed.

.../network/ was missing from the path :-).

$ git-remote show origin
* remote origin
  URL: git://git.kernel.org/pub/scm/network/tcptest/tcptest.git
  Remote branch(es) merged with 'git pull' while on branch master
master
  Tracked remote branches
master


-- 
 i.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP event tracking via netlink...

2008-01-03 Thread David Miller
From: "Ilpo_Järvinen" <[EMAIL PROTECTED]>
Date: Wed, 2 Jan 2008 13:05:17 +0200 (EET)

> git://git.kernel.org/pub/scm/network/tcptest/tcptest.git

Thanks a lot Ilpo.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html