Re: Netchannles: first stage has been completed. Further ideas.

2006-07-31 Thread David Miller
From: Rusty Russell [EMAIL PROTECTED]
Date: Fri, 28 Jul 2006 15:54:04 +1000

 (1) I am imagining some Grand Unified Flow Cache (Olsson trie?) that
 holds (some subset of?) flows.  A successful lookup immediately after
 packet comes off NIC gives destiny for packet: what route, (optionally)
 what socket, what filtering, what connection tracking ( what NAT), etc?
 I don't know if this should be a general array of fn  data ptrs, or
 specialized fields for each one, or a mix.  Maybe there's a too hard,
 do slow path bit, or maybe hard cases just never get put in the cache.
 Perhaps we need a separate one for locally-generated packets, a-la
 ip_route_output().  Anyway, we trade slightly more expensive flow setup
 for faster packet processing within flows.

So, specifically, one of the methods you are thinking about might
be implemented by adding:

void (*input)(struct sk_buff *, void *);
void *input_data;

to struct flow_cache_entry or whatever replaces it?

This way we don't need some kind of type information in
the flow cache entry, since the input handler knows the type.

 One way to do this is to add a have_interest callback into the
 hook_ops, which takes each about-to-be-inserted GUFC entry and adds any
 destinies this hook cares about.  In the case of packet filtering this
 would do a traversal and append a fn/data ptr to the entry for each rule
 which could effect it.  

Can you give a concrete example of how the GUFC might make use
of this?  Just some small abstract code snippets will do.

 The other way is to have the hooks register what they are interested in
 into a general data structure which GUFC entry creation then looks up
 itself.  This general data structure will need to support wildcards
 though.

My gut reaction is that imposing a global data structure on all object
classes is not prudent.  When we take a GUFC miss, it seems better we
call into the subsystems to resolve things.  It can implement whatever
slow path lookup algorithm is most appropriate for it's data.

 We also need efficient ways of reflecting rule changes into the GUFC.
 We can be pretty slack with conntrack timeouts, but we either need to
 flush or handle callbacks from GUFC on timed-out entries.  Packet
 filtering changes need to be synchronous, definitely.

This, I will remind, is similar to the problem of doing RCU locking
of the TCP hash tables.

 (3) Smart NICs that do some flowid work themselves can accelerate lookup
 implicitly (same flow goes to same CPU/thread) or explicitly (each
 CPU/thread maintains only part of GUFC which it needs, or even NIC
 returns flow cookie which is pointer to GUFC entry or subtree?).  AFAICT
 this will magnify the payoff from the GUFC.

I want to warn you about HW issues that I mentioned to Alexey the
other week.  If we are not careful, we can run into the same issues
TOE cards run into, performance wise.

Namely, it is important to be careful about how the GUFC table entries
get updated in the card.  If you add them synchronously, your
connection rates will deteriorate dramatically.

I had the idea of a lazy scheme.  When we create a GUFC entry, we
tack it onto a DMA'able linked list the card uses.  We do not
notify the card, we just entail the update onto the list.

Then, if the card misses it's on-chip GUFC table on an incoming
packet, it checks the DMA update list by reading it in from memory.
It updates it's GUFC table with whatever entries are found on this
list, then it retries to classify the packet.

This seems like a possible good solution until we try to address GUFC
entry deletion, which unfortunately cannot be evaluated in a lazy
fashion.  It must be synchronous.  This is because if, for example, we
just killed off a TCP socket we must make sure we don't hit the GUFC
entry for the TCP identity of that socket any longer.

Just something to think about, when considering how to translate these
ideas into hardware.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-28 Thread David Miller
From: Stephen Hemminger [EMAIL PROTECTED]
Date: Thu, 27 Jul 2006 11:54:19 -0700

 I think we sell our existing stack short.

I agree.

 There are lots of opportunities left to look more closely at actual
 real performance bottlenecks and improve incrementally. But it
 requires, tools, time, faster net hardware, and some creative
 insight. I guess it just isn't as cool.

We are in fact suggesting some ideas that address the current
stack issues along the way.  Witness the discussion we had about
the tcp_ack() costs wrt. pruning the retransmit queue and tagging
packets for SACK, I'm working on a new data structure and layout
to cure all that stuff.

But I think we can do better.  Jamal said to me one email, If even
only half of Van's numbers are real, this is really exciting.

Rusty and Alexey are looking at the problem from another direction.
Go back to the unified flow cache, implement all the hair to do
that, and then we can look at netchannels because they will be so
much more straight forward at that point.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-27 Thread David Miller
From: Rusty Russell [EMAIL PROTECTED]
Date: Thu, 27 Jul 2006 15:46:12 +1000

 Yes, my first thought back in January was how netfilter would interact
 with this in a sane way.  One answer is don't: once someone registers
 on any hook we go into slow path.  Another is to run the hooks in socket
 context, which is better, but precludes having the consumer in
 userspace, which still appeals to me 8)

Small steps, small steps.  I have not ruled out userspace TCP just
yet, but we are not prepared to go there right now anyways.  It is
just the same kind of jump to go to kernel level netchannels as it is
to go from kernel level netchannels to userspace netchannel based TCP.

 What would the tuple look like?  Off the top of my head:
 SRCIP/DSTIP/PROTO/SPT/DPT/IN/OUT (where IN and OUT are boolean values
 indicating whether the src/dest is local).
 
 Of course, it means rewriting all the userspace tools, documentation,
 and creating a complete new infrastructure for connection tracking and
 NAT, but if that's what's required, then so be it.

I think we are able to finally talk seriously about revamping
netfilter on this level because we finally have a good incentive to do
so and some kind of model exists to work against.  Robert's trie might
be able to handle your tuple very well, fwiw, perhaps even with
prefixing.

But something occurs to me.  Socket has ID when it is created and
goes to established state.  This means we have this tuple, and thus
we can prelookup the netfilter rule and attach this cached lookup
state on the socket.  Your tuple in this case is defined to be:

SRCIP/DSTIP/TCP/SPT/DPT/0/1

I do not know how practical this is, it is just some suggestion.

Would there be prefixing in these tuples?  That's where the trouble
starts.  If you add prefixing, troubles and limitations of lookup of
today reappear.  If you disallow prefixing, tables get very large
but lookup becomes simpler and practical.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-27 Thread Alexey Kuznetsov
Hello!

On Thu, Jul 27, 2006 at 03:46:12PM +1000, Rusty Russell wrote:
 Of course, it means rewriting all the userspace tools, documentation,
 and creating a complete new infrastructure for connection tracking and
 NAT, but if that's what's required, then so be it.

That's what I love to hear. Not a joke. :-)

Could I only suggest not to relate this to netchannels? :-)
In the past we used to call this thing (grand-unified) flow cache.


 I don't think they are equivalent.  In channels,

I understand this. Actually, it was what I said in the next paragraph,
which you even cited.

I really do not like to repeat myself, it is nothing but idle talk,
but if the questions are questioned...

First, it was stated that suggested implementation performs better and even
much better. I am asking why do we see such improvement?
I am absolutely not satisifed with statement It is better. Period.
From all that I see, this particular implementation does not implement
optimizations suggested by VJ, it implements only the things,
which are not supposed to affect performance or to affect it negatively.

Idle talk? I am sure that if that improvement happened not due
to a severe protocol violation we can easily fix existing stack.


 userspace), no dequeue lock is required.

And that was a part of the second question.

I do not see, how single threaded TCP is possible. In receiver path
it has to ack with quite strict time bounds, to delack etc., in sender path
it has to slow start, I am even not saying about slow path things:
retransmit, probing window, lingering without process context etc.
It looks like, VJ implies the protocol must be changed. We can't, we mustn't.

After we deidealize this idealization and recognize that some slow path
should exist and some part of this slow path has to be executed
with higher priority than the fast one, where do we arrive?
Is not it exactly what we have right now? Clean fast path, separate slow path.
Not good enough? Where? Let's find and fix this.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-27 Thread Evgeniy Polyakov
Hello, Alexey.

On Thu, Jul 27, 2006 at 08:33:35PM +0400, Alexey Kuznetsov ([EMAIL PROTECTED]) 
wrote:
 First, it was stated that suggested implementation performs better and even
 much better. I am asking why do we see such improvement?
 I am absolutely not satisifed with statement It is better. Period.
 From all that I see, this particular implementation does not implement
 optimizations suggested by VJ, it implements only the things,
 which are not supposed to affect performance or to affect it negatively.

Just for clarifications: I showed that even using _existing_ stack
(using sk_backlog_rcv) performance in process context can exceed two
level processing. And after creating own TCP implemetation 
(which does not include two-level related overhead among other things)
performance different was even higher. I can agree that it is possible
that in second case part of the gain is obtained from the new TCP
implementation, but not 100% from process' context, but in first place 
existing socket code was used.

  userspace), no dequeue lock is required.
 
 And that was a part of the second question.
 
 I do not see, how single threaded TCP is possible. In receiver path
 it has to ack with quite strict time bounds, to delack etc., in sender path
 it has to slow start, I am even not saying about slow path things:
 retransmit, probing window, lingering without process context etc.
 It looks like, VJ implies the protocol must be changed. We can't, we mustn't.
 
 After we deidealize this idealization and recognize that some slow path
 should exist and some part of this slow path has to be executed
 with higher priority than the fast one, where do we arrive?
 Is not it exactly what we have right now? Clean fast path, separate slow path.
 Not good enough? Where? Let's find and fix this.

Slow path does exist, retransmits and friends are there too in new stack.
And my initial netchannel implementation used  _existing_ socket code
from process context. Again, there is no need to crate two levels
between fast and slow or softirq and process, and it was proven and
shown that it can perform faster.

Why don't you want to see, that existing model is just path enlargement:
there might also exist delayes between hard and soft irqs, so acks will
be delayed and so on... But stack works without problems even if some
kernel thread takes 100% cpu (with preemption), and there are very big
delays for ack generations, but userspace is not possible to get that
data. With netchannels it is essentially the same (heh, I said that
already a lot of times).

 Alexey

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-27 Thread Stephen Hemminger
On Wed, 26 Jul 2006 23:00:28 -0700 (PDT)
David Miller [EMAIL PROTECTED] wrote:

 From: Rusty Russell [EMAIL PROTECTED]
 Date: Thu, 27 Jul 2006 15:46:12 +1000
 
  Yes, my first thought back in January was how netfilter would interact
  with this in a sane way.  One answer is don't: once someone registers
  on any hook we go into slow path.  Another is to run the hooks in socket
  context, which is better, but precludes having the consumer in
  userspace, which still appeals to me 8)
 
 Small steps, small steps.  I have not ruled out userspace TCP just
 yet, but we are not prepared to go there right now anyways.  It is
 just the same kind of jump to go to kernel level netchannels as it is
 to go from kernel level netchannels to userspace netchannel based TCP.

I think we sell our existing stack short. There are lots of opportunities left
to look more closely at actual real performance bottlenecks and improve
incrementally. But it requires, tools, time, faster net hardware, and some
creative insight. I guess it just isn't as cool.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-27 Thread Alexey Kuznetsov
Hello!

 kernel thread takes 100% cpu (with preemption

Preemption, you tell... :-)

I begged you to spend 1 minute of your time to press ^Z. Did you?

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-27 Thread Rusty Russell
On Thu, 2006-07-27 at 20:33 +0400, Alexey Kuznetsov wrote:
 Hello!
 
 On Thu, Jul 27, 2006 at 03:46:12PM +1000, Rusty Russell wrote:
  Of course, it means rewriting all the userspace tools, documentation,
  and creating a complete new infrastructure for connection tracking and
  NAT, but if that's what's required, then so be it.
 
 That's what I love to hear. Not a joke. :-)
 
 Could I only suggest not to relate this to netchannels? :-)
 In the past we used to call this thing (grand-unified) flow cache.

Yes.  Thankyou for all your explanation, it was very helpful.  I agree,
grand unified lookup idea returns 8).  Netchannels proposal vs.
netfilter forced me back into thinking about it again, but it is
unrelated.  Any netfilter bypass acceleration will want similar ideas.

I apologize for misreading your discussion of Evgeniy's implementation
with general channel problem.  My mistake.  

  userspace), no dequeue lock is required.
 
 And that was a part of the second question.
 
 I do not see, how single threaded TCP is possible. In receiver path
 it has to ack with quite strict time bounds, to delack etc., in sender path
 it has to slow start, I am even not saying about slow path things:
 retransmit, probing window, lingering without process context etc.
 It looks like, VJ implies the protocol must be changed. We can't, we mustn't.

All good points.  I can see two kinds of problems here: performance
problems due to wakeup (eg. ack processing for 5MB write), and
correctness problems due to no kernel enforcement.  We need measurements
for the performance issues, so I'll ignore them for the moment.

For correctness, in true end-to-end, kernel is just a router for
userspace, then we do not worry about such problems 8)  In real life
kernel must enforce linger and sending tuple correctness, but I don't
know how much else we must regulate.  Too much, and you are right: we
have slow and fast path split just like now.

 After we deidealize this idealization and recognize that some slow path
 should exist and some part of this slow path has to be executed
 with higher priority than the fast one, where do we arrive?
 Is not it exactly what we have right now? Clean fast path, separate slow path.
 Not good enough? Where? Let's find and fix this.

I am still not sure how significant slow path is: if 99% can be in
userspace, it could work very well for RDMA.  I would like to have seen
VJ's implementation so we could compare and steal bits.

Thanks,
Rusty.
-- 
Help! Save Australia from the worst of the DMCA: http://linux.org.au/law

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-27 Thread Evgeniy Polyakov
On Fri, Jul 28, 2006 at 12:56:51AM +0400, Alexey Kuznetsov ([EMAIL PROTECTED]) 
wrote:
 Hello!
 
  kernel thread takes 100% cpu (with preemption
 
 Preemption, you tell... :-)
 
 I begged you to spend 1 minute of your time to press ^Z. Did you?

What would you expect from non-preemptible kernel? Hard lockup, no acks,
no soft irqs. So this case still does not differ from process' context
processing.

And after several minutes I pressed hardware reset button...

 Alexey

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-27 Thread David Miller
From: Evgeniy Polyakov [EMAIL PROTECTED]
Date: Fri, 28 Jul 2006 09:17:25 +0400

 What would you expect from non-preemptible kernel? Hard lockup, no acks,
 no soft irqs.

Why does pressing Ctrl-Z on the user process stop kernel soft irq
processing?


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-27 Thread Evgeniy Polyakov
On Thu, Jul 27, 2006 at 10:34:00PM -0700, David Miller ([EMAIL PROTECTED]) 
wrote:
 From: Evgeniy Polyakov [EMAIL PROTECTED]
 Date: Fri, 28 Jul 2006 09:17:25 +0400
 
  What would you expect from non-preemptible kernel? Hard lockup, no acks,
  no soft irqs.
 
 Why does pressing Ctrl-Z on the user process stop kernel soft irq
 processing?

I do not know, why Alexey decided that Ctrl-Z was ever pressed.
I'm saying about the case when keventd ate 100% of CPU, but stack worked
with (very) long delays. Obviously userspace was unresponsible and no
data arrived there.
It is an analogy that posponed softirq work does not destroy connections
as long as process context protocol processing with delays.
User does not get it's data, so no need to send an ack. And if it is
impossible to get that data at all, user should not care that sending
side does not see acks. When user is capable to get that data, it starts
to acknowledge.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-27 Thread Rusty Russell
On Wed, 2006-07-26 at 23:00 -0700, David Miller wrote:
 From: Rusty Russell [EMAIL PROTECTED]
 Date: Thu, 27 Jul 2006 15:46:12 +1000
 
  Yes, my first thought back in January was how netfilter would interact
  with this in a sane way.  One answer is don't: once someone registers
  on any hook we go into slow path.  Another is to run the hooks in socket
  context, which is better, but precludes having the consumer in
  userspace, which still appeals to me 8)
 
 Small steps, small steps.  I have not ruled out userspace TCP just
 yet, but we are not prepared to go there right now anyways.  It is
 just the same kind of jump to go to kernel level netchannels as it is
 to go from kernel level netchannels to userspace netchannel based TCP.

I think I was unclear; the possibility of userspace netchannels adds
weight to the idea that we should rework netfilter hooks sooner rather
than later.

  What would the tuple look like?  Off the top of my head:
  SRCIP/DSTIP/PROTO/SPT/DPT/IN/OUT (where IN and OUT are boolean values
  indicating whether the src/dest is local).
  
  Of course, it means rewriting all the userspace tools, documentation,
  and creating a complete new infrastructure for connection tracking and
  NAT, but if that's what's required, then so be it.
 
 I think we are able to finally talk seriously about revamping
 netfilter on this level because we finally have a good incentive to do
 so and some kind of model exists to work against.  Robert's trie might
 be able to handle your tuple very well, fwiw, perhaps even with
 prefixing.
 
 But something occurs to me.  Socket has ID when it is created and
 goes to established state.  This means we have this tuple, and thus
 we can prelookup the netfilter rule and attach this cached lookup
 state on the socket.  Your tuple in this case is defined to be:
 
   SRCIP/DSTIP/TCP/SPT/DPT/0/1
 
 I do not know how practical this is, it is just some suggestion.
 
 Would there be prefixing in these tuples?  That's where the trouble
 starts.  If you add prefixing, troubles and limitations of lookup of
 today reappear.  If you disallow prefixing, tables get very large
 but lookup becomes simpler and practical.

OK.  AFAICT, there are three ideas in play here (ignoring netchannels).
First, there should be a unified lookup for efficiency (Grand Unified
Cache).  Secondly, that netfilter hook users need to publish information
about what they are actually looking at if they are to use this lookup.
Thirdly, that smart cards can accelerate lookup.

(1) I am imagining some Grand Unified Flow Cache (Olsson trie?) that
holds (some subset of?) flows.  A successful lookup immediately after
packet comes off NIC gives destiny for packet: what route, (optionally)
what socket, what filtering, what connection tracking ( what NAT), etc?
I don't know if this should be a general array of fn  data ptrs, or
specialized fields for each one, or a mix.  Maybe there's a too hard,
do slow path bit, or maybe hard cases just never get put in the cache.
Perhaps we need a separate one for locally-generated packets, a-la
ip_route_output().  Anyway, we trade slightly more expensive flow setup
for faster packet processing within flows.

(2) To make this work sanely in the presence of netfilter hooks, we need
them to register the tuples they are interested in.  Not at the hook
level, but *in addition*.  For example, we need to know what flows each
packet filtering rule cares about.  Connection tracking wants to see the
first packet (and first reply packet), but then probably only want to
see packets with RST/SYN/FIN set.  (Erk, window tracking wants to see
every packet, but maybe we could do something).  NAT definitely needs to
see every packet on a connection which is natted.

One way to do this is to add a have_interest callback into the
hook_ops, which takes each about-to-be-inserted GUFC entry and adds any
destinies this hook cares about.  In the case of packet filtering this
would do a traversal and append a fn/data ptr to the entry for each rule
which could effect it.  

The other way is to have the hooks register what they are interested in
into a general data structure which GUFC entry creation then looks up
itself.  This general data structure will need to support wildcards
though.

We also need efficient ways of reflecting rule changes into the GUFC.
We can be pretty slack with conntrack timeouts, but we either need to
flush or handle callbacks from GUFC on timed-out entries.  Packet
filtering changes need to be synchronous, definitely.

(3) Smart NICs that do some flowid work themselves can accelerate lookup
implicitly (same flow goes to same CPU/thread) or explicitly (each
CPU/thread maintains only part of GUFC which it needs, or even NIC
returns flow cookie which is pointer to GUFC entry or subtree?).  AFAICT
this will magnify the payoff from the GUFC.

Sorry for the length,
Rusty.
-- 
Help! Save Australia from the worst of the DMCA: http://linux.org.au/law

-
To unsubscribe from this 

Re: Netchannles: first stage has been completed. Further ideas.

2006-07-26 Thread Rusty Russell
On Wed, 2006-07-19 at 03:01 +0400, Alexey Kuznetsov wrote:
 Hello!
 
 Can I ask couple of questions? Just as a person who looked at VJ's
 slides once and was confused. And startled, when found that it is not
 considered as another joke of genuis. :-)

Hi Alexey!

 About locks:
 
is completely lockless (there is one irq lock when skb 
  is queued/dequeued into netchannels queue in hard/soft irq, 
 
 Equivalent of socket spinlock.

I don't think they are equivalent.  In channels, this can be split into
two locks, queue lock and an dequeue lock, which operate independently.
The socket spinlock cannot.  Moreover, in the case where there is a
guarantee about IRQs being bound to a single CPU (as Dave's ideas on
MSI), the queue lock is no longer required.  In the case where there is
a single reader of the socket (or, as VJ did, the other end is in
userspace), no dequeue lock is required.

 VJ slides describe a totally different scheme, where softirq part is omitted
 completely, protocol processing is moved to user space as whole.
 It is an amazing toy. But I see nothing, which could promote its status
 to practical. Exokernels used to do this thing for ages, and all the
 performance gains are compensated by overcomplicated classification
 engine, which has to remain in kernel and essentially to do the same
 work which routing/firewalling/socket hash tables do.

My feeling is that modern cards will do partial demux for us; whether we
use netchannels or not, we should use that to accelerate lookup.  Making
card aim MSI at same CPU for same flow is a start (and as Dave said,
much less code).  As the next step, having the card give us a cookie
too, would allow us to explicitly skip first level of lookup.  This
should allow us to identify which flows are simple enough to be directly
accelerated (whether by channels or something else): no bonding, raw
sockets, non-trivial netfilter rules, connection tracking changes, etc.

Thoughts?
Rusty.
-- 
Help! Save Australia from the worst of the DMCA: http://linux.org.au/law

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-26 Thread David Miller
From: Rusty Russell [EMAIL PROTECTED]
Date: Thu, 27 Jul 2006 12:17:51 +1000

 On Wed, 2006-07-19 at 03:01 +0400, Alexey Kuznetsov wrote:
  About locks:
  
   is completely lockless (there is one irq lock when skb 
   is queued/dequeued into netchannels queue in hard/soft irq, 
  
  Equivalent of socket spinlock.
 
 I don't think they are equivalent.  In channels, this can be split into
 two locks, queue lock and an dequeue lock, which operate independently.
 The socket spinlock cannot.  Moreover, in the case where there is a
 guarantee about IRQs being bound to a single CPU (as Dave's ideas on
 MSI), the queue lock is no longer required.  In the case where there is
 a single reader of the socket (or, as VJ did, the other end is in
 userspace), no dequeue lock is required.

Cost is a very interesting question here.  I guess your main point
is that eventually this lock can be made to go away, whereas
Alexey speaks about the state of Evgivny's specific implementation.

 My feeling is that modern cards will do partial demux for us; whether we
 use netchannels or not, we should use that to accelerate lookup.  Making
 card aim MSI at same CPU for same flow is a start (and as Dave said,
 much less code).  As the next step, having the card give us a cookie
 too, would allow us to explicitly skip first level of lookup.  This
 should allow us to identify which flows are simple enough to be directly
 accelerated (whether by channels or something else): no bonding, raw
 sockets, non-trivial netfilter rules, connection tracking changes, etc.

I read this as we will be able to get around the problems but
no specific answer as to how.  I am an optimist too but I want
to start seeing concrete discussion about the way in which the
problems will be dealt with.

Alexey has some ideas, such as running the netfilter path from the
netchannel consumer socket context.  That is the kind of thing
we need to be talking about.

Robert Olsson is also doing some work involving full flow
classifications using special trie structures in the routing cache
that might be extendable to netchannels.  His trick is to watch for
the FIN shutdown sequence and GC route cache entries for a flow when
this is seen.  This is in order to keep the trie shallow and thus have
a better bound on memory accesses for routing lookups.

We are not a group of mathematicians discussing the tractability of
some problem.  Our interest is practice not theory. :)



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-26 Thread Rusty Russell
On Wed, 2006-07-26 at 22:17 -0700, David Miller wrote:
 I read this as we will be able to get around the problems but
 no specific answer as to how.  I am an optimist too but I want
 to start seeing concrete discussion about the way in which the
 problems will be dealt with.
 
 Alexey has some ideas, such as running the netfilter path from the
 netchannel consumer socket context.  That is the kind of thing
 we need to be talking about.

Yes, my first thought back in January was how netfilter would interact
with this in a sane way.  One answer is don't: once someone registers
on any hook we go into slow path.  Another is to run the hooks in socket
context, which is better, but precludes having the consumer in
userspace, which still appeals to me 8)

So I don't like either.  The mistake (?) with netfilter was that we are
completely general: you will see all packets, do what you want.  If,
instead, we had forced all rules to be of form show me all packets
matching this tuple we would be in a combine it in a single lookup with
routing etc.

What would the tuple look like?  Off the top of my head:
SRCIP/DSTIP/PROTO/SPT/DPT/IN/OUT (where IN and OUT are boolean values
indicating whether the src/dest is local).

Of course, it means rewriting all the userspace tools, documentation,
and creating a complete new infrastructure for connection tracking and
NAT, but if that's what's required, then so be it.

Rusty.
-- 
Help! Save Australia from the worst of the DMCA: http://linux.org.au/law

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-24 Thread Stephen Hemminger
On Wed, 19 Jul 2006 13:01:50 -0700 (PDT)
David Miller [EMAIL PROTECTED] wrote:

 From: Stephen Hemminger [EMAIL PROTECTED]
 Date: Wed, 19 Jul 2006 15:52:04 -0400
 
  As a related note, I am looking into fixing inet hash tables to use RCU.
 
 IBM had posted a patch a long time ago, which would be not
 so hard to munge into the current tree.  See if you can
 spot it in the archives :)

Srivatsa Vaddagiri from IBM did  patch: http://lkml.org/lkml/2004/8/31/129

And Ben had a patch: http://lwn.net/Articles/174596/

Srivata's was more complete but pre-dates Acme's rearrangement.
Also, there is some code for refcnt's in it that looks wrong.
Or at minimum is masking underlying design flaws.

/* Ungrab socket and destroy it, if it was the last reference. */
 static inline void sock_put(struct sock *sk)
 {
-   if (atomic_dec_and_test(sk-sk_refcnt))
-   sk_free(sk);
+sp_loop:
+   if (atomic_dec_and_test(sk-sk_refcnt)) {
+   /* Restore ref count and schedule callback.
+* If we don't restore ref count, then the callback can be
+* scheduled by more than one CPU.
+*/
+   atomic_inc(sk-sk_refcnt);
+
+   if (atomic_read(sk-sk_refcnt) == 1)
+   call_rcu(sk-sk_rcu, sk_free_rcu);
+   else
+   goto sp_loop;
+   }
 }

Ben's still left reader writer locks, and needed IPV6 work. He said he
plans to get back to it.


-- 
Stephen Hemminger [EMAIL PROTECTED]
And in the Packet there writ down that doome
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-24 Thread Alexey Kuznetsov
Hello!

 Also, there is some code for refcnt's in it that looks wrong.

Yes, it is disgusting. rcu does not allow to increase socket refcnt
in lookup routine.

Ben's version looks cleaner here, it does not touch refcnt
in rcu lookups. But it is dubious too:

 do_time_wait:
+   sock_hold(sk);

is obviously in violation of the rule. Probably, rcu lookup should do something
like:

if (!atomic_inc_not_zero(sk-sk_refcnt))
pretend_it_is_not_found; 

It is clear Ben did not look into IBM patch, because one known place
of trouble is missed: when socket moves from established to timewait,
timewait bucket must be inserted before established socket is removed.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Netchannles: first stage has been completed. Further ideas.

2006-07-22 Thread Caitlin Bestler
[EMAIL PROTECTED] wrote:
 Evgeniy Polyakov wrote:
 On Thu, Jul 20, 2006 at 02:21:57PM -0700, Ben Greear
 ([EMAIL PROTECTED]) wrote:
 
 Out of curiosity, is it possible to have the single producer logic
 if you have two+ ethernet interfaces handling frames for a single
 TCP connection?  (I am assuming some sort of multi-path routing
 logic...)
 
 
 I do not think it is possible with additional logic like what is
 implemented in softirqs, i.e. per cpu queues of data, which in turn
 will be converted into skbs one-by-one.
 
 Couldn't you have two NICs being handled by two separate
 CPUs, with both CPUs trying to write to the same socket queue?
 
 The receive path works with RCU locking from what I
 understand, so a protocol's receive function must be re-entrant.

Wouldn't it be easier simply not have two NICs feed the
same ring? What packets end up in which ring is fully
controllable. On the rare occasion that a single connection
must be fed by two NICs a software merge of the two rings
would be far cheaper than having to co-ordinate between
producers all the time.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-21 Thread Evgeniy Polyakov
On Thu, Jul 20, 2006 at 09:55:04PM -0700, David Miller ([EMAIL PROTECTED]) 
wrote:
 From: Alexey Kuznetsov [EMAIL PROTECTED]
 Date: Fri, 21 Jul 2006 02:59:08 +0400
 
   Moving protocol (no matter if it is TCP or not) closer to user allows
   naturally control the dataflow - when user can read that data(and _this_
   is the main goal), user acks, when it can not - it does not generate
   ack. In theory
  
  To all that I rememeber, in theory absence of feedback leads
  to loss of control yet. The same is in practice, unfortunately.
  You must say that window is closed, otherwise sender is totally
  confused.
 
 Correct, and too large delay even results in retransmits.  You can say
 that RTT will be adjusted by delay of ACK, but if user context
 switches cleanly at the beginning, resulting in near immediate ACKs,
 and then blocks later you will get spurious retransmits.  Alexey's
 example of blocking on a disk write is a good example.  I really don't
 like when pure NULL data sinks are used for benchmarking these kinds
 of things because real applications 1) touch the data, 2) do something
 with that data, and 3) have some life outside of TCP!

And what will happen with sockets?
Data will arive and ack will be generated, until queue is filled and
duplicate ack started to be sent thus reducing window even more.

Results _are_ the same, both will have duplicate acks and so on, but
with netchannels there is no complex queue management, no two or more
rings, where data is procesed (bh, process context and so on), no locks
and ... hugh, I reacll I wrote it already several times :)

My userspace applications do memset, and actually writing data into
/dev/null through the stdout pipe does not change the overall picture.

I read a lot of your critics about benchmarking, so I'm ready :)

 If you optimize an application that does nothing with the data it
 receives, you have likewise optimized nothing :-)

I've run that test - dump all data into file through pipe.

84byte packet bulk receiving: 

netchannels: 8 Mb/sec (down 6 when VFS cache is filled)
socket: 7 Mb/sec (down to 6 when VFS cache is filled)

So you asked to create narrow pipe, and speed becomes equal to the speed
of that pipe. No more, no less.

 All this talk reminds me of one thing, how expensive tcp_ack() is.
 And this expense has nothing to do with TCP really.  The main cost is
 purging and freeing up the skbs which have been ACK'd in the
 retransmit queue.

Yes, allocation always takes first places in all profiles.
I'm working to eliminate that - it is a side effect of zero-copy
networking design I'm working on right now.

 So tcp_ack() sort of inherits the cost of freeing a bunch of SKBs
 which haven't been touched by the cpu in some time and are thus nearly
 guarenteed to be cold in the cache.
 
 This is the kind of work we could think about batching to user
 sleeping on some socket call.
 
 Also notice that retransmit queue is potentially a good use of an
 array similar VJ netchannel lockless queue data structure. :)

Array has a lot of disadvantages with it's resizing, there will be a lot
of troubles with recv/send queue len changes.
But it allows to remove several pointer from skb, which is always a good
start.

 BTW, notice that TSO makes this work touch less skb state.  TSO also
 decreases cpu utilization.  I think these two things are no
 coincidence. :-)

TSO/GSO is a good idea definitely, but it is completely unrelated to
other problems. If it will be implemented with netchannels we will have
even better perfomance.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-21 Thread Evgeniy Polyakov
On Thu, Jul 20, 2006 at 02:21:57PM -0700, Ben Greear ([EMAIL PROTECTED]) wrote:
 Out of curiosity, is it possible to have the single producer logic
 if you have two+ ethernet interfaces handling frames for a single
 TCP connection?  (I am assuming some sort of multi-path routing
 logic...)

I do not think it is possible with additional logic like what is
implemented in softirqs, i.e. per cpu queues of data, which in turn will
be converted into skbs one-by-one.

 Thanks,
 Ben

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-21 Thread Evgeniy Polyakov
On Fri, Jul 21, 2006 at 11:19:00AM +0400, Evgeniy Polyakov ([EMAIL PROTECTED]) 
wrote:
 On Thu, Jul 20, 2006 at 02:21:57PM -0700, Ben Greear ([EMAIL PROTECTED]) 
 wrote:
  Out of curiosity, is it possible to have the single producer logic
  if you have two+ ethernet interfaces handling frames for a single
  TCP connection?  (I am assuming some sort of multi-path routing
  logic...)
 
 I do not think it is possible with additional logic like what is

I think it is posssible ...


-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-21 Thread Evgeniy Polyakov
On Fri, Jul 21, 2006 at 09:40:32AM +1200, Ian McDonald ([EMAIL PROTECTED]) 
wrote:
 If we consider netchannels as how Van Jackobson discribed them, then
 mutext is not needed, since it is impossible to have several readers or
 writers. But in socket case even if there is only one userspace
 consumer, that lock must be held to protect against bh (or introduce
 several queues and complicate a lot their's management (ucopy for
 example)).
 
 As I recall Van's talk you don't need a lock with a ring buffer if you
 have a start and end variable pointing to location within ring buffer.
 
 He didn't explain this in great depth as it is computer science 101
 but here is how I would explain it:
 
 Once socket is initialiased consumer is the only one that sets start
 variable and network driver reads this only. It is the other way
 around for the end variable. As long as the writes are atomic then you
 are fine. You only need one ring buffer in this scenario and two
 atomic variables.
 
 Having atomic writes does have overhead but far less than locking semantic.

With netchannels and one data producer it should not be even atomic.
Problems start to appear when there are several producers or consumers -
there must be implemented either atomic or locking logic indeed.

 -- 
 Ian McDonald
 Web: http://wand.net.nz/~iam4
 Blog: http://imcdnzl.blogspot.com
 WAND Network Research Group
 Department of Computer Science
 University of Waikato
 New Zealand

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-21 Thread David Miller
From: Evgeniy Polyakov [EMAIL PROTECTED]
Date: Fri, 21 Jul 2006 11:10:10 +0400

 On Thu, Jul 20, 2006 at 09:55:04PM -0700, David Miller ([EMAIL PROTECTED]) 
 wrote:
  Correct, and too large delay even results in retransmits.  You can say
  that RTT will be adjusted by delay of ACK, but if user context
  switches cleanly at the beginning, resulting in near immediate ACKs,
  and then blocks later you will get spurious retransmits.  Alexey's
  example of blocking on a disk write is a good example.  I really don't
  like when pure NULL data sinks are used for benchmarking these kinds
  of things because real applications 1) touch the data, 2) do something
  with that data, and 3) have some life outside of TCP!
 
 And what will happen with sockets?
 Data will arive and ack will be generated, until queue is filled and
 duplicate ack started to be sent thus reducing window even more.
 
 Results _are_ the same, both will have duplicate acks and so on, but
 with netchannels there is no complex queue management, no two or more
 rings, where data is procesed (bh, process context and so on), no locks
 and ... hugh, I reacll I wrote it already several times :)

Packets will be retransmitted spuriously and unnecessarily, and we
cannot over-stress how bad this is.

Sure, your local 1gbit network can absorb this extra cost when
the application is blocked for a long time, but in the real internet
it is a real concern.

Please address the fact that your design makes for retransmits that
are totally unnecessary.  Your TCP stack is flawed if it allows this
to happen.  Proper closing of window and timely ACKs are not some
optional feature of TCP, they are in fact mandatory.

If you want to bypass these things, this is fine, but do not name it
TCP :-)))

As a related example, deeply stretched ACKs can help and are perfect
when there is no packet loss.  But in the event of packet loss a
stretch ACK will kill performance, because it makes packet loss
recovery take at least one extra round trip to occur.

Therefore I disabled stretch ACKs in the input path of TCP last year.

  If you optimize an application that does nothing with the data it
  receives, you have likewise optimized nothing :-)
 
 I've run that test - dump all data into file through pipe.
 
 84byte packet bulk receiving: 
 
 netchannels: 8 Mb/sec (down 6 when VFS cache is filled)
 socket: 7 Mb/sec (down to 6 when VFS cache is filled)
 
 So you asked to create narrow pipe, and speed becomes equal to the speed
 of that pipe. No more, no less.

If you cause unnecessary retransmits, you add unnecessary congestion
to the network for other flows.

  All this talk reminds me of one thing, how expensive tcp_ack() is.
  And this expense has nothing to do with TCP really.  The main cost is
  purging and freeing up the skbs which have been ACK'd in the
  retransmit queue.
 
 Yes, allocation always takes first places in all profiles.
 I'm working to eliminate that - it is a side effect of zero-copy
 networking design I'm working on right now.

When you say these things over and over again, people like Alexey
and myself perceive it as La la la la, I'm not listening to you
guys

Our point is not that your work cannot lead you to fixing these
problems.  Our point is that existing TCP stack can have these
problems fixed too!  With advantage that we don't need all the
negative aspects of moving TCP into userspace.

You can eliminate allocation overhead in our existing stack, with
the simple design I outlined.  In fact, I outlined two approaches,
there is such an abundance of ways to do it that you have a choice
of which one you like the best :)

 Array has a lot of disadvantages with it's resizing, there will be a lot
 of troubles with recv/send queue len changes.
 But it allows to remove several pointer from skb, which is always a good
 start.

Yes it is something to consider.  Large pipes with 4000+ packet
windows present considerable problems in this area.

 TSO/GSO is a good idea definitely, but it is completely unrelated to
 other problems. If it will be implemented with netchannels we will have
 even better perfomance.

I like TSO-like ideas because it points to solutions within existing
stack.

Radical changes are great, when they buy us something that is
impossible with current design.  A lot of things being shown and
discussed here are indeed possible with current design.

You have a nice toy and you should be proud of it, but do not make
it into panacea.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-21 Thread Evgeniy Polyakov
On Fri, Jul 21, 2006 at 12:47:13AM -0700, David Miller ([EMAIL PROTECTED]) 
wrote:
   Correct, and too large delay even results in retransmits.  You can say
   that RTT will be adjusted by delay of ACK, but if user context
   switches cleanly at the beginning, resulting in near immediate ACKs,
   and then blocks later you will get spurious retransmits.  Alexey's
   example of blocking on a disk write is a good example.  I really don't
   like when pure NULL data sinks are used for benchmarking these kinds
   of things because real applications 1) touch the data, 2) do something
   with that data, and 3) have some life outside of TCP!
  
  And what will happen with sockets?
  Data will arive and ack will be generated, until queue is filled and
  duplicate ack started to be sent thus reducing window even more.
  
  Results _are_ the same, both will have duplicate acks and so on, but
  with netchannels there is no complex queue management, no two or more
  rings, where data is procesed (bh, process context and so on), no locks
  and ... hugh, I reacll I wrote it already several times :)
 
 Packets will be retransmitted spuriously and unnecessarily, and we
 cannot over-stress how bad this is.

In theory practice and theory are the same, but in practice they are
different (c) Larry McVoy as far as I recall :)
And even in theory Linux behaves the same.

I see the only point about process context tcp processing is following
issue:
we started tcp connection, and acks are generated very fast, then
suddenly receiving userspace is blocked.
In that case BH processing apologists state that sending side starts to
retransmit.
Let's see how it works.
If receiving side works for a long with maximum speed, then window is
opened enough, so it can even exceed socket buffer size (max 200k, I saw
several megs socket windows in my tests), so sending side will continue
to send until window is filled. 
Receiving side, nor matter if it is socket or netchannel, will drop
packets (socket due to queue overfull, netchannels will not drop, but
will not ack (it's maximum queue len is 1mb)).

So both approaches behave _exactly_ the same.
Did I miss something?

Btw, here are tests which were ran with netchannels:
 * surfing the web (index pages of different remote sites only)
 * 1gb transfers
 * 1gb - 100mb transfers

 Sure, your local 1gbit network can absorb this extra cost when
 the application is blocked for a long time, but in the real internet
 it is a real concern.

Writing into the pipe (or into 100mb NIC) and file is a real internet 
example - data is blocked, acks and retransmits happen.

 Please address the fact that your design makes for retransmits that
 are totally unnecessary.  Your TCP stack is flawed if it allows this
 to happen.  Proper closing of window and timely ACKs are not some
 optional feature of TCP, they are in fact mandatory.
 
 If you want to bypass these things, this is fine, but do not name it
 TCP :-)))

Hey, you did not look into atcp.c in my patches :)

 As a related example, deeply stretched ACKs can help and are perfect
 when there is no packet loss.  But in the event of packet loss a
 stretch ACK will kill performance, because it makes packet loss
 recovery take at least one extra round trip to occur.
 
 Therefore I disabled stretch ACKs in the input path of TCP last year.

For slow start it is definitely a must.
If stretching alog is based on timers and round trip time, then I do not
have that in atcp, but proper delaying based on sequence is used instead.

   If you optimize an application that does nothing with the data it
   receives, you have likewise optimized nothing :-)
  
  I've run that test - dump all data into file through pipe.
  
  84byte packet bulk receiving: 
  
  netchannels: 8 Mb/sec (down 6 when VFS cache is filled)
  socket: 7 Mb/sec (down to 6 when VFS cache is filled)
  
  So you asked to create narrow pipe, and speed becomes equal to the speed
  of that pipe. No more, no less.
 
 If you cause unnecessary retransmits, you add unnecessary congestion
 to the network for other flows.

Please refer to my description above.
Situation is perfectly the same as with socket code or with netchannels.

   All this talk reminds me of one thing, how expensive tcp_ack() is.
   And this expense has nothing to do with TCP really.  The main cost is
   purging and freeing up the skbs which have been ACK'd in the
   retransmit queue.
  
  Yes, allocation always takes first places in all profiles.
  I'm working to eliminate that - it is a side effect of zero-copy
  networking design I'm working on right now.
 
 When you say these things over and over again, people like Alexey
 and myself perceive it as La la la la, I'm not listening to you
 guys

Hmm, I've confirmed that allocation is a problem no matter which stack
is used. My problem fix has nothing special to netchannels at all.

 Our point is not that your work cannot lead you to fixing these
 problems.  Our point is that existing TCP stack can 

Re: Netchannles: first stage has been completed. Further ideas.

2006-07-21 Thread David Miller
From: Evgeniy Polyakov [EMAIL PROTECTED]
Date: Fri, 21 Jul 2006 13:06:11 +0400

 Receiving side, nor matter if it is socket or netchannel, will drop
 packets (socket due to queue overfull, netchannels will not drop, but
 will not ack (it's maximum queue len is 1mb)).
 
 So both approaches behave _exactly_ the same.
 Did I miss something?

Socket will not drop the packets on receive because sender will not
violate the window which receiver advertises, therefore there is no
reason to drop the packets.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-21 Thread Evgeniy Polyakov
On Fri, Jul 21, 2006 at 02:19:55AM -0700, David Miller ([EMAIL PROTECTED]) 
wrote:
 From: Evgeniy Polyakov [EMAIL PROTECTED]
 Date: Fri, 21 Jul 2006 13:06:11 +0400
 
  Receiving side, nor matter if it is socket or netchannel, will drop
  packets (socket due to queue overfull, netchannels will not drop, but
  will not ack (it's maximum queue len is 1mb)).
  
  So both approaches behave _exactly_ the same.
  Did I miss something?
 
 Socket will not drop the packets on receive because sender will not
 violate the window which receiver advertises, therefore there is no
 reason to drop the packets.

How come?
sk_stream_rmem_schedule(), sk_rmem_alloc and friends...

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-21 Thread David Miller
From: Evgeniy Polyakov [EMAIL PROTECTED]
Date: Fri, 21 Jul 2006 13:39:09 +0400

 On Fri, Jul 21, 2006 at 02:19:55AM -0700, David Miller ([EMAIL PROTECTED]) 
 wrote:
  From: Evgeniy Polyakov [EMAIL PROTECTED]
  Date: Fri, 21 Jul 2006 13:06:11 +0400
  
   Receiving side, nor matter if it is socket or netchannel, will drop
   packets (socket due to queue overfull, netchannels will not drop, but
   will not ack (it's maximum queue len is 1mb)).
   
   So both approaches behave _exactly_ the same.
   Did I miss something?
  
  Socket will not drop the packets on receive because sender will not
  violate the window which receiver advertises, therefore there is no
  reason to drop the packets.
 
 How come?
 sk_stream_rmem_schedule(), sk_rmem_alloc and friends...

sk_stream_rmem_schedule() allocates bytes from the global memory pool
quota for TCP sockets.  It is not something will trigger when, for
example, application blocks on a disk write.

In fact it will rarely trigger once size of window is known, since
sk_forward_alloc will grow to fill that size, then statically stay
at the value being able to service all allocation requests in the
future.

Only when there is severe global TCP memory pressure will it be
decreased.

And again this isn't something which happens when a user simply
blocks on some non-TCP operation.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-21 Thread Ben Greear

Evgeniy Polyakov wrote:

On Thu, Jul 20, 2006 at 02:21:57PM -0700, Ben Greear ([EMAIL PROTECTED]) wrote:


Out of curiosity, is it possible to have the single producer logic
if you have two+ ethernet interfaces handling frames for a single
TCP connection?  (I am assuming some sort of multi-path routing
logic...)



I do not think it is possible with additional logic like what is
implemented in softirqs, i.e. per cpu queues of data, which in turn will
be converted into skbs one-by-one.


Couldn't you have two NICs being handled by two separate CPUs, with both
CPUs trying to write to the same socket queue?

The receive path works with RCU locking from what I understand, so
a protocol's receive function must be re-entrant.

--
Ben Greear [EMAIL PROTECTED]
Candela Technologies Inc  http://www.candelatech.com

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-21 Thread Rick Jones

All this talk reminds me of one thing, how expensive tcp_ack() is.
And this expense has nothing to do with TCP really.  The main cost is
purging and freeing up the skbs which have been ACK'd in the
retransmit queue.

So tcp_ack() sort of inherits the cost of freeing a bunch of SKBs
which haven't been touched by the cpu in some time and are thus nearly
guarenteed to be cold in the cache.

This is the kind of work we could think about batching to user
sleeping on some socket call.


Ultimately isn't that just trying to squeeze the balloon?

rick jones

nice to see people seeing ACKs as expensive though :)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-21 Thread Evgeniy Polyakov
On Fri, Jul 21, 2006 at 09:14:39AM -0700, Ben Greear ([EMAIL PROTECTED]) wrote:
 Out of curiosity, is it possible to have the single producer logic
 if you have two+ ethernet interfaces handling frames for a single
 TCP connection?  (I am assuming some sort of multi-path routing
 logic...)
 
 I do not think it is possible with additional logic like what is
 implemented in softirqs, i.e. per cpu queues of data, which in turn will
 be converted into skbs one-by-one.
 
 Couldn't you have two NICs being handled by two separate CPUs, with both
 CPUs trying to write to the same socket queue?
 
 The receive path works with RCU locking from what I understand, so
 a protocol's receive function must be re-entrant.

There will not be socket queue on that stage - only per-cpu queues,
which then will be processed one-by-one by _exactly_ single user.
That user can get skb in round-robin manner and put them into socket
queue and call protocol receiving function.

 -- 
 Ben Greear [EMAIL PROTECTED]
 Candela Technologies Inc  http://www.candelatech.com

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-21 Thread David Miller
From: Rick Jones [EMAIL PROTECTED]
Date: Fri, 21 Jul 2006 09:26:42 -0700

  All this talk reminds me of one thing, how expensive tcp_ack() is.
  And this expense has nothing to do with TCP really.  The main cost is
  purging and freeing up the skbs which have been ACK'd in the
  retransmit queue.
  
  So tcp_ack() sort of inherits the cost of freeing a bunch of SKBs
  which haven't been touched by the cpu in some time and are thus nearly
  guarenteed to be cold in the cache.
  
  This is the kind of work we could think about batching to user
  sleeping on some socket call.
 
 Ultimately isn't that just trying to squeeze the balloon?

In this case, the goal is not to eliminate the cost, but to
move it to user context so that it:

1) gets charged to the user instead of being lost in the ether of
   anonymous software interrupt execution, and more importantly...

2) it gets moved to the cpu where the user socket
   code is executing instead of the cpu where the ACK packet arrives
   which is basically arbitrary

#2 is in-line with the system level end-to-end principle goals of
netchannels.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-20 Thread Evgeniy Polyakov
 Hello!

Hello, Alexey.

[ Sorry for long delay, there are some problems with mail servers, so I
can not access them remotely, so I create mail by hads, hopefully thread
will not be broken. ]

 There is no socket spinlock anymore.
 Above lock is skb_queue lock which is held inside
 skb_dequeue/skb_queue_tail calls.

 Lock is named differently, but it is still here.
 BTW for UDP even the name is the same.

There is no bh processing, that lock is needed for 4 operations when skb
is enqueued/dequeued.

And if I would changed skbs to different structures there were no locks
at all - it is extremely lightweight, it can not be compared with socket
lock at all.

No bh/irq processing at all, natural speed management - that is main idea
behind netchannels.

  Equivalent of socket user lock.
 
 No, it is an equivalent for hash lock in socket table.

OK. But you have to introduce socket mutex somewhere in any case.
Even in ATCP.

Actually not - VJ's idea is to have only one consumer and one provider,
so no locks needed, but I agree, in general case it is needed, but _only_
to protect against several netchannel userspace consumers.
There is no BH protocol processing at all, so there is no need to
pprotect against someone who will add data while you are processing own
chunk.

 Just an example - tcp_established() can be called with bh disabled
 under the socket lock.

 When we have a process context in hands, it is not.

Did you ask youself, why do not we put all the packets to
backlog/prequeue
and just wait when user will read the data? It would be 100% equivalent
to netchannels.

How many hacks just to be a bit closer to userspace processing,
implemented in netchannels!

The answer is simple: because we cannot wait. If user delays for
200msec,
wait for connection collapse due to retransmissions. If the segment is
out of order, immediate attention is required. Any scheme, which tries
to wait for user unconditionally, at least has to run a watchdog timer,
which fires before sender senses the gap.

If userspace is scheduled away for too much time, it is bloody wrong to
ack the data, that is impossible to read due to the fact that system is
being busy. It is just postponing the work from one end to another - ack
now and stop when queue is full, or postpone the ack generation when
segment is realy being read.

And this is what we do for ages. Grep for VJ in sources. :-)
netchannels have nothing to do with it, it is much elder idea.

And it was Van, who decided to move away from BH/irq processing.
It was slow and a bit pain way (how many hacks with prequeue, with
direct processing, it is enough just to look how TCP socket lock is locked
in different contexts :)

 In that case one copies the whole data into userspace, so access for
 20 bytes of headers completely does not matter.

For short packets it matters.

But I said not this. I said it looks _worse_. A bit, but worse.

At least for 80 bytes it does not matter at all.
And it is very likely that data is misaligned, so half of the
header will be in a cache line. And socket code has the same problem -
skb-cb can be flushed away, and tcp_recvmsg() needs to get it again.
And actually I never understood nanooptimisation behind more serious
problems (i.e. one cache line vs. 50MB/sec speed).

 Hmm, for 80 bytes sized packets win was about 2.5 times. Could you
 please show me lines inside existing code, which should be commented,
 so I got 50Mbyte/sec for that?

If I knew it would be done. :-)

Actually, it is the action, which I would expect. This, but
not dropping all the TCP stack.

I tried to use existing one, and I had speed and CPU usage win, but it's
magnitude was not what I expected, so I started userspace network stack
implementation. It was succeded, and there are _very_ major
optimisations over existing code, when processing is fully moved into
userspace, but also there are big problems, like one syscall per ack, 
so I decided to use that stack as a base for in-kernel process protocol 
processing, and I succeded. Probably I will return to the userspace 
network stack idea when I complete zero-copy networking support.

 I showed there, that using existing stack it is imposible

Please, understand, it is such statements that compromise your work.
If it is impossible then it is not interesting.

Do not mix soft and warm - I just post the facts, that netchannel TCP
implementation works (sumetimes much) faster.
It is socket code that probably has some misoptimisations, and if it is
impossible to fix them (well, it least it is very hard), then it is not
interesting.

I definitely do not say, that it must be removed/replaced/anything - it
works perfectly ok, but it is possible to have better performance by
changing architecture, and it was done.

Alexey

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-20 Thread Evgeniy Polyakov
Hello.

[ Sorry for long delay, there are some problems with mail servers, so I
can not access them remotely, so I create mail by hads, hopefully thread
will not be broken. ]

  Your description makes it sound as if you would take a huge leap,
  changing all in-kernel code _and_ the userspace interface in a
  single
  patch.  Am I wrong?  Or am I right and would it make sense to
  extract
  small incremental steps from your patch similar to those Van did in
  his non-published work?
 
 My first implementation used existing kernel code and showed small
 performance win - there was binding of the socket to netchannel and
 all
 protocol processing was moved into process context.

Iirc, Van didn't show performance numbers but rather cpu utilization
numbers.  And those went down significantly without changing the
userspace interface.

At least lca presentation graphs shows exactly different numbers - 
performance without CPU utilization (but not as his tables).

Did you look at cpu utilization as well?  If you did and your numbers
are worse than Vans, he either did something smarter than you or
forged his numbers (quite unlikely).

Interesting sentence from political correcteness point of view :)

I did both CPU and speed measurements when used socket code [1], 
and both of them showed small gain, but I only tested 1gbit setup, so
they can not be compared with Van's.
But even with 1gb I was not satisfied with them, so I started different
implementation, which I described in my e-mail to Alexey.

1. speed/cpu measurements of one of the netchannels implementation which
used socket code.
http://thread.gmane.org/gmane.linux.network/36609/focus=36614

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-20 Thread Alexey Kuznetsov
Hello!

Small question first:

 userspace, but also there are big problems, like one syscall per ack,

I do not see redundant syscalls. Is not it expected to send ACKs only
after receiving data as you said? What is the problem?


Now boring things:

 There is no BH protocol processing at all, so there is no need to
 pprotect against someone who will add data while you are processing own
 chunk.

Essential part of socket user lock is the same mutex.

Backlog is actually not a protection, but a thing equivalent to netchannel.
The difference is only that it tries to process something immediately,
when it is safe. You can omit this and push everything to backlog(=netchannel),
which is processed only by syscalls, if you do not care about latency.


 How many hacks just to be a bit closer to userspace processing,
 implemented in netchannels!

Moving processing closer to userspace is not a goal, it is a tool.
Which sometimes useful, but generally quite useless.

F.e. in your tests it should not affect performance at all,
end user is just a sink.

What's about prequeueing, it is a bright example. Guess why is it useful?
What does it save? Nothing, like netchannel. Answer is: it is just a tool
to generate coarsed ACKs in a controlled manner without essential violation
of protocol. (Well, and to combine checksumming and copy if you do not like how
your card does this)


 If userspace is scheduled away for too much time, it is bloody wrong to
 ack the data, that is impossible to read due to the fact that system is
 being busy. It is just postponing the work from one end to another - ack
 now and stop when queue is full, or postpone the ack generation when
 segment is realy being read.

... when you get all the segments nicely aligned, blah-blah-blah.

If you do not care about losses-congestion-delays-delacks-whatever,
you have a totally different protocol. Sending window feedback
is only a minor part of tcp. But even these boring tcp intrinsics
are not so important, look at ideal lossless network:

Think what happens f.e. while plain file transfer to your notebook.
You get 110MB/sec for a few seconds, then writeback is fired and
disk io subsystems discovers that the disk holds only 50MB/sec.
If you are unlucky and some another application starts, disk is so congested
that it will take lots of seconds to make a progress with io.
For this time another side will retransmit, because poor thing thought
rtt is 100 usecs and you will never return to 50MB/sec.

You have to _CLOSE_ window in the case of long delay, rather than to forget
to ack. See the difference?

It is just because actual end user is still far far away.
And this happens all the time, when you relay the results to another
application via pipe, when... Well, the only case where real end user
is user of netchannel is when you receive to a sink.


 But I said not this. I said it looks _worse_. A bit, but worse.
 
 At least for 80 bytes it does not matter at all.

Hello-o, do you hear me? :-)

I am asking: it looks not much better, but a bit worse,
then what is real reason for better performance, unless it is
due to castration of protocol?

Simplify protocol, move all the processing (even memory copies) to softirq,
leave to user space only feeding pages to copy and you will have unbeatable
performance. Been there, done that, not with TCP of course, but if you do not
care about losses and ACK clocking and send an ACK once per window,
I do not see how it can spoil the situation.


 And actually I never understood nanooptimisation behind more serious
 problems (i.e. one cache line vs. 50MB/sec speed).

You deal with 80 byte packets, to all that I understand.
If you lose one cacheline per packet, it is a big problem.

All that we can change is protocol overhead. Handling data part
is invariant anyway. You are scared of complexity of tcp, but
you obviously forget one thing: cpu is fast.
The code can look very complicated: some crazy hash functions,
damn hairy protocol processing, but if you take care about caches etc.,
all this is dominated by the first look into packet in eth_type_trans()
or ip_rcv().

BTW, when you deal with normal data flow, cache can be not dirtied
by data at all, it can be bypassed.


 works perfectly ok, but it is possible to have better performance by
 changing architecture, and it was done.

It is exactly the point of trouble. From all that I see and you said,
better performance is got not due to change of architecture,
but despite of this.

A proof that we can perform better by changing protocol is not required,
it is kinda obvious. The question is how to make existing protocol
to perform better.

I have no idea, why your tcp performs better. It can be everything:
absence of slow start, more coarse ACKs, whatever. I believe you were careful
to check those reasons and to do a fair comparison, but then the only guess
remains that you saved lots of i-cache getting rid of long code path.

And none of those guesses can be attributed to 

Re: Netchannles: first stage has been completed. Further ideas.

2006-07-20 Thread Evgeniy Polyakov
On Thu, Jul 20, 2006 at 08:41:00PM +0400, Alexey Kuznetsov ([EMAIL PROTECTED]) 
wrote:
 Hello!

Hello, Alexey.

 Small question first:
 
  userspace, but also there are big problems, like one syscall per ack,
 
 I do not see redundant syscalls. Is not it expected to send ACKs only
 after receiving data as you said? What is the problem?

I mean that each ack is a pure syscall without any data, so overhead is
quite huge compared to the situatin when acks are created in
kernelspace.
At least slow start will eat a lot of CPU with them.

 Now boring things:
 
  There is no BH protocol processing at all, so there is no need to
  pprotect against someone who will add data while you are processing own
  chunk.
 
 Essential part of socket user lock is the same mutex.
 
 Backlog is actually not a protection, but a thing equivalent to netchannel.
 The difference is only that it tries to process something immediately,
 when it is safe. You can omit this and push everything to 
 backlog(=netchannel),
 which is processed only by syscalls, if you do not care about latency.

If we consider netchannels as how Van Jackobson discribed them, then
mutext is not needed, since it is impossible to have several readers or
writers. But in socket case even if there is only one userspace
consumer, that lock must be held to protect against bh (or introduce
several queues and complicate a lot their's management (ucopy for
example)).
 
  How many hacks just to be a bit closer to userspace processing,
  implemented in netchannels!
 
 Moving processing closer to userspace is not a goal, it is a tool.
 Which sometimes useful, but generally quite useless.
 
 F.e. in your tests it should not affect performance at all,
 end user is just a sink.
 
 What's about prequeueing, it is a bright example. Guess why is it useful?
 What does it save? Nothing, like netchannel. Answer is: it is just a tool
 to generate coarsed ACKs in a controlled manner without essential violation
 of protocol. (Well, and to combine checksumming and copy if you do not like 
 how
 your card does this)

I can not agree here. 
The main goal of the protocol is data delivery to the user, but not
it's blind accepting and data transmit from user, but not some other
ring.
As you see, sending is already implemented in process' context,
but receiving is not directly connected to the user.
THe more elemnts between user and it's data we have, the more
probability of some problems there. And we already have two queues just
to eliminate one of them.
Moving protocol (no matter if it is TCP or not) closer to user allows
naturally control the dataflow - when user can read that data(and _this_
is the main goal), user acks, when it can not - it does not generate
ack. In theory that can lead to the full absence of the congestions,
especially if receiving window can be controlled in both directions.
At least with current state of routers it does not lead to the broken
connections.

  If userspace is scheduled away for too much time, it is bloody wrong to
  ack the data, that is impossible to read due to the fact that system is
  being busy. It is just postponing the work from one end to another - ack
  now and stop when queue is full, or postpone the ack generation when
  segment is realy being read.
 
 ... when you get all the segments nicely aligned, blah-blah-blah.
 
 If you do not care about losses-congestion-delays-delacks-whatever,
 you have a totally different protocol. Sending window feedback
 is only a minor part of tcp. But even these boring tcp intrinsics
 are not so important, look at ideal lossless network:
 
 Think what happens f.e. while plain file transfer to your notebook.
 You get 110MB/sec for a few seconds, then writeback is fired and
 disk io subsystems discovers that the disk holds only 50MB/sec.
 If you are unlucky and some another application starts, disk is so congested
 that it will take lots of seconds to make a progress with io.
 For this time another side will retransmit, because poor thing thought
 rtt is 100 usecs and you will never return to 50MB/sec.
 
 You have to _CLOSE_ window in the case of long delay, rather than to forget
 to ack. See the difference?
 
 It is just because actual end user is still far far away.
 And this happens all the time, when you relay the results to another
 application via pipe, when... Well, the only case where real end user
 is user of netchannel is when you receive to a sink.

There is one problem in your logic.
RTT will not be so small, since acks are not sent when user does not
read data.

  But I said not this. I said it looks _worse_. A bit, but worse.
  
  At least for 80 bytes it does not matter at all.
 
 Hello-o, do you hear me? :-)
 
 I am asking: it looks not much better, but a bit worse,
 then what is real reason for better performance, unless it is
 due to castration of protocol?

Well, if speed would be measured in lines of code, that atcp gets far less than
existing tcp, but performance win is only 2.5 times.

 

Re: Netchannles: first stage has been completed. Further ideas.

2006-07-20 Thread Ben Greear

Evgeniy Polyakov wrote:


Backlog is actually not a protection, but a thing equivalent to netchannel.
The difference is only that it tries to process something immediately,
when it is safe. You can omit this and push everything to backlog(=netchannel),
which is processed only by syscalls, if you do not care about latency.



If we consider netchannels as how Van Jackobson discribed them, then
mutext is not needed, since it is impossible to have several readers or
writers. But in socket case even if there is only one userspace
consumer, that lock must be held to protect against bh (or introduce
several queues and complicate a lot their's management (ucopy for
example)).


Out of curiosity, is it possible to have the single producer logic
if you have two+ ethernet interfaces handling frames for a single
TCP connection?  (I am assuming some sort of multi-path routing
logic...)

Thanks,
Ben

--
Ben Greear [EMAIL PROTECTED]
Candela Technologies Inc  http://www.candelatech.com

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-20 Thread Ian McDonald


If we consider netchannels as how Van Jackobson discribed them, then
mutext is not needed, since it is impossible to have several readers or
writers. But in socket case even if there is only one userspace
consumer, that lock must be held to protect against bh (or introduce
several queues and complicate a lot their's management (ucopy for
example)).


As I recall Van's talk you don't need a lock with a ring buffer if you
have a start and end variable pointing to location within ring buffer.

He didn't explain this in great depth as it is computer science 101
but here is how I would explain it:

Once socket is initialiased consumer is the only one that sets start
variable and network driver reads this only. It is the other way
around for the end variable. As long as the writes are atomic then you
are fine. You only need one ring buffer in this scenario and two
atomic variables.

Having atomic writes does have overhead but far less than locking semantic.
--
Ian McDonald
Web: http://wand.net.nz/~iam4
Blog: http://imcdnzl.blogspot.com
WAND Network Research Group
Department of Computer Science
University of Waikato
New Zealand
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-20 Thread Alexey Kuznetsov
Hello!

 Moving protocol (no matter if it is TCP or not) closer to user allows
 naturally control the dataflow - when user can read that data(and _this_
 is the main goal), user acks, when it can not - it does not generate
 ack. In theory

To all that I rememeber, in theory absence of feedback leads
to loss of control yet. The same is in practice, unfortunately.
You must say that window is closed, otherwise sender is totally
confused.


 There is one problem in your logic.
 RTT will not be so small, since acks are not sent when user does not
 read data.

It is arithmetics: rtt = window/rate.

And rto stays rounded up to 200 msec, unless you messed the connection
so hard that it is not alive. Check.


  Simplify protocol, move all the processing (even memory copies) to softirq,
  leave to user space only feeding pages to copy and you will have unbeatable
  performance. Been there, done that, not with TCP of course, but if you do 
  not
  care about losses and ACK clocking and send an ACK once per window,
  I do not see how it can spoil the situation.
 
 Do you live in a perfect world, where user does not want what was
 requested?

All the time I am trying to bring you attention that you read to sink. :-)
At least, read to disk to move it a little closer to reality.
Or at least do it from terminal and press ^Z sometimes.


  You deal with 80 byte packets, to all that I understand.
  If you lose one cacheline per packet, it is a big problem.
 
 So actual netchannels speed is even better? :)

atcp. If you get rid of netchannels, leave only atcp, the speed will
be at least not worse. No doubts.


 tell me, why we should keep (enabled) that redundant functionality?
 Because it can work better in some other places, and that is correct,
 but why it should be enabled then in majority of the cases?

Did not I tell you something like that? :-) Optimize real thing,
even trying to detect the situations when retransmissions are redundant
and eliminate the code.


 Let's draw the line.
...
 That was my opinion on the topic. It looks like neither you, nor me will
 not change our point of view about that right now :)

I agree. :)

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-20 Thread David Miller
From: Alexey Kuznetsov [EMAIL PROTECTED]
Date: Fri, 21 Jul 2006 02:59:08 +0400

  Moving protocol (no matter if it is TCP or not) closer to user allows
  naturally control the dataflow - when user can read that data(and _this_
  is the main goal), user acks, when it can not - it does not generate
  ack. In theory
 
 To all that I rememeber, in theory absence of feedback leads
 to loss of control yet. The same is in practice, unfortunately.
 You must say that window is closed, otherwise sender is totally
 confused.

Correct, and too large delay even results in retransmits.  You can say
that RTT will be adjusted by delay of ACK, but if user context
switches cleanly at the beginning, resulting in near immediate ACKs,
and then blocks later you will get spurious retransmits.  Alexey's
example of blocking on a disk write is a good example.  I really don't
like when pure NULL data sinks are used for benchmarking these kinds
of things because real applications 1) touch the data, 2) do something
with that data, and 3) have some life outside of TCP!

If you optimize an application that does nothing with the data it
receives, you have likewise optimized nothing :-)

All this talk reminds me of one thing, how expensive tcp_ack() is.
And this expense has nothing to do with TCP really.  The main cost is
purging and freeing up the skbs which have been ACK'd in the
retransmit queue.

So tcp_ack() sort of inherits the cost of freeing a bunch of SKBs
which haven't been touched by the cpu in some time and are thus nearly
guarenteed to be cold in the cache.

This is the kind of work we could think about batching to user
sleeping on some socket call.

Also notice that retransmit queue is potentially a good use of an
array similar VJ netchannel lockless queue data structure. :)

BTW, notice that TSO makes this work touch less skb state.  TSO also
decreases cpu utilization.  I think these two things are no
coincidence. :-)

I have even toyed with the idea of eventually abstracting the
retransmit queue into a pure data representation.  The skb_shinfo()
page vector is very nearly this already.  Or, a less extreme idea
where we have fully retained huge TSO skbs, but we do not chop them up
to create smaller TSO frames.  Instead, we add offset GSO attribute
which is used in the clones.

Calls to tso_fragment() would be replaced with pure clones and
adjustment of skb-len and the new skb-gso_offset in the clone.
Rest of the logic would remain identical except that non-linear data
would start skb-gso_offset bytes into the skb_shinfo() described
area.

In this way we could also set tp-xmit_size_goal to it's maximum
possible value, always.  Actually, I was looking at this the other day
and this clamping of xmit_size_goal to 1/2 max_window is extremely
dubious.  In fact it's downright wrong, only MSS needs this limiting
for sender side SWS avoidance.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-19 Thread Jörn Engel
On Tue, 18 July 2006 23:08:01 +0400, Evgeniy Polyakov wrote:
 On Tue, Jul 18, 2006 at 02:15:17PM +0200, J?rn Engel ([EMAIL PROTECTED]) 
 wrote:
  
  Your description makes it sound as if you would take a huge leap,
  changing all in-kernel code _and_ the userspace interface in a single
  patch.  Am I wrong?  Or am I right and would it make sense to extract
  small incremental steps from your patch similar to those Van did in
  his non-published work?
 
 My first implementation used existing kernel code and showed small
 performance win - there was binding of the socket to netchannel and all
 protocol processing was moved into process context.

Iirc, Van didn't show performance numbers but rather cpu utilization
numbers.  And those went down significantly without changing the
userspace interface.

Did you look at cpu utilization as well?  If you did and your numbers
are worse than Vans, he either did something smarter than you or
forged his numbers (quite unlikely).

Jörn

-- 
Sometimes, asking the right question is already the answer.
-- Unknown
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-19 Thread Alexey Kuznetsov
Hello!

 There is no socket spinlock anymore.
 Above lock is skb_queue lock which is held inside
 skb_dequeue/skb_queue_tail calls.

Lock is named differently, but it is still here.
BTW for UDP even the name is the same.

 
  Equivalent of socket user lock.
 
 No, it is an equivalent for hash lock in socket table.

OK. But you have to introduce socket mutex somewhere in any case.
Even in ATCP.


 Just an example - tcp_established() can be called with bh disabled under
 the socket lock.

When we have a process context in hands, it is not.

Did you ask youself, why do not we put all the packets to backlog/prequeue
and just wait when user will read the data? It would be 100% equivalent
to netchannels.

The answer is simple: because we cannot wait. If user delays for 200msec,
wait for connection collapse due to retransmissions. If the segment is
out of order, immediate attention is required. Any scheme, which tries
to wait for user unconditionally, at least has to run a watchdog timer,
which fires before sender senses the gap.

And this is what we do for ages. Grep for VJ in sources. :-)

netchannels have nothing to do with it, it is much elder idea.



 In that case one copies the whole data into userspace, so access for 20
 bytes of headers completely does not matter.

For short packets it matters.

But I said not this. I said it looks _worse_. A bit, but worse.


 Hmm, for 80 bytes sized packets win was about 2.5 times. Could you
 please show me lines inside existing code, which should be commented, so
 I got 50Mbyte/sec for that?

If I knew it would be done. :-)

Actually, it is the action, which I would expect. This, but
not dropping all the TCP stack.


 I showed there, that using existing stack it is imposible

Please, understand, it is such statements that compromise your work.
If it is impossible then it is not interesting.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-19 Thread Stephen Hemminger
As a related note, I am looking into fixing inet hash tables to use RCU.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-19 Thread David Miller
From: Stephen Hemminger [EMAIL PROTECTED]
Date: Wed, 19 Jul 2006 15:52:04 -0400

 As a related note, I am looking into fixing inet hash tables to use RCU.

IBM had posted a patch a long time ago, which would be not
so hard to munge into the current tree.  See if you can
spot it in the archives :)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-19 Thread Stephen Hemminger
On Wed, 19 Jul 2006 13:01:50 -0700 (PDT)
David Miller [EMAIL PROTECTED] wrote:

 From: Stephen Hemminger [EMAIL PROTECTED]
 Date: Wed, 19 Jul 2006 15:52:04 -0400
 
  As a related note, I am looking into fixing inet hash tables to use RCU.
 
 IBM had posted a patch a long time ago, which would be not
 so hard to munge into the current tree.  See if you can
 spot it in the archives :)

Ben posted a patch in March, and IBM did one a while ago.
I am looking at both.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-18 Thread David Miller
From: Evgeniy Polyakov [EMAIL PROTECTED]
Date: Tue, 18 Jul 2006 12:16:26 +0400

 I would ask to push netchannel support into -mm tree, but I expect
 in advance that having two separate TCP stacks (one of which can
 contain some bugs (I mean atcp.c)) is not that good idea, so I
 understand possible negative feedback on that issue, but it is much
 better than silence.

Evgeniy, you are present in my queue of work to review.

Perhaps I am mistaken with my priorities, but I tend to hit all the
easy patches and bug fixes first, before significant new work.

And even in the realm of new work, your things require the most
serious thinking and consideration.  I apologize for the time it takes
me, therefore, to get to reviewing deep work such as your's.

I will make a real effort to properly review your excellent work this
week, and I encourage any other netdev hackers with some spare
cycles to do the same. :)

Thanks!
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-18 Thread Evgeniy Polyakov
On Tue, Jul 18, 2006 at 01:34:37AM -0700, David Miller ([EMAIL PROTECTED]) 
wrote:
 Perhaps I am mistaken with my priorities, but I tend to hit all the
 easy patches and bug fixes first, before significant new work.
 
 And even in the realm of new work, your things require the most
 serious thinking and consideration.  I apologize for the time it takes
 me, therefore, to get to reviewing deep work such as your's.
 
 I will make a real effort to properly review your excellent work this
 week, and I encourage any other netdev hackers with some spare
 cycles to do the same. :)

That would be great!

Please don't think that I wash people's mind with weekly get it, get it 
zombying stuff, I completely understand that there are things with much 
higher priority that netchannels, so it can wait (for a while :).

Thank you.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-18 Thread Christian Borntraeger
Hello Evgeniy,

 +asmlinkage long sys_netchannel_control(void __user *arg)
[...]
 + if (copy_from_user(ctl, arg, sizeof(struct unetchannel_control)))
 + return -ERESTARTSYS;
^^^
[...]
 + if (copy_to_user(arg, ctl, sizeof(struct unetchannel_control)))
 + return -ERESTARTSYS;
^^^

I think this should be -EFAULT instead of -ERESTARTSYS, right?

-- 
Mit freundlichen Grüßen / Best Regards

Christian Borntraeger
Linux Software Engineer zSeries Linux  Virtualization



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-18 Thread Evgeniy Polyakov
On Tue, Jul 18, 2006 at 01:16:18PM +0200, Christian Borntraeger ([EMAIL 
PROTECTED]) wrote:
 Hello Evgeniy,
 
  +asmlinkage long sys_netchannel_control(void __user *arg)
 [...]
  +   if (copy_from_user(ctl, arg, sizeof(struct unetchannel_control)))
  +   return -ERESTARTSYS;
 ^^^
 [...]
  +   if (copy_to_user(arg, ctl, sizeof(struct unetchannel_control)))
  +   return -ERESTARTSYS;
 ^^^
 
 I think this should be -EFAULT instead of -ERESTARTSYS, right?

I have no strong feeling on what must be returned in that case.
As far as I see, copy*user can fail due to absence of the next
destination page, so -ERESTARTSYS makes sence, but if failure happens due to
process size limitation, -EFAULT is correct.

Let's change it to -EFAULT.

 -- 
 Mit freundlichen Grüßen / Best Regards
 
 Christian Borntraeger
 Linux Software Engineer zSeries Linux  Virtualization

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-18 Thread Jörn Engel
On Tue, 18 July 2006 12:16:26 +0400, Evgeniy Polyakov wrote:
 
 Current tests with the latest netchannel patch show that netchannels 
 outperforms sockets in any type of bulk transfer (big-sized, small-sized, 
 sending, receiving) over 1gb wire. I omit graphs and numbers here, 
 since I posted it already several times. I also plan to proceed
 some negotiations which would allow to test netchannel support in 10gbit
 environment, but it can also happen after second development stage
 completed.

[ I don't have enough time for a deeper look.  So if my questions are
stupid, please just tell me so and don't take it personal. ]

After having seen Van Jacobson's presentation at LCA twice, it
appeared to me that Van could get astonishing speedups with small
incremental steps, only changing kernel code and leaving the
kernel-userspace interface as is.

Changing (or rather adding a new) the userspace interface was just the
last step, which also gave some performance benefits but is also a
change to the userspace interface and therefore easy to get wrong and
hard to fix later.

Your description makes it sound as if you would take a huge leap,
changing all in-kernel code _and_ the userspace interface in a single
patch.  Am I wrong?  Or am I right and would it make sense to extract
small incremental steps from your patch similar to those Van did in
his non-published work?

Jörn

-- 
When people work hard for you for a pat on the back, you've got
to give them that pat.
-- Robert Heinlein
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-18 Thread Christian Borntraeger
On Tuesday 18 July 2006 13:51, Evgeniy Polyakov wrote:
  I think this should be -EFAULT instead of -ERESTARTSYS, right?

 I have no strong feeling on what must be returned in that case.
 As far as I see, copy*user can fail due to absence of the next
 destination page, so -ERESTARTSYS makes sence, but if failure happens due
 to process size limitation, -EFAULT is correct.

If I am not completely mistaken ERESTARTSYS is wrong. 
include/linux/errno.h says userspace should never see ERESTARTSYS, therefore 
we should only return it if we were interrupted by a signal as do_signal 
takes care of ERESTARTSYS. Furthermore, copy*user transparently faults in 
necessary pages as long as the address is valid in the user context. 

 Let's change it to -EFAULT.

Thanks :-)


-- 
Mit freundlichen Grüßen / Best Regards

Christian Borntraeger
Linux Software Engineer zSeries Linux  Virtualization



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-18 Thread Evgeniy Polyakov
On Tue, Jul 18, 2006 at 02:15:17PM +0200, J?rn Engel ([EMAIL PROTECTED]) wrote:
 
 Your description makes it sound as if you would take a huge leap,
 changing all in-kernel code _and_ the userspace interface in a single
 patch.  Am I wrong?  Or am I right and would it make sense to extract
 small incremental steps from your patch similar to those Van did in
 his non-published work?

My first implementation used existing kernel code and showed small
performance win - there was binding of the socket to netchannel and all
protocol processing was moved into process context. It actually is the
same what IBM folks do, but my investigation showed that linux sending
side has some issues which would not allow to grow speed very noticebly
(after creating yet another congestion control algo I now think that the
problem is there, but I'm not 100% sure).
And after looking into Van's presentation (and his words about
_userspace_ protocol processing) I think they used own stack too.
So I reinvented the wheel and created my own too.

 J?rn
 
 -- 
 When people work hard for you for a pat on the back, you've got
 to give them that pat.
 -- Robert Heinlein

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-18 Thread David Miller
From: Evgeniy Polyakov [EMAIL PROTECTED]
Date: Tue, 18 Jul 2006 23:11:37 +0400

 Actually userspace will not see ERESTARTSYS, when it is returned from
 syscall.

This is true only when a signal is pending.

It is the signal dispatch code that fixes up the return value
either by changing it to -EINTR or by resetting the register
state such that the signal handler returns to re-execute the
system call with the original set of argument register values.

If a signal is not pending, you risk leaking ERESTARTSYS to
userspace.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-18 Thread Alexey Kuznetsov
Hello!

Can I ask couple of questions? Just as a person who looked at VJ's
slides once and was confused. And startled, when found that it is not
considered as another joke of genuis. :-)


About locks:

 is completely lockless (there is one irq lock when skb 
 is queued/dequeued into netchannels queue in hard/soft irq, 

Equivalent of socket spinlock.


 one mutex for netchannel's bucket 

Equivalent of socket user lock.


 and some locks on qdisk/NIC driver layer,

The same as in traditional code, right?


From all that I see, this completely lockless code has not less locks
than traditional approach, even when doing no protocol processing.
Where am I wrong? Frankly speaking, when talking about locks,
I do not see anything, which could be saved, only TCP hash table
lookup can be RCUized, but this optimization obviously has nothing to do
with netchannels.

The only improvement in this area suggested in VJ's slides 
is a lock-free producer-consumer ring. It is missing in your patch
and I could guess it is not big loss, it is unlikely
to improve something significantly until the lock is heavily contended,
which never happens without massive network-level parallelism
for a single bucket.


The next question is about locality:

To find netchannel bucket in netif_receive_skb() you have to access
all the headers of packet. Right? Then you wait for processing in user
context, and this information is washed out of cache or even scheduled
on another CPU.

In traditional approach you also fetch all the headers on softirq,
but you do all the required work with them immediately and do not access them
when the rest of processing is done in process context. I do not see
how netchannels (without hardware classification) can improve something
here. At the first sight it makes locality worse.

Honestly, I do not see how this approach could improve performance
even a little. And it looks like your benchmarks confirm that all
the win is not due to architectural changes, but just because
some required bits of code are castrated.


VJ slides describe a totally different scheme, where softirq part is omitted
completely, protocol processing is moved to user space as whole.
It is an amazing toy. But I see nothing, which could promote its status
to practical. Exokernels used to do this thing for ages, and all the
performance gains are compensated by overcomplicated classification
engine, which has to remain in kernel and essentially to do the same
work which routing/firewalling/socket hash tables do.


 advance that having two separate TCP stacks (one of which can contain 
 some bugs (I mean atcp.c)) is not that good idea, so I understand 
 possible negative feedback on that issue, but it is much better than
 silence.

You are absolutely right here. Moreover, I can guess that absense
of feedback is a direct consequence of this thing. I would advise to
get rid of it and never mention it again. :-) If you took VJ suggestion
seriously and moved TCP engine to user space, it could remain unnoticed.
But if TCP stays in kernel (and it obviously has to), you want to work
with normal stack, you can improve, optimize and rewrite it infinitely,
but do not start with a toy. It proves nothing and compromises
the whole approach.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-18 Thread David Miller
From: Alexey Kuznetsov [EMAIL PROTECTED]
Date: Wed, 19 Jul 2006 03:01:21 +0400

 The only improvement in this area suggested in VJ's slides is a
 lock-free producer-consumer ring. It is missing in your patch and I
 could guess it is not big loss, it is unlikely to improve something
 significantly until the lock is heavily contended, which never
 happens without massive network-level parallelism for a single
 bucket.

And the gains from this ring can be obtained by stateless hardware
classification pointing to unique MSI-X PCI interrupt vectors that get
targetted to specific unique cpus.  It is true zero cost in that case.

I guess my excitement about VJ channels, from a practical viewpoint,
begin to wane even further.  How depressing :)

Devices can move flow work to individual cpus via intellegent
interrupt targeting, and OS should just get out of the way and
continue doing what it does today.  This idea is actually very old,
and PCI MSI-X interrupts just make it practical for commodity devices.

At least, there is less code to write. :-)))
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-18 Thread Evgeniy Polyakov
On Wed, Jul 19, 2006 at 03:01:21AM +0400, Alexey Kuznetsov ([EMAIL PROTECTED]) 
wrote:
 Hello!

Hello, Alexey.

 Can I ask couple of questions? Just as a person who looked at VJ's
 slides once and was confused. And startled, when found that it is not
 considered as another joke of genuis. :-)
 
 
 About locks:
 
is completely lockless (there is one irq lock when skb 
  is queued/dequeued into netchannels queue in hard/soft irq, 
 
 Equivalent of socket spinlock.

There is no socket spinlock anymore.
Above lock is skb_queue lock which is held inside
skb_dequeue/skb_queue_tail calls.
 
  one mutex for netchannel's bucket 
 
 Equivalent of socket user lock.

No, it is an equivalent for hash lock in socket table.

  and some locks on qdisk/NIC driver layer,
 
 The same as in traditional code, right?

I use dst_output(), so it is possible to have as many locks inside
low-level NIC driver as you want.

 From all that I see, this completely lockless code has not less locks
 than traditional approach, even when doing no protocol processing.
 Where am I wrong? Frankly speaking, when talking about locks,
 I do not see anything, which could be saved, only TCP hash table
 lookup can be RCUized, but this optimization obviously has nothing to do
 with netchannels.

It looks like you should looks at it again :)
Just an example - tcp_established() can be called with bh disabled under
the socket lock. In netchannels there is no need for that.

 The only improvement in this area suggested in VJ's slides 
 is a lock-free producer-consumer ring. It is missing in your patch
 and I could guess it is not big loss, it is unlikely
 to improve something significantly until the lock is heavily contended,
 which never happens without massive network-level parallelism
 for a single bucket.

That's because I decided to use skbs, but not special structures and
thus I use the same queue as socket code (and have the only one lock
inside skb_queue_tail()/skb_dequeue()). I will describe below why I do
not changed it to more hardware-friendly stuff.

 The next question is about locality:
 
 To find netchannel bucket in netif_receive_skb() you have to access
 all the headers of packet. Right? Then you wait for processing in user
 context, and this information is washed out of cache or even scheduled
 on another CPU.
 
 In traditional approach you also fetch all the headers on softirq,
 but you do all the required work with them immediately and do not access them
 when the rest of processing is done in process context. I do not see
 how netchannels (without hardware classification) can improve something
 here. At the first sight it makes locality worse.

In that case one copies the whole data into userspace, so access for 20
bytes of headers completely does not matter.

 Honestly, I do not see how this approach could improve performance
 even a little. And it looks like your benchmarks confirm that all
 the win is not due to architectural changes, but just because
 some required bits of code are castrated.

Hmm, for 80 bytes sized packets win was about 2.5 times. Could you
please show me lines inside existing code, which should be commented, so
I got 50Mbyte/sec for that?

 VJ slides describe a totally different scheme, where softirq part is omitted
 completely, protocol processing is moved to user space as whole.
 It is an amazing toy. But I see nothing, which could promote its status
 to practical. Exokernels used to do this thing for ages, and all the
 performance gains are compensated by overcomplicated classification
 engine, which has to remain in kernel and essentially to do the same
 work which routing/firewalling/socket hash tables do.

There are several ideas presented in his slides.
For my personal opinion most of performance win is obtained from
userspace processing and memcpy instead of copy_to_user() (but my
previous work showed that it is not the case for a lot of situations),
so I created first approach, tested second and now move into fully
zero-copy design. How skbs or other structures are delivered into the
queue/array does not matter in my design - I can replace it in a moment,
but I do not want to mess with drivers, since it is huge break, which
must be done after high-level stuff proven to work good.

  advance that having two separate TCP stacks (one of which can contain 
  some bugs (I mean atcp.c)) is not that good idea, so I understand 
  possible negative feedback on that issue, but it is much better than
  silence.
 
 You are absolutely right here. Moreover, I can guess that absense
 of feedback is a direct consequence of this thing. I would advise to
 get rid of it and never mention it again. :-) If you took VJ suggestion
 seriously and moved TCP engine to user space, it could remain unnoticed.
 But if TCP stays in kernel (and it obviously has to), you want to work
 with normal stack, you can improve, optimize and rewrite it infinitely,
 but do not start with a toy. It proves nothing and compromises
 the