On Fri, 12 Oct 2007 09:08:58 -0700
Brandeburg, Jesse [EMAIL PROTECTED] wrote:
Andi Kleen wrote:
When the hw TX queue gains space, the driver self-batches packets
from the sw queue to the hw queue.
I don't really see the advantage over the qdisc in that scheme.
It's certainly not
Use RCU? or write a generic version and get it reviewed. You really
want someone with knowledge of all the possible barrier impacts to
review it.
I guess he was thinking of using cmpxchg; but we don't support this
in portable code.
RCU is not really suitable for this because it assume
Andi Kleen wrote:
When the hw TX queue gains space, the driver self-batches packets
from the sw queue to the hw queue.
I don't really see the advantage over the qdisc in that scheme.
It's certainly not simpler and probably more code and would likely
also not require less locks (e.g. a
related to this comment, does Linux have a lockless (using atomics)
singly linked list element? That would be very useful in a driver hot
path.
No; it doesn't. At least not a portable one.
Besides they tend to be not faster anyways because e.g. cmpxchg tends
to be as slow as an explicit
Hi Dave,
David Miller wrote on 10/10/2007 02:13:31 AM:
Hopefully that new qdisc will just use the TX rings of the hardware
directly. They are typically large enough these days. That might avoid
some locking in this critical path.
Indeed, I also realized last night that for the default
A 256 entry TX hw queue fills up trivially on 1GB and 10GB, but if you
With TSO really?
increase the size much more performance starts to go down due to L2
cache thrashing.
Another possibility would be to consider using cache avoidance
instructions while updating the TX ring (e.g. write
From: Andi Kleen [EMAIL PROTECTED]
Date: Wed, 10 Oct 2007 11:16:44 +0200
A 256 entry TX hw queue fills up trivially on 1GB and 10GB, but if you
With TSO really?
Yes.
increase the size much more performance starts to go down due to L2
cache thrashing.
Another possibility would be to
On Wed, Oct 10, 2007 at 11:16:44AM +0200, Andi Kleen wrote:
A 256 entry TX hw queue fills up trivially on 1GB and 10GB, but if you
With TSO really?
Hardware queues are generally per-page rather than per-skb so
it'd fill up quicker than a software queue even with TSO.
Cheers,
--
Visit
On Wed, Oct 10, 2007 at 02:25:50AM -0700, David Miller wrote:
The chip I was working with at the time (UltraSPARC-IIi) compressed
all the linear stores into 64-byte full cacheline transactions via
the store buffer.
That's a pretty old CPU. Conclusions on more modern ones might be different.
From: Andi Kleen [EMAIL PROTECTED]
Date: Wed, 10 Oct 2007 12:23:31 +0200
On Wed, Oct 10, 2007 at 02:25:50AM -0700, David Miller wrote:
The chip I was working with at the time (UltraSPARC-IIi) compressed
all the linear stores into 64-byte full cacheline transactions via
the store buffer.
We've done similar testing with ixgbe to push maximum descriptor counts,
and we lost performance very quickly in the same range you're quoting on
NIU.
Did you try it with WC writes to the ring or CLFLUSH?
-Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a
On Tue, 09 Oct 2007, David Miller wrote:
From: jamal [EMAIL PROTECTED]
Date: Tue, 09 Oct 2007 17:56:46 -0400
if the h/ware queues are full because of link pressure etc, you drop. We
drop today when the s/ware queues are full. The driver txmit lock takes
place of the qdisc queue lock
PROTECTED]; [EMAIL PROTECTED];
netdev@vger.kernel.org; [EMAIL PROTECTED];
[EMAIL PROTECTED]; [EMAIL PROTECTED];
[EMAIL PROTECTED]; [EMAIL PROTECTED];
[EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED];
[EMAIL PROTECTED]
Subject: Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net
core use
From: jamal [EMAIL PROTECTED]
Date: Wed, 10 Oct 2007 09:08:48 -0400
On Wed, 2007-10-10 at 03:44 -0700, David Miller wrote:
I've always gotten very poor results when increasing the TX queue a
lot, for example with NIU the point of diminishing returns seems to
be in the range of 256-512 TX
From: Bill Fink [EMAIL PROTECTED]
Date: Wed, 10 Oct 2007 12:02:15 -0400
On Tue, 09 Oct 2007, David Miller wrote:
We have to keep in mind, however, that the sw queue right now is 1000
packets. I heavily discourage any driver author to try and use any
single TX queue of that size. Which
Hi Peter,
Waskiewicz Jr, Peter P [EMAIL PROTECTED] wrote on
10/09/2007 04:03:42 AM:
true, that needs some resolution. Heres a hand-waving thought:
Assuming all packets of a specific map end up in the same
qdiscn queue, it seems feasible to ask the qdisc scheduler to
give us enough
From: Krishna Kumar2 [EMAIL PROTECTED]
Date: Tue, 9 Oct 2007 16:28:27 +0530
Isn't it enough that the multiqueue+batching drivers handle skbs
belonging to different queue's themselves, instead of qdisc having
to figure that out? This will reduce costs for most skbs that are
neither batched nor
Hi Dave,
David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM:
Isn't it enough that the multiqueue+batching drivers handle skbs
belonging to different queue's themselves, instead of qdisc having
to figure that out? This will reduce costs for most skbs that are
neither batched
David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM:
Ignore LLTX, it sucks, it was a big mistake, and we will get rid of
it.
Great, this will make life easy. Any idea how long that would take?
It seems simple enough to do.
thanks,
- KK
-
To unsubscribe from this list: send the
From: Krishna Kumar2 [EMAIL PROTECTED]
Date: Tue, 9 Oct 2007 16:51:14 +0530
David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM:
Ignore LLTX, it sucks, it was a big mistake, and we will get rid of
it.
Great, this will make life easy. Any idea how long that would take?
It
David Miller wrote:
From: Krishna Kumar2 [EMAIL PROTECTED]
Date: Tue, 9 Oct 2007 16:51:14 +0530
David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM:
Ignore LLTX, it sucks, it was a big mistake, and we will get rid of
it.
Great, this will make life easy. Any idea how long that
On Tue, Oct 09, 2007 at 08:44:25AM -0400, Jeff Garzik wrote:
David Miller wrote:
I can just threaten to do them all and that should get the driver
maintainers going :-)
What, like this? :)
Awsome :)
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL
Herbert Xu wrote:
On Tue, Oct 09, 2007 at 08:44:25AM -0400, Jeff Garzik wrote:
David Miller wrote:
I can just threaten to do them all and that should get the driver
maintainers going :-)
What, like this? :)
Awsome :)
Note my patch is just to get the maintainers going. :) I'm not going
On Tue, 2007-09-10 at 08:39 +0530, Krishna Kumar2 wrote:
Driver might ask for 10 and we send 10, but LLTX driver might fail to get
lock and return TX_LOCKED. I haven't seen your code in greater detail, but
don't you requeue in that case too?
For others drivers that are non-batching and LLTX,
On 09 Oct 2007 18:51:51 +0200
Andi Kleen [EMAIL PROTECTED] wrote:
David Miller [EMAIL PROTECTED] writes:
2) Switch the default qdisc away from pfifo_fast to a new DRR fifo
with load balancing using the code in #1. I think this is kind
of in the territory of what Peter said he is
I wonder about the whole idea of queueing in general at such high speeds.
Given the normal bi-modal distribution of packets, and the predominance
of 1500 byte MTU; does it make sense to even have any queueing in software
at all?
Yes that is my point -- it should just pass it through directly
IMO the net driver really should provide a hint as to what it wants.
8139cp and tg3 would probably prefer multiple TX queue
behavior to match silicon behavior -- strict prio.
If I understand what you just said, I disagree. If your hardware is
running strict prio, you don't want to enforce
Waskiewicz Jr, Peter P wrote:
IMO the net driver really should provide a hint as to what it wants.
8139cp and tg3 would probably prefer multiple TX queue
behavior to match silicon behavior -- strict prio.
If I understand what you just said, I disagree. If your hardware is
running strict
A misunderstanding, I think.
To my brain, DaveM's item #2 seemed to assume/require the NIC
hardware to balance fairly across hw TX rings, which seemed
to preclude the
8139cp/tg3 style of strict-prio hardware. That's what I was
responding to.
As long as there is some modular way to
From: Jeff Garzik [EMAIL PROTECTED]
Date: Tue, 09 Oct 2007 08:44:25 -0400
David Miller wrote:
From: Krishna Kumar2 [EMAIL PROTECTED]
Date: Tue, 9 Oct 2007 16:51:14 +0530
David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM:
Ignore LLTX, it sucks, it was a big mistake, and
David Miller wrote:
From: Jeff Garzik [EMAIL PROTECTED]
Date: Tue, 09 Oct 2007 08:44:25 -0400
David Miller wrote:
From: Krishna Kumar2 [EMAIL PROTECTED]
Date: Tue, 9 Oct 2007 16:51:14 +0530
David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM:
Ignore LLTX, it sucks, it was a big
I'd say we can probably try to get rid of it in 2.6.25, this is
assuming we get driver authors to cooperate and do the conversions
or alternatively some other motivated person.
I can just threaten to do them all and that should get the driver
maintainers going :-)
I can definitely
From: Andi Kleen [EMAIL PROTECTED]
Date: 09 Oct 2007 18:51:51 +0200
Hopefully that new qdisc will just use the TX rings of the hardware
directly. They are typically large enough these days. That might avoid
some locking in this critical path.
Indeed, I also realized last night that for the
From: Roland Dreier [EMAIL PROTECTED]
Date: Tue, 09 Oct 2007 13:22:44 -0700
I can definitely kill LLTX for IPoIB by 2.6.25 and I just added it to
my TODO list so I don't forget.
In fact if 2.6.23 drags on long enough I may do it for 2.6.24
Before you add new entries to your list, how is
On Tue, 09 Oct 2007 13:43:31 -0700 (PDT)
David Miller [EMAIL PROTECTED] wrote:
From: Andi Kleen [EMAIL PROTECTED]
Date: 09 Oct 2007 18:51:51 +0200
Hopefully that new qdisc will just use the TX rings of the hardware
directly. They are typically large enough these days. That might avoid
From: Stephen Hemminger [EMAIL PROTECTED]
Date: Tue, 9 Oct 2007 13:53:40 -0700
I was thinking why not have a default transmit queue len of 0 like
the virtual devices.
I'm not so sure.
Even if the device has huge queues I still think we need a software
queue for when the hardware one backs up.
From: Jeff Garzik [EMAIL PROTECTED]
Date: Tue, 09 Oct 2007 16:20:14 -0400
David Miller wrote:
If you unconditionally take those locks in the transmit function,
there is probably an ABBA deadlock elsewhere in the driver now, most
likely in the TX reclaim processing, and you therefore need
Before you add new entries to your list, how is that ibm driver NAPI
conversion coming along? :-)
I still haven't done much. OK, I will try to get my board booting
again this week.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
On Tue, 2007-09-10 at 14:22 -0700, David Miller wrote:
Even if the device has huge queues I still think we need a software
queue for when the hardware one backs up.
It should be fine to just pretend the qdisc exists despite it sitting
in the driver and not have s/ware queues at all to avoid
Before you add new entries to your list, how is that ibm driver NAPI
conversion coming along? :-)
OK, thanks for the kick in the pants, I have a couple of patches for
net-2.6.24 coming (including an unrelated trivial warning fix for
IPoIB).
- R.
-
To unsubscribe from this list: send the
From: jamal [EMAIL PROTECTED]
Date: Tue, 09 Oct 2007 17:56:46 -0400
if the h/ware queues are full because of link pressure etc, you drop. We
drop today when the s/ware queues are full. The driver txmit lock takes
place of the qdisc queue lock etc. I am assuming there is still need for
that
On Tue, Oct 09, 2007 at 05:04:35PM -0700, David Miller wrote:
We have to keep in mind, however, that the sw queue right now is 1000
packets. I heavily discourage any driver author to try and use any
single TX queue of that size.
Why would you discourage them?
If 1000 is ok for a software
From: Andi Kleen [EMAIL PROTECTED]
Date: Wed, 10 Oct 2007 02:37:16 +0200
On Tue, Oct 09, 2007 at 05:04:35PM -0700, David Miller wrote:
We have to keep in mind, however, that the sw queue right now is 1000
packets. I heavily discourage any driver author to try and use any
single TX queue
On Mon, 2007-08-10 at 10:33 +0530, Krishna Kumar2 wrote:
As a side note: Any batching driver should _never_ have to requeue; if
it does it is buggy. And the non-batching ones if they ever requeue will
be a single packet, so not much reordering.
On the contrary, batching LLTX drivers (if
This patch adds the usage of batching within the core.
cheers,
jamal
[NET_BATCH] net core use batching
This patch adds the usage of batching within the core.
Performance results demonstrating improvement are provided separately.
I have #if-0ed some of the old functions so the patch is more
2/3][NET_BATCH] net core use batching
This patch adds the usage of batching within the core.
cheers,
jamal
Hey Jamal,
I still have concerns how this will work with Tx multiqueue.
The way the batching code looks right now, you will probably send a
batch of skb's from multiple bands
On Mon, 2007-08-10 at 12:46 -0700, Waskiewicz Jr, Peter P wrote:
I still have concerns how this will work with Tx multiqueue.
The way the batching code looks right now, you will probably send a
batch of skb's from multiple bands from PRIO or RR to the driver. For
non-Tx multiqueue
From: jamal [EMAIL PROTECTED]
Date: Mon, 08 Oct 2007 16:48:50 -0400
On Mon, 2007-08-10 at 12:46 -0700, Waskiewicz Jr, Peter P wrote:
I still have concerns how this will work with Tx multiqueue.
The way the batching code looks right now, you will probably send a
batch of skb's from
true, that needs some resolution. Heres a hand-waving thought:
Assuming all packets of a specific map end up in the same
qdiscn queue, it seems feasible to ask the qdisc scheduler to
give us enough packages (ive seen people use that terms to
refer to packets) for each hardware ring's
On Mon, 2007-08-10 at 14:26 -0700, David Miller wrote:
Add xmit_win to struct net_device_subqueue, problem solved.
If net_device_subqueue is visible from both driver and core scheduler
area (couldnt tell from looking at whats in there already), then that'll
do it.
cheers,
jamal
-
To
If net_device_subqueue is visible from both driver and core
scheduler area (couldnt tell from looking at whats in there
already), then that'll do it.
Yes, I use the net_device_subqueue structs (the state variable in there)
in the prio and rr qdiscs right now. It's an indexed list at the
On Mon, 2007-08-10 at 15:33 -0700, Waskiewicz Jr, Peter P wrote:
Addressing your note/issue with different rings being services
concurrently: I'd like to remove the QDISC_RUNNING bit from the global
The challenge to deal with is that netdevices, filters, the queues and
scheduler are closely
jamal wrote:
Ok, so the concurency aspect is what worries me. What i am saying is
that sooner or later you have to serialize (which is anti-concurency)
For example, consider CPU0 running a high prio queue and CPU1 running
the low prio queue of the same netdevice.
Assume CPU0 is getting a lot of
jamal wrote:
The challenge to deal with is that netdevices, filters, the queues and
scheduler are closely inter-twined. So it is not just the scheduling
region and QDISC_RUNNING. For example, lets pick just the filters
because they are simple to see: You need to attach them to something -
From: Jeff Garzik [EMAIL PROTECTED]
Date: Mon, 08 Oct 2007 21:13:59 -0400
If you assume a scheduler implementation where each prio band is mapped
to a separate CPU, you can certainly see where some CPUs could be
substantially idle while others are overloaded, largely depending on the
data
On Mon, Oct 08, 2007 at 06:41:26PM -0700, David Miller wrote:
I also want to point out another issue. Any argument wrt. reordering
is specious at best because right now reordering from qdisc to device
happens anyways.
This is not true.
If your device has a qdisc at all, then you will end up
On Tue, Oct 09, 2007 at 10:01:15AM +0800, Herbert Xu wrote:
On Mon, Oct 08, 2007 at 06:41:26PM -0700, David Miller wrote:
I also want to point out another issue. Any argument wrt. reordering
is specious at best because right now reordering from qdisc to device
happens anyways.
This is
On Tue, Oct 09, 2007 at 10:03:18AM +0800, Herbert Xu wrote:
On Tue, Oct 09, 2007 at 10:01:15AM +0800, Herbert Xu wrote:
On Mon, Oct 08, 2007 at 06:41:26PM -0700, David Miller wrote:
I also want to point out another issue. Any argument wrt. reordering
is specious at best because right
David Miller wrote:
1) A library for transmit load balancing functions, with an interface
that can be made visible to userspace. I can write this and test
it on real multiqueue hardware.
The whole purpose of this library is to set skb-queue_mapping
based upon the load balancing
On Mon, 2007-08-10 at 18:41 -0700, David Miller wrote:
I also want to point out another issue. Any argument wrt. reordering
is specious at best because right now reordering from qdisc to device
happens anyways.
And that's because we drop the qdisc lock first, then we grab the
transmit lock
On Tue, 2007-09-10 at 10:04 +0800, Herbert Xu wrote:
Please revert
commit 41843197b17bdfb1f97af0a87c06d24c1620ba90
Author: Jamal Hadi Salim [EMAIL PROTECTED]
Date: Tue Sep 25 19:27:13 2007 -0700
[NET_SCHED]: explict hold dev tx lock
As this change introduces potential reordering
On Mon, Oct 08, 2007 at 10:14:30PM -0400, jamal wrote:
You forgot QDISC_RUNNING Dave;- the above cant happen.
Essentially at any one point in time, we are guaranteed that we can have
multiple cpus enqueueing but only can be dequeueing (the one that
managed to grab QDISC_RUNNING) i.e multiple
On Mon, Oct 08, 2007 at 10:15:49PM -0400, jamal wrote:
On Tue, 2007-09-10 at 10:04 +0800, Herbert Xu wrote:
Please revert
commit 41843197b17bdfb1f97af0a87c06d24c1620ba90
Author: Jamal Hadi Salim [EMAIL PROTECTED]
Date: Tue Sep 25 19:27:13 2007 -0700
[NET_SCHED]: explict
On Tue, 2007-09-10 at 10:16 +0800, Herbert Xu wrote:
No it doesn't. I'd forgotten about the QDISC_RUNNING bit :)
You should not better, you wrote it and ive been going insane trying to
break for at least a year now ;-
cheers,
jamal
-
To unsubscribe from this list: send the line unsubscribe
On Mon, Oct 08, 2007 at 10:19:02PM -0400, jamal wrote:
On Tue, 2007-09-10 at 10:16 +0800, Herbert Xu wrote:
No it doesn't. I'd forgotten about the QDISC_RUNNING bit :)
You should not better, you wrote it and ive been going insane trying to
break for at least a year now ;-
Well you've
From: Herbert Xu [EMAIL PROTECTED]
Date: Tue, 9 Oct 2007 10:03:18 +0800
On Tue, Oct 09, 2007 at 10:01:15AM +0800, Herbert Xu wrote:
On Mon, Oct 08, 2007 at 06:41:26PM -0700, David Miller wrote:
I also want to point out another issue. Any argument wrt. reordering
is specious at best
From: Jeff Garzik [EMAIL PROTECTED]
Date: Mon, 08 Oct 2007 22:12:03 -0400
I'm interested in working on a load balancer function that approximates
skb-queue_mapping = smp_processor_id()
I'd be happy to code and test in that direction, based on your lib.
It's the second algorithm that
On Mon, Oct 08, 2007 at 07:43:43PM -0700, David Miller wrote:
Right, that's Jamal's recent patch. It looked funny to me too.
Hang on Dave. It was too early in the morning for me :)
I'd forgotten about the QDISC_RUNNING bit which did what the
queue lock did without actually holding the queue
From: Herbert Xu [EMAIL PROTECTED]
Date: Tue, 9 Oct 2007 10:16:20 +0800
On Mon, Oct 08, 2007 at 10:14:30PM -0400, jamal wrote:
You forgot QDISC_RUNNING Dave;- the above cant happen.
Essentially at any one point in time, we are guaranteed that we can have
multiple cpus enqueueing but
From: Herbert Xu [EMAIL PROTECTED]
Date: Tue, 9 Oct 2007 10:04:42 +0800
On Tue, Oct 09, 2007 at 10:03:18AM +0800, Herbert Xu wrote:
On Tue, Oct 09, 2007 at 10:01:15AM +0800, Herbert Xu wrote:
On Mon, Oct 08, 2007 at 06:41:26PM -0700, David Miller wrote:
I also want to point out
J Hadi Salim [EMAIL PROTECTED] wrote on 10/08/2007 06:47:24 PM:
two, there should _never_ be any requeueing even if LLTX in the previous
patches when i supported them; if there is, it is a bug. This is because
we dont send more than what the driver asked for via xmit_win. So if it
asked for
This patch adds the usage of batching within the core.
cheers,
jamal
[NET_BATCH] net core use batching
This patch adds the usage of batching within the core.
Performance results demonstrating improvement are provided separately.
I have #if-0ed some of the old functions so the patch is more
jamal wrote:
+ while ((skb = __skb_dequeue(skbs)) != NULL)
+ q-ops-requeue(skb, q);
-requeue queues at the head, so this looks like it would reverse
the order of the skbs.
Excellent catch! thanks; i will fix.
As a side note: Any batching driver should _never_ have to
On Wed, 2007-03-10 at 01:29 -0400, Bill Fink wrote:
It does sound sensible. My own decidedly non-expert speculation
was that the big 30 % performance hit right at 4 KB may be related
to memory allocation issues or having to split the skb across
multiple 4 KB pages.
plausible. But i also
On Tue, 02 Oct 2007, jamal wrote:
On Tue, 2007-02-10 at 00:25 -0400, Bill Fink wrote:
One reason I ask, is that on an earlier set of alternative batching
xmit patches by Krishna Kumar, his performance testing showed a 30 %
performance hit for TCP for a single process and a size of 4 KB,
jamal wrote:
+static inline int
+dev_requeue_skbs(struct sk_buff_head *skbs, struct net_device *dev,
+struct Qdisc *q)
+{
+
+ struct sk_buff *skb;
+
+ while ((skb = __skb_dequeue(skbs)) != NULL)
+ q-ops-requeue(skb, q);
-requeue queues at the head, so
On Mon, 2007-01-10 at 12:42 +0200, Patrick McHardy wrote:
jamal wrote:
+ while ((skb = __skb_dequeue(skbs)) != NULL)
+ q-ops-requeue(skb, q);
-requeue queues at the head, so this looks like it would reverse
the order of the skbs.
Excellent catch! thanks; i will fix.
As a
On Mon, 2007-01-10 at 00:11 -0400, Bill Fink wrote:
Have you done performance comparisons for the case of using 9000-byte
jumbo frames?
I havent, but will try if any of the gige cards i have support it.
As a side note: I have not seen any useful gains or losses as the packet
size approaches
On Mon, 01 Oct 2007, jamal wrote:
On Mon, 2007-01-10 at 00:11 -0400, Bill Fink wrote:
Have you done performance comparisons for the case of using 9000-byte
jumbo frames?
I havent, but will try if any of the gige cards i have support it.
As a side note: I have not seen any useful
This patch adds the usage of batching within the core.
cheers,
jamal
[NET_BATCH] net core use batching
This patch adds the usage of batching within the core.
The same test methodology used in introducing txlock is used, with
the following results on different kernels:
On Sun, 30 Sep 2007, jamal wrote:
This patch adds the usage of batching within the core.
cheers,
jamal
[sep30-p2of3 text/plain (6.8KB)]
[NET_BATCH] net core use batching
This patch adds the usage of batching within the core.
The same test methodology used in introducing txlock is
81 matches
Mail list logo