Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-12 Thread Stephen Hemminger
On Fri, 12 Oct 2007 09:08:58 -0700
Brandeburg, Jesse [EMAIL PROTECTED] wrote:

 Andi Kleen wrote:
  When the hw TX queue gains space, the driver self-batches packets
  from the sw queue to the hw queue.
  
  I don't really see the advantage over the qdisc in that scheme.
  It's certainly not simpler and probably more code and would likely
  also not require less locks (e.g. a currently lockless driver
  would need a new lock for its sw queue). Also it is unclear to me
  it would be really any faster.
 
 related to this comment, does Linux have a lockless (using atomics)
 singly linked list element?  That would be very useful in a driver hot
 path.

Use RCU? or write a generic version and get it reviewed.  You really
want someone with knowledge of all the possible barrier impacts to
review it.

-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-12 Thread Andi Kleen
 Use RCU? or write a generic version and get it reviewed.  You really
 want someone with knowledge of all the possible barrier impacts to
 review it.

I guess he was thinking of using cmpxchg; but we don't support this
in portable code.

RCU is not really suitable for this because it assume
writing is relatively rare which is definitely not the case for a qdisc.
Also general list management with RCU is quite expensive anyways --
it would require a full copy (that is the 'C' in RCU which Linux
generally doesn't use at all) 

-Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-12 Thread Brandeburg, Jesse
Andi Kleen wrote:
 When the hw TX queue gains space, the driver self-batches packets
 from the sw queue to the hw queue.
 
 I don't really see the advantage over the qdisc in that scheme.
 It's certainly not simpler and probably more code and would likely
 also not require less locks (e.g. a currently lockless driver
 would need a new lock for its sw queue). Also it is unclear to me
 it would be really any faster.

related to this comment, does Linux have a lockless (using atomics)
singly linked list element?  That would be very useful in a driver hot
path.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-12 Thread Andi Kleen
 related to this comment, does Linux have a lockless (using atomics)
 singly linked list element?  That would be very useful in a driver hot
 path.

No; it doesn't. At least not a portable one.
Besides they tend to be not faster anyways because e.g. cmpxchg tends
to be as slow as an explicit spinlock.

-Andi

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-11 Thread Krishna Kumar2
Hi Dave,

David Miller wrote on 10/10/2007 02:13:31 AM:

  Hopefully that new qdisc will just use the TX rings of the hardware
  directly. They are typically large enough these days. That might avoid
  some locking in this critical path.

 Indeed, I also realized last night that for the default qdiscs
 we do a lot of stupid useless work.  If the queue is a FIFO
 and the device can take packets, we should send it directly
 and not stick it into the qdisc at all.

Since you are talking of how it should be done in the *current* code,
I feel LLTX drivers will not work nicely with this.

Actually I was trying this change a couple of weeks back, but felt
that doin go would result in out of order packets (skbs present in
q which were not sent out for LLTX failure will be sent out only at
next net_tx_action, while other skbs are sent ahead).

One option is to first call qdisc_run() and then process this skb,
but that is ugly (requeue handling).

However I guess this can be done cleanly once LLTX is removed.

Thanks,

- KK

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-10 Thread Andi Kleen
 A 256 entry TX hw queue fills up trivially on 1GB and 10GB, but if you

With TSO really? 

 increase the size much more performance starts to go down due to L2
 cache thrashing.

Another possibility would be to consider using cache avoidance
instructions while updating the TX ring (e.g. write combining 
on x86) 

-Andi

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-10 Thread David Miller
From: Andi Kleen [EMAIL PROTECTED]
Date: Wed, 10 Oct 2007 11:16:44 +0200

  A 256 entry TX hw queue fills up trivially on 1GB and 10GB, but if you
 
 With TSO really? 

Yes.

  increase the size much more performance starts to go down due to L2
  cache thrashing.
 
 Another possibility would be to consider using cache avoidance
 instructions while updating the TX ring (e.g. write combining 
 on x86) 

The chip I was working with at the time (UltraSPARC-IIi) compressed
all the linear stores into 64-byte full cacheline transactions via
the store buffer.

It's true that it would allocate in the L2 cache on a miss, which
is different from your suggestion.

In fact, such a thing might not pan out well, because most of the time
you write a single descriptor or two, and that isn't a full cacheline,
which means a read/modify/write is the only coherent way to make such
a write to RAM.

Sure you could batch, but I'd rather give the chip work to do unless
I unequivocably knew I'd have enough pending to fill a cacheline's
worth of descriptors.  And since you suggest we shouldn't queue in
software... :-)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-10 Thread Herbert Xu
On Wed, Oct 10, 2007 at 11:16:44AM +0200, Andi Kleen wrote:
  A 256 entry TX hw queue fills up trivially on 1GB and 10GB, but if you
 
 With TSO really? 

Hardware queues are generally per-page rather than per-skb so
it'd fill up quicker than a software queue even with TSO.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-10 Thread Andi Kleen
On Wed, Oct 10, 2007 at 02:25:50AM -0700, David Miller wrote:
 The chip I was working with at the time (UltraSPARC-IIi) compressed
 all the linear stores into 64-byte full cacheline transactions via
 the store buffer.

That's a pretty old CPU. Conclusions on more modern ones might be different.

 In fact, such a thing might not pan out well, because most of the time
 you write a single descriptor or two, and that isn't a full cacheline,
 which means a read/modify/write is the only coherent way to make such
 a write to RAM.

x86 WC does R-M-W and is coherent of course. The main difference is 
just that the result is not cached.  When the hardware accesses the cache line
then the cache should be also invalidated.

 Sure you could batch, but I'd rather give the chip work to do unless
 I unequivocably knew I'd have enough pending to fill a cacheline's
 worth of descriptors.  And since you suggest we shouldn't queue in
 software... :-)

Hmm, it probably would need to be coupled with batched submission if 
multiple packets are available you're right. Probably not worth doing explicit
queueing though.

I suppose it would be an interesting experiment at least.

-Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-10 Thread David Miller
From: Andi Kleen [EMAIL PROTECTED]
Date: Wed, 10 Oct 2007 12:23:31 +0200

 On Wed, Oct 10, 2007 at 02:25:50AM -0700, David Miller wrote:
  The chip I was working with at the time (UltraSPARC-IIi) compressed
  all the linear stores into 64-byte full cacheline transactions via
  the store buffer.
 
 That's a pretty old CPU. Conclusions on more modern ones might be different.

Cache matters, just scale the numbers.

 I suppose it would be an interesting experiment at least.

Absolutely.

I've always gotten very poor results when increasing the TX queue a
lot, for example with NIU the point of diminishing returns seems to
be in the range of 256-512 TX descriptor entries and this was with
1.6Ghz cpus.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-10 Thread Andi Kleen
 We've done similar testing with ixgbe to push maximum descriptor counts,
 and we lost performance very quickly in the same range you're quoting on
 NIU.

Did you try it with WC writes to the ring or CLFLUSH?

-Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-10 Thread Bill Fink
On Tue, 09 Oct 2007, David Miller wrote:

 From: jamal [EMAIL PROTECTED]
 Date: Tue, 09 Oct 2007 17:56:46 -0400
 
  if the h/ware queues are full because of link pressure etc, you drop. We
  drop today when the s/ware queues are full. The driver txmit lock takes
  place of the qdisc queue lock etc. I am assuming there is still need for
  that locking. The filter/classification scheme still works as is and
  select classes which map to rings. tc still works as is etc.
 
 I understand your suggestion.
 
 We have to keep in mind, however, that the sw queue right now is 1000
 packets.  I heavily discourage any driver author to try and use any
 single TX queue of that size.  Which means that just dropping on back
 pressure might not work so well.
 
 Or it might be perfect and signal TCP to backoff, who knows! :-)

I can't remember the details anymore, but for 10-GigE, I have encountered
cases where I was able to significantly increase TCP performance by
increasing the txqueuelen to 1, which is the setting I now use for
any 10-GigE testing.

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-10 Thread Waskiewicz Jr, Peter P
 -Original Message-
 From: Andi Kleen [mailto:[EMAIL PROTECTED] 
 Sent: Wednesday, October 10, 2007 9:02 AM
 To: Waskiewicz Jr, Peter P
 Cc: David Miller; [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 netdev@vger.kernel.org; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]
 Subject: Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net 
 core use batching
 
  We've done similar testing with ixgbe to push maximum descriptor 
  counts, and we lost performance very quickly in the same 
 range you're 
  quoting on NIU.
 
 Did you try it with WC writes to the ring or CLFLUSH?
 
 -Andi

Hmm, I think it might be slightly different, but it still shows queue
depth vs. performance.  I was actually referring to how many descriptors
we can represent a packet with before it becomes a problem wrt
performance.  This morning I tried to actually push my ixgbe NIC hard
enough to come close to filling the ring with packets (384-byte
packets), and even on my 8-core Xeon I can't do it.  My system can't
generate enough I/O to fill the hardware queues before CPUs max out.

-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-10 Thread David Miller
From: jamal [EMAIL PROTECTED]
Date: Wed, 10 Oct 2007 09:08:48 -0400

 On Wed, 2007-10-10 at 03:44 -0700, David Miller wrote:
 
  I've always gotten very poor results when increasing the TX queue a
  lot, for example with NIU the point of diminishing returns seems to
  be in the range of 256-512 TX descriptor entries and this was with
  1.6Ghz cpus.
 
 Is it interupt per packet? From my experience, you may find interesting
 results varying tx interupt mitigation parameters in addition to the
 ring parameters.
 Unfortunately when you do that, optimal parameters also depends on
 packet size. so what may work for 64B, wont work well for 1400B.

No, it was not interrupt per-packet, I was telling the chip to
interrupt me every 1/4 of the ring.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-10 Thread David Miller
From: Bill Fink [EMAIL PROTECTED]
Date: Wed, 10 Oct 2007 12:02:15 -0400

 On Tue, 09 Oct 2007, David Miller wrote:
 
  We have to keep in mind, however, that the sw queue right now is 1000
  packets.  I heavily discourage any driver author to try and use any
  single TX queue of that size.  Which means that just dropping on back
  pressure might not work so well.
  
  Or it might be perfect and signal TCP to backoff, who knows! :-)
 
 I can't remember the details anymore, but for 10-GigE, I have encountered
 cases where I was able to significantly increase TCP performance by
 increasing the txqueuelen to 1, which is the setting I now use for
 any 10-GigE testing.

For some reason this does not surprise me.

We bumped the ethernet default up to 1000 for gigabit.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Krishna Kumar2
Hi Peter,

Waskiewicz Jr, Peter P [EMAIL PROTECTED] wrote on
10/09/2007 04:03:42 AM:

  true, that needs some resolution. Heres a hand-waving thought:
  Assuming all packets of a specific map end up in the same
  qdiscn queue, it seems feasible to ask the qdisc scheduler to
  give us enough packages (ive seen people use that terms to
  refer to packets) for each hardware ring's available space.
  With the patches i posted, i do that via
  dev-xmit_win that assumes only one view of the driver; essentially a
  single ring.
  If that is doable, then it is up to the driver to say i have
  space for 5 in ring[0], 10 in ring[1] 0 in ring[2] based on
  what scheduling scheme the driver implements - the dev-blist
  can stay the same. Its a handwave, so there may be issues
  there and there could be better ways to handle this.
 
  Note: The other issue that needs resolving that i raised
  earlier was in regards to multiqueue running on multiple cpus
  servicing different rings concurently.

 I can see the qdisc being modified to send batches per queue_mapping.
 This shouldn't be too difficult, and if we had the xmit_win per queue
 (in the subqueue struct like Dave pointed out).

I hope my understanding of multiqueue is correct for this mail to make
sense :-)

Isn't it enough that the multiqueue+batching drivers handle skbs
belonging to different queue's themselves, instead of qdisc having
to figure that out? This will reduce costs for most skbs that are
neither batched nor sent to multiqueue devices.

Eg, driver can keep processing skbs and put to the correct tx_queue
as long as mapping remains the same. If the mapping changes, it posts
earlier skbs (with the correct lock) and then iterates for the other
skbs that have the next different mapping, and so on.

(This is required only if driver is supposed to transmit 1 skb in one
call, otherwise it is not an issue)

Alternatively, supporting drivers could return a different code on
mapping change, like: NETDEV_TX_MAPPING_CHANGED (for batching only)
so that qdisc_run() could retry. Would that work?

Secondly having xmit_win per queue: would it help in multiple skb
case? Currently there is no way to tell qdisc to dequeue skbs from
a particular band - it returns skb from highest priority band.

thanks,

- KK

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread David Miller
From: Krishna Kumar2 [EMAIL PROTECTED]
Date: Tue, 9 Oct 2007 16:28:27 +0530

 Isn't it enough that the multiqueue+batching drivers handle skbs
 belonging to different queue's themselves, instead of qdisc having
 to figure that out? This will reduce costs for most skbs that are
 neither batched nor sent to multiqueue devices.
 
 Eg, driver can keep processing skbs and put to the correct tx_queue
 as long as mapping remains the same. If the mapping changes, it posts
 earlier skbs (with the correct lock) and then iterates for the other
 skbs that have the next different mapping, and so on.

The complexity in most of these suggestions is beginning to drive me a
bit crazy :-)

This should be the simplest thing in the world, when TX queue has
space, give it packets.  Period.

When I hear suggestions like have the driver pick the queue in
-hard_start_xmit() and return some special status if the queue
becomes different.  you know, I really begin to wonder :-)

If we have to go back, get into the queueing layer locks, have these
special cases, and whatnot, what's the point?

This code should eventually be able to run lockless all the way to the
TX queue handling code of the driver.  The queueing code should know
what TX queue the packet will be bound for, and always precisely
invoke the driver in a state where the driver can accept the packet.

Ignore LLTX, it sucks, it was a big mistake, and we will get rid of
it.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Krishna Kumar2
Hi Dave,

David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM:

  Isn't it enough that the multiqueue+batching drivers handle skbs
  belonging to different queue's themselves, instead of qdisc having
  to figure that out? This will reduce costs for most skbs that are
  neither batched nor sent to multiqueue devices.
 
  Eg, driver can keep processing skbs and put to the correct tx_queue
  as long as mapping remains the same. If the mapping changes, it posts
  earlier skbs (with the correct lock) and then iterates for the other
  skbs that have the next different mapping, and so on.

 The complexity in most of these suggestions is beginning to drive me a
 bit crazy :-)

 This should be the simplest thing in the world, when TX queue has
 space, give it packets.  Period.

 When I hear suggestions like have the driver pick the queue in
 -hard_start_xmit() and return some special status if the queue
 becomes different.  you know, I really begin to wonder :-)

 If we have to go back, get into the queueing layer locks, have these
 special cases, and whatnot, what's the point?

I understand your point, but the qdisc code itself needs almost no
change, as small as:

qdisc_restart()
{
  ...
  case NETDEV_TX_MAPPING_CHANGED:
  /*
   * Driver sent some skbs from one mapping, and found others
   * are for different queue_mapping. Try again.
   */
  ret = 1; /* guaranteed to have atleast 1 skb in batch list */
  break;
  ...
}

Alternatively if the driver does all the dirty work, qdisc needs no
change at all. However, I am not sure if this addresses all the
concerns raised by you, Peter, Jamal, others.

 This code should eventually be able to run lockless all the way to the
 TX queue handling code of the driver.  The queueing code should know
 what TX queue the packet will be bound for, and always precisely
 invoke the driver in a state where the driver can accept the packet.

This sounds like a good idea :)

I need to think more on this, esp as my batching sends multiple skbs of
possibly different mappings to device, and those skbs stay in batch list
if driver couldn't send them out.

thanks,

- KK

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Krishna Kumar2
David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM:

 Ignore LLTX, it sucks, it was a big mistake, and we will get rid of
 it.

Great, this will make life easy. Any idea how long that would take?
It seems simple enough to do.

thanks,

- KK

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread David Miller
From: Krishna Kumar2 [EMAIL PROTECTED]
Date: Tue, 9 Oct 2007 16:51:14 +0530

 David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM:
 
  Ignore LLTX, it sucks, it was a big mistake, and we will get rid of
  it.
 
 Great, this will make life easy. Any idea how long that would take?
 It seems simple enough to do.

I'd say we can probably try to get rid of it in 2.6.25, this is
assuming we get driver authors to cooperate and do the conversions
or alternatively some other motivated person.

I can just threaten to do them all and that should get the driver
maintainers going :-)

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Jeff Garzik

David Miller wrote:

From: Krishna Kumar2 [EMAIL PROTECTED]
Date: Tue, 9 Oct 2007 16:51:14 +0530


David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM:


Ignore LLTX, it sucks, it was a big mistake, and we will get rid of
it.

Great, this will make life easy. Any idea how long that would take?
It seems simple enough to do.


I'd say we can probably try to get rid of it in 2.6.25, this is
assuming we get driver authors to cooperate and do the conversions
or alternatively some other motivated person.

I can just threaten to do them all and that should get the driver
maintainers going :-)


What, like this?  :)

Jeff




 drivers/net/atl1/atl1_main.c   |   16 +---
 drivers/net/chelsio/cxgb2.c|1 -
 drivers/net/chelsio/sge.c  |   20 +---
 drivers/net/e1000/e1000_main.c |6 +-
 drivers/net/ixgb/ixgb_main.c   |   24 
 drivers/net/pasemi_mac.c   |2 +-
 drivers/net/rionet.c   |   19 +++
 drivers/net/spider_net.c   |2 +-
 drivers/net/sungem.c   |   17 ++---
 drivers/net/tehuti.c   |   12 +---
 drivers/net/tehuti.h   |3 +--
 11 files changed, 32 insertions(+), 90 deletions(-)

diff --git a/drivers/net/atl1/atl1_main.c b/drivers/net/atl1/atl1_main.c
index 4c728f1..03e94fe 100644
--- a/drivers/net/atl1/atl1_main.c
+++ b/drivers/net/atl1/atl1_main.c
@@ -1665,10 +1665,7 @@ static int atl1_xmit_frame(struct sk_buff *skb, struct 
net_device *netdev)
 
len -= skb-data_len;
 
-   if (unlikely(skb-len == 0)) {
-   dev_kfree_skb_any(skb);
-   return NETDEV_TX_OK;
-   }
+   WARN_ON(skb-len == 0);
 
param.data = 0;
param.tso.tsopu = 0;
@@ -1703,11 +1700,7 @@ static int atl1_xmit_frame(struct sk_buff *skb, struct 
net_device *netdev)
}
}
 
-   if (!spin_trylock_irqsave(adapter-lock, flags)) {
-   /* Can't get lock - tell upper layer to requeue */
-   dev_printk(KERN_DEBUG, adapter-pdev-dev, tx locked\n);
-   return NETDEV_TX_LOCKED;
-   }
+   spin_lock_irqsave(adapter-lock, flags);
 
if (atl1_tpd_avail(adapter-tpd_ring)  count) {
/* not enough descriptors */
@@ -1749,8 +1742,11 @@ static int atl1_xmit_frame(struct sk_buff *skb, struct 
net_device *netdev)
atl1_tx_map(adapter, skb, 1 == val);
atl1_tx_queue(adapter, count, param);
netdev-trans_start = jiffies;
+
spin_unlock_irqrestore(adapter-lock, flags);
+
atl1_update_mailbox(adapter);
+
return NETDEV_TX_OK;
 }
 
@@ -2301,8 +2297,6 @@ static int __devinit atl1_probe(struct pci_dev *pdev,
 */
/* netdev-features |= NETIF_F_TSO; */
 
-   netdev-features |= NETIF_F_LLTX;
-
/*
 * patch for some L1 of old version,
 * the final version of L1 may not need these
diff --git a/drivers/net/chelsio/cxgb2.c b/drivers/net/chelsio/cxgb2.c
index 2dbf8dc..0aba7e7 100644
--- a/drivers/net/chelsio/cxgb2.c
+++ b/drivers/net/chelsio/cxgb2.c
@@ -1084,7 +1084,6 @@ static int __devinit init_one(struct pci_dev *pdev,
netdev-mem_end = mmio_start + mmio_len - 1;
netdev-priv = adapter;
netdev-features |= NETIF_F_SG | NETIF_F_IP_CSUM;
-   netdev-features |= NETIF_F_LLTX;
 
adapter-flags |= RX_CSUM_ENABLED | TCP_CSUM_CAPABLE;
if (pci_using_dac)
diff --git a/drivers/net/chelsio/sge.c b/drivers/net/chelsio/sge.c
index ffa7e64..84f5869 100644
--- a/drivers/net/chelsio/sge.c
+++ b/drivers/net/chelsio/sge.c
@@ -1739,8 +1739,7 @@ static int t1_sge_tx(struct sk_buff *skb, struct adapter 
*adapter,
struct cmdQ *q = sge-cmdQ[qid];
unsigned int credits, pidx, genbit, count, use_sched_skb = 0;
 
-   if (!spin_trylock(q-lock))
-   return NETDEV_TX_LOCKED;
+   spin_lock(q-lock);
 
reclaim_completed_tx(sge, q);
 
@@ -1817,12 +1816,12 @@ use_sched:
}
 
if (use_sched_skb) {
-   if (spin_trylock(q-lock)) {
-   credits = q-size - q-in_use;
-   skb = NULL;
-   goto use_sched;
-   }
+   spin_lock(q-lock);
+   credits = q-size - q-in_use;
+   skb = NULL;
+   goto use_sched;
}
+
return NETDEV_TX_OK;
 }
 
@@ -1977,13 +1976,12 @@ static void sge_tx_reclaim_cb(unsigned long data)
for (i = 0; i  SGE_CMDQ_N; ++i) {
struct cmdQ *q = sge-cmdQ[i];
 
-   if (!spin_trylock(q-lock))
-   continue;
+   spin_lock(q-lock);
 
reclaim_completed_tx(sge, q);
-   if (i == 0  q-in_use) {/* flush pending credits */
+   if (i == 0  q-in_use)/* flush pending credits */
writel(F_CMDQ0_ENABLE, 

Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Herbert Xu
On Tue, Oct 09, 2007 at 08:44:25AM -0400, Jeff Garzik wrote:
 David Miller wrote:
 
 I can just threaten to do them all and that should get the driver
 maintainers going :-)
 
 What, like this?  :)

Awsome :)
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Jeff Garzik

Herbert Xu wrote:

On Tue, Oct 09, 2007 at 08:44:25AM -0400, Jeff Garzik wrote:

David Miller wrote:

I can just threaten to do them all and that should get the driver
maintainers going :-)

What, like this?  :)


Awsome :)


Note my patch is just to get the maintainers going.  :)  I'm not going 
to commit that, since I don't have any way to test any of the drivers I 
touched (but I wouldn't scream if it appeared in net-2.6.24 either)


Jeff


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread jamal
On Tue, 2007-09-10 at 08:39 +0530, Krishna Kumar2 wrote:

 Driver might ask for 10 and we send 10, but LLTX driver might fail to get
 lock and return TX_LOCKED. I haven't seen your code in greater detail, but
 don't you requeue in that case too?

For others drivers that are non-batching and LLTX, it is possible - at
the moment in my patch i whine that the driver is buggy. I will fix this
up so it checks for NETIF_F_BTX. Thanks for pointing the above use case.

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Stephen Hemminger
On 09 Oct 2007 18:51:51 +0200
Andi Kleen [EMAIL PROTECTED] wrote:

 David Miller [EMAIL PROTECTED] writes:
  
  2) Switch the default qdisc away from pfifo_fast to a new DRR fifo
 with load balancing using the code in #1.  I think this is kind
 of in the territory of what Peter said he is working on.
 
 Hopefully that new qdisc will just use the TX rings of the hardware
 directly. They are typically large enough these days. That might avoid
 some locking in this critical path.
 
 I know this is controversial, but realistically I doubt users
 benefit at all from the prioritization that pfifo provides.
 
 I agree. For most interfaces the priority is probably dubious.
 Even for DSL the prioritization will be likely usually done in a router
 these days.
 
 Also for the fast interfaces where we do TSO priority doesn't work
 very well anyways -- with large packets there is not too much 
 to prioritize.
 
  3) Work on discovering a way to make the locking on transmit as
 localized to the current thread of execution as possible.  Things
 like RCU and statistic replication, techniques we use widely
 elsewhere in the stack, begin to come to mind.
 
 If the data is just passed on to the hardware queue, why is any 
 locking needed at all? (except for the driver locking of course)
 
 -Andi

I wonder about the whole idea of queueing in general at such high speeds.
Given the normal bi-modal distribution of packets, and the predominance
of 1500 byte MTU; does it make sense to even have any queueing in software
at all?


-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Andi Kleen
 I wonder about the whole idea of queueing in general at such high speeds.
 Given the normal bi-modal distribution of packets, and the predominance
 of 1500 byte MTU; does it make sense to even have any queueing in software
 at all?

Yes that is my point -- it should just pass it through directly
and the driver can then put it into the different per CPU (or per
whatever) queues managed by the hardware.

The only thing the qdisc needs to do is to set some bit that says
it is ok to put this into difference queues; don't need strict ordering 

Otherwise if the drivers did that unconditionally they might cause
problems with other qdiscs.

This would also require that the driver exports some hint 
to the upper layer on how large its internal queues are. A device
with a short queue would still require pfifo_fast. Long queue
devices could just pass through. That again could be a single flag.

-Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Waskiewicz Jr, Peter P
 IMO the net driver really should provide a hint as to what it wants.
 
 8139cp and tg3 would probably prefer multiple TX queue 
 behavior to match silicon behavior -- strict prio.

If I understand what you just said, I disagree.  If your hardware is
running strict prio, you don't want to enforce strict prio in the qdisc
layer; performing two layers of QoS is excessive, and may lead to
results you don't want.  The reason I added the DRR qdisc is for the Si
that has its own queueing strategy that is not RR.  For Si that
implements RR (like e1000), you can either use the DRR qdisc, or if you
want to prioritize your flows, use PRIO.

-PJ Waskiewicz
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Jeff Garzik

Waskiewicz Jr, Peter P wrote:

IMO the net driver really should provide a hint as to what it wants.

8139cp and tg3 would probably prefer multiple TX queue 
behavior to match silicon behavior -- strict prio.


If I understand what you just said, I disagree.  If your hardware is
running strict prio, you don't want to enforce strict prio in the qdisc
layer; performing two layers of QoS is excessive, and may lead to
results you don't want.  The reason I added the DRR qdisc is for the Si
that has its own queueing strategy that is not RR.  For Si that
implements RR (like e1000), you can either use the DRR qdisc, or if you
want to prioritize your flows, use PRIO.


A misunderstanding, I think.

To my brain, DaveM's item #2 seemed to assume/require the NIC hardware 
to balance fairly across hw TX rings, which seemed to preclude the 
8139cp/tg3 style of strict-prio hardware.  That's what I was responding to.


As long as there is some modular way to fit 8139cp/tg3 style multi-TX 
into our universe, I'm happy :)


Jeff



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Waskiewicz Jr, Peter P
 A misunderstanding, I think.
 
 To my brain, DaveM's item #2 seemed to assume/require the NIC 
 hardware to balance fairly across hw TX rings, which seemed 
 to preclude the
 8139cp/tg3 style of strict-prio hardware.  That's what I was 
 responding to.
 
 As long as there is some modular way to fit 8139cp/tg3 style 
 multi-TX into our universe, I'm happy :)

Ah hah.  Yes, a misunderstanding on my part.  Thanks for the
clarification.  Methinks more caffeine is required for today...

-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread David Miller
From: Jeff Garzik [EMAIL PROTECTED]
Date: Tue, 09 Oct 2007 08:44:25 -0400

 David Miller wrote:
  From: Krishna Kumar2 [EMAIL PROTECTED]
  Date: Tue, 9 Oct 2007 16:51:14 +0530
  
  David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM:
 
  Ignore LLTX, it sucks, it was a big mistake, and we will get rid of
  it.
  Great, this will make life easy. Any idea how long that would take?
  It seems simple enough to do.
  
  I'd say we can probably try to get rid of it in 2.6.25, this is
  assuming we get driver authors to cooperate and do the conversions
  or alternatively some other motivated person.
  
  I can just threaten to do them all and that should get the driver
  maintainers going :-)
 
 What, like this?  :)

Thanks, but it's probably going to need some corrections and/or
an audit.

If you unconditionally take those locks in the transmit function,
there is probably an ABBA deadlock elsewhere in the driver now, most
likely in the TX reclaim processing, and you therefore need to handle
that too.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Jeff Garzik

David Miller wrote:

From: Jeff Garzik [EMAIL PROTECTED]
Date: Tue, 09 Oct 2007 08:44:25 -0400


David Miller wrote:

From: Krishna Kumar2 [EMAIL PROTECTED]
Date: Tue, 9 Oct 2007 16:51:14 +0530


David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM:


Ignore LLTX, it sucks, it was a big mistake, and we will get rid of
it.

Great, this will make life easy. Any idea how long that would take?
It seems simple enough to do.

I'd say we can probably try to get rid of it in 2.6.25, this is
assuming we get driver authors to cooperate and do the conversions
or alternatively some other motivated person.

I can just threaten to do them all and that should get the driver
maintainers going :-)

What, like this?  :)


Thanks, but it's probably going to need some corrections and/or
an audit.


I would be happy if someone wanted to audit that patch.



If you unconditionally take those locks in the transmit function,
there is probably an ABBA deadlock elsewhere in the driver now, most
likely in the TX reclaim processing, and you therefore need to handle
that too.


And I most certainly checked the relevant transmit paths and other 
locking to make sure lock ordering was correct.


Jeff



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Roland Dreier
  I'd say we can probably try to get rid of it in 2.6.25, this is
  assuming we get driver authors to cooperate and do the conversions
  or alternatively some other motivated person.
  
  I can just threaten to do them all and that should get the driver
  maintainers going :-)

I can definitely kill LLTX for IPoIB by 2.6.25 and I just added it to
my TODO list so I don't forget.

In fact if 2.6.23 drags on long enough I may do it for 2.6.24
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread David Miller
From: Andi Kleen [EMAIL PROTECTED]
Date: 09 Oct 2007 18:51:51 +0200

 Hopefully that new qdisc will just use the TX rings of the hardware
 directly. They are typically large enough these days. That might avoid
 some locking in this critical path.

Indeed, I also realized last night that for the default qdiscs
we do a lot of stupid useless work.  If the queue is a FIFO
and the device can take packets, we should send it directly
and not stick it into the qdisc at all.

 If the data is just passed on to the hardware queue, why is any 
 locking needed at all? (except for the driver locking of course)

Absolutely.

Our packet scheduler subsystem is great, but by default it should just
get out of the way.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread David Miller
From: Roland Dreier [EMAIL PROTECTED]
Date: Tue, 09 Oct 2007 13:22:44 -0700

 I can definitely kill LLTX for IPoIB by 2.6.25 and I just added it to
 my TODO list so I don't forget.
 
 In fact if 2.6.23 drags on long enough I may do it for 2.6.24

Before you add new entries to your list, how is that ibm driver NAPI
conversion coming along? :-)

Right now that's a more pressing task to complete.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Stephen Hemminger
On Tue, 09 Oct 2007 13:43:31 -0700 (PDT)
David Miller [EMAIL PROTECTED] wrote:

 From: Andi Kleen [EMAIL PROTECTED]
 Date: 09 Oct 2007 18:51:51 +0200
 
  Hopefully that new qdisc will just use the TX rings of the hardware
  directly. They are typically large enough these days. That might avoid
  some locking in this critical path.
 
 Indeed, I also realized last night that for the default qdiscs
 we do a lot of stupid useless work.  If the queue is a FIFO
 and the device can take packets, we should send it directly
 and not stick it into the qdisc at all.
 
  If the data is just passed on to the hardware queue, why is any 
  locking needed at all? (except for the driver locking of course)
 
 Absolutely.
 
 Our packet scheduler subsystem is great, but by default it should just
 get out of the way.

I was thinking why not have a default transmit queue len of 0 like
the virtual devices.

-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread David Miller
From: Stephen Hemminger [EMAIL PROTECTED]
Date: Tue, 9 Oct 2007 13:53:40 -0700

 I was thinking why not have a default transmit queue len of 0 like
 the virtual devices.

I'm not so sure.

Even if the device has huge queues I still think we need a software
queue for when the hardware one backs up.

It is even beneficial to stick with reasonably sized TX queues because
it keeps the total resident state accessed by the CPU within the
bounds of the L2 cache.  If you go past that it actually hurts to make
the TX queue larger instead of helps even if it means you never hit
back pressure.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread David Miller
From: Jeff Garzik [EMAIL PROTECTED]
Date: Tue, 09 Oct 2007 16:20:14 -0400

 David Miller wrote:
  If you unconditionally take those locks in the transmit function,
  there is probably an ABBA deadlock elsewhere in the driver now, most
  likely in the TX reclaim processing, and you therefore need to handle
  that too.
 
 And I most certainly checked the relevant transmit paths and other 
 locking to make sure lock ordering was correct.

Awesome.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Roland Dreier
  Before you add new entries to your list, how is that ibm driver NAPI
  conversion coming along? :-)

I still haven't done much.  OK, I will try to get my board booting
again this week.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread jamal

On Tue, 2007-09-10 at 14:22 -0700, David Miller wrote:

 Even if the device has huge queues I still think we need a software
 queue for when the hardware one backs up.

It should be fine to just pretend the qdisc exists despite it sitting
in the driver and not have s/ware queues at all to avoid all the
challenges that qdiscs bring; 
if the h/ware queues are full because of link pressure etc, you drop. We
drop today when the s/ware queues are full. The driver txmit lock takes
place of the qdisc queue lock etc. I am assuming there is still need for
that locking. The filter/classification scheme still works as is and
select classes which map to rings. tc still works as is etc.

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Roland Dreier
  Before you add new entries to your list, how is that ibm driver NAPI
  conversion coming along? :-)

OK, thanks for the kick in the pants, I have a couple of patches for
net-2.6.24 coming (including an unrelated trivial warning fix for
IPoIB).

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread David Miller
From: jamal [EMAIL PROTECTED]
Date: Tue, 09 Oct 2007 17:56:46 -0400

 if the h/ware queues are full because of link pressure etc, you drop. We
 drop today when the s/ware queues are full. The driver txmit lock takes
 place of the qdisc queue lock etc. I am assuming there is still need for
 that locking. The filter/classification scheme still works as is and
 select classes which map to rings. tc still works as is etc.

I understand your suggestion.

We have to keep in mind, however, that the sw queue right now is 1000
packets.  I heavily discourage any driver author to try and use any
single TX queue of that size.  Which means that just dropping on back
pressure might not work so well.

Or it might be perfect and signal TCP to backoff, who knows! :-)

While working out this issue in my mind, it occured to me that we
can put the sw queue into the driver as well.

The idea is that the network stack, as in the pure hw queue scheme,
unconditionally always submits new packets to the driver.  Therefore
even if the hw TX queue is full, the driver can still queue to an
internal sw queue with some limit (say 1000 for ethernet, as is used
now).

When the hw TX queue gains space, the driver self-batches packets
from the sw queue to the hw queue.

It sort of obviates the need for mid-level queue batching in the
generic networking.  Compared to letting the driver self-batch,
the mid-level batching approach is pure overhead.

We seem to be sort of all mentioning similar ideas.  For example, you
can get the above kind of scheme today by using a mid-level queue
length of zero, and I believe this idea was mentioned by Stephen
Hemminger earlier.

I may experiment with this in the NIU driver.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Andi Kleen
On Tue, Oct 09, 2007 at 05:04:35PM -0700, David Miller wrote:
 We have to keep in mind, however, that the sw queue right now is 1000
 packets.  I heavily discourage any driver author to try and use any
 single TX queue of that size.  

Why would you discourage them? 

If 1000 is ok for a software queue why would it not be ok
for a hardware queue?

 Which means that just dropping on back
 pressure might not work so well.
 
 Or it might be perfect and signal TCP to backoff, who knows! :-)

1000 packets is a lot. I don't have hard data, but gut feeling 
is less would also do.

And if the hw queues are not enough a better scheme might be to
just manage this in the sockets in sendmsg. e.g. provide a wait queue that
drivers can wake up and let them block on more queue.

 The idea is that the network stack, as in the pure hw queue scheme,
 unconditionally always submits new packets to the driver.  Therefore
 even if the hw TX queue is full, the driver can still queue to an
 internal sw queue with some limit (say 1000 for ethernet, as is used
 now).

 
 When the hw TX queue gains space, the driver self-batches packets
 from the sw queue to the hw queue.

I don't really see the advantage over the qdisc in that scheme.
It's certainly not simpler and probably more code and would likely
also not require less locks (e.g. a currently lockless driver
would need a new lock for its sw queue). Also it is unclear to me
it would be really any faster.

-Andi

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread David Miller
From: Andi Kleen [EMAIL PROTECTED]
Date: Wed, 10 Oct 2007 02:37:16 +0200

 On Tue, Oct 09, 2007 at 05:04:35PM -0700, David Miller wrote:
  We have to keep in mind, however, that the sw queue right now is 1000
  packets.  I heavily discourage any driver author to try and use any
  single TX queue of that size.  
 
 Why would you discourage them? 
 
 If 1000 is ok for a software queue why would it not be ok
 for a hardware queue?

Because with the software queue, you aren't accessing 1000 slots
shared with the hardware device which does shared-ownership
transactions on those L2 cache lines with the cpu.

Long ago I did a test on gigabit on a cpu with only 256K of
L2 cache.  Using a smaller TX queue make things go faster,
and it's exactly because of these L2 cache effects.

 1000 packets is a lot. I don't have hard data, but gut feeling 
 is less would also do.

I'll try to see how backlogged my 10Gb tests get when a strong
sender is sending to a weak receiver.

 And if the hw queues are not enough a better scheme might be to
 just manage this in the sockets in sendmsg. e.g. provide a wait queue that
 drivers can wake up and let them block on more queue.

TCP does this already, but it operates in a lossy manner.

 I don't really see the advantage over the qdisc in that scheme.
 It's certainly not simpler and probably more code and would likely
 also not require less locks (e.g. a currently lockless driver
 would need a new lock for its sw queue). Also it is unclear to me
 it would be really any faster.

You still need a lock to guard hw TX enqueue from hw TX reclaim.

A 256 entry TX hw queue fills up trivially on 1GB and 10GB, but if you
increase the size much more performance starts to go down due to L2
cache thrashing.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread jamal
On Mon, 2007-08-10 at 10:33 +0530, Krishna Kumar2 wrote:

  As a side note: Any batching driver should _never_ have to requeue; if
  it does it is buggy. And the non-batching ones if they ever requeue will
  be a single packet, so not much reordering.
 
 On the contrary, batching LLTX drivers (if that is not ruled out) will very
 often requeue resulting in heavy reordering. Fix looks good though.

Two things: 
one, LLTX is deprecated (I think i saw a patch which says no more new
drivers should do LLTX) and i plan if nobody else does to kill 
LLTX in e1000 RSN. So for that reason i removed all code that existed to
support LLTX.
two, there should _never_ be any requeueing even if LLTX in the previous
patches when i supported them; if there is, it is a bug. This is because
we dont send more than what the driver asked for via xmit_win. So if it
asked for more than it can handle, that is a bug. If its available space
changes while we are sending to it, that too is a bug.

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread Waskiewicz Jr, Peter P
 -Original Message-
 From: J Hadi Salim [mailto:[EMAIL PROTECTED] On Behalf Of jamal
 Sent: Monday, October 08, 2007 11:27 AM
 To: David Miller
 Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; netdev@vger.kernel.org; 
 [EMAIL PROTECTED]; Waskiewicz Jr, Peter P; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]
 Subject: [PATCH 2/3][NET_BATCH] net core use batching
 
 This patch adds the usage of batching within the core.
 
 cheers,
 jamal

Hey Jamal,
I still have concerns how this will work with Tx multiqueue.
The way the batching code looks right now, you will probably send a
batch of skb's from multiple bands from PRIO or RR to the driver.  For
non-Tx multiqueue drivers, this is fine.  For Tx multiqueue drivers,
this isn't fine, since the Tx ring is selected by the value of
skb-queue_mapping (set by the qdisc on {prio|rr}_classify()).  If the
whole batch comes in with different queue_mappings, this could prove to
be an interesting issue.
Now I see in the driver HOWTO you recently sent that the driver
will be expected to loop over the list and call it's -hard_start_xmit()
for each skb.  I think that should be fine for multiqueue, I just wanted
to see if you had any thoughts on how it should work, any performance
issues you can see (I can't think of any).  Since the batching feature
and Tx multiqueue are very new features, I'd like to make sure we can
think of any possible issues with them coexisting before they are both
mainline.

Looking ahead for multiqueue, I'm still working on the per-queue
lock implementation for multiqueue, which I know will not work with
batching as it's designed today.  I'm still not sure how to handle this,
because it really would require the batch you send to have the same
queue_mapping in each skb, so you're grabbing the correct queue_lock.
Or, we could have the core grab all the queue locks for each
skb-queue_mapping represented in the batch.  That would block another
batch though if it had any of those queues in it's next batch before the
first one completed.  Thoughts?

Thanks Jamal,

-PJ Waskiewicz
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread jamal
On Mon, 2007-08-10 at 12:46 -0700, Waskiewicz Jr, Peter P wrote:

   I still have concerns how this will work with Tx multiqueue.
 The way the batching code looks right now, you will probably send a
 batch of skb's from multiple bands from PRIO or RR to the driver.  For
 non-Tx multiqueue drivers, this is fine.  For Tx multiqueue drivers,
 this isn't fine, since the Tx ring is selected by the value of
 skb-queue_mapping (set by the qdisc on {prio|rr}_classify()).  If the
 whole batch comes in with different queue_mappings, this could prove to
 be an interesting issue.

true, that needs some resolution. Heres a hand-waving thought:
Assuming all packets of a specific map end up in the same qdiscn queue,
it seems feasible to ask the qdisc scheduler to give us enough packages
(ive seen people use that terms to refer to packets) for each hardware
ring's available space. With the patches i posted, i do that via
dev-xmit_win that assumes only one view of the driver; essentially a
single ring.  
If that is doable, then it is up to the driver to say
i have space for 5 in ring[0], 10 in ring[1] 0 in ring[2] based on
what scheduling scheme the driver implements - the dev-blist can stay
the same. Its a handwave, so there may be issues there and there could
be better ways to handle this.

Note: The other issue that needs resolving that i raised earlier was in
regards to multiqueue running on multiple cpus servicing different rings
concurently. 

   Now I see in the driver HOWTO you recently sent that the driver
 will be expected to loop over the list and call it's -hard_start_xmit()
 for each skb.  

It's the core that does that, not the driver; the driver continues to
use -hard_start_xmit() (albeit modified one). The idea is not to have
many new interfaces.

 I think that should be fine for multiqueue, I just wanted
 to see if you had any thoughts on how it should work, any performance
 issues you can see (I can't think of any).  Since the batching feature
 and Tx multiqueue are very new features, I'd like to make sure we can
 think of any possible issues with them coexisting before they are both
 mainline.

Isnt multiqueue mainline already?

   Looking ahead for multiqueue, I'm still working on the per-queue
 lock implementation for multiqueue, which I know will not work with
 batching as it's designed today.  

The point behind batching is to reduce the cost of the locks by
amortizing across the locks. Even better if one can, they should get rid
of locks. Remind me, why do you need the per-queuemap lock? And is it
needed from the enqueuing side too? Maybe lets start there to help me
understand things?

 I'm still not sure how to handle this,
 because it really would require the batch you send to have the same
 queue_mapping in each skb, so you're grabbing the correct queue_lock.

Sure, that is doable if the driver can set a per queue_mapping xmit_win
and the qdisc can be taught to say give me packets for queue_mapping X

 Or, we could have the core grab all the queue locks for each
 skb-queue_mapping represented in the batch.  That would block another
 batch though if it had any of those queues in it's next batch before the
 first one completed.  Thoughts?

I am not understanding the desire to have locks on a per-queuemap. I
think the single queuelock we have today should suffice. If the intent
is to have concurent cpus running to each hardware ring, then this is
what i questioned earlier whether it was the right thing to do(very top
of email where i mention it as other issue). 

cheers,
jamal


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread David Miller
From: jamal [EMAIL PROTECTED]
Date: Mon, 08 Oct 2007 16:48:50 -0400

 On Mon, 2007-08-10 at 12:46 -0700, Waskiewicz Jr, Peter P wrote:
 
  I still have concerns how this will work with Tx multiqueue.
  The way the batching code looks right now, you will probably send a
  batch of skb's from multiple bands from PRIO or RR to the driver.  For
  non-Tx multiqueue drivers, this is fine.  For Tx multiqueue drivers,
  this isn't fine, since the Tx ring is selected by the value of
  skb-queue_mapping (set by the qdisc on {prio|rr}_classify()).  If the
  whole batch comes in with different queue_mappings, this could prove to
  be an interesting issue.
 
 true, that needs some resolution. Heres a hand-waving thought:
 Assuming all packets of a specific map end up in the same qdiscn queue,
 it seems feasible to ask the qdisc scheduler to give us enough packages
 (ive seen people use that terms to refer to packets) for each hardware
 ring's available space. With the patches i posted, i do that via
 dev-xmit_win that assumes only one view of the driver; essentially a
 single ring.  
 If that is doable, then it is up to the driver to say
 i have space for 5 in ring[0], 10 in ring[1] 0 in ring[2] based on
 what scheduling scheme the driver implements - the dev-blist can stay
 the same. Its a handwave, so there may be issues there and there could
 be better ways to handle this.

Add xmit_win to struct net_device_subqueue, problem solved.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread Waskiewicz Jr, Peter P
 true, that needs some resolution. Heres a hand-waving thought:
 Assuming all packets of a specific map end up in the same 
 qdiscn queue, it seems feasible to ask the qdisc scheduler to 
 give us enough packages (ive seen people use that terms to 
 refer to packets) for each hardware ring's available space. 
 With the patches i posted, i do that via
 dev-xmit_win that assumes only one view of the driver; essentially a
 single ring.  
 If that is doable, then it is up to the driver to say i have 
 space for 5 in ring[0], 10 in ring[1] 0 in ring[2] based on 
 what scheduling scheme the driver implements - the dev-blist 
 can stay the same. Its a handwave, so there may be issues 
 there and there could be better ways to handle this.
 
 Note: The other issue that needs resolving that i raised 
 earlier was in regards to multiqueue running on multiple cpus 
 servicing different rings concurently. 

I can see the qdisc being modified to send batches per queue_mapping.
This shouldn't be too difficult, and if we had the xmit_win per queue
(in the subqueue struct like Dave pointed out).

Addressing your note/issue with different rings being services
concurrently: I'd like to remove the QDISC_RUNNING bit from the global
device; with Tx multiqueue, this bit should be set on each queue (if at
all), allowing multiple Tx rings to be loaded simultaneously.  The
biggest issue today with the multiqueue implementation is the global
queue_lock.  I see it being a hot source of contention in my testing; my
setup is a 8-core machine (dual quad-core procs) with a 10GbE NIC, using
8 Tx and 8 Rx queues.  On transmit, when loading all 8 queues, the
enqueue/dequeue are hitting that lock quite a bit for the whole device.
I really think that the queue_lock should join the queue_state, so the
device no longer manages the top-level state (since we're operating
per-queue instead of per-device).

 It's the core that does that, not the driver; the driver 
 continues to use -hard_start_xmit() (albeit modified one). 
 The idea is not to have many new interfaces.

I'll look closer at this, since I think I confused myself.

 Isnt multiqueue mainline already?

Well, it's in 2.6.23-rc*.  I imagine it won't see much action though
until 2.6.24, since people will be porting drivers during that time.
Plus having the native Rx multiqueue w/NAPI code in 2.6.24 makes sense
to have Tx multiqueue at that time.

 The point behind batching is to reduce the cost of the locks 
 by amortizing across the locks. Even better if one can, they 
 should get rid of locks. Remind me, why do you need the 
 per-queuemap lock? And is it needed from the enqueuing side 
 too? Maybe lets start there to help me understand things?

The multiqueue implementation today enforces the number of qdisc bands
(RR or PRIO) to be equal to the number of Tx rings your hardware/driver
is supporting.  Therefore, the queue_lock and queue_state in the kernel
directly relate to the qdisc band management.  If the queue stops from
the driver, then the qdisc won't try to dequeue from the band.  What I'm
working on is to move the lock there too, so I can lock the queue when I
enqueue (protect the band from multiple sources modifying the skb
chain), and lock it when I dequeue.  This is purely for concurrency of
adding/popping skb's from the qdisc queues.  Right now, we take the
whole global lock to add and remove skb's.  This is the next logical
step for separating the queue dependancy on each other.  Please let me
know if this doesn't make sense, or if you have any questions at all
about my reasoning.  I agree that this is where we should be on the same
page before moving onto anything else in this discussion.  :)

 Sure, that is doable if the driver can set a per 
 queue_mapping xmit_win and the qdisc can be taught to say 
 give me packets for queue_mapping X

Yes, I like this idea very much.  Do that, modify the qdisc to send in
chunks from a queue, and the problem should be solved.

I will try and find some additional cycles to get my patches completely
working, and send them.  It'd be easier I think to see what's going on
if I did that.  I'll also try to make them work with the ideas of
xmit_win per queue and batched queue qdisc sends.  Stay tuned...

Thanks Jamal,

-PJ Waskiewicz
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread jamal
On Mon, 2007-08-10 at 14:26 -0700, David Miller wrote:

 Add xmit_win to struct net_device_subqueue, problem solved.

If  net_device_subqueue is visible from both driver and core scheduler
area (couldnt tell from looking at whats in there already), then that'll
do it.

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread Waskiewicz Jr, Peter P
 If  net_device_subqueue is visible from both driver and core 
 scheduler area (couldnt tell from looking at whats in there 
 already), then that'll do it.

Yes, I use the net_device_subqueue structs (the state variable in there)
in the prio and rr qdiscs right now.  It's an indexed list at the very
end of struct netdevice.

-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread jamal
On Mon, 2007-08-10 at 15:33 -0700, Waskiewicz Jr, Peter P wrote:


 Addressing your note/issue with different rings being services
 concurrently: I'd like to remove the QDISC_RUNNING bit from the global

The challenge to deal with is that netdevices, filters, the queues and
scheduler are closely inter-twined. So it is not just the scheduling
region and QDISC_RUNNING. For example, lets pick just the filters
because they are simple to see: You need to attach them to something -
whatever that is, you then need to synchronize against config and
multiple cpus trying to use them. You could: 
a) replicate them across cpus and only lock on config, but you are
wasting RAM then
b) attach them to rings instead of netdevices - but that makes me wonder
if those subqueues are now going to become netdevices. This also means
you change all user space interfaces to know about subqueues. If you
recall this was a major contention in our earlier discussion.

 device; with Tx multiqueue, this bit should be set on each queue (if at
 all), allowing multiple Tx rings to be loaded simultaneously. 

This is the issue i raised - refer to Dave's wording of it. If you run
access to the rings simultenously you may not be able to guarantee any
ordering or proper qos in contention for wire-resources (think strict
prio in hardware) - as long as you have the qdisc area.  You may
actually get away with it with something like DRR.
You could totaly bypass the qdisc region and go to the driver directly
and let it worry about the scheduling but youd have to make the qdisc
area a passthrough while providing the illusion to user space that all
is as before.

  The
 biggest issue today with the multiqueue implementation is the global
 queue_lock.  I see it being a hot source of contention in my testing; my
 setup is a 8-core machine (dual quad-core procs) with a 10GbE NIC, using
 8 Tx and 8 Rx queues.  On transmit, when loading all 8 queues, the
 enqueue/dequeue are hitting that lock quite a bit for the whole device.

Yes, the queuelock is expensive; in your case if all 8 hardware threads
are contending for that one device, you will suffer. The txlock on the
other hand is not that expensive since the contention is for a max of 2
cpus (tx and rx softirq). 
I tried to use that fact in the batching to move things that i processed
under queue lock into the area for txlock. I'd be very interested in
some results on such a piece of hardware with the 10G nic to see if
these theories make any sense.

 I really think that the queue_lock should join the queue_state, so the
 device no longer manages the top-level state (since we're operating
 per-queue instead of per-device).

Refer to above.


 
 The multiqueue implementation today enforces the number of qdisc bands
 (RR or PRIO) to be equal to the number of Tx rings your hardware/driver
 is supporting.  Therefore, the queue_lock and queue_state in the kernel
 directly relate to the qdisc band management.  If the queue stops from
 the driver, then the qdisc won't try to dequeue from the band. 

Good start.

  What I'm
 working on is to move the lock there too, so I can lock the queue when I
 enqueue (protect the band from multiple sources modifying the skb
 chain), and lock it when I dequeue.  This is purely for concurrency of
 adding/popping skb's from the qdisc queues.  

Ok, so the concurency aspect is what worries me. What i am saying is
that sooner or later you have to serialize (which is anti-concurency)
For example, consider CPU0 running a high prio queue and CPU1 running
the low prio queue of the same netdevice.
Assume CPU0 is getting a lot of interupts or other work while CPU1
doesnt (so as to create a condition that CPU1 is slower). Then as long
as there packets and there is space on the drivers rings, CPU1 will send
more packets per unit time than CPU0.
This contradicts the strict prio scheduler which says higher priority
packets ALWAYS go out first regardless of the presence of low prio
packets.  I am not sure i made sense.

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread Jeff Garzik

jamal wrote:

Ok, so the concurency aspect is what worries me. What i am saying is
that sooner or later you have to serialize (which is anti-concurency)
For example, consider CPU0 running a high prio queue and CPU1 running
the low prio queue of the same netdevice.
Assume CPU0 is getting a lot of interupts or other work while CPU1
doesnt (so as to create a condition that CPU1 is slower). Then as long
as there packets and there is space on the drivers rings, CPU1 will send
more packets per unit time than CPU0.
This contradicts the strict prio scheduler which says higher priority
packets ALWAYS go out first regardless of the presence of low prio
packets.  I am not sure i made sense.


You made sense.  I think it is important to note simply that the packet 
scheduling algorithm itself will dictate the level of concurrency you 
can achieve.


Strict prio is fundamentally an interface to a big imaginary queue, with 
multiple packet insertion points (the individual bands/rings for each 
prio band).


If you assume a scheduler implementation where each prio band is mapped 
to a separate CPU, you can certainly see where some CPUs could be 
substantially idle while others are overloaded, largely depending on the 
data workload (and priority contained within).


Moreover, you increase L1/L2 cache traffic, not just because of locks, 
but because of data dependencies:


userpriopacket  NIC TX ring
process bandscheduler

cpu71   cpu11
cpu51   cpu11
cpu20   cpu00

At that point, it is probably more cache- and lock-friendly to keep the 
current TX softirq scheme.


In contrast, a pure round-robin approach is more friendly to concurrency.

Jeff


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread Jeff Garzik

jamal wrote:

The challenge to deal with is that netdevices, filters, the queues and
scheduler are closely inter-twined. So it is not just the scheduling
region and QDISC_RUNNING. For example, lets pick just the filters
because they are simple to see: You need to attach them to something -
whatever that is, you then need to synchronize against config and
multiple cpus trying to use them. You could: 
a) replicate them across cpus and only lock on config, but you are

wasting RAM then


I think you've pretty much bought into the cost of wasting RAM, when 
doing multiple TX rings.  So logic implies associated costs, like the 
ones you describe, come along for the ride.




b) attach them to rings instead of netdevices - but that makes me wonder
if those subqueues are now going to become netdevices. This also means
you change all user space interfaces to know about subqueues. If you
recall this was a major contention in our earlier discussion.


That's definitely a good question, and I honestly don't see any easy 
solutions.


Multiple net devices makes a -lot- of things easier, with regards to 
existing infrastructure, but it also imposes potentially annoying 
administrative burdens:  Not only must each interface be set up 
individually, but the userland apps must be made aware of this unique 
method of concurrency.


Jeff




-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread David Miller
From: Jeff Garzik [EMAIL PROTECTED]
Date: Mon, 08 Oct 2007 21:13:59 -0400

 If you assume a scheduler implementation where each prio band is mapped 
 to a separate CPU, you can certainly see where some CPUs could be 
 substantially idle while others are overloaded, largely depending on the 
 data workload (and priority contained within).

Right, which is why Peter added the prio DRR scheduler stuff for TX
multiqueue (see net/sched/sch_prio.c:rr_qdisc_ops) because this is
what the chips do.

But this doesn't get us to where we want to be as Peter has been
explaining a bit these past few days.

Ok, we're talking a lot but not pouring much concrete, let's start
doing that.  I propose:

1) A library for transmit load balancing functions, with an interface
   that can be made visible to userspace.  I can write this and test
   it on real multiqueue hardware.

   The whole purpose of this library is to set skb-queue_mapping
   based upon the load balancing function.

   Facilities will be added to handle virtualization port selection
   based upon destination MAC address as one of the load balancing
   methods.

2) Switch the default qdisc away from pfifo_fast to a new DRR fifo
   with load balancing using the code in #1.  I think this is kind
   of in the territory of what Peter said he is working on.

   I know this is controversial, but realistically I doubt users
   benefit at all from the prioritization that pfifo provides.  They
   will, on the other hand, benefit from TX queue load balancing on
   fast interfaces.

3) Work on discovering a way to make the locking on transmit as
   localized to the current thread of execution as possible.  Things
   like RCU and statistic replication, techniques we use widely
   elsewhere in the stack, begin to come to mind.

I also want to point out another issue.  Any argument wrt. reordering
is specious at best because right now reordering from qdisc to device
happens anyways.

And that's because we drop the qdisc lock first, then we grab the
transmit lock on the device and submit the packet.  So, after we
drop the qdisc lock, another cpu can get the qdisc lock, get the
next packet (perhaps a lower priority one) and then sneak in to
get the device transmit lock before the first thread can, and
thus the packets will be submitted out of order.

This, along with other things, makes me believe that ordering really
doesn't matter in practice.  And therefore, in practice, we can treat
everything from the qdisc to the real hardware as a FIFO even if
something else is going on inside the black box which might reorder
packets on the wire.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread Herbert Xu
On Mon, Oct 08, 2007 at 06:41:26PM -0700, David Miller wrote:

 I also want to point out another issue.  Any argument wrt. reordering
 is specious at best because right now reordering from qdisc to device
 happens anyways.

This is not true.

If your device has a qdisc at all, then you will end up in the
function qdisc_restart, where we release the queue lock only
after acquiring the TX lock.

So right now this path does not create any reordering.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread Herbert Xu
On Tue, Oct 09, 2007 at 10:01:15AM +0800, Herbert Xu wrote:
 On Mon, Oct 08, 2007 at 06:41:26PM -0700, David Miller wrote:
 
  I also want to point out another issue.  Any argument wrt. reordering
  is specious at best because right now reordering from qdisc to device
  happens anyways.
 
 This is not true.
 
 If your device has a qdisc at all, then you will end up in the
 function qdisc_restart, where we release the queue lock only
 after acquiring the TX lock.
 
 So right now this path does not create any reordering.

Argh! Someone's just broken this.  I think we should restore
the original behaviour.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread Herbert Xu
On Tue, Oct 09, 2007 at 10:03:18AM +0800, Herbert Xu wrote:
 On Tue, Oct 09, 2007 at 10:01:15AM +0800, Herbert Xu wrote:
  On Mon, Oct 08, 2007 at 06:41:26PM -0700, David Miller wrote:
  
   I also want to point out another issue.  Any argument wrt. reordering
   is specious at best because right now reordering from qdisc to device
   happens anyways.
  
  This is not true.
  
  If your device has a qdisc at all, then you will end up in the
  function qdisc_restart, where we release the queue lock only
  after acquiring the TX lock.
  
  So right now this path does not create any reordering.
 
 Argh! Someone's just broken this.  I think we should restore
 the original behaviour.

Please revert

commit 41843197b17bdfb1f97af0a87c06d24c1620ba90
Author: Jamal Hadi Salim [EMAIL PROTECTED]
Date:   Tue Sep 25 19:27:13 2007 -0700

[NET_SCHED]: explict hold dev tx lock

As this change introduces potential reordering and I don't think
we've discussed this aspect sufficiently.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread Jeff Garzik

David Miller wrote:

1) A library for transmit load balancing functions, with an interface
   that can be made visible to userspace.  I can write this and test
   it on real multiqueue hardware.

   The whole purpose of this library is to set skb-queue_mapping
   based upon the load balancing function.

   Facilities will be added to handle virtualization port selection
   based upon destination MAC address as one of the load balancing
   methods.


Groovy.

I'm interested in working on a load balancer function that approximates

skb-queue_mapping = smp_processor_id()

I'd be happy to code and test in that direction, based on your lib.



2) Switch the default qdisc away from pfifo_fast to a new DRR fifo
   with load balancing using the code in #1.  I think this is kind
   of in the territory of what Peter said he is working on.

   I know this is controversial, but realistically I doubt users
   benefit at all from the prioritization that pfifo provides.  They
   will, on the other hand, benefit from TX queue load balancing on
   fast interfaces.


IMO the net driver really should provide a hint as to what it wants.

8139cp and tg3 would probably prefer multiple TX queue behavior to match 
silicon behavior -- strict prio.


And I'll volunteer to write the net driver code for that, if people want 
to see how things would look for that type of hardware packet scheduling.




3) Work on discovering a way to make the locking on transmit as
   localized to the current thread of execution as possible.  Things
   like RCU and statistic replication, techniques we use widely
   elsewhere in the stack, begin to come to mind.


Definitely.

Jeff


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread jamal
On Mon, 2007-08-10 at 18:41 -0700, David Miller wrote:

 I also want to point out another issue.  Any argument wrt. reordering
 is specious at best because right now reordering from qdisc to device
 happens anyways.

 And that's because we drop the qdisc lock first, then we grab the
 transmit lock on the device and submit the packet.  So, after we
 drop the qdisc lock, another cpu can get the qdisc lock, get the
 next packet (perhaps a lower priority one) and then sneak in to
 get the device transmit lock before the first thread can, and
 thus the packets will be submitted out of order.
 

You forgot QDISC_RUNNING Dave;- the above cant happen. 
Essentially at any one point in time, we are guaranteed that we can have
multiple cpus enqueueing but only can be dequeueing (the one that
managed to grab QDISC_RUNNING) i.e multiple producers to the qdisc queue
but only one consumer. Only the dequeuer has access to the txlock.
 
 This, along with other things, makes me believe that ordering really
 doesn't matter in practice.  And therefore, in practice, we can treat
 everything from the qdisc to the real hardware as a FIFO even if
 something else is going on inside the black box which might reorder
 packets on the wire.

I think it is important to get the scheduling right - estimations can be
a last resort. For example, If i have voip competing for the wire with
ftp on two different rings/cpus and i specified that voip should be more
important i may consider equipment faulty if it works most of the
time (when ftp is not clogging the wire) and at times i am asked to
repeat what i just said. 

cheers,
jamal


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread jamal
On Tue, 2007-09-10 at 10:04 +0800, Herbert Xu wrote:

 Please revert
 
 commit 41843197b17bdfb1f97af0a87c06d24c1620ba90
 Author: Jamal Hadi Salim [EMAIL PROTECTED]
 Date:   Tue Sep 25 19:27:13 2007 -0700
 
 [NET_SCHED]: explict hold dev tx lock
 
 As this change introduces potential reordering and I don't think
 we've discussed this aspect sufficiently.

How does it introduce reordering?

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread Herbert Xu
On Mon, Oct 08, 2007 at 10:14:30PM -0400, jamal wrote:

 You forgot QDISC_RUNNING Dave;- the above cant happen. 
 Essentially at any one point in time, we are guaranteed that we can have
 multiple cpus enqueueing but only can be dequeueing (the one that
 managed to grab QDISC_RUNNING) i.e multiple producers to the qdisc queue
 but only one consumer. Only the dequeuer has access to the txlock.

Good point.  You had me worried for a sec :)

Dave, Jamal's patch is fine as it is and doesn't actually create
any packet reordering.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread Herbert Xu
On Mon, Oct 08, 2007 at 10:15:49PM -0400, jamal wrote:
 On Tue, 2007-09-10 at 10:04 +0800, Herbert Xu wrote:
 
  Please revert
  
  commit 41843197b17bdfb1f97af0a87c06d24c1620ba90
  Author: Jamal Hadi Salim [EMAIL PROTECTED]
  Date:   Tue Sep 25 19:27:13 2007 -0700
  
  [NET_SCHED]: explict hold dev tx lock
  
  As this change introduces potential reordering and I don't think
  we've discussed this aspect sufficiently.
 
 How does it introduce reordering?

No it doesn't.  I'd forgotten about the QDISC_RUNNING bit :)
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread jamal
On Tue, 2007-09-10 at 10:16 +0800, Herbert Xu wrote:

 
 No it doesn't.  I'd forgotten about the QDISC_RUNNING bit :)

You should not better, you wrote it and ive been going insane trying to
break for at least a year now ;-

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread Herbert Xu
On Mon, Oct 08, 2007 at 10:19:02PM -0400, jamal wrote:
 On Tue, 2007-09-10 at 10:16 +0800, Herbert Xu wrote:
 
  
  No it doesn't.  I'd forgotten about the QDISC_RUNNING bit :)
 
 You should not better, you wrote it and ive been going insane trying to
 break for at least a year now ;-

Well you've broken me at least :)
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread David Miller
From: Herbert Xu [EMAIL PROTECTED]
Date: Tue, 9 Oct 2007 10:03:18 +0800

 On Tue, Oct 09, 2007 at 10:01:15AM +0800, Herbert Xu wrote:
  On Mon, Oct 08, 2007 at 06:41:26PM -0700, David Miller wrote:
  
   I also want to point out another issue.  Any argument wrt. reordering
   is specious at best because right now reordering from qdisc to device
   happens anyways.
  
  This is not true.
  
  If your device has a qdisc at all, then you will end up in the
  function qdisc_restart, where we release the queue lock only
  after acquiring the TX lock.
  
  So right now this path does not create any reordering.
 
 Argh! Someone's just broken this.  I think we should restore
 the original behaviour.

Right, that's Jamal's recent patch.  It looked funny to me too.

I think we can't make this change, the acquisition of the device
transmit lock before we release the qdisc is the only thing that
prevents reordering between qdisc and device.

Otherwise all of the prioritization is pretty much for nothing as
I described in another email today.

Jamal, I'm pretty sure we have to revert this, you can't change the
locking in this way.

commit 41843197b17bdfb1f97af0a87c06d24c1620ba90
Author: Jamal Hadi Salim [EMAIL PROTECTED]
Date:   Tue Sep 25 19:27:13 2007 -0700

[NET_SCHED]: explict hold dev tx lock

For N cpus, with full throttle traffic on all N CPUs, funneling traffic
to the same ethernet device, the devices queue lock is contended by all
N CPUs constantly. The TX lock is only contended by a max of 2 CPUS.
In the current mode of operation, after all the work of entering the
dequeue region, we may endup aborting the path if we are unable to get
the tx lock and go back to contend for the queue lock. As N goes up,
this gets worse.

The changes in this patch result in a small increase in performance
with a 4CPU (2xdual-core) with no irq binding. Both e1000 and tg3
showed similar behavior;

Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED]
Signed-off-by: David S. Miller [EMAIL PROTECTED]

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index e970e8e..95ae119 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -134,34 +134,19 @@ static inline int qdisc_restart(struct net_device *dev)
 {
struct Qdisc *q = dev-qdisc;
struct sk_buff *skb;
-   unsigned lockless;
int ret;
 
/* Dequeue packet */
if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL))
return 0;
 
-   /*
-* When the driver has LLTX set, it does its own locking in
-* start_xmit. These checks are worth it because even uncongested
-* locks can be quite expensive. The driver can do a trylock, as
-* is being done here; in case of lock contention it should return
-* NETDEV_TX_LOCKED and the packet will be requeued.
-*/
-   lockless = (dev-features  NETIF_F_LLTX);
-
-   if (!lockless  !netif_tx_trylock(dev)) {
-   /* Another CPU grabbed the driver tx lock */
-   return handle_dev_cpu_collision(skb, dev, q);
-   }
 
/* And release queue */
spin_unlock(dev-queue_lock);
 
+   HARD_TX_LOCK(dev, smp_processor_id());
ret = dev_hard_start_xmit(skb, dev);
-
-   if (!lockless)
-   netif_tx_unlock(dev);
+   HARD_TX_UNLOCK(dev);
 
spin_lock(dev-queue_lock);
q = dev-qdisc;
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread David Miller
From: Jeff Garzik [EMAIL PROTECTED]
Date: Mon, 08 Oct 2007 22:12:03 -0400

 I'm interested in working on a load balancer function that approximates
 
   skb-queue_mapping = smp_processor_id()
 
 I'd be happy to code and test in that direction, based on your lib.

It's the second algorithm that will be available :-)  Just add
a % num_tx_queues to the result.

 IMO the net driver really should provide a hint as to what it wants.
 
 8139cp and tg3 would probably prefer multiple TX queue behavior to match 
 silicon behavior -- strict prio.
 
 And I'll volunteer to write the net driver code for that, if people want 
 to see how things would look for that type of hardware packet scheduling.

Ok.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread Herbert Xu
On Mon, Oct 08, 2007 at 07:43:43PM -0700, David Miller wrote:
 
 Right, that's Jamal's recent patch.  It looked funny to me too.

Hang on Dave.  It was too early in the morning for me :)

I'd forgotten about the QDISC_RUNNING bit which did what the
queue lock did without actually holding the queue lock.

So there is no reordering with or without Jamal's patch.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread David Miller
From: Herbert Xu [EMAIL PROTECTED]
Date: Tue, 9 Oct 2007 10:16:20 +0800

 On Mon, Oct 08, 2007 at 10:14:30PM -0400, jamal wrote:
 
  You forgot QDISC_RUNNING Dave;- the above cant happen. 
  Essentially at any one point in time, we are guaranteed that we can have
  multiple cpus enqueueing but only can be dequeueing (the one that
  managed to grab QDISC_RUNNING) i.e multiple producers to the qdisc queue
  but only one consumer. Only the dequeuer has access to the txlock.
 
 Good point.  You had me worried for a sec :)
 
 Dave, Jamal's patch is fine as it is and doesn't actually create
 any packet reordering.

Ok, then, I'll un-revert. :-)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread David Miller
From: Herbert Xu [EMAIL PROTECTED]
Date: Tue, 9 Oct 2007 10:04:42 +0800

 On Tue, Oct 09, 2007 at 10:03:18AM +0800, Herbert Xu wrote:
  On Tue, Oct 09, 2007 at 10:01:15AM +0800, Herbert Xu wrote:
   On Mon, Oct 08, 2007 at 06:41:26PM -0700, David Miller wrote:
   
I also want to point out another issue.  Any argument wrt. reordering
is specious at best because right now reordering from qdisc to device
happens anyways.
   
   This is not true.
   
   If your device has a qdisc at all, then you will end up in the
   function qdisc_restart, where we release the queue lock only
   after acquiring the TX lock.
   
   So right now this path does not create any reordering.
  
  Argh! Someone's just broken this.  I think we should restore
  the original behaviour.
 
 Please revert
 
 commit 41843197b17bdfb1f97af0a87c06d24c1620ba90
 Author: Jamal Hadi Salim [EMAIL PROTECTED]
 Date:   Tue Sep 25 19:27:13 2007 -0700
 
 [NET_SCHED]: explict hold dev tx lock
 
 As this change introduces potential reordering and I don't think
 we've discussed this aspect sufficiently.

Agreed, and done.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread Krishna Kumar2
J Hadi Salim [EMAIL PROTECTED] wrote on 10/08/2007 06:47:24 PM:

 two, there should _never_ be any requeueing even if LLTX in the previous
 patches when i supported them; if there is, it is a bug. This is because
 we dont send more than what the driver asked for via xmit_win. So if it
 asked for more than it can handle, that is a bug. If its available space
 changes while we are sending to it, that too is a bug.

Driver might ask for 10 and we send 10, but LLTX driver might fail to get
lock and return TX_LOCKED. I haven't seen your code in greater detail, but
don't you requeue in that case too?

- KK

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-07 Thread Krishna Kumar2
 jamal wrote:

   +   while ((skb = __skb_dequeue(skbs)) != NULL)
   +  q-ops-requeue(skb, q);
 
 
  -requeue queues at the head, so this looks like it would reverse
  the order of the skbs.

 Excellent catch!  thanks; i will fix.

 As a side note: Any batching driver should _never_ have to requeue; if
 it does it is buggy. And the non-batching ones if they ever requeue will
 be a single packet, so not much reordering.

On the contrary, batching LLTX drivers (if that is not ruled out) will very
often requeue resulting in heavy reordering. Fix looks good though.

- KK

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-03 Thread jamal
On Wed, 2007-03-10 at 01:29 -0400, Bill Fink wrote:

 It does sound sensible.  My own decidedly non-expert speculation
 was that the big 30 % performance hit right at 4 KB may be related
 to memory allocation issues or having to split the skb across
 multiple 4 KB pages.  

plausible. But i also worry it could be 10 other things; example, could
it be the driver used? I noted in my udp test the oddity that turned out
to be tx coal parameter related.
In any case, I will attempt to run those tests later.

 And perhaps it only affected the single
 process case because with multiple processes lock contention may
 be a bigger issue and the xmit batching changes would presumably
 help with that.  I am admittedly a novice when it comes to the
 detailed internals of TCP/skb processing, although I have been
 slowly slogging my way through parts of the TCP kernel code to
 try and get a better understanding, so I don't know if these
 thoughts have any merit.

You do bring up issues that need to be looked into and i will run those
tests.
Note, the effectiveness of batching becomes evident as the number of
flows grows. Actually, scratch that: It becomes evident if you can keep
the tx path busyed out to which multiple users running contribute. If i
can have a user per CPU with lots of traffic to send, i can create that
condition. It's a little boring in the scenario where the bottleneck is
the wire but it needs to be checked.

 BTW does anyone know of a good book they would recommend that has
 substantial coverage of the Linux kernel TCP code, that's fairly
 up-to-date and gives both an overall view of the code and packet
 flow as well as details on individual functions and algorithms,
 and hopefully covers basic issues like locking and synchronization,
 concurrency of different parts of the stack, and memory allocation.
 I have several books already on Linux kernel and networking internals,
 but they seem to only cover the IP (and perhaps UDP) portions of the
 network stack, and none have more than a cursory reference to TCP.  
 The most useful documentation on the Linux TCP stack that I have
 found thus far is some of Dave Miller's excellent web pages and
 a few other web references, but overall it seems fairly skimpy
 for such an important part of the Linux network code.

Reading books or magazines may end up busying you out with some small
gains of knowledge at the end. They tend to be outdated fast. My advice
is if you start with a focus on one thing, watch the patches that fly
around on that area and learn that way. Read the code to further
understand things then ask questions when its not clear. Other folks may
have different views. The other way to do it is pick yourself some task
to either add or improve something and get your hands dirty that way. 

 It would be good to see some empirical evidence that there aren't
 any unforeseen gotchas for larger packet sizes, that at least the
 same level of performance can be obtained with no greater CPU
 utilization.

Reasonable - I will try with 9K after i move over to the new tree from
Dave and make sure nothing else broke in the previous tests.
And when all looks good, i will move to TCP.


  [1] On average i spend 10x more time performance testing and analysing
  results than writting code.
 
 As you have written previously, and I heartily agree with, this is a
 very good practice for developing performance enhancement patches.

To give you a perspective, the results i posted were each run 10
iterations per packet size per kernel. Each run is 60 seconds long. I
think i am past that stage for resolving or fixing anything for UDP or
pktgen, but i need to keep checking for any new regressions when Dave
updates his tree. Now multiply that by 5 packet sizes (I am going to add
2 more) and multiply that by 3-4 kernels. Then add the time it takes to
sift through the data and collect it then analyze it and go back to the
drawing table when something doesnt look right.  Essentially, it needs a
weekend ;-

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-02 Thread Bill Fink
On Tue, 02 Oct 2007, jamal wrote:

 On Tue, 2007-02-10 at 00:25 -0400, Bill Fink wrote:
 
  One reason I ask, is that on an earlier set of alternative batching
  xmit patches by Krishna Kumar, his performance testing showed a 30 %
  performance hit for TCP for a single process and a size of 4 KB, and
  a performance hit of 5 % for a single process and a size of 16 KB
  (a size of 8 KB wasn't tested).  Unfortunately I was too busy at the
  time to inquire further about it, but it would be a major potential
  concern for me in my 10-GigE network testing with 9000-byte jumbo
  frames.  Of course the single process and 4 KB or larger size was
  the only case that showed a significant performance hit in Krishna
  Kumar's latest reported test results, so it might be acceptable to
  just have a switch to disable the batching feature for that specific
  usage scenario.  So it would be useful to know if your xmit batching
  changes would have similar issues.
 
 There were many times while testing that i noticed inconsistencies and
 in each case when i analysed[1], i found it to be due to some variable
 other than batching which needed some resolving, always via some
 parametrization or other. I suspect what KK posted is in the same class.
 To give you an example, with UDP, batching was giving worse results at
 around 256B compared to 64B or 512B; investigating i found that the
 receiver just wasnt able to keep up and the udp layer dropped a lot of
 packets so both iperf and netperf reported bad numbers. Fixing the
 receiver ended up with consistency coming back. On why 256B was the one
 that overwhelmed the receiver more than 64B(which sent more pps)? On
 some limited investigation, it seemed to me to be the effect of the
 choice of the tg3 driver's default tx mitigation parameters as well tx
 ring size; which is something i plan to revisit (but neutralizing it
 helps me focus on just batching). In the end i dropped both netperf and
 iperf for similar reasons and wrote my own app. What i am trying to
 achieve is demonstrate if batching is a GoodThing. In experimentation
 like this, it is extremely valuable to reduce the variables. Batching
 may expose other orthogonal issues - those need to be resolved or fixed
 as they are found. I hope that sounds sensible.

It does sound sensible.  My own decidedly non-expert speculation
was that the big 30 % performance hit right at 4 KB may be related
to memory allocation issues or having to split the skb across
multiple 4 KB pages.  And perhaps it only affected the single
process case because with multiple processes lock contention may
be a bigger issue and the xmit batching changes would presumably
help with that.  I am admittedly a novice when it comes to the
detailed internals of TCP/skb processing, although I have been
slowly slogging my way through parts of the TCP kernel code to
try and get a better understanding, so I don't know if these
thoughts have any merit.

BTW does anyone know of a good book they would recommend that has
substantial coverage of the Linux kernel TCP code, that's fairly
up-to-date and gives both an overall view of the code and packet
flow as well as details on individual functions and algorithms,
and hopefully covers basic issues like locking and synchronization,
concurrency of different parts of the stack, and memory allocation.
I have several books already on Linux kernel and networking internals,
but they seem to only cover the IP (and perhaps UDP) portions of the
network stack, and none have more than a cursory reference to TCP.  
The most useful documentation on the Linux TCP stack that I have
found thus far is some of Dave Miller's excellent web pages and
a few other web references, but overall it seems fairly skimpy
for such an important part of the Linux network code.

 Back to the =9K packet size you raise above:
 I dont have a 10Gige card so iam theorizing. Given that theres an
 observed benefit to batching for a saturated link with smaller packets
 (in my results small is anything below 256B which maps to about
 380Kpps anything above that seems to approach wire speed and the link is
 the bottleneck); then i theorize that 10Gige with 9K jumbo frames if
 already achieving wire rate, should continue to do so. And sizes below
 that will see improvements if they were not already hitting wire rate.
 So i would say that with 10G NICS, there will be more observed
 improvements with batching with apps that do bulk transfers (assuming
 those apps are not seeing wire speed already). Note that this hasnt been
 quiet the case even with TSO given the bottlenecks in the Linux
 receivers that J Heffner put nicely in a response to some results you
 posted - but that exposes an issue with Linux receivers rather than TSO.

It would be good to see some empirical evidence that there aren't
any unforeseen gotchas for larger packet sizes, that at least the
same level of performance can be obtained with no greater CPU
utilization.

  Also for your xmit 

Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-01 Thread Patrick McHardy
jamal wrote:
 +static inline int
 +dev_requeue_skbs(struct sk_buff_head *skbs, struct net_device *dev,
 +struct Qdisc *q)
 +{
 +
 + struct sk_buff *skb;
 +
 + while ((skb = __skb_dequeue(skbs)) != NULL)
 + q-ops-requeue(skb, q);


-requeue queues at the head, so this looks like it would reverse
the order of the skbs.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-01 Thread jamal
On Mon, 2007-01-10 at 12:42 +0200, Patrick McHardy wrote:
 jamal wrote:

  +   while ((skb = __skb_dequeue(skbs)) != NULL)
  +   q-ops-requeue(skb, q);
 
 
 -requeue queues at the head, so this looks like it would reverse
 the order of the skbs.

Excellent catch!  thanks; i will fix.

As a side note: Any batching driver should _never_ have to requeue; if
it does it is buggy. And the non-batching ones if they ever requeue will
be a single packet, so not much reordering.

Thanks again Patrick.

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-01 Thread jamal
On Mon, 2007-01-10 at 00:11 -0400, Bill Fink wrote:

 Have you done performance comparisons for the case of using 9000-byte
 jumbo frames?

I havent, but will try if any of the gige cards i have support it.

As a side note: I have not seen any useful gains or losses as the packet
size approaches even 1500B MTU. For example, post about 256B neither the
batching nor the non-batching give much difference in either throughput
or cpu use. Below 256B, theres a noticeable gain for batching.
Note, in the cases of my tests all 4 CPUs are in full-throttle UDP and
so the occupancy of both the qdisc queue(s) and ethernet ring is
constantly high. For example at 512B, the app is 80% idle on all 4 CPUs
and we are hitting in the range of wire speed. We are at 90% idle at
1024B. This is the case with or without batching.  So my suspicion is
that with that trend a 9000B packet will just follow the same pattern.


cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-01 Thread Bill Fink
On Mon, 01 Oct 2007, jamal wrote:

 On Mon, 2007-01-10 at 00:11 -0400, Bill Fink wrote:
 
  Have you done performance comparisons for the case of using 9000-byte
  jumbo frames?
 
 I havent, but will try if any of the gige cards i have support it.
 
 As a side note: I have not seen any useful gains or losses as the packet
 size approaches even 1500B MTU. For example, post about 256B neither the
 batching nor the non-batching give much difference in either throughput
 or cpu use. Below 256B, theres a noticeable gain for batching.
 Note, in the cases of my tests all 4 CPUs are in full-throttle UDP and
 so the occupancy of both the qdisc queue(s) and ethernet ring is
 constantly high. For example at 512B, the app is 80% idle on all 4 CPUs
 and we are hitting in the range of wire speed. We are at 90% idle at
 1024B. This is the case with or without batching.  So my suspicion is
 that with that trend a 9000B packet will just follow the same pattern.

One reason I ask, is that on an earlier set of alternative batching
xmit patches by Krishna Kumar, his performance testing showed a 30 %
performance hit for TCP for a single process and a size of 4 KB, and
a performance hit of 5 % for a single process and a size of 16 KB
(a size of 8 KB wasn't tested).  Unfortunately I was too busy at the
time to inquire further about it, but it would be a major potential
concern for me in my 10-GigE network testing with 9000-byte jumbo
frames.  Of course the single process and 4 KB or larger size was
the only case that showed a significant performance hit in Krishna
Kumar's latest reported test results, so it might be acceptable to
just have a switch to disable the batching feature for that specific
usage scenario.  So it would be useful to know if your xmit batching
changes would have similar issues.

Also for your xmit batching changes, I think it would be good to see
performance comparisons for TCP and IP forwarding in addition to your
UDP pktgen tests, including various packet sizes up to and including
9000-byte jumbo frames.

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-09-30 Thread Bill Fink
On Sun, 30 Sep 2007, jamal wrote:

 This patch adds the usage of batching within the core.
 
 cheers,
 jamal



 [sep30-p2of3  text/plain (6.8KB)]
 [NET_BATCH] net core use batching
 
 This patch adds the usage of batching within the core.
 The same test methodology used in introducing txlock is used, with
 the following results on different kernels:
 
 ++--+-+++
 |   64B  |  128B| 256B| 512B   |1024B   |
 ++--+-+++
 Original| 467482 | 463061   | 388267  | 216308 | 114704 |
 ||  | |||
 txlock  | 468922 | 464060   | 388298  | 216316 | 114709 |
 ||  | |||
 tg3nobtx| 468012 | 464079   | 388293  | 216314 | 114704 |
 ||  | |||
 tg3btxdr| 480794 | 475102   | 388298  | 216316 | 114705 |
 ||  | |||
 tg3btxco| 481059 | 475423   | 388285  | 216308 | 114706 |
 ++--+-+++
 
 The first two colums Original and txlock were introduced in an earlier
 patch and demonstrate a slight increase in performance with txlock.
 tg3nobtx shows the tg3 driver with no changes to support batching.
 The purpose of this test is to demonstrate the effect of introducing
 the core changes to a driver that doesnt support them.
 Although this patch brings down perfomance slightly compared to txlock
 for such netdevices, it is still better compared to just the original
 kernel.
 tg3btxdr demonstrates the effect of using -hard_batch_xmit() with tg3
 driver. tg3btxco demonstrates the effect of letting the core do all the
 work. As can be seen the last two are not very different in performance.
 The difference is -hard_batch_xmit() introduces a new method which
 is intrusive.

Have you done performance comparisons for the case of using 9000-byte
jumbo frames?

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html