Re: SG_IO with 4k buffer size to iscsi sg device causes Bad page panic

2007-06-08 Thread Herbert Xu
Please don't drop CCs.

Qi, Yanling [EMAIL PROTECTED] wrote:

 Qi, Yanling [EMAIL PROTECTED] wrote:
  @@ -2571,6 +2572,13 @@ sg_page_malloc(int rqSz, int lowDma, int
 resp = (char *) __get_free_pages(page_mask, order);
  /* try half */
 resSz = a_size;
 }
  +   tmppage = virt_to_page(resp);
  +   for( m = PAGE_SIZE; m  resSz; m += PAGE_SIZE )
  +   {
  +   tmppage++;
  +   SetPageReserved(tmppage);
  +   }
  +
 
 [Qi, Yanling]
 If I do a get_page() at sg_page_malloc() time and then do a put_page()
 at sg_page_free() time, I worry about a race condition that the page
 gets re-used before calling free_pages().

Could you explain what is going to cause this page to be reused if it
has a non-zero reference count?

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multicast and hardware checksum

2007-06-08 Thread Herbert Xu
Baruch Even [EMAIL PROTECTED] wrote:
 
 I have a machine on which I have an applications that sends multicast 
 through eth interface with hardware tx checksum enabled. On the same 
 machine I have mrouted running that routes the multicast traffic to a 
 set of ppp interfaces. The packets that are received by the client have 
 their checksum fixed on some number which is incorrect. If I disable tx 
 checksum on the eth device the packets arrive with the proper checksum.

Where is the client? On the same machine or behind a PPP link?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.22-rc4-mm2 -- ipw2200 -- SIOCSIFADDR: No buffer space available

2007-06-08 Thread Miles Lane

On 6/7/07, Björn Steinbrink [EMAIL PROTECTED] wrote:
[...]

Miles, could you try if this patch helps?

Björn


Stop destroying devices when all of their ifas are gone, as we no longer
recreate them when ifas are added.

Signed-off-by: Björn Steinbrink [EMAIL PROTECTED]
--
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index fa97b96..abf6352 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -327,12 +327,8 @@ static void __inet_del_ifa(struct in_device *in_dev, 
struct in_ifaddr **ifap,
}

}
-   if (destroy) {
+   if (destroy)
inet_free_ifa(ifa1);
-
-   if (!in_dev-ifa_list)
-   inetdev_destroy(in_dev);
-   }
 }

 static void inet_del_ifa(struct in_device *in_dev, struct in_ifaddr **ifap,



Björn,

Thanks.  You patch worked for me.

Miles
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-08 Thread Jens Axboe
On Thu, Jun 07 2007, Evgeniy Polyakov wrote:
 On Thu, Jun 07, 2007 at 12:51:59PM +0200, Jens Axboe ([EMAIL PROTECTED]) 
 wrote:
   What bout checking if page belongs to kmalloc cache (or any other cache
   via priviate pointers) and do not perform any kind of reference counting
   on them? I will play with this a bit later today.
  
  That might work, but sounds a little dirty... But there's probably no
  way around. Be sure to look at the #splice-net branch if you are playing
  with this, I've updated it a number of times and fixed some bugs in
  there. Notably it now gets the offset right, and handles fragments and
  fraglist as well.
 
 I've pulled splice-net, which indeed fixed some issues, but referencing
 slab pages is still is not allowed. There are at least two problems
 (although they are related):
 1. if we do not increment reference counter for slab pages, they
 eventually get refilled and slab exploses after it understood that its
 pages are in use (or user dies when page is moved out of his control in
 slab).
 2. get/put page does not work with slab pages, and simple
 increment/decrement of the reference counters is not allowed too.
 
 Both problems have the same root - slab does not allow anyone to 
 manipulate page's members. That should be broken/changed to allow splice
 to put its hands into network using the fastest way.
 I will think about it.

Perhaps it's possible to solve this at a different level - can we hang
on to the skb until the pipe buffer has been consumed, and prevent reuse
that way? Then we don't have to care what backing the skb has, as long
as it (and its data) isn't being reused until we drop the reference to
it in sock_pipe_buf_release().

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-08 Thread David Miller
From: Jens Axboe [EMAIL PROTECTED]
Date: Fri, 8 Jun 2007 09:48:24 +0200

 Perhaps it's possible to solve this at a different level - can we hang
 on to the skb until the pipe buffer has been consumed, and prevent reuse
 that way? Then we don't have to care what backing the skb has, as long
 as it (and its data) isn't being reused until we drop the reference to
 it in sock_pipe_buf_release().

Depending upon whether the pipe buffer consumption is bounded of not,
this will jam up the TCP sender because the SKB data allocation is
charged against the socket send buffer allocation.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-08 Thread Jens Axboe
On Fri, Jun 08 2007, David Miller wrote:
 From: Jens Axboe [EMAIL PROTECTED]
 Date: Fri, 8 Jun 2007 09:48:24 +0200
 
  Perhaps it's possible to solve this at a different level - can we hang
  on to the skb until the pipe buffer has been consumed, and prevent reuse
  that way? Then we don't have to care what backing the skb has, as long
  as it (and its data) isn't being reused until we drop the reference to
  it in sock_pipe_buf_release().
 
 Depending upon whether the pipe buffer consumption is bounded of not,
 this will jam up the TCP sender because the SKB data allocation is
 charged against the socket send buffer allocation.

Forgive my network ignorance, but is that a problem? Since you bring it
up, I guess so :-)

We can grow the pipe, should we have to. So instead of blocking waiting
on reader consumption, we can extend the size of the pipe and keep
going.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-08 Thread Jens Axboe
On Fri, Jun 08 2007, Evgeniy Polyakov wrote:
 On Fri, Jun 08, 2007 at 10:38:53AM +0200, Jens Axboe ([EMAIL PROTECTED]) 
 wrote:
  On Fri, Jun 08 2007, David Miller wrote:
   From: Jens Axboe [EMAIL PROTECTED]
   Date: Fri, 8 Jun 2007 09:48:24 +0200
   
Perhaps it's possible to solve this at a different level - can we hang
on to the skb until the pipe buffer has been consumed, and prevent reuse
that way? Then we don't have to care what backing the skb has, as long
as it (and its data) isn't being reused until we drop the reference to
it in sock_pipe_buf_release().
   
   Depending upon whether the pipe buffer consumption is bounded of not,
   this will jam up the TCP sender because the SKB data allocation is
   charged against the socket send buffer allocation.
  
  Forgive my network ignorance, but is that a problem? Since you bring it
  up, I guess so :-)
 
 David means, that socket bufer allocation is limited, and delaying
 freeing can end up with exhausint that limit.

OK, so a delayed empty of the pipe could end up causing packet drops
elsewhere due to allocation exhaustion?

  We can grow the pipe, should we have to. So instead of blocking waiting
  on reader consumption, we can extend the size of the pipe and keep
  going.
 
 I have a code, which roughly works (but I will test it some more), which
 just introduces reference counters for slab pages, so that the would not
 be actually freed via page reclaim, but only after reference counters
 are dropped. That forced changes in mm/slab.c so likely it is
 unacceptible solution, but it is interesting as is.

Hmm, still seems like it's working around the problem. We essentially
just need to ensure that the data doesn't get _reused_, not just freed.
It doesn't help holding a reference to the page, if someone else just
reuses it and fills it with other data before it has been consumed and
released by the pipe buffer operations.

That's why I thought the skb referencing was the better idea, then we
don't have to care about the backing of the skb either. Provided that
preventing the free of the skb before the pipe buffer has been consumed
guarantees that the contents aren't reused.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [WIP][PATCHES] Network xmit batching

2007-06-08 Thread Evgeniy Polyakov
On Thu, Jun 07, 2007 at 06:23:16PM -0400, jamal ([EMAIL PROTECTED]) wrote:
 On Thu, 2007-07-06 at 20:13 +0400, Evgeniy Polyakov wrote:
 
  Actually I wonder where the devil lives, but I do not see how that
  patchset can improve sending situation.
  Let me clarify: there are two possibilities to send data:
  1. via batched sending, which runs via queue of packets and performs
  prepare call (which only setups some private flags, no work with 
  hardware) and then sending call.
 
 I believe both are called with no lock. The idea is to avoid the lock
 entirely when unneeded. That code may end up finding that the packet
 is bogus and throw it out when it deems it useless.
 If you followed the discussions on multi-ring, this call is where
 i suggested to select the tx ring as well.

Hmm...

+   netif_tx_lock_bh(odev);
+   if (!netif_queue_stopped(odev)) {
+
+   idle_start = getCurUs();
+   pkt_dev-tx_entered++;
+   ret = odev-hard_batch_xmit(odev-blist, odev);


+   if (!spin_trylock_irqsave(tx_ring-tx_lock, flags)) {
+   /* Collision - tell upper layer to requeue */
+   return NETDEV_TX_LOCKED;
+   }
+
+   while ((skb = __skb_dequeue(list)) != NULL) {
+#ifdef coredoesnoprep
+   ret = netdev-hard_prep_xmit(skb, netdev);
+   if (ret != NETDEV_TX_OK)
+   continue;
+#endif
+
+   /*XXX: This may be an opportunity to not give nit
+* the packet if the dev ix TX BUSY ;- */
+   dev_do_xmit_nit(skb, netdev);
+   ret = e1000_queue_frame(skb, netdev);

The same applies to *_gso case.

  2. old xmit function (which seems to be unused by kernel now?)
  
 
 You can change that by turning off _BTX feature in the driver.
 For WIP reasons it is on at the moment.
 
  Btw, prep_queue_frame seems to be always called under tx_lock, but it
  old e1000 xmit function calls it without lock. 
 
 I think both call it without lock.

Without lock that would be wrong - it accesses hardware.

  Locked case is correct,
  since it accesses private registers via e1000_transfer_dhcp_info() for
  some adapters.
 
 I am unsure about the value of that lock (refer to email to Auke). There
 is only one CPU that can enter the tx path and the contention is
 minimal.
 
  So, essentially batched sending is 
  lock
  while ((skb = dequue))
send
  unlock
  
  where queue of skbs are prepared by stack using the same transmit lock.
  
  Where is a gain?
 
 The amortizing of the lock on tx is where the value is.
 Did you see the numbers Evgeniy? ;-
 Heres one i can vouch on a dual processor 2GHz that i tested with
 pktgen;

I only saw results Krishna posted, and i also do not know, what
service demand is :)

 
 1) Original e1000 driver (no batching):
 a) We got a xmit throughput of 362Kpackets/second of 362K with
 the default setup (everything falls on cpu#0).
 b) With tying to CPU#1, i saw 401Kpps.
 
 2) Repeated the tests with batching patches (as in this commit)
 And got an outstanding 694Kpps throughput.
 
 5) Repeated #4 with binding to cpu #1.
 And throughput didnt improve that much - was hitting 697Kpps
 I think we are pretty much hitting upper limits here
 ...
 
 
 I am actually testing as we speak on faster hardware - I will post
 results shortly.

Result looks good, but I still do not understand how it appeared, that
is why I'm not that excited about idea - I just do not know it in
details.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-08 Thread Evgeniy Polyakov
On Fri, Jun 08, 2007 at 10:38:53AM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote:
 On Fri, Jun 08 2007, David Miller wrote:
  From: Jens Axboe [EMAIL PROTECTED]
  Date: Fri, 8 Jun 2007 09:48:24 +0200
  
   Perhaps it's possible to solve this at a different level - can we hang
   on to the skb until the pipe buffer has been consumed, and prevent reuse
   that way? Then we don't have to care what backing the skb has, as long
   as it (and its data) isn't being reused until we drop the reference to
   it in sock_pipe_buf_release().
  
  Depending upon whether the pipe buffer consumption is bounded of not,
  this will jam up the TCP sender because the SKB data allocation is
  charged against the socket send buffer allocation.
 
 Forgive my network ignorance, but is that a problem? Since you bring it
 up, I guess so :-)

David means, that socket bufer allocation is limited, and delaying
freeing can end up with exhausint that limit.

 We can grow the pipe, should we have to. So instead of blocking waiting
 on reader consumption, we can extend the size of the pipe and keep
 going.

I have a code, which roughly works (but I will test it some more), which
just introduces reference counters for slab pages, so that the would not
be actually freed via page reclaim, but only after reference counters
are dropped. That forced changes in mm/slab.c so likely it is
unacceptible solution, but it is interesting as is.

 -- 
 Jens Axboe

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NET: Multiqueue network device support.

2007-06-08 Thread Herbert Xu
On Thu, Jun 07, 2007 at 09:35:36PM -0400, jamal wrote:
 On Thu, 2007-07-06 at 17:31 -0700, Sridhar Samudrala wrote:
 
  If the QDISC_RUNNING flag guarantees that only one CPU can call
  dev-hard_start_xmit(), then why do we need to hold netif_tx_lock
  for non-LLTX drivers?
 
 I havent stared at other drivers, but for e1000 seems to me 
 even if you got rid of LLTX that netif_tx_lock is unnecessary.
 Herbert?

It would guard against the poll routine which would acquire this lock
when cleaning the TX ring.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multicast and hardware checksum

2007-06-08 Thread Baruch Even

Herbert Xu wrote:

On Fri, Jun 08, 2007 at 02:02:27PM +0300, Baruch Even wrote:
As far as IGMP and multicast handling everything works, the packets are 
even forwarded over the ppp links but they arrive to the client with a 
bad checksum. I don't have the trace in front of me but I believe it was 
the UDP checksum that failed.


What kind of a ppp device is this?

If you run a tcpdump either side of the ppp link do you see the same
UDP checksum value?


This is a pptp link. I've checked the checksum on the receive side, I 
don't know on the sender side and I'll only be able to try it on Sunday.


Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [WIP][PATCHES] Network xmit batching

2007-06-08 Thread jamal
KK,
On Fri, 2007-08-06 at 10:36 +0530, Krishna Kumar2 wrote:

 I will try that. Also on the receiver, I am using unmodified 2.6.21 bits.

That should be fine as long as the sender is running the patched
2.6.22-rc4

 My earlier experiments showed that even small buffers were filling the
 E1000
 slots and resulting in stop queue very often. In any case, I will also
 add 1 or 2 larger packet sizes (1K, 16K in addition to the 4K already
 there).

Thats interesting - it is possible there is transient burstiness which
fills up the ring.
My observation of your results (hence my comments): for example the
buffer size = 8B, TCP 1 process you achieve less than 70M.  That is less
than 100Kpps on average being sent out. Very very tiny - so it is
interesting that it is causing a shutdown.
Also note something else strange that it is kind of strange that
something like UDP which doesnt backoff will send out less
packets/second ;-

I could put a little hack in the e1000 driver to find exact number
number of times per run it was shutdown.

BTW, another interesting things to do is ensure that several netperfs
are running on different CPUs.


 I was planning to submit my changes on top of this patch, and since it
 includes
 a configuration option per device, it will be easy to test with and without
 this API. 

fantastic.

 When I ran after setting this config option to 0, the results
 were almost identical to the original code. I will try to post that today for
 your review/comments.

no problem.

  Sorry, been many moons since i last played with netperf; what does
 service
  demand mean?
 
 It gives an indication of the amount of CPU cycles to send out a particular
 amount of data. Netperf provides it as us/KB. I don't know the internals of
 netperf enough to say how this is calculated.

I am hoping Rick would comment.

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multicast and hardware checksum

2007-06-08 Thread Herbert Xu
On Fri, Jun 08, 2007 at 02:02:27PM +0300, Baruch Even wrote:
 
 As far as IGMP and multicast handling everything works, the packets are 
 even forwarded over the ppp links but they arrive to the client with a 
 bad checksum. I don't have the trace in front of me but I believe it was 
 the UDP checksum that failed.

What kind of a ppp device is this?

If you run a tcpdump either side of the ppp link do you see the same
UDP checksum value?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multicast and hardware checksum

2007-06-08 Thread Baruch Even

Herbert Xu wrote:

Baruch Even [EMAIL PROTECTED] wrote:
I have a machine on which I have an applications that sends multicast 
through eth interface with hardware tx checksum enabled. On the same 
machine I have mrouted running that routes the multicast traffic to a 
set of ppp interfaces. The packets that are received by the client have 
their checksum fixed on some number which is incorrect. If I disable tx 
checksum on the eth device the packets arrive with the proper checksum.


Where is the client? On the same machine or behind a PPP link?


The clients are behind the ppp links.

As far as IGMP and multicast handling everything works, the packets are 
even forwarded over the ppp links but they arrive to the client with a 
bad checksum. I don't have the trace in front of me but I believe it was 
the UDP checksum that failed.


Baruch


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [WIP][PATCHES] Network xmit batching

2007-06-08 Thread Krishna Kumar2
Hi Jamal,

J Hadi Salim [EMAIL PROTECTED] wrote on 06/08/2007 04:44:06 PM:

 That should be fine as long as the sender is running the patched
 2.6.22-rc4

Definitely :)

 Thats interesting - it is possible there is transient burstiness which
 fills up the ring.
 My observation of your results (hence my comments): for example the
 buffer size = 8B, TCP 1 process you achieve less than 70M.  That is less
 than 100Kpps on average being sent out. Very very tiny - so it is
 interesting that it is causing a shutdown.

I thought it comes to 1.147Mpps, or did I calculate wrong
(70*1024*1024/8/8) ?

 Also note something else strange that it is kind of strange that
 something like UDP which doesnt backoff will send out less
 packets/second ;-

Cannot explain that either :)

 BTW, another interesting things to do is ensure that several netperfs
 are running on different CPUs.

My script was doing that earlier, I trimmed all that to make it easier
to understand. Will post the larger version later.

 no problem.

Thanks, please let me know what you think of the patch I sent earlier.

I am running a larger 5 iteration run with buffer sizes :8,32,128,512,1
K,4K,16K.
It is going to run for around 12 hours and since I am moving house during
the
weekend, I will be able to look at the results only on Monday.

Regards,

- KK

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [WIP][PATCHES] Network xmit batching

2007-06-08 Thread jamal
KK,

On Fri, 2007-08-06 at 17:01 +0530, Krishna Kumar2 wrote:


 I thought it comes to 1.147Mpps, or did I calculate wrong
 (70*1024*1024/8/8) ?

I assumed 8B to mean data that is on top of TCP/UDP?
If so then in the case of UDP we have 8B UDP header, 20B IP and 14B
ethernet  64B minimal allowed Ethernet packet; so it gets padded and
goes out as 64B.
There are, as you state above, 1.147(or is it 1.48?) such packets/sec in
1Gbps.
So (70Mbps/1000Mbps)*1.147 is the rough number i was reffering to.


 My script was doing that earlier, I trimmed all that to make it easier
 to understand. Will post the larger version later.

That will be nice because remember we can have multiple CPU packet
producers but only one CPU consumer.

  no problem.
 
 Thanks, please let me know what you think of the patch I sent earlier.

I havent seen a patch. Can you resend it?

 I am running a larger 5 iteration run with buffer sizes :8,32,128,512,1
 K,4K,16K.
 It is going to run for around 12 hours and since I am moving house during
 the
 weekend, I will be able to look at the results only on Monday.
 

sounds good.

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [WIP][PATCHES] Network xmit batching

2007-06-08 Thread jamal
On Fri, 2007-08-06 at 12:38 +0400, Evgeniy Polyakov wrote:
 On Thu, Jun 07, 2007 at 06:23:16PM -0400, jamal ([EMAIL PROTECTED]) wrote:

  I believe both are called with no lock. The idea is to avoid the lock
  entirely when unneeded. That code may end up finding that the packet
[..]
 + netif_tx_lock_bh(odev);
 + if (!netif_queue_stopped(odev)) {
 +
 + idle_start = getCurUs();
 + pkt_dev-tx_entered++;
 + ret = odev-hard_batch_xmit(odev-blist, odev);

[..]
 The same applies to *_gso case.
 

You missed an important piece which is grabbing of
__LINK_STATE_QDISC_RUNNING


 Without lock that would be wrong - it accesses hardware.

We are achieving the goal of only a single CPU entering that path. Are
you saying that is not good enough?

 I only saw results Krishna posted, 

Ok, sorry - i thought you saw the git log or earlier results where
other things are captured.

 and i also do not know, what service demand is :)

From the explanation seems to be how much cpu was used while sending. Do
you have any suggestions for computing cpu use?
in pktgen i added code to count how many microsecs were used in
transmitting.

 Result looks good, but I still do not understand how it appeared, that
 is why I'm not that excited about idea - I just do not know it in
 details.

To add to KKs explanation on other email:
Essentially the value is in amortizing the cost of barriers and IO per
packet. For example the queue lock is held/released only once per X
packets. DMA kicking which includes both a PCI IO write and mbs is done
only once per X packets. There are still a lot of room for improvement
of such IO;

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multicast and hardware checksum

2007-06-08 Thread Baruch Even

Baruch Even wrote:

Herbert Xu wrote:

On Fri, Jun 08, 2007 at 02:02:27PM +0300, Baruch Even wrote:
As far as IGMP and multicast handling everything works, the packets 
are even forwarded over the ppp links but they arrive to the client 
with a bad checksum. I don't have the trace in front of me but I 
believe it was the UDP checksum that failed.


What kind of a ppp device is this?

If you run a tcpdump either side of the ppp link do you see the same
UDP checksum value?


This is a pptp link. I've checked the checksum on the receive side, I 
don't know on the sender side and I'll only be able to try it on Sunday.


For completeness, the clients are Windows XP clients and the server is a 
 Linux machine. The tunnel is mppe encrypted so I believe that what 
goes out on the client is the same as what got in on the server.


Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NET: Multiqueue network device support.

2007-06-08 Thread jamal
On Fri, 2007-08-06 at 20:39 +1000, Herbert Xu wrote:

 It would guard against the poll routine which would acquire this lock
 when cleaning the TX ring.

Ok, then i suppose we can conclude it is a bug on e1000 (holds tx_lock
on tx side and adapter queue lock on rx). Adding that lock will
certainly bring down the performance numbers on a send/recv profile.
The bizare thing is things run just fine even under the heavy tx/rx
traffic i was testing under. I guess i didnt hit hard enough.

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [WIP][PATCHES] Network xmit batching

2007-06-08 Thread Evgeniy Polyakov
On Fri, Jun 08, 2007 at 07:31:07AM -0400, jamal ([EMAIL PROTECTED]) wrote:
 On Fri, 2007-08-06 at 12:38 +0400, Evgeniy Polyakov wrote:
  On Thu, Jun 07, 2007 at 06:23:16PM -0400, jamal ([EMAIL PROTECTED]) wrote:
 
   I believe both are called with no lock. The idea is to avoid the lock
   entirely when unneeded. That code may end up finding that the packet
 [..]
  +   netif_tx_lock_bh(odev);
  +   if (!netif_queue_stopped(odev)) {
  +
  +   idle_start = getCurUs();
  +   pkt_dev-tx_entered++;
  +   ret = odev-hard_batch_xmit(odev-blist, odev);
 
 [..]
  The same applies to *_gso case.
  
 
 You missed an important piece which is grabbing of
 __LINK_STATE_QDISC_RUNNING

But lock is still being hold - or there was no intention to reduce lock
usage? As far as I read Krishna's mail, lock usage was not an issue, so
that hunk probably should be dropped from the analysis.
 
  Without lock that would be wrong - it accesses hardware.
 
 We are achieving the goal of only a single CPU entering that path. Are
 you saying that is not good enough?

Then why essentially the same code (current batch_xmit callback)
previously was always called with disabled interrupts? Aren't there
some watchdog/link/poll/whatever issues present?

  and i also do not know, what service demand is :)
 
 From the explanation seems to be how much cpu was used while sending. Do
 you have any suggestions for computing cpu use?
 in pktgen i added code to count how many microsecs were used in
 transmitting.

Something, that anyone can understand :)
For example /proc stats, although it is not very accurate, but it is
really usable parameter from userspace point ov view.

  Result looks good, but I still do not understand how it appeared, that
  is why I'm not that excited about idea - I just do not know it in
  details.
 
 To add to KKs explanation on other email:
 Essentially the value is in amortizing the cost of barriers and IO per
 packet. For example the queue lock is held/released only once per X
 packets. DMA kicking which includes both a PCI IO write and mbs is done
 only once per X packets. There are still a lot of room for improvement
 of such IO;

Btw, what is the size of the packet in pktgen in your tests? Likely it
is small, since result is that good. That can explain alot.

 cheers,
 jamal

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NET: Multiqueue network device support.

2007-06-08 Thread Herbert Xu
On Fri, Jun 08, 2007 at 07:34:57AM -0400, jamal wrote:
 On Fri, 2007-08-06 at 20:39 +1000, Herbert Xu wrote:
 
  It would guard against the poll routine which would acquire this lock
  when cleaning the TX ring.
 
 Ok, then i suppose we can conclude it is a bug on e1000 (holds tx_lock
 on tx side and adapter queue lock on rx). Adding that lock will
 certainly bring down the performance numbers on a send/recv profile.
 The bizare thing is things run just fine even under the heavy tx/rx
 traffic i was testing under. I guess i didnt hit hard enough.

Hmm I wasn't describing how it works now.  I'm talking about how it
would work if we removed LLTX and replaced the private tx_lock with
netif_tx_lock.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NET: Multiqueue network device support.

2007-06-08 Thread jamal
On Fri, 2007-08-06 at 22:37 +1000, Herbert Xu wrote:

 Hmm I wasn't describing how it works now.  I'm talking about how it
 would work if we removed LLTX and replaced the private tx_lock with
 netif_tx_lock.

I got that - it is what tg3 does for example.
To mimick that behavior in LLTX, a driver needs to use the same lock on
both tx and receive. e1000 holds a different lock on tx path from rx
path. Maybe theres something clever i am missing; but it seems to be a
bug on e1000.
The point i was making is that it was strange i never had problems
despite taking away the lock on the tx side and using the rx side
concurently.

cheers,
jamal



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [WIP][PATCHES] Network xmit batching

2007-06-08 Thread jamal
On Fri, 2007-08-06 at 16:09 +0400, Evgeniy Polyakov wrote:
 On Fri, Jun 08, 2007 at 07:31:07AM -0400, jamal ([EMAIL PROTECTED]) wrote:


 But lock is still being hold - or there was no intention to reduce lock
 usage? As far as I read Krishna's mail, lock usage was not an issue, so
 that hunk probably should be dropped from the analysis.

With post 2.6.18 that atomic bit guarantees only one CPU will enter tx
path. The lock is only necessary to protect shared resources between tx
and rx (which could be simultenously be entered by two CPUs) such as tx
ring. Refer to some other thread talking about a possible bug with e1000
in this area. So maybe e1000 is not a good example in this sense. But
look at tg3.

   Without lock that would be wrong - it accesses hardware.
  
  We are achieving the goal of only a single CPU entering that path. Are
  you saying that is not good enough?
 
 Then why essentially the same code (current batch_xmit callback)
 previously was always called with disabled interrupts? Aren't there
 some watchdog/link/poll/whatever issues present?

not in the e1000 as it stands today.


 Something, that anyone can understand :)
 For example /proc stats, although it is not very accurate, but it is
 really usable parameter from userspace point ov view.

which /proc stats?


 Btw, what is the size of the packet in pktgen in your tests? Likely it
 is small, since result is that good. That can explain alot.

There is a per-packet cost involved in that code path. So the more
packets/second you can generate the more intensely you can test that
path. I believe you will achieve overall better results with large
packets.

cheers,
jamal


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-08 Thread Evgeniy Polyakov
On Fri, Jun 08, 2007 at 11:04:40AM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote:
 OK, so a delayed empty of the pipe could end up causing packet drops
 elsewhere due to allocation exhaustion?

Yes.

   We can grow the pipe, should we have to. So instead of blocking waiting
   on reader consumption, we can extend the size of the pipe and keep
   going.
  
  I have a code, which roughly works (but I will test it some more), which
  just introduces reference counters for slab pages, so that the would not
  be actually freed via page reclaim, but only after reference counters
  are dropped. That forced changes in mm/slab.c so likely it is
  unacceptible solution, but it is interesting as is.
 
 Hmm, still seems like it's working around the problem. We essentially
 just need to ensure that the data doesn't get _reused_, not just freed.
 It doesn't help holding a reference to the page, if someone else just
 reuses it and fills it with other data before it has been consumed and
 released by the pipe buffer operations.
 
 That's why I thought the skb referencing was the better idea, then we
 don't have to care about the backing of the skb either. Provided that
 preventing the free of the skb before the pipe buffer has been consumed
 guarantees that the contents aren't reused.

It is not only better idea, it is the only correct one.
Attached patch for interested reader, which does slab pages accounting,
but it is broken. It does not fires up with kernel bug, but it fills
output file with random garbage from reused and dirtied pages. And I do
not know why, but received file is always smaller than file being sent
(when file has resonable size like 10mb, with 4-40kb filesize things
seems to be ok).

I've started skb referencing work, let's see where this will end up.

diff --git a/fs/splice.c b/fs/splice.c
index 928bea0..742e1ee 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -29,6 +29,18 @@
 #include linux/syscalls.h
 #include linux/uio.h
 
+extern void slab_change_usage(struct page *p);
+
+static inline void splice_page_release(struct page *p)
+{
+   struct page *head = p-first_page;
+   if (!PageSlab(head))
+   page_cache_release(p);
+   else {
+   slab_change_usage(head);
+   }
+}
+
 /*
  * Attempt to steal a page from a pipe buffer. This should perhaps go into
  * a vm helper function, it's already simplified quite a bit by the
@@ -81,7 +93,7 @@ static int page_cache_pipe_buf_steal(struct pipe_inode_info 
*pipe,
 static void page_cache_pipe_buf_release(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
 {
-   page_cache_release(buf-page);
+   splice_page_release(buf-page);
buf-flags = ~PIPE_BUF_FLAG_LRU;
 }
 
@@ -246,7 +258,7 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
}
 
while (page_nr  spd-nr_pages)
-   page_cache_release(spd-pages[page_nr++]);
+   splice_page_release(spd-pages[page_nr++]);
 
return ret;
 }
@@ -322,7 +334,7 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
error = add_to_page_cache_lru(page, mapping, index,
  GFP_KERNEL);
if (unlikely(error)) {
-   page_cache_release(page);
+   splice_page_release(page);
if (error == -EEXIST)
continue;
break;
@@ -448,7 +460,7 @@ fill_it:
 * we got, 'nr_pages' is how many pages are in the map.
 */
while (page_nr  nr_pages)
-   page_cache_release(pages[page_nr++]);
+   splice_page_release(pages[page_nr++]);
 
if (spd.nr_pages)
return splice_to_pipe(pipe, spd);
@@ -604,7 +616,7 @@ find_page:
 
if (ret != AOP_TRUNCATED_PAGE)
unlock_page(page);
-   page_cache_release(page);
+   splice_page_release(page);
if (ret == AOP_TRUNCATED_PAGE)
goto find_page;
 
@@ -634,7 +646,7 @@ find_page:
ret = mapping-a_ops-commit_write(file, page, offset, offset+this_len);
if (ret) {
if (ret == AOP_TRUNCATED_PAGE) {
-   page_cache_release(page);
+   splice_page_release(page);
goto find_page;
}
if (ret  0)
@@ -651,7 +663,7 @@ find_page:
 */
mark_page_accessed(page);
 out:
-   page_cache_release(page);
+   splice_page_release(page);
unlock_page(page);
 out_ret:
return ret;
diff --git a/mm/slab.c b/mm/slab.c
index 2e71a32..673383d 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1649,8 +1649,12 @@ static void *kmem_getpages(struct kmem_cache *cachep, 
gfp_t flags, int nodeid)
else
add_zone_page_state(page_zone(page),

Re: [PATCH][RFC] network splice receive

2007-06-08 Thread Jens Axboe
On Fri, Jun 08 2007, Evgeniy Polyakov wrote:
 On Fri, Jun 08, 2007 at 11:04:40AM +0200, Jens Axboe ([EMAIL PROTECTED]) 
 wrote:
  OK, so a delayed empty of the pipe could end up causing packet drops
  elsewhere due to allocation exhaustion?
 
 Yes.
 
We can grow the pipe, should we have to. So instead of blocking waiting
on reader consumption, we can extend the size of the pipe and keep
going.
   
   I have a code, which roughly works (but I will test it some more), which
   just introduces reference counters for slab pages, so that the would not
   be actually freed via page reclaim, but only after reference counters
   are dropped. That forced changes in mm/slab.c so likely it is
   unacceptible solution, but it is interesting as is.
  
  Hmm, still seems like it's working around the problem. We essentially
  just need to ensure that the data doesn't get _reused_, not just freed.
  It doesn't help holding a reference to the page, if someone else just
  reuses it and fills it with other data before it has been consumed and
  released by the pipe buffer operations.
  
  That's why I thought the skb referencing was the better idea, then we
  don't have to care about the backing of the skb either. Provided that
  preventing the free of the skb before the pipe buffer has been consumed
  guarantees that the contents aren't reused.
 
 It is not only better idea, it is the only correct one.
 Attached patch for interested reader, which does slab pages accounting,
 but it is broken. It does not fires up with kernel bug, but it fills
 output file with random garbage from reused and dirtied pages. And I do
 not know why, but received file is always smaller than file being sent
 (when file has resonable size like 10mb, with 4-40kb filesize things
 seems to be ok).
 
 I've started skb referencing work, let's see where this will end up.

Here's a start, for the splice side at least of storing a buf-private
entity with the ops.

diff --git a/fs/splice.c b/fs/splice.c
index 90588a8..f24e367 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -191,6 +191,7 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
buf-page = spd-pages[page_nr];
buf-offset = spd-partial[page_nr].offset;
buf-len = spd-partial[page_nr].len;
+   buf-private = spd-partial[page_nr].private;
buf-ops = spd-ops;
if (spd-flags  SPLICE_F_GIFT)
buf-flags |= PIPE_BUF_FLAG_GIFT;
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 7ba228d..4409167 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -14,6 +14,7 @@ struct pipe_buffer {
unsigned int offset, len;
const struct pipe_buf_operations *ops;
unsigned int flags;
+   unsigned long private;
 };
 
 struct pipe_inode_info {
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 619dcf5..64e3eed 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1504,7 +1504,7 @@ extern int   skb_store_bits(struct sk_buff 
*skb, int offset,
 extern __wsum skb_copy_and_csum_bits(const struct sk_buff *skb,
  int offset, u8 *to, int len,
  __wsum csum);
-extern int skb_splice_bits(const struct sk_buff *skb,
+extern int skb_splice_bits(struct sk_buff *skb,
unsigned int offset,
struct pipe_inode_info *pipe,
unsigned int len,
diff --git a/include/linux/splice.h b/include/linux/splice.h
index b3f1528..1a1182b 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -41,6 +41,7 @@ struct splice_desc {
 struct partial_page {
unsigned int offset;
unsigned int len;
+   unsigned long private;
 };
 
 /*
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d2b2547..7d9ec9e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -78,7 +78,10 @@ static void sock_pipe_buf_release(struct pipe_inode_info 
*pipe,
 #ifdef NET_COPY_SPLICE
__free_page(buf-page);
 #else
-   put_page(buf-page);
+   struct sk_buff *skb = (struct sk_buff *) buf-private;
+
+   kfree_skb(skb);
+   //put_page(buf-page);
 #endif
 }
 
@@ -1148,7 +1151,8 @@ fault:
  * Fill page/offset/length into spd, if it can hold more pages.
  */
 static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page 
*page,
-   unsigned int len, unsigned int offset)
+   unsigned int len, unsigned int offset,
+   struct sk_buff *skb)
 {
struct page *p;
 
@@ -1163,12 +1167,14 @@ static inline int spd_fill_page(struct splice_pipe_desc 
*spd, struct page *page,

[PATCH 1/1] make network DMA usable for non-tcp drivers

2007-06-08 Thread Ed L. Cashin
Here is a patch against the netdev-2.6 git tree that makes the net DMA
feature usable for drivers like the ATA over Ethernet block driver,
which can use dma_skb_copy_datagram_iovec when receiving data from the
network.

The change was suggested on kernelnewbies.

  http://article.gmane.org/gmane.linux.kernel.kernelnewbies/21663

Signed-off-by: Ed L. Cashin [EMAIL PROTECTED]
---
 drivers/dma/Kconfig |2 +-
 net/core/user_dma.c |2 ++
 2 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index 72be6c6..270d23e 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -14,7 +14,7 @@ config DMA_ENGINE
 comment DMA Clients
 
 config NET_DMA
-   bool Network: TCP receive copy offload
+   bool Network: receive copy offload
depends on DMA_ENGINE  NET
default y
---help---
diff --git a/net/core/user_dma.c b/net/core/user_dma.c
index 0ad1cd5..69d0b15 100644
--- a/net/core/user_dma.c
+++ b/net/core/user_dma.c
@@ -130,3 +130,5 @@ end:
 fault:
return -EFAULT;
 }
+
+EXPORT_SYMBOL(dma_skb_copy_datagram_iovec);
-- 
1.5.2.1
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-08 Thread Evgeniy Polyakov
On Fri, Jun 08, 2007 at 04:14:52PM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote:
 Here's a start, for the splice side at least of storing a buf-private
 entity with the ops.

:) I tested the same implementation, but I put skb pointer into
page-private. My approach is not correct, since the same page can hold
several objects, so if there are several splicers, this will scream.
I've tested your patch on top of splice-net branch, here is a result:

[   44.798853] Slab corruption: skbuff_head_cache start=81003b726668, 
len=192
[   44.806148] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
[   44.811598] Last user: [803699fd](kfree_skbmem+0x7a/0x7e)
[   44.818012] 0b0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6a 6b 6b a5
[   44.824889] Prev obj: start=81003b726590, len=192
[   44.829985] Redzone: 0xd84156c5635688c0/0xd84156c5635688c0.
[   44.835604] Last user: [8036a22c](__alloc_skb+0x40/0x13f)
[   44.842010] 000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[   44.848896] 010: 20 58 7e 3b 00 81 ff ff 00 00 00 00 00 00 00 00
[   44.855772] Next obj: start=81003b726740, len=192
[   44.860868] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
[   44.866314] Last user: [803699fd](kfree_skbmem+0x7a/0x7e)
[   44.872721] 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
[   44.879597] 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b

I will try some things for the nearest 30-60 minutes, and then will move to
canoe trip until thuesday, so will not be able to work on this idea.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-08 Thread Jens Axboe
On Fri, Jun 08 2007, Evgeniy Polyakov wrote:
 On Fri, Jun 08, 2007 at 04:14:52PM +0200, Jens Axboe ([EMAIL PROTECTED]) 
 wrote:
  Here's a start, for the splice side at least of storing a buf-private
  entity with the ops.
 
 :) I tested the same implementation, but I put skb pointer into
 page-private. My approach is not correct, since the same page can hold
 several objects, so if there are several splicers, this will scream.
 I've tested your patch on top of splice-net branch, here is a result:
 
 [   44.798853] Slab corruption: skbuff_head_cache start=81003b726668, 
 len=192
 [   44.806148] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
 [   44.811598] Last user: [803699fd](kfree_skbmem+0x7a/0x7e)
 [   44.818012] 0b0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6a 6b 6b a5
 [   44.824889] Prev obj: start=81003b726590, len=192
 [   44.829985] Redzone: 0xd84156c5635688c0/0xd84156c5635688c0.
 [   44.835604] Last user: [8036a22c](__alloc_skb+0x40/0x13f)
 [   44.842010] 000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 [   44.848896] 010: 20 58 7e 3b 00 81 ff ff 00 00 00 00 00 00 00 00
 [   44.855772] Next obj: start=81003b726740, len=192
 [   44.860868] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
 [   44.866314] Last user: [803699fd](kfree_skbmem+0x7a/0x7e)
 [   44.872721] 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
 [   44.879597] 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
 
 I will try some things for the nearest 30-60 minutes, and then will move to
 canoe trip until thuesday, so will not be able to work on this idea.

I'm not surprised, it wasn't tested at all - just provides the basic
framework for storing the skb so we can access it on pipe buffer
release.

Lets talk more next week, I'll likely play with this approach on monday.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-08 Thread Evgeniy Polyakov
On Fri, Jun 08, 2007 at 06:57:25PM +0400, Evgeniy Polyakov ([EMAIL PROTECTED]) 
wrote:
 I will try some things for the nearest 30-60 minutes, and then will move to
 canoe trip until thuesday, so will not be able to work on this idea.

Ok, replacing in fs/splice.c every page_cache_release() with
static void splice_page_release(struct page *p)
{
if (!PageSlab(p))
page_cache_release(p);
}

and putting cloned skb into private field instead of 
original on in spd_fill_page() ends up without kernel hung.

I'm not sure it is correct, that page can be released in fs/splice.c
without calling any callback from network code, when network data is
being processed.

Size of the received file is bigger than file sent, file contains repeated
blocks of data sometimes. Cloned skb usage is likely too big overhead,
although for receiving fast clone is unused in most cases, so there
might be some gain.

Attached your patch with above changes.

diff --git a/fs/splice.c b/fs/splice.c
index 928bea0..a75dc56 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -29,6 +29,12 @@
 #include linux/syscalls.h
 #include linux/uio.h
 
+static void splice_page_release(struct page *p)
+{
+   if (!PageSlab(p))
+   page_cache_release(p);
+}
+
 /*
  * Attempt to steal a page from a pipe buffer. This should perhaps go into
  * a vm helper function, it's already simplified quite a bit by the
@@ -81,7 +87,7 @@ static int page_cache_pipe_buf_steal(struct pipe_inode_info 
*pipe,
 static void page_cache_pipe_buf_release(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
 {
-   page_cache_release(buf-page);
+   splice_page_release(buf-page);
buf-flags = ~PIPE_BUF_FLAG_LRU;
 }
 
@@ -191,6 +197,7 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
buf-page = spd-pages[page_nr];
buf-offset = spd-partial[page_nr].offset;
buf-len = spd-partial[page_nr].len;
+   buf-private = spd-partial[page_nr].private;
buf-ops = spd-ops;
if (spd-flags  SPLICE_F_GIFT)
buf-flags |= PIPE_BUF_FLAG_GIFT;
@@ -246,7 +253,7 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
}
 
while (page_nr  spd-nr_pages)
-   page_cache_release(spd-pages[page_nr++]);
+   splice_page_release(spd-pages[page_nr++]);
 
return ret;
 }
@@ -322,7 +329,7 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
error = add_to_page_cache_lru(page, mapping, index,
  GFP_KERNEL);
if (unlikely(error)) {
-   page_cache_release(page);
+   splice_page_release(page);
if (error == -EEXIST)
continue;
break;
@@ -448,7 +455,7 @@ fill_it:
 * we got, 'nr_pages' is how many pages are in the map.
 */
while (page_nr  nr_pages)
-   page_cache_release(pages[page_nr++]);
+   splice_page_release(pages[page_nr++]);
 
if (spd.nr_pages)
return splice_to_pipe(pipe, spd);
@@ -604,7 +611,7 @@ find_page:
 
if (ret != AOP_TRUNCATED_PAGE)
unlock_page(page);
-   page_cache_release(page);
+   splice_page_release(page);
if (ret == AOP_TRUNCATED_PAGE)
goto find_page;
 
@@ -634,7 +641,7 @@ find_page:
ret = mapping-a_ops-commit_write(file, page, offset, offset+this_len);
if (ret) {
if (ret == AOP_TRUNCATED_PAGE) {
-   page_cache_release(page);
+   splice_page_release(page);
goto find_page;
}
if (ret  0)
@@ -651,7 +658,7 @@ find_page:
 */
mark_page_accessed(page);
 out:
-   page_cache_release(page);
+   splice_page_release(page);
unlock_page(page);
 out_ret:
return ret;
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 7ba228d..4409167 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -14,6 +14,7 @@ struct pipe_buffer {
unsigned int offset, len;
const struct pipe_buf_operations *ops;
unsigned int flags;
+   unsigned long private;
 };
 
 struct pipe_inode_info {
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 619dcf5..64e3eed 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1504,7 +1504,7 @@ extern int   skb_store_bits(struct sk_buff 
*skb, int offset,
 extern __wsum skb_copy_and_csum_bits(const struct sk_buff *skb,
  int offset, u8 *to, int len,
  

Re: [PATCH] Virtual ethernet tunnel (v.2)

2007-06-08 Thread Pavel Emelianov
Ben Greear wrote:

[snip]

 I would also like some way to identify veth from other device types,
 preferably
 something like a value in sysfs.   However, that should not hold up
 

 We can do this with ethtool. It can get and print the driver name of
 the device.
   
 I think I'd like something in sysfs that we could query for any
 interface.  Possible return
 strings could be:
 VLAN
 VETH
 ETH
 PPP
 BRIDGE
 AP /* wifi access point interface */
 STA /* wifi station */
 
 
 I will cook up a patch for consideration after veth goes in.
 

Ben, could you please tell what sysfs features do you
plan to implement?

Thanks,
Pavel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] RFC: have tcp_recvmsg() check kthread_should_stop() and treat it as if it were signalled

2007-06-08 Thread Jeff Layton
Already sent this to several lists, but forgot netdev ;-)...

This one's sort of outside my normal area of expertise so sending this
as an RFC to gather feedback on the idea.

Some background:

The cifs_mount() and cifs_umount() functions currently send a signal to
the cifsd kthread prior to calling kthread_stop on it. The reasoning is
apparently that it's likely that cifsd will have called kernel_recvmsg()
and if it doesn't do this there can be a rather long delay when a
filesystem is unmounted.

The following patch is a first stab at removing this need. It makes it
so that in tcp_recvmsg() we also check kthread_should_stop() at any
point where we currently check to see if the task was signalled. If
that returns true, then it acts as if it were signalled.

I've tested this on a fairly recent kernel with a cifs module that
doesn't send signals on unmount and it seems to work as expected. I'm
just not clear on whether it will have any adverse side-effects.

Obviously if this approach is OK then we'll probably also want to fix
up other recvmsg functions (udp_recvmsg, etc).

Anyone care to comment?

Thanks,

Signed-off-by: Jeff Layton [EMAIL PROTECTED]

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index bd4c295..1ad91fa 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -258,6 +258,7 @@
 #include linux/cache.h
 #include linux/err.h
 #include linux/crypto.h
+#include linux/kthread.h
 
 #include net/icmp.h
 #include net/tcp.h
@@ -1154,7 +1155,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, 
struct msghdr *msg,
if (tp-urg_data  tp-urg_seq == *seq) {
if (copied)
break;
-   if (signal_pending(current)) {
+   if (signal_pending(current) || kthread_should_stop()) {
copied = timeo ? sock_intr_errno(timeo) : 
-EAGAIN;
break;
}
@@ -1197,6 +1198,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, 
struct msghdr *msg,
(sk-sk_shutdown  RCV_SHUTDOWN) ||
!timeo ||
signal_pending(current) ||
+   kthread_should_stop() ||
(flags  MSG_PEEK))
break;
} else {
@@ -1227,7 +1229,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, 
struct msghdr *msg,
break;
}
 
-   if (signal_pending(current)) {
+   if (signal_pending(current) || kthread_should_stop()) {
copied = sock_intr_errno(timeo);
break;
}
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Virtual ethernet tunnel (v.2)

2007-06-08 Thread Ben Greear

Pavel Emelianov wrote:

Ben Greear wrote:

[snip]

  

I would also like some way to identify veth from other device types,
preferably
something like a value in sysfs.   However, that should not hold up



We can do this with ethtool. It can get and print the driver name of
the device.
  
  

I think I'd like something in sysfs that we could query for any
interface.  Possible return
strings could be:
VLAN
VETH
ETH
PPP
BRIDGE
AP /* wifi access point interface */
STA /* wifi station */


I will cook up a patch for consideration after veth goes in.




Ben, could you please tell what sysfs features do you
plan to implement?
  

I think this is the only thing that has a chance of getting into the kernel.
Basically, I have a user-space app and I want to be able to definitively 
know the type for
all interfaces.  Currently, I have a hodge-podge of logic to query 
various ioctls and /proc
files and finally, guess by name if nothing else works.  There must be a 
better way :P


I have another sysfs patch that allows setting a default skb-mark for 
an interface so that you can set the skb-mark
before it hits the connection tracking logic, but I'm been told this one 
has very little chance
of getting into the kernel.  The skb-mark patch is only useful (as far 
as I can tell) if you
also include a patch Patrick McHardy did for me that allowed the 
conn-tracking logic to
use skb-mark as part of it's tuple.  This allows me to do NAT between 
virtual routers
(routing tables) on the same machine using veth-equivalent drivers to 
connect the

routers.  He thinks this will probably not ever get into the kernel either.

I have another sysctl related send-to-self patch that also has little 
chance of getting into the kernel, but
it might be quite useful with veth (it's useful to me..but my needs 
aren't exactly mainstream :))

I'll post this separately for consideration

Thanks,
Ben


--
Ben Greear [EMAIL PROTECTED] 
Candela Technologies Inc  http://www.candelatech.com



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Cbe-oss-dev] [PATCH 0/18] spidernet driver bug fixes

2007-06-08 Thread Linas Vepstas
On Fri, Jun 08, 2007 at 11:12:31AM +1000, Michael Ellerman wrote:
 On Thu, 2007-06-07 at 14:17 -0500, Linas Vepstas wrote:
  Jeff, please apply for the 2.6.23 kernel tree.  The pach series
  consists of two major bugfixes, and several bits of cleanup.
  
  The major bug fixes are: 
  
  1) a rare but fatal bug involving RX ram full messages, 
 which results in a driver deadlock.
  
  2) misconfigured TX interrupts, causing a sever performance
 degardation for small packets.
 
 I realise it's late, but shouldn't major bugfixes be going into 22 ?

Yeah, I suppose, I admit I've lost track of the process. 

I'm not sure how to submit patches for this case. The major fixes
are patches 6/18, 13/18 14/18 and 17/18; (the rest of the patches are 
cruft-fixes). Taken alone, these four will not apply cleanly. 

I could prepare a new set, with just these four; asuming these are
accepted into 2.6.22, then once 22 comes out, Jeff's .23 tree won't 
merge cleanly.  

What's the right way to do this?

--linas
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Cbe-oss-dev] [PATCH 0/18] spidernet driver bug fixes

2007-06-08 Thread Jeff Garzik
On Fri, Jun 08, 2007 at 12:06:08PM -0500, Linas Vepstas wrote:
 On Fri, Jun 08, 2007 at 11:12:31AM +1000, Michael Ellerman wrote:
  On Thu, 2007-06-07 at 14:17 -0500, Linas Vepstas wrote:
   Jeff, please apply for the 2.6.23 kernel tree.  The pach series
   consists of two major bugfixes, and several bits of cleanup.
   
   The major bug fixes are: 
   
   1) a rare but fatal bug involving RX ram full messages, 
  which results in a driver deadlock.
   
   2) misconfigured TX interrupts, causing a sever performance
  degardation for small packets.
  
  I realise it's late, but shouldn't major bugfixes be going into 22 ?
 
 Yeah, I suppose, I admit I've lost track of the process. 
 
 I'm not sure how to submit patches for this case. The major fixes
 are patches 6/18, 13/18 14/18 and 17/18; (the rest of the patches are 
 cruft-fixes). Taken alone, these four will not apply cleanly. 
 
 I could prepare a new set, with just these four; asuming these are
 accepted into 2.6.22, then once 22 comes out, Jeff's .23 tree won't 
 merge cleanly.  

You need to order your bug fixes first in the queue.  I push those
upstream, and simultaneous merge the result into netdev#upstream (2.6.23
queue).

Jeff



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] NetXen: Initialization, link status and other bug fixes

2007-06-08 Thread wendy xiong
On Thu, 2007-06-07 at 04:28 -0700, Mithlesh Thukral wrote:
 Hi All,
 
 I will be sending bug fixes related to initialization, link status and 
 some compile issues of NetXen's 1/10G Ethernet driver in subsequent 
 mails.
 These patches are wrt netdev#upstream-fixes.
 
 Regards,
 Mithlesh Thukral

Jeff, 

Thanks for your review this series patches on 6/3.
Based on your comments, we have re-submitted the patches with your
requirements. Also we have tested these patches on x/pBlade in our lab.

Thanks,
wendy


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [WIP][PATCHES] Network xmit batching

2007-06-08 Thread Rick Jones

These results are based on the test script that I sent earlier today. I
removed the results for UDP 32 procs 512 and 4096 buffer cases since
the BW was coming line speed (infact it was showing 1500Mb/s and
4900Mb/s respectively for both the ORG and these bits). 



I expect UDP to overwhelm the receiver. So the receiver needs a lot more
tuning (like increased rcv socket buffer sizes to keep up, IMO).

But yes, the above is an odd result - Rick any insight into this?


Indeed, there is no flow control provided by netperf for the UDP_STREAM 
test and so it is quite common for a receiver to be overwhelmed.  One 
can tweak the SO_RCVBUF size a bit to try to help with transients, but 
if the sender is sustainably faster than the receiver, you have to 
configure netperf with --enable-intervals  and then provide a send burst 
(number of sends) size and an inter burst interval (constrained by HZ 
on the platform) to pace the netperf UDP sender.  You can get finer 
grained control with --enable-spin, but that shoots your netperf-sided 
CPU util to hell.


And with UDP datagram sizes  MTU there is (in the abstract, not sure 
about current Linux code) the concern about filling a transmit queue 
with some but not all of the fragments of a datagram and the others 
being tossed, so one ends-up sending unreassemblable datagram fragments.




Summary : Average BW (whatever meaning that has) improved 0.65%, while
Service Demand deteriorated 11.86%



Sorry, been many moons since i last played with netperf; what does service
demand mean?


Service demand is a measure of efficiency.  It is a 
normalization/reconciliation of the throughput and the CPU utilization 
to arrive at a CPU consumed per unit of work figure.  Lower is better.


Now, when running aggregate tests with netperf2 using the launch a 
bunch in the background with confidence intervals enble to get 
iterations to minimize skew error :)


http://www.netperf.org/svn/netperf2/tags/netperf-2.4.3/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance

you cannot take the netperf service demand directly - each netperf is 
calculating assuming that it is the only thing running on the system. 
It then ass-u-me-s that the CPU util it measured was all for its work. 
This means the service demand figure will be quite higher than it really is.


So, for aggregate tests using netperf2, one has to calculate service 
demand by hand.  Sum the throughput as KB/s, convert the CPU util and 
number of CPUs to a microseconds of CPU consumed per second and divide 
to get microseconds per KB for the aggregate.


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [WIP][PATCHES] Network xmit batching

2007-06-08 Thread Rick Jones

Also note something else strange that it is kind of strange that
something like UDP which doesnt backoff will send out less
packets/second ;-



Cannot explain that either :)


Perhaps delays in restarting after the intra-stack flow control is 
asserted.  One possible thing to do to try to deal with that a little 
would be to increase SO_SNDBUF in netperf with the -s option.  That at 
least is something I did back in 2.4 days


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] ibmveth: Fix h_free_logical_lan error on pool resize

2007-06-08 Thread Brian King

When attempting to activate additional rx buffer pools on an ibmveth interface 
that
was not yet up, the error below was seen. The patch fixes this by only closing
and opening the interface to activate the resize if the interface is already
opened.

(drivers/net/ibmveth.c:597 ua:3004) ERROR: h_free_logical_lan failed with 
fffc, continuing with close
Unable to handle kernel paging request for data at address 0x0ff8
Faulting instruction address: 0xd02540e0
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=128 NUMA PSERIES LPAR 
Modules linked in: ip6t_REJECT xt_tcpudp ipt_REJECT xt_state iptable_mangle ipta
ble_nat ip_nat iptable_filter ip6table_mangle ip_conntrack nfnetlink ip_tables i
p6table_filter ip6_tables x_tables ipv6 apparmor aamatch_pcre loop dm_mod ibmvet
h sg ibmvscsic sd_mod scsi_mod
NIP: D02540E0 LR: D02540D4 CTR: 801AF404
REGS: c0001cd27870 TRAP: 0300   Not tainted  (2.6.16.46-0.4-ppc64)
MSR: 80009032 EE,ME,IR,DR  CR: 24242422  XER: 0007
DAR: 0FF8, DSISR: 4000
TASK = c0001ca7b4e0[1636] 'sh' THREAD: c0001cd24000 CPU: 0
GPR00: D02540D4 C0001CD27AF0 D0265650 C0001C936500 
GPR04: 80009032  0007 0002C2EF 
GPR08:   C0652A10 C0652AE0 
GPR12: 4000 C04A3300 100A  
GPR16: 100B8808 100C0F60  10084878 
GPR20:  100C0CB0 100AF498 0002 
GPR24: 100BA488 C0001C936760 D0258DD0 C0001C936000 
GPR28:  C0001C936500 D0265180 C0001C936000 
NIP [D02540E0] .ibmveth_close+0xc8/0xf4 [ibmveth]
LR [D02540D4] .ibmveth_close+0xbc/0xf4 [ibmveth]
Call Trace:
[C0001CD27AF0] [D02540D4] .ibmveth_close+0xbc/0xf4 [ibmveth] 
(unreliable)
[C0001CD27B80] [D02545FC] .veth_pool_store+0xd0/0x260 [ibmveth]
[C0001CD27C40] [C012E0E8] .sysfs_write_file+0x118/0x198
[C0001CD27CF0] [C00CDAF0] .vfs_write+0x130/0x218
[C0001CD27D90] [C00CE52C] .sys_write+0x4c/0x8c
[C0001CD27E30] [C000871C] syscall_exit+0x0/0x40
Instruction dump:
419affd8 2fa3 419e0020 e93d e89e8040 38a00255 e87e81b0 80c90018 
48001531 e8410028 e93d00e0 7fa3eb78 e8090ff8 f81d0430 4bfffdc9 38210090 

Signed-off-by: Brian King [EMAIL PROTECTED]
---

 linux-2.6-bjking1/drivers/net/ibmveth.c |   53 ++--
 1 file changed, 31 insertions(+), 22 deletions(-)

diff -puN drivers/net/ibmveth.c~ibmveth_large_frames drivers/net/ibmveth.c
--- linux-2.6/drivers/net/ibmveth.c~ibmveth_large_frames2007-05-14 
15:03:06.0 -0500
+++ linux-2.6-bjking1/drivers/net/ibmveth.c 2007-05-15 09:18:46.0 
-0500
@@ -1243,16 +1243,19 @@ const char * buf, size_t count)
 
if (attr == veth_active_attr) {
if (value  !pool-active) {
-   if(ibmveth_alloc_buffer_pool(pool)) {
-ibmveth_error_printk(unable to alloc pool\n);
-return -ENOMEM;
-}
-   pool-active = 1;
-   adapter-pool_config = 1;
-   ibmveth_close(netdev);
-   adapter-pool_config = 0;
-   if ((rc = ibmveth_open(netdev)))
-   return rc;
+   if (netif_running(netdev)) {
+   if(ibmveth_alloc_buffer_pool(pool)) {
+   ibmveth_error_printk(unable to alloc 
pool\n);
+   return -ENOMEM;
+   }
+   pool-active = 1;
+   adapter-pool_config = 1;
+   ibmveth_close(netdev);
+   adapter-pool_config = 0;
+   if ((rc = ibmveth_open(netdev)))
+   return rc;
+   } else
+   pool-active = 1;
} else if (!value  pool-active) {
int mtu = netdev-mtu + IBMVETH_BUFF_OH;
int i;
@@ -1281,23 +1284,29 @@ const char * buf, size_t count)
if (value = 0 || value  IBMVETH_MAX_POOL_COUNT)
return -EINVAL;
else {
-   adapter-pool_config = 1;
-   ibmveth_close(netdev);
-   adapter-pool_config = 0;
-   pool-size = value;
-   if ((rc = ibmveth_open(netdev)))
-   return rc;
+   if (netif_running(netdev)) {
+   adapter-pool_config = 1;
+   

[PATCH 2/2] ibmveth: Automatically enable larger rx buffer pools for larger mtu

2007-06-08 Thread Brian King

Currently, ibmveth maintains several rx buffer pools, which can
be modified through sysfs. By default, pools are not allocated by
default such that jumbo frames cannot be supported without first
activating larger rx buffer pools. This results in failures when attempting
to change the mtu. This patch makes ibmveth automatically allocate
these larger buffer pools when the mtu is changed.

Signed-off-by: Brian King [EMAIL PROTECTED]
---

 linux-2.6-bjking1/drivers/net/ibmveth.c |   27 +++
 1 file changed, 23 insertions(+), 4 deletions(-)

diff -puN drivers/net/ibmveth.c~ibmveth_large_mtu drivers/net/ibmveth.c
--- linux-2.6/drivers/net/ibmveth.c~ibmveth_large_mtu   2007-05-16 
10:47:54.0 -0500
+++ linux-2.6-bjking1/drivers/net/ibmveth.c 2007-05-16 10:47:54.0 
-0500
@@ -915,17 +915,36 @@ static int ibmveth_change_mtu(struct net
 {
struct ibmveth_adapter *adapter = dev-priv;
int new_mtu_oh = new_mtu + IBMVETH_BUFF_OH;
-   int i;
+   int reinit = 0;
+   int i, rc;
 
if (new_mtu  IBMVETH_MAX_MTU)
return -EINVAL;
 
+   for (i = 0; i  IbmVethNumBufferPools; i++)
+   if (new_mtu_oh  adapter-rx_buff_pool[i].buff_size)
+   break;
+
+   if (i == IbmVethNumBufferPools)
+   return -EINVAL;
+
/* Look for an active buffer pool that can hold the new MTU */
for(i = 0; iIbmVethNumBufferPools; i++) {
-   if (!adapter-rx_buff_pool[i].active)
-   continue;
+   if (!adapter-rx_buff_pool[i].active) {
+   adapter-rx_buff_pool[i].active = 1;
+   reinit = 1;
+   }
+
if (new_mtu_oh  adapter-rx_buff_pool[i].buff_size) {
-   dev-mtu = new_mtu;
+   if (reinit  netif_running(adapter-netdev)) {
+   adapter-pool_config = 1;
+   ibmveth_close(adapter-netdev);
+   adapter-pool_config = 0;
+   dev-mtu = new_mtu;
+   if ((rc = ibmveth_open(adapter-netdev)))
+   return rc;
+   } else
+   dev-mtu = new_mtu;
return 0;
}
}
_
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] make network DMA usable for non-tcp drivers

2007-06-08 Thread Andrew Morton
On Fri, 8 Jun 2007 10:30:53 -0400
Ed L. Cashin [EMAIL PROTECTED] wrote:

 Here is a patch against the netdev-2.6 git tree that makes the net DMA
 feature usable for drivers like the ATA over Ethernet block driver,
 which can use dma_skb_copy_datagram_iovec when receiving data from the
 network.
 
 The change was suggested on kernelnewbies.
 
   http://article.gmane.org/gmane.linux.kernel.kernelnewbies/21663
 
 Signed-off-by: Ed L. Cashin [EMAIL PROTECTED]
 ---
  drivers/dma/Kconfig |2 +-
  net/core/user_dma.c |2 ++
  2 files changed, 3 insertions(+), 1 deletions(-)
 
 diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
 index 72be6c6..270d23e 100644
 --- a/drivers/dma/Kconfig
 +++ b/drivers/dma/Kconfig
 @@ -14,7 +14,7 @@ config DMA_ENGINE
  comment DMA Clients
  
  config NET_DMA
 - bool Network: TCP receive copy offload
 + bool Network: receive copy offload
   depends on DMA_ENGINE  NET
   default y
   ---help---
 diff --git a/net/core/user_dma.c b/net/core/user_dma.c
 index 0ad1cd5..69d0b15 100644
 --- a/net/core/user_dma.c
 +++ b/net/core/user_dma.c
 @@ -130,3 +130,5 @@ end:
  fault:
   return -EFAULT;
  }
 +
 +EXPORT_SYMBOL(dma_skb_copy_datagram_iovec);

We wouldn't want to merge this until code which actually uses the export is
also merged.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Virtual ethernet tunnel (v.2)

2007-06-08 Thread Carl-Daniel Hailfinger
On 08.06.2007 19:00, Ben Greear wrote:
 I have another sysfs patch that allows setting a default skb-mark for
 an interface so that you can set the skb-mark
 before it hits the connection tracking logic, but I'm been told this one
 has very little chance
 of getting into the kernel.  The skb-mark patch is only useful (as far
 as I can tell) if you
 also include a patch Patrick McHardy did for me that allowed the
 conn-tracking logic to
 use skb-mark as part of it's tuple.  This allows me to do NAT between
 virtual routers
 (routing tables) on the same machine using veth-equivalent drivers to
 connect the
 routers.  He thinks this will probably not ever get into the kernel either.

Are these patches available somewhere? I'm currently doing NAT between
virtual routers by some advanced iproute2/iptables trickery, but I have
no way to handle the occasional tuple conflict.

Regards,
Carl-Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ipvs: Fix state variable on failure to start ipvs threads

2007-06-08 Thread Neil Horman
Hey all-
ip_vs currently fails to reset its ip_vs_sync_state variable if the sync
thread fails to start properly.  The result is that the kernel will report a
running daemon when their actuall is none.  If you issue the following commands:
1. ipvsadm --start-daemon master --mcast-interface bla
2. ipvsadm -L --daemon
3. ipvsadm --stop-daemon master

Assuming that bla is not an actual interface, step 2 should return no data, but
instead returns:
$ ipvsadm -L --daemon
master sync daemon (mcast=bla, syncid=0)

The following patch corrects this behavior.  Tested successfully by myself

Thanks  Regards
Neil

Signed-off-by: Neil Horman [EMAIL PROTECTED]


 ip_vs_sync.c |   41 +++--
 1 file changed, 39 insertions(+), 2 deletions(-)



diff --git a/net/ipv4/ipvs/ip_vs_sync.c b/net/ipv4/ipvs/ip_vs_sync.c
index 7ea2d98..ff4df68 100644
--- a/net/ipv4/ipvs/ip_vs_sync.c
+++ b/net/ipv4/ipvs/ip_vs_sync.c
@@ -67,6 +67,11 @@ struct ip_vs_sync_conn_options {
struct ip_vs_seqout_seq;/* outgoing seq. struct */
 };
 
+struct ip_vs_sync_thread_data {
+   struct completion *startup;
+   int state;
+};
+
 #define IP_VS_SYNC_CONN_TIMEOUT (3*60*HZ)
 #define SIMPLE_CONN_SIZE  (sizeof(struct ip_vs_sync_conn))
 #define FULL_CONN_SIZE  \
@@ -751,6 +756,7 @@ static int sync_thread(void *startup)
mm_segment_t oldmm;
int state;
const char *name;
+   struct ip_vs_sync_thread_data *tinfo = startup;
 
/* increase the module use count */
ip_vs_use_count_inc();
@@ -789,7 +795,14 @@ static int sync_thread(void *startup)
add_wait_queue(sync_wait, wait);
 
set_sync_pid(state, current-pid);
-   complete((struct completion *)startup);
+   complete(tinfo-startup);
+
+   /*
+* once we call the completion queue above, we should
+* null out that reference, since its allocated on the
+* stack of the creating kernel thread
+*/
+   tinfo-startup = NULL;
 
/* processing master/backup loop here */
if (state == IP_VS_STATE_MASTER)
@@ -801,6 +814,14 @@ static int sync_thread(void *startup)
remove_wait_queue(sync_wait, wait);
 
/* thread exits */
+
+   /*
+* If we weren't explicitly stopped, then we
+* exited in error, and should undo our state
+*/
+   if ((!stop_master_sync)  (!stop_backup_sync))
+   ip_vs_sync_state -= tinfo-state;
+
set_sync_pid(state, 0);
IP_VS_INFO(sync thread stopped!\n);
 
@@ -812,6 +833,11 @@ static int sync_thread(void *startup)
set_stop_sync(state, 0);
wake_up(stop_sync_wait);
 
+   /*
+* we need to free the structure that was allocated 
+* for us in start_sync_thread
+*/
+   kfree(tinfo);
return 0;
 }
 
@@ -838,11 +864,19 @@ int start_sync_thread(int state, char *mcast_ifn, __u8 
syncid)
 {
DECLARE_COMPLETION_ONSTACK(startup);
pid_t pid;
+   struct ip_vs_sync_thread_data *tinfo;
 
if ((state == IP_VS_STATE_MASTER  sync_master_pid) ||
(state == IP_VS_STATE_BACKUP  sync_backup_pid))
return -EEXIST;
 
+   /*
+* Note that tinfo will be freed in sync_thread on exit
+*/
+   tinfo = kmalloc(sizeof(struct ip_vs_sync_thread_data), GFP_KERNEL);
+   if (!tinfo)
+   return -ENOMEM;
+
IP_VS_DBG(7, %s: pid %d\n, __FUNCTION__, current-pid);
IP_VS_DBG(7, Each ip_vs_sync_conn entry need %Zd bytes\n,
  sizeof(struct ip_vs_sync_conn));
@@ -858,8 +892,11 @@ int start_sync_thread(int state, char *mcast_ifn, __u8 
syncid)
ip_vs_backup_syncid = syncid;
}
 
+   tinfo-state = state;
+   tinfo-startup = startup;
+
   repeat:
-   if ((pid = kernel_thread(fork_sync_thread, startup, 0))  0) {
+   if ((pid = kernel_thread(fork_sync_thread, tinfo, 0))  0) {
IP_VS_ERR(could not create fork_sync_thread due to %d... 
  retrying.\n, pid);
msleep_interruptible(1000);
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] NET: Multiqueue network device support.

2007-06-08 Thread Waskiewicz Jr, Peter P
 I thought the correct use is to get this lock on clean_tx 
 side which can get called on a different cpu on rx (which 
 also cleans up slots for skbs that have finished xmit). Both 
 TX and clean_tx uses the same tx_ring's head/tail ptrs and 
 should be exclusive. But I don't find clean tx using this 
 lock in the code, so I am confused :-)

From e1000_main.c, e1000_clean():

/* e1000_clean is called per-cpu.  This lock protects
 * tx_ring[0] from being cleaned by multiple cpus
 * simultaneously.  A failure obtaining the lock means
 * tx_ring[0] is currently being cleaned anyway. */
if (spin_trylock(adapter-tx_queue_lock)) {
tx_cleaned = e1000_clean_tx_irq(adapter,
adapter-tx_ring[0]);
spin_unlock(adapter-tx_queue_lock);
}

In a multi-ring implementation of the driver, this is wrapped with for
(i = 0; i  adapter-num_tx_queues; i++) and adapter-tx_ring[i].  This
lock also prevents the clean routine from stomping on xmit_frame() when
transmitting.  Also in the multi-ring implementation, the tx_lock is
pushed down into the individual tx_ring struct, not at the adapter
level.

Cheers,

-PJ Waskiewicz
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


SKY2 vs SK98LIN performance on 88E8053 MAC

2007-06-08 Thread Philip Romanov
Hello!
  
We are observing severe IPv4 forwarding degradation 
when switching from sk98lin to sky2 driver. Setup:
plain 2.6.21.3 kernel, 88E8053 Marvell Yukon2 MAC,
sk98lin is @revision 8.41.2.3 coming from FC6, SKY2
driver from 2.6.21.3 kernel, both drivers are in NAPI
mode. 
 
Benchmarks are done using bidirectional traffic 
generated by IXIA, sending 256-byte packets. Observed 
packet throughput is almost 30% higher with sklin98
driver. 
 
Ethernet flow control is turned off in SKY2 driver 
(hard-coded as off, we know about this problem).  
 
I also have oprofile records of the drivers in case 
anybody is interested.
 
Please share info if you know anything on SKY2
performance bottlenecks.
 
 
 Thanks in advance,
 
Philip R.
 



  

Shape Yahoo! in your own image.  Join our Network Research Panel today!   
http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7 


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [WIP][PATCHES] Network xmit batching

2007-06-08 Thread Evgeniy Polyakov
On Fri, Jun 08, 2007 at 09:07:47AM -0400, jamal ([EMAIL PROTECTED]) wrote:
  Something, that anyone can understand :)
  For example /proc stats, although it is not very accurate, but it is
  really usable parameter from userspace point ov view.
 
 which /proc stats?

/proc/$pid/stat, for pktgen it is likely not that interesting, but for
usual userspace applcation it is quite interesting parameter.
At least that is what 'top' shows.
 
-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: networking busted in current -git ???

2007-06-08 Thread Trond Myklebust
On Fri, 2007-06-08 at 23:07 +0200, Arkadiusz Miskiewicz wrote:
 On Friday 08 of June 2007, you wrote:
  Hello,
 
  I am using the current git tree: 85f6038f2170e3335dda09c3dfb0f83110e87019 .
  Git tree from two days ago (with the same config) works fine.
 
  Attempting to acquire an IP address via DHCP fails with:
 
  SIOCSIFADDR: No buffer space available
  Listening on LPF/eth0/00:19:b9:0c:9a:43
  Sending on   LPF/eth0/00:19:b9:0c:9a:43
  Sending on   Socket/fallback
  DHCPREQUEST on eth0 to 255.255.255.255 port 67
  DHCPACK from xxx.xxx.xxx.xxx
  SIOCSIFADDR: No buffer space available
  SIOCSIFNETMASK: Cannot assign requested address
  SIOCSIFBRDADDR: Cannot assign requested address
  SIOCADDRT: Network is unreachable
  bound to xxx.xxx.xxx.xxx -- renewal in 98610 seconds.
 
  This is on a Dell 490 with tg3 network driver running Ubuntu 7.04 .
  .config and dmesg are appended.
 
  florin
 
 Here it requires few retries (stop dhcpcd, start again) to get the IP. git 
 tree from few hours ago. tg3 driver. I also saw SIOCSIFADDR: No buffer space 
 available once.

(added netdev to the Cc list)

It is not dhcp. I'm seeing the same bug with bog-standard ifup with a
static address on an FC-6 machine.

It appears to be something in the latest dump from davem to Linus, but I
haven't yet had time to identify what.

Cheers
  Trond

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] [RFC] NET: Implement a standard ndev_printk family

2007-06-08 Thread Auke Kok
A lot of netdevices implement their own variant of printk and use
use variations of dev_printk, printk or others that use msg_enable,
which has been an eyesore with countless variations across drivers.

This patch implements a standard ndev_printk and derivatives
such as ndev_err, ndev_info, ndev_warn that allows drivers to
transparently use both the msg_enable and a generic netdevice
message layout. It moves the msg_enable over to the net_device
struct and allows drivers to obsolete ethtool handling code of
the msg_enable value.

The current code has each driver contain a copy of msg_enable and
handle the setting/changing through ethtool that way. Since the
netdev name is stored in the net_device struct, those two are
not coherently available in a uniform way across all drivers (a
single macro or function would not work since all drivers name
their net_device members differently). This makes netdevice
driver writes reinvent the wheel over and over again.

It thus makes sense to move msg_enable to the net_device. This
gives us the opportunity to (1) initialize it by default with a
globally sane value, (2) remove msg_enable handling code w/r
ethtool for drivers that know and use the msg_enable member
of the net_device struct. (3) Ethtool code can just modify the
net_device msg_enable for drivers that do not have custom
msg_enable get/set handlers so converted drivers lose some
code for that as well.

Signed-off-by: Auke Kok [EMAIL PROTECTED]
---

 include/linux/netdevice.h |   24 
 net/core/dev.c|   10 ++
 net/core/ethtool.c|   14 +++---
 3 files changed, 41 insertions(+), 7 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3a70f55..5551b63 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -540,6 +540,8 @@ struct net_device
struct device   dev;
/* space for optional statistics and wireless sysfs groups */
struct attribute_group  *sysfs_groups[3];
+
+   int msg_enable;
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
@@ -838,6 +840,28 @@ enum {
NETIF_MSG_WOL   = 0x4000,
 };
 
+#define ndev_printk(kern_level, netif_level, netdev, format, arg...) \
+   do { if ((netdev)-msg_enable  NETIF_MSG_##netif_level) { \
+   printk(kern_level %s:  format, \
+   (netdev)-name, ## arg); } } while (0)
+
+#ifdef DEBUG
+#define ndev_dbg(level, netdev, format, arg...) \
+   ndev_printk(KERN_DEBUG, level, netdev, format, ## arg)
+#else
+#define ndev_dbg(level, netdev, format, arg...) \
+   do { (void)(netdev); } while (0)
+#endif
+
+#define ndev_err(level, netdev, format, arg...) \
+   ndev_printk(KERN_ERR, level, netdev, format, ## arg)
+#define ndev_info(level, netdev, format, arg...) \
+   ndev_printk(KERN_INFO, level, netdev, format, ## arg)
+#define ndev_warn(level, netdev, format, arg...) \
+   ndev_printk(KERN_WARNING, level, netdev, format, ## arg)
+#define ndev_notice(level, netdev, format, arg...) \
+   ndev_printk(KERN_NOTICE, level, netdev, format, ## arg)
+
 #define netif_msg_drv(p)   ((p)-msg_enable  NETIF_MSG_DRV)
 #define netif_msg_probe(p) ((p)-msg_enable  NETIF_MSG_PROBE)
 #define netif_msg_link(p)  ((p)-msg_enable  NETIF_MSG_LINK)
diff --git a/net/core/dev.c b/net/core/dev.c
index 5a7f20f..e854c09 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3376,6 +3376,16 @@ struct net_device *alloc_netdev(int sizeof_priv, const 
char *name,
dev-priv = netdev_priv(dev);
 
dev-get_stats = internal_stats;
+   dev-msg_enable = NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_LINK;
+#ifdef DEBUG
+   /* put these to good use: */
+   dev-msg_enable |= NETIF_MSG_TIMER | NETIF_MSG_IFDOWN |
+  NETIF_MSG_IFUP | NETIF_MSG_RX_ERR |
+  NETIF_MSG_TX_ERR | NETIF_MSG_TX_QUEUED |
+  NETIF_MSG_INTR | NETIF_MSG_TX_DONE |
+  NETIF_MSG_RX_STATUS | NETIF_MSG_PKTDATA |
+  NETIF_MSG_HW | NETIF_MSG_WOL;
+#endif
setup(dev);
strcpy(dev-name, name);
return dev;
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 8d5e5a0..ff8d52f 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -234,9 +234,9 @@ static int ethtool_get_msglevel(struct net_device *dev, 
char __user *useraddr)
struct ethtool_value edata = { ETHTOOL_GMSGLVL };
 
if (!dev-ethtool_ops-get_msglevel)
-   return -EOPNOTSUPP;
-
-   edata.data = dev-ethtool_ops-get_msglevel(dev);
+   edata.data = dev-msg_enable;
+   else
+   edata.data = dev-ethtool_ops-get_msglevel(dev);
 
if (copy_to_user(useraddr, edata, sizeof(edata)))
return -EFAULT;
@@ -247,13 +247,13 @@ static int ethtool_set_msglevel(struct net_device *dev, 
char __user *useraddr)
 {

[PATCH 2/2] [RFC] NET: Convert several drivers to ndev_printk

2007-06-08 Thread Auke Kok
With the generic ndev_printk macros, we can now convert network
drivers to use this generic printk family for netdevices.

Signed-off-by: Auke Kok [EMAIL PROTECTED]
---

 drivers/net/e100.c|  121 +++--
 drivers/net/e1000/e1000.h |   15 -
 drivers/net/e1000/e1000_ethtool.c |   39 
 drivers/net/e1000/e1000_main.c|  101 +++
 drivers/net/e1000/e1000_param.c   |   67 ++--
 drivers/net/ixgb/ixgb.h   |   14 
 drivers/net/ixgb/ixgb_ethtool.c   |   15 -
 drivers/net/ixgb/ixgb_main.c  |   46 ++
 8 files changed, 166 insertions(+), 252 deletions(-)

diff --git a/drivers/net/e100.c b/drivers/net/e100.c
index 6ca0a08..56e7504 100644
--- a/drivers/net/e100.c
+++ b/drivers/net/e100.c
@@ -172,19 +172,12 @@ MODULE_AUTHOR(DRV_COPYRIGHT);
 MODULE_LICENSE(GPL);
 MODULE_VERSION(DRV_VERSION);
 
-static int debug = 3;
 static int eeprom_bad_csum_allow = 0;
 static int use_io = 0;
-module_param(debug, int, 0);
 module_param(eeprom_bad_csum_allow, int, 0);
 module_param(use_io, int, 0);
-MODULE_PARM_DESC(debug, Debug level (0=none,...,16=all));
 MODULE_PARM_DESC(eeprom_bad_csum_allow, Allow bad eeprom checksums);
 MODULE_PARM_DESC(use_io, Force use of i/o access mode);
-#define DPRINTK(nlevel, klevel, fmt, args...) \
-   (void)((NETIF_MSG_##nlevel  nic-msg_enable)  \
-   printk(KERN_##klevel PFX %s: %s:  fmt, nic-netdev-name, \
-   __FUNCTION__ , ## args))
 
 #define INTEL_8255X_ETHERNET_DEVICE(device_id, ich) {\
PCI_VENDOR_ID_INTEL, device_id, PCI_ANY_ID, PCI_ANY_ID, \
@@ -644,12 +637,12 @@ static int e100_self_test(struct nic *nic)
 
/* Check results of self-test */
if(nic-mem-selftest.result != 0) {
-   DPRINTK(HW, ERR, Self-test failed: result=0x%08X\n,
+   ndev_err(HW, nic-netdev, Self-test failed: result=0x%08X\n,
nic-mem-selftest.result);
return -ETIMEDOUT;
}
if(nic-mem-selftest.signature == 0) {
-   DPRINTK(HW, ERR, Self-test failed: timed out\n);
+   ndev_err(HW, nic-netdev, Self-test failed: timed out\n);
return -ETIMEDOUT;
}
 
@@ -753,7 +746,7 @@ static int e100_eeprom_load(struct nic *nic)
 * the sum of words should be 0xBABA */
checksum = le16_to_cpu(0xBABA - checksum);
if(checksum != nic-eeprom[nic-eeprom_wc - 1]) {
-   DPRINTK(PROBE, ERR, EEPROM corrupted\n);
+   ndev_err(PROBE, nic-netdev, EEPROM corrupted\n);
if (!eeprom_bad_csum_allow)
return -EAGAIN;
}
@@ -908,7 +901,7 @@ static u16 mdio_ctrl(struct nic *nic, u32 addr, u32 dir, 
u32 reg, u16 data)
break;
}
spin_unlock_irqrestore(nic-mdio_lock, flags);
-   DPRINTK(HW, DEBUG,
+   ndev_dbg(HW, nic-netdev,
%s:addr=%d, reg=%d, data_in=0x%04X, data_out=0x%04X\n,
dir == mdi_read ? READ : WRITE, addr, reg, data, data_out);
return (u16)data_out;
@@ -960,8 +953,8 @@ static void e100_get_defaults(struct nic *nic)
 static void e100_configure(struct nic *nic, struct cb *cb, struct sk_buff *skb)
 {
struct config *config = cb-u.config;
-   u8 *c = (u8 *)config;
-
+   u8 *c;
+   
cb-command = cpu_to_le16(cb_config);
 
memset(config, 0, sizeof(struct config));
@@ -1021,12 +1014,16 @@ static void e100_configure(struct nic *nic, struct cb 
*cb, struct sk_buff *skb)
config-standard_stat_counter = 0x0;
}
 
-   DPRINTK(HW, DEBUG, [00-07]=%02X:%02X:%02X:%02X:%02X:%02X:%02X:%02X\n,
-   c[0], c[1], c[2], c[3], c[4], c[5], c[6], c[7]);
-   DPRINTK(HW, DEBUG, [08-15]=%02X:%02X:%02X:%02X:%02X:%02X:%02X:%02X\n,
-   c[8], c[9], c[10], c[11], c[12], c[13], c[14], c[15]);
-   DPRINTK(HW, DEBUG, [16-23]=%02X:%02X:%02X:%02X:%02X:%02X:%02X:%02X\n,
-   c[16], c[17], c[18], c[19], c[20], c[21], c[22], c[23]);
+   c = (u8 *)config;
+   ndev_dbg(HW, nic-netdev,
+[00-07]=%02X:%02X:%02X:%02X:%02X:%02X:%02X:%02X\n,
+c[0], c[1], c[2], c[3], c[4], c[5], c[6], c[7]);
+   ndev_dbg(HW, nic-netdev,
+[08-15]=%02X:%02X:%02X:%02X:%02X:%02X:%02X:%02X\n,
+c[8], c[9], c[10], c[11], c[12], c[13], c[14], c[15]);
+   ndev_dbg(HW, nic-netdev,
+[16-23]=%02X:%02X:%02X:%02X:%02X:%02X:%02X:%02X\n,
+c[16], c[17], c[18], c[19], c[20], c[21], c[22], c[23]);
 }
 
 //
@@ -1296,7 +1293,7 @@ static inline int e100_exec_cb_wait(struct nic *nic, 
struct sk_buff *skb,
struct cb *cb = nic-cb_to_clean;
 
if ((err = e100_exec_cb(nic, NULL, e100_setup_ucode)))
-   DPRINTK(PROBE,ERR, ucode cmd failed with error %d\n, err);
+ 

Re: [PATCH 1/2] [RFC] NET: Implement a standard ndev_printk family

2007-06-08 Thread Stephen Hemminger

  
 +#define ndev_printk(kern_level, netif_level, netdev, format, arg...) \
 + do { if ((netdev)-msg_enable  NETIF_MSG_##netif_level) { \
 + printk(kern_level %s:  format, \
 + (netdev)-name, ## arg); } } while (0)

Could you make a version that doesn't evaluate the arguments twice?

 +#ifdef DEBUG
 +#define ndev_dbg(level, netdev, format, arg...) \
 + ndev_printk(KERN_DEBUG, level, netdev, format, ## arg)
 +#else
 +#define ndev_dbg(level, netdev, format, arg...) \
 + do { (void)(netdev); } while (0)
 +#endif
 +
 +#define ndev_err(level, netdev, format, arg...) \
 + ndev_printk(KERN_ERR, level, netdev, format, ## arg)
 +#define ndev_info(level, netdev, format, arg...) \
 + ndev_printk(KERN_INFO, level, netdev, format, ## arg)
 +#define ndev_warn(level, netdev, format, arg...) \
 + ndev_printk(KERN_WARNING, level, netdev, format, ## arg)
 +#define ndev_notice(level, netdev, format, arg...) \
 + ndev_printk(KERN_NOTICE, level, netdev, format, ## arg)
 +
  #define netif_msg_drv(p) ((p)-msg_enable  NETIF_MSG_DRV)
  #define netif_msg_probe(p)   ((p)-msg_enable  NETIF_MSG_PROBE)
  #define netif_msg_link(p)((p)-msg_enable  NETIF_MSG_LINK)
 diff --git a/net/core/dev.c b/net/core/dev.c
 index 5a7f20f..e854c09 100644
 --- a/net/core/dev.c
 +++ b/net/core/dev.c
 @@ -3376,6 +3376,16 @@ struct net_device *alloc_netdev(int sizeof_priv, const 
 char *name,
   dev-priv = netdev_priv(dev);
  
   dev-get_stats = internal_stats;
 + dev-msg_enable = NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_LINK;
 +#ifdef DEBUG
 + /* put these to good use: */
 + dev-msg_enable |= NETIF_MSG_TIMER | NETIF_MSG_IFDOWN |
 +NETIF_MSG_IFUP | NETIF_MSG_RX_ERR |
 +NETIF_MSG_TX_ERR | NETIF_MSG_TX_QUEUED |
 +NETIF_MSG_INTR | NETIF_MSG_TX_DONE |
 +NETIF_MSG_RX_STATUS | NETIF_MSG_PKTDATA |
 +NETIF_MSG_HW | NETIF_MSG_WOL;
 +#endif

Let driver writer choose message enable bits please.



-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SKY2 vs SK98LIN performance on 88E8053 MAC

2007-06-08 Thread Stephen Hemminger
On Fri, 8 Jun 2007 13:41:55 -0700 (PDT)
Philip Romanov [EMAIL PROTECTED] wrote:

 Hello!
   
 We are observing severe IPv4 forwarding degradation 
 when switching from sk98lin to sky2 driver. Setup:
 plain 2.6.21.3 kernel, 88E8053 Marvell Yukon2 MAC,
 sk98lin is @revision 8.41.2.3 coming from FC6, SKY2
 driver from 2.6.21.3 kernel, both drivers are in NAPI
 mode. 
  
 Benchmarks are done using bidirectional traffic 
 generated by IXIA, sending 256-byte packets. Observed 
 packet throughput is almost 30% higher with sklin98
 driver. 
  
 Ethernet flow control is turned off in SKY2 driver 
 (hard-coded as off, we know about this problem).  
  
 I also have oprofile records of the drivers in case 
 anybody is interested.
  
 Please share info if you know anything on SKY2
 performance bottlenecks.
  

I'm surprised? The vendor driver has bogus extra locking and other
crap. Please send profile data.  Flow control should work on sky2 (now).

Are you routing or doing real TCP transfers?

-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] [RFC] NET: Implement a standard ndev_printk family

2007-06-08 Thread Kok, Auke

Stephen Hemminger wrote:
 
+#define ndev_printk(kern_level, netif_level, netdev, format, arg...) \

+   do { if ((netdev)-msg_enable  NETIF_MSG_##netif_level) { \
+   printk(kern_level %s:  format, \
+   (netdev)-name, ## arg); } } while (0)


Could you make a version that doesn't evaluate the arguments twice?


hmm you lost me there a bit; Do you want me to duplicate this code for all the 
ndev_err/ndev_info functions instead so that ndev_err doesn't direct back to 
ndev_printk?



+#ifdef DEBUG
+#define ndev_dbg(level, netdev, format, arg...) \
+   ndev_printk(KERN_DEBUG, level, netdev, format, ## arg)
+#else
+#define ndev_dbg(level, netdev, format, arg...) \
+   do { (void)(netdev); } while (0)
+#endif
+
+#define ndev_err(level, netdev, format, arg...) \
+   ndev_printk(KERN_ERR, level, netdev, format, ## arg)
+#define ndev_info(level, netdev, format, arg...) \
+   ndev_printk(KERN_INFO, level, netdev, format, ## arg)
+#define ndev_warn(level, netdev, format, arg...) \
+   ndev_printk(KERN_WARNING, level, netdev, format, ## arg)
+#define ndev_notice(level, netdev, format, arg...) \
+   ndev_printk(KERN_NOTICE, level, netdev, format, ## arg)
+
 #define netif_msg_drv(p)   ((p)-msg_enable  NETIF_MSG_DRV)
 #define netif_msg_probe(p) ((p)-msg_enable  NETIF_MSG_PROBE)
 #define netif_msg_link(p)  ((p)-msg_enable  NETIF_MSG_LINK)
diff --git a/net/core/dev.c b/net/core/dev.c
index 5a7f20f..e854c09 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3376,6 +3376,16 @@ struct net_device *alloc_netdev(int sizeof_priv, const 
char *name,
dev-priv = netdev_priv(dev);
 
 	dev-get_stats = internal_stats;

+   dev-msg_enable = NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_LINK;
+#ifdef DEBUG
+   /* put these to good use: */
+   dev-msg_enable |= NETIF_MSG_TIMER | NETIF_MSG_IFDOWN |
+  NETIF_MSG_IFUP | NETIF_MSG_RX_ERR |
+  NETIF_MSG_TX_ERR | NETIF_MSG_TX_QUEUED |
+  NETIF_MSG_INTR | NETIF_MSG_TX_DONE |
+  NETIF_MSG_RX_STATUS | NETIF_MSG_PKTDATA |
+  NETIF_MSG_HW | NETIF_MSG_WOL;
+#endif


Let driver writer choose message enable bits please.


the driver can, since these bits are set in alloc_netdev, nothing prevents a 
driver from setting the mask immediately afterwards. Putting in a sane default 
seems a good idea and good practice.


Maybe I went a bit far by going all out on the DEBUG flags tho... perhaps those 
can be removed or only NETIF_MSG_RX_ERR and NETIF_MSG_TX_ERR set with DEBUG.


Auke
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Virtual ethernet tunnel (v.2)

2007-06-08 Thread Ben Greear

Carl-Daniel Hailfinger wrote:

On 08.06.2007 19:00, Ben Greear wrote:

I have another sysfs patch that allows setting a default skb-mark for
an interface so that you can set the skb-mark
before it hits the connection tracking logic, but I'm been told this one
has very little chance
of getting into the kernel.  The skb-mark patch is only useful (as far
as I can tell) if you
also include a patch Patrick McHardy did for me that allowed the
conn-tracking logic to
use skb-mark as part of it's tuple.  This allows me to do NAT between
virtual routers
(routing tables) on the same machine using veth-equivalent drivers to
connect the
routers.  He thinks this will probably not ever get into the kernel either.


Are these patches available somewhere? I'm currently doing NAT between
virtual routers by some advanced iproute2/iptables trickery, but I have
no way to handle the occasional tuple conflict.


A consolidated patch against 2.6.20.12 is here.  It has a lot more than
just the patches mentioned above, but it shouldn't hurt anything to have
the whole patch applied:

http://www.candelatech.com/oss/candela_2.6.20.patch

The original patch for using skb-mark as a tuple was
written by Patrick McHardy, and is here:

http://www.candelatech.com/oss/skb_mark_conntrack.patch

His patch merged with my patch to sysfs to set skb-mark on ingress is here:
http://www.candelatech.com/oss/conntrack_mark_with_ssyctl.patch


Thanks,
Ben




Regards,
Carl-Daniel



--
Ben Greear [EMAIL PROTECTED]
Candela Technologies Inc  http://www.candelatech.com

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RFC: Support send-to-self over external interfaces (and veths).

2007-06-08 Thread Ben Greear

This should also be useful with the pending 'veth' driver, as it
emulates two ethernet ports connected with a cross-over cable.

To make this work, you have to enable the sysctl (look Dave,
no IOCTLS, there might be hope for me yet!! :)), and in your
application you will need to use SO_BINDTODEVICE (and probably bind to
the local IP as well).  Some applications such as traceroute already
support this binding..others such as ping do not.

You most likely will also have to set up routing tables using
source IPs as a rule to direct these connections to a particular
routing table.

Comments welcome.

Thanks,
Ben

--
Ben Greear [EMAIL PROTECTED]
Candela Technologies Inc  http://www.candelatech.com

diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index c0f7aec..88f78b6 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -31,6 +31,7 @@ struct ipv4_devconf
 	int	no_policy;
 	int	force_igmp_version;
 	int	promote_secondaries;
+	int	accept_sts;
 	void	*sysctl;
 };
 
@@ -84,6 +85,7 @@ struct in_device
 #define IN_DEV_ARPFILTER(in_dev)	(ipv4_devconf.arp_filter || (in_dev)-cnf.arp_filter)
 #define IN_DEV_ARP_ANNOUNCE(in_dev)	(max(ipv4_devconf.arp_announce, (in_dev)-cnf.arp_announce))
 #define IN_DEV_ARP_IGNORE(in_dev)	(max(ipv4_devconf.arp_ignore, (in_dev)-cnf.arp_ignore))
+#define IN_DEV_ACCEPT_STS(in_dev)  (max(ipv4_devconf.accept_sts, (in_dev)-cnf.accept_sts))
 
 struct in_ifaddr
 {
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index 47f1c53..6c00bf4 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -496,6 +496,7 @@ enum
 	NET_IPV4_CONF_ARP_IGNORE=19,
 	NET_IPV4_CONF_PROMOTE_SECONDARIES=20,
 	NET_IPV4_CONF_ARP_ACCEPT=21,
+	NET_IPV4_CONF_ACCEPT_STS=22,
 	__NET_IPV4_CONF_MAX
 };
 
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 7110779..9866f1b 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -419,6 +419,26 @@ static int arp_ignore(struct in_device *in_dev, struct net_device *dev,
 	return !inet_confirm_addr(dev, sip, tip, scope);
 }
 
+static int is_ip_on_dev(struct net_device* dev, __u32 ip) {
+  int rv = 0;
+  struct in_device* in_dev = in_dev_get(dev);
+  if (in_dev) {
+  struct in_ifaddr *ifa;
+
+  rcu_read_lock();
+  for (ifa = in_dev-ifa_list; ifa; ifa = ifa-ifa_next) {
+  if (ifa-ifa_address == ip) {
+  /* match */
+  rv = 1;
+  break;
+  }
+  }
+  rcu_read_unlock();
+  in_dev_put(in_dev);
+  }
+  return rv;
+}
+
 static int arp_filter(__be32 sip, __be32 tip, struct net_device *dev)
 {
 	struct flowi fl = { .nl_u = { .ip4_u = { .daddr = sip,
@@ -430,8 +450,38 @@ static int arp_filter(__be32 sip, __be32 tip, struct net_device *dev)
 	if (ip_route_output_key(rt, fl)  0)
 		return 1;
 	if (rt-u.dst.dev != dev) {
-		NET_INC_STATS_BH(LINUX_MIB_ARPFILTER);
-		flag = 1;
+		struct in_device *in_dev = in_dev_get(dev);
+		if (in_dev  IN_DEV_ACCEPT_STS(in_dev) 
+		(rt-u.dst.dev == loopback_dev))  {
+			/* Accept these IFF target-ip == dev's IP */
+			/* TODO:  Need to force the ARP response back out the interface
+			 * instead of letting it route locally.
+			 */
+			
+			if (is_ip_on_dev(dev, tip)) {
+/* OK, we'll let this special case slide, so that we can
+ * arp from one local interface to another.  This seems
+ * to work, but could use some review. --Ben
+ */
+/*printk(arp_filter, sip: %x tip: %x  dev: %s, STS override (ip on dev)\n,
+  sip, tip, dev-name);*/
+			}
+			else {
+/*printk(arp_filter, sip: %x tip: %x  dev: %s, IP is NOT on dev\n,
+  sip, tip, dev-name);*/
+NET_INC_STATS_BH(LINUX_MIB_ARPFILTER);
+flag = 1;
+			}
+		}
+		else {
+			/*printk(arp_filter, not lpbk  sip: %x tip: %x  dev: %s  flgs: %hx  dst.dev: %p  lbk: %p\n,
+			  sip, tip, dev-name, dev-priv_flags, rt-u.dst.dev, loopback_dev);*/
+			NET_INC_STATS_BH(LINUX_MIB_ARPFILTER);
+			flag = 1;
+		}
+		if (in_dev) {
+			in_dev_put(in_dev);
+		}
 	}
 	ip_rt_put(rt);
 	return flag;
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 7f95e6e..33ac2ed 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -1513,6 +1513,15 @@ static struct devinet_sysctl_table {
 			.proc_handler	= ipv4_doint_and_flush,
 			.strategy	= ipv4_doint_and_flush_strategy,
 		},
+		{
+			.ctl_name   = NET_IPV4_CONF_ACCEPT_STS,
+			.procname   = accept_sts,
+			.data   = ipv4_devconf.accept_sts,
+			.maxlen = sizeof(int),
+			.mode   = 0644,
+			.proc_handler   = proc_dointvec,
+		},
+
 	},
 	.devinet_dev = {
 		{
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 837f295..9b57bf5 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -206,8 +206,16 @@ int fib_validate_source(__be32 src, __be32 dst, u8 

Re: [PATCH 1/2] [RFC] NET: Implement a standard ndev_printk family

2007-06-08 Thread Stephen Hemminger
On Fri, 08 Jun 2007 16:42:31 -0700
Kok, Auke [EMAIL PROTECTED] wrote:

 Stephen Hemminger wrote:
   
  +#define ndev_printk(kern_level, netif_level, netdev, format, arg...) \
  +  do { if ((netdev)-msg_enable  NETIF_MSG_##netif_level) { \
  +  printk(kern_level %s:  format, \
  +  (netdev)-name, ## arg); } } while (0)
  
  Could you make a version that doesn't evaluate the arguments twice?
 
 hmm you lost me there a bit; Do you want me to duplicate this code for all 
 the 
 ndev_err/ndev_info functions instead so that ndev_err doesn't direct back to 
 ndev_printk?

It is good practice in a macro to avoid potential problems with usage
by only touching the arguments once. Otherwise, something (bogus) like
ndev_printk(KERN_DEBUG, NETIF_MSG_PKTDATA, got %d\n,
dev++, skb-len)
would increment dev twice.

My preference would be something more like dev_printk or even use that?
You want to show both device name, and physical attachment in the message.

  +#ifdef DEBUG
  +#define ndev_dbg(level, netdev, format, arg...) \
  +  ndev_printk(KERN_DEBUG, level, netdev, format, ## arg)
  +#else
  +#define ndev_dbg(level, netdev, format, arg...) \
  +  do { (void)(netdev); } while (0)
  +#endif
  +
  +#define ndev_err(level, netdev, format, arg...) \
  +  ndev_printk(KERN_ERR, level, netdev, format, ## arg)
  +#define ndev_info(level, netdev, format, arg...) \
  +  ndev_printk(KERN_INFO, level, netdev, format, ## arg)
  +#define ndev_warn(level, netdev, format, arg...) \
  +  ndev_printk(KERN_WARNING, level, netdev, format, ## arg)
  +#define ndev_notice(level, netdev, format, arg...) \
  +  ndev_printk(KERN_NOTICE, level, netdev, format, ## arg)
  +
   #define netif_msg_drv(p)  ((p)-msg_enable  NETIF_MSG_DRV)
   #define netif_msg_probe(p)((p)-msg_enable  NETIF_MSG_PROBE)
   #define netif_msg_link(p) ((p)-msg_enable  NETIF_MSG_LINK)
  diff --git a/net/core/dev.c b/net/core/dev.c
  index 5a7f20f..e854c09 100644
  --- a/net/core/dev.c
  +++ b/net/core/dev.c
  @@ -3376,6 +3376,16 @@ struct net_device *alloc_netdev(int sizeof_priv, 
  const char *name,
 dev-priv = netdev_priv(dev);
   
 dev-get_stats = internal_stats;
  +  dev-msg_enable = NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_LINK;
  +#ifdef DEBUG
  +  /* put these to good use: */
  +  dev-msg_enable |= NETIF_MSG_TIMER | NETIF_MSG_IFDOWN |
  + NETIF_MSG_IFUP | NETIF_MSG_RX_ERR |
  + NETIF_MSG_TX_ERR | NETIF_MSG_TX_QUEUED |
  + NETIF_MSG_INTR | NETIF_MSG_TX_DONE |
  + NETIF_MSG_RX_STATUS | NETIF_MSG_PKTDATA |
  + NETIF_MSG_HW | NETIF_MSG_WOL;
  +#endif
  
  Let driver writer choose message enable bits please.
 
 the driver can, since these bits are set in alloc_netdev, nothing prevents a 
 driver from setting the mask immediately afterwards. Putting in a sane 
 default 
 seems a good idea and good practice.
 
 Maybe I went a bit far by going all out on the DEBUG flags tho... perhaps 
 those 
 can be removed or only NETIF_MSG_RX_ERR and NETIF_MSG_TX_ERR set with DEBUG.
 
 Auke


-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [WIP][PATCHES] Network xmit batching

2007-06-08 Thread jamal
On Fri, 2007-08-06 at 10:27 -0700, Rick Jones wrote:

[..]

 you cannot take the netperf service demand directly - each netperf is 
 calculating assuming that it is the only thing running on the system. 
 It then ass-u-me-s that the CPU util it measured was all for its work. 
 This means the service demand figure will be quite higher than it really is.
 
 So, for aggregate tests using netperf2, one has to calculate service 
 demand by hand.  Sum the throughput as KB/s, convert the CPU util and 
 number of CPUs to a microseconds of CPU consumed per second and divide 
 to get microseconds per KB for the aggregate.

From what you are saying above seems to me that for more than one proc
it is safe to just run netperf4 instead of netperf2?

It also seems reasonable to set up large socket buffers on the receiver.

cheers,
jamal



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] NET: Multiqueue network device support.

2007-06-08 Thread jamal
On Fri, 2007-08-06 at 12:55 -0700, Waskiewicz Jr, Peter P wrote:
  I thought the correct use is to get this lock on clean_tx 
  side which can get called on a different cpu on rx (which 
  also cleans up slots for skbs that have finished xmit). Both 
  TX and clean_tx uses the same tx_ring's head/tail ptrs and 
  should be exclusive. But I don't find clean tx using this 
  lock in the code, so I am confused :-)
 
 From e1000_main.c, e1000_clean():
 
 /* e1000_clean is called per-cpu.  This lock protects
  * tx_ring[0] from being cleaned by multiple cpus
  * simultaneously.  A failure obtaining the lock means
  * tx_ring[0] is currently being cleaned anyway. */
 if (spin_trylock(adapter-tx_queue_lock)) {
 tx_cleaned = e1000_clean_tx_irq(adapter,
 adapter-tx_ring[0]);
 spin_unlock(adapter-tx_queue_lock);
 }

Are you saying theres no problem because the adapter-tx_queue_lock is
being held? 

 In a multi-ring implementation of the driver, this is wrapped with for
 (i = 0; i  adapter-num_tx_queues; i++) and adapter-tx_ring[i].  This
 lock also prevents the clean routine from stomping on xmit_frame() when
 transmitting.  Also in the multi-ring implementation, the tx_lock is
 pushed down into the individual tx_ring struct, not at the adapter
 level.

That sounds right - but the adapter lock is not related to tx_lock in
current e1000, correct?

cheers,
jamal


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [WIP][PATCHES] Network xmit batching

2007-06-08 Thread Rick Jones

jamal wrote:

On Fri, 2007-08-06 at 10:27 -0700, Rick Jones wrote:

[..]


you cannot take the netperf service demand directly - each netperf is 
calculating assuming that it is the only thing running on the system. 
It then ass-u-me-s that the CPU util it measured was all for its work. 
This means the service demand figure will be quite higher than it really is.


So, for aggregate tests using netperf2, one has to calculate service 
demand by hand.  Sum the throughput as KB/s, convert the CPU util and 
number of CPUs to a microseconds of CPU consumed per second and divide 
to get microseconds per KB for the aggregate.



From what you are saying above seems to me that for more than one proc
it is safe to just run netperf4 instead of netperf2?


Well, it is easier to be safe on aggregates with netperf4 than netperf2 
although at present it is more difficult to run netperf4 than netperf2




It also seems reasonable to set up large socket buffers on the receiver.


For bulk transfers I often do.

rick
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] [IrDA] Updates for net-2.6

2007-06-08 Thread samuel
Hi Dave,

These 2 patches are bug fixes and should thus be considered for net-2.6
inclusion.

Cheers,
Samuel.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] [IrDA] Fix Rx/Tx path race

2007-06-08 Thread samuel
From: G. Liakhovetski [EMAIL PROTECTED]

We need to switch to NRM _before_ sending the final packet otherwise we might 
hit a race condition where we get the first packet from the peer while we're 
still in LAP_XMIT_P.

Cc: G. Liakhovetski [EMAIL PROTECTED]
Signed-off-by: Samuel Ortiz [EMAIL PROTECTED]
---
 include/net/irda/irlap.h |   17 +
 net/irda/irlap_event.c   |   18 --
 net/irda/irlap_frame.c   |3 +++
 3 files changed, 20 insertions(+), 18 deletions(-)

Index: net-2.6-quilt/include/net/irda/irlap.h
===
--- net-2.6-quilt.orig/include/net/irda/irlap.h 2007-05-10 19:23:04.0 
+0300
+++ net-2.6-quilt/include/net/irda/irlap.h  2007-05-10 19:24:57.0 
+0300
@@ -289,4 +289,21 @@
self-disconnect_pending = FALSE;
 }
 
+/*
+ * Function irlap_next_state (self, state)
+ *
+ *Switches state and provides debug information
+ *
+ */
+static inline void irlap_next_state(struct irlap_cb *self, IRLAP_STATE state)
+{
+   /*
+   if (!self || self-magic != LAP_MAGIC)
+   return;
+
+   IRDA_DEBUG(4, next LAP state = %s\n, irlap_state[state]);
+   */
+   self-state = state;
+}
+
 #endif
Index: net-2.6-quilt/net/irda/irlap_event.c
===
--- net-2.6-quilt.orig/net/irda/irlap_event.c   2007-05-10 19:23:04.0 
+0300
+++ net-2.6-quilt/net/irda/irlap_event.c2007-05-10 19:23:09.0 
+0300
@@ -317,23 +317,6 @@
 }
 
 /*
- * Function irlap_next_state (self, state)
- *
- *Switches state and provides debug information
- *
- */
-static inline void irlap_next_state(struct irlap_cb *self, IRLAP_STATE state)
-{
-   /*
-   if (!self || self-magic != LAP_MAGIC)
-   return;
-
-   IRDA_DEBUG(4, next LAP state = %s\n, irlap_state[state]);
-   */
-   self-state = state;
-}
-
-/*
  * Function irlap_state_ndm (event, skb, frame)
  *
  *NDM (Normal Disconnected Mode) state
@@ -1086,7 +1069,6 @@
} else {
/* Final packet of window */
irlap_send_data_primary_poll(self, skb);
-   irlap_next_state(self, LAP_NRM_P);
 
/*
 * Make sure state machine does not try to send
Index: net-2.6-quilt/net/irda/irlap_frame.c
===
--- net-2.6-quilt.orig/net/irda/irlap_frame.c   2007-05-10 19:23:04.0 
+0300
+++ net-2.6-quilt/net/irda/irlap_frame.c2007-05-10 19:25:59.0 
+0300
@@ -798,16 +798,19 @@
self-vs = (self-vs + 1) % 8;
self-ack_required = FALSE;
 
+   irlap_next_state(self, LAP_NRM_P);
irlap_send_i_frame(self, tx_skb, CMD_FRAME);
} else {
IRDA_DEBUG(4, %s(), sending unreliable frame\n, __FUNCTION__);
 
if (self-ack_required) {
irlap_send_ui_frame(self, skb_get(skb), self-caddr, 
CMD_FRAME);
+   irlap_next_state(self, LAP_NRM_P);
irlap_send_rr_frame(self, CMD_FRAME);
self-ack_required = FALSE;
} else {
skb-data[1] |= PF_BIT;
+   irlap_next_state(self, LAP_NRM_P);
irlap_send_ui_frame(self, skb_get(skb), self-caddr, 
CMD_FRAME);
}
}

-- 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] [IrDA] f-timer reloading when sending rejected frames

2007-06-08 Thread samuel
Jean II was right: you have to re-charge the final timer when resending 
rejected frames. Otherwise it triggers at a wrong time and can break the 
currently running communication. Reproducible under rt-preempt.

Signed-off-by: G. Liakhovetski [EMAIL PROTECTED]
Signed-off-by: Samuel Ortiz [EMAIL PROTECTED]

Index: net-2.6-quilt/net/irda/irlap_event.c
===
--- net-2.6-quilt.orig/net/irda/irlap_event.c   2007-05-29 09:36:09.0 
+0300
+++ net-2.6-quilt/net/irda/irlap_event.c2007-05-29 09:38:19.0 
+0300
@@ -1418,14 +1418,14 @@
 */
self-remote_busy = FALSE;
 
+   /* Stop final timer */
+   del_timer(self-final_timer);
+
/*
 *  Nr as expected?
 */
ret = irlap_validate_nr_received(self, info-nr);
if (ret == NR_EXPECTED) {
-   /* Stop final timer */
-   del_timer(self-final_timer);
-
/* Update Nr received */
irlap_update_nr_received(self, info-nr);
 
@@ -1457,14 +1457,12 @@
 
/* Resend rejected frames */
irlap_resend_rejected_frames(self, CMD_FRAME);
-
-   /* Final timer ??? Jean II */
+   irlap_start_final_timer(self, self-final_timeout * 2);
 
irlap_next_state(self, LAP_NRM_P);
} else if (ret == NR_INVALID) {
IRDA_DEBUG(1, %s(), Received RR with 
   invalid nr !\n, __FUNCTION__);
-   del_timer(self-final_timer);
 
irlap_next_state(self, LAP_RESET_WAIT);
 

-- 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: networking busted in current -git ???

2007-06-08 Thread Herbert Xu
Trond Myklebust [EMAIL PROTECTED] wrote:
 
 It appears to be something in the latest dump from davem to Linus, but I
 haven't yet had time to identify what.

You want this patch which should hit the tree soon.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
[IPV4]: Do not remove idev when addresses are cleared

Now that we create idev before addresses are added, it no longer makes
sense to remove them when addresses are all deleted.

Signed-off-by: Herbert Xu [EMAIL PROTECTED]

diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 354e800..0cf813f 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -327,12 +327,8 @@ static void __inet_del_ifa(struct in_device *in_dev, 
struct in_ifaddr **ifap,
}
 
}
-   if (destroy) {
+   if (destroy)
inet_free_ifa(ifa1);
-
-   if (!in_dev-ifa_list)
-   inetdev_destroy(in_dev);
-   }
 }
 
 static void inet_del_ifa(struct in_device *in_dev, struct in_ifaddr **ifap,
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SKY2 vs SK98LIN performance on 88E8053 MAC

2007-06-08 Thread Philip Romanov
Hi, Stephen

We are doing pure IPv4 forwarding between two Ethernet
interfaces:

 IXIA port A---System Under Test---IXIA Port B

Traffic has two IP destinations for each direction and
L4 protocol is UDP. There are two static ARP entries
and only interface routes. Two tests are identical
except that we switch from one driver to another. 

Ethernet ports on the SUT are oversubscribed -- I'm
sending 60% of line rate (of 256-byte packets) and
measuring percentage of pass-through traffic which
makes to the IXIA port on the other side. Traffic is
bidirectional and system load is close to 100%.

I attach vmlinux and driver profiles I have taken with
oprofile 0.8.2. I can easily take more
measurements/experiemnts if need be. 

Regards,

   Philip


  We are observing severe IPv4 forwarding
 degradation 
  when switching from sk98lin to sky2 driver. Setup:
  plain 2.6.21.3 kernel, 88E8053 Marvell Yukon2 MAC,
  sk98lin is @revision 8.41.2.3 coming from FC6,
 SKY2
  driver from 2.6.21.3 kernel, both drivers are in
 NAPI
  mode. 
   
  Benchmarks are done using bidirectional traffic 
  generated by IXIA, sending 256-byte packets.
 Observed 
  packet throughput is almost 30% higher with
 sklin98
  driver. 
   
  Ethernet flow control is turned off in SKY2 driver
 
  (hard-coded as off, we know about this problem).  
   
  I also have oprofile records of the drivers in
 case 
  anybody is interested.
   
  Please share info if you know anything on SKY2
  performance bottlenecks.
   
 
 I'm surprised? The vendor driver has bogus extra
 locking and other
 crap. Please send profile data.  Flow control should
 work on sky2 (now).
 
 Are you routing or doing real TCP transfers?
 



  
___
You snooze, you lose. Get messages ASAP with AutoCheck
in the all-new Yahoo! Mail Beta.
http://advision.webevents.yahoo.com/mailbeta/newmail_html.html

vmlinux-sk98lin-2.6.21.3-report
Description: 4224780258-vmlinux-sk98lin-2.6.21.3-report


vmlinux-sky2-2.6.21.3-report
Description: 1503384622-vmlinux-sky2-2.6.21.3-report


sk98lin-2.6.21.3-report
Description: 3831520705-sk98lin-2.6.21.3-report


sky2-2.6.21.3.report
Description: 1004031548-sky2-2.6.21.3.report


Re: [patch 23/32] IPV4: Correct rp_filter help text.

2007-06-08 Thread Herbert Xu
Chris Wright [EMAIL PROTECTED] wrote:
 
 --- linux-2.6.20.13.orig/net/ipv4/Kconfig
 +++ linux-2.6.20.13/net/ipv4/Kconfig
 @@ -43,11 +43,11 @@ config IP_ADVANCED_ROUTER
  asymmetric routing (packets from you to a host take a different path
  than packets from that host to you) or if you operate a non-routing
  host which has several IP addresses on different interfaces. To turn
 - rp_filter off use:
 + rp_filter on use:
 
 - echo 0  /proc/sys/net/ipv4/conf/device/rp_filter
 + echo 1  /proc/sys/net/ipv4/conf/device/rp_filter
  or
 - echo 0  /proc/sys/net/ipv4/conf/all/rp_filter
 + echo 1  /proc/sys/net/ipv4/conf/all/rp_filter

BTW, this documentation is actually wrong.  You can't enable rp_filter
on all interfaces with

echo 1  /proc/sys/net/ipv4/conf/all/rp_filter

You must do that in conjunction with

echo 1  /proc/sys/net/ipv4/conf/device/rp_filter

for it to work for device.

This is really counter-intuitive but it's apparently how it's always
worked.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 23/32] IPV4: Correct rp_filter help text.

2007-06-08 Thread Herbert Xu
On Sat, Jun 09, 2007 at 11:20:43AM +1000, Herbert Xu wrote:
 Chris Wright [EMAIL PROTECTED] wrote:
  
  --- linux-2.6.20.13.orig/net/ipv4/Kconfig
  +++ linux-2.6.20.13/net/ipv4/Kconfig
  @@ -43,11 +43,11 @@ config IP_ADVANCED_ROUTER
   asymmetric routing (packets from you to a host take a different 
  path
   than packets from that host to you) or if you operate a non-routing
   host which has several IP addresses on different interfaces. To 
  turn
  - rp_filter off use:
  + rp_filter on use:
  
  - echo 0  /proc/sys/net/ipv4/conf/device/rp_filter
  + echo 1  /proc/sys/net/ipv4/conf/device/rp_filter
   or
  - echo 0  /proc/sys/net/ipv4/conf/all/rp_filter
  + echo 1  /proc/sys/net/ipv4/conf/all/rp_filter
 
 BTW, this documentation is actually wrong.  You can't enable rp_filter

So to fix the documentation, we should change the word or to and.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] NET: Multiqueue network device support.

2007-06-08 Thread Ramkrishna Vepa
Peter,

Where is your git tree located? 

Ram
 -Original Message-
 From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]
 On Behalf Of Waskiewicz Jr, Peter P
 Sent: Thursday, June 07, 2007 3:56 PM
 To: David Miller; [EMAIL PROTECTED]
 Cc: Kok, Auke-jan H; [EMAIL PROTECTED]; [EMAIL PROTECTED];
 netdev@vger.kernel.org; Brandeburg, Jesse
 Subject: RE: [PATCH] NET: Multiqueue network device support.
 
   I empathize but take a closer look; seems mostly useless.
 
  I thought E1000 still uses LLTX, and if so then multiple cpus
  can most definitely get into the -hard_start_xmit() in parallel.
 
 Not with how the qdisc status protects it today:
 
 include/net/pkt_sched.h:
 
 static inline void qdisc_run(struct net_device *dev)
 {
 if (!netif_queue_stopped(dev) 
 !test_and_set_bit(__LINK_STATE_QDISC_RUNNING,
dev-state))
 __qdisc_run(dev);
 }
 -
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] RFC: have tcp_recvmsg() check kthread_should_stop() and treat it as if it were signalled

2007-06-08 Thread Herbert Xu
Please cc networking patches to [EMAIL PROTECTED]

Jeff Layton [EMAIL PROTECTED] wrote:
 
 The following patch is a first stab at removing this need. It makes it
 so that in tcp_recvmsg() we also check kthread_should_stop() at any
 point where we currently check to see if the task was signalled. If
 that returns true, then it acts as if it were signalled and returns to
 the calling function.

This just doesn't seem to fit.  Why should networking care about kthreads?

Perhaps you can get kthread_stop to send a signal instead?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: networking busted in current -git ???

2007-06-08 Thread David Miller
From: Trond Myklebust [EMAIL PROTECTED]
Date: Fri, 08 Jun 2007 17:43:27 -0400

 It is not dhcp. I'm seeing the same bug with bog-standard ifup with a
 static address on an FC-6 machine.
 
 It appears to be something in the latest dump from davem to Linus, but I
 haven't yet had time to identify what.

Linus's current tree should have this fixed.

Let us know if this is not the case.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] [IrDA] Updates for net-2.6

2007-06-08 Thread David Miller
From: [EMAIL PROTECTED]
Date: Sat, 09 Jun 2007 04:08:15 +0300

 These 2 patches are bug fixes and should thus be considered for net-2.6
 inclusion.

Both patches applied, thanks Sam.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] [RFC] NET: Implement a standard ndev_printk family

2007-06-08 Thread Kok, Auke

Stephen Hemminger wrote:

On Fri, 08 Jun 2007 16:42:31 -0700
Kok, Auke [EMAIL PROTECTED] wrote:


Stephen Hemminger wrote:
 
+#define ndev_printk(kern_level, netif_level, netdev, format, arg...) \

+   do { if ((netdev)-msg_enable  NETIF_MSG_##netif_level) { \
+   printk(kern_level %s:  format, \
+   (netdev)-name, ## arg); } } while (0)

Could you make a version that doesn't evaluate the arguments twice?
hmm you lost me there a bit; Do you want me to duplicate this code for all the 
ndev_err/ndev_info functions instead so that ndev_err doesn't direct back to 
ndev_printk?


It is good practice in a macro to avoid potential problems with usage
by only touching the arguments once. Otherwise, something (bogus) like
ndev_printk(KERN_DEBUG, NETIF_MSG_PKTDATA, got %d\n,
dev++, skb-len)
would increment dev twice.


agreed, but


My preference would be something more like dev_printk or even use that?
You want to show both device name, and physical attachment in the message.


actually these ndev_* macros are almost an exact copy of dev_printk, which is 
how I modeled them in the first place!


See for yourself - here's the relevant snipplet from linux/device.h:

500 #define dev_printk(level, dev, format, arg...)  \
501 printk(level %s %s:  format , dev_driver_string(dev) , 
(dev)-bus_id , ## arg)

502
503 #ifdef DEBUG
504 #define dev_dbg(dev, format, arg...)\
505 dev_printk(KERN_DEBUG , dev , format , ## arg)
506 #else
507 #define dev_dbg(dev, format, arg...) do { (void)(dev); } while (0)
508 #endif
509
510 #define dev_err(dev, format, arg...)\
511 dev_printk(KERN_ERR , dev , format , ## arg)
512 #define dev_info(dev, format, arg...)   \
513 dev_printk(KERN_INFO , dev , format , ## arg)
514 #define dev_warn(dev, format, arg...)   \
515 dev_printk(KERN_WARNING , dev , format , ## arg)
516 #define dev_notice(dev, format, arg...) \
517 dev_printk(KERN_NOTICE , dev , format , ## arg)

using dev_printk however ignores msg_enable completely and also omits 
netdev-name, which may even change, so for netdevices it's much less suitable, 
maybe only at init time.


We can fix the dev_printk macro family as well, that's allright, but the need 
for a netdev-centric printk should be obvious: almost every netdevice driver has 
it's own variant :)


Auke
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] [RFC] NET: Implement a standard ndev_printk family

2007-06-08 Thread Kok, Auke

Stephen Hemminger wrote:

On Fri, 08 Jun 2007 16:42:31 -0700
Kok, Auke [EMAIL PROTECTED] wrote:


Stephen Hemminger wrote:
 
+#define ndev_printk(kern_level, netif_level, netdev, format, arg...) \

+   do { if ((netdev)-msg_enable  NETIF_MSG_##netif_level) { \
+   printk(kern_level %s:  format, \
+   (netdev)-name, ## arg); } } while (0)

Could you make a version that doesn't evaluate the arguments twice?
hmm you lost me there a bit; Do you want me to duplicate this code for all the 
ndev_err/ndev_info functions instead so that ndev_err doesn't direct back to 
ndev_printk?


It is good practice in a macro to avoid potential problems with usage
by only touching the arguments once. Otherwise, something (bogus) like
ndev_printk(KERN_DEBUG, NETIF_MSG_PKTDATA, got %d\n,
dev++, skb-len)
would increment dev twice.

My preference would be something more like dev_printk or even use that?


... see other reply


You want to show both device name, and physical attachment in the message.


OK, that does make sense, and here it gets interesting and we can get creative, 
since for NETIF_MSG_HW and NETIF_MSG_PROBE messages we could add the printout of 
netdev-dev-bus_id. I have modeled and toyed around with the message format and 
did this (add bus_id to all messages) but it got messy (for LINK messages it's 
totally not needed).


However, that is going to make the macro's a bit more complex, and unlikely that 
I can make it fit without double-pass evaluation without making it a monster...


unless everyone agrees to just printing everything: both netdev-name and 
netdev-dev-bus_id for every message


Auke
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html