Re: SG_IO with 4k buffer size to iscsi sg device causes Bad page panic
Please don't drop CCs. Qi, Yanling [EMAIL PROTECTED] wrote: Qi, Yanling [EMAIL PROTECTED] wrote: @@ -2571,6 +2572,13 @@ sg_page_malloc(int rqSz, int lowDma, int resp = (char *) __get_free_pages(page_mask, order); /* try half */ resSz = a_size; } + tmppage = virt_to_page(resp); + for( m = PAGE_SIZE; m resSz; m += PAGE_SIZE ) + { + tmppage++; + SetPageReserved(tmppage); + } + [Qi, Yanling] If I do a get_page() at sg_page_malloc() time and then do a put_page() at sg_page_free() time, I worry about a race condition that the page gets re-used before calling free_pages(). Could you explain what is going to cause this page to be reused if it has a non-zero reference count? Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multicast and hardware checksum
Baruch Even [EMAIL PROTECTED] wrote: I have a machine on which I have an applications that sends multicast through eth interface with hardware tx checksum enabled. On the same machine I have mrouted running that routes the multicast traffic to a set of ppp interfaces. The packets that are received by the client have their checksum fixed on some number which is incorrect. If I disable tx checksum on the eth device the packets arrive with the proper checksum. Where is the client? On the same machine or behind a PPP link? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.22-rc4-mm2 -- ipw2200 -- SIOCSIFADDR: No buffer space available
On 6/7/07, Björn Steinbrink [EMAIL PROTECTED] wrote: [...] Miles, could you try if this patch helps? Björn Stop destroying devices when all of their ifas are gone, as we no longer recreate them when ifas are added. Signed-off-by: Björn Steinbrink [EMAIL PROTECTED] -- diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c index fa97b96..abf6352 100644 --- a/net/ipv4/devinet.c +++ b/net/ipv4/devinet.c @@ -327,12 +327,8 @@ static void __inet_del_ifa(struct in_device *in_dev, struct in_ifaddr **ifap, } } - if (destroy) { + if (destroy) inet_free_ifa(ifa1); - - if (!in_dev-ifa_list) - inetdev_destroy(in_dev); - } } static void inet_del_ifa(struct in_device *in_dev, struct in_ifaddr **ifap, Björn, Thanks. You patch worked for me. Miles - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive
On Thu, Jun 07 2007, Evgeniy Polyakov wrote: On Thu, Jun 07, 2007 at 12:51:59PM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: What bout checking if page belongs to kmalloc cache (or any other cache via priviate pointers) and do not perform any kind of reference counting on them? I will play with this a bit later today. That might work, but sounds a little dirty... But there's probably no way around. Be sure to look at the #splice-net branch if you are playing with this, I've updated it a number of times and fixed some bugs in there. Notably it now gets the offset right, and handles fragments and fraglist as well. I've pulled splice-net, which indeed fixed some issues, but referencing slab pages is still is not allowed. There are at least two problems (although they are related): 1. if we do not increment reference counter for slab pages, they eventually get refilled and slab exploses after it understood that its pages are in use (or user dies when page is moved out of his control in slab). 2. get/put page does not work with slab pages, and simple increment/decrement of the reference counters is not allowed too. Both problems have the same root - slab does not allow anyone to manipulate page's members. That should be broken/changed to allow splice to put its hands into network using the fastest way. I will think about it. Perhaps it's possible to solve this at a different level - can we hang on to the skb until the pipe buffer has been consumed, and prevent reuse that way? Then we don't have to care what backing the skb has, as long as it (and its data) isn't being reused until we drop the reference to it in sock_pipe_buf_release(). -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive
From: Jens Axboe [EMAIL PROTECTED] Date: Fri, 8 Jun 2007 09:48:24 +0200 Perhaps it's possible to solve this at a different level - can we hang on to the skb until the pipe buffer has been consumed, and prevent reuse that way? Then we don't have to care what backing the skb has, as long as it (and its data) isn't being reused until we drop the reference to it in sock_pipe_buf_release(). Depending upon whether the pipe buffer consumption is bounded of not, this will jam up the TCP sender because the SKB data allocation is charged against the socket send buffer allocation. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive
On Fri, Jun 08 2007, David Miller wrote: From: Jens Axboe [EMAIL PROTECTED] Date: Fri, 8 Jun 2007 09:48:24 +0200 Perhaps it's possible to solve this at a different level - can we hang on to the skb until the pipe buffer has been consumed, and prevent reuse that way? Then we don't have to care what backing the skb has, as long as it (and its data) isn't being reused until we drop the reference to it in sock_pipe_buf_release(). Depending upon whether the pipe buffer consumption is bounded of not, this will jam up the TCP sender because the SKB data allocation is charged against the socket send buffer allocation. Forgive my network ignorance, but is that a problem? Since you bring it up, I guess so :-) We can grow the pipe, should we have to. So instead of blocking waiting on reader consumption, we can extend the size of the pipe and keep going. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive
On Fri, Jun 08 2007, Evgeniy Polyakov wrote: On Fri, Jun 08, 2007 at 10:38:53AM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: On Fri, Jun 08 2007, David Miller wrote: From: Jens Axboe [EMAIL PROTECTED] Date: Fri, 8 Jun 2007 09:48:24 +0200 Perhaps it's possible to solve this at a different level - can we hang on to the skb until the pipe buffer has been consumed, and prevent reuse that way? Then we don't have to care what backing the skb has, as long as it (and its data) isn't being reused until we drop the reference to it in sock_pipe_buf_release(). Depending upon whether the pipe buffer consumption is bounded of not, this will jam up the TCP sender because the SKB data allocation is charged against the socket send buffer allocation. Forgive my network ignorance, but is that a problem? Since you bring it up, I guess so :-) David means, that socket bufer allocation is limited, and delaying freeing can end up with exhausint that limit. OK, so a delayed empty of the pipe could end up causing packet drops elsewhere due to allocation exhaustion? We can grow the pipe, should we have to. So instead of blocking waiting on reader consumption, we can extend the size of the pipe and keep going. I have a code, which roughly works (but I will test it some more), which just introduces reference counters for slab pages, so that the would not be actually freed via page reclaim, but only after reference counters are dropped. That forced changes in mm/slab.c so likely it is unacceptible solution, but it is interesting as is. Hmm, still seems like it's working around the problem. We essentially just need to ensure that the data doesn't get _reused_, not just freed. It doesn't help holding a reference to the page, if someone else just reuses it and fills it with other data before it has been consumed and released by the pipe buffer operations. That's why I thought the skb referencing was the better idea, then we don't have to care about the backing of the skb either. Provided that preventing the free of the skb before the pipe buffer has been consumed guarantees that the contents aren't reused. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [WIP][PATCHES] Network xmit batching
On Thu, Jun 07, 2007 at 06:23:16PM -0400, jamal ([EMAIL PROTECTED]) wrote: On Thu, 2007-07-06 at 20:13 +0400, Evgeniy Polyakov wrote: Actually I wonder where the devil lives, but I do not see how that patchset can improve sending situation. Let me clarify: there are two possibilities to send data: 1. via batched sending, which runs via queue of packets and performs prepare call (which only setups some private flags, no work with hardware) and then sending call. I believe both are called with no lock. The idea is to avoid the lock entirely when unneeded. That code may end up finding that the packet is bogus and throw it out when it deems it useless. If you followed the discussions on multi-ring, this call is where i suggested to select the tx ring as well. Hmm... + netif_tx_lock_bh(odev); + if (!netif_queue_stopped(odev)) { + + idle_start = getCurUs(); + pkt_dev-tx_entered++; + ret = odev-hard_batch_xmit(odev-blist, odev); + if (!spin_trylock_irqsave(tx_ring-tx_lock, flags)) { + /* Collision - tell upper layer to requeue */ + return NETDEV_TX_LOCKED; + } + + while ((skb = __skb_dequeue(list)) != NULL) { +#ifdef coredoesnoprep + ret = netdev-hard_prep_xmit(skb, netdev); + if (ret != NETDEV_TX_OK) + continue; +#endif + + /*XXX: This may be an opportunity to not give nit +* the packet if the dev ix TX BUSY ;- */ + dev_do_xmit_nit(skb, netdev); + ret = e1000_queue_frame(skb, netdev); The same applies to *_gso case. 2. old xmit function (which seems to be unused by kernel now?) You can change that by turning off _BTX feature in the driver. For WIP reasons it is on at the moment. Btw, prep_queue_frame seems to be always called under tx_lock, but it old e1000 xmit function calls it without lock. I think both call it without lock. Without lock that would be wrong - it accesses hardware. Locked case is correct, since it accesses private registers via e1000_transfer_dhcp_info() for some adapters. I am unsure about the value of that lock (refer to email to Auke). There is only one CPU that can enter the tx path and the contention is minimal. So, essentially batched sending is lock while ((skb = dequue)) send unlock where queue of skbs are prepared by stack using the same transmit lock. Where is a gain? The amortizing of the lock on tx is where the value is. Did you see the numbers Evgeniy? ;- Heres one i can vouch on a dual processor 2GHz that i tested with pktgen; I only saw results Krishna posted, and i also do not know, what service demand is :) 1) Original e1000 driver (no batching): a) We got a xmit throughput of 362Kpackets/second of 362K with the default setup (everything falls on cpu#0). b) With tying to CPU#1, i saw 401Kpps. 2) Repeated the tests with batching patches (as in this commit) And got an outstanding 694Kpps throughput. 5) Repeated #4 with binding to cpu #1. And throughput didnt improve that much - was hitting 697Kpps I think we are pretty much hitting upper limits here ... I am actually testing as we speak on faster hardware - I will post results shortly. Result looks good, but I still do not understand how it appeared, that is why I'm not that excited about idea - I just do not know it in details. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive
On Fri, Jun 08, 2007 at 10:38:53AM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: On Fri, Jun 08 2007, David Miller wrote: From: Jens Axboe [EMAIL PROTECTED] Date: Fri, 8 Jun 2007 09:48:24 +0200 Perhaps it's possible to solve this at a different level - can we hang on to the skb until the pipe buffer has been consumed, and prevent reuse that way? Then we don't have to care what backing the skb has, as long as it (and its data) isn't being reused until we drop the reference to it in sock_pipe_buf_release(). Depending upon whether the pipe buffer consumption is bounded of not, this will jam up the TCP sender because the SKB data allocation is charged against the socket send buffer allocation. Forgive my network ignorance, but is that a problem? Since you bring it up, I guess so :-) David means, that socket bufer allocation is limited, and delaying freeing can end up with exhausint that limit. We can grow the pipe, should we have to. So instead of blocking waiting on reader consumption, we can extend the size of the pipe and keep going. I have a code, which roughly works (but I will test it some more), which just introduces reference counters for slab pages, so that the would not be actually freed via page reclaim, but only after reference counters are dropped. That forced changes in mm/slab.c so likely it is unacceptible solution, but it is interesting as is. -- Jens Axboe -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On Thu, Jun 07, 2007 at 09:35:36PM -0400, jamal wrote: On Thu, 2007-07-06 at 17:31 -0700, Sridhar Samudrala wrote: If the QDISC_RUNNING flag guarantees that only one CPU can call dev-hard_start_xmit(), then why do we need to hold netif_tx_lock for non-LLTX drivers? I havent stared at other drivers, but for e1000 seems to me even if you got rid of LLTX that netif_tx_lock is unnecessary. Herbert? It would guard against the poll routine which would acquire this lock when cleaning the TX ring. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multicast and hardware checksum
Herbert Xu wrote: On Fri, Jun 08, 2007 at 02:02:27PM +0300, Baruch Even wrote: As far as IGMP and multicast handling everything works, the packets are even forwarded over the ppp links but they arrive to the client with a bad checksum. I don't have the trace in front of me but I believe it was the UDP checksum that failed. What kind of a ppp device is this? If you run a tcpdump either side of the ppp link do you see the same UDP checksum value? This is a pptp link. I've checked the checksum on the receive side, I don't know on the sender side and I'll only be able to try it on Sunday. Baruch - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [WIP][PATCHES] Network xmit batching
KK, On Fri, 2007-08-06 at 10:36 +0530, Krishna Kumar2 wrote: I will try that. Also on the receiver, I am using unmodified 2.6.21 bits. That should be fine as long as the sender is running the patched 2.6.22-rc4 My earlier experiments showed that even small buffers were filling the E1000 slots and resulting in stop queue very often. In any case, I will also add 1 or 2 larger packet sizes (1K, 16K in addition to the 4K already there). Thats interesting - it is possible there is transient burstiness which fills up the ring. My observation of your results (hence my comments): for example the buffer size = 8B, TCP 1 process you achieve less than 70M. That is less than 100Kpps on average being sent out. Very very tiny - so it is interesting that it is causing a shutdown. Also note something else strange that it is kind of strange that something like UDP which doesnt backoff will send out less packets/second ;- I could put a little hack in the e1000 driver to find exact number number of times per run it was shutdown. BTW, another interesting things to do is ensure that several netperfs are running on different CPUs. I was planning to submit my changes on top of this patch, and since it includes a configuration option per device, it will be easy to test with and without this API. fantastic. When I ran after setting this config option to 0, the results were almost identical to the original code. I will try to post that today for your review/comments. no problem. Sorry, been many moons since i last played with netperf; what does service demand mean? It gives an indication of the amount of CPU cycles to send out a particular amount of data. Netperf provides it as us/KB. I don't know the internals of netperf enough to say how this is calculated. I am hoping Rick would comment. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multicast and hardware checksum
On Fri, Jun 08, 2007 at 02:02:27PM +0300, Baruch Even wrote: As far as IGMP and multicast handling everything works, the packets are even forwarded over the ppp links but they arrive to the client with a bad checksum. I don't have the trace in front of me but I believe it was the UDP checksum that failed. What kind of a ppp device is this? If you run a tcpdump either side of the ppp link do you see the same UDP checksum value? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multicast and hardware checksum
Herbert Xu wrote: Baruch Even [EMAIL PROTECTED] wrote: I have a machine on which I have an applications that sends multicast through eth interface with hardware tx checksum enabled. On the same machine I have mrouted running that routes the multicast traffic to a set of ppp interfaces. The packets that are received by the client have their checksum fixed on some number which is incorrect. If I disable tx checksum on the eth device the packets arrive with the proper checksum. Where is the client? On the same machine or behind a PPP link? The clients are behind the ppp links. As far as IGMP and multicast handling everything works, the packets are even forwarded over the ppp links but they arrive to the client with a bad checksum. I don't have the trace in front of me but I believe it was the UDP checksum that failed. Baruch - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [WIP][PATCHES] Network xmit batching
Hi Jamal, J Hadi Salim [EMAIL PROTECTED] wrote on 06/08/2007 04:44:06 PM: That should be fine as long as the sender is running the patched 2.6.22-rc4 Definitely :) Thats interesting - it is possible there is transient burstiness which fills up the ring. My observation of your results (hence my comments): for example the buffer size = 8B, TCP 1 process you achieve less than 70M. That is less than 100Kpps on average being sent out. Very very tiny - so it is interesting that it is causing a shutdown. I thought it comes to 1.147Mpps, or did I calculate wrong (70*1024*1024/8/8) ? Also note something else strange that it is kind of strange that something like UDP which doesnt backoff will send out less packets/second ;- Cannot explain that either :) BTW, another interesting things to do is ensure that several netperfs are running on different CPUs. My script was doing that earlier, I trimmed all that to make it easier to understand. Will post the larger version later. no problem. Thanks, please let me know what you think of the patch I sent earlier. I am running a larger 5 iteration run with buffer sizes :8,32,128,512,1 K,4K,16K. It is going to run for around 12 hours and since I am moving house during the weekend, I will be able to look at the results only on Monday. Regards, - KK - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [WIP][PATCHES] Network xmit batching
KK, On Fri, 2007-08-06 at 17:01 +0530, Krishna Kumar2 wrote: I thought it comes to 1.147Mpps, or did I calculate wrong (70*1024*1024/8/8) ? I assumed 8B to mean data that is on top of TCP/UDP? If so then in the case of UDP we have 8B UDP header, 20B IP and 14B ethernet 64B minimal allowed Ethernet packet; so it gets padded and goes out as 64B. There are, as you state above, 1.147(or is it 1.48?) such packets/sec in 1Gbps. So (70Mbps/1000Mbps)*1.147 is the rough number i was reffering to. My script was doing that earlier, I trimmed all that to make it easier to understand. Will post the larger version later. That will be nice because remember we can have multiple CPU packet producers but only one CPU consumer. no problem. Thanks, please let me know what you think of the patch I sent earlier. I havent seen a patch. Can you resend it? I am running a larger 5 iteration run with buffer sizes :8,32,128,512,1 K,4K,16K. It is going to run for around 12 hours and since I am moving house during the weekend, I will be able to look at the results only on Monday. sounds good. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [WIP][PATCHES] Network xmit batching
On Fri, 2007-08-06 at 12:38 +0400, Evgeniy Polyakov wrote: On Thu, Jun 07, 2007 at 06:23:16PM -0400, jamal ([EMAIL PROTECTED]) wrote: I believe both are called with no lock. The idea is to avoid the lock entirely when unneeded. That code may end up finding that the packet [..] + netif_tx_lock_bh(odev); + if (!netif_queue_stopped(odev)) { + + idle_start = getCurUs(); + pkt_dev-tx_entered++; + ret = odev-hard_batch_xmit(odev-blist, odev); [..] The same applies to *_gso case. You missed an important piece which is grabbing of __LINK_STATE_QDISC_RUNNING Without lock that would be wrong - it accesses hardware. We are achieving the goal of only a single CPU entering that path. Are you saying that is not good enough? I only saw results Krishna posted, Ok, sorry - i thought you saw the git log or earlier results where other things are captured. and i also do not know, what service demand is :) From the explanation seems to be how much cpu was used while sending. Do you have any suggestions for computing cpu use? in pktgen i added code to count how many microsecs were used in transmitting. Result looks good, but I still do not understand how it appeared, that is why I'm not that excited about idea - I just do not know it in details. To add to KKs explanation on other email: Essentially the value is in amortizing the cost of barriers and IO per packet. For example the queue lock is held/released only once per X packets. DMA kicking which includes both a PCI IO write and mbs is done only once per X packets. There are still a lot of room for improvement of such IO; cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multicast and hardware checksum
Baruch Even wrote: Herbert Xu wrote: On Fri, Jun 08, 2007 at 02:02:27PM +0300, Baruch Even wrote: As far as IGMP and multicast handling everything works, the packets are even forwarded over the ppp links but they arrive to the client with a bad checksum. I don't have the trace in front of me but I believe it was the UDP checksum that failed. What kind of a ppp device is this? If you run a tcpdump either side of the ppp link do you see the same UDP checksum value? This is a pptp link. I've checked the checksum on the receive side, I don't know on the sender side and I'll only be able to try it on Sunday. For completeness, the clients are Windows XP clients and the server is a Linux machine. The tunnel is mppe encrypted so I believe that what goes out on the client is the same as what got in on the server. Baruch - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On Fri, 2007-08-06 at 20:39 +1000, Herbert Xu wrote: It would guard against the poll routine which would acquire this lock when cleaning the TX ring. Ok, then i suppose we can conclude it is a bug on e1000 (holds tx_lock on tx side and adapter queue lock on rx). Adding that lock will certainly bring down the performance numbers on a send/recv profile. The bizare thing is things run just fine even under the heavy tx/rx traffic i was testing under. I guess i didnt hit hard enough. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [WIP][PATCHES] Network xmit batching
On Fri, Jun 08, 2007 at 07:31:07AM -0400, jamal ([EMAIL PROTECTED]) wrote: On Fri, 2007-08-06 at 12:38 +0400, Evgeniy Polyakov wrote: On Thu, Jun 07, 2007 at 06:23:16PM -0400, jamal ([EMAIL PROTECTED]) wrote: I believe both are called with no lock. The idea is to avoid the lock entirely when unneeded. That code may end up finding that the packet [..] + netif_tx_lock_bh(odev); + if (!netif_queue_stopped(odev)) { + + idle_start = getCurUs(); + pkt_dev-tx_entered++; + ret = odev-hard_batch_xmit(odev-blist, odev); [..] The same applies to *_gso case. You missed an important piece which is grabbing of __LINK_STATE_QDISC_RUNNING But lock is still being hold - or there was no intention to reduce lock usage? As far as I read Krishna's mail, lock usage was not an issue, so that hunk probably should be dropped from the analysis. Without lock that would be wrong - it accesses hardware. We are achieving the goal of only a single CPU entering that path. Are you saying that is not good enough? Then why essentially the same code (current batch_xmit callback) previously was always called with disabled interrupts? Aren't there some watchdog/link/poll/whatever issues present? and i also do not know, what service demand is :) From the explanation seems to be how much cpu was used while sending. Do you have any suggestions for computing cpu use? in pktgen i added code to count how many microsecs were used in transmitting. Something, that anyone can understand :) For example /proc stats, although it is not very accurate, but it is really usable parameter from userspace point ov view. Result looks good, but I still do not understand how it appeared, that is why I'm not that excited about idea - I just do not know it in details. To add to KKs explanation on other email: Essentially the value is in amortizing the cost of barriers and IO per packet. For example the queue lock is held/released only once per X packets. DMA kicking which includes both a PCI IO write and mbs is done only once per X packets. There are still a lot of room for improvement of such IO; Btw, what is the size of the packet in pktgen in your tests? Likely it is small, since result is that good. That can explain alot. cheers, jamal -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On Fri, Jun 08, 2007 at 07:34:57AM -0400, jamal wrote: On Fri, 2007-08-06 at 20:39 +1000, Herbert Xu wrote: It would guard against the poll routine which would acquire this lock when cleaning the TX ring. Ok, then i suppose we can conclude it is a bug on e1000 (holds tx_lock on tx side and adapter queue lock on rx). Adding that lock will certainly bring down the performance numbers on a send/recv profile. The bizare thing is things run just fine even under the heavy tx/rx traffic i was testing under. I guess i didnt hit hard enough. Hmm I wasn't describing how it works now. I'm talking about how it would work if we removed LLTX and replaced the private tx_lock with netif_tx_lock. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On Fri, 2007-08-06 at 22:37 +1000, Herbert Xu wrote: Hmm I wasn't describing how it works now. I'm talking about how it would work if we removed LLTX and replaced the private tx_lock with netif_tx_lock. I got that - it is what tg3 does for example. To mimick that behavior in LLTX, a driver needs to use the same lock on both tx and receive. e1000 holds a different lock on tx path from rx path. Maybe theres something clever i am missing; but it seems to be a bug on e1000. The point i was making is that it was strange i never had problems despite taking away the lock on the tx side and using the rx side concurently. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [WIP][PATCHES] Network xmit batching
On Fri, 2007-08-06 at 16:09 +0400, Evgeniy Polyakov wrote: On Fri, Jun 08, 2007 at 07:31:07AM -0400, jamal ([EMAIL PROTECTED]) wrote: But lock is still being hold - or there was no intention to reduce lock usage? As far as I read Krishna's mail, lock usage was not an issue, so that hunk probably should be dropped from the analysis. With post 2.6.18 that atomic bit guarantees only one CPU will enter tx path. The lock is only necessary to protect shared resources between tx and rx (which could be simultenously be entered by two CPUs) such as tx ring. Refer to some other thread talking about a possible bug with e1000 in this area. So maybe e1000 is not a good example in this sense. But look at tg3. Without lock that would be wrong - it accesses hardware. We are achieving the goal of only a single CPU entering that path. Are you saying that is not good enough? Then why essentially the same code (current batch_xmit callback) previously was always called with disabled interrupts? Aren't there some watchdog/link/poll/whatever issues present? not in the e1000 as it stands today. Something, that anyone can understand :) For example /proc stats, although it is not very accurate, but it is really usable parameter from userspace point ov view. which /proc stats? Btw, what is the size of the packet in pktgen in your tests? Likely it is small, since result is that good. That can explain alot. There is a per-packet cost involved in that code path. So the more packets/second you can generate the more intensely you can test that path. I believe you will achieve overall better results with large packets. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive
On Fri, Jun 08, 2007 at 11:04:40AM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: OK, so a delayed empty of the pipe could end up causing packet drops elsewhere due to allocation exhaustion? Yes. We can grow the pipe, should we have to. So instead of blocking waiting on reader consumption, we can extend the size of the pipe and keep going. I have a code, which roughly works (but I will test it some more), which just introduces reference counters for slab pages, so that the would not be actually freed via page reclaim, but only after reference counters are dropped. That forced changes in mm/slab.c so likely it is unacceptible solution, but it is interesting as is. Hmm, still seems like it's working around the problem. We essentially just need to ensure that the data doesn't get _reused_, not just freed. It doesn't help holding a reference to the page, if someone else just reuses it and fills it with other data before it has been consumed and released by the pipe buffer operations. That's why I thought the skb referencing was the better idea, then we don't have to care about the backing of the skb either. Provided that preventing the free of the skb before the pipe buffer has been consumed guarantees that the contents aren't reused. It is not only better idea, it is the only correct one. Attached patch for interested reader, which does slab pages accounting, but it is broken. It does not fires up with kernel bug, but it fills output file with random garbage from reused and dirtied pages. And I do not know why, but received file is always smaller than file being sent (when file has resonable size like 10mb, with 4-40kb filesize things seems to be ok). I've started skb referencing work, let's see where this will end up. diff --git a/fs/splice.c b/fs/splice.c index 928bea0..742e1ee 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -29,6 +29,18 @@ #include linux/syscalls.h #include linux/uio.h +extern void slab_change_usage(struct page *p); + +static inline void splice_page_release(struct page *p) +{ + struct page *head = p-first_page; + if (!PageSlab(head)) + page_cache_release(p); + else { + slab_change_usage(head); + } +} + /* * Attempt to steal a page from a pipe buffer. This should perhaps go into * a vm helper function, it's already simplified quite a bit by the @@ -81,7 +93,7 @@ static int page_cache_pipe_buf_steal(struct pipe_inode_info *pipe, static void page_cache_pipe_buf_release(struct pipe_inode_info *pipe, struct pipe_buffer *buf) { - page_cache_release(buf-page); + splice_page_release(buf-page); buf-flags = ~PIPE_BUF_FLAG_LRU; } @@ -246,7 +258,7 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe, } while (page_nr spd-nr_pages) - page_cache_release(spd-pages[page_nr++]); + splice_page_release(spd-pages[page_nr++]); return ret; } @@ -322,7 +334,7 @@ __generic_file_splice_read(struct file *in, loff_t *ppos, error = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL); if (unlikely(error)) { - page_cache_release(page); + splice_page_release(page); if (error == -EEXIST) continue; break; @@ -448,7 +460,7 @@ fill_it: * we got, 'nr_pages' is how many pages are in the map. */ while (page_nr nr_pages) - page_cache_release(pages[page_nr++]); + splice_page_release(pages[page_nr++]); if (spd.nr_pages) return splice_to_pipe(pipe, spd); @@ -604,7 +616,7 @@ find_page: if (ret != AOP_TRUNCATED_PAGE) unlock_page(page); - page_cache_release(page); + splice_page_release(page); if (ret == AOP_TRUNCATED_PAGE) goto find_page; @@ -634,7 +646,7 @@ find_page: ret = mapping-a_ops-commit_write(file, page, offset, offset+this_len); if (ret) { if (ret == AOP_TRUNCATED_PAGE) { - page_cache_release(page); + splice_page_release(page); goto find_page; } if (ret 0) @@ -651,7 +663,7 @@ find_page: */ mark_page_accessed(page); out: - page_cache_release(page); + splice_page_release(page); unlock_page(page); out_ret: return ret; diff --git a/mm/slab.c b/mm/slab.c index 2e71a32..673383d 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -1649,8 +1649,12 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid) else add_zone_page_state(page_zone(page),
Re: [PATCH][RFC] network splice receive
On Fri, Jun 08 2007, Evgeniy Polyakov wrote: On Fri, Jun 08, 2007 at 11:04:40AM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: OK, so a delayed empty of the pipe could end up causing packet drops elsewhere due to allocation exhaustion? Yes. We can grow the pipe, should we have to. So instead of blocking waiting on reader consumption, we can extend the size of the pipe and keep going. I have a code, which roughly works (but I will test it some more), which just introduces reference counters for slab pages, so that the would not be actually freed via page reclaim, but only after reference counters are dropped. That forced changes in mm/slab.c so likely it is unacceptible solution, but it is interesting as is. Hmm, still seems like it's working around the problem. We essentially just need to ensure that the data doesn't get _reused_, not just freed. It doesn't help holding a reference to the page, if someone else just reuses it and fills it with other data before it has been consumed and released by the pipe buffer operations. That's why I thought the skb referencing was the better idea, then we don't have to care about the backing of the skb either. Provided that preventing the free of the skb before the pipe buffer has been consumed guarantees that the contents aren't reused. It is not only better idea, it is the only correct one. Attached patch for interested reader, which does slab pages accounting, but it is broken. It does not fires up with kernel bug, but it fills output file with random garbage from reused and dirtied pages. And I do not know why, but received file is always smaller than file being sent (when file has resonable size like 10mb, with 4-40kb filesize things seems to be ok). I've started skb referencing work, let's see where this will end up. Here's a start, for the splice side at least of storing a buf-private entity with the ops. diff --git a/fs/splice.c b/fs/splice.c index 90588a8..f24e367 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -191,6 +191,7 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe, buf-page = spd-pages[page_nr]; buf-offset = spd-partial[page_nr].offset; buf-len = spd-partial[page_nr].len; + buf-private = spd-partial[page_nr].private; buf-ops = spd-ops; if (spd-flags SPLICE_F_GIFT) buf-flags |= PIPE_BUF_FLAG_GIFT; diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h index 7ba228d..4409167 100644 --- a/include/linux/pipe_fs_i.h +++ b/include/linux/pipe_fs_i.h @@ -14,6 +14,7 @@ struct pipe_buffer { unsigned int offset, len; const struct pipe_buf_operations *ops; unsigned int flags; + unsigned long private; }; struct pipe_inode_info { diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 619dcf5..64e3eed 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1504,7 +1504,7 @@ extern int skb_store_bits(struct sk_buff *skb, int offset, extern __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to, int len, __wsum csum); -extern int skb_splice_bits(const struct sk_buff *skb, +extern int skb_splice_bits(struct sk_buff *skb, unsigned int offset, struct pipe_inode_info *pipe, unsigned int len, diff --git a/include/linux/splice.h b/include/linux/splice.h index b3f1528..1a1182b 100644 --- a/include/linux/splice.h +++ b/include/linux/splice.h @@ -41,6 +41,7 @@ struct splice_desc { struct partial_page { unsigned int offset; unsigned int len; + unsigned long private; }; /* diff --git a/net/core/skbuff.c b/net/core/skbuff.c index d2b2547..7d9ec9e 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -78,7 +78,10 @@ static void sock_pipe_buf_release(struct pipe_inode_info *pipe, #ifdef NET_COPY_SPLICE __free_page(buf-page); #else - put_page(buf-page); + struct sk_buff *skb = (struct sk_buff *) buf-private; + + kfree_skb(skb); + //put_page(buf-page); #endif } @@ -1148,7 +1151,8 @@ fault: * Fill page/offset/length into spd, if it can hold more pages. */ static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page, - unsigned int len, unsigned int offset) + unsigned int len, unsigned int offset, + struct sk_buff *skb) { struct page *p; @@ -1163,12 +1167,14 @@ static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
[PATCH 1/1] make network DMA usable for non-tcp drivers
Here is a patch against the netdev-2.6 git tree that makes the net DMA feature usable for drivers like the ATA over Ethernet block driver, which can use dma_skb_copy_datagram_iovec when receiving data from the network. The change was suggested on kernelnewbies. http://article.gmane.org/gmane.linux.kernel.kernelnewbies/21663 Signed-off-by: Ed L. Cashin [EMAIL PROTECTED] --- drivers/dma/Kconfig |2 +- net/core/user_dma.c |2 ++ 2 files changed, 3 insertions(+), 1 deletions(-) diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig index 72be6c6..270d23e 100644 --- a/drivers/dma/Kconfig +++ b/drivers/dma/Kconfig @@ -14,7 +14,7 @@ config DMA_ENGINE comment DMA Clients config NET_DMA - bool Network: TCP receive copy offload + bool Network: receive copy offload depends on DMA_ENGINE NET default y ---help--- diff --git a/net/core/user_dma.c b/net/core/user_dma.c index 0ad1cd5..69d0b15 100644 --- a/net/core/user_dma.c +++ b/net/core/user_dma.c @@ -130,3 +130,5 @@ end: fault: return -EFAULT; } + +EXPORT_SYMBOL(dma_skb_copy_datagram_iovec); -- 1.5.2.1 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive
On Fri, Jun 08, 2007 at 04:14:52PM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: Here's a start, for the splice side at least of storing a buf-private entity with the ops. :) I tested the same implementation, but I put skb pointer into page-private. My approach is not correct, since the same page can hold several objects, so if there are several splicers, this will scream. I've tested your patch on top of splice-net branch, here is a result: [ 44.798853] Slab corruption: skbuff_head_cache start=81003b726668, len=192 [ 44.806148] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b. [ 44.811598] Last user: [803699fd](kfree_skbmem+0x7a/0x7e) [ 44.818012] 0b0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6a 6b 6b a5 [ 44.824889] Prev obj: start=81003b726590, len=192 [ 44.829985] Redzone: 0xd84156c5635688c0/0xd84156c5635688c0. [ 44.835604] Last user: [8036a22c](__alloc_skb+0x40/0x13f) [ 44.842010] 000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 44.848896] 010: 20 58 7e 3b 00 81 ff ff 00 00 00 00 00 00 00 00 [ 44.855772] Next obj: start=81003b726740, len=192 [ 44.860868] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b. [ 44.866314] Last user: [803699fd](kfree_skbmem+0x7a/0x7e) [ 44.872721] 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b [ 44.879597] 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b I will try some things for the nearest 30-60 minutes, and then will move to canoe trip until thuesday, so will not be able to work on this idea. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive
On Fri, Jun 08 2007, Evgeniy Polyakov wrote: On Fri, Jun 08, 2007 at 04:14:52PM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: Here's a start, for the splice side at least of storing a buf-private entity with the ops. :) I tested the same implementation, but I put skb pointer into page-private. My approach is not correct, since the same page can hold several objects, so if there are several splicers, this will scream. I've tested your patch on top of splice-net branch, here is a result: [ 44.798853] Slab corruption: skbuff_head_cache start=81003b726668, len=192 [ 44.806148] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b. [ 44.811598] Last user: [803699fd](kfree_skbmem+0x7a/0x7e) [ 44.818012] 0b0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6a 6b 6b a5 [ 44.824889] Prev obj: start=81003b726590, len=192 [ 44.829985] Redzone: 0xd84156c5635688c0/0xd84156c5635688c0. [ 44.835604] Last user: [8036a22c](__alloc_skb+0x40/0x13f) [ 44.842010] 000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 44.848896] 010: 20 58 7e 3b 00 81 ff ff 00 00 00 00 00 00 00 00 [ 44.855772] Next obj: start=81003b726740, len=192 [ 44.860868] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b. [ 44.866314] Last user: [803699fd](kfree_skbmem+0x7a/0x7e) [ 44.872721] 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b [ 44.879597] 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b I will try some things for the nearest 30-60 minutes, and then will move to canoe trip until thuesday, so will not be able to work on this idea. I'm not surprised, it wasn't tested at all - just provides the basic framework for storing the skb so we can access it on pipe buffer release. Lets talk more next week, I'll likely play with this approach on monday. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive
On Fri, Jun 08, 2007 at 06:57:25PM +0400, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: I will try some things for the nearest 30-60 minutes, and then will move to canoe trip until thuesday, so will not be able to work on this idea. Ok, replacing in fs/splice.c every page_cache_release() with static void splice_page_release(struct page *p) { if (!PageSlab(p)) page_cache_release(p); } and putting cloned skb into private field instead of original on in spd_fill_page() ends up without kernel hung. I'm not sure it is correct, that page can be released in fs/splice.c without calling any callback from network code, when network data is being processed. Size of the received file is bigger than file sent, file contains repeated blocks of data sometimes. Cloned skb usage is likely too big overhead, although for receiving fast clone is unused in most cases, so there might be some gain. Attached your patch with above changes. diff --git a/fs/splice.c b/fs/splice.c index 928bea0..a75dc56 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -29,6 +29,12 @@ #include linux/syscalls.h #include linux/uio.h +static void splice_page_release(struct page *p) +{ + if (!PageSlab(p)) + page_cache_release(p); +} + /* * Attempt to steal a page from a pipe buffer. This should perhaps go into * a vm helper function, it's already simplified quite a bit by the @@ -81,7 +87,7 @@ static int page_cache_pipe_buf_steal(struct pipe_inode_info *pipe, static void page_cache_pipe_buf_release(struct pipe_inode_info *pipe, struct pipe_buffer *buf) { - page_cache_release(buf-page); + splice_page_release(buf-page); buf-flags = ~PIPE_BUF_FLAG_LRU; } @@ -191,6 +197,7 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe, buf-page = spd-pages[page_nr]; buf-offset = spd-partial[page_nr].offset; buf-len = spd-partial[page_nr].len; + buf-private = spd-partial[page_nr].private; buf-ops = spd-ops; if (spd-flags SPLICE_F_GIFT) buf-flags |= PIPE_BUF_FLAG_GIFT; @@ -246,7 +253,7 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe, } while (page_nr spd-nr_pages) - page_cache_release(spd-pages[page_nr++]); + splice_page_release(spd-pages[page_nr++]); return ret; } @@ -322,7 +329,7 @@ __generic_file_splice_read(struct file *in, loff_t *ppos, error = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL); if (unlikely(error)) { - page_cache_release(page); + splice_page_release(page); if (error == -EEXIST) continue; break; @@ -448,7 +455,7 @@ fill_it: * we got, 'nr_pages' is how many pages are in the map. */ while (page_nr nr_pages) - page_cache_release(pages[page_nr++]); + splice_page_release(pages[page_nr++]); if (spd.nr_pages) return splice_to_pipe(pipe, spd); @@ -604,7 +611,7 @@ find_page: if (ret != AOP_TRUNCATED_PAGE) unlock_page(page); - page_cache_release(page); + splice_page_release(page); if (ret == AOP_TRUNCATED_PAGE) goto find_page; @@ -634,7 +641,7 @@ find_page: ret = mapping-a_ops-commit_write(file, page, offset, offset+this_len); if (ret) { if (ret == AOP_TRUNCATED_PAGE) { - page_cache_release(page); + splice_page_release(page); goto find_page; } if (ret 0) @@ -651,7 +658,7 @@ find_page: */ mark_page_accessed(page); out: - page_cache_release(page); + splice_page_release(page); unlock_page(page); out_ret: return ret; diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h index 7ba228d..4409167 100644 --- a/include/linux/pipe_fs_i.h +++ b/include/linux/pipe_fs_i.h @@ -14,6 +14,7 @@ struct pipe_buffer { unsigned int offset, len; const struct pipe_buf_operations *ops; unsigned int flags; + unsigned long private; }; struct pipe_inode_info { diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 619dcf5..64e3eed 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1504,7 +1504,7 @@ extern int skb_store_bits(struct sk_buff *skb, int offset, extern __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to, int len,
Re: [PATCH] Virtual ethernet tunnel (v.2)
Ben Greear wrote: [snip] I would also like some way to identify veth from other device types, preferably something like a value in sysfs. However, that should not hold up We can do this with ethtool. It can get and print the driver name of the device. I think I'd like something in sysfs that we could query for any interface. Possible return strings could be: VLAN VETH ETH PPP BRIDGE AP /* wifi access point interface */ STA /* wifi station */ I will cook up a patch for consideration after veth goes in. Ben, could you please tell what sysfs features do you plan to implement? Thanks, Pavel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] RFC: have tcp_recvmsg() check kthread_should_stop() and treat it as if it were signalled
Already sent this to several lists, but forgot netdev ;-)... This one's sort of outside my normal area of expertise so sending this as an RFC to gather feedback on the idea. Some background: The cifs_mount() and cifs_umount() functions currently send a signal to the cifsd kthread prior to calling kthread_stop on it. The reasoning is apparently that it's likely that cifsd will have called kernel_recvmsg() and if it doesn't do this there can be a rather long delay when a filesystem is unmounted. The following patch is a first stab at removing this need. It makes it so that in tcp_recvmsg() we also check kthread_should_stop() at any point where we currently check to see if the task was signalled. If that returns true, then it acts as if it were signalled. I've tested this on a fairly recent kernel with a cifs module that doesn't send signals on unmount and it seems to work as expected. I'm just not clear on whether it will have any adverse side-effects. Obviously if this approach is OK then we'll probably also want to fix up other recvmsg functions (udp_recvmsg, etc). Anyone care to comment? Thanks, Signed-off-by: Jeff Layton [EMAIL PROTECTED] diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index bd4c295..1ad91fa 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -258,6 +258,7 @@ #include linux/cache.h #include linux/err.h #include linux/crypto.h +#include linux/kthread.h #include net/icmp.h #include net/tcp.h @@ -1154,7 +1155,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, if (tp-urg_data tp-urg_seq == *seq) { if (copied) break; - if (signal_pending(current)) { + if (signal_pending(current) || kthread_should_stop()) { copied = timeo ? sock_intr_errno(timeo) : -EAGAIN; break; } @@ -1197,6 +1198,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, (sk-sk_shutdown RCV_SHUTDOWN) || !timeo || signal_pending(current) || + kthread_should_stop() || (flags MSG_PEEK)) break; } else { @@ -1227,7 +1229,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, break; } - if (signal_pending(current)) { + if (signal_pending(current) || kthread_should_stop()) { copied = sock_intr_errno(timeo); break; } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Virtual ethernet tunnel (v.2)
Pavel Emelianov wrote: Ben Greear wrote: [snip] I would also like some way to identify veth from other device types, preferably something like a value in sysfs. However, that should not hold up We can do this with ethtool. It can get and print the driver name of the device. I think I'd like something in sysfs that we could query for any interface. Possible return strings could be: VLAN VETH ETH PPP BRIDGE AP /* wifi access point interface */ STA /* wifi station */ I will cook up a patch for consideration after veth goes in. Ben, could you please tell what sysfs features do you plan to implement? I think this is the only thing that has a chance of getting into the kernel. Basically, I have a user-space app and I want to be able to definitively know the type for all interfaces. Currently, I have a hodge-podge of logic to query various ioctls and /proc files and finally, guess by name if nothing else works. There must be a better way :P I have another sysfs patch that allows setting a default skb-mark for an interface so that you can set the skb-mark before it hits the connection tracking logic, but I'm been told this one has very little chance of getting into the kernel. The skb-mark patch is only useful (as far as I can tell) if you also include a patch Patrick McHardy did for me that allowed the conn-tracking logic to use skb-mark as part of it's tuple. This allows me to do NAT between virtual routers (routing tables) on the same machine using veth-equivalent drivers to connect the routers. He thinks this will probably not ever get into the kernel either. I have another sysctl related send-to-self patch that also has little chance of getting into the kernel, but it might be quite useful with veth (it's useful to me..but my needs aren't exactly mainstream :)) I'll post this separately for consideration Thanks, Ben -- Ben Greear [EMAIL PROTECTED] Candela Technologies Inc http://www.candelatech.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Cbe-oss-dev] [PATCH 0/18] spidernet driver bug fixes
On Fri, Jun 08, 2007 at 11:12:31AM +1000, Michael Ellerman wrote: On Thu, 2007-06-07 at 14:17 -0500, Linas Vepstas wrote: Jeff, please apply for the 2.6.23 kernel tree. The pach series consists of two major bugfixes, and several bits of cleanup. The major bug fixes are: 1) a rare but fatal bug involving RX ram full messages, which results in a driver deadlock. 2) misconfigured TX interrupts, causing a sever performance degardation for small packets. I realise it's late, but shouldn't major bugfixes be going into 22 ? Yeah, I suppose, I admit I've lost track of the process. I'm not sure how to submit patches for this case. The major fixes are patches 6/18, 13/18 14/18 and 17/18; (the rest of the patches are cruft-fixes). Taken alone, these four will not apply cleanly. I could prepare a new set, with just these four; asuming these are accepted into 2.6.22, then once 22 comes out, Jeff's .23 tree won't merge cleanly. What's the right way to do this? --linas - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Cbe-oss-dev] [PATCH 0/18] spidernet driver bug fixes
On Fri, Jun 08, 2007 at 12:06:08PM -0500, Linas Vepstas wrote: On Fri, Jun 08, 2007 at 11:12:31AM +1000, Michael Ellerman wrote: On Thu, 2007-06-07 at 14:17 -0500, Linas Vepstas wrote: Jeff, please apply for the 2.6.23 kernel tree. The pach series consists of two major bugfixes, and several bits of cleanup. The major bug fixes are: 1) a rare but fatal bug involving RX ram full messages, which results in a driver deadlock. 2) misconfigured TX interrupts, causing a sever performance degardation for small packets. I realise it's late, but shouldn't major bugfixes be going into 22 ? Yeah, I suppose, I admit I've lost track of the process. I'm not sure how to submit patches for this case. The major fixes are patches 6/18, 13/18 14/18 and 17/18; (the rest of the patches are cruft-fixes). Taken alone, these four will not apply cleanly. I could prepare a new set, with just these four; asuming these are accepted into 2.6.22, then once 22 comes out, Jeff's .23 tree won't merge cleanly. You need to order your bug fixes first in the queue. I push those upstream, and simultaneous merge the result into netdev#upstream (2.6.23 queue). Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] NetXen: Initialization, link status and other bug fixes
On Thu, 2007-06-07 at 04:28 -0700, Mithlesh Thukral wrote: Hi All, I will be sending bug fixes related to initialization, link status and some compile issues of NetXen's 1/10G Ethernet driver in subsequent mails. These patches are wrt netdev#upstream-fixes. Regards, Mithlesh Thukral Jeff, Thanks for your review this series patches on 6/3. Based on your comments, we have re-submitted the patches with your requirements. Also we have tested these patches on x/pBlade in our lab. Thanks, wendy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [WIP][PATCHES] Network xmit batching
These results are based on the test script that I sent earlier today. I removed the results for UDP 32 procs 512 and 4096 buffer cases since the BW was coming line speed (infact it was showing 1500Mb/s and 4900Mb/s respectively for both the ORG and these bits). I expect UDP to overwhelm the receiver. So the receiver needs a lot more tuning (like increased rcv socket buffer sizes to keep up, IMO). But yes, the above is an odd result - Rick any insight into this? Indeed, there is no flow control provided by netperf for the UDP_STREAM test and so it is quite common for a receiver to be overwhelmed. One can tweak the SO_RCVBUF size a bit to try to help with transients, but if the sender is sustainably faster than the receiver, you have to configure netperf with --enable-intervals and then provide a send burst (number of sends) size and an inter burst interval (constrained by HZ on the platform) to pace the netperf UDP sender. You can get finer grained control with --enable-spin, but that shoots your netperf-sided CPU util to hell. And with UDP datagram sizes MTU there is (in the abstract, not sure about current Linux code) the concern about filling a transmit queue with some but not all of the fragments of a datagram and the others being tossed, so one ends-up sending unreassemblable datagram fragments. Summary : Average BW (whatever meaning that has) improved 0.65%, while Service Demand deteriorated 11.86% Sorry, been many moons since i last played with netperf; what does service demand mean? Service demand is a measure of efficiency. It is a normalization/reconciliation of the throughput and the CPU utilization to arrive at a CPU consumed per unit of work figure. Lower is better. Now, when running aggregate tests with netperf2 using the launch a bunch in the background with confidence intervals enble to get iterations to minimize skew error :) http://www.netperf.org/svn/netperf2/tags/netperf-2.4.3/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance you cannot take the netperf service demand directly - each netperf is calculating assuming that it is the only thing running on the system. It then ass-u-me-s that the CPU util it measured was all for its work. This means the service demand figure will be quite higher than it really is. So, for aggregate tests using netperf2, one has to calculate service demand by hand. Sum the throughput as KB/s, convert the CPU util and number of CPUs to a microseconds of CPU consumed per second and divide to get microseconds per KB for the aggregate. rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [WIP][PATCHES] Network xmit batching
Also note something else strange that it is kind of strange that something like UDP which doesnt backoff will send out less packets/second ;- Cannot explain that either :) Perhaps delays in restarting after the intra-stack flow control is asserted. One possible thing to do to try to deal with that a little would be to increase SO_SNDBUF in netperf with the -s option. That at least is something I did back in 2.4 days rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] ibmveth: Fix h_free_logical_lan error on pool resize
When attempting to activate additional rx buffer pools on an ibmveth interface that was not yet up, the error below was seen. The patch fixes this by only closing and opening the interface to activate the resize if the interface is already opened. (drivers/net/ibmveth.c:597 ua:3004) ERROR: h_free_logical_lan failed with fffc, continuing with close Unable to handle kernel paging request for data at address 0x0ff8 Faulting instruction address: 0xd02540e0 Oops: Kernel access of bad area, sig: 11 [#1] SMP NR_CPUS=128 NUMA PSERIES LPAR Modules linked in: ip6t_REJECT xt_tcpudp ipt_REJECT xt_state iptable_mangle ipta ble_nat ip_nat iptable_filter ip6table_mangle ip_conntrack nfnetlink ip_tables i p6table_filter ip6_tables x_tables ipv6 apparmor aamatch_pcre loop dm_mod ibmvet h sg ibmvscsic sd_mod scsi_mod NIP: D02540E0 LR: D02540D4 CTR: 801AF404 REGS: c0001cd27870 TRAP: 0300 Not tainted (2.6.16.46-0.4-ppc64) MSR: 80009032 EE,ME,IR,DR CR: 24242422 XER: 0007 DAR: 0FF8, DSISR: 4000 TASK = c0001ca7b4e0[1636] 'sh' THREAD: c0001cd24000 CPU: 0 GPR00: D02540D4 C0001CD27AF0 D0265650 C0001C936500 GPR04: 80009032 0007 0002C2EF GPR08: C0652A10 C0652AE0 GPR12: 4000 C04A3300 100A GPR16: 100B8808 100C0F60 10084878 GPR20: 100C0CB0 100AF498 0002 GPR24: 100BA488 C0001C936760 D0258DD0 C0001C936000 GPR28: C0001C936500 D0265180 C0001C936000 NIP [D02540E0] .ibmveth_close+0xc8/0xf4 [ibmveth] LR [D02540D4] .ibmveth_close+0xbc/0xf4 [ibmveth] Call Trace: [C0001CD27AF0] [D02540D4] .ibmveth_close+0xbc/0xf4 [ibmveth] (unreliable) [C0001CD27B80] [D02545FC] .veth_pool_store+0xd0/0x260 [ibmveth] [C0001CD27C40] [C012E0E8] .sysfs_write_file+0x118/0x198 [C0001CD27CF0] [C00CDAF0] .vfs_write+0x130/0x218 [C0001CD27D90] [C00CE52C] .sys_write+0x4c/0x8c [C0001CD27E30] [C000871C] syscall_exit+0x0/0x40 Instruction dump: 419affd8 2fa3 419e0020 e93d e89e8040 38a00255 e87e81b0 80c90018 48001531 e8410028 e93d00e0 7fa3eb78 e8090ff8 f81d0430 4bfffdc9 38210090 Signed-off-by: Brian King [EMAIL PROTECTED] --- linux-2.6-bjking1/drivers/net/ibmveth.c | 53 ++-- 1 file changed, 31 insertions(+), 22 deletions(-) diff -puN drivers/net/ibmveth.c~ibmveth_large_frames drivers/net/ibmveth.c --- linux-2.6/drivers/net/ibmveth.c~ibmveth_large_frames2007-05-14 15:03:06.0 -0500 +++ linux-2.6-bjking1/drivers/net/ibmveth.c 2007-05-15 09:18:46.0 -0500 @@ -1243,16 +1243,19 @@ const char * buf, size_t count) if (attr == veth_active_attr) { if (value !pool-active) { - if(ibmveth_alloc_buffer_pool(pool)) { -ibmveth_error_printk(unable to alloc pool\n); -return -ENOMEM; -} - pool-active = 1; - adapter-pool_config = 1; - ibmveth_close(netdev); - adapter-pool_config = 0; - if ((rc = ibmveth_open(netdev))) - return rc; + if (netif_running(netdev)) { + if(ibmveth_alloc_buffer_pool(pool)) { + ibmveth_error_printk(unable to alloc pool\n); + return -ENOMEM; + } + pool-active = 1; + adapter-pool_config = 1; + ibmveth_close(netdev); + adapter-pool_config = 0; + if ((rc = ibmveth_open(netdev))) + return rc; + } else + pool-active = 1; } else if (!value pool-active) { int mtu = netdev-mtu + IBMVETH_BUFF_OH; int i; @@ -1281,23 +1284,29 @@ const char * buf, size_t count) if (value = 0 || value IBMVETH_MAX_POOL_COUNT) return -EINVAL; else { - adapter-pool_config = 1; - ibmveth_close(netdev); - adapter-pool_config = 0; - pool-size = value; - if ((rc = ibmveth_open(netdev))) - return rc; + if (netif_running(netdev)) { + adapter-pool_config = 1; +
[PATCH 2/2] ibmveth: Automatically enable larger rx buffer pools for larger mtu
Currently, ibmveth maintains several rx buffer pools, which can be modified through sysfs. By default, pools are not allocated by default such that jumbo frames cannot be supported without first activating larger rx buffer pools. This results in failures when attempting to change the mtu. This patch makes ibmveth automatically allocate these larger buffer pools when the mtu is changed. Signed-off-by: Brian King [EMAIL PROTECTED] --- linux-2.6-bjking1/drivers/net/ibmveth.c | 27 +++ 1 file changed, 23 insertions(+), 4 deletions(-) diff -puN drivers/net/ibmveth.c~ibmveth_large_mtu drivers/net/ibmveth.c --- linux-2.6/drivers/net/ibmveth.c~ibmveth_large_mtu 2007-05-16 10:47:54.0 -0500 +++ linux-2.6-bjking1/drivers/net/ibmveth.c 2007-05-16 10:47:54.0 -0500 @@ -915,17 +915,36 @@ static int ibmveth_change_mtu(struct net { struct ibmveth_adapter *adapter = dev-priv; int new_mtu_oh = new_mtu + IBMVETH_BUFF_OH; - int i; + int reinit = 0; + int i, rc; if (new_mtu IBMVETH_MAX_MTU) return -EINVAL; + for (i = 0; i IbmVethNumBufferPools; i++) + if (new_mtu_oh adapter-rx_buff_pool[i].buff_size) + break; + + if (i == IbmVethNumBufferPools) + return -EINVAL; + /* Look for an active buffer pool that can hold the new MTU */ for(i = 0; iIbmVethNumBufferPools; i++) { - if (!adapter-rx_buff_pool[i].active) - continue; + if (!adapter-rx_buff_pool[i].active) { + adapter-rx_buff_pool[i].active = 1; + reinit = 1; + } + if (new_mtu_oh adapter-rx_buff_pool[i].buff_size) { - dev-mtu = new_mtu; + if (reinit netif_running(adapter-netdev)) { + adapter-pool_config = 1; + ibmveth_close(adapter-netdev); + adapter-pool_config = 0; + dev-mtu = new_mtu; + if ((rc = ibmveth_open(adapter-netdev))) + return rc; + } else + dev-mtu = new_mtu; return 0; } } _ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] make network DMA usable for non-tcp drivers
On Fri, 8 Jun 2007 10:30:53 -0400 Ed L. Cashin [EMAIL PROTECTED] wrote: Here is a patch against the netdev-2.6 git tree that makes the net DMA feature usable for drivers like the ATA over Ethernet block driver, which can use dma_skb_copy_datagram_iovec when receiving data from the network. The change was suggested on kernelnewbies. http://article.gmane.org/gmane.linux.kernel.kernelnewbies/21663 Signed-off-by: Ed L. Cashin [EMAIL PROTECTED] --- drivers/dma/Kconfig |2 +- net/core/user_dma.c |2 ++ 2 files changed, 3 insertions(+), 1 deletions(-) diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig index 72be6c6..270d23e 100644 --- a/drivers/dma/Kconfig +++ b/drivers/dma/Kconfig @@ -14,7 +14,7 @@ config DMA_ENGINE comment DMA Clients config NET_DMA - bool Network: TCP receive copy offload + bool Network: receive copy offload depends on DMA_ENGINE NET default y ---help--- diff --git a/net/core/user_dma.c b/net/core/user_dma.c index 0ad1cd5..69d0b15 100644 --- a/net/core/user_dma.c +++ b/net/core/user_dma.c @@ -130,3 +130,5 @@ end: fault: return -EFAULT; } + +EXPORT_SYMBOL(dma_skb_copy_datagram_iovec); We wouldn't want to merge this until code which actually uses the export is also merged. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Virtual ethernet tunnel (v.2)
On 08.06.2007 19:00, Ben Greear wrote: I have another sysfs patch that allows setting a default skb-mark for an interface so that you can set the skb-mark before it hits the connection tracking logic, but I'm been told this one has very little chance of getting into the kernel. The skb-mark patch is only useful (as far as I can tell) if you also include a patch Patrick McHardy did for me that allowed the conn-tracking logic to use skb-mark as part of it's tuple. This allows me to do NAT between virtual routers (routing tables) on the same machine using veth-equivalent drivers to connect the routers. He thinks this will probably not ever get into the kernel either. Are these patches available somewhere? I'm currently doing NAT between virtual routers by some advanced iproute2/iptables trickery, but I have no way to handle the occasional tuple conflict. Regards, Carl-Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ipvs: Fix state variable on failure to start ipvs threads
Hey all- ip_vs currently fails to reset its ip_vs_sync_state variable if the sync thread fails to start properly. The result is that the kernel will report a running daemon when their actuall is none. If you issue the following commands: 1. ipvsadm --start-daemon master --mcast-interface bla 2. ipvsadm -L --daemon 3. ipvsadm --stop-daemon master Assuming that bla is not an actual interface, step 2 should return no data, but instead returns: $ ipvsadm -L --daemon master sync daemon (mcast=bla, syncid=0) The following patch corrects this behavior. Tested successfully by myself Thanks Regards Neil Signed-off-by: Neil Horman [EMAIL PROTECTED] ip_vs_sync.c | 41 +++-- 1 file changed, 39 insertions(+), 2 deletions(-) diff --git a/net/ipv4/ipvs/ip_vs_sync.c b/net/ipv4/ipvs/ip_vs_sync.c index 7ea2d98..ff4df68 100644 --- a/net/ipv4/ipvs/ip_vs_sync.c +++ b/net/ipv4/ipvs/ip_vs_sync.c @@ -67,6 +67,11 @@ struct ip_vs_sync_conn_options { struct ip_vs_seqout_seq;/* outgoing seq. struct */ }; +struct ip_vs_sync_thread_data { + struct completion *startup; + int state; +}; + #define IP_VS_SYNC_CONN_TIMEOUT (3*60*HZ) #define SIMPLE_CONN_SIZE (sizeof(struct ip_vs_sync_conn)) #define FULL_CONN_SIZE \ @@ -751,6 +756,7 @@ static int sync_thread(void *startup) mm_segment_t oldmm; int state; const char *name; + struct ip_vs_sync_thread_data *tinfo = startup; /* increase the module use count */ ip_vs_use_count_inc(); @@ -789,7 +795,14 @@ static int sync_thread(void *startup) add_wait_queue(sync_wait, wait); set_sync_pid(state, current-pid); - complete((struct completion *)startup); + complete(tinfo-startup); + + /* +* once we call the completion queue above, we should +* null out that reference, since its allocated on the +* stack of the creating kernel thread +*/ + tinfo-startup = NULL; /* processing master/backup loop here */ if (state == IP_VS_STATE_MASTER) @@ -801,6 +814,14 @@ static int sync_thread(void *startup) remove_wait_queue(sync_wait, wait); /* thread exits */ + + /* +* If we weren't explicitly stopped, then we +* exited in error, and should undo our state +*/ + if ((!stop_master_sync) (!stop_backup_sync)) + ip_vs_sync_state -= tinfo-state; + set_sync_pid(state, 0); IP_VS_INFO(sync thread stopped!\n); @@ -812,6 +833,11 @@ static int sync_thread(void *startup) set_stop_sync(state, 0); wake_up(stop_sync_wait); + /* +* we need to free the structure that was allocated +* for us in start_sync_thread +*/ + kfree(tinfo); return 0; } @@ -838,11 +864,19 @@ int start_sync_thread(int state, char *mcast_ifn, __u8 syncid) { DECLARE_COMPLETION_ONSTACK(startup); pid_t pid; + struct ip_vs_sync_thread_data *tinfo; if ((state == IP_VS_STATE_MASTER sync_master_pid) || (state == IP_VS_STATE_BACKUP sync_backup_pid)) return -EEXIST; + /* +* Note that tinfo will be freed in sync_thread on exit +*/ + tinfo = kmalloc(sizeof(struct ip_vs_sync_thread_data), GFP_KERNEL); + if (!tinfo) + return -ENOMEM; + IP_VS_DBG(7, %s: pid %d\n, __FUNCTION__, current-pid); IP_VS_DBG(7, Each ip_vs_sync_conn entry need %Zd bytes\n, sizeof(struct ip_vs_sync_conn)); @@ -858,8 +892,11 @@ int start_sync_thread(int state, char *mcast_ifn, __u8 syncid) ip_vs_backup_syncid = syncid; } + tinfo-state = state; + tinfo-startup = startup; + repeat: - if ((pid = kernel_thread(fork_sync_thread, startup, 0)) 0) { + if ((pid = kernel_thread(fork_sync_thread, tinfo, 0)) 0) { IP_VS_ERR(could not create fork_sync_thread due to %d... retrying.\n, pid); msleep_interruptible(1000); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] NET: Multiqueue network device support.
I thought the correct use is to get this lock on clean_tx side which can get called on a different cpu on rx (which also cleans up slots for skbs that have finished xmit). Both TX and clean_tx uses the same tx_ring's head/tail ptrs and should be exclusive. But I don't find clean tx using this lock in the code, so I am confused :-) From e1000_main.c, e1000_clean(): /* e1000_clean is called per-cpu. This lock protects * tx_ring[0] from being cleaned by multiple cpus * simultaneously. A failure obtaining the lock means * tx_ring[0] is currently being cleaned anyway. */ if (spin_trylock(adapter-tx_queue_lock)) { tx_cleaned = e1000_clean_tx_irq(adapter, adapter-tx_ring[0]); spin_unlock(adapter-tx_queue_lock); } In a multi-ring implementation of the driver, this is wrapped with for (i = 0; i adapter-num_tx_queues; i++) and adapter-tx_ring[i]. This lock also prevents the clean routine from stomping on xmit_frame() when transmitting. Also in the multi-ring implementation, the tx_lock is pushed down into the individual tx_ring struct, not at the adapter level. Cheers, -PJ Waskiewicz [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
SKY2 vs SK98LIN performance on 88E8053 MAC
Hello! We are observing severe IPv4 forwarding degradation when switching from sk98lin to sky2 driver. Setup: plain 2.6.21.3 kernel, 88E8053 Marvell Yukon2 MAC, sk98lin is @revision 8.41.2.3 coming from FC6, SKY2 driver from 2.6.21.3 kernel, both drivers are in NAPI mode. Benchmarks are done using bidirectional traffic generated by IXIA, sending 256-byte packets. Observed packet throughput is almost 30% higher with sklin98 driver. Ethernet flow control is turned off in SKY2 driver (hard-coded as off, we know about this problem). I also have oprofile records of the drivers in case anybody is interested. Please share info if you know anything on SKY2 performance bottlenecks. Thanks in advance, Philip R. Shape Yahoo! in your own image. Join our Network Research Panel today! http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [WIP][PATCHES] Network xmit batching
On Fri, Jun 08, 2007 at 09:07:47AM -0400, jamal ([EMAIL PROTECTED]) wrote: Something, that anyone can understand :) For example /proc stats, although it is not very accurate, but it is really usable parameter from userspace point ov view. which /proc stats? /proc/$pid/stat, for pktgen it is likely not that interesting, but for usual userspace applcation it is quite interesting parameter. At least that is what 'top' shows. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: networking busted in current -git ???
On Fri, 2007-06-08 at 23:07 +0200, Arkadiusz Miskiewicz wrote: On Friday 08 of June 2007, you wrote: Hello, I am using the current git tree: 85f6038f2170e3335dda09c3dfb0f83110e87019 . Git tree from two days ago (with the same config) works fine. Attempting to acquire an IP address via DHCP fails with: SIOCSIFADDR: No buffer space available Listening on LPF/eth0/00:19:b9:0c:9a:43 Sending on LPF/eth0/00:19:b9:0c:9a:43 Sending on Socket/fallback DHCPREQUEST on eth0 to 255.255.255.255 port 67 DHCPACK from xxx.xxx.xxx.xxx SIOCSIFADDR: No buffer space available SIOCSIFNETMASK: Cannot assign requested address SIOCSIFBRDADDR: Cannot assign requested address SIOCADDRT: Network is unreachable bound to xxx.xxx.xxx.xxx -- renewal in 98610 seconds. This is on a Dell 490 with tg3 network driver running Ubuntu 7.04 . .config and dmesg are appended. florin Here it requires few retries (stop dhcpcd, start again) to get the IP. git tree from few hours ago. tg3 driver. I also saw SIOCSIFADDR: No buffer space available once. (added netdev to the Cc list) It is not dhcp. I'm seeing the same bug with bog-standard ifup with a static address on an FC-6 machine. It appears to be something in the latest dump from davem to Linus, but I haven't yet had time to identify what. Cheers Trond - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] [RFC] NET: Implement a standard ndev_printk family
A lot of netdevices implement their own variant of printk and use use variations of dev_printk, printk or others that use msg_enable, which has been an eyesore with countless variations across drivers. This patch implements a standard ndev_printk and derivatives such as ndev_err, ndev_info, ndev_warn that allows drivers to transparently use both the msg_enable and a generic netdevice message layout. It moves the msg_enable over to the net_device struct and allows drivers to obsolete ethtool handling code of the msg_enable value. The current code has each driver contain a copy of msg_enable and handle the setting/changing through ethtool that way. Since the netdev name is stored in the net_device struct, those two are not coherently available in a uniform way across all drivers (a single macro or function would not work since all drivers name their net_device members differently). This makes netdevice driver writes reinvent the wheel over and over again. It thus makes sense to move msg_enable to the net_device. This gives us the opportunity to (1) initialize it by default with a globally sane value, (2) remove msg_enable handling code w/r ethtool for drivers that know and use the msg_enable member of the net_device struct. (3) Ethtool code can just modify the net_device msg_enable for drivers that do not have custom msg_enable get/set handlers so converted drivers lose some code for that as well. Signed-off-by: Auke Kok [EMAIL PROTECTED] --- include/linux/netdevice.h | 24 net/core/dev.c| 10 ++ net/core/ethtool.c| 14 +++--- 3 files changed, 41 insertions(+), 7 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 3a70f55..5551b63 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -540,6 +540,8 @@ struct net_device struct device dev; /* space for optional statistics and wireless sysfs groups */ struct attribute_group *sysfs_groups[3]; + + int msg_enable; }; #define to_net_dev(d) container_of(d, struct net_device, dev) @@ -838,6 +840,28 @@ enum { NETIF_MSG_WOL = 0x4000, }; +#define ndev_printk(kern_level, netif_level, netdev, format, arg...) \ + do { if ((netdev)-msg_enable NETIF_MSG_##netif_level) { \ + printk(kern_level %s: format, \ + (netdev)-name, ## arg); } } while (0) + +#ifdef DEBUG +#define ndev_dbg(level, netdev, format, arg...) \ + ndev_printk(KERN_DEBUG, level, netdev, format, ## arg) +#else +#define ndev_dbg(level, netdev, format, arg...) \ + do { (void)(netdev); } while (0) +#endif + +#define ndev_err(level, netdev, format, arg...) \ + ndev_printk(KERN_ERR, level, netdev, format, ## arg) +#define ndev_info(level, netdev, format, arg...) \ + ndev_printk(KERN_INFO, level, netdev, format, ## arg) +#define ndev_warn(level, netdev, format, arg...) \ + ndev_printk(KERN_WARNING, level, netdev, format, ## arg) +#define ndev_notice(level, netdev, format, arg...) \ + ndev_printk(KERN_NOTICE, level, netdev, format, ## arg) + #define netif_msg_drv(p) ((p)-msg_enable NETIF_MSG_DRV) #define netif_msg_probe(p) ((p)-msg_enable NETIF_MSG_PROBE) #define netif_msg_link(p) ((p)-msg_enable NETIF_MSG_LINK) diff --git a/net/core/dev.c b/net/core/dev.c index 5a7f20f..e854c09 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3376,6 +3376,16 @@ struct net_device *alloc_netdev(int sizeof_priv, const char *name, dev-priv = netdev_priv(dev); dev-get_stats = internal_stats; + dev-msg_enable = NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_LINK; +#ifdef DEBUG + /* put these to good use: */ + dev-msg_enable |= NETIF_MSG_TIMER | NETIF_MSG_IFDOWN | + NETIF_MSG_IFUP | NETIF_MSG_RX_ERR | + NETIF_MSG_TX_ERR | NETIF_MSG_TX_QUEUED | + NETIF_MSG_INTR | NETIF_MSG_TX_DONE | + NETIF_MSG_RX_STATUS | NETIF_MSG_PKTDATA | + NETIF_MSG_HW | NETIF_MSG_WOL; +#endif setup(dev); strcpy(dev-name, name); return dev; diff --git a/net/core/ethtool.c b/net/core/ethtool.c index 8d5e5a0..ff8d52f 100644 --- a/net/core/ethtool.c +++ b/net/core/ethtool.c @@ -234,9 +234,9 @@ static int ethtool_get_msglevel(struct net_device *dev, char __user *useraddr) struct ethtool_value edata = { ETHTOOL_GMSGLVL }; if (!dev-ethtool_ops-get_msglevel) - return -EOPNOTSUPP; - - edata.data = dev-ethtool_ops-get_msglevel(dev); + edata.data = dev-msg_enable; + else + edata.data = dev-ethtool_ops-get_msglevel(dev); if (copy_to_user(useraddr, edata, sizeof(edata))) return -EFAULT; @@ -247,13 +247,13 @@ static int ethtool_set_msglevel(struct net_device *dev, char __user *useraddr) {
[PATCH 2/2] [RFC] NET: Convert several drivers to ndev_printk
With the generic ndev_printk macros, we can now convert network drivers to use this generic printk family for netdevices. Signed-off-by: Auke Kok [EMAIL PROTECTED] --- drivers/net/e100.c| 121 +++-- drivers/net/e1000/e1000.h | 15 - drivers/net/e1000/e1000_ethtool.c | 39 drivers/net/e1000/e1000_main.c| 101 +++ drivers/net/e1000/e1000_param.c | 67 ++-- drivers/net/ixgb/ixgb.h | 14 drivers/net/ixgb/ixgb_ethtool.c | 15 - drivers/net/ixgb/ixgb_main.c | 46 ++ 8 files changed, 166 insertions(+), 252 deletions(-) diff --git a/drivers/net/e100.c b/drivers/net/e100.c index 6ca0a08..56e7504 100644 --- a/drivers/net/e100.c +++ b/drivers/net/e100.c @@ -172,19 +172,12 @@ MODULE_AUTHOR(DRV_COPYRIGHT); MODULE_LICENSE(GPL); MODULE_VERSION(DRV_VERSION); -static int debug = 3; static int eeprom_bad_csum_allow = 0; static int use_io = 0; -module_param(debug, int, 0); module_param(eeprom_bad_csum_allow, int, 0); module_param(use_io, int, 0); -MODULE_PARM_DESC(debug, Debug level (0=none,...,16=all)); MODULE_PARM_DESC(eeprom_bad_csum_allow, Allow bad eeprom checksums); MODULE_PARM_DESC(use_io, Force use of i/o access mode); -#define DPRINTK(nlevel, klevel, fmt, args...) \ - (void)((NETIF_MSG_##nlevel nic-msg_enable) \ - printk(KERN_##klevel PFX %s: %s: fmt, nic-netdev-name, \ - __FUNCTION__ , ## args)) #define INTEL_8255X_ETHERNET_DEVICE(device_id, ich) {\ PCI_VENDOR_ID_INTEL, device_id, PCI_ANY_ID, PCI_ANY_ID, \ @@ -644,12 +637,12 @@ static int e100_self_test(struct nic *nic) /* Check results of self-test */ if(nic-mem-selftest.result != 0) { - DPRINTK(HW, ERR, Self-test failed: result=0x%08X\n, + ndev_err(HW, nic-netdev, Self-test failed: result=0x%08X\n, nic-mem-selftest.result); return -ETIMEDOUT; } if(nic-mem-selftest.signature == 0) { - DPRINTK(HW, ERR, Self-test failed: timed out\n); + ndev_err(HW, nic-netdev, Self-test failed: timed out\n); return -ETIMEDOUT; } @@ -753,7 +746,7 @@ static int e100_eeprom_load(struct nic *nic) * the sum of words should be 0xBABA */ checksum = le16_to_cpu(0xBABA - checksum); if(checksum != nic-eeprom[nic-eeprom_wc - 1]) { - DPRINTK(PROBE, ERR, EEPROM corrupted\n); + ndev_err(PROBE, nic-netdev, EEPROM corrupted\n); if (!eeprom_bad_csum_allow) return -EAGAIN; } @@ -908,7 +901,7 @@ static u16 mdio_ctrl(struct nic *nic, u32 addr, u32 dir, u32 reg, u16 data) break; } spin_unlock_irqrestore(nic-mdio_lock, flags); - DPRINTK(HW, DEBUG, + ndev_dbg(HW, nic-netdev, %s:addr=%d, reg=%d, data_in=0x%04X, data_out=0x%04X\n, dir == mdi_read ? READ : WRITE, addr, reg, data, data_out); return (u16)data_out; @@ -960,8 +953,8 @@ static void e100_get_defaults(struct nic *nic) static void e100_configure(struct nic *nic, struct cb *cb, struct sk_buff *skb) { struct config *config = cb-u.config; - u8 *c = (u8 *)config; - + u8 *c; + cb-command = cpu_to_le16(cb_config); memset(config, 0, sizeof(struct config)); @@ -1021,12 +1014,16 @@ static void e100_configure(struct nic *nic, struct cb *cb, struct sk_buff *skb) config-standard_stat_counter = 0x0; } - DPRINTK(HW, DEBUG, [00-07]=%02X:%02X:%02X:%02X:%02X:%02X:%02X:%02X\n, - c[0], c[1], c[2], c[3], c[4], c[5], c[6], c[7]); - DPRINTK(HW, DEBUG, [08-15]=%02X:%02X:%02X:%02X:%02X:%02X:%02X:%02X\n, - c[8], c[9], c[10], c[11], c[12], c[13], c[14], c[15]); - DPRINTK(HW, DEBUG, [16-23]=%02X:%02X:%02X:%02X:%02X:%02X:%02X:%02X\n, - c[16], c[17], c[18], c[19], c[20], c[21], c[22], c[23]); + c = (u8 *)config; + ndev_dbg(HW, nic-netdev, +[00-07]=%02X:%02X:%02X:%02X:%02X:%02X:%02X:%02X\n, +c[0], c[1], c[2], c[3], c[4], c[5], c[6], c[7]); + ndev_dbg(HW, nic-netdev, +[08-15]=%02X:%02X:%02X:%02X:%02X:%02X:%02X:%02X\n, +c[8], c[9], c[10], c[11], c[12], c[13], c[14], c[15]); + ndev_dbg(HW, nic-netdev, +[16-23]=%02X:%02X:%02X:%02X:%02X:%02X:%02X:%02X\n, +c[16], c[17], c[18], c[19], c[20], c[21], c[22], c[23]); } // @@ -1296,7 +1293,7 @@ static inline int e100_exec_cb_wait(struct nic *nic, struct sk_buff *skb, struct cb *cb = nic-cb_to_clean; if ((err = e100_exec_cb(nic, NULL, e100_setup_ucode))) - DPRINTK(PROBE,ERR, ucode cmd failed with error %d\n, err); +
Re: [PATCH 1/2] [RFC] NET: Implement a standard ndev_printk family
+#define ndev_printk(kern_level, netif_level, netdev, format, arg...) \ + do { if ((netdev)-msg_enable NETIF_MSG_##netif_level) { \ + printk(kern_level %s: format, \ + (netdev)-name, ## arg); } } while (0) Could you make a version that doesn't evaluate the arguments twice? +#ifdef DEBUG +#define ndev_dbg(level, netdev, format, arg...) \ + ndev_printk(KERN_DEBUG, level, netdev, format, ## arg) +#else +#define ndev_dbg(level, netdev, format, arg...) \ + do { (void)(netdev); } while (0) +#endif + +#define ndev_err(level, netdev, format, arg...) \ + ndev_printk(KERN_ERR, level, netdev, format, ## arg) +#define ndev_info(level, netdev, format, arg...) \ + ndev_printk(KERN_INFO, level, netdev, format, ## arg) +#define ndev_warn(level, netdev, format, arg...) \ + ndev_printk(KERN_WARNING, level, netdev, format, ## arg) +#define ndev_notice(level, netdev, format, arg...) \ + ndev_printk(KERN_NOTICE, level, netdev, format, ## arg) + #define netif_msg_drv(p) ((p)-msg_enable NETIF_MSG_DRV) #define netif_msg_probe(p) ((p)-msg_enable NETIF_MSG_PROBE) #define netif_msg_link(p)((p)-msg_enable NETIF_MSG_LINK) diff --git a/net/core/dev.c b/net/core/dev.c index 5a7f20f..e854c09 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3376,6 +3376,16 @@ struct net_device *alloc_netdev(int sizeof_priv, const char *name, dev-priv = netdev_priv(dev); dev-get_stats = internal_stats; + dev-msg_enable = NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_LINK; +#ifdef DEBUG + /* put these to good use: */ + dev-msg_enable |= NETIF_MSG_TIMER | NETIF_MSG_IFDOWN | +NETIF_MSG_IFUP | NETIF_MSG_RX_ERR | +NETIF_MSG_TX_ERR | NETIF_MSG_TX_QUEUED | +NETIF_MSG_INTR | NETIF_MSG_TX_DONE | +NETIF_MSG_RX_STATUS | NETIF_MSG_PKTDATA | +NETIF_MSG_HW | NETIF_MSG_WOL; +#endif Let driver writer choose message enable bits please. -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SKY2 vs SK98LIN performance on 88E8053 MAC
On Fri, 8 Jun 2007 13:41:55 -0700 (PDT) Philip Romanov [EMAIL PROTECTED] wrote: Hello! We are observing severe IPv4 forwarding degradation when switching from sk98lin to sky2 driver. Setup: plain 2.6.21.3 kernel, 88E8053 Marvell Yukon2 MAC, sk98lin is @revision 8.41.2.3 coming from FC6, SKY2 driver from 2.6.21.3 kernel, both drivers are in NAPI mode. Benchmarks are done using bidirectional traffic generated by IXIA, sending 256-byte packets. Observed packet throughput is almost 30% higher with sklin98 driver. Ethernet flow control is turned off in SKY2 driver (hard-coded as off, we know about this problem). I also have oprofile records of the drivers in case anybody is interested. Please share info if you know anything on SKY2 performance bottlenecks. I'm surprised? The vendor driver has bogus extra locking and other crap. Please send profile data. Flow control should work on sky2 (now). Are you routing or doing real TCP transfers? -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] [RFC] NET: Implement a standard ndev_printk family
Stephen Hemminger wrote: +#define ndev_printk(kern_level, netif_level, netdev, format, arg...) \ + do { if ((netdev)-msg_enable NETIF_MSG_##netif_level) { \ + printk(kern_level %s: format, \ + (netdev)-name, ## arg); } } while (0) Could you make a version that doesn't evaluate the arguments twice? hmm you lost me there a bit; Do you want me to duplicate this code for all the ndev_err/ndev_info functions instead so that ndev_err doesn't direct back to ndev_printk? +#ifdef DEBUG +#define ndev_dbg(level, netdev, format, arg...) \ + ndev_printk(KERN_DEBUG, level, netdev, format, ## arg) +#else +#define ndev_dbg(level, netdev, format, arg...) \ + do { (void)(netdev); } while (0) +#endif + +#define ndev_err(level, netdev, format, arg...) \ + ndev_printk(KERN_ERR, level, netdev, format, ## arg) +#define ndev_info(level, netdev, format, arg...) \ + ndev_printk(KERN_INFO, level, netdev, format, ## arg) +#define ndev_warn(level, netdev, format, arg...) \ + ndev_printk(KERN_WARNING, level, netdev, format, ## arg) +#define ndev_notice(level, netdev, format, arg...) \ + ndev_printk(KERN_NOTICE, level, netdev, format, ## arg) + #define netif_msg_drv(p) ((p)-msg_enable NETIF_MSG_DRV) #define netif_msg_probe(p) ((p)-msg_enable NETIF_MSG_PROBE) #define netif_msg_link(p) ((p)-msg_enable NETIF_MSG_LINK) diff --git a/net/core/dev.c b/net/core/dev.c index 5a7f20f..e854c09 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3376,6 +3376,16 @@ struct net_device *alloc_netdev(int sizeof_priv, const char *name, dev-priv = netdev_priv(dev); dev-get_stats = internal_stats; + dev-msg_enable = NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_LINK; +#ifdef DEBUG + /* put these to good use: */ + dev-msg_enable |= NETIF_MSG_TIMER | NETIF_MSG_IFDOWN | + NETIF_MSG_IFUP | NETIF_MSG_RX_ERR | + NETIF_MSG_TX_ERR | NETIF_MSG_TX_QUEUED | + NETIF_MSG_INTR | NETIF_MSG_TX_DONE | + NETIF_MSG_RX_STATUS | NETIF_MSG_PKTDATA | + NETIF_MSG_HW | NETIF_MSG_WOL; +#endif Let driver writer choose message enable bits please. the driver can, since these bits are set in alloc_netdev, nothing prevents a driver from setting the mask immediately afterwards. Putting in a sane default seems a good idea and good practice. Maybe I went a bit far by going all out on the DEBUG flags tho... perhaps those can be removed or only NETIF_MSG_RX_ERR and NETIF_MSG_TX_ERR set with DEBUG. Auke - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Virtual ethernet tunnel (v.2)
Carl-Daniel Hailfinger wrote: On 08.06.2007 19:00, Ben Greear wrote: I have another sysfs patch that allows setting a default skb-mark for an interface so that you can set the skb-mark before it hits the connection tracking logic, but I'm been told this one has very little chance of getting into the kernel. The skb-mark patch is only useful (as far as I can tell) if you also include a patch Patrick McHardy did for me that allowed the conn-tracking logic to use skb-mark as part of it's tuple. This allows me to do NAT between virtual routers (routing tables) on the same machine using veth-equivalent drivers to connect the routers. He thinks this will probably not ever get into the kernel either. Are these patches available somewhere? I'm currently doing NAT between virtual routers by some advanced iproute2/iptables trickery, but I have no way to handle the occasional tuple conflict. A consolidated patch against 2.6.20.12 is here. It has a lot more than just the patches mentioned above, but it shouldn't hurt anything to have the whole patch applied: http://www.candelatech.com/oss/candela_2.6.20.patch The original patch for using skb-mark as a tuple was written by Patrick McHardy, and is here: http://www.candelatech.com/oss/skb_mark_conntrack.patch His patch merged with my patch to sysfs to set skb-mark on ingress is here: http://www.candelatech.com/oss/conntrack_mark_with_ssyctl.patch Thanks, Ben Regards, Carl-Daniel -- Ben Greear [EMAIL PROTECTED] Candela Technologies Inc http://www.candelatech.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RFC: Support send-to-self over external interfaces (and veths).
This should also be useful with the pending 'veth' driver, as it emulates two ethernet ports connected with a cross-over cable. To make this work, you have to enable the sysctl (look Dave, no IOCTLS, there might be hope for me yet!! :)), and in your application you will need to use SO_BINDTODEVICE (and probably bind to the local IP as well). Some applications such as traceroute already support this binding..others such as ping do not. You most likely will also have to set up routing tables using source IPs as a rule to direct these connections to a particular routing table. Comments welcome. Thanks, Ben -- Ben Greear [EMAIL PROTECTED] Candela Technologies Inc http://www.candelatech.com diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h index c0f7aec..88f78b6 100644 --- a/include/linux/inetdevice.h +++ b/include/linux/inetdevice.h @@ -31,6 +31,7 @@ struct ipv4_devconf int no_policy; int force_igmp_version; int promote_secondaries; + int accept_sts; void *sysctl; }; @@ -84,6 +85,7 @@ struct in_device #define IN_DEV_ARPFILTER(in_dev) (ipv4_devconf.arp_filter || (in_dev)-cnf.arp_filter) #define IN_DEV_ARP_ANNOUNCE(in_dev) (max(ipv4_devconf.arp_announce, (in_dev)-cnf.arp_announce)) #define IN_DEV_ARP_IGNORE(in_dev) (max(ipv4_devconf.arp_ignore, (in_dev)-cnf.arp_ignore)) +#define IN_DEV_ACCEPT_STS(in_dev) (max(ipv4_devconf.accept_sts, (in_dev)-cnf.accept_sts)) struct in_ifaddr { diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h index 47f1c53..6c00bf4 100644 --- a/include/linux/sysctl.h +++ b/include/linux/sysctl.h @@ -496,6 +496,7 @@ enum NET_IPV4_CONF_ARP_IGNORE=19, NET_IPV4_CONF_PROMOTE_SECONDARIES=20, NET_IPV4_CONF_ARP_ACCEPT=21, + NET_IPV4_CONF_ACCEPT_STS=22, __NET_IPV4_CONF_MAX }; diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c index 7110779..9866f1b 100644 --- a/net/ipv4/arp.c +++ b/net/ipv4/arp.c @@ -419,6 +419,26 @@ static int arp_ignore(struct in_device *in_dev, struct net_device *dev, return !inet_confirm_addr(dev, sip, tip, scope); } +static int is_ip_on_dev(struct net_device* dev, __u32 ip) { + int rv = 0; + struct in_device* in_dev = in_dev_get(dev); + if (in_dev) { + struct in_ifaddr *ifa; + + rcu_read_lock(); + for (ifa = in_dev-ifa_list; ifa; ifa = ifa-ifa_next) { + if (ifa-ifa_address == ip) { + /* match */ + rv = 1; + break; + } + } + rcu_read_unlock(); + in_dev_put(in_dev); + } + return rv; +} + static int arp_filter(__be32 sip, __be32 tip, struct net_device *dev) { struct flowi fl = { .nl_u = { .ip4_u = { .daddr = sip, @@ -430,8 +450,38 @@ static int arp_filter(__be32 sip, __be32 tip, struct net_device *dev) if (ip_route_output_key(rt, fl) 0) return 1; if (rt-u.dst.dev != dev) { - NET_INC_STATS_BH(LINUX_MIB_ARPFILTER); - flag = 1; + struct in_device *in_dev = in_dev_get(dev); + if (in_dev IN_DEV_ACCEPT_STS(in_dev) + (rt-u.dst.dev == loopback_dev)) { + /* Accept these IFF target-ip == dev's IP */ + /* TODO: Need to force the ARP response back out the interface + * instead of letting it route locally. + */ + + if (is_ip_on_dev(dev, tip)) { +/* OK, we'll let this special case slide, so that we can + * arp from one local interface to another. This seems + * to work, but could use some review. --Ben + */ +/*printk(arp_filter, sip: %x tip: %x dev: %s, STS override (ip on dev)\n, + sip, tip, dev-name);*/ + } + else { +/*printk(arp_filter, sip: %x tip: %x dev: %s, IP is NOT on dev\n, + sip, tip, dev-name);*/ +NET_INC_STATS_BH(LINUX_MIB_ARPFILTER); +flag = 1; + } + } + else { + /*printk(arp_filter, not lpbk sip: %x tip: %x dev: %s flgs: %hx dst.dev: %p lbk: %p\n, + sip, tip, dev-name, dev-priv_flags, rt-u.dst.dev, loopback_dev);*/ + NET_INC_STATS_BH(LINUX_MIB_ARPFILTER); + flag = 1; + } + if (in_dev) { + in_dev_put(in_dev); + } } ip_rt_put(rt); return flag; diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c index 7f95e6e..33ac2ed 100644 --- a/net/ipv4/devinet.c +++ b/net/ipv4/devinet.c @@ -1513,6 +1513,15 @@ static struct devinet_sysctl_table { .proc_handler = ipv4_doint_and_flush, .strategy = ipv4_doint_and_flush_strategy, }, + { + .ctl_name = NET_IPV4_CONF_ACCEPT_STS, + .procname = accept_sts, + .data = ipv4_devconf.accept_sts, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, + }, .devinet_dev = { { diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index 837f295..9b57bf5 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -206,8 +206,16 @@ int fib_validate_source(__be32 src, __be32 dst, u8
Re: [PATCH 1/2] [RFC] NET: Implement a standard ndev_printk family
On Fri, 08 Jun 2007 16:42:31 -0700 Kok, Auke [EMAIL PROTECTED] wrote: Stephen Hemminger wrote: +#define ndev_printk(kern_level, netif_level, netdev, format, arg...) \ + do { if ((netdev)-msg_enable NETIF_MSG_##netif_level) { \ + printk(kern_level %s: format, \ + (netdev)-name, ## arg); } } while (0) Could you make a version that doesn't evaluate the arguments twice? hmm you lost me there a bit; Do you want me to duplicate this code for all the ndev_err/ndev_info functions instead so that ndev_err doesn't direct back to ndev_printk? It is good practice in a macro to avoid potential problems with usage by only touching the arguments once. Otherwise, something (bogus) like ndev_printk(KERN_DEBUG, NETIF_MSG_PKTDATA, got %d\n, dev++, skb-len) would increment dev twice. My preference would be something more like dev_printk or even use that? You want to show both device name, and physical attachment in the message. +#ifdef DEBUG +#define ndev_dbg(level, netdev, format, arg...) \ + ndev_printk(KERN_DEBUG, level, netdev, format, ## arg) +#else +#define ndev_dbg(level, netdev, format, arg...) \ + do { (void)(netdev); } while (0) +#endif + +#define ndev_err(level, netdev, format, arg...) \ + ndev_printk(KERN_ERR, level, netdev, format, ## arg) +#define ndev_info(level, netdev, format, arg...) \ + ndev_printk(KERN_INFO, level, netdev, format, ## arg) +#define ndev_warn(level, netdev, format, arg...) \ + ndev_printk(KERN_WARNING, level, netdev, format, ## arg) +#define ndev_notice(level, netdev, format, arg...) \ + ndev_printk(KERN_NOTICE, level, netdev, format, ## arg) + #define netif_msg_drv(p) ((p)-msg_enable NETIF_MSG_DRV) #define netif_msg_probe(p)((p)-msg_enable NETIF_MSG_PROBE) #define netif_msg_link(p) ((p)-msg_enable NETIF_MSG_LINK) diff --git a/net/core/dev.c b/net/core/dev.c index 5a7f20f..e854c09 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3376,6 +3376,16 @@ struct net_device *alloc_netdev(int sizeof_priv, const char *name, dev-priv = netdev_priv(dev); dev-get_stats = internal_stats; + dev-msg_enable = NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_LINK; +#ifdef DEBUG + /* put these to good use: */ + dev-msg_enable |= NETIF_MSG_TIMER | NETIF_MSG_IFDOWN | + NETIF_MSG_IFUP | NETIF_MSG_RX_ERR | + NETIF_MSG_TX_ERR | NETIF_MSG_TX_QUEUED | + NETIF_MSG_INTR | NETIF_MSG_TX_DONE | + NETIF_MSG_RX_STATUS | NETIF_MSG_PKTDATA | + NETIF_MSG_HW | NETIF_MSG_WOL; +#endif Let driver writer choose message enable bits please. the driver can, since these bits are set in alloc_netdev, nothing prevents a driver from setting the mask immediately afterwards. Putting in a sane default seems a good idea and good practice. Maybe I went a bit far by going all out on the DEBUG flags tho... perhaps those can be removed or only NETIF_MSG_RX_ERR and NETIF_MSG_TX_ERR set with DEBUG. Auke -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [WIP][PATCHES] Network xmit batching
On Fri, 2007-08-06 at 10:27 -0700, Rick Jones wrote: [..] you cannot take the netperf service demand directly - each netperf is calculating assuming that it is the only thing running on the system. It then ass-u-me-s that the CPU util it measured was all for its work. This means the service demand figure will be quite higher than it really is. So, for aggregate tests using netperf2, one has to calculate service demand by hand. Sum the throughput as KB/s, convert the CPU util and number of CPUs to a microseconds of CPU consumed per second and divide to get microseconds per KB for the aggregate. From what you are saying above seems to me that for more than one proc it is safe to just run netperf4 instead of netperf2? It also seems reasonable to set up large socket buffers on the receiver. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] NET: Multiqueue network device support.
On Fri, 2007-08-06 at 12:55 -0700, Waskiewicz Jr, Peter P wrote: I thought the correct use is to get this lock on clean_tx side which can get called on a different cpu on rx (which also cleans up slots for skbs that have finished xmit). Both TX and clean_tx uses the same tx_ring's head/tail ptrs and should be exclusive. But I don't find clean tx using this lock in the code, so I am confused :-) From e1000_main.c, e1000_clean(): /* e1000_clean is called per-cpu. This lock protects * tx_ring[0] from being cleaned by multiple cpus * simultaneously. A failure obtaining the lock means * tx_ring[0] is currently being cleaned anyway. */ if (spin_trylock(adapter-tx_queue_lock)) { tx_cleaned = e1000_clean_tx_irq(adapter, adapter-tx_ring[0]); spin_unlock(adapter-tx_queue_lock); } Are you saying theres no problem because the adapter-tx_queue_lock is being held? In a multi-ring implementation of the driver, this is wrapped with for (i = 0; i adapter-num_tx_queues; i++) and adapter-tx_ring[i]. This lock also prevents the clean routine from stomping on xmit_frame() when transmitting. Also in the multi-ring implementation, the tx_lock is pushed down into the individual tx_ring struct, not at the adapter level. That sounds right - but the adapter lock is not related to tx_lock in current e1000, correct? cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [WIP][PATCHES] Network xmit batching
jamal wrote: On Fri, 2007-08-06 at 10:27 -0700, Rick Jones wrote: [..] you cannot take the netperf service demand directly - each netperf is calculating assuming that it is the only thing running on the system. It then ass-u-me-s that the CPU util it measured was all for its work. This means the service demand figure will be quite higher than it really is. So, for aggregate tests using netperf2, one has to calculate service demand by hand. Sum the throughput as KB/s, convert the CPU util and number of CPUs to a microseconds of CPU consumed per second and divide to get microseconds per KB for the aggregate. From what you are saying above seems to me that for more than one proc it is safe to just run netperf4 instead of netperf2? Well, it is easier to be safe on aggregates with netperf4 than netperf2 although at present it is more difficult to run netperf4 than netperf2 It also seems reasonable to set up large socket buffers on the receiver. For bulk transfers I often do. rick - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] [IrDA] Updates for net-2.6
Hi Dave, These 2 patches are bug fixes and should thus be considered for net-2.6 inclusion. Cheers, Samuel. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] [IrDA] Fix Rx/Tx path race
From: G. Liakhovetski [EMAIL PROTECTED] We need to switch to NRM _before_ sending the final packet otherwise we might hit a race condition where we get the first packet from the peer while we're still in LAP_XMIT_P. Cc: G. Liakhovetski [EMAIL PROTECTED] Signed-off-by: Samuel Ortiz [EMAIL PROTECTED] --- include/net/irda/irlap.h | 17 + net/irda/irlap_event.c | 18 -- net/irda/irlap_frame.c |3 +++ 3 files changed, 20 insertions(+), 18 deletions(-) Index: net-2.6-quilt/include/net/irda/irlap.h === --- net-2.6-quilt.orig/include/net/irda/irlap.h 2007-05-10 19:23:04.0 +0300 +++ net-2.6-quilt/include/net/irda/irlap.h 2007-05-10 19:24:57.0 +0300 @@ -289,4 +289,21 @@ self-disconnect_pending = FALSE; } +/* + * Function irlap_next_state (self, state) + * + *Switches state and provides debug information + * + */ +static inline void irlap_next_state(struct irlap_cb *self, IRLAP_STATE state) +{ + /* + if (!self || self-magic != LAP_MAGIC) + return; + + IRDA_DEBUG(4, next LAP state = %s\n, irlap_state[state]); + */ + self-state = state; +} + #endif Index: net-2.6-quilt/net/irda/irlap_event.c === --- net-2.6-quilt.orig/net/irda/irlap_event.c 2007-05-10 19:23:04.0 +0300 +++ net-2.6-quilt/net/irda/irlap_event.c2007-05-10 19:23:09.0 +0300 @@ -317,23 +317,6 @@ } /* - * Function irlap_next_state (self, state) - * - *Switches state and provides debug information - * - */ -static inline void irlap_next_state(struct irlap_cb *self, IRLAP_STATE state) -{ - /* - if (!self || self-magic != LAP_MAGIC) - return; - - IRDA_DEBUG(4, next LAP state = %s\n, irlap_state[state]); - */ - self-state = state; -} - -/* * Function irlap_state_ndm (event, skb, frame) * *NDM (Normal Disconnected Mode) state @@ -1086,7 +1069,6 @@ } else { /* Final packet of window */ irlap_send_data_primary_poll(self, skb); - irlap_next_state(self, LAP_NRM_P); /* * Make sure state machine does not try to send Index: net-2.6-quilt/net/irda/irlap_frame.c === --- net-2.6-quilt.orig/net/irda/irlap_frame.c 2007-05-10 19:23:04.0 +0300 +++ net-2.6-quilt/net/irda/irlap_frame.c2007-05-10 19:25:59.0 +0300 @@ -798,16 +798,19 @@ self-vs = (self-vs + 1) % 8; self-ack_required = FALSE; + irlap_next_state(self, LAP_NRM_P); irlap_send_i_frame(self, tx_skb, CMD_FRAME); } else { IRDA_DEBUG(4, %s(), sending unreliable frame\n, __FUNCTION__); if (self-ack_required) { irlap_send_ui_frame(self, skb_get(skb), self-caddr, CMD_FRAME); + irlap_next_state(self, LAP_NRM_P); irlap_send_rr_frame(self, CMD_FRAME); self-ack_required = FALSE; } else { skb-data[1] |= PF_BIT; + irlap_next_state(self, LAP_NRM_P); irlap_send_ui_frame(self, skb_get(skb), self-caddr, CMD_FRAME); } } -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] [IrDA] f-timer reloading when sending rejected frames
Jean II was right: you have to re-charge the final timer when resending rejected frames. Otherwise it triggers at a wrong time and can break the currently running communication. Reproducible under rt-preempt. Signed-off-by: G. Liakhovetski [EMAIL PROTECTED] Signed-off-by: Samuel Ortiz [EMAIL PROTECTED] Index: net-2.6-quilt/net/irda/irlap_event.c === --- net-2.6-quilt.orig/net/irda/irlap_event.c 2007-05-29 09:36:09.0 +0300 +++ net-2.6-quilt/net/irda/irlap_event.c2007-05-29 09:38:19.0 +0300 @@ -1418,14 +1418,14 @@ */ self-remote_busy = FALSE; + /* Stop final timer */ + del_timer(self-final_timer); + /* * Nr as expected? */ ret = irlap_validate_nr_received(self, info-nr); if (ret == NR_EXPECTED) { - /* Stop final timer */ - del_timer(self-final_timer); - /* Update Nr received */ irlap_update_nr_received(self, info-nr); @@ -1457,14 +1457,12 @@ /* Resend rejected frames */ irlap_resend_rejected_frames(self, CMD_FRAME); - - /* Final timer ??? Jean II */ + irlap_start_final_timer(self, self-final_timeout * 2); irlap_next_state(self, LAP_NRM_P); } else if (ret == NR_INVALID) { IRDA_DEBUG(1, %s(), Received RR with invalid nr !\n, __FUNCTION__); - del_timer(self-final_timer); irlap_next_state(self, LAP_RESET_WAIT); -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: networking busted in current -git ???
Trond Myklebust [EMAIL PROTECTED] wrote: It appears to be something in the latest dump from davem to Linus, but I haven't yet had time to identify what. You want this patch which should hit the tree soon. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- [IPV4]: Do not remove idev when addresses are cleared Now that we create idev before addresses are added, it no longer makes sense to remove them when addresses are all deleted. Signed-off-by: Herbert Xu [EMAIL PROTECTED] diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c index 354e800..0cf813f 100644 --- a/net/ipv4/devinet.c +++ b/net/ipv4/devinet.c @@ -327,12 +327,8 @@ static void __inet_del_ifa(struct in_device *in_dev, struct in_ifaddr **ifap, } } - if (destroy) { + if (destroy) inet_free_ifa(ifa1); - - if (!in_dev-ifa_list) - inetdev_destroy(in_dev); - } } static void inet_del_ifa(struct in_device *in_dev, struct in_ifaddr **ifap, - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SKY2 vs SK98LIN performance on 88E8053 MAC
Hi, Stephen We are doing pure IPv4 forwarding between two Ethernet interfaces: IXIA port A---System Under Test---IXIA Port B Traffic has two IP destinations for each direction and L4 protocol is UDP. There are two static ARP entries and only interface routes. Two tests are identical except that we switch from one driver to another. Ethernet ports on the SUT are oversubscribed -- I'm sending 60% of line rate (of 256-byte packets) and measuring percentage of pass-through traffic which makes to the IXIA port on the other side. Traffic is bidirectional and system load is close to 100%. I attach vmlinux and driver profiles I have taken with oprofile 0.8.2. I can easily take more measurements/experiemnts if need be. Regards, Philip We are observing severe IPv4 forwarding degradation when switching from sk98lin to sky2 driver. Setup: plain 2.6.21.3 kernel, 88E8053 Marvell Yukon2 MAC, sk98lin is @revision 8.41.2.3 coming from FC6, SKY2 driver from 2.6.21.3 kernel, both drivers are in NAPI mode. Benchmarks are done using bidirectional traffic generated by IXIA, sending 256-byte packets. Observed packet throughput is almost 30% higher with sklin98 driver. Ethernet flow control is turned off in SKY2 driver (hard-coded as off, we know about this problem). I also have oprofile records of the drivers in case anybody is interested. Please share info if you know anything on SKY2 performance bottlenecks. I'm surprised? The vendor driver has bogus extra locking and other crap. Please send profile data. Flow control should work on sky2 (now). Are you routing or doing real TCP transfers? ___ You snooze, you lose. Get messages ASAP with AutoCheck in the all-new Yahoo! Mail Beta. http://advision.webevents.yahoo.com/mailbeta/newmail_html.html vmlinux-sk98lin-2.6.21.3-report Description: 4224780258-vmlinux-sk98lin-2.6.21.3-report vmlinux-sky2-2.6.21.3-report Description: 1503384622-vmlinux-sky2-2.6.21.3-report sk98lin-2.6.21.3-report Description: 3831520705-sk98lin-2.6.21.3-report sky2-2.6.21.3.report Description: 1004031548-sky2-2.6.21.3.report
Re: [patch 23/32] IPV4: Correct rp_filter help text.
Chris Wright [EMAIL PROTECTED] wrote: --- linux-2.6.20.13.orig/net/ipv4/Kconfig +++ linux-2.6.20.13/net/ipv4/Kconfig @@ -43,11 +43,11 @@ config IP_ADVANCED_ROUTER asymmetric routing (packets from you to a host take a different path than packets from that host to you) or if you operate a non-routing host which has several IP addresses on different interfaces. To turn - rp_filter off use: + rp_filter on use: - echo 0 /proc/sys/net/ipv4/conf/device/rp_filter + echo 1 /proc/sys/net/ipv4/conf/device/rp_filter or - echo 0 /proc/sys/net/ipv4/conf/all/rp_filter + echo 1 /proc/sys/net/ipv4/conf/all/rp_filter BTW, this documentation is actually wrong. You can't enable rp_filter on all interfaces with echo 1 /proc/sys/net/ipv4/conf/all/rp_filter You must do that in conjunction with echo 1 /proc/sys/net/ipv4/conf/device/rp_filter for it to work for device. This is really counter-intuitive but it's apparently how it's always worked. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 23/32] IPV4: Correct rp_filter help text.
On Sat, Jun 09, 2007 at 11:20:43AM +1000, Herbert Xu wrote: Chris Wright [EMAIL PROTECTED] wrote: --- linux-2.6.20.13.orig/net/ipv4/Kconfig +++ linux-2.6.20.13/net/ipv4/Kconfig @@ -43,11 +43,11 @@ config IP_ADVANCED_ROUTER asymmetric routing (packets from you to a host take a different path than packets from that host to you) or if you operate a non-routing host which has several IP addresses on different interfaces. To turn - rp_filter off use: + rp_filter on use: - echo 0 /proc/sys/net/ipv4/conf/device/rp_filter + echo 1 /proc/sys/net/ipv4/conf/device/rp_filter or - echo 0 /proc/sys/net/ipv4/conf/all/rp_filter + echo 1 /proc/sys/net/ipv4/conf/all/rp_filter BTW, this documentation is actually wrong. You can't enable rp_filter So to fix the documentation, we should change the word or to and. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] NET: Multiqueue network device support.
Peter, Where is your git tree located? Ram -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Waskiewicz Jr, Peter P Sent: Thursday, June 07, 2007 3:56 PM To: David Miller; [EMAIL PROTECTED] Cc: Kok, Auke-jan H; [EMAIL PROTECTED]; [EMAIL PROTECTED]; netdev@vger.kernel.org; Brandeburg, Jesse Subject: RE: [PATCH] NET: Multiqueue network device support. I empathize but take a closer look; seems mostly useless. I thought E1000 still uses LLTX, and if so then multiple cpus can most definitely get into the -hard_start_xmit() in parallel. Not with how the qdisc status protects it today: include/net/pkt_sched.h: static inline void qdisc_run(struct net_device *dev) { if (!netif_queue_stopped(dev) !test_and_set_bit(__LINK_STATE_QDISC_RUNNING, dev-state)) __qdisc_run(dev); } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] RFC: have tcp_recvmsg() check kthread_should_stop() and treat it as if it were signalled
Please cc networking patches to [EMAIL PROTECTED] Jeff Layton [EMAIL PROTECTED] wrote: The following patch is a first stab at removing this need. It makes it so that in tcp_recvmsg() we also check kthread_should_stop() at any point where we currently check to see if the task was signalled. If that returns true, then it acts as if it were signalled and returns to the calling function. This just doesn't seem to fit. Why should networking care about kthreads? Perhaps you can get kthread_stop to send a signal instead? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: networking busted in current -git ???
From: Trond Myklebust [EMAIL PROTECTED] Date: Fri, 08 Jun 2007 17:43:27 -0400 It is not dhcp. I'm seeing the same bug with bog-standard ifup with a static address on an FC-6 machine. It appears to be something in the latest dump from davem to Linus, but I haven't yet had time to identify what. Linus's current tree should have this fixed. Let us know if this is not the case. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] [IrDA] Updates for net-2.6
From: [EMAIL PROTECTED] Date: Sat, 09 Jun 2007 04:08:15 +0300 These 2 patches are bug fixes and should thus be considered for net-2.6 inclusion. Both patches applied, thanks Sam. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] [RFC] NET: Implement a standard ndev_printk family
Stephen Hemminger wrote: On Fri, 08 Jun 2007 16:42:31 -0700 Kok, Auke [EMAIL PROTECTED] wrote: Stephen Hemminger wrote: +#define ndev_printk(kern_level, netif_level, netdev, format, arg...) \ + do { if ((netdev)-msg_enable NETIF_MSG_##netif_level) { \ + printk(kern_level %s: format, \ + (netdev)-name, ## arg); } } while (0) Could you make a version that doesn't evaluate the arguments twice? hmm you lost me there a bit; Do you want me to duplicate this code for all the ndev_err/ndev_info functions instead so that ndev_err doesn't direct back to ndev_printk? It is good practice in a macro to avoid potential problems with usage by only touching the arguments once. Otherwise, something (bogus) like ndev_printk(KERN_DEBUG, NETIF_MSG_PKTDATA, got %d\n, dev++, skb-len) would increment dev twice. agreed, but My preference would be something more like dev_printk or even use that? You want to show both device name, and physical attachment in the message. actually these ndev_* macros are almost an exact copy of dev_printk, which is how I modeled them in the first place! See for yourself - here's the relevant snipplet from linux/device.h: 500 #define dev_printk(level, dev, format, arg...) \ 501 printk(level %s %s: format , dev_driver_string(dev) , (dev)-bus_id , ## arg) 502 503 #ifdef DEBUG 504 #define dev_dbg(dev, format, arg...)\ 505 dev_printk(KERN_DEBUG , dev , format , ## arg) 506 #else 507 #define dev_dbg(dev, format, arg...) do { (void)(dev); } while (0) 508 #endif 509 510 #define dev_err(dev, format, arg...)\ 511 dev_printk(KERN_ERR , dev , format , ## arg) 512 #define dev_info(dev, format, arg...) \ 513 dev_printk(KERN_INFO , dev , format , ## arg) 514 #define dev_warn(dev, format, arg...) \ 515 dev_printk(KERN_WARNING , dev , format , ## arg) 516 #define dev_notice(dev, format, arg...) \ 517 dev_printk(KERN_NOTICE , dev , format , ## arg) using dev_printk however ignores msg_enable completely and also omits netdev-name, which may even change, so for netdevices it's much less suitable, maybe only at init time. We can fix the dev_printk macro family as well, that's allright, but the need for a netdev-centric printk should be obvious: almost every netdevice driver has it's own variant :) Auke - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] [RFC] NET: Implement a standard ndev_printk family
Stephen Hemminger wrote: On Fri, 08 Jun 2007 16:42:31 -0700 Kok, Auke [EMAIL PROTECTED] wrote: Stephen Hemminger wrote: +#define ndev_printk(kern_level, netif_level, netdev, format, arg...) \ + do { if ((netdev)-msg_enable NETIF_MSG_##netif_level) { \ + printk(kern_level %s: format, \ + (netdev)-name, ## arg); } } while (0) Could you make a version that doesn't evaluate the arguments twice? hmm you lost me there a bit; Do you want me to duplicate this code for all the ndev_err/ndev_info functions instead so that ndev_err doesn't direct back to ndev_printk? It is good practice in a macro to avoid potential problems with usage by only touching the arguments once. Otherwise, something (bogus) like ndev_printk(KERN_DEBUG, NETIF_MSG_PKTDATA, got %d\n, dev++, skb-len) would increment dev twice. My preference would be something more like dev_printk or even use that? ... see other reply You want to show both device name, and physical attachment in the message. OK, that does make sense, and here it gets interesting and we can get creative, since for NETIF_MSG_HW and NETIF_MSG_PROBE messages we could add the printout of netdev-dev-bus_id. I have modeled and toyed around with the message format and did this (add bus_id to all messages) but it got messy (for LINK messages it's totally not needed). However, that is going to make the macro's a bit more complex, and unlikely that I can make it fit without double-pass evaluation without making it a monster... unless everyone agrees to just printing everything: both netdev-name and netdev-dev-bus_id for every message Auke - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html