Re: Congestion Avoidance Monitoring Tools
On Friday 21 April 2006 07:59, Tom Young wrote: > On Thu, 2006-04-20 at 22:26 -0700, Piet Delaney wrote: > > I'm upgrading our 2.6.12 kernel to 2.6.13, which includes significant > > congestion avoidance code additions and changes. I was wondering if > > there are any tools folks can recommend for testing the kernel to make > > sure the congestion avoidance code is operating correctly. For > > example the displaying of the congestion window as a function of time > > while undergoing convergence. For causing congestion I could modify > > a kernel to discard packets once in a while on a lab gateway and hit > > it with iperf. HP's netperf looks interesting. > > > > Any suggestions? > > > > > > -piet > > > > Hi, > > Try having a look at the output of 'ss -i' (you may need to update to > the latest iproute2 tools). You could either try and parse the text > output of that or use the same inet_diag interface that ss uses to poll > for the data it at regular intervals. Another way is to use tcptrace on a tcpdump file (http://jarok.cs.ohiou.edu/software/tcptrace/) It finds a lot of statistics about the dumped TCP connections. Newer ethereal also has some TCP plotting functions that are useful. They don't display the congestion window directly, but you can see it indirectly. -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Congestion Avoidance Monitoring Tools
On Thu, 2006-04-20 at 22:26 -0700, Piet Delaney wrote: > I'm upgrading our 2.6.12 kernel to 2.6.13, which includes significant > congestion avoidance code additions and changes. I was wondering if > there are any tools folks can recommend for testing the kernel to make > sure the congestion avoidance code is operating correctly. For > example the displaying of the congestion window as a function of time > while undergoing convergence. For causing congestion I could modify > a kernel to discard packets once in a while on a lab gateway and hit > it with iperf. HP's netperf looks interesting. > > Any suggestions? > > > -piet > Hi, Try having a look at the output of 'ss -i' (you may need to update to the latest iproute2 tools). You could either try and parse the text output of that or use the same inet_diag interface that ss uses to poll for the data it at regular intervals. -- Thomas Young http://cubinlab.ee.unimelb.edu.au/~tyo/ Research Assistant CUBIN Research Centre - University of Melbourne - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Fw: [Bugme-new] [Bug 6420] New: iptables is complaining with bogus unknown error 18446744073709551615
Begin forwarded message: Date: Thu, 20 Apr 2006 23:17:58 -0700 From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject: [Bugme-new] [Bug 6420] New: iptables is complaining with bogus unknown error 18446744073709551615 http://bugzilla.kernel.org/show_bug.cgi?id=6420 Summary: iptables is complaining with bogus unknown error 18446744073709551615 Kernel Version: 2.6.17-rc2 Status: NEW Severity: normal Owner: [EMAIL PROTECTED] Submitter: [EMAIL PROTECTED] At least since 2.6.1.16.1, many calls to iptables no longer function at least under 64-bit x86, presumably due to a bug in the netfilter kernel code. The problem is still present in 2.6.17-rc2. The error from iptables is iptables: unknown error 18446744073709551615 Examples of rules that give the error are 1) iptables -A INPUT -i bond0 -s 129.98.90.0/24 -p tcp --dport 548 -j ACCEPT 2) iptables -A INPUT -i bond0 -s 129.98.90.101/32 -p tcp --dport 497 -j ACCEPT 3) iptables -A INPUT -i bond0 -s 129.98.90.227/32 -p tcp --dport 22 -j ACCEPT Example of a rule that does not give the error: 1) iptables -A INPUT -i bond0 -p ICMP --icmp-type echo-request -s 129.98.90.13/32 -j ACCEPT The computer is using IPv4 and not IPv6, which has not been compiled into the kernel. iptables is version 1.3.5. Kernel configuration related to iptables follows: CONFIG_IP_NF_CONNTRACK=m CONFIG_IP_NF_CT_ACCT=y CONFIG_IP_NF_CONNTRACK_MARK=y CONFIG_IP_NF_CONNTRACK_EVENTS=y CONFIG_IP_NF_CONNTRACK_NETLINK=m # CONFIG_IP_NF_CT_PROTO_SCTP is not set CONFIG_IP_NF_FTP=m # CONFIG_IP_NF_IRC is not set # CONFIG_IP_NF_NETBIOS_NS is not set # CONFIG_IP_NF_TFTP is not set # CONFIG_IP_NF_AMANDA is not set # CONFIG_IP_NF_PPTP is not set # CONFIG_IP_NF_H323 is not set # CONFIG_IP_NF_QUEUE is not set CONFIG_IP_NF_IPTABLES=m CONFIG_IP_NF_MATCH_IPRANGE=m CONFIG_IP_NF_MATCH_TOS=m CONFIG_IP_NF_MATCH_RECENT=m CONFIG_IP_NF_MATCH_ECN=m CONFIG_IP_NF_MATCH_DSCP=m CONFIG_IP_NF_MATCH_AH=m CONFIG_IP_NF_MATCH_TTL=m CONFIG_IP_NF_MATCH_OWNER=m CONFIG_IP_NF_MATCH_ADDRTYPE=m CONFIG_IP_NF_MATCH_HASHLIMIT=m CONFIG_IP_NF_FILTER=m # CONFIG_IP_NF_TARGET_REJECT is not set CONFIG_IP_NF_TARGET_LOG=m CONFIG_IP_NF_TARGET_ULOG=m CONFIG_IP_NF_TARGET_TCPMSS=m # CONFIG_IP_NF_NAT is not set CONFIG_IP_NF_MANGLE=m # CONFIG_IP_NF_TARGET_TOS is not set # CONFIG_IP_NF_TARGET_ECN is not set # CONFIG_IP_NF_TARGET_DSCP is not set # CONFIG_IP_NF_TARGET_TTL is not set # CONFIG_IP_NF_TARGET_CLUSTERIP is not set CONFIG_IP_NF_RAW=m CONFIG_IP_NF_ARPTABLES=m CONFIG_IP_NF_ARPFILTER=m CONFIG_IP_NF_ARP_MANGLE=m CONFIG_NETFILTER_XT_TARGET_CLASSIFY=m # CONFIG_NETFILTER_XT_TARGET_CONNMARK is not set CONFIG_NETFILTER_XT_TARGET_MARK=m CONFIG_NETFILTER_XT_TARGET_NFQUEUE=m # CONFIG_NETFILTER_XT_TARGET_NOTRACK is not set CONFIG_NETFILTER_XT_MATCH_COMMENT=m CONFIG_NETFILTER_XT_MATCH_CONNBYTES=m CONFIG_NETFILTER_XT_MATCH_CONNMARK=m CONFIG_NETFILTER_XT_MATCH_CONNTRACK=m CONFIG_NETFILTER_XT_MATCH_DCCP=m CONFIG_NETFILTER_XT_MATCH_ESP=m CONFIG_NETFILTER_XT_MATCH_HELPER=m CONFIG_NETFILTER_XT_MATCH_LENGTH=m CONFIG_NETFILTER_XT_MATCH_LIMIT=m CONFIG_NETFILTER_XT_MATCH_MAC=m CONFIG_NETFILTER_XT_MATCH_MARK=m CONFIG_NETFILTER_XT_MATCH_MULTIPORT=m CONFIG_NETFILTER_XT_MATCH_PKTTYPE=m CONFIG_NETFILTER_XT_MATCH_REALM=m CONFIG_NETFILTER_XT_MATCH_SCTP=m CONFIG_NETFILTER_XT_MATCH_STATE=m CONFIG_NETFILTER_XT_MATCH_STRING=m CONFIG_NETFILTER_XT_MATCH_TCPMSS=m lsmod shows xt_state4928 0 ipt_LOG 8960 0 ip_conntrack_ftp 1 0 ip_conntrack 57880 2 xt_state,ip_conntrack_ftp nfnetlink 8520 1 ip_conntrack iptable_filter 5440 0 ip_tables 22168 1 iptable_filter x_tables 17800 3 xt_state,ipt_LOG,ip_tables This issue has been posted to netfilter bugzilla as https://bugzilla.netfilter.org/bugzilla/show_bug.cgi?id=467 --- You are receiving this mail because: --- You are on the CC list for the bug, or are watching someone who is. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Congestion Avoidance Monitoring Tools
I'm upgrading our 2.6.12 kernel to 2.6.13, which includes significant congestion avoidance code additions and changes. I was wondering if there are any tools folks can recommend for testing the kernel to make sure the congestion avoidance code is operating correctly. For example the displaying of the congestion window as a function of time while undergoing convergence. For causing congestion I could modify a kernel to discard packets once in a while on a lab gateway and hit it with iperf. HP's netperf looks interesting. Any suggestions? -piet -- --- [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/10] [IOAT] I/OAT patches repost
On Thu, Apr 20, 2006 at 08:42:00PM -0700, David S. Miller wrote: > This is basically why none of the performance gains add up to me. I > am thus very concerned that the current non-cache-warming > implmentation may fall flat performance wise. Ok, I buy your arguments. It does seems unlikely that a DMA offload without cache warmth will be a net gain. More performance data is definitely be required. After digging after PDFs, it seems as the Freescale 85xx (at least, probably earlier models as well) can warm L2 for the DMA destination data. However, I don't have any hardware with it to play around with for benchmarking to see what cache warming might bring (back), performance-wise. I think there is still use for a common multi-function DMA framework across platforms and client components, even if net receive doesn't end up being {a,the first} user. -Olof - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/10] [IOAT] I/OAT patches repost
From: Olof Johansson <[EMAIL PROTECTED]> Date: Thu, 20 Apr 2006 22:04:26 -0500 > On Thu, Apr 20, 2006 at 05:27:42PM -0700, David S. Miller wrote: > > Besides the control overhead of the DMA engines, the biggest thing > > lost in my opinion is the perfect cache warming that a cpu based copy > > does from the kernel socket buffer into userspace. > > It's definitely the easiest way to always make sure the right caches > are warm for the app, that I agree with. > > But, when warming those caches by copying, the data is pulled in through > a potentially cold cache in the first place. So the cache misses are > just moved from the copy loop to userspace with dma offload. Or am I > missing something? Yes, and it means that the memory bandwidth costs are equivalent between I/O AT and cpu copy. In the cpu copy case you eat the read cache miss, but on the write side you'll prewarm the cache properly. In the I/O AT case you eat the same read cost, but the cache will not be prewarmed, so you'll eat the read cache miss in the application. It's moving the same exact cost from one place to another. The time it takes to get the app to make forward progress (meaning returned from the recvmsg() system call and back in userspace) must by definition take at least as long with I/O AT as it does with cpu copies. Yet in the I/O AT case, the application must wait that long and also then take in the delays of the cache misses when it tries to read the data that the I/O AT engine copied. Instead of eating the cache miss cost in the kernel, we eat it in the app because in the I/O AT case the cpu won't have the user data fresh and loaded into the cpu cache. And I say I/O AT must take "at least as long" as cpu copies because the same memory copy cost is there, and on top of that I/O AT has to program the DMA controller and touch a _lot_ of other state to get things going and then wake the task up. We're talking non-trivial overheads like grabbing the page mappings out of the page tables using get_user_pages(). Evgivny has posted some very nice performance graphs showing how poorly that function scales. This is basically why none of the performance gains add up to me. I am thus very concerned that the current non-cache-warming implmentation may fall flat performance wise. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Unregister network device before releasing PCMCIA resources
From: Pavel Roskin <[EMAIL PROTECTED]> This is the right thing to do and it prevents kernel BUG on unload. Some PCMCIA network drivers use link->dev_node as a flag indicating that the network device has been successfully registered. Recent code changes cause this flag to be 0 after PCMCIA resources have been released. Signed-off-by: Pavel Roskin <[EMAIL PROTECTED]> --- drivers/net/wireless/netwave_cs.c |4 ++-- drivers/net/wireless/orinoco_cs.c |5 +++-- drivers/net/wireless/ray_cs.c |4 +++- drivers/net/wireless/spectrum_cs.c |5 +++-- drivers/net/wireless/wavelan_cs.c |9 + 5 files changed, 16 insertions(+), 11 deletions(-) diff --git a/drivers/net/wireless/netwave_cs.c b/drivers/net/wireless/netwave_cs.c index 9343d97..5d80db2 100644 --- a/drivers/net/wireless/netwave_cs.c +++ b/drivers/net/wireless/netwave_cs.c @@ -445,11 +445,11 @@ static void netwave_detach(struct pcmcia DEBUG(0, "netwave_detach(0x%p)\n", link); - netwave_release(link); - if (link->dev_node) unregister_netdev(dev); + netwave_release(link); + free_netdev(dev); } /* netwave_detach */ diff --git a/drivers/net/wireless/orinoco_cs.c b/drivers/net/wireless/orinoco_cs.c index 434f7d7..5988305 100644 --- a/drivers/net/wireless/orinoco_cs.c +++ b/drivers/net/wireless/orinoco_cs.c @@ -147,14 +147,15 @@ static void orinoco_cs_detach(struct pcm { struct net_device *dev = link->priv; - orinoco_cs_release(link); - DEBUG(0, PFX "detach: link=%p link->dev_node=%p\n", link, link->dev_node); if (link->dev_node) { DEBUG(0, PFX "About to unregister net device %p\n", dev); unregister_netdev(dev); } + + orinoco_cs_release(link); + free_orinocodev(dev); } /* orinoco_cs_detach */ diff --git a/drivers/net/wireless/ray_cs.c b/drivers/net/wireless/ray_cs.c index 879eb42..fac4f1b 100644 --- a/drivers/net/wireless/ray_cs.c +++ b/drivers/net/wireless/ray_cs.c @@ -388,13 +388,15 @@ static void ray_detach(struct pcmcia_dev this_device = NULL; dev = link->priv; +if (link->dev_node) + unregister_netdev(dev); + ray_release(link); local = (ray_dev_t *)dev->priv; del_timer(&local->timer); if (link->priv) { - if (link->dev_node) unregister_netdev(dev); free_netdev(dev); } DEBUG(2,"ray_cs ray_detach ending\n"); diff --git a/drivers/net/wireless/spectrum_cs.c b/drivers/net/wireless/spectrum_cs.c index f7b77ce..2551938 100644 --- a/drivers/net/wireless/spectrum_cs.c +++ b/drivers/net/wireless/spectrum_cs.c @@ -626,14 +626,15 @@ static void spectrum_cs_detach(struct pc { struct net_device *dev = link->priv; - spectrum_cs_release(link); - DEBUG(0, PFX "detach: link=%p link->dev_node=%p\n", link, link->dev_node); if (link->dev_node) { DEBUG(0, PFX "About to unregister net device %p\n", dev); unregister_netdev(dev); } + + spectrum_cs_release(link); + free_orinocodev(dev); } /* spectrum_cs_detach */ diff --git a/drivers/net/wireless/wavelan_cs.c b/drivers/net/wireless/wavelan_cs.c index f7724eb..03c2e16 100644 --- a/drivers/net/wireless/wavelan_cs.c +++ b/drivers/net/wireless/wavelan_cs.c @@ -4681,6 +4681,11 @@ #ifdef DEBUG_CALLBACK_TRACE printk(KERN_DEBUG "-> wavelan_detach(0x%p)\n", link); #endif + /* Remove ourselves from the kernel list of ethernet devices */ + /* Warning : can't be called from interrupt, timer or wavelan_close() */ + if (link->dev_node) +unregister_netdev(dev); + /* Some others haven't done their job : give them another chance */ wv_pcmcia_release(link); @@ -4689,10 +4694,6 @@ #endif { struct net_device * dev = (struct net_device *) link->priv; - /* Remove ourselves from the kernel list of ethernet devices */ - /* Warning : can't be called from interrupt, timer or wavelan_close() */ - if (link->dev_node) - unregister_netdev(dev); link->dev_node = NULL; ((net_local *)netdev_priv(dev))->link = NULL; ((net_local *)netdev_priv(dev))->dev = NULL; - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000_down and tx_timeout worker race cleaning the transmit buffers
On Thu, 2006-04-20 at 19:42 -0700, Shaw Vrana wrote: > I'll bite! Here's a patch to add a call to flush_scheduled_work() in > e1000_down. It's against 2.6.16.9. > You're not following our discussion. It is not safe to call flush_scheduled_work() in a driver's close() because it is holding the rtnl and can deadlock with linkwatch_event() if it happens to be on the workqueue. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/10] [IOAT] I/OAT patches repost
On Thu, Apr 20, 2006 at 05:44:38PM -0700, David S. Miller wrote: > From: Olof Johansson <[EMAIL PROTECTED]> > Date: Thu, 20 Apr 2006 18:33:43 -0500 > > > On Thu, Apr 20, 2006 at 03:14:15PM -0700, Andrew Grover wrote: > > > In > > > addition, there may be workloads (file serving? backup?) where we > > > could do a skb->page-in-page-cache copy and avoid cache pollution? > > > > Yes, NFS is probably a prime example of where most of the data isn't > > looked at; just written to disk. I'm not sure how well-optimized the > > receive path is there already w.r.t. avoiding copying though. I don't > > remember seeing memcpy and friends being high on the profile when I > > looked at SPECsfs last. > > If that makes sense then the cpu copy can be made to use non-temporal > stores. I'm not sure that would buy anything. I didn't mean caching was necessarily bad, just that lack of it might not hurt as much under that specific type of workload. NFS has to look at RPC/NFS headers anyway, so it will benefit from the cache being warm. -Olof - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000_down and tx_timeout worker race cleaning the transmit buffers
I've replied to this once before, but haven't seen my last two emails on the list, so I'm sending again with different settings. Sorry for the noise. On Thursday 20 April 2006 17:10, Michael Chan wrote: > In tg3_remove_one(), we call flush_scheduled_work() in case the > reset_task is still pending. Here, it is safe to call > flush_scheduled_work() because we're not holding the rtnl. Again, when > it runs, nothing bad will happen because it will see netif_running() == > 0. I'll bite! Here's a patch to add a call to flush_scheduled_work() in e1000_down. It's against 2.6.16.9. Thanks, Shaw diff -u -uprN -X linux-2.6.16.9/Documentation/dontdiff linux-2.6.16.9/drivers/net/e1000/e1000_main.c linux-2.6.16.9-patch/drivers/net/e1000/e1000_main.c --- linux-2.6.16.9/drivers/net/e1000/e1000_main.c 2006-04-18 23:10:14.0 -0700 +++ linux-2.6.16.9-patch/drivers/net/e1000/e1000_main.c 2006-04-20 19:36:55.0 -0700 @@ -538,6 +538,7 @@ e1000_down(struct e1000_adapter *adapter del_timer_sync(&adapter->tx_fifo_stall_timer); del_timer_sync(&adapter->watchdog_timer); del_timer_sync(&adapter->phy_info_timer); + flush_scheduled_work(); #ifdef CONFIG_E1000_NAPI netif_poll_disable(netdev);
Re: [PATCH 0/10] [IOAT] I/OAT patches repost
On Thu, Apr 20, 2006 at 05:27:42PM -0700, David S. Miller wrote: > From: Olof Johansson <[EMAIL PROTECTED]> > Date: Thu, 20 Apr 2006 16:33:05 -0500 > > > From the wiki: > > > > >3. Data copied by I/OAT is not cached > > > > This is a I/OAT device limitation and not a global statement of the > > DMA infrastructure. Other platforms might be able to prime caches > > with the DMA traffic. Hint flags should be added on either the channel > > allocation calls, or per-operation calls, depending on where it makes > > sense driver/client wise. > > This sidesteps the whole question of _which_ cache to warm. And if > you choose wrongly, then what? > > Besides the control overhead of the DMA engines, the biggest thing > lost in my opinion is the perfect cache warming that a cpu based copy > does from the kernel socket buffer into userspace. It's definitely the easiest way to always make sure the right caches are warm for the app, that I agree with. But, when warming those caches by copying, the data is pulled in through a potentially cold cache in the first place. So the cache misses are just moved from the copy loop to userspace with dma offload. Or am I missing something? > The first thing an application is going to do is touch that data. So > I think it's very important to prewarm the caches and the only > straightforward way I know of to always warm up the correct cpu's > caches is copy_to_user(). The other way (assuming the hardware supports cache warming) would be to pass down affinities (or look them up during receive processing, I'm not sure that's practical the way things work now), and dispatch on a DMA channel with the right cache affinity. I've got a feeling that "straightforward" is not a term to use for describing that solution though. > Unfortunately, many benchmarks just do raw bandwidth tests sending to > a receiver that just doesn't even look at the data. They just return > from recvmsg() and loop back into it. This is not what applications > using networking actually do, so it's important to make sure we look > intelligently at any benchmarks done and do not fall into the trap of > saying "even without cache warming it made things faster" when in fact > the tested receiver did not touch the data at all so was a false test. Yes, some real-life-like benchmarking is definitiely needed. Unfortunately I'm not at a position where I can do much (and share numbers) at the moment myself. -Olof - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000_down and tx_timeout worker race cleaning the transmit buffers
On Fri, 2006-04-21 at 12:40 +1000, Herbert Xu wrote: > One simple solution is to establish a separate queue for RTNL-holding > users or vice versa for non-RTNL holding networking users. That > would allow the drivers to safely flush the non-RTNL queue while > holding the RTNL. You mean a separate workqueue for net drivers to use instead of the keventd_wq? Yeah, I think that'll work. Each driver can also create its own workqueue but that may be a bit more wasteful. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Cannot receive multicast packets
Andrew, > I did not > think the source IP was relevant to the matching code in linux, since > there are no source squelching socket options. > > There are no firewall rules active on this machine, and the packets are > definitely visible at the interface (see tcpdump output in my email). The source address is not relevant (other than potentially for firewall rules), and I understand from your original mail that they are arriving at the machine. The IP TTL is what I wanted to know there; but "netstat -s" will normally tell you why a packet was dropped, if it's arriving but not making it through the UDP/IP stack (as is your case). > I am going to try upgrading the kernel, and turning off the multicast > router kernel options as a next step. But if you have any other ideas > at all, I'm all ears. "netstat -s" would be a good start. :-) tcpdump receiving a copy of the packet does not mean UDP or IP won't drop it, but those drops are counted. +-DLS - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000_down and tx_timeout worker race cleaning the transmit buffers
On Fri, Apr 21, 2006 at 12:37:36PM +1000, Herbert Xu wrote: > > Rather than dealing with this individually in each driver perhaps we should > come up with a more centralised solution? One simple solution is to establish a separate queue for RTNL-holding users or vice versa for non-RTNL holding networking users. That would allow the drivers to safely flush the non-RTNL queue while holding the RTNL. -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000_down and tx_timeout worker race cleaning the transmit buffers
On Thursday 20 April 2006 17:10, Michael Chan wrote: > In tg3_remove_one(), we call flush_scheduled_work() in case the > reset_task is still pending. Here, it is safe to call > flush_scheduled_work() because we're not holding the rtnl. Again, when > it runs, nothing bad will happen because it will see netif_running() == > 0. I'll bite! Here's a patch to add a call to flush_scheduled_work() in e1000_down. It's against 2.6.16.9. Shaw diff -u -uprN -X linux-2.6.16.9/Documentation/dontdiff linux-2.6.16.9/drivers/net/e1000/e1000_main.c linux-2.6.16.9-patch/drivers/net/e1000/e1000_main.c --- linux-2.6.16.9/drivers/net/e1000/e1000_main.c 2006-04-18 23:10:14.0 -0700 +++ linux-2.6.16.9-patch/drivers/net/e1000/e1000_main.c 2006-04-20 19:36:55.0 -0700 @@ -538,6 +538,7 @@ e1000_down(struct e1000_adapter *adapter del_timer_sync(&adapter->tx_fifo_stall_timer); del_timer_sync(&adapter->watchdog_timer); del_timer_sync(&adapter->phy_info_timer); + flush_scheduled_work(); #ifdef CONFIG_E1000_NAPI netif_poll_disable(netdev);
Re: e1000_down and tx_timeout worker race cleaning the transmit buffers
Michael Chan <[EMAIL PROTECTED]> wrote: > > In tg3_remove_one(), we call flush_scheduled_work() in case the > reset_task is still pending. Here, it is safe to call Great. > flush_scheduled_work() because we're not holding the rtnl. Again, when Hmm doing a quick grep seems to indicate that quite a number of drivers do this in netdev->close or other callbacks under RTNL. This means that they're all vulnerable to the linkwatch deadlock that you alluded to. Rather than dealing with this individually in each driver perhaps we should come up with a more centralised solution? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/10] [IOAT] I/OAT patches repost
David S. Miller <[EMAIL PROTECTED]> wrote: > > For I/O AT you'd really want to get the DMA engine going as soon > as you had those packets, but I do not see a clean and reliable way > to determine the target pages before the app gets back to recvmsg(). The vmsplice() system call proposed by Linus might be a good fit. http://www.ussg.iu.edu/hypermail/linux/kernel/0604.2/0854.html -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000_down and tx_timeout worker race cleaning the transmit buffers
On Fri, 2006-04-21 at 11:33 +1000, Herbert Xu wrote: > Actually, what if the tg3_close is followed by a tg3_open? That could > produce a spurious reset which I suppose isn't that bad. Yes, an extra reset. And yes, it isn't too bad. > Also if the > module is unloaded bad things will happen as well. In tg3_remove_one(), we call flush_scheduled_work() in case the reset_task is still pending. Here, it is safe to call flush_scheduled_work() because we're not holding the rtnl. Again, when it runs, nothing bad will happen because it will see netif_running() == 0. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Open ethernet hardware specs
On Thu, Apr 20, 2006 at 06:55:58PM -0400, Jeff Garzik wrote: > Also, janitors, there are more NIC specs at > http://gkernel.sourceforge.net/specs/ than are listed on the wiki. What > I posted is just a starter list. If someone were to comb through each > PDF in the /specs/ sub-directories, and make sure it is linked on the > wiki, I would be grateful. Almost done. P.S.: http://gkernel.sourceforge.net/specs/via/501designguide.pdf.bz2 is broken. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000_down and tx_timeout worker race cleaning the transmit buffers
On Fri, Apr 21, 2006 at 11:27:01AM +1000, herbert wrote: > On Thu, Apr 20, 2006 at 03:36:57PM -0700, Michael Chan wrote: > > > > If we're in tg3_close() and the reset task isn't running yet, tg3_close > > () will proceed. However, when the reset task finally runs, it will see > > that netif_running() is zero and will just return. > > Yes you're absolutely right. Actually, what if the tg3_close is followed by a tg3_open? That could produce a spurious reset which I suppose isn't that bad. Also if the module is unloaded bad things will happen as well. So I still don't feel too comfortable about leaving it scheduled after a close. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SIOCGIWSCAN wireless event behaviour
Jean Tourrilhes wrote: The original behaviour was that the event was sent only when a user did request a scan. At that time, cards did not do background scanning, so new scan results would be produced only as a result of a user scan. After a short discussion we Dan, we agree that to change that, the driver should send a scan whenever a new scan result is available, regardless of how it happens (background scan or user scan). This allow smart application to synchronise on background scans and avoid them generating useless user scans. Minimising the number of user scan is actually good. Thanks for all the responses. I am not sure if the 'extra' SIOCGIWSCAN event is what is causing wpa_supplicant's confusion, but the kind of behaviour I am seeing is wpa_supplicant associating to the network, immediately disassociating, and then associating again before the connection stabilises. This is with wpa_supplicant 0.5.2 connecting to an unencrypted network. I am also seeing that softmac reassociates with a network after wpa_supplicant exits. Johannes posted a softmac patch earlier which may help (related to softmac's handling of SIOCGIWAP). I will do some further investigation and provide a more complete report if that doesn't fix it. Thanks, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000_down and tx_timeout worker race cleaning the transmit buffers
On Thu, Apr 20, 2006 at 03:36:57PM -0700, Michael Chan wrote: > > If we're in tg3_close() and the reset task isn't running yet, tg3_close > () will proceed. However, when the reset task finally runs, it will see > that netif_running() is zero and will just return. Yes you're absolutely right. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/10] [IOAT] I/OAT patches repost
From: Rick Jones <[EMAIL PROTECTED]> Date: Thu, 20 Apr 2006 18:00:37 -0700 > Actually, that brings-up a question - presently, and for reasons that > are lost to me in the mists of time - netperf will "access" the buffer > before it calls recv(). I'm wondering if that should be changed to an > access of the buffer after it calls recv()? Yes, that's what it should do, as this is whan a real application would do. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/10] [IOAT] I/OAT patches repost
David S. Miller wrote: From: "Andrew Grover" <[EMAIL PROTECTED]> Date: Thu, 20 Apr 2006 15:14:15 -0700 First obviously it's a technology for RX CPU improvement so there's no benefit on TX workloads. Second it depends on there being buffers to copy the data into *before* the data arrives. This happens to be the case for benchmarks like netperf and Chariot, but real apps using poll/select wouldn't see a benefit, Just laying the cards out here. BUT we are seeing very good CPU savings on some workloads, so for those apps (and if select/poll apps could make use of a yet-to-be-implemented async net interface) it would be a win. I don't know what the breakdown is of apps doing blocking reads vs. waiting, does anyone know? All the bandwidth benchmarks tend to block, real world servers (and most clients to some extent) tend to use non-blocking reads and poll/select except in some very limited cases and designs doing something like 1 thread per connection. Another netperf2 option :) (not exported via configure though) if a certain define is set - look at recv_tcp_stream() in nettest_bsd.c - then netperf will call select() before it calls recv(). rick jones - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/10] [IOAT] I/OAT patches repost
Unfortunately, many benchmarks just do raw bandwidth tests sending to a receiver that just doesn't even look at the data. They just return from recvmsg() and loop back into it. This is not what applications using networking actually do, so it's important to make sure we look intelligently at any benchmarks done and do not fall into the trap of saying "even without cache warming it made things faster" when in fact the tested receiver did not touch the data at all so was a false test. FWIW, netperf can be configured to access the buffers it gives to send() or gets from recv(). A ./configure --enable-dirty in TOT: http://www.netperf.org/svn/netperf2/trunk will enable two global options: -k dirty,clean # bytes to dirty, bytes to read clean on netperf side -K dirty,clean # as above, on netserver side. And in such a netperf the test banner will include the string "dirty data" (alas the default output will not say how much :) In say a TCP_STREAM test -k will affect what is done with a buffer before send() is called, and -K will affect what is done with a buffer _before_ recv() is called with that buffer. -k N will cause the first N bytes of the buffer to be dirtied, and the next N bytes to be read clean -k N, will cause the first N bytes of the buffer to be dirtied -k ,N will cause the first N bytes of the buffer to be read clean -k M,N will cause the first M bytes to be dirtied, the next N bytes to be read clean Actually, that brings-up a question - presently, and for reasons that are lost to me in the mists of time - netperf will "access" the buffer before it calls recv(). I'm wondering if that should be changed to an access of the buffer after it calls recv()? And I suspect related to all this is whether or not one should alter the size of the buffer ring being used by netperf, which by default is the SO_*BUF size divided by the send_size (or recv_size) plus one buffers - the -W option can control that. rick jones - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Cannot receive multicast packets
David: Thank you for taking the time to respond. The packets are arriving via a switched network composed of Cisco devices in PIM dense mode. The packets pass through several switch hops, but no routing hops that have been documented to me. I did not think the source IP was relevant to the matching code in linux, since there are no source squelching socket options. There are no firewall rules active on this machine, and the packets are definitely visible at the interface (see tcpdump output in my email). I am going to try upgrading the kernel, and turning off the multicast router kernel options as a next step. But if you have any other ideas at all, I'm all ears. This seems too much like Mr. Murphy's in the room. A. David Stevens wrote: I've run your test program and it receives fine for me. I note that the source address is not on the same subnet as (any of) the receiver's addresses. Are the packets being routed? The default multicasting TTL is 1, though I don't know if it'll be checked or dropped on the receiver, seeing as we aren't forwarding it. Also, you might want to run "netstat -s" to see if any of the drop counters are being incremented (e.g., checksum error). Finally, I'm assuming you don't have any firewall rules that are matching, right? +-DLS - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/10] [IOAT] I/OAT patches repost
From: Olof Johansson <[EMAIL PROTECTED]> Date: Thu, 20 Apr 2006 18:33:43 -0500 > On Thu, Apr 20, 2006 at 03:14:15PM -0700, Andrew Grover wrote: > > In > > addition, there may be workloads (file serving? backup?) where we > > could do a skb->page-in-page-cache copy and avoid cache pollution? > > Yes, NFS is probably a prime example of where most of the data isn't > looked at; just written to disk. I'm not sure how well-optimized the > receive path is there already w.r.t. avoiding copying though. I don't > remember seeing memcpy and friends being high on the profile when I > looked at SPECsfs last. If that makes sense then the cpu copy can be made to use non-temporal stores. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/10] [IOAT] I/OAT patches repost
From: "Andrew Grover" <[EMAIL PROTECTED]> Date: Thu, 20 Apr 2006 15:14:15 -0700 > First obviously it's a technology for RX CPU improvement so there's no > benefit on TX workloads. Second it depends on there being buffers to > copy the data into *before* the data arrives. This happens to be the > case for benchmarks like netperf and Chariot, but real apps using > poll/select wouldn't see a benefit, Just laying the cards out here. > BUT we are seeing very good CPU savings on some workloads, so for > those apps (and if select/poll apps could make use of a > yet-to-be-implemented async net interface) it would be a win. > > I don't know what the breakdown is of apps doing blocking reads vs. > waiting, does anyone know? All the bandwidth benchmarks tend to block, real world servers (and most clients to some extent) tend to use non-blocking reads and poll/select except in some very limited cases and designs doing something like 1 thread per connection. This is an issue for the TCP prequeue and as a consequence VJ's net channel ideas. We need something to wakeup some context in order to push channel data. All the net channel stuff really wants is an execution context to run the TCP stack outside of software interrupts. I/O AT wants something similar. For net channels the probably best thing to do is to just queue to the socket's netchannel, and mark poll state appropriately and just wait for the thread to get back into recvmsg() to run the queue. So I think net channels can be handled in all cases and application I/O models. For I/O AT you'd really want to get the DMA engine going as soon as you had those packets, but I do not see a clean and reliable way to determine the target pages before the app gets back to recvmsg(). I/O AT really expects a lot of things to be in place in order for it to function at all. And sadly, that set of requirements isn't actually very common outside of benchmarking tools and a few uncommonly designed servers. Even a web browser does non-blocking reads and poll(). - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/10] [IOAT] I/OAT patches repost
From: Olof Johansson <[EMAIL PROTECTED]> Date: Thu, 20 Apr 2006 16:33:05 -0500 > From the wiki: > > >3. Data copied by I/OAT is not cached > > This is a I/OAT device limitation and not a global statement of the > DMA infrastructure. Other platforms might be able to prime caches > with the DMA traffic. Hint flags should be added on either the channel > allocation calls, or per-operation calls, depending on where it makes > sense driver/client wise. This sidesteps the whole question of _which_ cache to warm. And if you choose wrongly, then what? Besides the control overhead of the DMA engines, the biggest thing lost in my opinion is the perfect cache warming that a cpu based copy does from the kernel socket buffer into userspace. The first thing an application is going to do is touch that data. So I think it's very important to prewarm the caches and the only straightforward way I know of to always warm up the correct cpu's caches is copy_to_user(). Unfortunately, many benchmarks just do raw bandwidth tests sending to a receiver that just doesn't even look at the data. They just return from recvmsg() and loop back into it. This is not what applications using networking actually do, so it's important to make sure we look intelligently at any benchmarks done and do not fall into the trap of saying "even without cache warming it made things faster" when in fact the tested receiver did not touch the data at all so was a false test. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000_down and tx_timeout worker race cleaning the transmit buffers
On Fri, 2006-04-21 at 09:51 +1000, Herbert Xu wrote: > > Actually TG3 is buggy too. If the reset task is scheduled but > isn't running yet there is no synchronisation here to prevent the > reset task from running after tg3_close releases the tp lock. > If we're in tg3_close() and the reset task isn't running yet, tg3_close () will proceed. However, when the reset task finally runs, it will see that netif_running() is zero and will just return. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Cannot receive multicast packets
I've run your test program and it receives fine for me. I note that the source address is not on the same subnet as (any of) the receiver's addresses. Are the packets being routed? The default multicasting TTL is 1, though I don't know if it'll be checked or dropped on the receiver, seeing as we aren't forwarding it. Also, you might want to run "netstat -s" to see if any of the drop counters are being incremented (e.g., checksum error). Finally, I'm assuming you don't have any firewall rules that are matching, right? +-DLS - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000_down and tx_timeout worker race cleaning the transmit buffers
On Fri, Apr 21, 2006 at 09:36:31AM +1000, herbert wrote: > > Yes that's definitely buggy. There needs to be some form of > synchronisation as the TG3 driver does. However, to be frank > I'm not too fond of what the TG3 driver does either. Is there > no better way than an msleep loop? Actually TG3 is buggy too. If the reset task is scheduled but isn't running yet there is no synchronisation here to prevent the reset task from running after tg3_close releases the tp lock. It needs to kill the reset task and make sure it doesn't get rescheduled by someone else. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Netlink and user-space buffer pointers
On Thu, 20 Apr 2006, James Smart wrote: > Note: We've transitioned off topic. If what this means is "there isn't a > good > way except by ioctls (which still isn't easily portable) or system calls", > then that's ok. Then at least we know the limits and can look at other > implementation alternatives. this topic has been brought-up many times in the past, most recently: http://thread.gmane.org/gmane.linux.drivers.openib/19525/focus=19525 http://thread.gmane.org/gmane.linux.kernel/387375/focus=387455 where is was suggested to pathscale folks to use some blend of sysfs, netlink sockets and debugfs: http://kerneltrap.org/node/4394 > >>Mike Christie wrote: > >Instead of netlink for scsi commands and transport requests > > > >For scsi commands could we just use sg io, or is there something special > >about the command you want to send? If you can use sg io for scsi > >commands, maybe for transport level requests (in my example iscsi pdu) > >we could modify something like sg/bsg/block layer scsi_ioctl.c to send > >down transport requests to the classes and encapsulate them in some new > >struct transport_requests or use the existing struct request but do that > >thing people keep taling about using the request/request_queue for > >message passing. > > Well - there's 2 parts to this answer: > > First : IOCTL's are considered dangerous/bad practice and therefore it would > be nice to find a replacement mechanism that eliminates them. If that > mechanism has some of the cool features that netlink does, even better. > Using sg io, in the manner you indicate, wouldn't remove the ioctl use. > Note: I have OEMs/users that are very confused about the community's > statement > about ioctls. They've heard they are bad, should never be allowed, will no > be longer supported, but yet they are at the heart of DM and sg io and > other > subsystems. Other than a "grandfathered" explanation, they don't > understand > why the rules bend for one piece of code but not for another. To them, all > the features are just as critical regardless of whose providing them. I believe it to be the same for most hardware-vendor's customers... > Second: transport level i/o could be done like you suggest, and we've > prototyped some of this as well. However, there's something very wrong > about putting "block device" wrappers and settings around something that > is not a block device. Eeww... no wrappers. Your netlink prototypes certainly get FC- transport further along, but would also be nice if there could be some subsystem consensus on *the* interface. I honestly don't know which interface is *best*, but from a HBA vendors perspective managing per-request locally allocated memory is undesirable. Thanks, av - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2.6.16-rc5] S2io: Receive packet classification and steering mechanisms
Andi, The driver will be polling(listening) to netlink for any configuration requests. We could release the user tools but not sure where(in the tree) they would reside. Thanks, Ravi -Original Message- From: Andi Kleen [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 19, 2006 5:51 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED]; netdev@vger.kernel.org Subject: Re: [PATCH 2.6.16-rc5] S2io: Receive packet classification and steering mechanisms On Thursday 20 April 2006 00:45, Ravinandan Arakali wrote: > Andi, > We would like to explain that this patch is tier-1 of a two > tiered approach. It implements all the steering > functionality at driver-only level, and it is fairly Neterion-specific. That's fine for experiments, but probably not something that should be in tree. > > The second upcoming submission will add a generic netlink-based > interface for channel data flow and configuration(including receive steering > parameters) on per-channel basis, that will utilize the lower level > implementation from the current patch. Will the driver itself listening to netlink? My feeling would be to teach the stack to use this would require efficient interfaces and netlink isn't particularly. But if it's just a glue module outside the driver that would be reasonable as a first step I guess. Do you also plan to release user tools to use it? -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000_down and tx_timeout worker race cleaning the transmit buffers
On Thu, Apr 20, 2006 at 05:35:00PM +, [EMAIL PROTECTED] wrote: > > If the e1000_tx_timeout_task were running concurrently with e1000_down, it > seems that they could both attempt to kfree_skb concurrently when running > e1000_unmap_and_free_tx_resource. I googled around to find mention of this > anywhere with no luck. Has this been discussed already? Yes that's definitely buggy. There needs to be some form of synchronisation as the TG3 driver does. However, to be frank I'm not too fond of what the TG3 driver does either. Is there no better way than an msleep loop? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/10] [IOAT] I/OAT patches repost
On Thu, Apr 20, 2006 at 03:14:15PM -0700, Andrew Grover wrote: > Hah, I was just writing an email covering those. I'll incorporate that > into this reponse. > > On 4/20/06, Olof Johansson <[EMAIL PROTECTED]> wrote: > > I guess the overall question is, how much of this needs to be addressed > > in the implementation before merge, and how much should be done when > > more drivers (with more features) are merged down the road. It might not > > make sense to implement all of it now if the only available public > > driver lacks the abilities. But I'm bringing up the points anyway. > > Yeah. But I would think maybe this is a reason to merge at least the > DMA subsystem code, so people with other HW (ARM? I'm still not > exactly sure) can start trying to write a DMA driver and see where the > architecture needs to be generalized further. The interfaces need to evolve as people implement drivers, yes. If it should be before or after merging can be discussed, but as long as everyone is on the same page w.r.t. the interfaces being volatile for a while, merge should be OK. Having a roadmap of known-todo improvements could be beneficial for everyone involved, especially if several people start looking at drivers in parallel. However, so far, (public) activity seems to have been fairly low. > > I would also prefer to see the series clearly split between the DMA > > framework and first clients (networking) and the I/OAT driver. Right now > > "I/OAT" and "DMA" is used interchangeably, especially when describing > > the later patches. It might help you in the perception that this is > > something unique to the Intel chipsets as well. :-) > > I think we have this reasonably well split-out in the patches, but yes > you're right about how we've been using the terms. The patches are well split up already, it was mostly that the network stack changes were marked as I/OAT changes instead of DMA dito. > > >1. Performance improvement may be on too narrow a set of workloads > > Maybe from I/OAT and the current client, but the introduction of the > > DMA infrastructure opens up for other uses that are not yet possible in > > the API. For example, DMA with functions is a very natural extension, > > and something that's very common on various platforms (XOR for RAID use, > > checksums, encryption). > > Yes. Does this hardware exist in shipping platforms, so we could use > actual hw to start evaluating the DMA interfaces? Freescale has it on several processors that are shipping, as far as I know. Other embedded families likely has them as well (MIPS, ARM), but I don't know details. The platform I am working on is not yet shipping; I've just started looking at drivers. > > For people who might want to play with it, a reference software-based > > implementation might be useful. > > Yeah I'll ask if I can post the one we have. Or it would be trivial to write. I was going to look at it myself, but if you have one to post that's even more trivial. :-) > > >3. Data copied by I/OAT is not cached > > > > This is a I/OAT device limitation and not a global statement of the > > DMA infrastructure. Other platforms might be able to prime caches > > with the DMA traffic. Hint flags should be added on either the channel > > allocation calls, or per-operation calls, depending on where it makes > > sense driver/client wise. > > Furthermore in our implementation's defense I would say I think the > smart prefetching that modern CPUs do is helping here. Yes. It's also not obvious that warming the cache at copy time is always a gain, it will depends on the receiver and what it does with the data. > In any case, we > are seeing performance gains (see benchmarks), which seems to indicate > this is not an immediate deal-breaker for the technology.. There's always the good old benefit-vs-added-complexity tradeoff, which I guess is the sore spot right now. > In > addition, there may be workloads (file serving? backup?) where we > could do a skb->page-in-page-cache copy and avoid cache pollution? Yes, NFS is probably a prime example of where most of the data isn't looked at; just written to disk. I'm not sure how well-optimized the receive path is there already w.r.t. avoiding copying though. I don't remember seeing memcpy and friends being high on the profile when I looked at SPECsfs last. > > >4. Intrusiveness of net stack modifications > > >5. Compatibility with upcoming VJ net channel architecture > > Both of these are outside my scope, so I won't comment on them at this > > time. > > Yeah I don't have much to say about these except we made the patch as > unintrusive as we could, and we think there may be ways to use async > DMA to > help VJ channels, whenever they arrive. Not that I know all the tricks they are using, but it seems to me that it would be hard to both be efficient w.r.t memory use (i.e. more than one IP packet per page) AND avoid copying once. At least without device-level flow classification and per
Open ethernet hardware specs
I started a specs section on the linux-net wiki: http://linux-net.osdl.org/index.php?title=Network-Adapters#Hardware_specifications If you add to this list, please be SURE of the specification's origin. We do not want to link to any "fell off the back of a truck" specs of questionable origin. Also, janitors, there are more NIC specs at http://gkernel.sourceforge.net/specs/ than are listed on the wiki. What I posted is just a starter list. If someone were to comb through each PDF in the /specs/ sub-directories, and make sure it is linked on the wiki, I would be grateful. Jeff - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please pull upstream-fixes branch of wireless-2.6
On Thu, 20 Apr 2006, Andrew Morton wrote: > "John W. Linville" <[EMAIL PROTECTED]> wrote: > > > > At present, all the branches in wireless-2.6 only pull from linux-2.6. > > I am still pushing (i.e. requesting Jeff's pull) to netdev-2.6, > > if that matters. > > > > Maybe the current wireless-2.6 tree fits into your system better? > > Works well, thanks. I have some patches for you ;) Well, since Jeff pushed it on to me, if you have patches that fix obvious problems and should go in before 2.6.17, you can now push those directly to me too ;) Linus - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/10] [IOAT] I/OAT patches repost
Hah, I was just writing an email covering those. I'll incorporate that into this reponse. On 4/20/06, Olof Johansson <[EMAIL PROTECTED]> wrote: > I guess the overall question is, how much of this needs to be addressed > in the implementation before merge, and how much should be done when > more drivers (with more features) are merged down the road. It might not > make sense to implement all of it now if the only available public > driver lacks the abilities. But I'm bringing up the points anyway. Yeah. But I would think maybe this is a reason to merge at least the DMA subsystem code, so people with other HW (ARM? I'm still not exactly sure) can start trying to write a DMA driver and see where the architecture needs to be generalized further. > Maybe it could make sense to add a software-based driver for reference, > and for others to play around with. We wrote one, but just for testing. I think we've been focused on the performance story, so it didn't seem a priority. > I would also prefer to see the series clearly split between the DMA > framework and first clients (networking) and the I/OAT driver. Right now > "I/OAT" and "DMA" is used interchangeably, especially when describing > the later patches. It might help you in the perception that this is > something unique to the Intel chipsets as well. :-) I think we have this reasonably well split-out in the patches, but yes you're right about how we've been using the terms. > (I have also proposed DMA offload discussions as a topic for the Kernel > Summit. I have kept Chris Leech Cc:d on most of the emails in question. It > should be a good place to get input from other subsystems regarding what > functionality they would like to see provided, etc.) I think that would be a good topic for the KS - like you say not necessarily I/OAT but general DMA offload. > >1. Performance improvement may be on too narrow a set of workloads > Maybe from I/OAT and the current client, but the introduction of the > DMA infrastructure opens up for other uses that are not yet possible in > the API. For example, DMA with functions is a very natural extension, > and something that's very common on various platforms (XOR for RAID use, > checksums, encryption). Yes. Does this hardware exist in shipping platforms, so we could use actual hw to start evaluating the DMA interfaces? While you may not care (:-) I'd like to address the network performance aspect above, for other netdev readers: First obviously it's a technology for RX CPU improvement so there's no benefit on TX workloads. Second it depends on there being buffers to copy the data into *before* the data arrives. This happens to be the case for benchmarks like netperf and Chariot, but real apps using poll/select wouldn't see a benefit, Just laying the cards out here. BUT we are seeing very good CPU savings on some workloads, so for those apps (and if select/poll apps could make use of a yet-to-be-implemented async net interface) it would be a win. I don't know what the breakdown is of apps doing blocking reads vs. waiting, does anyone know? > >2. Limited availability of hardware supporting I/OAT > > DMA engines are fairly common, even though I/OAT might not be yet. They > just haven't had a common infrastructure until now. We've engaged early that's a good thing :) I think we'd like to see some netdev people do some independent performance analysis of it. If anyone is willing to do so and has time to do so, email us and let's see what we can work out. > For people who might want to play with it, a reference software-based > implementation might be useful. Yeah I'll ask if I can post the one we have. Or it would be trivial to write. > >3. Data copied by I/OAT is not cached > > This is a I/OAT device limitation and not a global statement of the > DMA infrastructure. Other platforms might be able to prime caches > with the DMA traffic. Hint flags should be added on either the channel > allocation calls, or per-operation calls, depending on where it makes > sense driver/client wise. Furthermore in our implementation's defense I would say I think the smart prefetching that modern CPUs do is helping here. In any case, we are seeing performance gains (see benchmarks), which seems to indicate this is not an immediate deal-breaker for the technology.. In addition, there may be workloads (file serving? backup?) where we could do a skb->page-in-page-cache copy and avoid cache pollution? > >4. Intrusiveness of net stack modifications > >5. Compatibility with upcoming VJ net channel architecture > Both of these are outside my scope, so I won't comment on them at this > time. Yeah I don't have much to say about these except we made the patch as unintrusive as we could, and we think there may be ways to use async DMA to help VJ channels, whenever they arrive. > I would like to add, for longer term: >* Userspace interfaces: > Are there any plans yet on how to export some of this to userspace?
Re: e1000 breakage in git-netdev-all
Andrew Morton wrote: A bunch of e1000 changes just hit Jeff's tree. Hopefully things are now fixed in git-netdev-all... Jeff - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[git patches] net driver fixes
Please pull from 'upstream-linus' branch of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6.git to receive the following updates: drivers/net/ne.c |2 drivers/net/wireless/Kconfig |2 drivers/net/wireless/airo.c| 46 +++--- drivers/net/wireless/atmel.c | 11 ++ drivers/net/wireless/bcm43xx/Kconfig |3 drivers/net/wireless/bcm43xx/bcm43xx.h | 17 +++ drivers/net/wireless/bcm43xx/bcm43xx_debugfs.c |8 - drivers/net/wireless/bcm43xx/bcm43xx_dma.c | 13 +- drivers/net/wireless/bcm43xx/bcm43xx_main.c|2 drivers/net/wireless/bcm43xx/bcm43xx_phy.c |1 drivers/net/wireless/bcm43xx/bcm43xx_power.c | 115 ++--- drivers/net/wireless/bcm43xx/bcm43xx_power.h |9 + drivers/net/wireless/bcm43xx/bcm43xx_sysfs.c | 115 ++--- drivers/net/wireless/bcm43xx/bcm43xx_sysfs.h | 16 --- drivers/net/wireless/bcm43xx/bcm43xx_wx.c |8 - drivers/net/wireless/orinoco.c |2 include/net/ieee80211softmac.h |3 net/core/dev.c |3 net/core/wireless.c|8 + net/ieee80211/softmac/Kconfig |1 net/ieee80211/softmac/ieee80211softmac_assoc.c |5 - net/ieee80211/softmac/ieee80211softmac_event.c | 40 +++- net/ieee80211/softmac/ieee80211softmac_io.c| 18 +++ net/ieee80211/softmac/ieee80211softmac_scan.c |2 net/ieee80211/softmac/ieee80211softmac_wx.c| 10 ++ 25 files changed, 289 insertions(+), 171 deletions(-) Adrian Bunk: bcm43xx: fix dyn tssi2dbm memleak Dan Williams: wireless/airo: clean up WEXT association and scan events wireless/atmel: send WEXT scan completion events Erik Mouw: bcm43xx: iw_priv_args names should be <16 characters Jean Tourrilhes: wext: Fix IWENCODEEXT security permissions Revert NET_RADIO Kconfig title change wext: Fix RtNetlink ENCODE security permissions Johannes Berg: softmac: fix event sending softmac: report when scanning has finished [EMAIL PROTECTED]: softmac: return -EAGAIN from getscan while scanning softmac: dont send out packets while scanning softmac: handle iw_mode properly Michael Buesch: softmac: fix spinlock recursion on reassoc bcm43xx: set trans_start on TX to prevent bogus timeouts bcm43xx: fix pctl slowclock limit calculation bcm43xx: sysfs code cleanup Pavel Roskin: orinoco: fix truncating commsquality RID with the latest Symbol firmware Randy Dunlap: softmac uses Wiress Ext. bcm43xx wireless: fix printk format warnings bcm43xx: fix config menu alignment Sergei Shtylyov: NEx000: fix RTL8019AS base address for RBTX4938 diff --git a/drivers/net/ne.c b/drivers/net/ne.c index 08b218c..93c494b 100644 --- a/drivers/net/ne.c +++ b/drivers/net/ne.c @@ -226,7 +226,7 @@ struct net_device * __init ne_probe(int netdev_boot_setup_check(dev); #ifdef CONFIG_TOSHIBA_RBTX4938 - dev->base_addr = 0x07f20280; + dev->base_addr = RBTX4938_RTL_8019_BASE; dev->irq = RBTX4938_RTL_8019_IRQ; #endif err = do_ne_probe(dev); diff --git a/drivers/net/wireless/Kconfig b/drivers/net/wireless/Kconfig index bad09eb..e0874cb 100644 --- a/drivers/net/wireless/Kconfig +++ b/drivers/net/wireless/Kconfig @@ -6,7 +6,7 @@ menu "Wireless LAN (non-hamradio)" depends on NETDEVICES config NET_RADIO - bool "Wireless LAN drivers (non-hamradio)" + bool "Wireless LAN drivers (non-hamradio) & Wireless Extensions" select WIRELESS_EXT ---help--- Support for wireless LANs and everything having to do with radio, diff --git a/drivers/net/wireless/airo.c b/drivers/net/wireless/airo.c index 108d9fe..00764dd 100644 --- a/drivers/net/wireless/airo.c +++ b/drivers/net/wireless/airo.c @@ -3139,6 +3139,7 @@ static irqreturn_t airo_interrupt ( int } if ( status & EV_LINK ) { union iwreq_datawrqu; + int scan_forceloss = 0; /* The link status has changed, if you want to put a monitor hook in, do it here. (Remember that interrupts are still disabled!) @@ -3157,7 +3158,8 @@ static irqreturn_t airo_interrupt ( int code) */ #define AUTHFAIL 0x0300 /* Authentication failure (low byte is reason code) */ -#define ASSOCIATED 0x0400 /* Assocatied */ +#define ASSOCIATED 0x0400 /* Associated */ +#define REASSOCIATED 0x0600 /* Reassociated? Only on firmware >= 5.30.17 */ #define RC_RESERVED 0 /* Reserved return code */ #define RC_NOREASON 1 /* Unspecified reason */ #define RC_AUTHINV 2 /* Previous authentication invalid */ @@ -3174,44 +3176,30
[PATCH] bridge: allow full size vlan tagged packets to be bridged
The Ethernet bridge code silently drops packets when forwarding a packet that is too large for the destination interface (as per 802.1d). But it should allow for VLAN tagged frames. Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> --- bridge.orig/net/bridge/br_forward.c 2006-04-10 16:17:51.0 -0700 +++ bridge/net/bridge/br_forward.c 2006-04-19 13:50:42.0 -0700 @@ -16,6 +16,7 @@ #include #include #include +#include #include #include "br_private.h" @@ -29,10 +30,15 @@ return 1; } +static inline unsigned packet_length(const struct sk_buff *skb) +{ + return skb->len - (skb->protocol == htons(ETH_P_8021Q) ? VLAN_HLEN : 0); +} + int br_dev_queue_push_xmit(struct sk_buff *skb) { /* drop mtu oversized packets except tso */ - if (skb->len > skb->dev->mtu && !skb_shinfo(skb)->tso_size) + if (packet_length(skb) > skb->dev->mtu && !skb_shinfo(skb)->tso_size) kfree_skb(skb); else { #ifdef CONFIG_BRIDGE_NETFILTER - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/10] [IOAT] I/OAT patches repost
On Thu, Apr 20, 2006 at 01:49:16PM -0700, Andrew Grover wrote: > Hi I'm reposting these, originally posted by Chris Leech a few weeks ago. > However, there is an extra part since I broke up one patch that was too > big for netdev last time into two (patches 2 and 3). > > Of course we're always looking for more style improvement comments, but > more importantly we're posting these to talk about the larger issues > around I/OAT and this code making it in upstream at some point. > > These are also available on the wiki, > http://linux-net.osdl.org/index.php/I/OAT . Hi, Since you didn't provide the current issues in this email, I will copy and paste them from the wiki page. I guess the overall question is, how much of this needs to be addressed in the implementation before merge, and how much should be done when more drivers (with more features) are merged down the road. It might not make sense to implement all of it now if the only available public driver lacks the abilities. But I'm bringing up the points anyway. Maybe it could make sense to add a software-based driver for reference, and for others to play around with. I would also prefer to see the series clearly split between the DMA framework and first clients (networking) and the I/OAT driver. Right now "I/OAT" and "DMA" is used interchangeably, especially when describing the later patches. It might help you in the perception that this is something unique to the Intel chipsets as well. :-) (I have also proposed DMA offload discussions as a topic for the Kernel Summit. I have kept Chris Leech Cc:d on most of the emails in question. It should be a good place to get input from other subsystems regarding what functionality they would like to see provided, etc.) >From the wiki: > Current issues of concern: > >1. Performance improvement may be on too narrow a set of workloads Maybe from I/OAT and the current client, but the introduction of the DMA infrastructure opens up for other uses that are not yet possible in the API. For example, DMA with functions is a very natural extension, and something that's very common on various platforms (XOR for RAID use, checksums, encryption). The API needs to be expanded to cover this by adding function types and adding them to the channel allocation interface and logic. >2. Limited availability of hardware supporting I/OAT DMA engines are fairly common, even though I/OAT might not be yet. They just haven't had a common infrastructure until now. For people who might want to play with it, a reference software-based implementation might be useful. >3. Data copied by I/OAT is not cached This is a I/OAT device limitation and not a global statement of the DMA infrastructure. Other platforms might be able to prime caches with the DMA traffic. Hint flags should be added on either the channel allocation calls, or per-operation calls, depending on where it makes sense driver/client wise. >4. Intrusiveness of net stack modifications >5. Compatibility with upcoming VJ net channel architecture Both of these are outside my scope, so I won't comment on them at this time. I would like to add, for longer term: * Userspace interfaces: Are there any plans yet on how to export some of this to userspace? It might not make full sense for just memcpy due to overheads, but it makes sense for more advanced dma/offload engines. -Olof - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 9/10] [IOAT] Add sysctl to tuning IOAT offloaded IO threshold
Hi, On Thu, Apr 20, 2006 at 01:50:40PM -0700, Andrew Grover wrote: > > Any socket recv of less than this ammount will not be offloaded [...] > --- a/net/core/user_dma.c > +++ b/net/core/user_dma.c > @@ -33,6 +33,10 @@ > > #ifdef CONFIG_NET_DMA > > +#define NET_DMA_DEFAULT_COPYBREAK 1024 > + > +int sysctl_tcp_dma_copybreak = NET_DMA_DEFAULT_COPYBREAK; > + The breakpoint is highly likely to be at different points on various architectures and platforms depending on what they look like, where in the system the DMA engine is, how efficient regular memcpy is, etc. I would like to see it as a config option instead, so it will at least be possible to tune per-arch (via default config, etc). -Olof - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix locking in gianfar
Andy Fleming wrote: This patch fixes several bugs in the gianfar driver, including a major one where spinlocks were horribly broken: * Split gianfar locks into two types: TX and RX * Made it so gfar_start() now clears RHALT * Fixed a bug where calling gfar_start_xmit() with interrupts off would corrupt the interrupt state * Fixed a bug where a frame could potentially arrive, and never be handled (if no more frames arrived * Fixed a bug where the rx_work_limit would never be observed by the rx completion code * Fixed a bug where the interrupt handlers were not actually protected by their spinlocks Signed-off-by: Andy Fleming <[EMAIL PROTECTED]> ACK but failed: [EMAIL PROTECTED] netdev-2.6]$ git-applymbox /g/tmp/mbox ~/info/signoff.txt 1 patch(es) to process. Applying 'Fix locking in gianfar' fatal: corrupt patch at line 19 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please pull upstream-fixes branch of wireless-2.6
John W. Linville wrote: The following changes since commit 0efd9323f32c137b5cf48bc6582cd08556e7cdfc: Linus Torvalds: Merge branch 'splice' of git://brick.kernel.dk/data/git/linux-2.6-block are found in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6.git upstream-fixes pulled - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] e1000: fix two mispatches
Kok, Auke wrote: Hi, This patch series implements two e100 fixes for an old and new patch mishap. [1] fix mispatch for media type detect. [2] fix mismerge skb_put. These changes are available through git. git://63.64.152.142/~ahkok/git/netdev-2.6 e1000-7.0.38-k2-fixes applied - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND 1/2] s390: remove tty support from ctc network device driver [1/2]
Frank Pavlic wrote: Hi jeff, after the first shot I sent to you did not apply I resend two new patches I've made today to remove tty from ctc network driver. Please apply Thank you ... applied 1-2 to #upstream (queued for 2.6.18, since 2.6.17 is in -rc) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[2.6 patch] net/802/tr.c: remove an unsed export
This patch removes the unused EXPORT_SYMBOL(tr_source_route). (No, the usage in net/llc/llc_output.c can't be modular.) Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]> --- linux-2.6.17-rc1-mm3-full/net/802/tr.c.old 2006-04-20 22:45:07.0 +0200 +++ linux-2.6.17-rc1-mm3-full/net/802/tr.c 2006-04-20 22:45:18.0 +0200 @@ -643,6 +643,5 @@ module_init(rif_init); -EXPORT_SYMBOL(tr_source_route); EXPORT_SYMBOL(tr_type_trans); EXPORT_SYMBOL(alloc_trdev); - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2b/2] [IOAT] Driver for the I/OAT DMA engine
Second half of the ioatdma.c diff, split up to make it past netdev size block -- Andy Adds a new ioatdma driver, ioatdma.c Signed-off-by: Chris Leech <[EMAIL PROTECTED]> --- drivers/dma/ioatdma.c | 805 +++ diff --git a/drivers/dma/ioatdma.c b/drivers/dma/ioatdma.c new file mode 100644 index 000..ffe47dd --- /dev/null +++ b/drivers/dma/ioatdma.c [see previous post for first half of file. sorry] +/** + * ioat_dma_memcpy_issue_pending - push potentially unrecognoized appended descriptors to hw + * @chan: DMA channel handle + */ + +static void ioat_dma_memcpy_issue_pending(struct dma_chan *chan) +{ + struct ioat_dma_chan *ioat_chan = to_ioat_chan(chan); + + if (ioat_chan->pending != 0) { + ioat_chan->pending = 0; + ioatdma_chan_write8(ioat_chan, + IOAT_CHANCMD_OFFSET, + IOAT_CHANCMD_APPEND); + } +} + +static void ioat_dma_memcpy_cleanup(struct ioat_dma_chan *chan) +{ + unsigned long phys_complete; + struct ioat_desc_sw *desc, *_desc; + dma_cookie_t cookie = 0; + + prefetch(chan->completion_virt); + + if (!spin_trylock(&chan->cleanup_lock)) + return; + + /* The completion writeback can happen at any time, + so reads by the driver need to be atomic operations + The descriptor physical addresses are limited to 32-bits + when the CPU can only do a 32-bit mov */ + +#if (BITS_PER_LONG == 64) + phys_complete = chan->completion_virt->full & IOAT_CHANSTS_COMPLETED_DESCRIPTOR_ADDR; +#else + phys_complete = chan->completion_virt->low & IOAT_LOW_COMPLETION_MASK; +#endif + + if ((chan->completion_virt->full & IOAT_CHANSTS_DMA_TRANSFER_STATUS) == + IOAT_CHANSTS_DMA_TRANSFER_STATUS_HALTED) { + printk("IOAT: Channel halted, chanerr = %x\n", + ioatdma_chan_read32(chan, IOAT_CHANERR_OFFSET)); + + /* TODO do something to salvage the situation */ + } + + if (phys_complete == chan->last_completion) { + spin_unlock(&chan->cleanup_lock); + return; + } + + spin_lock_bh(&chan->desc_lock); + list_for_each_entry_safe(desc, _desc, &chan->used_desc, node) { + + /* +* Incoming DMA requests may use multiple descriptors, due to +* exceeding xfercap, perhaps. If so, only the last one will +* have a cookie, and require unmapping. +*/ + if (desc->cookie) { + cookie = desc->cookie; + + /* yes we are unmapping both _page and _single alloc'd + regions with unmap_page. Is this *really* that bad? + */ + pci_unmap_page(chan->device->pdev, + pci_unmap_addr(desc, dst), + pci_unmap_len(desc, dst_len), + PCI_DMA_FROMDEVICE); + pci_unmap_page(chan->device->pdev, + pci_unmap_addr(desc, src), + pci_unmap_len(desc, src_len), + PCI_DMA_TODEVICE); + } + + if (desc->phys != phys_complete) { + /* a completed entry, but not the last, so cleanup */ + list_del(&desc->node); + list_add_tail(&desc->node, &chan->free_desc); + } else { + /* last used desc. Do not remove, so we can append from + it, but don't look at it next time, either */ + desc->cookie = 0; + + /* TODO check status bits? */ + break; + } + } + + spin_unlock_bh(&chan->desc_lock); + + chan->last_completion = phys_complete; + if (cookie != 0) + chan->completed_cookie = cookie; + + spin_unlock(&chan->cleanup_lock); +} + +/** + * ioat_dma_is_complete - poll the status of a IOAT DMA transaction + * @chan: IOAT DMA channel handle + * @cookie: DMA transaction identifier + */ + +static enum dma_status ioat_dma_is_complete(struct dma_chan *chan, dma_cookie_t cookie, dma_cookie_t *done, dma_cookie_t *used) +{ + struct ioat_dma_chan *ioat_chan = to_ioat_chan(chan); + dma_cookie_t last_used; + dma_cookie_t last_complete; + enum dma_status ret; + + last_used = chan->cookie; + last_complete = ioat_chan->completed_cookie; + + if (done) + *done= last_complete; + if (used) + *used = last_used; + + ret = dma_async_is_complete(cookie, last_complete, last_used); + if (ret == DMA_SUCCESS) + return ret; + + ioat
[PATCH 2a/2] [IOAT] Driver for the I/OAT engine part 2a
patch 2 got blocked due to size, here is the diff in 2 parts. -- Andy Adds a new ioatdma driver, ioatdma.c Signed-off-by: Chris Leech <[EMAIL PROTECTED]> --- drivers/dma/ioatdma.c | 805 +++ diff --git a/drivers/dma/ioatdma.c b/drivers/dma/ioatdma.c new file mode 100644 index 000..ffe47dd --- /dev/null +++ b/drivers/dma/ioatdma.c @@ -0,0 +1,805 @@ +/* + * Copyright(c) 2004 - 2006 Intel Corporation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the Free + * Software Foundation; either version 2 of the License, or (at your option) + * any later version. + * + * This program is distributed in the hope that it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * You should have received a copy of the GNU General Public License along with + * this program; if not, write to the Free Software Foundation, Inc., 59 + * Temple Place - Suite 330, Boston, MA 02111-1307, USA. + * + * The full GNU General Public License is included in this distribution in the + * file called COPYING. + */ + +/* + * This driver supports an Intel I/OAT DMA engine, which does asynchronous + * copy operations. + */ + +#include +#include +#include +#include +#include +#include +#include "ioatdma.h" +#include "ioatdma_io.h" +#include "ioatdma_registers.h" +#include "ioatdma_hw.h" + +#define to_ioat_chan(chan) container_of(chan, struct ioat_dma_chan, common) +#define to_ioat_device(dev) container_of(dev, struct ioat_device, common) +#define to_ioat_desc(lh) container_of(lh, struct ioat_desc_sw, node) + +/* internal functions */ +static int __devinit ioat_probe(struct pci_dev *pdev, const struct pci_device_id *ent); +static void __devexit ioat_remove(struct pci_dev *pdev); + +static int enumerate_dma_channels(struct ioat_device *device) +{ + u8 xfercap_scale; + u32 xfercap; + int i; + struct ioat_dma_chan *ioat_chan; + + device->common.chancnt = ioatdma_read8(device, IOAT_CHANCNT_OFFSET); + xfercap_scale = ioatdma_read8(device, IOAT_XFERCAP_OFFSET); + xfercap = (xfercap_scale == 0 ? -1 : (1UL << xfercap_scale)); + + for (i = 0; i < device->common.chancnt; i++) { + ioat_chan = kzalloc(sizeof(*ioat_chan), GFP_KERNEL); + if (!ioat_chan) { + device->common.chancnt = i; + break; + } + + ioat_chan->device = device; + ioat_chan->reg_base = device->reg_base + (0x80 * (i + 1)); + ioat_chan->xfercap = xfercap; + spin_lock_init(&ioat_chan->cleanup_lock); + spin_lock_init(&ioat_chan->desc_lock); + INIT_LIST_HEAD(&ioat_chan->free_desc); + INIT_LIST_HEAD(&ioat_chan->used_desc); + /* This should be made common somewhere in dmaengine.c */ + ioat_chan->common.device = &device->common; + ioat_chan->common.client = NULL; + list_add_tail(&ioat_chan->common.device_node, + &device->common.channels); + } + return device->common.chancnt; +} + +static struct ioat_desc_sw *ioat_dma_alloc_descriptor(struct ioat_dma_chan *ioat_chan, int flags) +{ + struct ioat_dma_descriptor *desc; + struct ioat_desc_sw *desc_sw; + struct ioat_device *ioat_device; + dma_addr_t phys; + + ioat_device = to_ioat_device(ioat_chan->common.device); + desc = pci_pool_alloc(ioat_device->dma_pool, flags, &phys); + if (unlikely(!desc)) + return NULL; + + desc_sw = kzalloc(sizeof(*desc_sw), flags); + if (unlikely(!desc_sw)) { + pci_pool_free(ioat_device->dma_pool, desc, phys); + return NULL; + } + + memset(desc, 0, sizeof(*desc)); + desc_sw->hw = desc; + desc_sw->phys = phys; + + return desc_sw; +} + +#define INITIAL_IOAT_DESC_COUNT 128 + +static void ioat_start_null_desc(struct ioat_dma_chan *ioat_chan); + +/* returns the actual number of allocated descriptors */ +static int ioat_dma_alloc_chan_resources(struct dma_chan *chan) +{ + struct ioat_dma_chan *ioat_chan = to_ioat_chan(chan); + struct ioat_desc_sw *desc = NULL; + u16 chanctrl; + u32 chanerr; + int i; + + /* +* In-use bit automatically set by reading chanctrl +* If 0, we got it, if 1, someone else did +*/ + chanctrl = ioatdma_chan_read16(ioat_chan, IOAT_CHANCTRL_OFFSET); + if (chanctrl & IOAT_CHANCTRL_CHANNEL_IN_USE) + return -EBUSY; + +/* Setup register to interrupt and write completion status on error */ + chanctrl = IOAT_CHANCTRL_CHANNEL_IN_USE | +
[PATCH 6/10] [IOAT] Struct changes for TCP recv offload to IOAT
Adds an async_wait_queue and some additional fields to tcp_sock, and a dma_cookie_t to sk_buff. Signed-off-by: Chris Leech <[EMAIL PROTECTED]> --- include/linux/skbuff.h |4 include/linux/tcp.h|8 include/net/sock.h |2 ++ include/net/tcp.h |7 +++ net/core/sock.c|6 ++ diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 613b951..76861a8 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -29,6 +29,7 @@ #include #include #include +#include #define HAVE_ALLOC_SKB /* For the drivers to know */ #define HAVE_ALIGNABLE_SKB /* Ditto 8)*/ @@ -285,6 +286,9 @@ struct sk_buff { __u16 tc_verd;/* traffic control verdict */ #endif #endif +#ifdef CONFIG_NET_DMA + dma_cookie_tdma_cookie; +#endif /* These elements must be at the end, see alloc_skb() for details. */ diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 542d395..c90daa5 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -18,6 +18,7 @@ #define _LINUX_TCP_H #include +#include #include struct tcphdr { @@ -233,6 +234,13 @@ struct tcp_sock { struct iovec*iov; int memory; int len; +#ifdef CONFIG_NET_DMA + /* members for async copy */ + struct dma_chan *dma_chan; + int wakeup; + struct dma_pinned_list *pinned_list; + dma_cookie_tdma_cookie; +#endif } ucopy; __u32 snd_wl1;/* Sequence for window update */ diff --git a/include/net/sock.h b/include/net/sock.h index af2b054..190809c 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -132,6 +132,7 @@ struct sock_common { *@sk_receive_queue: incoming packets *@sk_wmem_alloc: transmit queue bytes committed *@sk_write_queue: Packet sending queue + *@sk_async_wait_queue: DMA copied packets *@sk_omem_alloc: "o" is "option" or "other" *@sk_wmem_queued: persistent queue size *@sk_forward_alloc: space allocated forward @@ -205,6 +206,7 @@ struct sock { atomic_tsk_omem_alloc; struct sk_buff_head sk_receive_queue; struct sk_buff_head sk_write_queue; + struct sk_buff_head sk_async_wait_queue; int sk_wmem_queued; int sk_forward_alloc; gfp_t sk_allocation; diff --git a/include/net/tcp.h b/include/net/tcp.h index 9418f4d..54e4367 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -28,6 +28,7 @@ #include #include #include +#include #include #include @@ -820,6 +821,12 @@ static inline void tcp_prequeue_init(str tp->ucopy.len = 0; tp->ucopy.memory = 0; skb_queue_head_init(&tp->ucopy.prequeue); +#ifdef CONFIG_NET_DMA + tp->ucopy.dma_chan = NULL; + tp->ucopy.wakeup = 0; + tp->ucopy.pinned_list = NULL; + tp->ucopy.dma_cookie = 0; +#endif } /* Packet is added to VJ-style prequeue for processing in process diff --git a/net/core/sock.c b/net/core/sock.c index a96ea7d..d2acd35 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -818,6 +818,9 @@ struct sock *sk_clone(const struct sock atomic_set(&newsk->sk_omem_alloc, 0); skb_queue_head_init(&newsk->sk_receive_queue); skb_queue_head_init(&newsk->sk_write_queue); +#ifdef CONFIG_NET_DMA + skb_queue_head_init(&newsk->sk_async_wait_queue); +#endif rwlock_init(&newsk->sk_dst_lock); rwlock_init(&newsk->sk_callback_lock); @@ -1369,6 +1372,9 @@ void sock_init_data(struct socket *sock, skb_queue_head_init(&sk->sk_receive_queue); skb_queue_head_init(&sk->sk_write_queue); skb_queue_head_init(&sk->sk_error_queue); +#ifdef CONFIG_NET_DMA + skb_queue_head_init(&sk->sk_async_wait_queue); +#endif sk->sk_send_head= NULL; -- 1.2.6 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/10] [IOAT] Util funcs for offloading sk_buff to iovec copies
Provides for pinning user space pages in memory, copying to iovecs, and copying from sk_buffs including fragmented and chained sk_buffs. Signed-off-by: Chris Leech <[EMAIL PROTECTED]> --- drivers/dma/Makefile |3 drivers/dma/iovlock.c | 301 + include/linux/dmaengine.h | 22 +++ include/net/netdma.h |6 + net/core/Makefile |1 net/core/user_dma.c | 141 + diff --git a/drivers/dma/Makefile b/drivers/dma/Makefile index c8a5f56..bdcfdbd 100644 --- a/drivers/dma/Makefile +++ b/drivers/dma/Makefile @@ -1,2 +1,3 @@ -obj-y += dmaengine.o +obj-$(CONFIG_DMA_ENGINE) += dmaengine.o +obj-$(CONFIG_NET_DMA) += iovlock.o obj-$(CONFIG_INTEL_IOATDMA) += ioatdma.o diff --git a/drivers/dma/iovlock.c b/drivers/dma/iovlock.c new file mode 100644 index 000..5ed327e --- /dev/null +++ b/drivers/dma/iovlock.c @@ -0,0 +1,301 @@ +/* + * Copyright(c) 2004 - 2006 Intel Corporation. All rights reserved. + * Portions based on net/core/datagram.c and copyrighted by their authors. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the Free + * Software Foundation; either version 2 of the License, or (at your option) + * any later version. + * + * This program is distributed in the hope that it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * You should have received a copy of the GNU General Public License along with + * this program; if not, write to the Free Software Foundation, Inc., 59 + * Temple Place - Suite 330, Boston, MA 02111-1307, USA. + * + * The full GNU General Public License is included in this distribution in the + * file called COPYING. + */ + +/* + * This code allows the net stack to make use of a DMA engine for + * skb to iovec copies. + */ + +#include +#include +#include /* for memcpy_toiovec */ +#include +#include + +int num_pages_spanned(struct iovec *iov) +{ + return + ((PAGE_ALIGN((unsigned long)iov->iov_base + iov->iov_len) - + ((unsigned long)iov->iov_base & PAGE_MASK)) >> PAGE_SHIFT); +} + +/* + * Pin down all the iovec pages needed for len bytes. + * Return a struct dma_pinned_list to keep track of pages pinned down. + * + * We are allocating a single chunk of memory, and then carving it up into + * 3 sections, the latter 2 whose size depends on the number of iovecs and the + * total number of pages, respectively. + */ +struct dma_pinned_list *dma_pin_iovec_pages(struct iovec *iov, size_t len) +{ + struct dma_pinned_list *local_list; + struct page **pages; + int i; + int ret; + int nr_iovecs = 0; + int iovec_len_used = 0; + int iovec_pages_used = 0; + long err; + + /* don't pin down non-user-based iovecs */ + if (segment_eq(get_fs(), KERNEL_DS)) + return NULL; + + /* determine how many iovecs/pages there are, up front */ + do { + iovec_len_used += iov[nr_iovecs].iov_len; + iovec_pages_used += num_pages_spanned(&iov[nr_iovecs]); + nr_iovecs++; + } while (iovec_len_used < len); + + /* single kmalloc for pinned list, page_list[], and the page arrays */ + local_list = kmalloc(sizeof(*local_list) + + (nr_iovecs * sizeof (struct dma_page_list)) + + (iovec_pages_used * sizeof (struct page*)), GFP_KERNEL); + if (!local_list) { + err = -ENOMEM; + goto out; + } + + /* list of pages starts right after the page list array */ + pages = (struct page **) &local_list->page_list[nr_iovecs]; + + for (i = 0; i < nr_iovecs; i++) { + struct dma_page_list *page_list = &local_list->page_list[i]; + + len -= iov[i].iov_len; + + if (!access_ok(VERIFY_WRITE, iov[i].iov_base, iov[i].iov_len)) { + err = -EFAULT; + goto unpin; + } + + page_list->nr_pages = num_pages_spanned(&iov[i]); + page_list->base_address = iov[i].iov_base; + + page_list->pages = pages; + pages += page_list->nr_pages; + + /* pin pages down */ + down_read(¤t->mm->mmap_sem); + ret = get_user_pages( + current, + current->mm, + (unsigned long) iov[i].iov_base, + page_list->nr_pages, + 1, /* write */ + 0, /* force */ + page_list->pages, + NULL); + up_read(¤t->mm->mmap_sem); + + if (ret != page_list->nr_pages) { + err = -ENOMEM; +
[PATCH 0/10] [IOAT] I/OAT patches repost
Hi I'm reposting these, originally posted by Chris Leech a few weeks ago. However, there is an extra part since I broke up one patch that was too big for netdev last time into two (patches 2 and 3). Of course we're always looking for more style improvement comments, but more importantly we're posting these to talk about the larger issues around I/OAT and this code making it in upstream at some point. These are also available on the wiki, http://linux-net.osdl.org/index.php/I/OAT . Thanks -- Andy - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 9/10] [IOAT] Add sysctl to tuning IOAT offloaded IO threshold
Any socket recv of less than this ammount will not be offloaded Signed-off-by: Chris Leech <[EMAIL PROTECTED]> --- include/linux/sysctl.h |1 + include/net/tcp.h |1 + net/core/user_dma.c|4 net/ipv4/sysctl_net_ipv4.c | 10 ++ diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h index 76eaeff..cd9e7c0 100644 --- a/include/linux/sysctl.h +++ b/include/linux/sysctl.h @@ -403,6 +403,7 @@ enum NET_TCP_MTU_PROBING=113, NET_TCP_BASE_MSS=114, NET_IPV4_TCP_WORKAROUND_SIGNED_WINDOWS=115, + NET_TCP_DMA_COPYBREAK=116, }; enum { diff --git a/include/net/tcp.h b/include/net/tcp.h index ca5bdaf..2e6fdef 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -219,6 +219,7 @@ extern int sysctl_tcp_adv_win_scale; extern int sysctl_tcp_tw_reuse; extern int sysctl_tcp_frto; extern int sysctl_tcp_low_latency; +extern int sysctl_tcp_dma_copybreak; extern int sysctl_tcp_nometrics_save; extern int sysctl_tcp_moderate_rcvbuf; extern int sysctl_tcp_tso_win_divisor; diff --git a/net/core/user_dma.c b/net/core/user_dma.c index ec177ef..642a3f3 100644 --- a/net/core/user_dma.c +++ b/net/core/user_dma.c @@ -33,6 +33,10 @@ #ifdef CONFIG_NET_DMA +#define NET_DMA_DEFAULT_COPYBREAK 1024 + +int sysctl_tcp_dma_copybreak = NET_DMA_DEFAULT_COPYBREAK; + /** * dma_skb_copy_datagram_iovec - Copy a datagram to an iovec. * @skb - buffer to copy diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 6b6c3ad..6a6aa53 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -688,6 +688,16 @@ ctl_table ipv4_table[] = { .mode = 0644, .proc_handler = &proc_dointvec }, +#ifdef CONFIG_NET_DMA + { + .ctl_name = NET_TCP_DMA_COPYBREAK, + .procname = "tcp_dma_copybreak", + .data = &sysctl_tcp_dma_copybreak, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec + }, +#endif { .ctl_name = 0 } }; -- 1.2.6 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 10/10] [IOAT] Actual changes to the net stack to use IOAT
Locks down user pages and sets up for DMA in tcp_recvmsg, then calls dma_async_try_early_copy in tcp_v4_do_rcv Signed-off-by: Chris Leech <[EMAIL PROTECTED]> --- net/ipv4/tcp.c | 101 -- net/ipv4/tcp_input.c | 74 + net/ipv4/tcp_ipv4.c | 18 - net/ipv6/tcp_ipv6.c | 12 +- diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 2346539..8be8d69 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -263,7 +263,7 @@ #include #include #include - +#include #include #include @@ -1110,6 +1110,7 @@ int tcp_recvmsg(struct kiocb *iocb, stru int target; /* Read at least this many bytes */ long timeo; struct task_struct *user_recv = NULL; + int copied_early = 0; lock_sock(sk); @@ -1133,6 +1134,15 @@ int tcp_recvmsg(struct kiocb *iocb, stru target = sock_rcvlowat(sk, flags & MSG_WAITALL, len); +#ifdef CONFIG_NET_DMA + tp->ucopy.dma_chan = NULL; + preempt_disable(); + if ((len > sysctl_tcp_dma_copybreak) && !(flags & MSG_PEEK) && + !sysctl_tcp_low_latency && __get_cpu_var(softnet_data.net_dma)) + tp->ucopy.pinned_list = dma_pin_iovec_pages(msg->msg_iov, len); + preempt_enable_no_resched(); +#endif + do { struct sk_buff *skb; u32 offset; @@ -1274,6 +1284,10 @@ int tcp_recvmsg(struct kiocb *iocb, stru } else sk_wait_data(sk, &timeo); +#ifdef CONFIG_NET_DMA + tp->ucopy.wakeup = 0; +#endif + if (user_recv) { int chunk; @@ -1329,13 +1343,39 @@ do_prequeue: } if (!(flags & MSG_TRUNC)) { - err = skb_copy_datagram_iovec(skb, offset, - msg->msg_iov, used); - if (err) { - /* Exception. Bailout! */ - if (!copied) - copied = -EFAULT; - break; +#ifdef CONFIG_NET_DMA + if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list) + tp->ucopy.dma_chan = get_softnet_dma(); + + if (tp->ucopy.dma_chan) { + tp->ucopy.dma_cookie = dma_skb_copy_datagram_iovec( + tp->ucopy.dma_chan, skb, offset, + msg->msg_iov, used, + tp->ucopy.pinned_list); + + if (tp->ucopy.dma_cookie < 0) { + + printk(KERN_ALERT "dma_cookie < 0\n"); + + /* Exception. Bailout! */ + if (!copied) + copied = -EFAULT; + break; + } + if ((offset + used) == skb->len) + copied_early = 1; + + } else +#endif + { + err = skb_copy_datagram_iovec(skb, offset, + msg->msg_iov, used); + if (err) { + /* Exception. Bailout! */ + if (!copied) + copied = -EFAULT; + break; + } } } @@ -1355,15 +1395,19 @@ skip_copy: if (skb->h.th->fin) goto found_fin_ok; - if (!(flags & MSG_PEEK)) - sk_eat_skb(sk, skb, 0); + if (!(flags & MSG_PEEK)) { + sk_eat_skb(sk, skb, copied_early); + copied_early = 0; + } continue; found_fin_ok: /* Process the FIN. */ ++*seq; - if (!(flags & MSG_PEEK)) - sk_eat_skb(sk, skb, 0); + if (!(flags & MSG_PEEK)) { + sk_eat_skb(sk, skb, copied_early); + copied_early = 0; + } break; } while (len > 0); @@ -1386,6 +1430,36 @@ skip_copy: tp->ucopy.len = 0; } +#ifdef CONFIG_NET_DMA + if (tp->ucopy.dma_chan) { + struct sk_buff *skb; + dma_cookie_t done, used; + + dma_async_memcpy_issue_pending(tp->ucopy.dma_chan); + + while (dma_async_memcpy_complete(tp->ucopy.dma_chan, +tp->ucopy.dma_cookie, &done,
[PATCH 1/10] [IOAT] DMA memcpy subsystem
Provides an API for offloading memory copies to DMA devices Signed-off-by: Chris Leech <[EMAIL PROTECTED]> --- drivers/Kconfig |2 drivers/Makefile |1 drivers/dma/Kconfig | 13 + drivers/dma/Makefile |1 drivers/dma/dmaengine.c | 405 + include/linux/dmaengine.h | 337 + diff --git a/drivers/Kconfig b/drivers/Kconfig index 9f5c0da..f89ac05 100644 --- a/drivers/Kconfig +++ b/drivers/Kconfig @@ -72,4 +72,6 @@ source "drivers/edac/Kconfig" source "drivers/rtc/Kconfig" +source "drivers/dma/Kconfig" + endmenu diff --git a/drivers/Makefile b/drivers/Makefile index 4249552..9b808a6 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -74,3 +74,4 @@ obj-$(CONFIG_SGI_SN) += sn/ obj-y += firmware/ obj-$(CONFIG_CRYPTO) += crypto/ obj-$(CONFIG_SUPERH) += sh/ +obj-$(CONFIG_DMA_ENGINE) += dma/ diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig new file mode 100644 index 000..f9ac4bc --- /dev/null +++ b/drivers/dma/Kconfig @@ -0,0 +1,13 @@ +# +# DMA engine configuration +# + +menu "DMA Engine support" + +config DMA_ENGINE + bool "Support for DMA engines" + ---help--- + DMA engines offload copy operations from the CPU to dedicated + hardware, allowing the copies to happen asynchronously. + +endmenu diff --git a/drivers/dma/Makefile b/drivers/dma/Makefile new file mode 100644 index 000..10b7391 --- /dev/null +++ b/drivers/dma/Makefile @@ -0,0 +1 @@ +obj-y += dmaengine.o diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c new file mode 100644 index 000..683456a --- /dev/null +++ b/drivers/dma/dmaengine.c @@ -0,0 +1,405 @@ +/* + * Copyright(c) 2004 - 2006 Intel Corporation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the Free + * Software Foundation; either version 2 of the License, or (at your option) + * any later version. + * + * This program is distributed in the hope that it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * You should have received a copy of the GNU General Public License along with + * this program; if not, write to the Free Software Foundation, Inc., 59 + * Temple Place - Suite 330, Boston, MA 02111-1307, USA. + * + * The full GNU General Public License is included in this distribution in the + * file called COPYING. + */ + +/* + * This code implements the DMA subsystem. It provides a HW-neutral interface + * for other kernel code to use asynchronous memory copy capabilities, + * if present, and allows different HW DMA drivers to register as providing + * this capability. + * + * Due to the fact we are accelerating what is already a relatively fast + * operation, the code goes to great lengths to avoid additional overhead, + * such as locking. + * + * LOCKING: + * + * The subsystem keeps two global lists, dma_device_list and dma_client_list. + * Both of these are protected by a spinlock, dma_list_lock. + * + * Each device has a channels list, which runs unlocked but is never modified + * once the device is registered, it's just setup by the driver. + * + * Each client has a channels list, it's only modified under the client->lock + * and in an RCU callback, so it's safe to read under rcu_read_lock(). + * + * Each device has a kref, which is initialized to 1 when the device is + * registered. A kref_put is done for each class_device registered. When the + * class_device is released, the coresponding kref_put is done in the release + * method. Every time one of the device's channels is allocated to a client, + * a kref_get occurs. When the channel is freed, the coresponding kref_put + * happens. The device's release function does a completion, so + * unregister_device does a remove event, class_device_unregister, a kref_put + * for the first reference, then waits on the completion for all other + * references to finish. + * + * Each channel has an open-coded implementation of Rusty Russell's "bigref," + * with a kref and a per_cpu local_t. A single reference is set when on an + * ADDED event, and removed with a REMOVE event. Net DMA client takes an + * extra reference per outstanding transaction. The relase function does a + * kref_put on the device. -ChrisL + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +static DEFINE_SPINLOCK(dma_list_lock); +static LIST_HEAD(dma_device_list); +static LIST_HEAD(dma_client_list); + +/* --- sysfs implementation --- */ + +static ssize_t show_memcpy_count(struct class_device *cd, char *buf) +{ + struct dma_chan *chan = container_of(cd, struct dma_chan, class_dev); + un
[PATCH 8/10] [IOAT] Make sk_eat_skb() IOAT-aware
Add an extra argument to sk_eat_skb, and make it move early copied packets to the async_wait_queue instead of freeing them. Signed-off-by: Chris Leech <[EMAIL PROTECTED]> --- include/net/sock.h | 13 - net/dccp/proto.c |4 ++-- net/ipv4/tcp.c |8 net/llc/af_llc.c |2 +- diff --git a/include/net/sock.h b/include/net/sock.h index 190809c..e3723b6 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1272,11 +1272,22 @@ sock_recv_timestamp(struct msghdr *msg, * This routine must be called with interrupts disabled or with the socket * locked so that the sk_buff queue operation is ok. */ -static inline void sk_eat_skb(struct sock *sk, struct sk_buff *skb) +#ifdef CONFIG_NET_DMA +static inline void sk_eat_skb(struct sock *sk, struct sk_buff *skb, int copied_early) +{ + __skb_unlink(skb, &sk->sk_receive_queue); + if (!copied_early) + __kfree_skb(skb); + else + __skb_queue_tail(&sk->sk_async_wait_queue, skb); +} +#else +static inline void sk_eat_skb(struct sock *sk, struct sk_buff *skb, int copied_early) { __skb_unlink(skb, &sk->sk_receive_queue); __kfree_skb(skb); } +#endif extern void sock_enable_timestamp(struct sock *sk); extern int sock_get_timestamp(struct sock *, struct timeval __user *); diff --git a/net/dccp/proto.c b/net/dccp/proto.c index 1ff7328..35d7dfd 100644 --- a/net/dccp/proto.c +++ b/net/dccp/proto.c @@ -719,7 +719,7 @@ int dccp_recvmsg(struct kiocb *iocb, str } dccp_pr_debug("packet_type=%s\n", dccp_packet_name(dh->dccph_type)); - sk_eat_skb(sk, skb); + sk_eat_skb(sk, skb, 0); verify_sock_status: if (sock_flag(sk, SOCK_DONE)) { len = 0; @@ -773,7 +773,7 @@ verify_sock_status: } found_fin_ok: if (!(flags & MSG_PEEK)) - sk_eat_skb(sk, skb); + sk_eat_skb(sk, skb, 0); break; } while (1); out: diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index b10f78c..2346539 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1072,11 +1072,11 @@ int tcp_read_sock(struct sock *sk, read_ break; } if (skb->h.th->fin) { - sk_eat_skb(sk, skb); + sk_eat_skb(sk, skb, 0); ++seq; break; } - sk_eat_skb(sk, skb); + sk_eat_skb(sk, skb, 0); if (!desc->count) break; } @@ -1356,14 +1356,14 @@ skip_copy: if (skb->h.th->fin) goto found_fin_ok; if (!(flags & MSG_PEEK)) - sk_eat_skb(sk, skb); + sk_eat_skb(sk, skb, 0); continue; found_fin_ok: /* Process the FIN. */ ++*seq; if (!(flags & MSG_PEEK)) - sk_eat_skb(sk, skb); + sk_eat_skb(sk, skb, 0); break; } while (len > 0); diff --git a/net/llc/af_llc.c b/net/llc/af_llc.c index 5a04db7..7465170 100644 --- a/net/llc/af_llc.c +++ b/net/llc/af_llc.c @@ -789,7 +789,7 @@ static int llc_ui_recvmsg(struct kiocb * continue; if (!(flags & MSG_PEEK)) { - sk_eat_skb(sk, skb); + sk_eat_skb(sk, skb, 0); *seq = 0; } } while (len > 0); -- 1.2.6 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/10] [IOAT] Setup the net subsystem as DMA client
Attempts to allocate per-CPU DMA channels Signed-off-by: Chris Leech <[EMAIL PROTECTED]> --- drivers/dma/Kconfig | 12 + include/linux/netdevice.h |4 ++ include/net/netdma.h | 38 net/core/dev.c| 104 + diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig index 0f15e76..30d021d 100644 --- a/drivers/dma/Kconfig +++ b/drivers/dma/Kconfig @@ -10,6 +10,18 @@ config DMA_ENGINE DMA engines offload copy operations from the CPU to dedicated hardware, allowing the copies to happen asynchronously. +comment "DMA Clients" + +config NET_DMA + bool "Network: TCP receive copy offload" + depends on DMA_ENGINE && NET + default y + ---help--- + This enables the use of DMA engines in the network stack to + offload receive copy-to-user operations, freeing CPU cycles. + Since this is the main user of the DMA engine, it should be enabled; + say Y here. + comment "DMA Devices" config INTEL_IOATDMA diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 950dc55..7fda35f 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -37,6 +37,7 @@ #include #include #include +#include struct divert_blk; struct vlan_group; @@ -592,6 +593,9 @@ struct softnet_data struct sk_buff *completion_queue; struct net_device backlog_dev;/* Sorry. 8) */ +#ifdef CONFIG_NET_DMA + struct dma_chan *net_dma; +#endif }; DECLARE_PER_CPU(struct softnet_data,softnet_data); diff --git a/include/net/netdma.h b/include/net/netdma.h new file mode 100644 index 000..cbfe89d --- /dev/null +++ b/include/net/netdma.h @@ -0,0 +1,38 @@ +/* + * Copyright(c) 2004 - 2006 Intel Corporation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the Free + * Software Foundation; either version 2 of the License, or (at your option) + * any later version. + * + * This program is distributed in the hope that it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * You should have received a copy of the GNU General Public License along with + * this program; if not, write to the Free Software Foundation, Inc., 59 + * Temple Place - Suite 330, Boston, MA 02111-1307, USA. + * + * The full GNU General Public License is included in this distribution in the + * file called COPYING. + */ +#ifndef NETDMA_H +#define NETDMA_H +#include +#ifdef CONFIG_NET_DMA +#include + +static inline struct dma_chan *get_softnet_dma(void) +{ + struct dma_chan *chan; + rcu_read_lock(); + chan = rcu_dereference(__get_cpu_var(softnet_data.net_dma)); + if (chan) + dma_chan_get(chan); + rcu_read_unlock(); + return chan; +} +#endif /* CONFIG_NET_DMA */ +#endif /* NETDMA_H */ diff --git a/net/core/dev.c b/net/core/dev.c index a3ab11f..ffd3d6d 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -115,6 +115,7 @@ #include #include #include +#include /* * The list of packet types we will receive (as opposed to discard) @@ -148,6 +149,12 @@ static DEFINE_SPINLOCK(ptype_lock); static struct list_head ptype_base[16];/* 16 way hashed list */ static struct list_head ptype_all; /* Taps */ +#ifdef CONFIG_NET_DMA +static struct dma_client *net_dma_client; +static unsigned int net_dma_count; +static spinlock_t net_dma_event_lock; +#endif + /* * The @dev_base list is protected by @dev_base_lock and the rtln * semaphore. @@ -1780,6 +1787,19 @@ static void net_rx_action(struct softirq } } out: +#ifdef CONFIG_NET_DMA + /* +* There may not be any more sk_buffs coming right now, so push +* any pending DMA copies to hardware +*/ + if (net_dma_client) { + struct dma_chan *chan; + rcu_read_lock(); + list_for_each_entry_rcu(chan, &net_dma_client->channels, client_node) + dma_async_memcpy_issue_pending(chan); + rcu_read_unlock(); + } +#endif local_irq_enable(); return; @@ -3243,6 +3263,88 @@ static int dev_cpu_callback(struct notif } #endif /* CONFIG_HOTPLUG_CPU */ +#ifdef CONFIG_NET_DMA +/** + * net_dma_rebalance - + * This is called when the number of channels allocated to the net_dma_client + * changes. The net_dma_client tries to have one DMA channel per CPU. + */ +static void net_dma_rebalance(void) +{ + unsigned int cpu, i, n; + struct dma_chan *chan; + + lock_cpu_hotplug(); + + if (net_dma_count == 0) { + for_each_online_cpu(cpu) + rcu_assign_pointer(per_cpu(
[PATCH 3/10] [IOAT] Driver for the I/OAT DMA engine part 2
Adds a new ioatdma driver, ioatdma.c Signed-off-by: Chris Leech <[EMAIL PROTECTED]> --- drivers/dma/ioatdma.c | 805 +++ diff --git a/drivers/dma/ioatdma.c b/drivers/dma/ioatdma.c new file mode 100644 index 000..ffe47dd --- /dev/null +++ b/drivers/dma/ioatdma.c @@ -0,0 +1,805 @@ +/* + * Copyright(c) 2004 - 2006 Intel Corporation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the Free + * Software Foundation; either version 2 of the License, or (at your option) + * any later version. + * + * This program is distributed in the hope that it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * You should have received a copy of the GNU General Public License along with + * this program; if not, write to the Free Software Foundation, Inc., 59 + * Temple Place - Suite 330, Boston, MA 02111-1307, USA. + * + * The full GNU General Public License is included in this distribution in the + * file called COPYING. + */ + +/* + * This driver supports an Intel I/OAT DMA engine, which does asynchronous + * copy operations. + */ + +#include +#include +#include +#include +#include +#include +#include "ioatdma.h" +#include "ioatdma_io.h" +#include "ioatdma_registers.h" +#include "ioatdma_hw.h" + +#define to_ioat_chan(chan) container_of(chan, struct ioat_dma_chan, common) +#define to_ioat_device(dev) container_of(dev, struct ioat_device, common) +#define to_ioat_desc(lh) container_of(lh, struct ioat_desc_sw, node) + +/* internal functions */ +static int __devinit ioat_probe(struct pci_dev *pdev, const struct pci_device_id *ent); +static void __devexit ioat_remove(struct pci_dev *pdev); + +static int enumerate_dma_channels(struct ioat_device *device) +{ + u8 xfercap_scale; + u32 xfercap; + int i; + struct ioat_dma_chan *ioat_chan; + + device->common.chancnt = ioatdma_read8(device, IOAT_CHANCNT_OFFSET); + xfercap_scale = ioatdma_read8(device, IOAT_XFERCAP_OFFSET); + xfercap = (xfercap_scale == 0 ? -1 : (1UL << xfercap_scale)); + + for (i = 0; i < device->common.chancnt; i++) { + ioat_chan = kzalloc(sizeof(*ioat_chan), GFP_KERNEL); + if (!ioat_chan) { + device->common.chancnt = i; + break; + } + + ioat_chan->device = device; + ioat_chan->reg_base = device->reg_base + (0x80 * (i + 1)); + ioat_chan->xfercap = xfercap; + spin_lock_init(&ioat_chan->cleanup_lock); + spin_lock_init(&ioat_chan->desc_lock); + INIT_LIST_HEAD(&ioat_chan->free_desc); + INIT_LIST_HEAD(&ioat_chan->used_desc); + /* This should be made common somewhere in dmaengine.c */ + ioat_chan->common.device = &device->common; + ioat_chan->common.client = NULL; + list_add_tail(&ioat_chan->common.device_node, + &device->common.channels); + } + return device->common.chancnt; +} + +static struct ioat_desc_sw *ioat_dma_alloc_descriptor(struct ioat_dma_chan *ioat_chan, int flags) +{ + struct ioat_dma_descriptor *desc; + struct ioat_desc_sw *desc_sw; + struct ioat_device *ioat_device; + dma_addr_t phys; + + ioat_device = to_ioat_device(ioat_chan->common.device); + desc = pci_pool_alloc(ioat_device->dma_pool, flags, &phys); + if (unlikely(!desc)) + return NULL; + + desc_sw = kzalloc(sizeof(*desc_sw), flags); + if (unlikely(!desc_sw)) { + pci_pool_free(ioat_device->dma_pool, desc, phys); + return NULL; + } + + memset(desc, 0, sizeof(*desc)); + desc_sw->hw = desc; + desc_sw->phys = phys; + + return desc_sw; +} + +#define INITIAL_IOAT_DESC_COUNT 128 + +static void ioat_start_null_desc(struct ioat_dma_chan *ioat_chan); + +/* returns the actual number of allocated descriptors */ +static int ioat_dma_alloc_chan_resources(struct dma_chan *chan) +{ + struct ioat_dma_chan *ioat_chan = to_ioat_chan(chan); + struct ioat_desc_sw *desc = NULL; + u16 chanctrl; + u32 chanerr; + int i; + + /* +* In-use bit automatically set by reading chanctrl +* If 0, we got it, if 1, someone else did +*/ + chanctrl = ioatdma_chan_read16(ioat_chan, IOAT_CHANCTRL_OFFSET); + if (chanctrl & IOAT_CHANCTRL_CHANNEL_IN_USE) + return -EBUSY; + +/* Setup register to interrupt and write completion status on error */ + chanctrl = IOAT_CHANCTRL_CHANNEL_IN_USE | + IOAT_CHANCTRL_ERR_INT_EN | + IOAT_CHANCTRL_ANY_ERR_A
[PATCH 7/10] [IOAT] cleanup_rbuf -> tcp_cleanup_rbuf and make static
Needed to be able to call tcp_cleanup_rbuf in tcp_input.c for I/OAT Signed-off-by: Chris Leech <[EMAIL PROTECTED]> --- include/net/tcp.h |2 ++ net/ipv4/tcp.c| 10 +- diff --git a/include/net/tcp.h b/include/net/tcp.h index 54e4367..ca5bdaf 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -294,6 +294,8 @@ extern int tcp_rcv_established(struct extern voidtcp_rcv_space_adjust(struct sock *sk); +extern voidtcp_cleanup_rbuf(struct sock *sk, int copied); + extern int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp); diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 87f68e7..b10f78c 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -937,7 +937,7 @@ static int tcp_recv_urg(struct sock *sk, * calculation of whether or not we must ACK for the sake of * a window update. */ -static void cleanup_rbuf(struct sock *sk, int copied) +void tcp_cleanup_rbuf(struct sock *sk, int copied) { struct tcp_sock *tp = tcp_sk(sk); int time_to_ack = 0; @@ -1086,7 +1086,7 @@ int tcp_read_sock(struct sock *sk, read_ /* Clean up data we have read: This will do ACK frames. */ if (copied) - cleanup_rbuf(sk, copied); + tcp_cleanup_rbuf(sk, copied); return copied; } @@ -1220,7 +1220,7 @@ int tcp_recvmsg(struct kiocb *iocb, stru } } - cleanup_rbuf(sk, copied); + tcp_cleanup_rbuf(sk, copied); if (!sysctl_tcp_low_latency && tp->ucopy.task == user_recv) { /* Install new reader */ @@ -1391,7 +1391,7 @@ skip_copy: */ /* Clean up data we have read: This will do ACK frames. */ - cleanup_rbuf(sk, copied); + tcp_cleanup_rbuf(sk, copied); TCP_CHECK_TIMER(sk); release_sock(sk); @@ -1853,7 +1853,7 @@ static int do_tcp_setsockopt(struct sock (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT) && inet_csk_ack_scheduled(sk)) { icsk->icsk_ack.pending |= ICSK_ACK_PUSHED; - cleanup_rbuf(sk, 1); + tcp_cleanup_rbuf(sk, 1); if (!(val & 1)) icsk->icsk_ack.pingpong = 1; } -- 1.2.6 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Netlink and user-space buffer pointers
Mike Christie wrote: > James Smart wrote: >> Note: We've transitioned off topic. If what this means is "there isn't a >> good >> way except by ioctls (which still isn't easily portable) or system calls", >> then that's ok. Then at least we know the limits and can look at other >> implementation alternatives. >> >> Mike Christie wrote: >>> James Smart wrote: Mike Christie wrote: > For the tasks you want to do for the fc class is performance critical? No, it should not be. > If not, you could do what the iscsi class (for the netdev people > this is > drivers/scsi/scsi_transport_iscsi.c) does and just suffer a couple > copies. For iscsi we do this in userspace to send down a login pdu: > > /* > * xmitbuf is a buffer that is large enough for the iscsi_event, > * iscsi pdu (hdr_size) and iscsi pdu data (data_size) > */ Well, the real difference is that the payload of the "message" is actually the payload of the SCSI command or ELS/CT Request. Thus, the payload may >>> I am not sure I follow. For iscsi, everything after the iscsi_event >>> struct can be the iscsi request that is to be transmitted. The payload >>> will not normally be Mbytes but it is not a couple if bytes. >> True... For a large read/write - it will eventually total what the i/o >> request size was, and you did have to push it through the socekt. >> What this discussion really comes down to is the difference between >> initiator >> offload and what a target does. >> >> The initiator offloads the "full" i/o from the users - e.g. send command, >> get response. In the initiator case, the user isn't aware of each and >> every IU that makes up the i/o. As it's on an i/o basis, the LLDD doing >> the offload needs the full buffer sitting and ready. DMA is preferred so >> the buffer doesn't have to be consuming socket/kernel/driver buffers while >> it's pending - plus speed. >> >> In the target case, the target controls each IU and it's size, thus it >> only has to have access to as much buffer space as it wants to push the >> next >> IU. The i/o can be "paced" by the target. Unfortunately, this is an >> entirely >> different use model than users of a scsi initiator expect, and it won't map >> well into replacing things like our sg_io ioctls. > > > I am not talking about the target here. For the open-iscsi initiator > that is in mainline that I referecnced in the example we send pdus from > userpsace to the LLD. In the future, initaitors that offload some iscsi > processing and will login from userspace or have userspace monitor the > transport by doing iscsi pings, we need to be able to send these pdus. > And the iscsi pdu cannot be broken up at the iscsi level (they can at > the interconect level though). From the iscsi host level they have to go > out like a scsi command would in that the LLD cannot decide to send out > mutiple pdus for he pdu that userspace sends down. > > I do agree with you that targets can break down a scsi command into > multiple transport level packets as it sees fit. > Oh yeah is FC IU == iscsi tcp packet or FC IU == iscsi pdu ? - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Netlink and user-space buffer pointers
James Smart wrote: > Note: We've transitioned off topic. If what this means is "there isn't a > good > way except by ioctls (which still isn't easily portable) or system calls", > then that's ok. Then at least we know the limits and can look at other > implementation alternatives. > > Mike Christie wrote: >> James Smart wrote: >>> Mike Christie wrote: For the tasks you want to do for the fc class is performance critical? >>> No, it should not be. >>> If not, you could do what the iscsi class (for the netdev people this is drivers/scsi/scsi_transport_iscsi.c) does and just suffer a couple copies. For iscsi we do this in userspace to send down a login pdu: /* * xmitbuf is a buffer that is large enough for the iscsi_event, * iscsi pdu (hdr_size) and iscsi pdu data (data_size) */ >>> Well, the real difference is that the payload of the "message" is >>> actually >>> the payload of the SCSI command or ELS/CT Request. Thus, the payload may >> >> I am not sure I follow. For iscsi, everything after the iscsi_event >> struct can be the iscsi request that is to be transmitted. The payload >> will not normally be Mbytes but it is not a couple if bytes. > > True... For a large read/write - it will eventually total what the i/o > request size was, and you did have to push it through the socekt. > What this discussion really comes down to is the difference between > initiator > offload and what a target does. > > The initiator offloads the "full" i/o from the users - e.g. send command, > get response. In the initiator case, the user isn't aware of each and > every IU that makes up the i/o. As it's on an i/o basis, the LLDD doing > the offload needs the full buffer sitting and ready. DMA is preferred so > the buffer doesn't have to be consuming socket/kernel/driver buffers while > it's pending - plus speed. > > In the target case, the target controls each IU and it's size, thus it > only has to have access to as much buffer space as it wants to push the > next > IU. The i/o can be "paced" by the target. Unfortunately, this is an > entirely > different use model than users of a scsi initiator expect, and it won't map > well into replacing things like our sg_io ioctls. I am not talking about the target here. For the open-iscsi initiator that is in mainline that I referecnced in the example we send pdus from userpsace to the LLD. In the future, initaitors that offload some iscsi processing and will login from userspace or have userspace monitor the transport by doing iscsi pings, we need to be able to send these pdus. And the iscsi pdu cannot be broken up at the iscsi level (they can at the interconect level though). From the iscsi host level they have to go out like a scsi command would in that the LLD cannot decide to send out mutiple pdus for he pdu that userspace sends down. I do agree with you that targets can break down a scsi command into multiple transport level packets as it sees fit. > >> Instead of netlink for scsi commands and transport requests >> >> For scsi commands could we just use sg io, or is there something special >> about the command you want to send? If you can use sg io for scsi >> commands, maybe for transport level requests (in my example iscsi pdu) >> we could modify something like sg/bsg/block layer scsi_ioctl.c to send >> down transport requests to the classes and encapsulate them in some new >> struct transport_requests or use the existing struct request but do that >> thing people keep taling about using the request/request_queue for >> message passing. > > Well - there's 2 parts to this answer: > > First : IOCTL's are considered dangerous/bad practice and therefore it > would Yeah, i am not trying to kill ioctls. I go where the community goes. What I am trying to dois just reuse the sg io mapping code so that we do not end up with sg, st, target, blk scsi_ioctl.c and bsg all doing similar things. > be nice to find a replacement mechanism that eliminates them. If that > mechanism has some of the cool features that netlink does, even better. > Using sg io, in the manner you indicate, wouldn't remove the ioctl use. > Note: I have OEMs/users that are very confused about the community's > statement > about ioctls. They've heard they are bad, should never be allowed, > will no > be longer supported, but yet they are at the heart of DM and sg io and > other > subsystems. Other than a "grandfathered" explanation, they don't > understand > why the rules bend for one piece of code but not for another. To them, > all > the features are just as critical regardless of whose providing them. > > Second: transport level i/o could be done like you suggest, and we've > prototyped some of this as well. However, there's something very wrong > about putting "block device" wrappers and settings around something that > is not a block device. In general, it's a heck of a lot of overhead and > still does
Re: [RFC] Netlink and user-space buffer pointers
Mike Christie wrote: > James Smart wrote: > >>Mike Christie wrote: >> >>>For the tasks you want to do for the fc class is performance critical? >> >>No, it should not be. >> >> >>>If not, you could do what the iscsi class (for the netdev people this is >>>drivers/scsi/scsi_transport_iscsi.c) does and just suffer a couple >>>copies. For iscsi we do this in userspace to send down a login pdu: >>> >>>/* >>> * xmitbuf is a buffer that is large enough for the iscsi_event, >>> * iscsi pdu (hdr_size) and iscsi pdu data (data_size) >>> */ >> >>Well, the real difference is that the payload of the "message" is actually >>the payload of the SCSI command or ELS/CT Request. Thus, the payload may > > > I am not sure I follow. For iscsi, everything after the iscsi_event > struct can be the iscsi request that is to be transmitted. The payload > will not normally be Mbytes but it is not a couple if bytes. > > >>range in size from a few hundred bytes to several kbytes (> 1 page) to >>Mbyte's in size. Rather than buffer all of this, and push it over the >>socket, >>thus the extra copies - it would best to have the LLDD simply DMA the >>payload like on a typical SCSI command. Additionally, there will be >>response data that can be several kbytes in length. >> > > > Once you have got the buffer to the class, the class can create a > scatterlist to DMA from for the LLD. I thought. iscsi does not do this > just because it is software right now. For qla4xxx we do not need > something like what you are talking about (see below for what I was > thinking about for the initiators). If you are saying the extra step of > the copy is plain dumb, I agree, but this happens (you have to suffer > some copy and cannot do dio) for sg io as well in some cases. I think > for the sg driver the copy_*_user is the default. Mike, Indirect IO is the default in the sg driver because: - it has always been thus - the sg driver is less constrained (e.g. max number of scatg elements is a bigger issue with dio) - the only alignment to worry about is byte alignment (some folks would like bit alignment but you can't please everybody) - there is no need for the sg driver to pin user pages in memory (as there is with direct IO and mmaped-IO) > Instead of netlink for scsi commands and transport requests With a netlink based pass through one might: - improve on the SG_IO ioctl and add things like tags that are currently missing - introduce a proper SCSI task management function pass through (no request queue please) - make other pass throughs for SAS: SMP and STP - have an alternative to sysfs for various control functions in a HBA (e.g. in SAS: link and hard reset) and fetching performance data from a HBA Apart from how to get data efficiently between the HBA and the user space, another major issue is the flexibility of the bind() in s_netlink (storage netlink??). > For scsi commands could we just use sg io, or is there something special > about the command you want to send? If you can use sg io for scsi > commands, maybe for transport level requests (in my example iscsi pdu) > we could modify something like sg/bsg/block layer scsi_ioctl.c to send > down transport requests to the classes and encapsulate them in some new > struct transport_requests or use the existing struct request but do that > thing people keep taling about using the request/request_queue for > message passing. Some SG_IO ioctl users want up to 32 MB in one transaction and others want their data fast. Many pass through users view the kernel as an impediment (not so much as "the way" as "in the way"). Doug Gilbert - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Netlink and user-space buffer pointers
Note: We've transitioned off topic. If what this means is "there isn't a good way except by ioctls (which still isn't easily portable) or system calls", then that's ok. Then at least we know the limits and can look at other implementation alternatives. Mike Christie wrote: James Smart wrote: Mike Christie wrote: For the tasks you want to do for the fc class is performance critical? No, it should not be. If not, you could do what the iscsi class (for the netdev people this is drivers/scsi/scsi_transport_iscsi.c) does and just suffer a couple copies. For iscsi we do this in userspace to send down a login pdu: /* * xmitbuf is a buffer that is large enough for the iscsi_event, * iscsi pdu (hdr_size) and iscsi pdu data (data_size) */ Well, the real difference is that the payload of the "message" is actually the payload of the SCSI command or ELS/CT Request. Thus, the payload may I am not sure I follow. For iscsi, everything after the iscsi_event struct can be the iscsi request that is to be transmitted. The payload will not normally be Mbytes but it is not a couple if bytes. True... For a large read/write - it will eventually total what the i/o request size was, and you did have to push it through the socekt. What this discussion really comes down to is the difference between initiator offload and what a target does. The initiator offloads the "full" i/o from the users - e.g. send command, get response. In the initiator case, the user isn't aware of each and every IU that makes up the i/o. As it's on an i/o basis, the LLDD doing the offload needs the full buffer sitting and ready. DMA is preferred so the buffer doesn't have to be consuming socket/kernel/driver buffers while it's pending - plus speed. In the target case, the target controls each IU and it's size, thus it only has to have access to as much buffer space as it wants to push the next IU. The i/o can be "paced" by the target. Unfortunately, this is an entirely different use model than users of a scsi initiator expect, and it won't map well into replacing things like our sg_io ioctls. Instead of netlink for scsi commands and transport requests For scsi commands could we just use sg io, or is there something special about the command you want to send? If you can use sg io for scsi commands, maybe for transport level requests (in my example iscsi pdu) we could modify something like sg/bsg/block layer scsi_ioctl.c to send down transport requests to the classes and encapsulate them in some new struct transport_requests or use the existing struct request but do that thing people keep taling about using the request/request_queue for message passing. Well - there's 2 parts to this answer: First : IOCTL's are considered dangerous/bad practice and therefore it would be nice to find a replacement mechanism that eliminates them. If that mechanism has some of the cool features that netlink does, even better. Using sg io, in the manner you indicate, wouldn't remove the ioctl use. Note: I have OEMs/users that are very confused about the community's statement about ioctls. They've heard they are bad, should never be allowed, will no be longer supported, but yet they are at the heart of DM and sg io and other subsystems. Other than a "grandfathered" explanation, they don't understand why the rules bend for one piece of code but not for another. To them, all the features are just as critical regardless of whose providing them. Second: transport level i/o could be done like you suggest, and we've prototyped some of this as well. However, there's something very wrong about putting "block device" wrappers and settings around something that is not a block device. In general, it's a heck of a lot of overhead and still doesn't solve the real issue - how to portably pass that user buffer in to/out of the kernel. -- james s - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please pull upstream-fixes branch of wireless-2.6
"John W. Linville" <[EMAIL PROTECTED]> wrote: > > At present, all the branches in wireless-2.6 only pull from linux-2.6. > I am still pushing (i.e. requesting Jeff's pull) to netdev-2.6, > if that matters. > > Maybe the current wireless-2.6 tree fits into your system better? Works well, thanks. I have some patches for you ;) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sendpage and high mem pages
From: Mike Christie <[EMAIL PROTECTED]> Date: Thu, 20 Apr 2006 14:29:06 -0500 > I was wondering if it is ok to pass sendpage high mem pages. If a piece > of code does this: > > struct socket *sock; > > sock->ops->sendpage(pg...) > > and pg is a highmem page will the network layer do the right thing or > should the caller check the page type and call sock_no_sendpage() for > highmen? It looks like net/sunrpc/xprtsock.c does a check but > drivers/scsi/iscsi_tcp.c and some others do not. TCP and others handle this just fine, if something doesn't then it needs to be fixed. Any page in the page cache can be sent over this interface. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [XFRM Doc]: aevent description
From: jamal <[EMAIL PROTECTED]> Date: Thu, 20 Apr 2006 06:58:45 -0400 > On Fri, 2006-14-04 at 15:05 -0700, David S. Miller wrote: > > From: jamal <[EMAIL PROTECTED]> > > Date: Thu, 13 Apr 2006 09:00:08 -0400 > > > > > There is dependency on the previous patch i sent since the issue that > > > patch fixes is assumed in this text description. It would be a good > > > idea to apply at the same time as the other. > > > > Applied, after fixing 28 lines containing trailing whitespace :-) > > yikes ;-> > Ok, so how do i avoid this in the future? Note, this was a _brand new_ > file, so it is a little bizarre. This command: git apply --check --whitespace=error-all $1 will spit out errors if your patch adds trailing whitespace or will not apply cleanly to the current GIT tree. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson's net channels and real-time
[ Maybe ask questions like this on "netdev" where the networking developers hang out? Added to CC: ] Van fell off the face of the planet after giving his presentation and never published his code, only his slides. I've started to make a slow attempt at implementing his ideas, nothing but pure infrastructure so far, but you can look at what I have here: kernel.org:/pub/scm/linux/kernel/git/davem/vj-2.6.git don't expect major progress and don't expect anything beyond a simple channel to softint packet processing on receive any time soon. Going all the way to the socket is a large endeavor and will require a lot of restructuring to do it right, so expect this to take on the order of months. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 0/3] softmac: more fixes
This patchset fixes more things in softmac, the first patch implements the SIOCSIWMLME wext, the second fixes the SIOCSIWAP wext and the third cleans up the event code. The second is a fairly important fix for wpa_supplicant and should probably still go to 2.6.17, the others can go in too of course but aren't that important I think. johannes - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 1/3] softmac: add SIOCSIWMLME
This patch adds the SIOCSIWMLME wext to softmac, this functionality appears to be used by wpa_supplicant and is softmac-specific. Signed-off-by: Johannes Berg <[EMAIL PROTECTED]> Cc: Jouni Malinen <[EMAIL PROTECTED]> --- wireless-2.6.orig/net/ieee80211/softmac/ieee80211softmac_priv.h 2006-04-19 18:44:51.710074158 +0200 +++ wireless-2.6/net/ieee80211/softmac/ieee80211softmac_priv.h 2006-04-20 00:50:54.930882874 +0200 @@ -150,6 +150,7 @@ int ieee80211softmac_handle_disassoc(str int ieee80211softmac_handle_reassoc_req(struct net_device * dev, struct ieee80211_reassoc_request * reassoc); void ieee80211softmac_assoc_timeout(void *d); +void ieee80211softmac_disassoc(struct ieee80211softmac_device *mac, u16 reason); /* some helper functions */ static inline int ieee80211softmac_scan_handlers_check_self(struct ieee80211softmac_device *sm) --- wireless-2.6.orig/net/ieee80211/softmac/ieee80211softmac_wx.c 2006-04-19 18:44:51.710074158 +0200 +++ wireless-2.6/net/ieee80211/softmac/ieee80211softmac_wx.c2006-04-19 18:48:52.200074158 +0200 @@ -424,3 +424,35 @@ ieee80211softmac_wx_get_genie(struct net } EXPORT_SYMBOL_GPL(ieee80211softmac_wx_get_genie); +int +ieee80211softmac_wx_set_mlme(struct net_device *dev, +struct iw_request_info *info, +union iwreq_data *wrqu, +char *extra) +{ + struct ieee80211softmac_device *mac = ieee80211_priv(dev); + struct iw_mlme *mlme = (struct iw_mlme *)extra; + u16 reason = cpu_to_le16(mlme->reason_code); + struct ieee80211softmac_network *net; + + if (memcmp(mac->associnfo.bssid, mlme->addr.sa_data, ETH_ALEN)) { + printk(KERN_DEBUG PFX "wx_set_mlme: requested operation on net we don't use\n"); + return -EINVAL; + } + + switch (mlme->cmd) { + case IW_MLME_DEAUTH: + net = ieee80211softmac_get_network_by_bssid_locked(mac, mlme->addr.sa_data); + if (!net) { + printk(KERN_DEBUG PFX "wx_set_mlme: we should know the net here...\n"); + return -EINVAL; + } + return ieee80211softmac_deauth_req(mac, net, reason); + case IW_MLME_DISASSOC: + ieee80211softmac_disassoc(mac, reason); + return 0; + default: + return -EOPNOTSUPP; + } +} +EXPORT_SYMBOL_GPL(ieee80211softmac_wx_set_mlme); --- wireless-2.6.orig/include/net/ieee80211softmac_wx.h 2006-03-28 16:23:31.0 +0200 +++ wireless-2.6/include/net/ieee80211softmac_wx.h 2006-04-19 18:48:30.640074158 +0200 @@ -91,4 +91,9 @@ ieee80211softmac_wx_get_genie(struct net struct iw_request_info *info, union iwreq_data *wrqu, char *extra); +extern int +ieee80211softmac_wx_set_mlme(struct net_device *dev, +struct iw_request_info *info, +union iwreq_data *wrqu, +char *extra); #endif /* _IEEE80211SOFTMAC_WX */ --- wireless-2.6.orig/net/ieee80211/softmac/ieee80211softmac_assoc.c 2006-04-19 18:46:29.0 +0200 +++ wireless-2.6/net/ieee80211/softmac/ieee80211softmac_assoc.c 2006-04-19 18:46:47.300074158 +0200 @@ -82,7 +82,7 @@ ieee80211softmac_assoc_timeout(void *d) } /* Sends out a disassociation request to the desired AP */ -static void +void ieee80211softmac_disassoc(struct ieee80211softmac_device *mac, u16 reason) { unsigned long flags; -- - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 2/3] softmac: fix SIOCSIWAP
There are some bugs in the current implementation of the SIOCSIWAP wext, for example that when you do it twice and it fails, it may still try another access point for some reason. This patch fixes this by introducing a new flag that tells the association code that the bssid that is in use was fixed by the user and shouldn't be deviated from. Signed-off-by: Johannes Berg <[EMAIL PROTECTED]> --- wireless-2.6.orig/include/net/ieee80211softmac.h2006-04-13 15:48:12.0 +0200 +++ wireless-2.6/include/net/ieee80211softmac.h 2006-04-20 01:10:32.770882874 +0200 @@ -96,10 +96,13 @@ struct ieee80211softmac_assoc_info { * * bssvalid is true if we found a matching network * and saved it's BSSID into the bssid above. +* +* bssfixed is used for SIOCSIWAP. */ u8 static_essid:1, associating:1, - bssvalid:1; + bssvalid:1, + bssfixed:1; /* Scan retries remaining */ int scan_retry; --- wireless-2.6.orig/net/ieee80211/softmac/ieee80211softmac_assoc.c 2006-04-19 18:46:47.0 +0200 +++ wireless-2.6/net/ieee80211/softmac/ieee80211softmac_assoc.c 2006-04-20 01:30:59.090882874 +0200 @@ -144,6 +144,12 @@ network_matches_request(struct ieee80211 if (!we_support_all_basic_rates(mac, net->rates_ex, net->rates_ex_len)) return 0; + /* assume that users know what they're doing ... +* (note we don't let them select a net we're incompatible with) */ + if (mac->associnfo.bssfixed) { + return !memcmp(mac->associnfo.bssid, net->bssid, ETH_ALEN); + } + /* if 'ANY' network requested, take any that doesn't have privacy enabled */ if (mac->associnfo.req_essid.len == 0 && !(net->capability & WLAN_CAPABILITY_PRIVACY)) @@ -176,7 +182,7 @@ ieee80211softmac_assoc_work(void *d) ieee80211softmac_disassoc(mac, WLAN_REASON_DISASSOC_STA_HAS_LEFT); /* try to find the requested network in our list, if we found one already */ - if (mac->associnfo.bssvalid) + if (mac->associnfo.bssvalid || mac->associnfo.bssfixed) found = ieee80211softmac_get_network_by_bssid(mac, mac->associnfo.bssid); /* Search the ieee80211 networks for this network if we didn't find it by bssid, @@ -241,19 +247,25 @@ ieee80211softmac_assoc_work(void *d) if (ieee80211softmac_start_scan(mac)) dprintk(KERN_INFO PFX "Associate: failed to initiate scan. Is device up?\n"); return; - } - else { + } else { spin_lock_irqsave(&mac->lock, flags); mac->associnfo.associating = 0; mac->associated = 0; spin_unlock_irqrestore(&mac->lock, flags); dprintk(KERN_INFO PFX "Unable to find matching network after scan!\n"); + /* reset the retry counter for the next user request since we +* break out and don't reschedule ourselves after this point. */ + mac->associnfo.scan_retry = IEEE80211SOFTMAC_ASSOC_SCAN_RETRY_LIMIT; ieee80211softmac_call_events(mac, IEEE80211SOFTMAC_EVENT_ASSOCIATE_NET_NOT_FOUND, NULL); return; } } - + + /* reset the retry counter for the next user request since we +* now found a net and will try to associate to it, but not +* schedule this function again. */ + mac->associnfo.scan_retry = IEEE80211SOFTMAC_ASSOC_SCAN_RETRY_LIMIT; mac->associnfo.bssvalid = 1; memcpy(mac->associnfo.bssid, found->bssid, ETH_ALEN); /* copy the ESSID for displaying it */ --- wireless-2.6.orig/net/ieee80211/softmac/ieee80211softmac_wx.c 2006-04-19 18:48:52.0 +0200 +++ wireless-2.6/net/ieee80211/softmac/ieee80211softmac_wx.c2006-04-20 15:27:26.122486954 +0200 @@ -27,7 +27,8 @@ #include "ieee80211softmac_priv.h" #include - +/* for is_broadcast_ether_addr and is_zero_ether_addr */ +#include int ieee80211softmac_wx_trigger_scan(struct net_device *net_dev, @@ -83,7 +84,6 @@ ieee80211softmac_wx_set_essid(struct net sm->associnfo.static_essid = 1; } } - sm->associnfo.scan_retry = IEEE80211SOFTMAC_ASSOC_SCAN_RETRY_LIMIT; /* set our requested ESSID length. * If applicable, we have already copied the data in */ @@ -310,8 +310,6 @@ ieee80211softmac_wx_set_wap(struct net_d char *extra) { struct ieee80211softmac_device *mac = ieee80211_priv(net_dev); - static const unsigned char any[] = {0xff, 0xff, 0xff, 0xff, 0xff, 0xff}; - static const unsigned char off[] = {0x00, 0x00, 0x00, 0x00, 0x00, 0x00}; unsigned long flags;
Re: [RFC] Netlink and user-space buffer pointers
Mike Christie wrote: > James Smart wrote: >> Mike Christie wrote: >>> For the tasks you want to do for the fc class is performance critical? >> No, it should not be. >> >>> If not, you could do what the iscsi class (for the netdev people this is >>> drivers/scsi/scsi_transport_iscsi.c) does and just suffer a couple >>> copies. For iscsi we do this in userspace to send down a login pdu: >>> >>> /* >>> * xmitbuf is a buffer that is large enough for the iscsi_event, >>> * iscsi pdu (hdr_size) and iscsi pdu data (data_size) >>> */ >> Well, the real difference is that the payload of the "message" is actually >> the payload of the SCSI command or ELS/CT Request. Thus, the payload may > > I am not sure I follow. For iscsi, everything after the iscsi_event > struct can be the iscsi request that is to be transmitted. The payload > will not normally be Mbytes but it is not a couple if bytes. > >> range in size from a few hundred bytes to several kbytes (> 1 page) to >> Mbyte's in size. Rather than buffer all of this, and push it over the >> socket, >> thus the extra copies - it would best to have the LLDD simply DMA the >> payload like on a typical SCSI command. Additionally, there will be >> response data that can be several kbytes in length. >> > > Once you have got the buffer to the class, the class can create a > scatterlist to DMA from for the LLD. I thought. iscsi does not do this > just because it is software right now. For qla4xxx we do not need > something like what you are talking about (see below for what I was > thinking about for the initiators). If you are saying the extra step of > the copy is plain dumb, I agree, but this happens (you have to suffer > some copy and cannot do dio) for sg io as well in some cases. I think > for the sg driver the copy_*_user is the default. > > Instead of netlink for scsi commands and transport requests > > For scsi commands could we just use sg io, or is there something special > about the command you want to send? If you can use sg io for scsi > commands, maybe for transport level requests (in my example iscsi pdu) > we could modify something like sg/bsg/block layer scsi_ioctl.c to send > down transport requests to the classes and encapsulate them in some new > struct transport_requests or use the existing struct request but do that > thing people keep taling about using the request/request_queue for > message passing. And just to be complete, the problem with this is that it is tied to the request queue and so you cannot just send a transport level request unless it is tied to the device. But for the target stuff we added a request queue to the host so we could inject requests (the idea was to send down those magic message requests) at a higher level. To be able to use that for sg io though it would require some more code and magic as you know. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Netlink and user-space buffer pointers
Mike Christie wrote: > James Smart wrote: >> Mike Christie wrote: >>> For the tasks you want to do for the fc class is performance critical? >> No, it should not be. >> >>> If not, you could do what the iscsi class (for the netdev people this is >>> drivers/scsi/scsi_transport_iscsi.c) does and just suffer a couple >>> copies. For iscsi we do this in userspace to send down a login pdu: >>> >>> /* >>> * xmitbuf is a buffer that is large enough for the iscsi_event, >>> * iscsi pdu (hdr_size) and iscsi pdu data (data_size) >>> */ >> Well, the real difference is that the payload of the "message" is actually >> the payload of the SCSI command or ELS/CT Request. Thus, the payload may > > I am not sure I follow. For iscsi, everything after the iscsi_event > struct can be the iscsi request that is to be transmitted. The payload > will not normally be Mbytes but it is not a couple if bytes. > >> range in size from a few hundred bytes to several kbytes (> 1 page) to >> Mbyte's in size. Rather than buffer all of this, and push it over the >> socket, >> thus the extra copies - it would best to have the LLDD simply DMA the >> payload like on a typical SCSI command. Additionally, there will be >> response data that can be several kbytes in length. >> > > Once you have got the buffer to the class, the class can create a > scatterlist to DMA from for the LLD. I thought. iscsi does not do this > just because it is software right now. For qla4xxx we do not need That should be, we do need. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Netlink and user-space buffer pointers
James Smart wrote: > > Mike Christie wrote: >> For the tasks you want to do for the fc class is performance critical? > > No, it should not be. > >> If not, you could do what the iscsi class (for the netdev people this is >> drivers/scsi/scsi_transport_iscsi.c) does and just suffer a couple >> copies. For iscsi we do this in userspace to send down a login pdu: >> >> /* >> * xmitbuf is a buffer that is large enough for the iscsi_event, >> * iscsi pdu (hdr_size) and iscsi pdu data (data_size) >> */ > > Well, the real difference is that the payload of the "message" is actually > the payload of the SCSI command or ELS/CT Request. Thus, the payload may I am not sure I follow. For iscsi, everything after the iscsi_event struct can be the iscsi request that is to be transmitted. The payload will not normally be Mbytes but it is not a couple if bytes. > range in size from a few hundred bytes to several kbytes (> 1 page) to > Mbyte's in size. Rather than buffer all of this, and push it over the > socket, > thus the extra copies - it would best to have the LLDD simply DMA the > payload like on a typical SCSI command. Additionally, there will be > response data that can be several kbytes in length. > Once you have got the buffer to the class, the class can create a scatterlist to DMA from for the LLD. I thought. iscsi does not do this just because it is software right now. For qla4xxx we do not need something like what you are talking about (see below for what I was thinking about for the initiators). If you are saying the extra step of the copy is plain dumb, I agree, but this happens (you have to suffer some copy and cannot do dio) for sg io as well in some cases. I think for the sg driver the copy_*_user is the default. Instead of netlink for scsi commands and transport requests For scsi commands could we just use sg io, or is there something special about the command you want to send? If you can use sg io for scsi commands, maybe for transport level requests (in my example iscsi pdu) we could modify something like sg/bsg/block layer scsi_ioctl.c to send down transport requests to the classes and encapsulate them in some new struct transport_requests or use the existing struct request but do that thing people keep taling about using the request/request_queue for message passing. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SIOCGIWSCAN wireless event behaviour
On Thu, Apr 20, 2006 at 09:43:54AM -0700, Jean Tourrilhes wrote: > After we changed to behaviour of ipw, various users reported > that wpa_supplicant was confused. I particularly trust the report of > Bill Moss, who has been hacking ipw for a long time : > > http://sourceforge.net/mailarchive/forum.php?thread_id=10091113&forum_id=38938 Hmm.. Can someone please describe what was changed? Just sending SIOCGIWSCAN events more frequently? I have not seen any problems with this in my tests (though, mainly with madwifi-ng). Is the broken case available in one of the kernel trees? 2.6.16? wireless-2.6? (i.e., where can I get the exact version of ipw2200 driver that is expected to show incorrect behavior)? > Jouni was notified, but did not really answer to that bug report. > Then, the ipw maintainers commited the following patch to ipw > that fix or workaround that issue : > > http://marc.theaimsgroup.com/?l=linux-netdev&m=114492056522667&w=2 Hmm.. I don't remember having seen that report from Bill Moss.. How was I notified? ;-) The patch here seems to be moving ipw_disassociate() call, so it is not obviously clear from that what the impact on behavior is. I can try to reproduce this, but I would like to know what version to test with in order to avoid any possible workarounds from hiding the issue. -- Jouni MalinenPGP id EFC895FA - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SIOCGIWSCAN wireless event behaviour
On Thu, Apr 20, 2006 at 10:37:32AM -0400, Dan Williams wrote: > On Thu, 2006-04-20 at 15:15 +0100, Daniel Drake wrote: > > Hi Jean, > > > > A query regarding wireless events: under which circumstances should a > > driver/stack send a SIOCGIWSCAN event to userspace? > > > > Should it be sent whenever a driver has new scan results available, or > > only when the user requested a scan a short time beforehand (via > > SIOCSIWSCAN)? The original behaviour was that the event was sent only when a user did request a scan. At that time, cards did not do background scanning, so new scan results would be produced only as a result of a user scan. After a short discussion we Dan, we agree that to change that, the driver should send a scan whenever a new scan result is available, regardless of how it happens (background scan or user scan). This allow smart application to synchronise on background scans and avoid them generating useless user scans. Minimising the number of user scan is actually good. > Similar situation: when wpa_supplicant requests a scan, the driver > scans and pushes the GIWSCAN at completion. _Every_ process (like > NetworkManager) listening for netlink WE messages gets the GIWSCAN event > even though only wpa_supplicant requested the original scan. > > So what I'm saying is that applications that process GIWSCAN netlink > messages today should _already_ be able to handle random GIWSCAN events > at any time even when they have not explicitly requested a scan with > SIWSCAN. The events are broadcast and the driver shouldn't really care > which user app initiated any particular request. Multiple apps can > theoretically request scans at any time, though this isn't so good in > practice. 100% correct. > > I ask this because softmac is sending the SIOCGIWSCAN event even when > > the user did not explicitly ask for it. > > Given the above, I think this behavior is fine and even desirable. Yes. > > I think the 'extra' SIOCGIWSCAN event may be confusing wpa_supplicant > > (but have not confirmed that yet). > > If this is the case, wpa_supplicant should not be getting confused by > GIWSCAN events happening at random times, and should be fixed. However, > in my experience with 0.4.8, this isn't a problem and wpa_supplicant > handles random scan events correctly. Not sure about the 0.5.x branch > though. After we changed to behaviour of ipw, various users reported that wpa_supplicant was confused. I particularly trust the report of Bill Moss, who has been hacking ipw for a long time : http://sourceforge.net/mailarchive/forum.php?thread_id=10091113&forum_id=38938 Jouni was notified, but did not really answer to that bug report. Then, the ipw maintainers commited the following patch to ipw that fix or workaround that issue : http://marc.theaimsgroup.com/?l=linux-netdev&m=114492056522667&w=2 I would still like Jouni to have a look at the issue to tell us where the problem is. Two driver having issue is not coincidence. I would hate driver starting to implement various workaround if the problem is really in wpa_supplicant. Have fun... Jean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SIOCGIWSCAN wireless event behaviour
On Thu, Apr 20, 2006 at 03:15:59PM +0100, Daniel Drake wrote: > I think the 'extra' SIOCGIWSCAN event may be confusing wpa_supplicant > (but have not confirmed that yet). No, they don't. madwifi-ng is already doing this with background scanning and as was pointed out, there can be multiple programs asking for scans, so user space must be prepared for multiple events anyway. -- Jouni MalinenPGP id EFC895FA - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: I/OAT: Call for discussion
On 4/20/06, Jack Vogel <[EMAIL PROTECTED]> wrote: > On 4/19/06, Christoph Hellwig <[EMAIL PROTECTED]> wrote: > > On Wed, Apr 19, 2006 at 10:28:41AM -0700, John Ronciak wrote: > > > The hardware is going to generally available in June. There are also > > > lots of OEMs, OSVs and hardware vendors that have the system to test > > > on today. The early rollout of hardware has been very large. > > > > As a start to get people actually interested you should stop talking > > like a jerk and kill all these silly three-letter acronyms from your > > language. > > ??? For a community absolutely FILLED with everyday use of acronyms > it boggles the mind why you would call someone names for using them. > > So if they were 4 letter ones it would make him a savant instead?? hch is not complaining about TLA usage, he is complaining about _silly_ TLAs usage, as in to justify new feature acceptance in mainline. - Arnaldo - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: I/OAT: Call for discussion
On 4/19/06, Christoph Hellwig <[EMAIL PROTECTED]> wrote: > On Wed, Apr 19, 2006 at 10:28:41AM -0700, John Ronciak wrote: > > The hardware is going to generally available in June. There are also > > lots of OEMs, OSVs and hardware vendors that have the system to test > > on today. The early rollout of hardware has been very large. > > As a start to get people actually interested you should stop talking > like a jerk and kill all these silly three-letter acronyms from your language. ??? For a community absolutely FILLED with everyday use of acronyms it boggles the mind why you would call someone names for using them. So if they were 4 letter ones it would make him a savant instead?? Jack - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SIOCGIWSCAN wireless event behaviour
On Thu, 2006-04-20 at 15:15 +0100, Daniel Drake wrote: > Hi Jean, > > A query regarding wireless events: under which circumstances should a > driver/stack send a SIOCGIWSCAN event to userspace? > > Should it be sent whenever a driver has new scan results available, or > only when the user requested a scan a short time beforehand (via > SIOCSIWSCAN)? Similar situation: when wpa_supplicant requests a scan, the driver scans and pushes the GIWSCAN at completion. _Every_ process (like NetworkManager) listening for netlink WE messages gets the GIWSCAN event even though only wpa_supplicant requested the original scan. So what I'm saying is that applications that process GIWSCAN netlink messages today should _already_ be able to handle random GIWSCAN events at any time even when they have not explicitly requested a scan with SIWSCAN. The events are broadcast and the driver shouldn't really care which user app initiated any particular request. Multiple apps can theoretically request scans at any time, though this isn't so good in practice. > I ask this because softmac is sending the SIOCGIWSCAN event even when > the user did not explicitly ask for it. Given the above, I think this behavior is fine and even desirable. > I think the 'extra' SIOCGIWSCAN event may be confusing wpa_supplicant > (but have not confirmed that yet). If this is the case, wpa_supplicant should not be getting confused by GIWSCAN events happening at random times, and should be fixed. However, in my experience with 0.4.8, this isn't a problem and wpa_supplicant handles random scan events correctly. Not sure about the 0.5.x branch though. Dan - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Netlink and user-space buffer pointers
Mike Christie wrote: For the tasks you want to do for the fc class is performance critical? No, it should not be. If not, you could do what the iscsi class (for the netdev people this is drivers/scsi/scsi_transport_iscsi.c) does and just suffer a couple copies. For iscsi we do this in userspace to send down a login pdu: /* * xmitbuf is a buffer that is large enough for the iscsi_event, * iscsi pdu (hdr_size) and iscsi pdu data (data_size) */ Well, the real difference is that the payload of the "message" is actually the payload of the SCSI command or ELS/CT Request. Thus, the payload may range in size from a few hundred bytes to several kbytes (> 1 page) to Mbyte's in size. Rather than buffer all of this, and push it over the socket, thus the extra copies - it would best to have the LLDD simply DMA the payload like on a typical SCSI command. Additionally, there will be response data that can be several kbytes in length. ... I think there may be issues with packing structs or 32 bit userspace and 64 bit kernels and other fun things like this so the iscsi pdu and iscsi event have to be defined correctly and I guess we are back to some of the problems with ioctls :( Agreed. In this use of netlink, there's not a lot of wins for netlink over ioctls. It all comes down to 2 things: a) proper portable message definition; and b) what do you do with that non-portable user space buffer pointer ? -- james s - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
SIOCGIWSCAN wireless event behaviour
Hi Jean, A query regarding wireless events: under which circumstances should a driver/stack send a SIOCGIWSCAN event to userspace? Should it be sent whenever a driver has new scan results available, or only when the user requested a scan a short time beforehand (via SIOCSIWSCAN)? I ask this because softmac is sending the SIOCGIWSCAN event even when the user did not explicitly ask for it. For example, the user sets an essid. softmac starts a scan in order to find the requested network. The network is found, the scan completes, and softmac sends SIOCGIWSCAN. softmac then authenticates to that network, associates, and then sends SIOCGIWAP. I think the 'extra' SIOCGIWSCAN event may be confusing wpa_supplicant (but have not confirmed that yet). Thanks, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] ipv4: initialize arp_tbl rw lock
> > As spinlock debugging still does not work with the qeth driver I > > want to pick up the discussion. > > Does something like the patch below work? > > But this all begs the question, what happens if you want to > dig into the internals of a protocol which is built modular and > hasn't been loaded yet? > > diff --git a/include/linux/init.h b/include/linux/init.h > index 93dcbe1..8169f25 100644 > --- a/include/linux/init.h > +++ b/include/linux/init.h > @@ -95,8 +95,9 @@ #define postcore_initcall(fn) __define_ > #define arch_initcall(fn)__define_initcall("3",fn) > #define subsys_initcall(fn) __define_initcall("4",fn) > #define fs_initcall(fn) __define_initcall("5",fn) > -#define device_initcall(fn) __define_initcall("6",fn) > -#define late_initcall(fn)__define_initcall("7",fn) > +#define net_initcall(fn) __define_initcall("6",fn) > +#define device_initcall(fn) __define_initcall("7",fn) > +#define late_initcall(fn)__define_initcall("8",fn) > > #define __initcall(fn) device_initcall(fn) > > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c > index dc206f1..9803a57 100644 > --- a/net/ipv4/af_inet.c > +++ b/net/ipv4/af_inet.c > @@ -1257,7 +1257,7 @@ out_unregister_udp_proto: > goto out; > } > > -module_init(inet_init); > +net_initcall(inet_init); That's exactly the same thing that I tried to. It didn't work for me since I saw "sometimes" the described rcu_update latencies. Today I was able to boot the machine 30 times and just saw it once... Not very helpful for debugging this :( Btw.: I guess the linker scripts need an update too, so that the new .initcall8.init section doesn't get discarded. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please pull upstream-fixes branch of wireless-2.6
On Thu, Apr 20, 2006 at 01:57:52AM -0700, Andrew Morton wrote: > And I really need to find a way of getting git-wireless into -mm. Problem > is, it's based off git-netdev-all and when John's tree is synced to a later > version of Linus's tree than Jeff's tree, all hell breaks loose at my end. FWIW, I think this issue should be gone (hopefully never to return). For a while I was pulling from Jeff's netdev tree as a way to fix-up a git administration error I had inflicted upon myself... That need has disappeared since 2.6.17 opened and Jeff pushed his upstream branch to Linus. At present, all the branches in wireless-2.6 only pull from linux-2.6. I am still pushing (i.e. requesting Jeff's pull) to netdev-2.6, if that matters. Maybe the current wireless-2.6 tree fits into your system better? Thanks, John -- John W. Linville [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [XFRM Doc]: aevent description
On Fri, 2006-14-04 at 15:05 -0700, David S. Miller wrote: > From: jamal <[EMAIL PROTECTED]> > Date: Thu, 13 Apr 2006 09:00:08 -0400 > > > There is dependency on the previous patch i sent since the issue that > > patch fixes is assumed in this text description. It would be a good > > idea to apply at the same time as the other. > > Applied, after fixing 28 lines containing trailing whitespace :-) yikes ;-> Ok, so how do i avoid this in the future? Note, this was a _brand new_ file, so it is a little bizarre. cheers, jamal - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please pull upstream-fixes branch of wireless-2.6
On Thursday 20 April 2006 10:57, you wrote: > Michael Buesch <[EMAIL PROTECTED]> wrote: > > > > On Thursday 20 April 2006 03:12, John W. Linville wrote: > > > bcm43xx: fix dyn tssi2dbm memleak > > > bcm43xx: fix pctl slowclock limit calculation > > > bcm43xx: sysfs code cleanup > > > > These are already in -mm and on their way into linus's tree. > > I don't send netdev patches to Linus except under unusual circumstances. > I'd expect these patches to go upstream via John or Jeff. > > > Is it possible to cause problems? > > Nope, I'll just drop then when they appear in a git tree. > > And I really need to find a way of getting git-wireless into -mm. Problem > is, it's based off git-netdev-all and when John's tree is synced to a later > version of Linus's tree than Jeff's tree, all hell breaks loose at my end. > Junio and I weren't able to work out a way of extracting the jeff->john > diffs so I gave up. > > Probably, I'll need to actually do a git merge, generate the diff then > throw away the resulting git tree. Or something. I've avoided doing git > merges because I'm dealing with 58 trees and I suspect I'd go insane. > > > If not, fine. If yes, we need some clearly defined rules where > > to put patches and a clearly defined statement of how often > > patches are pushed upstream. > > Because I don't carry git-wireless I don't have visibility of when John has > merged something. Ordinarily you'd have seen me drop the patches again > when they popped up in John's tree. Ok, that is perfectly fine and it will work. Thanks for the clarification. -- Greetings Michael. pgpKKw9wXoRoW.pgp Description: PGP signature
Re: Please pull upstream-fixes branch of wireless-2.6
Michael Buesch wrote: On Thursday 20 April 2006 03:12, John W. Linville wrote: bcm43xx: fix dyn tssi2dbm memleak bcm43xx: fix pctl slowclock limit calculation bcm43xx: sysfs code cleanup These are already in -mm and on their way into linus's tree. Is it possible to cause problems? If not, fine. If yes, we need some clearly defined rules where to put patches and a clearly defined statement of how often patches are pushed upstream. Ideally, patches should be sent to John, who will send me -> Linus. If they are bug fixes, the turnaround can be same once I get them from John (and Linus is taking patches). That's always been the standard route: wireless patches -> wireless maintainer. Jeff - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please pull upstream-fixes branch of wireless-2.6
Michael Buesch <[EMAIL PROTECTED]> wrote: > > On Thursday 20 April 2006 03:12, John W. Linville wrote: > > bcm43xx: fix dyn tssi2dbm memleak > > bcm43xx: fix pctl slowclock limit calculation > > bcm43xx: sysfs code cleanup > > These are already in -mm and on their way into linus's tree. I don't send netdev patches to Linus except under unusual circumstances. I'd expect these patches to go upstream via John or Jeff. > Is it possible to cause problems? Nope, I'll just drop then when they appear in a git tree. And I really need to find a way of getting git-wireless into -mm. Problem is, it's based off git-netdev-all and when John's tree is synced to a later version of Linus's tree than Jeff's tree, all hell breaks loose at my end. Junio and I weren't able to work out a way of extracting the jeff->john diffs so I gave up. Probably, I'll need to actually do a git merge, generate the diff then throw away the resulting git tree. Or something. I've avoided doing git merges because I'm dealing with 58 trees and I suspect I'd go insane. > If not, fine. If yes, we need some clearly defined rules where > to put patches and a clearly defined statement of how often > patches are pushed upstream. Because I don't carry git-wireless I don't have visibility of when John has merged something. Ordinarily you'd have seen me drop the patches again when they popped up in John's tree. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please pull upstream-fixes branch of wireless-2.6
On Thursday 20 April 2006 03:12, John W. Linville wrote: > bcm43xx: fix dyn tssi2dbm memleak > bcm43xx: fix pctl slowclock limit calculation > bcm43xx: sysfs code cleanup These are already in -mm and on their way into linus's tree. Is it possible to cause problems? If not, fine. If yes, we need some clearly defined rules where to put patches and a clearly defined statement of how often patches are pushed upstream. -- Greetings Michael. pgpVHc4xlR1cH.pgp Description: PGP signature
Re: [RESEND][PATCH] ebtables: clean up vmalloc usage in net/bridge/netfilter/ebtables.c
On Wed, Apr 19, 2006 at 04:13:24PM -0700, Andrew Morton wrote: > "David S. Miller" <[EMAIL PROTECTED]> wrote: > > > > From: Andrew Morton <[EMAIL PROTECTED]> > > Date: Wed, 19 Apr 2006 15:59:25 -0700 > > > > > "David S. Miller" <[EMAIL PROTECTED]> wrote: > > > > > > > > An earlier variant of your patch was applied already, included below. > > > > You'll need to submit the newer parts relative to the current tree. > > > > > > This is a similar-but-different patch. It applies OK. > > > > > > I reviewed it (mostly - it's somewhat non-trivial to do this) and queued > > > it > > > up and was planning on sending it to you for post-2.6.17. > > > > It's at least fixing a few bugs, and the parts which are cleanups > > undoubtedly should prevent bugs in the future, so I think we > > should consider it for 2.6.17 right? > > afaict it's just a cleanup, but whatever - I'll send it over now. The first patch (which is already applied) was a bug fix. This one is just a clean up, it makes the same clean-up that Andrew did to the original patch, in other places in the same file. This is not at all critical, so it can be moved post-2.6.17 without any problem. Regards, JC. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html