Re: [PATCH 2.6.25 0/9]: SCTP: Update ADD-IP implementation to conform to spec
From: Vlad Yasevich [EMAIL PROTECTED] Date: Wed, 19 Dec 2007 15:53:47 -0500 Not sure if you got the PATCH 7/9 resend, but it looks like netdev ate that too. I made this patch set available here: master.kernel.org:/pub/scm/linux/kernel/git/vxy/lksctp-dev.git addip I got the patch, there is probably some keyword in there that is making it get consumed by the majordomo regexp filters we have in place. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/29] Swap over NFS -v15
On Wed, 2007-12-19 at 17:22 -0500, Bill Davidsen wrote: Peter Zijlstra wrote: Hi, Another posting of the full swap over NFS series. Andrew/Linus, could we start thinking of sticking this in -mm? Two questions: 1 - what is the memory use impact on the system which don't do swap over NFS, such as embedded systems, and It should have little to no impact if not used. 2 - what is the advantage of this code over the two existing network swap approaches, swapping to NFS mounted file and This is not actually possible with a recent kernel, current swapfile support requires a blockdevice. swap to NBD device? I've used the NFS file when a program was running out of memory and that seemed to work, people in UNYUUG have reported that the nbd swap works, so what's better here? swap over NBD works sometimes, its rather easy to deadlock, and its impossible to recover from a broken connection. signature.asc Description: This is a digitally signed message part
Re: [PATCH] One more XFRM audit fix
From: Paul Moore [EMAIL PROTECTED] Date: Wed, 19 Dec 2007 14:29:31 -0500 The following patch is backed against David's net-2.6 tree and is pretty trivial. I know we're late in the 2.6.24 cycle but I think this is worth merging, if you guys don't feel that way let me know and I'll resubmit it for 2.6.25. Where is that patch? Or do you mean the fix you emailed seperately today (which I will apply, thanks)? As a side note, I'm unable to actually test the patch because I can't get the kernel to compile (M=net/xfrm works just fine). The problem I keep seeing is below: make[3]: *** No rule to make target \ `/blah/kernels/net-2.6_xfrm-auid-secid-fix/include/linux/ticable.h', \ needed by \ `/blah/kernels/net-2.6_xfrm-auid-secid-fix/usr/include/linux/ticable.h'. \ Stop. Remove ticable.h from include/linux/Kbuild This is already cured in Linus's tree. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] XFRM: Audit function arguments misordered
From: Paul Moore [EMAIL PROTECTED] Date: Wed, 19 Dec 2007 14:29:38 -0500 In several places the arguments to the xfrm_audit_start() function are in the wrong order resulting in incorrect user information being reported. This patch corrects this by pacing the arguments in the correct order. Signed-off-by: Paul Moore [EMAIL PROTECTED] Applied, thanks for fixing this bug. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][IPV4] ip_gre: set mac_header correctly in receive path
From: Timo_Teräs [EMAIL PROTECTED] Date: Wed, 19 Dec 2007 20:10:41 +0200 From: Timo Teras [EMAIL PROTECTED] mac_header update in ipgre_recv() was incorrectly changed to skb_reset_mac_header() when it was introduced. Signed-off-by: Timo Teras [EMAIL PROTECTED] Patch applied, thanks. --- This replaces my earlier patch titled ip_gre: use skb-{mac, network}_header consistently. Apparently I hadn't done my homework how to use *_header correctly. And I should have done a bit more testing to figure out the previous patch does not work. But the main problem was the receive path in the first place, and this patch fixes it. The bug was introduced in commit 459a98ed881802dee55897441bc7f77af614368e. There might be other similar incorrect replaces. That commit has two other identical bad conversions, I'll fix them up. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-2.6.25 1/3] Uninline the __inet_hash function
From: Eric Dumazet [EMAIL PROTECTED] Date: Wed, 19 Dec 2007 18:15:20 +0100 Pavel Emelyanov a écrit : That's not truth, if I get you right. The __inet_hash() is called with 0, from all the places except for the inet_hash() one. OK, but on cases with 0, sk-sk_state is != TCP_LISTEN, unless I am mistaken. This is true. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [IPROUTE]: A workaround to make larger rto_min printed correctly
In article [EMAIL PROTECTED] (at Thu, 20 Dec 2007 12:31:27 +0900), Satoru SATOH [EMAIL PROTECTED] says: diff --git a/ip/iproute.c b/ip/iproute.c index f4200ae..fa722c6 100644 --- a/ip/iproute.c +++ b/ip/iproute.c @@ -510,16 +510,16 @@ int print_route(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) fprintf(fp, %u, *(unsigned*)RTA_DATA(mxrta[i])); else { unsigned val = *(unsigned*)RTA_DATA(mxrta[i]); + unsigned hz1 = hz / 1000; - val *= 1000; if (i == RTAX_RTT) I think this is incorrect; hz might not be 1000; e.g. 250 etc. --yoshfuji -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [IPROUTE]: A workaround to make larger rto_min printed correctly
On 20-12-2007 04:31, Satoru SATOH wrote: ip route show does not print correct value when larger rto_min is set (e.g. 3sec). This problem is because of overflow in print_route() and the patch below is a workaround fix for that. ... --- a/ip/iproute.c +++ b/ip/iproute.c @@ -510,16 +510,16 @@ int print_route(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) fprintf(fp, %u, *(unsigned*)RTA_DATA(mxrta[i])); else { unsigned val = *(unsigned*)RTA_DATA(mxrta[i]); + unsigned hz1 = hz / 1000; ... + if (val = hz1) + fprintf(fp, %ums, val/hz1); ... Probably I miss something or my iproute sources are too old, but: does this work with hz 1000? Regards, Jarek P. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
A short question about net git tree and patches
Hello, I have a short question regarding the net git tree and patches: I want to write and send patches against the most recent and bleeding edge kernel networking code. I see in: http://kernel.org/pub/scm/linux/kernel/git/davem/?C=M;O=A that there are 3 git trees which can be candidates for git-clone and making patches against; these are: netdev-2.6.git, net-2.6.25.git and net-2.6.git. It seems to me that net-2.6.git is the most suitable one to work against; am I right ? what is the difference, in short, between the three repositories? Regards, DS -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: DM9000_IRQ_FLAGS
On Tue, Dec 11, 2007 at 08:18:23PM +0100, Daniel Mack wrote: Hi, on Toradex' Colibri, a PXA270 based board with a DM9000 ethernet controller, this driver won't work due to unsuitable DM9000_IRQ_FLAGS. If I understood the code behind request_irq() correctly, it's not recommended to register an IRQ without any of the IRQT_* flags set. Is there any concerns about applying the patch below? Yes, that will possibly break all systems using level-triggered interrupts. Probably the best solution is to pass the data via the platform information being fed to the device. -- Ben ([EMAIL PROTECTED], http://www.fluff.org/) 'a smiley only costs 4 bytes' -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: DM9000_IRQ_FLAGS
On Wed, Dec 12, 2007 at 02:41:53PM +0100, Daniel Mack wrote: Hi Remy, On Tue, Dec 11, 2007 at 09:31:03PM +0100, Remy Bohmer wrote: This controller is also used on many other boards, like the e.g. Atmel AT91sam9261-ek board. On that board on both the rising _and_ falling edge an interrupt is generated. However, request_irq() is called with IRQF_SHARED only, so neither IRQT_RISING nor IRQT_FALLING is set and the value defaults to IRQT_NOEDGE. How can you get IRQs? I can test tomorrow if this patch leaves this board in tact, but should the board-specific code not add this flag if it is required ? By modifying this driver you will interfere the behavior of other boards, and I do not know if there any level triggered types used. Actually, the best way to go is to let the platform resources flags decide about that with something like resource-flags = IORESOURCE_IRQ | IRQT_RISING; but the dm9000 does not care about them at all. Changing that would also imply modifications to all board support code. I did have a go at trying to get people to pass the information this way, but it seem to be ignored last time I sent it. I can dig out the code that converts resource-flags to IRQT_ flags. -- Ben ([EMAIL PROTECTED], http://www.fluff.org/) 'a smiley only costs 4 bytes' -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-2.6.25 (resend) 1/3] Uninline the __inet_hash function
This one is used in quite many places in the networking code and seems to big to be inline. After the patch net/ipv4/build-in.o loses ~650 bytes: add/remove: 2/0 grow/shrink: 0/5 up/down: 461/-1114 (-653) function old new delta __inet_hash_nolisten - 282+282 __inet_hash- 179+179 tcp_sacktag_write_queue 22552254 -1 __inet_lookup_listener 284 274 -10 tcp_v4_syn_recv_sock 755 493-262 tcp_v4_hash 389 35-354 inet_hash_connect 1086 599-487 This version addresses the issue pointed by Eric, that while being inline this function was optimized by gcc in respect to the 'listen_possible' argument. (Patches 2 and 3 in this series are still applied after this) Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] --- diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h index fef4442..65ddb25 100644 --- a/include/net/inet_hashtables.h +++ b/include/net/inet_hashtables.h @@ -264,37 +264,14 @@ static inline void inet_listen_unlock(struct inet_hashinfo *hashinfo) wake_up(hashinfo-lhash_wait); } -static inline void __inet_hash(struct inet_hashinfo *hashinfo, - struct sock *sk, const int listen_possible) -{ - struct hlist_head *list; - rwlock_t *lock; - - BUG_TRAP(sk_unhashed(sk)); - if (listen_possible sk-sk_state == TCP_LISTEN) { - list = hashinfo-listening_hash[inet_sk_listen_hashfn(sk)]; - lock = hashinfo-lhash_lock; - inet_listen_wlock(hashinfo); - } else { - struct inet_ehash_bucket *head; - sk-sk_hash = inet_sk_ehashfn(sk); - head = inet_ehash_bucket(hashinfo, sk-sk_hash); - list = head-chain; - lock = inet_ehash_lockp(hashinfo, sk-sk_hash); - write_lock(lock); - } - __sk_add_node(sk, list); - sock_prot_inc_use(sk-sk_prot); - write_unlock(lock); - if (listen_possible sk-sk_state == TCP_LISTEN) - wake_up(hashinfo-lhash_wait); -} +extern void __inet_hash(struct inet_hashinfo *hashinfo, struct sock *sk); +extern void __inet_hash_nolisten(struct inet_hashinfo *hinfo, struct sock *sk); static inline void inet_hash(struct inet_hashinfo *hashinfo, struct sock *sk) { if (sk-sk_state != TCP_CLOSE) { local_bh_disable(); - __inet_hash(hashinfo, sk, 1); + __inet_hash(hashinfo, sk); local_bh_enable(); } } diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c index 02fc91c..f450df2 100644 --- a/net/dccp/ipv4.c +++ b/net/dccp/ipv4.c @@ -408,7 +408,7 @@ struct sock *dccp_v4_request_recv_sock(struct sock *sk, struct sk_buff *skb, dccp_sync_mss(newsk, dst_mtu(dst)); - __inet_hash(dccp_hashinfo, newsk, 0); + __inet_hash_nolisten(dccp_hashinfo, newsk); __inet_inherit_port(dccp_hashinfo, sk, newsk); return newsk; diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index b07e2d3..2e5814a 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -305,6 +305,48 @@ static inline u32 inet_sk_port_offset(const struct sock *sk) inet-dport); } +void __inet_hash_nolisten(struct inet_hashinfo *hashinfo, struct sock *sk) +{ + struct hlist_head *list; + rwlock_t *lock; + struct inet_ehash_bucket *head; + + BUG_TRAP(sk_unhashed(sk)); + + sk-sk_hash = inet_sk_ehashfn(sk); + head = inet_ehash_bucket(hashinfo, sk-sk_hash); + list = head-chain; + lock = inet_ehash_lockp(hashinfo, sk-sk_hash); + + write_lock(lock); + __sk_add_node(sk, list); + sock_prot_inc_use(sk-sk_prot); + write_unlock(lock); +} +EXPORT_SYMBOL_GPL(__inet_hash_nolisten); + +void __inet_hash(struct inet_hashinfo *hashinfo, struct sock *sk) +{ + struct hlist_head *list; + rwlock_t *lock; + + if (sk-sk_state != TCP_LISTEN) { + __inet_hash_nolisten(hashinfo, sk); + return; + } + + BUG_TRAP(sk_unhashed(sk)); + list = hashinfo-listening_hash[inet_sk_listen_hashfn(sk)]; + lock = hashinfo-lhash_lock; + + inet_listen_wlock(hashinfo); + __sk_add_node(sk, list); + sock_prot_inc_use(sk-sk_prot); + write_unlock(lock); + wake_up(hashinfo-lhash_wait); +} +EXPORT_SYMBOL_GPL(__inet_hash); + /* * Bind a port for a connect operation and hash it. */ @@ -372,7 +414,7 @@ ok: inet_bind_hash(sk, tb, port); if (sk_unhashed(sk)) { inet_sk(sk)-sport = htons(port); - __inet_hash(hinfo, sk, 0); +
[PATCH net-2.6.25][NEIGH] Make neigh_add_timer symmetrical to neigh_del_timer
The neigh_del_timer() looks sane - it removes the timer and (conditionally) puts the neighbor. I expected, that the neigh_add_timer() is symmetrical to the del one - i.e. it holds the neighbor and arms the timer - but it turned out that it was not so. I think, that making them look symmetrical makes the code more readable. Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] --- diff --git a/net/core/neighbour.c b/net/core/neighbour.c index 4b6dd1e..9a283fc 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -165,6 +165,16 @@ static int neigh_forced_gc(struct neigh_table *tbl) return shrunk; } +static void neigh_add_timer(struct neighbour *n, unsigned long when) +{ + neigh_hold(n); + if (unlikely(mod_timer(n-timer, when))) { + printk(NEIGH: BUG, double timer add, state is %x\n, + n-nud_state); + dump_stack(); + } +} + static int neigh_del_timer(struct neighbour *n) { if ((n-nud_state NUD_IN_TIMER) @@ -716,15 +726,6 @@ static __inline__ int neigh_max_probes(struct neighbour *n) p-ucast_probes + p-app_probes + p-mcast_probes); } -static inline void neigh_add_timer(struct neighbour *n, unsigned long when) -{ - if (unlikely(mod_timer(n-timer, when))) { - printk(NEIGH: BUG, double timer add, state is %x\n, - n-nud_state); - dump_stack(); - } -} - /* Called when a timer expires for a neighbour entry. */ static void neigh_timer_handler(unsigned long arg) @@ -856,7 +857,6 @@ int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb) atomic_set(neigh-probes, neigh-parms-ucast_probes); neigh-nud_state = NUD_INCOMPLETE; neigh-updated = jiffies; - neigh_hold(neigh); neigh_add_timer(neigh, now + 1); } else { neigh-nud_state = NUD_FAILED; @@ -869,7 +869,6 @@ int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb) } } else if (neigh-nud_state NUD_STALE) { NEIGH_PRINTK2(neigh %p is delayed.\n, neigh); - neigh_hold(neigh); neigh-nud_state = NUD_DELAY; neigh-updated = jiffies; neigh_add_timer(neigh, @@ -1013,13 +1012,11 @@ int neigh_update(struct neighbour *neigh, const u8 *lladdr, u8 new, if (new != old) { neigh_del_timer(neigh); - if (new NUD_IN_TIMER) { - neigh_hold(neigh); + if (new NUD_IN_TIMER) neigh_add_timer(neigh, (jiffies + ((new NUD_REACHABLE) ? neigh-parms-reachable_time : 0))); - } neigh-nud_state = new; } -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] net: napi fix
David Miller writes: Is the netif_running() check even required? No, it is not. When a device is brought down, one of the first things that happens is that we wait for all pending NAPI polls to complete, then block any new polls from starting. Hello! Yes but the reason was not to wait for all pending polls to complete so a server/router could be rebooted even under high- load and DOS. We've experienced some nasty problems with this. Cheers. --ro -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A short question about net git tree and patches
On Thu, Dec 20, 2007 at 11:20:26AM +0200, David Shwatrz wrote: Hello, I have a short question regarding the net git tree and patches: I want to write and send patches against the most recent and bleeding edge kernel networking code. I see in: http://kernel.org/pub/scm/linux/kernel/git/davem/?C=M;O=A that there are 3 git trees which can be candidates for git-clone and making patches against; these are: netdev-2.6.git, net-2.6.25.git and net-2.6.git. It seems to me that net-2.6.git is the most suitable one to work against; am I right ? what is the difference, in short, between the three repositories? IIRC the usage is: netdev-2.6.git = old stuff, 4 weeks since last update. Not in use net-2.6.25.git = patches for current kernel release (only fixes) net-2.6.git = patches for next kernel relase and planned to be applied in next merge window So net-2.6.git is the correct choice for bleeding edge. Sam -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: DM9000_IRQ_FLAGS
Hello Ben, Actually, the best way to go is to let the platform resources flags decide about that with something like resource-flags = IORESOURCE_IRQ | IRQT_RISING; but the dm9000 does not care about them at all. Changing that would also imply modifications to all board support code. I did have a go at trying to get people to pass the information this way, but it seem to be ignored last time I sent it. I can dig out the code that converts resource-flags to IRQT_ flags. I thought this issue was already solved by using set_irq_type() in the BSP, just like all the other boards do... Kind Regards, Remy -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A short question about net git tree and patches
On Thu, 20 Dec 2007, Sam Ravnborg wrote: On Thu, Dec 20, 2007 at 11:20:26AM +0200, David Shwatrz wrote: Hello, I have a short question regarding the net git tree and patches: I want to write and send patches against the most recent and bleeding edge kernel networking code. I see in: http://kernel.org/pub/scm/linux/kernel/git/davem/?C=M;O=A that there are 3 git trees which can be candidates for git-clone and making patches against; these are: netdev-2.6.git, net-2.6.25.git and net-2.6.git. It seems to me that net-2.6.git is the most suitable one to work against; am I right ? what is the difference, in short, between the three repositories? IIRC the usage is: netdev-2.6.git = old stuff, 4 weeks since last update. Not in use net-2.6.25.git = patches for current kernel release (only fixes) Nope, we don't even have 2.6.24 yet. :-) net-2.6.git = patches for next kernel relase and planned to be applied in next merge window So net-2.6.git is the correct choice for bleeding edge. net-2.6 is for fixes only, net-2.6.25 will become net-2.6 once 2.6.24 gets released, eventually net-2.6.26 gets opened (not necessarily at the same time as the merge window is closed but a bit later) and the cycle repeats with similar transitions when 2.6.25 gets released. The netdev trees are for network drivers and are usually managed by Jeff Garzik but there were recently some arrangement between Dave and Jeff due to vacations so that also netdev was managed by Dave. -- i. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Badness at net/core/dev.c:2199
I already sendout a correct patch last week. It should pre-increment. Any hope getting it upstream? -- Meelis Roos ([EMAIL PROTECTED]) http://www.cs.ut.ee/~mroos/ -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] net: napi fix
From: Robert Olsson [EMAIL PROTECTED] Date: Thu, 20 Dec 2007 10:52:17 +0100 Yes but the reason was not to wait for all pending polls to complete so a server/router could be rebooted even under high- load and DOS. We've experienced some nasty problems with this. I know, see the rest of the thread where I agree that we need to deal with this somehow. The device is marked down first, and somehow we need to tip off of that to break out of the NAPI loop. This how is what hasn't been resolved yet. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A short question about net git tree and patches
From: Sam Ravnborg [EMAIL PROTECTED] Date: Thu, 20 Dec 2007 10:55:10 +0100 net-2.6.25.git = patches for current kernel release (only fixes) net-2.6.git = patches for next kernel relase and planned to be applied in next merge window So net-2.6.git is the correct choice for bleeding edge. You reversed them, net-2.6.25.git is for bleeding edge stuff, net-2.6.git is for bug fixes only. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] [UDP]: fix send buffer check
From: Hideo AOKI [EMAIL PROTECTED] Date: Mon, 17 Dec 2007 21:38:03 -0500 This patch introduces sndbuf size check before memory allocation for send buffer. signed-off-by: Satoshi Oshima [EMAIL PROTECTED] signed-off-by: Hideo Aoki [EMAIL PROTECTED] ... diff -pruN net-2.6/net/ipv4/ip_output.c net-2.6-udp-take11a1-p1/net/ipv4/ip_output.c --- net-2.6/net/ipv4/ip_output.c 2007-12-11 10:54:55.0 -0500 +++ net-2.6-udp-take11a1-p1/net/ipv4/ip_output.c 2007-12-17 14:42:31.0 -0500 @@ -1004,6 +1004,11 @@ alloc_new_skb: frag = skb_shinfo(skb)-frags[i]; } } else if (i MAX_SKB_FRAGS) { + if (atomic_read(sk-sk_wmem_alloc) + PAGE_SIZE + 2 * sk-sk_sndbuf) { + err = -ENOBUFS; + goto error; + } if (copy PAGE_SIZE) copy = PAGE_SIZE; page = alloc_pages(sk-sk_allocation, 0); If we are going to do this, we need to add the same check to skb_append_datato_frags() which is invoked via ip_ufo_append_data(). We also have to be very careful in this area. One problem we had a long time ago was that we would socket account when fragmenting an outgoing frame. This was bogus because even if the socket had enough space for one full sized frame, the packet send would fail because it could not fit the space for both the original frame and the fragmented copy of it. This situation was cured by simply not enforcing accounting for the fragmented copy. It is valid because after we fragment, we keep the fragmented copy but free the original. This doesn't apply directly to this specific patch, but it is something to keep in mind when doing these changes. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TSO trimming question
On Wed, 19 Dec 2007, David Miller wrote: From: Ilpo_Järvinen [EMAIL PROTECTED] Date: Wed, 19 Dec 2007 23:46:33 +0200 (EET) I'm not fully sure what's purpose of this code in tcp_write_xmit: if (skb-len limit) { unsigned int trim = skb-len % mss_now; if (trim) limit = skb-len - trim; } Is it used to make sure we send only multiples of mss_now here and leave the left-over into another skb? Yeah, I now understand that this part is correct. I somehow got such impression while trying to figure this out that it ends up being dead code but that wasn't correct thought from my side. However, it caught my attention and after some thinking I'd say there's more to handle here (covered by the second question). Also note that patch I sent earlier is not right either but needs some refining to do the right thing. Or does it try to make sure that tso_fragment result honors multiple of mss_now boundaries when snd_wnd is the limitting factor? For latter IMHO this would be necessary: if (skb-len limit) limit -= limit % mss_now; The purpose of the test is to make sure we process tail sub-mss chunks correctly wrt. Nagle, which most closely matches the first purpose you've listed. So I think the calculation really does belong where it is. Because of the way that the sendmsg() super-skb formation logic works, we always will tack on more data and grow the tail SKB before creating a new one. So any sub-mss chunk at the end of a TSO frame really is at the end of the write queue and really should get nagle processing. Yes, I now agree this is fully correct for this task. Actually, there is an exception, which is when we run out of skb_frag_list slots. In that case we'll potentially have breaks at odd boundaries in the middle of the queue. But this can only happen in exceptional cases (user does tons of 1-byte sendfile()'s over random non-consequetive locations of a file) or outright bugs (MAX_SKB_FRAGS is defined incorrectly, for example) and thus this situation is not worth coding for. That's not the only case, IMHO if there's odd boundary due to snd_una+snd_wnd - skb-seq limit (done in tcp_window_allows()), we don't consider it as odd but break the skb at arbitary point resulting two small segments to the network, and what's worse, when the later skb resulting from the first split is matching skb-len limit check as well causing an unnecessary small skb to be created for nagle purpose too, solving it fully requires some thought in case the mss_now != mss_cache even if non-odd boundaries are honored in the middle of skb. Though whether we get there is depending on what tcp_tso_should_defer() decided. Hmm, there seems to be an unrelated bug in it as well :-/. A patch below. Please consider the fact that enabling TSO deferring may have some unpleasant effect to TCP dynamics, considering that I don't find stable mandatory for this to avoid breaking, besides things have been working quite well without it too... Only compile tested. -- i. -- [PATCH] [TCP]: Fix TSO deferring I'd say that most of what tcp_tso_should_defer had in between there was dead code because of this. Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- net/ipv4/tcp_output.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 8dafda9..693b9f6 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1217,7 +1217,8 @@ static int tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb) goto send_now; /* Defer for less than two clock ticks. */ - if (!tp-tso_deferred ((jiffies1)1) - (tp-tso_deferred1) 1) + if (tp-tso_deferred + ((jiffies 1) 1) - (tp-tso_deferred 1) 1) goto send_now; in_flight = tcp_packets_in_flight(tp); -- 1.5.0.6
Re: [PATCH 2/4] [CORE]: datagram: basic memory accounting functions
From: Hideo AOKI [EMAIL PROTECTED] Date: Mon, 17 Dec 2007 21:38:17 -0500 Why do we need seperate stream and datagram accounting functions? Is it just to facilitate things like the following test? +static inline int sk_wmem_schedule(struct sock *sk, int size) +{ + if (sk-sk_type == SOCK_DGRAM) + return sk_datagram_wmem_schedule(sk, size); + else + return 1; +} If so, this can be greatly improved. All of these other functions are identical copies of the stream counterparts, they should all be consolidated. I still see a lot of special casing, instead of large pieces of common code. There should be one core set of functions that handle the memory accounting, regardless of socket type. Maybe there is one spot where something like sk-prot-doing_memory_accounting is tested, but that's it. I am still very dissatisfied with these changes. They are full of special cases, because they mix generic facilities (the socket memory accounting) with an unrelated issue (we only support memory accounting for datagram sockets which are actually UDP). Also, the memory accounting is done at different parts in the socket code paths for stream vs. datagram. This is why everything is inconsistent, and, a mess. What's funny is that I absolutely do not care if these changes are perfect and pass every possible regression test. Rather, I'm more concerned that this thing is designed correctly and will allow us to have one core set of memory accounting functions regardless of socket type. As it is coded now, we have two sets of code paths to fix, two ways of doing the socket accounting, and therefore twice as much code to maintain and debug. The whole thing needs to be consistent and without special cases. The protocol supports memory accounting test can be performed, as you did in this patch, by simply checking if sk-sk_prot-memory_allocated is non-NULL. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: dn_neigh_table vs pneigh_lookup/pneigh_delete
Hi, On Wed, Dec 19, 2007 at 05:11:34PM +0300, Pavel Emelyanov wrote: Hi The pneigh_lookup/delete silently concerns, that the key_len of the table is more that 4 bytes. Look: u32 hash_val = *(u32 *)(pkey + key_len - 4); The hash_val for the proxy neighbor entry is four last bytes from the pkey. But the dn_neigh_tables' key_len is sizeof(__le16), that is 2, so setting (via netlink) the proxy neighbor entry for decnet will cause this entry to reside in arbitrary hash chain. Is this too bad for decnet? Thanks, Pavel The pneigh code is never used in DECnet, we only use the normal part of the neigh code where the hash function was changed so that it can be defined for each protocol (and thus doesn't suffer from this problem) Steve. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] [UDP]: memory accounting in IPv4
From: Hideo AOKI [EMAIL PROTECTED] Date: Mon, 17 Dec 2007 21:38:47 -0500 This patch adds UDP memory usage accounting in IPv4. Send buffer accounting is performed by IP layer, because skbuff is allocated in the layer. Receive buffer is charged, when the buffer successfully received. Destructor of the buffer does uncharging and reclaiming, when the buffer is freed. To set destructor at proper place, we use __udp_queue_rcv_skb() instead of sock_queue_rcv_skb(). To maintain consistency of memory accounting, socket lock is used to free receive buffer in udp_recvmsg(). New packet will be add to backlog when the socket is used by user. Cc: Satoshi Oshima [EMAIL PROTECTED] signed-off-by: Takahiro Yasui [EMAIL PROTECTED] signed-off-by: Masami Hiramatsu [EMAIL PROTECTED] signed-off-by: Hideo Aoki [EMAIL PROTECTED] We can't accept these changes, even once the other issues are fixed, until IPV6 is supported as well. It's pointless to support proper UDP memory accounting only in IPV4 and not in IPV6 as well. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TSO trimming question
From: Ilpo_Järvinen [EMAIL PROTECTED] Date: Thu, 20 Dec 2007 13:40:51 +0200 (EET) [PATCH] [TCP]: Fix TSO deferring I'd say that most of what tcp_tso_should_defer had in between there was dead code because of this. Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] Yikes! John, we've been living a lie for more than a year. :-/ On the bright side this explains a lot of small TSO frames I've been seeing in traces over the past year but never got a chance to investigate. diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 8dafda9..693b9f6 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1217,7 +1217,8 @@ static int tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb) goto send_now; /* Defer for less than two clock ticks. */ - if (!tp-tso_deferred ((jiffies1)1) - (tp-tso_deferred1) 1) + if (tp-tso_deferred + ((jiffies 1) 1) - (tp-tso_deferred 1) 1) goto send_now; in_flight = tcp_packets_in_flight(tp); -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TSO trimming question
From: Ilpo_Järvinen [EMAIL PROTECTED] Date: Thu, 20 Dec 2007 13:40:51 +0200 (EET) That's not the only case, IMHO if there's odd boundary due to snd_una+snd_wnd - skb-seq limit (done in tcp_window_allows()), we don't consider it as odd but break the skb at arbitary point resulting two small segments to the network, and what's worse, when the later skb resulting from the first split is matching skb-len limit check as well causing an unnecessary small skb to be created for nagle purpose too, solving it fully requires some thought in case the mss_now != mss_cache even if non-odd boundaries are honored in the middle of skb. In the most ideal sense, tcp_window_allows() should probably be changed to only return MSS multiples. Unfortunately this would add an expensive modulo operation, however I think it would elimiate this problem case. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/11] drivers/net/sunvnet.c: Use print_mac
From: Joe Perches [EMAIL PROTECTED] Date: Fri, 14 Dec 2007 14:34:09 -0800 Signed-off-by: Joe Perches [EMAIL PROTECTED] Applied to net-2.6.25 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/11] drivers/net/tg3.c: Use print_mac
From: Joe Perches [EMAIL PROTECTED] Date: Fri, 14 Dec 2007 14:34:10 -0800 Signed-off-by: Joe Perches [EMAIL PROTECTED] Applied to net-2.6.25 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/11] drivers/net/niu.c: Use print_mac
From: Joe Perches [EMAIL PROTECTED] Date: Fri, 14 Dec 2007 14:34:06 -0800 Signed-off-by: Joe Perches [EMAIL PROTECTED] Applied to net-2.6.25 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] [UDP6]: Counter increment on BH mode
From: Herbert Xu [EMAIL PROTECTED] Date: Sat, 15 Dec 2007 21:58:52 +0800 [SNMP]: Fix SNMP counters with PREEMPT The SNMP macros use raw_smp_processor_id() in process context which is illegal because the process may be preempted and then migrated to another CPU. This patch makes it use get_cpu/put_cpu to disable preemption. Signed-off-by: Herbert Xu [EMAIL PROTECTED] Applied to net-2.6.25, thanks. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [SNMP]: Fix SNMP counters with PREEMPT
From: Herbert Xu [EMAIL PROTECTED] Date: Sun, 16 Dec 2007 10:30:25 +0800 On Sat, Dec 15, 2007 at 06:03:19PM +0100, Eric Dumazet wrote: How come you change SNMP_INC_STATS_USER() but not SNMP_INC_STATS() ? Heh, my brain must have blocked me from seeing it because it's too hard :) Let's fix it the stupid way first and I'll do a local_t conversion later. [SNMP]: Fix SNMP counters with PREEMPT The SNMP macros use raw_smp_processor_id() in process context which is illegal because the process may be preempted and then migrated to another CPU. This patch makes it use get_cpu/put_cpu to disable preemption. Signed-off-by: Herbert Xu [EMAIL PROTECTED] I just noticed this and replaced the other SNMP fix patch with this one. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] One more XFRM audit fix
On Thursday 20 December 2007 3:00:09 am David Miller wrote: From: Paul Moore [EMAIL PROTECTED] Date: Wed, 19 Dec 2007 14:29:31 -0500 The following patch is backed against David's net-2.6 tree and is pretty trivial. I know we're late in the 2.6.24 cycle but I think this is worth merging, if you guys don't feel that way let me know and I'll resubmit it for 2.6.25. Where is that patch? Or do you mean the fix you emailed seperately today (which I will apply, thanks)? Yes, it was the patch you applied, XFRM: Audit function arguments misordered. I was using stacked-git to post the patch and it apparently doesn't annotate the cover email's subject line with 0/1 when you only send one patch. Sorry about that. As a side note, I'm unable to actually test the patch because I can't get the kernel to compile (M=net/xfrm works just fine). The problem I keep seeing is below: make[3]: *** No rule to make target \ `/blah/kernels/net-2.6_xfrm-auid-secid-fix/include/linux/ticable.h', \ needed by \ `/blah/kernels/net-2.6_xfrm-auid-secid-fix/usr/include/linux/ticable.h'. \ Stop. Remove ticable.h from include/linux/Kbuild This is already cured in Linus's tree. Noted, thanks. -- paul moore linux security @ hp -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TSO trimming question
On Thu, 20 Dec 2007, David Miller wrote: From: Ilpo_Järvinen [EMAIL PROTECTED] Date: Thu, 20 Dec 2007 13:40:51 +0200 (EET) That's not the only case, IMHO if there's odd boundary due to snd_una+snd_wnd - skb-seq limit (done in tcp_window_allows()), we don't consider it as odd but break the skb at arbitary point resulting two small segments to the network, and what's worse, when the later skb resulting from the first split is matching skb-len limit check as well causing an unnecessary small skb to be created for nagle purpose too, solving it fully requires some thought in case the mss_now != mss_cache even if non-odd boundaries are honored in the middle of skb. In the most ideal sense, tcp_window_allows() should probably be changed to only return MSS multiples. That's what Herbert suggested already, I'll send a patch later on... :-) Unfortunately this would add an expensive modulo operation, however I think it would elimiate this problem case. Yes. Should we still call tcp_minshall_update() if split in the middle of wq results in smaller than MSS tail (occurs only if mss_now != mss_cache)? -- i.
neigh: timer !nud_in_timer
Hello, I noticed the following message in my kernel log. kernel: neigh: timer !nud_in_timer (Might be due to a race condition.) I'm running a UP Linux version 2.6.22.1-rt9 ( http://rt.wiki.kernel.org/index.php ) The following /proc entries might be relevant. /proc/sys/net/ipv4/conf/all/arp_accept 0 /proc/sys/net/ipv4/conf/all/arp_announce 2 /proc/sys/net/ipv4/conf/all/arp_filter 0 /proc/sys/net/ipv4/conf/all/arp_ignore 1 I also lowered the priority of softirq-timer/0 to 10 which means it can be interrupted by other IRQ handlers. Regards. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A short question about net git tree and patches
On Thu, Dec 20, 2007 at 03:22:58AM -0800, David Miller wrote: From: Sam Ravnborg [EMAIL PROTECTED] Date: Thu, 20 Dec 2007 10:55:10 +0100 net-2.6.25.git = patches for current kernel release (only fixes) net-2.6.git = patches for next kernel relase and planned to be applied in next merge window So net-2.6.git is the correct choice for bleeding edge. You reversed them, net-2.6.25.git is for bleeding edge stuff, net-2.6.git is for bug fixes only. Sorry - thanks for clarifying it. Sam - who should refrain from thinking too much in current crappy condition -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Re: Nested VLAN causes recursive locking error
On 19-12-2007 00:03, Chuck Ebbert wrote: From: https://bugzilla.redhat.com/show_bug.cgi?id=426164 kernel version is 2.6.24-0.107.rc5.git3.fc9 From boot log on serial console: (full log attached) Added VLAN with VID == 2 to IF -:eth0.1568:- = [ INFO: possible recursive locking detected ] 2.6.24-0.107.rc5.git3.fc9 #1 - ifconfig/15011 is trying to acquire lock: (vlan_netdev_xmit_lock_key){-+..}, at: [c05d9450] dev_mc_sync+0x1c/0x102 but task is already holding lock: (vlan_netdev_xmit_lock_key){-+..}, at: [c05d51bd] dev_set_rx_mode+0x14/0x3c other info that might help us debug this: 2 locks held by ifconfig/15011: #0: (rtnl_mutex){--..}, at: [c05de4f7] rtnl_lock+0xf/0x11 #1: (vlan_netdev_xmit_lock_key){-+..}, at: [c05d51bd] dev_set_rx_mode+0x14/0x3c ... Subject: [PATCH] nested VLAN: fix lockdep's recursive locking warning Allow vlans nesting other vlans without lockdep's warnings (max. 8 levels). Reported-by: Benny Amorsen Tested-by: Benny Amorsen(?) NEEDS TESTING! Signed-off-by: Jarek Poplawski [EMAIL PROTECTED] --- diff -Nurp linux-2.6.24-rc5-/net/8021q/vlan.c linux-2.6.24-rc5+/net/8021q/vlan.c --- linux-2.6.24-rc5-/net/8021q/vlan.c 2007-12-17 13:29:19.0 +0100 +++ linux-2.6.24-rc5+/net/8021q/vlan.c 2007-12-20 14:21:02.0 +0100 @@ -307,12 +307,15 @@ int unregister_vlan_device(struct net_de return ret; } +#ifdef CONFIG_LOCKDEP /* * vlan network devices have devices nesting below it, and are a special * super class of normal network devices; split their locks off into a * separate class since they always nest. */ static struct lock_class_key vlan_netdev_xmit_lock_key; +static int subclass; /* vlan nesting vlan */ +#endif static const struct header_ops vlan_header_ops = { .create = vlan_dev_hard_header, @@ -349,7 +352,14 @@ static int vlan_dev_init(struct net_devi dev-hard_start_xmit = vlan_dev_hard_start_xmit; } - lockdep_set_class(dev-_xmit_lock, vlan_netdev_xmit_lock_key); +#ifdef CONFIG_LOCKDEP + if ((real_dev-priv_flags IFF_802_1Q_VLAN) + subclass MAX_LOCKDEP_SUBCLASSES - 1) + subclass++; + + lockdep_set_class_and_subclass(dev-_xmit_lock, + vlan_netdev_xmit_lock_key, subclass); +#endif return 0; } -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH/.24] [NET] fs_enet: check for phydev existence in the ethtool handlers
Otherwise oops will happen if ethernet device has not been opened: Unable to handle kernel paging request for data at address 0x014c Faulting instruction address: 0xc016f7f0 Oops: Kernel access of bad area, sig: 11 [#1] MPC85xx NIP: c016f7f0 LR: c01722a0 CTR: REGS: c79ddc70 TRAP: 0300 Not tainted (2.6.24-rc3-g820a386b) MSR: 00029000 EE,ME CR: 20004428 XER: 2000 DEAR: 014c, ESR: TASK = c789f5e0[999] 'snmpd' THREAD: c79dc000 GPR00: c01aceb8 c79ddd20 c789f5e0 c79ddd3c c79ddd64 GPR08: c7845b60 c79dde3c c01ace80 20004422 200249fc 02a0 100da728 GPR16: 100c 20022078 0009 200220e0 bfc85558 GPR24: c79ddd3c c02e0e70 c022fc64 c7845800 bfc85498 NIP [c016f7f0] phy_ethtool_gset+0x0/0x4c LR [c01722a0] fs_get_settings+0x18/0x28 Call Trace: [c79ddd20] [c79dde38] 0xc79dde38 (unreliable) [c79ddd30] [c01aceb8] dev_ethtool+0x294/0x11ec [c79dde30] [c01aaa44] dev_ioctl+0x454/0x6a8 [c79ddeb0] [c019b9d4] sock_ioctl+0x84/0x230 [c79dded0] [c007ded8] do_ioctl+0x34/0x8c [c79ddee0] [c007dfbc] vfs_ioctl+0x8c/0x41c [c79ddf10] [c007e38c] sys_ioctl+0x40/0x74 [c79ddf40] [c000d4c0] ret_from_syscall+0x0/0x3c Instruction dump: 8163 800b0030 2f80 419e0010 7c0803a6 4e800021 7c691b78 80010014 7d234b78 38210010 7c0803a6 4e800020 8003014c 7c6b1b78 3860 90040004 Signed-off-by: Anton Vorontsov [EMAIL PROTECTED] --- drivers/net/fs_enet/fs_enet-main.c | 11 +-- 1 files changed, 9 insertions(+), 2 deletions(-) diff --git a/drivers/net/fs_enet/fs_enet-main.c b/drivers/net/fs_enet/fs_enet-main.c index f2a4d39..23fddc3 100644 --- a/drivers/net/fs_enet/fs_enet-main.c +++ b/drivers/net/fs_enet/fs_enet-main.c @@ -897,14 +897,21 @@ static void fs_get_regs(struct net_device *dev, struct ethtool_regs *regs, static int fs_get_settings(struct net_device *dev, struct ethtool_cmd *cmd) { struct fs_enet_private *fep = netdev_priv(dev); + + if (!fep-phydev) + return -ENODEV; + return phy_ethtool_gset(fep-phydev, cmd); } static int fs_set_settings(struct net_device *dev, struct ethtool_cmd *cmd) { struct fs_enet_private *fep = netdev_priv(dev); - phy_ethtool_sset(fep-phydev, cmd); - return 0; + + if (!fep-phydev) + return -ENODEV; + + return phy_ethtool_sset(fep-phydev, cmd); } static int fs_nway_reset(struct net_device *dev) -- 1.5.2.2 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TSO trimming question
On Thu, Dec 20, 2007 at 04:00:37AM -0800, David Miller wrote: In the most ideal sense, tcp_window_allows() should probably be changed to only return MSS multiples. Unfortunately this would add an expensive modulo operation, however I think it would elimiate this problem case. Well you only have to divide in the unlikely case of us being limited by the receiver window. In that case speed is probably not of the essence anyway. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: After many hours all outbound connections get stuck in SYN_SENT
[speculation by network engineer -- not kernel hacker -- follows] The router could be sooo crappy that it drops all packets from TCP streams that have SACK enabled and the client has opened 200+ SACK connections previously... something like that? As far as any third party is concerned the existing TCP connections continue to have negotiated SACK Permitted. Only new connections will not negotiate this. So router crappiness promptly disappearing doesn't seem too likely (a way I could see this happening is if the Linux box sends a Ack for each connection and this clears out Sack datastructures on the third party). But I'd be very surprised if the router is acting as anything more that a network-layer device. It might perhaps have some soft connection state being used for generating accounting records. Being Cisco it's probably a switch-router, so it might carry some per-port hard state for validating source IP addresses and ARPs on each port. The firewall is much more likely to be carrying per-flow Sack state. The Cisco PIX had a bug with SACK handling (CSCse14419, fixed in 7.0(7), 7.1(2.34), 7.2(2.2), 8.0(0.141) but perhaps it has regressed). A simple trace either side of the firewall will show the inconsistency between the TCP sequence number (which gets randomised) and the Sack sequence number (which didn't). You could disable the TCP Sequence Number Randomisation feature and see if the fault reoccurs. You'd probably should also investigate the Linux kernel, especially the size and locks of the components of the Sack data structures and what happens to those data structures after Sack is disabled (presumably the Sack data structure is in some unhappy circumstance, and disabling Sack allows the data to be discarded, magically unclaging the box). In the absence of the reporter wanting to dump the kernel's core, how about a patch to print the Sack datastructure when the command to disable Sack is received by the kernel? Maybe just print the last 16b of the IP address? Best wishes, Glen -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/.24] [NET] fs_enet: check for phydev existence in the ethtool handlers
On Thu, 20 Dec 2007 16:59:23 +0300 Anton Vorontsov wrote: Otherwise oops will happen if ethernet device has not been opened: Unable to handle kernel paging request for data at address 0x014c Faulting instruction address: 0xc016f7f0 Oops: Kernel access of bad area, sig: 11 [#1] MPC85xx NIP: c016f7f0 LR: c01722a0 CTR: REGS: c79ddc70 TRAP: 0300 Not tainted (2.6.24-rc3-g820a386b) MSR: 00029000 EE,ME CR: 20004428 XER: 2000 DEAR: 014c, ESR: TASK = c789f5e0[999] 'snmpd' THREAD: c79dc000 GPR00: c01aceb8 c79ddd20 c789f5e0 c79ddd3c c79ddd64 GPR08: c7845b60 c79dde3c c01ace80 20004422 200249fc 02a0 100da728 GPR16: 100c 20022078 0009 200220e0 bfc85558 GPR24: c79ddd3c c02e0e70 c022fc64 c7845800 bfc85498 NIP [c016f7f0] phy_ethtool_gset+0x0/0x4c LR [c01722a0] fs_get_settings+0x18/0x28 Call Trace: [c79ddd20] [c79dde38] 0xc79dde38 (unreliable) [c79ddd30] [c01aceb8] dev_ethtool+0x294/0x11ec [c79dde30] [c01aaa44] dev_ioctl+0x454/0x6a8 [c79ddeb0] [c019b9d4] sock_ioctl+0x84/0x230 [c79dded0] [c007ded8] do_ioctl+0x34/0x8c [c79ddee0] [c007dfbc] vfs_ioctl+0x8c/0x41c [c79ddf10] [c007e38c] sys_ioctl+0x40/0x74 [c79ddf40] [c000d4c0] ret_from_syscall+0x0/0x3c Instruction dump: 8163 800b0030 2f80 419e0010 7c0803a6 4e800021 7c691b78 80010014 7d234b78 38210010 7c0803a6 4e800020 8003014c 7c6b1b78 3860 90040004 Signed-off-by: Anton Vorontsov [EMAIL PROTECTED] Acked-by: Vitaly Bordug [EMAIL PROTECTED] Jeff: this fix is important and should be merged if possible. --- drivers/net/fs_enet/fs_enet-main.c | 11 +-- 1 files changed, 9 insertions(+), 2 deletions(-) diff --git a/drivers/net/fs_enet/fs_enet-main.c b/drivers/net/fs_enet/fs_enet-main.c index f2a4d39..23fddc3 100644 --- a/drivers/net/fs_enet/fs_enet-main.c +++ b/drivers/net/fs_enet/fs_enet-main.c @@ -897,14 +897,21 @@ static void fs_get_regs(struct net_device *dev, struct ethtool_regs *regs, static int fs_get_settings(struct net_device *dev, struct ethtool_cmd *cmd) { struct fs_enet_private *fep = netdev_priv(dev); + + if (!fep-phydev) + return -ENODEV; + return phy_ethtool_gset(fep-phydev, cmd); } static int fs_set_settings(struct net_device *dev, struct ethtool_cmd *cmd) { struct fs_enet_private *fep = netdev_priv(dev); - phy_ethtool_sset(fep-phydev, cmd); - return 0; + + if (!fep-phydev) + return -ENODEV; + + return phy_ethtool_sset(fep-phydev, cmd); } static int fs_nway_reset(struct net_device *dev) -- Sincerely, Vitaly -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TSO trimming question
David Miller wrote: From: Ilpo_Järvinen [EMAIL PROTECTED] Date: Thu, 20 Dec 2007 13:40:51 +0200 (EET) [PATCH] [TCP]: Fix TSO deferring I'd say that most of what tcp_tso_should_defer had in between there was dead code because of this. Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] Yikes! John, we've been living a lie for more than a year. :-/ On the bright side this explains a lot of small TSO frames I've been seeing in traces over the past year but never got a chance to investigate. Ouch. This fix may improve some benchmarks. Re-checking this function was on my list of things to do because I had also noticed some TSO frames that seemed a bit small. This clearly explains it. -John -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: After many hours all outbound connections get stuck in SYN_SENT
I still dont understand. tcpdump -p -n -s 1600 -c 1 doesnt reveal User data at all. Without any exact data from you, I am afraid nobody can help. Oh, I didn't see that you specified specific options. I'll still have to anonymize 2000+ IP addresses, but I think there is an open source tool that will do this for you. 2) Are you sure you are not using connection tracking, and hit a limit on it ? I'm using ip_conntrack, but the limit I have for max entries is 65K. The most I've seen in there are a couple thousand- that was one of the first things I monitored very closely. Now please try without conn tracking module. I saw many failures in the past that were trigered by conntrack. Do you have some firewall rules, using some netfilter modules like hashlimit ? I will have to look into this. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [IPROUTE]: A workaround to make larger rto_min printed correctly
i see. HZ can be 1000.. i should be wrong. however, i got the following, [root iproute2.org]# ./ip/ip route change 192.168.140.0/24 dev eth1 rto_min 4s [root iproute2.org]# gdb -q ./ip/ip Using host libthread_db library /lib/libthread_db.so.1. (gdb) br iproute.c:512 Breakpoint 1 at 0x804fc8d: file iproute.c, line 512. (gdb) r route show dev eth1 Starting program: /root/iproute2.org/ip/ip route show dev eth1 Breakpoint 1, print_route (who=0xbfb9854c, n=0xbfb94528, arg=0x6404c0) at iproute.c:512 512 unsigned val = *(unsigned*)RTA_DATA(mxrta[i]); (gdb) l 512,522 512 unsigned val = *(unsigned*)RTA_DATA(mxrta[i]); 513 514 val *= 1000; 515 if (i == RTAX_RTT) 516 val /= 8; 517 else if (i == RTAX_RTTVAR) 518 val /= 4; 519 if (val = hz) 520 fprintf(fp, %ums, val/hz); 521 else 522 fprintf(fp, %.2fms, (float)val/hz); (gdb) p hz $1 = 10 (gdb) n 514 val *= 1000; (gdb) p val $2 = 40 (gdb) p val/ (hz / 1000) $3 = 4000 (gdb) n 515 if (i == RTAX_RTT) (gdb) p val $4 = 1385447424 (gdb) c Continuing. 192.168.140.0/24 scope link rto_min lock 1ms Program exited normally. (gdb) Thanks, Satoru SATOH 2007/12/20, Jarek Poplawski [EMAIL PROTECTED]: On 20-12-2007 04:31, Satoru SATOH wrote: ip route show does not print correct value when larger rto_min is set (e.g. 3sec). This problem is because of overflow in print_route() and the patch below is a workaround fix for that. ... --- a/ip/iproute.c +++ b/ip/iproute.c @@ -510,16 +510,16 @@ int print_route(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) fprintf(fp, %u, *(unsigned*)RTA_DATA(mxrta[i])); else { unsigned val = *(unsigned*)RTA_DATA(mxrta[i]); + unsigned hz1 = hz / 1000; ... + if (val = hz1) + fprintf(fp, %ums, val/hz1); ... Probably I miss something or my iproute sources are too old, but: does this work with hz 1000? Regards, Jarek P. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Please pull 'fixes-jgarzik' branch of wireless-2.6
Jeff, Here are a few more for 2.6.24...please let me know if there are any problems! Thanks, John P.S. The rtl8187 USB ID is already in your upstream branch -- I'm sure it would seem like a fix if it was the ID for your wireless stick. :-) --- Individual patches are available here: http://www.kernel.org//pub/linux/kernel/people/linville/wireless-2.6/fixes-jgarzik --- The following changes since commit 82d29bf6dc7317aeb0a3a13c2348ca8591965875: Linus Torvalds (1): Linux 2.6.24-rc5 are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6.git fixes-jgarzik Matthias Mueller (1): rtl8187: Add USB ID for Sitecom WL-168 v1 001 Michael Wu (1): p54: add Kconfig description Reinette Chatre (1): ipw2200: prevent alloc of unspecified size on stack Zhu Yi (1): iwlwifi: fix possible priv-mutex deadlock during suspend drivers/net/wireless/Kconfig| 51 +++ drivers/net/wireless/ipw2200.c | 13 ++- drivers/net/wireless/iwlwifi/iwl3945-base.c | 18 +++--- drivers/net/wireless/iwlwifi/iwl4965-base.c | 18 +++--- drivers/net/wireless/rtl8187_dev.c |2 + 5 files changed, 75 insertions(+), 27 deletions(-) diff --git a/drivers/net/wireless/Kconfig b/drivers/net/wireless/Kconfig index 2b733c5..7bdf9da 100644 --- a/drivers/net/wireless/Kconfig +++ b/drivers/net/wireless/Kconfig @@ -586,15 +586,66 @@ config ADM8211 config P54_COMMON tristate Softmac Prism54 support depends on MAC80211 WLAN_80211 FW_LOADER EXPERIMENTAL + ---help--- + This is common code for isl38xx based cards. + This module does nothing by itself - the USB/PCI frontends + also need to be enabled in order to support any devices. + + These devices require softmac firmware which can be found at + http://prism54.org/ + + If you choose to build a module, it'll be called p54common. config P54_USB tristate Prism54 USB support depends on P54_COMMON USB select CRC32 + ---help--- + This driver is for USB isl38xx based wireless cards. + These are USB based adapters found in devices such as: + + 3COM 3CRWE254G72 + SMC 2862W-G + Accton 802.11g WN4501 USB + Siemens Gigaset USB + Netgear WG121 + Netgear WG111 + Medion 40900, Roper Europe + Shuttle PN15, Airvast WM168g, IOGear GWU513 + Linksys WUSB54G + Linksys WUSB54G Portable + DLink DWL-G120 Spinnaker + DLink DWL-G122 + Belkin F5D7050 ver 1000 + Cohiba Proto board + SMC 2862W-G version 2 + U.S. Robotics U5 802.11g Adapter + FUJITSU E-5400 USB D1700 + Sagem XG703A + DLink DWL-G120 Cohiba + Spinnaker Proto board + Linksys WUSB54AG + Inventel UR054G + Spinnaker DUT + + These devices require softmac firmware which can be found at + http://prism54.org/ + + If you choose to build a module, it'll be called p54usb. config P54_PCI tristate Prism54 PCI support depends on P54_COMMON PCI + ---help--- + This driver is for PCI isl38xx based wireless cards. + This driver supports most devices that are supported by the + fullmac prism54 driver plus many devices which are not + supported by the fullmac driver/firmware. + + This driver requires softmac firmware which can be found at + http://prism54.org/ + + If you choose to build a module, it'll be called p54pci. source drivers/net/wireless/iwlwifi/Kconfig source drivers/net/wireless/hostap/Kconfig diff --git a/drivers/net/wireless/ipw2200.c b/drivers/net/wireless/ipw2200.c index 54f44e5..38ce8ee 100644 --- a/drivers/net/wireless/ipw2200.c +++ b/drivers/net/wireless/ipw2200.c @@ -1233,9 +1233,19 @@ static ssize_t show_event_log(struct device *d, { struct ipw_priv *priv = dev_get_drvdata(d); u32 log_len = ipw_get_event_log_len(priv); - struct ipw_event log[log_len]; + u32 log_size; + struct ipw_event *log; u32 len = 0, i; + /* not using min() because of its strict type checking */ + log_size = PAGE_SIZE / sizeof(*log) log_len ? + sizeof(*log) * log_len : PAGE_SIZE; + log = kzalloc(log_size, GFP_KERNEL); + if (!log) { + IPW_ERROR(Unable to allocate memory for log\n); + return 0; + } + log_len = log_size / sizeof(*log); ipw_capture_event_log(priv, log_len, log); len += snprintf(buf + len, PAGE_SIZE - len, %08X, log_len); @@ -1244,6 +1254,7 @@ static ssize_t show_event_log(struct device *d, \n%08X%08X%08X, log[i].time, log[i].event, log[i].data); len += snprintf(buf + len,
Please pull 'fixes-davem' branch of wireless-2.6
Dave, A few more stragglers for 2.6.24...let me know if there are any problems! Thanks, John --- Individual patches available here: http://www.kernel.org/pub/linux/kernel/people/linville/wireless-2.6/fixes-davem --- The following changes since commit 82d29bf6dc7317aeb0a3a13c2348ca8591965875: Linus Torvalds (1): Linux 2.6.24-rc5 are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6.git fixes-davem Johannes Berg (2): mac80211: round station cleanup timer mac80211: warn when receiving frames with unaligned data net/mac80211/rx.c | 13 + net/mac80211/sta_info.c |7 +-- 2 files changed, 18 insertions(+), 2 deletions(-) diff --git a/net/mac80211/rx.c b/net/mac80211/rx.c index 00f908d..a7263fc 100644 --- a/net/mac80211/rx.c +++ b/net/mac80211/rx.c @@ -1443,6 +1443,7 @@ void __ieee80211_rx(struct ieee80211_hw *hw, struct sk_buff *skb, struct ieee80211_sub_if_data *prev = NULL; struct sk_buff *skb_new; u8 *bssid; + int hdrlen; /* * key references and virtual interfaces are protected using RCU @@ -1472,6 +1473,18 @@ void __ieee80211_rx(struct ieee80211_hw *hw, struct sk_buff *skb, rx.fc = le16_to_cpu(hdr-frame_control); type = rx.fc IEEE80211_FCTL_FTYPE; + /* +* Drivers are required to align the payload data to a four-byte +* boundary, so the last two bits of the address where it starts +* may not be set. The header is required to be directly before +* the payload data, padding like atheros hardware adds which is +* inbetween the 802.11 header and the payload is not supported, +* the driver is required to move the 802.11 header further back +* in that case. +*/ + hdrlen = ieee80211_get_hdrlen(rx.fc); + WARN_ON_ONCE(((unsigned long)(skb-data + hdrlen)) 3); + if (type == IEEE80211_FTYPE_DATA || type == IEEE80211_FTYPE_MGMT) local-dot11ReceivedFragmentCount++; diff --git a/net/mac80211/sta_info.c b/net/mac80211/sta_info.c index e849155..cfd8ee9 100644 --- a/net/mac80211/sta_info.c +++ b/net/mac80211/sta_info.c @@ -14,6 +14,7 @@ #include linux/slab.h #include linux/skbuff.h #include linux/if_arp.h +#include linux/timer.h #include net/mac80211.h #include ieee80211_i.h @@ -306,7 +307,8 @@ static void sta_info_cleanup(unsigned long data) } read_unlock_bh(local-sta_lock); - local-sta_cleanup.expires = jiffies + STA_INFO_CLEANUP_INTERVAL; + local-sta_cleanup.expires = + round_jiffies(jiffies + STA_INFO_CLEANUP_INTERVAL); add_timer(local-sta_cleanup); } @@ -345,7 +347,8 @@ void sta_info_init(struct ieee80211_local *local) INIT_LIST_HEAD(local-sta_list); init_timer(local-sta_cleanup); - local-sta_cleanup.expires = jiffies + STA_INFO_CLEANUP_INTERVAL; + local-sta_cleanup.expires = + round_jiffies(jiffies + STA_INFO_CLEANUP_INTERVAL); local-sta_cleanup.data = (unsigned long) local; local-sta_cleanup.function = sta_info_cleanup; -- John W. Linville [EMAIL PROTECTED] -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Please pull 'upstream-davem' branch of wireless-2.6
Dave, These are destined for 2.6.25. The patches fall mostly into two categories: a new rate control algorithm for mac80211, and some cfg80211 enhancements (including mac80211 patches to use them). Also there are some small hits in the iwlwifi drivers related to rate control. I'll CC Jeff since his tree has a lot of iwlwifi symbol renames and those patches will conflict (or break the build, or both) when your tree and his finally come together. Let me know if there are any problems! John P.S. I have a few more related to the cfg80211 changes, but the patches are cross-dependent on both your tree and Jeff's. I will probably send those to akpm in the meantime, and push them after Linus has pulled both your tree and Jeff's in the 2.6.25 merge window. --- Individual patches are available here: http://www.kernel.org/pub/linux/kernel/people/linville/wireless-2.6/upstream-davem --- The following changes since commit adc292d3280278282d7b0e0813ccda711e739b5f: Herbert Xu (1): [IPSEC]: Do xfrm_state_check_space before encapsulation are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6.git upstream-davem Johannes Berg (13): mac80211: clean up eapol frame handling/port control mac80211: clean up eapol handling in TX path mac80211: make ieee80211_rx_mgmt_action static mac80211: allow easier multicast/broadcast buffering in hardware cfg80211/nl80211: introduce key handling mac80211: support adding/removing keys via cfg80211 mac80211: support getting key sequence counters via cfg80211 cfg80211/nl80211: add beacon settings cfg80211/nl80211: station handling cfg80211/nl80211: implement station attribute retrieval mac80211: implement station stats retrieval mac80211: move tx crypto decision mac80211: don't read ERP information from (re)association response Mattias Nissler (4): mac80211: clean up rate selection mac80211: add PID controller based rate control algorithm rc80211-pid: add debugging rc80211-pid: export tuning parameters through debugfs Ron Rindjunsky (1): mac80211: pass in PS_POLL frames Stefano Brivio (4): mac80211: make PID rate control algorithm the default rc80211-pid: add rate behaviour learning algorithm rc80211-pid: add sharpening factor doc: fix typo in feature-removal-schedule Documentation/feature-removal-schedule.txt | 10 +- drivers/net/wireless/iwlwifi/iwl-3945-rs.c | 44 +-- drivers/net/wireless/iwlwifi/iwl-4965-rs.c | 46 +-- include/linux/nl80211.h| 154 ++ include/net/cfg80211.h | 167 +++ include/net/mac80211.h | 17 +- net/mac80211/Kconfig | 63 +++- net/mac80211/Makefile | 16 +- net/mac80211/cfg.c | 202 - net/mac80211/debugfs_netdev.c | 27 +- net/mac80211/ieee80211.c | 21 +- net/mac80211/ieee80211_i.h | 24 +- net/mac80211/ieee80211_iface.c |1 - net/mac80211/ieee80211_rate.c | 59 +++- net/mac80211/ieee80211_rate.h | 76 ++-- net/mac80211/ieee80211_sta.c | 35 +- net/mac80211/rc80211_pid.h | 261 ++ net/mac80211/rc80211_pid_algo.c| 510 +++ net/mac80211/rc80211_pid_debugfs.c | 223 + net/mac80211/rc80211_simple.c | 64 +-- net/mac80211/rx.c | 144 +++--- net/mac80211/tx.c | 171 --- net/mac80211/util.c| 24 +- net/mac80211/wep.c | 10 - net/mac80211/wpa.c | 14 - net/wireless/core.c|3 + net/wireless/nl80211.c | 737 27 files changed, 2692 insertions(+), 431 deletions(-) create mode 100644 net/mac80211/rc80211_pid.h create mode 100644 net/mac80211/rc80211_pid_algo.c create mode 100644 net/mac80211/rc80211_pid_debugfs.c Omnibus patch attached as 'upstream-davem.patch.bz2' due to size concerns. -- John W. Linville [EMAIL PROTECTED] upstream-davem.patch.bz2 Description: BZip2 compressed data
Please pull 'upstream-jgarzik' branch of wireless-2.6
Jeff, More for 2.6.25...Mr. Woodhouse continues his savage assault on libertas, the b43legacy version of the rfkill led patch is here (b43legacy rfkill stuff is not in 2.6.24), and there are a couple of iwlwifi patches as well. Let me know if there are problems! Thanks, John --- Individual patches are available here: http://www.kernel.org/pub/linux/kernel/people/linville/wireless-2.6/upstream-jgarzik --- The following changes since commit b503d38b01bf313e4f1250c4ded89fc10a1d3da0: Ramkrishna Vepa (1): S2io: Fixes to enable multiple transmit fifos are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6.git upstream-jgarzik David Woodhouse (38): libertas: don't exit worker thread until kthread_stop() is called libertas: stop attempting to reset devices on unload libertas: clean up if_usb driver libertas: kill whitespace at end of lines libertas: kill unused wait_option field in struct cmd_ctrl_node libertas: rename and clean up DownloadcommandToStation libertas: don't use __lbs_cmd() with empty callback in if_usb.c libertas: remove some pointless checks for cmdnode buffer being present libertas: introduce and use lbs_complete_command() for command completion libertas: don't re-initialise cmdnode when taking it off the free queue libertas: kill cleanup_cmdnode() libertas: let __lbs_cmd() free its own cmdnode libertas: kill pdata_buf member of struct cmd_ctrl_node libertas: store command result in cmdnode instead of priv-cur_cmd_retcode libertas: add __lbs_cmd_async() for asynchronous command submission libertas: ensure response buffer size is always set for lbs_cmd_with_response libertas: handle command timeout in main thread instead of directly in timer libertas: kill 'addtail' argument to lbs_queue_cmd() and make it static libertas: fix return from lbs_update_channel() libertas: add SLEEP_PERIOD and FW_WAKE_METHOD command definitions libertas: fix buffer handling of PS_MODE commands and responses libertas: don't clear priv-dnld_sent after sending sleep confirm libertas: handle HOST_AWAKE event by sending WAKEUP_CONFIRM command libertas: allow for PS mode to be disabled when firmware doesn't support it libertas: Check for PS mode support on USB devices libertas: reduce explicit references to priv-cur_cmd-cmdbuf libertas: use priv-upld_buf for command responses libertas: discard DEFER responses to commands; let the timeout trigger libertas: make lbs_submit_command always 'succeed' and set command timer libertas: submit RSSI command on tx timeout, to check whether module is dead libertas: convert RADIO_CONTROL to a direct command libertas: convert INACTIVITY_TIMEOUT to a direct command libertas: convert SLEEP_PARAMS to a direct command libertas: convert SET_WEP to a direct command libertas: convert ENABLE_RSN to a direct command libertas: change inference about buffer size in lbs_cmd() libertas: convert SUBSCRIBE_EVENT to a direct command libertas: remove check for driver_lock in lbs_interrupt() Larry Finger (1): b43legacy: Fix rfkill radio LED Zhu Yi (2): iwlwifi: proper monitor support iwlwifi: skip mac80211 conf during a hardware scan and replay it afterwards drivers/net/wireless/b43legacy/leds.c |4 + drivers/net/wireless/b43legacy/main.c | 20 +- drivers/net/wireless/b43legacy/rfkill.c | 133 --- drivers/net/wireless/iwlwifi/iwl-3945.c | 120 +- drivers/net/wireless/iwlwifi/iwl-3945.h | 38 +-- drivers/net/wireless/iwlwifi/iwl-4965.c | 120 ++- drivers/net/wireless/iwlwifi/iwl-4965.h | 26 +-- drivers/net/wireless/iwlwifi/iwl3945-base.c | 139 +-- drivers/net/wireless/iwlwifi/iwl4965-base.c | 122 +-- drivers/net/wireless/libertas/assoc.c | 61 ++-- drivers/net/wireless/libertas/cmd.c | 565 +++ drivers/net/wireless/libertas/cmd.h | 29 ++- drivers/net/wireless/libertas/cmdresp.c | 162 +++- drivers/net/wireless/libertas/debugfs.c | 350 - drivers/net/wireless/libertas/decl.h|9 +- drivers/net/wireless/libertas/dev.h | 19 +- drivers/net/wireless/libertas/host.h|8 + drivers/net/wireless/libertas/hostcmd.h | 47 ++- drivers/net/wireless/libertas/if_cs.c | 10 +- drivers/net/wireless/libertas/if_sdio.c | 10 +- drivers/net/wireless/libertas/if_usb.c | 470 ++- drivers/net/wireless/libertas/if_usb.h | 95 ++--- drivers/net/wireless/libertas/main.c| 92 +++-- drivers/net/wireless/libertas/tx.c |4 +- drivers/net/wireless/libertas/wext.c|7 + 25 files changed, 1200 insertions(+), 1460 deletions(-) Omnibus
Re: After many hours all outbound connections get stuck in SYN_SENT
But I'd be very surprised if the router is acting as anything more that a network-layer device. It might perhaps have some soft connection state being used for generating accounting records. Being Cisco it's probably a switch-router, so it might carry some per-port hard state for validating source IP addresses and ARPs on each port. The firewall is much more likely to be carrying per-flow Sack state. The Cisco PIX had a bug with SACK handling (CSCse14419, fixed in 7.0(7), 7.1(2.34), 7.2(2.2), 8.0(0.141) but perhaps it has regressed). A simple trace either side of the firewall will show the inconsistency between the TCP sequence number (which gets randomised) and the Sack sequence number (which didn't). You could disable the TCP Sequence Number Randomisation feature and see if the fault reoccurs. I do have TCP Sequence # Randomization enabled on my router. However, if this was causing an issue, wouldn't it always occur and cause connection issues, not just after 38 hours of correct operation? I can look into turning this off, but I'll likely have to jump through several hoops which will be challenging if I don't have a very clear definitive reason why this is causing this issue. Plus, I've had this problem with at least 2 other sets of network switches over the past 4 years. I'm actually running 7.0(6), which doesn't have the fix you mentioned. If it really is possible that this issue wouldn't always cause problems, but only after hours of succesful operation, then I could probably motivate the upgrade. I can try to setup a trace, but this is a lot of work for other people in my organization, so it will take quite some time. You'd probably should also investigate the Linux kernel, especially the size and locks of the components of the Sack data structures and what happens to those data structures after Sack is disabled (presumably the Sack data structure is in some unhappy circumstance, and disabling Sack allows the data to be discarded, magically unclaging the box). In the absence of the reporter wanting to dump the kernel's core, how about a patch to print the Sack datastructure when the command to disable Sack is received by the kernel? Maybe just print the last 16b of the IP address? Given the fact that I've had this problem for so long, over a variety of networking hardware vendors and colo-facilities, this really sounds good to me. It will be challenging for me to justify a kernel core dump, but a simple patch to dump the Sack data would be do-able. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] net: neighbor timer power saving
On Wed, 19 Dec 2007 08:23:43 +0100 Eric Dumazet [EMAIL PROTECTED] wrote: Stephen Hemminger a écrit : The neighbor GC timer runs once a second, but it doesn't need to wake up the machine. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- a/net/core/neighbour.c 2007-12-18 07:46:07.0 -0800 +++ b/net/core/neighbour.c 2007-12-18 07:47:36.0 -0800 @@ -270,7 +270,7 @@ static struct neighbour *neigh_alloc(str n-nud_state = NUD_NONE; n-output = neigh_blackhole; n-parms = neigh_parms_clone(tbl-parms); - init_timer(n-timer); + init_timer_deferrable(n-timer); n-timer.function = neigh_timer_handler; n-timer.data = (unsigned long)n; @@ -740,7 +740,7 @@ static void neigh_timer_handler(unsigned state = neigh-nud_state; now = jiffies; - next = now + HZ; + next = round_jiffies(now + HZ); if (!(state NUD_IN_TIMER)) { #ifndef CONFIG_SMP @@ -1372,7 +1372,7 @@ void neigh_table_init_no_netlink(struct get_random_bytes(tbl-hash_rnd, sizeof(tbl-hash_rnd)); rwlock_init(tbl-lock); - init_timer(tbl-gc_timer); + init_timer_deferrable(tbl-gc_timer); tbl-gc_timer.data = (unsigned long)tbl; tbl-gc_timer.function = neigh_periodic_timer; tbl-gc_timer.expires = now + 1; I wonder if this deferrable timer thing is the right way to go. (like read_mostly thing if you want :) ) We are going to convert 99% timers to deferrable. Maybe the right move should be to have the reverse attribute, to mark a timer as non deferrable... Also, why use round_jiffies() on a deferrable timer ? That sounds unecessary ? Thinking about it more, this looks like a case for just using round_jiffies(). The GC timer needs to run to clean up under DoS attack, and deferring it probably isn't a good idea. -- Stephen Hemminger [EMAIL PROTECTED] -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] sky2: Use deferrable timer for watchdog
On Tue, 18 Dec 2007 20:13:28 -0500 (EST) Parag Warudkar [EMAIL PROTECTED] wrote: sky2 can use deferrable timer for watchdog - reduces wakeups from idle per second. Signed-off-by: Parag Warudkar [EMAIL PROTECTED] --- linux-2.6/drivers/net/sky2.c 2007-12-07 10:04:39.0 -0500 +++ linux-2.6-work/drivers/net/sky2.c 2007-12-18 20:07:58.0 -0500 @@ -4230,7 +4230,10 @@ sky2_show_addr(dev1); } - setup_timer(hw-watchdog_timer, sky2_watchdog, (unsigned long) hw); + hw-watchdog_timer.function = sky2_watchdog; + hw-watchdog_timer.data = (unsigned long) hw; + init_timer_deferrable(hw-watchdog_timer); + INIT_WORK(hw-restart_work, sky2_restart); pci_set_drvdata(pdev, hw); Does it really reduce the wakeup's or only change who gets charged by powertop? The system is going to wakeup once a second anyway. Looks to me that if the timer is using round_jiffies(), that setting deferrable just changes the accounting. My interpretation of the api is: * round_jiffies() - timer wants to wakeup but isn't precise about when so schedule on next second when system will wake up anyway; e.g why meetings are usually scheduled on the hour * deferrable - timer doesn't have to really wakeup but wants to happen near a particular time. e.g. I'll meet you at the pub around 8pm Therefore doing deferrable is unnecessary for timers using round_jiffies unless system is so good at doing timers that it is going to skip doing timer once per second. -- Stephen Hemminger [EMAIL PROTECTED] -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] e1000e: Use deferrable timer for watchdog
Parag Warudkar wrote: Reduce wakeups from idle per second. Signed-off-by: Parag Warudkar [EMAIL PROTECTED] --- linux-2.6/drivers/net/e1000e/netdev.c2007-12-07 10:04:39.0 -0500 +++ linux-2.6-work/drivers/net/e1000e/netdev.c2007-12-18 20:45:59.0 -0500 @@ -3899,7 +3899,7 @@ goto err_eeprom; } -init_timer(adapter-watchdog_timer); +init_timer_deferrable(adapter-watchdog_timer); adapter-watchdog_timer.function = e1000_watchdog; adapter-watchdog_timer.data = (unsigned long) adapter; I can't even apply this patch and the e1000 one... not only is it whitespace damaged it is also not properly formatted as patch at all. If you want me to take these patches seriously, then please fix the formatting issues. Auke -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] sky2: Use deferrable timer for watchdog
On Thu, 20 Dec 2007 17:29:23 + -Original Message- From: Stephen Hemminger [EMAIL PROTECTED] Date: Thu, 20 Dec 2007 09:16:03 To:[EMAIL PROTECTED] Cc:netdev@vger.kernel.org, [EMAIL PROTECTED], [EMAIL PROTECTED] Subject: Re: [PATCH] sky2: Use deferrable timer for watchdog On Tue, 18 Dec 2007 20:13:28 -0500 (EST) Parag Warudkar [EMAIL PROTECTED] wrote: sky2 can use deferrable timer for watchdog - reduces wakeups from idle per second. Signed-off-by: Parag Warudkar [EMAIL PROTECTED] --- linux-2.6/drivers/net/sky2.c2007-12-07 10:04:39.0 -0500 +++ linux-2.6-work/drivers/net/sky2.c 2007-12-18 20:07:58.0 -0500 @@ -4230,7 +4230,10 @@ sky2_show_addr(dev1); } - setup_timer(hw-watchdog_timer, sky2_watchdog, (unsigned long) hw); + hw-watchdog_timer.function = sky2_watchdog; + hw-watchdog_timer.data = (unsigned long) hw; + init_timer_deferrable(hw-watchdog_timer); + INIT_WORK(hw-restart_work, sky2_restart); pci_set_drvdata(pdev, hw); Does it really reduce the wakeup's or only change who gets charged by powertop? The system is going to wakeup once a second anyway. Looks to me that if the timer is using round_jiffies(), that setting deferrable just changes the accounting. My interpretation of the api is: * round_jiffies() - timer wants to wakeup but isn't precise about when so schedule on next second when system will wake up anyway; e.g why meetings are usually scheduled on the hour * deferrable - timer doesn't have to really wakeup but wants to happen near a particular time. e.g. I'll meet you at the pub around 8pm Therefore doing deferrable is unnecessary for timers using round_jiffies unless system is so good at doing timers that it is going to skip doing timer once per second. [EMAIL PROTECTED] wrote: NO_HZ kernels don't do timers every second - if you do round_jiffies() the kernel will wakeup and run the timer at that time no matter what. The reason deferrable was introduced is to avoid waking up the kernel just for this one timer that can be called when the CPU is not idle for some reason other than this timer. In other words let's say there were two timers - one non-deferrable expiring in 3 seconds and other deferrable, expiring in 1.5 seconds. The kernel will not wake up twice - once for 1.5 second and other for 3 second - it will wake up once at expiry of 3 second timer and execute both the 1.5 second and 3 second timers. And this is not just powertop accounting thing - like I said the total num of wakeups per second go down with this patch. Parag Sent via BlackBerry from T-Mobile Quit top-posting! If this is the case then the whole usage of round_jiffies() is bogus. All users of round_jiffies() should just be converted to deferrable?? I am a bit concerned that if deferrable gets used everywhere then a strange situation would occur where all timers were waiting for some other timer to finally happen, kind of a wierd timelock situation. Like the old chip/dale cartoon: you first, no you first, after you mister chip, no after you mister dale,... -- Stephen Hemminger [EMAIL PROTECTED] -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] e1000e: Use deferrable timer for watchdog
On Dec 20, 2007 12:05 PM, Kok, Auke [EMAIL PROTECTED] wrote: I can't even apply this patch and the e1000 one... not only is it whitespace damaged it is also not properly formatted as patch at all. If you want me to take these patches seriously, then please fix the formatting issues. Sigh - I use Pine, follow Documents/email-clients.txt for the recommended settings and obviously the pathces are not generated with whitespace damage at my end as I test those before sending out. So although I hate to see this happen there is nothing at this moment that I can do - except for attaching the patch instead of inlining it. Since they have already been reviewed inline, please see if the attached patches work for you. [EMAIL PROTECTED] linux-2.6]$ scripts/checkpatch.pl --no-tree ../../Patches/e1000_main.c.patch total: 0 errors, 0 warnings, 8 lines checked Your patch has no obvious style problems and is ready for submission. [EMAIL PROTECTED] linux-2.6]$ [EMAIL PROTECTED] linux-2.6]$ vim drivers/net/e1000/e1000_main.c [EMAIL PROTECTED] linux-2.6]$ patch -p1 ../../Patches/e1000_main.c.patch patching file drivers/net/e1000/e1000_main.c [EMAIL PROTECTED] linux-2.6]$ scripts/checkpatch.pl --no-tree ../../Patches/e1000e-netdev.c.patch total: 0 errors, 0 warnings, 8 lines checked Your patch has no obvious style problems and is ready for submission. [EMAIL PROTECTED] linux-2.6]$ patch -p1 ../../Patches/e1000e-netdev.c.patch patching file drivers/net/e1000e/netdev.c Thanks Parag e1000_main.c.patch Description: Binary data e1000e-netdev.c.patch Description: Binary data
Re: [PATCH] sky2: Use deferrable timer for watchdog
On Dec 20, 2007 12:51 PM, Stephen Hemminger [EMAIL PROTECTED] wrote: Quit top-posting! If this is the case then the whole usage of round_jiffies() is bogus. All users of round_jiffies() should just be converted to deferrable?? I am a bit concerned that if deferrable gets used everywhere then a strange situation would occur where all timers were waiting for some other timer to finally happen, kind of a wierd timelock situation. Like the old chip/dale cartoon: you first, no you first, after you mister chip, no after you mister dale,... Haha - I thought about this too. I think there should be mechanism where the machine does not idle infinitely even if there are no non-deferrable timers. Something like an affordable QoS for non deferrable timers - the kernel wakes up after that interval and runs all deferrable timers even if nothing non-deferrable is set to run. So we still get advantage of not having to wake individually for each timer and the non-deferrable timers do get all run in reasonable amount of time. Who knows Thomas/Ingo already built in something of that nature or effect?! Parag -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] net: neighbor timer power saving
On Dec 20, 2007 12:10 PM, Stephen Hemminger [EMAIL PROTECTED] wrote: Thinking about it more, this looks like a case for just using round_jiffies(). The GC timer needs to run to clean up under DoS attack, and deferring it probably isn't a good idea. But what are the chances that a DoSed machine will be idling which will prevent the GC timer to run? I would think there would be lot of other activities going on (including non-deferrable timers running) that will avoid this situation? Parag -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] e1000: Use deferrable timer for watchdog
From: Parag Warudkar [EMAIL PROTECTED] Reduces wakeups from idle per second. Signed-off-by: Parag Warudkar [EMAIL PROTECTED] Signed-off-by: Auke Kok [EMAIL PROTECTED] --- drivers/net/e1000/e1000_main.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c index 599153d..6af86fa 100644 --- a/drivers/net/e1000/e1000_main.c +++ b/drivers/net/e1000/e1000_main.c @@ -1066,7 +1066,7 @@ e1000_probe(struct pci_dev *pdev, adapter-tx_fifo_stall_timer.function = e1000_82547_tx_fifo_stall; adapter-tx_fifo_stall_timer.data = (unsigned long) adapter; - init_timer(adapter-watchdog_timer); + init_timer_deferrable(adapter-watchdog_timer); adapter-watchdog_timer.function = e1000_watchdog; adapter-watchdog_timer.data = (unsigned long) adapter; -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-2.6.25 3/3] Uninline the inet_twsk_put function
Pavel Emelyanov schrieb: This one is not that big, but is widely used: saves 1200 bytes from net/ipv4/built-in.o +void inet_twsk_put(struct inet_timewait_sock *tw) +{ + if (atomic_dec_and_test(tw-tw_refcnt)) { + struct module *owner = tw-tw_prot-owner; + twsk_destructor((struct sock *)tw); +#ifdef SOCK_REFCNT_DEBUG + printk(KERN_DEBUG %s timewait_sock %p released\n, +tw-tw_prot-name, tw); +#endif + kmem_cache_free(tw-tw_prot-twsk_prot-twsk_slab, tw); + module_put(owner); + } +} +EXPORT_SYMBOL_GPL(inet_twsk_put); More correct fix seems to be conversion to kref. Just create out of line inet_twsk_release() containing sth. similiar to the code inside these braces and modify inet_twsk_put() to sth. like this: static inline inet_twsk_put(struct inet_timewait_sock *tw) { kref_put(tw-kref, inet_twsk_release); } David, can you see any reason (e.g. some crazy lock stuff) NOT to do this? Best Regards Ingo Oeser -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] sky2: Use deferrable timer for watchdog
Stephen Hemminger wrote: On Thu, 20 Dec 2007 17:29:23 + -Original Message- From: Stephen Hemminger [EMAIL PROTECTED] Date: Thu, 20 Dec 2007 09:16:03 To:[EMAIL PROTECTED] Cc:netdev@vger.kernel.org, [EMAIL PROTECTED], [EMAIL PROTECTED] Subject: Re: [PATCH] sky2: Use deferrable timer for watchdog On Tue, 18 Dec 2007 20:13:28 -0500 (EST) Parag Warudkar [EMAIL PROTECTED] wrote: sky2 can use deferrable timer for watchdog - reduces wakeups from idle per second. Signed-off-by: Parag Warudkar [EMAIL PROTECTED] --- linux-2.6/drivers/net/sky2.c2007-12-07 10:04:39.0 -0500 +++ linux-2.6-work/drivers/net/sky2.c 2007-12-18 20:07:58.0 -0500 @@ -4230,7 +4230,10 @@ sky2_show_addr(dev1); } - setup_timer(hw-watchdog_timer, sky2_watchdog, (unsigned long) hw); + hw-watchdog_timer.function = sky2_watchdog; + hw-watchdog_timer.data = (unsigned long) hw; + init_timer_deferrable(hw-watchdog_timer); + INIT_WORK(hw-restart_work, sky2_restart); pci_set_drvdata(pdev, hw); Does it really reduce the wakeup's or only change who gets charged by powertop? The system is going to wakeup once a second anyway. Looks to me that if the timer is using round_jiffies(), that setting deferrable just changes the accounting. My interpretation of the api is: * round_jiffies() - timer wants to wakeup but isn't precise about when so schedule on next second when system will wake up anyway; e.g why meetings are usually scheduled on the hour * deferrable - timer doesn't have to really wakeup but wants to happen near a particular time. e.g. I'll meet you at the pub around 8pm Therefore doing deferrable is unnecessary for timers using round_jiffies unless system is so good at doing timers that it is going to skip doing timer once per second. [EMAIL PROTECTED] wrote: NO_HZ kernels don't do timers every second - if you do round_jiffies() the kernel will wakeup and run the timer at that time no matter what. The reason deferrable was introduced is to avoid waking up the kernel just for this one timer that can be called when the CPU is not idle for some reason other than this timer. In other words let's say there were two timers - one non-deferrable expiring in 3 seconds and other deferrable, expiring in 1.5 seconds. The kernel will not wake up twice - once for 1.5 second and other for 3 second - it will wake up once at expiry of 3 second timer and execute both the 1.5 second and 3 second timers. And this is not just powertop accounting thing - like I said the total num of wakeups per second go down with this patch. Parag Sent via BlackBerry from T-Mobile Quit top-posting! If this is the case then the whole usage of round_jiffies() is bogus. All users of round_jiffies() should just be converted to deferrable?? I am a bit concerned that if deferrable gets used everywhere then a strange situation would occur where all timers were waiting for some other timer to finally happen, kind of a wierd timelock situation. Like the old chip/dale cartoon: you first, no you first, after you mister chip, no after you mister dale,... that's a dangerous situation indeed and I'd really like to know what the limits are for deferring deferrable timers Arjan, do you know? Anyone? I don't see a danger just yet on normal systems - I get something like 10 wakeups per second from just the kernel (acpi, ahci, usb) on most my systems which guarantees that the watchdog runs often enough, but for embedded systems and critical timers in other drivers this may be an issue quickly Auke -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] e1000e: Use deferrable timer for watchdog
Parag Warudkar wrote: On Dec 20, 2007 12:05 PM, Kok, Auke [EMAIL PROTECTED] wrote: I can't even apply this patch and the e1000 one... not only is it whitespace damaged it is also not properly formatted as patch at all. If you want me to take these patches seriously, then please fix the formatting issues. Sigh - I use Pine, follow Documents/email-clients.txt for the recommended settings and obviously the pathces are not generated with whitespace damage at my end as I test those before sending out. So although I hate to see this happen there is nothing at this moment that I can do - except for attaching the patch instead of inlining it. Since they have already been reviewed inline, please see if the attached patches work for you. here's what the files in my Maildir spool look like in vim (my vim displays a '»' char for tabs and a ¶ for EOL): 76 --- linux-2.6/drivers/net/e1000e/netdev.c» 2007-12-07 10:04:39. 77 +++ linux-2.6-work/drivers/net/e1000e/netdev.c» 2007-12-18 20:45:59. 78 @@ -3899,7 +3899,7 @@¶ 79 » » goto err_eeprom;¶ 80 » }¶ 81 ¶ 82 -» init_timer(adapter-watchdog_timer);¶ 83 +» init_timer_deferrable(adapter-watchdog_timer);¶ 84 » adapter-watchdog_timer.function = e1000_watchdog;¶ 85 » adapter-watchdog_timer.data = (unsigned long) adapter;¶ 86 ¶ 87 --¶ notice that there are two spaces instead of 1. Also there's no line heading the diff with 'diff a/foo b/foo' which is what throws of stg. And the -p option is missing. as for content, the patch looks OK with me. I ran the numbers and allthough there was a slight average delay in the link up detection time it is negligeable (less than 0.2sec difference over a bunch of measurements), and I confirmed your powertop numbers are correct. As for the timer interval, the watchdog may already be delayed up to 3 seconds safely, this doesn't change that. I'll forward the patch, Care to make one for e100? plenty of laptops with those still around! The embedded guys would love it I think. Thanks, Auke -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] e1000e: Use deferrable timer for watchdog
From: Parag Warudkar [EMAIL PROTECTED] Reduce wakeups from idle per second. Signed-off-by: Parag Warudkar [EMAIL PROTECTED] Signed-off-by: Auke Kok [EMAIL PROTECTED] --- drivers/net/e1000e/netdev.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c index 2422d16..59960d2 100644 --- a/drivers/net/e1000e/netdev.c +++ b/drivers/net/e1000e/netdev.c @@ -3931,7 +3931,7 @@ static int __devinit e1000_probe(struct pci_dev *pdev, goto err_eeprom; } - init_timer(adapter-watchdog_timer); + init_timer_deferrable(adapter-watchdog_timer); adapter-watchdog_timer.function = e1000_watchdog; adapter-watchdog_timer.data = (unsigned long) adapter; -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] sky2: Use deferrable timer for watchdog
My interpretation of the api is: * round_jiffies() - timer wants to wakeup but isn't precise about when so schedule on next second when system will wake up anyway; e.g why meetings are usually scheduled on the hour * deferrable - timer doesn't have to really wakeup but wants to happen near a particular time. e.g. I'll meet you at the pub around 8pm this is not correct. deferrable means if you're busy wake me up at this time. But if not, don't bother waking up for me, get to it later. The later can be a LONG time later, several seconds easily, if not more. (timers are on a per cpu bases, and you may end up with a several-core system where the common timers are all on another cpu than this one) If this is the case then the whole usage of round_jiffies() is bogus. All users of round_jiffies() should just be converted to deferrable?? I am a bit concerned that if deferrable gets used everywhere then a strange situation would occur where all timers were waiting for some other timer to finally happen, kind of a wierd timelock situation. Like the old chip/dale cartoon: you first, no you first, after you mister chip, no after you mister dale,... that's a dangerous situation indeed and I'd really like to know what the limits are for deferring deferrable timers Arjan, do you know? Anyone? there is NO limit to deferring a timer. Do NOT use a deferrable timer if you can't afford the timer to not happen within.. 10 to 100 seconds! (or more) They are really meant for things where you CAN afford for it to not happen when you're idle I don't see a danger just yet on normal systems - I get something like 10 wakeups per second from just the kernel (acpi, ahci, usb) on most my systems which guarantees that the watchdog runs often enough, but for embedded systems and critical timers in other drivers this may be an issue quickly on my work desktop test box the average time between cpu wakeups is 1.4 seconds (and that's single core). It would be higher if it wasn't for some hpet limit issues. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] sky2: Use deferrable timer for watchdog
Arjan van de Ven wrote: My interpretation of the api is: * round_jiffies() - timer wants to wakeup but isn't precise about when so schedule on next second when system will wake up anyway; e.g why meetings are usually scheduled on the hour * deferrable - timer doesn't have to really wakeup but wants to happen near a particular time. e.g. I'll meet you at the pub around 8pm this is not correct. deferrable means if you're busy wake me up at this time. But if not, don't bother waking up for me, get to it later. The later can be a LONG time later, several seconds easily, if not more. (timers are on a per cpu bases, and you may end up with a several-core system where the common timers are all on another cpu than this one) If this is the case then the whole usage of round_jiffies() is bogus. All users of round_jiffies() should just be converted to deferrable?? I am a bit concerned that if deferrable gets used everywhere then a strange situation would occur where all timers were waiting for some other timer to finally happen, kind of a wierd timelock situation. Like the old chip/dale cartoon: you first, no you first, after you mister chip, no after you mister dale,... that's a dangerous situation indeed and I'd really like to know what the limits are for deferring deferrable timers Arjan, do you know? Anyone? there is NO limit to deferring a timer. Do NOT use a deferrable timer if you can't afford the timer to not happen within.. 10 to 100 seconds! (or more) They are really meant for things where you CAN afford for it to not happen when you're idle ok, that's just bad and if there's no user-defineable limit to the deferral I definately don't like this change. Can I safely assume that any irq will cause all deferred timers to run? If this is the case then for e1000 this patch is still OK since the watchdog needs to run (1) after a link up/down interrupt or (2) to update statistics. Those statistics won't increase if there is no traffic of course... Auke -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/9][BNX2]: Add MSIX support.
David, this patchset lays the foundation for supporting multiple MSIX IRQs. Only 1 additional MSIX is added to handle TX separately from RX at the moment. Multiple TX and RX rings will be added in the future. Please review for 2.6.25. Thanks. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/9][BNX2]: Add function to fetch hardware tx index.
[BNX2]: Add function to fetch hardware tx index. This makes the code cleaner and easier to support different tx rings. Signed-off-by: Michael Chan [EMAIL PROTECTED] diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c index 469d259..f19a1e9 100644 --- a/drivers/net/bnx2.c +++ b/drivers/net/bnx2.c @@ -2323,17 +2323,25 @@ bnx2_phy_int(struct bnx2 *bp) } +static inline u16 +bnx2_get_hw_tx_cons(struct bnx2 *bp) +{ + u16 cons; + + cons = bp-status_blk-status_tx_quick_consumer_index0; + + if (unlikely((cons MAX_TX_DESC_CNT) == MAX_TX_DESC_CNT)) + cons++; + return cons; +} + static void bnx2_tx_int(struct bnx2 *bp) { - struct status_block *sblk = bp-status_blk; u16 hw_cons, sw_cons, sw_ring_cons; int tx_free_bd = 0; - hw_cons = bp-hw_tx_cons = sblk-status_tx_quick_consumer_index0; - if ((hw_cons MAX_TX_DESC_CNT) == MAX_TX_DESC_CNT) { - hw_cons++; - } + hw_cons = bnx2_get_hw_tx_cons(bp); sw_cons = bp-tx_cons; while (sw_cons != hw_cons) { @@ -2385,14 +2393,10 @@ bnx2_tx_int(struct bnx2 *bp) dev_kfree_skb(skb); - hw_cons = bp-hw_tx_cons = - sblk-status_tx_quick_consumer_index0; - - if ((hw_cons MAX_TX_DESC_CNT) == MAX_TX_DESC_CNT) { - hw_cons++; - } + hw_cons = bnx2_get_hw_tx_cons(bp); } + bp-hw_tx_cons = hw_cons; bp-tx_cons = sw_cons; /* Need to make the tx_cons update visible to bnx2_start_xmit() * before checking for netif_queue_stopped(). Without the @@ -2822,7 +2826,7 @@ bnx2_has_work(struct bnx2 *bp) struct status_block *sblk = bp-status_blk; if ((bnx2_get_hw_rx_cons(bp) != bp-rx_cons) || - (sblk-status_tx_quick_consumer_index0 != bp-hw_tx_cons)) + (bnx2_get_hw_tx_cons(bp) != bp-hw_tx_cons)) return 1; if ((sblk-status_attn_bits STATUS_ATTN_EVENTS) != @@ -2851,7 +2855,7 @@ static int bnx2_poll_work(struct bnx2 *bp, int work_done, int budget) REG_RD(bp, BNX2_HC_COMMAND); } - if (sblk-status_tx_quick_consumer_index0 != bp-hw_tx_cons) + if (bnx2_get_hw_tx_cons(bp) != bp-hw_tx_cons) bnx2_tx_int(bp); if (bnx2_get_hw_rx_cons(bp) != bp-rx_cons) @@ -4917,7 +4921,7 @@ bnx2_run_loopback(struct bnx2 *bp, int loopback_mode) REG_RD(bp, BNX2_HC_COMMAND); udelay(5); - rx_start_idx = bp-status_blk-status_rx_quick_consumer_index0; + rx_start_idx = bnx2_get_hw_rx_cons(bp); num_pkts = 0; @@ -4947,11 +4951,10 @@ bnx2_run_loopback(struct bnx2 *bp, int loopback_mode) pci_unmap_single(bp-pdev, map, pkt_size, PCI_DMA_TODEVICE); dev_kfree_skb(skb); - if (bp-status_blk-status_tx_quick_consumer_index0 != bp-tx_prod) { + if (bnx2_get_hw_tx_cons(bp) != bp-tx_prod) goto loopback_test_done; - } - rx_idx = bp-status_blk-status_rx_quick_consumer_index0; + rx_idx = bnx2_get_hw_rx_cons(bp); if (rx_idx != rx_start_idx + num_pkts) { goto loopback_test_done; } -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/9][BNX2]: Restructure IRQ datastructures.
[BNX2]: Restructure IRQ datastructures. Add a table to keep track of multiple IRQs and restructure the IRQ request and free functions so that they can be easily expanded to handle multiple IRQs. Signed-off-by: Michael Chan [EMAIL PROTECTED] diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c index f19a1e9..83cdbde 100644 --- a/drivers/net/bnx2.c +++ b/drivers/net/bnx2.c @@ -5234,18 +5234,15 @@ static int bnx2_request_irq(struct bnx2 *bp) { struct net_device *dev = bp-dev; - int rc = 0; - - if (bp-flags USING_MSI_FLAG) { - irq_handler_t fn = bnx2_msi; - - if (bp-flags ONE_SHOT_MSI_FLAG) - fn = bnx2_msi_1shot; + unsigned long flags; + struct bnx2_irq *irq = bp-irq_tbl[0]; + int rc; - rc = request_irq(bp-pdev-irq, fn, 0, dev-name, dev); - } else - rc = request_irq(bp-pdev-irq, bnx2_interrupt, -IRQF_SHARED, dev-name, dev); + if (bp-flags USING_MSI_FLAG) + flags = 0; + else + flags = IRQF_SHARED; + rc = request_irq(irq-vector, irq-handler, flags, dev-name, dev); return rc; } @@ -5254,12 +5251,31 @@ bnx2_free_irq(struct bnx2 *bp) { struct net_device *dev = bp-dev; + free_irq(bp-irq_tbl[0].vector, dev); if (bp-flags USING_MSI_FLAG) { - free_irq(bp-pdev-irq, dev); pci_disable_msi(bp-pdev); bp-flags = ~(USING_MSI_FLAG | ONE_SHOT_MSI_FLAG); - } else - free_irq(bp-pdev-irq, dev); + } +} + +static void +bnx2_setup_int_mode(struct bnx2 *bp, int dis_msi) +{ + bp-irq_tbl[0].handler = bnx2_interrupt; + strcpy(bp-irq_tbl[0].name, bp-dev-name); + + if ((bp-flags MSI_CAP_FLAG) !dis_msi) { + if (pci_enable_msi(bp-pdev) == 0) { + bp-flags |= USING_MSI_FLAG; + if (CHIP_NUM(bp) == CHIP_NUM_5709) { + bp-flags |= ONE_SHOT_MSI_FLAG; + bp-irq_tbl[0].handler = bnx2_msi_1shot; + } else + bp-irq_tbl[0].handler = bnx2_msi; + } + } + + bp-irq_tbl[0].vector = bp-pdev-irq; } /* Called with rtnl_lock */ @@ -5278,15 +5294,8 @@ bnx2_open(struct net_device *dev) if (rc) return rc; + bnx2_setup_int_mode(bp, disable_msi); napi_enable(bp-napi); - - if ((bp-flags MSI_CAP_FLAG) !disable_msi) { - if (pci_enable_msi(bp-pdev) == 0) { - bp-flags |= USING_MSI_FLAG; - if (CHIP_NUM(bp) == CHIP_NUM_5709) - bp-flags |= ONE_SHOT_MSI_FLAG; - } - } rc = bnx2_request_irq(bp); if (rc) { @@ -5325,6 +5334,8 @@ bnx2_open(struct net_device *dev) bnx2_disable_int(bp); bnx2_free_irq(bp); + bnx2_setup_int_mode(bp, 1); + rc = bnx2_init_nic(bp); if (!rc) diff --git a/drivers/net/bnx2.h b/drivers/net/bnx2.h index 1f244fa..1accf00 100644 --- a/drivers/net/bnx2.h +++ b/drivers/net/bnx2.h @@ -6494,6 +6494,15 @@ struct flash_spec { u8 *name; }; +#define BNX2_MAX_MSIX_HW_VEC 9 +#define BNX2_MAX_MSIX_VEC 1 + +struct bnx2_irq { + irq_handler_t handler; + u16 vector; + charname[16]; +}; + struct bnx2 { /* Fields used in the tx and intr/napi performance paths are grouped */ /* together in the beginning of the structure. */ @@ -6721,6 +6730,9 @@ struct bnx2 { u32 flash_size; int status_stats_size; + + struct bnx2_irq irq_tbl[BNX2_MAX_MSIX_VEC]; + int irq_nvecs; }; static u32 bnx2_reg_rd_ind(struct bnx2 *bp, u32 offset); -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/9][BNX2]: Introduce new bnx2_napi structure.
[BNX2]: Introduce new bnx2_napi structure. Introduce a bnx2_napi structure that will hold a napi_struct and other fields to handle NAPI polling for the napi_struct. Various tx and rx indexes and status block pointers will be moved from the main bnx2 structure to this bnx2_napi structure. Most NAPI path functions are modified to be passed this bnx2_napi struct pointer. Signed-off-by: Michael Chan [EMAIL PROTECTED] diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c index 83cdbde..3f754e6 100644 --- a/drivers/net/bnx2.c +++ b/drivers/net/bnx2.c @@ -407,12 +407,14 @@ bnx2_disable_int(struct bnx2 *bp) static void bnx2_enable_int(struct bnx2 *bp) { + struct bnx2_napi *bnapi = bp-bnx2_napi; + REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID | - BNX2_PCICFG_INT_ACK_CMD_MASK_INT | bp-last_status_idx); + BNX2_PCICFG_INT_ACK_CMD_MASK_INT | bnapi-last_status_idx); REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, - BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID | bp-last_status_idx); + BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID | bnapi-last_status_idx); REG_WR(bp, BNX2_HC_COMMAND, bp-hc_cmd | BNX2_HC_COMMAND_COAL_NOW); } @@ -426,11 +428,23 @@ bnx2_disable_int_sync(struct bnx2 *bp) } static void +bnx2_napi_disable(struct bnx2 *bp) +{ + napi_disable(bp-bnx2_napi.napi); +} + +static void +bnx2_napi_enable(struct bnx2 *bp) +{ + napi_enable(bp-bnx2_napi.napi); +} + +static void bnx2_netif_stop(struct bnx2 *bp) { bnx2_disable_int_sync(bp); if (netif_running(bp-dev)) { - napi_disable(bp-napi); + bnx2_napi_disable(bp); netif_tx_disable(bp-dev); bp-dev-trans_start = jiffies; /* prevent tx timeout */ } @@ -442,7 +456,7 @@ bnx2_netif_start(struct bnx2 *bp) if (atomic_dec_and_test(bp-intr_sem)) { if (netif_running(bp-dev)) { netif_wake_queue(bp-dev); - napi_enable(bp-napi); + bnx2_napi_enable(bp); bnx2_enable_int(bp); } } @@ -555,6 +569,8 @@ bnx2_alloc_mem(struct bnx2 *bp) memset(bp-status_blk, 0, bp-status_stats_size); + bp-bnx2_napi.status_blk = bp-status_blk; + bp-stats_blk = (void *) ((unsigned long) bp-status_blk + status_blk_size); @@ -2291,9 +2307,9 @@ bnx2_alloc_rx_skb(struct bnx2 *bp, u16 index) } static int -bnx2_phy_event_is_set(struct bnx2 *bp, u32 event) +bnx2_phy_event_is_set(struct bnx2 *bp, struct bnx2_napi *bnapi, u32 event) { - struct status_block *sblk = bp-status_blk; + struct status_block *sblk = bnapi-status_blk; u32 new_link_state, old_link_state; int is_set = 1; @@ -2311,24 +2327,24 @@ bnx2_phy_event_is_set(struct bnx2 *bp, u32 event) } static void -bnx2_phy_int(struct bnx2 *bp) +bnx2_phy_int(struct bnx2 *bp, struct bnx2_napi *bnapi) { - if (bnx2_phy_event_is_set(bp, STATUS_ATTN_BITS_LINK_STATE)) { + if (bnx2_phy_event_is_set(bp, bnapi, STATUS_ATTN_BITS_LINK_STATE)) { spin_lock(bp-phy_lock); bnx2_set_link(bp); spin_unlock(bp-phy_lock); } - if (bnx2_phy_event_is_set(bp, STATUS_ATTN_BITS_TIMER_ABORT)) + if (bnx2_phy_event_is_set(bp, bnapi, STATUS_ATTN_BITS_TIMER_ABORT)) bnx2_set_remote_link(bp); } static inline u16 -bnx2_get_hw_tx_cons(struct bnx2 *bp) +bnx2_get_hw_tx_cons(struct bnx2_napi *bnapi) { u16 cons; - cons = bp-status_blk-status_tx_quick_consumer_index0; + cons = bnapi-status_blk-status_tx_quick_consumer_index0; if (unlikely((cons MAX_TX_DESC_CNT) == MAX_TX_DESC_CNT)) cons++; @@ -2336,12 +2352,12 @@ bnx2_get_hw_tx_cons(struct bnx2 *bp) } static void -bnx2_tx_int(struct bnx2 *bp) +bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi) { u16 hw_cons, sw_cons, sw_ring_cons; int tx_free_bd = 0; - hw_cons = bnx2_get_hw_tx_cons(bp); + hw_cons = bnx2_get_hw_tx_cons(bnapi); sw_cons = bp-tx_cons; while (sw_cons != hw_cons) { @@ -2393,7 +2409,7 @@ bnx2_tx_int(struct bnx2 *bp) dev_kfree_skb(skb); - hw_cons = bnx2_get_hw_tx_cons(bp); + hw_cons = bnx2_get_hw_tx_cons(bnapi); } bp-hw_tx_cons = hw_cons; @@ -2584,9 +2600,9 @@ bnx2_rx_skb(struct bnx2 *bp, struct sk_buff *skb, unsigned int len, } static inline u16 -bnx2_get_hw_rx_cons(struct bnx2 *bp) +bnx2_get_hw_rx_cons(struct bnx2_napi *bnapi) { - u16 cons = bp-status_blk-status_rx_quick_consumer_index0; + u16 cons = bnapi-status_blk-status_rx_quick_consumer_index0; if (unlikely((cons MAX_RX_DESC_CNT) == MAX_RX_DESC_CNT)) cons++; @@ -2594,13 +2610,13 @@ bnx2_get_hw_rx_cons(struct bnx2 *bp) } static int
[PATCH 4/9][BNX2]: Move tx indexes into bnx2_napi struct.
[BNX2]: Move tx indexes into bnx2_napi struct. Tx related fields used in NAPI polling are moved from the main bnx2 struct to the bnx2_napi struct. Signed-off-by: Michael Chan [EMAIL PROTECTED] diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c index 3f754e6..0300a75 100644 --- a/drivers/net/bnx2.c +++ b/drivers/net/bnx2.c @@ -226,7 +226,7 @@ static struct flash_spec flash_5709 = { MODULE_DEVICE_TABLE(pci, bnx2_pci_tbl); -static inline u32 bnx2_tx_avail(struct bnx2 *bp) +static inline u32 bnx2_tx_avail(struct bnx2 *bp, struct bnx2_napi *bnapi) { u32 diff; @@ -235,7 +235,7 @@ static inline u32 bnx2_tx_avail(struct bnx2 *bp) /* The ring uses 256 indices for 255 entries, one of them * needs to be skipped. */ - diff = bp-tx_prod - bp-tx_cons; + diff = bp-tx_prod - bnapi-tx_cons; if (unlikely(diff = TX_DESC_CNT)) { diff = 0x; if (diff == TX_DESC_CNT) @@ -2358,7 +2358,7 @@ bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi) int tx_free_bd = 0; hw_cons = bnx2_get_hw_tx_cons(bnapi); - sw_cons = bp-tx_cons; + sw_cons = bnapi-tx_cons; while (sw_cons != hw_cons) { struct sw_bd *tx_buf; @@ -2412,8 +2412,8 @@ bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi) hw_cons = bnx2_get_hw_tx_cons(bnapi); } - bp-hw_tx_cons = hw_cons; - bp-tx_cons = sw_cons; + bnapi-hw_tx_cons = hw_cons; + bnapi-tx_cons = sw_cons; /* Need to make the tx_cons update visible to bnx2_start_xmit() * before checking for netif_queue_stopped(). Without the * memory barrier, there is a small possibility that bnx2_start_xmit() @@ -2422,10 +2422,10 @@ bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi) smp_mb(); if (unlikely(netif_queue_stopped(bp-dev)) -(bnx2_tx_avail(bp) bp-tx_wake_thresh)) { +(bnx2_tx_avail(bp, bnapi) bp-tx_wake_thresh)) { netif_tx_lock(bp-dev); if ((netif_queue_stopped(bp-dev)) - (bnx2_tx_avail(bp) bp-tx_wake_thresh)) + (bnx2_tx_avail(bp, bnapi) bp-tx_wake_thresh)) netif_wake_queue(bp-dev); netif_tx_unlock(bp-dev); } @@ -2846,7 +2846,7 @@ bnx2_has_work(struct bnx2_napi *bnapi) struct status_block *sblk = bp-status_blk; if ((bnx2_get_hw_rx_cons(bnapi) != bp-rx_cons) || - (bnx2_get_hw_tx_cons(bnapi) != bp-hw_tx_cons)) + (bnx2_get_hw_tx_cons(bnapi) != bnapi-hw_tx_cons)) return 1; if ((sblk-status_attn_bits STATUS_ATTN_EVENTS) != @@ -2876,7 +2876,7 @@ static int bnx2_poll_work(struct bnx2 *bp, struct bnx2_napi *bnapi, REG_RD(bp, BNX2_HC_COMMAND); } - if (bnx2_get_hw_tx_cons(bnapi) != bp-hw_tx_cons) + if (bnx2_get_hw_tx_cons(bnapi) != bnapi-hw_tx_cons) bnx2_tx_int(bp, bnapi); if (bnx2_get_hw_rx_cons(bnapi) != bp-rx_cons) @@ -4381,6 +4381,7 @@ bnx2_init_tx_ring(struct bnx2 *bp) { struct tx_bd *txbd; u32 cid; + struct bnx2_napi *bnapi = bp-bnx2_napi; bp-tx_wake_thresh = bp-tx_ring_size / 2; @@ -4390,8 +4391,8 @@ bnx2_init_tx_ring(struct bnx2 *bp) txbd-tx_bd_haddr_lo = (u64) bp-tx_desc_mapping 0x; bp-tx_prod = 0; - bp-tx_cons = 0; - bp-hw_tx_cons = 0; + bnapi-tx_cons = 0; + bnapi-hw_tx_cons = 0; bp-tx_prod_bseq = 0; cid = TX_CID; @@ -5440,8 +5441,10 @@ bnx2_start_xmit(struct sk_buff *skb, struct net_device *dev) u32 len, vlan_tag_flags, last_frag, mss; u16 prod, ring_prod; int i; + struct bnx2_napi *bnapi = bp-bnx2_napi; - if (unlikely(bnx2_tx_avail(bp) (skb_shinfo(skb)-nr_frags + 1))) { + if (unlikely(bnx2_tx_avail(bp, bnapi) + (skb_shinfo(skb)-nr_frags + 1))) { netif_stop_queue(dev); printk(KERN_ERR PFX %s: BUG! Tx ring full when queue awake!\n, dev-name); @@ -5556,9 +5559,9 @@ bnx2_start_xmit(struct sk_buff *skb, struct net_device *dev) bp-tx_prod = prod; dev-trans_start = jiffies; - if (unlikely(bnx2_tx_avail(bp) = MAX_SKB_FRAGS)) { + if (unlikely(bnx2_tx_avail(bp, bnapi) = MAX_SKB_FRAGS)) { netif_stop_queue(dev); - if (bnx2_tx_avail(bp) bp-tx_wake_thresh) + if (bnx2_tx_avail(bp, bnapi) bp-tx_wake_thresh) netif_wake_queue(dev); } diff --git a/drivers/net/bnx2.h b/drivers/net/bnx2.h index 345b6db..958fdda 100644 --- a/drivers/net/bnx2.h +++ b/drivers/net/bnx2.h @@ -6509,6 +6509,9 @@ struct bnx2_napi { struct status_block *status_blk; u32 last_status_idx; u32 int_num; + + u16 tx_cons;
[PATCH 5/9][BNX2]: Move rx indexes into bnx2_napi struct.
[BNX2]: Move rx indexes into bnx2_napi struct. Rx related fields used in NAPI polling are moved from the main bnx2 struct to the bnx2_napi struct. Signed-off-by: Michael Chan [EMAIL PROTECTED] diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c index 0300a75..ecfaad1 100644 --- a/drivers/net/bnx2.c +++ b/drivers/net/bnx2.c @@ -2276,7 +2276,7 @@ bnx2_free_rx_page(struct bnx2 *bp, u16 index) } static inline int -bnx2_alloc_rx_skb(struct bnx2 *bp, u16 index) +bnx2_alloc_rx_skb(struct bnx2 *bp, struct bnx2_napi *bnapi, u16 index) { struct sk_buff *skb; struct sw_bd *rx_buf = bp-rx_buf_ring[index]; @@ -2301,7 +2301,7 @@ bnx2_alloc_rx_skb(struct bnx2 *bp, u16 index) rxbd-rx_bd_haddr_hi = (u64) mapping 32; rxbd-rx_bd_haddr_lo = (u64) mapping 0x; - bp-rx_prod_bseq += bp-rx_buf_use_size; + bnapi-rx_prod_bseq += bp-rx_buf_use_size; return 0; } @@ -2432,14 +2432,15 @@ bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi) } static void -bnx2_reuse_rx_skb_pages(struct bnx2 *bp, struct sk_buff *skb, int count) +bnx2_reuse_rx_skb_pages(struct bnx2 *bp, struct bnx2_napi *bnapi, + struct sk_buff *skb, int count) { struct sw_pg *cons_rx_pg, *prod_rx_pg; struct rx_bd *cons_bd, *prod_bd; dma_addr_t mapping; int i; - u16 hw_prod = bp-rx_pg_prod, prod; - u16 cons = bp-rx_pg_cons; + u16 hw_prod = bnapi-rx_pg_prod, prod; + u16 cons = bnapi-rx_pg_cons; for (i = 0; i count; i++) { prod = RX_PG_RING_IDX(hw_prod); @@ -2476,12 +2477,12 @@ bnx2_reuse_rx_skb_pages(struct bnx2 *bp, struct sk_buff *skb, int count) cons = RX_PG_RING_IDX(NEXT_RX_BD(cons)); hw_prod = NEXT_RX_BD(hw_prod); } - bp-rx_pg_prod = hw_prod; - bp-rx_pg_cons = cons; + bnapi-rx_pg_prod = hw_prod; + bnapi-rx_pg_cons = cons; } static inline void -bnx2_reuse_rx_skb(struct bnx2 *bp, struct sk_buff *skb, +bnx2_reuse_rx_skb(struct bnx2 *bp, struct bnx2_napi *bnapi, struct sk_buff *skb, u16 cons, u16 prod) { struct sw_bd *cons_rx_buf, *prod_rx_buf; @@ -2494,7 +2495,7 @@ bnx2_reuse_rx_skb(struct bnx2 *bp, struct sk_buff *skb, pci_unmap_addr(cons_rx_buf, mapping), bp-rx_offset + RX_COPY_THRESH, PCI_DMA_FROMDEVICE); - bp-rx_prod_bseq += bp-rx_buf_use_size; + bnapi-rx_prod_bseq += bp-rx_buf_use_size; prod_rx_buf-skb = skb; @@ -2511,20 +2512,21 @@ bnx2_reuse_rx_skb(struct bnx2 *bp, struct sk_buff *skb, } static int -bnx2_rx_skb(struct bnx2 *bp, struct sk_buff *skb, unsigned int len, - unsigned int hdr_len, dma_addr_t dma_addr, u32 ring_idx) +bnx2_rx_skb(struct bnx2 *bp, struct bnx2_napi *bnapi, struct sk_buff *skb, + unsigned int len, unsigned int hdr_len, dma_addr_t dma_addr, + u32 ring_idx) { int err; u16 prod = ring_idx 0x; - err = bnx2_alloc_rx_skb(bp, prod); + err = bnx2_alloc_rx_skb(bp, bnapi, prod); if (unlikely(err)) { - bnx2_reuse_rx_skb(bp, skb, (u16) (ring_idx 16), prod); + bnx2_reuse_rx_skb(bp, bnapi, skb, (u16) (ring_idx 16), prod); if (hdr_len) { unsigned int raw_len = len + 4; int pages = PAGE_ALIGN(raw_len - hdr_len) PAGE_SHIFT; - bnx2_reuse_rx_skb_pages(bp, NULL, pages); + bnx2_reuse_rx_skb_pages(bp, bnapi, NULL, pages); } return err; } @@ -2539,8 +2541,8 @@ bnx2_rx_skb(struct bnx2 *bp, struct sk_buff *skb, unsigned int len, } else { unsigned int i, frag_len, frag_size, pages; struct sw_pg *rx_pg; - u16 pg_cons = bp-rx_pg_cons; - u16 pg_prod = bp-rx_pg_prod; + u16 pg_cons = bnapi-rx_pg_cons; + u16 pg_prod = bnapi-rx_pg_prod; frag_size = len + 4 - hdr_len; pages = PAGE_ALIGN(frag_size) PAGE_SHIFT; @@ -2551,9 +2553,10 @@ bnx2_rx_skb(struct bnx2 *bp, struct sk_buff *skb, unsigned int len, if (unlikely(frag_len = 4)) { unsigned int tail = 4 - frag_len; - bp-rx_pg_cons = pg_cons; - bp-rx_pg_prod = pg_prod; - bnx2_reuse_rx_skb_pages(bp, NULL, pages - i); + bnapi-rx_pg_cons = pg_cons; + bnapi-rx_pg_prod = pg_prod; + bnx2_reuse_rx_skb_pages(bp, bnapi, NULL, + pages - i); skb-len -= tail; if (i == 0) { skb-tail -= tail; @@ -2579,9 +2582,10 @@ bnx2_rx_skb(struct bnx2 *bp,
[PATCH 6/9][BNX2]: Support multiple MSIX IRQs.
[BNX2]: Support multiple MSIX IRQs. Change bnx2_napi struct into an array and add code to manage multiple IRQs. MSIX hardware structures and new registers are also added. Signed-off-by: Michael Chan [EMAIL PROTECTED] diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c index ecfaad1..196d053 100644 --- a/drivers/net/bnx2.c +++ b/drivers/net/bnx2.c @@ -399,44 +399,65 @@ bnx2_write_phy(struct bnx2 *bp, u32 reg, u32 val) static void bnx2_disable_int(struct bnx2 *bp) { - REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, - BNX2_PCICFG_INT_ACK_CMD_MASK_INT); + int i; + struct bnx2_napi *bnapi; + + for (i = 0; i bp-irq_nvecs; i++) { + bnapi = bp-bnx2_napi[i]; + REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, bnapi-int_num | + BNX2_PCICFG_INT_ACK_CMD_MASK_INT); + } REG_RD(bp, BNX2_PCICFG_INT_ACK_CMD); } static void bnx2_enable_int(struct bnx2 *bp) { - struct bnx2_napi *bnapi = bp-bnx2_napi; + int i; + struct bnx2_napi *bnapi; - REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, - BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID | - BNX2_PCICFG_INT_ACK_CMD_MASK_INT | bnapi-last_status_idx); + for (i = 0; i bp-irq_nvecs; i++) { + bnapi = bp-bnx2_napi[i]; - REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, - BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID | bnapi-last_status_idx); + REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, bnapi-int_num | + BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID | + BNX2_PCICFG_INT_ACK_CMD_MASK_INT | + bnapi-last_status_idx); + REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, bnapi-int_num | + BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID | + bnapi-last_status_idx); + } REG_WR(bp, BNX2_HC_COMMAND, bp-hc_cmd | BNX2_HC_COMMAND_COAL_NOW); } static void bnx2_disable_int_sync(struct bnx2 *bp) { + int i; + atomic_inc(bp-intr_sem); bnx2_disable_int(bp); - synchronize_irq(bp-pdev-irq); + for (i = 0; i bp-irq_nvecs; i++) + synchronize_irq(bp-irq_tbl[i].vector); } static void bnx2_napi_disable(struct bnx2 *bp) { - napi_disable(bp-bnx2_napi.napi); + int i; + + for (i = 0; i bp-irq_nvecs; i++) + napi_disable(bp-bnx2_napi[i].napi); } static void bnx2_napi_enable(struct bnx2 *bp) { - napi_enable(bp-bnx2_napi.napi); + int i; + + for (i = 0; i bp-irq_nvecs; i++) + napi_enable(bp-bnx2_napi[i].napi); } static void @@ -559,6 +580,9 @@ bnx2_alloc_mem(struct bnx2 *bp) /* Combine status and statistics blocks into one allocation. */ status_blk_size = L1_CACHE_ALIGN(sizeof(struct status_block)); + if (bp-flags MSIX_CAP_FLAG) + status_blk_size = L1_CACHE_ALIGN(BNX2_MAX_MSIX_HW_VEC * +BNX2_SBLK_MSIX_ALIGN_SIZE); bp-status_stats_size = status_blk_size + sizeof(struct statistics_block); @@ -569,7 +593,17 @@ bnx2_alloc_mem(struct bnx2 *bp) memset(bp-status_blk, 0, bp-status_stats_size); - bp-bnx2_napi.status_blk = bp-status_blk; + bp-bnx2_napi[0].status_blk = bp-status_blk; + if (bp-flags MSIX_CAP_FLAG) { + for (i = 1; i BNX2_MAX_MSIX_VEC; i++) { + struct bnx2_napi *bnapi = bp-bnx2_napi[i]; + + bnapi-status_blk = (void *) + ((unsigned long) bp-status_blk + +BNX2_SBLK_MSIX_ALIGN_SIZE * i); + bnapi-int_num = i 24; + } + } bp-stats_blk = (void *) ((unsigned long) bp-status_blk + status_blk_size); @@ -2767,7 +2801,7 @@ bnx2_msi(int irq, void *dev_instance) { struct net_device *dev = dev_instance; struct bnx2 *bp = netdev_priv(dev); - struct bnx2_napi *bnapi = bp-bnx2_napi; + struct bnx2_napi *bnapi = bp-bnx2_napi[0]; prefetch(bnapi-status_blk); REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, @@ -2788,7 +2822,7 @@ bnx2_msi_1shot(int irq, void *dev_instance) { struct net_device *dev = dev_instance; struct bnx2 *bp = netdev_priv(dev); - struct bnx2_napi *bnapi = bp-bnx2_napi; + struct bnx2_napi *bnapi = bp-bnx2_napi[0]; prefetch(bnapi-status_blk); @@ -2806,7 +2840,7 @@ bnx2_interrupt(int irq, void *dev_instance) { struct net_device *dev = dev_instance; struct bnx2 *bp = netdev_priv(dev); - struct bnx2_napi *bnapi = bp-bnx2_napi; + struct bnx2_napi *bnapi = bp-bnx2_napi[0]; struct status_block *sblk = bnapi-status_blk; /* When using INTx, it is possible for the interrupt to arrive @@ -2911,7 +2945,7 @@ static int bnx2_poll(struct napi_struct *napi, int budget)
[PATCH 7/9][BNX2]: Add support for a new tx ring.
[BNX2]: Add support for a new tx ring. To separate TX IRQs into a different MSIX vector, we need to support a new tx ring. The original tx ring will still be used when not using MSIX. Signed-off-by: Michael Chan [EMAIL PROTECTED] diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c index 196d053..a4ed6ca 100644 --- a/drivers/net/bnx2.c +++ b/drivers/net/bnx2.c @@ -2378,7 +2378,10 @@ bnx2_get_hw_tx_cons(struct bnx2_napi *bnapi) { u16 cons; - cons = bnapi-status_blk-status_tx_quick_consumer_index0; + if (bnapi-int_num == 0) + cons = bnapi-status_blk-status_tx_quick_consumer_index0; + else + cons = bnapi-status_blk_msix-status_tx_quick_consumer_index; if (unlikely((cons MAX_TX_DESC_CNT) == MAX_TX_DESC_CNT)) cons++; @@ -2389,7 +2392,6 @@ static void bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi) { u16 hw_cons, sw_cons, sw_ring_cons; - int tx_free_bd = 0; hw_cons = bnx2_get_hw_tx_cons(bnapi); sw_cons = bnapi-tx_cons; @@ -2439,8 +2441,6 @@ bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi) sw_cons = NEXT_TX_BD(sw_cons); - tx_free_bd += last + 1; - dev_kfree_skb(skb); hw_cons = bnx2_get_hw_tx_cons(bnapi); @@ -4369,6 +4369,24 @@ bnx2_init_chip(struct bnx2 *bp) BNX2_HC_CONFIG_COLLECT_STATS; } + if (bp-flags USING_MSIX_FLAG) { + REG_WR(bp, BNX2_HC_MSIX_BIT_VECTOR, + BNX2_HC_MSIX_BIT_VECTOR_VAL); + + REG_WR(bp, BNX2_HC_SB_CONFIG_1, + BNX2_HC_SB_CONFIG_1_TX_TMR_MODE | + BNX2_HC_SB_CONFIG_1_ONE_SHOT); + + REG_WR(bp, BNX2_HC_TX_QUICK_CONS_TRIP_1, + (bp-tx_quick_cons_trip_int 16) | +bp-tx_quick_cons_trip); + + REG_WR(bp, BNX2_HC_TX_TICKS_1, + (bp-tx_ticks_int 16) | bp-tx_ticks); + + val |= BNX2_HC_CONFIG_SB_ADDR_INC_128B; + } + if (bp-flags ONE_SHOT_MSI_FLAG) val |= BNX2_HC_CONFIG_ONE_SHOT; @@ -4401,6 +4419,25 @@ bnx2_init_chip(struct bnx2 *bp) } static void +bnx2_clear_ring_states(struct bnx2 *bp) +{ + struct bnx2_napi *bnapi; + int i; + + for (i = 0; i BNX2_MAX_MSIX_VEC; i++) { + bnapi = bp-bnx2_napi[i]; + + bnapi-tx_cons = 0; + bnapi-hw_tx_cons = 0; + bnapi-rx_prod_bseq = 0; + bnapi-rx_prod = 0; + bnapi-rx_cons = 0; + bnapi-rx_pg_prod = 0; + bnapi-rx_pg_cons = 0; + } +} + +static void bnx2_init_tx_context(struct bnx2 *bp, u32 cid) { u32 val, offset0, offset1, offset2, offset3; @@ -4433,8 +4470,17 @@ static void bnx2_init_tx_ring(struct bnx2 *bp) { struct tx_bd *txbd; - u32 cid; - struct bnx2_napi *bnapi = bp-bnx2_napi[0]; + u32 cid = TX_CID; + struct bnx2_napi *bnapi; + + bp-tx_vec = 0; + if (bp-flags USING_MSIX_FLAG) { + cid = TX_TSS_CID; + bp-tx_vec = BNX2_TX_VEC; + REG_WR(bp, BNX2_TSCH_TSS_CFG, BNX2_TX_INT_NUM | + (TX_TSS_CID 7)); + } + bnapi = bp-bnx2_napi[bp-tx_vec]; bp-tx_wake_thresh = bp-tx_ring_size / 2; @@ -,11 +4490,8 @@ bnx2_init_tx_ring(struct bnx2 *bp) txbd-tx_bd_haddr_lo = (u64) bp-tx_desc_mapping 0x; bp-tx_prod = 0; - bnapi-tx_cons = 0; - bnapi-hw_tx_cons = 0; bp-tx_prod_bseq = 0; - cid = TX_CID; bp-tx_bidx_addr = MB_GET_CID_ADDR(cid) + BNX2_L2CTX_TX_HOST_BIDX; bp-tx_bseq_addr = MB_GET_CID_ADDR(cid) + BNX2_L2CTX_TX_HOST_BSEQ; @@ -4487,12 +4530,6 @@ bnx2_init_rx_ring(struct bnx2 *bp) u32 val, rx_cid_addr = GET_CID_ADDR(RX_CID); struct bnx2_napi *bnapi = bp-bnx2_napi[0]; - bnapi-rx_prod = 0; - bnapi-rx_cons = 0; - bnapi-rx_prod_bseq = 0; - bnapi-rx_pg_prod = 0; - bnapi-rx_pg_cons = 0; - bnx2_init_rxbd_rings(bp-rx_desc_ring, bp-rx_desc_mapping, bp-rx_buf_use_size, bp-rx_max_ring); @@ -4694,6 +4731,7 @@ bnx2_reset_nic(struct bnx2 *bp, u32 reset_code) if ((rc = bnx2_init_chip(bp)) != 0) return rc; + bnx2_clear_ring_states(bp); bnx2_init_tx_ring(bp); bnx2_init_rx_ring(bp); return 0; @@ -4965,7 +5003,11 @@ bnx2_run_loopback(struct bnx2 *bp, int loopback_mode) struct sw_bd *rx_buf; struct l2_fhdr *rx_hdr; int ret = -ENODEV; - struct bnx2_napi *bnapi = bp-bnx2_napi[0]; + struct bnx2_napi *bnapi = bp-bnx2_napi[0], *tx_napi; + + tx_napi = bnapi; + if (bp-flags USING_MSIX_FLAG) + tx_napi = bp-bnx2_napi[BNX2_TX_VEC]; if (loopback_mode == BNX2_MAC_LOOPBACK) {
[PATCH 8/9][BNX2]: Enable new tx ring.
[BNX2]: Enable new tx ring. Enable new tx ring and add new MSIX handler and NAPI poll function for the new tx ring. Enable MSIX when the hardware supports it. Signed-off-by: Michael Chan [EMAIL PROTECTED] diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c index a4ed6ca..3745fc8 100644 --- a/drivers/net/bnx2.c +++ b/drivers/net/bnx2.c @@ -598,7 +598,7 @@ bnx2_alloc_mem(struct bnx2 *bp) for (i = 1; i BNX2_MAX_MSIX_VEC; i++) { struct bnx2_napi *bnapi = bp-bnx2_napi[i]; - bnapi-status_blk = (void *) + bnapi-status_blk_msix = (void *) ((unsigned long) bp-status_blk + BNX2_SBLK_MSIX_ALIGN_SIZE * i); bnapi-int_num = i 24; @@ -2388,10 +2388,11 @@ bnx2_get_hw_tx_cons(struct bnx2_napi *bnapi) return cons; } -static void -bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi) +static int +bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi, int budget) { u16 hw_cons, sw_cons, sw_ring_cons; + int tx_pkt = 0; hw_cons = bnx2_get_hw_tx_cons(bnapi); sw_cons = bnapi-tx_cons; @@ -2442,6 +2443,9 @@ bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi) sw_cons = NEXT_TX_BD(sw_cons); dev_kfree_skb(skb); + tx_pkt++; + if (tx_pkt == budget) + break; hw_cons = bnx2_get_hw_tx_cons(bnapi); } @@ -2463,6 +2467,7 @@ bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi) netif_wake_queue(bp-dev); netif_tx_unlock(bp-dev); } + return tx_pkt; } static void @@ -2875,6 +2880,23 @@ bnx2_interrupt(int irq, void *dev_instance) return IRQ_HANDLED; } +static irqreturn_t +bnx2_tx_msix(int irq, void *dev_instance) +{ + struct net_device *dev = dev_instance; + struct bnx2 *bp = netdev_priv(dev); + struct bnx2_napi *bnapi = bp-bnx2_napi[BNX2_TX_VEC]; + + prefetch(bnapi-status_blk_msix); + + /* Return here if interrupt is disabled. */ + if (unlikely(atomic_read(bp-intr_sem) != 0)) + return IRQ_HANDLED; + + netif_rx_schedule(dev, bnapi-napi); + return IRQ_HANDLED; +} + #define STATUS_ATTN_EVENTS (STATUS_ATTN_BITS_LINK_STATE | \ STATUS_ATTN_BITS_TIMER_ABORT) @@ -2895,6 +2917,29 @@ bnx2_has_work(struct bnx2_napi *bnapi) return 0; } +static int bnx2_tx_poll(struct napi_struct *napi, int budget) +{ + struct bnx2_napi *bnapi = container_of(napi, struct bnx2_napi, napi); + struct bnx2 *bp = bnapi-bp; + int work_done = 0; + struct status_block_msix *sblk = bnapi-status_blk_msix; + + do { + work_done += bnx2_tx_int(bp, bnapi, budget - work_done); + if (unlikely(work_done = budget)) + return work_done; + + bnapi-last_status_idx = sblk-status_idx; + rmb(); + } while (bnx2_get_hw_tx_cons(bnapi) != bnapi-hw_tx_cons); + + netif_rx_complete(bp-dev, napi); + REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD, bnapi-int_num | + BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID | + bnapi-last_status_idx); + return work_done; +} + static int bnx2_poll_work(struct bnx2 *bp, struct bnx2_napi *bnapi, int work_done, int budget) { @@ -2916,7 +2961,7 @@ static int bnx2_poll_work(struct bnx2 *bp, struct bnx2_napi *bnapi, } if (bnx2_get_hw_tx_cons(bnapi) != bnapi-hw_tx_cons) - bnx2_tx_int(bp, bnapi); + bnx2_tx_int(bp, bnapi, 0); if (bnx2_get_hw_rx_cons(bnapi) != bnapi-rx_cons) work_done += bnx2_rx_int(bp, bnapi, budget - work_done); @@ -5399,10 +5444,35 @@ bnx2_free_irq(struct bnx2 *bp) static void bnx2_enable_msix(struct bnx2 *bp) { + int i, rc; + struct msix_entry msix_ent[BNX2_MAX_MSIX_VEC]; + bnx2_setup_msix_tbl(bp); REG_WR(bp, BNX2_PCI_MSIX_CONTROL, BNX2_MAX_MSIX_HW_VEC - 1); REG_WR(bp, BNX2_PCI_MSIX_TBL_OFF_BIR, BNX2_PCI_GRC_WINDOW2_BASE); REG_WR(bp, BNX2_PCI_MSIX_PBA_OFF_BIT, BNX2_PCI_GRC_WINDOW3_BASE); + + for (i = 0; i BNX2_MAX_MSIX_VEC; i++) { + msix_ent[i].entry = i; + msix_ent[i].vector = 0; + } + + rc = pci_enable_msix(bp-pdev, msix_ent, BNX2_MAX_MSIX_VEC); + if (rc != 0) + return; + + bp-irq_tbl[BNX2_BASE_VEC].handler = bnx2_msi_1shot; + bp-irq_tbl[BNX2_TX_VEC].handler = bnx2_tx_msix; + + strcpy(bp-irq_tbl[BNX2_BASE_VEC].name, bp-dev-name); + strcat(bp-irq_tbl[BNX2_BASE_VEC].name, -base); + strcpy(bp-irq_tbl[BNX2_TX_VEC].name, bp-dev-name); + strcat(bp-irq_tbl[BNX2_TX_VEC].name, -tx); + + bp-irq_nvecs = BNX2_MAX_MSIX_VEC; + bp-flags |= USING_MSIX_FLAG |
[PATCH 9/9][BNX2]: Update version to 1.7.1.
[BNX2]: Update version to 1.7.1. Signed-off-by: Michael Chan [EMAIL PROTECTED] diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c index 08b0349..69a3ce3 100644 --- a/drivers/net/bnx2.c +++ b/drivers/net/bnx2.c @@ -56,8 +56,8 @@ #define DRV_MODULE_NAMEbnx2 #define PFX DRV_MODULE_NAME: -#define DRV_MODULE_VERSION 1.7.0 -#define DRV_MODULE_RELDATE December 11, 2007 +#define DRV_MODULE_VERSION 1.7.1 +#define DRV_MODULE_RELDATE December 19, 2007 #define RUN_AT(x) (jiffies + (x)) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] sky2: Use deferrable timer for watchdog
On Dec 20, 2007 2:22 PM, Kok, Auke [EMAIL PROTECTED] wrote: ok, that's just bad and if there's no user-defineable limit to the deferral I definately don't like this change. Can I safely assume that any irq will cause all deferred timers to run? I think even other causes for wakeup like process related ones will cause the CPU to go busy and run the timers. This, coupled with the fact that no one is yet able to reach 0 wakeups per second makes it pretty unlikely that deferrable timers will be deferred indefinitely. If this is the case then for e1000 this patch is still OK since the watchdog needs to run (1) after a link up/down interrupt or (2) to update statistics. Those statistics won't increase if there is no traffic of course... I think it is reasonable for Network driver watchdogs to use a deferrable timer - if the machine is 100% IDLE there is no one needing the network to be up. If there is something running even on the other CPU - that is going to cause an IPI, reschedule, TLB invalidation etc. which will make it very likely in practice that each CPU will be interrupted in reasonable amount of time. Of course there are theoretical cases where we could land into a situation where a CPU in a multiprocessor machine is IDLE infinitely and that causes the watchdog that happens to be bound to run on the same CPU to not run. To take care of these unlikely cases I think the timer mechanism should have a reasonable limit on how long a CPU can go IDLE if there are deferrable timers. Parag -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] sky2: Use deferrable timer for watchdog
Kok, Auke wrote: ok, that's just bad and if there's no user-defineable limit to the deferral I definately don't like this change. Can I safely assume that any irq will cause all deferred timers to run? *on that cpu*. Timers are per cpu, as are interrupts. Just not per se the same one ... -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] sky2: Use deferrable timer for watchdog
Parag Warudkar wrote: On Dec 20, 2007 2:22 PM, Kok, Auke [EMAIL PROTECTED] wrote: ok, that's just bad and if there's no user-defineable limit to the deferral I definately don't like this change. Can I safely assume that any irq will cause all deferred timers to run? I think even other causes for wakeup like process related ones will cause the CPU to go busy and run the timers. This, coupled with the fact that no one is yet able to reach 0 wakeups per second makes it pretty unlikely that deferrable timers will be deferred indefinitely. 0.8 is easy on single core today. multicore just increases how idle you can be for a given core. If this is the case then for e1000 this patch is still OK since the watchdog needs to run (1) after a link up/down interrupt or (2) to update statistics. Those statistics won't increase if there is no traffic of course... I think it is reasonable for Network driver watchdogs to use a deferrable timer - if the machine is 100% IDLE there is no one needing the network to be up. If there is something running even on the other CPU - that is going to cause an IPI, reschedule, TLB invalidation etc. which will make it very likely in practice that each CPU will be interrupted in reasonable amount of time. this is not correct; many machines are idle waiting for network data. Think of webservers... Of course there are theoretical cases where we could land into a situation where a CPU in a multiprocessor machine is IDLE infinitely and that causes the watchdog that happens to be bound to run on the same CPU to not run. To take care of these unlikely cases I think the timer mechanism should have a reasonable limit on how long a CPU can go IDLE if there are deferrable timers. how about something else instead: a timer mechanism that takes a range instead.. that at least has defined semantics; the deferrable semantics really are indefinite. Lets keep at least the semantics clear and clean. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] sky2: Use deferrable timer for watchdog
On Thu, 20 Dec 2007, Parag Warudkar wrote: On Dec 20, 2007 2:22 PM, Kok, Auke [EMAIL PROTECTED] wrote: ok, that's just bad and if there's no user-defineable limit to the deferral I definately don't like this change. Can I safely assume that any irq will cause all deferred timers to run? I think even other causes for wakeup like process related ones will cause the CPU to go busy and run the timers. This, coupled with the fact that no one is yet able to reach 0 wakeups per second makes it pretty unlikely that deferrable timers will be deferred indefinitely. If this is the case then for e1000 this patch is still OK since the watchdog needs to run (1) after a link up/down interrupt or (2) to update statistics. Those statistics won't increase if there is no traffic of course... I think it is reasonable for Network driver watchdogs to use a deferrable timer - if the machine is 100% IDLE there is no one needing the network to be up. Please note tha being connected to a network does not only mean to send but also to receive. Best regards, Krzysztof Oledzki -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [IPROUTE]: A workaround to make larger rto_min printed correctly
Satoru SATOH wrote, On 12/20/2007 05:21 PM: i see. HZ can be 1000.. i should be wrong. however, i got the following, [root iproute2.org]# ./ip/ip route change 192.168.140.0/24 dev eth1 rto_min 4s [root iproute2.org]# gdb -q ./ip/ip ... (gdb) p hz $1 = 10 That's why I had some doubts! I didn't study this enough, but my (older) version definitely showed hz == 100. Maybe I'm wrong, but looking into lib/util.c it seems this could be set differently depending on system's configuration (or even kernel version). So, probably this patch could sometimes work even for HZ 1000, but since it's your patch, I hope you do some additional checking if it's always like this... Cheers, Jarek P. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [IPROUTE]: A workaround to make larger rto_min printed correctly
Jarek Poplawski wrote, On 12/20/2007 09:24 PM: ... but since it's your patch, I hope you do some additional checking if it's always like this... ...or maybe only changing this all a little bit will make it look safer! Jarek P. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/4] [POWERPC][NET] ucc_geth_mii and users: get rid of device_type
device_type property is bogus, thus use proper compatible. Also change compatible property to fsl,ucc-mdio. Per http://ozlabs.org/pipermail/linuxppc-dev/2007-December/048388.html Signed-off-by: Anton Vorontsov [EMAIL PROTECTED] --- arch/powerpc/boot/dts/mpc832x_mds.dts |3 +-- arch/powerpc/boot/dts/mpc832x_rdb.dts |3 +-- arch/powerpc/boot/dts/mpc836x_mds.dts |3 +-- arch/powerpc/boot/dts/mpc8568mds.dts |2 +- drivers/net/ucc_geth_mii.c|3 +++ 5 files changed, 7 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/boot/dts/mpc832x_mds.dts b/arch/powerpc/boot/dts/mpc832x_mds.dts index 588d658..8844d30 100644 --- a/arch/powerpc/boot/dts/mpc832x_mds.dts +++ b/arch/powerpc/boot/dts/mpc832x_mds.dts @@ -255,8 +255,7 @@ #address-cells = 1; #size-cells = 0; reg = 2320 18; - device_type = mdio; - compatible = ucc_geth_phy; + compatible = fsl,ucc-mdio; phy3: [EMAIL PROTECTED] { interrupt-parent = ipic ; diff --git a/arch/powerpc/boot/dts/mpc832x_rdb.dts b/arch/powerpc/boot/dts/mpc832x_rdb.dts index 719f375..a7a2e45 100644 --- a/arch/powerpc/boot/dts/mpc832x_rdb.dts +++ b/arch/powerpc/boot/dts/mpc832x_rdb.dts @@ -236,8 +236,7 @@ #address-cells = 1; #size-cells = 0; reg = 3120 18; - device_type = mdio; - compatible = ucc_geth_phy; + compatible = fsl,ucc-mdio; phy00:[EMAIL PROTECTED] { interrupt-parent = pic; diff --git a/arch/powerpc/boot/dts/mpc836x_mds.dts b/arch/powerpc/boot/dts/mpc836x_mds.dts index 8d7124e..5f0b427 100644 --- a/arch/powerpc/boot/dts/mpc836x_mds.dts +++ b/arch/powerpc/boot/dts/mpc836x_mds.dts @@ -288,8 +288,7 @@ #address-cells = 1; #size-cells = 0; reg = 2120 18; - device_type = mdio; - compatible = ucc_geth_phy; + compatible = fsl,ucc-mdio; phy0: [EMAIL PROTECTED] { interrupt-parent = ipic ; diff --git a/arch/powerpc/boot/dts/mpc8568mds.dts b/arch/powerpc/boot/dts/mpc8568mds.dts index 89add8d..ea70010 100644 --- a/arch/powerpc/boot/dts/mpc8568mds.dts +++ b/arch/powerpc/boot/dts/mpc8568mds.dts @@ -356,7 +356,7 @@ #address-cells = 1; #size-cells = 0; reg = 2120 18; - compatible = ucc_geth_phy; + compatible = fsl,ucc-mdio; /* These are the same PHYs as on * gianfar's MDIO bus */ diff --git a/drivers/net/ucc_geth_mii.c b/drivers/net/ucc_geth_mii.c index df884f0..e3ba14a 100644 --- a/drivers/net/ucc_geth_mii.c +++ b/drivers/net/ucc_geth_mii.c @@ -256,6 +256,9 @@ static struct of_device_id uec_mdio_match[] = { .type = mdio, .compatible = ucc_geth_phy, }, + { + .compatible = fsl,ucc-mdio, + }, {}, }; -- 1.5.2.2 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] sky2: Use deferrable timer for watchdog
On Dec 20, 2007 3:04 PM, Arjan van de Ven [EMAIL PROTECTED] wrote: I think it is reasonable for Network driver watchdogs to use a deferrable timer - if the machine is 100% IDLE there is no one needing the network to be up. If there is something running even on the other CPU - that is going to cause an IPI, reschedule, TLB invalidation etc. which will make it very likely in practice that each CPU will be interrupted in reasonable amount of time. this is not correct; many machines are idle waiting for network data. Think of webservers... Yes, I forgot the receive case. So if a server was 100% IDLE and a web server was listening for network data and we reach 0 wakeups per second on the CPU where the network watchdog timer is scheduled to run deferred _and_ the network link went down, it would cause the watchdog to not run and redo the link until some one else wakes up that CPU later. So as long as we make sure we don't convert every timer to deferrable we should be ok - may be this can be resolved easily by having a non-deferrable dont-allow-deferring-for-too-long timer on each CPU that just causes at least one wake up in some reasonable time delta from the previous wakeup (whoever caused that one.) It is still beneficial in that all deferrable timers would run at once without needing to have separate wakeup for each. Of course there are theoretical cases where we could land into a situation where a CPU in a multiprocessor machine is IDLE infinitely and that causes the watchdog that happens to be bound to run on the same CPU to not run. To take care of these unlikely cases I think the timer mechanism should have a reasonable limit on how long a CPU can go IDLE if there are deferrable timers. how about something else instead: a timer mechanism that takes a range instead.. that at least has defined semantics; the deferrable semantics really are indefinite. Lets keep at least the semantics clear and clean. Would not the simpler solution of installing a non-deferrable timer per cpu which will not allow the CPU to go IDLE for more than x units of time at once (or something to that effect) work? Range would complicate the thing and I am not sure how many cases will know reasonably correct range for their normal operation. In this instance of the e1000 watchdog what range could it give and be successful at what it wants to do - bring up the link in reasonable amount of time, while also realizing the power savings? Perhaps depending on Server/Laptop/Desktop machine (may be based on Preemption) we could have normal or deferrable timers but that'll exclude Servers from power savings and I am not sure Data center folks will like that :) . Parag -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: After many hours all outbound connections get stuck in SYN_SENT
On Thu, 20 Dec 2007, James Nichols wrote: I still dont understand. tcpdump -p -n -s 1600 -c 1 doesnt reveal User data at all. Without any exact data from you, I am afraid nobody can help. Oh, I didn't see that you specified specific options. I'll still have to anonymize 2000+ IP addresses, but I think there is an open source tool that will do this for you. Even a simple for loop in shell can do that. It's not that hard and there's very little need for manual work! Ingrediments: for, cut, grep and sed. -- i. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: After many hours all outbound connections get stuck in SYN_SENT
On Thu, 20 Dec 2007, James Nichols wrote: You'd probably should also investigate the Linux kernel, especially the size and locks of the components of the Sack data structures and what happens to those data structures after Sack is disabled (presumably the Sack data structure is in some unhappy circumstance, and disabling Sack allows the data to be discarded, magically unclaging the box). ...Not sure if you want now to invent such structure. Yes, we have per skb -sacked but again in SYN_SENT there are very few things who touch it at all, and they just set it to zero (though it would not even be mandatory for tcp_transmit_skb, IIRC, checked that just couple of days ago due to other things). Another thing is the rx_opt.sack_ok which is just couple flag bits that tell the TCP variant in use (and it's mostly used only after SYN handshake completes). The rest (the actual SACK blocks) is in the ack_skb but again it has very little meaning in SYN_SENT state unless somebody is crazy enough to add SACK blocks to SYN-ACKs :-). In the absence of the reporter wanting to dump the kernel's core, how about a patch to print the Sack datastructure when the command to disable Sack is received by the kernel? Maybe just print the last 16b of the IP address? Given the fact that I've had this problem for so long, over a variety of networking hardware vendors and colo-facilities, this really sounds good to me. It will be challenging for me to justify a kernel core dump, but a simple patch to dump the Sack data would be do-able. If your symptoms really are: SYNs leaving (if they show up in tcpdump, for sure they've left TCP code already) and SYN-ACK not showing up even in something as early as in tcpdump (for sure TCP side code didn't execute at that point yet), there's very little change that Linux' TCP code has some bug in it, only things that do something in such scenario are the SYN generation and retransmitting SYNs (and those are trivially verifiable from tcpdump). -- i. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] sky2: Use deferrable timer for watchdog
On Thu, 20 Dec 2007 15:36:13 -0500 Parag Warudkar [EMAIL PROTECTED] wrote: On Dec 20, 2007 3:04 PM, Arjan van de Ven [EMAIL PROTECTED] wrote: I think it is reasonable for Network driver watchdogs to use a deferrable timer - if the machine is 100% IDLE there is no one needing the network to be up. If there is something running even on the other CPU - that is going to cause an IPI, reschedule, TLB invalidation etc. which will make it very likely in practice that each CPU will be interrupted in reasonable amount of time. this is not correct; many machines are idle waiting for network data. Think of webservers... Yes, I forgot the receive case. So if a server was 100% IDLE and a web server was listening for network data and we reach 0 wakeups per second on the CPU where the network watchdog timer is scheduled to run deferred _and_ the network link went down, it would cause the watchdog to not run and redo the link until some one else wakes up that CPU later. So as long as we make sure we don't convert every timer to deferrable we should be ok - may be this can be resolved easily by having a non-deferrable dont-allow-deferring-for-too-long timer on each CPU that just causes at least one wake up in some reasonable time delta from the previous wakeup (whoever caused that one.) It is still beneficial in that all deferrable timers would run at once without needing to have separate wakeup for each. Of course there are theoretical cases where we could land into a situation where a CPU in a multiprocessor machine is IDLE infinitely and that causes the watchdog that happens to be bound to run on the same CPU to not run. To take care of these unlikely cases I think the timer mechanism should have a reasonable limit on how long a CPU can go IDLE if there are deferrable timers. how about something else instead: a timer mechanism that takes a range instead.. that at least has defined semantics; the deferrable semantics really are indefinite. Lets keep at least the semantics clear and clean. Would not the simpler solution of installing a non-deferrable timer per cpu which will not allow the CPU to go IDLE for more than x units of time at once (or something to that effect) work? Range would complicate the thing and I am not sure how many cases will know reasonably correct range for their normal operation. In this instance of the e1000 watchdog what range could it give and be successful at what it wants to do - bring up the link in reasonable amount of time, while also realizing the power savings? Perhaps depending on Server/Laptop/Desktop machine (may be based on Preemption) we could have normal or deferrable timers but that'll exclude Servers from power savings and I am not sure Data center folks will like that :) . Parag The problem is that on a server the receiver will go deaf if the chip bug that the watchdog is looking for triggers. Yes, no packets in and it happily will just sit there. So for now, I am not going to apply your simple patch and work on a two stage timer per arjan's suggestion for a later release. -- Stephen Hemminger [EMAIL PROTECTED] -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: After many hours all outbound connections get stuck in SYN_SENT
James Nichols wrote I still dont understand. tcpdump -p -n -s 1600 -c 1 doesnt reveal User data at all. Without any exact data from you, I am afraid nobody can help. Oh, I didn't see that you specified specific options. I'll still have to anonymize 2000+ IP addresses, but I think there is an open source tool that will do this for you. tcpdump -p -n -s 1600 -c 1 | perl -pe 's/(\d+\.\d+\.\d+\.\d+)/HIDE.THIS.IP.ADDR/g' -justinb -- Justin Banks BakBone Software [EMAIL PROTECTED] -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] sky2: Use deferrable timer for watchdog
Stephen Hemminger wrote: On Thu, 20 Dec 2007 15:36:13 -0500 Parag Warudkar [EMAIL PROTECTED] wrote: On Dec 20, 2007 3:04 PM, Arjan van de Ven [EMAIL PROTECTED] wrote: I think it is reasonable for Network driver watchdogs to use a deferrable timer - if the machine is 100% IDLE there is no one needing the network to be up. If there is something running even on the other CPU - that is going to cause an IPI, reschedule, TLB invalidation etc. which will make it very likely in practice that each CPU will be interrupted in reasonable amount of time. this is not correct; many machines are idle waiting for network data. Think of webservers... Yes, I forgot the receive case. So if a server was 100% IDLE and a web server was listening for network data and we reach 0 wakeups per second on the CPU where the network watchdog timer is scheduled to run deferred _and_ the network link went down, it would cause the watchdog to not run and redo the link until some one else wakes up that CPU later. So as long as we make sure we don't convert every timer to deferrable we should be ok - may be this can be resolved easily by having a non-deferrable dont-allow-deferring-for-too-long timer on each CPU that just causes at least one wake up in some reasonable time delta from the previous wakeup (whoever caused that one.) It is still beneficial in that all deferrable timers would run at once without needing to have separate wakeup for each. Of course there are theoretical cases where we could land into a situation where a CPU in a multiprocessor machine is IDLE infinitely and that causes the watchdog that happens to be bound to run on the same CPU to not run. To take care of these unlikely cases I think the timer mechanism should have a reasonable limit on how long a CPU can go IDLE if there are deferrable timers. how about something else instead: a timer mechanism that takes a range instead.. that at least has defined semantics; the deferrable semantics really are indefinite. Lets keep at least the semantics clear and clean. Would not the simpler solution of installing a non-deferrable timer per cpu which will not allow the CPU to go IDLE for more than x units of time at once (or something to that effect) work? Range would complicate the thing and I am not sure how many cases will know reasonably correct range for their normal operation. In this instance of the e1000 watchdog what range could it give and be successful at what it wants to do - bring up the link in reasonable amount of time, while also realizing the power savings? Perhaps depending on Server/Laptop/Desktop machine (may be based on Preemption) we could have normal or deferrable timers but that'll exclude Servers from power savings and I am not sure Data center folks will like that :) . Parag The problem is that on a server the receiver will go deaf if the chip bug that the watchdog is looking for triggers. Yes, no packets in and it happily will just sit there. So for now, I am not going to apply your simple patch and work on a two stage timer per arjan's suggestion for a later release. I also think that's the right way to go for now. I'll ask jeff to hold off on the two patches for now. Auke -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] e1000e: Use deferrable timer for watchdog
Auke Kok wrote: From: Parag Warudkar [EMAIL PROTECTED] Reduce wakeups from idle per second. Signed-off-by: Parag Warudkar [EMAIL PROTECTED] Signed-off-by: Auke Kok [EMAIL PROTECTED] --- Jeff, given the discussion with Stephen I'd like to skip merging this patch and the e1000 one for now. The unforeseen implications of this are just not controlled enough and we need to guarantee some limit of deferral first. Auke -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] XFRM audit fixes/additions for net-2.6.25
Three patches backed against net-2.6.25 from today. Some of the audit messages are a little difficult to test by their nature but I've verified that I'm still able to send/receive IPsec protected traffic with the patches applied. The first patch was posted before but David decided it best to split the patch so some parts could be pulled into 2.6.24; the patch was split and the 2.6.24 bits were accepted (the SPI byteorder fix) so patch #1 in the series is what is left for 2.6.25. The second patch was posted before as an RFC patch without anyone complaining too loudly. Eric Paris made some suggestions about better handling of the op= audit field and I've tried to take that into account with this patch. The final patch is the audit replay counter overflow issue fix that has been talked about on netdev. This sounded like the best course of action from the discussion but if I'm wrong, just drop this patch and I'll cook up something else to solve the problem. Thanks. -- paul moore linux security @ hp -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] XFRM: Drop packets when replay counter would overflow
According to RFC4303, section 3.3.3 we need to drop outgoing packets which cause the replay counter to overflow: 3.3.3. Sequence Number Generation The sender's counter is initialized to 0 when an SA is established. The sender increments the sequence number (or ESN) counter for this SA and inserts the low-order 32 bits of the value into the Sequence Number field. Thus, the first packet sent using a given SA will contain a sequence number of 1. If anti-replay is enabled (the default), the sender checks to ensure that the counter has not cycled before inserting the new value in the Sequence Number field. In other words, the sender MUST NOT send a packet on an SA if doing so would cause the sequence number to cycle. An attempt to transmit a packet that would result in sequence number overflow is an auditable event. The audit log entry for this event SHOULD include the SPI value, current date/time, Source Address, Destination Address, and (in IPv6) the cleartext Flow ID. Signed-off-by: Paul Moore [EMAIL PROTECTED] --- net/xfrm/xfrm_output.c |5 - 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c index ebb..284eeef 100644 --- a/net/xfrm/xfrm_output.c +++ b/net/xfrm/xfrm_output.c @@ -57,8 +57,11 @@ static int xfrm_output_one(struct sk_buff *skb, int err) if (x-type-flags XFRM_TYPE_REPLAY_PROT) { XFRM_SKB_CB(skb)-seq = ++x-replay.oseq; - if (unlikely(x-replay.oseq == 0)) + if (unlikely(x-replay.oseq == 0)) { + x-replay.oseq--; xfrm_audit_state_replay_overflow(x, skb); + goto error; + } if (xfrm_aevent_is_on()) xfrm_replay_notify(x, XFRM_REPLAY_UPDATE); } -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] XFRM: Assorted IPsec fixups
This patch fixes a number of small but potentially troublesome things in the XFRM/IPsec code: * Use the 'audit_enabled' variable already in include/linux/audit.h Removed the need for extern declarations local to each XFRM audit fuction * Convert 'sid' to 'secid' everywhere we can The 'sid' name is specific to SELinux, 'secid' is the common naming convention used by the kernel when refering to tokenized LSM labels, unfortunately we have to leave 'ctx_sid' in 'struct xfrm_sec_ctx' otherwise we risk breaking userspace * Convert address display to use standard NIP* macros Similar to what was recently done with the SPD audit code, this also also includes the removal of some unnecessary memcpy() calls * Move common code to xfrm_audit_common_stateinfo() Code consolidation from the less is more book on software development * Proper spacing around commas in function arguments Minor style tweak since I was already touching the code Signed-off-by: Paul Moore [EMAIL PROTECTED] --- include/net/xfrm.h | 14 ++--- net/xfrm/xfrm_policy.c | 15 ++ net/xfrm/xfrm_state.c | 53 3 files changed, 36 insertions(+), 46 deletions(-) diff --git a/include/net/xfrm.h b/include/net/xfrm.h index 32b99e2..ac6cf09 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -548,7 +548,7 @@ struct xfrm_audit }; #ifdef CONFIG_AUDITSYSCALL -static inline struct audit_buffer *xfrm_audit_start(u32 auid, u32 sid) +static inline struct audit_buffer *xfrm_audit_start(u32 auid, u32 secid) { struct audit_buffer *audit_buf = NULL; char *secctx; @@ -561,8 +561,8 @@ static inline struct audit_buffer *xfrm_audit_start(u32 auid, u32 sid) audit_log_format(audit_buf, auid=%u, auid); - if (sid != 0 - security_secid_to_secctx(sid, secctx, secctx_len) == 0) { + if (secid != 0 + security_secid_to_secctx(secid, secctx, secctx_len) == 0) { audit_log_format(audit_buf, subj=%s, secctx); security_release_secctx(secctx, secctx_len); } else @@ -571,13 +571,13 @@ static inline struct audit_buffer *xfrm_audit_start(u32 auid, u32 sid) } extern void xfrm_audit_policy_add(struct xfrm_policy *xp, int result, - u32 auid, u32 sid); + u32 auid, u32 secid); extern void xfrm_audit_policy_delete(struct xfrm_policy *xp, int result, - u32 auid, u32 sid); + u32 auid, u32 secid); extern void xfrm_audit_state_add(struct xfrm_state *x, int result, -u32 auid, u32 sid); +u32 auid, u32 secid); extern void xfrm_audit_state_delete(struct xfrm_state *x, int result, - u32 auid, u32 sid); + u32 auid, u32 secid); #else #define xfrm_audit_policy_add(x, r, a, s) do { ; } while (0) #define xfrm_audit_policy_delete(x, r, a, s) do { ; } while (0) diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index d2084b1..c8f0656 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -24,6 +24,7 @@ #include linux/netfilter.h #include linux/module.h #include linux/cache.h +#include linux/audit.h #include net/dst.h #include net/xfrm.h #include net/ip.h @@ -2317,15 +2318,14 @@ static inline void xfrm_audit_common_policyinfo(struct xfrm_policy *xp, } } -void -xfrm_audit_policy_add(struct xfrm_policy *xp, int result, u32 auid, u32 sid) +void xfrm_audit_policy_add(struct xfrm_policy *xp, int result, + u32 auid, u32 secid) { struct audit_buffer *audit_buf; - extern int audit_enabled; if (audit_enabled == 0) return; - audit_buf = xfrm_audit_start(sid, auid); + audit_buf = xfrm_audit_start(auid, secid); if (audit_buf == NULL) return; audit_log_format(audit_buf, op=SPD-add res=%u, result); @@ -2334,15 +2334,14 @@ xfrm_audit_policy_add(struct xfrm_policy *xp, int result, u32 auid, u32 sid) } EXPORT_SYMBOL_GPL(xfrm_audit_policy_add); -void -xfrm_audit_policy_delete(struct xfrm_policy *xp, int result, u32 auid, u32 sid) +void xfrm_audit_policy_delete(struct xfrm_policy *xp, int result, + u32 auid, u32 secid) { struct audit_buffer *audit_buf; - extern int audit_enabled; if (audit_enabled == 0) return; - audit_buf = xfrm_audit_start(sid, auid); + audit_buf = xfrm_audit_start(auid, secid); if (audit_buf == NULL) return; audit_log_format(audit_buf, op=SPD-delete res=%u, result); diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index 95df01c..dd38e6f 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -19,6 +19,7 @@ #include
[PATCH 2/3] XFRM: RFC4303 compliant auditing
This patch adds a number of new IPsec audit events to meet the auditing requirements of RFC4303. This includes audit hooks for the following events: * Could not find a valid SA [sections 2.1, 3.4.2] . xfrm_audit_state_notfound() . xfrm_audit_state_notfound_simple() * Sequence number overflow [section 3.3.3] . xfrm_audit_state_replay_overflow() * Replayed packet [section 3.4.3] . xfrm_audit_state_replay() * Integrity check failure [sections 3.4.4.1, 3.4.4.2] . xfrm_audit_state_icvfail() While RFC4304 deals only with ESP most of the changes in this patch apply to IPsec in general, i.e. both AH and ESP. The one case, integrity check failure, where ESP specific code had to be modified the same was done to the AH code for the sake of consistency. Signed-off-by: Paul Moore [EMAIL PROTECTED] --- include/net/xfrm.h | 33 -- net/ipv4/ah4.c |4 + net/ipv4/esp4.c|1 net/ipv6/ah6.c |2 - net/ipv6/esp6.c|1 net/ipv6/xfrm6_input.c |4 + net/xfrm/xfrm_input.c |6 +- net/xfrm/xfrm_output.c |2 + net/xfrm/xfrm_policy.c | 14 ++-- net/xfrm/xfrm_state.c | 153 +++- 10 files changed, 184 insertions(+), 36 deletions(-) diff --git a/include/net/xfrm.h b/include/net/xfrm.h index ac6cf09..941d5cd 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -548,26 +548,33 @@ struct xfrm_audit }; #ifdef CONFIG_AUDITSYSCALL -static inline struct audit_buffer *xfrm_audit_start(u32 auid, u32 secid) +static inline struct audit_buffer *xfrm_audit_start(const char *op) { struct audit_buffer *audit_buf = NULL; - char *secctx; - u32 secctx_len; + if (audit_enabled == 0) + return NULL; audit_buf = audit_log_start(current-audit_context, GFP_ATOMIC, - AUDIT_MAC_IPSEC_EVENT); + AUDIT_MAC_IPSEC_EVENT); if (audit_buf == NULL) return NULL; + audit_log_format(audit_buf, op=%s, op); + return audit_buf; +} - audit_log_format(audit_buf, auid=%u, auid); +static inline void xfrm_audit_helper_usrinfo(u32 auid, u32 secid, +struct audit_buffer *audit_buf) +{ + char *secctx; + u32 secctx_len; + audit_log_format(audit_buf, auid=%u, auid); if (secid != 0 security_secid_to_secctx(secid, secctx, secctx_len) == 0) { audit_log_format(audit_buf, subj=%s, secctx); security_release_secctx(secctx, secctx_len); } else audit_log_task_context(audit_buf); - return audit_buf; } extern void xfrm_audit_policy_add(struct xfrm_policy *xp, int result, @@ -578,11 +585,22 @@ extern void xfrm_audit_state_add(struct xfrm_state *x, int result, u32 auid, u32 secid); extern void xfrm_audit_state_delete(struct xfrm_state *x, int result, u32 auid, u32 secid); +extern void xfrm_audit_state_replay_overflow(struct xfrm_state *x, +struct sk_buff *skb); +extern void xfrm_audit_state_notfound_simple(struct sk_buff *skb, u16 family); +extern void xfrm_audit_state_notfound(struct sk_buff *skb, u16 family, + __be32 net_spi, __be32 net_seq); +extern void xfrm_audit_state_icvfail(struct xfrm_state *x, +struct sk_buff *skb, u8 proto); #else #define xfrm_audit_policy_add(x, r, a, s) do { ; } while (0) #define xfrm_audit_policy_delete(x, r, a, s) do { ; } while (0) #define xfrm_audit_state_add(x, r, a, s) do { ; } while (0) #define xfrm_audit_state_delete(x, r, a, s)do { ; } while (0) +#define xfrm_audit_state_replay_overflow(x, s) do { ; } while (0) +#define xfrm_audit_state_notfound_simple(s, f) do { ; } while (0) +#define xfrm_audit_state_notfound(s, f, sp, sq)do { ; } while (0) +#define xfrm_audit_state_icvfail(x, s, p) do { ; } while (0) #endif /* CONFIG_AUDITSYSCALL */ static inline void xfrm_pol_hold(struct xfrm_policy *policy) @@ -1193,7 +1211,8 @@ extern int xfrm_state_delete(struct xfrm_state *x); extern int xfrm_state_flush(u8 proto, struct xfrm_audit *audit_info); extern void xfrm_sad_getinfo(struct xfrmk_sadinfo *si); extern void xfrm_spd_getinfo(struct xfrmk_spdinfo *si); -extern int xfrm_replay_check(struct xfrm_state *x, __be32 seq); +extern int xfrm_replay_check(struct xfrm_state *x, +struct sk_buff *skb, __be32 seq); extern void xfrm_replay_advance(struct xfrm_state *x, __be32 seq); extern void xfrm_replay_notify(struct xfrm_state *x, int event); extern int xfrm_state_mtu(struct xfrm_state *x, int mtu); diff --git a/net/ipv4/ah4.c b/net/ipv4/ah4.c index d76803a..ec8de0a 100644 --- a/net/ipv4/ah4.c +++ b/net/ipv4/ah4.c @@ -179,8 +179,10 @@ static int ah_input(struct
Re: [IPSEC]: Rename tunnel-mode functions to avoid collisions with tunnels
From: Herbert Xu [EMAIL PROTECTED] Date: Wed, 19 Dec 2007 14:38:33 +0800 [IPSEC]: Rename tunnel-mode functions to avoid collisions with tunnels It appears that I've managed to create two different functions both called xfrm6_tunnel_output. This is because we have the plain tunnel encapsulation named xfrmX_tunnel as well as the tunnel-mode encapsulation which lives in the files xfrmX_mode_tunnel.c. This patch renames functions from the latter to use the xfrmX_mode_tunnel prefix to avoid name-space conflicts. Signed-off-by: Herbert Xu [EMAIL PROTECTED] Applied, thanks Herbert. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] include/net/: Spelling fixes
From: Joe Perches [EMAIL PROTECTED] Date: Mon, 17 Dec 2007 11:40:25 -0800 Signed-off-by: Joe Perches [EMAIL PROTECTED] Applied. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net/dccp/: Spelling fixes
From: Joe Perches [EMAIL PROTECTED] Date: Mon, 17 Dec 2007 11:40:30 -0800 Signed-off-by: Joe Perches [EMAIL PROTECTED] Applied. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net/irda/: Spelling fixes
From: Joe Perches [EMAIL PROTECTED] Date: Mon, 17 Dec 2007 11:40:33 -0800 Signed-off-by: Joe Perches [EMAIL PROTECTED] Applied. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net/ipv6/: Spelling fixes
From: Joe Perches [EMAIL PROTECTED] Date: Mon, 17 Dec 2007 11:40:32 -0800 Signed-off-by: Joe Perches [EMAIL PROTECTED] APplied. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net/core/: Spelling fixes
From: Joe Perches [EMAIL PROTECTED] Date: Mon, 17 Dec 2007 11:40:29 -0800 Signed-off-by: Joe Perches [EMAIL PROTECTED] Applied. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net/sched/: Spelling fixes
From: Joe Perches [EMAIL PROTECTED] Date: Mon, 17 Dec 2007 11:40:36 -0800 Signed-off-by: Joe Perches [EMAIL PROTECTED] Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net/netlabel/: Spelling fixes
From: Joe Perches [EMAIL PROTECTED] Date: Mon, 17 Dec 2007 11:40:35 -0800 Signed-off-by: Joe Perches [EMAIL PROTECTED] Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net/sctp/: Spelling fixes
From: Joe Perches [EMAIL PROTECTED] Date: Mon, 17 Dec 2007 11:40:37 -0800 Signed-off-by: Joe Perches [EMAIL PROTECTED] Applied. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html