d80211: minor review item: generic_lock
generic_lock does not appear to be used at all. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: d80211 merge plans
Mohamed Abbas wrote: David Miller wrote: I think this is a non-started until the SMP problems are worked out. Is it still SMP challenged? I been using d80211 stack for about a month I have not encounter any SMP issues. We are currently involving validation engineers to do more stress tests and will see if any SMP issues come up. Well, tests are interesting, but I would rather see a real _analysis_ of the locking. Locking is provable, you know... Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD
On Wed, 2006-08-09 at 16:54 -0700, David Miller wrote: From: Peter Zijlstra [EMAIL PROTECTED] Date: Wed, 09 Aug 2006 15:32:33 +0200 The idea is to drop all !NFS packets (or even more specific only keep those NFS packets that belong to the critical mount), and everybody doing critical IO over layered networks like IPSec or other tunnel constructs asks for trouble - Just DON'T do that. People are doing I/O over IP exactly for it's ubiquity and flexibility. It seems a major limitation of the design if you cancel out major components of this flexibility. We're not, that was a bit of my own frustration leaking out; I think this whole push to IP based storage is a bit silly. I'm just not going to help the admin who's server just hangs because his VPN key expired. Running critical resources remotely like this is tricky, and every hop/layer you put in between increases the risk of something going bad. The only setup I think even remotely sane is a dedicated network in the very same room - not unlike FC but cheaper (which I think is the whole push behind this, eth is cheap) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] ehea: header files
Hi Jan-Bernd, I haven't read all of this, but a few things caught my eye ... cheers On Wed, 2006-08-09 at 10:39 +0200, Jan-Bernd Themann wrote: Signed-off-by: Jan-Bernd Themann [EMAIL PROTECTED] drivers/net/ehea/ehea.h| 452 + drivers/net/ehea/ehea_hw.h | 319 +++ 2 files changed, 771 insertions(+) --- linux-2.6.18-rc4-orig/drivers/net/ehea/ehea.h 1969-12-31 16:00:00.0 -0800 +++ kernel/drivers/net/ehea/ehea.h2006-08-08 23:59:39.927452928 -0700 @@ -0,0 +1,452 @@ +/* + * linux/drivers/net/ehea/ehea.h + * + * eHEA ethernet device driver for IBM eServer System p + * + * (C) Copyright IBM Corp. 2006 + * + * Authors: + * Christoph Raisch [EMAIL PROTECTED] + * Jan-Bernd Themann [EMAIL PROTECTED] + * Heiko-Joerg Schick [EMAIL PROTECTED] + * Thomas Klein [EMAIL PROTECTED] + * + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2, or (at your option) + * any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + */ + +#ifndef __EHEA_H__ +#define __EHEA_H__ + +#include linux/version.h +#include linux/module.h +#include linux/moduleparam.h +#include linux/kernel.h +#include linux/vmalloc.h +#include linux/mm.h +#include linux/slab.h +#include linux/sched.h +#include linux/err.h +#include linux/list.h +#include linux/netdevice.h +#include linux/etherdevice.h +#include linux/kthread.h +#include linux/ethtool.h +#include linux/if_vlan.h +#include asm/ibmebus.h +#include asm/of_device.h +#include asm/abs_addr.h +#include asm/semaphore.h +#include asm/current.h +#include asm/io.h + +#define EHEA_DRIVER_NAME IBM eHEA +#define EHEA_DRIVER_VERSION EHEA_0015 + +#define NET_IP_ALIGN 0 +#define EHEA_NUM_TX_QP 1 +#ifdef EHEA_SMALL_QUEUES +#define EHEA_MAX_CQE_COUNT 1020 +#define EHEA_MAX_ENTRIES_SQ1020 +#define EHEA_MAX_ENTRIES_RQ1 4080 +#define EHEA_MAX_ENTRIES_RQ2 1020 +#define EHEA_MAX_ENTRIES_RQ3 1020 +#define EHEA_SWQE_REFILL_TH 100 +#else +#define EHEA_MAX_CQE_COUNT32000 +#define EHEA_MAX_ENTRIES_SQ 16000 +#define EHEA_MAX_ENTRIES_RQ1 32080 +#define EHEA_MAX_ENTRIES_RQ2 4020 +#define EHEA_MAX_ENTRIES_RQ3 4020 +#define EHEA_SWQE_REFILL_TH1000 +#endif + +#define EHEA_MAX_ENTRIES_EQ 20 + +#define EHEA_SG_SQ 2 +#define EHEA_SG_RQ1 1 +#define EHEA_SG_RQ2 0 +#define EHEA_SG_RQ3 0 + +#define EHEA_MAX_PACKET_SIZE9022 /* for jumbo frame */ +#define EHEA_RQ2_PKT_SIZE 1522 +#define EHEA_LL_PKT_SIZE 256 + +/* Send completion signaling */ +#define EHEA_SIG_IV 1000 +#define EHEA_SIG_IV_LONG 4 + +/* Protection Domain Identifier */ +#define EHEA_PD_ID0xaabcdeff + +#define EHEA_RQ2_THRESHOLD 1 +/* use RQ3 threshold of 1522 bytes */ +#define EHEA_RQ3_THRESHOLD 9 + +#define EHEA_SPEED_10G 1 +#define EHEA_SPEED_1G 1000 +#define EHEA_SPEED_100M 100 +#define EHEA_SPEED_10M10 + +/* Broadcast/Multicast registration types */ +#define EHEA_BCMC_SCOPE_ALL 0x08 +#define EHEA_BCMC_SCOPE_SINGLE 0x00 +#define EHEA_BCMC_MULTICAST 0x04 +#define EHEA_BCMC_BROADCAST 0x00 +#define EHEA_BCMC_UNTAGGED 0x02 +#define EHEA_BCMC_TAGGED 0x00 +#define EHEA_BCMC_VLANID_ALL 0x01 +#define EHEA_BCMC_VLANID_SINGLE 0x00 + +/* Use this define to kmallocate PHYP control blocks */ +#define H_CB_ALIGNMENT 4096 + +#define EHEA_PAGESHIFT 12 +#define EHEA_PAGESIZE 4096UL +#define EHEA_CACHE_LINE 128 This looks like a very bad idea, what happens if you're running on a machine with 64K pages? + +#define EHEA_ENABLE 1 +#define EHEA_DISABLE 0 Do you really need hash defines for 0 and 1 ? They're fairly well understood in C as meaning true and false. + +/* Memory Regions */ +#define EHEA_MR_MAX_TX_PAGES 20 +#define EHEA_MR_TX_DATA_PN 3 +#define EHEA_MR_ACC_CTRL 0x0080 +#define EHEA_RWQES_PER_MR_RQ2 10 +#define EHEA_RWQES_PER_MR_RQ3 10 + + +void ehea_set_ethtool_ops(struct net_device *netdev); + +#ifndef KEEP_EDEBS_BELOW +#define KEEP_EDEBS_BELOW 8 +#endif + +extern int ehea_trace_level; + +#ifdef EHEA_NO_EDEB +#define EDEB_P_GENERIC(level, idstring, format, args...) \ + while (0 == 1) { \ + if(unlikely (level = ehea_trace_level)) { \ +
Re: [RFC][PATCH 2/9] deadlock prevention core
On Wed, 2006-08-09 at 16:58 -0700, David Miller wrote: From: Peter Zijlstra [EMAIL PROTECTED] Date: Wed, 09 Aug 2006 16:07:20 +0200 Hmm, what does sk_buff::input_dev do? That seems to store the initial device? You can run grep on the tree just as easily as I can which is what I did to answer this question. It only takes a few seconds of your time to grep the source tree for things like skb-input_dev, so would you please do that before asking more questions like this? That is exactly what I did, but I wanted a bit of confirmation. Sorry if it offends you, but I'm a bit new to this network thing. It does store the initial device, but as Thomas tried so hard to explain to you guys these device pointers in the skb are transient and you cannot refer to them outside of packet receive processing. Yes, I understood that after Thomas' last mail. The reason is that there is no refcounting performed on these devices when they are attached to the skb, for performance reasons, and thus the device can be downed, the module for it removed, etc. long before the skb is freed up. I understood that, thanks. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take6 1/3] kevent: Core files.
From: Evgeniy Polyakov [EMAIL PROTECTED] Date: Thu, 10 Aug 2006 10:14:33 +0400 On Wed, Aug 09, 2006 at 03:21:27PM -0700, Andrew Morton ([EMAIL PROTECTED]) wrote: On big-endian machines, this pointer will appear to be word-swapped as far as a 64-bit kernel is concerned. Or something. IOW: What's going on here?? It is user data - I put there a union just to simplify userspace, so it sould not require some typecasting. And this is consistent with similar mechianism we use for netlink socket dumping, so that we don't have compat layer crap just because we provide a place for the user to store his pointer or whatever there. + k-kevent_entry.next = LIST_POISON1; + k-storage_entry.prev = LIST_POISON2; + k-ready_entry.next = LIST_POISON1; Nope ;) I use pointer checks to determine if entry is in the list or not, why it is frowned upon here? As Andrew mentioned in another posting, these poison macros are likely to simply go away some day, so you should not use them. If you want pointer encoded tags you use internally, define your own. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take6 1/3] kevent: Core files.
On Wed, Aug 09, 2006 at 11:42:35PM -0700, David Miller ([EMAIL PROTECTED]) wrote: + k-kevent_entry.next = LIST_POISON1; + k-storage_entry.prev = LIST_POISON2; + k-ready_entry.next = LIST_POISON1; Nope ;) I use pointer checks to determine if entry is in the list or not, why it is frowned upon here? As Andrew mentioned in another posting, these poison macros are likely to simply go away some day, so you should not use them. They exist for ages and sudently can go away?.. If you want pointer encoded tags you use internally, define your own. I think if I will add code like this list_del(k-entry); k-entry.prev = KEVENT_POISON1; k-entry.next = KEVENT_POISON2; I will be suggested to make myself a lobotomy. I have enough space in flags in each kevent, so I will use some bits there. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take6 1/3] kevent: Core files.
On Thu, 10 Aug 2006 10:14:33 +0400 Evgeniy Polyakov [EMAIL PROTECTED] wrote: + union { + __u32 user[2];/* User's data. It is not used, just copied to/from user. */ + void*ptr; + }; +}; What is this union for? `ptr' needs a __user tag, does it not? Not, it is never touched by kernel. hrm, if you say so. +/* + * Must be called before event is going to be added into some origin's queue. + * Initializes -enqueue(), -dequeue() and -callback() callbacks. + * If failed, kevent should not be used or kevent_enqueue() will fail to add + * this kevent into origin's queue with setting + * KEVENT_RET_BROKEN flag in kevent-event.ret_flags. + */ +int kevent_init(struct kevent *k) +{ + spin_lock_init(k-ulock); + k-kevent_entry.next = LIST_POISON1; + k-storage_entry.prev = LIST_POISON2; + k-ready_entry.next = LIST_POISON1; Nope ;) I use pointer checks to determine if entry is in the list or not, why it is frowned upon here? Please do not say about poisoning which takes a lot of cpu cycles to get new cachelines and so on - everything in that entry is in the cache, since entry was added/deleted/accessed through list walk macro. poisoning which takes a lot of cpu cycles. So there ;) I assure you, that poisoning code might disappear at any time. If you want to be able to determine whether a list_head has been detached you can detach it with list_del_init() and then use list_empty() on it. +} + +late_initcall(kevent_sys_init); Why is it late_initcall? (A comment is needed) Why not? Why? There must have been some reason for having made this a late_initcall() and that reason is 100% concealed from the reader of this code. IOW, it needs a comment. +static inline void kevent_user_ring_set(struct kevent_user *u, unsigned int num) +{ + unsigned int *idx; + + idx = (unsigned int *)u-pring[0]; This is a bit ugly. I specially use first 4 bytes in the first page to store index there, since it must be accessed from userspace and kernelspace. Sure, but the C language is the preferred way in which we communicate and calcuate pointer offsets. + idx[0] = num; +} + +/* + * Note that kevents does not exactly fill the page (each ukevent is 40 bytes), + * so we reuse 4 bytes at the begining of the first page to store index. + * Take that into account if you want to change size of struct ukevent. + */ +#define KEVENTS_ON_PAGE (PAGE_SIZE/sizeof(struct ukevent)) How about doing struct ukevent_ring { unsigned int index; struct ukevent[0]; } and removing all those nasty typeasting and offsetting games? In fact you can even do struct ukevent_ring { struct ukevent[(PAGE_SIZE - sizeof(unsigned int)) / sizeof(struct ukevent)]; unsigned int index; }; if you're careful ;) Ring takes more than one page, so it will be struct ukevent_ring_0 and struct ukevent_ring_other. Does it really needed? Not a big problem, if you do thing it worse it. Well, I've given a couple of prototype-style suggestions. Please take a look, see if all this open-coded offsetting magic can be done by the compiler in some reliable and readable fashion. It might not work out, but I suspect it will. + u-pring = kmalloc(pnum * sizeof(unsigned long), GFP_KERNEL); + if (!u-pring) + return -ENOMEM; + + for (i=0; ipnum; ++i) { + u-pring[i] = __get_free_page(GFP_KERNEL); + if (!u-pring) bug: this is testing the wrong thing. HOw come? Take a closer look ;) __get_free_page() can return 0 if page was not allocated. And that 0 is copied to u-pring[0], not to u-pring. The function name is mistyped. Did you miss an OK? It needs s/kevnet_user_mmap/kevent_user_mmap/g This code doesn't have many comments, does it? What are we mapping here, and why would an application want to map it? That code waits comments from people who requested it. It is ring of the ready events, which can be read by userspace instead of calling syscall, so syscall just becomes wait until there is a place or something like that. hm. Well, please fully comment code prior to sending it out for review. I do go on about this, but trust me, it makes the review much more effective. Afaict this mmap function gives a user a free way of getting pinned memory. What is the upper bound on the amount of memory which a user can thus obtain? +static int kevent_modify(struct ukevent *uk, struct kevent_user *u) wonders what this function does Let me guess... It modifies kevent? :) I will add comments. +{ + struct kevent *k; + unsigned int hash = kevent_user_hash(uk); + int err = -ENODEV; + unsigned long flags; + + spin_lock_irqsave(u-kevent_lock, flags);
Re: [PATCH 1/6] ehea: interface to network stack
On Thu, 2006-08-10 at 16:15 +1000, Michael Ellerman wrote: + struct hcp_query_ehea_port_cb_2 *cb2 = NULL; + struct net_device_stats *stats = port-stats; + + EDEB_EN(7, net_device=%p, dev); + + cb2 = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + if (!cb2) { + EDEB_ERR(4, No memory for cb2); + goto get_stat_exit; You leak cb2 here. + } + + hret = ehea_h_query_ehea_port(adapter-handle, + port-logical_port_id, + H_PORT_CB2, + H_PORT_CB2_ALL, + cb2); + + if (hret != H_SUCCESS) { + EDEB_ERR(4, query_ehea_port failed for cb2); + goto get_stat_exit; Sorry, here. cheers -- Michael Ellerman IBM OzLabs wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person signature.asc Description: This is a digitally signed message part
Re: [take6 1/3] kevent: Core files.
On Thu, Aug 10, 2006 at 12:18:44AM -0700, Andrew Morton ([EMAIL PROTECTED]) wrote: + spin_lock_init(k-ulock); + k-kevent_entry.next = LIST_POISON1; + k-storage_entry.prev = LIST_POISON2; + k-ready_entry.next = LIST_POISON1; Nope ;) I use pointer checks to determine if entry is in the list or not, why it is frowned upon here? Please do not say about poisoning which takes a lot of cpu cycles to get new cachelines and so on - everything in that entry is in the cache, since entry was added/deleted/accessed through list walk macro. poisoning which takes a lot of cpu cycles. So there ;) I assure you, that poisoning code might disappear at any time. If you want to be able to determine whether a list_head has been detached you can detach it with list_del_init() and then use list_empty() on it. I can't due to RCU rules. +} + +late_initcall(kevent_sys_init); Why is it late_initcall? (A comment is needed) Why not? Why? There must have been some reason for having made this a late_initcall() and that reason is 100% concealed from the reader of this code. kevent must be initialized before use, and it must happen before userspace started, so I use late_initcall(), as I said it can be anything other which is called before userspace. IOW, it needs a comment. Sure. I'm working right now on fixing all issues mentioned in this thread, and comments are not on the last place. +static inline void kevent_user_ring_set(struct kevent_user *u, unsigned int num) +{ + unsigned int *idx; + + idx = (unsigned int *)u-pring[0]; This is a bit ugly. I specially use first 4 bytes in the first page to store index there, since it must be accessed from userspace and kernelspace. Sure, but the C language is the preferred way in which we communicate and calcuate pointer offsets. + idx[0] = num; +} + +/* + * Note that kevents does not exactly fill the page (each ukevent is 40 bytes), + * so we reuse 4 bytes at the begining of the first page to store index. + * Take that into account if you want to change size of struct ukevent. + */ +#define KEVENTS_ON_PAGE (PAGE_SIZE/sizeof(struct ukevent)) How about doing struct ukevent_ring { unsigned int index; struct ukevent[0]; } and removing all those nasty typeasting and offsetting games? In fact you can even do struct ukevent_ring { struct ukevent[(PAGE_SIZE - sizeof(unsigned int)) / sizeof(struct ukevent)]; unsigned int index; }; if you're careful ;) Ring takes more than one page, so it will be struct ukevent_ring_0 and struct ukevent_ring_other. Does it really needed? Not a big problem, if you do thing it worse it. Well, I've given a couple of prototype-style suggestions. Please take a look, see if all this open-coded offsetting magic can be done by the compiler in some reliable and readable fashion. It might not work out, but I suspect it will. I think I will use structure with index on each page, since kevents are unaligned to exaclty fit page, and it can be some kind of (later) optimisation to use not global counter, but per-page one. + u-pring = kmalloc(pnum * sizeof(unsigned long), GFP_KERNEL); + if (!u-pring) + return -ENOMEM; + + for (i=0; ipnum; ++i) { + u-pring[i] = __get_free_page(GFP_KERNEL); + if (!u-pring) bug: this is testing the wrong thing. HOw come? Take a closer look ;) [i] My fault :) __get_free_page() can return 0 if page was not allocated. And that 0 is copied to u-pring[0], not to u-pring. The function name is mistyped. Did you miss an OK? It needs s/kevnet_user_mmap/kevent_user_mmap/g It is already fixed :) This code doesn't have many comments, does it? What are we mapping here, and why would an application want to map it? That code waits comments from people who requested it. It is ring of the ready events, which can be read by userspace instead of calling syscall, so syscall just becomes wait until there is a place or something like that. hm. Well, please fully comment code prior to sending it out for review. I do go on about this, but trust me, it makes the review much more effective. Afaict this mmap function gives a user a free way of getting pinned memory. What is the upper bound on the amount of memory which a user can thus obtain? it is limited by maximum queue length which is 4k entries right now, so maximum number of paged here is 4k*40/page_size, i.e. about 40 pages on x86. +static int kevent_modify(struct ukevent *uk, struct kevent_user *u) wonders what this function does Let me guess... It modifies kevent? :) I will add
Re: [take6 1/3] kevent: Core files.
On Thu, 10 Aug 2006 11:50:47 +0400 Evgeniy Polyakov [EMAIL PROTECTED] wrote: Afaict this mmap function gives a user a free way of getting pinned memory. What is the upper bound on the amount of memory which a user can thus obtain? it is limited by maximum queue length which is 4k entries right now, so maximum number of paged here is 4k*40/page_size, i.e. about 40 pages on x86. Is that per user or per fd? If the latter that is, with the usual RLIMIT_NOFILE, 160MBytes. 2GB with 64k pagesize. Problem ;) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] socket: code style cleanup
On Wed, 09 Aug 2006 11:31:40 -0700 Stephen Hemminger [EMAIL PROTECTED] wrote: Make socket.c conform to current style: * run through Lindent * get rid of unneeded casts * split assignment and comparsion where possible stares at a stream of rejects. Sighs -static ssize_t sock_aio_read(struct kiocb *iocb, char __user *buf, - size_t size, loff_t pos); -static ssize_t sock_aio_write(struct kiocb *iocb, const char __user *buf, - size_t size, loff_t pos); -static int sock_mmap(struct file *file, struct vm_area_struct * vma); +static ssize_t sock_aio_read(struct kiocb *iocb, char __user * buf, + size_t size, loff_t pos); +static ssize_t sock_aio_write(struct kiocb *iocb, const char __user * buf, + size_t size, loff_t pos); +static int sock_mmap(struct file *file, struct vm_area_struct *vma); The s/ *buf/ * buf/ is inconsistent, illogical and, IMO, wrong. goes off to fix the rejects - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] [GIT PATCH] IPv6 Routing / Ndisc Fixes
Hello. In article [EMAIL PROTECTED] (at Thu, 10 Aug 2006 00:37:14 +0300), Ville Nuorvala [EMAIL PROTECTED] says: commit e0ad64d5b44179ea1296d737dec23279c72c9636 Author: YOSHIFUJI Hideaki [EMAIL PROTECTED] Date: Wed Aug 9 17:08:33 2006 +0900 [IPV6] NDISC: Allow redirects from other interfaces if it is not strict. Signed-off-by: YOSHIFUJI Hideaki [EMAIL PROTECTED] diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 4650787..1698fec 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -1322,7 +1322,7 @@ restart: continue; if (!(rt-rt6i_flags RTF_GATEWAY)) continue; - if (fl-oif != rt-rt6i_dev-ifindex) + if ((flags RT6_F_STRICT) fl-oif != rt-rt6i_dev-ifindex) continue; if (!ipv6_addr_equal(rdfl-gateway, rt-rt6i_gateway)) continue; Is this absolutely safe? Doesn't this enable a malicious node on another link to make a bogus redirect if it uses same link-local source address as the real router on the other link. Keep in mind that the RT6_F_STRICT flag is set based on the destination of the original redirected packet and doesn't in any way depend on the router or source address. : Ah, you're right. I'll drop this. As a result of original lookup (with possible ambiguous outout interface), one interface for original output is selected. Which means, we have a route for the (original) destination through that interface. Redirects shall come from that interface. So, it is enough to lookup routes on that interface. Thanks. --yoshfuji - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] socket: code style cleanup
From: Andrew Morton [EMAIL PROTECTED] Date: Thu, 10 Aug 2006 01:19:50 -0700 goes off to fix the rejects Just pull from net-2.6.19, Stephen's stuff is all merged in there. I'll fix the * var stuff, I hate that too :-) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take6 1/3] kevent: Core files.
On Thu, Aug 10, 2006 at 01:02:54AM -0700, Andrew Morton ([EMAIL PROTECTED]) wrote: Afaict this mmap function gives a user a free way of getting pinned memory. What is the upper bound on the amount of memory which a user can thus obtain? it is limited by maximum queue length which is 4k entries right now, so maximum number of paged here is 4k*40/page_size, i.e. about 40 pages on x86. Is that per user or per fd? If the latter that is, with the usual RLIMIT_NOFILE, 160MBytes. 2GB with 64k pagesize. Problem ;) Per kevent fd. I have some ideas about better mmap ring implementation, which would dinamically grow it's buffer when events are added and reuse the same place for next events, but there are some nitpics unresolved yet. Let's not see there in next releases (no merge of course), until better solution is ready. I will change that area when other things are ready. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] [GIT PATCH] IPv6 Routing / Ndisc Fixes
YOSHIFUJI Hideaki wrote: As a result of original lookup (with possible ambiguous outout interface), one interface for original output is selected. Which means, we have a route for the (original) destination through that interface. Redirects shall come from that interface. So, it is enough to lookup routes on that interface. Yes, exactly. Regards, Ville - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Doubt about locking in ixgb driver
hi all, The transmit functions of ethernet drivers (dev-hard_start_xmit) are protected to prevent multiple execution of transmits going in parallel. The general scheme used by most of driver is : 1. Reset NETIF_F_LLTX flag in dev-features and then use kernel locking given through HARD_TX_LOCK (net/core/dev.c:3417) OR 2. Use a internal lock of driver generally kept in adapter to prevent multiple accesses. In ixgb driver (drivers/net/ixgb/), there is a lock in adapter of driver (adapter-tx_lock). But this is left before the ixgb_xmit_frame() function returns. The access to adapter-tx_ring.next_to_use which i suppose will be the index of next element to use from tx_ring is accessed outside the area where lock is held. What will prevent race condition during accessing adapter-tx_ring.next_to_use ? How does multiple instances of xmit not run or multiple instances of xmit running is fine ? Regards, Mithlesh Thukral - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hello, We had some patch need to submit for sundance.c
Jesse Huang wrote: Dear All: We had some patch need to submit. Would you tell me where to get current sundance.c for myself to generate those patch files. Sorry, I only got this link: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;h=f13b2a195c708fe32d8c53d05988875a51bd52e1;hb=1668b19f75cb949f930814a23b74201ad6f76a53;f=drivers/net/sundance.c You need to install the git software package, and then check out the upstream branch of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/netdev-2.6.git Then provide patches against the drivers/net/sundance.c driver found there. git software download: http://www.kernel.org/pub/software/scm/git/ git overview: http://git.or.cz/ git tutorial: http://www.kernel.org/pub/software/scm/git/docs/tutorial.html git man pages: http://www.kernel.org/pub/software/scm/git/docs Thanks, Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hello, We had some patch need to submit for sundance.c
Hi Jeff: I will use sundance.c in this tree to generate patch files. Thanks for this information. Jesse - Original Message - From: Jeff Garzik [EMAIL PROTECTED] To: Jesse Huang [EMAIL PROTECTED] Cc: Francois Romieu [EMAIL PROTECTED]; linux-kernel@vger.kernel.org; netdev@vger.kernel.org; Andrew Morton [EMAIL PROTECTED] Sent: Thursday, August 10, 2006 7:23 PM Subject: Re: Hello, We had some patch need to submit for sundance.c Jesse Huang wrote: Dear All: We had some patch need to submit. Would you tell me where to get current sundance.c for myself to generate those patch files. Sorry, I only got this link: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;h=f13b2a195c708fe32d8c53d05988875a51bd52e1;hb=1668b19f75cb949f930814a23b74201ad6f76a53;f=drivers/net/sundance.c You need to install the git software package, and then check out the upstream branch of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/netdev-2.6.git Then provide patches against the drivers/net/sundance.c driver found there. git software download: http://www.kernel.org/pub/software/scm/git/ git overview: http://git.or.cz/ git tutorial: http://www.kernel.org/pub/software/scm/git/docs/tutorial.html git man pages: http://www.kernel.org/pub/software/scm/git/docs Thanks, Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Possible leak of multicast source filter sctructure
Hi all! It seems to me that there is a leak of struct ip_sf_socklist in the ip_mc_drop_socket function (in net/ipv4/igmp.c) which is called on socket close. This patch corrects it: diff -Naur linux-2.6.17.8.orig/net/ipv4/igmp.c linux-2.6.17.8/net/ipv4/igmp.c --- linux-2.6.17.8.orig/net/ipv4/igmp.c 2006-08-07 06:18:54.0 +0200 +++ linux-2.6.17.8/net/ipv4/igmp.c 2006-08-10 10:38:04.0 +0200 @@ -2206,9 +2206,10 @@ (void) ip_mc_leave_src(sk, iml, in_dev); ip_mc_dec_group(in_dev, iml-multi.imr_multiaddr.s_addr); in_dev_put(in_dev); - } - sock_kfree_s(sk, iml, sizeof(*iml)); + } else if (iml-sflist != NULL) + sock_kfree_s(sk, iml-sflist, IP_SFLSIZE(iml-sflist-sl_max)); + sock_kfree_s(sk, iml, sizeof(*iml)); } rtnl_unlock(); } The leak only happens if there are some multicast source filters set on a socket wich are bound to an interface that does not exist any more, as in the following scenario: 1. create a temporary interface (say GRE tunnel) 3. join a multicast group an set a source filter on the temporary interface via MCAST_JOIN_SOURCE_GROUP setsockopt call 4. destroy the temporary interface 5. close the socket This sequence of things eventually leads to a call of ip_mc_drop_socket function, which fails to free the soucre filter structure ip_sf_socklist pointed to from members of socket's multicast addresses list. This structure is normally freed in ip_mc_leave_src function but this function is not called in this scenario because the interface that the multicast group is joined on does not exist any more. Thanks Michal Ruzicka linux-2.6.17.8-mc_sf_leak.patch Description: Binary data
Re: Possible leak of multicast source filter sctructure
From: Michal Ruzicka [EMAIL PROTECTED] Date: Thu, 10 Aug 2006 14:07:06 +0200 This patch corrects it: Correct or not this patch is corrupted by your email client, turning tabs into spaces among other things. This makes your patch unusable. Please configure your email client to not mangle the text of the patch in any way and resubmit with your original surrounding description so that it can be properly reviewed. If in doubt, always email the patch to yourself as a test and try to apply that patch as if you were the person who might be integrating your work. Thanks a lot. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possible leak of multicast source filter sctructure
From: David Miller [EMAIL PROTECTED] Date: Thu, 10 Aug 2006 05:12:41 -0700 (PDT) From: Michal Ruzicka [EMAIL PROTECTED] Date: Thu, 10 Aug 2006 14:07:06 +0200 This patch corrects it: Correct or not this patch is corrupted by your email client, turning tabs into spaces among other things. This makes your patch unusable. And yes I do realize you created an attachment before you bark that back. :-) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/9] [TULIP] Clean tulip.h so it can be used by winbond-840.c
Grant Grundler wrote: On Wed, Aug 09, 2006 at 01:33:18AM -0400, Jeff Garzik wrote: 2) nobody (but parisc folks?) knows what CBMA and CBIO mean. Just use MMIO and PIO CBIO is what's in the public documentation. I just want to make it easy for anyone who bothers to read the documentation to be sure they are reading about the right register. Thanks for clarifying. Nonetheless, I still prefer 'mmio' and 'pio' because its more universal. Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] sky2: phy power problems on 88e805X chips
Stephen Hemminger wrote: On the 88E805X chipsets (used in laptops), the PHY was not getting powered out of shutdown properly. The variable reg1 was getting reused incorrectly. This is probably the cause of the bug. http://bugzilla.kernel.org/show_bug.cgi?id=6471 Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- netdev-2.6.orig/drivers/net/sky2.c 2006-08-09 14:13:36.0 -0700 +++ netdev-2.6/drivers/net/sky2.c 2006-08-09 14:14:07.0 -0700 @@ -233,6 +233,8 @@ if (hw-ports 1) reg1 |= PCI_Y2_PHY2_COMA; } + sky2_pci_write32(hw, PCI_DEV_REG1, reg1); + udelay(100); if (hw-chip_id == CHIP_ID_YUKON_EC_U) { applied to #upstream-fixes, though I note that the obvious PCI posting bug remains. You cannot be assured that the udelay(100) is truly effective without a flushing readl(). Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bonding questions: replaying call to set_multicast_list and sending IGMP doing Fail-Over
Jay Vosburgh wrote: I haven't studied the effects of having large amounts of multicast traffic coming in under this situation. However, I would suspect that the MAC filters found on sufficiently modern network adapters would drop the incoming multicast traffic on the backup slaves, as only the active slave in active-backup mode has its multicast list set. That information is sent to a slave when it becomes the active slave; see the call to bond_mc_swap() made by bond_change_active_slave(). OK, i agree the MAC filter would drop the incoming traffic on the backup slaves b/c bond_mc_swap() calls bond_mc_delete() on the slave which becomes a backup one. But as you have noted there might be some impact on the switch. Or. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] gre: transparent ethernet bridging
On Mon, Aug 07, 2006 at 11:55:14AM +1000, Philip Craig wrote: I have one machine at home that appears to be on my employer's network via such a tunnel. I don't use bridging, because I don't need any other machine at home to access this tunnel. I do want bridging, and not proxy ARP, because it allows me to run arpwatch, and doesn't require me to reconfigure something at the remote end if I, for example, want to add another IP address to my home box. Okay. If this is using Linux, do you have a patch that does this already? I use vtun: http://vtun.sourceforge.net/ But I would prefer using some in-kernel ethernet tunneling method with ipsec instead. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[take7 0/1] kevent: generic event handling mechanism.
Hello. Generic event handling mechanism. Changes from 'take6' patchset: * a lot of comments! * do not use list poisoning for detection of the fact, that entry is in the list * return number of ready kevents even if copy*user() fails * strict check for number of kevents in syscall * use ARRAY_SIZE for array size calculation * changed superblock magic number * use SLAB_PANIC instead of direct panic() call * changed -E* return values * a lot of small cleanups and indent fixes * fully removed AIO stuff from patchset Changes from 'take5' patchset: * removed compilation warnings about unused wariables when lockdep is not turned on * do not use internal socket structures, use appropriate (exported) wrappers instead * removed default 1 second timeout * removed AIO stuff from patchset Changes from 'take4' patchset: * use miscdevice instead of chardevice * comments fixes Changes from 'take3' patchset: * removed serializing mutex from kevent_user_wait() * moved storage list processing to RCU * removed lockdep screaming - all storage locks are initialized in the same function, so it was learned to differentiate between various cases * remove kevent from storage if is marked as broken after callback * fixed a typo in mmaped buffer implementation which would end up in wrong index calcualtion Changes from 'take2' patchset: * split kevent_finish_user() to locked and unlocked variants * do not use KEVENT_STAT ifdefs, use inline functions instead * use array of callbacks of each type instead of each kevent callback initialization * changed name of ukevent guarding lock * use only one kevent lock in kevent_user for all hash buckets instead of per-bucket locks * do not use kevent_user_ctl structure instead provide needed arguments as syscall parameters * various indent cleanups * added optimisation, which is aimed to help when a lot of kevents are being copied from userspace * mapped buffer (initial) implementation (no userspace yet) Changes from 'take1' patchset: - rebased against 2.6.18-git tree - removed ioctl controlling - added new syscall kevent_get_events(int fd, unsigned int min_nr, unsigned int max_nr, unsigned int timeout, void __user *buf, unsigned flags) - use old syscall kevent_ctl for creation/removing, modification and initial kevent initialization - use mutuxes instead of semaphores - added file descriptor check and return error if provided descriptor does not match kevent file operations - various indent fixes - removed aio_sendfile() declarations. Thank you. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[take7 1/1] kevent: core files and timer/poll notifications.
This patch includes core kevent files: - userspace controlling - kernelspace interfaces - initialization - notification state machines - timer and poll/select notifications With this patchset rate of requests per second has achieved 2500 req/sec while with epoll/kqueue and similar techniques it is about 1600-1800 requests per second on my test hardware and trivial web server. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S index dd63d47..091ff42 100644 --- a/arch/i386/kernel/syscall_table.S +++ b/arch/i386/kernel/syscall_table.S @@ -317,3 +317,5 @@ ENTRY(sys_call_table) .long sys_tee /* 315 */ .long sys_vmsplice .long sys_move_pages + .long sys_kevent_get_events + .long sys_kevent_ctl diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S index 5d4a7d1..b2af4a8 100644 --- a/arch/x86_64/ia32/ia32entry.S +++ b/arch/x86_64/ia32/ia32entry.S @@ -713,4 +713,6 @@ #endif .quad sys_tee .quad compat_sys_vmsplice .quad compat_sys_move_pages + .quad sys_kevent_get_events + .quad sys_kevent_ctl ia32_syscall_end: diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h index fc1c8dd..c9dde13 100644 --- a/include/asm-i386/unistd.h +++ b/include/asm-i386/unistd.h @@ -323,10 +323,12 @@ #define __NR_sync_file_range 314 #define __NR_tee 315 #define __NR_vmsplice 316 #define __NR_move_pages317 +#define __NR_kevent_get_events 318 +#define __NR_kevent_ctl319 #ifdef __KERNEL__ -#define NR_syscalls 318 +#define NR_syscalls 320 /* * user-visible error numbers are in the range -1 - -128: see diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h index 94387c9..61363e0 100644 --- a/include/asm-x86_64/unistd.h +++ b/include/asm-x86_64/unistd.h @@ -619,10 +619,14 @@ #define __NR_vmsplice 278 __SYSCALL(__NR_vmsplice, sys_vmsplice) #define __NR_move_pages279 __SYSCALL(__NR_move_pages, sys_move_pages) +#define __NR_kevent_get_events 280 +__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events) +#define __NR_kevent_ctl281 +__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl) #ifdef __KERNEL__ -#define __NR_syscall_max __NR_move_pages +#define __NR_syscall_max __NR_kevent_ctl #ifndef __NO_STUBS diff --git a/include/linux/kevent.h b/include/linux/kevent.h new file mode 100644 index 000..d3ff0cd --- /dev/null +++ b/include/linux/kevent.h @@ -0,0 +1,302 @@ +/* + * kevent.h + * + * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef __KEVENT_H +#define __KEVENT_H + +/* + * Kevent request flags. + */ + +#define KEVENT_REQ_ONESHOT 0x1 /* Process this event only once and then dequeue. */ + +/* + * Kevent return flags. + */ +#define KEVENT_RET_BROKEN 0x1 /* Kevent is broken. */ +#define KEVENT_RET_DONE0x2 /* Kevent processing was finished successfully. */ + +/* + * Kevent type set. + */ +#define KEVENT_SOCKET 0 +#define KEVENT_INODE 1 +#define KEVENT_TIMER 2 +#define KEVENT_POLL3 +#define KEVENT_NAIO4 +#define KEVENT_AIO 5 +#defineKEVENT_MAX 6 + +/* + * Per-type event sets. + * Number of per-event sets should be exactly as number of kevent types. + */ + +/* + * Timer events. + */ +#defineKEVENT_TIMER_FIRED 0x1 + +/* + * Socket/network asynchronous IO events. + */ +#defineKEVENT_SOCKET_RECV 0x1 +#defineKEVENT_SOCKET_ACCEPT0x2 +#defineKEVENT_SOCKET_SEND 0x4 + +/* + * Inode events. + */ +#defineKEVENT_INODE_CREATE 0x1 +#defineKEVENT_INODE_REMOVE 0x2 + +/* + * Poll events. + */ +#defineKEVENT_POLL_POLLIN 0x0001 +#defineKEVENT_POLL_POLLPRI 0x0002 +#defineKEVENT_POLL_POLLOUT 0x0004 +#defineKEVENT_POLL_POLLERR 0x0008 +#defineKEVENT_POLL_POLLHUP 0x0010 +#defineKEVENT_POLL_POLLNVAL0x0020 + +#defineKEVENT_POLL_POLLRDNORM 0x0040 +#define
Re: [take7 1/1] kevent: core files and timer/poll notifications.
On Thu, Aug 10, 2006 at 04:16:38PM +0400, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: With this patchset rate of requests per second has achieved 2500 req/sec while with epoll/kqueue and similar techniques it is about 1600-1800 requests per second on my test hardware and trivial web server. Nope, it is old record from archives... Current one is 2600+ -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6] htb: cleanup
* David Miller [EMAIL PROTECTED] 2006-08-02 15:18 From: Stephen Hemminger [EMAIL PROTECTED] Date: Wed, 2 Aug 2006 12:56:36 -0700 The HTB scheduler code is a mess, this patch set does some basic house cleaning. The first four should cause no code change, but the last two need more testing. These patches look fine to me. Once everyone think's they are ready just let me know and I'll push them into net-2.6.19 I think they are ready. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH] VM deadlock prevention core -v3
Hi, So I try again, please tell me if I'm still on crack and should go detox. However if you do so, I kindly request some words on the how and why of it. so I try to map to net_device in several fashions: 1) netdev_alloc_skb() - has an argument with the actual incoming device 2) free_skb_pages() - uses: skb-input_dev ?: skb-dev this device is pinned by virtue of netdev_wait_memalloc(), which will wait until all skb-memalloc skbuffs are destroyed. This will delay module unload under severe memory pressure, I think this is acceptable as one has other problems at that point. 3) inet_sock_destruct(), sk_set_memalloc() - both use: ip_dev_find(inet_sk(sk)-rcv_saddr)) if the later two methods do not yield the same net_device the first has, weird and wonderfull stuff will happen. Why, currently the sole purpose is to be able to limit the number of memalloc skbs per device (and for me to learn some). I suspect something as simple as a bridge device will destroy this, with that I suspect (3) will return the bride device instead of the actual input device. So, if I'm still busted, is there any hope for this approach? If not, I'll have to go do global skb memalloc accounting (largesmp fanboys need not worry, this will only happen under severe load, at that point the box will have other issues) and fudge the per deviceness. --- The core of the VM deadlock avoidance framework. From the 'user' side of things it provides a function to mark a 'struct sock' as SOCK_MEMALLOC, meaning this socket may dip into the memalloc reserves on the receive side. From the net_device side of things, the extra 'struct net_device *' argument to {,__}netdev_alloc_skb() is used to attribute/account the memalloc usage. When netdev_alloc_skb() finds it cannot allocate a struct sk_buff the regular way it will grab some memory from the memalloc reserve. Drivers that have been converted to the netdev_alloc_skb() family will automatically receive this feature. Network paths will drop !SOCK_MEMALLOC packets ASAP when reserve is being used. Memalloc sk_buff allocations are not done from the SLAB but are done using alloc_pages(). sk_buff::memalloc records this exception so that kfree_skbmem() can do the right thing. NOTE this does not play very nice with skb_clone() Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] Signed-off-by: Daniel Phillips [EMAIL PROTECTED] --- include/linux/gfp.h |3 include/linux/mmzone.h|1 include/linux/netdevice.h |7 ++ include/linux/skbuff.h|3 include/net/sock.h|8 ++ mm/page_alloc.c | 46 - net/core/dev.c| 97 net/core/skbuff.c | 155 +++--- net/core/sock.c | 25 +++ net/ethernet/eth.c|1 net/ipv4/af_inet.c|8 ++ net/ipv4/icmp.c |3 net/ipv4/tcp_ipv4.c |3 net/ipv4/udp.c|8 ++ 14 files changed, 355 insertions(+), 13 deletions(-) Index: linux-2.6/include/linux/gfp.h === --- linux-2.6.orig/include/linux/gfp.h +++ linux-2.6/include/linux/gfp.h @@ -46,6 +46,7 @@ struct vm_area_struct; #define __GFP_ZERO ((__force gfp_t)0x8000u)/* Return zeroed page on success */ #define __GFP_NOMEMALLOC ((__force gfp_t)0x1u) /* Don't use emergency reserves */ #define __GFP_HARDWALL ((__force gfp_t)0x2u) /* Enforce hardwall cpuset memory allocs */ +#define __GFP_MEMALLOC ((__force gfp_t)0x4u) /* Use emergency reserves */ #define __GFP_BITS_SHIFT 20/* Room for 20 __GFP_FOO bits */ #define __GFP_BITS_MASK ((__force gfp_t)((1 __GFP_BITS_SHIFT) - 1)) @@ -54,7 +55,7 @@ struct vm_area_struct; #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \ __GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \ __GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \ - __GFP_NOMEMALLOC|__GFP_HARDWALL) + __GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_MEMALLOC) /* This equals 0, but use constants in case they ever change */ #define GFP_NOWAIT (GFP_ATOMIC ~__GFP_HIGH) Index: linux-2.6/include/linux/mmzone.h === --- linux-2.6.orig/include/linux/mmzone.h +++ linux-2.6/include/linux/mmzone.h @@ -420,6 +420,7 @@ int percpu_pagelist_fraction_sysctl_hand void __user *, size_t *, loff_t *); int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); +int adjust_memalloc_reserve(int bytes); #include linux/topology.h /* Returns the number of the current Node. */ Index: linux-2.6/include/linux/netdevice.h === ---
Re: [RFC][PATCH] VM deadlock prevention core -v3
On Thu, Aug 10, 2006 at 03:32:49PM +0200, Peter Zijlstra ([EMAIL PROTECTED]) wrote: Hi, Hello, Peter. So I try again, please tell me if I'm still on crack and should go detox. However if you do so, I kindly request some words on the how and why of it. I think you should talk with doctor in that case, but not with kernel hackers :) I have some comments about implementation, not overall design, since we have slightly diametral points of view there. ... --- linux-2.6.orig/net/core/skbuff.c +++ linux-2.6/net/core/skbuff.c @@ -43,6 +43,7 @@ #include linux/kernel.h #include linux/sched.h #include linux/mm.h +#include linux/pagemap.h #include linux/interrupt.h #include linux/in.h #include linux/inet.h @@ -125,6 +126,8 @@ EXPORT_SYMBOL(skb_truesize_bug); * */ +#define ceiling_log2(x) fls((x) - 1) + /** * __alloc_skb - allocate a network buffer * @size: size to allocate @@ -147,6 +150,59 @@ struct sk_buff *__alloc_skb(unsigned int struct sk_buff *skb; u8 *data; + size = SKB_DATA_ALIGN(size); + + if (gfp_mask __GFP_MEMALLOC) { + /* + * Fallback allocation for memalloc reserves. + * + * the page is populated like so: + * + * struct sk_buff + * [ struct sk_buff ] + * [ atomic_t ] + * unsigned int + * struct skb_shared_info + * char [] + * + * We have to do higher order allocations for icky jumbo + * frame drivers :-(. They really should be migrated to + * scather/gather DMA and use skb fragments. + */ + unsigned int data_offset = + sizeof(struct sk_buff) + sizeof(unsigned int); + unsigned long length = size + data_offset + + sizeof(struct skb_shared_info); + unsigned int pages; + unsigned int order; + struct page *page; + void *kaddr; + + /* + * Force fclone alloc in order to fudge a lacking in skb_clone(). + */ + fclone = 1; + if (fclone) { + data_offset += sizeof(struct sk_buff) + sizeof(atomic_t); + length += sizeof(struct sk_buff) + sizeof(atomic_t); + } + pages = (length + PAGE_SIZE - 1) PAGE_SHIFT; + order = ceiling_log2(pages); + skb = NULL; + if (!(page = alloc_pages(gfp_mask ~__GFP_HIGHMEM, order))) + goto out; + + kaddr = pfn_to_kaddr(page_to_pfn(page)); + skb = (struct sk_buff *)kaddr; + + *((unsigned int *)(kaddr + data_offset - + sizeof(unsigned int))) = order; + data = (u8 *)(kaddr + data_offset); + Tricky, but since you are using own allocator here, you could change it to be not so aggressive - i.e. do not round size to number of pages. + goto allocated; + } + cache = fclone ? skbuff_fclone_cache : skbuff_head_cache; /* Get the HEAD */ @@ -155,12 +211,13 @@ struct sk_buff *__alloc_skb(unsigned int goto out; /* Get the DATA. Size must match skb_add_mtu(). */ - size = SKB_DATA_ALIGN(size); Bad sign. data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask); if (!data) goto nodata; +struct sk_buff *__netdev_alloc_skb(struct net_device *dev, + unsigned length, gfp_t gfp_mask) +{ + struct sk_buff *skb; + + WARN_ON(gfp_mask (__GFP_NOMEMALLOC | __GFP_MEMALLOC)); + gfp_mask = ~(__GFP_NOMEMALLOC | __GFP_MEMALLOC); + + skb = ___netdev_alloc_skb(dev, length, gfp_mask | __GFP_NOMEMALLOC); + if (skb) + goto done; + + if (atomic_read(dev-rx_reserve_used) = + dev-rx_reserve * dev-memalloc_socks) + goto out; + + /* + * pre-inc guards against a race with netdev_wait_memalloc() + */ + atomic_inc(dev-rx_reserve_used); + skb = ___netdev_alloc_skb(dev, length, gfp_mask | __GFP_MEMALLOC); + if (unlikely(!skb)) { + atomic_dec(dev-rx_reserve_used); + goto out; + } Since you have added atomic operation in that path, you can use device's reference counter instead and do not care that it can dissapear. +done: + skb-dev = dev; +out: + return skb; +} + static void skb_drop_list(struct sk_buff **listp) { struct sk_buff *list = *listp; @@ -313,10 +417,35 @@ static void skb_release_data(struct sk_b if (skb_shinfo(skb)-frag_list) skb_drop_fraglist(skb); - kfree(skb-head); + if (!skb-memalloc) + kfree(skb-head); +
Re: [RFC][PATCH] VM deadlock prevention core -v3
On Thu, 2006-08-10 at 18:02 +0400, Evgeniy Polyakov wrote: On Thu, Aug 10, 2006 at 03:32:49PM +0200, Peter Zijlstra ([EMAIL PROTECTED]) wrote: Hi, Hello, Peter. So I try again, please tell me if I'm still on crack and should go detox. However if you do so, I kindly request some words on the how and why of it. I think you should talk with doctor in that case, but not with kernel hackers :) I have some comments about implementation, not overall design, since we have slightly diametral points of view there. --- linux-2.6.orig/net/core/skbuff.c +++ linux-2.6/net/core/skbuff.c @@ -43,6 +43,7 @@ #include linux/kernel.h #include linux/sched.h #include linux/mm.h +#include linux/pagemap.h #include linux/interrupt.h #include linux/in.h #include linux/inet.h @@ -125,6 +126,8 @@ EXPORT_SYMBOL(skb_truesize_bug); * */ +#define ceiling_log2(x)fls((x) - 1) + /** * __alloc_skb - allocate a network buffer * @size: size to allocate @@ -147,6 +150,59 @@ struct sk_buff *__alloc_skb(unsigned int struct sk_buff *skb; u8 *data; + size = SKB_DATA_ALIGN(size); I moved it here. + + if (gfp_mask __GFP_MEMALLOC) { + /* +* Fallback allocation for memalloc reserves. + * This allocator is build on alloc_pages() so that freed * skbuffs return to the memalloc reserve imediately. SLAB * memory might not ever be returned. This was missing,... +* the page is populated like so: +* +* struct sk_buff +* [ struct sk_buff ] +* [ atomic_t ] +* unsigned int +* struct skb_shared_info +* char [] +* +* We have to do higher order allocations for icky jumbo +* frame drivers :-(. They really should be migrated to +* scather/gather DMA and use skb fragments. +*/ + unsigned int data_offset = + sizeof(struct sk_buff) + sizeof(unsigned int); + unsigned long length = size + data_offset + + sizeof(struct skb_shared_info); + unsigned int pages; + unsigned int order; + struct page *page; + void *kaddr; + + /* +* Force fclone alloc in order to fudge a lacking in skb_clone(). +*/ + fclone = 1; + if (fclone) { + data_offset += sizeof(struct sk_buff) + sizeof(atomic_t); + length += sizeof(struct sk_buff) + sizeof(atomic_t); + } + pages = (length + PAGE_SIZE - 1) PAGE_SHIFT; + order = ceiling_log2(pages); + skb = NULL; + if (!(page = alloc_pages(gfp_mask ~__GFP_HIGHMEM, order))) + goto out; + + kaddr = pfn_to_kaddr(page_to_pfn(page)); + skb = (struct sk_buff *)kaddr; + + *((unsigned int *)(kaddr + data_offset - + sizeof(unsigned int))) = order; + data = (u8 *)(kaddr + data_offset); + Tricky, but since you are using own allocator here, you could change it to be not so aggressive - i.e. do not round size to number of pages. I'm not sure I follow you, I'm explicitly using alloc_pages()/free_page(), if I were to go smart here, I'd loose the whole reason for doing so. + goto allocated; + } + cache = fclone ? skbuff_fclone_cache : skbuff_head_cache; /* Get the HEAD */ @@ -155,12 +211,13 @@ struct sk_buff *__alloc_skb(unsigned int goto out; /* Get the DATA. Size must match skb_add_mtu(). */ - size = SKB_DATA_ALIGN(size); Bad sign. See above. data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask); if (!data) goto nodata; +struct sk_buff *__netdev_alloc_skb(struct net_device *dev, + unsigned length, gfp_t gfp_mask) +{ + struct sk_buff *skb; + + WARN_ON(gfp_mask (__GFP_NOMEMALLOC | __GFP_MEMALLOC)); + gfp_mask = ~(__GFP_NOMEMALLOC | __GFP_MEMALLOC); + + skb = ___netdev_alloc_skb(dev, length, gfp_mask | __GFP_NOMEMALLOC); + if (skb) + goto done; + + if (atomic_read(dev-rx_reserve_used) = + dev-rx_reserve * dev-memalloc_socks) + goto out; + + /* +* pre-inc guards against a race with netdev_wait_memalloc() +*/ + atomic_inc(dev-rx_reserve_used); + skb = ___netdev_alloc_skb(dev, length, gfp_mask | __GFP_MEMALLOC); + if (unlikely(!skb)) { + atomic_dec(dev-rx_reserve_used); + goto out; + } Since you have added atomic operation in that path, you can use device's reference counter instead and do not care that it can dissapear.
Re: [PATCH 5/6] ehea: makefile
Sam Ravnborg wrote: On Wed, Aug 09, 2006 at 10:40:20AM +0200, Jan-Bernd Themann wrote: Signed-off-by: Jan-Bernd Themann [EMAIL PROTECTED] drivers/net/ehea/Makefile |7 +++ 1 file changed, 7 insertions(+) --- linux-2.6.18-rc4-orig/drivers/net/ehea/Makefile 1969-12-31 16:00:00.0 -0800 +++ kernel/drivers/net/ehea/Makefile2006-08-08 23:59:38.083467216 -0700 @@ -0,0 +1,7 @@ +# +# Makefile for the eHEA ethernet device driver for IBM eServer System p +# + +ehea_mod-objs = ehea_main.o ehea_phyp.o ehea_qmr.o ehea_ethtool.o ehea_phyp.o +obj-$(CONFIG_EHEA) += ehea_mod.o + Using -objs is deprecated, please use ehea_mod-y. This needs to be documented and later warned upon which I will do soon. Sam Done. Will be included in next patch. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6] ehea: interface to network stack
Hi Michael, thanks for your very helpful comments so far, we'll provide a patch with these and other fixes very soon. See comments below. Jan-Bernd Michael Ellerman wrote: Hi Jan-Bernd, Comments below the code they refer to. On Wed, 2006-08-09 at 10:38 +0200, Jan-Bernd Themann wrote: Signed-off-by: Jan-Bernd Themann [EMAIL PROTECTED] drivers/net/ehea/ehea_main.c | 2738 +++ 1 file changed, 2738 insertions(+) --- linux-2.6.18-rc4-orig/drivers/net/ehea/ehea_main.c 1969-12-31 16:00:00.0 -0800 +++ kernel/drivers/net/ehea/ehea_main.c2006-08-08 23:59:39.683357016 -0700 @@ -0,0 +1,2738 @@ +/* + * linux/drivers/net/ehea/ehea_main.c Putting the file name in the file is fairly redundant IMHO. + * eHEA ethernet device driver for IBM eServer System p What's the actual hardware that this is for? System p covers a whole range of machines, do they really all support this driver? + * (C) Copyright IBM Corp. 2006 + * + * Authors: + * Christoph Raisch [EMAIL PROTECTED] + * Jan-Bernd Themann [EMAIL PROTECTED] + * Heiko-Joerg Schick [EMAIL PROTECTED] + * Thomas Klein [EMAIL PROTECTED] + * + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2, or (at your option) + * any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + */ + +#define DEB_PREFIX main + +#include linux/in.h +#include linux/ip.h +#include linux/tcp.h +#include linux/udp.h +#include linux/if.h +#include linux/list.h +#include net/ip.h + +#include ehea.h +#include ehea_qmr.h +#include ehea_phyp.h + + +MODULE_LICENSE(GPL); +MODULE_AUTHOR(Christoph Raisch [EMAIL PROTECTED]); +MODULE_DESCRIPTION(IBM eServer HEA Driver); +MODULE_VERSION(EHEA_DRIVER_VERSION); + +static int __devinit ehea_probe(struct ibmebus_dev *dev, + const struct of_device_id *id); +static int __devexit ehea_remove(struct ibmebus_dev *dev); +static int ehea_sense_port_attr(struct ehea_adapter *adapter, int portnum); I haven't looked closely, but can you rearrange the functions so you don't need these forward declarations? yes, rearrangement is possible. Done. +int ehea_trace_level = 5; + +static struct net_device_stats *ehea_get_stats(struct net_device *dev) +{ + int i; + u64 hret = H_HARDWARE; You unconditionally assign to hret below. + u64 rx_packets = 0; Why not just update stats-rx_packets directly? + struct ehea_port *port = (struct ehea_port*)dev-priv; + struct ehea_adapter *adapter = port-adapter; I don't think you need adapter, you only use it in one place, just access it through port-adapter-handle (below). done + struct hcp_query_ehea_port_cb_2 *cb2 = NULL; + struct net_device_stats *stats = port-stats; + + EDEB_EN(7, net_device=%p, dev); + + cb2 = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + if (!cb2) { + EDEB_ERR(4, No memory for cb2); + goto get_stat_exit; You leak cb2 here. done + } + + hret = ehea_h_query_ehea_port(adapter-handle, +port-logical_port_id, +H_PORT_CB2, +H_PORT_CB2_ALL, +cb2); + + if (hret != H_SUCCESS) { + EDEB_ERR(4, query_ehea_port failed for cb2); + goto get_stat_exit; + } + + EDEB_DMP(7, (u8*)cb2, + sizeof(struct hcp_query_ehea_port_cb_2), After HCALL); + + for (i = 0; i port-num_def_qps; i++) { + rx_packets += port-port_res[i].rx_packets; + } + + stats-tx_packets = cb2-txucp + cb2-txmcp + cb2-txbcp; + stats-multicast = cb2-rxmcp; + stats-rx_errors = cb2-rxuerr; + stats-rx_bytes = cb2-rxo; + stats-tx_bytes = cb2-txo; + stats-rx_packets = rx_packets; + +get_stat_exit: + EDEB_EX(7, ); + return stats; +} + +static inline u32 ehea_get_send_lkey(struct ehea_port_res *pr) +{ + return pr-send_mr.lkey; +} Get rid of this, it's only used once. done +static inline u32 ehea_get_recv_lkey(struct ehea_port_res *pr) +{ + return pr-recv_mr.lkey; +} And this one only twice? Is it really useful? done + +#define EHEA_OD_ADDR(address, segment) (((address) (PAGE_SIZE - 1)) \ +
Re: [PATCH 2/6] ehea: pHYP interface
Arnd Bergmann wrote: On Wednesday 09 August 2006 10:38, Jan-Bernd Themann wrote: --- linux-2.6.18-rc4-orig/drivers/net/ehea/ehea_hcall.h 1969-12-31 16:00:00.0 -0800 +++ kernel/drivers/net/ehea/ehea_hcall.h2006-08-08 23:59:38.111462960 -0700 @@ -0,0 +1,52 @@ + +/** + * This file contains HCALL defines that are to be included in the appropriate + * kernel files later + */ + +#define H_ALLOC_HEA_RESOURCE 0x278 +#define H_MODIFY_HEA_QP0x250 +#define H_QUERY_HEA_QP 0x254 +#define H_QUERY_HEA0x258 +#define H_QUERY_HEA_PORT 0x25C +#define H_MODIFY_HEA_PORT 0x260 +#define H_REG_BCMC 0x264 +#define H_DEREG_BCMC 0x268 +#define H_REGISTER_HEA_RPAGES 0x26C +#define H_DISABLE_AND_GET_HEA 0x270 +#define H_GET_HEA_INFO 0x274 +#define H_ADD_CONN 0x284 +#define H_DEL_CONN 0x288 I guess these should go to include/asm-powerpc/hvcall.h instead. Arnd We posted a separate patch for hvcall.h (http://ozlabs.org/pipermail/linuxppc-dev/2006-August/025000.html). As soon as this patch is accepted we'll remove the ehea_hcall.h headerfile. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6] ehea: interface to network stack
Hi, thanks for your comments! We'll post a modified patch very soon. Jan-Bernd Alexey Dobriyan wrote: On Wed, Aug 09, 2006 at 10:38:20AM +0200, Jan-Bernd Themann wrote: --- linux-2.6.18-rc4-orig/drivers/net/ehea/ehea_main.c +++ kernel/drivers/net/ehea/ehea_main.c +static inline u64 get_swqe_addr(u64 tmp_addr, int addr_seg) +{ + u64 addr; + addr = tmp_addr; + return addr; +} + +static inline u64 get_rwqe_addr(u64 tmp_addr) +{ + return tmp_addr; +} The point of this exercise? has been removed +static inline int ehea_refill_rq3_def(struct ehea_port_res *pr, int nr_of_wqes) Way too big to be inline function. +{ + int i; + int ret = 0; + struct ehea_qp *qp; + struct ehea_rwqe *rwqe; + int skb_arr_rq3_len = pr-skb_arr_rq3_len; + struct sk_buff **skb_arr_rq3 = pr-skb_arr_rq3; + EDEB_EN(8, pr=%p, nr_of_wqes=%d, pr, nr_of_wqes); + if (nr_of_wqes == 0) + return -EINVAL; + qp = pr-qp; + for (i = 0; i nr_of_wqes; i++) { + int index = pr-skb_rq3_index++; + struct sk_buff *skb = dev_alloc_skb(EHEA_MAX_PACKET_SIZE + + NET_IP_ALIGN); + + if (!skb) { + EDEB_ERR(4, No memory for skb. Only %d rwqe filled., +i); + ret = -ENOMEM; + break; + } + skb_reserve(skb, NET_IP_ALIGN); + + rwqe = ehea_get_next_rwqe(qp, 3); + pr-skb_rq3_index %= skb_arr_rq3_len; + skb_arr_rq3[index] = skb; + rwqe-wr_id = EHEA_BMASK_SET(EHEA_WR_ID_TYPE, EHEA_RWQE3_TYPE) + | EHEA_BMASK_SET(EHEA_WR_ID_INDEX, index); + rwqe-sg_list[0].l_key = ehea_get_recv_lkey(pr); + rwqe-sg_list[0].vaddr = get_rwqe_addr((u64)skb-data); + rwqe-sg_list[0].len = EHEA_MAX_PACKET_SIZE; + rwqe-data_segments = 1; + } + + /* Ring doorbell */ + iosync(); + ehea_update_rq3a(qp, i); + EDEB_EX(8, ); + return ret; +} + + +static inline int ehea_refill_rq3(struct ehea_port_res *pr, int nr_of_wqes) +{ + return ehea_refill_rq3_def(pr, nr_of_wqes); +} ehea_refill_rq3[123] appears to be 1:1 wrappers around ehea_refill_rq3[123]_def. Any idea behind them? introduced for near future features + init_attr = (struct ehea_qp_init_attr*) + kzalloc(sizeof(struct ehea_qp_init_attr), GFP_KERNEL); Useless cast. removed + pr-skb_arr_sq = (struct sk_buff**)vmalloc(sizeof(struct sk_buff*) + * (max_rq_entries + 1)); Useless cast removed + pr-skb_arr_rq1 = (struct sk_buff**)vmalloc(sizeof(struct sk_buff*) +* (max_rq_entries + 1)); + pr-skb_arr_rq2 = (struct sk_buff**)vmalloc(sizeof(struct sk_buff*) +* (max_rq_entries + 1)); + pr-skb_arr_rq3 = (struct sk_buff**)vmalloc(sizeof(struct sk_buff*) +* (max_rq_entries + 1)); +static int ehea_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd) +{ + EDEB_ERR(4, ioctl not supported: dev=%s cmd=%d, dev-name, cmd); Then copy NULL into -do_ioctl! done + return -EOPNOTSUPP; +} + ehea_port_cb_0 = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + + if (!ehea_port_cb_0) { + EDEB_ERR(4, No memory for ehea_port control block); + ret = -ENOMEM; + goto kzalloc_failed; + } + + memcpy((u8*)((ehea_port_cb_0-port_mac_addr)), + (u8*)(mac_addr-sa_data[0]), 6); No casts on memcpy arguments. done + memcpy((u8*)ehea_mcl_entry-macaddr, mc_mac_addr, ETH_ALEN); +static inline void ehea_xmit2(struct sk_buff *skb, + struct net_device *dev, struct ehea_swqe *swqe, + u32 lkey) +{ + int nfrags; + unsigned short skb_protocol = skb-protocol; Useless variable. And it should be __be16, FYI. changed + nfrags = skb_shinfo(skb)-nr_frags; + EDEB_EN(7, skb-nfrags=%d (0x%X), nfrags, nfrags); + + if (skb_protocol == ETH_P_IP) { ITYM, htons(ETH_P_IP). good point, thx +static inline void ehea_xmit3(struct sk_buff *skb, + struct net_device *dev, struct ehea_swqe *swqe) +{ + int i; + skb_frag_t *frag; + int nfrags = skb_shinfo(skb)-nr_frags; + u8 *imm_data = swqe-u.immdata_nodesc.immediate_data[0]; + u64 skb_protocol = skb-protocol; Useless var. removed + + EDEB_EN(7, ); + if (likely(skb_protocol == ETH_P_IP)) { htons(ETH_P_IP) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED]
Re: [PATCH 1/4] [NETLINK]: Handle NLM_F_ECHO in netlink_rcv_skb()
Hello! This patch handles NLM_F_ECHO in netlink_rcv_skb() to handle it in a central point. Most subsystems currently interpret NLM_F_ECHO as to just unicast events to the originator of the change while the real meaning of the flag is to echo the request. Do not you think it is useless to echo something back to originator, who just sent it? Actually, the sense of NLM_F_ECHO was to tell user what happened due to his request. The answer is not original request, which can contain some incomplete fields etc., but full information about object deleted/added/changed. Moreover, the feedback can contain several messages (though accurately it is done only in net/sched/), f.e. when the request triggered deletion of one object and addition of another. Obviously, it cannot be done in a central place. Normally, it is not needed, ip route add does not tell user, what actually was done, so that it suppresses echo. But for multistage operation it is absolutely necessary: the answer contains f.e. auto-allocated handles, which should be given in subsequent requests. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Possible leak of multicast source filter sctructure #2
Took some time but this time the inlined patch should be OK. Hi all! It seems to me that there is a leak of struct ip_sf_socklist in the ip_mc_drop_socket function (in net/ipv4/igmp.c) which is called on socket close. This patch corrects it: diff -Naur linux-2.6.17.8.orig/net/ipv4/igmp.c linux-2.6.17.8/net/ipv4/igmp.c --- linux-2.6.17.8.orig/net/ipv4/igmp.c 2006-08-07 06:18:54.0 +0200 +++ linux-2.6.17.8/net/ipv4/igmp.c 2006-08-10 10:38:04.0 +0200 @@ -2206,9 +2206,10 @@ (void) ip_mc_leave_src(sk, iml, in_dev); ip_mc_dec_group(in_dev, iml-multi.imr_multiaddr.s_addr); in_dev_put(in_dev); - } - sock_kfree_s(sk, iml, sizeof(*iml)); + } else if (iml-sflist != NULL) + sock_kfree_s(sk, iml-sflist, IP_SFLSIZE(iml-sflist-sl_max)); + sock_kfree_s(sk, iml, sizeof(*iml)); } rtnl_unlock(); } The leak only happens if there are some multicast source filters set on a socket wich are bound to an interface that does not exist any more, as in the following scenario: 1. create a temporary interface (say GRE tunnel) 2. create a socket 3. join a multicast group and set a source filter on the temporary interface via MCAST_JOIN_SOURCE_GROUP setsockopt call 4. destroy the temporary interface 5. close the socket This sequence of things eventually leads to a call of ip_mc_drop_socket function, which fails to free the soucre filter structure ip_sf_socklist pointed to from members of socket's multicast addresses list. This structure is normally freed in ip_mc_leave_src function but this function is not called in this scenario because the interface that the multicast group is joined on does not exist any more. Thanks Michal Ruzicka - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
ipvs locahost client patch for 2.6?
I found this patch for 2.4 that allows the host running ipvs to act as it's own client via loopback connection. Does anyone have a similar patch for 2.6? --- ip_vs_core.c.orig 2003-11-28 19:26:21.0 +0100 +++ ip_vs_core.c.list 2004-07-02 11:13:51.0 +0200 @@ -1036,7 +1036,7 @@ * Big tappo: only PACKET_HOST (nor loopback neither mcasts) * ... don't know why 1st test DOES NOT include 2nd (?) */ - if (skb-pkt_type != PACKET_HOST || skb-dev == loopback_dev) { + if (skb-pkt_type != PACKET_HOST) { /* || skb-dev == loopback_dev) { */ IP_VS_DBG(12, packet type=%d proto=%d daddr=%d.%d.%d.%d ignored\n, skb-pkt_type, iph-protocol, @@ -1059,6 +1059,13 @@ iph = skb-nh.iph; h.raw = (char*) iph + ihl; +cp = ip_vs_conn_out_get(iph-protocol, iph-saddr, h.portp[0], + iph-daddr, h.portp[1]); +if (cp) { + __ip_vs_conn_put(cp); + return (ip_vs_out(hooknum,skb_p,in,out,okfn)); +} + /* * Check if the packet belongs to an existing connection entry */ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] sock_create bad error return
On Wed, 09 Aug 2006 20:47:45 -0700 (PDT) David Miller [EMAIL PROTECTED] wrote: From: Stephen Hemminger [EMAIL PROTECTED] Date: Wed, 09 Aug 2006 11:31:39 -0700 If socket create call races with module unload, it correctly fails the socket call but doesn't return an error. This race is theoritical because the sock-ops are always the same and non-modular. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] I think the intention of the code is to return -EAFNOSUPPORT which is set explicitly some lines above, and this makes sense because if we can't grab onto the module reference count it means the module is in the process of being unloaded. It is the module reference count of the socket file ops, not the protocol family reference count. The protocol family code is already handled a few lines above. Since the socket code can't be built as a module, it is a dead end. I think in-olden-times the idea was that networking could be built as a module so that inode ops would have to be ref counted. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possible leak of multicast source filter sctructure
Michal, This looks correct, but I think a better way to do it is: in_dev = inetdev_by_index(...) (void) ip_mc_leave_src() if (in_dev) { ip_mc_dec_group() in_dev_put() } That way, sflist internal details aren't visible at this level, and ip_mc_leave_src() collapses to the sock_kfree_s() when in_dev is NULL. Also, ip_mc_leave_group() has the same issue; looks like it just needs the if (in_dev) removed before the call to ip_mc_leave_src(). +-DLS - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] [NETLINK]: Dont set socket error for failed event notifications
Thomas Graf wrote: Setting a socket error on all sockets subscribed to a group if an event notificiation of said group fails due to memory pressure only confuses applications and is of no use. This patch removes it all together. I disagree with this patch, how else are applications supposed to know when they missed an update and are not in sync anymore? I actually have a half-finished patch to add this in some spots where its missing (and uses better error codes). - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PATCH Fix bonding active-backup behavior for VLAN interfaces
On Thu, 3 Aug 2006, Krzysztof Oledzki wrote: On Wed, 2 Aug 2006, David Miller wrote: CUT Finally, I'm still a little stumped about why this change is necessary still, to be honest. If I understand it correctly this patch fixes the [PATCH] bonding: suppress duplicate packets patch: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=8f903c708fcc2b579ebf16542bf6109bad593a1d;hp=ebe19a4ed78d4a11a7e01cdeda25f91b7f2fcb5a It seems that the original patch does not work properly in vlan accelerated environment, which I reported 31 Mar 2006 http://marc.theaimsgroup.com/?l=bonding-develm=114381240718113w=2 Anyway, I didn't test this patch yet but I'm going to di it ASAP. OK, this patch really solves the bug from my report. Are there any chances for similar fix in the net-2.6.19.git? Best regards, Krzysztof Olędzki
[PATCH] neighbor: use ALIGN() macro
Rather than opencoding the mask, it looks better to use ALIGN() macro from kernel.h. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- net-2.6.19.orig/include/net/neighbour.h +++ net-2.6.19/include/net/neighbour.h @@ -101,7 +101,7 @@ struct neighbour __u8dead; atomic_tprobes; rwlock_tlock; - unsigned char ha[(MAX_ADDR_LEN+sizeof(unsigned long)-1)~(sizeof(unsigned long)-1)]; + unsigned char ha[ALIGN(MAX_ADDR_LEN, sizeof(unsigned long))]; struct hh_cache *hh; atomic_trefcnt; int (*output)(struct sk_buff *skb); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] net: socket family using RCU
On Wed, Aug 09, 2006 at 11:31:42AM -0700, Stephen Hemminger wrote: Replace the gross custom locking done in socket code for net_family[] with simple RCU usage. Some reordering necessary to avoid sleep issues with sock_alloc. Definitely a good use of RCU from a read-intensive standpoint -- does anyone other than Linux-kernel networking developers change the elements of the net_family[] array except at boot and shutdown? ;-) Some comments included below. Looks good, but one question about things like atalk_create() being able to sleep and a place or two where a comment would be good. Thanx, Paul Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- net/socket.c | 171 +-- 1 file changed, 74 insertions(+), 97 deletions(-) --- net-2.6.orig/net/socket.c 2006-08-09 11:19:08.0 -0700 +++ net-2.6/net/socket.c 2006-08-09 11:19:22.0 -0700 @@ -59,11 +59,11 @@ */ #include linux/mm.h -#include linux/smp_lock.h #include linux/socket.h #include linux/file.h #include linux/net.h #include linux/interrupt.h +#include linux/rcupdate.h #include linux/netdevice.h #include linux/proc_fs.h #include linux/seq_file.h @@ -146,51 +146,8 @@ * The protocol list. Each protocol is registered in here. */ -static struct net_proto_family *net_families[NPROTO]; - -#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT) -static atomic_t net_family_lockct = ATOMIC_INIT(0); static DEFINE_SPINLOCK(net_family_lock); - -/* The strategy is: modifications net_family vector are short, do not - sleep and veeery rare, but read access should be free of any exclusive - locks. - */ - -static void net_family_write_lock(void) -{ - spin_lock(net_family_lock); - while (atomic_read(net_family_lockct) != 0) { - spin_unlock(net_family_lock); - - yield(); - - spin_lock(net_family_lock); - } -} - -static __inline__ void net_family_write_unlock(void) -{ - spin_unlock(net_family_lock); -} - -static __inline__ void net_family_read_lock(void) -{ - atomic_inc(net_family_lockct); - spin_unlock_wait(net_family_lock); -} - -static __inline__ void net_family_read_unlock(void) -{ - atomic_dec(net_family_lockct); -} - -#else -#define net_family_write_lock() do { } while(0) -#define net_family_write_unlock() do { } while(0) -#define net_family_read_lock() do { } while(0) -#define net_family_read_unlock() do { } while(0) -#endif +static const struct net_proto_family *net_families[NPROTO]; /* * Statistics counters of the socket lists @@ -1131,6 +1088,7 @@ { int err; struct socket *sock; + const struct net_proto_family *pf; /* * Check protocol is in range @@ -1159,6 +1117,20 @@ if (err) return err; + /* + * Allocate the socket and allow the family to set things up. if + * the protocol is 0, the family is instructed to select an appropriate + * default. + */ + sock = sock_alloc(); + if (!sock) { + printk(KERN_WARNING socket: no more sockets\n); + return -ENFILE; /* Not exactly a match, but its the +closest posix thing */ + } + + sock-type = type; + #if defined(CONFIG_KMOD) /* Attempt to load a protocol module if the find failed. * @@ -1166,70 +1138,59 @@ * requested real, full-featured networking support upon configuration. * Otherwise module support will break! */ - if (net_families[family] == NULL) { + if (net_families[family] == NULL) request_module(net-pf-%d, family); OK, I'll bite... What happens if the module is not present? Or is this what the Otherwise module support will break comment is getting at? Also, this reference to net_families[family] is done without rcu_dereference() and without any clear update-side lock. This just happens to be OK, since we are only testing for NULL, but should at least have a comment. - } #endif - net_family_read_lock(); - if (net_families[family] == NULL) { - err = -EAFNOSUPPORT; - goto out; - } - -/* - * Allocate the socket and allow the family to set things up. if - * the protocol is 0, the family is instructed to select an appropriate - * default. - */ - - if (!(sock = sock_alloc())) { - printk(KERN_WARNING socket: no more sockets\n); - err = -ENFILE; /* Not exactly a match, but its the -closest posix thing */ - goto out; - } - - sock-type = type; + rcu_read_lock(); + pf = rcu_dereference(net_families[family]); OK, so the elements of the net_families array are protected by RCU. All references should
Re: [PATCH 0/5] net socket family patches
On Thu, 10 Aug 2006 05:36:13 + (UTC) Alexey Toptygin wrote: On Wed, 9 Aug 2006, David Miller wrote: From: Stephen Hemminger [EMAIL PROTECTED] Date: Wed, 09 Aug 2006 11:31:38 -0700 These patches cleanup the net socket family interface and convert it to RCU. This is new stuff that should go into 2.6.19 (if it is ready). Andrew could you put it in -mm as well? Andrew pulls net-2.6.19 so there is no need to ask him to put networking patches explicitly into -mm I've been wondering - are the relationships of which of the various kernel trees pull patches from which other ones documented anywhere? If so, I'd love to read about it. Not really documented AFAIK, except what Andrew pulls into his -mm tree for testing. His announcements [used to] list which (git or other) trees that he has merged, along with non-tree patches. Now that is just in the patch-list file, e.g., see ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/patch-list and then search for git- to see which git trees it contains. If you go down the maintainer's hierarchy, it gets more fuzzy. :) Jeff Garzik pulls the wireless tree from John Linville and several net driver trees from Francois Romieu, e.g. And Jeff pulls SATA patches from Tejun Heo. DaveM pulls net patches from Yoshifuji etc. James Bottomley usually maintains 2 SCSI git trees: one for 2.6.current-rc fixes and one for 2.6.next merges. He recently documented that in email to [EMAIL PROTECTED] Most kernel git trees can be seen at www.kernel.org/git/. Most kernel patch trees (git or other) are now listed in the MAINTAINERS file. HTH. --- ~Randy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] [NETLINK]: Handle NLM_F_ECHO in netlink_rcv_skb()
* Alexey Kuznetsov [EMAIL PROTECTED] 2006-08-10 19:51 This patch handles NLM_F_ECHO in netlink_rcv_skb() to handle it in a central point. Most subsystems currently interpret NLM_F_ECHO as to just unicast events to the originator of the change while the real meaning of the flag is to echo the request. Do not you think it is useless to echo something back to originator, who just sent it? Actually, the sense of NLM_F_ECHO was to tell user what happened due to his request. The answer is not original request, which can contain some incomplete fields etc., but full information about object deleted/added/changed. Moreover, the feedback can contain several messages (though accurately it is done only in net/sched/), f.e. when the request triggered deletion of one object and addition of another. Obviously, it cannot be done in a central place. Normally, it is not needed, ip route add does not tell user, what actually was done, so that it suppresses echo. But for multistage operation it is absolutely necessary: the answer contains f.e. auto-allocated handles, which should be given in subsequent requests. What's wrong with listening to the notification for that purpose? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] [NETLINK]: Dont set socket error for failed event notifications
* Patrick McHardy [EMAIL PROTECTED] 2006-08-10 20:09 Thomas Graf wrote: Setting a socket error on all sockets subscribed to a group if an event notificiation of said group fails due to memory pressure only confuses applications and is of no use. This patch removes it all together. I disagree with this patch, how else are applications supposed to know when they missed an update and are not in sync anymore? I actually have a half-finished patch to add this in some spots where its missing (and uses better error codes). The application has no idea what went wrong nor does it know for which group so it will have to resync all group subscrptions and as it only happens due to memory pressure that will fail anyway. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] [NETLINK]: Dont set socket error for failed event notifications
Thomas Graf wrote: * Patrick McHardy [EMAIL PROTECTED] 2006-08-10 20:09 I disagree with this patch, how else are applications supposed to know when they missed an update and are not in sync anymore? I actually have a half-finished patch to add this in some spots where its missing (and uses better error codes). The application has no idea what went wrong nor does it know for which group so it will have to resync all group subscrptions and as it only happens due to memory pressure that will fail anyway. The error code (-ENOMEM) gives it a pretty good idea what went wrong. Its true that it doesn't know which group was affected (that could be fixed), but at least it knows that something went wrong and it needs to resync. If that fails due to memory shortage as well it can schedule a delayed resync or something, but without getting notified it has no chance of doing anything useful. This makes notification essentially useless. If I can't rely on either getting either a notification or an error, I can't rely on them at all. Please put this back in. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] [NETLINK]: Dont set socket error for failed event notifications
* Patrick McHardy [EMAIL PROTECTED] 2006-08-10 21:08 The error code (-ENOMEM) gives it a pretty good idea what went wrong. Its true that it doesn't know which group was affected (that could be fixed), but at least it knows that something went wrong and it needs to resync. If that fails due to memory shortage as well it can schedule a delayed resync or something, but without getting notified it has no chance of doing anything useful. This makes notification essentially useless. If I can't rely on either getting either a notification or an error, I can't rely on them at all. Please put this back in. Alright, I think it's pretty much theoretical but it doesn't really matter to me. Dave, please revert the whole patchset. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[NET 02/06]: Introduce RTA_TABLE/FRA_TABLE attributes
[NET]: Introduce RTA_TABLE/FRA_TABLE attributes Introduce RTA_TABLE route attribute and FRA_TABLE routing rule attribute to hold 32 bit routing table IDs. Usespace compatibility is provided by continuing to accept and send the rtm_table field, but because of its limited size it can only carry the low 8 bits of the table ID. This implies that if larger IDs are used, _all_ userspace programs using them need to use RTA_TABLE. Signed-off-by: Patrick McHardy [EMAIL PROTECTED] --- commit a9fe50e925cdc0471b88bcf6f3cc18278b63c984 tree 08d8bfa20011b5afa940126a8bb0c153729584c3 parent 29a0f4a779543907ddf8fbca55b6f1d0e0017f64 author Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 20:50:19 +0200 committer Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 20:50:19 +0200 include/linux/fib_rules.h |4 include/linux/rtnetlink.h |8 include/net/fib_rules.h |7 +++ net/core/fib_rules.c |5 +++-- net/decnet/dn_fib.c |7 --- net/decnet/dn_route.c |1 + net/decnet/dn_table.c |1 + net/ipv4/fib_frontend.c |7 --- net/ipv4/fib_rules.c |1 + net/ipv4/fib_semantics.c |1 + net/ipv4/route.c |1 + net/ipv6/fib6_rules.c |1 + net/ipv6/route.c | 13 + 13 files changed, 45 insertions(+), 12 deletions(-) diff --git a/include/linux/fib_rules.h b/include/linux/fib_rules.h index 5e503f0..19a82b6 100644 --- a/include/linux/fib_rules.h +++ b/include/linux/fib_rules.h @@ -36,6 +36,10 @@ enum FRA_UNUSED5, FRA_FWMARK, /* netfilter mark (IPv4) */ FRA_FLOW, /* flow/class id */ + FRA_UNUSED6, + FRA_UNUSED7, + FRA_UNUSED8, + FRA_TABLE, /* Extended table id */ __FRA_MAX }; diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h index 5deca87..b01bc8b 100644 --- a/include/linux/rtnetlink.h +++ b/include/linux/rtnetlink.h @@ -264,6 +264,7 @@ enum rtattr_type_t RTA_CACHEINFO, RTA_SESSION, RTA_MP_ALGO, + RTA_TABLE, __RTA_MAX }; @@ -716,6 +717,13 @@ #define BUG_TRAP(x) do { \ } \ } while(0) +static inline u32 rtm_get_table(struct rtattr **rta, u8 table) +{ + return RTA_GET_U32(rta[RTA_TABLE-1]); +rtattr_failure: + return table; +} + #endif /* __KERNEL__ */ diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h index 61375d9..8e2f473 100644 --- a/include/net/fib_rules.h +++ b/include/net/fib_rules.h @@ -74,6 +74,13 @@ static inline void fib_rule_put(struct f call_rcu(rule-rcu, fib_rule_put_rcu); } +static inline u32 frh_get_table(struct fib_rule_hdr *frh, struct nlattr **nla) +{ + if (nla[FRA_TABLE]) + return nla_get_u32(nla[FRA_TABLE]); + return frh-table; +} + extern int fib_rules_register(struct fib_rules_ops *); extern int fib_rules_unregister(struct fib_rules_ops *); diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c index 2e7ed5d..97b196f 100644 --- a/net/core/fib_rules.c +++ b/net/core/fib_rules.c @@ -187,7 +187,7 @@ int fib_nl_newrule(struct sk_buff *skb, rule-action = frh-action; rule-flags = frh-flags; - rule-table = frh-table; + rule-table = frh_get_table(frh, tb); if (!rule-pref ops-default_pref) rule-pref = ops-default_pref(); @@ -245,7 +245,7 @@ int fib_nl_delrule(struct sk_buff *skb, if (frh-action (frh-action != rule-action)) continue; - if (frh-table (frh-table != rule-table)) + if (frh-table (frh_get_table(frh, tb) != rule-table)) continue; if (tb[FRA_PRIORITY] @@ -291,6 +291,7 @@ static int fib_nl_fill_rule(struct sk_bu frh = nlmsg_data(nlh); frh-table = rule-table; + NLA_PUT_U32(skb, FRA_TABLE, rule-table); frh-res1 = 0; frh-res2 = 0; frh-action = rule-action; diff --git a/net/decnet/dn_fib.c b/net/decnet/dn_fib.c index 7b3bf5c..fb59637 100644 --- a/net/decnet/dn_fib.c +++ b/net/decnet/dn_fib.c @@ -491,7 +491,8 @@ static int dn_fib_check_attr(struct rtms if (attr) { if (RTA_PAYLOAD(attr) 4 RTA_PAYLOAD(attr) != 2) return -EINVAL; - if (i != RTA_MULTIPATH i != RTA_METRICS) + if (i != RTA_MULTIPATH i != RTA_METRICS + i != RTA_TABLE) rta[i-1] = (struct rtattr *)RTA_DATA(attr); } } @@ -508,7 +509,7 @@ int dn_fib_rtm_delroute(struct sk_buff * if (dn_fib_check_attr(r, rta)) return -EINVAL; - tb = dn_fib_get_table(r-rtm_table, 0); + tb = dn_fib_get_table(rtm_get_table(rta, r-rtm_table), 0); if (tb) return tb-delete(tb, r, (struct dn_kern_rta *)rta, nlh,
[NET 01/06]: Use u32 for routing table IDs
[NET]: Use u32 for routing table IDs Use u32 for routing table IDs in net/ipv4 and net/decnet in preparation of support for a larger number of routing tables. net/ipv6 already uses u32 everywhere and needs no further changes. No functional changes are made by this patch. Signed-off-by: Patrick McHardy [EMAIL PROTECTED] --- commit 29a0f4a779543907ddf8fbca55b6f1d0e0017f64 tree c559ca79c2d6ab28ceb4a4c1d5ecd5ea81264f0d parent 1b471cd32acdff18786bc06542c686d52decbc5a author Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 20:45:22 +0200 committer Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 20:45:22 +0200 include/net/dn_fib.h |4 ++-- include/net/ip_fib.h | 14 +++--- net/decnet/dn_fib.c |6 +++--- net/decnet/dn_table.c| 10 +- net/ipv4/fib_frontend.c |8 net/ipv4/fib_hash.c |4 ++-- net/ipv4/fib_lookup.h|4 ++-- net/ipv4/fib_rules.c |2 +- net/ipv4/fib_semantics.c |4 ++-- net/ipv4/fib_trie.c |6 +++--- 10 files changed, 31 insertions(+), 31 deletions(-) diff --git a/include/net/dn_fib.h b/include/net/dn_fib.h index 32bc8ce..cd9c378 100644 --- a/include/net/dn_fib.h +++ b/include/net/dn_fib.h @@ -94,7 +94,7 @@ #define DN_FIB_INFO(f) ((f)-fn_info) struct dn_fib_table { - int n; + u32 n; int (*insert)(struct dn_fib_table *t, struct rtmsg *r, struct dn_kern_rta *rta, struct nlmsghdr *n, @@ -137,7 +137,7 @@ extern int dn_fib_sync_up(struct net_dev /* * dn_tables.c */ -extern struct dn_fib_table *dn_fib_get_table(int n, int creat); +extern struct dn_fib_table *dn_fib_get_table(u32 n, int creat); extern struct dn_fib_table *dn_fib_empty_table(void); extern void dn_fib_table_init(void); extern void dn_fib_table_cleanup(void); diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index adf7358..0dcbf16 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -150,7 +150,7 @@ #define FIB_RES_NETMASK(res)(0) #endif /* CONFIG_IP_ROUTE_MULTIPATH_WRANDOM */ struct fib_table { - unsigned char tb_id; + u32 tb_id; unsignedtb_stamp; int (*tb_lookup)(struct fib_table *tb, const struct flowi *flp, struct fib_result *res); int (*tb_insert)(struct fib_table *table, struct rtmsg *r, @@ -173,14 +173,14 @@ #ifndef CONFIG_IP_MULTIPLE_TABLES extern struct fib_table *ip_fib_local_table; extern struct fib_table *ip_fib_main_table; -static inline struct fib_table *fib_get_table(int id) +static inline struct fib_table *fib_get_table(u32 id) { if (id != RT_TABLE_LOCAL) return ip_fib_main_table; return ip_fib_local_table; } -static inline struct fib_table *fib_new_table(int id) +static inline struct fib_table *fib_new_table(u32 id) { return fib_get_table(id); } @@ -205,9 +205,9 @@ #define ip_fib_main_table (fib_tables[RT extern struct fib_table * fib_tables[RT_TABLE_MAX+1]; extern int fib_lookup(struct flowi *flp, struct fib_result *res); -extern struct fib_table *__fib_new_table(int id); +extern struct fib_table *__fib_new_table(u32 id); -static inline struct fib_table *fib_get_table(int id) +static inline struct fib_table *fib_get_table(u32 id) { if (id == 0) id = RT_TABLE_MAIN; @@ -215,7 +215,7 @@ static inline struct fib_table *fib_get_ return fib_tables[id]; } -static inline struct fib_table *fib_new_table(int id) +static inline struct fib_table *fib_new_table(u32 id) { if (id == 0) id = RT_TABLE_MAIN; @@ -248,7 +248,7 @@ extern int fib_convert_rtentry(int cmd, extern u32 __fib_res_prefsrc(struct fib_result *res); /* Exported by fib_hash.c */ -extern struct fib_table *fib_hash_init(int id); +extern struct fib_table *fib_hash_init(u32 id); #ifdef CONFIG_IP_MULTIPLE_TABLES extern int fib4_rules_dump(struct sk_buff *skb, struct netlink_callback *cb); diff --git a/net/decnet/dn_fib.c b/net/decnet/dn_fib.c index ed5fb5c..7b3bf5c 100644 --- a/net/decnet/dn_fib.c +++ b/net/decnet/dn_fib.c @@ -534,8 +534,8 @@ int dn_fib_rtm_newroute(struct sk_buff * int dn_fib_dump(struct sk_buff *skb, struct netlink_callback *cb) { - int t; - int s_t; + u32 t; + u32 s_t; struct dn_fib_table *tb; if (NLMSG_PAYLOAD(cb-nlh, 0) = sizeof(struct rtmsg) @@ -765,7 +765,7 @@ void dn_fib_flush(void) { int flushed = 0; struct dn_fib_table *tb; -int id; +u32 id; for(id = RT_TABLE_MAX; id 0; id--) { if ((tb = dn_fib_get_table(id, 0)) == NULL) diff --git a/net/decnet/dn_table.c b/net/decnet/dn_table.c index c6a2e41..b7c6c06 100644 --- a/net/decnet/dn_table.c +++ b/net/decnet/dn_table.c @@ -264,7 +264,7 @@ static int dn_fib_nh_match(struct rtmsg } static int dn_fib_dump_info(struct sk_buff *skb, u32 pid, u32 seq, int event, -u8
[IPV4 03/06]: Increase number of possible routing tables to 2^32
[IPV4]: Increase number of possible routing tables to 2^32 Increase the number of possible routing tables to 2^32 by replacing the fixed sized array of pointers by a hash table and replacing iterations over all possible table IDs by hash table walking. Signed-off-by: Patrick McHardy [EMAIL PROTECTED] --- commit 148d1ca7c199005b5a92f8154a7caf3f78529672 tree ee025abdbab6fe6a4eac916791b8a06f0622d71e parent a9fe50e925cdc0471b88bcf6f3cc18278b63c984 author Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 20:52:30 +0200 committer Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 20:52:30 +0200 include/net/ip_fib.h| 25 ++-- net/ipv4/fib_frontend.c | 102 +++ net/ipv4/fib_hash.c | 26 ++-- net/ipv4/fib_rules.c|4 +- net/ipv4/fib_trie.c | 26 ++-- 5 files changed, 101 insertions(+), 82 deletions(-) diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index 0dcbf16..8e9ba56 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -150,6 +150,7 @@ #define FIB_RES_NETMASK(res)(0) #endif /* CONFIG_IP_ROUTE_MULTIPATH_WRANDOM */ struct fib_table { + struct hlist_node tb_hlist; u32 tb_id; unsignedtb_stamp; int (*tb_lookup)(struct fib_table *tb, const struct flowi *flp, struct fib_result *res); @@ -200,29 +201,13 @@ static inline void fib_select_default(co } #else /* CONFIG_IP_MULTIPLE_TABLES */ -#define ip_fib_local_table (fib_tables[RT_TABLE_LOCAL]) -#define ip_fib_main_table (fib_tables[RT_TABLE_MAIN]) +#define ip_fib_local_table fib_get_table(RT_TABLE_LOCAL) +#define ip_fib_main_table fib_get_table(RT_TABLE_MAIN) -extern struct fib_table * fib_tables[RT_TABLE_MAX+1]; extern int fib_lookup(struct flowi *flp, struct fib_result *res); -extern struct fib_table *__fib_new_table(u32 id); - -static inline struct fib_table *fib_get_table(u32 id) -{ - if (id == 0) - id = RT_TABLE_MAIN; - - return fib_tables[id]; -} - -static inline struct fib_table *fib_new_table(u32 id) -{ - if (id == 0) - id = RT_TABLE_MAIN; - - return fib_tables[id] ? : __fib_new_table(id); -} +extern struct fib_table *fib_new_table(u32 id); +extern struct fib_table *fib_get_table(u32 id); extern void fib_select_default(const struct flowi *flp, struct fib_result *res); #endif /* CONFIG_IP_MULTIPLE_TABLES */ diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index 2696ede..ad4c14f 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -37,6 +37,7 @@ #include linux/if_arp.h #include linux/skbuff.h #include linux/netlink.h #include linux/init.h +#include linux/list.h #include net/ip.h #include net/protocol.h @@ -51,48 +52,67 @@ #define FFprint(a...) printk(KERN_DEBUG #ifndef CONFIG_IP_MULTIPLE_TABLES -#define RT_TABLE_MIN RT_TABLE_MAIN - struct fib_table *ip_fib_local_table; struct fib_table *ip_fib_main_table; -#else +#define FIB_TABLE_HASHSZ 1 +static struct hlist_head fib_table_hash[FIB_TABLE_HASHSZ]; -#define RT_TABLE_MIN 1 +#else -struct fib_table *fib_tables[RT_TABLE_MAX+1]; +#define FIB_TABLE_HASHSZ 256 +static struct hlist_head fib_table_hash[FIB_TABLE_HASHSZ]; -struct fib_table *__fib_new_table(u32 id) +struct fib_table *fib_new_table(u32 id) { struct fib_table *tb; + unsigned int h; + if (id == 0) + id = RT_TABLE_MAIN; + tb = fib_get_table(id); + if (tb) + return tb; tb = fib_hash_init(id); if (!tb) return NULL; - fib_tables[id] = tb; + h = id (FIB_TABLE_HASHSZ - 1); + hlist_add_head_rcu(tb-tb_hlist, fib_table_hash[h]); return tb; } +struct fib_table *fib_get_table(u32 id) +{ + struct fib_table *tb; + struct hlist_node *node; + unsigned int h; + if (id == 0) + id = RT_TABLE_MAIN; + h = id (FIB_TABLE_HASHSZ - 1); + rcu_read_lock(); + hlist_for_each_entry_rcu(tb, node, fib_table_hash[h], tb_hlist) { + if (tb-tb_id == id) { + rcu_read_unlock(); + return tb; + } + } + rcu_read_unlock(); + return NULL; +} #endif /* CONFIG_IP_MULTIPLE_TABLES */ - static void fib_flush(void) { int flushed = 0; -#ifdef CONFIG_IP_MULTIPLE_TABLES struct fib_table *tb; - u32 id; + struct hlist_node *node; + unsigned int h; - for (id = RT_TABLE_MAX; id0; id--) { - if ((tb = fib_get_table(id))==NULL) - continue; - flushed += tb-tb_flush(tb); + for (h = 0; h FIB_TABLE_HASHSZ; h++) { + hlist_for_each_entry(tb, node, fib_table_hash[h], tb_hlist) + flushed += tb-tb_flush(tb); } -#else /* CONFIG_IP_MULTIPLE_TABLES */ - flushed +=
[DECNET 05/06]: Increase number of possible routing tables to 2^32
[DECNET]: Increase number of possible routing tables to 2^32 Increase the number of possible routing tables to 2^32 by replacing the fixed sized array of pointers by a hash table and replacing iterations over all possible table IDs by hash table walking. Signed-off-by: Patrick McHardy [EMAIL PROTECTED] --- commit 9203e4cdab89d96c474c6a903ef9a1f47c7eee07 tree e0c6a2c5e3a691919863b4eb871fc3a25ebd5d44 parent cad398a8f3ef363abba9e6450dded94a022c96fa author Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 20:54:19 +0200 committer Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 20:54:19 +0200 include/net/dn_fib.h |3 - net/decnet/dn_fib.c | 49 --- net/decnet/dn_rules.c |2 - net/decnet/dn_table.c | 125 - 4 files changed, 93 insertions(+), 86 deletions(-) diff --git a/include/net/dn_fib.h b/include/net/dn_fib.h index cd9c378..d97aa10 100644 --- a/include/net/dn_fib.h +++ b/include/net/dn_fib.h @@ -94,6 +94,7 @@ #define DN_FIB_INFO(f) ((f)-fn_info) struct dn_fib_table { + struct hlist_node hlist; u32 n; int (*insert)(struct dn_fib_table *t, struct rtmsg *r, @@ -177,8 +178,6 @@ static inline void dn_fib_res_put(struct fib_rule_put(res-r); } -extern struct dn_fib_table *dn_fib_tables[]; - #else /* Endnode */ #define dn_fib_init() do { } while(0) diff --git a/net/decnet/dn_fib.c b/net/decnet/dn_fib.c index fb59637..5ccca3e 100644 --- a/net/decnet/dn_fib.c +++ b/net/decnet/dn_fib.c @@ -532,39 +532,6 @@ int dn_fib_rtm_newroute(struct sk_buff * return -ENOBUFS; } - -int dn_fib_dump(struct sk_buff *skb, struct netlink_callback *cb) -{ - u32 t; - u32 s_t; - struct dn_fib_table *tb; - - if (NLMSG_PAYLOAD(cb-nlh, 0) = sizeof(struct rtmsg) - ((struct rtmsg *)NLMSG_DATA(cb-nlh))-rtm_flagsRTM_F_CLONED) - return dn_cache_dump(skb, cb); - - s_t = cb-args[0]; - if (s_t == 0) - s_t = cb-args[0] = RT_MIN_TABLE; - - for(t = s_t; t = RT_TABLE_MAX; t++) { - if (t s_t) - continue; - if (t s_t) - memset(cb-args[1], 0, - sizeof(cb-args) - sizeof(cb-args[0])); - tb = dn_fib_get_table(t, 0); - if (tb == NULL) - continue; - if (tb-dump(tb, skb, cb) 0) - break; - } - - cb-args[0] = t; - - return skb-len; -} - static void fib_magic(int cmd, int type, __le16 dst, int dst_len, struct dn_ifaddr *ifa) { struct dn_fib_table *tb; @@ -762,22 +729,6 @@ int dn_fib_sync_up(struct net_device *de return ret; } -void dn_fib_flush(void) -{ -int flushed = 0; -struct dn_fib_table *tb; -u32 id; - -for(id = RT_TABLE_MAX; id 0; id--) { -if ((tb = dn_fib_get_table(id, 0)) == NULL) -continue; -flushed += tb-flush(tb); -} - -if (flushed) -dn_rt_cache_flush(-1); -} - static struct notifier_block dn_fib_dnaddr_notifier = { .notifier_call = dn_fib_dnaddr_event, }; diff --git a/net/decnet/dn_rules.c b/net/decnet/dn_rules.c index 096f127..878312f 100644 --- a/net/decnet/dn_rules.c +++ b/net/decnet/dn_rules.c @@ -210,7 +210,7 @@ unsigned dnet_addr_type(__le16 addr) struct flowi fl = { .nl_u = { .dn_u = { .daddr = addr } } }; struct dn_fib_res res; unsigned ret = RTN_UNICAST; - struct dn_fib_table *tb = dn_fib_tables[RT_TABLE_LOCAL]; + struct dn_fib_table *tb = dn_fib_get_table(RT_TABLE_LOCAL, 0); res.r = NULL; diff --git a/net/decnet/dn_table.c b/net/decnet/dn_table.c index d2ad791..5701a3f 100644 --- a/net/decnet/dn_table.c +++ b/net/decnet/dn_table.c @@ -75,9 +75,9 @@ #define DN_FIB_SCAN_KEY(f, fp, key) \ for( ; ((f) = *(fp)) != NULL dn_key_eq((f)-fn_key, (key)); (fp) = (f)-fn_next) #define RT_TABLE_MIN 1 - +#define DN_FIB_TABLE_HASHSZ 256 +static struct hlist_head dn_fib_table_hash[DN_FIB_TABLE_HASHSZ]; static DEFINE_RWLOCK(dn_fib_tables_lock); -struct dn_fib_table *dn_fib_tables[RT_TABLE_MAX + 1]; static kmem_cache_t *dn_hash_kmem __read_mostly; static int dn_fib_hash_zombies; @@ -357,7 +357,7 @@ static __inline__ int dn_hash_dump_bucke { int i, s_i; - s_i = cb-args[3]; + s_i = cb-args[4]; for(i = 0; f; i++, f = f-fn_next) { if (i s_i) continue; @@ -370,11 +370,11 @@ static __inline__ int dn_hash_dump_bucke (f-fn_state DN_S_ZOMBIE) ? 0 : f-fn_type, f-fn_scope, f-fn_key, dz-dz_order, f-fn_info, NLM_F_MULTI) 0) { - cb-args[3] = i; + cb-args[4] = i; return -1;
[NET 06/06]: Increate RT_TABLE_MAX to 2^32
[NET]: Increate RT_TABLE_MAX to 2^32 Signed-off-by: Patrick McHardy [EMAIL PROTECTED] --- commit f20cbb83204cd7e2ffa9cf4e8ee8b6353628d5d3 tree 8f0eaa4219506715449e7118037040f396875c99 parent 9203e4cdab89d96c474c6a903ef9a1f47c7eee07 author Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 20:54:49 +0200 committer Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 20:54:49 +0200 include/linux/rtnetlink.h |4 +--- 1 files changed, 1 insertions(+), 3 deletions(-) diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h index b01bc8b..a616c68 100644 --- a/include/linux/rtnetlink.h +++ b/include/linux/rtnetlink.h @@ -239,10 +239,8 @@ enum rt_class_t RT_TABLE_DEFAULT=253, RT_TABLE_MAIN=254, RT_TABLE_LOCAL=255, - __RT_TABLE_MAX + RT_TABLE_MAX=0x }; -#define RT_TABLE_MAX (__RT_TABLE_MAX - 1) - /* Routing message attributes */ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[IPV6 04/06]: Increase number of possible routing tables to 2^32
[IPV6]: Increase number of possible routing tables to 2^32 Increase number of possible routing tables to 2^32 by replacing iterations over all possible table IDs by hash table walking. Signed-off-by: Patrick McHardy [EMAIL PROTECTED] --- commit cad398a8f3ef363abba9e6450dded94a022c96fa tree 4fea9c50650ab65d942dca9c2545d1810b227839 parent 148d1ca7c199005b5a92f8154a7caf3f78529672 author Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 20:53:33 +0200 committer Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 20:53:33 +0200 include/net/ip6_route.h |7 ++ net/ipv6/ip6_fib.c | 171 ++- net/ipv6/route.c| 128 --- 3 files changed, 159 insertions(+), 147 deletions(-) diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h index 9bfa3cc..01bfe40 100644 --- a/include/net/ip6_route.h +++ b/include/net/ip6_route.h @@ -137,6 +137,13 @@ extern int inet6_rtm_newroute(struct sk_ extern int inet6_rtm_delroute(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg); extern int inet6_rtm_getroute(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg); +struct rt6_rtnl_dump_arg +{ + struct sk_buff *skb; + struct netlink_callback *cb; +}; + +extern int rt6_dump_route(struct rt6_info *rt, void *p_arg); extern void rt6_ifdown(struct net_device *dev); extern void rt6_mtu_change(struct net_device *dev, unsigned mtu); diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c index 1f23161..bececbe 100644 --- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -158,7 +158,26 @@ static struct fib6_table fib6_main_tbl = }; #ifdef CONFIG_IPV6_MULTIPLE_TABLES +#define FIB_TABLE_HASHSZ 256 +#else +#define FIB_TABLE_HASHSZ 1 +#endif +static struct hlist_head fib_table_hash[FIB_TABLE_HASHSZ]; + +static void fib6_link_table(struct fib6_table *tb) +{ + unsigned int h; + + h = tb-tb6_id (FIB_TABLE_HASHSZ - 1); + /* +* No protection necessary, this is the only list mutatation +* operation, tables never disappear once they exist. +*/ + hlist_add_head_rcu(tb-tb6_hlist, fib_table_hash[h]); +} + +#ifdef CONFIG_IPV6_MULTIPLE_TABLES static struct fib6_table fib6_local_tbl = { .tb6_id = RT6_TABLE_LOCAL, .tb6_lock = RW_LOCK_UNLOCKED, @@ -168,9 +187,6 @@ static struct fib6_table fib6_local_tbl }, }; -#define FIB_TABLE_HASHSZ 256 -static struct hlist_head fib_table_hash[FIB_TABLE_HASHSZ]; - static struct fib6_table *fib6_alloc_table(u32 id) { struct fib6_table *table; @@ -186,19 +202,6 @@ static struct fib6_table *fib6_alloc_tab return table; } -static void fib6_link_table(struct fib6_table *tb) -{ - unsigned int h; - - h = tb-tb6_id (FIB_TABLE_HASHSZ - 1); - - /* -* No protection necessary, this is the only list mutatation -* operation, tables never disappear once they exist. -*/ - hlist_add_head_rcu(tb-tb6_hlist, fib_table_hash[h]); -} - struct fib6_table *fib6_new_table(u32 id) { struct fib6_table *tb; @@ -263,10 +266,135 @@ struct dst_entry *fib6_rule_lookup(struc static void __init fib6_tables_init(void) { + fib6_link_table(fib6_main_tbl); } #endif +static int fib6_dump_node(struct fib6_walker_t *w) +{ + int res; + struct rt6_info *rt; + + for (rt = w-leaf; rt; rt = rt-u.next) { + res = rt6_dump_route(rt, w-args); + if (res 0) { + /* Frame is full, suspend walking */ + w-leaf = rt; + return 1; + } + BUG_TRAP(res!=0); + } + w-leaf = NULL; + return 0; +} + +static void fib6_dump_end(struct netlink_callback *cb) +{ + struct fib6_walker_t *w = (void*)cb-args[2]; + + if (w) { + cb-args[2] = 0; + kfree(w); + } + cb-done = (void*)cb-args[3]; + cb-args[1] = 3; +} + +static int fib6_dump_done(struct netlink_callback *cb) +{ + fib6_dump_end(cb); + return cb-done ? cb-done(cb) : 0; +} + +static int fib6_dump_table(struct fib6_table *table, struct sk_buff *skb, + struct netlink_callback *cb) +{ + struct fib6_walker_t *w; + int res; + + w = (void *)cb-args[2]; + w-root = table-tb6_root; + + if (cb-args[4] == 0) { + read_lock_bh(table-tb6_lock); + res = fib6_walk(w); + read_unlock_bh(table-tb6_lock); + if (res 0) + cb-args[4] = 1; + } else { + read_lock_bh(table-tb6_lock); + res = fib6_walk_continue(w); + read_unlock_bh(table-tb6_lock); + if (res != 0) { + if (res 0) + fib6_walker_unlink(w); + goto end; + } + fib6_walker_unlink(w); +
Re: [RFC][PATCH] VM deadlock prevention core -v3
On Thu, Aug 10, 2006 at 04:46:31PM +0200, Peter Zijlstra ([EMAIL PROTECTED]) wrote: On Thu, 2006-08-10 at 18:02 +0400, Evgeniy Polyakov wrote: On Thu, Aug 10, 2006 at 03:32:49PM +0200, Peter Zijlstra ([EMAIL PROTECTED]) wrote: Hi, Hello, Peter. So I try again, please tell me if I'm still on crack and should go detox. However if you do so, I kindly request some words on the how and why of it. I think you should talk with doctor in that case, but not with kernel hackers :) I have some comments about implementation, not overall design, since we have slightly diametral points of view there. --- linux-2.6.orig/net/core/skbuff.c +++ linux-2.6/net/core/skbuff.c @@ -43,6 +43,7 @@ #include linux/kernel.h #include linux/sched.h #include linux/mm.h +#include linux/pagemap.h #include linux/interrupt.h #include linux/in.h #include linux/inet.h @@ -125,6 +126,8 @@ EXPORT_SYMBOL(skb_truesize_bug); * */ +#define ceiling_log2(x) fls((x) - 1) + /** * __alloc_skb - allocate a network buffer * @size: size to allocate @@ -147,6 +150,59 @@ struct sk_buff *__alloc_skb(unsigned int struct sk_buff *skb; u8 *data; + size = SKB_DATA_ALIGN(size); I moved it here. Yep. + + if (gfp_mask __GFP_MEMALLOC) { + /* + * Fallback allocation for memalloc reserves. + * This allocator is build on alloc_pages() so that freed * skbuffs return to the memalloc reserve imediately. SLAB * memory might not ever be returned. This was missing,... + * the page is populated like so: + * + * struct sk_buff + * [ struct sk_buff ] + * [ atomic_t ] + * unsigned int + * struct skb_shared_info + * char [] + * + * We have to do higher order allocations for icky jumbo + * frame drivers :-(. They really should be migrated to + * scather/gather DMA and use skb fragments. + */ + unsigned int data_offset = + sizeof(struct sk_buff) + sizeof(unsigned int); + unsigned long length = size + data_offset + + sizeof(struct skb_shared_info); + unsigned int pages; + unsigned int order; + struct page *page; + void *kaddr; + + /* + * Force fclone alloc in order to fudge a lacking in skb_clone(). + */ + fclone = 1; + if (fclone) { + data_offset += sizeof(struct sk_buff) + sizeof(atomic_t); + length += sizeof(struct sk_buff) + sizeof(atomic_t); + } + pages = (length + PAGE_SIZE - 1) PAGE_SHIFT; + order = ceiling_log2(pages); + skb = NULL; + if (!(page = alloc_pages(gfp_mask ~__GFP_HIGHMEM, order))) + goto out; + + kaddr = pfn_to_kaddr(page_to_pfn(page)); + skb = (struct sk_buff *)kaddr; + + *((unsigned int *)(kaddr + data_offset - + sizeof(unsigned int))) = order; + data = (u8 *)(kaddr + data_offset); + Tricky, but since you are using own allocator here, you could change it to be not so aggressive - i.e. do not round size to number of pages. I'm not sure I follow you, I'm explicitly using alloc_pages()/free_page(), if I were to go smart here, I'd loose the whole reason for doing so. You can use page to put there several skbs for example or at least add there a fclone (fast clone). + goto allocated; + } + cache = fclone ? skbuff_fclone_cache : skbuff_head_cache; /* Get the HEAD */ @@ -155,12 +211,13 @@ struct sk_buff *__alloc_skb(unsigned int goto out; /* Get the DATA. Size must match skb_add_mtu(). */ - size = SKB_DATA_ALIGN(size); Bad sign. See above. Yep, I've found. data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask); if (!data) goto nodata; +struct sk_buff *__netdev_alloc_skb(struct net_device *dev, + unsigned length, gfp_t gfp_mask) +{ + struct sk_buff *skb; + + WARN_ON(gfp_mask (__GFP_NOMEMALLOC | __GFP_MEMALLOC)); + gfp_mask = ~(__GFP_NOMEMALLOC | __GFP_MEMALLOC); + + skb = ___netdev_alloc_skb(dev, length, gfp_mask | __GFP_NOMEMALLOC); + if (skb) + goto done; + + if (atomic_read(dev-rx_reserve_used) = + dev-rx_reserve * dev-memalloc_socks) + goto out; + + /* + * pre-inc guards against a race with netdev_wait_memalloc() + */ + atomic_inc(dev-rx_reserve_used); + skb = ___netdev_alloc_skb(dev, length, gfp_mask | __GFP_MEMALLOC); + if
Re: [RFC][PATCH] VM deadlock prevention core -v3
Tricky, but since you are using own allocator here, you could change it to be not so aggressive - i.e. do not round size to number of pages. I'm not sure I follow you, I'm explicitly using alloc_pages()/free_page(), if I were to go smart here, I'd loose the whole reason for doing so. You can use page to put there several skbs for example or at least add there a fclone (fast clone). fclone support is there. +struct sk_buff *__netdev_alloc_skb(struct net_device *dev, + unsigned length, gfp_t gfp_mask) +{ + struct sk_buff *skb; + + WARN_ON(gfp_mask (__GFP_NOMEMALLOC | __GFP_MEMALLOC)); + gfp_mask = ~(__GFP_NOMEMALLOC | __GFP_MEMALLOC); + + skb = ___netdev_alloc_skb(dev, length, gfp_mask | __GFP_NOMEMALLOC); + if (skb) + goto done; + + if (atomic_read(dev-rx_reserve_used) = + dev-rx_reserve * dev-memalloc_socks) + goto out; + + /* +* pre-inc guards against a race with netdev_wait_memalloc() +*/ + atomic_inc(dev-rx_reserve_used); + skb = ___netdev_alloc_skb(dev, length, gfp_mask | __GFP_MEMALLOC); + if (unlikely(!skb)) { + atomic_dec(dev-rx_reserve_used); + goto out; + } Since you have added atomic operation in that path, you can use device's reference counter instead and do not care that it can dissapear. Is that the sole reason taking a reference on the device is bad? Taking a reference is bad due to performance reasons, since atomic increment is not that cheap. If you do it for one variable for the purpose of reference counting you can use device's refcnt istead, which will solve some races. Yes, I understand you. However I'm not sure if performance is the only reason not to take a refcount on the device. Anyway, I think I have just been convinced to abandon the per device thing and go global. @@ -434,6 +567,12 @@ struct sk_buff *skb_clone(struct sk_buff n-fclone = SKB_FCLONE_CLONE; atomic_inc(fclone_ref); } else { + /* +* should we special-case skb-memalloc cloning? +* for now fudge it by forcing fast-clone alloc. +*/ + BUG_ON(skb-memalloc); + n = kmem_cache_alloc(skbuff_head_cache, gfp_mask); if (!n) return NULL; Ugh... cloning is a one of the shoulders of giant where Linux network stack is staying... Yes, I'm aware of that, I have a plan to fix this, however I haven't had time to implement it. My immediate concern is the point wrt. the net_device mapping. My idea was: instead of the order, store the size, and allocate clone skbuffs in the available room at the end of the page(s), allocating extra pages if needed. You can check if requested skb with fclone fits allocated pages, and if so use fclone magic, otherwise postpone clone allocation until it is required. Yes the fclone magic works, however that will only let you have one clone. I'm just not confident no receive path will ever exceed that. Sockets can live without network devices at all, I expect it is enough to clean up in socket destructor, since packets can come from different devices into the same socket. You are right if the reserve wasn't device bound - which I will abandon because you are right that with multi-path routing, bridge device and other advanced goodies this scheme is broken in that there is no unambiguous mapping from sockets to devices. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-rc3-mm2 - IPV6_MULTIPLE_TABLES borked....
On Sun, 06 Aug 2006 03:08:09 PDT, Andrew Morton said: ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/ Building a kernel with IPV6_MULTIPLE_TABLES=y breaks my IPv6 connectivity quite badly. It basically totally refuses to answer an IPv6 Neighbor Solicit packet or IPv6 Echo Request packet. I run a 'tcpdump -n ipv6', and I see the requests come in, and no packets leaving. Interestingly enough, if I try to ping6 *out* of the box, it's totally willing to send a Neighbor Solicit outbound (although it appears to totally ignore the Neighbor Advert packet that comes back). Of course, things don't work very well at all with busticated Neighbor Solicit. A kernel built with IPV6_MULTIPLE_TABLES=n works just fine. The relevant ifconfig (eth3 is a 100mbit port, eth5 is a wireless card): eth3 Link encap:Ethernet HWaddr 00:06:5B:EA:8E:4E inet addr:128.173.14.107 Bcast:128.173.15.255 Mask:255.255.252.0 inet6 addr: 2001:468:c80:2103:206:5bff:feea:8e4e/64 Scope:Global inet6 addr: fe80::206:5bff:feea:8e4e/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:15529 errors:0 dropped:0 overruns:1 frame:0 TX packets:2073 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:2333290 (2.2 MiB) TX bytes:228862 (223.4 KiB) Interrupt:11 Base address:0x6800 eth5 Link encap:Ethernet HWaddr 00:02:2D:5C:11:48 inet addr:198.82.168.129 Bcast:198.82.168.255 Mask:255.255.255.0 inet6 addr: 2001:468:c80:2181:202:2dff:fe5c:1148/64 Scope:Global inet6 addr: fe80::202:2dff:fe5c:1148/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:2096 errors:0 dropped:0 overruns:0 frame:0 TX packets:144 errors:1 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:280919 (274.3 KiB) TX bytes:22184 (21.6 KiB) Interrupt:11 Base address:0xe100 loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:1583 errors:0 dropped:0 overruns:0 frame:0 TX packets:1583 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:642598 (627.5 KiB) TX bytes:642598 (627.5 KiB) A working routing table: netstat -r -n -A inet6 Kernel IPv6 routing table Destination Next Hop Flags Metric RefUse Iface ::1/128 :: U 0 12 1 lo 2001:468:c80:2103:206:5bff:feea:8e4e/128:: U 0 41 lo 2001:468:c80:2103::/64 :: UA256113 0 eth3 2001:468:c80:2181:202:2dff:fe5c:1148/128:: U 0 01 lo 2001:468:c80:2181::/64 :: UA25611 0 eth5 fe80::202:2dff:fe5c:1148/128:: U 0 01 lo fe80::206:5bff:feea:8e4e/128:: U 0 21 lo fe80::/64 :: U 25600 eth3 fe80::/64 :: U 25600 eth5 ff02::1/128 ff02::1 UC0 113 0 eth3 ff02::1/128 ff02::1 UC0 10 eth5 ff00::/8:: U 25600 eth3 ff00::/8:: U 25600 eth5 ::/0fe80::20f:35ff:fe3e:d41a UGDA 1024 10 eth3 ::/0fe80::20f:35ff:fe3e:d41a UGDA 1024 10 eth5 pgp0hv0N6FUv3.pgp Description: PGP signature
Re: [PATCH 1/4] [NETLINK]: Handle NLM_F_ECHO in netlink_rcv_skb()
Hello! What's wrong with listening to the notification for that purpose? Nothing! NLM_F_ECHO _is_ listening for notifications without subscription to multicast groups and need to figure out what messages are yours. But beyond this NLM_F_ECHO is totally subset of this. Which still makes much more sense then echoing of a know thing, does not it? Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [IPROUTE]: Support IPv6 routing table filter
On Thu, 10 Aug 2006 22:42:58 +0200 Patrick McHardy [EMAIL PROTECTED] wrote: Support IPv6 routing table filter in presence of multiple tables, f.e. ip -6 route list table 123. Compatibility is preserved for kernels not supporting multiple IPv6 tables. applied thanks -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: skb_shared_info()
On Tue, Aug 08, 2006 at 04:39:15PM -0700, David Miller ([EMAIL PROTECTED]) wrote: I'm beginning to think that where we store the skb_shared_info() is a weakness of the SKB design. Food for thoughts - unix sockets can use PAGE_SIZEd chunks of memory (and they do it almost always), which are aligned to 2*PAGE_SIZE due to alignment issues with skb_shared_info, so unix sockets waste 3.5 kb of memory on each skb. I think it is time to resurrect idea of placing shared_info inside skb and allow to allocate it from own cache for special cases, what do you think? -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[IPROUTE]: Support IPv6 routing table filter
Support IPv6 routing table filter in presence of multiple tables, f.e. ip -6 route list table 123. Compatibility is preserved for kernels not supporting multiple IPv6 tables. [IPROUTE]: Support IPv6 routing table filter The current behaviour for IPv6 routing table filters is to derive the table from the route type. This doesn't really work anymore now that IPv6 supports multiple tables. Add detection for IPv6 multiple table support (relying on the fact that the first routes dumped belong to the local table and have rtm_table == RT_TABLE_LOCAL with multiple tables) and handle it like other protocols. Signed-off-by: Patrick McHardy [EMAIL PROTECTED] --- commit 14d210c56edd67973439acd67d916de84a6e0384 tree 5678d9dba5c1b8a0b25133a89bce5d4e473a1160 parent e81c1a22cd2408a8b490ce39bf6ece2d19919a3b author Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 22:39:21 +0200 committer Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 22:39:21 +0200 ip/iproute.c |6 +- 1 files changed, 5 insertions(+), 1 deletions(-) diff --git a/ip/iproute.c b/ip/iproute.c index 8f4a55d..1645f0b 100644 --- a/ip/iproute.c +++ b/ip/iproute.c @@ -138,6 +138,7 @@ int print_route(const struct sockaddr_nl inet_prefix prefsrc; inet_prefix via; int host_len = -1; + static int ip6_multiple_tables; SPRINT_BUF(b1); @@ -163,7 +164,10 @@ int print_route(const struct sockaddr_nl else if (r-rtm_family == AF_IPX) host_len = 80; - if (r-rtm_family == AF_INET6) { + if (r-rtm_family == AF_INET6 r-rtm_table != RT_TABLE_MAIN) + ip6_multiple_tables = 1; + + if (r-rtm_family == AF_INET6 !ip6_multiple_tables) { if (filter.tb) { if (filter.tb 0) { if (!(r-rtm_flagsRTM_F_CLONED))
Re: [PATCH 4/5] net: socket family using RCU
On Wed, Aug 09, 2006 at 11:31:42AM -0700, Stephen Hemminger wrote: Replace the gross custom locking done in socket code for net_family[] with simple RCU usage. Some reordering necessary to avoid sleep issues with sock_alloc. Definitely a good use of RCU from a read-intensive standpoint -- does anyone other than Linux-kernel networking developers change the elements of the net_family[] array except at boot and shutdown? ;-) Some comments included below. Looks good, but one question about things like atalk_create() being able to sleep and a place or two where a comment would be good. ... +/* + * Allocate the socket and allow the family to set things up. if + * the protocol is 0, the family is instructed to select an appropriate + * default. + */ +sock = sock_alloc(); +if (!sock) { +printk(KERN_WARNING socket: no more sockets\n); +return -ENFILE; /* Not exactly a match, but its the + closest posix thing */ +} + +sock-type = type; + #if defined(CONFIG_KMOD) /* Attempt to load a protocol module if the find failed. * @@ -1166,70 +1138,59 @@ * requested real, full-featured networking support upon configuration. * Otherwise module support will break! */ -if (net_families[family] == NULL) { +if (net_families[family] == NULL) request_module(net-pf-%d, family); OK, I'll bite... What happens if the module is not present? Or is this what the Otherwise module support will break comment is getting at? request_module loads the module (and blocks). One would expect that the module loaded would set net_families[] via sock_register, so later reference would succeed. Comment is historical since intention was to make base socket code itself modular which never was done, and is probably a bad idea to even consider. If module is not present, then net_families[] will still be NULL. Also, this reference to net_families[family] is done without rcu_dereference() and without any clear update-side lock. This just happens to be OK, since we are only testing for NULL, but should at least have a comment. -} #endif -net_family_read_lock(); -if (net_families[family] == NULL) { -err = -EAFNOSUPPORT; -goto out; -} - -/* - * Allocate the socket and allow the family to set things up. if - * the protocol is 0, the family is instructed to select an appropriate - * default. - */ - -if (!(sock = sock_alloc())) { -printk(KERN_WARNING socket: no more sockets\n); -err = -ENFILE; /* Not exactly a match, but its the - closest posix thing */ -goto out; -} - -sock-type = type; +rcu_read_lock(); +pf = rcu_dereference(net_families[family]); OK, so the elements of the net_families array are protected by RCU. All references should either be under rcu_read_lock() and accessed via rcu_dereference() or under the update-side lock, whatever that might be. Yes, the net_family_lock -/* +/** + * sock_unregister - remove a protocol handler + * @family: protocol family to remove + * * This function is called by a protocol handler that wants to * remove its address family, and have it unlinked from the - * SOCKET module. + * new socket creation. + * + * If protocol handler is a module, then it can use module reference + * counts to protect against new references. If protocol handler is not + * a module then it needs to provide its own protection in + * the ops-create routine. */ - int sock_unregister(int family) { if (family 0 || family = NPROTO) -return -1; +return -EINVAL; -net_family_write_lock(); +spin_lock(net_family_lock); net_families[family] = NULL; And this one is covered by net_families_lock, so we are set, since this is the last one. -net_family_write_unlock(); +spin_unlock(net_family_lock); + +synchronize_rcu(); OK, and the caller is presumably going to free up whatever needs to be freed. Or, if nothing need be freed, beyond this point, we know that all non-sleeping code paths through any of the net_protocol_family functions have completed. (So, are all of the functions non-sleeping, or do we care? The definition of net_protocol_family in include/linux/net.h doesn't say that they need to be non-sleeping...) atalk_create() can potentially sleep in the following line of code: sk = sk_alloc(PF_APPLETALK, GFP_KERNEL, ddp_proto, 1); The module reference counts are used to prevent that. Since appletalk module can't be unloaded until there are no more appletalk sockets open (ie ref count of appletalk module is zero). To prevent new references there is a call to try_module_get() before the net_families[family]-create() call. This happens inside rcu_read_lock. What prevents
Re: [PATCH 1/4] [NETLINK]: Handle NLM_F_ECHO in netlink_rcv_skb()
Hello * Alexey Kuznetsov [EMAIL PROTECTED] 2006-08-11 00:32 Nothing! NLM_F_ECHO _is_ listening for notifications without subscription to multicast groups and need to figure out what messages are yours. But beyond this NLM_F_ECHO is totally subset of this. Which still makes much more sense then echoing of a know thing, does not it? I get your point and I see the value. Unfortunately, probably due to lack of documentation, this feature isn't used by any applications I know of. We even put in the hacks to make identification of own caused notifications easier by storing the netlink pid of the originator in the notification message. I will put this back in (document it! :) and hide it behind nlmsg_notify() so we do it for all notifications for consistency. I use echoing of the original request for debuging purposes, it allows to verify what is actually being parsed at the netlink family specific parsing function. Using libnl a flag enables NLM_F_ECHO in all messages and it gets simple to verify what exactly is being seen in the kernel side parser by looking at the messages log. I agree, there is no functional value besides the possibility to implement a netlink ping with NLMSG_NOOP. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[IPROUTE 01/03]: Preparation for 32 bit table IDs
[IPROUTE]: Preparation for 32 bit table IDs The route table filter uses an integer for the table number and the value -1 to represent cloned routes. For 32 bit table IDs it needs to become an unsigned, so this won't work anymore. Introduce a new filter flag cloned and use instead of filter.tb = -1. Signed-off-by: Patrick McHardy [EMAIL PROTECTED] --- commit 00d896184c5f8737269ac05264446c58133ec414 tree 3eb3760b7b5b8b5811cadeaaec1b949533fb5ffd parent 14d210c56edd67973439acd67d916de84a6e0384 author Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 23:19:31 +0200 committer Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 23:19:31 +0200 ip/iproute.c | 42 +- 1 files changed, 21 insertions(+), 21 deletions(-) diff --git a/ip/iproute.c b/ip/iproute.c index 1645f0b..cb674d7 100644 --- a/ip/iproute.c +++ b/ip/iproute.c @@ -89,6 +89,7 @@ static void usage(void) static struct { int tb; + int cloned; int flushed; char *flushb; int flushp; @@ -168,22 +169,21 @@ int print_route(const struct sockaddr_nl ip6_multiple_tables = 1; if (r-rtm_family == AF_INET6 !ip6_multiple_tables) { + if (filter.cloned) { + if (!(r-rtm_flagsRTM_F_CLONED)) + return 0; + } if (filter.tb) { - if (filter.tb 0) { - if (!(r-rtm_flagsRTM_F_CLONED)) - return 0; - } else { - if (r-rtm_flagsRTM_F_CLONED) + if (r-rtm_flagsRTM_F_CLONED) + return 0; + if (filter.tb == RT_TABLE_LOCAL) { + if (r-rtm_type != RTN_LOCAL) return 0; - if (filter.tb == RT_TABLE_LOCAL) { - if (r-rtm_type != RTN_LOCAL) - return 0; - } else if (filter.tb == RT_TABLE_MAIN) { - if (r-rtm_type == RTN_LOCAL) - return 0; - } else { + } else if (filter.tb == RT_TABLE_MAIN) { + if (r-rtm_type == RTN_LOCAL) return 0; - } + } else { + return 0; } } } else { @@ -1045,19 +1045,19 @@ static int iproute_list_or_flush(int arg NEXT_ARG(); if (rtnl_rttable_a2n(tid, *argv)) { if (strcmp(*argv, all) == 0) { - tid = 0; + filter.tb = 0; } else if (strcmp(*argv, cache) == 0) { - tid = -1; + filter.cloned = 1; } else if (strcmp(*argv, help) == 0) { usage(); } else { invarg(table id value is invalid\n, *argv); } - } - filter.tb = tid; + } else + filter.tb = tid; } else if (matches(*argv, cached) == 0 || matches(*argv, cloned) == 0) { - filter.tb = -1; + filter.cloned = 1; } else if (strcmp(*argv, tos) == 0 || matches(*argv, dsfield) == 0) { __u32 tos; @@ -1189,7 +1189,7 @@ static int iproute_list_or_flush(int arg char flushb[4096-512]; time_t start = time(0); - if (filter.tb == -1) { + if (filter.cloned) { if (do_ipv6 != AF_INET6) { iproute_flush_cache(); if (show_stats) @@ -1215,7 +1215,7 @@ static int iproute_list_or_flush(int arg } if (filter.flushed == 0) { if (round == 0) { - if (filter.tb != -1 || do_ipv6 == AF_INET6) + if (!filter.cloned || do_ipv6 == AF_INET6) fprintf(stderr, Nothing to flush.\n); } else if (show_stats) printf(*** Flush is complete after %d round%s ***\n, round, round1?s:); @@ -1239,7 +1239,7 @@ static int
[IPROUTE 03/03]: Add support for larger number of routing tables
[IPROUTE]: Add support for larger number of routing tables Support support for 2^32 routing tables by using the new RTA_TABLE attribute for specifying tables 255 and intepreting it if it is sent by the kernel. When tables 255 are used on a kernel not supporting it an error will occur because of the unknown netlink attribute. Signed-off-by: Patrick McHardy [EMAIL PROTECTED] --- commit 7980d6ceea890359173344e71c1139b252fd9894 tree 19a33af25df28c002569e85b34a8c90ca517d875 parent ccd621fbb5faa91a98479e9492baee525c6f10c0 author Patrick McHardy [EMAIL PROTECTED] Fri, 11 Aug 2006 00:03:32 +0200 committer Patrick McHardy [EMAIL PROTECTED] Fri, 11 Aug 2006 00:03:32 +0200 include/linux/rtnetlink.h |4 ++-- include/rt_names.h|2 +- ip/ip_common.h|8 ip/iproute.c | 21 ++--- ip/iprule.c | 14 +++--- lib/rt_names.c|4 ++-- 6 files changed, 38 insertions(+), 15 deletions(-) diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h index 5e33a20..d63578c 100644 --- a/include/linux/rtnetlink.h +++ b/include/linux/rtnetlink.h @@ -238,9 +238,8 @@ enum rt_class_t RT_TABLE_DEFAULT=253, RT_TABLE_MAIN=254, RT_TABLE_LOCAL=255, - __RT_TABLE_MAX + RT_TABLE_MAX=0x, }; -#define RT_TABLE_MAX (__RT_TABLE_MAX - 1) @@ -263,6 +262,7 @@ enum rtattr_type_t RTA_CACHEINFO, RTA_SESSION, RTA_MP_ALGO, + RTA_TABLE, __RTA_MAX }; diff --git a/include/rt_names.h b/include/rt_names.h index 2d9ef10..07a10e0 100644 --- a/include/rt_names.h +++ b/include/rt_names.h @@ -5,7 +5,7 @@ #include asm/types.h char* rtnl_rtprot_n2a(int id, char *buf, int len); char* rtnl_rtscope_n2a(int id, char *buf, int len); -char* rtnl_rttable_n2a(int id, char *buf, int len); +char* rtnl_rttable_n2a(__u32 id, char *buf, int len); char* rtnl_rtrealm_n2a(int id, char *buf, int len); char* rtnl_dsfield_n2a(int id, char *buf, int len); int rtnl_rtprot_a2n(__u32 *id, char *arg); diff --git a/ip/ip_common.h b/ip/ip_common.h index 1fe4a69..8b286b0 100644 --- a/ip/ip_common.h +++ b/ip/ip_common.h @@ -32,4 +32,12 @@ extern int do_multiaddr(int argc, char * extern int do_multiroute(int argc, char **argv); extern int do_xfrm(int argc, char **argv); +static inline int rtm_get_table(struct rtmsg *r, struct rtattr **tb) +{ + __u32 table = r-rtm_table; + if (tb[RTA_TABLE]) + table = *(__u32*) RTA_DATA(tb[RTA_TABLE]); + return table; +} + extern struct rtnl_handle rth; diff --git a/ip/iproute.c b/ip/iproute.c index cb674d7..24e7a86 100644 --- a/ip/iproute.c +++ b/ip/iproute.c @@ -140,6 +140,7 @@ int print_route(const struct sockaddr_nl inet_prefix via; int host_len = -1; static int ip6_multiple_tables; + __u32 table; SPRINT_BUF(b1); @@ -165,7 +166,10 @@ int print_route(const struct sockaddr_nl else if (r-rtm_family == AF_IPX) host_len = 80; - if (r-rtm_family == AF_INET6 r-rtm_table != RT_TABLE_MAIN) + parse_rtattr(tb, RTA_MAX, RTM_RTA(r), len); + table = rtm_get_table(r, tb); + + if (r-rtm_family == AF_INET6 table != RT_TABLE_MAIN) ip6_multiple_tables = 1; if (r-rtm_family == AF_INET6 !ip6_multiple_tables) { @@ -187,7 +191,7 @@ int print_route(const struct sockaddr_nl } } } else { - if (filter.tb 0 filter.tb != r-rtm_table) + if (filter.tb 0 filter.tb != table) return 0; } if ((filter.protocol^r-rtm_protocol)filter.protocolmask) @@ -217,8 +221,6 @@ int print_route(const struct sockaddr_nl if (filter.rprefsrc.family r-rtm_family != filter.rprefsrc.family) return 0; - parse_rtattr(tb, RTA_MAX, RTM_RTA(r), len); - memset(dst, 0, sizeof(dst)); dst.family = r-rtm_family; if (tb[RTA_DST]) @@ -371,8 +373,8 @@ int print_route(const struct sockaddr_nl fprintf(fp, dev %s , ll_index_to_name(*(int*)RTA_DATA(tb[RTA_OIF]))); if (!(r-rtm_flagsRTM_F_CLONED)) { - if (r-rtm_table != RT_TABLE_MAIN !filter.tb) - fprintf(fp, table %s , rtnl_rttable_n2a(r-rtm_table, b1, sizeof(b1))); + if (table != RT_TABLE_MAIN !filter.tb) + fprintf(fp, table %s , rtnl_rttable_n2a(table, b1, sizeof(b1))); if (r-rtm_protocol != RTPROT_BOOT filter.protocolmask != -1) fprintf(fp, proto %s , rtnl_rtprot_n2a(r-rtm_protocol, b1, sizeof(b1))); if (r-rtm_scope != RT_SCOPE_UNIVERSE filter.scopemask != -1) @@ -875,7 +877,12 @@ #endif NEXT_ARG(); if (rtnl_rttable_a2n(tid, *argv)) invarg(\table\ value is invalid\n, *argv); -
[IPROUTE 02/03]: Use hash for routing table name cache
[IPROUTE]: Use hash for routing table name cache Use a hash for routing table name cache instead of the fixed size array. Signed-off-by: Patrick McHardy [EMAIL PROTECTED] --- commit ccd621fbb5faa91a98479e9492baee525c6f10c0 tree e4e1416406b5ed252b3b1a91efc3d8caadbf1bd0 parent 00d896184c5f8737269ac05264446c58133ec414 author Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 23:27:59 +0200 committer Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 23:27:59 +0200 lib/rt_names.c | 96 +++- 1 files changed, 74 insertions(+), 22 deletions(-) diff --git a/lib/rt_names.c b/lib/rt_names.c index 05046c2..b77ad4a 100644 --- a/lib/rt_names.c +++ b/lib/rt_names.c @@ -23,6 +23,51 @@ #include linux/rtnetlink.h #include rt_names.h +struct rtnl_hash_entry { + struct rtnl_hash_entry *next; + char * name; + unsigned intid; +}; + +static void +rtnl_hash_initialize(char *file, struct rtnl_hash_entry **hash, int size) +{ + struct rtnl_hash_entry *entry; + char buf[512]; + FILE *fp; + + fp = fopen(file, r); + if (!fp) + return; + while (fgets(buf, sizeof(buf), fp)) { + char *p = buf; + int id; + char namebuf[512]; + + while (*p == ' ' || *p == '\t') + p++; + if (*p == '#' || *p == '\n' || *p == 0) + continue; + if (sscanf(p, 0x%x %s\n, id, namebuf) != 2 + sscanf(p, 0x%x %s #, id, namebuf) != 2 + sscanf(p, %d %s\n, id, namebuf) != 2 + sscanf(p, %d %s #, id, namebuf) != 2) { + fprintf(stderr, Database %s is corrupted at %s\n, + file, p); + return; + } + + if (id0) + continue; + entry = malloc(sizeof(*entry)); + entry-id = id; + entry-name = strdup(namebuf); + entry-next = hash[id (size - 1)]; + hash[id (size - 1)] = entry; + } + fclose(fp); +} + static void rtnl_tab_initialize(char *file, char **tab, int size) { char buf[512]; @@ -57,7 +102,6 @@ static void rtnl_tab_initialize(char *fi fclose(fp); } - static char * rtnl_rtprot_tab[256] = { [RTPROT_UNSPEC] = none, [RTPROT_REDIRECT] =redirect, @@ -266,9 +310,14 @@ int rtnl_rtrealm_a2n(__u32 *id, char *ar } +static struct rtnl_hash_entry dflt_table_entry = { .id = 253, .name = default }; +static struct rtnl_hash_entry main_table_entry = { .id = 254, .name = main }; +static struct rtnl_hash_entry local_table_entry = { .id = 255, .name = local }; -static char * rtnl_rttable_tab[256] = { - unspec, +static struct rtnl_hash_entry * rtnl_rttable_hash[256] = { + [253] = dflt_table_entry, + [254] = main_table_entry, + [255] = local_table_entry, }; static int rtnl_rttable_init; @@ -276,26 +325,26 @@ static int rtnl_rttable_init; static void rtnl_rttable_initialize(void) { rtnl_rttable_init = 1; - rtnl_rttable_tab[255] = local; - rtnl_rttable_tab[254] = main; - rtnl_rttable_tab[253] = default; - rtnl_tab_initialize(/etc/iproute2/rt_tables, - rtnl_rttable_tab, 256); + rtnl_hash_initialize(/etc/iproute2/rt_tables, +rtnl_rttable_hash, 256); } char * rtnl_rttable_n2a(int id, char *buf, int len) { - if (id0 || id=256) { - snprintf(buf, len, %d, id); + struct rtnl_hash_entry *entry; + + if (id = RT_TABLE_MAX) { + snprintf(buf, len, %u, id); return buf; } - if (!rtnl_rttable_tab[id]) { - if (!rtnl_rttable_init) - rtnl_rttable_initialize(); - } - if (rtnl_rttable_tab[id]) - return rtnl_rttable_tab[id]; - snprintf(buf, len, %d, id); + if (!rtnl_rttable_init) + rtnl_rttable_initialize(); + entry = rtnl_rttable_hash[id 255]; + while (entry entry-id != id) + entry = entry-next; + if (entry) + return entry-name; + snprintf(buf, len, %u, id); return buf; } @@ -303,6 +352,7 @@ int rtnl_rttable_a2n(__u32 *id, char *ar { static char *cache = NULL; static unsigned long res; + struct rtnl_hash_entry *entry; char *end; int i; @@ -315,17 +365,19 @@ int rtnl_rttable_a2n(__u32 *id, char *ar rtnl_rttable_initialize(); for (i=0; i256; i++) { - if (rtnl_rttable_tab[i] - strcmp(rtnl_rttable_tab[i], arg) == 0) { - cache = rtnl_rttable_tab[i]; - res = i; + entry = rtnl_rttable_hash[i]; + while (entry
[IPROUTE 00/03]: Increase number of possible routing tables
These patches add support for a larger number of routing tables to iproute and are needed for the patches doing the same for the kernel I just sent. They apply on top of the [IPROUTE]: Support IPv6 routing table filter patch. Please apply, thanks. include/linux/rtnetlink.h |4 - include/rt_names.h|2 ip/ip_common.h|8 +++ ip/iproute.c | 63 ip/iprule.c | 14 +- lib/rt_names.c| 100 ++ 6 files changed, 133 insertions(+), 58 deletions(-) Patrick McHardy: [IPROUTE]: Preparation for 32 bit table IDs [IPROUTE]: Use hash for routing table name cache [IPROUTE]: Add support for larger number of routing tables - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix memory leak in net/ipv4/tcp_probe.c::tcpprobe_read()
On 05/08/06, David Miller [EMAIL PROTECTED] wrote: From: Jesper Juhl [EMAIL PROTECTED] Date: Sat, 5 Aug 2006 01:30:49 +0200 On 31/07/06, David Miller [EMAIL PROTECTED] wrote: From: Jesper Juhl [EMAIL PROTECTED] Date: Sun, 30 Jul 2006 23:51:20 +0200 Looks ok to me. I've applied James's version of the fix, thanks everyone. Hmm, if you are refering to commit 118075b3cdc90e0815362365f3fc64d672ace0d6 - http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=118075b3cdc90e0815362365f3fc64d672ace0d6 then I think a mistake has crept in. That commit only initializes 'cnt' to 0 - I don't see how that would fix the leak - looks like you forgot the business end of the patch... See the commit right before that, the initialize of cnt to zero is just to fix a compiler warning that resulted from James's version of the fix. Hmm, perhaps I'm going blind, but I don't see it. The commit right before the one I linked to above is completely unrelated : [ATALK]: Make CONFIG_DEV_APPLETALK a tristate. http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=9cac2c35e26cc44978df654306bb92d7cfe7e2de And if I download 2.6.18-rc4 the tcpprobe_read() function (still) looks like this : static ssize_t tcpprobe_read(struct file *file, char __user *buf, size_t len, loff_t *ppos) { int error = 0, cnt = 0; unsigned char *tbuf; if (!buf || len 0) return -EINVAL; if (len == 0) return 0; tbuf = vmalloc(len); if (!tbuf) return -ENOMEM; error = wait_event_interruptible(tcpw.wait, __kfifo_len(tcpw.fifo) != 0); if (error) return error; cnt = kfifo_get(tcpw.fifo, tbuf, len); error = copy_to_user(buf, tbuf, cnt); vfree(tbuf); return error ? error : cnt; } That function still contains the 'tbuf' leak. I also couldn't find the fix in your git trees at http://www.kernel.org/git/?p=linux/kernel/git/davem/net-2.6.19.git;a=summary http://www.kernel.org/git/?p=linux/kernel/git/davem/net-2.6.git;a=summary So either I'm going blind or a mistake has been made getting the fix into mainline... -- Jesper Juhl [EMAIL PROTECTED] Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-rc3-mm2 - IPV6_MULTIPLE_TABLES borked....
[EMAIL PROTECTED] wrote: On Sun, 06 Aug 2006 03:08:09 PDT, Andrew Morton said: ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/ Building a kernel with IPV6_MULTIPLE_TABLES=y breaks my IPv6 connectivity quite badly. It basically totally refuses to answer an IPv6 Neighbor Solicit packet or IPv6 Echo Request packet. I run a 'tcpdump -n ipv6', and I see the requests come in, and no packets leaving. Interestingly enough, if I try to ping6 *out* of the box, it's totally willing to send a Neighbor Solicit outbound (although it appears to totally ignore the Neighbor Advert packet that comes back). Of course, things don't work very well at all with busticated Neighbor Solicit. A kernel built with IPV6_MULTIPLE_TABLES=n works just fine. It should be fixed by this patch (already contained in net-2.6.19). [IPV6]: Fix policy routing lookup When the lookup in a table returns ip6_null_entry the policy routing lookup returns it instead of continuing in the next table, which effectively means it only searches the local table. Signed-off-by: Patrick McHardy [EMAIL PROTECTED] Signed-off-by: David S. Miller [EMAIL PROTECTED] --- commit 2b885e76c2b2c74d2dfe86a8140f0b41149f327c tree 767711f03ea3e990ce02b3720718b77490027793 parent 5bd721a145d02a89a9b69adf3ede9d0b3647ae8b author Patrick McHardy [EMAIL PROTECTED] Sun, 06 Aug 2006 22:24:08 -0700 committer David S. Miller [EMAIL PROTECTED] Sun, 06 Aug 2006 22:24:08 -0700 net/ipv6/fib6_rules.c |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/net/ipv6/fib6_rules.c b/net/ipv6/fib6_rules.c index c3c8195..94a46ec 100644 --- a/net/ipv6/fib6_rules.c +++ b/net/ipv6/fib6_rules.c @@ -94,8 +94,10 @@ int fib6_rule_action(struct fib_rule *ru if (rt != ip6_null_entry) goto out; - dst_release(rt-u.dst); + rt = NULL; + goto out; + discard_pkt: dst_hold(rt-u.dst); out:
Re: 2.6.18-rc3-mm2 - IPV6_MULTIPLE_TABLES borked....
On Thu, 10 Aug 2006 22:02:03 +0200, Patrick McHardy said: [EMAIL PROTECTED] wrote: On Sun, 06 Aug 2006 03:08:09 PDT, Andrew Morton said: ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/ Building a kernel with IPV6_MULTIPLE_TABLES=y breaks my IPv6 connectivity It should be fixed by this patch (already contained in net-2.6.19). Confirmed fixed, thanks... pgp35bA5bBOzS.pgp Description: PGP signature
Re: [IPROUTE 00/03]: Increase number of possible routing tables
On Fri, 11 Aug 2006 00:14:47 +0200 (MEST) Patrick McHardy [EMAIL PROTECTED] wrote: These patches add support for a larger number of routing tables to iproute and are needed for the patches doing the same for the kernel I just sent. They apply on top of the [IPROUTE]: Support IPv6 routing table filter patch. Please apply, thanks. include/linux/rtnetlink.h |4 - include/rt_names.h|2 ip/ip_common.h|8 +++ ip/iproute.c | 63 ip/iprule.c | 14 +- lib/rt_names.c| 100 ++ 6 files changed, 133 insertions(+), 58 deletions(-) Patrick McHardy: [IPROUTE]: Preparation for 32 bit table IDs [IPROUTE]: Use hash for routing table name cache [IPROUTE]: Add support for larger number of routing tables Applied thanks. Let me know when your done, it has been too long since a real release of iproute2. I'll roll one as soon as the flow subsides. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [IPROUTE 00/03]: Increase number of possible routing tables
Stephen Hemminger wrote: Applied thanks. Let me know when your done, it has been too long since a real release of iproute2. I'll roll one as soon as the flow subsides. I only have one more patchset I would like to submit soon, the time cleanups. But they are only meant to make auditing for integer overflows easier, so we can one day switch to a higher clock resolution. iproute seems to be mostly fine, but the kernel will probably take a bit longer, so I wouldn't mind missing this release. I'll submit them in the next days anyway, but feel free to release without them. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix memory leak in net/ipv4/tcp_probe.c::tcpprobe_read()
From: Jesper Juhl [EMAIL PROTECTED] Date: Fri, 11 Aug 2006 00:18:44 +0200 Hmm, perhaps I'm going blind, but I don't see it. I definitely screwed that changeset up somehow. Thanks for catching this. Somehow James's fix got clobbered into just my subsequent warning fix, like this: commit 118075b3cdc90e0815362365f3fc64d672ace0d6 Author: James Morris [EMAIL PROTECTED] Date: Sun Jul 30 20:21:45 2006 -0700 [TCP]: fix memory leak in net/ipv4/tcp_probe.c::tcpprobe_read() Based upon a patch by Jesper Juhl. Signed-off-by: James Morris [EMAIL PROTECTED] Acked-by: Stephen Hemminger [EMAIL PROTECTED] Acked-by: Jesper Juhl [EMAIL PROTECTED] Signed-off-by: David S. Miller [EMAIL PROTECTED] diff --git a/net/ipv4/tcp_probe.c b/net/ipv4/tcp_probe.c index d7d517a..b343532 100644 --- a/net/ipv4/tcp_probe.c +++ b/net/ipv4/tcp_probe.c @@ -114,7 +114,7 @@ static int tcpprobe_open(struct inode * static ssize_t tcpprobe_read(struct file *file, char __user *buf, size_t len, loff_t *ppos) { - int error = 0, cnt; + int error = 0, cnt = 0; unsigned char *tbuf; if (!buf || len 0) Anyways, I'll fix this up, thanks again. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix memory leak in net/ipv4/tcp_probe.c::tcpprobe_read()
From: Stephen Hemminger [EMAIL PROTECTED] Date: Thu, 10 Aug 2006 16:52:16 -0700 Dave, here is my version... Don't leak memory on interrupted read. And only allocate as much memory as needed. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] I think I'm going to go with James's safe original fix for now, thanks. commit a7fc5b24a4921a6582ce47c0faf3a31858a80468 Author: David S. Miller [EMAIL PROTECTED] Date: Thu Aug 10 16:53:33 2006 -0700 [TCP]: Fix botched memory leak fix to tcpprobe_read(). Somehow I clobbered James's original fix and only my subsequent compiler warning change went in for that changeset. Get the real fix in there. Noticed by Jesper Juhl. Signed-off-by: David S. Miller [EMAIL PROTECTED] diff --git a/net/ipv4/tcp_probe.c b/net/ipv4/tcp_probe.c index b343532..dab37d2 100644 --- a/net/ipv4/tcp_probe.c +++ b/net/ipv4/tcp_probe.c @@ -130,11 +130,12 @@ static ssize_t tcpprobe_read(struct file error = wait_event_interruptible(tcpw.wait, __kfifo_len(tcpw.fifo) != 0); if (error) - return error; + goto out_free; cnt = kfifo_get(tcpw.fifo, tbuf, len); error = copy_to_user(buf, tbuf, cnt); +out_free: vfree(tbuf); return error ? error : cnt; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix memory leak in net/ipv4/tcp_probe.c::tcpprobe_read()
Dave, here is my version... Don't leak memory on interrupted read. And only allocate as much memory as needed. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- linux-2.6.orig/net/ipv4/tcp_probe.c 2006-08-10 16:32:36.0 -0700 +++ linux-2.6/net/ipv4/tcp_probe.c 2006-08-10 16:45:30.0 -0700 @@ -114,7 +114,7 @@ static ssize_t tcpprobe_read(struct file *file, char __user *buf, size_t len, loff_t *ppos) { - int error = 0, cnt = 0; + int error, cnt; unsigned char *tbuf; if (!buf || len 0) @@ -123,15 +123,16 @@ if (len == 0) return 0; - tbuf = vmalloc(len); - if (!tbuf) - return -ENOMEM; - error = wait_event_interruptible(tcpw.wait, __kfifo_len(tcpw.fifo) != 0); if (error) return error; + len = min(len, kfifo_len(tcpw.fifo)); + tbuf = vmalloc(len); + if (!tbuf) + return -ENOMEM; + cnt = kfifo_get(tcpw.fifo, tbuf, len); error = copy_to_user(buf, tbuf, cnt); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] [BNX2]: Fix tx race condition
[BNX2]: Fix tx race condition. Fix a subtle race condition between bnx2_start_xmit() and bnx2_tx_int() similar to the one in tg3 discovered by Herbert Xu: CPU0CPU1 bnx2_start_xmit() if (tx_ring_full) { tx_lock bnx2_tx() if (!netif_queue_stopped) netif_stop_queue() if (!tx_ring_full) update_tx_ring netif_wake_queue() tx_unlock } Even though tx_ring is updated before the if statement in bnx2_tx_int() in program order, it can be re-ordered by the CPU as shown above. This scenario can cause the tx queue to be stopped forever if bnx2_tx_int() has just freed up the entire tx_ring. The possibility of this happening should be very rare though. The following changes are made, very much identical to the tg3 fix: 1. Add memory barrier to fix the above race condition. 2. Eliminate the private tx_lock altogether and rely solely on netif_tx_lock. This eliminates one spinlock in bnx2_start_xmit() when the ring is full. 3. Because of 2, use netif_tx_lock in bnx2_tx_int() before calling netif_wake_queue(). 4. Add memory barrier to bnx2_tx_avail(). 5. Add bp-tx_wake_thresh which is set to half the tx ring size. 6. Check for the full wake queue condition before getting netif_tx_lock in tg3_tx(). This reduces the number of unnecessary spinlocks when the tx ring is full in a steady-state condition. Signed-off-by: Michael Chan [EMAIL PROTECTED] diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c index db73de0..2099edb 100644 --- a/drivers/net/bnx2.c +++ b/drivers/net/bnx2.c @@ -209,8 +209,10 @@ MODULE_DEVICE_TABLE(pci, bnx2_pci_tbl); static inline u32 bnx2_tx_avail(struct bnx2 *bp) { - u32 diff = TX_RING_IDX(bp-tx_prod) - TX_RING_IDX(bp-tx_cons); + u32 diff; + smp_mb(); + diff = TX_RING_IDX(bp-tx_prod) - TX_RING_IDX(bp-tx_cons); if (diff MAX_TX_DESC_CNT) diff = (diff MAX_TX_DESC_CNT) - 1; return (bp-tx_ring_size - diff); @@ -1686,15 +1688,20 @@ bnx2_tx_int(struct bnx2 *bp) } bp-tx_cons = sw_cons; + /* Need to make the tx_cons update visible to bnx2_start_xmit() +* before checking for netif_queue_stopped(). Without the +* memory barrier, there is a small possibility that bnx2_start_xmit() +* will miss it and cause the queue to be stopped forever. +*/ + smp_mb(); - if (unlikely(netif_queue_stopped(bp-dev))) { - spin_lock(bp-tx_lock); + if (unlikely(netif_queue_stopped(bp-dev)) +(bnx2_tx_avail(bp) bp-tx_wake_thresh)) { + netif_tx_lock(bp-dev); if ((netif_queue_stopped(bp-dev)) - (bnx2_tx_avail(bp) MAX_SKB_FRAGS)) { - + (bnx2_tx_avail(bp) bp-tx_wake_thresh)) netif_wake_queue(bp-dev); - } - spin_unlock(bp-tx_lock); + netif_tx_unlock(bp-dev); } } @@ -3503,6 +3510,8 @@ bnx2_init_tx_ring(struct bnx2 *bp) struct tx_bd *txbd; u32 val; + bp-tx_wake_thresh = bp-tx_ring_size / 2; + txbd = bp-tx_desc_ring[MAX_TX_DESC_CNT]; txbd-tx_bd_haddr_hi = (u64) bp-tx_desc_mapping 32; @@ -4390,10 +4399,8 @@ bnx2_vlan_rx_kill_vid(struct net_device #endif /* Called with netif_tx_lock. - * hard_start_xmit is pseudo-lockless - a lock is only required when - * the tx queue is full. This way, we get the benefit of lockless - * operations most of the time without the complexities to handle - * netif_stop_queue/wake_queue race conditions. + * bnx2_tx_int() runs without netif_tx_lock unless it needs to call + * netif_wake_queue(). */ static int bnx2_start_xmit(struct sk_buff *skb, struct net_device *dev) @@ -4512,12 +4519,9 @@ bnx2_start_xmit(struct sk_buff *skb, str dev-trans_start = jiffies; if (unlikely(bnx2_tx_avail(bp) = MAX_SKB_FRAGS)) { - spin_lock(bp-tx_lock); netif_stop_queue(dev); - - if (bnx2_tx_avail(bp) MAX_SKB_FRAGS) + if (bnx2_tx_avail(bp) bp-tx_wake_thresh) netif_wake_queue(dev); - spin_unlock(bp-tx_lock); } return NETDEV_TX_OK; @@ -5628,7 +5632,6 @@ bnx2_init_board(struct pci_dev *pdev, st bp-pdev = pdev; spin_lock_init(bp-phy_lock); - spin_lock_init(bp-tx_lock); INIT_WORK(bp-reset_task, bnx2_reset_task, bp); dev-base_addr = dev-mem_start = pci_resource_start(pdev, 0); diff --git a/drivers/net/bnx2.h b/drivers/net/bnx2.h index 658c5ee..fe80476 100644 --- a/drivers/net/bnx2.h +++ b/drivers/net/bnx2.h @@ -3890,10 +3890,6 @@ struct bnx2 { u32 tx_prod_bseq
[PATCH 2/2] [BNX2]: Convert to netdev_alloc_skb()
[BNX2]: Convert to netdev_alloc_skb() Convert dev_alloc_skb() to netdev_alloc_skb() and increase default rx ring size to 255. The old ring size of 100 was too small. Update version to 1.4.44. Signed-off-by: Michael Chan [EMAIL PROTECTED] diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c index 2099edb..652eb05 100644 --- a/drivers/net/bnx2.c +++ b/drivers/net/bnx2.c @@ -56,8 +56,8 @@ #define DRV_MODULE_NAMEbnx2 #define PFX DRV_MODULE_NAME: -#define DRV_MODULE_VERSION 1.4.43 -#define DRV_MODULE_RELDATE June 28, 2006 +#define DRV_MODULE_VERSION 1.4.44 +#define DRV_MODULE_RELDATE August 10, 2006 #define RUN_AT(x) (jiffies + (x)) @@ -1571,7 +1571,7 @@ bnx2_alloc_rx_skb(struct bnx2 *bp, u16 i struct rx_bd *rxbd = bp-rx_desc_ring[RX_RING(index)][RX_IDX(index)]; unsigned long align; - skb = dev_alloc_skb(bp-rx_buf_size); + skb = netdev_alloc_skb(bp-dev, bp-rx_buf_size); if (skb == NULL) { return -ENOMEM; } @@ -1580,7 +1580,6 @@ bnx2_alloc_rx_skb(struct bnx2 *bp, u16 i skb_reserve(skb, 8 - align); } - skb-dev = bp-dev; mapping = pci_map_single(bp-pdev, skb-data, bp-rx_buf_use_size, PCI_DMA_FROMDEVICE); @@ -1793,7 +1792,7 @@ bnx2_rx_int(struct bnx2 *bp, int budget) if ((bp-dev-mtu 1500) (len = RX_COPY_THRESH)) { struct sk_buff *new_skb; - new_skb = dev_alloc_skb(len + 2); + new_skb = netdev_alloc_skb(bp-dev, len + 2); if (new_skb == NULL) goto reuse_rx; @@ -1804,7 +1803,6 @@ bnx2_rx_int(struct bnx2 *bp, int budget) skb_reserve(new_skb, 2); skb_put(new_skb, len); - new_skb-dev = bp-dev; bnx2_reuse_rx_skb(bp, skb, sw_ring_cons, sw_ring_prod); @@ -3961,7 +3959,7 @@ bnx2_run_loopback(struct bnx2 *bp, int l return -EINVAL; pkt_size = 1514; - skb = dev_alloc_skb(pkt_size); + skb = netdev_alloc_skb(bp-dev, pkt_size); if (!skb) return -ENOMEM; packet = skb_put(skb, pkt_size); @@ -5754,7 +5752,7 @@ bnx2_init_board(struct pci_dev *pdev, st bp-mac_addr[5] = (u8) reg; bp-tx_ring_size = MAX_TX_DESC_CNT; - bnx2_set_rx_ring_size(bp, 100); + bnx2_set_rx_ring_size(bp, 255); bp-rx_csum = 1; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[IPROUTE]: Fix struct alignment with cris architecture
[IPROUTE]: Fix struct alignment with cris architecture gcc for the cris arch does not pad structures to the next multiple of 4 bytes, as the i386 gcc does. This causes errors like this when displaying xfrm policies: # ip x p !!!Deficit 3, rta_len=300 src 192.168.251.32/29 dst 192.168.251.32/29 dir in priority 0 !!!Deficit 3, rta_len=180 src 0.0.0.0/0 dst 192.168.251.32/29 dir in priority 2208 Similar errors are seen from ip x s. This patch fixes the errors when printing. I'm not sure whether we should worry about other uses of the affected structs, I've not seen any other bad effects from this though, so hopefully this is enough. (Thanks to Herbert Xu for pointing out that NLMSG_SPACE is the correct macro to use here.) Tested against 2.6.17.6 kernel on i386, and 2.6.16.1 kernel on cris. Signed-off-by: Andy Gay [EMAIL PROTECTED] --- diff --git a/ip/xfrm_policy.c b/ip/xfrm_policy.c index 433b513..340e7df 100644 --- a/ip/xfrm_policy.c +++ b/ip/xfrm_policy.c @@ -354,15 +354,15 @@ int xfrm_policy_print(const struct socka if (n-nlmsg_type == XFRM_MSG_DELPOLICY) { xpid = NLMSG_DATA(n); - len -= NLMSG_LENGTH(sizeof(*xpid)); + len -= NLMSG_SPACE(sizeof(*xpid)); } else if (n-nlmsg_type == XFRM_MSG_POLEXPIRE) { xpexp = NLMSG_DATA(n); xpinfo = xpexp-pol; - len -= NLMSG_LENGTH(sizeof(*xpexp)); + len -= NLMSG_SPACE(sizeof(*xpexp)); } else { xpexp = NULL; xpinfo = NLMSG_DATA(n); - len -= NLMSG_LENGTH(sizeof(*xpinfo)); + len -= NLMSG_SPACE(sizeof(*xpinfo)); } if (len 0) { diff --git a/ip/xfrm_state.c b/ip/xfrm_state.c index 3eefaff..1d61685 100644 --- a/ip/xfrm_state.c +++ b/ip/xfrm_state.c @@ -575,15 +575,15 @@ int xfrm_state_print(const struct sockad if (n-nlmsg_type == XFRM_MSG_DELSA) { /* Dont blame me for this .. Herbert made me do it */ xsid = NLMSG_DATA(n); - len -= NLMSG_LENGTH(sizeof(*xsid)); + len -= NLMSG_SPACE(sizeof(*xsid)); } else if (n-nlmsg_type == XFRM_MSG_EXPIRE) { xexp = NLMSG_DATA(n); xsinfo = xexp-state; - len -= NLMSG_LENGTH(sizeof(*xexp)); + len -= NLMSG_SPACE(sizeof(*xexp)); } else { xexp = NULL; xsinfo = NLMSG_DATA(n); - len -= NLMSG_LENGTH(sizeof(*xsinfo)); + len -= NLMSG_SPACE(sizeof(*xsinfo)); } if (len 0) { - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] llc: multicast receive device match
From: Stephen Hemminger [EMAIL PROTECTED] Date: Thu, 3 Aug 2006 10:05:58 -0700 Fix from [EMAIL PROTECTED], STP packets are incorrectly received on all LLC datagram sockets, whichever interface they are bound to. The llc_sap datagram receive logic sends packets with a unicast destination MAC to one socket bound to that SAP and MAC, and multicast packets to all sockets bound to that SAP. STP packets are multicast, and we do need to know on which interface they were received. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] Looks correct, I will apply this. Thanks a lot. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] net: socket family using RCU
On Thu, Aug 10, 2006 at 01:28:27PM -0700, Stephen Hemminger wrote: On Wed, Aug 09, 2006 at 11:31:42AM -0700, Stephen Hemminger wrote: Replace the gross custom locking done in socket code for net_family[] with simple RCU usage. Some reordering necessary to avoid sleep issues with sock_alloc. Definitely a good use of RCU from a read-intensive standpoint -- does anyone other than Linux-kernel networking developers change the elements of the net_family[] array except at boot and shutdown? ;-) Some comments included below. Looks good, but one question about things like atalk_create() being able to sleep and a place or two where a comment would be good. ... + /* + * Allocate the socket and allow the family to set things up. if + * the protocol is 0, the family is instructed to select an appropriate + * default. + */ + sock = sock_alloc(); + if (!sock) { + printk(KERN_WARNING socket: no more sockets\n); + return -ENFILE; /* Not exactly a match, but its the + closest posix thing */ + } + + sock-type = type; + #if defined(CONFIG_KMOD) /* Attempt to load a protocol module if the find failed. * @@ -1166,70 +1138,59 @@ * requested real, full-featured networking support upon configuration. * Otherwise module support will break! */ - if (net_families[family] == NULL) { + if (net_families[family] == NULL) request_module(net-pf-%d, family); OK, I'll bite... What happens if the module is not present? Or is this what the Otherwise module support will break comment is getting at? request_module loads the module (and blocks). One would expect that the module loaded would set net_families[] via sock_register, so later reference would succeed. Comment is historical since intention was to make base socket code itself modular which never was done, and is probably a bad idea to even consider. If module is not present, then net_families[] will still be NULL. Got it! Also, this reference to net_families[family] is done without rcu_dereference() and without any clear update-side lock. This just happens to be OK, since we are only testing for NULL, but should at least have a comment. - } #endif - net_family_read_lock(); - if (net_families[family] == NULL) { - err = -EAFNOSUPPORT; - goto out; - } - -/* - *Allocate the socket and allow the family to set things up. if - *the protocol is 0, the family is instructed to select an appropriate - *default. - */ - - if (!(sock = sock_alloc())) { - printk(KERN_WARNING socket: no more sockets\n); - err = -ENFILE; /* Not exactly a match, but its the - closest posix thing */ - goto out; - } - - sock-type = type; + rcu_read_lock(); + pf = rcu_dereference(net_families[family]); OK, so the elements of the net_families array are protected by RCU. All references should either be under rcu_read_lock() and accessed via rcu_dereference() or under the update-side lock, whatever that might be. Yes, the net_family_lock Good. -/* +/** + *sock_unregister - remove a protocol handler + *@family: protocol family to remove + * *This function is called by a protocol handler that wants to *remove its address family, and have it unlinked from the - *SOCKET module. + *new socket creation. + * + *If protocol handler is a module, then it can use module reference + *counts to protect against new references. If protocol handler is not + *a module then it needs to provide its own protection in + *the ops-create routine. */ - int sock_unregister(int family) { if (family 0 || family = NPROTO) - return -1; + return -EINVAL; - net_family_write_lock(); + spin_lock(net_family_lock); net_families[family] = NULL; And this one is covered by net_families_lock, so we are set, since this is the last one. - net_family_write_unlock(); + spin_unlock(net_family_lock); + + synchronize_rcu(); OK, and the caller is presumably going to free up whatever needs to be freed. Or, if nothing need be freed, beyond this point, we know that all non-sleeping code paths through any of the net_protocol_family functions have completed. (So, are all of the functions non-sleeping, or do we care? The definition of net_protocol_family in include/linux/net.h doesn't say that they need to be non-sleeping...) atalk_create() can potentially sleep in the following line of code: sk = sk_alloc(PF_APPLETALK, GFP_KERNEL, ddp_proto, 1); The module reference counts are used to prevent that. Since appletalk module can't be unloaded until there are no more appletalk
Re: [PATCH 3/6] ehea: queue management
Please add comments to make the code more readable, especially at the start of functions/structures to describe what they do. A large readme at the start of ehea_main.c which gave an overview of the driver design would be really useful. Comments inline below... +static void *ipz_qpageit_get_inc(struct ipz_queue *queue) +{ + void *retvalue = ipz_qeit_get(queue); + queue-current_q_offset += queue-pagesize; + if (queue-current_q_offset queue-queue_length) { + queue-current_q_offset -= queue-pagesize; + retvalue = NULL; + } + else if u64) retvalue) (EHEA_PAGESIZE-1)) != 0) { + EDEB(4, ERROR!! not at PAGE-Boundary); + return NULL; + } + EDEB(7, queue=%p retvalue=%p, queue, retvalue); Don't redefine these debugging macros, but if so, what is EDEB? + return retvalue; +} + +static int ipz_queue_ctor(struct ipz_queue *queue, + const u32 nr_of_pages, + const u32 pagesize, const u32 qe_size, + const u32 nr_of_sg) +{ + int f; + EDEB_EN(7, nr_of_pages=%x pagesize=%x qe_size=%x, + nr_of_pages, pagesize, qe_size); + queue-queue_length = nr_of_pages * pagesize; + queue-queue_pages = vmalloc(nr_of_pages * sizeof(void *)); + if (!queue-queue_pages) { + EDEB(4, ERROR!! didn't get the memory); + return 0; + } + memset(queue-queue_pages, 0, nr_of_pages * sizeof(void *)); + + for (f = 0; f nr_of_pages; f++) { + (queue-queue_pages)[f] = + (struct ipz_page *)get_zeroed_page(GFP_KERNEL); + if (!(queue-queue_pages)[f]) { + break; + } + } + if (f nr_of_pages) { + int g; + EDEB_ERR(4, couldn't get 0ed pages queue=%p f=%x + nr_of_pages=%x, queue, f, nr_of_pages); + for (g = 0; g f; g++) { + free_page((unsigned long)(queue-queue_pages)[g]); + } + return 0; If you return here when calling from ehea_create_eq, I think you are leaking the queue-queue_pages allocation (the pages they point to are freed correctly). Either way these allocations/deallocations seem too complicated. Can you write to dtor so it can be called to free the structure in any state? + } + queue-current_q_offset = 0; + queue-qe_size = qe_size; + queue-act_nr_of_sg = nr_of_sg; + queue-pagesize = pagesize; + queue-toggle_state = 1; + EDEB_EX(7, queue_length=%x queue_pages=%p qe_size=%x + act_nr_of_sg=%x, queue-queue_length, queue-queue_pages, + queue-qe_size, queue-act_nr_of_sg); + return 1; +} + +static int ipz_queue_dtor(struct ipz_queue *queue) +{ + int g; + EDEB_EN(7, ipz_queue pointer=%p, queue); + if (!queue) { + return 0; + } + if (!queue-queue_pages) { + return 0; + } if (!queue || !queue-queue_pages) return 0; + EDEB(7, destructing a queue with the following properties:\n + queue_length=%x act_nr_of_sg=%x pagesize=%x qe_size=%x, + queue-queue_length, queue-act_nr_of_sg, queue-pagesize, + queue-qe_size); + for (g = 0; g (queue-queue_length / queue-pagesize); g++) { + free_page((unsigned long)(queue-queue_pages)[g]); + } + vfree(queue-queue_pages); + + EDEB_EX(7, queue freed!); + return 1; +} + +struct ehea_cq *ehea_cq_new(void) +{ + struct ehea_cq *cq = vmalloc(sizeof(*cq)); + if (cq) + memset(cq, 0, sizeof(*cq)); + return cq; +} + +void ehea_cq_delete(struct ehea_cq *cq) +{ + vfree(cq); +} This is used in only two places. Do we need it? If we do... can we static inline? +struct ehea_cq *ehea_create_cq(struct ehea_adapter *adapter, +int nr_of_cqe, u64 eq_handle, u32 cq_token) +{ + struct ehea_cq *cq = NULL; + struct h_galpa gal; + + u64 *cq_handle_ref; + u32 act_nr_of_entries; + u32 act_pages; + u64 hret = H_HARDWARE; + int ipz_rc; + u32 counter; + void *vpage = NULL; + u64 rpage = 0; + + EDEB_EN(7, adapter=%p nr_of_cqe=%x , eq_handle: %016lX, + adapter, nr_of_cqe, eq_handle); + + cq = ehea_cq_new(); + if (!cq) { + cq = NULL; + EDEB_ERR(4, ehea_create_cq ret=%p (-ENOMEM), cq); + goto create_cq_exit0; + } + + cq-attr.max_nr_of_cqes = nr_of_cqe; + cq-attr.cq_token = cq_token; + cq-attr.eq_handle = eq_handle; + + cq-adapter = adapter; + + cq_handle_ref = cq-ipz_cq_handle; + act_nr_of_entries = 0; + act_pages = 0; + + hret = ehea_h_alloc_resource_cq(adapter-handle, + cq, + cq-attr, +
Re: [PATCH 3/6] ehea: queue management
+static inline u32 map_swqe_size(u8 swqe_enc_size) +{ + return 128 swqe_enc_size; +}^ + | +static inline u32|map_rwqe_size(u8 rwqe_enc_size) +{| + return 128 rwqe_enc_size; ^ +}| | Snap! These are ide|tical... | No, they aren't. -+ Name the function after what it does, not after the functions you expect to call it. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6] ehea: queue management
+static inline u32 map_swqe_size(u8 swqe_enc_size) +{ + return 128 swqe_enc_size; +} ^ + | +static inline u32|map_rwqe_size(u8 rwqe_enc_size) +{ | + return 128 rwqe_enc_size; ^ +} | | Snap! These are ide|tical... | No, they aren't. -+ Functionally identical. Mikey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] VM deadlock prevention core -v3
On Thu, August 10, 2006 18:50, Peter Zijlstra said: You are right if the reserve wasn't device bound - which I will abandon because you are right that with multi-path routing, bridge device and other advanced goodies this scheme is broken in that there is no unambiguous mapping from sockets to devices. The natural thing seems to make reserves socket bound, but that has overhead too and the simplicity of a global reserve is very tempting. What about adding a flag to sk_set_memalloc() which says if memalloc is on or off on the socket? (Or add sk_unset_memalloc). That way it's possible to switch it off again, which doesn't seem like that a bad idea, because then it can be turned on only when the socket can be used to reduce total memory usage. Also if it is turned off again when no more memory can be freed by using this socket, it will solve the starvation problem as a starved socket now has a new chance to do its thing. Greetings, Indan - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take6 1/3] kevent: Core files.
On Thu, 10 Aug 2006 12:22:35 +0400 Evgeniy Polyakov [EMAIL PROTECTED] wrote: On Thu, Aug 10, 2006 at 01:02:54AM -0700, Andrew Morton ([EMAIL PROTECTED]) wrote: Afaict this mmap function gives a user a free way of getting pinned memory. What is the upper bound on the amount of memory which a user can thus obtain? it is limited by maximum queue length which is 4k entries right now, so maximum number of paged here is 4k*40/page_size, i.e. about 40 pages on x86. Is that per user or per fd? If the latter that is, with the usual RLIMIT_NOFILE, 160MBytes. 2GB with 64k pagesize. Problem ;) Per kevent fd. I have some ideas about better mmap ring implementation, which would dinamically grow it's buffer when events are added and reuse the same place for next events, but there are some nitpics unresolved yet. Let's not see there in next releases (no merge of course), until better solution is ready. I will change that area when other things are ready. This is not a problem with the mmap interface per-se. If the proposed event code permits each user to pin 160MB of kernel memory then that would be a serious problem. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
remove unnecessary config.h includes from drivers/net/
On Wed, Aug 09, 2006 at 09:04:38PM -0700, David Miller wrote: From: Dave Jones [EMAIL PROTECTED] Date: Wed, 9 Aug 2006 22:21:16 -0400 config.h is automatically included by kbuild these days. Signed-off-by: Dave Jones [EMAIL PROTECTED] Applied to net-2.6.19, thanks Dave. Here's a similar patch that does the same removals for drivers/net/ Signed-off-by: Dave Jones [EMAIL PROTECTED] --- linux-2.6.17.noarch/drivers/net/irda/mcs7780.c~ 2006-08-10 21:35:23.0 -0400 +++ linux-2.6.17.noarch/drivers/net/irda/mcs7780.c 2006-08-10 21:35:25.0 -0400 @@ -45,7 +45,6 @@ #include linux/module.h #include linux/moduleparam.h -#include linux/config.h #include linux/kernel.h #include linux/types.h #include linux/errno.h --- linux-2.6.17.noarch/drivers/net/irda/w83977af_ir.c~ 2006-08-10 21:35:28.0 -0400 +++ linux-2.6.17.noarch/drivers/net/irda/w83977af_ir.c 2006-08-10 21:35:30.0 -0400 @@ -40,7 +40,6 @@ / #include linux/module.h -#include linux/config.h #include linux/kernel.h #include linux/types.h #include linux/skbuff.h --- linux-2.6.17.noarch/drivers/net/smc911x.c~ 2006-08-10 21:35:34.0 -0400 +++ linux-2.6.17.noarch/drivers/net/smc911x.c 2006-08-10 21:35:37.0 -0400 @@ -55,8 +55,6 @@ static const char version[] = ) #endif - -#include linux/config.h #include linux/init.h #include linux/module.h #include linux/kernel.h --- linux-2.6.17.noarch/drivers/net/netx-eth.c~ 2006-08-10 21:35:41.0 -0400 +++ linux-2.6.17.noarch/drivers/net/netx-eth.c 2006-08-10 21:35:42.0 -0400 @@ -17,7 +17,6 @@ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ -#include linux/config.h #include linux/init.h #include linux/module.h #include linux/kernel.h --- linux-2.6.17.noarch/drivers/net/wan/cycx_main.c~2006-08-10 21:35:45.0 -0400 +++ linux-2.6.17.noarch/drivers/net/wan/cycx_main.c 2006-08-10 21:35:48.0 -0400 @@ -40,7 +40,6 @@ * 1998/08/08 acmeInitial version. */ -#include linux/config.h /* OS configuration options */ #include linux/stddef.h /* offsetof(), etc. */ #include linux/errno.h /* return codes */ #include linux/string.h /* inline memset(), etc. */ --- linux-2.6.17.noarch/drivers/net/wan/sdla.c~ 2006-08-10 21:35:51.0 -0400 +++ linux-2.6.17.noarch/drivers/net/wan/sdla.c 2006-08-10 21:35:53.0 -0400 @@ -32,7 +32,6 @@ * 2 of the License, or (at your option) any later version. */ -#include linux/config.h /* for CONFIG_DLCI_MAX */ #include linux/module.h #include linux/kernel.h #include linux/types.h --- linux-2.6.17.noarch/drivers/net/wan/dlci.c~ 2006-08-10 21:35:57.0 -0400 +++ linux-2.6.17.noarch/drivers/net/wan/dlci.c 2006-08-10 21:35:59.0 -0400 @@ -28,7 +28,6 @@ * 2 of the License, or (at your option) any later version. */ -#include linux/config.h /* for CONFIG_DLCI_COUNT */ #include linux/module.h #include linux/kernel.h #include linux/types.h --- linux-2.6.17.noarch/drivers/net/phy/vitesse.c~ 2006-08-10 21:36:02.0 -0400 +++ linux-2.6.17.noarch/drivers/net/phy/vitesse.c 2006-08-10 21:36:04.0 -0400 @@ -12,7 +12,6 @@ * */ -#include linux/config.h #include linux/kernel.h #include linux/module.h #include linux/mii.h --- linux-2.6.17.noarch/drivers/net/phy/smsc.c~ 2006-08-10 21:36:07.0 -0400 +++ linux-2.6.17.noarch/drivers/net/phy/smsc.c 2006-08-10 21:36:08.0 -0400 @@ -14,7 +14,6 @@ * */ -#include linux/config.h #include linux/kernel.h #include linux/module.h #include linux/mii.h --- linux-2.6.17.noarch/drivers/net/hp100.c~2006-08-10 21:36:12.0 -0400 +++ linux-2.6.17.noarch/drivers/net/hp100.c 2006-08-10 21:36:14.0 -0400 @@ -111,7 +111,6 @@ #include linux/etherdevice.h #include linux/skbuff.h #include linux/types.h -#include linux/config.h /* for CONFIG_PCI */ #include linux/delay.h #include linux/init.h #include linux/bitops.h --- linux-2.6.17.noarch/drivers/net/3c501.c~2006-08-10 21:36:18.0 -0400 +++ linux-2.6.17.noarch/drivers/net/3c501.c 2006-08-10 21:36:20.0 -0400 @@ -120,7 +120,6 @@ static const char version[] = #include linux/slab.h #include linux/string.h #include linux/errno.h -#include linux/config.h /* for CONFIG_IP_MULTICAST */ #include linux/spinlock.h #include linux/ethtool.h #include linux/delay.h -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
2.6.18-rc3-mm2 - BUG in rt6_lookup() from ipv6_del_addr()
On Sun, 06 Aug 2006 03:08:09 PDT, Andrew Morton said: ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/ After applying the patch that Patrick McHardy pointed me at, it lived longer. However, I'm now seeing problems at system shutdown (or anytime you try to 'ifdown ethX' where ethX has an IPv6 address attached to it): [ 196.346000] BUG: unable to handle kernel NULL pointer dereference at virtual address 0014 [ 196.347000] printing eip: [ 196.348000] c032c436 [ 196.348000] *pde = [ 196.349000] Oops: [#1] [ 196.349000] 4K_STACKS PREEMPT [ 196.349000] last sysfs file: /class/net/eth1/address [ 196.349000] Modules linked in: thermal sony_acpi processor fan button battery ac nfnetlink i8k floppy nvram orinoco_cs orinoco hermes pcmcia firmware_class ohci1394 ieee1394 intel_agp agpgart iTCO_wdt yenta_socket rsrc_nonstatic pcmcia_core rtc [ 196.349000] CPU:0 [ 196.349000] EIP:0060:[c032c436]Not tainted VLI [ 196.349000] EFLAGS: 00010246 (2.6.18-rc3-mm2 #4) [ 196.349000] EIP is at rt6_lookup+0x47/0x83 [ 196.349000] eax: ebx: ecx: 0005 edx: [ 196.349000] esi: e8b25c98 edi: e8b25c20 ebp: e8b25c78 esp: e8b25c20 [ 196.349000] ds: 007b es: 007b ss: 0068 [ 196.349000] Process ip (pid: 2511, ti=e8b25000 task=effb0aa0 task.ti=e8b25000) [ 196.349000] Stack: 0005 80fe [ 196.349000] [ 196.349000] 0008 eb6e98c8 e8b25ca8 e8b25cb4 c0327c04 [ 196.349000] Call Trace: [ 196.349000] [c0327c04] ipv6_del_addr+0x2ef/0x3a7 [ 196.349000] [c0327d3f] inet6_addr_del+0x83/0xbb [ 196.349000] [c0327dd6] inet6_rtm_deladdr+0x5f/0x6b [ 196.349000] [c02da097] rtnetlink_rcv_msg+0x1b3/0x1d6 [ 196.349000] [c02e011c] netlink_run_queue+0x5a/0xc6 [ 196.349000] [c02d9e9d] rtnetlink_rcv+0x29/0x42 [ 196.349000] [c02e0576] netlink_data_ready+0x12/0x49 [ 196.349000] [c02df518] netlink_sendskb+0x1c/0x4d [ 196.349000] [c02dfea0] netlink_unicast+0x1c4/0x1d0 [ 196.349000] [c02e0557] netlink_sendmsg+0x274/0x281 [ 196.349000] [c02ca57e] sock_sendmsg+0xeb/0x106 [ 196.349000] [c02cad99] sys_sendto+0xbe/0xdc [ 196.349000] [c02cb522] sys_socketcall+0xfb/0x186 [ 196.349000] [c0102849] sysenter_past_esp+0x56/0x79 [ 196.349000] DWARF2 unwinder stuck at sysenter_past_esp+0x56/0x79 [ 196.349000] Leftover inexact backtrace: [ 196.349000] [c01036c7] show_stack_log_lvl+0x8c/0x97 [ 196.349000] [c010381f] show_registers+0x14d/0x1de [ 196.349000] [c0103a5b] die+0x1ab/0x26d [ 196.349000] [c0352205] do_page_fault+0x3f8/0x4c5 [ 196.349000] [c0351271] error_code+0x39/0x40 [ 196.349000] [c0327c04] ipv6_del_addr+0x2ef/0x3a7 [ 196.349000] [c0327d3f] inet6_addr_del+0x83/0xbb [ 196.349000] [c0327dd6] inet6_rtm_deladdr+0x5f/0x6b [ 196.349000] [c02da097] rtnetlink_rcv_msg+0x1b3/0x1d6 [ 196.349000] [c02e011c] netlink_run_queue+0x5a/0xc6 [ 196.349000] [c02d9e9d] rtnetlink_rcv+0x29/0x42 [ 196.349000] [c02e0576] netlink_data_ready+0x12/0x49 [ 196.349000] [c02df518] netlink_sendskb+0x1c/0x4d [ 196.349000] [c02dfea0] netlink_unicast+0x1c4/0x1d0 [ 196.349000] [c02e0557] netlink_sendmsg+0x274/0x281 [ 196.349000] [c02ca57e] sock_sendmsg+0xeb/0x106 [ 196.349000] [c02cad99] sys_sendto+0xbe/0xdc [ 196.349000] [c02cb522] sys_socketcall+0xfb/0x186 [ 196.349000] [c0102849] sysenter_past_esp+0x56/0x79 [ 196.349000] Code: eb ff 89 5d a8 8d 45 b0 b9 10 00 00 00 89 f2 e8 c9 e0 eb ff 31 d2 83 7d 08 00 0f 95 c2 b9 ad cc 32 c0 89 f8 e8 47 7c 01 00 89 c3 66 83 7b 14 00 74 2d 8b 43 04 85 c0 7f 21 68 c4 19 37 c0 68 99 [ 196.349000] EIP: [c032c436] rt6_lookup+0x47/0x83 SS:ESP 0068:e8b25c20 The unlucky 'ip' process then gets a SIGSEGV and dies while holding a lock of some sort, so later 'ip' processes get hung in 'D' state. Checking the lkml and netdev archives didn't find any useful hits for 'ipv6_addr_rel'... pgpPNQBNHkWRz.pgp Description: PGP signature
Re: [RFC][PATCH 2/9] deadlock prevention core
Thomas Graf wrote: skb-dev is not guaranteed to still point to the allocating device once the skb is freed again so reserve/unreserve isn't symmetric. You'd need skb-alloc_dev or something. There's another consequence of this property of the network stack. Every network interface must be able to fall back to these MEMALLOC allocations, because the memory critical socket could be on another network interface. Hence, we cannot know which network interfaces should (not) be marked MEMALLOC. -- Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. - Brian W. Kernighan - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-rc3-mm2 - BUG in rt6_lookup() from ipv6_del_addr()
From: [EMAIL PROTECTED] Date: Thu, 10 Aug 2006 22:15:26 -0400 On Sun, 06 Aug 2006 03:08:09 PDT, Andrew Morton said: ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/ After applying the patch that Patrick McHardy pointed me at, it lived longer. However, I'm now seeing problems at system shutdown (or anytime you try to 'ifdown ethX' where ethX has an IPv6 address attached to it): This is cured by yet another fix already in the net-2.6.19 tree: From 7a3a5e6b0e6847749c756cbe4bf554eda063a577 Mon Sep 17 00:00:00 2001 From: Ville Nuorvala [EMAIL PROTECTED] Date: Tue, 8 Aug 2006 16:44:17 -0700 Subject: [PATCH] [IPV6]: Make sure fib6_rule_lookup doesn't return NULL The callers of fib6_rule_lookup don't expect it to return NULL, therefore it must return ip6_null_entry whenever fib_rule_lookup fails. Signed-off-by: Ville Nuorvala [EMAIL PROTECTED] Signed-off-by: David S. Miller [EMAIL PROTECTED] --- net/ipv6/fib6_rules.c |6 +- 1 files changed, 5 insertions(+), 1 deletions(-) diff --git a/net/ipv6/fib6_rules.c b/net/ipv6/fib6_rules.c index bf9bba8..22a2fdb 100644 --- a/net/ipv6/fib6_rules.c +++ b/net/ipv6/fib6_rules.c @@ -63,7 +63,11 @@ struct dst_entry *fib6_rule_lookup(struc if (arg.rule) fib_rule_put(arg.rule); - return (struct dst_entry *) arg.result; + if (arg.result) + return (struct dst_entry *) arg.result; + + dst_hold(ip6_null_entry.u.dst); + return ip6_null_entry.u.dst; } static int fib6_rule_action(struct fib_rule *rule, struct flowi *flp, -- 1.4.2.rc2.g3e042 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html