Re: Network virtualization/isolation
On Thursday 26 October 2006 19:56, Stephen Hemminger wrote: On Thu, 26 Oct 2006 11:44:55 +0200 Daniel Lezcano [EMAIL PROTECTED] wrote: Stephen Hemminger wrote: On Wed, 25 Oct 2006 17:51:28 +0200 Daniel Lezcano [EMAIL PROTECTED] wrote: Hi Stephen, currently the work to make the container enablement into the kernel is doing good progress. The ipc, pid, utsname and filesystem system ressources are isolated/virtualized relying on the namespaces concept. But, there is missing the network virtualization/isolation. Two approaches are proposed: doing the isolation at the layer 2 and at the layer 3. The first one instanciate a network device by namespace and add a peer network device into the root namespace, all the routing ressources are relative to the namespace. This work is done by Andrey Savochkin from the openvz project. The second relies on the routes and associates the network namespace pointer with each route. When the traffic is incoming, the packet follows an input route and retrieve the associated network namespace. When the traffic is outgoing, the packet, identified from the network namespace is coming from, follows only the routes matching the same network namespace. This work is made by me. IMHO, we need the two approach, the layer-2 to be able to bring *very* strong isolation for system container with a performance cost and a layer-3 to be able to have good isolation for lightweight container or application container when performances are more important. Do you have some suggestions ? What is your point of view on that ? Thanks in advance. -- Daniel Any solution should allow both and it should build on the existing netfilter infrastructure. The problem is netfilter can not give a good isolation, eg. how can be handled netstat command ? or avoid to see IP addresses assigned to another container when doing ifconfig ? Furthermore, one of the biggest interest of the network isolation is to bring mobility with a container and that can only be done if the network ressources inside the kernel can be identified by container in order to checkpoint/restart them. The all-in-namespace solution, ie. at layer 2, is very good in terms of isolation but it adds an non-negligeable overhead. The layer 3 isolation has an insignifiant overhead, a good isolation perfectly adapted for applications containers. Unfortunatly, from the point of view of implementation, layer 3 can not be a subset of layer 2 isolation when using all-in-namespace and layer 2 isolation can not be a extension of the layer 3 isolation. I think the layer 2 and the layer 3 implementations can coexists. You can for example create a system container with a layer 2 isolation and inside it add a layer 3 isolation. Does that make sense ? -- Daniel Assuming you are talking about pseudo-virtualized environments, there are several different discussions. 1. How should the namespace be isolated for the virtualized containered applications? 2. How should traffic be restricted into/out of those containers. This is where existing netfilter, classification, etc, should be used. The network code is overly rich as it is, we don't need another abstraction. 3. Can the virtualized containers be secure? No. we really can't keep hostile root in a container from killing system without going to a hypervisor. Stephen, Virtualized container can be secure, if it is complete system virtualization, not just an application container. OpenVZ implements such and it is used hard over the world. And of course, we care a lot to keep hostile root from killing whole system. OpenVZ uses virtualization on IP level (implemented by Andrey Savochkin, http://marc.theaimsgroup.com/?l=linux-netdevm=115572448503723), with all necessary network objects isolated/virtualized, such as sockets, devices, routes, netfilters, etc. -- Thanks, Dmitry. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network virtualization/isolation
[ ... ] Dmitry Mishin wrote: Stephen, Virtualized container can be secure, if it is complete system virtualization, not just an application container. OpenVZ implements such and it is used hard over the world. And of course, we care a lot to keep hostile root from killing whole system. OpenVZ power !! OpenVZ uses virtualization on IP level (implemented by Andrey Savochkin, http://marc.theaimsgroup.com/?l=linux-netdevm=115572448503723), with all necessary network objects isolated/virtualized, such as sockets, devices, routes, netfilters, etc. No, it uses virtualization at layer 2 and I had already mention it before (see the first email of the thread), but thank you for the email thread pointer. The discussion is not to convince Stephen that layer 2 or layer 3 is the best but to present the pros and the cons of each solution and to have a point of view from a network gourou guy. Regards. -- Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Check if user has CAP_NET_ADMIN to change congestion control algorithm
* David Miller | 2006-10-26 17:02:21 [-0700]: Your email client turned the tabs into spaces in the patch making it useless. Sorry my mistake! I am en route and I paste the patch into my editor, who eat all tabs. One more time: sorry! Check if user has CAP_NET_ADMIN capability to change congestion control algorithm. Signed-off-by: Hagen Paul Pfeifer [EMAIL PROTECTED] --- net/ipv4/tcp_cong.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c index af0aca1..c1ae2e9 100644 --- a/net/ipv4/tcp_cong.c +++ b/net/ipv4/tcp_cong.c @@ -10,6 +10,7 @@ #include linux/module.h #include linux/mm.h #include linux/types.h #include linux/list.h +#include linux/capability.h #include net/tcp.h static DEFINE_SPINLOCK(tcp_cong_list_lock); @@ -151,6 +152,9 @@ int tcp_set_congestion_control(struct so struct tcp_congestion_ops *ca; int err = 0; + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + rcu_read_lock(); ca = tcp_ca_find(name); if (ca == icsk-icsk_ca_ops) -- 1.4.1.1 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] s2io: add PCI error recovery support
Looking at all scenarios I feel the first patch is OK. Can you add the watchdog timer fix to first initial patch and resubmit. -Original Message- From: Linas Vepstas [mailto:[EMAIL PROTECTED] Sent: Thursday, October 26, 2006 3:52 PM To: Ananda Raju Cc: Wen Xiong; linux-kernel@vger.kernel.org; [EMAIL PROTECTED]; netdev@vger.kernel.org; Jeff Garzik; Andrew Morton Subject: Re: [PATCH] s2io: add PCI error recovery support Hi. On Thu, Oct 26, 2006 at 05:56:34AM -0400, Ananda Raju wrote: Hi, Can you try attached patch. The attached patch is simple. We set card state as down in error_detecct() so that all entry points return error and don't proceed further. In slot_reset() we do s2io_card_down() will reset adapter. In io_resume() we bringup the driver. Simplicity is always better. However, some questions/comments: @@ -4175,6 +4186,10 @@ static irqreturn_t s2io_isr(int irq, voi mac_info_t *mac_control; struct config_param *config; + if (atomic_read(sp-card_state) == CARD_DOWN) { + return IRQ_NONE; + } I used if ((sp-pdev-error_state != pci_channel_io_normal) here for a reason: the pdev-error_state is set even in an interrupt context, that is, it gets set even if interrups are disabled, and so it represents the actual state immediately. By contrast, the error callbacks do not get called until possibly much later, and so sp-card_state = CARD_DOWN might not get set for a while. If, for any reason, e.g. some obscure corner case, the s2io generates zillions of interupts, this could result in a soft-lockup. I actually saw this in the symbios device driver, which will regenerate an interrupt until its acknowledged -- and so it sat there, spinning. :-( I was returning IRQ_HANDLED instead of IRQ_NONE, so as to avoid falling into handle_bad_irq() or report_bad_irq(). I haven't seen this happen on s2io, but thought it would still be wise. If this can't happen, then there's no problem here. +/** + * s2io_io_slot_reset - called after the pci bus has been reset. + * @pdev: Pointer to PCI device + * + * Restart the card from scratch, as if from a cold-boot. + */ +static pci_ers_result_t s2io_io_slot_reset(struct pci_dev *pdev) +{ At this point, the card has just experienced a hardware reset, (the #RST wire was held low for 250 millisecs, followed by a settle time of 2 seconds, followed by whatever BIOS thinks it needed to do, followed by a restore of the pci config space to what it was after a cold boot. So the card is in a fresh state; in theory its identitcal to a cold boot. So ... are you sure you want to down at this point? + s2io_card_down(sp); + sp-device_close_flag = TRUE; /* Device is shut down. */ One problem I'm having is that the watchdog timer sometimes pops and tries to reset the card before s2io_card_down() has a chance to run. I fixed this ... == So -- just for grins, I thought to myself, Maybe I can make s2io be the first adapter ever to fully recover without a hard reset of the card. The idea is simple: 1) enable MMIO, 2) call s2io_card_down() 3) enable DMA 4) cal s2io_card_up() I have a patch that does this, but then hit a few more snags. I haven't yet nailed down all the trouble spots, maybe tommorrow. --linas - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Check if user has CAP_NET_ADMIN to change congestion control algorithm
On Fri, 27 Oct 2006 12:43:11 +0200 Hagen Paul Pfeifer [EMAIL PROTECTED] wrote: * David Miller | 2006-10-26 17:02:21 [-0700]: Your email client turned the tabs into spaces in the patch making it useless. Sorry my mistake! I am en route and I paste the patch into my editor, who eat all tabs. One more time: sorry! Check if user has CAP_NET_ADMIN capability to change congestion control algorithm. Signed-off-by: Hagen Paul Pfeifer [EMAIL PROTECTED] Please no, it makes the socket option useless. If you want to tag some bad apples thats okay, but would need some more infrastructure. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Rewrite e100_phys_id
Matthew Wilcox wrote: On Thu, Oct 26, 2006 at 01:04:32PM -0700, Auke Kok wrote: no objections, so I'll ACK it with the notion that I'm going to let our labs do some more testing on it with all the latest changes to it. Thanks, Auke. Here's the equivalent patch for e1000. I don't have a convenient machine to test it on, but it reduces the size of the driver by 1.5k. this is a bit (!) more complex than e100, so I'm going to take a bit of time to review this patch. thanks, Auke diff --git a/drivers/net/e1000/e1000.h b/drivers/net/e1000/e1000.h index 7ecce43..1e22da6 100644 --- a/drivers/net/e1000/e1000.h +++ b/drivers/net/e1000/e1000.h @@ -257,9 +257,6 @@ #endif struct work_struct reset_task; uint8_t fc_autoneg; - struct timer_list blink_timer; - unsigned long led_status; - /* TX */ struct e1000_tx_ring *tx_ring; /* One per active queue */ unsigned long tx_queue_len; diff --git a/drivers/net/e1000/e1000_ethtool.c b/drivers/net/e1000/e1000_ethtool.c index 773821e..620afa5 100644 --- a/drivers/net/e1000/e1000_ethtool.c +++ b/drivers/net/e1000/e1000_ethtool.c @@ -1819,61 +1819,15 @@ e1000_set_wol(struct net_device *netdev, return 0; } -/* toggle LED 4 times per second = 2 blinks per second */ -#define E1000_ID_INTERVAL (HZ/4) - -/* bit defines for adapter-led_status */ -#define E1000_LED_ON 0 - -static void -e1000_led_blink_callback(unsigned long data) -{ - struct e1000_adapter *adapter = (struct e1000_adapter *) data; - - if (test_and_change_bit(E1000_LED_ON, adapter-led_status)) - e1000_led_off(adapter-hw); - else - e1000_led_on(adapter-hw); - - mod_timer(adapter-blink_timer, jiffies + E1000_ID_INTERVAL); -} - static int e1000_phys_id(struct net_device *netdev, uint32_t data) { struct e1000_adapter *adapter = netdev_priv(netdev); - if (!data || data (uint32_t)(MAX_SCHEDULE_TIMEOUT / HZ)) - data = (uint32_t)(MAX_SCHEDULE_TIMEOUT / HZ); - - if (adapter-hw.mac_type e1000_82571) { - if (!adapter-blink_timer.function) { - init_timer(adapter-blink_timer); - adapter-blink_timer.function = e1000_led_blink_callback; - adapter-blink_timer.data = (unsigned long) adapter; - } - e1000_setup_led(adapter-hw); - mod_timer(adapter-blink_timer, jiffies); - msleep_interruptible(data * 1000); - del_timer_sync(adapter-blink_timer); - } else if (adapter-hw.phy_type == e1000_phy_ife) { - if (!adapter-blink_timer.function) { - init_timer(adapter-blink_timer); - adapter-blink_timer.function = e1000_led_blink_callback; - adapter-blink_timer.data = (unsigned long) adapter; - } - mod_timer(adapter-blink_timer, jiffies); - msleep_interruptible(data * 1000); - del_timer_sync(adapter-blink_timer); - e1000_write_phy_reg((adapter-hw), IFE_PHY_SPECIAL_CONTROL_LED, 0); - } else { - e1000_blink_led_start(adapter-hw); - msleep_interruptible(data * 1000); - } + if (data == 0) + data = 2; - e1000_led_off(adapter-hw); - clear_bit(E1000_LED_ON, adapter-led_status); - e1000_cleanup_led(adapter-hw); + e1000_blink_led(adapter-hw, data); return 0; } diff --git a/drivers/net/e1000/e1000_hw.c b/drivers/net/e1000/e1000_hw.c index 65077f3..db5e999 100644 --- a/drivers/net/e1000/e1000_hw.c +++ b/drivers/net/e1000/e1000_hw.c @@ -6071,7 +6071,7 @@ e1000_id_led_init(struct e1000_hw * hw) * * hw - Struct containing variables accessed by shared code */ -int32_t +static int32_t e1000_setup_led(struct e1000_hw *hw) { uint32_t ledctl; @@ -6123,50 +6123,11 @@ e1000_setup_led(struct e1000_hw *hw) /** - * Used on 82571 and later Si that has LED blink bits. - * Callers must use their own timer and should have already called - * e1000_id_led_init() - * Call e1000_cleanup led() to stop blinking - * - * hw - Struct containing variables accessed by shared code - */ -int32_t -e1000_blink_led_start(struct e1000_hw *hw) -{ -int16_t i; -uint32_t ledctl_blink = 0; - -DEBUGFUNC(e1000_id_led_blink_on); - -if (hw-mac_type e1000_82571) { -/* Nothing to do */ -return E1000_SUCCESS; -} -if (hw-media_type == e1000_media_type_fiber) { -/* always blink LED0 for PCI-E fiber */ -ledctl_blink = E1000_LEDCTL_LED0_BLINK | - (E1000_LEDCTL_MODE_LED_ON E1000_LEDCTL_LED0_MODE_SHIFT); -} else { -
Re: [PATCH] Check if user has CAP_NET_ADMIN to change congestion control algorithm
* Stephen Hemminger | 2006-10-27 07:41:02 [-0700]: Please no, it makes the socket option useless. Technical no, in the sense of usability for everybody yes. You are right Stephen, as a programmer I understand you complete! But on the other side: We know for sure that this IS a problem if we allow everybody to prefer his socket. In my opinion we should prefer fairness before usability! As John Heffner introduce, we can introduce a ranking system for congestion control algorithms - but this solution seems a little bit oversized and maybe can't be complete guaranteed (complex interaction between the protocols in different environment and so on, you know). HGN -- /°\ --- JOIN NOW!!! --- \ / ASCII ribbon campaign X against HTML / \in mail and news - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Check if user has CAP_NET_ADMIN to change congestion control algorithm
Hagen Paul Pfeifer wrote: * Stephen Hemminger | 2006-10-27 07:41:02 [-0700]: Please no, it makes the socket option useless. Technical no, in the sense of usability for everybody yes. You are right Stephen, as a programmer I understand you complete! But on the other side: We know for sure that this IS a problem if we allow everybody to prefer his socket. In my opinion we should prefer fairness before usability! As John Heffner introduce, we can introduce a ranking system for congestion control algorithms - but this solution seems a little bit oversized and maybe can't be complete guaranteed (complex interaction between the protocols in different environment and so on, you know). HGN If there is a dangerous choice, then it should be removed. Otherwise I can't see the problem. It is a bigger risk to have to escalate the privileges of an application just to allow it to use something. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[IPROUTE] manpage for rtmon
Hello, another manpage, this time for rtmon. Would be great if it could be applied to the next release too. regards, -mika- -- ,'`. http://www.michael-prokop.at/ ( grml.org -» Linux Live-CD for texttool-users and sysadmins `._,' http://www.grml.org/ .TH RTMON 8 .SH NAME rtmon \- listens to and monitors RTnetlink .SH SYNOPSIS .B rtmon .RI [ options ] file FILE [ all | LISTofOBJECTS ] .SH DESCRIPTION This manual page documents briefly the .B rtmon command. .PP \fBrtmon\fP is a RTnetlink listener. RTnetlink allows the kernel's routing tables to be read and altered. rtmon should be started before the first network configuration command is issued. For example if you insert: rtmon file /var/log/rtmon.log in a startup script, you will be able to view the full history later. Certainly, it is possible to start rtmon at any time. It prepends the history with the state snapshot dumped at the moment of starting. .SH OPTIONS rtmon supports the following options: .TP .B \-Version Print version and exit. .TP .B help Show summary of options. .TP .B file FILE [ all | LISTofOBJECTS ] Log output to FILE. LISTofOBJECTS is the list of object types that we want to monitor. It may contain 'link', 'address', 'route' and 'all'. 'link' specifies the network device, 'address' the protocol (IP or IPv6) address on a device, 'route' the routing table entry and 'all' does what the name says. .TP .B \-family [ inet | inet6 | link | help ] Specify protocol family. 'inet' is IPv4, 'inet6' is IPv6, 'link' means that no networking protocol is involved and 'help' prints usage information. .TP .B \-4 Use IPv4. Shortcut for -family inet. .TP .B \-6 Use IPv6. Shortcut for -family inet6. .TP .B \-0 Use a special family identifier meaning that no networking protocol is involved. Shortcut for -family link. .SH USAGE EXAMPLES .TP .B # rtmon file /var/log/rtmon.log Log to file /var/log/rtmon.log, then run: .TP .B # ip monitor file /var/log/rtmon.log to display logged output from file. .SH SEE ALSO .BR ip (8) .SH AUTHOR rtmon was written by Alexey Kuznetsov [EMAIL PROTECTED]. .PP This manual page was written by Michael Prokop [EMAIL PROTECTED], for the Debian project (but may be used by others). pgp3zdbStLG30.pgp Description: PGP signature
[PATCH] sky2: not experimental
The sky2 driver is no longer in experimental state. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- sky2.orig/drivers/net/Kconfig 2006-10-27 10:16:44.0 -0700 +++ sky2/drivers/net/Kconfig2006-10-27 10:20:20.0 -0700 @@ -2112,7 +2112,7 @@ config SKY2 tristate SysKonnect Yukon2 support (EXPERIMENTAL) - depends on PCI EXPERIMENTAL + depends on PCI select CRC32 ---help--- This driver supports Gigabit Ethernet adapters based on the @@ -2120,8 +2120,8 @@ Marvell 88E8021/88E8022/88E8035/88E8036/88E8038/88E8050/88E8052/ 88E8053/88E8055/88E8061/88E8062, SysKonnect SK-9E21D/SK-9S21 - This driver does not support the original Yukon chipset: a seperate - driver, skge, is provided for Yukon-based adapters. + There is companion driver for the older Marvell Yukon and + Genesis based adapters: skge. To compile this driver as a module, choose M here: the module will be called sky2. This is recommended. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[take21 1/4] kevent: Core files.
Core files. This patch includes core kevent files: * userspace controlling * kernelspace interfaces * initialization * notification state machines Some bits of documentation can be found on project's homepage (and links from there): http://tservice.net.ru/~s0mbre/old/?section=projectsitem=kevent Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S index 7e639f7..a9560eb 100644 --- a/arch/i386/kernel/syscall_table.S +++ b/arch/i386/kernel/syscall_table.S @@ -318,3 +318,6 @@ ENTRY(sys_call_table) .long sys_vmsplice .long sys_move_pages .long sys_getcpu + .long sys_kevent_get_events + .long sys_kevent_ctl/* 320 */ + .long sys_kevent_wait diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S index b4aa875..cf18955 100644 --- a/arch/x86_64/ia32/ia32entry.S +++ b/arch/x86_64/ia32/ia32entry.S @@ -714,8 +714,11 @@ #endif .quad compat_sys_get_robust_list .quad sys_splice .quad sys_sync_file_range - .quad sys_tee + .quad sys_tee /* 315 */ .quad compat_sys_vmsplice .quad compat_sys_move_pages .quad sys_getcpu + .quad sys_kevent_get_events + .quad sys_kevent_ctl/* 320 */ + .quad sys_kevent_wait ia32_syscall_end: diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h index bd99870..f009677 100644 --- a/include/asm-i386/unistd.h +++ b/include/asm-i386/unistd.h @@ -324,10 +324,13 @@ #define __NR_tee 315 #define __NR_vmsplice 316 #define __NR_move_pages317 #define __NR_getcpu318 +#define __NR_kevent_get_events 319 +#define __NR_kevent_ctl320 +#define __NR_kevent_wait 321 #ifdef __KERNEL__ -#define NR_syscalls 319 +#define NR_syscalls 322 #include linux/err.h /* diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h index 6137146..c53d156 100644 --- a/include/asm-x86_64/unistd.h +++ b/include/asm-x86_64/unistd.h @@ -619,10 +619,16 @@ #define __NR_vmsplice 278 __SYSCALL(__NR_vmsplice, sys_vmsplice) #define __NR_move_pages279 __SYSCALL(__NR_move_pages, sys_move_pages) +#define __NR_kevent_get_events 280 +__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events) +#define __NR_kevent_ctl281 +__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl) +#define __NR_kevent_wait 282 +__SYSCALL(__NR_kevent_wait, sys_kevent_wait) #ifdef __KERNEL__ -#define __NR_syscall_max __NR_move_pages +#define __NR_syscall_max __NR_kevent_wait #include linux/err.h #ifndef __NO_STUBS diff --git a/include/linux/kevent.h b/include/linux/kevent.h new file mode 100644 index 000..125414c --- /dev/null +++ b/include/linux/kevent.h @@ -0,0 +1,205 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef __KEVENT_H +#define __KEVENT_H +#include linux/types.h +#include linux/list.h +#include linux/rbtree.h +#include linux/spinlock.h +#include linux/mutex.h +#include linux/wait.h +#include linux/net.h +#include linux/rcupdate.h +#include linux/kevent_storage.h +#include linux/ukevent.h + +#define KEVENT_MIN_BUFFS_ALLOC 3 + +struct kevent; +struct kevent_storage; +typedef int (* kevent_callback_t)(struct kevent *); + +/* @callback is called each time new event has been caught. */ +/* @enqueue is called each time new event is queued. */ +/* @dequeue is called each time event is dequeued. */ + +struct kevent_callbacks { + kevent_callback_t callback, enqueue, dequeue; +}; + +#define KEVENT_READY 0x1 +#define KEVENT_STORAGE 0x2 +#define KEVENT_USER0x4 + +struct kevent +{ + /* Used for kevent freeing.*/ + struct rcu_head rcu_head; + struct ukevent event; + /* This lock protects ukevent manipulations, e.g. ret_flags changes. */ + spinlock_t ulock; + + /* Entry of user's tree. */ + struct rb_node kevent_node; + /* Entry of origin's queue. */ + struct list_headstorage_entry; + /* Entry of user's ready. */ + struct
[take21 4/4] kevent: Timer notifications.
Timer notifications. Timer notifications can be used for fine grained per-process time management, since interval timers are very inconvenient to use, and they are limited. This subsystem uses high-resolution timers. id.raw[0] is used as number of seconds id.raw[1] is used as number of nanoseconds Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c new file mode 100644 index 000..04acc46 --- /dev/null +++ b/kernel/kevent/kevent_timer.c @@ -0,0 +1,113 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include linux/kernel.h +#include linux/types.h +#include linux/list.h +#include linux/slab.h +#include linux/spinlock.h +#include linux/hrtimer.h +#include linux/jiffies.h +#include linux/kevent.h + +struct kevent_timer +{ + struct hrtimer ktimer; + struct kevent_storage ktimer_storage; + struct kevent *ktimer_event; +}; + +static int kevent_timer_func(struct hrtimer *timer) +{ + struct kevent_timer *t = container_of(timer, struct kevent_timer, ktimer); + struct kevent *k = t-ktimer_event; + + kevent_storage_ready(t-ktimer_storage, NULL, KEVENT_MASK_ALL); + hrtimer_forward(timer, timer-base-softirq_time, + ktime_set(k-event.id.raw[0], k-event.id.raw[1])); + return HRTIMER_RESTART; +} + +static struct lock_class_key kevent_timer_key; + +static int kevent_timer_enqueue(struct kevent *k) +{ + int err; + struct kevent_timer *t; + + t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL); + if (!t) + return -ENOMEM; + + hrtimer_init(t-ktimer, CLOCK_MONOTONIC, HRTIMER_REL); + t-ktimer.expires = ktime_set(k-event.id.raw[0], k-event.id.raw[1]); + t-ktimer.function = kevent_timer_func; + t-ktimer_event = k; + + err = kevent_storage_init(t-ktimer, t-ktimer_storage); + if (err) + goto err_out_free; + lockdep_set_class(t-ktimer_storage.lock, kevent_timer_key); + + err = kevent_storage_enqueue(t-ktimer_storage, k); + if (err) + goto err_out_st_fini; + + printk(%s: jiffies: %lu, timer: %p.\n, __func__, jiffies, t-ktimer); + hrtimer_start(t-ktimer, t-ktimer.expires, HRTIMER_REL); + + return 0; + +err_out_st_fini: + kevent_storage_fini(t-ktimer_storage); +err_out_free: + kfree(t); + + return err; +} + +static int kevent_timer_dequeue(struct kevent *k) +{ + struct kevent_storage *st = k-st; + struct kevent_timer *t = container_of(st, struct kevent_timer, ktimer_storage); + + hrtimer_cancel(t-ktimer); + kevent_storage_dequeue(st, k); + kfree(t); + + return 0; +} + +static int kevent_timer_callback(struct kevent *k) +{ + k-event.ret_data[0] = jiffies_to_msecs(jiffies); + return 1; +} + +static int __init kevent_init_timer(void) +{ + struct kevent_callbacks tc = { + .callback = kevent_timer_callback, + .enqueue = kevent_timer_enqueue, + .dequeue = kevent_timer_dequeue}; + + return kevent_add_callbacks(tc, KEVENT_TIMER); +} +module_init(kevent_init_timer); + - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[take21 0/4] kevent: Generic event handling mechanism.
Generic event handling mechanism. Consider for inclusion. Changes from 'take20' patchset: * new ring buffer implementation * removed artificial limit on possible number of kevents With this release and fixed userspace web server it was possible to achive 3960+ req/s with client connection rate of 4000 con/s over 100 Mbit lan, data IO over network was about 10582.7 KB/s, which is too close to wire speed if we get into account headers and the like. Changes from 'take19' patchset: * use __init instead of __devinit * removed 'default N' from config for user statistic * removed kevent_user_fini() since kevent can not be unloaded * use KERN_INFO for statistic output Changes from 'take18' patchset: * use __init instead of __devinit * removed 'default N' from config for user statistic * removed kevent_user_fini() since kevent can not be unloaded * use KERN_INFO for statistic output Changes from 'take17' patchset: * Use RB tree instead of hash table. At least for a web sever, frequency of addition/deletion of new kevent is comparable with number of search access, i.e. most of the time events are added, accesed only couple of times and then removed, so it justifies RB tree usage over AVL tree, since the latter does have much slower deletion time (max O(log(N)) compared to 3 ops), although faster search time (1.44*O(log(N)) vs. 2*O(log(N))). So for kevents I use RB tree for now and later, when my AVL tree implementation is ready, it will be possible to compare them. * Changed readiness check for socket notifications. With both above changes it is possible to achieve more than 3380 req/second compared to 2200, sometimes 2500 req/second for epoll() for trivial web-server and httperf client on the same hardware. It is possible that above kevent limit is due to maximum allowed kevents in a time limit, which is 4096 events. Changes from 'take16' patchset: * misc cleanups (__read_mostly, const ...) * created special macro which is used for mmap size (number of pages) calculation * export kevent_socket_notify(), since it is used in network protocols which can be built as modules (IPv6 for example) Changes from 'take15' patchset: * converted kevent_timer to high-resolution timers, this forces timer API update at http://linux-net.osdl.org/index.php/Kevent * use struct ukevent* instead of void * in syscalls (documentation has been updated) * added warning in kevent_add_ukevent() if ring has broken index (for testing) Changes from 'take14' patchset: * added kevent_wait() This syscall waits until either timeout expires or at least one event becomes ready. It also commits that @num events from @start are processed by userspace and thus can be be removed or rearmed (depending on it's flags). It can be used for commit events read by userspace through mmap interface. Example userspace code (evtest.c) can be found on project's homepage. * added socket notifications (send/recv/accept) Changes from 'take13' patchset: * do not get lock aroung user data check in __kevent_search() * fail early if there were no registered callbacks for given type of kevent * trailing whitespace cleanup Changes from 'take12' patchset: * remove non-chardev interface for initialization * use pointer to kevent_mring instead of unsigned longs * use aligned 64bit type in raw user data (can be used by high-res timer if needed) * simplified enqueue/dequeue callbacks and kevent initialization * use nanoseconds for timeout * put number of milliseconds into timer's return data * move some definitions into user-visible header * removed filenames from comments Changes from 'take11' patchset: * include missing headers into patchset * some trivial code cleanups (use goto instead of if/else games and so on) * some whitespace cleanups * check for ready_callback() callback before main loop which should save us some ticks Changes from 'take10' patchset: * removed non-existent prototypes * added helper function for kevent_registered_callbacks * fixed 80 lines comments issues * added shared between userspace and kernelspace header instead of embedd them in one * core restructuring to remove forward declarations * s o m e w h i t e s p a c e c o d y n g s t y l e c l e a n u p * use vm_insert_page() instead of remap_pfn_range() Changes from 'take9' patchset: * fixed -nopage method Changes from 'take8' patchset: * fixed mmap release bug * use module_init() instead of late_initcall() * use better structures for timer notifications Changes from 'take7' patchset: * new mmap interface (not tested, waiting for other changes to be acked) - use nopage() method to dynamically substitue pages - allocate new page for events only when new added kevent requres it - do not use ugly index dereferencing, use structure instead - reduced amount of data in the ring (id and
[take21 3/4] kevent: Socket notifications.
Socket notifications. This patch includes socket send/recv/accept notifications. Using trivial web server based on kevent and this features instead of epoll it's performance increased more than noticebly. More details about various benchmarks and server itself (evserver_kevent.c) can be found on project's homepage. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/fs/inode.c b/fs/inode.c index ada7643..ff1b129 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -21,6 +21,7 @@ #include linux/pagemap.h #include linux/cdev.h #include linux/bootmem.h #include linux/inotify.h +#include linux/kevent.h #include linux/mount.h /* @@ -164,12 +165,18 @@ #endif } inode-i_private = 0; inode-i_mapping = mapping; +#if defined CONFIG_KEVENT_SOCKET + kevent_storage_init(inode, inode-st); +#endif } return inode; } void destroy_inode(struct inode *inode) { +#if defined CONFIG_KEVENT_SOCKET + kevent_storage_fini(inode-st); +#endif BUG_ON(inode_has_buffers(inode)); security_inode_free(inode); if (inode-i_sb-s_op-destroy_inode) diff --git a/include/net/sock.h b/include/net/sock.h index edd4d73..d48ded8 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -48,6 +48,7 @@ #include linux/lockdep.h #include linux/netdevice.h #include linux/skbuff.h /* struct sk_buff */ #include linux/security.h +#include linux/kevent.h #include linux/filter.h @@ -450,6 +451,21 @@ static inline int sk_stream_memory_free( extern void sk_stream_rfree(struct sk_buff *skb); +struct socket_alloc { + struct socket socket; + struct inode vfs_inode; +}; + +static inline struct socket *SOCKET_I(struct inode *inode) +{ + return container_of(inode, struct socket_alloc, vfs_inode)-socket; +} + +static inline struct inode *SOCK_INODE(struct socket *socket) +{ + return container_of(socket, struct socket_alloc, socket)-vfs_inode; +} + static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk) { skb-sk = sk; @@ -477,6 +493,7 @@ static inline void sk_add_backlog(struct sk-sk_backlog.tail = skb; } skb-next = NULL; + kevent_socket_notify(sk, KEVENT_SOCKET_RECV); } #define sk_wait_event(__sk, __timeo, __condition) \ @@ -679,21 +696,6 @@ static inline struct kiocb *siocb_to_kio return si-kiocb; } -struct socket_alloc { - struct socket socket; - struct inode vfs_inode; -}; - -static inline struct socket *SOCKET_I(struct inode *inode) -{ - return container_of(inode, struct socket_alloc, vfs_inode)-socket; -} - -static inline struct inode *SOCK_INODE(struct socket *socket) -{ - return container_of(socket, struct socket_alloc, socket)-vfs_inode; -} - extern void __sk_stream_mem_reclaim(struct sock *sk); extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind); diff --git a/include/net/tcp.h b/include/net/tcp.h index 7a093d0..69f4ad2 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -857,6 +857,7 @@ static inline int tcp_prequeue(struct so tp-ucopy.memory = 0; } else if (skb_queue_len(tp-ucopy.prequeue) == 1) { wake_up_interruptible(sk-sk_sleep); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); if (!inet_csk_ack_scheduled(sk)) inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK, (3 * TCP_RTO_MIN) / 4, diff --git a/kernel/kevent/kevent_socket.c b/kernel/kevent/kevent_socket.c new file mode 100644 index 000..c865b3e --- /dev/null +++ b/kernel/kevent/kevent_socket.c @@ -0,0 +1,129 @@ +/* + * kevent_socket.c + * + * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include linux/kernel.h +#include linux/types.h +#include linux/list.h +#include linux/slab.h +#include linux/spinlock.h +#include linux/timer.h +#include linux/file.h +#include linux/tcp.h +#include linux/kevent.h + +#include net/sock.h +#include net/request_sock.h +#include net/inet_connection_sock.h + +static int
Re: [openib-general] [PATCH 1/9] NetEffect 10Gb RNIC Driver: kernel Kconfig and makefiles
On Thu, 26 Oct 2006, Glenn Grundstrom wrote: diff -ruNp old/drivers/infiniband/hw/nes/Makefile new/drivers/infiniband/hw/nes/Makefile --- old/drivers/infiniband/hw/nes/Makefile1969-12-31 18:00:00.0 -0600 +++ new/drivers/infiniband/hw/nes/Makefile2006-10-25 11:10:26.0 -0500 @@ -0,0 +1,27 @@ +EXTRA_CFLAGS += -Idrivers/infiniband/include -Idrivers/infiniband/hw/nes/nes_tcpip/include + +ifdef CONFIG_INFINIBAND_NES_DEBUG +EXTRA_CFLAGS += -DNES_DEBUG +endif The NES_DEBUG flag is unnecessary. You can check for CONFIG_INFINIBAND_NES_DEBUG in the code. See CONFIG_INFINIBAND_MTHCA_DEBUG for an example. + +ifneq ($(KERNELRELEASE),) + obj-$(CONFIG_INFINIBAND_NES) += iw_nes.o + + iw_nes-objs := \ + nes.o \ + nes_hw.o \ + nes_nic.o \ + nes_cm.o \ + nes_utils.o \ + nes_verbs.o +else + KERNELDIR ?= /usr/src/linux + PWD := $(shell pwd) + +default: + $(MAKE) -C $(KERNELDIR) M=$(PWD) modules + +clean: + $(MAKE) -C $(KERNELDIR) M=$(PWD) clean + +endif In tree drivers don't provide support for out-of-tree builds. See drivers/infiniband/hw/mthca/Makefile for an example of how to simplify this. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [openib-general] [PATCH 3/9] NetEffect 10Gb RNIC Driver: openfabrics connection manager c file
[...snip...] +extern void set_interface( +UINT32ip_addr, These should probably be the standard linux types u32, or uint32 +UINT32mask, +UINT32bcastaddr, +UINT32type + ); [...snip...] + struct NES_sockaddr_in inet_addr; + struct sockaddr_in kinet_addr; Is there some reason why you need your own sockaddr and sockaddr_in structures? [...snip...] + +/** + * nes_disconnect + * + * @param cm_id + * @param abrupt + * + * @return int + */ +int nes_disconnect(struct iw_cm_id *cm_id, int abrupt) +{ + struct ib_qp_attr attr; + struct ib_qp *ibqp; + struct nes_qp *nesqp; + struct nes_dev *nesdev = to_nesdev(cm_id-device); + int err = 0; + u8 u8temp; + + dprintk(%s:%s:%u\n, __FILE__, __FUNCTION__, __LINE__); + dprintk(%s: netdev refcnt = %u.\n, __FUNCTION__, atomic_read(nesdev-netdev-refcnt)); + + /* If the qp was already destroyed, then there's no QP */ + if (cm_id-provider_data == 0) + return 0; + + nesqp = (struct nes_qp *)cm_id-provider_data; + ibqp = nesqp-ibqp; + + /* Disassociate the QP from this cm_id */ + cm_id-provider_data = 0; + cm_id-rem_ref(cm_id); + nesqp-cm_id = 0; + + stack_ops_p-decelerate_socket(nesqp-socket, +(struct nes_uploaded_qp_context *) +nesqp-nesqp_context); + + if (nesqp-active_conn) { + u8temp = 1 (ntohs(cm_id-local_addr.sin_port)7); + nesdev-apbv_table[ntohs(cm_id-local_addr.sin_port)3] = ~(u8temp); + } else { + dev_put(nesdev-netdev); +/* Need to free the Last Streaming Mode Message */ +pci_free_consistent(nesdev-pcidev, + nesqp-private_data_len+sizeof(*nesqp-ietf_frame), +nesqp-ietf_frame, +nesqp-ietf_frame_pbase); This is mailer perversion. You need to turn off wrapping in your mailer. It makes it hard to review the patch never mind apply it. +} + + if (nesqp-ksock) sock_release(nesqp-ksock); + stack_ops_p-sock_ops_p-close( nesqp-socket ); + nesqp-ksock = 0; + nesqp-socket = 0; + if (nesqp-wq) { + destroy_workqueue(nesqp-wq); This will deadlock if this function is called from a workqueue thread and CONFIG_HOTPLUG_CPU is enabled. + nesqp-wq = NULL; + } + + memset(attr, 0, sizeof(struct ib_qp_attr)); + if (abrupt) + attr.qp_state = IB_QPS_ERR; + else + attr.qp_state = IB_QPS_SQD; + + return err; +} + + +/** + * nes_accept + * + * @param cm_id + * @param conn_param + * + * @return int + */ +int nes_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) +{ + struct nes_qp *nesqp; + struct nes_dev *nesdev; + struct nes_adapter *nesadapter; + struct ib_qp *ibqp; +struct nes_hw_qp_wqe *wqe; + struct nes_v4_quad nes_quad; + struct ib_qp_attr attr; +struct iw_cm_event cm_event; + + dprintk(%s:%s:%u: data len = %u\n, + __FILE__, __FUNCTION__, __LINE__, conn_param-private_data_len); + + ibqp = nes_get_qp(cm_id-device, conn_param-qpn); + if (!ibqp) + return -EINVAL; + nesqp = to_nesqp(ibqp); + nesdev = to_nesdev(nesqp-ibqp.device); + nesadapter = nesdev-nesadapter; + dprintk(%s: netdev refcnt = %u.\n, __FUNCTION__, atomic_read(nesdev-netdev-refcnt)); + +nesqp-ietf_frame = pci_alloc_consistent(nesdev-pcidev, + sizeof(*nesqp-ietf_frame)+conn_param-private_data_len, + nesqp-ietf_frame_pbase); +if (!nesqp-ietf_frame) { +dprintk(KERN_ERR PFX %s: Unable to allocate memory for private data\n, __FUNCTION__); +return -ENOMEM; +} +dprintk(PFX %s: PCI consistent memory for +private data located @ %p (pa = 0x%08lX.) size = %u.\n, +__FUNCTION__, nesqp-ietf_frame, (unsigned long)nesqp-ietf_frame_pbase, +conn_param-private_data_len+sizeof(*nesqp-ietf_frame)); +nesqp-private_data_len = conn_param-private_data_len; + +strcpy(nesqp-ietf_frame-key[0], IEFT_MPA_KEY_REP); +memcpy(nesqp-ietf_frame-private_data, conn_param-private_data, conn_param-private_data_len); +nesqp-ietf_frame-private_data_size = cpu_to_be16(conn_param-private_data_len); +nesqp-ietf_frame-rev = mpa_version; +nesqp-ietf_frame-flags = IETF_MPA_FLAGS_CRC; + +wqe = nesqp-hwqp.sq_vbase[0]; +*((struct nes_qp **)wqe-wqe_words[NES_IWARP_SQ_WQE_COMP_CTX_LOW_IDX]) = nesqp; + *((u64 *)wqe-wqe_words[NES_IWARP_SQ_WQE_COMP_CTX_LOW_IDX]) |= NES_SW_CONTEXT_ALIGN1; +wqe-wqe_words[NES_IWARP_SQ_WQE_MISC_IDX] =
[PATCH] tcp: don't allow unfair congestion control to be built without warning
My proposed method restricting TCP choices to fair algorithms. This a net wide, not system wide issue, it should not be done by kernel policy choice (capability), but by a build choice. --- sky2.orig/net/ipv4/Kconfig 2006-10-27 10:10:47.0 -0700 +++ sky2/net/ipv4/Kconfig 2006-10-27 10:15:56.0 -0700 @@ -470,6 +470,16 @@ if TCP_CONG_ADVANCED +config TCP_CONG_UNFAIR +bool Allow unfair congestion control algorithms + depends on EXPERIMENTAL +---help--- + Some of the congestion control algorithms are for testing + and research purposes and should not deployed on public + networks because of the possiblity of unfair behavior. + These algorithms may be useful for future development + or comparison purposes. + config TCP_CONG_BIC tristate Binary Increase Congestion (BIC) control default m @@ -551,7 +561,7 @@ config TCP_CONG_SCALABLE tristate Scalable TCP - depends on EXPERIMENTAL + depends on TCP_CONG_UNFAIR default n ---help--- Scalable TCP is a sender-side only change to TCP which uses a - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [IPROUTE] manpage for rtmon
On Fri, 27 Oct 2006 19:22:11 +0200 Michael Prokop [EMAIL PROTECTED] wrote: User-Agent: mutt-ng devel-r316 (Debian) Hello, added. -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tcp: don't allow unfair congestion control to be built without warning
I think unfair is a difficult word. Unfair to what? It's true that Scalable TCP is unfair to itself in that flows with unequal shares do not converge, but it's not clear what its interactions are with other congestion control algorithms. It's not clear to me that it's significantly more unfair wrt. reno than BIC, etc. Known to be broken might be more correct language. :) One thought would be to use a module parameter that sets one bit of state: allow unprivileged use. Each module could have a sensible default value. -John Stephen Hemminger wrote: My proposed method restricting TCP choices to fair algorithms. This a net wide, not system wide issue, it should not be done by kernel policy choice (capability), but by a build choice. --- sky2.orig/net/ipv4/Kconfig 2006-10-27 10:10:47.0 -0700 +++ sky2/net/ipv4/Kconfig 2006-10-27 10:15:56.0 -0700 @@ -470,6 +470,16 @@ if TCP_CONG_ADVANCED +config TCP_CONG_UNFAIR +bool Allow unfair congestion control algorithms + depends on EXPERIMENTAL +---help--- + Some of the congestion control algorithms are for testing + and research purposes and should not deployed on public + networks because of the possiblity of unfair behavior. + These algorithms may be useful for future development + or comparison purposes. + config TCP_CONG_BIC tristate Binary Increase Congestion (BIC) control default m @@ -551,7 +561,7 @@ config TCP_CONG_SCALABLE tristate Scalable TCP - depends on EXPERIMENTAL + depends on TCP_CONG_UNFAIR default n ---help--- Scalable TCP is a sender-side only change to TCP which uses a - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] tcp: setsockopt congestion control autoload
If application asks for a congestion control type with setsockopt() then it may be available as a module not included in the kernel already. If it has permission to load modules then the tcp congestion module should be autoloaded if needed. This is done already when the default selection is change with sysctl, but not when application requests via sysctl. Add a similar additional check to the sysctl path as well. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- net/ipv4/tcp_cong.c | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) --- a/net/ipv4/tcp_cong.c 2006-10-27 10:56:36.0 -0700 +++ b/net/ipv4/tcp_cong.c 2006-10-27 11:09:36.0 -0700 @@ -114,7 +114,7 @@ spin_lock(tcp_cong_list_lock); ca = tcp_ca_find(name); #ifdef CONFIG_KMOD - if (!ca) { + if (!ca capable(CAP_SYS_MODULE)) { spin_unlock(tcp_cong_list_lock); request_module(tcp_%s, name); @@ -154,9 +154,19 @@ rcu_read_lock(); ca = tcp_ca_find(name); + /* no change asking for existing value */ if (ca == icsk-icsk_ca_ops) goto out; +#ifdef CONFIG_KMOD + /* not found attempt to autoload module */ + if (!ca capable(CAP_SYS_MODULE)) { + rcu_read_unlock(); + request_module(tcp_%s, name); + rcu_read_lock(); + ca = tcp_ca_find(name); + } +#endif if (!ca) err = -ENOENT; Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 9/13] [SCTP] Merge IPv4 and IPv6 versions of get_saddr() with their corresponding get_dst().
On Tue, 2006-10-17 at 03:19 +0300, Ville Nuorvala wrote: As the IPv6 route lookup now also returns the selected source address there is no need for a separate source address lookup. In fact, the source address selection needs to be moved to get_dst() because the selected IPv6 source address isn't always stored in the route. Sometimes this makes it impossible to guess the correct address later on. Ville, Overall the patch looks pretty good. I found only 1 issue in sctp_v6_get_dst(). See below. snip +/* Returns the dst cache entry for the given source and destination ip + * addresses. + */ +static struct dst_entry *sctp_v6_get_dst(struct sctp_association *asoc, + union sctp_addr *daddr, + union sctp_addr *saddr) +{ + struct dst_entry *dst; + struct flowi fl; + struct sctp_bind_addr *bp; + rwlock_t *addr_lock; + struct sctp_sockaddr_entry *laddr; + struct list_head *pos; + struct rt6_info *rt; + union sctp_addr baddr; + sctp_scope_t scope; + __u8 matchlen = 0; + __u8 bmatchlen; + + memset(fl, 0, sizeof(fl)); + ipv6_addr_copy(fl.fl6_dst, daddr-v6.sin6_addr); + if (ipv6_addr_type(daddr-v6.sin6_addr) IPV6_ADDR_LINKLOCAL) + fl.oif = daddr-v6.sin6_scope_id; + + ipv6_addr_copy(fl.fl6_src, saddr-v6.sin6_addr); + SCTP_DEBUG_PRINTK(%s: DST= NIP6_FMT SRC= NIP6_FMT , + __FUNCTION__, NIP6(fl.fl6_dst), NIP6(fl.fl6_src)); + + dst = ip6_route_output(NULL, fl); + if (dst-error) { + dst_release(dst); + dst = NULL; + } + if (!ipv6_addr_any(saddr-v6.sin6_addr)) + goto out; + if (!asoc) { + if (dst) + ipv6_addr_copy(saddr-v6.sin6_addr, fl.fl6_src); + goto out; + } + bp = asoc-base.bind_addr; + addr_lock = asoc-base.addr_lock; + + if (dst) { + /* Walk through the bind address list and look for a bind + * address that matches the source address of the returned rt. + */ + sctp_v6_fl_saddr(baddr, fl, bp-port); Here we are checking if the source address returned in the dst matches one of the address in the bind address list of the association. Not the source address that is passed to this routine(it could be INADDRY_ANY). So this should be changed back to sctp_v6_dst_saddr(). Thanks Sridhar + sctp_read_lock(addr_lock); + list_for_each(pos, bp-address_list) { + laddr = list_entry(pos, struct sctp_sockaddr_entry, +list); + if (!laddr-use_as_src) + continue; + if (sctp_v6_cmp_addr(baddr, laddr-a)) + goto init_saddr; + } + sctp_read_unlock(addr_lock); + + /* Invalid rt or none of the bound addresses match the source + * address. So release it. + */ + dst_release(dst); + dst = NULL; + } + + /* Go through the bind address list and find the best source address + * that matches the scope of the destination address. + */ + memset(baddr, 0, sizeof(union sctp_addr)); + scope = sctp_scope(daddr); + sctp_read_lock(addr_lock); + list_for_each(pos, bp-address_list) { + laddr = list_entry(pos, struct sctp_sockaddr_entry, list); + + if (!laddr-use_as_src || + laddr-a.sa.sa_family != AF_INET6 || + scope sctp_scope(laddr-a) || + (ipv6_addr_type(laddr-a.v6.sin6_addr) + IPV6_ADDR_LINKLOCAL + laddr-a.v6.sin6_scope_id != fl.oif)) + continue; + + bmatchlen = sctp_v6_addr_match_len(daddr, laddr-a); + if (!dst || (matchlen bmatchlen)) { + struct dst_entry *dst2; + ipv6_addr_copy(fl.fl6_src, laddr-a.v6.sin6_addr); + dst2 = ip6_route_output(NULL, fl); + if (dst2-error) { + dst_release(dst2); + dst2 = NULL; + continue; + } + dst_release(dst); + dst = dst2; + memcpy(baddr, laddr-a, sizeof(union sctp_addr)); + matchlen = bmatchlen; + } + } + if (dst) + goto init_saddr; +out_unlock: + sctp_read_unlock(addr_lock); +out: + if (dst) { + rt = (struct rt6_info *) dst; + SCTP_DEBUG_PRINTK(SRC= NIP6_FMT +rt6_dst= NIP6_FMT +rt6_src= NIP6_FMT \n, +
Re: [PATCH] s2io: add PCI error recovery support
On Fri, Oct 27, 2006 at 07:35:18AM -0400, Ananda Raju wrote: Looking at all scenarios I feel the first patch is OK. Can you add the watchdog timer fix to first initial patch and resubmit. Appended below. So -- just for grins, I thought to myself, Maybe I can make s2io be the first adapter ever to fully recover without a hard reset of the card. ... I couldn't quite make this work. Since the patch below already works, I didn't see much point exterting myself further. --linas This patch adds PCI error recovery support to the s2io 10-Gigabit ethernet device driver. Third revision, blocks interrupts and the watchdog. Tested, seems to work well. Signed-off-by: Linas Vepstas [EMAIL PROTECTED] Cc: Raghavendra Koushik [EMAIL PROTECTED] Cc: Ananda Raju [EMAIL PROTECTED] Cc: Wen Xiong [EMAIL PROTECTED] drivers/net/s2io.c | 121 + drivers/net/s2io.h |5 ++ 2 files changed, 126 insertions(+) Index: linux-2.6.19-rc1-git11/drivers/net/s2io.c === --- linux-2.6.19-rc1-git11.orig/drivers/net/s2io.c 2006-10-27 10:49:07.0 -0500 +++ linux-2.6.19-rc1-git11/drivers/net/s2io.c 2006-10-27 13:55:01.0 -0500 @@ -434,11 +434,18 @@ static struct pci_device_id s2io_tbl[] _ MODULE_DEVICE_TABLE(pci, s2io_tbl); +static struct pci_error_handlers s2io_err_handler = { + .error_detected = s2io_io_error_detected, + .slot_reset = s2io_io_slot_reset, + .resume = s2io_io_resume, +}; + static struct pci_driver s2io_driver = { .name = S2IO, .id_table = s2io_tbl, .probe = s2io_init_nic, .remove = __devexit_p(s2io_rem_nic), + .err_handler = s2io_err_handler, }; /* A simplifier macro used both by init and free shared_mem Fns(). */ @@ -3159,6 +3166,11 @@ static void alarm_intr_handler(struct s2 register u64 val64 = 0, err_reg = 0; u64 cnt; int i; + + if ((nic-pdev-error_state != pci_channel_io_normal) +(nic-pdev-error_state != 0)) + return; + nic-mac_control.stats_info-sw_stat.ring_full_cnt = 0; /* Handling the XPAK counters update */ if(nic-mac_control.stats_info-xpak_stat.xpak_timer_count 72000) { @@ -4171,6 +4183,11 @@ static irqreturn_t s2io_isr(int irq, voi mac_info_t *mac_control; struct config_param *config; + /* Pretend we handled any irq's from a disconnected card */ + if ((sp-pdev-error_state != pci_channel_io_normal) +(sp-pdev-error_state != 0)) + return IRQ_HANDLED; + atomic_inc(sp-isr_cnt); mac_control = sp-mac_control; config = sp-config; @@ -7564,3 +7581,107 @@ static void lro_append_pkt(nic_t *sp, lr sp-mac_control.stats_info-sw_stat.clubbed_frms_cnt++; return; } + +/** + * s2io_io_error_detected - called when PCI error is detected + * @pdev: Pointer to PCI device + * @state: The current pci conneection state + * + * This function is called after a PCI bus error affecting + * this device has been detected. + */ +static pci_ers_result_t s2io_io_error_detected(struct pci_dev *pdev, + pci_channel_state_t state) +{ + struct net_device *netdev = pci_get_drvdata(pdev); + nic_t *sp = netdev-priv; + + netif_device_detach(netdev); + + if (netif_running(netdev)) { + unsigned long flags; + + /* The folowing is an abreviated subset of the +* steps taken by s2io_card_down(), avoiding +* steps that touch the card itself. +*/ + del_timer_sync(sp-alarm_timer); + atomic_set(sp-card_state, CARD_DOWN); + + /* Kill tasklet. */ + tasklet_kill(sp-task); + + /* Free all Tx buffers */ + spin_lock_irqsave(sp-tx_lock, flags); + free_tx_buffers(sp); + spin_unlock_irqrestore(sp-tx_lock, flags); + + /* Free all Rx buffers */ + spin_lock_irqsave(sp-rx_lock, flags); + free_rx_buffers(sp); + spin_unlock_irqrestore(sp-rx_lock, flags); + + clear_bit(0, (sp-link_state)); + sp-device_close_flag = TRUE; /* Device is shut down. */ + } + pci_disable_device(pdev); + + return PCI_ERS_RESULT_NEED_RESET; +} + +/** + * s2io_io_slot_reset - called after the pci bus has been reset. + * @pdev: Pointer to PCI device + * + * Restart the card from scratch, as if from a cold-boot. + * At this point, the card has exprienced a hard reset, + * followed by fixups by BIOS, and has its config space + * set up identically to what it was at cold boot. + */ +static pci_ers_result_t s2io_io_slot_reset(struct pci_dev *pdev) +{ + struct net_device *netdev = pci_get_drvdata(pdev); + nic_t *sp = netdev-priv; + + if
Re: [openib-general] [PATCH 1/5] NetEffect 10Gb RNIC Userspace Library: userspace config generation
I don't think the userspace stuff belongs on netdev. Someone please correct me if I'm wrong. Yeah, it's not a bad thing to get wider review, but your userspace library is pretty much your business. If you screw it up it doesn't hurt anyone else, so I'm happy to let you write it however you want. - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [openib-general] [PATCH 1/5] NetEffect 10Gb RNIC Userspace Library: userspace config generation
On Fri, 27 Oct 2006 10:56:45 -0700 Roland Dreier [EMAIL PROTECTED] wrote: I don't think the userspace stuff belongs on netdev. Someone please correct me if I'm wrong. Yeah, it's not a bad thing to get wider review, but your userspace library is pretty much your business. If you screw it up it doesn't hurt anyone else, so I'm happy to let you write it however you want. - R. I prefer a pointer to the project download source. Seeing the userspace stuff helps answer questions where the administration process is confusing (or could/should be done differently). -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tcp: don't allow unfair congestion control to be built without warning
From: Stephen Hemminger [EMAIL PROTECTED] Date: Fri, 27 Oct 2006 10:30:16 -0700 My proposed method restricting TCP choices to fair algorithms. This a net wide, not system wide issue, it should not be done by kernel policy choice (capability), but by a build choice. I think this sucks even worse than the current situation. How difficult is it to understand that an administrator might like to be able to build in and experiment with some congestion control algorithms, yet still be able to keep his normal users from using them? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Check if user has CAP_NET_ADMIN to change congestion control algorithm
From: Stephen Hemminger [EMAIL PROTECTED] Date: Fri, 27 Oct 2006 07:41:02 -0700 Please no, it makes the socket option useless. If you want to tag some bad apples thats okay, but would need some more infrastructure. The behavior of the TCP stack is a system wide decision. If anything it should be everything besides the default and Reno are offlimits to unprivileged users with an administrative method to override that. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tcp: don't allow unfair congestion control to be built without warning
On Fri, 27 Oct 2006 14:17:49 -0700 (PDT) David Miller [EMAIL PROTECTED] wrote: From: Stephen Hemminger [EMAIL PROTECTED] Date: Fri, 27 Oct 2006 10:30:16 -0700 My proposed method restricting TCP choices to fair algorithms. This a net wide, not system wide issue, it should not be done by kernel policy choice (capability), but by a build choice. I think this sucks even worse than the current situation. How difficult is it to understand that an administrator might like to be able to build in and experiment with some congestion control algorithms, yet still be able to keep his normal users from using them? Only some (very few) have any bad consequences. So the typical distribution should be able to switch with most available for everyone, and only a few needing special privileges. -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tcp: don't allow unfair congestion control to be built without warning
From: Stephen Hemminger [EMAIL PROTECTED] Date: Fri, 27 Oct 2006 14:24:02 -0700 Only some (very few) have any bad consequences. So the typical distribution should be able to switch with most available for everyone, and only a few needing special privileges. I would strongly disagree as we've had several OOPS'er class bugs in the less frequently used algorithms. I stand by my position that an administrator's wish to do this is quite valid. It's bad enough that people are all over us for the default algorithm we have choosen, so it'd be extremely irresponsible and even worse if we allowed users to select any of the other research algorithms for their TCP connections by default just because those modules happened to be configured into the kernel. This userspace convenience argument holds zero water. Provide a way for the administrator to control the situation fully, and choose a sane default which errs on the side of caution for the sake of internet stability. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
2.4/2.6 share in linux routers ?
Hello, I'd like to find/gather estimates about 2.4 vs 2.6 share in [small] linux routers in 2006. Can anyone offer estimates and/or references ? My own estimate is that definite majority is 2.4 (I'd say 75% for 2.4), in small linux routers in 2006. Can anyone offer support or correction ? Which factors make 2.4 or 2.6 more attractive for small linux router (128-256 mb RAM) ? Yakov Lerner - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
2.4/2.6 share in linux routers ?
Hello, I'd like to find/gather estimates about 2.4 vs 2.6 share in [small] linux routers in 2006. Can anyone offer estimates and/or references ? My own estimate is that definite majority is 2.4 (I'd say 75% for 2.4), in small linux routers in 2006. Can anyone offer support or correction ? Which factors make 2.4 or 2.6 more attractive for small linux router (128-256 mb RAM) ? Yakov Lerner P.S. Sorry if the message is duplicate. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tcp: don't allow unfair congestion control to be built without warning
On Fri, 27 Oct 2006 14:37:01 -0700 (PDT) David Miller [EMAIL PROTECTED] wrote: From: Stephen Hemminger [EMAIL PROTECTED] Date: Fri, 27 Oct 2006 14:24:02 -0700 Only some (very few) have any bad consequences. So the typical distribution should be able to switch with most available for everyone, and only a few needing special privileges. I would strongly disagree as we've had several OOPS'er class bugs in the less frequently used algorithms. Then tag those as restricted. Why should we keep app's away from the simple ones. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.4/2.6 share in linux routers ?
Please stop all of this cross posting. I've just seen you post this same exact email on the netfilter lists too. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tcp: don't allow unfair congestion control to be built without warning
From: Stephen Hemminger [EMAIL PROTECTED] Date: Fri, 27 Oct 2006 14:59:13 -0700 On Fri, 27 Oct 2006 14:37:01 -0700 (PDT) David Miller [EMAIL PROTECTED] wrote: From: Stephen Hemminger [EMAIL PROTECTED] Date: Fri, 27 Oct 2006 14:24:02 -0700 Only some (very few) have any bad consequences. So the typical distribution should be able to switch with most available for everyone, and only a few needing special privileges. I would strongly disagree as we've had several OOPS'er class bugs in the less frequently used algorithms. Then tag those as restricted. Why should we keep app's away from the simple ones. You can't predict bugs, but what you can do is know that the lesser used algorithms are by definition less tested and therefore more likely to have bugs. Everything except the default and Reno are lesser used. Safe by default, there is no other choice. You fail to respond to THAT part of my email. That's the important point. Let me reiterate: It's bad enough that people are all over us for the default algorithm we have choosen, so it'd be extremely irresponsible and even worse if we allowed users to select any of the other research algorithms for their TCP connections by default just because those modules happened to be configured into the kernel. This userspace convenience argument holds zero water. Provide a way for the administrator to control the situation fully, and choose a sane default which errs on the side of caution for the sake of internet stability. Please reread this and consider why it's important. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tcp: don't allow unfair congestion control to be built without warning
On Fri, 27 Oct 2006 15:12:38 -0700 (PDT) David Miller [EMAIL PROTECTED] wrote: From: Stephen Hemminger [EMAIL PROTECTED] Date: Fri, 27 Oct 2006 14:59:13 -0700 On Fri, 27 Oct 2006 14:37:01 -0700 (PDT) David Miller [EMAIL PROTECTED] wrote: From: Stephen Hemminger [EMAIL PROTECTED] Date: Fri, 27 Oct 2006 14:24:02 -0700 Only some (very few) have any bad consequences. So the typical distribution should be able to switch with most available for everyone, and only a few needing special privileges. I would strongly disagree as we've had several OOPS'er class bugs in the less frequently used algorithms. Then tag those as restricted. Why should we keep app's away from the simple ones. You can't predict bugs, but what you can do is know that the lesser used algorithms are by definition less tested and therefore more likely to have bugs. Everything except the default and Reno are lesser used. If they aren't usable they should be marked BROKEN or something like that. The stability argument doesn't really work, we don't like to let root kill the system either. Safe by default, there is no other choice. You fail to respond to THAT part of my email. That's the important point. Let me reiterate: It's bad enough that people are all over us for the default algorithm we have choosen, so it'd be extremely irresponsible and even worse if we allowed users to select any of the other research algorithms for their TCP connections by default just because those modules happened to be configured into the kernel. Make it hard for them to configure then. I don't want your distro to ship with the risky ones turned on. But we should allow use of reno, bic, cubic, lp, htcp, and westwood (maybe) by regular users if admin allows. This userspace convenience argument holds zero water. Provide a way for the administrator to control the situation fully, and choose a sane default which errs on the side of caution for the sake of internet stability. Please reread this and consider why it's important. The current situation is fine. You have to ask for them in the configuration, and root has to either load the module or set it as default. The restricted flag patch which you have ignored, would be a way to allow them to be configured but tag the bad apples for only root usage. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tcp: don't allow unfair congestion control to be built without warning
From: Stephen Hemminger [EMAIL PROTECTED] Date: Fri, 27 Oct 2006 15:21:49 -0700 The restricted flag patch which you have ignored, would be a way to allow them to be configured but tag the bad apples for only root usage. I haven't ignored it, it's in my backlog below more important things like Appletalk OOPS'ers. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.4/2.6 share in linux routers ?
On 10/28/06, David Miller [EMAIL PROTECTED] wrote: Please stop all of this cross posting. I've just seen you post this same exact email on the netfilter lists too. Sorry - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tcp: don't allow unfair congestion control to be built without warning
How about another way of controlling this via sysctl. First, add code to for read only: /proc/sys/net/ipv4/tcp_available_congestion_control (or shorter name) this will show all things compiled in (even if not loaded yet). Similar to /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies Second, add flag (allowed) to the tcp_congestion structure [inverse of earlier restricted] Third, add read-write /proc/sys/net/ipv4/tcp_allowed_congestion_control to show and set/clear the allowed flag. Default value would be reno xxx where xxx is what ever the default value from the kernel config is (currently cubic). I would use sysfs for this, but it make sense not to spread TCP stuff into both sysctl and sysfs. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2.6.19-rc3 v2 1/2] amso1100 - Use dma_alloc_coherent instead of kmalloc/dma_map_single.
tsk, tsk: fatal: 7 lines add trailing whitespaces. applied to for-2.6.19 anyway, thanks. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2.6.19-rc3 v2 2/2] amso1100 - Fix incorrect pr_debug().
Applied, thanks. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] tcp: available congetsion control
Nice way to see what congestion control modules are loaded. It does impose a soft limit of 32 possibilities. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- include/linux/sysctl.h |1 + include/net/tcp.h |3 +++ net/ipv4/sysctl_net_ipv4.c | 25 - net/ipv4/tcp_cong.c| 14 ++ 4 files changed, 42 insertions(+), 1 deletion(-) --- skge.orig/include/linux/sysctl.h +++ skge/include/linux/sysctl.h @@ -418,6 +418,7 @@ enum NET_CIPSOV4_CACHE_BUCKET_SIZE=119, NET_CIPSOV4_RBM_OPTFMT=120, NET_CIPSOV4_RBM_STRICTVALID=121, + NET_TCP_AVAIL_CONG_CONTROL=122, }; enum { --- skge.orig/include/net/tcp.h +++ skge/include/net/tcp.h @@ -621,6 +621,8 @@ enum tcp_ca_event { * Interface for adding new TCP congestion control handlers */ #define TCP_CA_NAME_MAX16 +#define TCP_CA_MAX 32 + struct tcp_congestion_ops { struct list_headlist; @@ -659,6 +661,7 @@ extern void tcp_unregister_congestion_co extern void tcp_init_congestion_control(struct sock *sk); extern void tcp_cleanup_congestion_control(struct sock *sk); extern int tcp_set_default_congestion_control(const char *name); +extern void tcp_get_available_congestion_control(char *name, int maxlen); extern void tcp_get_default_congestion_control(char *name); extern int tcp_set_congestion_control(struct sock *sk, const char *name); extern void tcp_slow_start(struct tcp_sock *tp); --- skge.orig/net/ipv4/sysctl_net_ipv4.c +++ skge/net/ipv4/sysctl_net_ipv4.c @@ -108,6 +108,22 @@ static int proc_tcp_congestion_control(c return ret; } +static int proc_tcp_available_congestion_control(ctl_table *ctl, +int write, struct file * filp, +void __user *buffer, size_t *lenp, +loff_t *ppos) +{ + char val[TCP_CA_MAX*(TCP_CA_NAME_MAX+1)]; + ctl_table tbl = { + .data = val, + .maxlen = TCP_CA_MAX*(TCP_CA_NAME_MAX+1), + }; + + tcp_get_available_congestion_control(val, tbl.maxlen); + + return proc_dostring(tbl, write, filp, buffer, lenp, ppos); +} + static int sysctl_tcp_congestion_control(ctl_table *table, int __user *name, int nlen, void __user *oldval, size_t __user *oldlenp, @@ -133,9 +149,9 @@ static int __init tcp_congestion_default { return tcp_set_default_congestion_control(CONFIG_DEFAULT_TCP_CONG); } - late_initcall(tcp_congestion_default); + ctl_table ipv4_table[] = { { .ctl_name = NET_IPV4_TCP_TIMESTAMPS, @@ -738,6 +754,13 @@ ctl_table ipv4_table[] = { .proc_handler = proc_dointvec, }, #endif /* CONFIG_NETLABEL */ + { + .ctl_name = NET_TCP_AVAIL_CONG_CONTROL, + .procname = tcp_available_congestion_control, + .mode = 0444, + .maxlen = TCP_CA_MAX*(TCP_CA_NAME_MAX+1), + .proc_handler = proc_tcp_available_congestion_control, + }, { .ctl_name = 0 } }; --- skge.orig/net/ipv4/tcp_cong.c +++ skge/net/ipv4/tcp_cong.c @@ -144,6 +144,20 @@ void tcp_get_default_congestion_control( rcu_read_unlock(); } +/* Build string with list of available congestion control values */ +void tcp_get_available_congestion_control(char *name, int maxlen) +{ + struct tcp_congestion_ops *ca; + int offs = 0; + + rcu_read_lock(); + list_for_each_entry_rcu(ca, tcp_cong_list, list) { + offs += snprintf(name + offs, maxlen - offs, %s%s, +offs == 0 ? : , ca-name); + } + rcu_read_unlock(); +} + /* Change congestion control for socket */ int tcp_set_congestion_control(struct sock *sk, const char *name) { - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html