Re: Network virtualization/isolation

2006-10-27 Thread Dmitry Mishin
On Thursday 26 October 2006 19:56, Stephen Hemminger wrote:
> On Thu, 26 Oct 2006 11:44:55 +0200
>
> Daniel Lezcano <[EMAIL PROTECTED]> wrote:
> > Stephen Hemminger wrote:
> > > On Wed, 25 Oct 2006 17:51:28 +0200
> > >
> > > Daniel Lezcano <[EMAIL PROTECTED]> wrote:
> > >>Hi Stephen,
> > >>
> > >>currently the work to make the container enablement into the kernel is
> > >>doing good progress. The ipc, pid, utsname and filesystem system
> > >>ressources are isolated/virtualized relying on the namespaces concept.
> > >>
> > >>But, there is missing the network virtualization/isolation. Two
> > >>approaches are proposed: doing the isolation at the layer 2 and at the
> > >>layer 3.
> > >>
> > >>The first one instanciate a network device by namespace and add a peer
> > >>network device into the "root namespace", all the routing ressources
> > >> are relative to the namespace. This work is done by Andrey Savochkin
> > >> from the openvz project.
> > >>
> > >>The second relies on the routes and associates the network namespace
> > >>pointer with each route. When the traffic is incoming, the packet
> > >>follows an input route and retrieve the associated network namespace.
> > >>When the traffic is outgoing, the packet, identified from the network
> > >>namespace is coming from, follows only the routes matching the same
> > >>network namespace. This work is made by me.
> > >>
> > >>IMHO, we need the two approach, the layer-2 to be able to bring *very*
> > >>strong isolation for system container with a performance cost and a
> > >>layer-3 to be able to have good isolation for lightweight container or
> > >>application container when performances are more important.
> > >>
> > >>Do you have some suggestions ? What is your point of view on that ?
> > >>
> > >>Thanks in advance.
> > >>
> > >>   -- Daniel
> > >
> > > Any solution should allow both and it should build on the existing
> > > netfilter infrastructure.
> >
> > The problem is netfilter can not give a good isolation, eg. how can be
> > handled netstat command ? or avoid to see IP addresses assigned to
> > another container when doing ifconfig ? Furthermore, one of the biggest
> > interest of the network isolation is to bring mobility with a container
> > and that can only be done if the network ressources inside the kernel
> > can be identified by container in order to checkpoint/restart them.
> >
> > The all-in-namespace solution, ie. at layer 2, is very good in terms of
> > isolation but it adds an non-negligeable overhead. The layer 3 isolation
> >   has an insignifiant overhead, a good isolation perfectly adapted for
> > applications containers.
> >
> > Unfortunatly, from the point of view of implementation, layer 3 can not
> > be a subset of layer 2 isolation when using "all-in-namespace" and layer
> > 2 isolation can not be a extension of the layer 3 isolation.
> >
> > I think the layer 2 and the layer 3 implementations can coexists. You
> > can for example create a system container with a layer 2 isolation and
> > inside it add a layer 3 isolation.
> >
> > Does that make sense ?
> >
> > -- Daniel
>
> Assuming you are talking about pseudo-virtualized environments,
> there are several different discussions.
>
> 1. How should the namespace be isolated for the virtualized containered
>applications?
>
> 2. How should traffic be restricted into/out of those containers. This
>is where existing netfilter, classification, etc, should be used.
>The network code is overly rich as it is, we don't need another
>abstraction.
>
> 3. Can the virtualized containers be secure? No. we really can't keep
>hostile root in a container from killing system without going to
>a hypervisor.
Stephen, 

Virtualized container can be secure, if it is complete system virtualization, 
not just an application container. OpenVZ implements such and it is used hard 
over the world. And of course, we care a lot to keep hostile root from
killing whole system.
 
OpenVZ uses virtualization on IP level (implemented by Andrey Savochkin, 
http://marc.theaimsgroup.com/?l=linux-netdev&m=115572448503723), with all
necessary network objects isolated/virtualized, such as sockets, devices, 
routes, netfilters, etc.

-- 
Thanks,
Dmitry.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-10-27 Thread Daniel Lezcano


[ ... ]

Dmitry Mishin wrote:
Stephen, 

Virtualized container can be secure, if it is complete system virtualization, 
not just an application container. OpenVZ implements such and it is used hard 
over the world. And of course, we care a lot to keep hostile root from

killing whole system.


OpenVZ power !!

OpenVZ uses virtualization on IP level (implemented by Andrey Savochkin, 
http://marc.theaimsgroup.com/?l=linux-netdev&m=115572448503723), with all
necessary network objects isolated/virtualized, such as sockets, devices, 
routes, netfilters, etc.


No, it uses virtualization at layer 2 and I had already mention it 
before (see the first email of the thread), but thank you for the email 
thread pointer.


The discussion is not to convince Stephen that layer 2 or layer 3 is the 
best but to present the pros and the cons of each solution and to have a 
point of view from a network gourou guy.


Regards.

-- Daniel




-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Check if user has CAP_NET_ADMIN to change congestion control algorithm

2006-10-27 Thread Hagen Paul Pfeifer
* David Miller | 2006-10-26 17:02:21 [-0700]:

>Your email client turned the tabs into spaces in the patch making it
>useless.

Sorry my mistake! I am en route and I paste the patch into my editor, who eat
all tabs. One more time: sorry!


Check if user has CAP_NET_ADMIN capability to change
congestion control algorithm.


Signed-off-by: Hagen Paul Pfeifer <[EMAIL PROTECTED]>

---
 net/ipv4/tcp_cong.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index af0aca1..c1ae2e9 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -10,6 +10,7 @@ #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 
 static DEFINE_SPINLOCK(tcp_cong_list_lock);
@@ -151,6 +152,9 @@ int tcp_set_congestion_control(struct so
struct tcp_congestion_ops *ca;
int err = 0;
 
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+
rcu_read_lock();
ca = tcp_ca_find(name);
if (ca == icsk->icsk_ca_ops)
-- 
1.4.1.1
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2.6.19-rc3 1/2] ehea: kzalloc GFP_ATOMIC fix

2006-10-27 Thread Christoph Raisch

Andrew Morton <[EMAIL PROTECTED]> wrote on 27.10.2006 05:13:13:

> On Wed, 25 Oct 2006 13:11:42 +0200
> Jan-Bernd Themann <[EMAIL PROTECTED]> wrote:
>
> > This patch fixes kzalloc parameters (GFP_ATOMIC instead of GFP_KERNEL)
>
> why?


these few kcallocs run in atomic context in some situations.
therefore GFP_KERNEL is no good idea.

Christoph R.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] s2io: add PCI error recovery support

2006-10-27 Thread Ananda Raju
Looking at all scenarios I feel the first patch is OK. Can you add the
watchdog timer fix to first initial patch and resubmit. 

-Original Message-
From: Linas Vepstas [mailto:[EMAIL PROTECTED] 
Sent: Thursday, October 26, 2006 3:52 PM
To: Ananda Raju
Cc: Wen Xiong; linux-kernel@vger.kernel.org;
[EMAIL PROTECTED]; netdev@vger.kernel.org; Jeff Garzik;
Andrew Morton
Subject: Re: [PATCH] s2io: add PCI error recovery support

Hi.

On Thu, Oct 26, 2006 at 05:56:34AM -0400, Ananda Raju wrote:
> Hi, 
> Can you try attached patch. The attached patch is simple. We set card
> state as down in error_detecct() so that all entry points return error
> and don't proceed further.
> 
> In slot_reset() we do s2io_card_down() will reset adapter. 
> In io_resume() we bringup the driver. 

Simplicity is always better. However, some questions/comments:

> @@ -4175,6 +4186,10 @@ static irqreturn_t s2io_isr(int irq, voi
>   mac_info_t *mac_control;
>   struct config_param *config;
>  
> + if (atomic_read(&sp->card_state) == CARD_DOWN) {
> + return IRQ_NONE;
> + }

I used 

if ((sp->pdev->error_state != pci_channel_io_normal)

here for a reason: the pdev->error_state is set even in an interrupt
context, that is, it gets set even if interrups are disabled, and
so it represents the actual state immediately. By contrast, the
error callbacks do not get called until possibly much later, 
and so sp->card_state = CARD_DOWN might not get set for a while.

If, for any reason, e.g. some obscure corner case, the s2io 
generates zillions of interupts, this could result in a soft-lockup.
I actually saw this in the symbios device driver, which will
regenerate an interrupt until its acknowledged -- and so it 
sat there, spinning. :-(

I was returning IRQ_HANDLED instead of IRQ_NONE, so as to avoid
falling into handle_bad_irq() or report_bad_irq(). I haven't 
seen this happen on s2io, but thought it would still be wise.

If this can't happen, then there's no problem here.

> +/**
> + * s2io_io_slot_reset - called after the pci bus has been reset.
> + * @pdev: Pointer to PCI device
> + *
> + * Restart the card from scratch, as if from a cold-boot.
> + */
> +static pci_ers_result_t s2io_io_slot_reset(struct pci_dev *pdev)
> +{

At this point, the card has just experienced a hardware reset,
(the #RST wire was held low for 250 millisecs, followed by
a settle time of 2 seconds, followed by whatever BIOS thinks
it needed to do, followed by a restore of the pci config space
to what it was after a cold boot. So the card is in a "fresh"
state; in theory its identitcal to a cold boot. So ... 
are you sure you want to "down" at this point? 

> + s2io_card_down(sp);
> + sp->device_close_flag = TRUE;   /* Device is shut down.
*/


One problem I'm having is that the watchdog timer sometimes
pops and tries to reset the card before s2io_card_down()
has a chance to run. I fixed this ... 

==
So -- just for grins, I thought to myself, "Maybe I can make 
s2io be the first adapter ever to fully recover without 
a hard reset of the card."

The idea is simple: 

1) enable MMIO,
2) call s2io_card_down()
3) enable DMA
4) cal s2io_card_up()

I have a patch that does this, but then hit a few more snags.
I haven't yet nailed down all the trouble spots, maybe tommorrow.

--linas


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Check if user has CAP_NET_ADMIN to change congestion control algorithm

2006-10-27 Thread Stephen Hemminger
On Fri, 27 Oct 2006 12:43:11 +0200
Hagen Paul Pfeifer <[EMAIL PROTECTED]> wrote:

> * David Miller | 2006-10-26 17:02:21 [-0700]:
> 
> >Your email client turned the tabs into spaces in the patch making it
> >useless.
> 
> Sorry my mistake! I am en route and I paste the patch into my editor, who eat
> all tabs. One more time: sorry!
> 
> 
> Check if user has CAP_NET_ADMIN capability to change
> congestion control algorithm.
> 
> 
> Signed-off-by: Hagen Paul Pfeifer <[EMAIL PROTECTED]>

Please no, it makes the socket option useless.
If you want to tag some "bad apples" thats okay, but would need
some more infrastructure.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Rewrite e100_phys_id

2006-10-27 Thread Auke Kok

Matthew Wilcox wrote:

On Thu, Oct 26, 2006 at 01:04:32PM -0700, Auke Kok wrote:
no objections, so I'll ACK it with the notion that I'm going to let our 
labs do some more testing on it with all the latest changes to it.


Thanks, Auke.  Here's the equivalent patch for e1000.  I don't have a
convenient machine to test it on, but it reduces the size of the driver
by 1.5k.


this is a bit (!) more complex than e100, so I'm going to take a bit of time to review 
this patch.


thanks,

Auke





diff --git a/drivers/net/e1000/e1000.h b/drivers/net/e1000/e1000.h
index 7ecce43..1e22da6 100644
--- a/drivers/net/e1000/e1000.h
+++ b/drivers/net/e1000/e1000.h
@@ -257,9 +257,6 @@ #endif
struct work_struct reset_task;
uint8_t fc_autoneg;
 
-	struct timer_list blink_timer;

-   unsigned long led_status;
-
/* TX */
struct e1000_tx_ring *tx_ring;  /* One per active queue */
unsigned long tx_queue_len;
diff --git a/drivers/net/e1000/e1000_ethtool.c 
b/drivers/net/e1000/e1000_ethtool.c
index 773821e..620afa5 100644
--- a/drivers/net/e1000/e1000_ethtool.c
+++ b/drivers/net/e1000/e1000_ethtool.c
@@ -1819,61 +1819,15 @@ e1000_set_wol(struct net_device *netdev,
return 0;
 }
 
-/* toggle LED 4 times per second = 2 "blinks" per second */

-#define E1000_ID_INTERVAL  (HZ/4)
-
-/* bit defines for adapter->led_status */
-#define E1000_LED_ON   0
-
-static void
-e1000_led_blink_callback(unsigned long data)
-{
-   struct e1000_adapter *adapter = (struct e1000_adapter *) data;
-
-   if (test_and_change_bit(E1000_LED_ON, &adapter->led_status))
-   e1000_led_off(&adapter->hw);
-   else
-   e1000_led_on(&adapter->hw);
-
-   mod_timer(&adapter->blink_timer, jiffies + E1000_ID_INTERVAL);
-}
-
 static int
 e1000_phys_id(struct net_device *netdev, uint32_t data)
 {
struct e1000_adapter *adapter = netdev_priv(netdev);
 
-	if (!data || data > (uint32_t)(MAX_SCHEDULE_TIMEOUT / HZ))

-   data = (uint32_t)(MAX_SCHEDULE_TIMEOUT / HZ);
-
-   if (adapter->hw.mac_type < e1000_82571) {
-   if (!adapter->blink_timer.function) {
-   init_timer(&adapter->blink_timer);
-   adapter->blink_timer.function = 
e1000_led_blink_callback;
-   adapter->blink_timer.data = (unsigned long) adapter;
-   }
-   e1000_setup_led(&adapter->hw);
-   mod_timer(&adapter->blink_timer, jiffies);
-   msleep_interruptible(data * 1000);
-   del_timer_sync(&adapter->blink_timer);
-   } else if (adapter->hw.phy_type == e1000_phy_ife) {
-   if (!adapter->blink_timer.function) {
-   init_timer(&adapter->blink_timer);
-   adapter->blink_timer.function = 
e1000_led_blink_callback;
-   adapter->blink_timer.data = (unsigned long) adapter;
-   }
-   mod_timer(&adapter->blink_timer, jiffies);
-   msleep_interruptible(data * 1000);
-   del_timer_sync(&adapter->blink_timer);
-   e1000_write_phy_reg(&(adapter->hw), 
IFE_PHY_SPECIAL_CONTROL_LED, 0);
-   } else {
-   e1000_blink_led_start(&adapter->hw);
-   msleep_interruptible(data * 1000);
-   }
+   if (data == 0)
+   data = 2;
 
-	e1000_led_off(&adapter->hw);

-   clear_bit(E1000_LED_ON, &adapter->led_status);
-   e1000_cleanup_led(&adapter->hw);
+   e1000_blink_led(&adapter->hw, data);
 
 	return 0;

 }
diff --git a/drivers/net/e1000/e1000_hw.c b/drivers/net/e1000/e1000_hw.c
index 65077f3..db5e999 100644
--- a/drivers/net/e1000/e1000_hw.c
+++ b/drivers/net/e1000/e1000_hw.c
@@ -6071,7 +6071,7 @@ e1000_id_led_init(struct e1000_hw * hw)
  *
  * hw - Struct containing variables accessed by shared code
  */
-int32_t
+static int32_t
 e1000_setup_led(struct e1000_hw *hw)
 {
 uint32_t ledctl;
@@ -6123,50 +6123,11 @@ e1000_setup_led(struct e1000_hw *hw)
 
 
 /**

- * Used on 82571 and later Si that has LED blink bits.
- * Callers must use their own timer and should have already called
- * e1000_id_led_init()
- * Call e1000_cleanup led() to stop blinking
- *
- * hw - Struct containing variables accessed by shared code
- */
-int32_t
-e1000_blink_led_start(struct e1000_hw *hw)
-{
-int16_t  i;
-uint32_t ledctl_blink = 0;
-
-DEBUGFUNC("e1000_id_led_blink_on");
-
-if (hw->mac_type < e1000_82571) {
-/* Nothing to do */
-return E1000_SUCCESS;
-}
-if (hw->media_type == e1000_media_type_fiber) {
-/* always blink LED0 for PCI-E fiber */
-ledctl_blink = E1000_LEDCTL_LED0_BLINK |
- (E1000_LEDCTL_MODE_LED_ON <

Re: [PATCH] Check if user has CAP_NET_ADMIN to change congestion control algorithm

2006-10-27 Thread Hagen Paul Pfeifer
* Stephen Hemminger | 2006-10-27 07:41:02 [-0700]:

>Please no, it makes the socket option useless.

Technical no, in the sense of usability for everybody yes. You are right
Stephen, as a programmer I understand you complete!

But on the other side: We know for sure that this IS a problem if we allow
everybody to "prefer his socket".

In my opinion we should prefer fairness before usability! As John Heffner
introduce, we can introduce a ranking system for congestion control algorithms -
but this solution seems a little bit oversized and maybe can't be complete
guaranteed (complex interaction between the protocols in different
environment and so on, you know).

HGN




-- 
 /°\   --- JOIN NOW!!! --- 
 \ /  ASCII ribbon campaign
  X   against HTML 
 / \in mail and news   
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Check if user has CAP_NET_ADMIN to change congestion control algorithm

2006-10-27 Thread Stephen Hemminger

Hagen Paul Pfeifer wrote:

* Stephen Hemminger | 2006-10-27 07:41:02 [-0700]:

  

Please no, it makes the socket option useless.



Technical no, in the sense of usability for everybody yes. You are right
Stephen, as a programmer I understand you complete!

But on the other side: We know for sure that this IS a problem if we allow
everybody to "prefer his socket".

In my opinion we should prefer fairness before usability! As John Heffner
introduce, we can introduce a ranking system for congestion control algorithms -
but this solution seems a little bit oversized and maybe can't be complete
guaranteed (complex interaction between the protocols in different
environment and so on, you know).

HGN

  

If there is a dangerous choice, then it should be removed. Otherwise I can't
see the problem. It is a bigger risk to have to escalate the privileges 
of an application

just to allow it to use something.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[IPROUTE] manpage for rtmon

2006-10-27 Thread Michael Prokop
Hello,

another manpage, this time for rtmon. Would be great if it could be
applied to the next release too.

regards,
-mika-
-- 
 ,'"`. http://www.michael-prokop.at/
(  grml.org -» Linux Live-CD for texttool-users and sysadmins
 `._,' http://www.grml.org/
.TH RTMON 8
.SH NAME
rtmon \- listens to and monitors RTnetlink
.SH SYNOPSIS
.B rtmon
.RI "[ options ] file FILE [ all | LISTofOBJECTS ]"
.SH DESCRIPTION
This manual page documents briefly the
.B rtmon
command.
.PP
\fBrtmon\fP is a RTnetlink listener. RTnetlink allows the kernel's routing 
tables to be read and altered.

rtmon should be started before the first network configuration command is 
issued. For example if you insert:

 rtmon file /var/log/rtmon.log

in a startup script, you will be able to view the full history later.
Certainly, it is possible to start rtmon at any time. It prepends the history 
with the state snapshot dumped at the moment of starting.
.SH OPTIONS
rtmon supports the following options:
.TP
.B \-Version
Print version and exit.
.TP
.B help
Show summary of options.
.TP
.B file FILE [ all | LISTofOBJECTS ]
Log output to FILE. LISTofOBJECTS is the list of object types that we want to 
monitor.
It may contain 'link', 'address', 'route' and 'all'. 'link' specifies the 
network device, 'address'
the protocol (IP or IPv6) address on a device, 'route' the routing table entry 
and 'all' does what the name says.
.TP
.B \-family [ inet | inet6 | link | help ]
Specify protocol family. 'inet' is IPv4, 'inet6' is IPv6, 'link' means that no 
networking protocol is involved and 'help' prints usage information.
.TP
.B \-4
Use IPv4. Shortcut for -family inet.
.TP
.B \-6
Use IPv6. Shortcut for -family inet6.
.TP
.B \-0
Use a special family identifier meaning that no networking protocol is 
involved. Shortcut for -family link.
.SH USAGE EXAMPLES
.TP
.B # rtmon file /var/log/rtmon.log
Log to file /var/log/rtmon.log, then run:
.TP
.B # ip monitor file /var/log/rtmon.log
to display logged output from file.
.SH SEE ALSO
.BR ip (8)
.SH AUTHOR
rtmon was written by Alexey Kuznetsov <[EMAIL PROTECTED]>.
.PP
This manual page was written by Michael Prokop <[EMAIL PROTECTED]>,
for the Debian project (but may be used by others).


pgp3zdbStLG30.pgp
Description: PGP signature


[PATCH] sky2: not experimental

2006-10-27 Thread Stephen Hemminger
The sky2 driver is no longer in experimental state.

Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]>

--- sky2.orig/drivers/net/Kconfig   2006-10-27 10:16:44.0 -0700
+++ sky2/drivers/net/Kconfig2006-10-27 10:20:20.0 -0700
@@ -2112,7 +2112,7 @@
 
 config SKY2
tristate "SysKonnect Yukon2 support (EXPERIMENTAL)"
-   depends on PCI && EXPERIMENTAL
+   depends on PCI
select CRC32
---help---
  This driver supports Gigabit Ethernet adapters based on the
@@ -2120,8 +2120,8 @@
  Marvell 88E8021/88E8022/88E8035/88E8036/88E8038/88E8050/88E8052/
  88E8053/88E8055/88E8061/88E8062, SysKonnect SK-9E21D/SK-9S21
 
- This driver does not support the original Yukon chipset: a seperate
- driver, skge, is provided for Yukon-based adapters.
+ There is companion driver for the older Marvell Yukon and
+ Genesis based adapters: skge.
 
  To compile this driver as a module, choose M here: the module
  will be called sky2.  This is recommended.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[take21 2/4] kevent: poll/select() notifications.

2006-10-27 Thread Evgeniy Polyakov

poll/select() notifications.

This patch includes generic poll/select notifications.
kevent_poll works simialr to epoll and has the same issues (callback
is invoked not from internal state machine of the caller, but through
process awake, a lot of allocations and so on).

Signed-off-by: Evgeniy Polyakov <[EMAIL PROTECTED]>

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5baf3a1..f81299f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -276,6 +276,7 @@ #include 
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -586,6 +587,10 @@ #ifdef CONFIG_INOTIFY
struct mutexinotify_mutex;  /* protects the watches list */
 #endif
 
+#ifdef CONFIG_KEVENT_SOCKET
+   struct kevent_storage   st;
+#endif
+
unsigned long   i_state;
unsigned long   dirtied_when;   /* jiffies of first dirtying */
 
@@ -739,6 +744,9 @@ #ifdef CONFIG_EPOLL
struct list_headf_ep_links;
spinlock_t  f_ep_lock;
 #endif /* #ifdef CONFIG_EPOLL */
+#ifdef CONFIG_KEVENT_POLL
+   struct kevent_storage   st;
+#endif
struct address_space*f_mapping;
 };
 extern spinlock_t files_lock;
diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c
new file mode 100644
index 000..fb74e0f
--- /dev/null
+++ b/kernel/kevent/kevent_poll.c
@@ -0,0 +1,222 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[EMAIL PROTECTED]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static kmem_cache_t *kevent_poll_container_cache;
+static kmem_cache_t *kevent_poll_priv_cache;
+
+struct kevent_poll_ctl
+{
+   struct poll_table_structpt;
+   struct kevent   *k;
+};
+
+struct kevent_poll_wait_container
+{
+   struct list_headcontainer_entry;
+   wait_queue_head_t   *whead;
+   wait_queue_twait;
+   struct kevent   *k;
+};
+
+struct kevent_poll_private
+{
+   struct list_headcontainer_list;
+   spinlock_t  container_lock;
+};
+
+static int kevent_poll_enqueue(struct kevent *k);
+static int kevent_poll_dequeue(struct kevent *k);
+static int kevent_poll_callback(struct kevent *k);
+
+static int kevent_poll_wait_callback(wait_queue_t *wait,
+   unsigned mode, int sync, void *key)
+{
+   struct kevent_poll_wait_container *cont =
+   container_of(wait, struct kevent_poll_wait_container, wait);
+   struct kevent *k = cont->k;
+   struct file *file = k->st->origin;
+   u32 revents;
+
+   revents = file->f_op->poll(file, NULL);
+
+   kevent_storage_ready(k->st, NULL, revents);
+
+   return 0;
+}
+
+static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead,
+   struct poll_table_struct *poll_table)
+{
+   struct kevent *k =
+   container_of(poll_table, struct kevent_poll_ctl, pt)->k;
+   struct kevent_poll_private *priv = k->priv;
+   struct kevent_poll_wait_container *cont;
+   unsigned long flags;
+
+   cont = kmem_cache_alloc(kevent_poll_container_cache, SLAB_KERNEL);
+   if (!cont) {
+   kevent_break(k);
+   return;
+   }
+
+   cont->k = k;
+   init_waitqueue_func_entry(&cont->wait, kevent_poll_wait_callback);
+   cont->whead = whead;
+
+   spin_lock_irqsave(&priv->container_lock, flags);
+   list_add_tail(&cont->container_entry, &priv->container_list);
+   spin_unlock_irqrestore(&priv->container_lock, flags);
+
+   add_wait_queue(whead, &cont->wait);
+}
+
+static int kevent_poll_enqueue(struct kevent *k)
+{
+   struct file *file;
+   int err, ready = 0;
+   unsigned int revents;
+   struct kevent_poll_ctl ctl;
+   struct kevent_poll_private *priv;
+
+   file = fget(k->event.id.raw[0]);
+   if (!file)
+   return -ENODEV;
+
+   err = -EINVAL;
+   if (!file->f_op || !file->f_op->poll)
+   goto err_out_fput;
+
+   err = -ENOMEM;
+   priv = kmem_cache_alloc(kevent_poll_priv_cache, SLAB_KERNEL);
+   if (!priv)
+   goto err_out_fput;
+
+   spin_lock_init(&priv->container_lock);
+   INIT_LIST_HEAD(&priv->container_list);
+
+   k->priv = priv;
+
+   ctl.k = k;
+   init_poll_funcptr(&ctl.pt, &kevent_poll_qproc);
+

[take21 1/4] kevent: Core files.

2006-10-27 Thread Evgeniy Polyakov

Core files.

This patch includes core kevent files:
 * userspace controlling
 * kernelspace interfaces
 * initialization
 * notification state machines

Some bits of documentation can be found on project's homepage (and links from 
there):
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent

Signed-off-by: Evgeniy Polyakov <[EMAIL PROTECTED]>

diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index 7e639f7..a9560eb 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -318,3 +318,6 @@ ENTRY(sys_call_table)
.long sys_vmsplice
.long sys_move_pages
.long sys_getcpu
+   .long sys_kevent_get_events
+   .long sys_kevent_ctl/* 320 */
+   .long sys_kevent_wait
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index b4aa875..cf18955 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -714,8 +714,11 @@ #endif
.quad compat_sys_get_robust_list
.quad sys_splice
.quad sys_sync_file_range
-   .quad sys_tee
+   .quad sys_tee   /* 315 */
.quad compat_sys_vmsplice
.quad compat_sys_move_pages
.quad sys_getcpu
+   .quad sys_kevent_get_events
+   .quad sys_kevent_ctl/* 320 */
+   .quad sys_kevent_wait
 ia32_syscall_end:  
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index bd99870..f009677 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -324,10 +324,13 @@ #define __NR_tee  315
 #define __NR_vmsplice  316
 #define __NR_move_pages317
 #define __NR_getcpu318
+#define __NR_kevent_get_events 319
+#define __NR_kevent_ctl320
+#define __NR_kevent_wait   321
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 319
+#define NR_syscalls 322
 #include 
 
 /*
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index 6137146..c53d156 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -619,10 +619,16 @@ #define __NR_vmsplice 278
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_kevent_get_events 280
+__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events)
+#define __NR_kevent_ctl281
+__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl)
+#define __NR_kevent_wait   282
+__SYSCALL(__NR_kevent_wait, sys_kevent_wait)
 
 #ifdef __KERNEL__
 
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_kevent_wait
 #include 
 
 #ifndef __NO_STUBS
diff --git a/include/linux/kevent.h b/include/linux/kevent.h
new file mode 100644
index 000..125414c
--- /dev/null
+++ b/include/linux/kevent.h
@@ -0,0 +1,205 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[EMAIL PROTECTED]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __KEVENT_H
+#define __KEVENT_H
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define KEVENT_MIN_BUFFS_ALLOC 3
+
+struct kevent;
+struct kevent_storage;
+typedef int (* kevent_callback_t)(struct kevent *);
+
+/* @callback is called each time new event has been caught. */
+/* @enqueue is called each time new event is queued. */
+/* @dequeue is called each time event is dequeued. */
+
+struct kevent_callbacks {
+   kevent_callback_t   callback, enqueue, dequeue;
+};
+
+#define KEVENT_READY   0x1
+#define KEVENT_STORAGE 0x2
+#define KEVENT_USER0x4
+
+struct kevent
+{
+   /* Used for kevent freeing.*/
+   struct rcu_head rcu_head;
+   struct ukevent  event;
+   /* This lock protects ukevent manipulations, e.g. ret_flags changes. */
+   spinlock_t  ulock;
+
+   /* Entry of user's tree. */
+   struct rb_node  kevent_node;
+   /* Entry of origin's queue. */
+   struct list_headstorage_entry;
+   /* Entry of user's ready. */
+   struct list_headready_entry;
+
+   u32 flags;
+
+   /* User who requested this kevent. */
+   struct kevent_user  *user;
+ 

[take21 4/4] kevent: Timer notifications.

2006-10-27 Thread Evgeniy Polyakov

Timer notifications.

Timer notifications can be used for fine grained per-process time 
management, since interval timers are very inconvenient to use, 
and they are limited.

This subsystem uses high-resolution timers.
id.raw[0] is used as number of seconds
id.raw[1] is used as number of nanoseconds

Signed-off-by: Evgeniy Polyakov <[EMAIL PROTECTED]>

diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c
new file mode 100644
index 000..04acc46
--- /dev/null
+++ b/kernel/kevent/kevent_timer.c
@@ -0,0 +1,113 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[EMAIL PROTECTED]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct kevent_timer
+{
+   struct hrtimer  ktimer;
+   struct kevent_storage   ktimer_storage;
+   struct kevent   *ktimer_event;
+};
+
+static int kevent_timer_func(struct hrtimer *timer)
+{
+   struct kevent_timer *t = container_of(timer, struct kevent_timer, 
ktimer);
+   struct kevent *k = t->ktimer_event;
+
+   kevent_storage_ready(&t->ktimer_storage, NULL, KEVENT_MASK_ALL);
+   hrtimer_forward(timer, timer->base->softirq_time,
+   ktime_set(k->event.id.raw[0], k->event.id.raw[1]));
+   return HRTIMER_RESTART;
+}
+
+static struct lock_class_key kevent_timer_key;
+
+static int kevent_timer_enqueue(struct kevent *k)
+{
+   int err;
+   struct kevent_timer *t;
+
+   t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL);
+   if (!t)
+   return -ENOMEM;
+
+   hrtimer_init(&t->ktimer, CLOCK_MONOTONIC, HRTIMER_REL);
+   t->ktimer.expires = ktime_set(k->event.id.raw[0], k->event.id.raw[1]);
+   t->ktimer.function = kevent_timer_func;
+   t->ktimer_event = k;
+
+   err = kevent_storage_init(&t->ktimer, &t->ktimer_storage);
+   if (err)
+   goto err_out_free;
+   lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key);
+
+   err = kevent_storage_enqueue(&t->ktimer_storage, k);
+   if (err)
+   goto err_out_st_fini;
+
+   printk("%s: jiffies: %lu, timer: %p.\n", __func__, jiffies, &t->ktimer);
+   hrtimer_start(&t->ktimer, t->ktimer.expires, HRTIMER_REL);
+
+   return 0;
+
+err_out_st_fini:
+   kevent_storage_fini(&t->ktimer_storage);
+err_out_free:
+   kfree(t);
+
+   return err;
+}
+
+static int kevent_timer_dequeue(struct kevent *k)
+{
+   struct kevent_storage *st = k->st;
+   struct kevent_timer *t = container_of(st, struct kevent_timer, 
ktimer_storage);
+
+   hrtimer_cancel(&t->ktimer);
+   kevent_storage_dequeue(st, k);
+   kfree(t);
+
+   return 0;
+}
+
+static int kevent_timer_callback(struct kevent *k)
+{
+   k->event.ret_data[0] = jiffies_to_msecs(jiffies);
+   return 1;
+}
+
+static int __init kevent_init_timer(void)
+{
+   struct kevent_callbacks tc = {
+   .callback = &kevent_timer_callback,
+   .enqueue = &kevent_timer_enqueue,
+   .dequeue = &kevent_timer_dequeue};
+
+   return kevent_add_callbacks(&tc, KEVENT_TIMER);
+}
+module_init(kevent_init_timer);
+

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[take21 0/4] kevent: Generic event handling mechanism.

2006-10-27 Thread Evgeniy Polyakov

Generic event handling mechanism.

Consider for inclusion.

Changes from 'take20' patchset:
 * new ring buffer implementation
 * removed artificial limit on possible number of kevents
With this release and fixed userspace web server it was possible to 
achive 3960+ req/s with client connection rate of 4000 con/s
over 100 Mbit lan, data IO over network was about 10582.7 KB/s, which
is too close to wire speed if we get into account headers and the like.

Changes from 'take19' patchset:
 * use __init instead of __devinit
 * removed 'default N' from config for user statistic
 * removed kevent_user_fini() since kevent can not be unloaded
 * use KERN_INFO for statistic output

Changes from 'take18' patchset:
 * use __init instead of __devinit
 * removed 'default N' from config for user statistic
 * removed kevent_user_fini() since kevent can not be unloaded
 * use KERN_INFO for statistic output

Changes from 'take17' patchset:
 * Use RB tree instead of hash table. 
At least for a web sever, frequency of addition/deletion of new kevent 
is comparable with number of search access, i.e. most of the time 
events 
are added, accesed only couple of times and then removed, so it 
justifies 
RB tree usage over AVL tree, since the latter does have much slower 
deletion 
time (max O(log(N)) compared to 3 ops), 
although faster search time (1.44*O(log(N)) vs. 2*O(log(N))). 
So for kevents I use RB tree for now and later, when my AVL tree 
implementation 
is ready, it will be possible to compare them.
 * Changed readiness check for socket notifications.

With both above changes it is possible to achieve more than 3380 req/second 
compared to 2200, 
sometimes 2500 req/second for epoll() for trivial web-server and httperf client 
on the same
hardware.
It is possible that above kevent limit is due to maximum allowed kevents in a 
time limit, which is
4096 events.

Changes from 'take16' patchset:
 * misc cleanups (__read_mostly, const ...)
 * created special macro which is used for mmap size (number of pages) 
calculation
 * export kevent_socket_notify(), since it is used in network protocols which 
can be 
built as modules (IPv6 for example)

Changes from 'take15' patchset:
 * converted kevent_timer to high-resolution timers, this forces timer API 
update at
http://linux-net.osdl.org/index.php/Kevent
 * use struct ukevent* instead of void * in syscalls (documentation has been 
updated)
 * added warning in kevent_add_ukevent() if ring has broken index (for testing)

Changes from 'take14' patchset:
 * added kevent_wait()
This syscall waits until either timeout expires or at least one event
becomes ready. It also commits that @num events from @start are processed
by userspace and thus can be be removed or rearmed (depending on it's 
flags).
It can be used for commit events read by userspace through mmap interface.
Example userspace code (evtest.c) can be found on project's homepage.
 * added socket notifications (send/recv/accept)

Changes from 'take13' patchset:
 * do not get lock aroung user data check in __kevent_search()
 * fail early if there were no registered callbacks for given type of kevent
 * trailing whitespace cleanup

Changes from 'take12' patchset:
 * remove non-chardev interface for initialization
 * use pointer to kevent_mring instead of unsigned longs
 * use aligned 64bit type in raw user data (can be used by high-res timer if 
needed)
 * simplified enqueue/dequeue callbacks and kevent initialization
 * use nanoseconds for timeout
 * put number of milliseconds into timer's return data
 * move some definitions into user-visible header
 * removed filenames from comments

Changes from 'take11' patchset:
 * include missing headers into patchset
 * some trivial code cleanups (use goto instead of if/else games and so on)
 * some whitespace cleanups
 * check for ready_callback() callback before main loop which should save us 
some ticks

Changes from 'take10' patchset:
 * removed non-existent prototypes
 * added helper function for kevent_registered_callbacks
 * fixed 80 lines comments issues
 * added shared between userspace and kernelspace header instead of embedd them 
in one
 * core restructuring to remove forward declarations
 * s o m e w h i t e s p a c e c o d y n g s t y l e c l e a n u p
 * use vm_insert_page() instead of remap_pfn_range()

Changes from 'take9' patchset:
 * fixed ->nopage method

Changes from 'take8' patchset:
 * fixed mmap release bug
 * use module_init() instead of late_initcall()
 * use better structures for timer notifications

Changes from 'take7' patchset:
 * new mmap interface (not tested, waiting for other changes to be acked)
- use nopage() method to dynamically substitue pages
- allocate new page for events only when new added kevent requres it
- do not use ugly index dereferencing, use structure instead
- reduced amount of data in the ring (id and flag

[take21 3/4] kevent: Socket notifications.

2006-10-27 Thread Evgeniy Polyakov

Socket notifications.

This patch includes socket send/recv/accept notifications.
Using trivial web server based on kevent and this features
instead of epoll it's performance increased more than noticebly.
More details about various benchmarks and server itself 
(evserver_kevent.c) can be found on project's homepage.

Signed-off-by: Evgeniy Polyakov <[EMAIL PROTECTED]>

diff --git a/fs/inode.c b/fs/inode.c
index ada7643..ff1b129 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -21,6 +21,7 @@ #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /*
@@ -164,12 +165,18 @@ #endif
}
inode->i_private = 0;
inode->i_mapping = mapping;
+#if defined CONFIG_KEVENT_SOCKET
+   kevent_storage_init(inode, &inode->st);
+#endif
}
return inode;
 }
 
 void destroy_inode(struct inode *inode) 
 {
+#if defined CONFIG_KEVENT_SOCKET
+   kevent_storage_fini(&inode->st);
+#endif
BUG_ON(inode_has_buffers(inode));
security_inode_free(inode);
if (inode->i_sb->s_op->destroy_inode)
diff --git a/include/net/sock.h b/include/net/sock.h
index edd4d73..d48ded8 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -48,6 +48,7 @@ #include 
 #include 
 #include   /* struct sk_buff */
 #include 
+#include 
 
 #include 
 
@@ -450,6 +451,21 @@ static inline int sk_stream_memory_free(
 
 extern void sk_stream_rfree(struct sk_buff *skb);
 
+struct socket_alloc {
+   struct socket socket;
+   struct inode vfs_inode;
+};
+
+static inline struct socket *SOCKET_I(struct inode *inode)
+{
+   return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
+}
+
+static inline struct inode *SOCK_INODE(struct socket *socket)
+{
+   return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
+}
+
 static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk)
 {
skb->sk = sk;
@@ -477,6 +493,7 @@ static inline void sk_add_backlog(struct
sk->sk_backlog.tail = skb;
}
skb->next = NULL;
+   kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
 }
 
 #define sk_wait_event(__sk, __timeo, __condition)  \
@@ -679,21 +696,6 @@ static inline struct kiocb *siocb_to_kio
return si->kiocb;
 }
 
-struct socket_alloc {
-   struct socket socket;
-   struct inode vfs_inode;
-};
-
-static inline struct socket *SOCKET_I(struct inode *inode)
-{
-   return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
-}
-
-static inline struct inode *SOCK_INODE(struct socket *socket)
-{
-   return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
-}
-
 extern void __sk_stream_mem_reclaim(struct sock *sk);
 extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind);
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7a093d0..69f4ad2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -857,6 +857,7 @@ static inline int tcp_prequeue(struct so
tp->ucopy.memory = 0;
} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
wake_up_interruptible(sk->sk_sleep);
+   kevent_socket_notify(sk, 
KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
if (!inet_csk_ack_scheduled(sk))
inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
  (3 * TCP_RTO_MIN) / 4,
diff --git a/kernel/kevent/kevent_socket.c b/kernel/kevent/kevent_socket.c
new file mode 100644
index 000..c865b3e
--- /dev/null
+++ b/kernel/kevent/kevent_socket.c
@@ -0,0 +1,129 @@
+/*
+ * kevent_socket.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <[EMAIL PROTECTED]>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+static int kevent_socket_callback(struct kevent *k)
+{
+   struct inode *inode = k->st->origin;
+   return SOCKET_I(inode)->ops->poll(SOCKET_I(inode)->file, 
SOCKET_I(inode), NULL);
+}
+
+int kevent_socket_enqueue(struct kevent *k)
+{
+   struct inode *inode;
+   struct socket *sock;
+   int err = -ENODEV;
+
+   so

Re: [take21 0/4] kevent: Generic event handling mechanism.

2006-10-27 Thread Evgeniy Polyakov
On Fri, Oct 27, 2006 at 08:10:01PM +0400, Evgeniy Polyakov ([EMAIL PROTECTED]) 
wrote:
> 
> Generic event handling mechanism.
> 
> Consider for inclusion.
> 
> Changes from 'take20' patchset:
>  * new ring buffer implementation

Test userspace application can be found in archive on project's
homepage. It is also attached to this mail.

Short design notes about ring buffer implementation.

Ring buffer is designed in a way that first ready kevent will be at
ring->uidx position, and all other ready events will be in FIFO order
after it. So when we need to commit num events, it means we should just
remove first num kevents from ready queue and commit them. We do not use
any special locking to protect this function against simultaneous
running - kevent dequeueing is atomic, and we do not care about order in
which events were committed.
An example: thread 1 and thread 2 simultaneously call kevent_wait() to
commit 2 and 3 events. It is possible that first thread will commit
events 0 and 2 while second thread will commit events 1, 3 and 4. If
there were only 3 ready events, then one of the calls will return lesser
number of committed events than it was requested.
ring->uidx update is atomic, since it is protected by u->ready_lock,
which removes race with kevent_user_ring_add_event().

If user asks to commit events which have beed removed by
kevent_get_events() recently (for example when one thread looked into
ring indexes and started to commit evets, which were simultaneously
committed by other thread through kevent_get_events(), kevent_wait()
will not commit unprocessed events, but will return number of actually
committed events instead.

It is forbidden to try to commit events not from the start of the
buffer, but from some 'futher' event.

An example: if ready events use positions 2-5, it is permitted to start
to commit 3 events from position 0, in this case 0 and 1 positions will
be ommited and only event in position 2 will be committed and
kevent_wait() will return 1, since only one event was actually
committed.
It is forbidden to try to commit from position 4, 0 will be returned.
This means that if some events were committed using kevent_get_events(),
they will not be counted, instead userspace should check ring index and
try to commit again.

-- 
Evgeniy Polyakov
#include 
#include 
#include 
#include 
#include 

#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 
#include 

#define PAGE_SIZE   4096
#include 

#define _syscall3(type,name,type1,arg1,type2,arg2,type3,arg3) \
type name (type1 arg1, type2 arg2, type3 arg3) \
{\
return syscall(__NR_##name, arg1, arg2, arg3);\
}

#define _syscall4(type,name,type1,arg1,type2,arg2,type3,arg3,type4,arg4) \
type name (type1 arg1, type2 arg2, type3 arg3, type4 arg4) \
{\
return syscall(__NR_##name, arg1, arg2, arg3, arg4);\
}

#define _syscall5(type,name,type1,arg1,type2,arg2,type3,arg3,type4,arg4, \
  type5,arg5) \
type name (type1 arg1,type2 arg2,type3 arg3,type4 arg4,type5 arg5) \
{\
return syscall(__NR_##name, arg1, arg2, arg3, arg4, arg5);\
}

#define _syscall6(type,name,type1,arg1,type2,arg2,type3,arg3,type4,arg4, \
  type5,arg5,type6,arg6) \
type name (type1 arg1,type2 arg2,type3 arg3,type4 arg4,type5 arg5, type6 arg6) \
{\
return syscall(__NR_##name, arg1, arg2, arg3, arg4, arg5, arg6);\
}

_syscall4(int, kevent_ctl, int, arg1, unsigned int, argv2, unsigned int, argv3, 
void *, argv4);
_syscall6(int, kevent_get_events, int, arg1, unsigned int, argv2, unsigned int, 
argv3, __u64, argv4, void *, argv5, unsigned, arg6);
_syscall4(int, kevent_wait, int, arg1, unsigned int, arg2, unsigned int, argv3, 
__u64, argv4);

#define ulog(f, a...) fprintf(stderr, "%8u: "f, time(NULL), ##a)
#define ulog_err(f, a...) ulog(f ": %s [%d].\n", ##a, strerror(errno), errno)

static void usage(char *p)
{
ulog("Usage: %s -t type -e event -o oneshot -p path -n wait_num -f 
kevent_file -r ready_num -h\n", p);
}

static int evtest_mmap(int fd, struct kevent_mring **ring, int number)
{
int i;
off_t o = 0;

for (i=0; i 0) {
switch (ch) {
case 'f':
file = optarg;
break;
case 'r':
ready_num = atoi(optarg);
break;
case 'n':
wait_num = atoi(optarg);
break;
case 't':
tm_sec = atoi(optarg);
break;
case 'T':
tm_nsec = atoi(optarg);
break;
case 'o':
oneshot = atoi(optarg);
break;
default:

Re: [openib-general] [PATCH 1/9] NetEffect 10Gb RNIC Driver: kernel Kconfig and makefiles

2006-10-27 Thread James Lentini


On Thu, 26 Oct 2006, Glenn Grundstrom wrote:

> diff -ruNp old/drivers/infiniband/hw/nes/Makefile
> new/drivers/infiniband/hw/nes/Makefile
> --- old/drivers/infiniband/hw/nes/Makefile1969-12-31
> 18:00:00.0 -0600
> +++ new/drivers/infiniband/hw/nes/Makefile2006-10-25
> 11:10:26.0 -0500
> @@ -0,0 +1,27 @@
> +EXTRA_CFLAGS += -Idrivers/infiniband/include
> -Idrivers/infiniband/hw/nes/nes_tcpip/include
> +
> +ifdef CONFIG_INFINIBAND_NES_DEBUG
> +EXTRA_CFLAGS += -DNES_DEBUG
> +endif

The NES_DEBUG flag is unnecessary. You can check for 
CONFIG_INFINIBAND_NES_DEBUG in the code. See 
CONFIG_INFINIBAND_MTHCA_DEBUG for an example.

> +
> +ifneq ($(KERNELRELEASE),)
> + obj-$(CONFIG_INFINIBAND_NES) += iw_nes.o
> +
> + iw_nes-objs := \
> + nes.o \
> + nes_hw.o \
> + nes_nic.o \
> + nes_cm.o \
> + nes_utils.o \
> + nes_verbs.o 
> +else
> + KERNELDIR ?= /usr/src/linux
> + PWD := $(shell pwd)
> +
> +default:
> + $(MAKE) -C $(KERNELDIR) M=$(PWD) modules
> +
> +clean:
> + $(MAKE) -C $(KERNELDIR) M=$(PWD) clean
> +
> +endif

In tree drivers don't provide support for out-of-tree builds. See 
drivers/infiniband/hw/mthca/Makefile for an example of how to 
simplify this.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [openib-general] [PATCH 3/9] NetEffect 10Gb RNIC Driver: openfabrics connection manager c file

2006-10-27 Thread Tom Tucker

[...snip...]
> +extern void set_interface(
> +UINT32ip_addr,

These should probably be the standard linux types u32, or uint32

> +UINT32mask,
> +UINT32bcastaddr,
> +UINT32type
> +   );

[...snip...]

> + struct NES_sockaddr_in  inet_addr;
> + struct sockaddr_in  kinet_addr;

Is there some reason why you need your own sockaddr and sockaddr_in
structures? 

[...snip...]
> +
> +/**
> + * nes_disconnect
> + * 
> + * @param cm_id
> + * @param abrupt
> + * 
> + * @return int
> + */
> +int nes_disconnect(struct iw_cm_id *cm_id, int abrupt)
> +{
> + struct ib_qp_attr attr;
> + struct ib_qp *ibqp;
> + struct nes_qp *nesqp;
> + struct nes_dev *nesdev = to_nesdev(cm_id->device);
> + int err = 0;
> + u8 u8temp;
> +
> + dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__);
> + dprintk("%s: netdev refcnt = %u.\n", __FUNCTION__,
> atomic_read(&nesdev->netdev->refcnt));
> +
> + /* If the qp was already destroyed, then there's no QP */
> + if (cm_id->provider_data == 0)
> + return 0;
> +
> + nesqp = (struct nes_qp *)cm_id->provider_data;
> + ibqp = &nesqp->ibqp;
> +
> + /* Disassociate the QP from this cm_id */
> + cm_id->provider_data = 0;
> + cm_id->rem_ref(cm_id);
> + nesqp->cm_id = 0;
> +
> + stack_ops_p->decelerate_socket(nesqp->socket, 
> +(struct nes_uploaded_qp_context *)
> +nesqp->nesqp_context);
> +  
> + if (nesqp->active_conn) {
> +   u8temp = 1 << (ntohs(cm_id->local_addr.sin_port)&7);
> +   nesdev->apbv_table[ntohs(cm_id->local_addr.sin_port)>>3] &=
> ~(u8temp);
> + } else {
> + dev_put(nesdev->netdev);
> +/* Need to free the Last Streaming Mode Message */
> +pci_free_consistent(nesdev->pcidev, 
> +
> nesqp->private_data_len+sizeof(*nesqp->ietf_frame), 
> +nesqp->ietf_frame,
> +nesqp->ietf_frame_pbase);

This is mailer perversion. You need to turn off wrapping in your mailer.
It makes it hard to review the patch never mind apply it.

> +}
> +
> + if (nesqp->ksock) sock_release(nesqp->ksock);
> + stack_ops_p->sock_ops_p->close( nesqp->socket );
> + nesqp->ksock = 0;
> + nesqp->socket = 0;
> + if (nesqp->wq) {
> + destroy_workqueue(nesqp->wq);

This will deadlock if this function is called from a workqueue thread
and CONFIG_HOTPLUG_CPU is enabled. 

> + nesqp->wq = NULL;
> + }
> +
> + memset(&attr, 0, sizeof(struct ib_qp_attr));
> + if (abrupt)
> + attr.qp_state = IB_QPS_ERR;
> + else
> + attr.qp_state = IB_QPS_SQD;
> +
> + return err;
> +}
> +
> +
> +/**
> + * nes_accept
> + * 
> + * @param cm_id
> + * @param conn_param
> + * 
> + * @return int
> + */
> +int nes_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param
> *conn_param)
> +{
> + struct nes_qp *nesqp;
> + struct nes_dev *nesdev;
> + struct nes_adapter *nesadapter;
> + struct ib_qp *ibqp;
> +struct nes_hw_qp_wqe *wqe;
> + struct nes_v4_quad nes_quad;
> + struct ib_qp_attr attr;
> +struct iw_cm_event cm_event;
> +
> + dprintk("%s:%s:%u: data len = %u\n", 
> + __FILE__, __FUNCTION__, __LINE__,
> conn_param->private_data_len);
> +
> + ibqp = nes_get_qp(cm_id->device, conn_param->qpn);
> + if (!ibqp)
> + return -EINVAL;
> + nesqp = to_nesqp(ibqp);
> + nesdev = to_nesdev(nesqp->ibqp.device);
> + nesadapter = nesdev->nesadapter;
> + dprintk("%s: netdev refcnt = %u.\n", __FUNCTION__,
> atomic_read(&nesdev->netdev->refcnt));
> +
> +nesqp->ietf_frame = pci_alloc_consistent(nesdev->pcidev, 
> +
> sizeof(*nesqp->ietf_frame)+conn_param->private_data_len,
> + &nesqp->ietf_frame_pbase);
> +if (!nesqp->ietf_frame) {
> +dprintk(KERN_ERR PFX "%s: Unable to allocate memory for private
> data\n", __FUNCTION__);
> +return -ENOMEM;
> +}
> +dprintk(PFX "%s: PCI consistent memory for "
> +"private data located @ %p (pa = 0x%08lX.) size = %u.\n", 
> +__FUNCTION__, nesqp->ietf_frame, (unsigned
> long)nesqp->ietf_frame_pbase,
> +conn_param->private_data_len+sizeof(*nesqp->ietf_frame));
> +nesqp->private_data_len = conn_param->private_data_len;
> +
> +strcpy(&nesqp->ietf_frame->key[0], IEFT_MPA_KEY_REP);
> +memcpy(&nesqp->ietf_frame->private_data, conn_param->private_data,
> conn_param->private_data_len);
> +nesqp->ietf_frame->private_data_size =
> cpu_to_be16(conn_param->private_data_len);
> +nesqp->ietf_frame->rev = mpa_version;
> +nesqp->ietf_frame->flags = IETF_MPA_FLAGS_CRC;
> +
> +wqe = &nesqp->hwqp.sq_vbase[0];
> +*((struct nes_

[PATCH] tcp: don't allow unfair congestion control to be built without warning

2006-10-27 Thread Stephen Hemminger
My proposed method restricting TCP choices to fair algorithms.
This a net wide, not system wide issue, it should not be done
by kernel policy choice (capability), but by a build choice.

--- sky2.orig/net/ipv4/Kconfig  2006-10-27 10:10:47.0 -0700
+++ sky2/net/ipv4/Kconfig   2006-10-27 10:15:56.0 -0700
@@ -470,6 +470,16 @@
 
 if TCP_CONG_ADVANCED
 
+config TCP_CONG_UNFAIR
+bool "Allow unfair congestion control algorithms"
+   depends on EXPERIMENTAL
+---help---
+ Some of the congestion control algorithms are for testing
+ and research purposes and should not deployed on public
+ networks because of the possiblity of unfair behavior.
+ These algorithms may be useful for future development
+ or comparison purposes.
+
 config TCP_CONG_BIC
tristate "Binary Increase Congestion (BIC) control"
default m
@@ -551,7 +561,7 @@
 
 config TCP_CONG_SCALABLE
tristate "Scalable TCP"
-   depends on EXPERIMENTAL
+   depends on TCP_CONG_UNFAIR
default n
---help---
Scalable TCP is a sender-side only change to TCP which uses a
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [IPROUTE] manpage for rtmon

2006-10-27 Thread Stephen Hemminger
On Fri, 27 Oct 2006 19:22:11 +0200
Michael Prokop <[EMAIL PROTECTED]> wrote:

> User-Agent: mutt-ng devel-r316 (Debian)
> 
> Hello,

added.

-- 
Stephen Hemminger <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tcp: don't allow unfair congestion control to be built without warning

2006-10-27 Thread John Heffner
I think "unfair" is a difficult word.  Unfair to what?  It's true that 
Scalable TCP is unfair to itself in that flows with unequal shares do 
not converge, but it's not clear what its interactions are with other 
congestion control algorithms.  It's not clear to me that it's 
significantly more unfair wrt. reno than BIC, etc.  "Known to be broken" 
might be more correct language. :)


One thought would be to use a module parameter that sets one bit of 
state: allow unprivileged use.  Each module could have a sensible 
default value.


  -John


Stephen Hemminger wrote:

My proposed method restricting TCP choices to fair algorithms.
This a net wide, not system wide issue, it should not be done
by kernel policy choice (capability), but by a build choice.

--- sky2.orig/net/ipv4/Kconfig  2006-10-27 10:10:47.0 -0700
+++ sky2/net/ipv4/Kconfig   2006-10-27 10:15:56.0 -0700
@@ -470,6 +470,16 @@
 
 if TCP_CONG_ADVANCED
 
+config TCP_CONG_UNFAIR

+bool "Allow unfair congestion control algorithms"
+   depends on EXPERIMENTAL
+---help---
+ Some of the congestion control algorithms are for testing
+ and research purposes and should not deployed on public
+ networks because of the possiblity of unfair behavior.
+ These algorithms may be useful for future development
+ or comparison purposes.
+
 config TCP_CONG_BIC
tristate "Binary Increase Congestion (BIC) control"
default m
@@ -551,7 +561,7 @@
 
 config TCP_CONG_SCALABLE

tristate "Scalable TCP"
-   depends on EXPERIMENTAL
+   depends on TCP_CONG_UNFAIR
default n
---help---
Scalable TCP is a sender-side only change to TCP which uses a


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] tcp: allow restricting congestion control choices

2006-10-27 Thread Stephen Hemminger
Here is an alternative that allows runtime based restriction on some
TCP congestion control choices. 

Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]>

---
 include/net/tcp.h   |1 +
 net/ipv4/tcp_cong.c |4 
 2 files changed, 5 insertions(+)

--- sky2.orig/include/net/tcp.h 2006-10-27 10:46:19.0 -0700
+++ sky2/include/net/tcp.h  2006-10-27 10:46:55.0 -0700
@@ -651,6 +651,7 @@
 
charname[TCP_CA_NAME_MAX];
struct module   *owner;
+   int restricted; /* NET_ADMIN only */
 };
 
 extern int tcp_register_congestion_control(struct tcp_congestion_ops *type);
--- sky2.orig/net/ipv4/tcp_cong.c   2006-10-27 10:51:47.0 -0700
+++ sky2/net/ipv4/tcp_cong.c2006-10-27 10:56:36.0 -0700
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 static DEFINE_SPINLOCK(tcp_cong_list_lock);
@@ -159,6 +160,9 @@
if (!ca)
err = -ENOENT;
 
+   else if (ca->restricted && !capable(CAP_NET_ADMIN))
+   err = -EPERM;
+
else if (!try_module_get(ca->owner))
err = -EBUSY;
 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] tcp: setsockopt congestion control autoload

2006-10-27 Thread Stephen Hemminger
If application asks for a congestion control type with setsockopt() 
then it may be available as a module not included in the kernel already. 
If it has permission to load modules then the tcp congestion
module should be autoloaded if needed.  This is done already when
the default selection is change with sysctl, but not when application
requests via sysctl.
 
Add a similar additional check to the sysctl path as well.

Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]>
 
---
 net/ipv4/tcp_cong.c |   12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

--- a/net/ipv4/tcp_cong.c   2006-10-27 10:56:36.0 -0700
+++ b/net/ipv4/tcp_cong.c   2006-10-27 11:09:36.0 -0700
@@ -114,7 +114,7 @@
spin_lock(&tcp_cong_list_lock);
ca = tcp_ca_find(name);
 #ifdef CONFIG_KMOD
-   if (!ca) {
+   if (!ca && capable(CAP_SYS_MODULE)) {
spin_unlock(&tcp_cong_list_lock);
 
request_module("tcp_%s", name);
@@ -154,9 +154,19 @@
 
rcu_read_lock();
ca = tcp_ca_find(name);
+   /* no change asking for existing value */
if (ca == icsk->icsk_ca_ops)
goto out;
 
+#ifdef CONFIG_KMOD
+   /* not found attempt to autoload module */
+   if (!ca && capable(CAP_SYS_MODULE)) {
+   rcu_read_unlock();
+   request_module("tcp_%s", name);
+   rcu_read_lock();
+   ca = tcp_ca_find(name);
+   }
+#endif
if (!ca)
err = -ENOENT;
 

Stephen Hemminger <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 9/13] [SCTP] Merge IPv4 and IPv6 versions of get_saddr() with their corresponding get_dst().

2006-10-27 Thread Sridhar Samudrala
On Tue, 2006-10-17 at 03:19 +0300, Ville Nuorvala wrote:
> As the IPv6 route lookup now also returns the selected source address
> there is no need for a separate source address lookup. In fact, the
> source address selection needs to be moved to get_dst() because the
> selected IPv6 source address isn't always stored in the route.
> Sometimes this makes it impossible to guess the correct address later on.
> 

Ville,

Overall the patch looks pretty good. I found only 1 issue in 
sctp_v6_get_dst(). See below.





> 
> +/* Returns the dst cache entry for the given source and destination ip
> + * addresses.
> + */
> +static struct dst_entry *sctp_v6_get_dst(struct sctp_association *asoc,
> +  union sctp_addr *daddr,
> +  union sctp_addr *saddr)
> +{
> + struct dst_entry *dst;
> + struct flowi fl;
> + struct sctp_bind_addr *bp;
> + rwlock_t *addr_lock;
> + struct sctp_sockaddr_entry *laddr;
> + struct list_head *pos;
> + struct rt6_info *rt;
> + union sctp_addr baddr;
> + sctp_scope_t scope;
> + __u8 matchlen = 0;
> + __u8 bmatchlen;
> +
> + memset(&fl, 0, sizeof(fl));
> + ipv6_addr_copy(&fl.fl6_dst, &daddr->v6.sin6_addr);
> + if (ipv6_addr_type(&daddr->v6.sin6_addr) & IPV6_ADDR_LINKLOCAL)
> + fl.oif = daddr->v6.sin6_scope_id;
> +
> + ipv6_addr_copy(&fl.fl6_src, &saddr->v6.sin6_addr);
> + SCTP_DEBUG_PRINTK("%s: DST=" NIP6_FMT " SRC=" NIP6_FMT " ",
> +   __FUNCTION__, NIP6(fl.fl6_dst), NIP6(fl.fl6_src));
> +
> + dst = ip6_route_output(NULL, &fl);
> + if (dst->error) {
> + dst_release(dst);
> + dst = NULL;
> + }
> + if (!ipv6_addr_any(&saddr->v6.sin6_addr))
> + goto out;
> + if (!asoc) {
> + if (dst)
> + ipv6_addr_copy(&saddr->v6.sin6_addr, &fl.fl6_src);
> + goto out;
> + }
> + bp = &asoc->base.bind_addr;
> + addr_lock = &asoc->base.addr_lock;
> +
> + if (dst) {
> + /* Walk through the bind address list and look for a bind
> +  * address that matches the source address of the returned rt.
> +  */
> + sctp_v6_fl_saddr(&baddr, &fl, bp->port);
Here we are checking if the source address returned in the dst matches one of
the address in the bind address list of the association. Not the source address
that is passed to this routine(it could be INADDRY_ANY).
So this should be changed back to sctp_v6_dst_saddr().

Thanks
Sridhar

> + sctp_read_lock(addr_lock);
> + list_for_each(pos, &bp->address_list) {
> + laddr = list_entry(pos, struct sctp_sockaddr_entry,
> +list);
> + if (!laddr->use_as_src)
> + continue;
> + if (sctp_v6_cmp_addr(&baddr, &laddr->a))
> + goto init_saddr;
> + }
> + sctp_read_unlock(addr_lock);
> +
> + /* Invalid rt or none of the bound addresses match the source
> +  * address. So release it.
> +  */
> + dst_release(dst);
> + dst = NULL;
> + }
> +
> + /* Go through the bind address list and find the best source address
> +  * that matches the scope of the destination address.
> +  */
> + memset(&baddr, 0, sizeof(union sctp_addr));
> + scope = sctp_scope(daddr);
> + sctp_read_lock(addr_lock);
> + list_for_each(pos, &bp->address_list) {
> + laddr = list_entry(pos, struct sctp_sockaddr_entry, list);
> + 
> + if (!laddr->use_as_src ||
> + laddr->a.sa.sa_family != AF_INET6 ||
> + scope > sctp_scope(&laddr->a) ||
> + (ipv6_addr_type(&laddr->a.v6.sin6_addr) &
> +  IPV6_ADDR_LINKLOCAL &&
> +  laddr->a.v6.sin6_scope_id != fl.oif))
> + continue;
> +
> + bmatchlen = sctp_v6_addr_match_len(daddr, &laddr->a);
> + if (!dst || (matchlen < bmatchlen)) {
> + struct dst_entry *dst2;
> + ipv6_addr_copy(&fl.fl6_src, &laddr->a.v6.sin6_addr);
> + dst2 = ip6_route_output(NULL, &fl);
> + if (dst2->error) {
> + dst_release(dst2);
> + dst2 = NULL;
> + continue;
> + }
> + dst_release(dst);
> + dst = dst2;
> + memcpy(&baddr, &laddr->a, sizeof(union sctp_addr));
> + matchlen = bmatchlen;
> + }
> + }
> + if (dst)
> + goto init_saddr;
> +out_unlock:
> + sctp_read_unlock(addr_lock);
> +out:
> + if (dst) {
> + rt = (struct rt6_info *) dst;
> +   

Re: [PATCH] s2io: add PCI error recovery support

2006-10-27 Thread Linas Vepstas
On Fri, Oct 27, 2006 at 07:35:18AM -0400, Ananda Raju wrote:
> Looking at all scenarios I feel the first patch is OK. Can you add the
> watchdog timer fix to first initial patch and resubmit. 

Appended below.

> So -- just for grins, I thought to myself, "Maybe I can make 
> s2io be the first adapter ever to fully recover without 
> a hard reset of the card."

... I couldn't quite make this work. Since the patch below
already works, I didn't see much point exterting myself further.

--linas

This patch adds PCI error recovery support to the 
s2io 10-Gigabit ethernet device driver. Third revision,
blocks interrupts and the watchdog.

Tested, seems to work well.

Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]>
Cc: Raghavendra Koushik <[EMAIL PROTECTED]>
Cc: Ananda Raju <[EMAIL PROTECTED]>
Cc: Wen Xiong <[EMAIL PROTECTED]>


 drivers/net/s2io.c |  121 +
 drivers/net/s2io.h |5 ++
 2 files changed, 126 insertions(+)

Index: linux-2.6.19-rc1-git11/drivers/net/s2io.c
===
--- linux-2.6.19-rc1-git11.orig/drivers/net/s2io.c  2006-10-27 
10:49:07.0 -0500
+++ linux-2.6.19-rc1-git11/drivers/net/s2io.c   2006-10-27 13:55:01.0 
-0500
@@ -434,11 +434,18 @@ static struct pci_device_id s2io_tbl[] _
 
 MODULE_DEVICE_TABLE(pci, s2io_tbl);
 
+static struct pci_error_handlers s2io_err_handler = {
+   .error_detected = s2io_io_error_detected,
+   .slot_reset = s2io_io_slot_reset,
+   .resume = s2io_io_resume,
+};
+
 static struct pci_driver s2io_driver = {
   .name = "S2IO",
   .id_table = s2io_tbl,
   .probe = s2io_init_nic,
   .remove = __devexit_p(s2io_rem_nic),
+  .err_handler = &s2io_err_handler,
 };
 
 /* A simplifier macro used both by init and free shared_mem Fns(). */
@@ -3159,6 +3166,11 @@ static void alarm_intr_handler(struct s2
register u64 val64 = 0, err_reg = 0;
u64 cnt;
int i;
+
+   if ((nic->pdev->error_state != pci_channel_io_normal) &&
+(nic->pdev->error_state != 0))
+   return;
+
nic->mac_control.stats_info->sw_stat.ring_full_cnt = 0;
/* Handling the XPAK counters update */
if(nic->mac_control.stats_info->xpak_stat.xpak_timer_count < 72000) {
@@ -4171,6 +4183,11 @@ static irqreturn_t s2io_isr(int irq, voi
mac_info_t *mac_control;
struct config_param *config;
 
+   /* Pretend we handled any irq's from a disconnected card */
+   if ((sp->pdev->error_state != pci_channel_io_normal) &&
+(sp->pdev->error_state != 0))
+   return IRQ_HANDLED;
+
atomic_inc(&sp->isr_cnt);
mac_control = &sp->mac_control;
config = &sp->config;
@@ -7564,3 +7581,107 @@ static void lro_append_pkt(nic_t *sp, lr
sp->mac_control.stats_info->sw_stat.clubbed_frms_cnt++;
return;
 }
+
+/**
+ * s2io_io_error_detected - called when PCI error is detected
+ * @pdev: Pointer to PCI device
+ * @state: The current pci conneection state
+ *
+ * This function is called after a PCI bus error affecting
+ * this device has been detected.
+ */
+static pci_ers_result_t s2io_io_error_detected(struct pci_dev *pdev,
+   pci_channel_state_t state)
+{
+   struct net_device *netdev = pci_get_drvdata(pdev);
+   nic_t *sp = netdev->priv;
+
+   netif_device_detach(netdev);
+
+   if (netif_running(netdev)) {
+   unsigned long flags;
+
+   /* The folowing is an abreviated subset of the
+* steps taken by s2io_card_down(), avoiding
+* steps that touch the card itself.
+*/
+   del_timer_sync(&sp->alarm_timer);
+   atomic_set(&sp->card_state, CARD_DOWN);
+
+   /* Kill tasklet. */
+   tasklet_kill(&sp->task);
+
+   /* Free all Tx buffers */
+   spin_lock_irqsave(&sp->tx_lock, flags);
+   free_tx_buffers(sp);
+   spin_unlock_irqrestore(&sp->tx_lock, flags);
+
+   /* Free all Rx buffers */
+   spin_lock_irqsave(&sp->rx_lock, flags);
+   free_rx_buffers(sp);
+   spin_unlock_irqrestore(&sp->rx_lock, flags);
+
+   clear_bit(0, &(sp->link_state));
+   sp->device_close_flag = TRUE;   /* Device is shut down. */
+   }
+   pci_disable_device(pdev);
+
+   return PCI_ERS_RESULT_NEED_RESET;
+}
+
+/**
+ * s2io_io_slot_reset - called after the pci bus has been reset.
+ * @pdev: Pointer to PCI device
+ *
+ * Restart the card from scratch, as if from a cold-boot.
+ * At this point, the card has exprienced a hard reset,
+ * followed by fixups by BIOS, and has its config space
+ * set up identically to what it was at cold boot.
+ */
+static pci_ers_result_t s2io_io_slot_reset(struct pci_dev *pdev)
+{
+   struct net_device *netdev = pci_get_dr

Re: [openib-general] [PATCH 1/5] NetEffect 10Gb RNIC Userspace Library: userspace config generation

2006-10-27 Thread Roland Dreier
 > I don't think the userspace stuff belongs on netdev. Someone please
 > correct me if I'm wrong.

Yeah, it's not a bad thing to get wider review, but your userspace
library is pretty much your business.  If you screw it up it doesn't
hurt anyone else, so I'm happy to let you write it however you want.

 - R.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [openib-general] [PATCH 1/5] NetEffect 10Gb RNIC Userspace Library: userspace config generation

2006-10-27 Thread Stephen Hemminger
On Fri, 27 Oct 2006 10:56:45 -0700
Roland Dreier <[EMAIL PROTECTED]> wrote:

>  > I don't think the userspace stuff belongs on netdev. Someone please
>  > correct me if I'm wrong.
> 
> Yeah, it's not a bad thing to get wider review, but your userspace
> library is pretty much your business.  If you screw it up it doesn't
> hurt anyone else, so I'm happy to let you write it however you want.
> 
>  - R.
>

I prefer a pointer to the project download source.
Seeing the userspace stuff helps answer questions where the administration
process is confusing (or could/should be done differently).

-- 
Stephen Hemminger <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tcp: don't allow unfair congestion control to be built without warning

2006-10-27 Thread David Miller
From: Stephen Hemminger <[EMAIL PROTECTED]>
Date: Fri, 27 Oct 2006 10:30:16 -0700

> My proposed method restricting TCP choices to fair algorithms.
> This a net wide, not system wide issue, it should not be done
> by kernel policy choice (capability), but by a build choice.

I think this sucks even worse than the current situation.

How difficult is it to understand that an administrator might
like to be able to build in and experiment with some congestion
control algorithms, yet still be able to keep his normal users
from using them?
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Check if user has CAP_NET_ADMIN to change congestion control algorithm

2006-10-27 Thread David Miller
From: Stephen Hemminger <[EMAIL PROTECTED]>
Date: Fri, 27 Oct 2006 07:41:02 -0700

> Please no, it makes the socket option useless.
> If you want to tag some "bad apples" thats okay, but would need
> some more infrastructure.

The behavior of the TCP stack is a system wide decision.

If anything it should be "everything besides the default
and Reno are offlimits to unprivileged users" with an
administrative method to override that.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tcp: don't allow unfair congestion control to be built without warning

2006-10-27 Thread Stephen Hemminger
On Fri, 27 Oct 2006 14:17:49 -0700 (PDT)
David Miller <[EMAIL PROTECTED]> wrote:

> From: Stephen Hemminger <[EMAIL PROTECTED]>
> Date: Fri, 27 Oct 2006 10:30:16 -0700
> 
> > My proposed method restricting TCP choices to fair algorithms.
> > This a net wide, not system wide issue, it should not be done
> > by kernel policy choice (capability), but by a build choice.
> 
> I think this sucks even worse than the current situation.
> 
> How difficult is it to understand that an administrator might
> like to be able to build in and experiment with some congestion
> control algorithms, yet still be able to keep his normal users
> from using them?

Only some (very few) have any bad consequences. So the typical
distribution should be able to switch with most available for everyone,
and only a few needing special privileges.


-- 
Stephen Hemminger <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tcp: don't allow unfair congestion control to be built without warning

2006-10-27 Thread David Miller
From: Stephen Hemminger <[EMAIL PROTECTED]>
Date: Fri, 27 Oct 2006 14:24:02 -0700

> Only some (very few) have any bad consequences. So the typical
> distribution should be able to switch with most available for everyone,
> and only a few needing special privileges.

I would strongly disagree as we've had several OOPS'er class bugs in
the less frequently used algorithms.

I stand by my position that an administrator's wish to do this is
quite valid.

It's bad enough that people are all over us for the default algorithm
we have choosen, so it'd be extremely irresponsible and even worse if
we allowed users to select any of the other "research" algorithms for
their TCP connections by default just because those modules happened
to be configured into the kernel.

This userspace convenience argument holds zero water.

Provide a way for the administrator to control the situation fully,
and choose a sane default which errs on the side of caution for the
sake of internet stability.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


2.4/2.6 share in linux routers ?

2006-10-27 Thread Yakov Lerner

Hello,

I'd like to find/gather estimates about 2.4 vs 2.6 share in  [small]
linux routers in 2006. Can anyone offer estimates and/or references ?

My own estimate is that definite majority is 2.4 (I'd say >75% for 2.4),
in small linux routers in 2006. Can anyone offer support or correction ?

Which factors make 2.4 or 2.6 more attractive for small linux router
(128-256 mb RAM) ?

Yakov Lerner
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


2.4/2.6 share in linux routers ?

2006-10-27 Thread Yakov Lerner

Hello,

I'd like to find/gather estimates about 2.4 vs 2.6 share in  [small]
linux routers in 2006. Can anyone offer estimates and/or references ?

My own estimate is that definite majority is 2.4 (I'd say >75% for 2.4),
in small linux routers in 2006. Can anyone offer support or correction ?

Which factors make 2.4 or 2.6 more attractive for small linux router
(128-256 mb RAM) ?

Yakov Lerner
P.S. Sorry if the message is duplicate.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tcp: don't allow unfair congestion control to be built without warning

2006-10-27 Thread Stephen Hemminger
On Fri, 27 Oct 2006 14:37:01 -0700 (PDT)
David Miller <[EMAIL PROTECTED]> wrote:

> From: Stephen Hemminger <[EMAIL PROTECTED]>
> Date: Fri, 27 Oct 2006 14:24:02 -0700
> 
> > Only some (very few) have any bad consequences. So the typical
> > distribution should be able to switch with most available for everyone,
> > and only a few needing special privileges.
> 
> I would strongly disagree as we've had several OOPS'er class bugs in
> the less frequently used algorithms.
> 

Then tag those as restricted.  Why should we keep app's away from
the simple ones.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.4/2.6 share in linux routers ?

2006-10-27 Thread David Miller

Please stop all of this cross posting.  I've just seen you post
this same exact email on the netfilter lists too.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tcp: don't allow unfair congestion control to be built without warning

2006-10-27 Thread David Miller
From: Stephen Hemminger <[EMAIL PROTECTED]>
Date: Fri, 27 Oct 2006 14:59:13 -0700

> On Fri, 27 Oct 2006 14:37:01 -0700 (PDT)
> David Miller <[EMAIL PROTECTED]> wrote:
> 
> > From: Stephen Hemminger <[EMAIL PROTECTED]>
> > Date: Fri, 27 Oct 2006 14:24:02 -0700
> > 
> > > Only some (very few) have any bad consequences. So the typical
> > > distribution should be able to switch with most available for everyone,
> > > and only a few needing special privileges.
> > 
> > I would strongly disagree as we've had several OOPS'er class bugs in
> > the less frequently used algorithms.
> > 
> 
> Then tag those as restricted.  Why should we keep app's away from
> the simple ones.

You can't predict bugs, but what you can do is know that the lesser
used algorithms are by definition less tested and therefore more
likely to have bugs.  Everything except the default and Reno are
lesser used.

Safe by default, there is no other choice.  You fail to respond to
THAT part of my email.  That's the important point.  Let me
reiterate:

> It's bad enough that people are all over us for the default algorithm
> we have choosen, so it'd be extremely irresponsible and even worse if
> we allowed users to select any of the other "research" algorithms for
> their TCP connections by default just because those modules happened
> to be configured into the kernel.
>
> This userspace convenience argument holds zero water.
>
> Provide a way for the administrator to control the situation fully,
> and choose a sane default which errs on the side of caution for the
> sake of internet stability.

Please reread this and consider why it's important.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tcp: don't allow unfair congestion control to be built without warning

2006-10-27 Thread Stephen Hemminger
On Fri, 27 Oct 2006 15:12:38 -0700 (PDT)
David Miller <[EMAIL PROTECTED]> wrote:

> From: Stephen Hemminger <[EMAIL PROTECTED]>
> Date: Fri, 27 Oct 2006 14:59:13 -0700
> 
> > On Fri, 27 Oct 2006 14:37:01 -0700 (PDT)
> > David Miller <[EMAIL PROTECTED]> wrote:
> > 
> > > From: Stephen Hemminger <[EMAIL PROTECTED]>
> > > Date: Fri, 27 Oct 2006 14:24:02 -0700
> > > 
> > > > Only some (very few) have any bad consequences. So the typical
> > > > distribution should be able to switch with most available for everyone,
> > > > and only a few needing special privileges.
> > > 
> > > I would strongly disagree as we've had several OOPS'er class bugs in
> > > the less frequently used algorithms.
> > > 
> > 
> > Then tag those as restricted.  Why should we keep app's away from
> > the simple ones.
> 
> You can't predict bugs, but what you can do is know that the lesser
> used algorithms are by definition less tested and therefore more
> likely to have bugs.  Everything except the default and Reno are
> lesser used.

If they aren't usable they should be marked BROKEN or something
like that. The stability argument doesn't really work, we don't
like to let root kill the system either.
 
> Safe by default, there is no other choice.  You fail to respond to
> THAT part of my email.  That's the important point.  Let me
> reiterate:
> 
> > It's bad enough that people are all over us for the default algorithm
> > we have choosen, so it'd be extremely irresponsible and even worse if
> > we allowed users to select any of the other "research" algorithms for
> > their TCP connections by default just because those modules happened
> > to be configured into the kernel.

Make it hard for them to configure then.  I don't want your
distro to ship with the risky ones turned on.  But we should allow
use of reno, bic, cubic, lp, htcp, and westwood (maybe) by regular
users if admin allows.

> > This userspace convenience argument holds zero water.
> >
> > Provide a way for the administrator to control the situation fully,
> > and choose a sane default which errs on the side of caution for the
> > sake of internet stability.
> 
> Please reread this and consider why it's important.

The current situation is fine. You have to ask for them in the configuration,
and root has to either load the module or set it as default.

The restricted flag patch which you have ignored, would be a way to
allow them to be configured but tag the "bad apples" for only
root usage.




-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tcp: don't allow unfair congestion control to be built without warning

2006-10-27 Thread David Miller
From: Stephen Hemminger <[EMAIL PROTECTED]>
Date: Fri, 27 Oct 2006 15:21:49 -0700

> The restricted flag patch which you have ignored, would be a way to
> allow them to be configured but tag the "bad apples" for only
> root usage.

I haven't ignored it, it's in my backlog below more important
things like Appletalk OOPS'ers.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.4/2.6 share in linux routers ?

2006-10-27 Thread Yakov Lerner

On 10/28/06, David Miller <[EMAIL PROTECTED]> wrote:


Please stop all of this cross posting.  I've just seen you post
this same exact email on the netfilter lists too.


Sorry
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tcp: don't allow unfair congestion control to be built without warning

2006-10-27 Thread Stephen Hemminger
How about another way of controlling this via sysctl.

First, add code to for read only:
/proc/sys/net/ipv4/tcp_available_congestion_control  (or shorter name)
this will show all things compiled in (even if not loaded yet). Similar
to /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies

Second, add flag (allowed) to the tcp_congestion structure [inverse of
earlier restricted]

Third, add read-write
/proc/sys/net/ipv4/tcp_allowed_congestion_control
to show and set/clear the allowed flag. Default value would be
"reno xxx" where xxx is what ever the default value from the kernel
config is (currently cubic).

I would use sysfs for this, but it make sense not to spread TCP stuff into
both sysctl and sysfs.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2.6.19-rc3 v2 1/2] amso1100 - Use dma_alloc_coherent instead of kmalloc/dma_map_single.

2006-10-27 Thread Roland Dreier
tsk, tsk:

fatal: 7 lines add trailing whitespaces.

applied to for-2.6.19 anyway, thanks.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2.6.19-rc3 v2 2/2] amso1100 - Fix incorrect pr_debug().

2006-10-27 Thread Roland Dreier
Applied, thanks.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] tcp: available congetsion control

2006-10-27 Thread Stephen Hemminger
Nice way to see what congestion control modules are loaded.
It does impose a soft limit of 32 possibilities.

Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]>

---
 include/linux/sysctl.h |1 +
 include/net/tcp.h  |3 +++
 net/ipv4/sysctl_net_ipv4.c |   25 -
 net/ipv4/tcp_cong.c|   14 ++
 4 files changed, 42 insertions(+), 1 deletion(-)

--- skge.orig/include/linux/sysctl.h
+++ skge/include/linux/sysctl.h
@@ -418,6 +418,7 @@ enum
NET_CIPSOV4_CACHE_BUCKET_SIZE=119,
NET_CIPSOV4_RBM_OPTFMT=120,
NET_CIPSOV4_RBM_STRICTVALID=121,
+   NET_TCP_AVAIL_CONG_CONTROL=122,
 };
 
 enum {
--- skge.orig/include/net/tcp.h
+++ skge/include/net/tcp.h
@@ -621,6 +621,8 @@ enum tcp_ca_event {
  * Interface for adding new TCP congestion control handlers
  */
 #define TCP_CA_NAME_MAX16
+#define TCP_CA_MAX 32
+
 struct tcp_congestion_ops {
struct list_headlist;
 
@@ -659,6 +661,7 @@ extern void tcp_unregister_congestion_co
 extern void tcp_init_congestion_control(struct sock *sk);
 extern void tcp_cleanup_congestion_control(struct sock *sk);
 extern int tcp_set_default_congestion_control(const char *name);
+extern void tcp_get_available_congestion_control(char *name, int maxlen);
 extern void tcp_get_default_congestion_control(char *name);
 extern int tcp_set_congestion_control(struct sock *sk, const char *name);
 extern void tcp_slow_start(struct tcp_sock *tp);
--- skge.orig/net/ipv4/sysctl_net_ipv4.c
+++ skge/net/ipv4/sysctl_net_ipv4.c
@@ -108,6 +108,22 @@ static int proc_tcp_congestion_control(c
return ret;
 }
 
+static int proc_tcp_available_congestion_control(ctl_table *ctl,
+int write, struct file * filp,
+void __user *buffer, size_t 
*lenp,
+loff_t *ppos)
+{
+   char val[TCP_CA_MAX*(TCP_CA_NAME_MAX+1)];
+   ctl_table tbl = {
+   .data = val,
+   .maxlen = TCP_CA_MAX*(TCP_CA_NAME_MAX+1),
+   };
+
+   tcp_get_available_congestion_control(val, tbl.maxlen);
+
+   return proc_dostring(&tbl, write, filp, buffer, lenp, ppos);
+}
+
 static int sysctl_tcp_congestion_control(ctl_table *table, int __user *name,
 int nlen, void __user *oldval,
 size_t __user *oldlenp,
@@ -133,9 +149,9 @@ static int __init tcp_congestion_default
 {
return tcp_set_default_congestion_control(CONFIG_DEFAULT_TCP_CONG);
 }
-
 late_initcall(tcp_congestion_default);
 
+
 ctl_table ipv4_table[] = {
 {
.ctl_name   = NET_IPV4_TCP_TIMESTAMPS,
@@ -738,6 +754,13 @@ ctl_table ipv4_table[] = {
.proc_handler   = &proc_dointvec,
},
 #endif /* CONFIG_NETLABEL */
+   {
+   .ctl_name   = NET_TCP_AVAIL_CONG_CONTROL,
+   .procname   = "tcp_available_congestion_control",
+   .mode   = 0444,
+   .maxlen = TCP_CA_MAX*(TCP_CA_NAME_MAX+1),
+   .proc_handler   = &proc_tcp_available_congestion_control,
+   },
{ .ctl_name = 0 }
 };
 
--- skge.orig/net/ipv4/tcp_cong.c
+++ skge/net/ipv4/tcp_cong.c
@@ -144,6 +144,20 @@ void tcp_get_default_congestion_control(
rcu_read_unlock();
 }
 
+/* Build string with list of available congestion control values */
+void tcp_get_available_congestion_control(char *name, int maxlen)
+{
+   struct tcp_congestion_ops *ca;
+   int offs = 0;
+
+   rcu_read_lock();
+   list_for_each_entry_rcu(ca, &tcp_cong_list, list) {
+   offs += snprintf(name + offs, maxlen - offs, "%s%s",
+offs == 0 ? "" : " ", ca->name);
+   }
+   rcu_read_unlock();
+}
+
 /* Change congestion control for socket */
 int tcp_set_congestion_control(struct sock *sk, const char *name)
 {
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html