Re: Network virtualization/isolation

2006-10-27 Thread Dmitry Mishin
On Thursday 26 October 2006 19:56, Stephen Hemminger wrote:
 On Thu, 26 Oct 2006 11:44:55 +0200

 Daniel Lezcano [EMAIL PROTECTED] wrote:
  Stephen Hemminger wrote:
   On Wed, 25 Oct 2006 17:51:28 +0200
  
   Daniel Lezcano [EMAIL PROTECTED] wrote:
  Hi Stephen,
  
  currently the work to make the container enablement into the kernel is
  doing good progress. The ipc, pid, utsname and filesystem system
  ressources are isolated/virtualized relying on the namespaces concept.
  
  But, there is missing the network virtualization/isolation. Two
  approaches are proposed: doing the isolation at the layer 2 and at the
  layer 3.
  
  The first one instanciate a network device by namespace and add a peer
  network device into the root namespace, all the routing ressources
   are relative to the namespace. This work is done by Andrey Savochkin
   from the openvz project.
  
  The second relies on the routes and associates the network namespace
  pointer with each route. When the traffic is incoming, the packet
  follows an input route and retrieve the associated network namespace.
  When the traffic is outgoing, the packet, identified from the network
  namespace is coming from, follows only the routes matching the same
  network namespace. This work is made by me.
  
  IMHO, we need the two approach, the layer-2 to be able to bring *very*
  strong isolation for system container with a performance cost and a
  layer-3 to be able to have good isolation for lightweight container or
  application container when performances are more important.
  
  Do you have some suggestions ? What is your point of view on that ?
  
  Thanks in advance.
  
 -- Daniel
  
   Any solution should allow both and it should build on the existing
   netfilter infrastructure.
 
  The problem is netfilter can not give a good isolation, eg. how can be
  handled netstat command ? or avoid to see IP addresses assigned to
  another container when doing ifconfig ? Furthermore, one of the biggest
  interest of the network isolation is to bring mobility with a container
  and that can only be done if the network ressources inside the kernel
  can be identified by container in order to checkpoint/restart them.
 
  The all-in-namespace solution, ie. at layer 2, is very good in terms of
  isolation but it adds an non-negligeable overhead. The layer 3 isolation
has an insignifiant overhead, a good isolation perfectly adapted for
  applications containers.
 
  Unfortunatly, from the point of view of implementation, layer 3 can not
  be a subset of layer 2 isolation when using all-in-namespace and layer
  2 isolation can not be a extension of the layer 3 isolation.
 
  I think the layer 2 and the layer 3 implementations can coexists. You
  can for example create a system container with a layer 2 isolation and
  inside it add a layer 3 isolation.
 
  Does that make sense ?
 
  -- Daniel

 Assuming you are talking about pseudo-virtualized environments,
 there are several different discussions.

 1. How should the namespace be isolated for the virtualized containered
applications?

 2. How should traffic be restricted into/out of those containers. This
is where existing netfilter, classification, etc, should be used.
The network code is overly rich as it is, we don't need another
abstraction.

 3. Can the virtualized containers be secure? No. we really can't keep
hostile root in a container from killing system without going to
a hypervisor.
Stephen, 

Virtualized container can be secure, if it is complete system virtualization, 
not just an application container. OpenVZ implements such and it is used hard 
over the world. And of course, we care a lot to keep hostile root from
killing whole system.
 
OpenVZ uses virtualization on IP level (implemented by Andrey Savochkin, 
http://marc.theaimsgroup.com/?l=linux-netdevm=115572448503723), with all
necessary network objects isolated/virtualized, such as sockets, devices, 
routes, netfilters, etc.

-- 
Thanks,
Dmitry.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network virtualization/isolation

2006-10-27 Thread Daniel Lezcano


[ ... ]

Dmitry Mishin wrote:
Stephen, 

Virtualized container can be secure, if it is complete system virtualization, 
not just an application container. OpenVZ implements such and it is used hard 
over the world. And of course, we care a lot to keep hostile root from

killing whole system.


OpenVZ power !!

OpenVZ uses virtualization on IP level (implemented by Andrey Savochkin, 
http://marc.theaimsgroup.com/?l=linux-netdevm=115572448503723), with all
necessary network objects isolated/virtualized, such as sockets, devices, 
routes, netfilters, etc.


No, it uses virtualization at layer 2 and I had already mention it 
before (see the first email of the thread), but thank you for the email 
thread pointer.


The discussion is not to convince Stephen that layer 2 or layer 3 is the 
best but to present the pros and the cons of each solution and to have a 
point of view from a network gourou guy.


Regards.

-- Daniel




-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Check if user has CAP_NET_ADMIN to change congestion control algorithm

2006-10-27 Thread Hagen Paul Pfeifer
* David Miller | 2006-10-26 17:02:21 [-0700]:

Your email client turned the tabs into spaces in the patch making it
useless.

Sorry my mistake! I am en route and I paste the patch into my editor, who eat
all tabs. One more time: sorry!


Check if user has CAP_NET_ADMIN capability to change
congestion control algorithm.


Signed-off-by: Hagen Paul Pfeifer [EMAIL PROTECTED]

---
 net/ipv4/tcp_cong.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index af0aca1..c1ae2e9 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -10,6 +10,7 @@ #include linux/module.h
 #include linux/mm.h
 #include linux/types.h
 #include linux/list.h
+#include linux/capability.h
 #include net/tcp.h
 
 static DEFINE_SPINLOCK(tcp_cong_list_lock);
@@ -151,6 +152,9 @@ int tcp_set_congestion_control(struct so
struct tcp_congestion_ops *ca;
int err = 0;
 
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+
rcu_read_lock();
ca = tcp_ca_find(name);
if (ca == icsk-icsk_ca_ops)
-- 
1.4.1.1
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] s2io: add PCI error recovery support

2006-10-27 Thread Ananda Raju
Looking at all scenarios I feel the first patch is OK. Can you add the
watchdog timer fix to first initial patch and resubmit. 

-Original Message-
From: Linas Vepstas [mailto:[EMAIL PROTECTED] 
Sent: Thursday, October 26, 2006 3:52 PM
To: Ananda Raju
Cc: Wen Xiong; linux-kernel@vger.kernel.org;
[EMAIL PROTECTED]; netdev@vger.kernel.org; Jeff Garzik;
Andrew Morton
Subject: Re: [PATCH] s2io: add PCI error recovery support

Hi.

On Thu, Oct 26, 2006 at 05:56:34AM -0400, Ananda Raju wrote:
 Hi, 
 Can you try attached patch. The attached patch is simple. We set card
 state as down in error_detecct() so that all entry points return error
 and don't proceed further.
 
 In slot_reset() we do s2io_card_down() will reset adapter. 
 In io_resume() we bringup the driver. 

Simplicity is always better. However, some questions/comments:

 @@ -4175,6 +4186,10 @@ static irqreturn_t s2io_isr(int irq, voi
   mac_info_t *mac_control;
   struct config_param *config;
  
 + if (atomic_read(sp-card_state) == CARD_DOWN) {
 + return IRQ_NONE;
 + }

I used 

if ((sp-pdev-error_state != pci_channel_io_normal)

here for a reason: the pdev-error_state is set even in an interrupt
context, that is, it gets set even if interrups are disabled, and
so it represents the actual state immediately. By contrast, the
error callbacks do not get called until possibly much later, 
and so sp-card_state = CARD_DOWN might not get set for a while.

If, for any reason, e.g. some obscure corner case, the s2io 
generates zillions of interupts, this could result in a soft-lockup.
I actually saw this in the symbios device driver, which will
regenerate an interrupt until its acknowledged -- and so it 
sat there, spinning. :-(

I was returning IRQ_HANDLED instead of IRQ_NONE, so as to avoid
falling into handle_bad_irq() or report_bad_irq(). I haven't 
seen this happen on s2io, but thought it would still be wise.

If this can't happen, then there's no problem here.

 +/**
 + * s2io_io_slot_reset - called after the pci bus has been reset.
 + * @pdev: Pointer to PCI device
 + *
 + * Restart the card from scratch, as if from a cold-boot.
 + */
 +static pci_ers_result_t s2io_io_slot_reset(struct pci_dev *pdev)
 +{

At this point, the card has just experienced a hardware reset,
(the #RST wire was held low for 250 millisecs, followed by
a settle time of 2 seconds, followed by whatever BIOS thinks
it needed to do, followed by a restore of the pci config space
to what it was after a cold boot. So the card is in a fresh
state; in theory its identitcal to a cold boot. So ... 
are you sure you want to down at this point? 

 + s2io_card_down(sp);
 + sp-device_close_flag = TRUE;   /* Device is shut down.
*/


One problem I'm having is that the watchdog timer sometimes
pops and tries to reset the card before s2io_card_down()
has a chance to run. I fixed this ... 

==
So -- just for grins, I thought to myself, Maybe I can make 
s2io be the first adapter ever to fully recover without 
a hard reset of the card.

The idea is simple: 

1) enable MMIO,
2) call s2io_card_down()
3) enable DMA
4) cal s2io_card_up()

I have a patch that does this, but then hit a few more snags.
I haven't yet nailed down all the trouble spots, maybe tommorrow.

--linas


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Check if user has CAP_NET_ADMIN to change congestion control algorithm

2006-10-27 Thread Stephen Hemminger
On Fri, 27 Oct 2006 12:43:11 +0200
Hagen Paul Pfeifer [EMAIL PROTECTED] wrote:

 * David Miller | 2006-10-26 17:02:21 [-0700]:
 
 Your email client turned the tabs into spaces in the patch making it
 useless.
 
 Sorry my mistake! I am en route and I paste the patch into my editor, who eat
 all tabs. One more time: sorry!
 
 
 Check if user has CAP_NET_ADMIN capability to change
 congestion control algorithm.
 
 
 Signed-off-by: Hagen Paul Pfeifer [EMAIL PROTECTED]

Please no, it makes the socket option useless.
If you want to tag some bad apples thats okay, but would need
some more infrastructure.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Rewrite e100_phys_id

2006-10-27 Thread Auke Kok

Matthew Wilcox wrote:

On Thu, Oct 26, 2006 at 01:04:32PM -0700, Auke Kok wrote:
no objections, so I'll ACK it with the notion that I'm going to let our 
labs do some more testing on it with all the latest changes to it.


Thanks, Auke.  Here's the equivalent patch for e1000.  I don't have a
convenient machine to test it on, but it reduces the size of the driver
by 1.5k.


this is a bit (!) more complex than e100, so I'm going to take a bit of time to review 
this patch.


thanks,

Auke





diff --git a/drivers/net/e1000/e1000.h b/drivers/net/e1000/e1000.h
index 7ecce43..1e22da6 100644
--- a/drivers/net/e1000/e1000.h
+++ b/drivers/net/e1000/e1000.h
@@ -257,9 +257,6 @@ #endif
struct work_struct reset_task;
uint8_t fc_autoneg;
 
-	struct timer_list blink_timer;

-   unsigned long led_status;
-
/* TX */
struct e1000_tx_ring *tx_ring;  /* One per active queue */
unsigned long tx_queue_len;
diff --git a/drivers/net/e1000/e1000_ethtool.c 
b/drivers/net/e1000/e1000_ethtool.c
index 773821e..620afa5 100644
--- a/drivers/net/e1000/e1000_ethtool.c
+++ b/drivers/net/e1000/e1000_ethtool.c
@@ -1819,61 +1819,15 @@ e1000_set_wol(struct net_device *netdev,
return 0;
 }
 
-/* toggle LED 4 times per second = 2 blinks per second */

-#define E1000_ID_INTERVAL  (HZ/4)
-
-/* bit defines for adapter-led_status */
-#define E1000_LED_ON   0
-
-static void
-e1000_led_blink_callback(unsigned long data)
-{
-   struct e1000_adapter *adapter = (struct e1000_adapter *) data;
-
-   if (test_and_change_bit(E1000_LED_ON, adapter-led_status))
-   e1000_led_off(adapter-hw);
-   else
-   e1000_led_on(adapter-hw);
-
-   mod_timer(adapter-blink_timer, jiffies + E1000_ID_INTERVAL);
-}
-
 static int
 e1000_phys_id(struct net_device *netdev, uint32_t data)
 {
struct e1000_adapter *adapter = netdev_priv(netdev);
 
-	if (!data || data  (uint32_t)(MAX_SCHEDULE_TIMEOUT / HZ))

-   data = (uint32_t)(MAX_SCHEDULE_TIMEOUT / HZ);
-
-   if (adapter-hw.mac_type  e1000_82571) {
-   if (!adapter-blink_timer.function) {
-   init_timer(adapter-blink_timer);
-   adapter-blink_timer.function = 
e1000_led_blink_callback;
-   adapter-blink_timer.data = (unsigned long) adapter;
-   }
-   e1000_setup_led(adapter-hw);
-   mod_timer(adapter-blink_timer, jiffies);
-   msleep_interruptible(data * 1000);
-   del_timer_sync(adapter-blink_timer);
-   } else if (adapter-hw.phy_type == e1000_phy_ife) {
-   if (!adapter-blink_timer.function) {
-   init_timer(adapter-blink_timer);
-   adapter-blink_timer.function = 
e1000_led_blink_callback;
-   adapter-blink_timer.data = (unsigned long) adapter;
-   }
-   mod_timer(adapter-blink_timer, jiffies);
-   msleep_interruptible(data * 1000);
-   del_timer_sync(adapter-blink_timer);
-   e1000_write_phy_reg((adapter-hw), 
IFE_PHY_SPECIAL_CONTROL_LED, 0);
-   } else {
-   e1000_blink_led_start(adapter-hw);
-   msleep_interruptible(data * 1000);
-   }
+   if (data == 0)
+   data = 2;
 
-	e1000_led_off(adapter-hw);

-   clear_bit(E1000_LED_ON, adapter-led_status);
-   e1000_cleanup_led(adapter-hw);
+   e1000_blink_led(adapter-hw, data);
 
 	return 0;

 }
diff --git a/drivers/net/e1000/e1000_hw.c b/drivers/net/e1000/e1000_hw.c
index 65077f3..db5e999 100644
--- a/drivers/net/e1000/e1000_hw.c
+++ b/drivers/net/e1000/e1000_hw.c
@@ -6071,7 +6071,7 @@ e1000_id_led_init(struct e1000_hw * hw)
  *
  * hw - Struct containing variables accessed by shared code
  */
-int32_t
+static int32_t
 e1000_setup_led(struct e1000_hw *hw)
 {
 uint32_t ledctl;
@@ -6123,50 +6123,11 @@ e1000_setup_led(struct e1000_hw *hw)
 
 
 /**

- * Used on 82571 and later Si that has LED blink bits.
- * Callers must use their own timer and should have already called
- * e1000_id_led_init()
- * Call e1000_cleanup led() to stop blinking
- *
- * hw - Struct containing variables accessed by shared code
- */
-int32_t
-e1000_blink_led_start(struct e1000_hw *hw)
-{
-int16_t  i;
-uint32_t ledctl_blink = 0;
-
-DEBUGFUNC(e1000_id_led_blink_on);
-
-if (hw-mac_type  e1000_82571) {
-/* Nothing to do */
-return E1000_SUCCESS;
-}
-if (hw-media_type == e1000_media_type_fiber) {
-/* always blink LED0 for PCI-E fiber */
-ledctl_blink = E1000_LEDCTL_LED0_BLINK |
- (E1000_LEDCTL_MODE_LED_ON  
E1000_LEDCTL_LED0_MODE_SHIFT);
-} else {
- 

Re: [PATCH] Check if user has CAP_NET_ADMIN to change congestion control algorithm

2006-10-27 Thread Hagen Paul Pfeifer
* Stephen Hemminger | 2006-10-27 07:41:02 [-0700]:

Please no, it makes the socket option useless.

Technical no, in the sense of usability for everybody yes. You are right
Stephen, as a programmer I understand you complete!

But on the other side: We know for sure that this IS a problem if we allow
everybody to prefer his socket.

In my opinion we should prefer fairness before usability! As John Heffner
introduce, we can introduce a ranking system for congestion control algorithms -
but this solution seems a little bit oversized and maybe can't be complete
guaranteed (complex interaction between the protocols in different
environment and so on, you know).

HGN




-- 
 /°\   --- JOIN NOW!!! --- 
 \ /  ASCII ribbon campaign
  X   against HTML 
 / \in mail and news   
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Check if user has CAP_NET_ADMIN to change congestion control algorithm

2006-10-27 Thread Stephen Hemminger

Hagen Paul Pfeifer wrote:

* Stephen Hemminger | 2006-10-27 07:41:02 [-0700]:

  

Please no, it makes the socket option useless.



Technical no, in the sense of usability for everybody yes. You are right
Stephen, as a programmer I understand you complete!

But on the other side: We know for sure that this IS a problem if we allow
everybody to prefer his socket.

In my opinion we should prefer fairness before usability! As John Heffner
introduce, we can introduce a ranking system for congestion control algorithms -
but this solution seems a little bit oversized and maybe can't be complete
guaranteed (complex interaction between the protocols in different
environment and so on, you know).

HGN

  

If there is a dangerous choice, then it should be removed. Otherwise I can't
see the problem. It is a bigger risk to have to escalate the privileges 
of an application

just to allow it to use something.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[IPROUTE] manpage for rtmon

2006-10-27 Thread Michael Prokop
Hello,

another manpage, this time for rtmon. Would be great if it could be
applied to the next release too.

regards,
-mika-
-- 
 ,'`. http://www.michael-prokop.at/
(  grml.org -» Linux Live-CD for texttool-users and sysadmins
 `._,' http://www.grml.org/
.TH RTMON 8
.SH NAME
rtmon \- listens to and monitors RTnetlink
.SH SYNOPSIS
.B rtmon
.RI [ options ] file FILE [ all | LISTofOBJECTS ]
.SH DESCRIPTION
This manual page documents briefly the
.B rtmon
command.
.PP
\fBrtmon\fP is a RTnetlink listener. RTnetlink allows the kernel's routing 
tables to be read and altered.

rtmon should be started before the first network configuration command is 
issued. For example if you insert:

 rtmon file /var/log/rtmon.log

in a startup script, you will be able to view the full history later.
Certainly, it is possible to start rtmon at any time. It prepends the history 
with the state snapshot dumped at the moment of starting.
.SH OPTIONS
rtmon supports the following options:
.TP
.B \-Version
Print version and exit.
.TP
.B help
Show summary of options.
.TP
.B file FILE [ all | LISTofOBJECTS ]
Log output to FILE. LISTofOBJECTS is the list of object types that we want to 
monitor.
It may contain 'link', 'address', 'route' and 'all'. 'link' specifies the 
network device, 'address'
the protocol (IP or IPv6) address on a device, 'route' the routing table entry 
and 'all' does what the name says.
.TP
.B \-family [ inet | inet6 | link | help ]
Specify protocol family. 'inet' is IPv4, 'inet6' is IPv6, 'link' means that no 
networking protocol is involved and 'help' prints usage information.
.TP
.B \-4
Use IPv4. Shortcut for -family inet.
.TP
.B \-6
Use IPv6. Shortcut for -family inet6.
.TP
.B \-0
Use a special family identifier meaning that no networking protocol is 
involved. Shortcut for -family link.
.SH USAGE EXAMPLES
.TP
.B # rtmon file /var/log/rtmon.log
Log to file /var/log/rtmon.log, then run:
.TP
.B # ip monitor file /var/log/rtmon.log
to display logged output from file.
.SH SEE ALSO
.BR ip (8)
.SH AUTHOR
rtmon was written by Alexey Kuznetsov [EMAIL PROTECTED].
.PP
This manual page was written by Michael Prokop [EMAIL PROTECTED],
for the Debian project (but may be used by others).


pgp3zdbStLG30.pgp
Description: PGP signature


[PATCH] sky2: not experimental

2006-10-27 Thread Stephen Hemminger
The sky2 driver is no longer in experimental state.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]

--- sky2.orig/drivers/net/Kconfig   2006-10-27 10:16:44.0 -0700
+++ sky2/drivers/net/Kconfig2006-10-27 10:20:20.0 -0700
@@ -2112,7 +2112,7 @@
 
 config SKY2
tristate SysKonnect Yukon2 support (EXPERIMENTAL)
-   depends on PCI  EXPERIMENTAL
+   depends on PCI
select CRC32
---help---
  This driver supports Gigabit Ethernet adapters based on the
@@ -2120,8 +2120,8 @@
  Marvell 88E8021/88E8022/88E8035/88E8036/88E8038/88E8050/88E8052/
  88E8053/88E8055/88E8061/88E8062, SysKonnect SK-9E21D/SK-9S21
 
- This driver does not support the original Yukon chipset: a seperate
- driver, skge, is provided for Yukon-based adapters.
+ There is companion driver for the older Marvell Yukon and
+ Genesis based adapters: skge.
 
  To compile this driver as a module, choose M here: the module
  will be called sky2.  This is recommended.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[take21 1/4] kevent: Core files.

2006-10-27 Thread Evgeniy Polyakov

Core files.

This patch includes core kevent files:
 * userspace controlling
 * kernelspace interfaces
 * initialization
 * notification state machines

Some bits of documentation can be found on project's homepage (and links from 
there):
http://tservice.net.ru/~s0mbre/old/?section=projectsitem=kevent

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index 7e639f7..a9560eb 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -318,3 +318,6 @@ ENTRY(sys_call_table)
.long sys_vmsplice
.long sys_move_pages
.long sys_getcpu
+   .long sys_kevent_get_events
+   .long sys_kevent_ctl/* 320 */
+   .long sys_kevent_wait
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index b4aa875..cf18955 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -714,8 +714,11 @@ #endif
.quad compat_sys_get_robust_list
.quad sys_splice
.quad sys_sync_file_range
-   .quad sys_tee
+   .quad sys_tee   /* 315 */
.quad compat_sys_vmsplice
.quad compat_sys_move_pages
.quad sys_getcpu
+   .quad sys_kevent_get_events
+   .quad sys_kevent_ctl/* 320 */
+   .quad sys_kevent_wait
 ia32_syscall_end:  
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index bd99870..f009677 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -324,10 +324,13 @@ #define __NR_tee  315
 #define __NR_vmsplice  316
 #define __NR_move_pages317
 #define __NR_getcpu318
+#define __NR_kevent_get_events 319
+#define __NR_kevent_ctl320
+#define __NR_kevent_wait   321
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 319
+#define NR_syscalls 322
 #include linux/err.h
 
 /*
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index 6137146..c53d156 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -619,10 +619,16 @@ #define __NR_vmsplice 278
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_kevent_get_events 280
+__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events)
+#define __NR_kevent_ctl281
+__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl)
+#define __NR_kevent_wait   282
+__SYSCALL(__NR_kevent_wait, sys_kevent_wait)
 
 #ifdef __KERNEL__
 
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_kevent_wait
 #include linux/err.h
 
 #ifndef __NO_STUBS
diff --git a/include/linux/kevent.h b/include/linux/kevent.h
new file mode 100644
index 000..125414c
--- /dev/null
+++ b/include/linux/kevent.h
@@ -0,0 +1,205 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __KEVENT_H
+#define __KEVENT_H
+#include linux/types.h
+#include linux/list.h
+#include linux/rbtree.h
+#include linux/spinlock.h
+#include linux/mutex.h
+#include linux/wait.h
+#include linux/net.h
+#include linux/rcupdate.h
+#include linux/kevent_storage.h
+#include linux/ukevent.h
+
+#define KEVENT_MIN_BUFFS_ALLOC 3
+
+struct kevent;
+struct kevent_storage;
+typedef int (* kevent_callback_t)(struct kevent *);
+
+/* @callback is called each time new event has been caught. */
+/* @enqueue is called each time new event is queued. */
+/* @dequeue is called each time event is dequeued. */
+
+struct kevent_callbacks {
+   kevent_callback_t   callback, enqueue, dequeue;
+};
+
+#define KEVENT_READY   0x1
+#define KEVENT_STORAGE 0x2
+#define KEVENT_USER0x4
+
+struct kevent
+{
+   /* Used for kevent freeing.*/
+   struct rcu_head rcu_head;
+   struct ukevent  event;
+   /* This lock protects ukevent manipulations, e.g. ret_flags changes. */
+   spinlock_t  ulock;
+
+   /* Entry of user's tree. */
+   struct rb_node  kevent_node;
+   /* Entry of origin's queue. */
+   struct list_headstorage_entry;
+   /* Entry of user's ready. */
+   struct 

[take21 4/4] kevent: Timer notifications.

2006-10-27 Thread Evgeniy Polyakov

Timer notifications.

Timer notifications can be used for fine grained per-process time 
management, since interval timers are very inconvenient to use, 
and they are limited.

This subsystem uses high-resolution timers.
id.raw[0] is used as number of seconds
id.raw[1] is used as number of nanoseconds

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c
new file mode 100644
index 000..04acc46
--- /dev/null
+++ b/kernel/kevent/kevent_timer.c
@@ -0,0 +1,113 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include linux/kernel.h
+#include linux/types.h
+#include linux/list.h
+#include linux/slab.h
+#include linux/spinlock.h
+#include linux/hrtimer.h
+#include linux/jiffies.h
+#include linux/kevent.h
+
+struct kevent_timer
+{
+   struct hrtimer  ktimer;
+   struct kevent_storage   ktimer_storage;
+   struct kevent   *ktimer_event;
+};
+
+static int kevent_timer_func(struct hrtimer *timer)
+{
+   struct kevent_timer *t = container_of(timer, struct kevent_timer, 
ktimer);
+   struct kevent *k = t-ktimer_event;
+
+   kevent_storage_ready(t-ktimer_storage, NULL, KEVENT_MASK_ALL);
+   hrtimer_forward(timer, timer-base-softirq_time,
+   ktime_set(k-event.id.raw[0], k-event.id.raw[1]));
+   return HRTIMER_RESTART;
+}
+
+static struct lock_class_key kevent_timer_key;
+
+static int kevent_timer_enqueue(struct kevent *k)
+{
+   int err;
+   struct kevent_timer *t;
+
+   t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL);
+   if (!t)
+   return -ENOMEM;
+
+   hrtimer_init(t-ktimer, CLOCK_MONOTONIC, HRTIMER_REL);
+   t-ktimer.expires = ktime_set(k-event.id.raw[0], k-event.id.raw[1]);
+   t-ktimer.function = kevent_timer_func;
+   t-ktimer_event = k;
+
+   err = kevent_storage_init(t-ktimer, t-ktimer_storage);
+   if (err)
+   goto err_out_free;
+   lockdep_set_class(t-ktimer_storage.lock, kevent_timer_key);
+
+   err = kevent_storage_enqueue(t-ktimer_storage, k);
+   if (err)
+   goto err_out_st_fini;
+
+   printk(%s: jiffies: %lu, timer: %p.\n, __func__, jiffies, t-ktimer);
+   hrtimer_start(t-ktimer, t-ktimer.expires, HRTIMER_REL);
+
+   return 0;
+
+err_out_st_fini:
+   kevent_storage_fini(t-ktimer_storage);
+err_out_free:
+   kfree(t);
+
+   return err;
+}
+
+static int kevent_timer_dequeue(struct kevent *k)
+{
+   struct kevent_storage *st = k-st;
+   struct kevent_timer *t = container_of(st, struct kevent_timer, 
ktimer_storage);
+
+   hrtimer_cancel(t-ktimer);
+   kevent_storage_dequeue(st, k);
+   kfree(t);
+
+   return 0;
+}
+
+static int kevent_timer_callback(struct kevent *k)
+{
+   k-event.ret_data[0] = jiffies_to_msecs(jiffies);
+   return 1;
+}
+
+static int __init kevent_init_timer(void)
+{
+   struct kevent_callbacks tc = {
+   .callback = kevent_timer_callback,
+   .enqueue = kevent_timer_enqueue,
+   .dequeue = kevent_timer_dequeue};
+
+   return kevent_add_callbacks(tc, KEVENT_TIMER);
+}
+module_init(kevent_init_timer);
+

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[take21 0/4] kevent: Generic event handling mechanism.

2006-10-27 Thread Evgeniy Polyakov

Generic event handling mechanism.

Consider for inclusion.

Changes from 'take20' patchset:
 * new ring buffer implementation
 * removed artificial limit on possible number of kevents
With this release and fixed userspace web server it was possible to 
achive 3960+ req/s with client connection rate of 4000 con/s
over 100 Mbit lan, data IO over network was about 10582.7 KB/s, which
is too close to wire speed if we get into account headers and the like.

Changes from 'take19' patchset:
 * use __init instead of __devinit
 * removed 'default N' from config for user statistic
 * removed kevent_user_fini() since kevent can not be unloaded
 * use KERN_INFO for statistic output

Changes from 'take18' patchset:
 * use __init instead of __devinit
 * removed 'default N' from config for user statistic
 * removed kevent_user_fini() since kevent can not be unloaded
 * use KERN_INFO for statistic output

Changes from 'take17' patchset:
 * Use RB tree instead of hash table. 
At least for a web sever, frequency of addition/deletion of new kevent 
is comparable with number of search access, i.e. most of the time 
events 
are added, accesed only couple of times and then removed, so it 
justifies 
RB tree usage over AVL tree, since the latter does have much slower 
deletion 
time (max O(log(N)) compared to 3 ops), 
although faster search time (1.44*O(log(N)) vs. 2*O(log(N))). 
So for kevents I use RB tree for now and later, when my AVL tree 
implementation 
is ready, it will be possible to compare them.
 * Changed readiness check for socket notifications.

With both above changes it is possible to achieve more than 3380 req/second 
compared to 2200, 
sometimes 2500 req/second for epoll() for trivial web-server and httperf client 
on the same
hardware.
It is possible that above kevent limit is due to maximum allowed kevents in a 
time limit, which is
4096 events.

Changes from 'take16' patchset:
 * misc cleanups (__read_mostly, const ...)
 * created special macro which is used for mmap size (number of pages) 
calculation
 * export kevent_socket_notify(), since it is used in network protocols which 
can be 
built as modules (IPv6 for example)

Changes from 'take15' patchset:
 * converted kevent_timer to high-resolution timers, this forces timer API 
update at
http://linux-net.osdl.org/index.php/Kevent
 * use struct ukevent* instead of void * in syscalls (documentation has been 
updated)
 * added warning in kevent_add_ukevent() if ring has broken index (for testing)

Changes from 'take14' patchset:
 * added kevent_wait()
This syscall waits until either timeout expires or at least one event
becomes ready. It also commits that @num events from @start are processed
by userspace and thus can be be removed or rearmed (depending on it's 
flags).
It can be used for commit events read by userspace through mmap interface.
Example userspace code (evtest.c) can be found on project's homepage.
 * added socket notifications (send/recv/accept)

Changes from 'take13' patchset:
 * do not get lock aroung user data check in __kevent_search()
 * fail early if there were no registered callbacks for given type of kevent
 * trailing whitespace cleanup

Changes from 'take12' patchset:
 * remove non-chardev interface for initialization
 * use pointer to kevent_mring instead of unsigned longs
 * use aligned 64bit type in raw user data (can be used by high-res timer if 
needed)
 * simplified enqueue/dequeue callbacks and kevent initialization
 * use nanoseconds for timeout
 * put number of milliseconds into timer's return data
 * move some definitions into user-visible header
 * removed filenames from comments

Changes from 'take11' patchset:
 * include missing headers into patchset
 * some trivial code cleanups (use goto instead of if/else games and so on)
 * some whitespace cleanups
 * check for ready_callback() callback before main loop which should save us 
some ticks

Changes from 'take10' patchset:
 * removed non-existent prototypes
 * added helper function for kevent_registered_callbacks
 * fixed 80 lines comments issues
 * added shared between userspace and kernelspace header instead of embedd them 
in one
 * core restructuring to remove forward declarations
 * s o m e w h i t e s p a c e c o d y n g s t y l e c l e a n u p
 * use vm_insert_page() instead of remap_pfn_range()

Changes from 'take9' patchset:
 * fixed -nopage method

Changes from 'take8' patchset:
 * fixed mmap release bug
 * use module_init() instead of late_initcall()
 * use better structures for timer notifications

Changes from 'take7' patchset:
 * new mmap interface (not tested, waiting for other changes to be acked)
- use nopage() method to dynamically substitue pages
- allocate new page for events only when new added kevent requres it
- do not use ugly index dereferencing, use structure instead
- reduced amount of data in the ring (id and 

[take21 3/4] kevent: Socket notifications.

2006-10-27 Thread Evgeniy Polyakov

Socket notifications.

This patch includes socket send/recv/accept notifications.
Using trivial web server based on kevent and this features
instead of epoll it's performance increased more than noticebly.
More details about various benchmarks and server itself 
(evserver_kevent.c) can be found on project's homepage.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/fs/inode.c b/fs/inode.c
index ada7643..ff1b129 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -21,6 +21,7 @@ #include linux/pagemap.h
 #include linux/cdev.h
 #include linux/bootmem.h
 #include linux/inotify.h
+#include linux/kevent.h
 #include linux/mount.h
 
 /*
@@ -164,12 +165,18 @@ #endif
}
inode-i_private = 0;
inode-i_mapping = mapping;
+#if defined CONFIG_KEVENT_SOCKET
+   kevent_storage_init(inode, inode-st);
+#endif
}
return inode;
 }
 
 void destroy_inode(struct inode *inode) 
 {
+#if defined CONFIG_KEVENT_SOCKET
+   kevent_storage_fini(inode-st);
+#endif
BUG_ON(inode_has_buffers(inode));
security_inode_free(inode);
if (inode-i_sb-s_op-destroy_inode)
diff --git a/include/net/sock.h b/include/net/sock.h
index edd4d73..d48ded8 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -48,6 +48,7 @@ #include linux/lockdep.h
 #include linux/netdevice.h
 #include linux/skbuff.h  /* struct sk_buff */
 #include linux/security.h
+#include linux/kevent.h
 
 #include linux/filter.h
 
@@ -450,6 +451,21 @@ static inline int sk_stream_memory_free(
 
 extern void sk_stream_rfree(struct sk_buff *skb);
 
+struct socket_alloc {
+   struct socket socket;
+   struct inode vfs_inode;
+};
+
+static inline struct socket *SOCKET_I(struct inode *inode)
+{
+   return container_of(inode, struct socket_alloc, vfs_inode)-socket;
+}
+
+static inline struct inode *SOCK_INODE(struct socket *socket)
+{
+   return container_of(socket, struct socket_alloc, socket)-vfs_inode;
+}
+
 static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk)
 {
skb-sk = sk;
@@ -477,6 +493,7 @@ static inline void sk_add_backlog(struct
sk-sk_backlog.tail = skb;
}
skb-next = NULL;
+   kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
 }
 
 #define sk_wait_event(__sk, __timeo, __condition)  \
@@ -679,21 +696,6 @@ static inline struct kiocb *siocb_to_kio
return si-kiocb;
 }
 
-struct socket_alloc {
-   struct socket socket;
-   struct inode vfs_inode;
-};
-
-static inline struct socket *SOCKET_I(struct inode *inode)
-{
-   return container_of(inode, struct socket_alloc, vfs_inode)-socket;
-}
-
-static inline struct inode *SOCK_INODE(struct socket *socket)
-{
-   return container_of(socket, struct socket_alloc, socket)-vfs_inode;
-}
-
 extern void __sk_stream_mem_reclaim(struct sock *sk);
 extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind);
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7a093d0..69f4ad2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -857,6 +857,7 @@ static inline int tcp_prequeue(struct so
tp-ucopy.memory = 0;
} else if (skb_queue_len(tp-ucopy.prequeue) == 1) {
wake_up_interruptible(sk-sk_sleep);
+   kevent_socket_notify(sk, 
KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
if (!inet_csk_ack_scheduled(sk))
inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
  (3 * TCP_RTO_MIN) / 4,
diff --git a/kernel/kevent/kevent_socket.c b/kernel/kevent/kevent_socket.c
new file mode 100644
index 000..c865b3e
--- /dev/null
+++ b/kernel/kevent/kevent_socket.c
@@ -0,0 +1,129 @@
+/*
+ * kevent_socket.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include linux/kernel.h
+#include linux/types.h
+#include linux/list.h
+#include linux/slab.h
+#include linux/spinlock.h
+#include linux/timer.h
+#include linux/file.h
+#include linux/tcp.h
+#include linux/kevent.h
+
+#include net/sock.h
+#include net/request_sock.h
+#include net/inet_connection_sock.h
+
+static int 

Re: [openib-general] [PATCH 1/9] NetEffect 10Gb RNIC Driver: kernel Kconfig and makefiles

2006-10-27 Thread James Lentini


On Thu, 26 Oct 2006, Glenn Grundstrom wrote:

 diff -ruNp old/drivers/infiniband/hw/nes/Makefile
 new/drivers/infiniband/hw/nes/Makefile
 --- old/drivers/infiniband/hw/nes/Makefile1969-12-31
 18:00:00.0 -0600
 +++ new/drivers/infiniband/hw/nes/Makefile2006-10-25
 11:10:26.0 -0500
 @@ -0,0 +1,27 @@
 +EXTRA_CFLAGS += -Idrivers/infiniband/include
 -Idrivers/infiniband/hw/nes/nes_tcpip/include
 +
 +ifdef CONFIG_INFINIBAND_NES_DEBUG
 +EXTRA_CFLAGS += -DNES_DEBUG
 +endif

The NES_DEBUG flag is unnecessary. You can check for 
CONFIG_INFINIBAND_NES_DEBUG in the code. See 
CONFIG_INFINIBAND_MTHCA_DEBUG for an example.

 +
 +ifneq ($(KERNELRELEASE),)
 + obj-$(CONFIG_INFINIBAND_NES) += iw_nes.o
 +
 + iw_nes-objs := \
 + nes.o \
 + nes_hw.o \
 + nes_nic.o \
 + nes_cm.o \
 + nes_utils.o \
 + nes_verbs.o 
 +else
 + KERNELDIR ?= /usr/src/linux
 + PWD := $(shell pwd)
 +
 +default:
 + $(MAKE) -C $(KERNELDIR) M=$(PWD) modules
 +
 +clean:
 + $(MAKE) -C $(KERNELDIR) M=$(PWD) clean
 +
 +endif

In tree drivers don't provide support for out-of-tree builds. See 
drivers/infiniband/hw/mthca/Makefile for an example of how to 
simplify this.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [openib-general] [PATCH 3/9] NetEffect 10Gb RNIC Driver: openfabrics connection manager c file

2006-10-27 Thread Tom Tucker

[...snip...]
 +extern void set_interface(
 +UINT32ip_addr,

These should probably be the standard linux types u32, or uint32

 +UINT32mask,
 +UINT32bcastaddr,
 +UINT32type
 +   );

[...snip...]

 + struct NES_sockaddr_in  inet_addr;
 + struct sockaddr_in  kinet_addr;

Is there some reason why you need your own sockaddr and sockaddr_in
structures? 

[...snip...]
 +
 +/**
 + * nes_disconnect
 + * 
 + * @param cm_id
 + * @param abrupt
 + * 
 + * @return int
 + */
 +int nes_disconnect(struct iw_cm_id *cm_id, int abrupt)
 +{
 + struct ib_qp_attr attr;
 + struct ib_qp *ibqp;
 + struct nes_qp *nesqp;
 + struct nes_dev *nesdev = to_nesdev(cm_id-device);
 + int err = 0;
 + u8 u8temp;
 +
 + dprintk(%s:%s:%u\n, __FILE__, __FUNCTION__, __LINE__);
 + dprintk(%s: netdev refcnt = %u.\n, __FUNCTION__,
 atomic_read(nesdev-netdev-refcnt));
 +
 + /* If the qp was already destroyed, then there's no QP */
 + if (cm_id-provider_data == 0)
 + return 0;
 +
 + nesqp = (struct nes_qp *)cm_id-provider_data;
 + ibqp = nesqp-ibqp;
 +
 + /* Disassociate the QP from this cm_id */
 + cm_id-provider_data = 0;
 + cm_id-rem_ref(cm_id);
 + nesqp-cm_id = 0;
 +
 + stack_ops_p-decelerate_socket(nesqp-socket, 
 +(struct nes_uploaded_qp_context *)
 +nesqp-nesqp_context);
 +  
 + if (nesqp-active_conn) {
 +   u8temp = 1  (ntohs(cm_id-local_addr.sin_port)7);
 +   nesdev-apbv_table[ntohs(cm_id-local_addr.sin_port)3] =
 ~(u8temp);
 + } else {
 + dev_put(nesdev-netdev);
 +/* Need to free the Last Streaming Mode Message */
 +pci_free_consistent(nesdev-pcidev, 
 +
 nesqp-private_data_len+sizeof(*nesqp-ietf_frame), 
 +nesqp-ietf_frame,
 +nesqp-ietf_frame_pbase);

This is mailer perversion. You need to turn off wrapping in your mailer.
It makes it hard to review the patch never mind apply it.

 +}
 +
 + if (nesqp-ksock) sock_release(nesqp-ksock);
 + stack_ops_p-sock_ops_p-close( nesqp-socket );
 + nesqp-ksock = 0;
 + nesqp-socket = 0;
 + if (nesqp-wq) {
 + destroy_workqueue(nesqp-wq);

This will deadlock if this function is called from a workqueue thread
and CONFIG_HOTPLUG_CPU is enabled. 

 + nesqp-wq = NULL;
 + }
 +
 + memset(attr, 0, sizeof(struct ib_qp_attr));
 + if (abrupt)
 + attr.qp_state = IB_QPS_ERR;
 + else
 + attr.qp_state = IB_QPS_SQD;
 +
 + return err;
 +}
 +
 +
 +/**
 + * nes_accept
 + * 
 + * @param cm_id
 + * @param conn_param
 + * 
 + * @return int
 + */
 +int nes_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param
 *conn_param)
 +{
 + struct nes_qp *nesqp;
 + struct nes_dev *nesdev;
 + struct nes_adapter *nesadapter;
 + struct ib_qp *ibqp;
 +struct nes_hw_qp_wqe *wqe;
 + struct nes_v4_quad nes_quad;
 + struct ib_qp_attr attr;
 +struct iw_cm_event cm_event;
 +
 + dprintk(%s:%s:%u: data len = %u\n, 
 + __FILE__, __FUNCTION__, __LINE__,
 conn_param-private_data_len);
 +
 + ibqp = nes_get_qp(cm_id-device, conn_param-qpn);
 + if (!ibqp)
 + return -EINVAL;
 + nesqp = to_nesqp(ibqp);
 + nesdev = to_nesdev(nesqp-ibqp.device);
 + nesadapter = nesdev-nesadapter;
 + dprintk(%s: netdev refcnt = %u.\n, __FUNCTION__,
 atomic_read(nesdev-netdev-refcnt));
 +
 +nesqp-ietf_frame = pci_alloc_consistent(nesdev-pcidev, 
 +
 sizeof(*nesqp-ietf_frame)+conn_param-private_data_len,
 + nesqp-ietf_frame_pbase);
 +if (!nesqp-ietf_frame) {
 +dprintk(KERN_ERR PFX %s: Unable to allocate memory for private
 data\n, __FUNCTION__);
 +return -ENOMEM;
 +}
 +dprintk(PFX %s: PCI consistent memory for 
 +private data located @ %p (pa = 0x%08lX.) size = %u.\n, 
 +__FUNCTION__, nesqp-ietf_frame, (unsigned
 long)nesqp-ietf_frame_pbase,
 +conn_param-private_data_len+sizeof(*nesqp-ietf_frame));
 +nesqp-private_data_len = conn_param-private_data_len;
 +
 +strcpy(nesqp-ietf_frame-key[0], IEFT_MPA_KEY_REP);
 +memcpy(nesqp-ietf_frame-private_data, conn_param-private_data,
 conn_param-private_data_len);
 +nesqp-ietf_frame-private_data_size =
 cpu_to_be16(conn_param-private_data_len);
 +nesqp-ietf_frame-rev = mpa_version;
 +nesqp-ietf_frame-flags = IETF_MPA_FLAGS_CRC;
 +
 +wqe = nesqp-hwqp.sq_vbase[0];
 +*((struct nes_qp
 **)wqe-wqe_words[NES_IWARP_SQ_WQE_COMP_CTX_LOW_IDX]) = nesqp;
 + *((u64 *)wqe-wqe_words[NES_IWARP_SQ_WQE_COMP_CTX_LOW_IDX]) |=
 NES_SW_CONTEXT_ALIGN1;
 +wqe-wqe_words[NES_IWARP_SQ_WQE_MISC_IDX] =
 

[PATCH] tcp: don't allow unfair congestion control to be built without warning

2006-10-27 Thread Stephen Hemminger
My proposed method restricting TCP choices to fair algorithms.
This a net wide, not system wide issue, it should not be done
by kernel policy choice (capability), but by a build choice.

--- sky2.orig/net/ipv4/Kconfig  2006-10-27 10:10:47.0 -0700
+++ sky2/net/ipv4/Kconfig   2006-10-27 10:15:56.0 -0700
@@ -470,6 +470,16 @@
 
 if TCP_CONG_ADVANCED
 
+config TCP_CONG_UNFAIR
+bool Allow unfair congestion control algorithms
+   depends on EXPERIMENTAL
+---help---
+ Some of the congestion control algorithms are for testing
+ and research purposes and should not deployed on public
+ networks because of the possiblity of unfair behavior.
+ These algorithms may be useful for future development
+ or comparison purposes.
+
 config TCP_CONG_BIC
tristate Binary Increase Congestion (BIC) control
default m
@@ -551,7 +561,7 @@
 
 config TCP_CONG_SCALABLE
tristate Scalable TCP
-   depends on EXPERIMENTAL
+   depends on TCP_CONG_UNFAIR
default n
---help---
Scalable TCP is a sender-side only change to TCP which uses a
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [IPROUTE] manpage for rtmon

2006-10-27 Thread Stephen Hemminger
On Fri, 27 Oct 2006 19:22:11 +0200
Michael Prokop [EMAIL PROTECTED] wrote:

 User-Agent: mutt-ng devel-r316 (Debian)
 
 Hello,

added.

-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tcp: don't allow unfair congestion control to be built without warning

2006-10-27 Thread John Heffner
I think unfair is a difficult word.  Unfair to what?  It's true that 
Scalable TCP is unfair to itself in that flows with unequal shares do 
not converge, but it's not clear what its interactions are with other 
congestion control algorithms.  It's not clear to me that it's 
significantly more unfair wrt. reno than BIC, etc.  Known to be broken 
might be more correct language. :)


One thought would be to use a module parameter that sets one bit of 
state: allow unprivileged use.  Each module could have a sensible 
default value.


  -John


Stephen Hemminger wrote:

My proposed method restricting TCP choices to fair algorithms.
This a net wide, not system wide issue, it should not be done
by kernel policy choice (capability), but by a build choice.

--- sky2.orig/net/ipv4/Kconfig  2006-10-27 10:10:47.0 -0700
+++ sky2/net/ipv4/Kconfig   2006-10-27 10:15:56.0 -0700
@@ -470,6 +470,16 @@
 
 if TCP_CONG_ADVANCED
 
+config TCP_CONG_UNFAIR

+bool Allow unfair congestion control algorithms
+   depends on EXPERIMENTAL
+---help---
+ Some of the congestion control algorithms are for testing
+ and research purposes and should not deployed on public
+ networks because of the possiblity of unfair behavior.
+ These algorithms may be useful for future development
+ or comparison purposes.
+
 config TCP_CONG_BIC
tristate Binary Increase Congestion (BIC) control
default m
@@ -551,7 +561,7 @@
 
 config TCP_CONG_SCALABLE

tristate Scalable TCP
-   depends on EXPERIMENTAL
+   depends on TCP_CONG_UNFAIR
default n
---help---
Scalable TCP is a sender-side only change to TCP which uses a


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] tcp: setsockopt congestion control autoload

2006-10-27 Thread Stephen Hemminger
If application asks for a congestion control type with setsockopt() 
then it may be available as a module not included in the kernel already. 
If it has permission to load modules then the tcp congestion
module should be autoloaded if needed.  This is done already when
the default selection is change with sysctl, but not when application
requests via sysctl.
 
Add a similar additional check to the sysctl path as well.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]
 
---
 net/ipv4/tcp_cong.c |   12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

--- a/net/ipv4/tcp_cong.c   2006-10-27 10:56:36.0 -0700
+++ b/net/ipv4/tcp_cong.c   2006-10-27 11:09:36.0 -0700
@@ -114,7 +114,7 @@
spin_lock(tcp_cong_list_lock);
ca = tcp_ca_find(name);
 #ifdef CONFIG_KMOD
-   if (!ca) {
+   if (!ca  capable(CAP_SYS_MODULE)) {
spin_unlock(tcp_cong_list_lock);
 
request_module(tcp_%s, name);
@@ -154,9 +154,19 @@
 
rcu_read_lock();
ca = tcp_ca_find(name);
+   /* no change asking for existing value */
if (ca == icsk-icsk_ca_ops)
goto out;
 
+#ifdef CONFIG_KMOD
+   /* not found attempt to autoload module */
+   if (!ca  capable(CAP_SYS_MODULE)) {
+   rcu_read_unlock();
+   request_module(tcp_%s, name);
+   rcu_read_lock();
+   ca = tcp_ca_find(name);
+   }
+#endif
if (!ca)
err = -ENOENT;
 

Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 9/13] [SCTP] Merge IPv4 and IPv6 versions of get_saddr() with their corresponding get_dst().

2006-10-27 Thread Sridhar Samudrala
On Tue, 2006-10-17 at 03:19 +0300, Ville Nuorvala wrote:
 As the IPv6 route lookup now also returns the selected source address
 there is no need for a separate source address lookup. In fact, the
 source address selection needs to be moved to get_dst() because the
 selected IPv6 source address isn't always stored in the route.
 Sometimes this makes it impossible to guess the correct address later on.
 

Ville,

Overall the patch looks pretty good. I found only 1 issue in 
sctp_v6_get_dst(). See below.


snip


 
 +/* Returns the dst cache entry for the given source and destination ip
 + * addresses.
 + */
 +static struct dst_entry *sctp_v6_get_dst(struct sctp_association *asoc,
 +  union sctp_addr *daddr,
 +  union sctp_addr *saddr)
 +{
 + struct dst_entry *dst;
 + struct flowi fl;
 + struct sctp_bind_addr *bp;
 + rwlock_t *addr_lock;
 + struct sctp_sockaddr_entry *laddr;
 + struct list_head *pos;
 + struct rt6_info *rt;
 + union sctp_addr baddr;
 + sctp_scope_t scope;
 + __u8 matchlen = 0;
 + __u8 bmatchlen;
 +
 + memset(fl, 0, sizeof(fl));
 + ipv6_addr_copy(fl.fl6_dst, daddr-v6.sin6_addr);
 + if (ipv6_addr_type(daddr-v6.sin6_addr)  IPV6_ADDR_LINKLOCAL)
 + fl.oif = daddr-v6.sin6_scope_id;
 +
 + ipv6_addr_copy(fl.fl6_src, saddr-v6.sin6_addr);
 + SCTP_DEBUG_PRINTK(%s: DST= NIP6_FMT  SRC= NIP6_FMT  ,
 +   __FUNCTION__, NIP6(fl.fl6_dst), NIP6(fl.fl6_src));
 +
 + dst = ip6_route_output(NULL, fl);
 + if (dst-error) {
 + dst_release(dst);
 + dst = NULL;
 + }
 + if (!ipv6_addr_any(saddr-v6.sin6_addr))
 + goto out;
 + if (!asoc) {
 + if (dst)
 + ipv6_addr_copy(saddr-v6.sin6_addr, fl.fl6_src);
 + goto out;
 + }
 + bp = asoc-base.bind_addr;
 + addr_lock = asoc-base.addr_lock;
 +
 + if (dst) {
 + /* Walk through the bind address list and look for a bind
 +  * address that matches the source address of the returned rt.
 +  */
 + sctp_v6_fl_saddr(baddr, fl, bp-port);
Here we are checking if the source address returned in the dst matches one of
the address in the bind address list of the association. Not the source address
that is passed to this routine(it could be INADDRY_ANY).
So this should be changed back to sctp_v6_dst_saddr().

Thanks
Sridhar

 + sctp_read_lock(addr_lock);
 + list_for_each(pos, bp-address_list) {
 + laddr = list_entry(pos, struct sctp_sockaddr_entry,
 +list);
 + if (!laddr-use_as_src)
 + continue;
 + if (sctp_v6_cmp_addr(baddr, laddr-a))
 + goto init_saddr;
 + }
 + sctp_read_unlock(addr_lock);
 +
 + /* Invalid rt or none of the bound addresses match the source
 +  * address. So release it.
 +  */
 + dst_release(dst);
 + dst = NULL;
 + }
 +
 + /* Go through the bind address list and find the best source address
 +  * that matches the scope of the destination address.
 +  */
 + memset(baddr, 0, sizeof(union sctp_addr));
 + scope = sctp_scope(daddr);
 + sctp_read_lock(addr_lock);
 + list_for_each(pos, bp-address_list) {
 + laddr = list_entry(pos, struct sctp_sockaddr_entry, list);
 + 
 + if (!laddr-use_as_src ||
 + laddr-a.sa.sa_family != AF_INET6 ||
 + scope  sctp_scope(laddr-a) ||
 + (ipv6_addr_type(laddr-a.v6.sin6_addr) 
 +  IPV6_ADDR_LINKLOCAL 
 +  laddr-a.v6.sin6_scope_id != fl.oif))
 + continue;
 +
 + bmatchlen = sctp_v6_addr_match_len(daddr, laddr-a);
 + if (!dst || (matchlen  bmatchlen)) {
 + struct dst_entry *dst2;
 + ipv6_addr_copy(fl.fl6_src, laddr-a.v6.sin6_addr);
 + dst2 = ip6_route_output(NULL, fl);
 + if (dst2-error) {
 + dst_release(dst2);
 + dst2 = NULL;
 + continue;
 + }
 + dst_release(dst);
 + dst = dst2;
 + memcpy(baddr, laddr-a, sizeof(union sctp_addr));
 + matchlen = bmatchlen;
 + }
 + }
 + if (dst)
 + goto init_saddr;
 +out_unlock:
 + sctp_read_unlock(addr_lock);
 +out:
 + if (dst) {
 + rt = (struct rt6_info *) dst;
 + SCTP_DEBUG_PRINTK(SRC= NIP6_FMT
 +rt6_dst= NIP6_FMT
 +rt6_src= NIP6_FMT \n,
 +

Re: [PATCH] s2io: add PCI error recovery support

2006-10-27 Thread Linas Vepstas
On Fri, Oct 27, 2006 at 07:35:18AM -0400, Ananda Raju wrote:
 Looking at all scenarios I feel the first patch is OK. Can you add the
 watchdog timer fix to first initial patch and resubmit. 

Appended below.

 So -- just for grins, I thought to myself, Maybe I can make 
 s2io be the first adapter ever to fully recover without 
 a hard reset of the card.

... I couldn't quite make this work. Since the patch below
already works, I didn't see much point exterting myself further.

--linas

This patch adds PCI error recovery support to the 
s2io 10-Gigabit ethernet device driver. Third revision,
blocks interrupts and the watchdog.

Tested, seems to work well.

Signed-off-by: Linas Vepstas [EMAIL PROTECTED]
Cc: Raghavendra Koushik [EMAIL PROTECTED]
Cc: Ananda Raju [EMAIL PROTECTED]
Cc: Wen Xiong [EMAIL PROTECTED]


 drivers/net/s2io.c |  121 +
 drivers/net/s2io.h |5 ++
 2 files changed, 126 insertions(+)

Index: linux-2.6.19-rc1-git11/drivers/net/s2io.c
===
--- linux-2.6.19-rc1-git11.orig/drivers/net/s2io.c  2006-10-27 
10:49:07.0 -0500
+++ linux-2.6.19-rc1-git11/drivers/net/s2io.c   2006-10-27 13:55:01.0 
-0500
@@ -434,11 +434,18 @@ static struct pci_device_id s2io_tbl[] _
 
 MODULE_DEVICE_TABLE(pci, s2io_tbl);
 
+static struct pci_error_handlers s2io_err_handler = {
+   .error_detected = s2io_io_error_detected,
+   .slot_reset = s2io_io_slot_reset,
+   .resume = s2io_io_resume,
+};
+
 static struct pci_driver s2io_driver = {
   .name = S2IO,
   .id_table = s2io_tbl,
   .probe = s2io_init_nic,
   .remove = __devexit_p(s2io_rem_nic),
+  .err_handler = s2io_err_handler,
 };
 
 /* A simplifier macro used both by init and free shared_mem Fns(). */
@@ -3159,6 +3166,11 @@ static void alarm_intr_handler(struct s2
register u64 val64 = 0, err_reg = 0;
u64 cnt;
int i;
+
+   if ((nic-pdev-error_state != pci_channel_io_normal) 
+(nic-pdev-error_state != 0))
+   return;
+
nic-mac_control.stats_info-sw_stat.ring_full_cnt = 0;
/* Handling the XPAK counters update */
if(nic-mac_control.stats_info-xpak_stat.xpak_timer_count  72000) {
@@ -4171,6 +4183,11 @@ static irqreturn_t s2io_isr(int irq, voi
mac_info_t *mac_control;
struct config_param *config;
 
+   /* Pretend we handled any irq's from a disconnected card */
+   if ((sp-pdev-error_state != pci_channel_io_normal) 
+(sp-pdev-error_state != 0))
+   return IRQ_HANDLED;
+
atomic_inc(sp-isr_cnt);
mac_control = sp-mac_control;
config = sp-config;
@@ -7564,3 +7581,107 @@ static void lro_append_pkt(nic_t *sp, lr
sp-mac_control.stats_info-sw_stat.clubbed_frms_cnt++;
return;
 }
+
+/**
+ * s2io_io_error_detected - called when PCI error is detected
+ * @pdev: Pointer to PCI device
+ * @state: The current pci conneection state
+ *
+ * This function is called after a PCI bus error affecting
+ * this device has been detected.
+ */
+static pci_ers_result_t s2io_io_error_detected(struct pci_dev *pdev,
+   pci_channel_state_t state)
+{
+   struct net_device *netdev = pci_get_drvdata(pdev);
+   nic_t *sp = netdev-priv;
+
+   netif_device_detach(netdev);
+
+   if (netif_running(netdev)) {
+   unsigned long flags;
+
+   /* The folowing is an abreviated subset of the
+* steps taken by s2io_card_down(), avoiding
+* steps that touch the card itself.
+*/
+   del_timer_sync(sp-alarm_timer);
+   atomic_set(sp-card_state, CARD_DOWN);
+
+   /* Kill tasklet. */
+   tasklet_kill(sp-task);
+
+   /* Free all Tx buffers */
+   spin_lock_irqsave(sp-tx_lock, flags);
+   free_tx_buffers(sp);
+   spin_unlock_irqrestore(sp-tx_lock, flags);
+
+   /* Free all Rx buffers */
+   spin_lock_irqsave(sp-rx_lock, flags);
+   free_rx_buffers(sp);
+   spin_unlock_irqrestore(sp-rx_lock, flags);
+
+   clear_bit(0, (sp-link_state));
+   sp-device_close_flag = TRUE;   /* Device is shut down. */
+   }
+   pci_disable_device(pdev);
+
+   return PCI_ERS_RESULT_NEED_RESET;
+}
+
+/**
+ * s2io_io_slot_reset - called after the pci bus has been reset.
+ * @pdev: Pointer to PCI device
+ *
+ * Restart the card from scratch, as if from a cold-boot.
+ * At this point, the card has exprienced a hard reset,
+ * followed by fixups by BIOS, and has its config space
+ * set up identically to what it was at cold boot.
+ */
+static pci_ers_result_t s2io_io_slot_reset(struct pci_dev *pdev)
+{
+   struct net_device *netdev = pci_get_drvdata(pdev);
+   nic_t *sp = netdev-priv;
+
+   if 

Re: [openib-general] [PATCH 1/5] NetEffect 10Gb RNIC Userspace Library: userspace config generation

2006-10-27 Thread Roland Dreier
  I don't think the userspace stuff belongs on netdev. Someone please
  correct me if I'm wrong.

Yeah, it's not a bad thing to get wider review, but your userspace
library is pretty much your business.  If you screw it up it doesn't
hurt anyone else, so I'm happy to let you write it however you want.

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [openib-general] [PATCH 1/5] NetEffect 10Gb RNIC Userspace Library: userspace config generation

2006-10-27 Thread Stephen Hemminger
On Fri, 27 Oct 2006 10:56:45 -0700
Roland Dreier [EMAIL PROTECTED] wrote:

   I don't think the userspace stuff belongs on netdev. Someone please
   correct me if I'm wrong.
 
 Yeah, it's not a bad thing to get wider review, but your userspace
 library is pretty much your business.  If you screw it up it doesn't
 hurt anyone else, so I'm happy to let you write it however you want.
 
  - R.


I prefer a pointer to the project download source.
Seeing the userspace stuff helps answer questions where the administration
process is confusing (or could/should be done differently).

-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tcp: don't allow unfair congestion control to be built without warning

2006-10-27 Thread David Miller
From: Stephen Hemminger [EMAIL PROTECTED]
Date: Fri, 27 Oct 2006 10:30:16 -0700

 My proposed method restricting TCP choices to fair algorithms.
 This a net wide, not system wide issue, it should not be done
 by kernel policy choice (capability), but by a build choice.

I think this sucks even worse than the current situation.

How difficult is it to understand that an administrator might
like to be able to build in and experiment with some congestion
control algorithms, yet still be able to keep his normal users
from using them?
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Check if user has CAP_NET_ADMIN to change congestion control algorithm

2006-10-27 Thread David Miller
From: Stephen Hemminger [EMAIL PROTECTED]
Date: Fri, 27 Oct 2006 07:41:02 -0700

 Please no, it makes the socket option useless.
 If you want to tag some bad apples thats okay, but would need
 some more infrastructure.

The behavior of the TCP stack is a system wide decision.

If anything it should be everything besides the default
and Reno are offlimits to unprivileged users with an
administrative method to override that.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tcp: don't allow unfair congestion control to be built without warning

2006-10-27 Thread Stephen Hemminger
On Fri, 27 Oct 2006 14:17:49 -0700 (PDT)
David Miller [EMAIL PROTECTED] wrote:

 From: Stephen Hemminger [EMAIL PROTECTED]
 Date: Fri, 27 Oct 2006 10:30:16 -0700
 
  My proposed method restricting TCP choices to fair algorithms.
  This a net wide, not system wide issue, it should not be done
  by kernel policy choice (capability), but by a build choice.
 
 I think this sucks even worse than the current situation.
 
 How difficult is it to understand that an administrator might
 like to be able to build in and experiment with some congestion
 control algorithms, yet still be able to keep his normal users
 from using them?

Only some (very few) have any bad consequences. So the typical
distribution should be able to switch with most available for everyone,
and only a few needing special privileges.


-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tcp: don't allow unfair congestion control to be built without warning

2006-10-27 Thread David Miller
From: Stephen Hemminger [EMAIL PROTECTED]
Date: Fri, 27 Oct 2006 14:24:02 -0700

 Only some (very few) have any bad consequences. So the typical
 distribution should be able to switch with most available for everyone,
 and only a few needing special privileges.

I would strongly disagree as we've had several OOPS'er class bugs in
the less frequently used algorithms.

I stand by my position that an administrator's wish to do this is
quite valid.

It's bad enough that people are all over us for the default algorithm
we have choosen, so it'd be extremely irresponsible and even worse if
we allowed users to select any of the other research algorithms for
their TCP connections by default just because those modules happened
to be configured into the kernel.

This userspace convenience argument holds zero water.

Provide a way for the administrator to control the situation fully,
and choose a sane default which errs on the side of caution for the
sake of internet stability.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


2.4/2.6 share in linux routers ?

2006-10-27 Thread Yakov Lerner

Hello,

I'd like to find/gather estimates about 2.4 vs 2.6 share in  [small]
linux routers in 2006. Can anyone offer estimates and/or references ?

My own estimate is that definite majority is 2.4 (I'd say 75% for 2.4),
in small linux routers in 2006. Can anyone offer support or correction ?

Which factors make 2.4 or 2.6 more attractive for small linux router
(128-256 mb RAM) ?

Yakov Lerner
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


2.4/2.6 share in linux routers ?

2006-10-27 Thread Yakov Lerner

Hello,

I'd like to find/gather estimates about 2.4 vs 2.6 share in  [small]
linux routers in 2006. Can anyone offer estimates and/or references ?

My own estimate is that definite majority is 2.4 (I'd say 75% for 2.4),
in small linux routers in 2006. Can anyone offer support or correction ?

Which factors make 2.4 or 2.6 more attractive for small linux router
(128-256 mb RAM) ?

Yakov Lerner
P.S. Sorry if the message is duplicate.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tcp: don't allow unfair congestion control to be built without warning

2006-10-27 Thread Stephen Hemminger
On Fri, 27 Oct 2006 14:37:01 -0700 (PDT)
David Miller [EMAIL PROTECTED] wrote:

 From: Stephen Hemminger [EMAIL PROTECTED]
 Date: Fri, 27 Oct 2006 14:24:02 -0700
 
  Only some (very few) have any bad consequences. So the typical
  distribution should be able to switch with most available for everyone,
  and only a few needing special privileges.
 
 I would strongly disagree as we've had several OOPS'er class bugs in
 the less frequently used algorithms.
 

Then tag those as restricted.  Why should we keep app's away from
the simple ones.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.4/2.6 share in linux routers ?

2006-10-27 Thread David Miller

Please stop all of this cross posting.  I've just seen you post
this same exact email on the netfilter lists too.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tcp: don't allow unfair congestion control to be built without warning

2006-10-27 Thread David Miller
From: Stephen Hemminger [EMAIL PROTECTED]
Date: Fri, 27 Oct 2006 14:59:13 -0700

 On Fri, 27 Oct 2006 14:37:01 -0700 (PDT)
 David Miller [EMAIL PROTECTED] wrote:
 
  From: Stephen Hemminger [EMAIL PROTECTED]
  Date: Fri, 27 Oct 2006 14:24:02 -0700
  
   Only some (very few) have any bad consequences. So the typical
   distribution should be able to switch with most available for everyone,
   and only a few needing special privileges.
  
  I would strongly disagree as we've had several OOPS'er class bugs in
  the less frequently used algorithms.
  
 
 Then tag those as restricted.  Why should we keep app's away from
 the simple ones.

You can't predict bugs, but what you can do is know that the lesser
used algorithms are by definition less tested and therefore more
likely to have bugs.  Everything except the default and Reno are
lesser used.

Safe by default, there is no other choice.  You fail to respond to
THAT part of my email.  That's the important point.  Let me
reiterate:

 It's bad enough that people are all over us for the default algorithm
 we have choosen, so it'd be extremely irresponsible and even worse if
 we allowed users to select any of the other research algorithms for
 their TCP connections by default just because those modules happened
 to be configured into the kernel.

 This userspace convenience argument holds zero water.

 Provide a way for the administrator to control the situation fully,
 and choose a sane default which errs on the side of caution for the
 sake of internet stability.

Please reread this and consider why it's important.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tcp: don't allow unfair congestion control to be built without warning

2006-10-27 Thread Stephen Hemminger
On Fri, 27 Oct 2006 15:12:38 -0700 (PDT)
David Miller [EMAIL PROTECTED] wrote:

 From: Stephen Hemminger [EMAIL PROTECTED]
 Date: Fri, 27 Oct 2006 14:59:13 -0700
 
  On Fri, 27 Oct 2006 14:37:01 -0700 (PDT)
  David Miller [EMAIL PROTECTED] wrote:
  
   From: Stephen Hemminger [EMAIL PROTECTED]
   Date: Fri, 27 Oct 2006 14:24:02 -0700
   
Only some (very few) have any bad consequences. So the typical
distribution should be able to switch with most available for everyone,
and only a few needing special privileges.
   
   I would strongly disagree as we've had several OOPS'er class bugs in
   the less frequently used algorithms.
   
  
  Then tag those as restricted.  Why should we keep app's away from
  the simple ones.
 
 You can't predict bugs, but what you can do is know that the lesser
 used algorithms are by definition less tested and therefore more
 likely to have bugs.  Everything except the default and Reno are
 lesser used.

If they aren't usable they should be marked BROKEN or something
like that. The stability argument doesn't really work, we don't
like to let root kill the system either.
 
 Safe by default, there is no other choice.  You fail to respond to
 THAT part of my email.  That's the important point.  Let me
 reiterate:
 
  It's bad enough that people are all over us for the default algorithm
  we have choosen, so it'd be extremely irresponsible and even worse if
  we allowed users to select any of the other research algorithms for
  their TCP connections by default just because those modules happened
  to be configured into the kernel.

Make it hard for them to configure then.  I don't want your
distro to ship with the risky ones turned on.  But we should allow
use of reno, bic, cubic, lp, htcp, and westwood (maybe) by regular
users if admin allows.

  This userspace convenience argument holds zero water.
 
  Provide a way for the administrator to control the situation fully,
  and choose a sane default which errs on the side of caution for the
  sake of internet stability.
 
 Please reread this and consider why it's important.

The current situation is fine. You have to ask for them in the configuration,
and root has to either load the module or set it as default.

The restricted flag patch which you have ignored, would be a way to
allow them to be configured but tag the bad apples for only
root usage.




-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tcp: don't allow unfair congestion control to be built without warning

2006-10-27 Thread David Miller
From: Stephen Hemminger [EMAIL PROTECTED]
Date: Fri, 27 Oct 2006 15:21:49 -0700

 The restricted flag patch which you have ignored, would be a way to
 allow them to be configured but tag the bad apples for only
 root usage.

I haven't ignored it, it's in my backlog below more important
things like Appletalk OOPS'ers.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.4/2.6 share in linux routers ?

2006-10-27 Thread Yakov Lerner

On 10/28/06, David Miller [EMAIL PROTECTED] wrote:


Please stop all of this cross posting.  I've just seen you post
this same exact email on the netfilter lists too.


Sorry
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tcp: don't allow unfair congestion control to be built without warning

2006-10-27 Thread Stephen Hemminger
How about another way of controlling this via sysctl.

First, add code to for read only:
/proc/sys/net/ipv4/tcp_available_congestion_control  (or shorter name)
this will show all things compiled in (even if not loaded yet). Similar
to /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies

Second, add flag (allowed) to the tcp_congestion structure [inverse of
earlier restricted]

Third, add read-write
/proc/sys/net/ipv4/tcp_allowed_congestion_control
to show and set/clear the allowed flag. Default value would be
reno xxx where xxx is what ever the default value from the kernel
config is (currently cubic).

I would use sysfs for this, but it make sense not to spread TCP stuff into
both sysctl and sysfs.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2.6.19-rc3 v2 1/2] amso1100 - Use dma_alloc_coherent instead of kmalloc/dma_map_single.

2006-10-27 Thread Roland Dreier
tsk, tsk:

fatal: 7 lines add trailing whitespaces.

applied to for-2.6.19 anyway, thanks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2.6.19-rc3 v2 2/2] amso1100 - Fix incorrect pr_debug().

2006-10-27 Thread Roland Dreier
Applied, thanks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] tcp: available congetsion control

2006-10-27 Thread Stephen Hemminger
Nice way to see what congestion control modules are loaded.
It does impose a soft limit of 32 possibilities.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]

---
 include/linux/sysctl.h |1 +
 include/net/tcp.h  |3 +++
 net/ipv4/sysctl_net_ipv4.c |   25 -
 net/ipv4/tcp_cong.c|   14 ++
 4 files changed, 42 insertions(+), 1 deletion(-)

--- skge.orig/include/linux/sysctl.h
+++ skge/include/linux/sysctl.h
@@ -418,6 +418,7 @@ enum
NET_CIPSOV4_CACHE_BUCKET_SIZE=119,
NET_CIPSOV4_RBM_OPTFMT=120,
NET_CIPSOV4_RBM_STRICTVALID=121,
+   NET_TCP_AVAIL_CONG_CONTROL=122,
 };
 
 enum {
--- skge.orig/include/net/tcp.h
+++ skge/include/net/tcp.h
@@ -621,6 +621,8 @@ enum tcp_ca_event {
  * Interface for adding new TCP congestion control handlers
  */
 #define TCP_CA_NAME_MAX16
+#define TCP_CA_MAX 32
+
 struct tcp_congestion_ops {
struct list_headlist;
 
@@ -659,6 +661,7 @@ extern void tcp_unregister_congestion_co
 extern void tcp_init_congestion_control(struct sock *sk);
 extern void tcp_cleanup_congestion_control(struct sock *sk);
 extern int tcp_set_default_congestion_control(const char *name);
+extern void tcp_get_available_congestion_control(char *name, int maxlen);
 extern void tcp_get_default_congestion_control(char *name);
 extern int tcp_set_congestion_control(struct sock *sk, const char *name);
 extern void tcp_slow_start(struct tcp_sock *tp);
--- skge.orig/net/ipv4/sysctl_net_ipv4.c
+++ skge/net/ipv4/sysctl_net_ipv4.c
@@ -108,6 +108,22 @@ static int proc_tcp_congestion_control(c
return ret;
 }
 
+static int proc_tcp_available_congestion_control(ctl_table *ctl,
+int write, struct file * filp,
+void __user *buffer, size_t 
*lenp,
+loff_t *ppos)
+{
+   char val[TCP_CA_MAX*(TCP_CA_NAME_MAX+1)];
+   ctl_table tbl = {
+   .data = val,
+   .maxlen = TCP_CA_MAX*(TCP_CA_NAME_MAX+1),
+   };
+
+   tcp_get_available_congestion_control(val, tbl.maxlen);
+
+   return proc_dostring(tbl, write, filp, buffer, lenp, ppos);
+}
+
 static int sysctl_tcp_congestion_control(ctl_table *table, int __user *name,
 int nlen, void __user *oldval,
 size_t __user *oldlenp,
@@ -133,9 +149,9 @@ static int __init tcp_congestion_default
 {
return tcp_set_default_congestion_control(CONFIG_DEFAULT_TCP_CONG);
 }
-
 late_initcall(tcp_congestion_default);
 
+
 ctl_table ipv4_table[] = {
 {
.ctl_name   = NET_IPV4_TCP_TIMESTAMPS,
@@ -738,6 +754,13 @@ ctl_table ipv4_table[] = {
.proc_handler   = proc_dointvec,
},
 #endif /* CONFIG_NETLABEL */
+   {
+   .ctl_name   = NET_TCP_AVAIL_CONG_CONTROL,
+   .procname   = tcp_available_congestion_control,
+   .mode   = 0444,
+   .maxlen = TCP_CA_MAX*(TCP_CA_NAME_MAX+1),
+   .proc_handler   = proc_tcp_available_congestion_control,
+   },
{ .ctl_name = 0 }
 };
 
--- skge.orig/net/ipv4/tcp_cong.c
+++ skge/net/ipv4/tcp_cong.c
@@ -144,6 +144,20 @@ void tcp_get_default_congestion_control(
rcu_read_unlock();
 }
 
+/* Build string with list of available congestion control values */
+void tcp_get_available_congestion_control(char *name, int maxlen)
+{
+   struct tcp_congestion_ops *ca;
+   int offs = 0;
+
+   rcu_read_lock();
+   list_for_each_entry_rcu(ca, tcp_cong_list, list) {
+   offs += snprintf(name + offs, maxlen - offs, %s%s,
+offs == 0 ?  :  , ca-name);
+   }
+   rcu_read_unlock();
+}
+
 /* Change congestion control for socket */
 int tcp_set_congestion_control(struct sock *sk, const char *name)
 {
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html