Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

2006-11-07 Thread Al Boldi
Zhao Xiaoming wrote:
 The latest update:
 It seems that Linux kernel memory management mechanisms including
 buddy and slab algorisms are not very efficient under my test
 conditions that tcp stack requires a lot of (hundreds of MB) packet
 buffers and release them very frequently.
 Here is the proof. After change my kernel configuration to support
 2/2 VM splition, LOMEM consumption reduced to 270M bytes compared with
 640M bytes of the 1/3 kernel. All test conditions are the same and
 memory pages allocated by TCP stack are also the same, 34K ~ 38K
 pages. In other words, 'lost' memory changed from ~500M to ~130M.
 Thus, I have nothing to do but guessing the much more free pages make
 the slab/buddy algorisms more efficient and waste less memory.

I kind of agree, and always compile for a 2G/2G VM split, as this also seems 
to affect certain OOM conditions positively.

What isn't quite clear though, why is the 2G/2G VM split not the default?


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [XFRM/IPV6] fix next header offset in decode_session

2006-11-07 Thread Jean-Philippe Andriot

The way to get the next protocol number of an IPv6 tunnel changes
after introducing IP6CB, but I think we should go back to the
previous version here.
In our case I think there was a confusion between the pointer
on the first byte of the next header and the value of the next
header field.

Signed-off-by: Andriot Jean-Philippe [EMAIL PROTECTED]

--- xfrm6_policy.c.org  2006-11-07 09:45:47.0 +0100
+++ xfrm6_policy.c  2006-11-07 09:46:19.0 +0100
@@ -255,7 +255,7 @@ _decode_session6(struct sk_buff *skb, st
u16 offset = skb-h.raw - skb-nh.raw;
struct ipv6hdr *hdr = skb-nh.ipv6h;
struct ipv6_opt_hdr *exthdr;
-   u8 nexthdr = skb-nh.raw[IP6CB(skb)-nhoff];
+   u8 nexthdr = skb-nh.ipv6h-nexthdr;
 
memset(fl, 0, sizeof(struct flowi));
ipv6_addr_copy(fl-fl6_dst, hdr-daddr);

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] add netpoll support for gianfar

2006-11-07 Thread Jeff Garzik

Vitaly Wool wrote:

The patch inlined below adds NET_POLL_CONTROLLER support for gianfar network 
driver.


As noted, this patch is out of date.  2.6.19-rc kernels removed the 
pt_regs argument from all irq handlers.


Jeff



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] ethtool: marvell register dump

2006-11-07 Thread Jeff Garzik

Stephen Hemminger wrote:

This is a consolidation of earlier marvell register decode patches to ethtool.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]


applied patch 1 of 3

patches 2 and 3 still in the queue under consideration.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fixed a number of bugs in the PHY Layer

2006-11-07 Thread Jeff Garzik

Andy Fleming wrote:

* genphy_update_link is now exported
* Added a fix from [EMAIL PROTECTED] which changes forcing so it
  only updates the link.  Otherwise, it never tries the lower
  values, since it is always overwriting the speed/duplex values
  with the current ones, rather than the intended ones.
* Fixed a bug where bringing up a PHY with no link caused it to
  timeout, and enter forcing mode.  Once in forcing mode,
  plugging in the link didn't autonegotiate.  Now the AN state
  detects the lack of link, and enters the NO_LINK state.  AN
  only times out if the link is up and AN fails
* Cleaned up the PHY_AN case, reducing one level of indentation
  for the timeout code.


applied

Please include a Signed-off-by line in future patches!

Jeff



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 4/8] subdance: fix TX Pause bug (reset_tx, intr_handler)

2006-11-07 Thread Jeff Garzik

[EMAIL PROTECTED] wrote:

From: Jesse Huang [EMAIL PROTECTED]

Fix TX Pause bug (reset_tx, intr_handler).  When MaxCollisions occurred, need
to re-enable Tx.  But just after re-enable, MaxCollisions maybe occurred again
and with TxStatusOverflow.  This will cause driver can't check new
MaxCollisions to re-enable Tx again, because TxStatusOverflow.  For this
reason, after re-enable Tx, we need to make sure Tx was actually enabled.

Signed-off-by: Jesse Huang [EMAIL PROTECTED]
Signed-off-by: Andrew Morton [EMAIL PROTECTED]


applied


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/8] sundance: remove TxStartThresh and RxEarlyThresh

2006-11-07 Thread Jeff Garzik

[EMAIL PROTECTED] wrote:

From: Jesse Huang [EMAIL PROTECTED]

For patent issue need to remove TxStartThresh and RxEarlyThresh.  This patent
is cut-through patent.  If use this function, Tx will start to transmit after
few data be move in to Tx FIFO.  We are not allow to use those function in
DFE530/DFE550/DFE580/DL10050/IP100/IP100A.  It will decrease a little
performance.

Signed-off-by: Jesse Huang [EMAIL PROTECTED]
Signed-off-by: Andrew Morton [EMAIL PROTECTED]


applied


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 6/8] subdance: correct initial and close hardware step.

2006-11-07 Thread Jeff Garzik

[EMAIL PROTECTED] wrote:

From: Jesse Huang [EMAIL PROTECTED]

Correct initial and close hardware step.  In some embedded system down and up
IP100A will cause DMA crash.  We add some for safe down and up IP100A.

Signed-off-by: Jesse Huang [EMAIL PROTECTED]
Signed-off-by: Andrew Morton [EMAIL PROTECTED]


applied


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [XFRM/IPV6] fix next header offset in decode_session

2006-11-07 Thread YOSHIFUJI Hideaki / 吉藤英明
In article [EMAIL PROTECTED] (at Tue, 7 Nov 2006 10:30:02 +0100), 
Jean-Philippe Andriot [EMAIL PROTECTED] says:

 The way to get the next protocol number of an IPv6 tunnel changes
 after introducing IP6CB, but I think we should go back to the
 previous version here.
:
 struct ipv6_opt_hdr *exthdr;
 -   u8 nexthdr = skb-nh.raw[IP6CB(skb)-nhoff];
 +   u8 nexthdr = skb-nh.ipv6h-nexthdr;
  
 memset(fl, 0, sizeof(struct flowi));

I disagree.

If you do this, you refer to the first extension headers only.
We need to skip preceding extension headers using IP6CB(skb)-nhoff,
which holds the offset to the current nexthdr.

--yoshfuji
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Zero checksum in netconsole/netdump packets

2006-11-07 Thread Gerrit Renker
Quoting Chris Lalancette:
|  Hello,
|   I realized that all of the packets that go from the crashing machine to 
the netdump server have a zero checksum. 
snip
|   Assuming that this is just an oversight, attached is a simple patch to 
compute the UDP checksum in netpoll_send_udp.
|  
|  Signed-off-by: Chris Lalancette [EMAIL PROTECTED]
|  
RFC 768 allows to not compute the checksum by leaving uh-check at 0 - hence it 
is not illegal.
But without David's suggestion the code is not valid, since otherwise there is 
no way of 
distinguishing a computed `0' from an ignored `0' field:
 if ( udph-check == 0 )
udph-check = -1;
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] add netpoll support for gianfar

2006-11-07 Thread Vitaly Wool
On Mon, 6 Nov 2006 15:26:33 -0600
Andy Fleming [EMAIL PROTECTED] wrote:
 
 You are passing extra arguments, here

Oh yes, thanks. I was out of sync here.

 1) Do we need the disable/enable irq stuff?  It seems like we should  
 be able to either just *mask* the interrupts at the controller, or  
 rely on the locks to disable the interrupts.

I don't see how masking the ints at the controller differs much from 
disable_irq.
Locking all the interrupts is definitely worse than disabling selected ones. 
Also, introducing locks here means that we'll need to handle that specifically 
for -rt kernels.
 
 2) If we are calling gfar_transmit and gfar_receive, shouldn't we  
 call gfar_error?
 
 3) I think it should be possible to just call gfar_interrupt() in  
 every situation, but I'm not very familiar with net poll's  
 requirements (You can add that into your evaluation of #1, too).

Oh yes, that's a nice idea, thanks.

Vitaly
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH/RFC] add netpoll support for gianfar: respin

2006-11-07 Thread Vitaly Wool
The patch inlined below adds NET_POLL_CONTROLLER support for gianfar network 
driver, slightly modified wrt the comments from Andy Fleming.

 drivers/net/gianfar.c |   33 +
 1 file changed, 33 insertions(+)

Signed-off-by: Vitaly Wool [EMAIL PROTECTED]
  
Index: powerpc/drivers/net/gianfar.c
===
--- powerpc.orig/drivers/net/gianfar.c
+++ powerpc/drivers/net/gianfar.c
@@ -133,6 +133,9 @@ static void gfar_set_hash_for_addr(struc
 #ifdef CONFIG_GFAR_NAPI
 static int gfar_poll(struct net_device *dev, int *budget);
 #endif
+#ifdef CONFIG_NET_POLL_CONTROLLER
+static void gfar_netpoll(struct net_device *dev);
+#endif
 int gfar_clean_rx_ring(struct net_device *dev, int rx_work_limit);
 static int gfar_process_frame(struct net_device *dev, struct sk_buff *skb, int 
length);
 static void gfar_vlan_rx_register(struct net_device *netdev,
@@ -260,6 +263,9 @@ static int gfar_probe(struct platform_de
dev-poll = gfar_poll;
dev-weight = GFAR_DEV_WEIGHT;
 #endif
+#ifdef CONFIG_NET_POLL_CONTROLLER
+   dev-poll_controller = gfar_netpoll;
+#endif
dev-stop = gfar_close;
dev-get_stats = gfar_get_stats;
dev-change_mtu = gfar_change_mtu;
@@ -1536,6 +1542,33 @@ static int gfar_poll(struct net_device *
 }
 #endif
 
+#ifdef CONFIG_NET_POLL_CONTROLLER
+/*
+ * Polling 'interrupt' - used by things like netconsole to send skbs
+ * without having to re-enable interrupts. It's not called while
+ * the interrupt routine is executing.
+ */
+static void gfar_netpoll(struct net_device *dev)
+{
+   struct gfar_private *priv = netdev_priv(dev);
+
+   /* If the device has multiple interrupts, run tx/rx */
+   if (priv-einfo-device_flags  FSL_GIANFAR_DEV_HAS_MULTI_INTR) {
+   disable_irq(priv-interruptTransmit);
+   disable_irq(priv-interruptReceive);
+   disable_irq(priv-interruptError);
+   gfar_interrupt(priv-interruptTransmit, dev);
+   enable_irq(priv-interruptError);
+   enable_irq(priv-interruptReceive);
+   enable_irq(priv-interruptTransmit);
+   } else {
+   disable_irq(priv-interruptTransmit);
+   gfar_interrupt(priv-interruptTransmit, dev);
+   enable_irq(priv-interruptTransmit);
+   }
+}
+#endif
+
 /* The interrupt handler for devices with one interrupt */
 static irqreturn_t gfar_interrupt(int irq, void *dev_id)
 {

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] add dev_to_node()

2006-11-07 Thread Christoph Hellwig
On Mon, Nov 06, 2006 at 10:25:36PM -0800, Ravikiran G Thirumalai wrote:
 On Sun, Nov 05, 2006 at 12:53:23AM +0100, Christoph Hellwig wrote:
  On Sat, Nov 04, 2006 at 06:06:48PM -0500, Dave Jones wrote:
   On Sat, Nov 04, 2006 at 11:56:29PM +0100, Christoph Hellwig wrote:
   
   This will break the compile for !NUMA if someone ends up doing a bisect
   and lands here as a bisect point.
   
   You introduce this nice wrapper..
  
  The dev_to_node wrapper is not enough as we can't assign to (-1) for
  the non-NUMA case.  So I added a second macro, set_dev_node for that.
  
  The patch below compiles and works on numa and non-NUMA platforms.
  
  
 
 Hi Christoph,
 dev_to_node does not work as expected on x86_64 (and i386).  This is because
 node value returned by pcibus_to_node is initialized after a struct device
 is created with current x86_64 code.
 
 We need the node value initialized before the call to pci_scan_bus_parented,
 as the generic devices are allocated and initialized
 off pci_scan_child_bus, which gets called from pci_scan_bus_parented
 The following patch does that using pci_sysdata introduced by the PCI
 domain patches in -mm.

A nice, that some non-cell folks actually care for this patch.  As far
as my x86_64 pci code knowledge is concerned that patch look fine to me.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[git patches] net driver fixes

2006-11-07 Thread Jeff Garzik

Please pull from 'upstream-linus' branch of
master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6.git 
upstream-linus

to receive the following updates:

 drivers/net/b44.c  |5 +++--
 drivers/net/e1000/e1000_main.c |7 +++
 2 files changed, 10 insertions(+), 2 deletions(-)

Auke Kok:
  e1000: Fix regression: garbled stats and irq allocation during swsusp

Johannes Berg:
  b44: change comment about irq mask register

diff --git a/drivers/net/b44.c b/drivers/net/b44.c
index 1ec2174..474a4e3 100644
--- a/drivers/net/b44.c
+++ b/drivers/net/b44.c
@@ -908,8 +908,9 @@ static irqreturn_t b44_interrupt(int irq
istat = br32(bp, B44_ISTAT);
imask = br32(bp, B44_IMASK);
 
-   /* ??? What the fuck is the purpose of the interrupt mask
-* ??? register if we have to mask it out by hand anyways?
+   /* The interrupt mask register controls which interrupt bits
+* will actually raise an interrupt to the CPU when set by hw/firmware,
+* but doesn't mask off the bits.
 */
istat = imask;
if (istat) {
diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index 8d04752..726ec5e 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -4800,6 +4800,9 @@ #endif
if (adapter-hw.phy_type == e1000_phy_igp_3)
e1000_phy_powerdown_workaround(adapter-hw);
 
+   if (netif_running(netdev))
+   e1000_free_irq(adapter);
+
/* Release control of h/w to f/w.  If f/w is AMT enabled, this
 * would have already happened in close and is redundant. */
e1000_release_hw_control(adapter);
@@ -4830,6 +4833,10 @@ e1000_resume(struct pci_dev *pdev)
pci_enable_wake(pdev, PCI_D3hot, 0);
pci_enable_wake(pdev, PCI_D3cold, 0);
 
+   if (netif_running(netdev)  (err = e1000_request_irq(adapter)))
+   return err;
+
+   e1000_power_up_phy(adapter);
e1000_reset(adapter);
E1000_WRITE_REG(adapter-hw, WUS, ~0);
 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take21 0/4] kevent: Generic event handling mechanism.

2006-11-07 Thread Jeff Garzik

Evgeniy Polyakov wrote:

Generic event handling mechanism.

Consider for inclusion.

Changes from 'take20' patchset:
 * new ring buffer implementation
 * removed artificial limit on possible number of kevents
With this release and fixed userspace web server it was possible to 
achive 3960+ req/s with client connection rate of 4000 con/s

over 100 Mbit lan, data IO over network was about 10582.7 KB/s, which
is too close to wire speed if we get into account headers and the like.


OK, now that ring buffer is here, I definitely like the direction this 
code is taking.  I just committed the patches to a local repo for a good 
in-depth review.


Could you write up a simple text file, documenting (a) your proposed 
syscalls and (b) your ring buffer design?



Overall I have a Linux design wish, that I hope kevent can fulfill:

To develop completely async applications (generally network servers, in 
Linux-land) and increase the chance of zero-copy I/O, network and file 
I/O submission and completion should be as async as possible.


As such, syscalls themselves have come a serializing bottleneck that 
isn't strictly necessary.  A fully-async application should be able to 
submit file read, file write, and network write requests 
asynchronously... in batches.  Network reads, and file I/O completions 
should be received asynchronously, potentially in batches.


Even with epoll and AIO syscalls, Linux isn't quite up to the task.

So to me, the design of the userspace interface that solves this problem 
is a fundamental issue.


My best guess at a solution would be two classes of mmap'd ring buffers, 
request and response.  Let the app allocate one or more.  Then have two 
hooks, (a) kick the kernel to read the request ring, and (b) kick the 
app when one or more events have arrived on a ring.


But that's just thinking out loud.  I welcome any solution that gives 
userspace a fully-async submission/completion interface for both network 
and file I/O.


Setting the standard for a good interface here means Linux will kick ass 
for decades more to come ;-)  This is IMO a Big Deal(tm).


Jeff


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Zero checksum in netconsole/netdump packets

2006-11-07 Thread Krzysztof Oledzki



On Tue, 7 Nov 2006, Gerrit Renker wrote:


Quoting Chris Lalancette:
|  Hello,
|   I realized that all of the packets that go from the crashing machine to 
the netdump server have a zero checksum.
snip
|   Assuming that this is just an oversight, attached is a simple patch to 
compute the UDP checksum in netpoll_send_udp.
|
|  Signed-off-by: Chris Lalancette [EMAIL PROTECTED]
|
RFC 768 allows to not compute the checksum by leaving uh-check at 0 - hence it 
is not illegal.


BTW: leaving UDP checksum at 0 is only valid for IPv4, with IPv6 we _have 
to_ compute a checksum.


Best regards,

Krzysztof Olędzki

Re: [PATCH 4/4] skge: version 1.9

2006-11-07 Thread Michael Stone
The skge 1.9 patch is looking good on older syskonnect fiber cards.  
Stability issues seem to be taken care of and performance is good. There 
are some strange interactions with bonding, however. If I try to put 
both interfaces of an sk-9844 into a bonded interface, I only see 
traffic from one of them. If I try to config the bonded interface down, 
the system hangs. If I tcpdump either of the individual interfaces 
(before bonding them) I see all the expected traffic.


Mike Stone
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Zero checksum in netconsole/netdump packets

2006-11-07 Thread Chris Lalancette

David Miller wrote:

From: Chris Lalancette [EMAIL PROTECTED]
Date: Mon, 06 Nov 2006 18:40:59 -0500


 Assuming that this is just an oversight, attached is a simple
 patch to compute the UDP checksum in netpoll_send_udp.


If the resulting checksum is zero, you should set it to
all 1's, like the real UDP code does.


David,
Ah, thanks.  Forgot about that.  I re-spun the patch with the change 
(attached).  I also moved the UDP checksum calculation up to where the rest of 
the UDP header setup is, to make it more consistent.

Thanks again for the comments!

Signed-off-by: Chris Lalancette [EMAIL PROTECTED]
--- linux-2.6/net/core/netpoll.c.orig	2006-11-06 18:16:58.0 -0500
+++ linux-2.6/net/core/netpoll.c	2006-11-07 08:16:29.0 -0500
@@ -340,6 +340,12 @@ void netpoll_send_udp(struct netpoll *np
 	udph-dest = htons(np-remote_port);
 	udph-len = htons(udp_len);
 	udph-check = 0;
+	udph-check = csum_tcpudp_magic(htonl(np-local_ip),
+	htonl(np-remote_ip),
+	udp_len, IPPROTO_UDP,
+	csum_partial((unsigned char *)udph, udp_len, 0));
+	if (udph-check == 0)
+		udph-check = -1;
 
 	skb-nh.iph = iph = (struct iphdr *)skb_push(skb, sizeof(*iph));
 


Re: [take21 0/4] kevent: Generic event handling mechanism.

2006-11-07 Thread Jeff Garzik

At an aside...  This may be useful.  Or not.

Al Viro had an interesting idea about kernel-userspace data passing 
interfaces.  He had suggested creating a task-specific filesystem 
derived from ramfs.  Through the normal VFS/VM codepaths, the user can 
easily create [subject to resource/priv checks] a buffer that is locked 
into the pagecache.  Using mmap, read, write, whatever they prefer. 
Derive from tmpfs, and the buffers are swappable.


Then it would be a simple matter to associate a file stored in 
keventfs with a ring buffer guaranteed to be pagecache-friendly.


Heck, that might make zero-copy easier in some cases, too.  And using a 
filesystem would mean that you could do all this without adding 
syscalls, by using special (poll-able!) files in the filesystem for 
control and notification purposes.


Jeff



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take21 0/4] kevent: Generic event handling mechanism.

2006-11-07 Thread Evgeniy Polyakov
On Tue, Nov 07, 2006 at 06:46:58AM -0500, Jeff Garzik ([EMAIL PROTECTED]) wrote:
 At an aside...  This may be useful.  Or not.
 
 Al Viro had an interesting idea about kernel-userspace data passing 
 interfaces.  He had suggested creating a task-specific filesystem 
 derived from ramfs.  Through the normal VFS/VM codepaths, the user can 
 easily create [subject to resource/priv checks] a buffer that is locked 
 into the pagecache.  Using mmap, read, write, whatever they prefer. 
 Derive from tmpfs, and the buffers are swappable.

It looks like Al likes filesystems more than any other part of kernel
tree...
Existing ring buffer is created in process' memory, so it is swappable
too (which is probably the most significant part of this ring buffer 
version), but in theory kevent file descriptor can be obtained not from
the char device, but from special filesystem (well, it was done in that
way in first releases but then I was asked to remove such
functionality).

 Then it would be a simple matter to associate a file stored in 
 keventfs with a ring buffer guaranteed to be pagecache-friendly.
 
 Heck, that might make zero-copy easier in some cases, too.  And using a 
 filesystem would mean that you could do all this without adding 
 syscalls, by using special (poll-able!) files in the filesystem for 
 control and notification purposes.

There are too many ideas about networking zero-copy both sending and
receiving, and some of them are even implemented on different layers
(starting from special allocator down to splice() with additional
single allocation/copy).

   Jeff

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take21 0/4] kevent: Generic event handling mechanism.

2006-11-07 Thread Evgeniy Polyakov
On Tue, Nov 07, 2006 at 06:26:09AM -0500, Jeff Garzik ([EMAIL PROTECTED]) wrote:
 Evgeniy Polyakov wrote:
 Generic event handling mechanism.
 
 Consider for inclusion.
 
 Changes from 'take20' patchset:
  * new ring buffer implementation
  * removed artificial limit on possible number of kevents
 With this release and fixed userspace web server it was possible to 
 achive 3960+ req/s with client connection rate of 4000 con/s
 over 100 Mbit lan, data IO over network was about 10582.7 KB/s, which
 is too close to wire speed if we get into account headers and the like.
 
 OK, now that ring buffer is here, I definitely like the direction this 
 code is taking.  I just committed the patches to a local repo for a good 
 in-depth review.

It is third ring buffer, the fourth one will be in the next release,
which should satisfy everyone.

 Could you write up a simple text file, documenting (a) your proposed 
 syscalls and (b) your ring buffer design?

Initial draft about supported syscalls can be found at documentation page at
http://linux-net.osdl.org/index.php/Kevent

Ring buffer background bits pasted below (quotations from blog, do not
pay too much attention if sometimes something is not in sync).

New ring buffer is implemented fully in userspace in process' memory,
which means that there are no memory pinned, its size can have almost
any length, several threads and processes can access it simultaneously.
There is new system call

int kevent_ring_init(int ctl_fd, struct ring_buffer *ring, unsigned int
num);

which initializes kevent's ring buffer (int ctl_fd is a kevent file
descriptor, struct ring_buffer *ring is a userspace allocated ring
buffer, and unsigned int num is maximum number of events (struct
ukevent) which can be placed into that buffer).
Ring buffer is described with following structure:

struct kevent_ring
{
unsigned intring_kidx, ring_uidx;
struct ukevent  event[0];
};

where unsigned int ring_kidx, ring_uidx are last kernel's position (i.e.
position which points to the first place after the last kevent put by
kernel into the ring buffer) and last userspace commit (i.e. position
where first unread kevent lives) positions appropriately.
I will release appropriate userspace test application when tests are
completed.

When kevent is removed (not dequeued when it is ready, but just
removed), even if it was ready, it is not copied into ring buffer, since
if it is removed, no one cares about it (otherwise user would wait until
it becomes ready and got it through usual way using kevent_get_events()
or kevent_wait()) and thus no need to copy it to the ring buffer.
Dequeueing of the kevent (calling kevent_get_events()) means that user
has processed previously dequeued kevent and is ready to process new
one, which means that position in the ring buffer previously ocupied but
that event can be reused by currently dequeued event. In the world where
only one type of syscalls to get events is used (either usual way and
kevent_get_events() or ring buffer and kevent_wait()) it should not be a
problem, since kevent_wait() only allows to mark number of events as
processed by userspace starting from the beginning (i.e. from the last
processed event), but if several threads will use different models, that
can rise some questions, for example one thread can start to read events
from ring buffer, and in that time other thread will call
kevent_get_events(), which can rewrite that events. Actually other
thread can call kevent_wait() to commit that events (i.e. mark them as
processed by userspace so kernel could free them or requeue), so
appropriate locking is required in userspace in any way.

So I want to repeat, that it is possible with userspace ring buffer,
that events in the ring buffer can be replaced without knowledge for the
thread currently reading them (when other thread calls
kevent_get_events() or kevent_wait()), so appropriate locking between
threads or processes, which can simultaneously access the same ring
buffer, is required.

Having userspace ring buffer allows to make all kevent syscalls as so
called 'cancellation points' by glibc, i.e. when thread has been
cancelled in kevent syscall, thread can be safely removed and no events
will be lost, since each syscall will copy event into special ring
buffer, accessible from other threads or even processes (if shared
memory is used).


 
 Overall I have a Linux design wish, that I hope kevent can fulfill:
 
 To develop completely async applications (generally network servers, in 
 Linux-land) and increase the chance of zero-copy I/O, network and file 
 I/O submission and completion should be as async as possible.
 
 As such, syscalls themselves have come a serializing bottleneck that 
 isn't strictly necessary.  A fully-async application should be able to 
 submit file read, file write, and network write requests 
 asynchronously... in batches.  Network reads, and file I/O completions 
 should be received 

Re: [take22 0/4] kevent: Generic event handling mechanism.

2006-11-07 Thread Jeff Garzik

Nate Diller wrote:

Indesiciveness has certainly been an issue here, but I remember akpm
and Ulrich both giving concrete suggestions.  I was particularly
interested in Andrew's request to explain and justify the differences
between kevent and BSD's kqueue interface.  Was there a discussion
that I missed?  I am very interested to see your work on this
mechanism merged, because you've clearly emphasized performance and
shown impressive results.  But it seems like we lose out on a lot by
throwing out all the applications that already use kqueue.



kqueue looks pretty nice, the filter/note models in particular.  I don't 
see anything about ring buffers though.


I also wonder about the asynchronous event side (send), not just the 
event reception side.


Jeff


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take22 0/4] kevent: Generic event handling mechanism.

2006-11-07 Thread Jeff Garzik

David Miller wrote:

From: Pavel Machek [EMAIL PROTECTED]
Date: Fri, 3 Nov 2006 09:57:12 +0100


Not sure what you are smoking, but there's unsigned long in *bsd
version, lets rewrite it from scratch sounds like very bad idea. What
about fixing that one bit you don't like?


I disagree, it's more like since we have to be structure incompatible
anyways, let's design something superior if we can.


Definitely agreed.

Jeff



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take21 0/4] kevent: Generic event handling mechanism.

2006-11-07 Thread Jeff Garzik

Evgeniy Polyakov wrote:

Well, kevent network and FS AIO are suspended for now (although first


Why?

IMO, getting async event submission right is important.  It should be 
designed in parallel with async event reception.


Jeff


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take21 0/4] kevent: Generic event handling mechanism.

2006-11-07 Thread Jeff Garzik

Evgeniy Polyakov wrote:

Mmap ring buffer implementation was stopped by Andrew Morton and Ulrich
Drepper, process' memory is used instead. copy_to_user() is slower (and
some times noticebly), but there are major advantages of such approach.



h.  I say there are advantages to both.

Perhaps create a kevent_direct_limit resource limit for each thread. 
By default, each thread could mmap $n pinned pagecache pages.  Sysadmin 
can tune certain app resource limits to permit more.


I would think that retaining the option to avoid copy_to_user() 
-somehow- in -some- cases would be wise.


Jeff


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take21 0/4] kevent: Generic event handling mechanism.

2006-11-07 Thread Evgeniy Polyakov
On Tue, Nov 07, 2006 at 07:17:03AM -0500, Jeff Garzik ([EMAIL PROTECTED]) wrote:
 Evgeniy Polyakov wrote:
 Well, kevent network and FS AIO are suspended for now (although first
 
 Why?
 
 IMO, getting async event submission right is important.  It should be 
 designed in parallel with async event reception.

It was not only designed but also implemented, but...

FS AIO was confirmed to have correct design, but there were minor (from
my point of view) layering design problems 
(I was almost suggested to make myself a lobotomy after I put
get_block() callback into address_space_operations, there were also some
code duplication of mpage_readpages() in async way in
kevent/kevent_aio.c - I made it to separate kevent as much as possible,
both changes can live in fs/ with appropriate callback export).

Network AIO I postponed for a while, since looking how hard core changed
are processed, it looks like a better decision...
Using Ulrich's DMA allocation API (if it would exist not only as
proposal) it would be possible to speed up NAIO yet a bit too.

Kevent based FS AIO patch can be found for example here (it contains
full kevent subsystem with network aio and fs aio):
http://tservice.net.ru/~s0mbre/archive/kevent/kevent_full.diff.3

Network aio homepage:
http://tservice.net.ru/~s0mbre/old/?section=projectsitem=naio

   Jeff
 

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] NetXen: 1G/10G Ethernet Driver updates

2006-11-07 Thread Amit S. Kale

Hi All,

I will be sending NetXen 1G/10G ethernet driver updates in subsequent 
emails. Kindly review it and feel free to send feedback.


Thanks,
--Amit

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] NetXen: Fixed /sys mapping between device and driver

2006-11-07 Thread Ingo Oeser
Hi Amit,

one minor nitpick:

You wrote:
 diff --git a/drivers/net/netxen/netxen_nic_main.c 
 b/drivers/net/netxen/netxen_nic_main.c
 index b54ea16..4effb87 100644
 --- a/drivers/net/netxen/netxen_nic_main.c
 +++ b/drivers/net/netxen/netxen_nic_main.c
[...]
 @@ -1040,7 +1041,7 @@ static int netxen_nic_poll(struct net_de
   netxen_nic_enable_int(adapter);
   }
 
 - return (done ? 0 : 1);
 + return (!done);
return !done;

Please lose the braces here (CodingStyle).

Just respin or send this change along with later patchsets.

Regards

Ingo Oeser
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] NetXen: 1G/10G Ethernet Driver updates

2006-11-07 Thread Amit S. Kale


NetXen: 1G/10G Ethernet Driver updates
- Driver cleanup
- These fixes take care of driver on machines with 4G memory

Signed-off-by: Amit S. Kale [EMAIL PROTECTED]

 netxen_nic.h  |   41 ++
 netxen_nic_ethtool.c  |   19 ++--
 netxen_nic_hdr.h  |0
 netxen_nic_hw.c   |   10 +-
 netxen_nic_hw.h   |4
 netxen_nic_init.c |   51 +++-
 netxen_nic_ioctl.h|0
 netxen_nic_isr.c  |3
 netxen_nic_main.c |  204 +++---
 netxen_nic_niu.c  |0
 netxen_nic_phan_reg.h |   10 +-
 11 files changed, 293 insertions(+), 49 deletions(-)

diff --git a/drivers/net/netxen/netxen_nic.h b/drivers/net/netxen/netxen_nic.h
index d0d9a29..104f60d 100644
--- a/drivers/net/netxen/netxen_nic.h
+++ b/drivers/net/netxen/netxen_nic.h
@@ -6,12 +6,12 @@
  * modify it under the terms of the GNU General Public License
  * as published by the Free Software Foundation; either version 2
  * of the License, or (at your option) any later version.
- * 
+ *

  * This program is distributed in the hope that it will be useful, but
  * WITHOUT ANY WARRANTY; without even the implied warranty of
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  * GNU General Public License for more details.
- * 
+ *

  * You should have received a copy of the GNU General Public License
  * along with this program; if not, write to the Free Software
  * Foundation, Inc., 59 Temple Place - Suite 330, Boston,
@@ -90,8 +90,8 @@ #define ADDR_IN_WINDOW1(off)  \
  * normalize a 64MB crb address to 32MB PCI window
  * To use NETXEN_CRB_NORMALIZE, window _must_ be set to 1
  */
-#define NETXEN_CRB_NORMAL(reg)\
-   (reg) - NETXEN_CRB_PCIX_HOST2 + NETXEN_CRB_PCIX_HOST
+#define NETXEN_CRB_NORMAL(reg) \
+   ((reg) - NETXEN_CRB_PCIX_HOST2 + NETXEN_CRB_PCIX_HOST)

 #define NETXEN_CRB_NORMALIZE(adapter, reg) \
pci_base_offset(adapter, NETXEN_CRB_NORMAL(reg))
@@ -165,7 +165,7 @@ #define RCV_DESC_TYPE(ID) \

 #define MAX_CMD_DESCRIPTORS1024
 #define MAX_RCV_DESCRIPTORS32768
-#define MAX_JUMBO_RCV_DESCRIPTORS  1024
+#define MAX_JUMBO_RCV_DESCRIPTORS  4096
 #define MAX_RCVSTATUS_DESCRIPTORS  MAX_RCV_DESCRIPTORS
 #define MAX_JUMBO_RCV_DESC MAX_JUMBO_RCV_DESCRIPTORS
 #define MAX_RCV_DESC   MAX_RCV_DESCRIPTORS
@@ -593,6 +593,16 @@ struct netxen_skb_frag {
u32 length;
 };

+/* Bounce buffer index */
+struct bounce_index {
+   /* Index of a buffer */
+   unsigned buffer_index;
+   /* Offset inside the buffer */
+   unsigned buffer_offset;
+};
+
+#define IS_BOUNCE 0xcafebb
+
 /*Following defines are for the state of the buffers*/
 #defineNETXEN_BUFFER_FREE  0
 #defineNETXEN_BUFFER_BUSY  1
@@ -612,6 +622,8 @@ struct netxen_cmd_buffer {
unsigned long time_stamp;
u32 state;
u32 no_of_descriptors;
+   u32 tx_bounce_buff;
+   struct bounce_index bnext;
 };

 /* In rx_buffer, we do not need multiple fragments as is a single buffer */
@@ -620,6 +632,9 @@ struct netxen_rx_buffer {
u64 dma;
u16 ref_handle;
u16 state;
+   u32 rx_bounce_buff;
+   struct bounce_index bnext;
+   char *bounce_ptr;
 };

 /* Board types */
@@ -704,6 +719,7 @@ struct netxen_recv_context {
 };

 #define NETXEN_NIC_MSI_ENABLED 0x02
+#define NETXEN_DMA_MASK0xfffe

 struct netxen_drvops;

@@ -938,9 +954,7 @@ static inline void netxen_nic_disable_in
/*
 * ISR_INT_MASK: Can be read from window 0 or 1.
 */
-   writel(0x7ff,
-  (void __iomem
-   *)(PCI_OFFSET_SECOND_RANGE(adapter, ISR_INT_MASK)));
+   writel(0x7ff, PCI_OFFSET_SECOND_RANGE(adapter, ISR_INT_MASK));

 }

@@ -960,14 +974,12 @@ static inline void netxen_nic_enable_int
break;
}

-   writel(mask,
-  (void __iomem
-   *)(PCI_OFFSET_SECOND_RANGE(adapter, ISR_INT_MASK)));
+   writel(mask, PCI_OFFSET_SECOND_RANGE(adapter, ISR_INT_MASK));

if (!(adapter-flags  NETXEN_NIC_MSI_ENABLED)) {
mask = 0xbff;
-   writel(mask, (void __iomem *)
-  (PCI_OFFSET_SECOND_RANGE(adapter, ISR_INT_TARGET_MASK)));
+   writel(mask, PCI_OFFSET_SECOND_RANGE(adapter,
+ISR_INT_TARGET_MASK));
}
 }

@@ -1041,6 +1053,9 @@ static inline void get_brd_name_by_type(

 int netxen_is_flash_supported(struct netxen_adapter *adapter);
 int netxen_get_flash_mac_addr(struct netxen_adapter *adapter, u64 mac[]);
+int netxen_get_next_bounce_buffer(struct bounce_index *head,
+ struct bounce_index *tail,
+ struct bounce_index *biret, unsigned len);

 extern void netxen_change_ringparam(struct netxen_adapter *adapter);
 extern int netxen_rom_fast_read(struct netxen_adapter 

[PATCH 3/3] mlsxfrm: Various fixes

2006-11-07 Thread Venkat Yekkirala
Fix the selection of an SA for an outgoing packet to be at the same
context as the originating socket/flow. This eliminates the SELinux
policy's ability to use/sendto SAs with contexts other than the socket's.

With this patch applied, the SELinux policy will require one or more of the
following for a socket to be able to communicate with/without SAs:

1. To enable a socket to communicate without using labeled-IPSec SAs:

allow socket_t unlabeled_t:association { sendto recvfrom }

2. To enable a socket to communicate with labeled-IPSec SAs:

allow socket_t self:association { sendto };
allow socket_t peer_sa_t:association { recvfrom };

Signed-off-by: Venkat Yekkirala [EMAIL PROTECTED]
---
 include/linux/security.h|   19 -
 net/xfrm/xfrm_policy.c  |3 
 security/dummy.c|7 -
 security/selinux/hooks.c|   26 --
 security/selinux/include/security.h |2 
 security/selinux/include/xfrm.h |7 -
 security/selinux/ss/services.c  |   44 +++
 security/selinux/xfrm.c |   97 --
 8 files changed, 112 insertions(+), 93 deletions(-)

--- net-2.6.xfrm2/include/linux/security.h  2006-10-25 12:26:20.0 
-0500
+++ net-2.6/include/linux/security.h2006-11-01 11:22:17.0 -0600
@@ -886,11 +886,6 @@ struct request_sock;
  * @xp contains the policy to check for a match.
  * @fl contains the flow to check for a match.
  * Return 1 if there is a match.
- * @xfrm_flow_state_match:
- * @fl contains the flow key to match.
- * @xfrm points to the xfrm_state to match.
- * @xp points to the xfrm_policy to match.
- * Return 1 if there is a match.
  * @xfrm_decode_session:
  * @skb points to skb to decode.
  * @secid points to the flow key secid to set.
@@ -1388,8 +1383,6 @@ struct security_operations {
int (*xfrm_policy_lookup)(struct xfrm_policy *xp, u32 fl_secid, u8 dir);
int (*xfrm_state_pol_flow_match)(struct xfrm_state *x,
struct xfrm_policy *xp, struct flowi *fl);
-   int (*xfrm_flow_state_match)(struct flowi *fl, struct xfrm_state *xfrm,
-   struct xfrm_policy *xp);
int (*xfrm_decode_session)(struct sk_buff *skb, u32 *secid, int ckall);
 #endif /* CONFIG_SECURITY_NETWORK_XFRM */
 
@@ -3186,12 +3179,6 @@ static inline int security_xfrm_state_po
return security_ops-xfrm_state_pol_flow_match(x, xp, fl);
 }
 
-static inline int security_xfrm_flow_state_match(struct flowi *fl,
-   struct xfrm_state *xfrm, struct xfrm_policy *xp)
-{
-   return security_ops-xfrm_flow_state_match(fl, xfrm, xp);
-}
-
 static inline int security_xfrm_decode_session(struct sk_buff *skb, u32 *secid)
 {
return security_ops-xfrm_decode_session(skb, secid, 1);
@@ -3255,12 +3242,6 @@ static inline int security_xfrm_state_po
return 1;
 }
 
-static inline int security_xfrm_flow_state_match(struct flowi *fl,
-   struct xfrm_state *xfrm, struct xfrm_policy *xp)
-{
-   return 1;
-}
-
 static inline int security_xfrm_decode_session(struct sk_buff *skb, u32 *secid)
 {
return 0;
--- net-2.6.xfrm2/net/xfrm/xfrm_policy.c2006-11-01 11:25:39.0 
-0600
+++ net-2.6/net/xfrm/xfrm_policy.c  2006-11-01 12:10:23.0 -0600
@@ -1894,7 +1894,8 @@ int xfrm_bundle_ok(struct xfrm_policy *p
 
if (fl  !xfrm_selector_match(dst-xfrm-sel, fl, family))
return 0;
-   if (fl  !security_xfrm_flow_state_match(fl, dst-xfrm, pol))
+   if (fl  pol 
+   !security_xfrm_state_pol_flow_match(dst-xfrm, pol, fl))
return 0;
if (dst-xfrm-km.state != XFRM_STATE_VALID)
return 0;
--- net-2.6.xfrm2/security/dummy.c  2006-10-25 12:23:47.0 -0500
+++ net-2.6/security/dummy.c2006-11-01 11:22:34.0 -0600
@@ -886,12 +886,6 @@ static int dummy_xfrm_state_pol_flow_mat
return 1;
 }
 
-static int dummy_xfrm_flow_state_match(struct flowi *fl, struct xfrm_state 
*xfrm,
-   struct xfrm_policy *xp)
-{
-   return 1;
-}
-
 static int dummy_xfrm_decode_session(struct sk_buff *skb, u32 *fl, int ckall)
 {
return 0;
@@ -1126,7 +1120,6 @@ void security_fixup_ops (struct security
set_to_dummy_if_null(ops, xfrm_state_delete_security);
set_to_dummy_if_null(ops, xfrm_policy_lookup);
set_to_dummy_if_null(ops, xfrm_state_pol_flow_match);
-   set_to_dummy_if_null(ops, xfrm_flow_state_match);
set_to_dummy_if_null(ops, xfrm_decode_session);
 #endif /* CONFIG_SECURITY_NETWORK_XFRM */
 #ifdef CONFIG_KEYS
--- net-2.6.xfrm2/security/selinux/include/xfrm.h   2006-11-07 
09:49:24.0 -0600
+++ net-2.6/security/selinux/include/xfrm.h 2006-11-07 10:03:20.0 
-0600
@@ -19,9 +19,6 @@ int selinux_xfrm_state_delete(struct xfr
 int 

Re: [PATCH 4/4] skge: version 1.9

2006-11-07 Thread Stephen Hemminger
On Tue, 07 Nov 2006 08:25:07 -0500
Michael Stone [EMAIL PROTECTED] wrote:

 The skge 1.9 patch is looking good on older syskonnect fiber cards.  
 Stability issues seem to be taken care of and performance is good. There 
 are some strange interactions with bonding, however. If I try to put 
 both interfaces of an sk-9844 into a bonded interface, I only see 
 traffic from one of them. If I try to config the bonded interface down, 
 the system hangs. If I tcpdump either of the individual interfaces 
 (before bonding them) I see all the expected traffic.
 
 Mike Stone

Which form of bonding link checking are you using. It could be that
bonding MII checking is confused.

-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2.6.19-rc4-git10][PKT_SCHED] sch_htb: INIT_HLIST_NODE after hlist_del()

2006-11-07 Thread Stephen Hemminger
On Tue, 7 Nov 2006 07:49:43 +0100
Jarek Poplawski [EMAIL PROTECTED] wrote:

 On Mon, Nov 06, 2006 at 09:44:49AM -0800, Stephen Hemminger wrote:
  On Mon, 6 Nov 2006 12:33:53 +0100
  Jarek Poplawski [EMAIL PROTECTED] wrote:
  
   After hlist_del() next and pprev pointers are not NULL
   so hlist_unhashed() doesn't work properly.
   
   
   Signed-off-by: Jarek Poplawski [EMAIL PROTECTED]
   ---
   
   
   diff -Nurp linux-2.6.19-rc4-git10-/net/sched/sch_htb.c 
   linux-2.6.19-rc4-git10/net/sched/sch_htb.c
   --- linux-2.6.19-rc4-git10-/net/sched/sch_htb.c   2006-11-06 
   11:42:41.0 +0100
   +++ linux-2.6.19-rc4-git10/net/sched/sch_htb.c2006-11-06 
   11:53:15.0 +0100
   @@ -1284,8 +1284,10 @@ static void htb_destroy_class(struct Qdi
   struct htb_class, sibling));

 /* note: this delete may happen twice (see htb_delete) */
   - if (!hlist_unhashed(cl-hlist))
   + if (!hlist_unhashed(cl-hlist)) {
 hlist_del(cl-hlist);
   + INIT_HLIST_NODE(cl-hlist);
   + }
  
  why not use hlist_del_init?

Your patch duplicated the code in hlist_del_init().  Why not do:

--- a/net/sched/sch_htb.c   2006-11-07 09:48:22.0 -0800
+++ b/net/sched/sch_htb.c   2006-11-07 09:49:01.0 -0800
@@ -1284,8 +1284,7 @@
  struct htb_class, sibling));
 
/* note: this delete may happen twice (see htb_delete) */
-   if (!hlist_unhashed(cl-hlist))
-   hlist_del(cl-hlist);
+   hlist_del_init(cl-hlist);
list_del(cl-sibling);
 
if (cl-prio_activity)
@@ -1333,8 +1332,7 @@
sch_tree_lock(sch);
 
/* delete from hash and active; remainder in destroy_class */
-   if (!hlist_unhashed(cl-hlist))
-   hlist_del(cl-hlist);
+   hlist_del_init(cl-hlist);
 
if (cl-prio_activity)
htb_deactivate(q, cl);
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Rewrite e100_phys_id

2006-11-07 Thread Auke Kok

Matthew Wilcox wrote:

The motivator for this was to fix the sparse warning:

drivers/net/e100.c:2418:48: warning: cast truncates bits from constant
value (83126e978d4fdf becomes 978d4fdf)
drivers/net/e100.c:2419:37: warning: cast truncates bits from constant
value (83126e978d4fdf becomes 978d4fdf)

Initially, I tried a quick fix, but when it ran into difficulties, I
looked at tg3.c to see how it does it.  I liked their way better, so I
rewrote e100.c to be similar.  It shaves ~700 bytes off the size of the
driver, and a few bytes off the size of struct nic, so I think it's a
win all round.  Tested on the internal interface of an HP Integrity rx2600.


bad news, it's completely hosed. The adapter does some indistinguishable blinking for a 
second, then stops blinking alltogether.


I might revert the code to the old situation. I guess I should have tested it initially 
right away.


I'm not even going to touch the e1000 patch for now ;)

Auke



Signed-off-by: Matthew Wilcox [EMAIL PROTECTED]

diff --git a/drivers/net/e100.c b/drivers/net/e100.c
index a3a08a5..aade1e9 100644
--- a/drivers/net/e100.c
+++ b/drivers/net/e100.c
@@ -556,7 +556,6 @@ struct nic {
struct params params;
struct net_device_stats net_stats;
struct timer_list watchdog;
-   struct timer_list blink_timer;
struct mii_if_info mii;
struct work_struct tx_timeout_task;
enum loopback loopback;
@@ -581,7 +580,6 @@ struct nic {
u32 rx_over_length_errors;
 
 	u8 rev_id;

-   u16 leds;
u16 eeprom_wc;
u16 eeprom[256];
spinlock_t mdio_lock;
@@ -2168,23 +2166,6 @@ err_clean_rx:
return err;
 }
 
-#define MII_LED_CONTROL	0x1B

-static void e100_blink_led(unsigned long data)
-{
-   struct nic *nic = (struct nic *)data;
-   enum led_state {
-   led_on = 0x01,
-   led_off= 0x04,
-   led_on_559 = 0x05,
-   led_on_557 = 0x07,
-   };
-
-   nic-leds = (nic-leds  led_on) ? led_off :
-   (nic-mac  mac_82559_D101M) ? led_on_557 : led_on_559;
-   mdio_write(nic-netdev, nic-mii.phy_id, MII_LED_CONTROL, nic-leds);
-   mod_timer(nic-blink_timer, jiffies + HZ / 4);
-}
-
 static int e100_get_settings(struct net_device *netdev, struct ethtool_cmd 
*cmd)
 {
struct nic *nic = netdev_priv(netdev);
@@ -2411,16 +2392,32 @@ static void e100_diag_test(struct net_de
msleep_interruptible(4 * 1000);
 }
 
+#define MII_LED_CONTROL	0x1B

 static int e100_phys_id(struct net_device *netdev, u32 data)
 {
struct nic *nic = netdev_priv(netdev);
+   int i;
+
+   enum led_state {
+   led_off= 0x04,
+   led_on_559 = 0x05,
+   led_on_557 = 0x07,
+   };
+   u16 leds = led_off;
+
+   if (data == 0)
+   data = 2;
+
+   for (i = 0; i  (data * 2); i++) {
+   leds = (leds == led_off) ?
+   (nic-mac  mac_82559_D101M) ? led_on_557 : led_on_559 :
+   led_off;
+   mdio_write(nic-netdev, nic-mii.phy_id, MII_LED_CONTROL, leds);
+   if (msleep_interruptible(500))
+   break;
+   }
 
-	if(!data || data  (u32)(MAX_SCHEDULE_TIMEOUT / HZ))

-   data = (u32)(MAX_SCHEDULE_TIMEOUT / HZ);
-   mod_timer(nic-blink_timer, jiffies);
-   msleep_interruptible(data * 1000);
-   del_timer_sync(nic-blink_timer);
-   mdio_write(netdev, nic-mii.phy_id, MII_LED_CONTROL, 0);
+   mdio_write(netdev, nic-mii.phy_id, MII_LED_CONTROL, led_off);
 
 	return 0;

 }
@@ -2633,9 +2630,6 @@ #endif
init_timer(nic-watchdog);
nic-watchdog.function = e100_watchdog;
nic-watchdog.data = (unsigned long)nic;
-   init_timer(nic-blink_timer);
-   nic-blink_timer.function = e100_blink_led;
-   nic-blink_timer.data = (unsigned long)nic;
 
 	INIT_WORK(nic-tx_timeout_task,

(void (*)(void *))e100_tx_timeout_task, netdev);

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/11] convert d80211 to a proper protocol

2006-11-07 Thread Johannes Berg
I've put a new patchset up at

http://johannes.sipsolutions.net/files/d80211-cleanup/

It now contains:
001-cfg80211-fix-Makefile.patch
002-cfg80211-wext-compat.patch

as before.

003-d80211-reduce-mdev-1.patch
004-d80211-reduce-mdev-2.patch
005-d80211-cleanup-rxmgmt.patch
006-d80211-scan-sanity.patch

similar to before, but modified to apply without the previous cookie
patch.

I decided to drop the cookie patch because it unnecessarily breaks
drivers. We obviously haven't figured out what we want, so let's just go
for the lowest common denominator.

I think these are mostly cleanups and it all compiles fine after each
one. No API changes. Haven't gotten around to testing it yet.

johannes


signature.asc
Description: This is a digitally signed message part


Re: [PATCH 0/11] convert d80211 to a proper protocol

2006-11-07 Thread Ivo van Doorn
Hi,

 http://johannes.sipsolutions.net/files/d80211-cleanup/

You might want to fix the rights to the folder again ;)

Ivo
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Rewrite e100_phys_id

2006-11-07 Thread Matthew Wilcox
On Tue, Nov 07, 2006 at 10:33:14AM -0800, Auke Kok wrote:
 Matthew Wilcox wrote:
 Tested on the internal interface of an HP Integrity rx2600.
 
 bad news, it's completely hosed. The adapter does some indistinguishable 
 blinking for a second, then stops blinking alltogether.

Weird.  I tested it on the only e100 I have access to, and it worked.
I've just reviewed the patch you quoted below, and I don't see what the
problem is.  

I wonder if this is wrong:

 -mdio_write(netdev, nic-mii.phy_id, MII_LED_CONTROL, 0);
 +mdio_write(netdev, nic-mii.phy_id, MII_LED_CONTROL, led_off);

but everything else seems pretty straight-forward.

 I might revert the code to the old situation. I guess I should have tested 
 it initially right away.
 
 I'm not even going to touch the e1000 patch for now ;)
 
 Auke
 
 
 Signed-off-by: Matthew Wilcox [EMAIL PROTECTED]
 
 diff --git a/drivers/net/e100.c b/drivers/net/e100.c
 index a3a08a5..aade1e9 100644
 --- a/drivers/net/e100.c
 +++ b/drivers/net/e100.c
 @@ -556,7 +556,6 @@ struct nic {
  struct params params;
  struct net_device_stats net_stats;
  struct timer_list watchdog;
 -struct timer_list blink_timer;
  struct mii_if_info mii;
  struct work_struct tx_timeout_task;
  enum loopback loopback;
 @@ -581,7 +580,6 @@ struct nic {
  u32 rx_over_length_errors;
  
  u8 rev_id;
 -u16 leds;
  u16 eeprom_wc;
  u16 eeprom[256];
  spinlock_t mdio_lock;
 @@ -2168,23 +2166,6 @@ err_clean_rx:
  return err;
  }
  
 -#define MII_LED_CONTROL 0x1B
 -static void e100_blink_led(unsigned long data)
 -{
 -struct nic *nic = (struct nic *)data;
 -enum led_state {
 -led_on = 0x01,
 -led_off= 0x04,
 -led_on_559 = 0x05,
 -led_on_557 = 0x07,
 -};
 -
 -nic-leds = (nic-leds  led_on) ? led_off :
 -(nic-mac  mac_82559_D101M) ? led_on_557 : led_on_559;
 -mdio_write(nic-netdev, nic-mii.phy_id, MII_LED_CONTROL, nic-leds);
 -mod_timer(nic-blink_timer, jiffies + HZ / 4);
 -}
 -
  static int e100_get_settings(struct net_device *netdev, struct 
  ethtool_cmd *cmd)
  {
  struct nic *nic = netdev_priv(netdev);
 @@ -2411,16 +2392,32 @@ static void e100_diag_test(struct net_de
  msleep_interruptible(4 * 1000);
  }
  
 +#define MII_LED_CONTROL 0x1B
  static int e100_phys_id(struct net_device *netdev, u32 data)
  {
  struct nic *nic = netdev_priv(netdev);
 +int i;
 +
 +enum led_state {
 +led_off= 0x04,
 +led_on_559 = 0x05,
 +led_on_557 = 0x07,
 +};
 +u16 leds = led_off;
 +
 +if (data == 0)
 +data = 2;
 +
 +for (i = 0; i  (data * 2); i++) {
 +leds = (leds == led_off) ?
 +(nic-mac  mac_82559_D101M) ? led_on_557 : 
 led_on_559 :
 +led_off;
 +mdio_write(nic-netdev, nic-mii.phy_id, MII_LED_CONTROL, 
 leds);
 +if (msleep_interruptible(500))
 +break;
 +}
  
 -if(!data || data  (u32)(MAX_SCHEDULE_TIMEOUT / HZ))
 -data = (u32)(MAX_SCHEDULE_TIMEOUT / HZ);
 -mod_timer(nic-blink_timer, jiffies);
 -msleep_interruptible(data * 1000);
 -del_timer_sync(nic-blink_timer);
 -mdio_write(netdev, nic-mii.phy_id, MII_LED_CONTROL, 0);
 +mdio_write(netdev, nic-mii.phy_id, MII_LED_CONTROL, led_off);
  
  return 0;
  }
 @@ -2633,9 +2630,6 @@ #endif
  init_timer(nic-watchdog);
  nic-watchdog.function = e100_watchdog;
  nic-watchdog.data = (unsigned long)nic;
 -init_timer(nic-blink_timer);
 -nic-blink_timer.function = e100_blink_led;
 -nic-blink_timer.data = (unsigned long)nic;
  
  INIT_WORK(nic-tx_timeout_task,
  (void (*)(void *))e100_tx_timeout_task, netdev);
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] skge: version 1.9

2006-11-07 Thread Michael Stone

On Tue, Nov 07, 2006 at 09:51:04AM -0800, Stephen Hemminger wrote:

Which form of bonding link checking are you using. It could be that
bonding MII checking is confused.


I'm not specifying anything, just ifenslave bond0 eth2 eth3

Mike Stone
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] mlsxfrm: Various fixes

2006-11-07 Thread Eric Paris
On Tue, 2006-11-07 at 11:17 -0600, Venkat Yekkirala wrote:
  int selinux_xfrm_policy_alloc(struct xfrm_policy *xp,
 - struct xfrm_user_sec_ctx *uctx, struct sock *sk)
 + struct xfrm_user_sec_ctx *uctx)
  {
   int err;
 - u32 sid;
  
 - BUG_ON(!xp);
 - BUG_ON(uctx  sk);
 -
 - if (sk) {
 - struct sk_security_struct *ssec = sk-sk_security;
 - sid = ssec-sid;
 - }
 - else
 - sid = SECSID_NULL;
 + BUG_ON(!xp || !uctx);
  
 - err = selinux_xfrm_sec_ctx_alloc(xp-security, uctx, NULL, sid);
 + err = selinux_xfrm_sec_ctx_alloc(xp-security, uctx, 0);
   return err;
  }

BUG_ON() with an || makes this a slight bit trickier to debug if
something goes wrong.  I'd have to dig around a little in the assembly
and look at the registers in the back trace to know which of the 2 was
the problem.  I personally would rather have a seperate

BUG_ON(!xp);
BUG_ON(!uctx);

probably not worth resubmitting, but if you have to make another set of
these

-Eric

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] skge: version 1.9

2006-11-07 Thread Stephen Hemminger
On Tue, 07 Nov 2006 13:58:31 -0500
Michael Stone [EMAIL PROTECTED] wrote:

 On Tue, Nov 07, 2006 at 09:51:04AM -0800, Stephen Hemminger wrote:
 Which form of bonding link checking are you using. It could be that
 bonding MII checking is confused.
 
 I'm not specifying anything, just ifenslave bond0 eth2 eth3
 
 Mike Stone

Do both ports report carrier present?
ethtool eth2
ethtool eth3



-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Fw: 2.6.19-rc1: Volanomark slowdown

2006-11-07 Thread Stephen Hemminger


Begin forwarded message:

Date: Tue, 07 Nov 2006 10:32:34 -0800
From: Tim Chen [EMAIL PROTECTED]
Newsgroups: linux.dev.kernel
Subject: 2.6.19-rc1: Volanomark slowdown



The patch

[TCP]: Send ACKs each 2nd received segment
commit: 1ef9696c909060ccdae3ade245ca88692b49285b
http://kernel.org/git/?
p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=1ef9696c909060ccdae3ade245ca88692b49285b

reduced Volanomark benchmark throughput by 10%.  
This is because Volanomark sends 
short message (100 bytes) on its TCP
connections.  This patch increases the number of ACKs 
traffic by 3.5 times.  

By adopting this patch, we assume that with
small segment, having short delay is important 
enough that we are willing to reduce bandwidth 
with more ACKs.  

Is there any real application out there
that this new behavior could be a concern?

Tim 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] skge: version 1.9

2006-11-07 Thread Michael Stone

On Tue, Nov 07, 2006 at 11:18:07AM -0800, Stephen Hemminger wrote:

Do both ports report carrier present?
ethtool eth2
ethtool eth3


Link detected? yes

Mike Stone
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[take23 2/5] kevent: Core files.

2006-11-07 Thread Evgeniy Polyakov

Core files.

This patch includes core kevent files:
 * userspace controlling
 * kernelspace interfaces
 * initialization
 * notification state machines

Some bits of documentation can be found on project's homepage (and links from 
there):
http://tservice.net.ru/~s0mbre/old/?section=projectsitem=kevent

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index 7e639f7..fa8075b 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -318,3 +318,7 @@ ENTRY(sys_call_table)
.long sys_vmsplice
.long sys_move_pages
.long sys_getcpu
+   .long sys_kevent_get_events
+   .long sys_kevent_ctl/* 320 */
+   .long sys_kevent_wait
+   .long sys_kevent_ring_init
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index b4aa875..95fb252 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -714,8 +714,12 @@ #endif
.quad compat_sys_get_robust_list
.quad sys_splice
.quad sys_sync_file_range
-   .quad sys_tee
+   .quad sys_tee   /* 315 */
.quad compat_sys_vmsplice
.quad compat_sys_move_pages
.quad sys_getcpu
+   .quad sys_kevent_get_events
+   .quad sys_kevent_ctl/* 320 */
+   .quad sys_kevent_wait
+   .quad sys_kevent_ring_init
 ia32_syscall_end:  
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index bd99870..2161ef2 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -324,10 +324,14 @@ #define __NR_tee  315
 #define __NR_vmsplice  316
 #define __NR_move_pages317
 #define __NR_getcpu318
+#define __NR_kevent_get_events 319
+#define __NR_kevent_ctl320
+#define __NR_kevent_wait   321
+#define __NR_kevent_ring_init  322
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 319
+#define NR_syscalls 323
 #include linux/err.h
 
 /*
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index 6137146..3669c0f 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -619,10 +619,18 @@ #define __NR_vmsplice 278
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_kevent_get_events 280
+__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events)
+#define __NR_kevent_ctl281
+__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl)
+#define __NR_kevent_wait   282
+__SYSCALL(__NR_kevent_wait, sys_kevent_wait)
+#define __NR_kevent_ring_init  283
+__SYSCALL(__NR_kevent_ring_init, sys_kevent_ring_init)
 
 #ifdef __KERNEL__
 
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_kevent_ring_init
 #include linux/err.h
 
 #ifndef __NO_STUBS
diff --git a/include/linux/kevent.h b/include/linux/kevent.h
new file mode 100644
index 000..781ffa8
--- /dev/null
+++ b/include/linux/kevent.h
@@ -0,0 +1,201 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __KEVENT_H
+#define __KEVENT_H
+#include linux/types.h
+#include linux/list.h
+#include linux/rbtree.h
+#include linux/spinlock.h
+#include linux/mutex.h
+#include linux/wait.h
+#include linux/net.h
+#include linux/rcupdate.h
+#include linux/kevent_storage.h
+#include linux/ukevent.h
+
+#define KEVENT_MIN_BUFFS_ALLOC 3
+
+struct kevent;
+struct kevent_storage;
+typedef int (* kevent_callback_t)(struct kevent *);
+
+/* @callback is called each time new event has been caught. */
+/* @enqueue is called each time new event is queued. */
+/* @dequeue is called each time event is dequeued. */
+
+struct kevent_callbacks {
+   kevent_callback_t   callback, enqueue, dequeue;
+};
+
+#define KEVENT_READY   0x1
+#define KEVENT_STORAGE 0x2
+#define KEVENT_USER0x4
+
+struct kevent
+{
+   /* Used for kevent freeing.*/
+   struct rcu_head rcu_head;
+   struct ukevent  event;
+   /* This lock protects ukevent manipulations, e.g. ret_flags changes. */
+   spinlock_t  ulock;
+
+   /* Entry 

[take23 1/5] kevent: Description.

2006-11-07 Thread Evgeniy Polyakov

Description.

int kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent *arg);

fd - is the file descriptor referring to the kevent queue to manipulate. 
It is created by opening /dev/kevent char device, which is created with 
dynamic 
minor number and major number assigned for misc devices. 

cmd - is the requested operation. It can be one of the following:
KEVENT_CTL_ADD - add event notification 
KEVENT_CTL_REMOVE - remove event notification 
KEVENT_CTL_MODIFY - modify existing notification 

num - number of struct ukevent in the array pointed to by arg 
arg - array of struct ukevent

When called, kevent_ctl will carry out the operation specified in the cmd 
parameter.
-

 int kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr, 
__u64 timeout, struct ukevent *buf, unsigned flags)

ctl_fd - file descriptor referring to the kevent queue 
min_nr - minimum number of completed events that kevent_get_events will block 
waiting for 
max_nr - number of struct ukevent in buf 
timeout - number of nanoseconds to wait before returning less than min_nr 
events. 
If this is -1, then wait forever. 
buf - pointer to an array of struct ukevent. 
flags - unused 

kevent_get_events will wait timeout milliseconds for at least min_nr completed 
events, 
copying completed struct ukevents to buf and deleting any KEVENT_REQ_ONESHOT 
event requests. 
In nonblocking mode it returns as many events as possible, but not more than 
max_nr. 
In blocking mode it waits until timeout or if at least min_nr events are ready.
-

 int kevent_wait(int ctl_fd, unsigned int num, __u64 timeout)

ctl_fd - file descriptor referring to the kevent queue 
num - number of processed kevents 
timeout - this timeout specifies number of nanoseconds to wait until there is 
free space in kevent queue 

This syscall waits until either timeout expires or at least one event becomes 
ready. 
It also copies that num events into special ring buffer and requeues them (or 
removes depending on flags). 
-

 int kevent_ring_init(int ctl_fd, struct kevent_ring *ring, unsigned int num)

ctl_fd - file descriptor referring to the kevent queue 
num - size of the ring buffer in events 

 struct kevent_ring
 {
   unsigned int ring_kidx;
   struct ukevent event[0];
 }

ring_kidx - is an index in the ring buffer where kernel will put new events 
when 
  kevent_wait() or kevent_get_events() is called 

Example userspace code (ring_buffer.c) can be found on project's homepage.

Each kevent syscall can be so called cancellation point in glibc, i.e. when 
thread has 
been cancelled in kevent syscall, thread can be safely removed and no events 
will be lost, 
since each syscall (kevent_wait() or kevent_get_events()) will copy event into 
special ring buffer, 
accessible from other threads or even processes (if shared memory is used).

When kevent is removed (not dequeued when it is ready, but just removed), even 
if it was ready, 
it is not copied into ring buffer, since if it is removed, no one cares about 
it (otherwise user 
would wait until it becomes ready and got it through usual way using 
kevent_get_events() or kevent_wait()) 
and thus no need to copy it to the ring buffer.

It is possible with userspace ring buffer, that events in the ring buffer can 
be replaced without knowledge 
for the thread currently reading them (when other thread calls 
kevent_get_events() or kevent_wait()), 
so appropriate locking between threads or processes, which can simultaneously 
access the same ring buffer, 
is required.
-

The bulk of the interface is entirely done through the ukevent struct. 
It is used to add event requests, modify existing event requests, 
specify which event requests to remove, and return completed events.

struct ukevent contains the following members:

struct kevent_id id
Id of this request, e.g. socket number, file descriptor and so on 
__u32 type
Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on 
__u32 event
Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED 
__u32 req_flags
Per-event request flags,

KEVENT_REQ_ONESHOT
event will be removed when it is ready 

KEVENT_REQ_WAKEUP_ONE
When several threads wait on the same kevent queue and requested the 
same event, 
for example 'wake me up when new client has connected, so I could call 
accept()', 
then all threads will be awakened when new client has connected, but 
only one of 
them can process the data. This problem is known as thundering nerd 
problem. 
Events which have this flag set will not be marked as ready (and 
appropriate 

[take23 3/5] kevent: poll/select() notifications.

2006-11-07 Thread Evgeniy Polyakov

poll/select() notifications.

This patch includes generic poll/select notifications.
kevent_poll works simialr to epoll and has the same issues (callback
is invoked not from internal state machine of the caller, but through
process awake, a lot of allocations and so on).

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5baf3a1..f81299f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -276,6 +276,7 @@ #include linux/prio_tree.h
 #include linux/init.h
 #include linux/sched.h
 #include linux/mutex.h
+#include linux/kevent.h
 
 #include asm/atomic.h
 #include asm/semaphore.h
@@ -586,6 +587,10 @@ #ifdef CONFIG_INOTIFY
struct mutexinotify_mutex;  /* protects the watches list */
 #endif
 
+#ifdef CONFIG_KEVENT_SOCKET
+   struct kevent_storage   st;
+#endif
+
unsigned long   i_state;
unsigned long   dirtied_when;   /* jiffies of first dirtying */
 
@@ -739,6 +744,9 @@ #ifdef CONFIG_EPOLL
struct list_headf_ep_links;
spinlock_t  f_ep_lock;
 #endif /* #ifdef CONFIG_EPOLL */
+#ifdef CONFIG_KEVENT_POLL
+   struct kevent_storage   st;
+#endif
struct address_space*f_mapping;
 };
 extern spinlock_t files_lock;
diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c
new file mode 100644
index 000..94facbb
--- /dev/null
+++ b/kernel/kevent/kevent_poll.c
@@ -0,0 +1,222 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/kernel.h
+#include linux/types.h
+#include linux/list.h
+#include linux/slab.h
+#include linux/spinlock.h
+#include linux/timer.h
+#include linux/file.h
+#include linux/kevent.h
+#include linux/poll.h
+#include linux/fs.h
+
+static kmem_cache_t *kevent_poll_container_cache;
+static kmem_cache_t *kevent_poll_priv_cache;
+
+struct kevent_poll_ctl
+{
+   struct poll_table_structpt;
+   struct kevent   *k;
+};
+
+struct kevent_poll_wait_container
+{
+   struct list_headcontainer_entry;
+   wait_queue_head_t   *whead;
+   wait_queue_twait;
+   struct kevent   *k;
+};
+
+struct kevent_poll_private
+{
+   struct list_headcontainer_list;
+   spinlock_t  container_lock;
+};
+
+static int kevent_poll_enqueue(struct kevent *k);
+static int kevent_poll_dequeue(struct kevent *k);
+static int kevent_poll_callback(struct kevent *k);
+
+static int kevent_poll_wait_callback(wait_queue_t *wait,
+   unsigned mode, int sync, void *key)
+{
+   struct kevent_poll_wait_container *cont =
+   container_of(wait, struct kevent_poll_wait_container, wait);
+   struct kevent *k = cont-k;
+   struct file *file = k-st-origin;
+   u32 revents;
+
+   revents = file-f_op-poll(file, NULL);
+
+   kevent_storage_ready(k-st, NULL, revents);
+
+   return 0;
+}
+
+static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead,
+   struct poll_table_struct *poll_table)
+{
+   struct kevent *k =
+   container_of(poll_table, struct kevent_poll_ctl, pt)-k;
+   struct kevent_poll_private *priv = k-priv;
+   struct kevent_poll_wait_container *cont;
+   unsigned long flags;
+
+   cont = kmem_cache_alloc(kevent_poll_container_cache, SLAB_KERNEL);
+   if (!cont) {
+   kevent_break(k);
+   return;
+   }
+
+   cont-k = k;
+   init_waitqueue_func_entry(cont-wait, kevent_poll_wait_callback);
+   cont-whead = whead;
+
+   spin_lock_irqsave(priv-container_lock, flags);
+   list_add_tail(cont-container_entry, priv-container_list);
+   spin_unlock_irqrestore(priv-container_lock, flags);
+
+   add_wait_queue(whead, cont-wait);
+}
+
+static int kevent_poll_enqueue(struct kevent *k)
+{
+   struct file *file;
+   int err, ready = 0;
+   unsigned int revents;
+   struct kevent_poll_ctl ctl;
+   struct kevent_poll_private *priv;
+
+   file = fget(k-event.id.raw[0]);
+   if (!file)
+   return -EBADF;
+
+   err = -EINVAL;
+   if (!file-f_op || !file-f_op-poll)
+   goto err_out_fput;
+
+   err = -ENOMEM;
+   priv = kmem_cache_alloc(kevent_poll_priv_cache, SLAB_KERNEL);
+   if (!priv)
+   goto err_out_fput;
+
+ 

[take23 4/5] kevent: Socket notifications.

2006-11-07 Thread Evgeniy Polyakov

Socket notifications.

This patch includes socket send/recv/accept notifications.
Using trivial web server based on kevent and this features
instead of epoll it's performance increased more than noticebly.
More details about various benchmarks and server itself 
(evserver_kevent.c) can be found on project's homepage.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/fs/inode.c b/fs/inode.c
index ada7643..ff1b129 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -21,6 +21,7 @@ #include linux/pagemap.h
 #include linux/cdev.h
 #include linux/bootmem.h
 #include linux/inotify.h
+#include linux/kevent.h
 #include linux/mount.h
 
 /*
@@ -164,12 +165,18 @@ #endif
}
inode-i_private = 0;
inode-i_mapping = mapping;
+#if defined CONFIG_KEVENT_SOCKET
+   kevent_storage_init(inode, inode-st);
+#endif
}
return inode;
 }
 
 void destroy_inode(struct inode *inode) 
 {
+#if defined CONFIG_KEVENT_SOCKET
+   kevent_storage_fini(inode-st);
+#endif
BUG_ON(inode_has_buffers(inode));
security_inode_free(inode);
if (inode-i_sb-s_op-destroy_inode)
diff --git a/include/net/sock.h b/include/net/sock.h
index edd4d73..d48ded8 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -48,6 +48,7 @@ #include linux/lockdep.h
 #include linux/netdevice.h
 #include linux/skbuff.h  /* struct sk_buff */
 #include linux/security.h
+#include linux/kevent.h
 
 #include linux/filter.h
 
@@ -450,6 +451,21 @@ static inline int sk_stream_memory_free(
 
 extern void sk_stream_rfree(struct sk_buff *skb);
 
+struct socket_alloc {
+   struct socket socket;
+   struct inode vfs_inode;
+};
+
+static inline struct socket *SOCKET_I(struct inode *inode)
+{
+   return container_of(inode, struct socket_alloc, vfs_inode)-socket;
+}
+
+static inline struct inode *SOCK_INODE(struct socket *socket)
+{
+   return container_of(socket, struct socket_alloc, socket)-vfs_inode;
+}
+
 static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk)
 {
skb-sk = sk;
@@ -477,6 +493,7 @@ static inline void sk_add_backlog(struct
sk-sk_backlog.tail = skb;
}
skb-next = NULL;
+   kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
 }
 
 #define sk_wait_event(__sk, __timeo, __condition)  \
@@ -679,21 +696,6 @@ static inline struct kiocb *siocb_to_kio
return si-kiocb;
 }
 
-struct socket_alloc {
-   struct socket socket;
-   struct inode vfs_inode;
-};
-
-static inline struct socket *SOCKET_I(struct inode *inode)
-{
-   return container_of(inode, struct socket_alloc, vfs_inode)-socket;
-}
-
-static inline struct inode *SOCK_INODE(struct socket *socket)
-{
-   return container_of(socket, struct socket_alloc, socket)-vfs_inode;
-}
-
 extern void __sk_stream_mem_reclaim(struct sock *sk);
 extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind);
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7a093d0..69f4ad2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -857,6 +857,7 @@ static inline int tcp_prequeue(struct so
tp-ucopy.memory = 0;
} else if (skb_queue_len(tp-ucopy.prequeue) == 1) {
wake_up_interruptible(sk-sk_sleep);
+   kevent_socket_notify(sk, 
KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
if (!inet_csk_ack_scheduled(sk))
inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
  (3 * TCP_RTO_MIN) / 4,
diff --git a/kernel/kevent/kevent_socket.c b/kernel/kevent/kevent_socket.c
new file mode 100644
index 000..7f74110
--- /dev/null
+++ b/kernel/kevent/kevent_socket.c
@@ -0,0 +1,135 @@
+/*
+ * kevent_socket.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include linux/kernel.h
+#include linux/types.h
+#include linux/list.h
+#include linux/slab.h
+#include linux/spinlock.h
+#include linux/timer.h
+#include linux/file.h
+#include linux/tcp.h
+#include linux/kevent.h
+
+#include net/sock.h
+#include net/request_sock.h
+#include net/inet_connection_sock.h
+
+static int 

[take23 5/5] kevent: Timer notifications.

2006-11-07 Thread Evgeniy Polyakov

Timer notifications.

Timer notifications can be used for fine grained per-process time 
management, since interval timers are very inconvenient to use, 
and they are limited.

This subsystem uses high-resolution timers.
id.raw[0] is used as number of seconds
id.raw[1] is used as number of nanoseconds

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c
new file mode 100644
index 000..df93049
--- /dev/null
+++ b/kernel/kevent/kevent_timer.c
@@ -0,0 +1,112 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include linux/kernel.h
+#include linux/types.h
+#include linux/list.h
+#include linux/slab.h
+#include linux/spinlock.h
+#include linux/hrtimer.h
+#include linux/jiffies.h
+#include linux/kevent.h
+
+struct kevent_timer
+{
+   struct hrtimer  ktimer;
+   struct kevent_storage   ktimer_storage;
+   struct kevent   *ktimer_event;
+};
+
+static int kevent_timer_func(struct hrtimer *timer)
+{
+   struct kevent_timer *t = container_of(timer, struct kevent_timer, 
ktimer);
+   struct kevent *k = t-ktimer_event;
+
+   kevent_storage_ready(t-ktimer_storage, NULL, KEVENT_MASK_ALL);
+   hrtimer_forward(timer, timer-base-softirq_time,
+   ktime_set(k-event.id.raw[0], k-event.id.raw[1]));
+   return HRTIMER_RESTART;
+}
+
+static struct lock_class_key kevent_timer_key;
+
+static int kevent_timer_enqueue(struct kevent *k)
+{
+   int err;
+   struct kevent_timer *t;
+
+   t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL);
+   if (!t)
+   return -ENOMEM;
+
+   hrtimer_init(t-ktimer, CLOCK_MONOTONIC, HRTIMER_REL);
+   t-ktimer.expires = ktime_set(k-event.id.raw[0], k-event.id.raw[1]);
+   t-ktimer.function = kevent_timer_func;
+   t-ktimer_event = k;
+
+   err = kevent_storage_init(t-ktimer, t-ktimer_storage);
+   if (err)
+   goto err_out_free;
+   lockdep_set_class(t-ktimer_storage.lock, kevent_timer_key);
+
+   err = kevent_storage_enqueue(t-ktimer_storage, k);
+   if (err)
+   goto err_out_st_fini;
+
+   hrtimer_start(t-ktimer, t-ktimer.expires, HRTIMER_REL);
+
+   return 0;
+
+err_out_st_fini:
+   kevent_storage_fini(t-ktimer_storage);
+err_out_free:
+   kfree(t);
+
+   return err;
+}
+
+static int kevent_timer_dequeue(struct kevent *k)
+{
+   struct kevent_storage *st = k-st;
+   struct kevent_timer *t = container_of(st, struct kevent_timer, 
ktimer_storage);
+
+   hrtimer_cancel(t-ktimer);
+   kevent_storage_dequeue(st, k);
+   kfree(t);
+
+   return 0;
+}
+
+static int kevent_timer_callback(struct kevent *k)
+{
+   k-event.ret_data[0] = jiffies_to_msecs(jiffies);
+   return 1;
+}
+
+static int __init kevent_init_timer(void)
+{
+   struct kevent_callbacks tc = {
+   .callback = kevent_timer_callback,
+   .enqueue = kevent_timer_enqueue,
+   .dequeue = kevent_timer_dequeue};
+
+   return kevent_add_callbacks(tc, KEVENT_TIMER);
+}
+module_init(kevent_init_timer);
+

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[take23 0/5] kevent: Generic event handling mechanism.

2006-11-07 Thread Evgeniy Polyakov

Generic event handling mechanism.

Kevent is a generic subsytem which allows to handle event notifications.
It supports both level and edge triggered events. It is similar to
poll/epoll in some cases, but it is more scalable, it is faster and
allows to work with essentially eny kind of events.

Events are provided into kernel through control syscall and can be read
back through mmaped ring or syscall.
Kevent update (i.e. readiness switching) happens directly from internals
of the appropriate state machine of the underlying subsytem (like
network, filesystem, timer or any other).

Homepage:
http://tservice.net.ru/~s0mbre/old/?section=projectsitem=kevent

Documentation page:
http://linux-net.osdl.org/index.php/Kevent

Consider for inclusion.

Changes from 'take22' patchset:
* new ring buffer implementation in process' memory
* wakeup-one-thread flag
* edge-triggered behaviour
With this release additional independent benchmark shows kevent speed compared 
to epoll:
Eric Dumazet created special benchmark which creates set of AF_INET sockets and 
two threads 
start to simultaneously read and write data from/into them.
Here is results:
epoll (no EPOLLET): 57428 events/sec
kevent (no ET): 59794 events/sec
epoll (with EPOLLET): 71000 events/sec
kevent (with ET): 78265 events/sec
Maximum (busy loop reading events): 88482 events/sec

Changes from 'take21' patchset:
 * minor cleanups (different return values, removed unneded variables, 
whitespaces and so on)
 * fixed bug in kevent removal in case when kevent being removed
   is the same as overflow_kevent (spotted by Eric Dumazet)

Changes from 'take20' patchset:
 * new ring buffer implementation
 * removed artificial limit on possible number of kevents
With this release and fixed userspace web server it was possible to 
achive 3960+ req/s with client connection rate of 4000 con/s
over 100 Mbit lan, data IO over network was about 10582.7 KB/s, which
is too close to wire speed if we get into account headers and the like.

Changes from 'take19' patchset:
 * use __init instead of __devinit
 * removed 'default N' from config for user statistic
 * removed kevent_user_fini() since kevent can not be unloaded
 * use KERN_INFO for statistic output

Changes from 'take18' patchset:
 * use __init instead of __devinit
 * removed 'default N' from config for user statistic
 * removed kevent_user_fini() since kevent can not be unloaded
 * use KERN_INFO for statistic output

Changes from 'take17' patchset:
 * Use RB tree instead of hash table. 
At least for a web sever, frequency of addition/deletion of new kevent 
is comparable with number of search access, i.e. most of the time 
events 
are added, accesed only couple of times and then removed, so it 
justifies 
RB tree usage over AVL tree, since the latter does have much slower 
deletion 
time (max O(log(N)) compared to 3 ops), 
although faster search time (1.44*O(log(N)) vs. 2*O(log(N))). 
So for kevents I use RB tree for now and later, when my AVL tree 
implementation 
is ready, it will be possible to compare them.
 * Changed readiness check for socket notifications.

With both above changes it is possible to achieve more than 3380 req/second 
compared to 2200, 
sometimes 2500 req/second for epoll() for trivial web-server and httperf client 
on the same
hardware.
It is possible that above kevent limit is due to maximum allowed kevents in a 
time limit, which is
4096 events.

Changes from 'take16' patchset:
 * misc cleanups (__read_mostly, const ...)
 * created special macro which is used for mmap size (number of pages) 
calculation
 * export kevent_socket_notify(), since it is used in network protocols which 
can be 
built as modules (IPv6 for example)

Changes from 'take15' patchset:
 * converted kevent_timer to high-resolution timers, this forces timer API 
update at
http://linux-net.osdl.org/index.php/Kevent
 * use struct ukevent* instead of void * in syscalls (documentation has been 
updated)
 * added warning in kevent_add_ukevent() if ring has broken index (for testing)

Changes from 'take14' patchset:
 * added kevent_wait()
This syscall waits until either timeout expires or at least one event
becomes ready. It also commits that @num events from @start are processed
by userspace and thus can be be removed or rearmed (depending on it's 
flags).
It can be used for commit events read by userspace through mmap interface.
Example userspace code (evtest.c) can be found on project's homepage.
 * added socket notifications (send/recv/accept)

Changes from 'take13' patchset:
 * do not get lock aroung user data check in __kevent_search()
 * fail early if there were no registered callbacks for given type of kevent
 * trailing whitespace cleanup

Changes from 'take12' patchset:
 * remove non-chardev interface for initialization
 * use pointer to kevent_mring instead of unsigned longs
 * use aligned 64bit type in raw user data (can be 

Re: [take21 0/4] kevent: Generic event handling mechanism.

2006-11-07 Thread Andrew Morton
On Tue, 07 Nov 2006 07:32:20 -0500
Jeff Garzik [EMAIL PROTECTED] wrote:

 Evgeniy Polyakov wrote:
  Mmap ring buffer implementation was stopped by Andrew Morton and Ulrich
  Drepper, process' memory is used instead. copy_to_user() is slower (and
  some times noticebly), but there are major advantages of such approach.
 
 
 h.  I say there are advantages to both.

My problem with the old mmapped ringbuffer was that it permitted each user
to pin (typically) 48MB of unswappable memory.  Plus this pinned-memory
problem would put upper bounds on the ring size.

 Perhaps create a kevent_direct_limit resource limit for each thread. 
 By default, each thread could mmap $n pinned pagecache pages.  Sysadmin 
 can tune certain app resource limits to permit more.
 
 I would think that retaining the option to avoid copy_to_user() 
 -somehow- in -some- cases would be wise.

What Evgeniy means here is that copy_to_user() is slower than memcpy() (on
his machine, with his kernel config, at least).

Which is kinda weird and unexpected and is something which we should
investigate independently from this project.  (Rather than simply going
and bypassing it!)

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] skge: version 1.9

2006-11-07 Thread Jay Vosburgh
Michael Stone [EMAIL PROTECTED] wrote:

The skge 1.9 patch is looking good on older syskonnect fiber cards.
Stability issues seem to be taken care of and performance is good. There
are some strange interactions with bonding, however. If I try to put both
interfaces of an sk-9844 into a bonded interface, I only see traffic from
one of them. If I try to config the bonded interface down, the system
hangs. If I tcpdump either of the individual interfaces (before bonding
them) I see all the expected traffic.

Can you provide some bonding configuration details?  Which mode,
options, etc, as well as the relevant bits from dmesg (you can send it
to me privately if it's huge)?  I don't have any skge hardware, so I'm
not able to test this locally.

-J

---
-Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.19-rc1: Volanomark slowdown

2006-11-07 Thread David Miller
From: Tim Chen [EMAIL PROTECTED]
Date: Tue, 07 Nov 2006 10:32:34 -0800

[ Please bring up networking questions on netdev@vger.kernel.org
  as that is the place where networking developers read bug reports
  and questions, they by-in-large do not read linux-kernel at all. ]

 [TCP]: Send ACKs each 2nd received segment
 commit: 1ef9696c909060ccdae3ade245ca88692b49285b
 http://kernel.org/git/?
 p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=1ef9696c909060ccdae3ade245ca88692b49285b
 
 reduced Volanomark benchmark throughput by 10%.  
 This is because Volanomark sends 
 short message (100 bytes) on its TCP
 connections.  This patch increases the number of ACKs 
 traffic by 3.5 times.  
 
 By adopting this patch, we assume that with
 small segment, having short delay is important 
 enough that we are willing to reduce bandwidth 
 with more ACKs.  
 
 Is there any real application out there
 that this new behavior could be a concern?

That's unfortunate, because without that patch connections can hang
which is more important to fix than your performance test.  :-)

If we don't ACK every two segments, stacks which grow the congestion
window based upon packet counting will not grow the congestion window
properly when they are sending smaller than MSS sized segments.

This topic has been discussed quite a bit, you may want to do some
searching in the netdev archives to read some of that.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] mlsxfrm: Various fixes

2006-11-07 Thread Stephen Smalley
On Tue, 2006-11-07 at 15:29 -0500, Paul Moore wrote:
 Venkat Yekkirala wrote:
  +/*
  + * security_sid_compare() - compares two given sid contexts.
  + * Returns 1 if they are equal, 0 otherwise.
  + */
  +int security_sid_compare(u32 sid1, u32 sid2)
  +{
  +   struct context *context1;
  +   struct context *context2;
  +   int rc;
  +
  +   if (!ss_initialized)
  +   return 1;
  +
  +   if (sid1 == sid2)
  +   return 1;
  +   else if (sid1  SECINITSID_NUM  sid2  SECINITSID_NUM)
  +   return 0;
  +
  +   /* explicit comparison in order */
  +
  +   POLICY_RDLOCK;
  +   context1 = sidtab_search(sidtab, sid1);
  +   if (!context1) {
  +   printk(KERN_ERR security_sid_compare:  unrecognized SID 
  +  %u\n, sid1);
  +   rc = 0;
  +   goto out_unlock;
  +   }
  +
  +   context2 = sidtab_search(sidtab, sid2);
  +   if (!context2) {
  +   printk(KERN_ERR security_sid_compare:  unrecognized SID 
  +  %u\n, sid2);
  +   rc = 0;
  +   goto out_unlock;
  +   }
  +
  +   rc = context_cmp(context1, context2);
  +
  +out_unlock:
  +   POLICY_RDUNLOCK;
  +   return rc;
  +}
 
 I understand wanting a generic LSM interface to do secid token comparisons, 
 but
 in the SELinux implementation of this function I think we can get away with 
 only
 a simple sid1 == sid2 since the security server shouldn't be creating
 duplicate SID/secid values for identical contexts, I think.  Did you run into
 something in testing that would indicate otherwise?

Such duplication can occur among the initial SIDs.  Not sure though when
that would apply here, and it would only apply if both SIDs were initial
SIDs.

-- 
Stephen Smalley
National Security Agency

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.19-rc1: Volanomark slowdown

2006-11-07 Thread John Heffner

David Miller wrote:

If we don't ACK every two segments, stacks which grow the congestion
window based upon packet counting will not grow the congestion window
properly when they are sending smaller than MSS sized segments.


The only stack I know of that does this currently is linux, and in doing 
so does not conform to the spec. ;)  Sending to a BSD receiver will 
result in the same behavior, so the right place to fix this is on the 
sending side.  (I know the issue of packet vs. byte counting has come up 
many times over the last 10 years or so, and many arguments have been 
made on either side... I don't mean this to be flame bait but it's clear 
what will happen in this scenario.)


One way of viewing the current situation is that linux's packet counting 
plus ABC is more conservative than byte counting -- sometimes much more 
so.  Packet counting without ABC may be more or less conservative than 
byte counting, depending on segment sizes and receiver ACK strategy. 
Without ABC, linux is vulnerable to aggressive ACKing to inflate the 
cwnd.  This is a kind of ugly state of affairs.


Unfortunately I see no clear way to reconcile these issues short of 
switching to byte counting.  Obviously this would be a big change as 
packet counting is deeply ingrained in not only the congestion control 
but also the recovery code.


  -John
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.19-rc1: Volanomark slowdown

2006-11-07 Thread David Miller
From: John Heffner [EMAIL PROTECTED]
Date: Tue, 07 Nov 2006 16:50:33 -0500

 The only stack I know of that does this currently is linux, and in doing 
 so does not conform to the spec. ;)  Sending to a BSD receiver will 
 result in the same behavior, so the right place to fix this is on the 
 sending side.  (I know the issue of packet vs. byte counting has come up 
 many times over the last 10 years or so, and many arguments have been 
 made on either side... I don't mean this to be flame bait but it's clear 
 what will happen in this scenario.)

John, you cannot change the N-million existing Linux systems
out there doing congestion control via byte counting.  You
cannot do this no matter how much you wish it so :-)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.19-rc1: Volanomark slowdown

2006-11-07 Thread John Heffner

David Miller wrote:

From: John Heffner [EMAIL PROTECTED]
Date: Tue, 07 Nov 2006 16:50:33 -0500

The only stack I know of that does this currently is linux, and in doing 
so does not conform to the spec. ;)  Sending to a BSD receiver will 
result in the same behavior, so the right place to fix this is on the 
sending side.  (I know the issue of packet vs. byte counting has come up 
many times over the last 10 years or so, and many arguments have been 
made on either side... I don't mean this to be flame bait but it's clear 
what will happen in this scenario.)


John, you cannot change the N-million existing Linux systems
out there doing congestion control via byte counting.  You
cannot do this no matter how much you wish it so :-)


That would make our lives easier, wouldn't it? ;)  Clearly there are 
some combinations of TCP stacks out there that won't interoperate well 
under certain workloads.  Making new versions of the stack work well is 
the best we can hope for...


Fixing the sending side does not mean we have to back out the 
work-around on the receiving side.


  -John
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Rewrite e100_phys_id

2006-11-07 Thread Auke Kok

Matthew Wilcox wrote:

On Tue, Nov 07, 2006 at 10:33:14AM -0800, Auke Kok wrote:

Matthew Wilcox wrote:

Tested on the internal interface of an HP Integrity rx2600.
bad news, it's completely hosed. The adapter does some indistinguishable 
blinking for a second, then stops blinking alltogether.


Weird.  I tested it on the only e100 I have access to, and it worked.
I've just reviewed the patch you quoted below, and I don't see what the
problem is.  


I don't understand it either, and will dig into this after I get more coffee.

point is that `ethtool -p` now exits immediately after 500ms. it should loop until ^C is 
pressed. Somehow msleep_interruptable is always returning 0 on my platform? very strange.


Auke
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2.6.19-rc4-git10][PKT_SCHED] sch_htb: INIT_HLIST_NODE after hlist_del()

2006-11-07 Thread David Miller
From: Stephen Hemminger [EMAIL PROTECTED]
Date: Tue, 7 Nov 2006 09:50:07 -0800

 Your patch duplicated the code in hlist_del_init().  Why not do:

Indeed, this is the patch I will apply.

Thanks Stephen.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take23 3/5] kevent: poll/select() notifications.

2006-11-07 Thread Davide Libenzi
On Tue, 7 Nov 2006, Evgeniy Polyakov wrote:

 +static int kevent_poll_wait_callback(wait_queue_t *wait,
 + unsigned mode, int sync, void *key)
 +{
 + struct kevent_poll_wait_container *cont =
 + container_of(wait, struct kevent_poll_wait_container, wait);
 + struct kevent *k = cont-k;
 + struct file *file = k-st-origin;
 + u32 revents;
 +
 + revents = file-f_op-poll(file, NULL);
 +
 + kevent_storage_ready(k-st, NULL, revents);
 +
 + return 0;
 +}

Are you sure you can safely call file-f_op-poll() from inside a callback 
based wakeup? The low level driver may be calling the wakeup with one of 
its locks held, and during the file-f_op-poll may be trying to acquire 
the same lock. I remember there was a discussion about this, and assuming 
the above not true, made epoll code more complex (and slower, since an 
extra O(R) loop was needed to fetch events).



- Davide


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tg3_read_partno(): possible array overrun

2006-11-07 Thread David Miller
From: Michael Chan [EMAIL PROTECTED]
Date: Mon, 06 Nov 2006 12:07:31 -0800

 On Mon, 2006-11-06 at 10:45 +0100, Adrian Bunk wrote:
  The Coverity checker noted the following in drivers/net/tg3.c:
  
  --  snip  --
  
  The problem is that vpd_data[i + 2] could be vpd_data[255 + 2].
 
 Thanks.  This should fix it:
 
 [TG3]: Fix array overrun in tg3_read_partno().
 
 Use proper upper limits for the loops and check for all error
 conditions.
 
 The problem was noticed by Adrian Bunk.
 
 Signed-off-by: Michael Chan [EMAIL PROTECTED] 

Applied, thanks Michael.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH wireless-2.6-git] prism54: WPA/RSN support for fullmac cards

2006-11-07 Thread John W. Linville
On Fri, Nov 03, 2006 at 01:41:46PM -0500, Luis R. Rodriguez wrote:
 On 11/3/06, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

 yes, especially mgt_commit_list caused alot headaches, until I removed
 DOT11_OID_PSM from the cache list.
 Now, I can hammer it with ping -f for hours.
 
 nice, perhaps that's been the culprit all along... going to dig to see
 if I find a fullmac prism card. Will like to get this merged in.

Any resolution on this?

-- 
John W. Linville
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Please pull 'upstream-fixes' branch of wireless-2.6

2006-11-07 Thread John W. Linville
The following changes since commit edd106fc8ac1826dbe231b70ce0762db24133e5c:
  Auke Kok:
e1000: Fix regression: garbled stats and irq allocation during swsusp

are found in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6.git 
upstream-fixes

Adrian Bunk:
  bcm43xx: Add error checking in bcm43xx_sprom_write()

Michael Buesch:
  bcm43xx: Drain TX status before starting IRQs

 drivers/net/wireless/bcm43xx/bcm43xx_main.c |   22 --
 1 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/net/wireless/bcm43xx/bcm43xx_main.c 
b/drivers/net/wireless/bcm43xx/bcm43xx_main.c
index 65edb56..a1b7838 100644
--- a/drivers/net/wireless/bcm43xx/bcm43xx_main.c
+++ b/drivers/net/wireless/bcm43xx/bcm43xx_main.c
@@ -746,7 +746,7 @@ int bcm43xx_sprom_write(struct bcm43xx_p
if (err)
goto err_ctlreg;
spromctl |= 0x10; /* SPROM WRITE enable. */
-   bcm43xx_pci_write_config32(bcm, BCM43xx_PCICFG_SPROMCTL, spromctl);
+   err = bcm43xx_pci_write_config32(bcm, BCM43xx_PCICFG_SPROMCTL, 
spromctl);
if (err)
goto err_ctlreg;
/* We must burn lots of CPU cycles here, but that does not
@@ -768,7 +768,7 @@ int bcm43xx_sprom_write(struct bcm43xx_p
mdelay(20);
}
spromctl = ~0x10; /* SPROM WRITE enable. */
-   bcm43xx_pci_write_config32(bcm, BCM43xx_PCICFG_SPROMCTL, spromctl);
+   err = bcm43xx_pci_write_config32(bcm, BCM43xx_PCICFG_SPROMCTL, 
spromctl);
if (err)
goto err_ctlreg;
mdelay(500);
@@ -1463,6 +1463,23 @@ static void handle_irq_transmit_status(s
}
 }
 
+static void drain_txstatus_queue(struct bcm43xx_private *bcm)
+{
+   u32 dummy;
+
+   if (bcm-current_core-rev  5)
+   return;
+   /* Read all entries from the microcode TXstatus FIFO
+* and throw them away.
+*/
+   while (1) {
+   dummy = bcm43xx_read32(bcm, BCM43xx_MMIO_XMITSTAT_0);
+   if (!dummy)
+   break;
+   dummy = bcm43xx_read32(bcm, BCM43xx_MMIO_XMITSTAT_1);
+   }
+}
+
 static void bcm43xx_generate_noise_sample(struct bcm43xx_private *bcm)
 {
bcm43xx_shm_write16(bcm, BCM43xx_SHM_SHARED, 0x408, 0x7F7F);
@@ -3532,6 +3549,7 @@ int bcm43xx_select_wireless_core(struct 
bcm43xx_macfilter_clear(bcm, BCM43xx_MACFILTER_ASSOC);
bcm43xx_macfilter_set(bcm, BCM43xx_MACFILTER_SELF, (u8 
*)(bcm-net_dev-dev_addr));
bcm43xx_security_init(bcm);
+   drain_txstatus_queue(bcm);
ieee80211softmac_start(bcm-net_dev);
 
/* Let's go! Be careful after enabling the IRQs.
-- 
John W. Linville
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Please pull 'upstream' branch of wireless-2.6

2006-11-07 Thread John W. Linville
The following changes since commit d4f748365129ccfc9dadf6fb14331e45e33cc4ed:
  John W. Linville:
Merge branch 'upstream-fixes' into upstream

are found in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6.git 
upstream

John W. Linville:
  wireless: clean-up some check return code warnings

Larry Finger:
  bcm43xx: remove badness variable and related routine
  bcm43xx: Remove useless core enable/disable messages
  ieee80211softmac: fix verbosity when debug disabled

 drivers/net/wireless/bcm43xx/bcm43xx_main.c   |   56 +
 drivers/net/wireless/hostap/hostap_pci.c  |8 +++-
 drivers/net/wireless/ipw2100.c|8 +++-
 drivers/net/wireless/ipw2200.c|8 +++-
 drivers/net/wireless/orinoco_pci.h|7 +++
 drivers/net/wireless/prism54/islpci_hotplug.c |   20 +++--
 net/ieee80211/softmac/ieee80211softmac_auth.c |   10 ++--
 7 files changed, 60 insertions(+), 57 deletions(-)

diff --git a/drivers/net/wireless/bcm43xx/bcm43xx_main.c 
b/drivers/net/wireless/bcm43xx/bcm43xx_main.c
index c6bd868..60a9745 100644
--- a/drivers/net/wireless/bcm43xx/bcm43xx_main.c
+++ b/drivers/net/wireless/bcm43xx/bcm43xx_main.c
@@ -2684,14 +2684,10 @@ #endif
bcm-chip_id, bcm-chip_rev);
dprintk(KERN_INFO PFX Number of cores: %d\n, core_count);
if (bcm-core_chipcommon.available) {
-   dprintk(KERN_INFO PFX Core 0: ID 0x%x, rev 0x%x, vendor 0x%x, 
%s\n,
-   core_id, core_rev, core_vendor,
-   bcm43xx_core_enabled(bcm) ? enabled : disabled);
-   }
-
-   if (bcm-core_chipcommon.available)
+   dprintk(KERN_INFO PFX Core 0: ID 0x%x, rev 0x%x, vendor 
0x%x\n,
+   core_id, core_rev, core_vendor);
current_core = 1;
-   else
+   } else
current_core = 0;
for ( ; current_core  core_count; current_core++) {
struct bcm43xx_coreinfo *core;
@@ -2709,9 +2705,8 @@ #endif
core_rev = (sb_id_hi  0xF);
core_vendor = (sb_id_hi  0x)  16;
 
-   dprintk(KERN_INFO PFX Core %d: ID 0x%x, rev 0x%x, vendor 0x%x, 
%s\n,
-   current_core, core_id, core_rev, core_vendor,
-   bcm43xx_core_enabled(bcm) ? enabled : disabled );
+   dprintk(KERN_INFO PFX Core %d: ID 0x%x, rev 0x%x, vendor 
0x%x\n,
+   current_core, core_id, core_rev, core_vendor);
 
core = NULL;
switch (core_id) {
@@ -3209,55 +3204,27 @@ static void bcm43xx_periodic_every15sec(
 
 static void do_periodic_work(struct bcm43xx_private *bcm)
 {
-   unsigned int state;
-
-   state = bcm-periodic_state;
-   if (state % 8 == 0)
+   if (bcm-periodic_state % 8 == 0)
bcm43xx_periodic_every120sec(bcm);
-   if (state % 4 == 0)
+   if (bcm-periodic_state % 4 == 0)
bcm43xx_periodic_every60sec(bcm);
-   if (state % 2 == 0)
+   if (bcm-periodic_state % 2 == 0)
bcm43xx_periodic_every30sec(bcm);
-   if (state % 1 == 0)
-   bcm43xx_periodic_every15sec(bcm);
-   bcm-periodic_state = state + 1;
+   bcm43xx_periodic_every15sec(bcm);
 
schedule_delayed_work(bcm-periodic_work, HZ * 15);
 }
 
-/* Estimate a Badness value based on the periodic work
- * state-machine state. Badness is worse (bigger), if the
- * periodic work will take longer.
- */
-static int estimate_periodic_work_badness(unsigned int state)
-{
-   int badness = 0;
-
-   if (state % 8 == 0) /* every 120 sec */
-   badness += 10;
-   if (state % 4 == 0) /* every 60 sec */
-   badness += 5;
-   if (state % 2 == 0) /* every 30 sec */
-   badness += 1;
-   if (state % 1 == 0) /* every 15 sec */
-   badness += 1;
-
-#define BADNESS_LIMIT  4
-   return badness;
-}
-
 static void bcm43xx_periodic_work_handler(void *d)
 {
struct bcm43xx_private *bcm = d;
struct net_device *net_dev = bcm-net_dev;
unsigned long flags;
u32 savedirqs = 0;
-   int badness;
unsigned long orig_trans_start = 0;
 
mutex_lock(bcm-mutex);
-   badness = estimate_periodic_work_badness(bcm-periodic_state);
-   if (badness  BADNESS_LIMIT) {
+   if (unlikely(bcm-periodic_state % 4 == 0)) {
/* Periodic work will take a long time, so we want it to
 * be preemtible.
 */
@@ -3289,7 +3256,7 @@ static void bcm43xx_periodic_work_handle
 
do_periodic_work(bcm);
 
-   if (badness  BADNESS_LIMIT) {
+   if (unlikely(bcm-periodic_state % 4 == 0)) {
spin_lock_irqsave(bcm-irq_lock, flags);
tasklet_enable(bcm-isr_tasklet);
bcm43xx_interrupt_enable(bcm, savedirqs);
@@ -3300,6 +3267,7 @@ 

Re: [PATCH 1/3] NetXen: Fixed /sys mapping between device and driver

2006-11-07 Thread Amit S. Kale
Hi Ingo,

Will do.
Thanks for reviewing it.
-Amit

On Tuesday 07 November 2006 22:19, Ingo Oeser wrote:
 Hi Amit,

 one minor nitpick:

 You wrote:
  diff --git a/drivers/net/netxen/netxen_nic_main.c
  b/drivers/net/netxen/netxen_nic_main.c index b54ea16..4effb87 100644
  --- a/drivers/net/netxen/netxen_nic_main.c
  +++ b/drivers/net/netxen/netxen_nic_main.c

 [...]

  @@ -1040,7 +1041,7 @@ static int netxen_nic_poll(struct net_de
  netxen_nic_enable_int(adapter);
  }
 
  -   return (done ? 0 : 1);
  +   return (!done);

   return !done;

 Please lose the braces here (CodingStyle).

 Just respin or send this change along with later patchsets.

 Regards

 Ingo Oeser
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Add support for configuring the PHY connection interface

2006-11-07 Thread Andy Fleming
Most PHYs connect to an ethernet controller over a GMII or MII
interface.  However, a growing number are connected over
different interfaces, such as RGMII or SGMII.

The ethernet driver will tell the PHY what type of connection it
is by setting it manually, or passing it in through phy_connect
(or phy_attach).

Changes include:
* Updates to documentation
* Updates to other PHY Lib consumers
* Changes to PHY Lib to add interface support
* Some minor changes to whitespace in phy.h
* interface values now passed to gianfar

Signed-off-by: Andrew Fleming [EMAIL PROTECTED]
---
 Documentation/networking/phy.txt   |   11 ---
 arch/powerpc/sysdev/fsl_soc.c  |   36 
 drivers/net/au1000_eth.c   |3 ++-
 drivers/net/fs_enet/fs_enet-main.c |3 ++-
 drivers/net/gianfar.c  |5 +++--
 drivers/net/phy/phy_device.c   |   29 -
 include/linux/phy.h|   32 ++--
 7 files changed, 97 insertions(+), 22 deletions(-)

diff --git a/Documentation/networking/phy.txt b/Documentation/networking/phy.txt
index 29ccae4..1c9873d 100644
--- a/Documentation/networking/phy.txt
+++ b/Documentation/networking/phy.txt
@@ -97,11 +97,12 @@ Letting the PHY Abstraction Layer do Eve
  
  Next, you need to know the device name of the PHY connected to this device. 
  The name will look something like, phy0:0, where the first number is the
- bus id, and the second is the PHY's address on that bus.
+ bus id, and the second is the PHY's address on that bus.  Typically,
+ the bus is responsible for making its ID unique.
  
  Now, to connect, just call this function:
  
-   phydev = phy_connect(dev, phy_name, adjust_link, flags);
+   phydev = phy_connect(dev, phy_name, adjust_link, flags, interface);
 
  phydev is a pointer to the phy_device structure which represents the PHY.  If
  phy_connect is successful, it will return the pointer.  dev, here, is the
@@ -115,6 +116,10 @@ Letting the PHY Abstraction Layer do Eve
  This is useful if the system has put hardware restrictions on
  the PHY/controller, of which the PHY needs to be aware.
 
+ interface is a u32 which specifies the connection type used
+ between the controller and the PHY.  Examples are GMII, MII,
+ RGMII, and SGMII.  For a full list, see include/linux/phy.h
+
  Now just make sure that phydev-supported and phydev-advertising have any
  values pruned from them which don't make sense for your controller (a 10/100
  controller may be connected to a gigabit capable PHY, so you would need to
@@ -191,7 +196,7 @@ Doing it all yourself
start, or disables then frees them for stop.
 
  struct phy_device * phy_attach(struct net_device *dev, const char *phy_id,
-u32 flags);
+u32 flags, u32 interface);
 
Attaches a network device to a particular PHY, binding the PHY to a generic
driver if none was found during bus initialization.  Passes in
diff --git a/arch/powerpc/sysdev/fsl_soc.c b/arch/powerpc/sysdev/fsl_soc.c
index b4b5b4a..b053370 100644
--- a/arch/powerpc/sysdev/fsl_soc.c
+++ b/arch/powerpc/sysdev/fsl_soc.c
@@ -211,6 +211,36 @@ static int __init gfar_set_flags(struct 
return device_flags;
 }
 
+/* Return the Linux interface mode type based on the
+ * specification in the device-tree */
+static int __init gfar_get_interface(struct device_node *np)
+{
+   const char *istr;
+   int interface = 0;
+
+   istr = get_property(np, interface, NULL);
+
+   if (istr == NULL)
+   istr = GMII;
+
+   if (!strcasecmp(istr, GMII))
+   interface = PHY_INTERFACE_MODE_GMII;
+   else if (!strcasecmp(istr, MII))
+   interface = PHY_INTERFACE_MODE_MII;
+   else if (!strcasecmp(istr, RGMII))
+   interface = PHY_INTERFACE_MODE_RGMII;
+   else if (!strcasecmp(istr, SGMII))
+   interface = PHY_INTERFACE_MODE_SGMII;
+   else if (!strcasecmp(istr, TBI))
+   interface = PHY_INTERFACE_MODE_TBI;
+   else if (!strcasecmp(istr, RMII))
+   interface = PHY_INTERFACE_MODE_RMII;
+   else if (!strcasecmp(istr, RTBI))
+   interface = PHY_INTERFACE_MODE_RTBI;
+
+   return interface;
+}
+
 static struct device_node * __init gfar_get_phy_node(struct device_node *np)
 {
const phandle *ph;
@@ -342,6 +372,12 @@ static int __init gfar_of_init(void)
if (mac_addr)
memcpy(gfar_data.mac_addr, mac_addr, 6);
 
+   gfar_data.interface = gfar_get_interface(np);
+   if (gfar_data.interface == 0) {
+   printk(gfar %d failed to set interface\n, num);
+   continue;
+   }
+
ret = gfar_set_phy_info(np, gfar_data.phy_id,
gfar_data.bus_id, gfar_data.phy_flags);
if (ret) {
diff --git a/drivers/net/au1000_eth.c 

[PATCH] Add support for Marvell 88e1111S and 88e1145

2006-11-07 Thread Andy Fleming
This patch requires the new support for configurable PHY
interfaces.

Changes include:
* New support for 88e1145
* New support for 88e111s
* Fixing 88e1101 driver to not match non-88e1101 PHYs
* Increases in feature support across Marvell PHY product line
* Fixes a bunch of whitespace issues found by Lindent

Signed-off-by: Andrew Fleming [EMAIL PROTECTED]
---
 drivers/net/phy/marvell.c |  156 ++---
 1 files changed, 144 insertions(+), 12 deletions(-)

diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
index 0ad2532..5320ab9 100644
--- a/drivers/net/phy/marvell.c
+++ b/drivers/net/phy/marvell.c
@@ -43,6 +43,19 @@ #define MII_M1011_IMASK  0x12
 #define MII_M1011_IMASK_INIT   0x6400
 #define MII_M1011_IMASK_CLEAR  0x
 
+#define MII_M1011_PHY_SCR  0x10
+#define MII_M1011_PHY_SCR_AUTO_CROSS   0x0060
+
+#define MII_M1145_PHY_EXT_CR   0x14
+#define MII_M1145_RGMII_RX_DELAY   0x0080
+#define MII_M1145_RGMII_TX_DELAY   0x0002
+
+#define M1145_DEV_FLAGS_RESISTANCE 0x0001
+
+#define MII_M_PHY_LED_CONTROL  0x18
+#define MII_M_PHY_LED_DIRECT   0x4100
+#define MII_M_PHY_LED_COMBINE  0x411c
+
 MODULE_DESCRIPTION(Marvell PHY driver);
 MODULE_AUTHOR(Andy Fleming);
 MODULE_LICENSE(GPL);
@@ -64,7 +77,7 @@ static int marvell_config_intr(struct ph
 {
int err;
 
-   if(phydev-interrupts == PHY_INTERRUPT_ENABLED)
+   if (phydev-interrupts == PHY_INTERRUPT_ENABLED)
err = phy_write(phydev, MII_M1011_IMASK, MII_M1011_IMASK_INIT);
else
err = phy_write(phydev, MII_M1011_IMASK, MII_M1011_IMASK_CLEAR);
@@ -104,34 +117,153 @@ static int marvell_config_aneg(struct ph
if (err  0)
return err;
 
+   err = phy_write(phydev, MII_M1011_PHY_SCR,
+   MII_M1011_PHY_SCR_AUTO_CROSS);
+   if (err  0)
+   return err;
+
+   err = phy_write(phydev, MII_M_PHY_LED_CONTROL,
+   MII_M_PHY_LED_DIRECT);
+   if (err  0)
+   return err;
 
err = genphy_config_aneg(phydev);
 
return err;
 }
 
+static int m88e1145_config_init(struct phy_device *phydev)
+{
+   int err;
+
+   /* Take care of errata E0  E1 */
+   err = phy_write(phydev, 0x1d, 0x001b);
+   if (err  0)
+   return err;
+
+   err = phy_write(phydev, 0x1e, 0x418f);
+   if (err  0)
+   return err;
+
+   err = phy_write(phydev, 0x1d, 0x0016);
+   if (err  0)
+   return err;
+
+   err = phy_write(phydev, 0x1e, 0xa2da);
+   if (err  0)
+   return err;
+
+   if (phydev-interface == PHY_INTERFACE_MODE_RGMII) {
+   int temp = phy_read(phydev, MII_M1145_PHY_EXT_CR);
+   if (temp  0)
+   return temp;
+
+   temp |= (MII_M1145_RGMII_RX_DELAY | MII_M1145_RGMII_TX_DELAY);
+
+   err = phy_write(phydev, MII_M1145_PHY_EXT_CR, temp);
+   if (err  0)
+   return err;
+
+   if (phydev-dev_flags  M1145_DEV_FLAGS_RESISTANCE) {
+   err = phy_write(phydev, 0x1d, 0x0012);
+   if (err  0)
+   return err;
+
+   temp = phy_read(phydev, 0x1e);
+   if (temp  0)
+   return temp;
+
+   temp = 0xf03f;
+   temp |= 2  9; /* 36 ohm */
+   temp |= 2  6; /* 39 ohm */
+
+   err = phy_write(phydev, 0x1e, temp);
+   if (err  0)
+   return err;
+
+   err = phy_write(phydev, 0x1d, 0x3);
+   if (err  0)
+   return err;
+
+   err = phy_write(phydev, 0x1e, 0x8000);
+   if (err  0)
+   return err;
+   }
+   }
+
+   return 0;
+}
 
 static struct phy_driver m88e1101_driver = {
-   .phy_id = 0x01410c00,
-   .phy_id_mask= 0xff00,
-   .name   = Marvell 88E1101,
-   .features   = PHY_GBIT_FEATURES,
-   .flags  = PHY_HAS_INTERRUPT,
-   .config_aneg= marvell_config_aneg,
-   .read_status= genphy_read_status,
-   .ack_interrupt  = marvell_ack_interrupt,
-   .config_intr= marvell_config_intr,
-   .driver = { .owner = THIS_MODULE,},
+   .phy_id = 0x01410c60,
+   .phy_id_mask = 0xfff0,
+   .name = Marvell 88E1101,
+   .features = PHY_GBIT_FEATURES,
+   .flags = PHY_HAS_INTERRUPT,
+   .config_aneg = marvell_config_aneg,
+   .read_status = genphy_read_status,
+   .ack_interrupt = marvell_ack_interrupt,
+   .config_intr = marvell_config_intr,
+   .driver = {.owner = THIS_MODULE,},

Re: [PATCH] Add support for configuring the PHY connection interface

2006-11-07 Thread Kumar Gala


On Nov 8, 2006, at 12:10 AM, Andy Fleming wrote:


Most PHYs connect to an ethernet controller over a GMII or MII
interface.  However, a growing number are connected over
different interfaces, such as RGMII or SGMII.

The ethernet driver will tell the PHY what type of connection it
is by setting it manually, or passing it in through phy_connect
(or phy_attach).

Changes include:
* Updates to documentation
* Updates to other PHY Lib consumers
* Changes to PHY Lib to add interface support
* Some minor changes to whitespace in phy.h
* interface values now passed to gianfar

Signed-off-by: Andrew Fleming [EMAIL PROTECTED]


Any reason to not make interface an enum?

- k


---
 Documentation/networking/phy.txt   |   11 ---
 arch/powerpc/sysdev/fsl_soc.c  |   36 + 
+++

 drivers/net/au1000_eth.c   |3 ++-
 drivers/net/fs_enet/fs_enet-main.c |3 ++-
 drivers/net/gianfar.c  |5 +++--
 drivers/net/phy/phy_device.c   |   29 +++ 
+-
 include/linux/phy.h|   32 + 
+--

 7 files changed, 97 insertions(+), 22 deletions(-)

diff --git a/Documentation/networking/phy.txt b/Documentation/ 
networking/phy.txt

index 29ccae4..1c9873d 100644
--- a/Documentation/networking/phy.txt
+++ b/Documentation/networking/phy.txt
@@ -97,11 +97,12 @@ Letting the PHY Abstraction Layer do Eve

  Next, you need to know the device name of the PHY connected to  
this device.
  The name will look something like, phy0:0, where the first  
number is the

- bus id, and the second is the PHY's address on that bus.
+ bus id, and the second is the PHY's address on that bus.  Typically,
+ the bus is responsible for making its ID unique.

  Now, to connect, just call this function:

-   phydev = phy_connect(dev, phy_name, adjust_link, flags);
+   phydev = phy_connect(dev, phy_name, adjust_link, flags,  
interface);


  phydev is a pointer to the phy_device structure which represents  
the PHY.  If
  phy_connect is successful, it will return the pointer.  dev,  
here, is the

@@ -115,6 +116,10 @@ Letting the PHY Abstraction Layer do Eve
  This is useful if the system has put hardware restrictions on
  the PHY/controller, of which the PHY needs to be aware.

+ interface is a u32 which specifies the connection type used
+ between the controller and the PHY.  Examples are GMII, MII,
+ RGMII, and SGMII.  For a full list, see include/linux/phy.h
+
  Now just make sure that phydev-supported and phydev-advertising  
have any
  values pruned from them which don't make sense for your  
controller (a 10/100
  controller may be connected to a gigabit capable PHY, so you  
would need to

@@ -191,7 +196,7 @@ Doing it all yourself
start, or disables then frees them for stop.

  struct phy_device * phy_attach(struct net_device *dev, const char  
*phy_id,

-u32 flags);
+u32 flags, u32 interface);

Attaches a network device to a particular PHY, binding the PHY  
to a generic

driver if none was found during bus initialization.  Passes in
diff --git a/arch/powerpc/sysdev/fsl_soc.c b/arch/powerpc/sysdev/ 
fsl_soc.c

index b4b5b4a..b053370 100644
--- a/arch/powerpc/sysdev/fsl_soc.c
+++ b/arch/powerpc/sysdev/fsl_soc.c
@@ -211,6 +211,36 @@ static int __init gfar_set_flags(struct
return device_flags;
 }

+/* Return the Linux interface mode type based on the
+ * specification in the device-tree */
+static int __init gfar_get_interface(struct device_node *np)
+{
+   const char *istr;
+   int interface = 0;
+
+   istr = get_property(np, interface, NULL);
+
+   if (istr == NULL)
+   istr = GMII;
+
+   if (!strcasecmp(istr, GMII))
+   interface = PHY_INTERFACE_MODE_GMII;
+   else if (!strcasecmp(istr, MII))
+   interface = PHY_INTERFACE_MODE_MII;
+   else if (!strcasecmp(istr, RGMII))
+   interface = PHY_INTERFACE_MODE_RGMII;
+   else if (!strcasecmp(istr, SGMII))
+   interface = PHY_INTERFACE_MODE_SGMII;
+   else if (!strcasecmp(istr, TBI))
+   interface = PHY_INTERFACE_MODE_TBI;
+   else if (!strcasecmp(istr, RMII))
+   interface = PHY_INTERFACE_MODE_RMII;
+   else if (!strcasecmp(istr, RTBI))
+   interface = PHY_INTERFACE_MODE_RTBI;
+
+   return interface;
+}
+
 static struct device_node * __init gfar_get_phy_node(struct  
device_node *np)

 {
const phandle *ph;
@@ -342,6 +372,12 @@ static int __init gfar_of_init(void)
if (mac_addr)
memcpy(gfar_data.mac_addr, mac_addr, 6);

+   gfar_data.interface = gfar_get_interface(np);
+   if (gfar_data.interface == 0) {
+   printk(gfar %d failed to set interface\n, num);
+   continue;
+   }
+
ret = gfar_set_phy_info(np, gfar_data.phy_id,
  

Re: [PATCH] Add support for configuring the PHY connection interface

2006-11-07 Thread Andy Fleming


On Nov 8, 2006, at 00:16, Kumar Gala wrote:



On Nov 8, 2006, at 12:10 AM, Andy Fleming wrote:


Most PHYs connect to an ethernet controller over a GMII or MII
interface.  However, a growing number are connected over
different interfaces, such as RGMII or SGMII.

The ethernet driver will tell the PHY what type of connection it
is by setting it manually, or passing it in through phy_connect
(or phy_attach).

Changes include:
* Updates to documentation
* Updates to other PHY Lib consumers
* Changes to PHY Lib to add interface support
* Some minor changes to whitespace in phy.h
* interface values now passed to gianfar

Signed-off-by: Andrew Fleming [EMAIL PROTECTED]


Any reason to not make interface an enum?


I became mildly attached to the notion of having a reduced bit.

I'd be open to changing it.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2.6.19-rc4-git10][PKT_SCHED] sch_htb: INIT_HLIST_NODE after hlist_del()

2006-11-07 Thread Jarek Poplawski
On Tue, Nov 07, 2006 at 09:50:07AM -0800, Stephen Hemminger wrote:
 On Tue, 7 Nov 2006 07:49:43 +0100
 Jarek Poplawski [EMAIL PROTECTED] wrote:
...
 Your patch duplicated the code in hlist_del_init().  Why not do:
 
 --- a/net/sched/sch_htb.c 2006-11-07 09:48:22.0 -0800
 +++ b/net/sched/sch_htb.c 2006-11-07 09:49:01.0 -0800
 @@ -1284,8 +1284,7 @@
 struct htb_class, sibling));
  
   /* note: this delete may happen twice (see htb_delete) */
 - if (!hlist_unhashed(cl-hlist))
 - hlist_del(cl-hlist);
 + hlist_del_init(cl-hlist);
   list_del(cl-sibling);
  
   if (cl-prio_activity)
 @@ -1333,8 +1332,7 @@
   sch_tree_lock(sch);
  
   /* delete from hash and active; remainder in destroy_class */
 - if (!hlist_unhashed(cl-hlist))
 - hlist_del(cl-hlist);
 + hlist_del_init(cl-hlist);
  
   if (cl-prio_activity)
   htb_deactivate(q, cl);
 

I've understood you first suggestion. But after sending
my patch I've found it is also hiding a real problem
of excessive deletion in one and possibly more places.
So probably this should be done the right way and this
hlist_unhashed testing left in BUG_ON only... 

Cheers,
Jarek P. 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] [2.6.19-rc4-mm2] can't compile drivers/acpi/processor_idle.c

2006-11-07 Thread Andrew Morton
On Wed, 8 Nov 2006 15:01:41 +0900
KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:

 
 While compiling 2.6.19-rc4-mm2 on ia64, I met this compile error.
 ==
   CC [M]  drivers/acpi/processor_idle.o
 drivers/acpi/processor_idle.c:43:22: asm/apic.h: No such file or directory
 drivers/acpi/processor_idle.c: In function `acpi_processor_power_seq_show':
 drivers/acpi/processor_idle.c:1202: warning: long long unsigned int format, 
 u64 arg (arg 5)
 ==
 
 This is because of acpi-include-apic-h.patch, maybe.
 ia64 doesn't have asm/acpi.h

That got fixed (by ugly means).

 my .config is attached.

But rc5-mm1 remains broken with that .config:

arch/ia64/pci/pci.c: In function `pci_acpi_scan_root':
arch/ia64/pci/pci.c:354: warning: implicit declaration of function `pxm_to_node'
...
arch/ia64/pci/built-in.o(.text+0xe92): In function `pci_acpi_scan_root':
: undefined reference to `pxm_to_node'

This bug exists in mainline.




Also,

drivers/built-in.o(.text+0xd9a72): In function `e1000_xmit_frame':
: undefined reference to `csum_ipv6_magic'

I don't know how this got broken.  ia64 seems to be the only architecture
which doesn't have an implementation of csum_ipv6_magic().  This bug
appears to be introduced by git-netdev-all.patch.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] [2.6.19-rc4-mm2] can't compile drivers/acpi/processor_idle.c

2006-11-07 Thread David Miller
From: Andrew Morton [EMAIL PROTECTED]
Date: Tue, 7 Nov 2006 22:52:59 -0800

 Also,
 
 drivers/built-in.o(.text+0xd9a72): In function `e1000_xmit_frame':
 : undefined reference to `csum_ipv6_magic'
 
 I don't know how this got broken.  ia64 seems to be the only architecture
 which doesn't have an implementation of csum_ipv6_magic().  This bug
 appears to be introduced by git-netdev-all.patch.

There is a generic version, which e1000 would get if it included
the net/ip_checksum.h header file.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] [2.6.19-rc4-mm2] can't compile drivers/acpi/processor_idle.c

2006-11-07 Thread KAMEZAWA Hiroyuki
On Tue, 7 Nov 2006 22:52:59 -0800
Andrew Morton [EMAIL PROTECTED] wrote:

 On Wed, 8 Nov 2006 15:01:41 +0900
 KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:
 
  
  While compiling 2.6.19-rc4-mm2 on ia64, I met this compile error.
  ==
CC [M]  drivers/acpi/processor_idle.o
  drivers/acpi/processor_idle.c:43:22: asm/apic.h: No such file or directory
  drivers/acpi/processor_idle.c: In function `acpi_processor_power_seq_show':
  drivers/acpi/processor_idle.c:1202: warning: long long unsigned int format, 
  u64 arg (arg 5)
  ==
  
  This is because of acpi-include-apic-h.patch, maybe.
  ia64 doesn't have asm/acpi.h
 
 That got fixed (by ugly means).

Ah, okay. I'll move to rc5-mm1. Thank you.

 
  my .config is attached.
 
 But rc5-mm1 remains broken with that .config:
 
 arch/ia64/pci/pci.c: In function `pci_acpi_scan_root':
 arch/ia64/pci/pci.c:354: warning: implicit declaration of function 
 `pxm_to_node'
 ...
 arch/ia64/pci/built-in.o(.text+0xe92): In function `pci_acpi_scan_root':
 : undefined reference to `pxm_to_node'
 
 This bug exists in mainline.
 

How about this ? Maybe ia64 people's review is necessary.

-Kame
==
When ACPI  NUMA, pxm_to_node is used and it exists in drivers/acpi/numa.c

Signed-Off-By: KAMEZAWA Hiroyuki [EMAIL PROTECTED]

Index: linux-2.6.19-rc4-mm2/arch/ia64/Kconfig
===
--- linux-2.6.19-rc4-mm2.orig/arch/ia64/Kconfig 2006-11-08 14:15:21.0 
+0900
+++ linux-2.6.19-rc4-mm2/arch/ia64/Kconfig  2006-11-08 16:16:40.0 
+0900
@@ -353,6 +353,7 @@
bool NUMA support
depends on !IA64_HP_SIM  !FLATMEM
default y if IA64_SGI_SN2
+   select ACPI_NUMA if ACPI
help
  Say Y to compile the kernel to support NUMA (Non-Uniform Memory
  Access).  This option is for configuring high-end multiprocessor

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html