[PATCH 0/2] myri10ge updates for 2.6.23

2007-08-24 Thread Brice Goglin
Hi Jeff,

Now that Greg pushed my fix to expose pcie_get_readrq() prototype in
linux/pci.h, I am resending my rework of Peter Oruba's patch to use
pcie_get/set_readrq() in myri10ge. Please apply for 2.6.23.

1. use pcie_get/set_readrq
2. update driver version to 1.3.2-1.269



Also, we noticed that packet forwarding is faster on our hardware when
receiving in linear skb instead of pages (about 10Gb/s vs. 7), so we are
thinking of submitting a Kconfig option to switch to the old linear skb
RX code. Would this be acceptable?

Thanks,
Brice

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] myri10ge: use pcie_get/set_readrq

2007-08-24 Thread Brice Goglin
Based on a patch from Peter Oruba, convert myri10ge to use pcie_get_readrq()
and pcie_set_readrq() instead of our own PCI calls and arithmetics.

These driver changes incorporate the proposed PCI-X / PCI-Express read byte
count interface.  Reading and setting those values doesn't take place
manually, instead wrapping functions are called to allow quirks for some
PCI bridges.

Signed-off-by: Brice Goglin [EMAIL PROTECTED]
Signed-off by: Peter Oruba [EMAIL PROTECTED]
Based on work by Stephen Hemminger [EMAIL PROTECTED]
Signed-off-by: Andrew Morton [EMAIL PROTECTED]
---
 drivers/net/myri10ge/myri10ge.c |   32 ++--
 1 file changed, 6 insertions(+), 26 deletions(-)

Index: linux-2.6.git/drivers/net/myri10ge/myri10ge.c
===
--- linux-2.6.git.orig/drivers/net/myri10ge/myri10ge.c  2007-08-24 
08:43:01.0 +0200
+++ linux-2.6.git/drivers/net/myri10ge/myri10ge.c   2007-08-24 
08:45:29.0 +0200
@@ -2514,26 +2514,20 @@
 {
struct pci_dev *pdev = mgp-pdev;
struct device *dev = pdev-dev;
-   int cap, status;
-   u16 val;
+   int status;
 
mgp-tx.boundary = 4096;
/*
 * Verify the max read request size was set to 4KB
 * before trying the test with 4KB.
 */
-   cap = pci_find_capability(pdev, PCI_CAP_ID_EXP);
-   if (cap  64) {
-   dev_err(dev, Bad PCI_CAP_ID_EXP location %d\n, cap);
-   goto abort;
-   }
-   status = pci_read_config_word(pdev, cap + PCI_EXP_DEVCTL, val);
-   if (status != 0) {
+   status = pcie_get_readrq(pdev);
+   if (status  0) {
dev_err(dev, Couldn't read max read req size: %d\n, status);
goto abort;
}
-   if ((val  (5  12)) != (5  12)) {
-   dev_warn(dev, Max Read Request size != 4096 (0x%x)\n, val);
+   if (status != 4096) {
+   dev_warn(dev, Max Read Request size != 4096 (%d)\n, status);
mgp-tx.boundary = 2048;
}
/*
@@ -2850,9 +2844,7 @@
size_t bytes;
int i;
int status = -ENXIO;
-   int cap;
int dac_enabled;
-   u16 val;
 
netdev = alloc_etherdev(sizeof(*mgp));
if (netdev == NULL) {
@@ -2884,19 +2876,7 @@
= pci_find_capability(pdev, PCI_CAP_ID_VNDR);
 
/* Set our max read request to 4KB */
-   cap = pci_find_capability(pdev, PCI_CAP_ID_EXP);
-   if (cap  64) {
-   dev_err(pdev-dev, Bad PCI_CAP_ID_EXP location %d\n, cap);
-   goto abort_with_netdev;
-   }
-   status = pci_read_config_word(pdev, cap + PCI_EXP_DEVCTL, val);
-   if (status != 0) {
-   dev_err(pdev-dev, Error %d reading PCI_EXP_DEVCTL\n,
-   status);
-   goto abort_with_netdev;
-   }
-   val = (val  ~PCI_EXP_DEVCTL_READRQ) | (5  12);
-   status = pci_write_config_word(pdev, cap + PCI_EXP_DEVCTL, val);
+   status = pcie_set_readrq(pdev, 4096);
if (status != 0) {
dev_err(pdev-dev, Error %d writing PCI_EXP_DEVCTL\n,
status);


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] myri10ge: update driver version to 1.3.2-1.269

2007-08-24 Thread Brice Goglin
Update myri10ge driver version to 1.3.2-1.269.

Signed-off-by: Brice Goglin [EMAIL PROTECTED]
---
 drivers/net/myri10ge/myri10ge.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.git/drivers/net/myri10ge/myri10ge.c
===
--- linux-2.6.git.orig/drivers/net/myri10ge/myri10ge.c  2007-08-24 
08:45:29.0 +0200
+++ linux-2.6.git/drivers/net/myri10ge/myri10ge.c   2007-08-24 
08:45:38.0 +0200
@@ -72,7 +72,7 @@
 #include myri10ge_mcp.h
 #include myri10ge_mcp_gen_header.h
 
-#define MYRI10GE_VERSION_STR 1.3.1-1.248
+#define MYRI10GE_VERSION_STR 1.3.2-1.269
 
 MODULE_DESCRIPTION(Myricom 10G driver (10GbE));
 MODULE_AUTHOR(Maintainer: [EMAIL PROTECTED]);


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with implementation of TCP_DEFER_ACCEPT?

2007-08-24 Thread Lennert Buytenhek
On Fri, Aug 24, 2007 at 01:08:25AM +0100, TJ wrote:

 An RFC 793 standard TCP handshake requires three packets:
 
 client SYN  server LISTENING
 client  SYN ACK server SYN_RECEIVED
 client ACK  server ESTABLISHED
 
 client PSH ACK + data  server
 
 TCP_DEFER_ACCEPT is designed to increase performance by reducing the
 number of TCP packets exchanged before the client can pass data:
 
 client SYN  server LISTENING
 client  SYN ACK server SYN_RECEIVED
 
 client PSH ACK + data  server ESTABLISHED
 
 At present with TCP_DEFER_ACCEPT the kernel treats the RFC 793 handshake
 as invalid; dropping the ACK from the client without replying so the
 client doesn't know the server has in fact set it's internal ACKed flag.
 
 If the client doesn't send a packet containing data before the SYN_ACK
 time-outs finally expire the connection will be dropped.

A brought this up a long, long time ago, and I seem to remember
Alexey Kuznetsov explained me at the time that this was intentional.

I can't find the thread in the mailing list archives anymore, though
-- and my memory might be failing me.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 12/30] net: No point in casting kmalloc return values in Gianfar Ethernet Driver

2007-08-24 Thread Kumar Gala


On Aug 23, 2007, at 6:59 PM, Jesper Juhl wrote:


kmalloc() returns a void ptr, so there's no need to cast its return
value in drivers/net/gianfar.c .

Signed-off-by: Jesper Juhl [EMAIL PROTECTED]


Acked-by: Kumar Gala [EMAIL PROTECTED]

- k
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] bridge: sysfs locking fix.

2007-08-24 Thread Daniel Lezcano

Stephen Hemminger wrote:

Forget earlier patch, it is wrong...

The stp change code generates sleeping function called from invalid context
because rtnl_lock() called with BH disabled. This fixes it by not acquiring then
dropping the bridge lock.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]

--- a/net/bridge/br_sysfs_br.c  2007-08-06 09:26:48.0 +0100
+++ b/net/bridge/br_sysfs_br.c  2007-08-14 14:29:52.0 +0100
@@ -147,20 +147,26 @@ static ssize_t show_stp_state(struct dev
return sprintf(buf, %d\n, br-stp_enabled);
 }

-static void set_stp_state(struct net_bridge *br, unsigned long val)
-{
-   rtnl_lock();
-   spin_unlock_bh(br-lock);
-   br_stp_set_enabled(br, val);
-   spin_lock_bh(br-lock);
-   rtnl_unlock();
-}

 static ssize_t store_stp_state(struct device *d,
   struct device_attribute *attr, const char *buf,
   size_t len)
 {
-   return store_bridge_parm(d, buf, len, set_stp_state);
+   struct net_bridge *br = to_bridge(d);
+   char *endp;
+   unsigned long val;
+
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+
+   val = simple_strtoul(buf, endp, 0);
+   if (endp == buf)
+   return -EINVAL;
+
+   rtnl_lock();
+   br_stp_set_enabled(br, val);
+   rtnl_unlock();
+


Shouldn't len value be returned at the end of the function ?


 }
 static DEVICE_ATTR(stp_state, S_IRUGO | S_IWUSR, show_stp_state,
   store_stp_state);
-


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with implementation of TCP_DEFER_ACCEPT?

2007-08-24 Thread Alexey Kuznetsov
Hello!

  At present with TCP_DEFER_ACCEPT the kernel treats the RFC 793 handshake
  as invalid; dropping the ACK from the client without replying so the
  client doesn't know the server has in fact set it's internal ACKed flag.
  
  If the client doesn't send a packet containing data before the SYN_ACK
  time-outs finally expire the connection will be dropped.
 
 A brought this up a long, long time ago, and I seem to remember
 Alexey Kuznetsov explained me at the time that this was intentional.

Obviously, I said something like it is exactly what TCP_DEFER_ACCEPT does.


There is no protocol violation here, ACK from client is considered as lost,
it is quite normal and happens all the time. Handshake is not complete,
server remains in SYN-RECV state and continues to retransmit SYN-ACK.
If client tried to cheat and is not going to send its request,
connection will time out.

Alexey

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Question] the precondition of calling alloc_skb()/kfree_skb()?

2007-08-24 Thread Li Yu
Hi, all:

I encountered a problem of using sk_buff.

I used 2.4.20 kernel, when burst traffic come, the kernel will complain
a bug report at skbuff.c:316 later:

311 void __kfree_skb(struct sk_buff *skb)
312 {
313 if (skb-list) {
314 printk(KERN_WARNING Warning: kfree_skb passed an skb still 
315 on a list (from %p).\n, NET_CALLER(skb));
316 BUG(); /* HERE!!! */
317 }
/* snip some code here */
332 }


I saw the dev_kfree_skb_irq() and dev_kfree_skb_irq(), and how to use
them. even, in fact, we work in pure poll I/O model, so the NIC can not
issue any interrupt.

And, I searched google, there are many similar reports like above, but
almost of them have no reply. I think there may have some unknown thing
while sk_buff API.

Well, I hope know, are there some preconditions of calling alloc_skb()
or *_kfree_skb_*() ? Thank in advanced.

Good luck.

- Yu Li

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with implementation of TCP_DEFER_ACCEPT?

2007-08-24 Thread TJ
On Fri, 2007-08-24 at 12:40 +0400, Alexey Kuznetsov wrote:

 There is no protocol violation here, ACK from client is considered as lost,
 it is quite normal and happens all the time. Handshake is not complete,
 server remains in SYN-RECV state and continues to retransmit SYN-ACK.
 If client tried to cheat and is not going to send its request,
 connection will time out.

Thanks for the responses.

Do we have any authoritative references on this? Who implemented it
originally?

Right now Juniper are claiming the issue that brought this to the
surface (the bug linked to in my original post) is a problem with the
implementation of TCP_DEFER_ACCEPT.

My position so far is that the Juniper DX OS is not following the HTTP
standard because it doesn't send a request with the connection, and as I
read the end of section 1.4 of RFC2616, an HTTP connection should be
accompanied by a request.

Can anyone confirm my interpretation or provide references to firm it
up, or refute it?

There is also a very real practical problem here:

Since version 2.1.5 apache enables TCP_DEFER_ACCEPT *by default* without
mention of it in the configuration file.

As time goes on the number of apache v2.1.5+ deployments is only going
to rise, and I'd hate for anyone else to go through the 5+ weeks of pain
the system admins at the e-commerce operation I was helping went
through, not to mention the last 2 weeks feeling like I was chasing
ghosts - it's an absolute pain to track down and identify!

Therefore, anyone deploying apache web servers in a web-farm behind the
Juniper DX load-balanders and using TCP multiplexing (for which they pay
a hefty licence fee!) is liable to suffer the random drop effects
described in my bug report.

Because several other HTTP load-balancers deploy similar methods of
holding open connections to the servers and pipe-lining requests, this
could affect more than just Juniper.

Any other suggestions/reactions on the Linux kernel side? I'm intending
posting a comment to the apache-dev mailing list once I've gathered the
strands together.

Thanks again.

TJ.
Ubuntu ACPI Kernel Team.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [XFRM] : Fix pointer copy size for encap_tmpl and coaddr.

2007-08-24 Thread Masahide NAKAMURA
This is minor fix about sizeof argument using with kmemdup().

Signed-off-by: Masahide NAKAMURA [EMAIL PROTECTED]
---
 net/xfrm/xfrm_user.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 0b8491f..46076f5 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -299,14 +299,14 @@ static struct xfrm_state *xfrm_state_construct(struct 
xfrm_usersa_info *p,
 
if (attrs[XFRMA_ENCAP]) {
x-encap = kmemdup(nla_data(attrs[XFRMA_ENCAP]),
-  sizeof(x-encap), GFP_KERNEL);
+  sizeof(*x-encap), GFP_KERNEL);
if (x-encap == NULL)
goto error;
}
 
if (attrs[XFRMA_COADDR]) {
x-coaddr = kmemdup(nla_data(attrs[XFRMA_COADDR]),
-   sizeof(x-coaddr), GFP_KERNEL);
+   sizeof(*x-coaddr), GFP_KERNEL);
if (x-coaddr == NULL)
goto error;
}
-- 
1.4.4.2

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [IPV6] XFRM: Fix connected socket to use transformation.

2007-08-24 Thread Masahide NAKAMURA
When XFRM policy and state are ready after TCP connection is started,
the traffic should be transformed immediately, however it does not
on IPv6 TCP.

It depends on a dst cache replacement policy with connected socket.
It seems that the replacement is always done for IPv4, however, on
IPv6 case it is done only when routing cookie is changed.

This patch fix that non-transformation dst can be changed to
transformation one.
This behavior is required by MIPv6 and improves IPv6 IPsec.

Signed-off-by: Noriaki TAKAMIYA [EMAIL PROTECTED]
Signed-off-by: Masahide NAKAMURA [EMAIL PROTECTED]
---
 include/net/ip6_fib.h|2 ++
 net/ipv6/inet6_connection_sock.c |   34 --
 2 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index c48ea87..85d6d9f 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -105,6 +105,8 @@ struct rt6_info
struct rt6key   rt6i_src;
 
u8  rt6i_protocol;
+
+   u32 rt6i_flow_cache_genid;
 };
 
 static inline struct inet6_dev *ip6_dst_idev(struct dst_entry *dst)
diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
index 116f94a..f389322 100644
--- a/net/ipv6/inet6_connection_sock.c
+++ b/net/ipv6/inet6_connection_sock.c
@@ -139,6 +139,36 @@ void inet6_csk_addr2sockaddr(struct sock *sk, struct 
sockaddr * uaddr)
 
 EXPORT_SYMBOL_GPL(inet6_csk_addr2sockaddr);
 
+static inline
+void __inet6_csk_dst_store(struct sock *sk, struct dst_entry *dst,
+  struct in6_addr *daddr, struct in6_addr *saddr)
+{
+   struct rt6_info *rt = (struct rt6_info *)dst;
+
+   __ip6_dst_store(sk, dst, daddr, saddr);
+   rt-rt6i_flow_cache_genid = atomic_read(flow_cache_genid);
+}
+
+static inline
+struct dst_entry *__inet6_csk_dst_check(struct sock *sk, u32 cookie)
+{
+   struct dst_entry *dst;
+   struct rt6_info *rt;
+
+   dst = __sk_dst_check(sk, cookie);
+   if (!dst)
+   goto end;
+
+   rt = (struct rt6_info *)dst;
+   if (rt-rt6i_flow_cache_genid != atomic_read(flow_cache_genid)) {
+   sk-sk_dst_cache = NULL;
+   dst_release(dst);
+   dst = NULL;
+   }
+ end:
+   return dst;
+}
+
 int inet6_csk_xmit(struct sk_buff *skb, int ipfragok)
 {
struct sock *sk = skb-sk;
@@ -166,7 +196,7 @@ int inet6_csk_xmit(struct sk_buff *skb, int ipfragok)
final_p = final;
}
 
-   dst = __sk_dst_check(sk, np-dst_cookie);
+   dst = __inet6_csk_dst_check(sk, np-dst_cookie);
 
if (dst == NULL) {
int err = ip6_dst_lookup(sk, dst, fl);
@@ -186,7 +216,7 @@ int inet6_csk_xmit(struct sk_buff *skb, int ipfragok)
return err;
}
 
-   __ip6_dst_store(sk, dst, NULL, NULL);
+   __inet6_csk_dst_store(sk, dst, NULL, NULL);
}
 
skb-dst = dst_clone(dst);
-- 
1.4.4.2

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] [IPV6] IPSEC: Omit redirect for tunnelled packet.

2007-08-24 Thread Masahide NAKAMURA
IPv6 IPsec tunnel gateway incorrectly sends redirect to
router or sender when network device the IPsec tunnelled packet
is arrived is the same as the one the decapsulated packet
is sent.

With this patch, it omits to send the redirect when the forwarding
skbuff carries secpath, since such skbuff should be assumed as
a decapsulated packet from IPsec tunnel by own.

It may be a rare case for an IPsec security gateway, however
it is not rare when the gateway is MIPv6 Home Agent since
the another tunnel end-point is Mobile Node and it changes
the attached network.

Signed-off-by: Masahide NAKAMURA [EMAIL PROTECTED]
---
 net/ipv6/ip6_output.c |4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 5dead39..07b82c2 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -441,8 +441,10 @@ int ip6_forward(struct sk_buff *skb)
 
/* IPv6 specs say nothing about it, but it is clear that we cannot
   send redirects to source routed frames.
+  We don't send redirects to frames decapsulated from IPsec.
 */
-   if (skb-dev == dst-dev  dst-neighbour  opt-srcrt == 0) {
+   if (skb-dev == dst-dev  dst-neighbour  opt-srcrt == 0 
+   !skb-sp) {
struct in6_addr *target = NULL;
struct rt6_info *rt;
struct neighbour *n = dst-neighbour;
-- 
1.4.4.2

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] [IPV4] IPSEC: Omit redirect for tunnelled packet.

2007-08-24 Thread Masahide NAKAMURA
IPv4 IPsec tunnel gateway incorrectly sends redirect to
sender if it is onlink host when network device the IPsec tunnelled
packet is arrived is the same as the one the decapsulated packet
is sent.

With this patch, it omits to send the redirect when the forwarding
skbuff carries secpath, since such skbuff should be assumed as
a decapsulated packet from IPsec tunnel by own.

Request for comments:
Alternatively we'd have another way to change net/ipv4/route.c
(__mkroute_input) to use RTCF_DOREDIRECT flag unless skbuff
has no secpath. It is better than this patch at performance
point of view because IPv4 redirect judgement is done at
routing slow-path. However, it should be taken care of resource
changes between SAD(XFRM states) and routing table. In other words,
When IPv4 SAD is changed does the related routing entry go to its
slow-path? If not, it is reasonable to apply this patch.

Signed-off-by: Masahide NAKAMURA [EMAIL PROTECTED]
---
 net/ipv4/ip_forward.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index 8c95cf0..afbf938 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -105,7 +105,7 @@ int ip_forward(struct sk_buff *skb)
 *  We now generate an ICMP HOST REDIRECT giving the route
 *  we calculated.
 */
-   if (rt-rt_flagsRTCF_DOREDIRECT  !opt-srr)
+   if (rt-rt_flagsRTCF_DOREDIRECT  !opt-srr  !skb-sp)
ip_rt_send_redirect(skb);
 
skb-priority = rt_tos2priority(iph-tos);
-- 
1.4.4.2

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] iproute2-2.6.23-rc3

2007-08-24 Thread Jarek Poplawski
On 22-08-2007 20:08, Stephen Hemminger wrote:
 There have been a lot of changes for 2.6.23, so here is a test release
 of iproute2 that should capture all the submitted patches
 
 
 http://developer.osdl.org/shemminger/iproute2/download/iproute2-2.6.23-rc3.tar.gz

But... isn't it forged, btw?!

Cheers,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [IPv6] Add v4mapped address inline

2007-08-24 Thread YOSHIFUJI Hideaki / 吉藤英明
In article [EMAIL PROTECTED] (at Thu, 23 Aug 2007 14:14:35 -0400), Brian 
Haley [EMAIL PROTECTED] says:

 YOSHIFUJI Hideaki /  wrote:
  Please put this just after ipv6_addr_any(), not after
  ipv6_addr_diff().
 
 Ok, updated patch attached.
 
 -Brian
 
 
 Add v4mapped address inline to avoid calls to ipv6_addr_type().
 
 Signed-off-by: Brian Haley [EMAIL PROTECTED]
Signed-off-by: YOSHIFUJI Hideaki [EMAIL PROTECTED]

--yoshfuji
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [XFRM] : Fix pointer copy size for encap_tmpl and coaddr.

2007-08-24 Thread Thomas Graf
* Masahide NAKAMURA [EMAIL PROTECTED] 2007-08-24 19:05
 This is minor fix about sizeof argument using with kmemdup().

Thanks for catching this!
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] i386: Fix a couple busy loops in mach_wakecpu.h:wait_for_init_deassert()

2007-08-24 Thread Denys Vlasenko
On Thursday 16 August 2007 01:39, Satyam Sharma wrote:

  static inline void wait_for_init_deassert(atomic_t *deassert)
  {
 - while (!atomic_read(deassert));
 + while (!atomic_read(deassert))
 + cpu_relax();
   return;
  }

For less-than-briliant people like me, it's totally non-obvious that
cpu_relax() is needed for correctness here, not just to make P4 happy.

IOW: atomic_read name quite unambiguously means I will read
this variable from main memory. Which is not true and creates
potential for confusion and bugs.
--
vda
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-24 Thread jamal
On Thu, 2007-23-08 at 23:18 -0400, Bill Fink wrote:

[..]
 Here you can see there is a major difference in the TX CPU utilization
 (99 % with TSO disabled versus only 39 % with TSO enabled), although
 the TSO disabled case was able to squeeze out a little extra performance
 from its extra CPU utilization.  

Good stuff. What kind of machine? SMP?
Seems the receive side of the sender is also consuming a lot more cpu
i suspect because receiver is generating a lot more ACKs with TSO.
Does the choice of the tcp congestion control algorithm affect results?
it would be interesting to see both MTUs with either TCP BIC vs good old
reno on sender (probably without changing what the receiver does). BIC
seems to be the default lately.

 Interestingly, with TSO enabled, the
 receiver actually consumed more CPU than with TSO disabled, 

I would suspect the fact that a lot more packets making it into the
receiver for TSO contributes.

 so I guess
 the receiver CPU saturation in that case (99 %) was what restricted
 its performance somewhat (this was consistent across a few test runs).

Unfortunately the receiver plays a big role in such tests - if it is
bottlenecked then you are not really testing the limits of the
transmitter. 

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-24 Thread Denys Vlasenko
On Saturday 18 August 2007 05:13, Linus Torvalds wrote:
 On Sat, 18 Aug 2007, Satyam Sharma wrote:
  No code does (or would do, or should do):
 
  x.counter++;
 
  on an atomic_t x; anyway.

 That's just an example of a general problem.

 No, you don't use x.counter++. But you *do* use

   if (atomic_read(x) = 1)

 and loading into a register is stupid and pointless, when you could just
 do it as a regular memory-operand to the cmp instruction.

It doesn't mean that (volatile int*) cast is bad, it means that current gcc
is bad (or not good enough). IOW: instead of avoiding volatile cast,
it's better to fix the compiler.

 And as far as the compiler is concerned, the problem is the 100% same:
 combining operations with the volatile memop.

 The fact is, a compiler that thinks that

   movl mem,reg
   cmpl $val,reg

 is any better than

   cmpl $val,mem

 is just not a very good compiler.

Linus, in all honesty gcc has many more cases of suboptimal code,
case of volatile is just one of many.

Off the top of my head:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=28417

unsigned v;
void f(unsigned A) { v = ((unsigned long long)A) * 365384439  (27+32); }

gcc-4.1.1 -S -Os -fomit-frame-pointer t.c

f:
movl$365384439, %eax
mull4(%esp)
movl%edx, %eax = ?
shrl$27, %eax
movl%eax, v
ret

Why is it moving %edx to %eax?

gcc-4.2.1 -S -Os -fomit-frame-pointer t.c

f:
movl$365384439, %eax
mull4(%esp)
movl%edx, %eax = ?
xorl%edx, %edx = ??!
shrl$27, %eax
movl%eax, v
ret

Progress... Now we also zero out %edx afterwards for no apparent reason.
--
vda
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] i386: Fix a couple busy loops in mach_wakecpu.h:wait_for_init_deassert()

2007-08-24 Thread Kenn Humborg
 On Thursday 16 August 2007 01:39, Satyam Sharma wrote:
 
   static inline void wait_for_init_deassert(atomic_t *deassert)
   {
  -   while (!atomic_read(deassert));
  +   while (!atomic_read(deassert))
  +   cpu_relax();
  return;
   }
 
 For less-than-briliant people like me, it's totally non-obvious that
 cpu_relax() is needed for correctness here, not just to make P4 happy.
 
 IOW: atomic_read name quite unambiguously means I will read
 this variable from main memory. Which is not true and creates
 potential for confusion and bugs.

To me, atomic_read means a read which is synchronized with other 
changes to the variable (using the atomic_XXX functions) in such 
a way that I will always only see the before or after
state of the variable - never an intermediate state while a 
modification is happening.  It doesn't imply that I have to 
see the after state immediately after another thread modifies
it.

Perhaps the Linux atomic_XXX functions work like that, or used
to work like that, but it's counter-intuitive to me that atomic
should imply a memory read.

Later,
Kenn

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] i386: Fix a couple busy loops in mach_wakecpu.h:wait_for_init_deassert()

2007-08-24 Thread Andi Kleen
On Friday 24 August 2007 13:59:32 Denys Vlasenko wrote:
 On Thursday 16 August 2007 01:39, Satyam Sharma wrote:
 
   static inline void wait_for_init_deassert(atomic_t *deassert)
   {
  -   while (!atomic_read(deassert));
  +   while (!atomic_read(deassert))
  +   cpu_relax();
  return;
   }
 
 For less-than-briliant people like me, it's totally non-obvious that
 cpu_relax() is needed for correctness here, not just to make P4 happy.

I find it also non obvious. It would be really better to have a barrier
or equivalent (volatile or variable clobber) in the atomic_read()
 
 IOW: atomic_read name quite unambiguously means I will read
 this variable from main memory. Which is not true and creates
 potential for confusion and bugs.

Agreed.

-Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-24 Thread jamal
On Thu, 2007-23-08 at 20:34 -0700, Stephen Hemminger wrote:

 A current hot topic of research is reducing the number of ACK's to make TCP
 work better over asymmetric links like 3G.

One other good reason to reduce ACKs to battery powered (3G) terminals
is it reduces the power consumption i.e you have longer battery life.

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-24 Thread Denys Vlasenko
On Thursday 16 August 2007 00:22, Paul Mackerras wrote:
 Satyam Sharma writes:
 In the kernel we use atomic variables in precisely those situations
 where a variable is potentially accessed concurrently by multiple
 CPUs, and where each CPU needs to see updates done by other CPUs in a
 timely fashion.  That is what they are for.  Therefore the compiler
 must not cache values of atomic variables in registers; each
 atomic_read must result in a load and each atomic_set must result in a
 store.  Anything else will just lead to subtle bugs.

Amen.
--
vda
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] i386: Fix a couple busy loops in mach_wakecpu.h:wait_for_init_deassert()

2007-08-24 Thread Satyam Sharma
Hi Denys,


On Fri, 24 Aug 2007, Denys Vlasenko wrote:

 On Thursday 16 August 2007 01:39, Satyam Sharma wrote:
 
   static inline void wait_for_init_deassert(atomic_t *deassert)
   {
  -   while (!atomic_read(deassert));
  +   while (!atomic_read(deassert))
  +   cpu_relax();
  return;
   }
 
 For less-than-briliant people like me, it's totally non-obvious that
 cpu_relax() is needed for correctness here, not just to make P4 happy.

This thread has been round-and-round with exactly the same discussions
:-) I had proposed few such variants to make a compiler barrier implicit
in atomic_{read,set} myself, but frankly, at least personally speaking
(now that I know better), I'm not so much in favour of implicit barriers
(compiler, memory or both) in atomic_{read,set}.

This might sound like an about-turn if you read my own postings to Nick
Piggin from a week back, but I do agree with most his opinions on the
matter now -- separation of barriers from atomic ops is actually good,
beneficial to certain code that knows what it's doing, explicit usage
of barriers stands out more clearly (most people here who deal with it
do know cpu_relax() is an explicit compiler barrier) compared to an
implicit usage in an atomic_read() or such variant ...


 IOW: atomic_read name quite unambiguously means I will read
 this variable from main memory. Which is not true and creates
 potential for confusion and bugs.

I'd have to disagree here -- atomic ops are all about _atomicity_ of
memory accesses, not _making_ them happen (or visible to other CPUs)
_then and there_ itself. The latter are the job of barriers.

The behaviour (and expectations) are quite comprehensively covered in
atomic_ops.txt -- let alone atomic_{read,set}, even atomic_{inc,dec}
are permitted by archs' implementations to _not_ have any memory
barriers, for that matter. [It is unrelated that on x86 making them
SMP-safe requires the use of the LOCK prefix that also happens to be
an implicit memory barrier.]

An argument was also made about consistency of atomic_{read,set} w.r.t.
the other atomic ops -- but clearly, they are all already consistent!
All of them are atomic :-) The fact that atomic_{read,set} do _not_
require any inline asm or LOCK prefix whereas the others do, has to do
with the fact that unlike all others, atomic_{read,set} are not RMW ops
and hence guaranteed to be atomic just as they are in plain  simple C.

But if people do seem to have a mixed / confused notion of atomicity
and barriers, and if there's consensus, then as I'd said earlier, I
have no issues in going with the consensus (eg. having API variants).
Linus would be more difficult to convince, however, I suspect :-)


Satyam
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] i386: Fix a couple busy loops in mach_wakecpu.h:wait_for_init_deassert()

2007-08-24 Thread Denys Vlasenko
On Friday 24 August 2007 13:12, Kenn Humborg wrote:
  On Thursday 16 August 2007 01:39, Satyam Sharma wrote:
static inline void wait_for_init_deassert(atomic_t *deassert)
{
   - while (!atomic_read(deassert));
   + while (!atomic_read(deassert))
   + cpu_relax();
 return;
}
 
  For less-than-briliant people like me, it's totally non-obvious that
  cpu_relax() is needed for correctness here, not just to make P4 happy.
 
  IOW: atomic_read name quite unambiguously means I will read
  this variable from main memory. Which is not true and creates
  potential for confusion and bugs.

 To me, atomic_read means a read which is synchronized with other
 changes to the variable (using the atomic_XXX functions) in such
 a way that I will always only see the before or after
 state of the variable - never an intermediate state while a
 modification is happening.  It doesn't imply that I have to
 see the after state immediately after another thread modifies
 it.

So you are ok with compiler propagating n1 to n2 here:

n1 += atomic_read(x);
other_variable++;
n2 += atomic_read(x);

without accessing x second time. What's the point? Any sane coder
will say that explicitly anyway:

tmp = atomic_read(x);
n1 += tmp;
other_variable++;
n2 += tmp;

if only for the sake of code readability. Because first code
is definitely hinting that it reads RAM twice, and it's actively *bad*
for code readability when in fact it's not the case!

Locking, compiler and CPU barriers are complicated enough already,
please don't make them even harder to understand.
--
vda
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RFC: issues concerning the next NAPI interface

2007-08-24 Thread Jan-Bernd Themann
Hi,

when I tried to get the eHEA driver working with the new interface,
the following issues came up.

1) The current implementation of netif_rx_schedule, netif_rx_complete
   and the net_rx_action have the following problem: netif_rx_schedule
   sets the NAPI_STATE_SCHED flag and adds the NAPI instance to the poll_list.
   netif_rx_action checks NAPI_STATE_SCHED, if set it will add the device
   to the poll_list again (as well). netif_rx_complete clears the 
NAPI_STATE_SCHED.
   If an interrupt handler calls netif_rx_schedule on CPU 2
   after netif_rx_complete has been called on CPU 1 (and the poll function 
   has not returned yet), the NAPI instance will be added twice to the 
   poll_list (by netif_rx_schedule and net_rx_action). Problems occur when 
   netif_rx_complete is called twice for the device (BUG() called)

2) If an ethernet chip supports multiple receive queues, the queues are 
   currently all processed on the CPU where the interrupt comes in. This
   is because netif_rx_schedule will always add the rx queue to the CPU's
   napi poll_list. The result under heavy presure is that all queues will
   gather on the weakest CPU (with highest CPU load) after some time as they
   will stay there as long as the entire queue is emptied. On SMP systems 
   this behaviour is not desired. It should also work well without interrupt
   pinning.
   It would be nice if it is possible to schedule queues to other CPU's, or
   at least to use interrupts to put the queue to another cpu (not nice for 
   as you never know which one you will hit). 
   I'm not sure how bad the tradeoff would be.

3) On modern systems the incoming packets are processed very fast. Especially
   on SMP systems when we use multiple queues we process only a few packets
   per napi poll cycle. So NAPI does not work very well here and the interrupt 
   rate is still high. What we need would be some sort of timer polling mode 
   which will schedule a device after a certain amount of time for high load 
   situations. With high precision timers this could work well. Current
   usual timers are too slow. A finer granularity would be needed to keep the
   latency down (and queue length moderate).

What do you think?

Thanks,
Jan-Bernd
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: issues concerning the next NAPI interface

2007-08-24 Thread Jan-Bernd Themann
Hi,

On Friday 24 August 2007 17:37, [EMAIL PROTECTED] wrote:
 On Fri, Aug 24, 2007 at 03:59:16PM +0200, Jan-Bernd Themann wrote:
  ...
  3) On modern systems the incoming packets are processed very fast. 
  Especially
     on SMP systems when we use multiple queues we process only a few packets
     per napi poll cycle. So NAPI does not work very well here and the 
  interrupt 
     rate is still high. What we need would be some sort of timer polling 
  mode 
     which will schedule a device after a certain amount of time for high 
  load 
     situations. With high precision timers this could work well. Current
     usual timers are too slow. A finer granularity would be needed to keep 
  the
 latency down (and queue length moderate).
  
 
 We found the same on ia64-sn systems with tg3 a couple of years 
 ago. Using simple interrupt coalescing (don't interrupt until 
 you've received N packets or M usecs have elapsed) worked 
 reasonably well in practice. If your h/w supports that (and I'd 
 guess it does, since it's such a simple thing), you might try 
 it.
 

I don't see how this should work. Our latest machines are fast enough that they
simply empty the queue during the first poll iteration (in most cases).
Even if you wait until X packets have been received, it does not help for
the next poll cycle. The average number of packets we process per poll queue
is low. So a timer would be preferable that periodically polls the 
queue, without the need of generating a HW interrupt. This would allow us
to wait until a reasonable amount of packets have been received in the meantime
to keep the poll overhead low. This would also be useful in combination
with LRO.

Regards,
Jan-Bernd
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: issues concerning the next NAPI interface

2007-08-24 Thread Stephen Hemminger
On Fri, 24 Aug 2007 17:47:15 +0200
Jan-Bernd Themann [EMAIL PROTECTED] wrote:

 Hi,
 
 On Friday 24 August 2007 17:37, [EMAIL PROTECTED] wrote:
  On Fri, Aug 24, 2007 at 03:59:16PM +0200, Jan-Bernd Themann wrote:
   ...
   3) On modern systems the incoming packets are processed very fast. 
   Especially
      on SMP systems when we use multiple queues we process only a few 
   packets
      per napi poll cycle. So NAPI does not work very well here and the 
   interrupt 
      rate is still high. What we need would be some sort of timer polling 
   mode 
      which will schedule a device after a certain amount of time for high 
   load 
      situations. With high precision timers this could work well. Current
      usual timers are too slow. A finer granularity would be needed to keep 
   the
  latency down (and queue length moderate).
   
  
  We found the same on ia64-sn systems with tg3 a couple of years 
  ago. Using simple interrupt coalescing (don't interrupt until 
  you've received N packets or M usecs have elapsed) worked 
  reasonably well in practice. If your h/w supports that (and I'd 
  guess it does, since it's such a simple thing), you might try 
  it.
  
 
 I don't see how this should work. Our latest machines are fast enough that 
 they
 simply empty the queue during the first poll iteration (in most cases).
 Even if you wait until X packets have been received, it does not help for
 the next poll cycle. The average number of packets we process per poll queue
 is low. So a timer would be preferable that periodically polls the 
 queue, without the need of generating a HW interrupt. This would allow us
 to wait until a reasonable amount of packets have been received in the 
 meantime
 to keep the poll overhead low. This would also be useful in combination
 with LRO.
 

You need hardware support for deferred interrupts. Most devices have it (e1000, 
sky2, tg3)
and it interacts well with NAPI. It is not a generic thing you want done by the 
stack,
you want the hardware to hold off interrupts until X packets or Y usecs have 
expired.

The parameters for controlling it are already in ethtool, the issue is finding 
a good
default set of values for a wide range of applications and architectures. Maybe 
some
heuristic based on processor speed would be a good starting point. The dynamic 
irq
moderation stuff is not widely used because it is too hard to get right.

-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/30] net: Kill some unneeded allocation return value casts in libertas

2007-08-24 Thread Dan Williams
On Fri, 2007-08-24 at 02:03 +0200, Jesper Juhl wrote:
 kmalloc() and friends return void*, no need to cast it.

Applied to libertas-2.6 'for-linville' branch, thanks.

Dan

 Signed-off-by: Jesper Juhl [EMAIL PROTECTED]
 ---
  drivers/net/wireless/libertas/debugfs.c |2 +-
  drivers/net/wireless/libertas/ethtool.c |3 +--
  2 files changed, 2 insertions(+), 3 deletions(-)
 
 diff --git a/drivers/net/wireless/libertas/debugfs.c 
 b/drivers/net/wireless/libertas/debugfs.c
 index 715cbda..6ade63e 100644
 --- a/drivers/net/wireless/libertas/debugfs.c
 +++ b/drivers/net/wireless/libertas/debugfs.c
 @@ -1839,7 +1839,7 @@ static ssize_t wlan_debugfs_write(struct file *f, const 
 char __user *buf,
   char *p2;
   struct debug_data *d = (struct debug_data *)f-private_data;
  
 - pdata = (char *)kmalloc(cnt, GFP_KERNEL);
 + pdata = kmalloc(cnt, GFP_KERNEL);
   if (pdata == NULL)
   return 0;
  
 diff --git a/drivers/net/wireless/libertas/ethtool.c 
 b/drivers/net/wireless/libertas/ethtool.c
 index 96f1974..7dad493 100644
 --- a/drivers/net/wireless/libertas/ethtool.c
 +++ b/drivers/net/wireless/libertas/ethtool.c
 @@ -60,8 +60,7 @@ static int libertas_ethtool_get_eeprom(struct net_device 
 *dev,
  
  //  mutex_lock(priv-mutex);
  
 - adapter-prdeeprom =
 - (char *)kmalloc(eeprom-len+sizeof(regctrl), GFP_KERNEL);
 + adapter-prdeeprom = kmalloc(eeprom-len+sizeof(regctrl), GFP_KERNEL);
   if (!adapter-prdeeprom)
   return -ENOMEM;
   memcpy(adapter-prdeeprom, regctrl, sizeof(regctrl));

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] [PATCH 1/1] Dynamically allocate the loopback device

2007-08-24 Thread Denis V. Lunev
[EMAIL PROTECTED] wrote:
 From: Daniel Lezcano [EMAIL PROTECTED]
 
 Doing this makes loopback.c a better example of how to do a
 simple network device, and it removes the special case
 single static allocation of a struct net_device, hopefully
 making maintenance easier.
 
 Applies against net-2.6.24
 
 Tested on i386, x86_64
 Compiled on ia64, sparc

I think that a small note, that initialization order is changed will be
good to record. After this, loopback MUST be allocated before any other
networking subsystem initialization. And this is an important change.

Regards,
Den
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: issues concerning the next NAPI interface

2007-08-24 Thread akepner
On Fri, Aug 24, 2007 at 03:59:16PM +0200, Jan-Bernd Themann wrote:
 ...
 3) On modern systems the incoming packets are processed very fast. Especially
    on SMP systems when we use multiple queues we process only a few packets
    per napi poll cycle. So NAPI does not work very well here and the 
 interrupt 
    rate is still high. What we need would be some sort of timer polling mode 
    which will schedule a device after a certain amount of time for high load 
    situations. With high precision timers this could work well. Current
    usual timers are too slow. A finer granularity would be needed to keep the
latency down (and queue length moderate).
 

We found the same on ia64-sn systems with tg3 a couple of years 
ago. Using simple interrupt coalescing (don't interrupt until 
you've received N packets or M usecs have elapsed) worked 
reasonably well in practice. If your h/w supports that (and I'd 
guess it does, since it's such a simple thing), you might try 
it.

-- 
Arthur

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/1] Dynamically allocate the loopback device

2007-08-24 Thread dlezcano
From: Daniel Lezcano [EMAIL PROTECTED]

Doing this makes loopback.c a better example of how to do a
simple network device, and it removes the special case
single static allocation of a struct net_device, hopefully
making maintenance easier.

Applies against net-2.6.24

Tested on i386, x86_64
Compiled on ia64, sparc

Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
Signed-off-by: Daniel Lezcano [EMAIL PROTECTED]
Acked-By: Kirill Korotaev [EMAIL PROTECTED]
Acked-by: Benjamin Thery [EMAIL PROTECTED]
---
 drivers/net/loopback.c   |   63 +++---
 include/linux/netdevice.h|2 +-
 net/core/dst.c   |8 ++--
 net/decnet/dn_dev.c  |4 +-
 net/decnet/dn_route.c|   14 
 net/ipv4/devinet.c   |6 ++--
 net/ipv4/ipconfig.c  |6 ++--
 net/ipv4/ipvs/ip_vs_core.c   |2 +-
 net/ipv4/route.c |   18 +-
 net/ipv4/xfrm4_policy.c  |2 +-
 net/ipv6/addrconf.c  |   15 +---
 net/ipv6/ip6_input.c |2 +-
 net/ipv6/netfilter/ip6t_REJECT.c |2 +-
 net/ipv6/route.c |   15 +++-
 net/ipv6/xfrm6_policy.c  |2 +-
 net/xfrm/xfrm_policy.c   |4 +-
 16 files changed, 89 insertions(+), 76 deletions(-)

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index 5106c23..3642aff 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -199,44 +199,57 @@ static const struct ethtool_ops loopback_ethtool_ops = {
.get_rx_csum= always_on,
 };
 
-/*
- * The loopback device is special. There is only one instance and
- * it is statically allocated. Don't do this for other devices.
- */
-struct net_device loopback_dev = {
-   .name   = lo,
-   .get_stats  = get_stats,
-   .mtu= (16 * 1024) + 20 + 20 + 12,
-   .hard_start_xmit= loopback_xmit,
-   .hard_header= eth_header,
-   .hard_header_cache  = eth_header_cache,
-   .header_cache_update= eth_header_cache_update,
-   .hard_header_len= ETH_HLEN, /* 14   */
-   .addr_len   = ETH_ALEN, /* 6*/
-   .tx_queue_len   = 0,
-   .type   = ARPHRD_LOOPBACK,  /* 0x0001*/
-   .rebuild_header = eth_rebuild_header,
-   .flags  = IFF_LOOPBACK,
-   .features   = NETIF_F_SG | NETIF_F_FRAGLIST
+static void loopback_setup(struct net_device *dev)
+{
+   dev-get_stats  = get_stats;
+   dev-mtu= (16 * 1024) + 20 + 20 + 12;
+   dev-hard_start_xmit= loopback_xmit;
+   dev-hard_header= eth_header;
+   dev-hard_header_cache  = eth_header_cache;
+   dev-header_cache_update = eth_header_cache_update;
+   dev-hard_header_len= ETH_HLEN; /* 14   */
+   dev-addr_len   = ETH_ALEN; /* 6*/
+   dev-tx_queue_len   = 0;
+   dev-type   = ARPHRD_LOOPBACK;  /* 0x0001*/
+   dev-rebuild_header = eth_rebuild_header;
+   dev-flags  = IFF_LOOPBACK;
+   dev-features   = NETIF_F_SG | NETIF_F_FRAGLIST
 #ifdef LOOPBACK_TSO
  | NETIF_F_TSO
 #endif
  | NETIF_F_NO_CSUM | NETIF_F_HIGHDMA
- | NETIF_F_LLTX,
-   .ethtool_ops= loopback_ethtool_ops,
-};
+ | NETIF_F_LLTX;
+   dev-ethtool_ops= loopback_ethtool_ops;
+}
 
 /* Setup and register the loopback device. */
 static int __init loopback_init(void)
 {
-   int err = register_netdev(loopback_dev);
+   struct net_device *dev;
+   int err;
+   
+   err = -ENOMEM;
+   dev = alloc_netdev(0, lo, loopback_setup);
+   if (!dev)
+   goto out;
+
+   err = register_netdev(dev);
+   if (err)
+   goto out_free_netdev;
 
+   err = 0;
+   loopback_dev = dev;
+
+out:
if (err)
panic(loopback: Failed to register netdevice: %d\n, err);
-
return err;
+out_free_netdev:
+   free_netdev(dev);
+   goto out;
 };
 
-module_init(loopback_init);
+fs_initcall(loopback_init);
 
+struct net_device *loopback_dev;
 EXPORT_SYMBOL(loopback_dev);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 8d12f02..7cd0641 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -680,7 +680,7 @@ struct packet_type {
 #include linux/interrupt.h
 #include linux/notifier.h
 
-extern struct net_device   loopback_dev;   /* The loopback 
*/
+extern struct net_device   *loopback_dev;  /* The loopback 
*/
 extern struct list_headdev_base_head;  /* All 
devices */
 extern rwlock_tdev_base_lock;  

Re: Problem with implementation of TCP_DEFER_ACCEPT?

2007-08-24 Thread John Heffner

TJ wrote:

Right now Juniper are claiming the issue that brought this to the
surface (the bug linked to in my original post) is a problem with the
implementation of TCP_DEFER_ACCEPT.

My position so far is that the Juniper DX OS is not following the HTTP
standard because it doesn't send a request with the connection, and as I
read the end of section 1.4 of RFC2616, an HTTP connection should be
accompanied by a request.

Can anyone confirm my interpretation or provide references to firm it
up, or refute it?


You can think of TCP_DEFER_ACCEPT as an implicit application close() 
after a certain timeout, when not receiving a request.  All HTTP servers 
do this anyway (though I think technically they're supposed to send a 
408 Request Timeout error it seems many do not).  It's a very valid 
question for Juniper as to why their box is failing to fill requests 
when its back-end connection has gone away, instead of re-establishing 
the connection and filling the request.


  -John
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] i386: Fix a couple busy loops in mach_wakecpu.h:wait_for_init_deassert()

2007-08-24 Thread Luck, Tony
  static inline void wait_for_init_deassert(atomic_t *deassert)
  {
 -while (!atomic_read(deassert));
 +while (!atomic_read(deassert))
 +cpu_relax();
  return;
  }

 For less-than-briliant people like me, it's totally non-obvious that
 cpu_relax() is needed for correctness here, not just to make P4 happy.

Not just P4 ... there are other threaded cpus where it is useful to
let the core know that this is a busy loop so it would be a good thing
to let other threads have priority.

Even on a non-threaded cpu the cpu_relax() could be useful in the
future to hint to the cpu that it could drop into a lower power
hogging state.

But I agree with your main point that the loop without the cpu_relax()
looks like it ought to work because atomic_read() ought to actually
go out and read memory each time around the loop.

-Tony
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: issues concerning the next NAPI interface

2007-08-24 Thread Linas Vepstas
On Fri, Aug 24, 2007 at 03:59:16PM +0200, Jan-Bernd Themann wrote:
 3) On modern systems the incoming packets are processed very fast. Especially
    on SMP systems when we use multiple queues we process only a few packets
    per napi poll cycle. So NAPI does not work very well here and the 
 interrupt 
    rate is still high. 

I saw this too, on a system that is modern but not terribly fast, and
only slightly (2-way) smp. (the spidernet)

I experimented wih various solutions, none were terribly exciting.  The
thing that killed all of them was a crazy test case that someone sprung on
me:  They had written a worst-case network ping-pong app: send one
packet, wait for reply, send one packet, etc.  

If I waited (indefinitely) for a second packet to show up, the test case 
completely stalled (since no second packet would ever arrive).  And if I 
introduced a timer to wait for a second packet, then I just increased 
the latency in the response to the first packet, and this was noticed, 
and folks complained.  

In the end, I just let it be, and let the system work as a busy-beaver, 
with the high interrupt rate. Is this a wise thing to do?  I was
thinking that, if the system is under heavy load, then the interrupt
rate would fall, since (for less pathological network loads) more 
packets would queue up before the poll was serviced.  But I did not
actually measure the interrupt rate under heavy load ... 

--linas
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] [PATCH 1/1] Dynamically allocate the loopback device

2007-08-24 Thread Daniel Lezcano

Denis V. Lunev wrote:

[EMAIL PROTECTED] wrote:

From: Daniel Lezcano [EMAIL PROTECTED]

Doing this makes loopback.c a better example of how to do a
simple network device, and it removes the special case
single static allocation of a struct net_device, hopefully
making maintenance easier.

Applies against net-2.6.24

Tested on i386, x86_64
Compiled on ia64, sparc


I think that a small note, that initialization order is changed will be
good to record. After this, loopback MUST be allocated before any other
networking subsystem initialization. And this is an important change.

Regards,
Den



Thanks Denis to point that.

-- Daniel

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: issues concerning the next NAPI interface

2007-08-24 Thread David Stevens
Stephen Hemminger [EMAIL PROTECTED] wrote on 08/24/2007 
08:52:03 AM:

 
 You need hardware support for deferred interrupts. Most devices have it 
 (e1000, sky2, tg3)
 and it interacts well with NAPI. It is not a generic thing you want done 
by the stack,
 you want the hardware to hold off interrupts until X packets or Y usecs 
have expired.

For generic hardware that doesn't support it, couldn't you use an 
estimater
and adjust the timer dynamicly in software based on sampled values? Switch 
to per-packet
interrupts when the receive rate is low...
Actually, that's how I thought NAPI worked before I found out 
otherwise (ie,
before I looked :-)).

The hardware-accelerated one is essentially siloing as done by 
ancient serial
devices on UNIX systems. If you had a tunable for a target count, and an 
estimator
for the time interval, then switch to per-packet when the estimator 
exceeds a tunable
max threshold (and also, I suppose, if you near overflowing the ring on 
the min
timer granularity), you get almost all of it, right?
Problem is if it increases rapidly, you may drop packets before 
you notice
that the ring is full in the current estimated interval.

 +-DLS


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: issues concerning the next NAPI interface

2007-08-24 Thread Linas Vepstas
On Fri, Aug 24, 2007 at 08:52:03AM -0700, Stephen Hemminger wrote:
 
 You need hardware support for deferred interrupts. Most devices have it 
 (e1000, sky2, tg3)
 and it interacts well with NAPI. It is not a generic thing you want done by 
 the stack,
 you want the hardware to hold off interrupts until X packets or Y usecs have 
 expired.

Just to be clear, in the previous email I posted on this thread, I
described a worst-case network ping-pong test case (send a packet, wait
for reply), and found out that a deffered interrupt scheme just damaged
the performance of the test case.  Since the folks who came up with the
test case were adamant, I turned off the defferred interrupts.  
While defferred interrupts are an obvious solution, I decided that 
they weren't a good solution. (And I have no other solution to offer).

--linas

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-24 Thread Rick Jones


A current hot topic of research is reducing the number of ACK's to make TCP
work better over asymmetric links like 3G.


Oy.  People running Solaris and HP-UX have been researching ACK reductions 
since 1997 if not earlier.


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] i386: Fix a couple busy loops in mach_wakecpu.h:wait_for_init_deassert()

2007-08-24 Thread Christoph Lameter
On Fri, 24 Aug 2007, Satyam Sharma wrote:

 But if people do seem to have a mixed / confused notion of atomicity
 and barriers, and if there's consensus, then as I'd said earlier, I
 have no issues in going with the consensus (eg. having API variants).
 Linus would be more difficult to convince, however, I suspect :-)

The confusion may be the result of us having barrier semantics in 
atomic_read. If we take that out then we may avoid future confusions.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: issues concerning the next NAPI interface

2007-08-24 Thread James Chapman

Stephen Hemminger wrote:

On Fri, 24 Aug 2007 17:47:15 +0200
Jan-Bernd Themann [EMAIL PROTECTED] wrote:


Hi,

On Friday 24 August 2007 17:37, [EMAIL PROTECTED] wrote:

On Fri, Aug 24, 2007 at 03:59:16PM +0200, Jan-Bernd Themann wrote:

...
3) On modern systems the incoming packets are processed very fast. Especially
   on SMP systems when we use multiple queues we process only a few packets
   per napi poll cycle. So NAPI does not work very well here and the interrupt 
   rate is still high. What we need would be some sort of timer polling mode 
   which will schedule a device after a certain amount of time for high load 
   situations. With high precision timers this could work well. Current

   usual timers are too slow. A finer granularity would be needed to keep the
   latency down (and queue length moderate).

We found the same on ia64-sn systems with tg3 a couple of years 
ago. Using simple interrupt coalescing (don't interrupt until 
you've received N packets or M usecs have elapsed) worked 
reasonably well in practice. If your h/w supports that (and I'd 
guess it does, since it's such a simple thing), you might try 
it.



I don't see how this should work. Our latest machines are fast enough that they
simply empty the queue during the first poll iteration (in most cases).
Even if you wait until X packets have been received, it does not help for
the next poll cycle. The average number of packets we process per poll queue
is low. So a timer would be preferable that periodically polls the 
queue, without the need of generating a HW interrupt. This would allow us

to wait until a reasonable amount of packets have been received in the meantime
to keep the poll overhead low. This would also be useful in combination
with LRO.



You need hardware support for deferred interrupts. Most devices have it (e1000, 
sky2, tg3)
and it interacts well with NAPI. It is not a generic thing you want done by the 
stack,
you want the hardware to hold off interrupts until X packets or Y usecs have 
expired.


Does hardware interrupt mitigation really interact well with NAPI? In my 
experience, holding off interrupts for X packets or Y usecs does more 
harm than good; such hardware features are useful only when the OS has 
no NAPI-like mechanism.


When tuning NAPI drivers for packets/sec performance (which is a good 
indicator of driver performance), I make sure that the driver stays in 
NAPI polled mode while it has any rx or tx work to do. If the CPU is 
fast enough that all work is always completed on each poll, I have the 
driver stay in polled mode until dev-poll() is called N times with no 
work being done. This keeps interrupts disabled for reasonable traffic 
levels, while minimizing packet processing latency. No need for hardware 
interrupt mitigation.



The parameters for controlling it are already in ethtool, the issue is finding 
a good
default set of values for a wide range of applications and architectures. Maybe 
some
heuristic based on processor speed would be a good starting point. The dynamic 
irq
moderation stuff is not widely used because it is too hard to get right.


I agree. It would be nice to find a way for the typical user to derive 
best values for these knobs for his/her particular system. Perhaps a 
tool using pktgen and network device phy internal loopback could be 
developed?


--
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-24 Thread Linus Torvalds


On Fri, 24 Aug 2007, Denys Vlasenko wrote:

  No, you don't use x.counter++. But you *do* use
 
  if (atomic_read(x) = 1)
 
  and loading into a register is stupid and pointless, when you could just
  do it as a regular memory-operand to the cmp instruction.
 
 It doesn't mean that (volatile int*) cast is bad, it means that current gcc
 is bad (or not good enough). IOW: instead of avoiding volatile cast,
 it's better to fix the compiler.

I would agree that fixing the compiler in this case would be a good thing, 
even quite regardless of any atomic_read() discussion.

I just have a strong suspicion that volatile performance is so low down 
the list of any C compiler persons interest, that it's never going to 
happen. And quite frankly, I cannot blame the gcc guys for it.

That's especially as volatile really isn't a very good feature of the C 
language, and is likely to get *less* interesting rather than more (as 
user space starts to be more and more threaded, volatile gets less and 
less useful.

[ Ie, currently, I think you can validly use volatile in a sigatomic_t 
  kind of way, where there is a single thread, but with asynchronous 
  events. In that kind of situation, I think it's probably useful. But 
  once you get multiple threads, it gets pointless.

  Sure: you could use volatile together with something like Dekker's or 
  Peterson's algorithm that doesn't depend on cache coherency (that's 
  basically what the C volatile keyword approximates: not atomic 
  accesses, but *uncached* accesses! But let's face it, that's way past 
  insane. ]

So I wouldn't expect volatile to ever really generate better code. It 
might happen as a side effect of other improvements (eg, I might hope that 
the SSA work would eventually lead to gcc having a much better defined 
model of valid optimizations, and maybe better code generation for 
volatile accesses fall out cleanly out of that), but in the end, it's such 
an ugly special case in C, and so seldom used, that I wouldn't depend on 
it.

 Linus, in all honesty gcc has many more cases of suboptimal code,
 case of volatile is just one of many.

Well, the thing is, quite often, many of those suboptimal code 
generations fall into two distinct classes:

 - complex C code. I can't really blame the compiler too much for this. 
   Some things are *hard* to optimize, and for various scalability 
   reasons, you often end up having limits in the compiler where it 
   doesn't even _try_ doing certain optimizations if you have excessive 
   complexity.

 - bad register allocation. Register allocation really is hard, and 
   sometimes gcc just does the obviously wrong thing, and you end up 
   having totally unnecessary spills.

 Off the top of my head:

Yes, unsigned long long with x86 has always generated atrocious code. In 
fact, I would say that historically it was really *really* bad. These 
days, gcc actually does a pretty good job, but I'm not surprised that it's 
still quite possible to find cases where it did some optimization (in this 
case, apparently noticing that shift by = 32 bits causes the low 
register to be pointless) and then missed *another* optimization (better 
register use) because that optimization had been done *before* the first 
optimization was done.

That's a *classic* example of compiler code generation issues, and quite 
frankly, I think that's very different from the issue of volatile.

Quite frankly, I'd like there to be more competition in the open source 
compiler game, and that might cause some upheavals, but on the whole, gcc 
actually does a pretty damn good job. 

Linus
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: issues concerning the next NAPI interface

2007-08-24 Thread Rick Jones

Just to be clear, in the previous email I posted on this thread, I
described a worst-case network ping-pong test case (send a packet, wait
for reply), and found out that a deffered interrupt scheme just damaged
the performance of the test case.  Since the folks who came up with the
test case were adamant, I turned off the defferred interrupts.  
While defferred interrupts are an obvious solution, I decided that 
they weren't a good solution. (And I have no other solution to offer).


Sounds exactly like the default netperf TCP_RR test and any number of other 
benchmarks.  The send  a request, wait for reply, send next request, etc etc 
etc is a rather common application behaviour afterall.


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [3/4] 2.6.23-rc3: known regressions v3

2007-08-24 Thread Michal Piotrowski
Hi all,

Here is a list of some known regressions in 2.6.23-rc3.

Feel free to add new regressions/remove fixed etc.
http://kernelnewbies.org/known_regressions

List of Aces

NameRegressions fixed since 21-Jun-2007
Adrian Bunk9
Andi Kleen 5
Linus Torvalds 5
Andrew Morton  4
Al Viro3
Alan Stern 3
Cornelia Huck  3
Jens Axboe 3
Tejun Heo  3



Networking

Subject : NETDEV WATCHDOG: eth0: transmit timed out
References  : http://lkml.org/lkml/2007/8/13/737
Last known good : ?
Submitter   : Karl Meyer [EMAIL PROTECTED]
Caused-By   : ?
Handled-By  : Francois Romieu [EMAIL PROTECTED]
Status  : problem is being debugged

Subject : Weird network problems with 2.6.23-rc2
References  : http://lkml.org/lkml/2007/8/11/40
Last known good : ?
Submitter   : Shish [EMAIL PROTECTED]
Caused-By   : ?
Handled-By  : ?
Status  : unknown

Subject : New wake ups from sky2
References  : http://lkml.org/lkml/2007/7/20/386
Last known good : ?
Submitter   : Thomas Meyer [EMAIL PROTECTED]
Caused-By   : Stephen Hemminger [EMAIL PROTECTED]
  commit eb35cf60e462491249166182e3e755d3d5d91a28
Handled-By  : Stephen Hemminger [EMAIL PROTECTED]
Status  : unknown



Power management

Subject : 2.6.23-rc2 swsusp, suddenly increased uptime
References  : http://lkml.org/lkml/2007/8/12/249
Last known good : ?
Submitter   : Thomas Voegtle [EMAIL PROTECTED]
Caused-By   : ?
Handled-By  : Rafael J. Wysocki [EMAIL PROTECTED]
Status  : problem is being debugged

Subject : resume from ram much slower
References  : http://lkml.org/lkml/2007/8/10/275
Last known good : 2.6.23-rc1 ?
Submitter   : Arkadiusz Miskiewicz [EMAIL PROTECTED]
Caused-By   : ?
Handled-By  : Rafael J. Wysocki [EMAIL PROTECTED]
Status  : problem is being debugged



Regards,
Michal

--
LOG
http://www.stardust.webpages.pl/log/
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [2/2] 2.6.23-rc3: known regressions with patches v3

2007-08-24 Thread Michal Piotrowski
Hi all,

Here is a list of some known regressions in 2.6.23-rc3
with patches available.

Feel free to add new regressions/remove fixed etc.
http://kernelnewbies.org/known_regressions

List of Aces

NameRegressions fixed since 21-Jun-2007
Adrian Bunk9
Andi Kleen 5
Linus Torvalds 5
Andrew Morton  4
Al Viro3
Alan Stern 3
Cornelia Huck  3
Jens Axboe 3
Tejun Heo  3



MTD

Subject : error: implicit declaration of function 'cfi_interleave'
References  : http://lkml.org/lkml/2007/8/6/272
Last known good : ?
Submitter   : Ingo Molnar [EMAIL PROTECTED]
Caused-By   : ?
Handled-By  : David Woodhouse [EMAIL PROTECTED]
Patch   : http://lkml.org/lkml/2007/8/9/586
Status  : patch available



Networking

Subject : BUG: when using 'brctl stp'
References  : http://lkml.org/lkml/2007/8/10/441
Last known good : 2.6.23-rc1
Submitter   : Daniel K. [EMAIL PROTECTED]
Caused-By   : ?
Handled-By  : Stephen Hemminger [EMAIL PROTECTED]
Status  : fix applied by David Miller

Subject : sky2 boot crash in sky2_mac_intr
References  : http://lkml.org/lkml/2007/7/24/91
Last known good : ?
Submitter   : Florian Lohoff [EMAIL PROTECTED]
Caused-By   : 
Handled-By  : Stephen Hemminger [EMAIL PROTECTED]
Patch   : http://marc.info/?l=linux-netdevm=118651402523966w=2
Status  : patch available



Regards,
Michal

--
LOG
http://www.stardust.webpages.pl/log/
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [3/4] 2.6.23-rc3: known regressions v3

2007-08-24 Thread Stephen Hemminger
O
 Subject : New wake ups from sky2
 References  : http://lkml.org/lkml/2007/7/20/386
 Last known good : ?
 Submitter   : Thomas Meyer [EMAIL PROTECTED]
 Caused-By   : Stephen Hemminger [EMAIL PROTECTED]
   commit eb35cf60e462491249166182e3e755d3d5d91a28
 Handled-By  : Stephen Hemminger [EMAIL PROTECTED]
 Status  : unknown
 


Fix posted to netdev (sky2 1.17 series), but Jeff hasn't 
applied it.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: issues concerning the next NAPI interface

2007-08-24 Thread Shirley Ma
 Just to be clear, in the previous email I posted on this thread, I
 described a worst-case network ping-pong test case (send a packet, wait
 for reply), and found out that a deffered interrupt scheme just damaged
 the performance of the test case. 

When splitting rx and tx handler, I found some performance gain by 
deffering interrupt scheme in tx not rx in IPoIB driver.

Shirley
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] [PATCH 1/1] Dynamically allocate the loopback device

2007-08-24 Thread Stephen Hemminger
On Fri, 24 Aug 2007 19:55:47 +0400
Denis V. Lunev [EMAIL PROTECTED] wrote:

 [EMAIL PROTECTED] wrote:
  From: Daniel Lezcano [EMAIL PROTECTED]
  
  Doing this makes loopback.c a better example of how to do a
  simple network device, and it removes the special case
  single static allocation of a struct net_device, hopefully
  making maintenance easier.
  
  Applies against net-2.6.24
  
  Tested on i386, x86_64
  Compiled on ia64, sparc
 
 I think that a small note, that initialization order is changed will be
 good to record. After this, loopback MUST be allocated before any other
 networking subsystem initialization. And this is an important change.
 
 Regards,
 Den

Yes, this code would break when other drivers are directly linked
in. 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] i386: Fix a couple busy loops in mach_wakecpu.h:wait_for_init_deassert()

2007-08-24 Thread Linus Torvalds


On Fri, 24 Aug 2007, Denys Vlasenko wrote:
 
 So you are ok with compiler propagating n1 to n2 here:
 
 n1 += atomic_read(x);
 other_variable++;
 n2 += atomic_read(x);
 
 without accessing x second time. What's the point? Any sane coder
 will say that explicitly anyway:

No.

This is a common mistake, and it's total crap.

Any sane coder will often use inline functions, macros, etc helpers to 
do certain abstract things. Those things may contain atomic_read() 
calls.

The biggest reason for compilers doing CSE is exactly the fact that many 
opportunities for CSE simple *are*not*visible* on a source code level. 

That is true of things like atomic_read() equally as to things like shared 
offsets inside structure member accesses. No difference what-so-ever.

Yes, we have, traditionally, tried to make it *easy* for the compiler to 
generate good code. So when we can, and when we look at performance for 
some really hot path, we *will* write the source code so that the compiler 
doesn't even have the option to screw it up, and that includes things like 
doing CSE at a source code level so that we don't see the compiler 
re-doing accesses unnecessarily.

And I'm not saying we shouldn't do that. But performance is not an 
either-or kind of situation, and we should:

 - spend the time at a source code level: make it reasonably easy for the 
   compiler to generate good code, and use the right algorithms at a 
   higher level (and order structures etc so that they have good cache 
   behaviour).

 - .. *and* expect the compiler to handle the cases we didn't do by hand
   pretty well anyway. In particular, quite often, abstraction levels at a 
   software level means that we give compilers stupid code, because some 
   function may have a certain high-level abstraction rule, but then on a 
   particular architecture it's actually a no-op, and the compiler should 
   get to untangle our stupid code and generate good end results.

 - .. *and* expect the hardware to be sane and do a good job even when the 
   compiler didn't generate perfect code or there were unlucky cache miss
   patterns etc.

and if we do all of that, we'll get good performance. But you really do 
want all three levels. It's not enough to be good at any one level (or 
even any two).

Linus
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] isdn capi driver broken on 64 bit.

2007-08-24 Thread Stephen Hemminger
The following driver API is broken on any architecture with 64 bit addresses.
because of cast that loses high bits.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]


--- a/drivers/isdn/capi/capidrv.c   2007-06-25 09:03:12.0 -0700
+++ b/drivers/isdn/capi/capidrv.c   2007-08-24 11:06:46.0 -0700
@@ -1855,6 +1855,9 @@ static int if_sendbuf(int id, int channe
return 0;
}
datahandle = nccip-datahandle;
+
+   /* This won't work on 64 bit! */
+   BUILD_BUG_ON(sizeof(skb-data)  sizeof(u32));
capi_fill_DATA_B3_REQ(sendcmsg, global.ap.applid, card-msgid++,
  nccip-ncci,  /* adr */
  (u32) skb-data,  /* Data */
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-24 Thread Bill Fink
On Fri, 24 Aug 2007, jamal wrote:

 On Thu, 2007-23-08 at 23:18 -0400, Bill Fink wrote:
 
 [..]
  Here you can see there is a major difference in the TX CPU utilization
  (99 % with TSO disabled versus only 39 % with TSO enabled), although
  the TSO disabled case was able to squeeze out a little extra performance
  from its extra CPU utilization.  
 
 Good stuff. What kind of machine? SMP?

Tyan Thunder K8WE S2895ANRF motherboard with Nvidia nForce
Professional 2200+2050 chipset, 2 AMD Opteron 254 2.8 GHz CPUs,
4 GB PC3200 ECC REG-DDR 400 memory, and 2 PCI-Express x16 slots
(2 buses).

It is SMP but both the NIC interrupts and nuttcp are bound to
CPU 0.  And all other non-kernel system processes are bound to
CPU 1.

 Seems the receive side of the sender is also consuming a lot more cpu
 i suspect because receiver is generating a lot more ACKs with TSO.

Odd.  I just reran the TCP CUBIC -M1460 tests, and with TSO enabled
on the transmitter, there were about 153709 eth2 interrupts on the
receiver, while with TSO disabled there was actually a somewhat higher
number (164988) of receiver side eth2 interrupts, although the receive
side CPU utilization was actually lower in that case.

On the transmit side (different test run), the TSO enabled case had
about 161773 eth2 interrupts whereas the TSO disabled case had about
165179 eth2 interrupts.

 Does the choice of the tcp congestion control algorithm affect results?
 it would be interesting to see both MTUs with either TCP BIC vs good old
 reno on sender (probably without changing what the receiver does). BIC
 seems to be the default lately.

These tests were with the default TCP CUBIC (with initial_ssthresh
set to 0).

With TCP BIC (and initial_ssthresh set to 0):

TSO enabled:

[EMAIL PROTECTED] ~]# nuttcp -w10m 192.168.88.16
11751.3750 MB /  10.00 sec = 9853.9839 Mbps 100 %TX 83 %RX

[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 4999.3321 MB /  10.06 sec = 4167.7872 Mbps 38 %TX 100 %RX

TSO disabled:

[EMAIL PROTECTED] ~]# nuttcp -w10m 192.168.88.16
11818.1875 MB /  10.00 sec = 9910.0682 Mbps 99 %TX 81 %RX

[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 5502.6250 MB /  10.00 sec = 4614.3297 Mbps 100 %TX 84 %RX

And with TCP Reno:

TSO enabled:

[EMAIL PROTECTED] ~]# nuttcp -w10m 192.168.88.16
11782.6250 MB /  10.00 sec = 9880.2613 Mbps 100 %TX 77 %RX

[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 5024.6649 MB /  10.06 sec = 4191.6574 Mbps 38 %TX 99 %RX

TSO disabled:

[EMAIL PROTECTED] ~]# nuttcp -w10m 192.168.88.16
11818.2500 MB /  10.00 sec = 9910.0860 Mbps 99 %TX 77 %RX

[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 5284. MB /  10.00 sec = 4430.9604 Mbps 99 %TX 79 %RX

Very similar results to the original TCP CUBIC tests.

  Interestingly, with TSO enabled, the
  receiver actually consumed more CPU than with TSO disabled, 
 
 I would suspect the fact that a lot more packets making it into the
 receiver for TSO contributes.
 
  so I guess
  the receiver CPU saturation in that case (99 %) was what restricted
  its performance somewhat (this was consistent across a few test runs).
 
 Unfortunately the receiver plays a big role in such tests - if it is
 bottlenecked then you are not really testing the limits of the
 transmitter. 

It might be interesting to see what affect the LRO changes would have
on this.  Once they are in a stable released kernel, I might try that
out, or maybe even before if I get some spare time (but that's in very
short supply right now).

-Thanks

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: issues concerning the next NAPI interface

2007-08-24 Thread Jan-Bernd Themann

James Chapman schrieb:

Stephen Hemminger wrote:

On Fri, 24 Aug 2007 17:47:15 +0200
Jan-Bernd Themann [EMAIL PROTECTED] wrote:


Hi,

On Friday 24 August 2007 17:37, [EMAIL PROTECTED] wrote:

On Fri, Aug 24, 2007 at 03:59:16PM +0200, Jan-Bernd Themann wrote:

...
3) On modern systems the incoming packets are processed very fast. 
Especially
   on SMP systems when we use multiple queues we process only a 
few packets
   per napi poll cycle. So NAPI does not work very well here and 
the interruptrate is still high. What we need would be some 
sort of timer polling modewhich will schedule a device after a 
certain amount of time for high loadsituations. With high 
precision timers this could work well. Current
   usual timers are too slow. A finer granularity would be needed 
to keep the

   latency down (and queue length moderate).

We found the same on ia64-sn systems with tg3 a couple of years 
ago. Using simple interrupt coalescing (don't interrupt until 
you've received N packets or M usecs have elapsed) worked 
reasonably well in practice. If your h/w supports that (and I'd 
guess it does, since it's such a simple thing), you might try it.


I don't see how this should work. Our latest machines are fast 
enough that they

simply empty the queue during the first poll iteration (in most cases).
Even if you wait until X packets have been received, it does not 
help for
the next poll cycle. The average number of packets we process per 
poll queue
is low. So a timer would be preferable that periodically polls the 
queue, without the need of generating a HW interrupt. This would 
allow us
to wait until a reasonable amount of packets have been received in 
the meantime

to keep the poll overhead low. This would also be useful in combination
with LRO.



You need hardware support for deferred interrupts. Most devices have 
it (e1000, sky2, tg3)
and it interacts well with NAPI. It is not a generic thing you want 
done by the stack,
you want the hardware to hold off interrupts until X packets or Y 
usecs have expired.


Does hardware interrupt mitigation really interact well with NAPI? In 
my experience, holding off interrupts for X packets or Y usecs does 
more harm than good; such hardware features are useful only when the 
OS has no NAPI-like mechanism.


When tuning NAPI drivers for packets/sec performance (which is a good 
indicator of driver performance), I make sure that the driver stays in 
NAPI polled mode while it has any rx or tx work to do. If the CPU is 
fast enough that all work is always completed on each poll, I have the 
driver stay in polled mode until dev-poll() is called N times with no 
work being done. This keeps interrupts disabled for reasonable traffic 
levels, while minimizing packet processing latency. No need for 
hardware interrupt mitigation.
Yes, that was one idea as well. But the problem with that is that 
net_rx_action will call
the same poll function over and over again in a row if there are no 
further network
devices. The problem about this approach is that you always poll just a 
very few packets
each time. This does not work with LRO well, as there are no packets to 
aggregate...
So it would make more sense to wait for a certain time before trying it 
again.
Second problem: after the jiffies incremented by one in net_rx_action 
(after some poll rounds), net_rx_action will quit and return control to 
the softIRQ handler. The poll function
is called again as the softIRQ handler thinks there is more work to be 
done. So even
then we do not wait... After some rounds in the softIRQ handler, we 
finally wait some time.




The parameters for controlling it are already in ethtool, the issue 
is finding a good
default set of values for a wide range of applications and 
architectures. Maybe some
heuristic based on processor speed would be a good starting point. 
The dynamic irq

moderation stuff is not widely used because it is too hard to get right.


I agree. It would be nice to find a way for the typical user to derive 
best values for these knobs for his/her particular system. Perhaps a 
tool using pktgen and network device phy internal loopback could be 
developed?





-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-24 Thread Rick Jones

Bill Fink wrote:

On Thu, 23 Aug 2007, Rick Jones wrote:



jamal wrote:


[TSO already passed - iirc, it has been
demostranted to really not add much to throughput (cant improve much
over closeness to wire speed) but improve CPU utilization].


In the one gig space sure, but in the 10 Gig space, TSO on/off does make a 
difference for throughput.



Not too much.

TSO enabled:

[EMAIL PROTECTED] ~]# ethtool -k eth2
Offload parameters for eth2:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on

[EMAIL PROTECTED] ~]# nuttcp -w10m 192.168.88.16
11813.4375 MB /  10.00 sec = 9906.1644 Mbps 99 %TX 80 %RX

TSO disabled:

[EMAIL PROTECTED] ~]# ethtool -K eth2 tso off
[EMAIL PROTECTED] ~]# ethtool -k eth2
Offload parameters for eth2:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off

[EMAIL PROTECTED] ~]# nuttcp -w10m 192.168.88.16
11818.2500 MB /  10.00 sec = 9910.0176 Mbps 100 %TX 78 %RX

Pretty negligible difference it seems.


Leaves one wondering how often more than one segment was sent to the card in the 
9000 byte case :)


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: issues concerning the next NAPI interface

2007-08-24 Thread Bodo Eggert
Linas Vepstas [EMAIL PROTECTED] wrote:
 On Fri, Aug 24, 2007 at 03:59:16PM +0200, Jan-Bernd Themann wrote:

 3) On modern systems the incoming packets are processed very fast. Especially
 on SMP systems when we use multiple queues we process only a few packets
 per napi poll cycle. So NAPI does not work very well here and the interrupt
 rate is still high.
 
 I saw this too, on a system that is modern but not terribly fast, and
 only slightly (2-way) smp. (the spidernet)
 
 I experimented wih various solutions, none were terribly exciting.  The
 thing that killed all of them was a crazy test case that someone sprung on
 me:  They had written a worst-case network ping-pong app: send one
 packet, wait for reply, send one packet, etc.
 
 If I waited (indefinitely) for a second packet to show up, the test case
 completely stalled (since no second packet would ever arrive).  And if I
 introduced a timer to wait for a second packet, then I just increased
 the latency in the response to the first packet, and this was noticed,
 and folks complained.

Possible solution / possible brainfart:

Introduce a timer, but don't start to use it to combine packets unless you
receive n packets within the timeframe. If you receive less than m packets
within one timeframe, stop using the timer. The system should now have a
decent response time when the network is idle, and when the network is
busy, nobody will complain about the latency.-)
-- 
Funny quotes:
22. When everything's going your way, you're in the wrong lane and and going
the wrong way.
Friß, Spammer: [EMAIL PROTECTED] [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] [02/10] pasemi_mac: Stop using the pci config space accessors for register read/writes

2007-08-24 Thread Olof Johansson
On Fri, Aug 24, 2007 at 02:05:31PM +1000, Stephen Rothwell wrote:
 On Thu, 23 Aug 2007 13:13:10 -0500 Olof Johansson [EMAIL PROTECTED] wrote:
 
   out:
  -   pci_dev_put(mac-iob_pdev);
  -out_put_dma_pdev:
  -   pci_dev_put(mac-dma_pdev);
  -out_free_netdev:
  +   if (mac-iob_pdev)
  +   pci_dev_put(mac-iob_pdev);
  +   if (mac-dma_pdev)
  +   pci_dev_put(mac-dma_pdev);
 
 It is not documented as such (as far as I can see), but pci_dev_put is
 safe to call with NULL. And there are other places in the kernel that
 explicitly use that fact.

Some places check, others do not. I'll leave it be for now but might take
care of it during some future cleanup. Thanks for point it out though.


-Olof
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] iproute2-2.6.23-rc3

2007-08-24 Thread Stephen Hemminger
On Fri, 24 Aug 2007 12:10:44 +0200
Jarek Poplawski [EMAIL PROTECTED] wrote:

 On 22-08-2007 20:08, Stephen Hemminger wrote:
  There have been a lot of changes for 2.6.23, so here is a test release
  of iproute2 that should capture all the submitted patches
  
  
  http://developer.osdl.org/shemminger/iproute2/download/iproute2-2.6.23-rc3.tar.gz
 
 But... isn't it forged, btw?!

No, I just didn't sign a temporary testing version.  A final version
will be out after 2.6.23

-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-24 Thread Denys Vlasenko
On Friday 24 August 2007 18:15, Christoph Lameter wrote:
 On Fri, 24 Aug 2007, Denys Vlasenko wrote:
  On Thursday 16 August 2007 00:22, Paul Mackerras wrote:
   Satyam Sharma writes:
   In the kernel we use atomic variables in precisely those situations
   where a variable is potentially accessed concurrently by multiple
   CPUs, and where each CPU needs to see updates done by other CPUs in a
   timely fashion.  That is what they are for.  Therefore the compiler
   must not cache values of atomic variables in registers; each
   atomic_read must result in a load and each atomic_set must result in a
   store.  Anything else will just lead to subtle bugs.
 
  Amen.

 A timely fashion? One cannot rely on something like that when coding.
 The visibility of updates is insured by barriers and not by some fuzzy
 notion of timeliness.

But here you do have some notion of time:

while (atomic_read(x))
continue;

continue when other CPU(s) decrement it down to zero.
If read includes an insn which accesses RAM, you will
see new value sometime after other CPU decrements it.
Sometime after is on the order of nanoseconds here.
It is a valid concept of time, right?

The whole confusion is about whether atomic_read implies
read from RAM or not. I am in a camp which thinks it does.
You are in an opposite one.

We just need a less ambiguous name.

What about this:

/**
 * atomic_read - read atomic variable
 * @v: pointer of type atomic_t
 *
 * Atomically reads the value of @v.
 * No compiler barrier implied.
 */
#define atomic_read(v)  ((v)-counter)

+/**
+ * atomic_read_uncached - read atomic variable from memory
+ * @v: pointer of type atomic_t
+ *
+ * Atomically reads the value of @v. This is guaranteed to emit an insn
+ * which accesses memory, atomically. No ordering guarantees!
+ */
+#define atomic_read_uncached(v)  asm_or_volatile_ptr_magic(v)

I was thinking of s/atomic_read/atomic_get/ too, but it implies taking
atomic a-la get_cpu()...
--
vda
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] i386: Fix a couple busy loops in mach_wakecpu.h:wait_for_init_deassert()

2007-08-24 Thread Denys Vlasenko
On Friday 24 August 2007 18:06, Christoph Lameter wrote:
 On Fri, 24 Aug 2007, Satyam Sharma wrote:
  But if people do seem to have a mixed / confused notion of atomicity
  and barriers, and if there's consensus, then as I'd said earlier, I
  have no issues in going with the consensus (eg. having API variants).
  Linus would be more difficult to convince, however, I suspect :-)

 The confusion may be the result of us having barrier semantics in
 atomic_read. If we take that out then we may avoid future confusions.

I think better name may help. Nuke atomic_read() altogether.

n = atomic_value(x);// doesnt hint as strongly at reading as atomic_read
n = atomic_fetch(x);// yes, we _do_ touch RAM
n = atomic_read_uncached(x); // or this

How does that sound?
--
vda
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] i386: Fix a couple busy loops in mach_wakecpu.h:wait_for_init_deassert()

2007-08-24 Thread Chris Snook

Denys Vlasenko wrote:

On Friday 24 August 2007 18:06, Christoph Lameter wrote:

On Fri, 24 Aug 2007, Satyam Sharma wrote:

But if people do seem to have a mixed / confused notion of atomicity
and barriers, and if there's consensus, then as I'd said earlier, I
have no issues in going with the consensus (eg. having API variants).
Linus would be more difficult to convince, however, I suspect :-)

The confusion may be the result of us having barrier semantics in
atomic_read. If we take that out then we may avoid future confusions.


I think better name may help. Nuke atomic_read() altogether.

n = atomic_value(x);// doesnt hint as strongly at reading as atomic_read
n = atomic_fetch(x);// yes, we _do_ touch RAM
n = atomic_read_uncached(x); // or this

How does that sound?


atomic_value() vs. atomic_fetch() should be rather unambiguous. 
atomic_read_uncached() begs the question of precisely which cache we are 
avoiding, and could itself cause confusion.


So, if I were writing atomic.h from scratch, knowing what I know now, I think 
I'd use atomic_value() and atomic_fetch().  The problem is that there are a lot 
of existing users of atomic_read(), and we can't write a script to correctly 
guess their intent.  I'm not sure auditing all uses of atomic_read() is really 
worth the comparatively miniscule benefits.


We could play it safe and convert them all to atomic_fetch(), or we could 
acknowledge that changing the semantics 8 months ago was not at all disastrous, 
and make them all atomic_value(), allowing people to use atomic_fetch() where 
they really care.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: issues concerning the next NAPI interface

2007-08-24 Thread Linas Vepstas
On Fri, Aug 24, 2007 at 09:04:56PM +0200, Bodo Eggert wrote:
 Linas Vepstas [EMAIL PROTECTED] wrote:
  On Fri, Aug 24, 2007 at 03:59:16PM +0200, Jan-Bernd Themann wrote:
  3) On modern systems the incoming packets are processed very fast. 
  Especially
  on SMP systems when we use multiple queues we process only a few packets
  per napi poll cycle. So NAPI does not work very well here and the interrupt
  rate is still high.
  
  worst-case network ping-pong app: send one
  packet, wait for reply, send one packet, etc.
 
 Possible solution / possible brainfart:
 
 Introduce a timer, but don't start to use it to combine packets unless you
 receive n packets within the timeframe. If you receive less than m packets
 within one timeframe, stop using the timer. The system should now have a
 decent response time when the network is idle, and when the network is
 busy, nobody will complain about the latency.-)

Ohh, that was inspirational. Let me free-associate some wild ideas.

Suppose we keep a running average of the recent packet arrival rate,
Lets say its 10 per millisecond (typical for a gigabit eth runnning
flat-out).  If we could poll the driver at a rate of 10-20 per
millisecond (i.e. letting the OS do other useful work for 0.05 millisec),
then we could potentially service the card without ever having to enable 
interrupts on the card, and without hurting latency.

If the packet arrival rate becomes slow enough, we go back to an
interrupt-driven scheme (to keep latency down).

The main problem here is that, even for HZ=1000 machines, this amounts 
to 10-20 polls per jiffy.  Which, if implemented in kernel, requires 
using the high-resolution timers. And, umm, don't the HR timers require
a cpu timer interrupt to make them go? So its not clear that this is much
of a win.

The eHEA is a 10 gigabit device, so it can expect 80-100 packets per
millisecond for large packets, and even more, say 1K packets per
millisec, for small packets. (Even the spec for my 1Gb spidernet card
claims its internal rate is 1M packets/sec.) 

Another possiblity is to set HZ to 5000 or 2 or something humongous
... after all cpu's are now faster! But, since this might be wasteful,
maybe we could make HZ be dynamically variable: have high HZ rates when
there's lots of network/disk activity, and low HZ rates when not. That
means a non-constant jiffy.

If all drivers used interrupt mitigation, then the variable-high
frequency jiffy could take thier place, and be more fair to everyone.
Most drivers would be polled most of the time when they're busy, and 
only use interrupts when they're not.
 
--linas
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] move hardware header functions out of netdevice

2007-08-24 Thread Stephen Hemminger
The follow patches series starts the process of moving function
pointers out of network device structure. This saves space and
separates code from data.

The first step is moving the functions dealing with hardware
headers.

Patches are against current net-2.6.24 tree. Basic functional
testing on ethernet part, not on all the other protocols affected.

-- 
Stephen Hemminger [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] net: wrap hard_header_parse

2007-08-24 Thread Stephen Hemminger
Wrap the hard_header_parse function to simplify next step
of header_ops conversion.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]

--- a/include/linux/netdevice.h 2007-08-23 21:25:57.0 -0700
+++ b/include/linux/netdevice.h 2007-08-23 22:25:35.0 -0700
@@ -639,7 +639,7 @@ struct net_device
void(*vlan_rx_kill_vid)(struct net_device *dev,
unsigned short vid);
 
-   int (*hard_header_parse)(struct sk_buff *skb,
+   int (*hard_header_parse)(const struct sk_buff *skb,
 unsigned char *haddr);
int (*neigh_setup)(struct net_device *dev, struct 
neigh_parms *);
 #ifdef CONFIG_NETPOLL
@@ -787,6 +787,16 @@ static inline int dev_hard_header(struct
return dev-hard_header(skb, dev, type, daddr, saddr, len);
 }
 
+static inline int dev_parse_header(const struct sk_buff *skb,
+  unsigned char *haddr)
+{
+   const struct net_device *dev = skb-dev;
+
+   if (!dev-hard_header_parse)
+   return 0;
+   return dev-hard_header_parse(skb, haddr);
+}
+
 typedef int gifconf_func_t(struct net_device * dev, char __user * bufptr, int 
len);
 extern int register_gifconf(unsigned int family, gifconf_func_t * 
gifconf);
 static inline int unregister_gifconf(unsigned int family)
--- a/net/netfilter/nfnetlink_log.c 2007-08-23 09:44:22.0 -0700
+++ b/net/netfilter/nfnetlink_log.c 2007-08-23 21:43:32.0 -0700
@@ -480,12 +480,13 @@ __build_packet_message(struct nfulnl_ins
NFA_PUT(inst-skb, NFULA_MARK, sizeof(tmp_uint), tmp_uint);
}
 
-   if (indev  skb-dev  skb-dev-hard_header_parse) {
+   if (indev  skb-dev) {
struct nfulnl_msg_packet_hw phw;
-   int len = skb-dev-hard_header_parse((struct sk_buff *)skb,
-   phw.hw_addr);
-   phw.hw_addrlen = htons(len);
-   NFA_PUT(inst-skb, NFULA_HWADDR, sizeof(phw), phw);
+   int len = dev_parse_header(skb, phw.hw_addr);
+   if (len  0) {
+   phw.hw_addrlen = htons(len);
+   NFA_PUT(inst-skb, NFULA_HWADDR, sizeof(phw), phw);
+   }
}
 
if (skb-tstamp.tv64) {
--- a/net/netfilter/nfnetlink_queue.c   2007-08-23 09:44:22.0 -0700
+++ b/net/netfilter/nfnetlink_queue.c   2007-08-23 21:33:50.0 -0700
@@ -485,14 +485,13 @@ nfqnl_build_packet_message(struct nfqnl_
NFA_PUT(skb, NFQA_MARK, sizeof(u_int32_t), tmp_uint);
}
 
-   if (indev  entskb-dev
-entskb-dev-hard_header_parse) {
+   if (indev  entskb-dev) {
struct nfqnl_msg_packet_hw phw;
-
-   int len = entskb-dev-hard_header_parse(entskb,
-  phw.hw_addr);
-   phw.hw_addrlen = htons(len);
-   NFA_PUT(skb, NFQA_HWADDR, sizeof(phw), phw);
+   int len = dev_parse_header(entskb, phw.hw_addr);
+   if (len) {
+   phw.hw_addrlen = htons(len);
+   NFA_PUT(skb, NFQA_HWADDR, sizeof(phw), phw);
+   }
}
 
if (entskb-tstamp.tv64) {
--- a/net/packet/af_packet.c2007-08-23 21:25:57.0 -0700
+++ b/net/packet/af_packet.c2007-08-23 22:25:19.0 -0700
@@ -512,10 +512,8 @@ static int packet_rcv(struct sk_buff *sk
sll-sll_ifindex = orig_dev-ifindex;
else
sll-sll_ifindex = dev-ifindex;
-   sll-sll_halen = 0;
 
-   if (dev-hard_header_parse)
-   sll-sll_halen = dev-hard_header_parse(skb, sll-sll_addr);
+   sll-sll_halen = dev_parse_header(skb, sll-sll_addr);
 
PACKET_SKB_CB(skb)-origlen = skb-len;
 
@@ -649,9 +647,7 @@ static int tpacket_rcv(struct sk_buff *s
h-tp_usec = tv.tv_usec;
 
sll = (struct sockaddr_ll*)((u8*)h + TPACKET_ALIGN(sizeof(*h)));
-   sll-sll_halen = 0;
-   if (dev-hard_header_parse)
-   sll-sll_halen = dev-hard_header_parse(skb, sll-sll_addr);
+   sll-sll_halen = dev_parse_header(skb, sll-sll_addr);
sll-sll_family = AF_PACKET;
sll-sll_hatype = dev-type;
sll-sll_protocol = skb-protocol;
--- a/net/ethernet/eth.c2007-08-23 21:25:57.0 -0700
+++ b/net/ethernet/eth.c2007-08-23 22:25:19.0 -0700
@@ -207,9 +207,9 @@ EXPORT_SYMBOL(eth_type_trans);
  * @skb: packet to extract header from
  * @haddr: destination buffer
  */
-static int eth_header_parse(struct sk_buff *skb, unsigned char *haddr)
+static int eth_header_parse(const struct sk_buff *skb, unsigned char *haddr)
 {
-   struct ethhdr *eth = eth_hdr(skb);
+   const struct ethhdr *eth = eth_hdr(skb);
memcpy(haddr, eth-h_source, 

[PATCH 1/3] net: wrap netdevice hardware header creation

2007-08-24 Thread Stephen Hemminger
Add inline for common usage of hardware header creation, and
fix bug in IPV6 mcast where the assumption about negative return value
was wrong.

Negative return from hard_header means not enough space was available,
(ie -N bytes).

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]


--- a/include/linux/netdevice.h 2007-08-23 09:44:19.0 -0700
+++ b/include/linux/netdevice.h 2007-08-24 12:47:11.0 -0700
@@ -778,6 +778,15 @@ extern int dev_restart(struct net_devic
 extern int netpoll_trap(void);
 #endif
 
+static inline int dev_hard_header(struct sk_buff *skb, struct net_device *dev,
+ unsigned short type,
+ void *daddr, void *saddr, unsigned len)
+{
+   if (!dev-hard_header)
+   return 0;
+   return dev-hard_header(skb, dev, type, daddr, saddr, len);
+}
+
 typedef int gifconf_func_t(struct net_device * dev, char __user * bufptr, int 
len);
 extern int register_gifconf(unsigned int family, gifconf_func_t * 
gifconf);
 static inline int unregister_gifconf(unsigned int family)
--- a/net/ipv4/arp.c2007-08-23 09:44:22.0 -0700
+++ b/net/ipv4/arp.c2007-08-24 12:47:11.0 -0700
@@ -590,8 +590,7 @@ struct sk_buff *arp_create(int type, int
/*
 *  Fill the device header for the ARP frame
 */
-   if (dev-hard_header 
-   dev-hard_header(skb,dev,ptype,dest_hw,src_hw,skb-len)  0)
+   if (dev_hard_header(skb, dev, ptype, dest_hw, src_hw, skb-len)  0)
goto out;
 
/*
--- a/net/core/neighbour.c  2007-08-23 09:44:22.0 -0700
+++ b/net/core/neighbour.c  2007-08-24 12:47:11.0 -0700
@@ -1123,9 +1123,8 @@ int neigh_compat_output(struct sk_buff *
 
__skb_pull(skb, skb_network_offset(skb));
 
-   if (dev-hard_header 
-   dev-hard_header(skb, dev, ntohs(skb-protocol), NULL, NULL,
-skb-len)  0 
+   if (dev_hard_header(skb, dev, ntohs(skb-protocol), NULL, NULL,
+   skb-len)  0 
dev-rebuild_header(skb))
return 0;
 
@@ -1152,13 +1151,13 @@ int neigh_resolve_output(struct sk_buff 
write_lock_bh(neigh-lock);
if (!dst-hh)
neigh_hh_init(neigh, dst, dst-ops-protocol);
-   err = dev-hard_header(skb, dev, ntohs(skb-protocol),
-  neigh-ha, NULL, skb-len);
+   err = dev_hard_header(skb, dev, ntohs(skb-protocol),
+ neigh-ha, NULL, skb-len);
write_unlock_bh(neigh-lock);
} else {
read_lock_bh(neigh-lock);
-   err = dev-hard_header(skb, dev, ntohs(skb-protocol),
-  neigh-ha, NULL, skb-len);
+   err = dev_hard_header(skb, dev, ntohs(skb-protocol),
+ neigh-ha, NULL, skb-len);
read_unlock_bh(neigh-lock);
}
if (err = 0)
@@ -1189,8 +1188,8 @@ int neigh_connected_output(struct sk_buf
__skb_pull(skb, skb_network_offset(skb));
 
read_lock_bh(neigh-lock);
-   err = dev-hard_header(skb, dev, ntohs(skb-protocol),
-  neigh-ha, NULL, skb-len);
+   err = dev_hard_header(skb, dev, ntohs(skb-protocol),
+ neigh-ha, NULL, skb-len);
read_unlock_bh(neigh-lock);
if (err = 0)
err = neigh-ops-queue_xmit(skb);
--- a/net/8021q/vlan_dev.c  2007-08-23 09:44:21.0 -0700
+++ b/net/8021q/vlan_dev.c  2007-08-24 12:47:11.0 -0700
@@ -419,21 +419,19 @@ int vlan_dev_hard_header(struct sk_buff 
 
if (build_vlan_header) {
/* Now make the underlying real hard header */
-   rc = dev-hard_header(skb, dev, ETH_P_8021Q, daddr, saddr, len 
+ VLAN_HLEN);
-
-   if (rc  0) {
+   rc = dev_hard_header(skb, dev, ETH_P_8021Q, daddr, saddr,
+len + VLAN_HLEN);
+   if (rc  0)
rc += VLAN_HLEN;
-   } else if (rc  0) {
+   else if (rc  0)
rc -= VLAN_HLEN;
-   }
-   } else {
+   } else
/* If here, then we'll just make a normal looking ethernet 
frame,
 * but, the hard_start_xmit method will insert the tag (it has 
to
 * be able to do this for bridged and other skbs that don't come
 * down the protocol stack in an orderly manner.
 */
-   rc = dev-hard_header(skb, dev, type, daddr, saddr, len);
-   }
+   rc = dev_hard_header(skb, dev, type, daddr, saddr, len);
 
return rc;
 }

[PATCH] via-velocity: use standard VLAN interface (resend)

2007-08-24 Thread Stephen Hemminger
The via-velocity is using a non-standard VLAN interface configured
via module parameters (yuck).

Replace with the standard acceleration interface.
It solves a number of problems with being able to handle multiple
vlans, and dynamically reconfigure.

This is compile tested only, don't have this board.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]


---
 drivers/net/via-velocity.c |   71 +++--
 drivers/net/via-velocity.h |3 +
 2 files changed, 45 insertions(+), 29 deletions(-)

--- a/drivers/net/via-velocity.c2007-08-18 07:50:10.0 -0700
+++ b/drivers/net/via-velocity.c2007-08-24 13:49:17.0 -0700
@@ -72,6 +72,7 @@
 #include linux/mii.h
 #include linux/in.h
 #include linux/if_arp.h
+#include linux/if_vlan.h
 #include linux/ip.h
 #include linux/tcp.h
 #include linux/udp.h
@@ -111,15 +112,6 @@ VELOCITY_PARAM(RxDescriptors, Number of
 #define TX_DESC_DEF 64
 VELOCITY_PARAM(TxDescriptors, Number of transmit descriptors);
 
-#define VLAN_ID_MIN 0
-#define VLAN_ID_MAX 4095
-#define VLAN_ID_DEF 0
-/* VID_setting[] is used for setting the VID of NIC.
-   0: default VID.
-   1-4094: other VIDs.
-*/
-VELOCITY_PARAM(VID_setting, 802.1Q VLAN ID);
-
 #define RX_THRESH_MIN   0
 #define RX_THRESH_MAX   3
 #define RX_THRESH_DEF   0
@@ -147,13 +139,6 @@ VELOCITY_PARAM(rx_thresh, Receive fifo 
 */
 VELOCITY_PARAM(DMA_length, DMA length);
 
-#define TAGGING_DEF 0
-/* enable_tagging[] is used for enabling 802.1Q VID tagging.
-   0: disable VID seeting(default).
-   1: enable VID setting.
-*/
-VELOCITY_PARAM(enable_tagging, Enable 802.1Q tagging);
-
 #define IP_ALIG_DEF 0
 /* IP_byte_align[] is used for IP header DWORD byte aligned
0: indicate the IP header won't be DWORD byte aligned.(Default) .
@@ -442,8 +427,7 @@ static void __devinit velocity_get_optio
velocity_set_int_opt(opts-DMA_length, DMA_length[index], 
DMA_LENGTH_MIN, DMA_LENGTH_MAX, DMA_LENGTH_DEF, DMA_length, devname);
velocity_set_int_opt(opts-numrx, RxDescriptors[index], RX_DESC_MIN, 
RX_DESC_MAX, RX_DESC_DEF, RxDescriptors, devname);
velocity_set_int_opt(opts-numtx, TxDescriptors[index], TX_DESC_MIN, 
TX_DESC_MAX, TX_DESC_DEF, TxDescriptors, devname);
-   velocity_set_int_opt(opts-vid, VID_setting[index], VLAN_ID_MIN, 
VLAN_ID_MAX, VLAN_ID_DEF, VID_setting, devname);
-   velocity_set_bool_opt(opts-flags, enable_tagging[index], TAGGING_DEF, 
VELOCITY_FLAGS_TAGGING, enable_tagging, devname);
+
velocity_set_bool_opt(opts-flags, txcsum_offload[index], TX_CSUM_DEF, 
VELOCITY_FLAGS_TX_CSUM, txcsum_offload, devname);
velocity_set_int_opt(opts-flow_cntl, flow_control[index], 
FLOW_CNTL_MIN, FLOW_CNTL_MAX, FLOW_CNTL_DEF, flow_control, devname);
velocity_set_bool_opt(opts-flags, IP_byte_align[index], IP_ALIG_DEF, 
VELOCITY_FLAGS_IP_ALIGN, IP_byte_align, devname);
@@ -465,6 +449,7 @@ static void __devinit velocity_get_optio
 static void velocity_init_cam_filter(struct velocity_info *vptr)
 {
struct mac_regs __iomem * regs = vptr-mac_regs;
+   unsigned short vid;
 
/* Turn on MCFG_PQEN, turn off MCFG_RTGOPT */
WORD_REG_BITS_SET(MCFG_PQEN, MCFG_RTGOPT, regs-MCFG);
@@ -477,13 +462,19 @@ static void velocity_init_cam_filter(str
mac_set_cam_mask(regs, vptr-mCAMmask, VELOCITY_MULTICAST_CAM);
 
/* Enable first VCAM */
-   if (vptr-flags  VELOCITY_FLAGS_TAGGING) {
-   /* If Tagging option is enabled and VLAN ID is not zero, then
-  turn on MCFG_RTGOPT also */
-   if (vptr-options.vid != 0)
-   WORD_REG_BITS_ON(MCFG_RTGOPT, regs-MCFG);
+   if (vptr-vlgrp) {
+   for (vid = 0; vid  VLAN_VID_MASK; vid++) {
+   if (vlan_group_get_device(vptr-vlgrp, vid)) {
+   /* If Tagging option is enabled and
+  VLAN ID is not zero, then
+  turn on MCFG_RTGOPT also */
+   if (vid != 0)
+   WORD_REG_BITS_ON(MCFG_RTGOPT, 
regs-MCFG);
 
-   mac_set_cam(regs, 0, (u8 *)  (vptr-options.vid), 
VELOCITY_VLAN_ID_CAM);
+   mac_set_cam(regs, 0, (u8 *) vid,
+   VELOCITY_VLAN_ID_CAM);
+   }
+   }
vptr-vCAMmask[0] |= 1;
mac_set_cam_mask(regs, vptr-vCAMmask, VELOCITY_VLAN_ID_CAM);
} else {
@@ -494,6 +485,26 @@ static void velocity_init_cam_filter(str
}
 }
 
+static void velocity_vlan_rx_add_vid(struct net_device *dev, unsigned short 
vid)
+{
+   struct velocity_info *vptr = netdev_priv(dev);
+
+spin_lock_irq(vptr-lock);
+   velocity_init_cam_filter(vptr);
+spin_unlock_irq(vptr-lock);
+}
+
+static void velocity_vlan_rx_kill_vid(struct net_device *dev, unsigned short 
vid)
+{
+   

Re: RFC: issues concerning the next NAPI interface

2007-08-24 Thread Jan-Bernd Themann

Linas Vepstas schrieb:

On Fri, Aug 24, 2007 at 09:04:56PM +0200, Bodo Eggert wrote:
  

Linas Vepstas [EMAIL PROTECTED] wrote:


On Fri, Aug 24, 2007 at 03:59:16PM +0200, Jan-Bernd Themann wrote:
  

3) On modern systems the incoming packets are processed very fast. Especially
on SMP systems when we use multiple queues we process only a few packets
per napi poll cycle. So NAPI does not work very well here and the interrupt
rate is still high.


worst-case network ping-pong app: send one
packet, wait for reply, send one packet, etc.
  

Possible solution / possible brainfart:

Introduce a timer, but don't start to use it to combine packets unless you
receive n packets within the timeframe. If you receive less than m packets
within one timeframe, stop using the timer. The system should now have a
decent response time when the network is idle, and when the network is
busy, nobody will complain about the latency.-)



Ohh, that was inspirational. Let me free-associate some wild ideas.

Suppose we keep a running average of the recent packet arrival rate,
Lets say its 10 per millisecond (typical for a gigabit eth runnning
flat-out).  If we could poll the driver at a rate of 10-20 per
millisecond (i.e. letting the OS do other useful work for 0.05 millisec),
then we could potentially service the card without ever having to enable 
interrupts on the card, and without hurting latency.


If the packet arrival rate becomes slow enough, we go back to an
interrupt-driven scheme (to keep latency down).

The main problem here is that, even for HZ=1000 machines, this amounts 
to 10-20 polls per jiffy.  Which, if implemented in kernel, requires 
using the high-resolution timers. And, umm, don't the HR timers require

a cpu timer interrupt to make them go? So its not clear that this is much
of a win.
  

That is indeed a good question. At least for 10G eHEA we see
that the average number of packets/poll cycle is very low.
With high precision timers we could control the poll interval
better and thus make sure we get enough packets on the queue in
high load situations to benefit from LRO while keeping the
latency moderate. When the traffic load is low we could just
stick to plain NAPI. I don't know how expensive hp timers are,
we probably just have to test it (when they are available for
POWER in our case). However, having more packets
per poll run would make LRO more efficient and thus the total
CPU utilization would decrease.

I guess on most systems there are not many different network
cards working in parallel. So if the driver could set the poll
interval for its devices, it could be well optimized depending
on the NICs characteristics.

Maybe it would be good enough to have a timer that schedules
the device for NAPI (and thus triggers SoftIRQs, which will
trigger NAPI). Whether this timer would be used via a generic
interface or would be implemented as a proprietary solution
would depend on whether other drivers want / need this feature
as well. Drivers / NICs that work fine with plain NAPI don't
have to use timer :-)

I tried to implement something with normal timers, but the result
was everything but great. The timers seem to be far too slow.
I'm not sure if it helps to increase it from 1000HZ to 2500HZ
or more.

Regards,
Jan-Bernd

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] via-velocity: use standard VLAN interface (resend)

2007-08-24 Thread Al Viro
On Fri, Aug 24, 2007 at 01:56:49PM -0700, Stephen Hemminger wrote:

  static void velocity_init_cam_filter(struct velocity_info *vptr)
  {
   struct mac_regs __iomem * regs = vptr-mac_regs;
 + unsigned short vid;
  
 - mac_set_cam(regs, 0, (u8 *)  (vptr-options.vid), 
 VELOCITY_VLAN_ID_CAM);
 + mac_set_cam(regs, 0, (u8 *) vid,
 + VELOCITY_VLAN_ID_CAM);

This mac_set_cam() dreck should be split in two properly typed functions.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] [PATCH 1/1] Dynamically allocate the loopback device

2007-08-24 Thread Denis V. Lunev
no, and this is important. Loopback is initialized in fs_initcall which
is called sufficiently before module_init.

I have checked the code and do not see initialization order mistakes
right now. But, from now on, maintainer should pay attention for this
unfortunate consequence :(

Regards,
Den

Stephen Hemminger wrote:
 On Fri, 24 Aug 2007 19:55:47 +0400
 Denis V. Lunev [EMAIL PROTECTED] wrote:
 
 [EMAIL PROTECTED] wrote:
 From: Daniel Lezcano [EMAIL PROTECTED]

 Doing this makes loopback.c a better example of how to do a
 simple network device, and it removes the special case
 single static allocation of a struct net_device, hopefully
 making maintenance easier.

 Applies against net-2.6.24

 Tested on i386, x86_64
 Compiled on ia64, sparc
 I think that a small note, that initialization order is changed will be
 good to record. After this, loopback MUST be allocated before any other
 networking subsystem initialization. And this is an important change.

 Regards,
 Den
 
 Yes, this code would break when other drivers are directly linked
 in. 
 ___
 Containers mailing list
 [EMAIL PROTECTED]
 https://lists.linux-foundation.org/mailman/listinfo/containers
 
 ___
 Devel mailing list
 [EMAIL PROTECTED]
 https://openvz.org/mailman/listinfo/devel
 

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-24 Thread David Miller
From: jamal [EMAIL PROTECTED]
Date: Fri, 24 Aug 2007 08:14:16 -0400

 Seems the receive side of the sender is also consuming a lot more cpu
 i suspect because receiver is generating a lot more ACKs with TSO.

I've seen this behavior before on a low cpu powered receiver and the
issue is that batching too much actually hurts a receiver.

If the data packets were better spaced out, the receive would handle
the load better.

This is the thing the TOE guys keep talking about overcoming with
their packet pacing algorithms in their on-card TOE stack.

My hunch is that even if in the non-TSO case the TX packets were all
back to back in the cards TX ring, TSO still spits them out faster on
the wire.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: issues concerning the next NAPI interface

2007-08-24 Thread David Miller
From: Jan-Bernd Themann [EMAIL PROTECTED]
Date: Fri, 24 Aug 2007 15:59:16 +0200

    It would be nice if it is possible to schedule queues to other CPU's, or
    at least to use interrupts to put the queue to another cpu (not nice for 
    as you never know which one you will hit). 
    I'm not sure how bad the tradeoff would be.

Once the per-cpu NAPI poll queues start needing locks, much of the
gain will be lost.  This is strictly what we want to avoid.

We need real facilities for IRQ distribution policies.  With that none
of this is an issue.

This is also a platform specific problem with IRQ behavior, the IRQ
distibution scheme you mention would never occur on sparc64 for
example.  We use a fixed round-robin distribution of interrupts to
CPUS there, they don't move.

Each scheme has it's advantages, but you want a difference scheme here
than what is implemented and the fix is therefore not in the
networking :-)

Furthermore, most cards that will be using multi-queue will be
using hashes on the packet headers to choose the MSI-X interrupt
and thus the cpu to be targetted.  Those cards will want fixed
instead of dynamic interrupt to cpu distribution schemes as well,
so your problem is not unique and they'll need the same fix as
you do.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: issues concerning the next NAPI interface

2007-08-24 Thread Linas Vepstas
On Fri, Aug 24, 2007 at 11:11:56PM +0200, Jan-Bernd Themann wrote:
 (when they are available for
 POWER in our case). 

hrtimer worked fine on the powerpc cell arch last summer.
I assume they work on p5 and p6 too, no ??

 I tried to implement something with normal timers, but the result
 was everything but great. The timers seem to be far too slow.
 I'm not sure if it helps to increase it from 1000HZ to 2500HZ
 or more.

Heh. Do the math. Even on 1gigabit cards, that's not enough:

(1gigabit/sec) x (byte/8 bits) x (packet/1500bytes) x (sec/1000 jiffy) 

is 83 packets a jiffy (for big packets, even more for small packets, 
and more again for 10 gigabit cards). So polling once per jiffy is a 
latency disaster.

--linas  

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: issues concerning the next NAPI interface

2007-08-24 Thread David Miller
From: Jan-Bernd Themann [EMAIL PROTECTED]
Date: Fri, 24 Aug 2007 15:59:16 +0200

 1) The current implementation of netif_rx_schedule, netif_rx_complete
    and the net_rx_action have the following problem: netif_rx_schedule
    sets the NAPI_STATE_SCHED flag and adds the NAPI instance to the poll_list.
    netif_rx_action checks NAPI_STATE_SCHED, if set it will add the device
    to the poll_list again (as well). netif_rx_complete clears the 
 NAPI_STATE_SCHED.
    If an interrupt handler calls netif_rx_schedule on CPU 2
    after netif_rx_complete has been called on CPU 1 (and the poll function 
    has not returned yet), the NAPI instance will be added twice to the 
    poll_list (by netif_rx_schedule and net_rx_action). Problems occur when 
    netif_rx_complete is called twice for the device (BUG() called)

Indeed, this is the who should manage the list problem.
Probably the answer is that whoever transitions the NAPI_STATE_SCHED
bit from cleared to set should do the list addition.

Patches welcome :-)

 3) On modern systems the incoming packets are processed very fast. Especially
    on SMP systems when we use multiple queues we process only a few packets
    per napi poll cycle. So NAPI does not work very well here and the 
 interrupt 
    rate is still high. What we need would be some sort of timer polling mode 
    which will schedule a device after a certain amount of time for high load 
    situations. With high precision timers this could work well. Current
    usual timers are too slow. A finer granularity would be needed to keep the
latency down (and queue length moderate).

This is why minimal levels of HW interrupt mitigation should be enabled
in your chip.  If it does not support this, you will indeed need to look
into using high resolution timers or other schemes to alleviate this.

I do not think it deserves a generic core networking helper facility,
the chips that can't mitigate interrupts are few and obscure.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] via-velocity: more cleanup

2007-08-24 Thread Stephen Hemminger
Per Al's suggestion, get rid of the stupid stuff:
Remove cam_type switch,
And deinline things that aren't important for speed.
And make big macro and inline.
And remove some dead/unused code.
And use const char * for chip name.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]


--- a/drivers/net/via-velocity.c2007-08-24 13:49:17.0 -0700
+++ b/drivers/net/via-velocity.c2007-08-24 14:39:14.0 -0700
@@ -85,6 +85,163 @@
 static int velocity_nics = 0;
 static int msglevel = MSG_LEVEL_INFO;
 
+/**
+ * mac_get_cam_mask-   Read a CAM mask
+ * @regs: register block for this velocity
+ * @mask: buffer to store mask
+ *
+ * Fetch the mask bits of the selected CAM and store them into the
+ * provided mask buffer.
+ */
+
+static void mac_get_cam_mask(struct mac_regs __iomem * regs, u8 * mask)
+{
+   int i;
+
+   /* Select CAM mask */
+   BYTE_REG_BITS_SET(CAMCR_PS_CAM_MASK, CAMCR_PS1 | CAMCR_PS0, 
regs-CAMCR);
+
+   writeb(0, regs-CAMADDR);
+
+   /* read mask */
+   for (i = 0; i  8; i++)
+   *mask++ = readb((regs-MARCAM[i]));
+
+   /* disable CAMEN */
+   writeb(0, regs-CAMADDR);
+
+   /* Select mar */
+   BYTE_REG_BITS_SET(CAMCR_PS_MAR, CAMCR_PS1 | CAMCR_PS0, regs-CAMCR);
+
+}
+
+
+/**
+ * mac_set_cam_mask-   Set a CAM mask
+ * @regs: register block for this velocity
+ * @mask: CAM mask to load
+ *
+ * Store a new mask into a CAM
+ */
+
+static void mac_set_cam_mask(struct mac_regs __iomem * regs, u8 * mask)
+{
+   int i;
+   /* Select CAM mask */
+   BYTE_REG_BITS_SET(CAMCR_PS_CAM_MASK, CAMCR_PS1 | CAMCR_PS0, 
regs-CAMCR);
+
+   writeb(CAMADDR_CAMEN, regs-CAMADDR);
+
+   for (i = 0; i  8; i++) {
+   writeb(*mask++, (regs-MARCAM[i]));
+   }
+   /* disable CAMEN */
+   writeb(0, regs-CAMADDR);
+
+   /* Select mar */
+   BYTE_REG_BITS_SET(CAMCR_PS_MAR, CAMCR_PS1 | CAMCR_PS0, regs-CAMCR);
+}
+
+static void mac_set_vlan_cam_mask(struct mac_regs __iomem * regs, u8 * mask)
+{
+   int i;
+   /* Select CAM mask */
+   BYTE_REG_BITS_SET(CAMCR_PS_CAM_MASK, CAMCR_PS1 | CAMCR_PS0, 
regs-CAMCR);
+
+   writeb(CAMADDR_CAMEN | CAMADDR_VCAMSL, regs-CAMADDR);
+
+   for (i = 0; i  8; i++) {
+   writeb(*mask++, (regs-MARCAM[i]));
+   }
+   /* disable CAMEN */
+   writeb(0, regs-CAMADDR);
+
+   /* Select mar */
+   BYTE_REG_BITS_SET(CAMCR_PS_MAR, CAMCR_PS1 | CAMCR_PS0, regs-CAMCR);
+}
+
+/**
+ * mac_set_cam -   set CAM data
+ * @regs: register block of this velocity
+ * @idx: Cam index
+ * @addr: 2 or 6 bytes of CAM data
+ *
+ * Load an address or vlan tag into a CAM
+ */
+
+static void mac_set_cam(struct mac_regs __iomem * regs, int idx, const u8 
*addr)
+{
+   int i;
+
+   /* Select CAM mask */
+   BYTE_REG_BITS_SET(CAMCR_PS_CAM_DATA, CAMCR_PS1 | CAMCR_PS0, 
regs-CAMCR);
+
+   idx = (64 - 1);
+
+   writeb(CAMADDR_CAMEN | idx, regs-CAMADDR);
+
+   for (i = 0; i  6; i++) {
+   writeb(*addr++, (regs-MARCAM[i]));
+   }
+   BYTE_REG_BITS_ON(CAMCR_CAMWR, regs-CAMCR);
+
+   udelay(10);
+
+   writeb(0, regs-CAMADDR);
+
+   /* Select mar */
+   BYTE_REG_BITS_SET(CAMCR_PS_MAR, CAMCR_PS1 | CAMCR_PS0, regs-CAMCR);
+}
+
+static void mac_set_vlan_cam(struct mac_regs __iomem * regs, int idx,
+const u8 *addr)
+{
+
+   /* Select CAM mask */
+   BYTE_REG_BITS_SET(CAMCR_PS_CAM_DATA, CAMCR_PS1 | CAMCR_PS0, 
regs-CAMCR);
+
+   idx = (64 - 1);
+
+   writeb(CAMADDR_CAMEN | CAMADDR_VCAMSL | idx, regs-CAMADDR);
+   writew(*((u16 *) addr), regs-MARCAM[0]);
+
+   BYTE_REG_BITS_ON(CAMCR_CAMWR, regs-CAMCR);
+
+   udelay(10);
+
+   writeb(0, regs-CAMADDR);
+
+   /* Select mar */
+   BYTE_REG_BITS_SET(CAMCR_PS_MAR, CAMCR_PS1 | CAMCR_PS0, regs-CAMCR);
+}
+
+
+/**
+ * mac_wol_reset   -   reset WOL after exiting low power
+ * @regs: register block of this velocity
+ *
+ * Called after we drop out of wake on lan mode in order to
+ * reset the Wake on lan features. This function doesn't restore
+ * the rest of the logic from the result of sleep/wakeup
+ */
+
+static void mac_wol_reset(struct mac_regs __iomem * regs)
+{
+
+   /* Turn off SWPTAG right after leaving power mode */
+   BYTE_REG_BITS_OFF(STICKHW_SWPTAG, regs-STICKHW);
+   /* clear sticky bits */
+   BYTE_REG_BITS_OFF((STICKHW_DS1 | STICKHW_DS0), regs-STICKHW);
+
+   BYTE_REG_BITS_OFF(CHIPGCR_FCGMII, regs-CHIPGCR);
+   BYTE_REG_BITS_OFF(CHIPGCR_FCMODE, regs-CHIPGCR);
+   /* disable force PME-enable */
+   writeb(WOLCFG_PMEOVR, regs-WOLCFGClr);
+   /* disable power-event config bit */
+   writew(0x, regs-WOLCRClr);
+   /* clear power status */
+   writew(0x, regs-WOLSRClr);
+}
 
 static int velocity_mii_ioctl(struct net_device *dev, 

Re: RFC: issues concerning the next NAPI interface

2007-08-24 Thread David Miller
From: [EMAIL PROTECTED] (Linas Vepstas)
Date: Fri, 24 Aug 2007 11:45:41 -0500

 In the end, I just let it be, and let the system work as a
 busy-beaver, with the high interrupt rate. Is this a wise thing to
 do?

The tradeoff is always going to be latency vs. throughput.

A sane default should defer enough to catch multiple packets coming in
at something close to line rate, but not so much that latency unduly
suffers.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: issues concerning the next NAPI interface

2007-08-24 Thread David Miller
From: David Stevens [EMAIL PROTECTED]
Date: Fri, 24 Aug 2007 09:50:58 -0700

 Problem is if it increases rapidly, you may drop packets
 before you notice that the ring is full in the current estimated
 interval.

This is one of many reasons why hardware interrupt mitigation
is really needed for this.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: issues concerning the next NAPI interface

2007-08-24 Thread David Miller
From: James Chapman [EMAIL PROTECTED]
Date: Fri, 24 Aug 2007 18:16:45 +0100

 Does hardware interrupt mitigation really interact well with NAPI?

It interacts quite excellently.

There was a long saga about this with tg3 and huge SGI numa
systems with large costs for interrupt processing, and the
fix was to do a minimal amount of interrupt mitigation and
this basically cleared up all the problems.

Someone should reference that thread _now_ before this discussion goes
too far and we repeat a lot of information and people like myself have
to stay up all night correcting the misinformation and
misunderstandings that are basically guarenteed for this topic :)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: issues concerning the next NAPI interface

2007-08-24 Thread Linas Vepstas
On Fri, Aug 24, 2007 at 02:44:36PM -0700, David Miller wrote:
 From: David Stevens [EMAIL PROTECTED]
 Date: Fri, 24 Aug 2007 09:50:58 -0700
 
  Problem is if it increases rapidly, you may drop packets
  before you notice that the ring is full in the current estimated
  interval.
 
 This is one of many reasons why hardware interrupt mitigation
 is really needed for this.

When turning off interrupts, don't turn them *all* off.
Leave the queue-full interrupt always on.

--linas
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: issues concerning the next NAPI interface

2007-08-24 Thread akepner
On Fri, Aug 24, 2007 at 02:47:11PM -0700, David Miller wrote:

 
 Someone should reference that thread _now_ before this discussion goes
 too far and we repeat a lot of information ..

Here's part of the thread:
http://marc.info/?t=11159530601r=1w=2

Also, Jamal's paper may be of interest - Google for when napi comes 
to town.

-- 
Arthur

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-24 Thread Herbert Xu
On Fri, Aug 24, 2007 at 02:25:03PM -0700, David Miller wrote:

 My hunch is that even if in the non-TSO case the TX packets were all
 back to back in the cards TX ring, TSO still spits them out faster on
 the wire.

If this is the case then we should see an improvement by
disabling TSO and enabling GSO.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] via-velocity: more cleanup

2007-08-24 Thread Al Viro
On Fri, Aug 24, 2007 at 02:40:45PM -0700, Stephen Hemminger wrote:
 +static void mac_set_vlan_cam(struct mac_regs __iomem * regs, int idx,
 +  const u8 *addr)

ITYM const u16 *, if not an outright u16.  These casts (one below and
ones in callers) really should die.

 + writew(*((u16 *) addr), regs-MARCAM[0]);
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] [02/10] pasemi_mac: Stop using the pci config space accessors for register read/writes

2007-08-24 Thread Stephen Rothwell
On Fri, 24 Aug 2007 13:11:04 -0500 Olof Johansson [EMAIL PROTECTED] wrote:

 On Fri, Aug 24, 2007 at 02:05:31PM +1000, Stephen Rothwell wrote:
  
  It is not documented as such (as far as I can see), but pci_dev_put is
  safe to call with NULL. And there are other places in the kernel that
  explicitly use that fact.
 
 Some places check, others do not. I'll leave it be for now but might take
 care of it during some future cleanup. Thanks for point it out though.

No worries.

-- 
Cheers,
Stephen Rothwell[EMAIL PROTECTED]
http://www.canb.auug.org.au/~sfr/


pgpIT5kuiWewe.pgp
Description: PGP signature


Re: [PATCH 2.6.23 RESEND] cxgb3 - Fix dev-priv usage

2007-08-24 Thread Jeff Garzik

Divy Le Ray wrote:

From: Divy Le Ray [EMAIL PROTECTED]

cxgb3 used netdev_priv() and dev-priv for different purposes.
In 2.6.23, netdev_priv() == dev-priv, cxgb3 needs a fix.
This patch is a partial backport of Dave Miller's changes in the 
net-2.6.24 git branch. 


Without this fix, cxgb3 crashes on 2.6.23.

Signed-off-by: Divy Le Ray [EMAIL PROTECTED]
---

 drivers/net/cxgb3/adapter.h   |   10 +++
 drivers/net/cxgb3/cxgb3_main.c|  126 +
 drivers/net/cxgb3/cxgb3_offload.c |6 +-
 drivers/net/cxgb3/sge.c   |   23 ---
 drivers/net/cxgb3/t3cdev.h|3 -
 5 files changed, 100 insertions(+), 68 deletions(-)



applied


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] [DM9000] Added support for big-endian hosts

2007-08-24 Thread Jeff Garzik

Laurent Pinchart wrote:

This patch splits the receive status in 8bit wide fields and convert the
packet length from little endian to CPU byte order.

Signed-off-by: Laurent Pinchart [EMAIL PROTECTED]
---
 drivers/net/dm9000.c |   13 +++--
 1 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/net/dm9000.c b/drivers/net/dm9000.c
index c3de81b..a424810 100644
--- a/drivers/net/dm9000.c
+++ b/drivers/net/dm9000.c
@@ -894,7 +894,8 @@ dm9000_timer(unsigned long data)
 }
 
 struct dm9000_rxhdr {

-   u16 RxStatus;
+   u8  RxPktReady;
+   u8  RxStatus;
u16 RxLen;
 } __attribute__((__packed__));


why does this not need endian conversions as well?

Jeff



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] ehea: show physical port state

2007-08-24 Thread Jeff Garzik

Jan-Bernd Themann wrote:

Introduces a module parameter to decide whether the physical
port link state is propagated to the network stack or not.
It makes sense not to take the physical port state into account
on machines with more logical partitions that communicate
with each other. This is always possible no matter what the physical
port state is. Thus eHEA can be considered as a switch there.

Signed-off-by: Jan-Bernd Themann [EMAIL PROTECTED]

---
 drivers/net/ehea/ehea.h  |5 -
 drivers/net/ehea/ehea_main.c |   14 +-
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ehea/ehea.h b/drivers/net/ehea/ehea.h
index d67f97b..8d58be5 100644
--- a/drivers/net/ehea/ehea.h
+++ b/drivers/net/ehea/ehea.h
@@ -39,7 +39,7 @@
 #include asm/io.h
 
 #define DRV_NAME	ehea

-#define DRV_VERSIONEHEA_0073
+#define DRV_VERSIONEHEA_0074
 
 /* eHEA capability flags */

 #define DLPAR_PORT_ADD_REM 1
@@ -402,6 +402,8 @@ struct ehea_mc_list {
 
 #define EHEA_PORT_UP 1

 #define EHEA_PORT_DOWN 0
+#define EHEA_PHY_LINK_UP 1
+#define EHEA_PHY_LINK_DOWN 0
 #define EHEA_MAX_PORT_RES 16
 struct ehea_port {
struct ehea_adapter *adapter;/* adapter that owns this port */
@@ -427,6 +429,7 @@ struct ehea_port {
u32 msg_enable;
u32 sig_comp_iv;
u32 state;
+   u8 phy_link;
u8 full_duplex;
u8 autoneg;
u8 num_def_qps;
diff --git a/drivers/net/ehea/ehea_main.c b/drivers/net/ehea/ehea_main.c
index db57474..1804c99 100644
--- a/drivers/net/ehea/ehea_main.c
+++ b/drivers/net/ehea/ehea_main.c
@@ -53,17 +53,21 @@ static int rq3_entries = EHEA_DEF_ENTRIES_RQ3;
 static int sq_entries = EHEA_DEF_ENTRIES_SQ;
 static int use_mcs = 0;
 static int num_tx_qps = EHEA_NUM_TX_QP;
+static int show_phys_link = 0;
 
 module_param(msg_level, int, 0);

 module_param(rq1_entries, int, 0);
 module_param(rq2_entries, int, 0);
 module_param(rq3_entries, int, 0);
 module_param(sq_entries, int, 0);
+module_param(show_phys_link, int, 0);
 module_param(use_mcs, int, 0);
 module_param(num_tx_qps, int, 0);
 
 MODULE_PARM_DESC(num_tx_qps, Number of TX-QPS);

 MODULE_PARM_DESC(msg_level, msg_level);
+MODULE_PARM_DESC(show_phys_link, Show link state of external port
+1:yes, 0: no.  Default = 0 );
 MODULE_PARM_DESC(rq3_entries, Number of entries for Receive Queue 3 
 [2^x - 1], x = [6..14]. Default = 
 __MODULE_STRING(EHEA_DEF_ENTRIES_RQ3) ));
@@ -814,7 +818,9 @@ int ehea_set_portspeed(struct ehea_port *port, u32 
port_speed)
ehea_error(Failed setting port speed);
}
}
-   netif_carrier_on(port-netdev);
+   if (!show_phys_link || (port-phy_link == EHEA_PHY_LINK_UP))
+   netif_carrier_on(port-netdev);
+
kfree(cb4);
 out:
return ret;
@@ -869,13 +875,19 @@ static void ehea_parse_eqe(struct ehea_adapter *adapter, 
u64 eqe)
}
 
 		if (EHEA_BMASK_GET(NEQE_EXTSWITCH_PORT_UP, eqe)) {

+   port-phy_link = EHEA_PHY_LINK_UP;
if (netif_msg_link(port))
ehea_info(%s: Physical port up,
  port-netdev-name);
+   if (show_phys_link)
+   netif_carrier_on(port-netdev);
} else {
+   port-phy_link = EHEA_PHY_LINK_DOWN;
if (netif_msg_link(port))
ehea_info(%s: Physical port down,
  port-netdev-name);
+   if (show_phys_link)
+   netif_carrier_off(port-netdev);


I think it's misnamed, calling it show_xxx, because this (as the 
change description notes) controls propagation of carrier to the network 
stack.


Jeff



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ucc_geth: kill unused include

2007-08-24 Thread Jeff Garzik

Kumar Gala wrote:

The ucc_geth_mii code is based on the gianfar_mii code that use to include
ocp.h.  ucc never need this and it causes issues when we want to kill
arch/ppc includes from arch/powerpc.

Signed-off-by: Kumar Gala [EMAIL PROTECTED]
---

Jeff, if you issue with this for 2.6.23, I'd prefer to push this via
the powerpc.git trees in 2.6.24 as part of a larger cleanup.  Let me know
one way or the other.

- k

 drivers/net/ucc_geth_mii.c |1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ucc_geth_mii.c b/drivers/net/ucc_geth_mii.c
index 6c257b8..df884f0 100644
--- a/drivers/net/ucc_geth_mii.c
+++ b/drivers/net/ucc_geth_mii.c
@@ -32,7 +32,6 @@
 #include linux/mm.h
 #include linux/module.h
 #include linux/platform_device.h
-#include asm/ocp.h
 #include linux/crc32.h
 #include linux/mii.h


Feel free to push via PPC git


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] DM9000: fix interface hang under load

2007-08-24 Thread Jeff Garzik

Florian Westphal wrote:

When transferring data at full speed, the DM9000 network interface
sometimes stops sending/receiving data. Worse, ksoftirqd consumes
100% cpu and the net tx watchdog never triggers.
Fix by spin_lock_irqsave() in dm9000_start_xmit() to prevent the
interrupt handler from interfering.

Signed-off-by: Florian Westphal [EMAIL PROTECTED]
---
 Actually the comments ('Disable all interrupts, iow(db, DM9000_IMR, IMR_PAR) 
etc)
 give the impression that the interrupt handler cannot run during 
dm9000_start_xmit(),
 however this isn't correct (perhaps the chipset has some weird timing issues?).
 The interface lockup usually occurs between 30 and 360 seconds after starting 
transmitting
 data (netcat /dev/zero) at full speed; with this patch applied I haven't been 
able
 to reproduce hangs yet (ran for  2h).
 FTR: This is a dm9000 on XScale-PXA255 rev 6 (ARMv5TE)/Compulab CM-x255, i.e.
 a module not supported by the vanilla kernel. Tested on (patched) 2.6.18.

 dm9000.c |   25 +++--
 1 file changed, 7 insertions(+), 18 deletions(-)


applied


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [METH] Don't use GFP_DMA for zone allocation.

2007-08-24 Thread Jeff Garzik

Ralf Baechle wrote:

IP32 doesn't even have a ZONE_DMA so no point in using GFP_DMA in any
IP32-specific device driver.

Signed-off-by: Ralf Baechle [EMAIL PROTECTED]


applied


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] ehea: fix interface to DLPAR tools

2007-08-24 Thread Jeff Garzik

Jan-Bernd Themann wrote:

Userspace DLPAR tool expects decimal numbers to be written to
and read from sysfs entries.

Signed-off-by: Jan-Bernd Themann [EMAIL PROTECTED]


applied 1-3


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] myri10ge: use pcie_get/set_readrq

2007-08-24 Thread Jeff Garzik

Brice Goglin wrote:

Based on a patch from Peter Oruba, convert myri10ge to use pcie_get_readrq()
and pcie_set_readrq() instead of our own PCI calls and arithmetics.

These driver changes incorporate the proposed PCI-X / PCI-Express read byte
count interface.  Reading and setting those values doesn't take place
manually, instead wrapping functions are called to allow quirks for some
PCI bridges.

Signed-off-by: Brice Goglin [EMAIL PROTECTED]
Signed-off by: Peter Oruba [EMAIL PROTECTED]
Based on work by Stephen Hemminger [EMAIL PROTECTED]
Signed-off-by: Andrew Morton [EMAIL PROTECTED]
---
 drivers/net/myri10ge/myri10ge.c |   32 ++--
 1 file changed, 6 insertions(+), 26 deletions(-)


applied 1-2


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [NET]: fix multicast list when cloning sockets

2007-08-24 Thread David Miller
From: Flavio Leitner [EMAIL PROTECTED]
Date: Tue, 31 Jul 2007 15:29:40 -0300

 On Tue, Jul 31, 2007 at 12:00:41AM -0300, Arnaldo Carvalho de Melo wrote:
  On 7/30/07, David Miller [EMAIL PROTECTED] wrote:
   Allowing non-datagram sockets to end up with a non-NULL inet-mc_list
   in the first place is a bug.
  
   Multicast subscriptions cannot even be used with TCP and DCCP, which
   are the only two users of these connection oriented socket functions.
  
   The first thing that TCP and DCCP do, in fact, for input packet
   processing is drop the packet if it is not unicast.
  
   Therefore the fix really is for the inet layer to reject multicast
   subscription requests on sockets for which that absolutely does not
   make sense.  There is no reason these functions in
   inet_connection_sock.c should need to be mindful of multicast
   state. :-)
  
  Well, we can add a BUG_ON there then 8)
  
  Flavio, take a look at  do_ip_setsockopt in net/ipv4/ip_sockglue.c, in
  the IP_{ADD,DROP}_MEMBERSHIP labels.
  
  Don't forget IPV6 (net/ipv6/ipv6_sockglue.c)
 
 yes, right. What about the one below?
 
 [NET]: Fix IP_ADD/DROP_MEMBERSHIP to handle only connectionless
 
 Fix IP[V6]_ADD_MEMBERSHIP and IP[V6]_DROP_MEMBERSHIP to
 return -EPROTO for connection oriented sockets.
 
 Signed-off-by: Flavio Leitner [EMAIL PROTECTED]

This looks great, patch applied.

Thanks!
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kj] is_power_of_2 in net/core/neighbour.c

2007-08-24 Thread David Miller
From: vignesh babu [EMAIL PROTECTED]
Date: Mon, 13 Aug 2007 18:33:47 +0530

 Replacing n  (n - 1) for power of 2 check by is_power_of_2(n)
 
 Signed-off-by: vignesh babu [EMAIL PROTECTED]

Patch applied, thanks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ethernet: optimize memcpy and memset

2007-08-24 Thread David Miller
From: Stephen Hemminger [EMAIL PROTECTED]
Date: Fri, 17 Aug 2007 16:29:50 -0700

 The ethernet header management only needs to handle a fixed
 size address (6 bytes). If the memcpy/memset are changed to
 be passed a constant length, then compiler can optimize for
 this case (and if it is smart eliminate string instructions).
 
 Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]

Applied.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] atm: replace DPRINTK() with pr_debug

2007-08-24 Thread David Miller
From: Stephen Hemminger [EMAIL PROTECTED]
Date: Fri, 17 Aug 2007 18:31:31 -0700

 Get rid of using DPRINTK macro in ATM and use pr_debug (in kernel.h).
 Using the standard macro is cleaner and forces code to check for bad arguments
 and formatting.
 
 Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]

Applied to net-2.6.24, thanks Stephen.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net/802: indentation cleanup

2007-08-24 Thread David Miller
From: Stephen Hemminger [EMAIL PROTECTED]
Date: Fri, 17 Aug 2007 18:53:11 -0700

 Run the 802 related protocols through Lindent (and hand cleanup)
 to fix indentation and whitespace style issues.

Applied to net-2.6.24, thanks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net/802: indentation cleanup

2007-08-24 Thread David Miller
From: David Miller [EMAIL PROTECTED]
Date: Fri, 24 Aug 2007 22:39:40 -0700 (PDT)

 From: Stephen Hemminger [EMAIL PROTECTED]
 Date: Fri, 17 Aug 2007 18:53:11 -0700
 
  Run the 802 related protocols through Lindent (and hand cleanup)
  to fix indentation and whitespace style issues.
 
 Applied to net-2.6.24, thanks.

Actually reverted.

Nothing in the world makes me more furious than a coding
style change that wasn't even compile tested.

net/802/tr.c: In function $,1rx(Btr_add_rif_info$,1ry(B:
net/802/tr.c:400: error: expected identifier before $,1rx(B!$,1ry(B token

Stephen I see you do things like this, forget sign offs,
and many other things that all say in big huge letters
sloppy.

Please shape up and test your changes no matter how trivial.

Thanks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] [TCP]: Remove unnecessary wrapper tcp_packets_out_dec

2007-08-24 Thread David Miller
From: Ilpo_Järvinen [EMAIL PROTECTED]
Date: Mon, 20 Aug 2007 16:16:29 +0300

 Makes caller side more obvious, there's no need to have
 a wrapper for this oneliner!
 
 Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]

Applied, thanks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] [TCP]: tcp_packets_out_inc to tcp_output.c (no callers elsewhere)

2007-08-24 Thread David Miller
From: Ilpo_Järvinen [EMAIL PROTECTED]
Date: Mon, 20 Aug 2007 16:16:30 +0300

 Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]

Applied.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] [TCP]: Rename tcp_ack_packets_out - tcp_rearm_rto

2007-08-24 Thread David Miller
From: Ilpo_Järvinen [EMAIL PROTECTED]
Date: Mon, 20 Aug 2007 16:16:31 +0300

 Only thing that tiny function does is rearming the RTO (if
 necessary), name it accordingly.
 
 Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]

Applied, thanks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] [TCP]: Discard fuzzy SACK blocks

2007-08-24 Thread David Miller
From: Ilpo_Järvinen [EMAIL PROTECTED]
Date: Mon, 20 Aug 2007 16:16:32 +0300

 SACK processing code has been a sort of russian roulette as no
 validation of SACK blocks is previously attempted. Besides, it
 is not very clear what all kinds of broken SACK blocks really
 mean (e.g., one that has start and end sequence numbers
 reversed). So now close the roulette once and for all.
 
 Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]

Thanks a lot for coding this up, I like it a lot, applied.

I have some minor worries about the D-SACK lower bound, but
it's probably OK and I'm just being paranoid :-)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] [TCP] MIB: Add counters for discarded SACK blocks

2007-08-24 Thread David Miller
From: Ilpo_Järvinen [EMAIL PROTECTED]
Date: Mon, 20 Aug 2007 16:16:33 +0300

 In DSACK case, some events are not extraordinary, such as packet
 duplication generated DSACK. They can arrive easily below
 snd_una when undo_marker is not set (TCP being in CA_Open),
 counting such DSACKs amoung SACK discards will likely just
 mislead if they occur in some scenario when there are other
 problems as well. Similarly, excessively delayed packets could
 cause normal DSACKs. Therefore, separate counters are
 allocated for DSACK events.
 
 Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]

Also applied, thanks a lot!
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ppp dependency on slhc

2007-08-24 Thread David Miller
From: Andrew Morton [EMAIL PROTECTED]
Date: Tue, 21 Aug 2007 01:30:57 -0700

 ERROR: slhc_init [drivers/net/ppp_generic.ko] undefined!
 ERROR: slhc_free [drivers/net/ppp_generic.ko] undefined!
 ERROR: slhc_uncompress [drivers/net/ppp_generic.ko] undefined!
 ERROR: slhc_compress [drivers/net/ppp_generic.ko] undefined!
 ERROR: slhc_toss [drivers/net/ppp_generic.ko] undefined!
 ERROR: slhc_remember [drivers/net/ppp_generic.ko] undefined!
 
 yet another reminder that select doesn't work ;)

Indeed :-)

However it is a good example of the kind of cases select was
made for, nobody should have to know about SLHC in order to
get PPP offered in the config.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >