Re: [PATCH] e1000: Work around 82571 completion timout on Pseries HW

2007-06-23 Thread Christoph Hellwig
On Thu, May 17, 2007 at 09:58:03AM -0500, Wen Xiong wrote:
 It really shouldn't be there at all because something in either the 
 intel
 or pseries hardware is totally buggy and we should disable features in
 the buggy one completely.
 
 Hi,
 
 Here there are not hardware issue on both Intel or PPC.  The patch is to 
 work around a loop hold on early version of PCI SGI spec. 
 The later PCI Sgi have spec have corrected it.  We can just implement it 
 for PPC only.  Other vendor may have the same issue.

In this case we should add a blacklist for implementations of the old
spec.  There should be a way to find specific bridges in the OF firmware
tree on powerpc and similar things on other platforms aswell.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fix race in AF_UNIX

2007-06-23 Thread Miklos Szeredi
Eric, thanks for looking at this.

   There are races involving the garbage collector, that can throw away
   perfectly good packets with AF_UNIX sockets in them.
   
   The problems arise when a socket goes from installed to in-flight or
   vice versa during garbage collection.  Since gc is done with a
   spinlock held, this only shows up on SMP.
   
   Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
  
  I'm going to hold off on this one for now.
  
  Holding all of the read locks kind of defeats the purpose of using
  the per-socket lock.
  
  Can't you just lock purely around the receive queue operation?
 
  That's already protected by the receive queue spinlock.  The race
  however happens _between_ pushing the root set and marking of the
  in-flight but reachable sockets.
 
  If in that space any of the AF_UNIX sockets goes from in-flight to
  installed into a file descriptor, the garbage collector can miss it.
  If we want to protect against this using unix_sk(s)-readlock, then we
  have to hold all of them for the duration of the marking.
 
  Al, Alan, you have more experience with this piece of code.  Do you
  have better ideas about how to fix this?
 
 I haven't looked at the code closely enough to be confident of
 changing something in this area.  However the classic solution to this
 kind of gc problem is to mark things that are manipulated during
 garbage collection as dirty (not orphaned).
 
 It should be possible to fix this problem by simply changing gc_tree
 when we perform a problematic manipulation of a passed socket, such
 as installing a passed socket into the file descriptors of a process.
 
 Essentially the idea is moving the current code in the direction of
 an incremental gc algorithm.
 
 
 If I understand the race properly.  What happens is that we dequeue
 a socket (which has packets in it passing sockets) before the
 garbage collector gets to it.  Therefore the garbage collector
 never processes that socket.  So it sounds like we just
 need to call maybe_unmark_and_push or possibly just wait for
 the garbage collector to complete when we do that and the packet
 we have pulled out 

Right.  But the devil is in the details, and (as you correctly point
out later) to implement this, the whole locking scheme needs to be
overhauled.  Problems:

 - Using the queue lock to make the dequeue and the fd detach atomic
   wrt the GC is difficult, if not impossible: they are are far from
   each other with various magic in between.  It would need thorough
   understanding of these functions and _big_ changes to implement.

 - Sleeping on u-readlock in GC is currently not possible, since that
   could deadlock with unix_dgram_recvmsg().  That function could
   probably be modified to release u-readlock, while waiting for
   data, similarly to unix_stream_recvmsg() at the cost of some added
   complexity.

 - Sleeping on u-readlock is also impossible, because GC is holding
   unix_table_lock for the whole operation.  We could release
   unix_table_lock, but then would have to cope with sockets coming
   and going, making the current socket iterator unworkable.

So theoretically it's quite simple, but it needs big changes.  And
this wouldn't even solve all the problems with the GC, like being a
possible DoS vector.

Miklos
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] First draft of RDNSS-in-RA support for IPv6 DNS autoconfiguration

2007-06-23 Thread Pierre Ynard
On Sat, Jun 23, 2007, David Miller wrote:
 From: David Stevens [EMAIL PROTECTED]
  Auto-configured addresses are used by the kernel. It has to
  have those addresses. But the kernel doesn't do DNS look-ups, or
  write resolv.conf; that's the difference, for me.
 
 I totally agree with David, this stuff definitely does not belong
 in the kernel.

It is my understanding that you think that IP stack configuration
belongs in the kernel whereas DNS does not, right?

Then I have a question: does RS-RA management belong in the kernel or
not?

-- 
Pierre Ynard
WTS #51 - No phone
Une âme dans un corps, c'est comme un dessin sur une feuille de papier.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[NETLINK]: attr: add nested compat attribute type

2007-06-23 Thread Patrick McHardy
Add support for the nested compat attribute type to netlink.

Thomas, I forgot to CC you on the related rtnetlink/iproute
patches, please have a look on netdev.

[NETLINK]: attr: add nested compat attribute type

Add a nested compat attribute type that can be used to convert
attributes that contain a structure to nested attributes in a
backwards compatible way.

Signed-off-by: Patrick McHardy [EMAIL PROTECTED]

---
commit fb99bf7aa7d9dd2af24d75d3a574ef17c4bae079
tree 4373b1c05a23d2544e7f53a9769ccc973ab913f7
parent 82a7e0e31d94515507be3ed8ac8e7866ab9ab928
author Patrick McHardy [EMAIL PROTECTED] Sat, 23 Jun 2007 11:24:26 +0200
committer Patrick McHardy [EMAIL PROTECTED] Sat, 23 Jun 2007 11:24:26 +0200

 include/net/netlink.h |   84 +
 net/netlink/attr.c|   11 ++
 2 files changed, 95 insertions(+), 0 deletions(-)

diff --git a/include/net/netlink.h b/include/net/netlink.h
index 7b510a9..d7b824b 100644
--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -118,6 +118,9 @@
  * Nested Attributes Construction:
  *   nla_nest_start(skb, type) start a nested attribute
  *   nla_nest_end(skb, nla)finalize a nested attribute
+ *   nla_nest_compat_start(skb, type,  start a nested compat attribute
+ *len, data)
+ *   nla_nest_compat_end(skb, type)finalize a nested compat attribute
  *   nla_nest_cancel(skb, nla) cancel nested attribute construction
  *
  * Attribute Length Calculations:
@@ -152,6 +155,7 @@
  *   nla_find_nested() find attribute in nested attributes
  *   nla_parse()   parse and validate stream of attrs
  *   nla_parse_nested()parse nested attribuets
+ *   nla_parse_nested_compat() parse nested compat attributes
  *   nla_for_each_attr()   loop over all attributes
  *   nla_for_each_nested() loop over the nested attributes
  *=
@@ -170,6 +174,7 @@ enum {
NLA_FLAG,
NLA_MSECS,
NLA_NESTED,
+   NLA_NESTED_COMPAT,
NLA_NUL_STRING,
NLA_BINARY,
__NLA_TYPE_MAX,
@@ -190,6 +195,7 @@ enum {
  *NLA_NUL_STRING   Maximum length of string (excluding NUL)
  *NLA_FLAG Unused
  *NLA_BINARY   Maximum length of attribute payload
+ *NLA_NESTED_COMPATExact length of structure payload
  *All otherExact length of attribute payload
  *
  * Example:
@@ -733,6 +739,39 @@ static inline int nla_parse_nested(struct nlattr *tb[], 
int maxtype,
 {
return nla_parse(tb, maxtype, nla_data(nla), nla_len(nla), policy);
 }
+
+/**
+ * nla_parse_nested_compat - parse nested compat attributes
+ * @tb: destination array with maxtype+1 elements
+ * @maxtype: maximum attribute type to be expected
+ * @nla: attribute containing the nested attributes
+ * @data: pointer to point to contained structure
+ * @len: length of contained structure
+ * @policy: validation policy
+ *
+ * Parse a nested compat attribute. The compat attribute contains a structure
+ * and optionally a set of nested attributes. On success the data pointer
+ * points to the nested data and tb contains the parsed attributes
+ * (see nla_parse).
+ */
+static inline int __nla_parse_nested_compat(struct nlattr *tb[], int maxtype,
+   struct nlattr *nla,
+   const struct nla_policy *policy,
+   int len)
+{
+   if (nla_len(nla)  len)
+   return -1;
+   if (nla_len(nla) = NLA_ALIGN(len) + sizeof(struct nlattr))
+   return nla_parse_nested(tb, maxtype,
+   nla_data(nla) + NLA_ALIGN(len),
+   policy);
+   memset(tb, 0, sizeof(struct nlattr *) * (maxtype + 1));
+   return 0;
+}
+
+#define nla_parse_nested_compat(tb, maxtype, nla, policy, data, len) \
+({ data = nla_len(nla) = len ? nla_data(nla) : NULL; \
+   __nla_parse_nested_compat(tb, maxtype, nla, policy, len); })
 /**
  * nla_put_u8 - Add a u16 netlink attribute to a socket buffer
  * @skb: socket buffer to add attribute to
@@ -965,6 +1004,51 @@ static inline int nla_nest_end(struct sk_buff *skb, 
struct nlattr *start)
 }
 
 /**
+ * nla_nest_compat_start - Start a new level of nested compat attributes
+ * @skb: socket buffer to add attributes to
+ * @attrtype: attribute type of container
+ * @attrlen: length of structure
+ * @data: pointer to structure
+ *
+ * Start a nested compat attribute that contains both a structure and
+ * a set of nested attributes.
+ *
+ * Returns the container attribute
+ */
+static inline struct nlattr *nla_nest_compat_start(struct sk_buff *skb,
+  int attrtype, int attrlen,
+  

Re: [RFD] First draft of RDNSS-in-RA support for IPv6 DNS autoconfiguration

2007-06-23 Thread Michael Buesch
On Saturday 23 June 2007 02:09:18 C. Scott Ananian wrote:
   diff -ruHpN -X dontdiff linux-2.6.22-rc5-orig/include/net/ip6_rdnss.h
   linux-2.6.22-rc5/include/net/ip6_rdnss.h
   --- linux-2.6.22-rc5-orig/include/net/ip6_rdnss.h1969-12-31
   19:00:00.0 -0500
   +++ linux-2.6.22-rc5/include/net/ip6_rdnss.h2007-06-21
   18:16:33.0 -0400 @@ -0,0 +1,58 @@
   +#ifndef _NET_IP6_RDNSS_H
   +#define _NET_IP6_RDNSS_H
   +
   +#ifdef __KERNEL__
   +
   +#include linux/in6.h
   +
   +struct nd_opt_rdnss {
   +__u8type;
   +__u8length;
   +#if defined(__BIG_ENDIAN_BITFIELD)
   +__u8priority:4,
   +open:1,
   +reserved1:3;
   +#elif defined(__LITTLE_ENDIAN_BITFIELD)
   +__u8reserved1:3,
   +open:1,
   +priority:4;
   +#else
   +# error not little or big endian
   +#endif
 
  That is not endianess-safe. Don't use foo:x at all
  for stuff where a specific endianess is needed. The
  compiler doesn't make any guarantee about it.
 
 This was copied directly from include/net/ip6_route.h.  I believe that
 it does in fact work, and I (for one) find this much more readable
 than the alternative.  If it is in fact broken, then
 include/net/ip6_route.h (and the 35 other files which use this #ifdef
 in this manner) should be fixed.

Yeah, it might work. But I think the compiler doesn't guarantee
you anything about it.

-- 
Greetings Michael.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] L2 Network namespace infrastructure

2007-06-23 Thread Patrick McHardy
Eric W. Biederman wrote:
 -- The basic design
 
 There will be a network namespace structure that holds the global
 variables for a network namespace, making those global variables
 per network namespace.
 
 One of those per network namespace global variables will be the
 loopback device.  Which means the network namespace a packet resides
 in can be found simply by examining the network device or the socket
 the packet is traversing.
 
 Either a pointer to this global structure will be passed into
 the functions that need to reference per network namespace variables
 or a structure that is already passed in (such as the network device)
 will be modified to contain a pointer to the network namespace
 structure.


I believe OpenVZ stores the current namespace somewhere global,
which avoids passing the namespace around. Couldn't you do this
as well?

 Depending upon the data structure it will either be modified to hold
 a per entry network namespace pointer or it there will be a separate
 copy per network namespace.  For large global data structures like
 the ipv4 routing cache hash table adding an additional pointer to the
 entries appears the more reasonable solution.


So the routing cache is shared between all namespaces?

 --- Performance
 
 In initial measurements the only performance overhead we have been
 able to measure is getting the packet to the network namespace.
 Going through ethernet bridging or routing seems to trigger copies
 of the packet that slow things down.  When packets go directly to
 the network namespace no performance penalty has yet been measured.


It would be interesting to find out whats triggering these copies.
Do you have NAT enabled?
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[SKBUFF]: Fix incorrect config #ifdef around skb_copy_secmark

2007-06-23 Thread Patrick McHardy
[SKBUFF]: Fix incorrect config #ifdef around skb_copy_secmark

secmark doesn't depend on CONFIG_NET_SCHED.

Signed-off-by: Patrick McHardy [EMAIL PROTECTED]
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7c6a34e..8d43ae6 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -434,8 +434,8 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t 
gfp_mask)
n-tc_verd = CLR_TC_MUNGED(n-tc_verd);
C(iif);
 #endif
-   skb_copy_secmark(n, skb);
 #endif
+   skb_copy_secmark(n, skb);
C(truesize);
atomic_set(n-users, 1);
C(head);


Re: [patch 5/7] CAN: Add virtual CAN netdevice driver

2007-06-23 Thread Oliver Hartkopp
Patrick McHardy wrote:
 Urs Thuermann wrote:
   
 Patrick McHardy [EMAIL PROTECTED] writes:


 
 Is there a reason why you're still doing the allocate n devices
 on init thing instead of using the rtnl_link API?
   
 Sorry, it's simply a matter of time.  We have been extremely busy with
 other projects and two presentations (mgmt, customers, and press) the
 last two weeks and have worked on the other changes this week.  I'm
 sorry I haven't yet been able to look at your rtnl_link code close
 enough, but it's definitely on my todo list.  Starting on Sunday I'll
 be on a business trip to .jp for a week, and I hope I get to it in
 that week, otherwise on return.
 


 Sorry, but busy is no reason for merging code that has deprecated
 (at least by me :)) behaviour. Please change this before submitting
 for inclusion.
   

Dear Patrick,

i was just looking through the mailings regarding your suggested changes
(e.g. in VLAN, DUMMY and IFB) an none of them currently went into the
kernel and the discussion on some topics (especially in the VLAN case)
is just running.

I just got an impression of what you intend to have and it looks
reasonable and good to me. But anyhow it's in process and therefore i
won't like to be the first adopter as you might comprehend. It is no
question, that we would update to your approach as it is part of the
kernel, finalized in discussion and somewhat stable. But it doesn't look
adequate to me to push us to support your brand new approach as some
kind of gate for an inclusion into the mainstream kernel :-(

So for me it looks like, that we should get the feedback from Jamal if
our usage of skb-iif fits the intention of skb-iif and if we should
set the incoming interface index ourselves of if we let
netif_receive_skb() do this job. After that discussion i currently can
not see any reason, why the PF_CAN support should not go into the
mainstream kernel. I daily get positive community feedback about this
matching implementation for the Linux kernel and it's elegant manner of
usage for application programmers.

On our TODO list there is the netlink support as well as the usage of
hrtimers in our broadcast manager - but both have no vital influence to
the new protocol family PF_CAN and therefore it should not slow down the
inclusion process. Be sure that we'll support netlink immediately, when
it hits the road for other drivers also.

Best regards,
Oliver


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] First draft of RDNSS-in-RA support for IPv6 DNS autoconfiguration

2007-06-23 Thread Rémi Denis-Courmont
Hello,

Le samedi 23 juin 2007, David Stevens a écrit :
 Why not make the application that writes resolv.conf
 also listen on a raw ICMPv6 socket? I don't believe you'd need
 any kernel changes, then, and it seems pretty simple and
 straightforward.

Unfortunately, ICMPv6 raw sockets will not work quite properly here, 
without modifications. At the moment, such a socket will queue just 
about any Router Advertisement that is received by the host.

Now, assuming the userland daemon did sanity check the message (properly 
formatted, source and destination addresses are sane, etc), it needs to 
know whether the IPv6 kernel stack has accepted it or not. It could 
be that the interface the RA was received on had autoconf disabled at 
the time the packet showed up, or it could be that the system is 
currently configured as a router, or it could be that we have a 
SeND-patched kernel and the RA did not pass authentication checks.

And then, what happens if IPv6 networking has been initialized before 
init got the chance to start the daemon, for instance root over 
NFS/IPv6? The RA is lost.

Similarly, the daemon has no way to know when information gathered from 
an RA becomes invalid. Of course, it can duplicate the lifetime timers 
in userland, but only the kernel knows if the link has been reset to 
off and on earlier than lifetime expiration.


Whether parsing RDNSS-in-RA belong in the kernel is irrelevant to me, as 
the kernel does not provide any interface for userland to do it 
properly at the moment.

-- 
Rémi Denis-Courmont
http://www.remlab.net/


signature.asc
Description: This is a digitally signed message part.


Re: [patch 5/7] CAN: Add virtual CAN netdevice driver

2007-06-23 Thread Patrick McHardy
Oliver Hartkopp wrote:
 Patrick McHardy wrote:
 

Sorry, it's simply a matter of time.  We have been extremely busy with
other projects and two presentations (mgmt, customers, and press) the
last two weeks and have worked on the other changes this week.  I'm
sorry I haven't yet been able to look at your rtnl_link code close
enough, but it's definitely on my todo list.  Starting on Sunday I'll
be on a business trip to .jp for a week, and I hope I get to it in
that week, otherwise on return.



Sorry, but busy is no reason for merging code that has deprecated
(at least by me :)) behaviour. Please change this before submitting
for inclusion.
  
 
 i was just looking through the mailings regarding your suggested changes
 (e.g. in VLAN, DUMMY and IFB) an none of them currently went into the
 kernel and the discussion on some topics (especially in the VLAN case)
 is just running.


They are all in the net-2.6.23 tree.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] L2 Network namespace infrastructure

2007-06-23 Thread Ben Greear

Patrick McHardy wrote:

Eric W. Biederman wrote:
  

-- The basic design

There will be a network namespace structure that holds the global
variables for a network namespace, making those global variables
per network namespace.

One of those per network namespace global variables will be the
loopback device.  Which means the network namespace a packet resides
in can be found simply by examining the network device or the socket
the packet is traversing.

Either a pointer to this global structure will be passed into
the functions that need to reference per network namespace variables
or a structure that is already passed in (such as the network device)
will be modified to contain a pointer to the network namespace
structure.




I believe OpenVZ stores the current namespace somewhere global,
which avoids passing the namespace around. Couldn't you do this
as well?
  

Will we be able to have a single application be in multiple name-spaces?

Thanks,
Ben

--
Ben Greear [EMAIL PROTECTED] 
Candela Technologies Inc  http://www.candelatech.com



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [SKBUFF]: Fix incorrect config #ifdef around skb_copy_secmark

2007-06-23 Thread James Morris
Thanks.

Acked-by: James Morris [EMAIL PROTECTED]




-- 
James Morris
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] First draft of RDNSS-in-RA support for IPv6 DNS autoconfiguration

2007-06-23 Thread Rémi Denis-Courmont
Le samedi 23 juin 2007, David Stevens a écrit :
 No, in fact! I didn't hear anyone suggesting that all of
 neighbor discovery be pushed out of the kernel. All I suggested is
 that you read a raw ICMPv6 socket for RA's that have the RDNS header
 and the app _process_the_RDNS_header. The kernel should still
 continue to do everything it needs to with the kernel data in the RA.
 Then you just need a hash table (or maybe just a list -- there
 shouldn't be a lot of them) and a timer to delete them when the RDNS
 expiration hits. Easy, right?

The exact thing I pointed out does not work. I *DID* write RA parsing in 
userland in the past.

 You might have to change the icmp6_filter, if RA's are not
 already copied to raw sockets (I don't know either way offhand),
 but that's a trivial kernel patch; otherwise, I don't believe you
 have to do anything but read the socket and process the RDNS header
 on RAs you receive.

To reiterate:

How do I authenticate SeND RA? How do I deal with the link going down 
before the expiration? How do I know this interface is doing autoconf 
at all?

-- 
Rémi Denis-Courmont
http://www.remlab.net/


signature.asc
Description: This is a digitally signed message part.


Re: [PATCH] Ethernet driver for EISA only SNI RM200/RM400 machines

2007-06-23 Thread Andrew Morton
 On Fri, 22 Jun 2007 21:53:58 +0200 [EMAIL PROTECTED] (Thomas Bogendoerfer) 
 wrote:
 Hi,
 
 This is new ethernet driver, which use the code taken out of lasi_82596
 (done by the other patch I just sent).
 
 Thomas.
 
 
 Ethernet driver for EISA only SNI RM200/RM400 machines
 
 ...

 +static char sni_82596_string[] = snirm_82596;

const?

 +
 +#define DMA_ALLOC  dma_alloc_coherent
 +#define DMA_FREE   dma_free_coherent
 +#define DMA_WBACK(priv, addr, len) do { } while (0)
 +#define DMA_INV(priv, addr, len)   do { } while (0)
 +#define DMA_WBACK_INV(priv, addr, len) do { } while (0)
 +
 +#define SYSBUS  0x4400
 +
 +/* big endian CPU, 82596 little endian */
 +#define SWAP32(x)   cpu_to_le32((u32)(x))
 +#define SWAP16(x)   cpu_to_le16((u16)(x))
 +
 +#define OPT_MPU_16BIT0x01
 +
 +static inline void CA(struct net_device *dev);
 +static inline void MPU_PORT(struct net_device *dev, int c, dma_addr_t x);

These two function's implementations could be moved to before the #include,
s we wouldn't need to forward-declare them?

 +#include lib82596.c

ugh.  Is this really unavoidable?

 +MODULE_AUTHOR(Thomas Bogendoerfer);
 +MODULE_DESCRIPTION(i82596 driver);
 +MODULE_LICENSE(GPL);
 +module_param(i596_debug, int, 0);
 +MODULE_PARM_DESC(i596_debug, 82596 debug mask);
 +
 +static inline void CA(struct net_device *dev)
 +{
 + struct i596_private *lp = netdev_priv(dev);
 + 
 + writel(0, lp-ca);
 +}
 +
 +
 +static inline void MPU_PORT(struct net_device *dev, int c, dma_addr_t x)
 +{
 + struct i596_private *lp = netdev_priv(dev);
 +
 + u32 v = (u32) (c) | (u32) (x);
 + 
 + if (lp-options  OPT_MPU_16BIT) {
 + writew(v  0x, lp-mpu_port);
 + wmb(); udelay(1); /* order writes to MPU port */

Nope, please put these on separate lines.  No exceptions..

 + writew(v  16, lp-mpu_port);
 + } else {
 + writel(v, lp-mpu_port);
 + wmb(); udelay(1); /* order writes to MPU port */
 + writel(v, lp-mpu_port);
 + }
 +}

Three callsites: This looks too large to inline.

I see no reason why this and CA() are have upper-case names?

 +
 +static int __devinit sni_82596_probe(struct platform_device *dev)
 +{
 + struct  net_device *netdevice;
 + struct i596_private *lp;
 + struct  resource *res, *ca, *idprom, *options;
 + int retval = -ENODEV;
 + static int init;
 + void __iomem *mpu_addr = NULL;
 + void __iomem *ca_addr = NULL;
 + u8 __iomem *eth_addr = NULL;
 + 
 + if (init == 0) {
 + printk(KERN_INFO SNI_82596_DRIVER_VERSION \n);
 + init++;
 + }

Might as well do this message in the module_init() function?  There's a
per-probed-device message later on anwyay.

The patchset tries to add rather a lot of new trailing whitespace btw.

 + res = platform_get_resource(dev, IORESOURCE_MEM, 0);
 + if (!res)
 + goto probe_failed;
 + mpu_addr = ioremap_nocache(res-start, 4);
 + if (!mpu_addr) {
 + retval = -ENOMEM;
 + goto probe_failed;
 + }
 + ca = platform_get_resource(dev, IORESOURCE_MEM, 1);
 + if (!ca)
 + goto probe_failed;
 + ca_addr = ioremap_nocache(ca-start, 4);
 + if (!ca_addr) {
 + retval = -ENOMEM;
 + goto probe_failed;
 + }
 + idprom = platform_get_resource(dev, IORESOURCE_MEM, 2);
 + if (!idprom)
 + goto probe_failed;
 + eth_addr = ioremap_nocache(idprom-start, 0x10);
 + if (!eth_addr) {
 + retval = -ENOMEM;
 + goto probe_failed;
 + }
 + options = platform_get_resource(dev, 0, 0);
 + if (!options)
 + goto probe_failed;
 +
 + printk(KERN_INFO Found i82596 at 0x%x\n, res-start);
 +
 + netdevice = alloc_etherdev(sizeof(struct i596_private));
 + if (!netdevice) {
 + retval = -ENOMEM;
 + goto probe_failed;
 + }
 + SET_NETDEV_DEV(netdevice, dev-dev);
 + platform_set_drvdata (dev, netdevice);
 +
 + netdevice-base_addr = res-start;
 + netdevice-irq = platform_get_irq(dev, 0);
 + 
 + /* someone seams to like messed up stuff */
 + netdevice-dev_addr[0] = readb(eth_addr + 0x0b);
 + netdevice-dev_addr[1] = readb(eth_addr + 0x0a);
 + netdevice-dev_addr[2] = readb(eth_addr + 0x09);
 + netdevice-dev_addr[3] = readb(eth_addr + 0x08);
 + netdevice-dev_addr[4] = readb(eth_addr + 0x07);
 + netdevice-dev_addr[5] = readb(eth_addr + 0x06);
 + iounmap(eth_addr);
 + 
 + if (!netdevice-irq) {
 + printk(KERN_ERR %s: IRQ not found for i82596 at 0x%lx\n,
 + __FILE__, netdevice-base_addr);
 + goto probe_failed;
 + }
 + 
 + lp = netdev_priv(netdevice);
 + lp-options = options-flags  IORESOURCE_BITS;
 + lp-ca = ca_addr;
 + lp-mpu_port = mpu_addr;
 + 
 + retval = 

Re: [RFD] First draft of RDNSS-in-RA support for IPv6 DNS autoconfiguration

2007-06-23 Thread David Stevens
[EMAIL PROTECTED] wrote on 06/23/2007 07:47:06 AM:

 Rémi and Simon give my responses very eloquently.  Although you could
 have yet-another-network-daemon redundantly process RA messages, the
 kernel is doing it already and it makes sense to just push this

It would be two pieces looking at the same packet, but it isn't
redundant processing. The kernel would ignore the RDNS header, and the
app would ignore everything else; everything would be processed once.

 Although parsing
 RA messages and processing expiry in userland looks barely-possible
 now,
barely possible?? See below.

 SeND support is really necessary for long-term IPv6 security, and
 duplicating SeND functionality in userland would be a nightmare.
 Futher, the neighbor discover protocol involves Router Solicitation
 messages which elicit the Router Advertisement reply, and we really
 don't want userland sending redundant Router Solicitation messages
 around, just because the kernel doesn't want to tell it what Router
 Advertisements it received.  I considered storing the *complete*
 Router Advertisement messages received and pushing them unparsed to
 userland, just to get around the bogus DNS in the kernel politics
 (hint: it's not a resolver in the kernel, it's just nameserver
 addresses being stored).  Does anyone really suggest that this would
 be a better solution?

No, in fact! I didn't hear anyone suggesting that all of
neighbor discovery be pushed out of the kernel. All I suggested is
that you read a raw ICMPv6 socket for RA's that have the RDNS header
and the app _process_the_RDNS_header. The kernel should still continue
to do everything it needs to with the kernel data in the RA. Then you
just need a hash table (or maybe just a list -- there shouldn't be a
lot of them) and a timer to delete them when the RDNS expiration hits.
Easy, right?
You might have to change the icmp6_filter, if RA's are not
already copied to raw sockets (I don't know either way offhand),
but that's a trivial kernel patch; otherwise, I don't believe you
have to do anything but read the socket and process the RDNS header
on RAs you receive.
 
 +-DLS

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fix race in AF_UNIX

2007-06-23 Thread Eric W. Biederman
Miklos Szeredi [EMAIL PROTECTED] writes:

 Right.  But the devil is in the details, and (as you correctly point
 out later) to implement this, the whole locking scheme needs to be
 overhauled.  Problems:

  - Using the queue lock to make the dequeue and the fd detach atomic
wrt the GC is difficult, if not impossible: they are are far from
each other with various magic in between.  It would need thorough
understanding of these functions and _big_ changes to implement.

  - Sleeping on u-readlock in GC is currently not possible, since that
could deadlock with unix_dgram_recvmsg().  That function could
probably be modified to release u-readlock, while waiting for
data, similarly to unix_stream_recvmsg() at the cost of some added
complexity.

  - Sleeping on u-readlock is also impossible, because GC is holding
unix_table_lock for the whole operation.  We could release
unix_table_lock, but then would have to cope with sockets coming
and going, making the current socket iterator unworkable.

 So theoretically it's quite simple, but it needs big changes.  And
 this wouldn't even solve all the problems with the GC, like being a
 possible DoS vector.

Making the GC fully incremental will solve the DoS vector problem as
well.  Basically you do a fixed amount of reclaim in the new socket
allocation code.

It appears clear that since we can't stop the world and garbage
collect we need an incremental collector.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] First draft of RDNSS-in-RA support for IPv6 DNS autoconfiguration

2007-06-23 Thread Simon Arlott
On 23/06/07 15:47, C. Scott Ananian wrote:
 Advertisements it received.  I considered storing the *complete*
 Router Advertisement messages received and pushing them unparsed to
 userland, just to get around the bogus DNS in the kernel politics
 (hint: it's not a resolver in the kernel, it's just nameserver
 addresses being stored).  Does anyone really suggest that this would
 be a better solution?

Yes, but I don't think it should be completely unparsed - it should be
possible to retrieve the data for a specific attribute type with
expiration information and with notification of changes.

The kernel has to read RAs anyway, why shouldn't it store them in a way
that userspace can access it on demand? A /proc file which is in
resolv.conf format is definitely *wrong*, and while I'd argue for DNS
being special enough to export its attributes is it really too much to
have the kernel provide everything from the last valid message in a
partially parsed format? Applications would then parse the data section
for RA attributes they understand.

-- 
Simon Arlott
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 5/7] CAN: Add virtual CAN netdevice driver

2007-06-23 Thread Oliver Hartkopp
Patrick McHardy wrote:
 Oliver Hartkopp wrote:
   
 i was just looking through the mailings regarding your suggested changes
 (e.g. in VLAN, DUMMY and IFB) an none of them currently went into the
 kernel (..)
 


 They are all in the net-2.6.23 tree.
   

Ah, ok - that wasn't on my radar as i missed the mail from Dave to you
at June 13th ...

@Dave: Please consider to schedule the PF_CAN stuff for inclusion into
2.6.23 also. Thx.

Btw. for next week, we'll ...

1. ... wait for Jamals feedback about skb-iif usage
2. ... move the vcan driver for the new netlink API

So that we can finally go for net-2.6.23 at the end of next week, if
there are no new issues from other reviewers until then.

@Patrick: The changes in dummy.c and ifb.c for the netlink support do
not look very complicated (not even for me ;-)) When these changes are
implemented, how do i create/remove my interfaces? Is there any
userspace tool like 'tc' for that?

Thx  regards,
Oliver


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] First draft of RDNSS-in-RA support for IPv6 DNS autoconfiguration

2007-06-23 Thread C. Scott Ananian

Rémi and Simon give my responses very eloquently.  Although you could
have yet-another-network-daemon redundantly process RA messages, the
kernel is doing it already and it makes sense to just push this
information to userland using /proc and/or netlink.  Although parsing
RA messages and processing expiry in userland looks barely-possible
now, SeND support is really necessary for long-term IPv6 security, and
duplicating SeND functionality in userland would be a nightmare.
Futher, the neighbor discover protocol involves Router Solicitation
messages which elicit the Router Advertisement reply, and we really
don't want userland sending redundant Router Solicitation messages
around, just because the kernel doesn't want to tell it what Router
Advertisements it received.  I considered storing the *complete*
Router Advertisement messages received and pushing them unparsed to
userland, just to get around the bogus DNS in the kernel politics
(hint: it's not a resolver in the kernel, it's just nameserver
addresses being stored).  Does anyone really suggest that this would
be a better solution?

The goal is to push the userland component into glibc, likely through
a NSS resolver plugin.  Current glibc doesn't do any processing to
determine when /etc/resolv.conf has changed, which is a problem for
long-running applications.  Exporting RDNSS-in-RA via netlink messages
(or by poll() on a /proc file as is done for /proc/pid/mounts, which
was suggested in linux-kernel) is an elegant solution that (as Rémi
noted) cleanly handles interface up/down/reconfig, route expiration,
and (eventually) the cryptographic neighbor discovery protocol without
weaving a web of hairs from the kernel to the resolver.
--scott

--
( http://cscott.net/ )
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 5/7] CAN: Add virtual CAN netdevice driver

2007-06-23 Thread Patrick McHardy
Oliver Hartkopp wrote:
 @Patrick: The changes in dummy.c and ifb.c for the netlink support do
 not look very complicated (not even for me ;-))


I have a patch to make it even simpler, it basically needs only
the rtnl_link_ops structures initialized with one or two members
for devices like dummy and ifb. Will push once we're through the
patches I sent recently, until then please use the current interface.

 When these changes are
 implemented, how do i create/remove my interfaces? Is there any
 userspace tool like 'tc' for that?


Its ip. I think I've CCed you or one of your colleagues on
the patches, otherwise please check the list. For a device like
yours it only needs the patch implementing general RTM_NEWLINK
support, unless you want to make the loopback parameter
configurable, in which case you would need to add something
like iplink_vlan that parses the parameter.

BTW, in case the loopback device is required for normal
operation it might make sense to create *one* device by
default, but four identical devices seems a bit extreme.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] L2 Network namespace infrastructure

2007-06-23 Thread Eric W. Biederman
Patrick McHardy [EMAIL PROTECTED] writes:

 Eric W. Biederman wrote:
 -- The basic design
 
 There will be a network namespace structure that holds the global
 variables for a network namespace, making those global variables
 per network namespace.
 
 One of those per network namespace global variables will be the
 loopback device.  Which means the network namespace a packet resides
 in can be found simply by examining the network device or the socket
 the packet is traversing.
 
 Either a pointer to this global structure will be passed into
 the functions that need to reference per network namespace variables
 or a structure that is already passed in (such as the network device)
 will be modified to contain a pointer to the network namespace
 structure.


 I believe OpenVZ stores the current namespace somewhere global,
 which avoids passing the namespace around. Couldn't you do this
 as well?

It sucks.  Especially in the corner cases.   Think macvlan
with the real network device in one namespace and the ``vlan''
device in another device.

The implementation of a global is also pretty a little questionable.
Last I looked it didn't work on the transmit path at all and
interesting on the receive path.

Further and fundamentally all a global achieves is removing the need
for the noise patches where you pass the pointer into the various
functions.  For long term maintenance it doesn't help anything.

All of the other changes such as messing with the
initialization/cleanup and changing access to access the per network
namespace data structure, and modifying the code partly along the way
to reject working in other non-default network namespaces that are
truly intrusive we both still have to make.

So except as an implementation detail how we pass the per network
namespace pointer is uninteresting.

Currently I am trying for the least clever most straight forward
implementation I can find, that doesn't give us a regression 
in network stack performance.

So yes if we want to do passing through a magic per cpu global on
the packet receive path now is the time to decide to do that.
Currently I don't see the advantage in doing that so I'm not
suggesting it.

In general if there are any specific objections people have written
complicated code that allows us to avoid those objections, so it
should just be a matter of dusting those patches off.  I would much
rather go with something stupid and simple if people are willing to
merge that however.

 Depending upon the data structure it will either be modified to hold
 a per entry network namespace pointer or it there will be a separate
 copy per network namespace.  For large global data structures like
 the ipv4 routing cache hash table adding an additional pointer to the
 entries appears the more reasonable solution.


 So the routing cache is shared between all namespaces?

Yes.  Each namespaces has it's own view so semantically it's not
shared.  But the initial fan out of the hash table 2M or something
isn't something we want to replicate on a per namespace basis even
assuming the huge page allocations could happen.

So we just tag the entries and add the network namespace as one more
part of the key when doing hash table look ups.

 --- Performance
 
 In initial measurements the only performance overhead we have been
 able to measure is getting the packet to the network namespace.
 Going through ethernet bridging or routing seems to trigger copies
 of the packet that slow things down.  When packets go directly to
 the network namespace no performance penalty has yet been measured.


 It would be interesting to find out whats triggering these copies.
 Do you have NAT enabled?

I would have to go back and look.  There was a skb_cow call someplace
in the routing path.  Something else with ipfilter, ethernet bridging.
So yes it is probably interesting to dig into.  

So the thread where we dug into this last time to the point of
identifying the problem is here:
https://lists.linux-foundation.org/pipermail/containers/2007-March/004309.html
The problem in the bridging was here:
https://lists.linux-foundation.org/pipermail/containers/2007-March/004336.html
I can't find a good pointer to the bit of discussion that described
the routing.  I just remember it was an skb_cow somewhere in the
routing output path, I believe at the point where we write in the new
destination IP.  I haven't a clue why the copy was triggering.

Design wise the interesting bit was it nothing was measurable when the
network device was in the network namespace.  So adding an extra
pointer parameter to functions and dereferencing the pointer has not
measurably affected performance at this point.

Eric





-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 5/7] CAN: Add virtual CAN netdevice driver

2007-06-23 Thread Patrick McHardy
Oliver Hartkopp wrote:
 Patrick McHardy wrote:
 
BTW, in case the loopback device is required for normal
operation it might make sense to create *one* device by
default, but four identical devices seems a bit extreme.

 
 As i wrote before CAN addressing consists of CAN-Identifiers and the
 used interface. The use of four vcan's is definitely a usual case!


It should create as many devices as necessary to operate (similar
to the loopback device) by default. Optional interfaces that are
used for addressing reasons should be manually added by the user
as needed. And it should not use module parameters for that please.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] e1000: Work around 82571 completion timout on Pseries HW

2007-06-23 Thread Kok, Auke

Christoph Hellwig wrote:

On Thu, May 17, 2007 at 09:58:03AM -0500, Wen Xiong wrote:
It really shouldn't be there at all because something in either the 

intel

or pseries hardware is totally buggy and we should disable features in
the buggy one completely.

Hi,

Here there are not hardware issue on both Intel or PPC.  The patch is to 
work around a loop hold on early version of PCI SGI spec. 
The later PCI Sgi have spec have corrected it.  We can just implement it 
for PPC only.  Other vendor may have the same issue.


In this case we should add a blacklist for implementations of the old
spec.  There should be a way to find specific bridges in the OF firmware
tree on powerpc and similar things on other platforms aswell.


Yes, this is almost what we did.

IBM is currently testing my patches that implement a generic pci quirk that will 
be enabled for only selected root complex ID's that require the (1.0a spec) 
device to disable the completion timeouts.


They are currently validating this test on the affected hardware. I expect to 
get the results within a week and then I will post the patch.


Since this is one of the few holes in between the two specs (where manual 
intervention is needed) I think that a single quirk is a fairly sane approach.


Cheers,

Auke
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] L2 Network namespace infrastructure

2007-06-23 Thread Eric W. Biederman
Stephen Hemminger [EMAIL PROTECTED] writes:

 On Sat, 23 Jun 2007 08:20:40 -0700
 Ben Greear [EMAIL PROTECTED] wrote:

 Patrick McHardy wrote:
  Eric W. Biederman wrote:

  -- The basic design
 
  There will be a network namespace structure that holds the global
  variables for a network namespace, making those global variables
  per network namespace.
 
  One of those per network namespace global variables will be the
  loopback device.  Which means the network namespace a packet resides
  in can be found simply by examining the network device or the socket
  the packet is traversing.
 
  Either a pointer to this global structure will be passed into
  the functions that need to reference per network namespace variables
  or a structure that is already passed in (such as the network device)
  will be modified to contain a pointer to the network namespace
  structure.
  
 
 
  I believe OpenVZ stores the current namespace somewhere global,
  which avoids passing the namespace around. Couldn't you do this
  as well?

 Maybe the current namespace should be attached to something else
 like sysfs root? Having multiple namespace indirection possiblities
 leads to interesting cases where current namespace is not correctly
 associated with current sysfs tree or current proc tree, ...

Yes.  There are some oddities there.  

In my current tree there is code that makes proc and sysfs match the inspecting
process.

I haven't quite solved the inspection problem where we want to look at
the namespace of a different process.  But as long as we have clean
code to do the basics that isn't a big leap when we come to it.

I'm not really seeing any problems along this line at this point.

The big problem at this point is code review and merging, and in
particular breaking this work up into small enough pieces that they
can be digested, successfully code reviewed and merged.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ps3: gigabit ethernet driver for PS3, take2

2007-06-23 Thread Geoff Levand
MOKUNO Masakazu wrote:
 --- a/MAINTAINERS
 +++ b/MAINTAINERS
 @@ -2920,6 +2920,12 @@ M: [EMAIL PROTECTED]
  L:   [EMAIL PROTECTED]
  S:   Maintained
  
 +PS3 NETWORK SUPPORT
 +P:   Masakazu Mokuno
 +M:   [EMAIL PROTECTED]
 +L:   netdev@vger.kernel.org


I think you should put [EMAIL PROTECTED] for
the mail list.  Users will get better support
and I will be able to keep track of the inquiries.

All PS3 developers monitor [EMAIL PROTECTED],
but few if any monitor [EMAIL PROTECTED]

-Geoff


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] First draft of RDNSS-in-RA support for IPv6 DNS autoconfiguration

2007-06-23 Thread David Stevens
Rémi Denis-Courmont [EMAIL PROTECTED] wrote on 06/23/2007 09:51:55 
AM:

 How do I authenticate SeND RA? How do I deal with the link going down 
 before the expiration? How do I know this interface is doing autoconf 
 at all?

The kernel should do the authentication, as it will for other 
RA's,
and should not deliver (IMAO) unauthenticated packets. If it is, I would
consider that a bug (for all cases, not just this), and that would be a
good thing to fix. :-)
An interface going down doesn't directly invalidate a DNS server
address, though it may not be the best through another interface. Since
it is a list, I think doing nothing for this case wouldn't be terrible.
This is no worse than the existing resolver code. But if you really need 
it,
you can monitor netlink, or poll the interface flags on whatever interval 
you
require for detection.
As for autoconf, that's available from sysctl, I assume from /proc
somewhere, too. That usually doesn't change, but if you want to account
for runtime configuration changes, you can always monitor netlink and
reread when new addresses appear, too.

There certainly may be complications I haven't thought of, since
I haven't implemented it. But I still don't see a good case for using the
kernel as a DNS database.

+-DLS

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] L2 Network namespace infrastructure

2007-06-23 Thread Eric W. Biederman
Ben Greear [EMAIL PROTECTED] writes:

 Patrick McHardy wrote:
 Eric W. Biederman wrote:

 -- The basic design

 There will be a network namespace structure that holds the global
 variables for a network namespace, making those global variables
 per network namespace.

 One of those per network namespace global variables will be the
 loopback device.  Which means the network namespace a packet resides
 in can be found simply by examining the network device or the socket
 the packet is traversing.

 Either a pointer to this global structure will be passed into
 the functions that need to reference per network namespace variables
 or a structure that is already passed in (such as the network device)
 will be modified to contain a pointer to the network namespace
 structure.



 I believe OpenVZ stores the current namespace somewhere global,
 which avoids passing the namespace around. Couldn't you do this
 as well?

 Will we be able to have a single application be in multiple name-spaces?

A single application certainly.   But then an application can be composed
of multiple processes which can be composed of multiple threads.

In my current patches a single task_struct belongs to a single network
namespace.  That namespace is used when creating sockets.  The sockets
themselves have a namespace tag and that is used when transmitting
packets, or otherwise operating on the socket.

So if you pass a socket from one process to another you can have
sockets that belong to different network namespaces in a single task.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] L2 Network namespace infrastructure

2007-06-23 Thread Stephen Hemminger
On Sat, 23 Jun 2007 08:20:40 -0700
Ben Greear [EMAIL PROTECTED] wrote:

 Patrick McHardy wrote:
  Eric W. Biederman wrote:

  -- The basic design
 
  There will be a network namespace structure that holds the global
  variables for a network namespace, making those global variables
  per network namespace.
 
  One of those per network namespace global variables will be the
  loopback device.  Which means the network namespace a packet resides
  in can be found simply by examining the network device or the socket
  the packet is traversing.
 
  Either a pointer to this global structure will be passed into
  the functions that need to reference per network namespace variables
  or a structure that is already passed in (such as the network device)
  will be modified to contain a pointer to the network namespace
  structure.
  
 
 
  I believe OpenVZ stores the current namespace somewhere global,
  which avoids passing the namespace around. Couldn't you do this
  as well?

Maybe the current namespace should be attached to something else
like sysfs root? Having multiple namespace indirection possiblities
leads to interesting cases where current namespace is not correctly
associated with current sysfs tree or current proc tree, ...


 Will we be able to have a single application be in multiple name-spaces?

That would break the whole point of namespaces...

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][Resend] TIPC: Fix infinite loop in netlink handler

2007-06-23 Thread Florian Westphal
From: Florian Westphal [EMAIL PROTECTED]

The tipc netlink config handler uses the nlmsg_pid from the
request header as destination for its reply. If the application
initialized nlmsg_pid to 0, the reply is looped back to the kernel,
causing hangup. Fix: use nlmsg_pid of the skb that triggered the
request.

Signed-off-by: Florian Westphal [EMAIL PROTECTED]
---
I already sent this to netdev@ on the 19th, but the patch itself
was neither acked nor Nacked. This is a crash that can be
triggered trivially -- please fix this bug.

 net/tipc/netlink.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/tipc/netlink.c b/net/tipc/netlink.c
index 4cdafa2..6a7f7b4 100644
--- a/net/tipc/netlink.c
+++ b/net/tipc/netlink.c
@@ -60,7 +60,7 @@ static int handle_cmd(struct sk_buff *skb, struct genl_info 
*info)
rep_nlh = nlmsg_hdr(rep_buf);
memcpy(rep_nlh, req_nlh, hdr_space);
rep_nlh-nlmsg_len = rep_buf-len;
-   genlmsg_unicast(rep_buf, req_nlh-nlmsg_pid);
+   genlmsg_unicast(rep_buf, NETLINK_CB(skb).pid);
}
 
return 0;
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] L2 Network namespace infrastructure

2007-06-23 Thread Patrick McHardy
Eric W. Biederman wrote:
 Patrick McHardy [EMAIL PROTECTED] writes:
 
I believe OpenVZ stores the current namespace somewhere global,
which avoids passing the namespace around. Couldn't you do this
as well?
 
 
 It sucks.  Especially in the corner cases.   Think macvlan
 with the real network device in one namespace and the ``vlan''
 device in another device.
 
 The implementation of a global is also pretty a little questionable.
 Last I looked it didn't work on the transmit path at all and
 interesting on the receive path.
 
 Further and fundamentally all a global achieves is removing the need
 for the noise patches where you pass the pointer into the various
 functions.  For long term maintenance it doesn't help anything.
 
 All of the other changes such as messing with the
 initialization/cleanup and changing access to access the per network
 namespace data structure, and modifying the code partly along the way
 to reject working in other non-default network namespaces that are
 truly intrusive we both still have to make.
 
 So except as an implementation detail how we pass the per network
 namespace pointer is uninteresting.
 
 Currently I am trying for the least clever most straight forward
 implementation I can find, that doesn't give us a regression 
 in network stack performance.
 
 So yes if we want to do passing through a magic per cpu global on
 the packet receive path now is the time to decide to do that.
 Currently I don't see the advantage in doing that so I'm not
 suggesting it.


I think your approach is fine and is probably a lot easier
to review than using something global.

Depending upon the data structure it will either be modified to hold
a per entry network namespace pointer or it there will be a separate
copy per network namespace.  For large global data structures like
the ipv4 routing cache hash table adding an additional pointer to the
entries appears the more reasonable solution.


So the routing cache is shared between all namespaces?
 
 
 Yes.  Each namespaces has it's own view so semantically it's not
 shared.  But the initial fan out of the hash table 2M or something
 isn't something we want to replicate on a per namespace basis even
 assuming the huge page allocations could happen.
 
 So we just tag the entries and add the network namespace as one more
 part of the key when doing hash table look ups.


I can wait for the patches, but I would be interested in how
GC is performed and whether limits can be configured per
namespace.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2.6.22-rc5] cxgb2: handle possible NULL pointer dereferencing, take 2

2007-06-23 Thread Andrew Morton
 On Thu, 21 Jun 2007 18:48:30 +0530 pradeep singh [EMAIL PROTECTED] wrote:
 Hi,
 My mistake.
 Resending after reformatting the patch by hand.
 Looks like gmail messes the plain text patches.
 

That's still mangled so I typed it in again.

Please always include a full changlog with each version of a patch.

I do not know what this patch does - please provide a changelog.  In this
case it should tell us whether and how this null pointer deref is actually
occuring and if so, why.

As well as a full description of the problem which it solves, a changelog
should also describe _how_ it solved it, but that is sufficiently obvious
in this case.


Thanks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] First draft of RDNSS-in-RA support for IPv6 DNS autoconfiguration

2007-06-23 Thread David Stevens
Rémi Denis-Courmont [EMAIL PROTECTED] wrote on 06/23/2007 11:13:01 
AM:

An implementation might perform additional validity checks on the
ICMPv6 message content and discard malformed packets.  However, a
portable application must not assume that such validity checks have
been performed.

This doesn't say that unauthenticated packets must be delivered,
and I don't think the portability of an RDNS daemon is an issue. But
even if you really wanted to run the same code on a non-Linux machine,
it just means that your daemon code would have to do its own 
authentication.
Reading /proc or netlink with packet formats you've defined to get this
information is not more portable to non-Linux machines, right?
I don't see any issue here. If an application is relying on the
ability to see forged packets for portability reasons, it's probably not
an application you want running on your machine. :-)

 That would encourage people into running open recursive DNS servers 
 which is widely known and documented as a bad practice. Definitely a 
 very bad idea.
I don't understand your point here. I'm talking about client
behaviour, and if the client fails for a server from a downed interface,
I don't see how that's different from removing the client from the
list, which is what you want to do. Nobody should feel encouraged to
do anything different on the server side-- at least not by me!
 
  But if you really need it, you can monitor netlink, or poll the
  interface flags on whatever interval you require for detection.
 
  As for autoconf, that's available from sysctl, I assume from
  /proc somewhere, too. That usually doesn't change, but if you want to
  account for runtime configuration changes, you can always monitor
  netlink and reread when new addresses appear, too.
 
 There are a bunch of parameters that determine whether an interface 
 accepts RAs or not. I doubt it's wise to try to reimplement that into 
 userspace, particularly if it is subject to change.

I'm not suggesting re-implementing anything; I'm saying you can
read the current state at application level, if you need it. If you
think it's difficult to get the correct information from existing
API's, then improving those API's is always worthwhile. I don't
believe it's excessively difficult to determine if autoconf is in
use, though.

 My point 
 is raw IPv6 sockets are not usable for the time being, and I do not see 
 anyway to fix without modifying the kernel.

I disagree about raw sockets being usable, but without modifying the
kernel isn't a constraint.
modifying the kernel != put DNS server info in the kernel;
if there's a bug, or some minor tweaking that'd help the feature along,
I'd support that. The important point for me is that the basic mechanisms
are already in place, and I think it'd be best to use those rather than
creating a new interface for all of this.

 The userspace DNS configuration daemon might need to be started later 
 than the kernel autoconf - another issue that needs help from the 
 kernel.
Easily done; the init scripts are what bring the interfaces up
in the first place, so start the daemon before those run. Adding an
entry in inittab so it'll be automatically restarted if it dies is also
a reasonable thing. RA's are resent periodically, and they can be lost
anyway, so not the end of the world if you miss one, either.

+-DLS

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] First draft of RDNSS-in-RA support for IPv6 DNS autoconfiguration

2007-06-23 Thread Rémi Denis-Courmont
Le samedi 23 juin 2007, David Stevens a écrit :
 This doesn't say that unauthenticated packets must be
 delivered, and I don't think the portability of an RDNS daemon is an
 issue. But even if you really wanted to run the same code on a
 non-Linux machine, it just means that your daemon code would have to
 do its own authentication.
 Reading /proc or netlink with packet formats you've defined to get
 this information is not more portable to non-Linux machines, right? I
 don't see any issue here. If an application is relying on the ability
 to see forged packets for portability reasons, it's probably not an
 application you want running on your machine. :-)

It so happens that the very userland applications that are currently 
using raw ICMPv6 sockets to see RAs *DO* want to see them all. As far 
as I know, they are all monitoring softwares (radvdump from radvd, 
rdisc6 from ndisc6, and probably scapy as well) where you do want to 
see problematic packets.

All in all, this would break well-behaved standard-abiding userland 
applications...

  The userspace DNS configuration daemon might need to be started
  later than the kernel autoconf - another issue that needs help from
  the kernel.

 Easily done; the init scripts are what bring the interfaces
 up in the first place, so start the daemon before those run. Adding
 an entry in inittab so it'll be automatically restarted if it dies is
 also a reasonable thing. RA's are resent periodically, and they can
 be lost anyway, so not the end of the world if you miss one, either.

What about NFS root? the network interface will already be up before 
even the real init gets started, let alone the userland RDNSS daemon.

resent periodically... at a default rate of one every 10 minutes! I 
surely hope your desktop boots up faster than that. Besides, some links 
do not have unsolicited advertisements at all (I have seen such a PPPoA 
link for instance). An ugly kludge would be to send a RS from userland, 
but that's not so great considering routers are rate-limiting their 
RAs.

The only way is for the kernel to remember something about the last 
processed RA. That disqualifies raw ICMPv6 sockets.

-- 
Rémi Denis-Courmont
http://www.remlab.net/


signature.asc
Description: This is a digitally signed message part.


Re: [RFD] L2 Network namespace infrastructure

2007-06-23 Thread Eric W. Biederman
Patrick McHardy [EMAIL PROTECTED] writes:

Depending upon the data structure it will either be modified to hold
a per entry network namespace pointer or it there will be a separate
copy per network namespace.  For large global data structures like
the ipv4 routing cache hash table adding an additional pointer to the
entries appears the more reasonable solution.


So the routing cache is shared between all namespaces?
 
 
 Yes.  Each namespaces has it's own view so semantically it's not
 shared.  But the initial fan out of the hash table 2M or something
 isn't something we want to replicate on a per namespace basis even
 assuming the huge page allocations could happen.
 
 So we just tag the entries and add the network namespace as one more
 part of the key when doing hash table look ups.


 I can wait for the patches, but I would be interested in how
 GC is performed and whether limits can be configured per
 namespace.

Currently I believe the gc code is unmodified in my patches.

Currently I have been focusing on the normal semantics and just
making something work in a mergeable fashion.

Limits and the like are comparatively easy to add in after the
rest is working so I haven't been focusing on that.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] First draft of RDNSS-in-RA support for IPv6 DNS autoconfiguration

2007-06-23 Thread Rémi Denis-Courmont
Le samedi 23 juin 2007, David Stevens a écrit :
 The kernel should do the authentication, as it will for other
 RA's,
 and should not deliver (IMAO) unauthenticated packets. If it is, I
 would consider that a bug (for all cases, not just this), and that
 would be a good thing to fix. :-)

I am all for an interface whereby the kernel queues all accepted RAs 
for userland to process additionnal parameters... but that's totally 
NOT how ICMPv6 raw sockets currently work, and it would be a very 
significant departure from the Advanced IPv6 Socket API (RFC 3542, in 
particular §3.3):

   An implementation might perform additional validity checks on the
   ICMPv6 message content and discard malformed packets.  However, a
   portable application must not assume that such validity checks have
   been performed.

Being malformed does not include failing authentication, or the local 
host not using autoconf. I am all for a setsockopt() that limits 
delivery to accepted RAs, but it does not currently exist.

 An interface going down doesn't directly invalidate a DNS
 server address, though it may not be the best through another
 interface. Since it is a list, I think doing nothing for this case
 wouldn't be terrible. This is no worse than the existing resolver
 code.

That would encourage people into running open recursive DNS servers 
which is widely known and documented as a bad practice. Definitely a 
very bad idea.

 But if you really need it, you can monitor netlink, or poll the
 interface flags on whatever interval you require for detection.

 As for autoconf, that's available from sysctl, I assume from
 /proc somewhere, too. That usually doesn't change, but if you want to
 account for runtime configuration changes, you can always monitor
 netlink and reread when new addresses appear, too.

There are a bunch of parameters that determine whether an interface 
accepts RAs or not. I doubt it's wise to try to reimplement that into 
userspace, particularly if it is subject to change.

 There certainly may be complications I haven't thought of,
 since I haven't implemented it. But I still don't see a good case for
 using the kernel as a DNS database.

I never said the kernel needed to parse DNS messages by itself. My point 
is raw IPv6 sockets are not usable for the time being, and I do not see 
anyway to fix without modifying the kernel.

The userspace DNS configuration daemon might need to be started later 
than the kernel autoconf - another issue that needs help from the 
kernel.

-- 
Rémi Denis-Courmont
http://www.remlab.net/
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 5/7] CAN: Add virtual CAN netdevice driver

2007-06-23 Thread Oliver Hartkopp
Patrick McHardy wrote:

 BTW, in case the loopback device is required for normal
 operation it might make sense to create *one* device by
 default, but four identical devices seems a bit extreme.

   

As i wrote before CAN addressing consists of CAN-Identifiers and the
used interface. The use of four vcan's is definitely a usual case!

Oliver


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] L2 Network namespace infrastructure

2007-06-23 Thread Carl-Daniel Hailfinger
On 23.06.2007 19:19, Eric W. Biederman wrote:
 Patrick McHardy [EMAIL PROTECTED] writes:
 
 Eric W. Biederman wrote:
 
 Depending upon the data structure it will either be modified to hold
 a per entry network namespace pointer or it there will be a separate
 copy per network namespace.  For large global data structures like
 the ipv4 routing cache hash table adding an additional pointer to the
 entries appears the more reasonable solution.

 So the routing cache is shared between all namespaces?
 
 Yes.  Each namespaces has it's own view so semantically it's not
 shared.  But the initial fan out of the hash table 2M or something
 isn't something we want to replicate on a per namespace basis even
 assuming the huge page allocations could happen.
 
 So we just tag the entries and add the network namespace as one more
 part of the key when doing hash table look ups.

Can one namespace DoS other namespaces' access to the routing cache?
Two scenarios come to mind:
* provoking hash collisions
* lock contention (sorry, haven't checked whether/how we do locking)

Regards,
Carl-Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] First draft of RDNSS-in-RA support for IPv6 DNS autoconfiguration

2007-06-23 Thread Pierre Ynard
On Sat, Jun 23, 2007, David Stevens wrote:
 There certainly may be complications I haven't thought of, since
 I haven't implemented it. But I still don't see a good case for using the
 kernel as a DNS database.

Excuse me for being a bit confused by the approach that you suggest, as
so far it doesn't look very good to me either: I would be glad if you
could clarify some points for the sake of the discussion.

 The kernel should do the authentication, as it will for other
 RA's, and should not deliver (IMAO) unauthenticated packets. If it is,
 I would consider that a bug (for all cases, not just this), and that
 would be a good thing to fix. :-)

You were talking about a raw ICMPv6 socket, right? Though, isn't the
point of a raw socket to be raw? Would it be a not-so-raw raw
socket, dropping a few unwanted packets?

 An interface going down doesn't directly invalidate a DNS
 server address, though it may not be the best through another
 interface. Since it is a list, I think doing nothing for this case
 wouldn't be terrible. This is no worse than the existing resolver
 code. But if you really need it, you can monitor netlink, or poll the
 interface flags on whatever interval you require for detection.
 As for autoconf, that's available from sysctl, I assume from
 /proc somewhere, too. That usually doesn't change, but if you want
 to account for runtime configuration changes, you can always monitor
 netlink and reread when new addresses appear, too.

If I understand well you suggest that in order to do things properly,
the application should keep track of a lot of kernel-related stuff? I
mean, the daemon, as the simple piece of code that you seem to have in
mind, should only care about processing RA options that it receives:
network/RA/configuration/availability concerns are precisely the role
of the kernel, which it is already fulfilling, isn't it? It just looks
naturally workable in the case where the kernel processes these options
first, and then handles them to the daemon.

Also, I think that RAs can be considered as a part of IPv6, right? As
opposed to DHCP that is indeed an applicative protocol, I can't see why
parts of a network protocol should be managed by a (non-networking)
userland application. Saying that it can only be used at application
layer doesn't look like a very good case for having networking packets
handled by userland instead of the kernel, and seems rather selfish from
the OS. Am I expecting too much as a user?

I had the understanding that it was a better design to clearly handle
autoconfiguration in one place, and not to scatter it between kernel
and userland. For some reason, it is done in the kernel: do you mean
that now the kernel should only support partial, half-way handling of
RAs? It may seem a bit awkward as a solution.

To me, it looks much more consistent that since the kernel already
parses RA options that it needs, it be in charge of wholly processing
the RA and extract and export all its options. That would be indeed
practical, less error-prone and maybe more efficient than duplicating
all the work to userland. Couldn't it be?

After all, the fact that RDNSS be accepted as an RA option is an
argument to say that it belongs in the kernel, not as DNS, but as an RA
option. As you are saying to Rémi, your intent is to fix or enhance the
existing, generic means of the kernel to provide an accurate access to
these RA options, right? Isn't it just what we all want?

-- 
Pierre Ynard
WTS #51 - No phone
Une âme dans un corps, c'est comme un dessin sur une feuille de papier.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] L2 Network namespace infrastructure

2007-06-23 Thread Ben Greear

Eric W. Biederman wrote:

Ben Greear [EMAIL PROTECTED] writes:

  

Will we be able to have a single application be in multiple name-spaces?



A single application certainly.   But then an application can be composed
of multiple processes which can be composed of multiple threads.

In my current patches a single task_struct belongs to a single network
namespace.  That namespace is used when creating sockets.  The sockets
themselves have a namespace tag and that is used when transmitting
packets, or otherwise operating on the socket.

So if you pass a socket from one process to another you can have
sockets that belong to different network namespaces in a single task.
  
Any chance it could allow one to use a single threaded, single process 
and do something like

int fd1 = socket(, namespace1);
int fd2 = socket(, namespace2);

Or, maybe a sockopt or similar call to move a socket into a particular 
namespace?


I can certainly see it being useful to allow a default name-space per 
process, but it would be nice
to also allow explicit assignment of a socket to a name-space for 
applications that want to span

a large number of name-spaces.

Thanks,
Ben

--
Ben Greear [EMAIL PROTECTED] 
Candela Technologies Inc  http://www.candelatech.com



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] L2 Network namespace infrastructure

2007-06-23 Thread Ben Greear

Stephen Hemminger wrote:
  

Will we be able to have a single application be in multiple name-spaces?



That would break the whole point of namespaces...
  
I was hoping that I could open a socket in one name-space and another in 
another name
space, and send traffic between them, within a single application.  This 
is basically what I can do now with my
send-to-self patch and (for more clever virtual-routing schemes + NAT, 
with a conn-track
patch that Patrick cooked up for me).  It seems these patches I use are 
not acceptable

for merge, so I was hoping name-spaces might work instead.

Thanks,
Ben

--
Ben Greear [EMAIL PROTECTED] 
Candela Technologies Inc  http://www.candelatech.com



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] L2 Network namespace infrastructure

2007-06-23 Thread Eric W. Biederman
Carl-Daniel Hailfinger [EMAIL PROTECTED] writes:

 Can one namespace DoS other namespaces' access to the routing cache?
 Two scenarios come to mind:
 * provoking hash collisions
 * lock contention (sorry, haven't checked whether/how we do locking)

My initial expectation is that the protections we have to prevent one user from
performing a DoS on another user generally cover the cases between namespaces as
well.

Further in general global caches and global resource management is more 
efficient
then per namespace management.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] L2 Network namespace infrastructure

2007-06-23 Thread Eric W. Biederman
Ben Greear [EMAIL PROTECTED] writes:

 Any chance it could allow one to use a single threaded, single process and do
 something like
 int fd1 = socket(, namespace1);
 int fd2 = socket(, namespace2);

 Or, maybe a sockopt or similar call to move a socket into a particular
 namespace?

 I can certainly see it being useful to allow a default name-space per process,
 but it would be nice
 to also allow explicit assignment of a socket to a name-space for applications
 that want to span
 a large number of name-spaces.

That isn't the primary use case so I have not considered it much.
A setsockopt call might be possible.

It is also possible to have a bunch of children opening sockets for you
and passing to the process that wants to do the work. If you have a
sufficiently slow socket creation rate that will not be a problem just
a little cumbersome.

If you can open all of your sockets upfront it is possible to do
something where you open your sockets then unshare your network
namespace and repeat.

I am committed to making general infrastructure not something that is
targeted in a brittle way at only one scenario.

So it may be that we can cover your scenario.  However it is just
enough off of the beaten path that I'm not going to worry about it
the first time through.  It looks like it is a very small step from
where I am at to where you want to be.  So you may be able to cook
up something that will satisfy your requirements relatively easily.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] First draft of RDNSS-in-RA support for IPv6 DNS autoconfiguration

2007-06-23 Thread David Miller
From: Michael Buesch [EMAIL PROTECTED]
Date: Sat, 23 Jun 2007 11:07:14 +0200

 Yeah, it might work. But I think the compiler doesn't guarantee
 you anything about it.

The compiler actually does guarentee these things, and that's why we
have the endian bitfield macros.  You're overreacting, we've
been using this stuff for more than 10 years in the basic
IPV4 header structure, so stop this nonsense.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] L2 Network namespace infrastructure

2007-06-23 Thread Ben Greear

Eric W. Biederman wrote:

So it may be that we can cover your scenario.  However it is just
enough off of the beaten path that I'm not going to worry about it
the first time through.  It looks like it is a very small step from
where I am at to where you want to be.  So you may be able to cook
up something that will satisfy your requirements relatively easily.
  

That sounds fair to me.

I will assume that as long as you can migrate sockets with the methods you
described, it should not be that difficult to do the same with a sockopt
or similar.

I'll revisit this when your patches are in mainline.

Thanks,
Ben

--
Ben Greear [EMAIL PROTECTED] 
Candela Technologies Inc  http://www.candelatech.com



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][Resend] TIPC: Fix infinite loop in netlink handler

2007-06-23 Thread David Miller
From: Florian Westphal [EMAIL PROTECTED]
Date: Sat, 23 Jun 2007 20:25:46 +0200

 From: Florian Westphal [EMAIL PROTECTED]
 
 The tipc netlink config handler uses the nlmsg_pid from the
 request header as destination for its reply. If the application
 initialized nlmsg_pid to 0, the reply is looped back to the kernel,
 causing hangup. Fix: use nlmsg_pid of the skb that triggered the
 request.
 
 Signed-off-by: Florian Westphal [EMAIL PROTECTED]

I have this patch already, I'm just backlogged :-)

Please be patient.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] L2 Network namespace infrastructure

2007-06-23 Thread Benny Amorsen
 DM == David Miller [EMAIL PROTECTED] writes:

DM To be honest I think this form of virtualization is a complete
DM waste of time, even the openvz approach.

You are only considering the security values of OpenVZ. Where I work,
OpenVZ and Linux-vserver are used for their ability to cleanly
separate processes. Security-wise, we could get the same effect just
by running the processes as separate users, but management-wise it is
so much easier to give them a completely separate environment.
OpenVZ's network virtualization enables us to do things which are
completely impossible with both the vanilla kernel and Xen -- e.g.
hundreds of virtual routers, with their own routing daemons.

Policy routing just doesn't cut it; it's cumbersome to set up, limited
to 256 tables, and routing daemons generally can't handle it well, if
at all.


/Benny



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] NET: Multiple queue hardware support

2007-06-23 Thread PJ Waskiewicz
Please consider these patches for 2.6.23 inclusion.

These patches are built against Patrick McHardy's recently submitted
RTNETLINK nested compat attribute patches.  They're needed to preserve
ABI between sch_{rr|prio} and iproute2.

Updates since the last submission:

1. Added checks for netif_subqueue_stopped() to net/core/netpoll.c,
   net/core/pktgen.c, and to software device hard_start_xmit in
   dev_queue_xmit().

2. Removed TCA_PRIO_TEST and added TCA_PRIO_MQ for sch_prio and sch_rr.

3. Fixed dependancy issues in net/sched/Kconfig with NET_SCH_RR.

4. Implemented the new nested compat attribute API for MQ in NET_SCH_PRIO
   and NET_SCH_RR.

5. Allow sch_rr and sch_prio to turn multiqueue hardware support on and off
   at loadtime.

This patchset is an updated version of previous multiqueue network device
support patches.  The general approach of introducing a new API for multiqueue
network devices to register with the stack has remained.  The changes include
adding a round-robin qdisc, heavily based on sch_prio, which will allow
queueing to hardware with no OS-enforced queuing policy.  sch_prio still has
the multiqueue code in it, but has a Kconfig option to compile it out of the
qdisc.  This allows people with hardware containing scheduling policies to
use sch_rr (round-robin), and others without scheduling policies in hardware
to continue using sch_prio if they wish to have some notion of scheduling
priority.

The patches being sent are split into Documentation, Qdisc changes, and
core stack changes.  The requested e1000 changes are still being resolved,
and will be sent at a later date.

The patches to iproute2 for tc will be sent separately, to support sch_rr.

-- 
PJ Waskiewicz [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API

2007-06-23 Thread PJ Waskiewicz
Updated: Added checks for netif_subqueue_stopped() to netpoll,
pktgen, and software device dev_queue_xmit().  This will ensure
external events to these subsystems will be handled correctly if
a subqueue is shut down.

Add the multiqueue hardware device support API to the core network
stack.  Allow drivers to allocate multiple queues and manage them
at the netdev level if they choose to do so.

Added a new field to sk_buff, namely queue_mapping, for drivers to
know which tx_ring to select based on OS classification of the flow.

Signed-off-by: Peter P Waskiewicz Jr [EMAIL PROTECTED]
---

 include/linux/etherdevice.h |3 +-
 include/linux/netdevice.h   |   62 ++-
 include/linux/skbuff.h  |4 ++-
 net/core/dev.c  |   27 +--
 net/core/netpoll.c  |8 +++---
 net/core/pktgen.c   |   10 +--
 net/core/skbuff.c   |3 ++
 net/ethernet/eth.c  |9 +++---
 8 files changed, 104 insertions(+), 22 deletions(-)

diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index f48eb89..b3fbb54 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -39,7 +39,8 @@ extern void   eth_header_cache_update(struct hh_cache 
*hh, struct net_device *dev
 extern int eth_header_cache(struct neighbour *neigh,
 struct hh_cache *hh);
 
-extern struct net_device *alloc_etherdev(int sizeof_priv);
+extern struct net_device *alloc_etherdev_mq(int sizeof_priv, int queue_count);
+#define alloc_etherdev(sizeof_priv) alloc_etherdev_mq(sizeof_priv, 1)
 
 /**
  * is_zero_ether_addr - Determine if give Ethernet address is all zeros.
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e7913ee..6509eb4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -108,6 +108,14 @@ struct wireless_dev;
 #define MAX_HEADER (LL_MAX_HEADER + 48)
 #endif
 
+struct net_device_subqueue
+{
+   /* Give a control state for each queue.  This struct may contain
+* per-queue locks in the future.
+*/
+   unsigned long   state;
+};
+
 /*
  * Network device statistics. Akin to the 2.0 ether stats but
  * with byte counters.
@@ -325,6 +333,7 @@ struct net_device
 #define NETIF_F_VLAN_CHALLENGED1024/* Device cannot handle VLAN 
packets */
 #define NETIF_F_GSO2048/* Enable software GSO. */
 #define NETIF_F_LLTX   4096/* LockLess TX */
+#define NETIF_F_MULTI_QUEUE16384   /* Has multiple TX/RX queues */
 
/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT  16
@@ -543,6 +552,10 @@ struct net_device
 
/* rtnetlink link ops */
const struct rtnl_link_ops *rtnl_link_ops;
+
+   /* The TX queue control structures */
+   int egress_subqueue_count;
+   struct net_device_subqueue  egress_subqueue[0];
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
@@ -705,6 +718,48 @@ static inline int netif_running(const struct net_device 
*dev)
return test_bit(__LINK_STATE_START, dev-state);
 }
 
+/*
+ * Routines to manage the subqueues on a device.  We only need start
+ * stop, and a check if it's stopped.  All other device management is
+ * done at the overall netdevice level.
+ * Also test the device if we're multiqueue.
+ */
+static inline void netif_start_subqueue(struct net_device *dev, u16 
queue_index)
+{
+   clear_bit(__LINK_STATE_XOFF, dev-egress_subqueue[queue_index].state);
+}
+
+static inline void netif_stop_subqueue(struct net_device *dev, u16 queue_index)
+{
+#ifdef CONFIG_NETPOLL_TRAP
+   if (netpoll_trap())
+   return;
+#endif
+   set_bit(__LINK_STATE_XOFF, dev-egress_subqueue[queue_index].state);
+}
+
+static inline int netif_subqueue_stopped(const struct net_device *dev,
+ u16 queue_index)
+{
+   return test_bit(__LINK_STATE_XOFF,
+   dev-egress_subqueue[queue_index].state);
+}
+
+static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index)
+{
+#ifdef CONFIG_NETPOLL_TRAP
+   if (netpoll_trap())
+   return;
+#endif
+   if (test_and_clear_bit(__LINK_STATE_XOFF,
+  dev-egress_subqueue[queue_index].state))
+   __netif_schedule(dev);
+}
+
+static inline int netif_is_multiqueue(const struct net_device *dev)
+{
+   return (!!(NETIF_F_MULTI_QUEUE  dev-features));
+}
 
 /* Use this variant when it is known for sure that it
  * is executing from interrupt context.
@@ -995,8 +1050,11 @@ static inline void netif_tx_disable(struct net_device 
*dev)
 extern voidether_setup(struct net_device *dev);
 
 /* Support for loadable net-drivers */
-extern struct net_device *alloc_netdev(int sizeof_priv, const char *name,
-  void (*setup)(struct 

[PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-23 Thread PJ Waskiewicz
Updated: This patch applies on top of Patrick McHardy's RTNETLINK
nested compat attribute patches.  These are required to preserve
ABI for iproute2 when working with the multiqueue qdiscs.

Add the new sch_rr qdisc for multiqueue network device support.
Allow sch_prio and sch_rr to be compiled with or without multiqueue hardware
support.

sch_rr is part of sch_prio, and is referenced from MODULE_ALIAS.  This
was done since sch_prio and sch_rr only differ in their dequeue routine.

Signed-off-by: Peter P Waskiewicz Jr [EMAIL PROTECTED]
---

 include/linux/pkt_sched.h |4 +-
 net/sched/Kconfig |   30 +
 net/sched/sch_generic.c   |3 +
 net/sched/sch_prio.c  |  106 -
 4 files changed, 129 insertions(+), 14 deletions(-)

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index 09808b7..ec3a9a5 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -103,8 +103,8 @@ struct tc_prio_qopt
 
 enum
 {
-   TCA_PRIO_UNPSEC,
-   TCA_PRIO_TEST,
+   TCA_PRIO_UNSPEC,
+   TCA_PRIO_MQ,
__TCA_PRIO_MAX
 };
 
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 475df84..7f14fa6 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -102,8 +102,16 @@ config NET_SCH_ATM
  To compile this code as a module, choose M here: the
  module will be called sch_atm.
 
+config NET_SCH_BANDS
+bool Multi Band Queueing (PRIO and RR)
+---help---
+  Say Y here if you want to use n-band multiqueue packet
+  schedulers.  These include a priority-based scheduler and
+  a round-robin scheduler.
+
 config NET_SCH_PRIO
tristate Multi Band Priority Queueing (PRIO)
+   depends on NET_SCH_BANDS
---help---
  Say Y here if you want to use an n-band priority queue packet
  scheduler.
@@ -111,6 +119,28 @@ config NET_SCH_PRIO
  To compile this code as a module, choose M here: the
  module will be called sch_prio.
 
+config NET_SCH_RR
+   tristate Multi Band Round Robin Queuing (RR)
+   depends on NET_SCH_BANDS
+   select NET_SCH_PRIO
+   ---help---
+ Say Y here if you want to use an n-band round robin packet
+ scheduler.
+
+ The module uses sch_prio for its framework and is aliased as
+ sch_rr, so it will load sch_prio, although it is referred
+ to using sch_rr.
+
+config NET_SCH_BANDS_MQ
+   bool Multiple hardware queue support
+   depends on NET_SCH_BANDS
+   ---help---
+ Say Y here if you want to allow the PRIO and RR qdiscs to assign
+ flows to multiple hardware queues on an ethernet device.  This
+ will still work on devices with 1 queue.
+
+ Most people will say N here.
+
 config NET_SCH_RED
tristate Random Early Detection (RED)
---help---
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 9461e8a..203d5c4 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -168,7 +168,8 @@ static inline int qdisc_restart(struct net_device *dev)
spin_unlock(dev-queue_lock);
 
ret = NETDEV_TX_BUSY;
-   if (!netif_queue_stopped(dev))
+   if (!netif_queue_stopped(dev) 
+   !netif_subqueue_stopped(dev, skb-queue_mapping))
/* churn baby churn .. */
ret = dev_hard_start_xmit(skb, dev);
 
diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index 40a13e8..8a716f0 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -40,9 +40,11 @@
 struct prio_sched_data
 {
int bands;
+   int curband; /* for round-robin */
struct tcf_proto *filter_list;
u8  prio2band[TC_PRIO_MAX+1];
struct Qdisc *queues[TCQ_PRIO_BANDS];
+   unsigned char mq;
 };
 
 
@@ -70,14 +72,28 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int 
*qerr)
 #endif
if (TC_H_MAJ(band))
band = 0;
+   if (q-mq)
+   skb-queue_mapping = 
+   q-prio2band[bandTC_PRIO_MAX];
+   else
+   skb-queue_mapping = 0;
return q-queues[q-prio2band[bandTC_PRIO_MAX]];
}
band = res.classid;
}
band = TC_H_MIN(band) - 1;
-   if (band = q-bands)
+   if (band = q-bands) {
+   if (q-mq)
+   skb-queue_mapping = q-prio2band[0];
+   else
+   skb-queue_mapping = 0;
return q-queues[q-prio2band[0]];
+   }
 
+   if (q-mq)
+   skb-queue_mapping = band;
+   else
+   skb-queue_mapping = 0;
return q-queues[band];
 }
 
@@ -144,17 +160,57 @@ prio_dequeue(struct Qdisc* sch)
struct Qdisc *qdisc;
 
for (prio = 0; prio  q-bands; prio++) {
-   

[PATCH 1/3] NET: [DOC] Multiqueue hardware support documentation

2007-06-23 Thread PJ Waskiewicz
Add a brief howto to Documentation/networking for multiqueue.  It
explains how to use the multiqueue API in a driver to support
multiqueue paths from the stack, as well as the qdiscs to use for
feeding a multiqueue device.

Signed-off-by: Peter P Waskiewicz Jr [EMAIL PROTECTED]
---

 Documentation/networking/multiqueue.txt |  106 +++
 1 files changed, 106 insertions(+), 0 deletions(-)

diff --git a/Documentation/networking/multiqueue.txt 
b/Documentation/networking/multiqueue.txt
new file mode 100644
index 000..b7ede56
--- /dev/null
+++ b/Documentation/networking/multiqueue.txt
@@ -0,0 +1,106 @@
+
+   HOWTO for multiqueue network device support
+   ===
+
+Section 1: Base driver requirements for implementing multiqueue support
+Section 2: Qdisc support for multiqueue devices
+Section 3: Brief howto using PRIO or RR for multiqueue devices
+
+
+Intro: Kernel support for multiqueue devices
+-
+
+Kernel support for multiqueue devices is only an API that is presented to the
+netdevice layer for base drivers to implement.  This feature is part of the
+core networking stack, and all network devices will be running on the
+multiqueue-aware stack.  If a base driver only has one queue, then these
+changes are transparent to that driver.
+
+
+Section 1: Base driver requirements for implementing multiqueue support
+---
+
+Base drivers are required to use the new alloc_etherdev_mq() or
+alloc_netdev_mq() functions to allocate the subqueues for the device.  The
+underlying kernel API will take care of the allocation and deallocation of
+the subqueue memory, as well as netdev configuration of where the queues
+exist in memory.
+
+The base driver will also need to manage the queues as it does the global
+netdev-queue_lock today.  Therefore base drivers should use the
+netif_{start|stop|wake}_subqueue() functions to manage each queue while the
+device is still operational.  netdev-queue_lock is still used when the device
+comes online or when it's completely shut down (unregister_netdev(), etc.).
+
+Finally, the base driver should indicate that it is a multiqueue device.  The
+feature flag NETIF_F_MULTI_QUEUE should be added to the netdev-features
+bitmap on device initialization.  Below is an example from e1000:
+
+#ifdef CONFIG_E1000_MQ
+   if ( (adapter-hw.mac.type == e1000_82571) ||
+(adapter-hw.mac.type == e1000_82572) ||
+(adapter-hw.mac.type == e1000_80003es2lan))
+   netdev-features |= NETIF_F_MULTI_QUEUE;
+#endif
+
+
+Section 2: Qdisc support for multiqueue devices
+---
+
+Currently two qdiscs support multiqueue devices.  A new round-robin qdisc,
+sch_rr, and sch_prio. The qdisc is responsible for classifying the skb's to
+bands and queues, and will store the queue mapping into skb-queue_mapping.
+Use this field in the base driver to determine which queue to send the skb
+to.
+
+sch_rr has been added for hardware that doesn't want scheduling policies from
+software, so it's a straight round-robin qdisc.  It uses the same syntax and
+classification priomap that sch_prio uses, so it should be intuitive to
+configure for people who've used sch_prio.
+
+The PRIO qdisc naturally plugs into a multiqueue device.  If PRIO has been
+built with NET_SCH_PRIO_MQ, then upon load, it will make sure the number of
+bands requested is equal to the number of queues on the hardware.  If they
+are equal, it sets a one-to-one mapping up between the queues and bands.  If
+they're not equal, it will not load the qdisc.  This is the same behavior
+for RR.  Once the association is made, any skb that is classified will have
+skb-queue_mapping set, which will allow the driver to properly queue skb's
+to multiple queues.
+
+
+Section 3: Brief howto using PRIO and RR for multiqueue devices
+---
+
+The userspace command 'tc,' part of the iproute2 package, is used to configure
+qdiscs.  To add the PRIO qdisc to your network device, assuming the device is
+called eth0, run the following command:
+
+# tc qdisc add dev eth0 root handle 1: prio bands 4 multiqueue
+
+This will create 4 bands, 0 being highest priority, and associate those bands
+to the queues on your NIC.  Assuming eth0 has 4 Tx queues, the band mapping
+would look like:
+
+band 0 = queue 0
+band 1 = queue 1
+band 2 = queue 2
+band 3 = queue 3
+
+Traffic will begin flowing through each queue if your TOS values are assigning
+traffic across the various bands.  For example, ssh traffic will always try to
+go out band 0 based on TOS - Linux priority conversion (realtime traffic),
+so it will be sent out queue 0.  ICMP traffic (pings) fall into the normal
+traffic classification, which is band 1.  Therefore pings will be send out

[PATCH] iproute2: sch_rr support in tc

2007-06-23 Thread PJ Waskiewicz
Updated: This patch applies on top of Patrick McHardy's RTNETLINK
patches to add nested compat attributes.  This is needed to maintain
ABI for sch_{rr|prio} in the kernel with respect to tc.  A new option,
namely multiqueue, was added to sch_prio and sch_rr.  This will allow
a user to turn multiqueue support on for sch_prio or sch_rr at loadtime.
Also, tc qdisc ls will display whether or not multiqueue is enabled on
that qdisc.

This patch is to support the new sch_rr (round-robin) qdisc being proposed
in NET for multiqueue network device support in the Linux network stack.
It uses q_prio.c as the template, since the qdiscs are nearly identical,
outside of the -dequeue() routine.

Signed-off-by: Peter P Waskiewicz Jr [EMAIL PROTECTED]
---

 include/linux/pkt_sched.h |2 -
 tc/Makefile   |1 
 tc/q_prio.c   |   15 -
 tc/q_rr.c |  126 +
 4 files changed, 138 insertions(+), 6 deletions(-)

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index fa0ec53..ec3a9a5 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -104,7 +104,7 @@ struct tc_prio_qopt
 enum
 {
TCA_PRIO_UNSPEC,
-   TCA_PRIO_TEST,
+   TCA_PRIO_MQ,
__TCA_PRIO_MAX
 };
 
diff --git a/tc/Makefile b/tc/Makefile
index 9d618ff..62e2697 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -9,6 +9,7 @@ TCMODULES += q_fifo.o
 TCMODULES += q_sfq.o
 TCMODULES += q_red.o
 TCMODULES += q_prio.o
+TCMODULES += q_rr.o
 TCMODULES += q_tbf.o
 TCMODULES += q_cbq.o
 TCMODULES += f_rsvp.o
diff --git a/tc/q_prio.c b/tc/q_prio.c
index 4934416..b34bc05 100644
--- a/tc/q_prio.c
+++ b/tc/q_prio.c
@@ -29,7 +29,7 @@
 
 static void explain(void)
 {
-   fprintf(stderr, Usage: ... prio bands NUMBER priomap P1 P2...\n);
+   fprintf(stderr, Usage: ... prio bands NUMBER priomap P1 
P2...[multiqueue]\n);
 }
 
 #define usage() return(-1)
@@ -41,6 +41,7 @@ static int prio_parse_opt(struct qdisc_util *qu, int argc, 
char **argv, struct n
int idx = 0;
struct tc_prio_qopt opt={3,{ 1, 2, 2, 2, 1, 2, 0, 0, 1, 1, 1, 1, 1, 1, 
1, 1 }};
struct rtattr *nest;
+   unsigned char mq = 0;
 
while (argc  0) {
if (strcmp(*argv, bands) == 0) {
@@ -58,6 +59,8 @@ static int prio_parse_opt(struct qdisc_util *qu, int argc, 
char **argv, struct n
return -1;
}
pmap_mode = 1;
+   } else if (strcmp(*argv, multiqueue) == 0) {
+   mq = 1;
} else if (strcmp(*argv, help) == 0) {
explain();
return -1;
@@ -92,7 +95,7 @@ static int prio_parse_opt(struct qdisc_util *qu, int argc, 
char **argv, struct n
}
 */
nest = addattr_nest_compat(n, 1024, TCA_OPTIONS, opt, sizeof(opt));
-   addattr32(n, 1024, TCA_PRIO_TEST, 123);
+   addattr32(n, 1024, TCA_PRIO_MQ, mq);
addattr_nest_compat_end(n, nest);
return 0;
 }
@@ -106,15 +109,17 @@ int prio_print_opt(struct qdisc_util *qu, FILE *f, struct 
rtattr *opt)
if (opt == NULL)
return 0;
 
-   if (parse_rtattr_nested_compat(tb, TCA_PRIO_MAX, opt, (void *)qopt, 
sizeof(*qopt)))
+   if (parse_rtattr_nested_compat(tb, TCA_PRIO_MAX, opt, qopt, 
sizeof(*qopt)))
return -1;
 
fprintf(f, bands %u priomap , qopt-bands);
for (i=0; i=TC_PRIO_MAX; i++)
fprintf(f,  %d, qopt-priomap[i]);
 
-   if (tb[TCA_PRIO_TEST])
-   fprintf(f,  TCA_PRIO_TEST: %u , *(__u32 
*)RTA_DATA(tb[TCA_PRIO_TEST]));
+   if (tb[TCA_PRIO_MQ])
+   fprintf(f,  multiqueue: %s ,
+   *(unsigned char *)RTA_DATA(tb[TCA_PRIO_MQ]) ? on : off);
+
return 0;
 }
 
diff --git a/tc/q_rr.c b/tc/q_rr.c
new file mode 100644
index 000..f74f4d5
--- /dev/null
+++ b/tc/q_rr.c
@@ -0,0 +1,126 @@
+/*
+ * q_rr.c  RR.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors:PJ Waskiewicz, [EMAIL PROTECTED]
+ * Original Authors:   Alexey Kuznetsov, [EMAIL PROTECTED] (from PRIO)
+ *
+ * Changes:
+ *
+ * Ole Husgaard [EMAIL PROTECTED]: 990513: prio2band map was always reset.
+ * J Hadi Salim [EMAIL PROTECTED]: 990609: priomap fix.
+ */
+
+#include stdio.h
+#include stdlib.h
+#include unistd.h
+#include syslog.h
+#include fcntl.h
+#include sys/socket.h
+#include netinet/in.h
+#include arpa/inet.h
+#include string.h
+
+#include utils.h
+#include tc_util.h
+
+static void explain(void)
+{
+   fprintf(stderr, Usage: ... rr bands NUMBER priomap P1 P2... 
[multiqueue]\n);
+}
+
+#define usage() return(-1)
+
+static int 

Re: [RFD] L2 Network namespace infrastructure

2007-06-23 Thread Eric W. Biederman
David Miller [EMAIL PROTECTED] writes:

 From: [EMAIL PROTECTED] (Eric W. Biederman)
 Date: Sat, 23 Jun 2007 11:19:34 -0600

 Further and fundamentally all a global achieves is removing the need
 for the noise patches where you pass the pointer into the various
 functions.  For long term maintenance it doesn't help anything.

 I don't accept that we have to add another function argument
 to a bunch of core routines just to support this crap,
 especially since you give no way to turn it off and get
 that function argument slot back.

 To be honest I think this form of virtualization is a complete
 waste of time, even the openvz approach.

 We're protecting the kernel from itself, and that's an endless
 uphill battle that you will never win.  Let's do this kind of
 stuff properly with a real minimal hypervisor, hopefully with
 appropriate hardware level support and good virtualized device
 interfaces, instead of this namespace stuff.

 At least the hypervisor approach you have some chance to fully
 harden in some verifyable and truly protected way, with
 namespaces it's just a pipe dream and everyone who works on
 these namespace approaches knows that very well.

 The only positive thing that came out of this work is the
 great auditing that the openvz folks have done and the bugs
 they have found, but it basically ends right there.

Dave thank you for your candor, it looks like I have finally made
the pieces small enough that we can discuss them.

If you want the argument to compile out.  That is not a problem at all.
I dropped that part from my patch because it makes infrastructure more
complicated and there appeared to be no gain.  However having a type
that you can pass that the compiler can optimize away is not a
problem.  Basically you just make the argument:

typedef struct {} you_can_compile_me_out;  /* when you don't want it. */
typedef void * you_can_compile_me_out; /* when you do want it. */

And gcc will generate no code to pass the argument when you compile
it out.


As far as the hardening goes.  There is definitely a point there,
short of a kernel proof subsystem that sounds correct to me.

There are some other factors that make a different tradeoff interesting.
First hypervisors do not allow global optimizations (because of the
better isolation) so have an inherent performance disadvantage.
Something like a 10x scaling penalty from the figures I have seen.

Even more interesting for me is the possibility of unmodified
application migration.  Where the limiting factor is that
you cannot reliably restore an application because the global
identifiers are not available.

So yes monolithic kernels may have grown so complex that they cannot
be verified and thus you cannot actually keep untrusted users from
doing bad things to each other with any degree of certainty.

However the interesting cases for me are cases where the users are not
aggressively hostile with each other but being stuck with one set of
global identifiers are a problem.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] L2 Network namespace infrastructure

2007-06-23 Thread Jeff Garzik

David Miller wrote:

I don't accept that we have to add another function argument
to a bunch of core routines just to support this crap,
especially since you give no way to turn it off and get
that function argument slot back.

To be honest I think this form of virtualization is a complete
waste of time, even the openvz approach.

We're protecting the kernel from itself, and that's an endless
uphill battle that you will never win.  Let's do this kind of
stuff properly with a real minimal hypervisor, hopefully with
appropriate hardware level support and good virtualized device
interfaces, instead of this namespace stuff.


Strongly seconded.  This containerized virtualization approach just 
bloats up the kernel for something that is inherently fragile and IMO 
less secure -- protecting the kernel from itself.


Plenty of other virt approaches don't stir the code like this, while 
simultaneously providing fewer, more-clean entry points for the 
virtualization to occur.


And that's speaking WITHOUT my vendor hat on...

Jeff


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] L2 Network namespace infrastructure

2007-06-23 Thread Eric W. Biederman
Jeff Garzik [EMAIL PROTECTED] writes:

 David Miller wrote:
 I don't accept that we have to add another function argument
 to a bunch of core routines just to support this crap,
 especially since you give no way to turn it off and get
 that function argument slot back.

 To be honest I think this form of virtualization is a complete
 waste of time, even the openvz approach.

 We're protecting the kernel from itself, and that's an endless
 uphill battle that you will never win.  Let's do this kind of
 stuff properly with a real minimal hypervisor, hopefully with
 appropriate hardware level support and good virtualized device
 interfaces, instead of this namespace stuff.

 Strongly seconded.  This containerized virtualization approach just bloats up
 the kernel for something that is inherently fragile and IMO less secure --
 protecting the kernel from itself.

 Plenty of other virt approaches don't stir the code like this, while
 simultaneously providing fewer, more-clean entry points for the virtualization
 to occur.

Wrong.  I really don't want to get into a my virtualization approach is better
then yours.  But this is flat out wrong.

99% of the changes I'm talking about introducing are just:
- variable 
+ ptr-variable

There are more pieces mostly with when we initialize those variables but
that is the essence of the change.

And as opposed to other virtualization approaches so far no one has been
able to measure the overhead.  I suspect there will be a few more cache
line misses somewhere but they haven't shown up yet.

If the only use was strong isolation which Dave complains about I would
concur that the namespace approach is inappropriate.  However there are
a lot other uses.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] L2 Network namespace infrastructure

2007-06-23 Thread Jeff Garzik

Eric W. Biederman wrote:

Jeff Garzik [EMAIL PROTECTED] writes:


David Miller wrote:

I don't accept that we have to add another function argument
to a bunch of core routines just to support this crap,
especially since you give no way to turn it off and get
that function argument slot back.

To be honest I think this form of virtualization is a complete
waste of time, even the openvz approach.

We're protecting the kernel from itself, and that's an endless
uphill battle that you will never win.  Let's do this kind of
stuff properly with a real minimal hypervisor, hopefully with
appropriate hardware level support and good virtualized device
interfaces, instead of this namespace stuff.

Strongly seconded.  This containerized virtualization approach just bloats up
the kernel for something that is inherently fragile and IMO less secure --
protecting the kernel from itself.

Plenty of other virt approaches don't stir the code like this, while
simultaneously providing fewer, more-clean entry points for the virtualization
to occur.


Wrong.  I really don't want to get into a my virtualization approach is better
then yours.  But this is flat out wrong.



99% of the changes I'm talking about introducing are just:
- variable 
+ ptr-variable


There are more pieces mostly with when we initialize those variables but
that is the essence of the change.


You completely dodged the main objection.  Which is OK if you are 
selling something to marketing departments, but not OK


Containers introduce chroot-jail-like features that give one a false 
sense of security, while still requiring one to poke holes in the 
illusion to get hardware-specific tasks accomplished.


The capable/not-capable model (i.e. superuser / normal user) is _still_ 
being secured locally, even after decades of work and whitepapers and 
audits.


You are drinking Deep Kool-Aid if you think adding containers to the 
myriad kernel subsystems does anything besides increasing fragility, and 
decreasing security.  You are securing in-kernel subsystems against 
other in-kernel subsystems.  superuser/user model made that difficult 
enough... now containers add exponential audit complexity to that.  Who 
is to say that a local root does not also pierce the container model?




And as opposed to other virtualization approaches so far no one has been
able to measure the overhead.  I suspect there will be a few more cache
line misses somewhere but they haven't shown up yet.

If the only use was strong isolation which Dave complains about I would
concur that the namespace approach is inappropriate.  However there are
a lot other uses.


Sure there are uses.  There are uses to putting the X server into the 
kernel, too.  At some point complexity and featuritis has to take a back 
seat to basic sanity.


Jeff


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] L2 Network namespace infrastructure

2007-06-23 Thread David Miller
From: Benny Amorsen [EMAIL PROTECTED]
Date: 23 Jun 2007 23:22:38 +0200

 Policy routing just doesn't cut it; it's cumbersome to set up, limited
 to 256 tables

False.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NET 00/02]: MACVLAN driver

2007-06-23 Thread Mark Smith
(Applogies for not maintaining thread id, I'm not subscribed.)

 We don't have any clean interfaces by which to do this MAC
 programming, and we do need something for it soon.


Yep, that's been on my long term wish list for a while, as well.

Overall I would like to see a more flexible way of allowing the net 
stack to learn each NIC's RX filter capabilities, and exploiting them. 
Plenty of NICs, even 100Mbps ones, support RX filter management that 
allows scanning for $hw_limit unicast addresses, before having to put 
the hardware into promisc mode.


A thought I had when I discovered this ability in the
Natsemi/NS83815 chip was to use these RX filters for perfect multicast
DA matching until they ran out, and then reverting to the normal
Multicast DA matching mechanisms.

Another alternative use I thought of was to use these filters to filter
out different ethernet protocol types e.g. if an interface is only
going to be processing IPv4 packets, program these filters to only
accept frames with type 0800 for IP and 0806 for ARP, reverting to
non-filtering if there are too many protocol types, as per the way the
interfaces operate today.

I think it could be useful to expose the ability to have the NIC ignore
broadcast packets, or more generally, expose the three catagories of
address recognition that NICs seem to allow to be enabled / disabled -
unicast, multicast and broadcast.

If an interface then didn't need to have broadcast reception enabled
e.g. an IPv6 only interface (or Appletalk), then it wouldn't be,
preventing the host from having to process broadcasts it's going to
ignore anyway.

A future common scenario where this ability might be useful would be
LANs with a mix of IPv4 only, IPv4/IPv6 and IPv6-only nodes.

The ability to enable/disable unicast, multicast and broadcast address
recognition individually on a NIC seems to be widespread -  I've found
that the original early to mid 90s Ne2K chip, the NS8390D, the Netgear
FA311/FA312 chip, the NS83815 and the SMC Epic/100 chip all have
specific individual register values for those three types of addresses.

Regards,
Mark.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] L2 Network namespace infrastructure

2007-06-23 Thread David Miller
From: [EMAIL PROTECTED] (Eric W. Biederman)
Date: Sat, 23 Jun 2007 15:41:16 -0600

 If you want the argument to compile out.  That is not a problem at all.
 I dropped that part from my patch because it makes infrastructure more
 complicated and there appeared to be no gain.  However having a type
 that you can pass that the compiler can optimize away is not a
 problem.  Basically you just make the argument:
 
 typedef struct {} you_can_compile_me_out;  /* when you don't want it. */
 typedef void * you_can_compile_me_out; /* when you do want it. */
 
 And gcc will generate no code to pass the argument when you compile
 it out.

I don't want to have to see or be aware of the types or the
fact that we support namespaces when I work on the networking
code.

This is why I like the security layer in the kernel we have,
I can disable it and it's completely not there.  And I can
be completely ignorant of it's existence when I work on the
networking stack.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] L2 Network namespace infrastructure

2007-06-23 Thread David Miller
From: [EMAIL PROTECTED] (Eric W. Biederman)
Date: Sat, 23 Jun 2007 16:56:49 -0600

 If the only use was strong isolation which Dave complains about I would
 concur that the namespace approach is inappropriate.  However there are
 a lot other uses.

By your very admission the only appropriate use case is when users
are not hostile and can be trusted to some extent.

And that by definition makes it not appropriate for a general purpose
operating system like Linux.

Containers are I believe a step backwards, and we're better than that.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html