date:20060810


generic_lock does not appear to be used at all.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: d80211 merge plans


Mohamed Abbas wrote:

David Miller wrote:

I think this is a non-started until the SMP problems are worked
out.  Is it still SMP challenged?


I been using d80211 stack for about a month I have not encounter any SMP 
issues. We are currently involving validation engineers to do more 
stress tests and will see if any SMP issues come up.



Well, tests are interesting, but I would rather see a real _analysis_ of 
the locking.


Locking is provable, you know...

Jeff


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

On Wed, 2006-08-09 at 16:54 -0700, David Miller wrote:
 From: Peter Zijlstra [EMAIL PROTECTED]
 Date: Wed, 09 Aug 2006 15:32:33 +0200

  The idea is to drop all !NFS packets (or even more specific only
  keep those NFS packets that belong to the critical mount), and
  everybody doing critical IO over layered networks like IPSec or
  other tunnel constructs asks for trouble - Just DON'T do that.

 People are doing I/O over IP exactly for it's ubiquity and
 flexibility.  It seems a major limitation of the design if you cancel
 out major components of this flexibility.

We're not, that was a bit of my own frustration leaking out; I think 
this whole push to IP based storage is a bit silly. I'm just not going 
to help the admin who's server just hangs because his VPN key expired.

Running critical resources remotely like this is tricky, and every 
hop/layer you put in between increases the risk of something going bad.
The only setup I think even remotely sane is a dedicated network in the
very same room - not unlike FC but cheaper (which I think is the whole
push behind this, eth is cheap)

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] ehea: header files

2006-08-10 Thread Michael Ellerman

Hi Jan-Bernd,

I haven't read all of this, but a few things caught my eye ...

cheers

On Wed, 2006-08-09 at 10:39 +0200, Jan-Bernd Themann wrote:
 Signed-off-by: Jan-Bernd Themann [EMAIL PROTECTED]
 
 
   drivers/net/ehea/ehea.h|  452 
 +
   drivers/net/ehea/ehea_hw.h |  319 +++
   2 files changed, 771 insertions(+)
 
 
 
 --- linux-2.6.18-rc4-orig/drivers/net/ehea/ehea.h 1969-12-31 
 16:00:00.0 -0800
 +++ kernel/drivers/net/ehea/ehea.h2006-08-08 23:59:39.927452928 -0700
 @@ -0,0 +1,452 @@
 +/*
 + *  linux/drivers/net/ehea/ehea.h
 + *
 + *  eHEA ethernet device driver for IBM eServer System p
 + *
 + *  (C) Copyright IBM Corp. 2006
 + *
 + *  Authors:
 + *   Christoph Raisch [EMAIL PROTECTED]
 + *   Jan-Bernd Themann [EMAIL PROTECTED]
 + *   Heiko-Joerg Schick [EMAIL PROTECTED]
 + *   Thomas Klein [EMAIL PROTECTED]
 + *
 + *
 + * This program is free software; you can redistribute it and/or modify
 + * it under the terms of the GNU General Public License as published by
 + * the Free Software Foundation; either version 2, or (at your option)
 + * any later version.
 + *
 + * This program is distributed in the hope that it will be useful,
 + * but WITHOUT ANY WARRANTY; without even the implied warranty of
 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.   See the
 + * GNU General Public License for more details.
 + *
 + * You should have received a copy of the GNU General Public License
 + * along with this program; if not, write to the Free Software
 + * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
 + */
 +
 +#ifndef __EHEA_H__
 +#define __EHEA_H__
 +
 +#include linux/version.h
 +#include linux/module.h
 +#include linux/moduleparam.h
 +#include linux/kernel.h
 +#include linux/vmalloc.h
 +#include linux/mm.h
 +#include linux/slab.h
 +#include linux/sched.h
 +#include linux/err.h
 +#include linux/list.h
 +#include linux/netdevice.h
 +#include linux/etherdevice.h
 +#include linux/kthread.h
 +#include linux/ethtool.h
 +#include linux/if_vlan.h
 +#include asm/ibmebus.h
 +#include asm/of_device.h
 +#include asm/abs_addr.h
 +#include asm/semaphore.h
 +#include asm/current.h
 +#include asm/io.h
 +
 +#define EHEA_DRIVER_NAME IBM eHEA
 +#define EHEA_DRIVER_VERSION  EHEA_0015
 +
 +#define NET_IP_ALIGN 0
 +#define EHEA_NUM_TX_QP 1
 +#ifdef EHEA_SMALL_QUEUES
 +#define EHEA_MAX_CQE_COUNT 1020
 +#define EHEA_MAX_ENTRIES_SQ1020
 +#define EHEA_MAX_ENTRIES_RQ1   4080
 +#define EHEA_MAX_ENTRIES_RQ2   1020
 +#define EHEA_MAX_ENTRIES_RQ3   1020
 +#define EHEA_SWQE_REFILL_TH 100
 +#else
 +#define EHEA_MAX_CQE_COUNT32000
 +#define EHEA_MAX_ENTRIES_SQ   16000
 +#define EHEA_MAX_ENTRIES_RQ1  32080
 +#define EHEA_MAX_ENTRIES_RQ2   4020
 +#define EHEA_MAX_ENTRIES_RQ3   4020
 +#define EHEA_SWQE_REFILL_TH1000
 +#endif
 +
 +#define EHEA_MAX_ENTRIES_EQ   20
 +
 +#define EHEA_SG_SQ  2
 +#define EHEA_SG_RQ1 1
 +#define EHEA_SG_RQ2 0
 +#define EHEA_SG_RQ3 0
 +
 +#define EHEA_MAX_PACKET_SIZE9022 /* for jumbo frame */
 +#define EHEA_RQ2_PKT_SIZE   1522
 +#define EHEA_LL_PKT_SIZE 256
 +
 +/* Send completion signaling */
 +#define EHEA_SIG_IV 1000
 +#define EHEA_SIG_IV_LONG 4
 +
 +/* Protection Domain Identifier */
 +#define EHEA_PD_ID0xaabcdeff
 +
 +#define EHEA_RQ2_THRESHOLD 1
 +/* use RQ3 threshold of 1522 bytes */
 +#define EHEA_RQ3_THRESHOLD 9
 +
 +#define EHEA_SPEED_10G 1
 +#define EHEA_SPEED_1G   1000
 +#define EHEA_SPEED_100M  100
 +#define EHEA_SPEED_10M10
 +
 +/* Broadcast/Multicast registration types */
 +#define EHEA_BCMC_SCOPE_ALL  0x08
 +#define EHEA_BCMC_SCOPE_SINGLE   0x00
 +#define EHEA_BCMC_MULTICAST  0x04
 +#define EHEA_BCMC_BROADCAST  0x00
 +#define EHEA_BCMC_UNTAGGED   0x02
 +#define EHEA_BCMC_TAGGED 0x00
 +#define EHEA_BCMC_VLANID_ALL 0x01
 +#define EHEA_BCMC_VLANID_SINGLE  0x00
 +
 +/* Use this define to kmallocate PHYP control blocks */
 +#define H_CB_ALIGNMENT   4096
 +
 +#define EHEA_PAGESHIFT  12
 +#define EHEA_PAGESIZE   4096UL
 +#define EHEA_CACHE_LINE 128

This looks like a very bad idea, what happens if you're running on a
machine with 64K pages?

 +
 +#define EHEA_ENABLE  1
 +#define EHEA_DISABLE 0

Do you really need hash defines for 0 and 1 ? They're fairly well
understood in C as meaning true and false.

 +
 +/* Memory Regions */
 +#define EHEA_MR_MAX_TX_PAGES 20
 +#define EHEA_MR_TX_DATA_PN 3
 +#define EHEA_MR_ACC_CTRL 0x0080
 +#define EHEA_RWQES_PER_MR_RQ2 10
 +#define EHEA_RWQES_PER_MR_RQ3 10
 +
 +
 +void ehea_set_ethtool_ops(struct net_device *netdev);
 +
 +#ifndef KEEP_EDEBS_BELOW
 +#define KEEP_EDEBS_BELOW 8
 +#endif
 +
 +extern int ehea_trace_level;
 +
 +#ifdef EHEA_NO_EDEB
 +#define EDEB_P_GENERIC(level, idstring, format, args...) \
 + while (0 == 1) { \
 + if(unlikely (level = ehea_trace_level)) { \
 +

Re: [RFC][PATCH 2/9] deadlock prevention core

On Wed, 2006-08-09 at 16:58 -0700, David Miller wrote:
 From: Peter Zijlstra [EMAIL PROTECTED]
 Date: Wed, 09 Aug 2006 16:07:20 +0200

  Hmm, what does sk_buff::input_dev do? That seems to store the initial
  device?

 You can run grep on the tree just as easily as I can which is what I
 did to answer this question.  It only takes a few seconds of your
 time to grep the source tree for things like skb-input_dev, so
 would you please do that before asking more questions like this?

That is exactly what I did, but I wanted a bit of confirmation. Sorry if
it 
offends you, but I'm a bit new to this network thing.

 It does store the initial device, but as Thomas tried so hard to
 explain to you guys these device pointers in the skb are transient and
 you cannot refer to them outside of packet receive processing.

Yes, I understood that after Thomas' last mail.

 The reason is that there is no refcounting performed on these devices
 when they are attached to the skb, for performance reasons, and thus
 the device can be downed, the module for it removed, etc. long before
 the skb is freed up.

I understood that, thanks.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [take6 1/3] kevent: Core files.

From: Evgeniy Polyakov [EMAIL PROTECTED]
Date: Thu, 10 Aug 2006 10:14:33 +0400

 On Wed, Aug 09, 2006 at 03:21:27PM -0700, Andrew Morton ([EMAIL PROTECTED]) 
 wrote:
  On big-endian machines, this pointer will appear to be word-swapped as far
  as a 64-bit kernel is concerned.  Or something.

  IOW: What's going on here??

 It is user data - I put there a union just to simplify userspace, so it
 sould not require some typecasting.

And this is consistent with similar mechianism we use for
netlink socket dumping, so that we don't have compat layer
crap just because we provide a place for the user to store
his pointer or whatever there.

   + k-kevent_entry.next = LIST_POISON1;
   + k-storage_entry.prev = LIST_POISON2;
   + k-ready_entry.next = LIST_POISON1;

  Nope ;)

 I use pointer checks to determine if entry is in the list or not, why it
 is frowned upon here?

As Andrew mentioned in another posting, these poison macros
are likely to simply go away some day, so you should not use
them.

If you want pointer encoded tags you use internally, define your own.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [take6 1/3] kevent: Core files.

On Wed, Aug 09, 2006 at 11:42:35PM -0700, David Miller ([EMAIL PROTECTED]) 
wrote:
+   k-kevent_entry.next = LIST_POISON1;
+   k-storage_entry.prev = LIST_POISON2;
+   k-ready_entry.next = LIST_POISON1;
   
   Nope ;)
  
  I use pointer checks to determine if entry is in the list or not, why it
  is frowned upon here?
 
 As Andrew mentioned in another posting, these poison macros
 are likely to simply go away some day, so you should not use
 them.

They exist for ages and sudently can go away?..
 
 If you want pointer encoded tags you use internally, define your own.

I think if I will add code like this
list_del(k-entry);
k-entry.prev = KEVENT_POISON1;
k-entry.next = KEVENT_POISON2;

I will be suggested to make myself a lobotomy.

I have enough space in flags in each kevent, so I will use some bits there.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [take6 1/3] kevent: Core files.

On Thu, 10 Aug 2006 10:14:33 +0400
Evgeniy Polyakov [EMAIL PROTECTED] wrote:

   + union {
   + __u32   user[2];/* User's data. It is 
   not used, just copied to/from user. */
   + void*ptr;
   + };
   +};
  
  What is this union for?
  
  `ptr' needs a __user tag, does it not?
 
 Not, it is never touched by kernel.

hrm, if you say so.

   +/*
   + * Must be called before event is going to be added into some origin's 
   queue.
   + * Initializes -enqueue(), -dequeue() and -callback() callbacks.
   + * If failed, kevent should not be used or kevent_enqueue() will fail to 
   add
   + * this kevent into origin's queue with setting
   + * KEVENT_RET_BROKEN flag in kevent-event.ret_flags.
   + */
   +int kevent_init(struct kevent *k)
   +{
   + spin_lock_init(k-ulock);
   + k-kevent_entry.next = LIST_POISON1;
   + k-storage_entry.prev = LIST_POISON2;
   + k-ready_entry.next = LIST_POISON1;
  
  Nope ;)
 
 I use pointer checks to determine if entry is in the list or not, why it
 is frowned upon here?
 Please do not say about poisoning which takes a lot of cpu cycles to get
 new cachelines and so on - everything in that entry is in the cache,
 since entry was added/deleted/accessed through list walk macro.

poisoning which takes a lot of cpu cycles.  So there ;)

I assure you, that poisoning code might disappear at any time.

If you want to be able to determine whether a list_head has been detached
you can detach it with list_del_init() and then use list_empty() on it.

   +}
   +
   +late_initcall(kevent_sys_init);
  
  Why is it late_initcall?  (A comment is needed)
 
 Why not?

Why?

There must have been some reason for having made this a late_initcall() and
that reason is 100% concealed from the reader of this code.

IOW, it needs a comment.

   +static inline void kevent_user_ring_set(struct kevent_user *u, unsigned 
   int num)
   +{
   + unsigned int *idx;
   + 
   + idx = (unsigned int *)u-pring[0];
  
  This is a bit ugly.
 
 I specially use first 4 bytes in the first page to store index there,
 since it must be accessed from userspace and kernelspace.

Sure, but the C language is the preferred way in which we communicate and
calcuate pointer offsets.

   + idx[0] = num;
   +}
   +
   +/*
   + * Note that kevents does not exactly fill the page (each ukevent is 40 
   bytes),
   + * so we reuse 4 bytes at the begining of the first page to store index.
   + * Take that into account if you want to change size of struct ukevent.
   + */
   +#define KEVENTS_ON_PAGE (PAGE_SIZE/sizeof(struct ukevent))
  
  How about doing
  
  struct ukevent_ring {
  unsigned int index;
  struct ukevent[0];
  }
  
  and removing all those nasty typeasting and offsetting games?
  
  In fact you can even do
  
  struct ukevent_ring {
  struct ukevent[(PAGE_SIZE - sizeof(unsigned int)) /
  sizeof(struct ukevent)];
  unsigned int index;
  };
  
  if you're careful ;)
 
 Ring takes more than one page, so it will be 
 struct ukevent_ring_0 and struct ukevent_ring_other.
 Does it really needed?
 Not a big problem, if you do thing it worse it.

Well, I've given a couple of prototype-style suggestions.  Please take a
look, see if all this open-coded offsetting magic can be done by the
compiler in some reliable and readable fashion.  It might not work out, but
I suspect it will.

   + u-pring = kmalloc(pnum * sizeof(unsigned long), GFP_KERNEL);
   + if (!u-pring)
   + return -ENOMEM;
   +
   + for (i=0; ipnum; ++i) {
   + u-pring[i] = __get_free_page(GFP_KERNEL);
   + if (!u-pring)
  
  bug: this is testing the wrong thing.
 
 HOw come?

Take a closer look ;)

 __get_free_page() can return 0 if page was not allocated.

And that 0 is copied to u-pring[0], not to u-pring.

  The function name is mistyped.

Did you miss an OK?  It needs s/kevnet_user_mmap/kevent_user_mmap/g

  This code doesn't have many comments, does it?  What are we mapping here,
  and why would an application want to map it?
 
 That code waits comments from people who requested it.
 It is ring of the ready events, which can be read by userspace instead
 of calling syscall, so syscall just becomes wait until there is a
 place or something like that.

hm.  Well, please fully comment code prior to sending it out for review.  I
do go on about this, but trust me, it makes the review much more effective.

Afaict this mmap function gives a user a free way of getting pinned memory. 
What is the upper bound on the amount of memory which a user can thus
obtain?

   +static int kevent_modify(struct ukevent *uk, struct kevent_user *u)
  
  wonders what this function does
 
 Let me guess... It modifies kevent? :)
 I will add comments.
 
   +{
   + struct kevent *k;
   + unsigned int hash = kevent_user_hash(uk);
   + int err = -ENODEV;
   + unsigned long flags;
   + 
   + spin_lock_irqsave(u-kevent_lock, flags);

Re: [PATCH 1/6] ehea: interface to network stack

2006-08-10 Thread Michael Ellerman

On Thu, 2006-08-10 at 16:15 +1000, Michael Ellerman wrote:
  +   struct hcp_query_ehea_port_cb_2 *cb2 = NULL;
  +   struct net_device_stats *stats = port-stats;
  +
  +   EDEB_EN(7, net_device=%p, dev);
  +
  +   cb2 = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL);
  +   if (!cb2) {
  +   EDEB_ERR(4, No memory for cb2);
  +   goto get_stat_exit;
 
 You leak cb2 here.
 
  +   }
  +
  +   hret = ehea_h_query_ehea_port(adapter-handle,
  + port-logical_port_id,
  + H_PORT_CB2,
  + H_PORT_CB2_ALL,
  + cb2);
  +
  +   if (hret != H_SUCCESS) {
  +   EDEB_ERR(4, query_ehea_port failed for cb2);
  +   goto get_stat_exit;

Sorry, here.

cheers

-- 
Michael Ellerman
IBM OzLabs

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person


signature.asc
Description: This is a digitally signed message part

Re: [take6 1/3] kevent: Core files.

On Thu, Aug 10, 2006 at 12:18:44AM -0700, Andrew Morton ([EMAIL PROTECTED]) 
wrote:
+   spin_lock_init(k-ulock);
+   k-kevent_entry.next = LIST_POISON1;
+   k-storage_entry.prev = LIST_POISON2;
+   k-ready_entry.next = LIST_POISON1;
   
   Nope ;)
  
  I use pointer checks to determine if entry is in the list or not, why it
  is frowned upon here?
  Please do not say about poisoning which takes a lot of cpu cycles to get
  new cachelines and so on - everything in that entry is in the cache,
  since entry was added/deleted/accessed through list walk macro.
 
 poisoning which takes a lot of cpu cycles.  So there ;)
 
 I assure you, that poisoning code might disappear at any time.
 
 If you want to be able to determine whether a list_head has been detached
 you can detach it with list_del_init() and then use list_empty() on it.

I can't due to RCU rules.

+}
+
+late_initcall(kevent_sys_init);
   
   Why is it late_initcall?  (A comment is needed)
  
  Why not?
 
 Why?
 
 There must have been some reason for having made this a late_initcall() and
 that reason is 100% concealed from the reader of this code.

kevent must be initialized before use, and it must happen before
userspace started, so I use late_initcall(), as I said it can be
anything other which is called before userspace.

 IOW, it needs a comment.

Sure.
I'm working right now on fixing all issues mentioned in this thread, and
comments are not on the last place.

+static inline void kevent_user_ring_set(struct kevent_user *u, 
unsigned int num)
+{
+   unsigned int *idx;
+   
+   idx = (unsigned int *)u-pring[0];
   
   This is a bit ugly.
  
  I specially use first 4 bytes in the first page to store index there,
  since it must be accessed from userspace and kernelspace.
 
 Sure, but the C language is the preferred way in which we communicate and
 calcuate pointer offsets.
 
+   idx[0] = num;
+}
+
+/*
+ * Note that kevents does not exactly fill the page (each ukevent is 
40 bytes),
+ * so we reuse 4 bytes at the begining of the first page to store 
index.
+ * Take that into account if you want to change size of struct ukevent.
+ */
+#define KEVENTS_ON_PAGE (PAGE_SIZE/sizeof(struct ukevent))
   
   How about doing
   
 struct ukevent_ring {
 unsigned int index;
 struct ukevent[0];
 }
   
   and removing all those nasty typeasting and offsetting games?
   
   In fact you can even do
   
 struct ukevent_ring {
 struct ukevent[(PAGE_SIZE - sizeof(unsigned int)) /
 sizeof(struct ukevent)];
 unsigned int index;
 };
   
   if you're careful ;)
  
  Ring takes more than one page, so it will be 
  struct ukevent_ring_0 and struct ukevent_ring_other.
  Does it really needed?
  Not a big problem, if you do thing it worse it.
 
 Well, I've given a couple of prototype-style suggestions.  Please take a
 look, see if all this open-coded offsetting magic can be done by the
 compiler in some reliable and readable fashion.  It might not work out, but
 I suspect it will.

I think I will use structure with index on each page, since kevents are
unaligned to exaclty fit page, and it can be some kind of (later)
optimisation to use not global counter, but per-page one.

+   u-pring = kmalloc(pnum * sizeof(unsigned long), GFP_KERNEL);
+   if (!u-pring)
+   return -ENOMEM;
+
+   for (i=0; ipnum; ++i) {
+   u-pring[i] = __get_free_page(GFP_KERNEL);
+   if (!u-pring)
   
   bug: this is testing the wrong thing.
  
  HOw come?
 
 Take a closer look ;)

[i] My fault :)

  __get_free_page() can return 0 if page was not allocated.
 
 And that 0 is copied to u-pring[0], not to u-pring.
 
   The function name is mistyped.
 
 Did you miss an OK?  It needs s/kevnet_user_mmap/kevent_user_mmap/g

It is already fixed :)

   This code doesn't have many comments, does it?  What are we mapping here,
   and why would an application want to map it?
  
  That code waits comments from people who requested it.
  It is ring of the ready events, which can be read by userspace instead
  of calling syscall, so syscall just becomes wait until there is a
  place or something like that.
 
 hm.  Well, please fully comment code prior to sending it out for review.  I
 do go on about this, but trust me, it makes the review much more effective.
 
 Afaict this mmap function gives a user a free way of getting pinned memory. 
 What is the upper bound on the amount of memory which a user can thus
 obtain?

it is limited by maximum queue length which is 4k entries right now, so
maximum number of paged here is 4k*40/page_size, i.e. about 40 pages on
x86.

+static int kevent_modify(struct ukevent *uk, struct kevent_user *u)
   
   wonders what this function does
  
  Let me guess... It modifies kevent? :)
  I will add

Re: [take6 1/3] kevent: Core files.

On Thu, 10 Aug 2006 11:50:47 +0400
Evgeniy Polyakov [EMAIL PROTECTED] wrote:

  Afaict this mmap function gives a user a free way of getting pinned memory. 
  What is the upper bound on the amount of memory which a user can thus
  obtain?
 
 it is limited by maximum queue length which is 4k entries right now, so
 maximum number of paged here is 4k*40/page_size, i.e. about 40 pages on
 x86.

Is that per user or per fd?  If the latter that is, with the usual
RLIMIT_NOFILE, 160MBytes.  2GB with 64k pagesize.  Problem ;)

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/5] socket: code style cleanup

On Wed, 09 Aug 2006 11:31:40 -0700
Stephen Hemminger [EMAIL PROTECTED] wrote:

 Make socket.c conform to current style:
   * run through Lindent
   * get rid of unneeded casts
   * split assignment and comparsion where possible
 

stares at a stream of rejects.  Sighs

 -static ssize_t sock_aio_read(struct kiocb *iocb, char __user *buf,
 -  size_t size, loff_t pos);
 -static ssize_t sock_aio_write(struct kiocb *iocb, const char __user *buf,
 -   size_t size, loff_t pos);
 -static int sock_mmap(struct file *file, struct vm_area_struct * vma);
 +static ssize_t sock_aio_read(struct kiocb *iocb, char __user * buf,
 +  size_t size, loff_t pos);
 +static ssize_t sock_aio_write(struct kiocb *iocb, const char __user * buf,
 +   size_t size, loff_t pos);
 +static int sock_mmap(struct file *file, struct vm_area_struct *vma);

The s/ *buf/ * buf/ is inconsistent, illogical and, IMO, wrong.

goes off to fix the rejects
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] [GIT PATCH] IPv6 Routing / Ndisc Fixes

2006-08-10 Thread YOSHIFUJI Hideaki / 吉藤英明

Hello.

In article [EMAIL PROTECTED] (at Thu, 10 Aug 2006 00:37:14 +0300), Ville 
Nuorvala [EMAIL PROTECTED] says:

  commit e0ad64d5b44179ea1296d737dec23279c72c9636
  Author: YOSHIFUJI Hideaki [EMAIL PROTECTED]
  Date:   Wed Aug 9 17:08:33 2006 +0900
 
  [IPV6] NDISC: Allow redirects from other interfaces if it is not 
  strict.
  
  Signed-off-by: YOSHIFUJI Hideaki [EMAIL PROTECTED]
 
  diff --git a/net/ipv6/route.c b/net/ipv6/route.c
  index 4650787..1698fec 100644
  --- a/net/ipv6/route.c
  +++ b/net/ipv6/route.c
  @@ -1322,7 +1322,7 @@ restart:
 continue;
 if (!(rt-rt6i_flags  RTF_GATEWAY))
 continue;
  -  if (fl-oif != rt-rt6i_dev-ifindex)
  +  if ((flags  RT6_F_STRICT)  fl-oif != rt-rt6i_dev-ifindex)
 continue;
 if (!ipv6_addr_equal(rdfl-gateway, rt-rt6i_gateway))
 continue;
 
  
  Is this absolutely safe? Doesn't this enable a malicious node on another
  link to make a bogus redirect if it uses same link-local source address
  as the real router on the other link. Keep in mind that the RT6_F_STRICT
  flag is set based on the destination of the original redirected packet
  and doesn't in any way depend on the router or source address.
:

Ah, you're right.  I'll drop this.

As a result of original lookup (with possible ambiguous outout interface),
one interface for original output is selected.
Which means, we have a route for the (original) destination through that
interface.

Redirects shall come from that interface.
So, it is enough to lookup routes on that interface.

Thanks.

--yoshfuji
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/5] socket: code style cleanup

From: Andrew Morton [EMAIL PROTECTED]
Date: Thu, 10 Aug 2006 01:19:50 -0700

 goes off to fix the rejects

Just pull from net-2.6.19, Stephen's stuff is all merged in
there.

I'll fix the * var stuff, I hate that too :-)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [take6 1/3] kevent: Core files.

On Thu, Aug 10, 2006 at 01:02:54AM -0700, Andrew Morton ([EMAIL PROTECTED]) 
wrote:
   Afaict this mmap function gives a user a free way of getting pinned 
   memory. 
   What is the upper bound on the amount of memory which a user can thus
   obtain?
  
  it is limited by maximum queue length which is 4k entries right now, so
  maximum number of paged here is 4k*40/page_size, i.e. about 40 pages on
  x86.
 
 Is that per user or per fd?  If the latter that is, with the usual
 RLIMIT_NOFILE, 160MBytes.  2GB with 64k pagesize.  Problem ;)

Per kevent fd.
I have some ideas about better mmap ring implementation, which would
dinamically grow it's buffer when events are added and reuse the same
place for next events, but there are some nitpics unresolved yet.
Let's not see there in next releases (no merge of course), until better 
solution is ready. I will change that area when other things are ready.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] [GIT PATCH] IPv6 Routing / Ndisc Fixes

2006-08-10 Thread Ville Nuorvala

YOSHIFUJI Hideaki wrote:

 As a result of original lookup (with possible ambiguous outout interface),
 one interface for original output is selected.
 Which means, we have a route for the (original) destination through that
 interface.
 
 Redirects shall come from that interface.
 So, it is enough to lookup routes on that interface.

Yes, exactly.

Regards,
Ville
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Doubt about locking in ixgb driver

2006-08-10 Thread Mithlesh Thukral

hi all,

The transmit functions of ethernet drivers (dev-hard_start_xmit) are 
protected to prevent multiple execution of transmits going  in parallel. The 
general scheme used by most of driver is :
1. Reset NETIF_F_LLTX flag in dev-features and then use kernel locking given 
through HARD_TX_LOCK (net/core/dev.c:3417)
OR
2. Use a internal lock of driver generally kept in adapter to prevent multiple 
accesses.

In ixgb driver (drivers/net/ixgb/), there is a lock in adapter of driver 
(adapter-tx_lock). But this is left before the ixgb_xmit_frame() function 
returns. The access to adapter-tx_ring.next_to_use which i suppose will be 
the index of next element to use from tx_ring is accessed outside the area 
where lock is held. What will prevent race condition during accessing 
adapter-tx_ring.next_to_use ?
How does multiple instances of xmit not run or multiple instances of xmit 
running is fine ?

Regards,
Mithlesh Thukral
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Hello, We had some patch need to submit for sundance.c


Jesse Huang wrote:

Dear All:

We had some patch need to submit. Would you tell me where to get current
sundance.c for myself to generate those patch files.

Sorry, I only got this link:
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;h=f13b2a195c708fe32d8c53d05988875a51bd52e1;hb=1668b19f75cb949f930814a23b74201ad6f76a53;f=drivers/net/sundance.c


You need to install the git software package, and then check out the 
upstream branch of

git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/netdev-2.6.git

Then provide patches against the drivers/net/sundance.c driver found there.

git software download: http://www.kernel.org/pub/software/scm/git/
git overview: http://git.or.cz/
git tutorial: http://www.kernel.org/pub/software/scm/git/docs/tutorial.html
git man pages: http://www.kernel.org/pub/software/scm/git/docs

Thanks,

Jeff


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Hello, We had some patch need to submit for sundance.c

2006-08-10 Thread Jesse Huang

Hi Jeff:

I will use sundance.c in this tree to generate patch files.

Thanks for this information.

Jesse

- Original Message - 
From: Jeff Garzik [EMAIL PROTECTED]
To: Jesse Huang [EMAIL PROTECTED]
Cc: Francois Romieu [EMAIL PROTECTED];
linux-kernel@vger.kernel.org; netdev@vger.kernel.org; Andrew Morton
[EMAIL PROTECTED]
Sent: Thursday, August 10, 2006 7:23 PM
Subject: Re: Hello, We had some patch need to submit for sundance.c


Jesse Huang wrote:
 Dear All:

 We had some patch need to submit. Would you tell me where to get current
 sundance.c for myself to generate those patch files.

 Sorry, I only got this link:

http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;h=f13b2a195c708fe32d8c53d05988875a51bd52e1;hb=1668b19f75cb949f930814a23b74201ad6f76a53;f=drivers/net/sundance.c

You need to install the git software package, and then check out the
upstream branch of
git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/netdev-2.6.git

Then provide patches against the drivers/net/sundance.c driver found there.

git software download: http://www.kernel.org/pub/software/scm/git/
git overview: http://git.or.cz/
git tutorial: http://www.kernel.org/pub/software/scm/git/docs/tutorial.html
git man pages: http://www.kernel.org/pub/software/scm/git/docs

Thanks,

Jeff


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Possible leak of multicast source filter sctructure

2006-08-10 Thread Michal Ruzicka


Hi all!
It seems to me that there is a leak of struct ip_sf_socklist in the 
ip_mc_drop_socket function (in net/ipv4/igmp.c) which is called on socket 
close.


This patch corrects it:

diff -Naur linux-2.6.17.8.orig/net/ipv4/igmp.c 
linux-2.6.17.8/net/ipv4/igmp.c

--- linux-2.6.17.8.orig/net/ipv4/igmp.c 2006-08-07 06:18:54.0 +0200
+++ linux-2.6.17.8/net/ipv4/igmp.c 2006-08-10 10:38:04.0 +0200
@@ -2206,9 +2206,10 @@
   (void) ip_mc_leave_src(sk, iml, in_dev);
   ip_mc_dec_group(in_dev, iml-multi.imr_multiaddr.s_addr);
   in_dev_put(in_dev);
-  }
-  sock_kfree_s(sk, iml, sizeof(*iml));
+  } else if (iml-sflist != NULL)
+   sock_kfree_s(sk, iml-sflist, IP_SFLSIZE(iml-sflist-sl_max));

+  sock_kfree_s(sk, iml, sizeof(*iml));
 }
 rtnl_unlock();
}

The leak only happens if there are some multicast source filters set on a 
socket wich are bound to an interface that does not exist any more, as in 
the following scenario:

1. create a temporary interface (say GRE tunnel)
3. join a multicast group an set a source filter on the temporary interface 
via MCAST_JOIN_SOURCE_GROUP setsockopt call

4. destroy the temporary interface
5. close the socket

This sequence of things eventually leads to a call of ip_mc_drop_socket 
function, which fails to free the soucre filter structure ip_sf_socklist 
pointed to from members of socket's multicast addresses list. This structure 
is normally freed in ip_mc_leave_src function but this function is not 
called in this scenario because the interface that the multicast group is 
joined on does not exist any more.


Thanks
Michal Ruzicka 


linux-2.6.17.8-mc_sf_leak.patch
Description: Binary data

Re: Possible leak of multicast source filter sctructure

From: Michal Ruzicka [EMAIL PROTECTED]
Date: Thu, 10 Aug 2006 14:07:06 +0200

 This patch corrects it:

Correct or not this patch is corrupted by your email client, turning
tabs into spaces among other things.  This makes your patch unusable.

Please configure your email client to not mangle the text of the patch
in any way and resubmit with your original surrounding description so
that it can be properly reviewed.

If in doubt, always email the patch to yourself as a test and try to
apply that patch as if you were the person who might be integrating
your work.

Thanks a lot.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Possible leak of multicast source filter sctructure

From: David Miller [EMAIL PROTECTED]
Date: Thu, 10 Aug 2006 05:12:41 -0700 (PDT)

 From: Michal Ruzicka [EMAIL PROTECTED]
 Date: Thu, 10 Aug 2006 14:07:06 +0200

  This patch corrects it:

 Correct or not this patch is corrupted by your email client, turning
 tabs into spaces among other things.  This makes your patch unusable.

And yes I do realize you created an attachment before you
bark that back. :-)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/9] [TULIP] Clean tulip.h so it can be used by winbond-840.c


Grant Grundler wrote:

On Wed, Aug 09, 2006 at 01:33:18AM -0400, Jeff Garzik wrote:
2) nobody (but parisc folks?) knows what CBMA and CBIO mean.  Just use 
MMIO and PIO


CBIO is what's in the public documentation. I just want to make it
easy for anyone who bothers to read the documentation to be sure
they are reading about the right register.


Thanks for clarifying.  Nonetheless, I still prefer 'mmio' and 'pio' 
because its more universal.


Jeff



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] sky2: phy power problems on 88e805X chips


Stephen Hemminger wrote:

On the 88E805X chipsets (used in laptops), the PHY was not getting powered
out of shutdown properly. The variable reg1 was getting reused incorrectly.
This is probably the cause of the bug.
http://bugzilla.kernel.org/show_bug.cgi?id=6471

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]

--- netdev-2.6.orig/drivers/net/sky2.c  2006-08-09 14:13:36.0 -0700
+++ netdev-2.6/drivers/net/sky2.c   2006-08-09 14:14:07.0 -0700
@@ -233,6 +233,8 @@
if (hw-ports  1)
reg1 |= PCI_Y2_PHY2_COMA;
}
+   sky2_pci_write32(hw, PCI_DEV_REG1, reg1);
+   udelay(100);
 
 		if (hw-chip_id == CHIP_ID_YUKON_EC_U) {


applied to #upstream-fixes, though I note that the obvious PCI posting 
bug remains.


You cannot be assured that the udelay(100) is truly effective without a 
flushing readl().


Jeff


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: bonding questions: replaying call to set_multicast_list and sending IGMP doing Fail-Over

2006-08-10 Thread Or Gerlitz


Jay Vosburgh wrote:

I haven't studied the effects of having large amounts of
multicast traffic coming in under this situation.

However, I would suspect that the MAC filters found on
sufficiently modern network adapters would drop the incoming multicast
traffic on the backup slaves, as only the active slave in active-backup
mode has its multicast list set.  That information is sent to a slave
when it becomes the active slave; see the call to bond_mc_swap() made by
bond_change_active_slave().


OK, i agree the MAC filter would drop the incoming traffic on the backup 
slaves b/c bond_mc_swap() calls bond_mc_delete() on the slave which 
becomes a backup one. But as you have noted there might be some impact 
on the switch.


Or.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] gre: transparent ethernet bridging

2006-08-10 Thread Lennert Buytenhek

On Mon, Aug 07, 2006 at 11:55:14AM +1000, Philip Craig wrote:

  I have one machine at home that appears to be on my employer's network
  via such a tunnel.  I don't use bridging, because I don't need any other
  machine at home to access this tunnel.  I do want bridging, and not proxy
  ARP, because it allows me to run arpwatch, and doesn't require me to
  reconfigure something at the remote end if I, for example, want to add
  another IP address to my home box.
 
 Okay.
 If this is using Linux, do you have a patch that does this already?

I use vtun:

http://vtun.sourceforge.net/

But I would prefer using some in-kernel ethernet tunneling method with
ipsec instead.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[take7 0/1] kevent: generic event handling mechanism.

Hello.

Generic event handling mechanism.

Changes from 'take6' patchset:
 * a lot of comments!
 * do not use list poisoning for detection of the fact, that entry is in the 
list
 * return number of ready kevents even if copy*user() fails
 * strict check for number of kevents in syscall
 * use ARRAY_SIZE for array size calculation
 * changed superblock magic number
 * use SLAB_PANIC instead of direct panic() call
 * changed -E* return values
 * a lot of small cleanups and indent fixes
 * fully removed AIO stuff from patchset

Changes from 'take5' patchset:
 * removed compilation warnings about unused wariables when lockdep is not 
turned on
 * do not use internal socket structures, use appropriate (exported) wrappers 
instead
 * removed default 1 second timeout
 * removed AIO stuff from patchset

Changes from 'take4' patchset:
 * use miscdevice instead of chardevice
 * comments fixes

Changes from 'take3' patchset:
 * removed serializing mutex from kevent_user_wait()
 * moved storage list processing to RCU
 * removed lockdep screaming - all storage locks are initialized in the same 
function, so it was learned 
to differentiate between various cases
 * remove kevent from storage if is marked as broken after callback
 * fixed a typo in mmaped buffer implementation which would end up in wrong 
index calcualtion 

Changes from 'take2' patchset:
 * split kevent_finish_user() to locked and unlocked variants
 * do not use KEVENT_STAT ifdefs, use inline functions instead
 * use array of callbacks of each type instead of each kevent callback 
initialization
 * changed name of ukevent guarding lock
 * use only one kevent lock in kevent_user for all hash buckets instead of 
per-bucket locks
 * do not use kevent_user_ctl structure instead provide needed arguments as 
syscall parameters
 * various indent cleanups
 * added optimisation, which is aimed to help when a lot of kevents are being 
copied from userspace
 * mapped buffer (initial) implementation (no userspace yet)

Changes from 'take1' patchset:
 - rebased against 2.6.18-git tree
 - removed ioctl controlling
 - added new syscall kevent_get_events(int fd, unsigned int min_nr, unsigned 
int max_nr,
unsigned int timeout, void __user *buf, unsigned flags)
 - use old syscall kevent_ctl for creation/removing, modification and initial 
kevent 
initialization
 - use mutuxes instead of semaphores
 - added file descriptor check and return error if provided descriptor does not 
match
kevent file operations
 - various indent fixes
 - removed aio_sendfile() declarations.

Thank you.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]



-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[take7 1/1] kevent: core files and timer/poll notifications.

This patch includes core kevent files:
- userspace controlling
- kernelspace interfaces
- initialization
- notification state machines
- timer and poll/select notifications

With this patchset rate of requests per second has achieved 2500 req/sec
while with epoll/kqueue and similar techniques it is about 1600-1800
requests per second on my test hardware and trivial web server.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index dd63d47..091ff42 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -317,3 +317,5 @@ ENTRY(sys_call_table)
.long sys_tee   /* 315 */
.long sys_vmsplice
.long sys_move_pages
+   .long sys_kevent_get_events
+   .long sys_kevent_ctl
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index 5d4a7d1..b2af4a8 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -713,4 +713,6 @@ #endif
.quad sys_tee
.quad compat_sys_vmsplice
.quad compat_sys_move_pages
+   .quad sys_kevent_get_events
+   .quad sys_kevent_ctl
 ia32_syscall_end:  
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index fc1c8dd..c9dde13 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -323,10 +323,12 @@ #define __NR_sync_file_range  314
 #define __NR_tee   315
 #define __NR_vmsplice  316
 #define __NR_move_pages317
+#define __NR_kevent_get_events 318
+#define __NR_kevent_ctl319
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 318
+#define NR_syscalls 320
 
 /*
  * user-visible error numbers are in the range -1 - -128: see
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index 94387c9..61363e0 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -619,10 +619,14 @@ #define __NR_vmsplice 278
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_kevent_get_events 280
+__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events)
+#define __NR_kevent_ctl281
+__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl)
 
 #ifdef __KERNEL__
 
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_kevent_ctl
 
 #ifndef __NO_STUBS
 
diff --git a/include/linux/kevent.h b/include/linux/kevent.h
new file mode 100644
index 000..d3ff0cd
--- /dev/null
+++ b/include/linux/kevent.h
@@ -0,0 +1,302 @@
+/*
+ * kevent.h
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __KEVENT_H
+#define __KEVENT_H
+
+/*
+ * Kevent request flags.
+ */
+
+#define KEVENT_REQ_ONESHOT 0x1 /* Process this event only once 
and then dequeue. */
+
+/*
+ * Kevent return flags.
+ */
+#define KEVENT_RET_BROKEN  0x1 /* Kevent is broken. */
+#define KEVENT_RET_DONE0x2 /* Kevent processing 
was finished successfully. */
+
+/*
+ * Kevent type set.
+ */
+#define KEVENT_SOCKET  0
+#define KEVENT_INODE   1
+#define KEVENT_TIMER   2
+#define KEVENT_POLL3
+#define KEVENT_NAIO4
+#define KEVENT_AIO 5
+#defineKEVENT_MAX  6
+
+/*
+ * Per-type event sets.
+ * Number of per-event sets should be exactly as number of kevent types.
+ */
+
+/*
+ * Timer events.
+ */
+#defineKEVENT_TIMER_FIRED  0x1
+
+/*
+ * Socket/network asynchronous IO events.
+ */
+#defineKEVENT_SOCKET_RECV  0x1
+#defineKEVENT_SOCKET_ACCEPT0x2
+#defineKEVENT_SOCKET_SEND  0x4
+
+/*
+ * Inode events.
+ */
+#defineKEVENT_INODE_CREATE 0x1
+#defineKEVENT_INODE_REMOVE 0x2
+
+/*
+ * Poll events.
+ */
+#defineKEVENT_POLL_POLLIN  0x0001
+#defineKEVENT_POLL_POLLPRI 0x0002
+#defineKEVENT_POLL_POLLOUT 0x0004
+#defineKEVENT_POLL_POLLERR 0x0008
+#defineKEVENT_POLL_POLLHUP 0x0010
+#defineKEVENT_POLL_POLLNVAL0x0020
+
+#defineKEVENT_POLL_POLLRDNORM  0x0040
+#define

Re: [take7 1/1] kevent: core files and timer/poll notifications.

On Thu, Aug 10, 2006 at 04:16:38PM +0400, Evgeniy Polyakov ([EMAIL PROTECTED]) 
wrote:
 With this patchset rate of requests per second has achieved 2500 req/sec
 while with epoll/kqueue and similar techniques it is about 1600-1800
 requests per second on my test hardware and trivial web server.

Nope, it is old record from archives... Current one is 2600+

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/6] htb: cleanup

* David Miller [EMAIL PROTECTED] 2006-08-02 15:18
 From: Stephen Hemminger [EMAIL PROTECTED]
 Date: Wed, 2 Aug 2006 12:56:36 -0700

  The HTB scheduler code is a mess, this patch set does some basic
  house cleaning.  The first four should cause no code change, but the
  last two need more testing.

 These patches look fine to me.  Once everyone think's they
 are ready just let me know and I'll push them into net-2.6.19

I think they are ready.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC][PATCH] VM deadlock prevention core -v3

Hi,

So I try again, please tell me if I'm still on crack and should go detox.
However if you do so, I kindly request some words on the how and why of it.

so I try to map to net_device in several fashions:

1) netdev_alloc_skb()
  - has an argument with the actual incoming device

2) free_skb_pages()
  - uses: skb-input_dev ?: skb-dev

this device is pinned by virtue of netdev_wait_memalloc(), which will
wait until all skb-memalloc skbuffs are destroyed. This will delay
module unload under severe memory pressure, I think this is acceptable
as one has other problems at that point.

3) inet_sock_destruct(), sk_set_memalloc() 
  - both use: ip_dev_find(inet_sk(sk)-rcv_saddr))

if the later two methods do not yield the same net_device the first has, 
weird and wonderfull stuff will happen.

Why, currently the sole purpose is to be able to limit the number of 
memalloc skbs per device (and for me to learn some).

I suspect something as simple as a bridge device will destroy this, with
that I suspect (3) will return the bride device instead of the actual
input device.

So, if I'm still busted, is there any hope for this approach?

If not, I'll have to go do global skb memalloc accounting (largesmp
fanboys need not worry, this will only happen under severe load, at
that point the box will have other issues) and fudge the per deviceness.

---

The core of the VM deadlock avoidance framework.

From the 'user' side of things it provides a function to mark a 'struct sock'
as SOCK_MEMALLOC, meaning this socket may dip into the memalloc reserves on
the receive side.

From the net_device side of things, the extra 'struct net_device *' argument
to {,__}netdev_alloc_skb() is used to attribute/account the memalloc usage.
When netdev_alloc_skb() finds it cannot allocate a struct sk_buff the regular
way it will grab some memory from the memalloc reserve.

Drivers that have been converted to the netdev_alloc_skb() family will 
automatically receive this feature.

Network paths will drop !SOCK_MEMALLOC packets ASAP when reserve is being used.

Memalloc sk_buff allocations are not done from the SLAB but are done using 
alloc_pages(). sk_buff::memalloc records this exception so that kfree_skbmem()
can do the right thing. NOTE this does not play very nice with skb_clone()


Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
Signed-off-by: Daniel Phillips [EMAIL PROTECTED]

---
 include/linux/gfp.h   |3 
 include/linux/mmzone.h|1 
 include/linux/netdevice.h |7 ++
 include/linux/skbuff.h|3 
 include/net/sock.h|8 ++
 mm/page_alloc.c   |   46 -
 net/core/dev.c|   97 
 net/core/skbuff.c |  155 +++---
 net/core/sock.c   |   25 +++
 net/ethernet/eth.c|1 
 net/ipv4/af_inet.c|8 ++
 net/ipv4/icmp.c   |3 
 net/ipv4/tcp_ipv4.c   |3 
 net/ipv4/udp.c|8 ++
 14 files changed, 355 insertions(+), 13 deletions(-)

Index: linux-2.6/include/linux/gfp.h
===
--- linux-2.6.orig/include/linux/gfp.h
+++ linux-2.6/include/linux/gfp.h
@@ -46,6 +46,7 @@ struct vm_area_struct;
 #define __GFP_ZERO ((__force gfp_t)0x8000u)/* Return zeroed page on 
success */
 #define __GFP_NOMEMALLOC ((__force gfp_t)0x1u) /* Don't use emergency 
reserves */
 #define __GFP_HARDWALL   ((__force gfp_t)0x2u) /* Enforce hardwall cpuset 
memory allocs */
+#define __GFP_MEMALLOC  ((__force gfp_t)0x4u) /* Use emergency reserves */
 
 #define __GFP_BITS_SHIFT 20/* Room for 20 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1  __GFP_BITS_SHIFT) - 1))
@@ -54,7 +55,7 @@ struct vm_area_struct;
 #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
-   __GFP_NOMEMALLOC|__GFP_HARDWALL)
+   __GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_MEMALLOC)
 
 /* This equals 0, but use constants in case they ever change */
 #define GFP_NOWAIT (GFP_ATOMIC  ~__GFP_HIGH)
Index: linux-2.6/include/linux/mmzone.h
===
--- linux-2.6.orig/include/linux/mmzone.h
+++ linux-2.6/include/linux/mmzone.h
@@ -420,6 +420,7 @@ int percpu_pagelist_fraction_sysctl_hand
void __user *, size_t *, loff_t *);
 int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
struct file *, void __user *, size_t *, loff_t *);
+int adjust_memalloc_reserve(int bytes);
 
 #include linux/topology.h
 /* Returns the number of the current Node. */
Index: linux-2.6/include/linux/netdevice.h
===
---

Re: [RFC][PATCH] VM deadlock prevention core -v3

On Thu, Aug 10, 2006 at 03:32:49PM +0200, Peter Zijlstra ([EMAIL PROTECTED]) 
wrote:
 Hi,

Hello, Peter.

 So I try again, please tell me if I'm still on crack and should go detox.
 However if you do so, I kindly request some words on the how and why of it.

I think you should talk with doctor in that case, but not with kernel
hackers :)

I have some comments about implementation, not overall design, since we
have slightly diametral points of view there.

...

 --- linux-2.6.orig/net/core/skbuff.c
 +++ linux-2.6/net/core/skbuff.c
 @@ -43,6 +43,7 @@
  #include linux/kernel.h
  #include linux/sched.h
  #include linux/mm.h
 +#include linux/pagemap.h
  #include linux/interrupt.h
  #include linux/in.h
  #include linux/inet.h
 @@ -125,6 +126,8 @@ EXPORT_SYMBOL(skb_truesize_bug);
   *
   */
  
 +#define ceiling_log2(x)  fls((x) - 1)
 +
  /**
   *   __alloc_skb -   allocate a network buffer
   *   @size: size to allocate
 @@ -147,6 +150,59 @@ struct sk_buff *__alloc_skb(unsigned int
   struct sk_buff *skb;
   u8 *data;
  
 + size = SKB_DATA_ALIGN(size);
 +
 + if (gfp_mask  __GFP_MEMALLOC) {
 + /*
 +  * Fallback allocation for memalloc reserves.
 +  *
 +  * the page is populated like so:
 +  *
 +  *   struct sk_buff
 +  *   [ struct sk_buff ]
 +  *   [ atomic_t ]
 +  *   unsigned int
 +  *   struct skb_shared_info
 +  *   char []
 +  *
 +  * We have to do higher order allocations for icky jumbo
 +  * frame drivers :-(. They really should be migrated to
 +  * scather/gather DMA and use skb fragments.
 +  */
 + unsigned int data_offset =
 + sizeof(struct sk_buff) + sizeof(unsigned int);
 + unsigned long length = size + data_offset +
 + sizeof(struct skb_shared_info);
 + unsigned int pages;
 + unsigned int order;
 + struct page *page;
 + void *kaddr;
 +
 + /*
 +  * Force fclone alloc in order to fudge a lacking in 
 skb_clone().
 +  */
 + fclone = 1;
 + if (fclone) {
 + data_offset += sizeof(struct sk_buff) + 
 sizeof(atomic_t);
 + length += sizeof(struct sk_buff) + sizeof(atomic_t);
 + }
 + pages = (length + PAGE_SIZE - 1)  PAGE_SHIFT;
 + order = ceiling_log2(pages);
 + skb = NULL;
 + if (!(page = alloc_pages(gfp_mask  ~__GFP_HIGHMEM, order)))
 + goto out;
 +
 + kaddr = pfn_to_kaddr(page_to_pfn(page));
 + skb = (struct sk_buff *)kaddr;
 +
 + *((unsigned int *)(kaddr + data_offset -
 + sizeof(unsigned int))) = order;
 + data = (u8 *)(kaddr + data_offset);
 +

Tricky, but since you are using own allocator here, you could change it to
be not so aggressive - i.e. do not round size to number of pages.

 + goto allocated;
 + }
 +
   cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
  
   /* Get the HEAD */
 @@ -155,12 +211,13 @@ struct sk_buff *__alloc_skb(unsigned int
   goto out;
  
   /* Get the DATA. Size must match skb_add_mtu(). */
 - size = SKB_DATA_ALIGN(size);

Bad sign.

   data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
   if (!data)
   goto nodata;
  
 +struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
 + unsigned length, gfp_t gfp_mask)
 +{
 + struct sk_buff *skb;
 +
 + WARN_ON(gfp_mask  (__GFP_NOMEMALLOC | __GFP_MEMALLOC));
 + gfp_mask = ~(__GFP_NOMEMALLOC | __GFP_MEMALLOC);
 +
 + skb = ___netdev_alloc_skb(dev, length, gfp_mask | __GFP_NOMEMALLOC);
 + if (skb)
 + goto done;
 +
 + if (atomic_read(dev-rx_reserve_used) =
 + dev-rx_reserve * dev-memalloc_socks)
 + goto out;
 +
 + /*
 +  * pre-inc guards against a race with netdev_wait_memalloc()
 +  */
 + atomic_inc(dev-rx_reserve_used);
 + skb = ___netdev_alloc_skb(dev, length, gfp_mask | __GFP_MEMALLOC);
 + if (unlikely(!skb)) {
 + atomic_dec(dev-rx_reserve_used);
 + goto out;
 + }

Since you have added atomic operation in that path, you can use device's
reference counter instead and do not care that it can dissapear.

 +done:
 + skb-dev = dev;
 +out:
 + return skb;
 +}
 +
  static void skb_drop_list(struct sk_buff **listp)
  {
   struct sk_buff *list = *listp;
 @@ -313,10 +417,35 @@ static void skb_release_data(struct sk_b
   if (skb_shinfo(skb)-frag_list)
   skb_drop_fraglist(skb);
  
 - kfree(skb-head);
 + if (!skb-memalloc)
 + kfree(skb-head);
 +

Re: [RFC][PATCH] VM deadlock prevention core -v3

On Thu, 2006-08-10 at 18:02 +0400, Evgeniy Polyakov wrote:
 On Thu, Aug 10, 2006 at 03:32:49PM +0200, Peter Zijlstra ([EMAIL PROTECTED]) 
 wrote:
  Hi,
 
 Hello, Peter.
 
  So I try again, please tell me if I'm still on crack and should go detox.
  However if you do so, I kindly request some words on the how and why of it.
 
 I think you should talk with doctor in that case, but not with kernel
 hackers :)
 
 I have some comments about implementation, not overall design, since we
 have slightly diametral points of view there.
 
 
 
  --- linux-2.6.orig/net/core/skbuff.c
  +++ linux-2.6/net/core/skbuff.c
  @@ -43,6 +43,7 @@
   #include linux/kernel.h
   #include linux/sched.h
   #include linux/mm.h
  +#include linux/pagemap.h
   #include linux/interrupt.h
   #include linux/in.h
   #include linux/inet.h
  @@ -125,6 +126,8 @@ EXPORT_SYMBOL(skb_truesize_bug);
*
*/
   
  +#define ceiling_log2(x)fls((x) - 1)
  +
   /**
* __alloc_skb -   allocate a network buffer
* @size: size to allocate
  @@ -147,6 +150,59 @@ struct sk_buff *__alloc_skb(unsigned int
  struct sk_buff *skb;
  u8 *data;
   
  +   size = SKB_DATA_ALIGN(size);

I moved it here.

  +
  +   if (gfp_mask  __GFP_MEMALLOC) {
  +   /*
  +* Fallback allocation for memalloc reserves.
  +

 * This allocator is build on alloc_pages() so that freed
 * skbuffs return to the memalloc reserve imediately. SLAB
 * memory might not ever be returned.

This was missing,... 

  +* the page is populated like so:
  +*
  +*   struct sk_buff
  +*   [ struct sk_buff ]
  +*   [ atomic_t ]
  +*   unsigned int
  +*   struct skb_shared_info
  +*   char []
  +*
  +* We have to do higher order allocations for icky jumbo
  +* frame drivers :-(. They really should be migrated to
  +* scather/gather DMA and use skb fragments.
  +*/
  +   unsigned int data_offset =
  +   sizeof(struct sk_buff) + sizeof(unsigned int);
  +   unsigned long length = size + data_offset +
  +   sizeof(struct skb_shared_info);
  +   unsigned int pages;
  +   unsigned int order;
  +   struct page *page;
  +   void *kaddr;
  +
  +   /*
  +* Force fclone alloc in order to fudge a lacking in 
  skb_clone().
  +*/
  +   fclone = 1;
  +   if (fclone) {
  +   data_offset += sizeof(struct sk_buff) + 
  sizeof(atomic_t);
  +   length += sizeof(struct sk_buff) + sizeof(atomic_t);
  +   }
  +   pages = (length + PAGE_SIZE - 1)  PAGE_SHIFT;
  +   order = ceiling_log2(pages);
  +   skb = NULL;
  +   if (!(page = alloc_pages(gfp_mask  ~__GFP_HIGHMEM, order)))
  +   goto out;
  +
  +   kaddr = pfn_to_kaddr(page_to_pfn(page));
  +   skb = (struct sk_buff *)kaddr;
  +
  +   *((unsigned int *)(kaddr + data_offset -
  +   sizeof(unsigned int))) = order;
  +   data = (u8 *)(kaddr + data_offset);
  +
 
 Tricky, but since you are using own allocator here, you could change it to
 be not so aggressive - i.e. do not round size to number of pages.

I'm not sure I follow you, I'm explicitly using
alloc_pages()/free_page(), if
I were to go smart here, I'd loose the whole reason for doing so.

 
  +   goto allocated;
  +   }
  +
  cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
   
  /* Get the HEAD */
  @@ -155,12 +211,13 @@ struct sk_buff *__alloc_skb(unsigned int
  goto out;
   
  /* Get the DATA. Size must match skb_add_mtu(). */
  -   size = SKB_DATA_ALIGN(size);
 
 Bad sign.

See above.

  data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
  if (!data)
  goto nodata;
   
  +struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
  +   unsigned length, gfp_t gfp_mask)
  +{
  +   struct sk_buff *skb;
  +
  +   WARN_ON(gfp_mask  (__GFP_NOMEMALLOC | __GFP_MEMALLOC));
  +   gfp_mask = ~(__GFP_NOMEMALLOC | __GFP_MEMALLOC);
  +
  +   skb = ___netdev_alloc_skb(dev, length, gfp_mask | __GFP_NOMEMALLOC);
  +   if (skb)
  +   goto done;
  +
  +   if (atomic_read(dev-rx_reserve_used) =
  +   dev-rx_reserve * dev-memalloc_socks)
  +   goto out;
  +
  +   /*
  +* pre-inc guards against a race with netdev_wait_memalloc()
  +*/
  +   atomic_inc(dev-rx_reserve_used);
  +   skb = ___netdev_alloc_skb(dev, length, gfp_mask | __GFP_MEMALLOC);
  +   if (unlikely(!skb)) {
  +   atomic_dec(dev-rx_reserve_used);
  +   goto out;
  +   }
 
 Since you have added atomic operation in that path, you can use device's
 reference counter instead and do not care that it can dissapear.

Re: [PATCH 5/6] ehea: makefile

2006-08-10 Thread Thomas Klein


Sam Ravnborg wrote:

On Wed, Aug 09, 2006 at 10:40:20AM +0200, Jan-Bernd Themann wrote:
  

Signed-off-by: Jan-Bernd Themann [EMAIL PROTECTED]


 drivers/net/ehea/Makefile |7 +++
 1 file changed, 7 insertions(+)



--- linux-2.6.18-rc4-orig/drivers/net/ehea/Makefile	1969-12-31 
16:00:00.0 -0800

+++ kernel/drivers/net/ehea/Makefile2006-08-08 23:59:38.083467216 -0700
@@ -0,0 +1,7 @@
+#
+# Makefile for the eHEA ethernet device driver for IBM eServer System p
+#
+
+ehea_mod-objs = ehea_main.o ehea_phyp.o ehea_qmr.o ehea_ethtool.o 
ehea_phyp.o

+obj-$(CONFIG_EHEA) += ehea_mod.o
+



Using -objs is deprecated, please use ehea_mod-y.
This needs to be documented and later warned upon which I will do soon.

Sam
  

Done. Will be included in next patch.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/6] ehea: interface to network stack

2006-08-10 Thread Jan-Bernd Themann


Hi Michael,

thanks for your very helpful comments so far, we'll provide a patch with these 
and other fixes
very soon.

See comments below.

Jan-Bernd

Michael Ellerman wrote:
 Hi Jan-Bernd,

 Comments below the code they refer to.

 On Wed, 2006-08-09 at 10:38 +0200, Jan-Bernd Themann wrote:
 Signed-off-by: Jan-Bernd Themann [EMAIL PROTECTED]

   drivers/net/ehea/ehea_main.c | 2738 
+++
   1 file changed, 2738 insertions(+)

 --- linux-2.6.18-rc4-orig/drivers/net/ehea/ehea_main.c 1969-12-31 
16:00:00.0 -0800
 +++ kernel/drivers/net/ehea/ehea_main.c2006-08-08 23:59:39.683357016 
-0700
 @@ -0,0 +1,2738 @@
 +/*
 + *  linux/drivers/net/ehea/ehea_main.c

 Putting the file name in the file is fairly redundant IMHO.

 + *  eHEA ethernet device driver for IBM eServer System p

 What's the actual hardware that this is for? System p covers a whole
 range of machines, do they really all support this driver?

 + *  (C) Copyright IBM Corp. 2006
 + *
 + *  Authors:
 + *   Christoph Raisch [EMAIL PROTECTED]
 + *   Jan-Bernd Themann [EMAIL PROTECTED]
 + *   Heiko-Joerg Schick [EMAIL PROTECTED]
 + *   Thomas Klein [EMAIL PROTECTED]
 + *
 + *
 + * This program is free software; you can redistribute it and/or modify
 + * it under the terms of the GNU General Public License as published by
 + * the Free Software Foundation; either version 2, or (at your option)
 + * any later version.
 + *
 + * This program is distributed in the hope that it will be useful,
 + * but WITHOUT ANY WARRANTY; without even the implied warranty of
 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.See the
 + * GNU General Public License for more details.
 + *
 + * You should have received a copy of the GNU General Public License
 + * along with this program; if not, write to the Free Software
 + * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
 + */
 +
 +#define DEB_PREFIX main
 +
 +#include linux/in.h
 +#include linux/ip.h
 +#include linux/tcp.h
 +#include linux/udp.h
 +#include linux/if.h
 +#include linux/list.h
 +#include net/ip.h
 +
 +#include ehea.h
 +#include ehea_qmr.h
 +#include ehea_phyp.h
 +
 +
 +MODULE_LICENSE(GPL);
 +MODULE_AUTHOR(Christoph Raisch [EMAIL PROTECTED]);
 +MODULE_DESCRIPTION(IBM eServer HEA Driver);
 +MODULE_VERSION(EHEA_DRIVER_VERSION);
 +
 +static int __devinit ehea_probe(struct ibmebus_dev *dev,
 +  const struct of_device_id *id);
 +static int __devexit ehea_remove(struct ibmebus_dev *dev);
 +static int ehea_sense_port_attr(struct ehea_adapter *adapter, int portnum);

 I haven't looked closely, but can you rearrange the functions so you
 don't need these forward declarations?

yes, rearrangement is possible. Done.


 +int ehea_trace_level = 5;
 +
 +static struct net_device_stats *ehea_get_stats(struct net_device *dev)
 +{
 +  int i;
 +  u64 hret = H_HARDWARE;

 You unconditionally assign to hret below.

 +  u64 rx_packets = 0;

 Why not just update stats-rx_packets directly?

 +  struct ehea_port *port = (struct ehea_port*)dev-priv;
 +  struct ehea_adapter *adapter = port-adapter;

 I don't think you need adapter, you only use it in one place, just
 access it through port-adapter-handle (below).

done


 +  struct hcp_query_ehea_port_cb_2 *cb2 = NULL;
 +  struct net_device_stats *stats = port-stats;
 +
 +  EDEB_EN(7, net_device=%p, dev);
 +
 +  cb2 = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL);
 +  if (!cb2) {
 +  EDEB_ERR(4, No memory for cb2);
 +  goto get_stat_exit;

 You leak cb2 here.


done

 +  }
 +
 +  hret = ehea_h_query_ehea_port(adapter-handle,
 +port-logical_port_id,
 +H_PORT_CB2,
 +H_PORT_CB2_ALL,
 +cb2);
 +
 +  if (hret != H_SUCCESS) {
 +  EDEB_ERR(4, query_ehea_port failed for cb2);
 +  goto get_stat_exit;
 +  }
 +
 +  EDEB_DMP(7, (u8*)cb2,
 +   sizeof(struct hcp_query_ehea_port_cb_2), After HCALL);
 +
 +  for (i = 0; i  port-num_def_qps; i++) {
 +  rx_packets += port-port_res[i].rx_packets;
 +  }
 +
 +  stats-tx_packets = cb2-txucp + cb2-txmcp + cb2-txbcp;
 +  stats-multicast = cb2-rxmcp;
 +  stats-rx_errors = cb2-rxuerr;
 +  stats-rx_bytes = cb2-rxo;
 +  stats-tx_bytes = cb2-txo;
 +  stats-rx_packets = rx_packets;
 +
 +get_stat_exit:
 +  EDEB_EX(7, );
 +  return stats;
 +}
 +
 +static inline u32 ehea_get_send_lkey(struct ehea_port_res *pr)
 +{
 +  return pr-send_mr.lkey;
 +}

 Get rid of this, it's only used once.


done

 +static inline u32 ehea_get_recv_lkey(struct ehea_port_res *pr)
 +{
 +  return pr-recv_mr.lkey;
 +}

 And this one only twice? Is it really useful?


done

 +
 +#define EHEA_OD_ADDR(address, segment) (((address)  (PAGE_SIZE - 1)) \
 +

Re: [PATCH 2/6] ehea: pHYP interface

2006-08-10 Thread Thomas Klein


Arnd Bergmann wrote:

On Wednesday 09 August 2006 10:38, Jan-Bernd Themann wrote:

--- linux-2.6.18-rc4-orig/drivers/net/ehea/ehea_hcall.h 1969-12-31 
16:00:00.0 -0800
+++ kernel/drivers/net/ehea/ehea_hcall.h2006-08-08 23:59:38.111462960 
-0700
@@ -0,0 +1,52 @@



+
+/**
+ * This file contains HCALL defines that are to be included in the appropriate
+ * kernel files later
+ */
+
+#define H_ALLOC_HEA_RESOURCE   0x278
+#define H_MODIFY_HEA_QP0x250
+#define H_QUERY_HEA_QP 0x254
+#define H_QUERY_HEA0x258
+#define H_QUERY_HEA_PORT   0x25C
+#define H_MODIFY_HEA_PORT  0x260
+#define H_REG_BCMC 0x264
+#define H_DEREG_BCMC   0x268
+#define H_REGISTER_HEA_RPAGES  0x26C
+#define H_DISABLE_AND_GET_HEA  0x270
+#define H_GET_HEA_INFO 0x274
+#define H_ADD_CONN 0x284
+#define H_DEL_CONN 0x288


I  guess these should go to include/asm-powerpc/hvcall.h instead.

  Arnd 


We posted a separate patch for hvcall.h 
(http://ozlabs.org/pipermail/linuxppc-dev/2006-August/025000.html).
As soon as this patch is accepted we'll remove the ehea_hcall.h headerfile.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/6] ehea: interface to network stack

2006-08-10 Thread Jan-Bernd Themann


Hi,

thanks for your comments!

We'll post a modified patch very soon.

Jan-Bernd

Alexey Dobriyan wrote:

On Wed, Aug 09, 2006 at 10:38:20AM +0200, Jan-Bernd Themann wrote:

--- linux-2.6.18-rc4-orig/drivers/net/ehea/ehea_main.c
+++ kernel/drivers/net/ehea/ehea_main.c



+static inline u64 get_swqe_addr(u64 tmp_addr, int addr_seg)
+{
+   u64 addr;
+   addr = tmp_addr;
+   return addr;
+}
+
+static inline u64 get_rwqe_addr(u64 tmp_addr)
+{
+   return tmp_addr;
+}


The point of this exercise?


has been removed




+static inline int ehea_refill_rq3_def(struct ehea_port_res *pr, int nr_of_wqes)


Way too big to be inline function.


+{
+   int i;
+   int ret = 0;
+   struct ehea_qp *qp;
+   struct ehea_rwqe *rwqe;
+   int skb_arr_rq3_len = pr-skb_arr_rq3_len;
+   struct sk_buff **skb_arr_rq3 = pr-skb_arr_rq3;
+   EDEB_EN(8, pr=%p, nr_of_wqes=%d, pr, nr_of_wqes);
+   if (nr_of_wqes == 0)
+   return -EINVAL;
+   qp = pr-qp;
+   for (i = 0; i  nr_of_wqes; i++) {
+   int index = pr-skb_rq3_index++;
+   struct sk_buff *skb = dev_alloc_skb(EHEA_MAX_PACKET_SIZE
+   + NET_IP_ALIGN);
+
+   if (!skb) {
+			EDEB_ERR(4, No memory for skb. Only %d rwqe 
filled.,

+i);
+   ret = -ENOMEM;
+   break;
+   }
+   skb_reserve(skb, NET_IP_ALIGN);
+
+   rwqe = ehea_get_next_rwqe(qp, 3);
+   pr-skb_rq3_index %= skb_arr_rq3_len;
+   skb_arr_rq3[index] = skb;
+		rwqe-wr_id = EHEA_BMASK_SET(EHEA_WR_ID_TYPE, 
EHEA_RWQE3_TYPE)

+   | EHEA_BMASK_SET(EHEA_WR_ID_INDEX, index);
+   rwqe-sg_list[0].l_key = ehea_get_recv_lkey(pr);
+   rwqe-sg_list[0].vaddr = get_rwqe_addr((u64)skb-data);
+   rwqe-sg_list[0].len = EHEA_MAX_PACKET_SIZE;
+   rwqe-data_segments = 1;
+   }
+
+   /* Ring doorbell */
+   iosync();
+   ehea_update_rq3a(qp, i);
+   EDEB_EX(8, );
+   return ret;
+}
+
+
+static inline int ehea_refill_rq3(struct ehea_port_res *pr, int nr_of_wqes)
+{
+   return ehea_refill_rq3_def(pr, nr_of_wqes);
+}


ehea_refill_rq3[123] appears to be 1:1 wrappers around
ehea_refill_rq3[123]_def. Any idea behind them?



introduced for near future features


+   init_attr = (struct ehea_qp_init_attr*)
+   kzalloc(sizeof(struct ehea_qp_init_attr), GFP_KERNEL);


Useless cast.



removed


+   pr-skb_arr_sq = (struct sk_buff**)vmalloc(sizeof(struct sk_buff*)
+   * (max_rq_entries + 1));


Useless cast


removed



+   pr-skb_arr_rq1 = (struct sk_buff**)vmalloc(sizeof(struct sk_buff*)
+* (max_rq_entries + 1));



+   pr-skb_arr_rq2 = (struct sk_buff**)vmalloc(sizeof(struct sk_buff*)
+* (max_rq_entries + 1));



+   pr-skb_arr_rq3 = (struct sk_buff**)vmalloc(sizeof(struct sk_buff*)
+* (max_rq_entries + 1));



+static int ehea_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
+{
+   EDEB_ERR(4, ioctl not supported: dev=%s cmd=%d, dev-name, cmd);


Then copy NULL into -do_ioctl!



done


+   return -EOPNOTSUPP;
+}



+   ehea_port_cb_0 = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL);
+
+   if (!ehea_port_cb_0) {
+   EDEB_ERR(4, No memory for ehea_port control block);
+   ret = -ENOMEM;
+   goto kzalloc_failed;
+   }
+
+   memcpy((u8*)((ehea_port_cb_0-port_mac_addr)),
+  (u8*)(mac_addr-sa_data[0]), 6);


No casts on memcpy arguments.


done




+   memcpy((u8*)ehea_mcl_entry-macaddr, mc_mac_addr, ETH_ALEN);



+static inline void ehea_xmit2(struct sk_buff *skb,
+ struct net_device *dev, struct ehea_swqe *swqe,
+ u32 lkey)
+{
+   int nfrags;
+   unsigned short skb_protocol = skb-protocol;


Useless variable. And it should be __be16, FYI.



changed


+   nfrags = skb_shinfo(skb)-nr_frags;
+   EDEB_EN(7, skb-nfrags=%d (0x%X), nfrags, nfrags);
+
+   if (skb_protocol == ETH_P_IP) {


ITYM, htons(ETH_P_IP).



good point, thx


+static inline void ehea_xmit3(struct sk_buff *skb,
+ struct net_device *dev, struct ehea_swqe *swqe)
+{
+   int i;
+   skb_frag_t *frag;
+   int nfrags = skb_shinfo(skb)-nr_frags;
+   u8 *imm_data = swqe-u.immdata_nodesc.immediate_data[0];
+   u64 skb_protocol = skb-protocol;


Useless var.


removed




+
+   EDEB_EN(7, );
+   if (likely(skb_protocol == ETH_P_IP)) {


   htons(ETH_P_IP)



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]

Re: [PATCH 1/4] [NETLINK]: Handle NLM_F_ECHO in netlink_rcv_skb()

2006-08-10 Thread Alexey Kuznetsov

Hello!

 This patch handles NLM_F_ECHO in netlink_rcv_skb() to
 handle it in a central point. Most subsystems currently
 interpret NLM_F_ECHO as to just unicast events to the
 originator of the change while the real meaning of the
 flag is to echo the request.

Do not you think it is useless to echo something back to originator,
who just sent it?

Actually, the sense of NLM_F_ECHO was to tell user what happened due to
his request. The answer is not original request, which can contain
some incomplete fields etc., but full information about object
deleted/added/changed. Moreover, the feedback can contain several
messages (though accurately it is done only in net/sched/), f.e. when
the request triggered deletion of one object and addition of another.

Obviously, it cannot be done in a central place.

Normally, it is not needed, ip route add does not tell user, what
actually was done, so that it suppresses echo. But for multistage
operation it is absolutely necessary: the answer contains f.e. auto-allocated
handles, which should be given in subsequent requests.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Possible leak of multicast source filter sctructure #2

2006-08-10 Thread Michal Ruzicka

Took some time but this time the inlined patch should be OK.

Hi all!
It seems to me that there is a leak of struct ip_sf_socklist in the 
ip_mc_drop_socket function (in net/ipv4/igmp.c) which is called on socket 
close.

This patch corrects it:

diff -Naur linux-2.6.17.8.orig/net/ipv4/igmp.c linux-2.6.17.8/net/ipv4/igmp.c
--- linux-2.6.17.8.orig/net/ipv4/igmp.c 2006-08-07 06:18:54.0 +0200
+++ linux-2.6.17.8/net/ipv4/igmp.c  2006-08-10 10:38:04.0 +0200
@@ -2206,9 +2206,10 @@
(void) ip_mc_leave_src(sk, iml, in_dev);
ip_mc_dec_group(in_dev, 
iml-multi.imr_multiaddr.s_addr);
in_dev_put(in_dev);
-   }
-   sock_kfree_s(sk, iml, sizeof(*iml));
+   } else if (iml-sflist != NULL)
+   sock_kfree_s(sk, iml-sflist, 
IP_SFLSIZE(iml-sflist-sl_max));
 
+   sock_kfree_s(sk, iml, sizeof(*iml));
}
rtnl_unlock();
 }


The leak only happens if there are some multicast source filters set on a 
socket wich are bound to an interface that does not exist any more, as in 
the following scenario:
1. create a temporary interface (say GRE tunnel)
2. create a socket
3. join a multicast group and set a source filter on the temporary interface 
via MCAST_JOIN_SOURCE_GROUP setsockopt call
4. destroy the temporary interface
5. close the socket

This sequence of things eventually leads to a call of ip_mc_drop_socket 
function, which fails to free the soucre filter structure ip_sf_socklist 
pointed to from members of socket's multicast addresses list. This structure 
is normally freed in ip_mc_leave_src function but this function is not 
called in this scenario because the interface that the multicast group is 
joined on does not exist any more.

Thanks
Michal Ruzicka 

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

ipvs locahost client patch for 2.6?

2006-08-10 Thread Ryan Nowakowski

I found this patch for 2.4 that allows the host running ipvs to act
as it's own client via loopback connection.  Does anyone have a similar
patch for 2.6?

--- ip_vs_core.c.orig   2003-11-28 19:26:21.0 +0100
+++ ip_vs_core.c.list   2004-07-02 11:13:51.0 +0200
@@ -1036,7 +1036,7 @@
 *  Big tappo: only PACKET_HOST (nor loopback neither mcasts)
 *  ... don't know why 1st test DOES NOT include 2nd (?)
 */
-   if (skb-pkt_type != PACKET_HOST || skb-dev == loopback_dev) {
+   if (skb-pkt_type != PACKET_HOST) { /* || skb-dev == loopback_dev) { 
*/
IP_VS_DBG(12, packet type=%d proto=%d daddr=%d.%d.%d.%d 
ignored\n,
  skb-pkt_type,
  iph-protocol,
@@ -1059,6 +1059,13 @@
iph = skb-nh.iph;
h.raw = (char*) iph + ihl;
 
+cp = ip_vs_conn_out_get(iph-protocol, iph-saddr, h.portp[0],
+   iph-daddr, h.portp[1]);
+if (cp) {
+   __ip_vs_conn_put(cp);
+   return (ip_vs_out(hooknum,skb_p,in,out,okfn));
+}
+
/*
 * Check if the packet belongs to an existing connection entry
 */

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/5] sock_create bad error return

On Wed, 09 Aug 2006 20:47:45 -0700 (PDT)
David Miller [EMAIL PROTECTED] wrote:

 From: Stephen Hemminger [EMAIL PROTECTED]
 Date: Wed, 09 Aug 2006 11:31:39 -0700

  If socket create call races with module unload, it correctly
  fails the socket call but doesn't return an error. This race
  is theoritical because the sock-ops are always the same and
  non-modular.

  Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]

 I think the intention of the code is to return
 -EAFNOSUPPORT which is set explicitly some lines
 above, and this makes sense because if we can't grab
 onto the module reference count it means the module
 is in the process of being unloaded.

It is the module reference count of the socket file ops, not the
protocol family reference count.  The protocol family code is
already handled a few lines above.

Since the socket code can't be built as a module, it is a dead end. I think
in-olden-times the idea was that networking could be built as a module
so that inode ops would have to be ref counted.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Possible leak of multicast source filter sctructure

2006-08-10 Thread David Stevens

Michal,
This looks correct, but I think a better way to do it is:

in_dev = inetdev_by_index(...)
(void) ip_mc_leave_src()
if (in_dev) {
ip_mc_dec_group()
in_dev_put()
}

That way, sflist internal details aren't visible at this
level, and ip_mc_leave_src() collapses to the sock_kfree_s()
when in_dev is NULL.
Also, ip_mc_leave_group() has the same issue; looks
like it just needs the if (in_dev) removed before the call to
ip_mc_leave_src().

+-DLS

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] [NETLINK]: Dont set socket error for failed event notifications

Thomas Graf wrote:
 Setting a socket error on all sockets subscribed to a group
 if an event notificiation of said group fails due to memory
 pressure only confuses applications and is of no use. 
 
 This patch removes it all together.

I disagree with this patch, how else are applications supposed
to know when they missed an update and are not in sync anymore?
I actually have a half-finished patch to add this in some spots
where its missing (and uses better error codes).

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: PATCH Fix bonding active-backup behavior for VLAN interfaces

2006-08-10 Thread Krzysztof Oledzki




On Thu, 3 Aug 2006, Krzysztof Oledzki wrote:




On Wed, 2 Aug 2006, David Miller wrote:
CUT

Finally, I'm still a little stumped about why this change is necessary
still, to be honest.


If I understand it correctly this patch fixes the [PATCH] bonding: suppress 
duplicate packets patch:


http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=8f903c708fcc2b579ebf16542bf6109bad593a1d;hp=ebe19a4ed78d4a11a7e01cdeda25f91b7f2fcb5a

It seems that the original patch does not work properly in vlan accelerated 
environment, which I reported 31 Mar 2006

http://marc.theaimsgroup.com/?l=bonding-develm=114381240718113w=2

Anyway, I didn't test this patch yet but I'm going to di it ASAP.


OK, this patch really solves the bug from my report. Are there any chances 
for similar fix in the net-2.6.19.git?


Best regards,

Krzysztof Olędzki

[PATCH] neighbor: use ALIGN() macro

Rather than opencoding the mask, it looks better to use ALIGN()
macro from kernel.h.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]

--- net-2.6.19.orig/include/net/neighbour.h
+++ net-2.6.19/include/net/neighbour.h
@@ -101,7 +101,7 @@ struct neighbour
__u8dead;
atomic_tprobes;
rwlock_tlock;
-   unsigned char   ha[(MAX_ADDR_LEN+sizeof(unsigned 
long)-1)~(sizeof(unsigned long)-1)];
+   unsigned char   ha[ALIGN(MAX_ADDR_LEN, sizeof(unsigned long))];
struct hh_cache *hh;
atomic_trefcnt;
int (*output)(struct sk_buff *skb);
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] net: socket family using RCU

2006-08-10 Thread Paul E. McKenney

On Wed, Aug 09, 2006 at 11:31:42AM -0700, Stephen Hemminger wrote:
 Replace the gross custom locking done in socket code for net_family[]
 with simple RCU usage. Some reordering necessary to avoid sleep
 issues with sock_alloc.

Definitely a good use of RCU from a read-intensive standpoint -- does
anyone other than Linux-kernel networking developers change the elements
of the net_family[] array except at boot and shutdown?  ;-)

Some comments included below.  Looks good, but one question about
things like atalk_create() being able to sleep and a place or two
where a comment would be good.

Thanx, Paul

 Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]
 
 ---
  net/socket.c |  171 
 +--
  1 file changed, 74 insertions(+), 97 deletions(-)
 
 --- net-2.6.orig/net/socket.c 2006-08-09 11:19:08.0 -0700
 +++ net-2.6/net/socket.c  2006-08-09 11:19:22.0 -0700
 @@ -59,11 +59,11 @@
   */
  
  #include linux/mm.h
 -#include linux/smp_lock.h
  #include linux/socket.h
  #include linux/file.h
  #include linux/net.h
  #include linux/interrupt.h
 +#include linux/rcupdate.h
  #include linux/netdevice.h
  #include linux/proc_fs.h
  #include linux/seq_file.h
 @@ -146,51 +146,8 @@
   *   The protocol list. Each protocol is registered in here.
   */
  
 -static struct net_proto_family *net_families[NPROTO];
 -
 -#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
 -static atomic_t net_family_lockct = ATOMIC_INIT(0);
  static DEFINE_SPINLOCK(net_family_lock);
 -
 -/* The strategy is: modifications net_family vector are short, do not
 -   sleep and veeery rare, but read access should be free of any exclusive
 -   locks.
 - */
 -
 -static void net_family_write_lock(void)
 -{
 - spin_lock(net_family_lock);
 - while (atomic_read(net_family_lockct) != 0) {
 - spin_unlock(net_family_lock);
 -
 - yield();
 -
 - spin_lock(net_family_lock);
 - }
 -}
 -
 -static __inline__ void net_family_write_unlock(void)
 -{
 - spin_unlock(net_family_lock);
 -}
 -
 -static __inline__ void net_family_read_lock(void)
 -{
 - atomic_inc(net_family_lockct);
 - spin_unlock_wait(net_family_lock);
 -}
 -
 -static __inline__ void net_family_read_unlock(void)
 -{
 - atomic_dec(net_family_lockct);
 -}
 -
 -#else
 -#define net_family_write_lock() do { } while(0)
 -#define net_family_write_unlock() do { } while(0)
 -#define net_family_read_lock() do { } while(0)
 -#define net_family_read_unlock() do { } while(0)
 -#endif
 +static const struct net_proto_family *net_families[NPROTO];
  
  /*
   *   Statistics counters of the socket lists
 @@ -1131,6 +1088,7 @@
  {
   int err;
   struct socket *sock;
 + const struct net_proto_family *pf;
  
   /*
*  Check protocol is in range
 @@ -1159,6 +1117,20 @@
   if (err)
   return err;
  
 + /*
 +  *  Allocate the socket and allow the family to set things up. if
 +  *  the protocol is 0, the family is instructed to select an 
 appropriate
 +  *  default.
 +  */
 + sock = sock_alloc();
 + if (!sock) {
 + printk(KERN_WARNING socket: no more sockets\n);
 + return -ENFILE; /* Not exactly a match, but its the
 +closest posix thing */
 + }
 +
 + sock-type = type;
 +
  #if defined(CONFIG_KMOD)
   /* Attempt to load a protocol module if the find failed.
*
 @@ -1166,70 +1138,59 @@
* requested real, full-featured networking support upon configuration.
* Otherwise module support will break!
*/
 - if (net_families[family] == NULL) {
 + if (net_families[family] == NULL)
   request_module(net-pf-%d, family);

OK, I'll bite...

What happens if the module is not present?  Or is this what the
Otherwise module support will break comment is getting at?

Also, this reference to net_families[family] is done without
rcu_dereference() and without any clear update-side lock.  This
just happens to be OK, since we are only testing for NULL, but
should at least have a comment.

 - }
  #endif
  
 - net_family_read_lock();
 - if (net_families[family] == NULL) {
 - err = -EAFNOSUPPORT;
 - goto out;
 - }
 -
 -/*
 - *   Allocate the socket and allow the family to set things up. if
 - *   the protocol is 0, the family is instructed to select an appropriate
 - *   default.
 - */
 -
 - if (!(sock = sock_alloc())) {
 - printk(KERN_WARNING socket: no more sockets\n);
 - err = -ENFILE;  /* Not exactly a match, but its the
 -closest posix thing */
 - goto out;
 - }
 -
 - sock-type = type;
 + rcu_read_lock();
 + pf = rcu_dereference(net_families[family]);

OK, so the elements of the net_families array are protected by RCU.
All references should

Re: [PATCH 0/5] net socket family patches

2006-08-10 Thread Randy.Dunlap

On Thu, 10 Aug 2006 05:36:13 + (UTC) Alexey Toptygin wrote:

On Wed, 9 Aug 2006, David Miller wrote:

From: Stephen Hemminger [EMAIL PROTECTED]
Date: Wed, 09 Aug 2006 11:31:38 -0700

These patches cleanup the net socket family interface and
convert it to RCU. This is new stuff that should go into 2.6.19
(if it is ready). Andrew could you put it in -mm as well?

Andrew pulls net-2.6.19 so there is no need to ask him to
put networking patches explicitly into -mm

I've been wondering - are the relationships of which of the various kernel
trees pull patches from which other ones documented anywhere? If so, I'd
love to read about it.

Not really documented AFAIK, except what Andrew pulls into his -mm tree
for testing. His announcements [used to] list which (git or other) trees that
he has merged, along with non-tree patches. Now that is just in the
patch-list file, e.g., see

ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/patch-list
and then search for git- to see which git trees it contains.

If you go down the maintainer's hierarchy, it gets more fuzzy. :)
Jeff Garzik pulls the wireless tree from John Linville and several
net driver trees from Francois Romieu, e.g. And Jeff pulls SATA
patches from Tejun Heo.

DaveM pulls net patches from Yoshifuji etc.

James Bottomley usually maintains 2 SCSI git trees: one for
2.6.current-rc fixes and one for 2.6.next merges. He recently documented
that in email to [EMAIL PROTECTED]

Most kernel git trees can be seen at www.kernel.org/git/.
Most kernel patch trees (git or other) are now listed in the
MAINTAINERS file.

HTH.
---
~Randy
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/4] [NETLINK]: Handle NLM_F_ECHO in netlink_rcv_skb()

* Alexey Kuznetsov [EMAIL PROTECTED] 2006-08-10 19:51
  This patch handles NLM_F_ECHO in netlink_rcv_skb() to
  handle it in a central point. Most subsystems currently
  interpret NLM_F_ECHO as to just unicast events to the
  originator of the change while the real meaning of the
  flag is to echo the request.
 
 Do not you think it is useless to echo something back to originator,
 who just sent it?
 
 Actually, the sense of NLM_F_ECHO was to tell user what happened due to
 his request. The answer is not original request, which can contain
 some incomplete fields etc., but full information about object
 deleted/added/changed. Moreover, the feedback can contain several
 messages (though accurately it is done only in net/sched/), f.e. when
 the request triggered deletion of one object and addition of another.
 
 Obviously, it cannot be done in a central place.
 
 Normally, it is not needed, ip route add does not tell user, what
 actually was done, so that it suppresses echo. But for multistage
 operation it is absolutely necessary: the answer contains f.e. auto-allocated
 handles, which should be given in subsequent requests.

What's wrong with listening to the notification for that purpose?
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] [NETLINK]: Dont set socket error for failed event notifications

* Patrick McHardy [EMAIL PROTECTED] 2006-08-10 20:09
 Thomas Graf wrote:
  Setting a socket error on all sockets subscribed to a group
  if an event notificiation of said group fails due to memory
  pressure only confuses applications and is of no use. 
  
  This patch removes it all together.
 
 I disagree with this patch, how else are applications supposed
 to know when they missed an update and are not in sync anymore?
 I actually have a half-finished patch to add this in some spots
 where its missing (and uses better error codes).

The application has no idea what went wrong nor does it know
for which group so it will have to resync all group subscrptions
and as it only happens due to memory pressure that will fail
anyway.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] [NETLINK]: Dont set socket error for failed event notifications

Thomas Graf wrote:
 * Patrick McHardy [EMAIL PROTECTED] 2006-08-10 20:09
 
I disagree with this patch, how else are applications supposed
to know when they missed an update and are not in sync anymore?
I actually have a half-finished patch to add this in some spots
where its missing (and uses better error codes).
 
 
 The application has no idea what went wrong nor does it know
 for which group so it will have to resync all group subscrptions
 and as it only happens due to memory pressure that will fail
 anyway.


The error code (-ENOMEM) gives it a pretty good idea what went
wrong. Its true that it doesn't know which group was affected
(that could be fixed), but at least it knows that something
went wrong and it needs to resync. If that fails due to memory
shortage as well it can schedule a delayed resync or something,
but without getting notified it has no chance of doing anything
useful. This makes notification essentially useless. If I can't
rely on either getting either a notification or an error, I can't
rely on them at all.

Please put this back in.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] [NETLINK]: Dont set socket error for failed event notifications

* Patrick McHardy [EMAIL PROTECTED] 2006-08-10 21:08
 The error code (-ENOMEM) gives it a pretty good idea what went
 wrong. Its true that it doesn't know which group was affected
 (that could be fixed), but at least it knows that something
 went wrong and it needs to resync. If that fails due to memory
 shortage as well it can schedule a delayed resync or something,
 but without getting notified it has no chance of doing anything
 useful. This makes notification essentially useless. If I can't
 rely on either getting either a notification or an error, I can't
 rely on them at all.
 
 Please put this back in.

Alright, I think it's pretty much theoretical but it doesn't
really matter to me.

Dave, please revert the whole patchset.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[NET 02/06]: Introduce RTA_TABLE/FRA_TABLE attributes

[NET]: Introduce RTA_TABLE/FRA_TABLE attributes

Introduce RTA_TABLE route attribute and FRA_TABLE routing rule attribute
to hold 32 bit routing table IDs. Usespace compatibility is provided by
continuing to accept and send the rtm_table field, but because of its
limited size it can only carry the low 8 bits of the table ID. This
implies that if larger IDs are used, _all_ userspace programs using them
need to use RTA_TABLE.

Signed-off-by: Patrick McHardy [EMAIL PROTECTED]

---
commit a9fe50e925cdc0471b88bcf6f3cc18278b63c984
tree 08d8bfa20011b5afa940126a8bb0c153729584c3
parent 29a0f4a779543907ddf8fbca55b6f1d0e0017f64
author Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 20:50:19 +0200
committer Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 20:50:19 +0200

 include/linux/fib_rules.h |4 
 include/linux/rtnetlink.h |8 
 include/net/fib_rules.h   |7 +++
 net/core/fib_rules.c  |5 +++--
 net/decnet/dn_fib.c   |7 ---
 net/decnet/dn_route.c |1 +
 net/decnet/dn_table.c |1 +
 net/ipv4/fib_frontend.c   |7 ---
 net/ipv4/fib_rules.c  |1 +
 net/ipv4/fib_semantics.c  |1 +
 net/ipv4/route.c  |1 +
 net/ipv6/fib6_rules.c |1 +
 net/ipv6/route.c  |   13 +
 13 files changed, 45 insertions(+), 12 deletions(-)

diff --git a/include/linux/fib_rules.h b/include/linux/fib_rules.h
index 5e503f0..19a82b6 100644
--- a/include/linux/fib_rules.h
+++ b/include/linux/fib_rules.h
@@ -36,6 +36,10 @@ enum
FRA_UNUSED5,
FRA_FWMARK, /* netfilter mark (IPv4) */
FRA_FLOW,   /* flow/class id */
+   FRA_UNUSED6,
+   FRA_UNUSED7,
+   FRA_UNUSED8,
+   FRA_TABLE,  /* Extended table id */
__FRA_MAX
 };
 
diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 5deca87..b01bc8b 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -264,6 +264,7 @@ enum rtattr_type_t
RTA_CACHEINFO,
RTA_SESSION,
RTA_MP_ALGO,
+   RTA_TABLE,
__RTA_MAX
 };
 
@@ -716,6 +717,13 @@ #define BUG_TRAP(x) do { \
} \
 } while(0)
 
+static inline u32 rtm_get_table(struct rtattr **rta, u8 table)
+{
+   return RTA_GET_U32(rta[RTA_TABLE-1]);
+rtattr_failure:
+   return table;
+}
+
 #endif /* __KERNEL__ */
 
 
diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h
index 61375d9..8e2f473 100644
--- a/include/net/fib_rules.h
+++ b/include/net/fib_rules.h
@@ -74,6 +74,13 @@ static inline void fib_rule_put(struct f
call_rcu(rule-rcu, fib_rule_put_rcu);
 }
 
+static inline u32 frh_get_table(struct fib_rule_hdr *frh, struct nlattr **nla)
+{
+   if (nla[FRA_TABLE])
+   return nla_get_u32(nla[FRA_TABLE]);
+   return frh-table;
+}
+
 extern int fib_rules_register(struct fib_rules_ops *);
 extern int fib_rules_unregister(struct fib_rules_ops *);
 
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 2e7ed5d..97b196f 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -187,7 +187,7 @@ int fib_nl_newrule(struct sk_buff *skb, 
 
rule-action = frh-action;
rule-flags = frh-flags;
-   rule-table = frh-table;
+   rule-table = frh_get_table(frh, tb);
 
if (!rule-pref  ops-default_pref)
rule-pref = ops-default_pref();
@@ -245,7 +245,7 @@ int fib_nl_delrule(struct sk_buff *skb, 
if (frh-action  (frh-action != rule-action))
continue;
 
-   if (frh-table  (frh-table != rule-table))
+   if (frh-table  (frh_get_table(frh, tb) != rule-table))
continue;
 
if (tb[FRA_PRIORITY] 
@@ -291,6 +291,7 @@ static int fib_nl_fill_rule(struct sk_bu
 
frh = nlmsg_data(nlh);
frh-table = rule-table;
+   NLA_PUT_U32(skb, FRA_TABLE, rule-table);
frh-res1 = 0;
frh-res2 = 0;
frh-action = rule-action;
diff --git a/net/decnet/dn_fib.c b/net/decnet/dn_fib.c
index 7b3bf5c..fb59637 100644
--- a/net/decnet/dn_fib.c
+++ b/net/decnet/dn_fib.c
@@ -491,7 +491,8 @@ static int dn_fib_check_attr(struct rtms
if (attr) {
if (RTA_PAYLOAD(attr)  4  RTA_PAYLOAD(attr) != 2)
return -EINVAL;
-   if (i != RTA_MULTIPATH  i != RTA_METRICS)
+   if (i != RTA_MULTIPATH  i != RTA_METRICS 
+   i != RTA_TABLE)
rta[i-1] = (struct rtattr *)RTA_DATA(attr);
}
}
@@ -508,7 +509,7 @@ int dn_fib_rtm_delroute(struct sk_buff *
if (dn_fib_check_attr(r, rta))
return -EINVAL;
 
-   tb = dn_fib_get_table(r-rtm_table, 0);
+   tb = dn_fib_get_table(rtm_get_table(rta, r-rtm_table), 0);
if (tb)
return tb-delete(tb, r, (struct dn_kern_rta *)rta, nlh,

[NET 01/06]: Use u32 for routing table IDs

[NET]: Use u32 for routing table IDs

Use u32 for routing table IDs in net/ipv4 and net/decnet in preparation of
support for a larger number of routing tables. net/ipv6 already uses u32
everywhere and needs no further changes. No functional changes are made by
this patch.

Signed-off-by: Patrick McHardy [EMAIL PROTECTED]

---
commit 29a0f4a779543907ddf8fbca55b6f1d0e0017f64
tree c559ca79c2d6ab28ceb4a4c1d5ecd5ea81264f0d
parent 1b471cd32acdff18786bc06542c686d52decbc5a
author Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 20:45:22 +0200
committer Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 20:45:22 +0200

 include/net/dn_fib.h |4 ++--
 include/net/ip_fib.h |   14 +++---
 net/decnet/dn_fib.c  |6 +++---
 net/decnet/dn_table.c|   10 +-
 net/ipv4/fib_frontend.c  |8 
 net/ipv4/fib_hash.c  |4 ++--
 net/ipv4/fib_lookup.h|4 ++--
 net/ipv4/fib_rules.c |2 +-
 net/ipv4/fib_semantics.c |4 ++--
 net/ipv4/fib_trie.c  |6 +++---
 10 files changed, 31 insertions(+), 31 deletions(-)

diff --git a/include/net/dn_fib.h b/include/net/dn_fib.h
index 32bc8ce..cd9c378 100644
--- a/include/net/dn_fib.h
+++ b/include/net/dn_fib.h
@@ -94,7 +94,7 @@ #define DN_FIB_INFO(f) ((f)-fn_info)
 
 
 struct dn_fib_table {
-   int n;
+   u32 n;
 
int (*insert)(struct dn_fib_table *t, struct rtmsg *r, 
struct dn_kern_rta *rta, struct nlmsghdr *n, 
@@ -137,7 +137,7 @@ extern int dn_fib_sync_up(struct net_dev
 /*
  * dn_tables.c
  */
-extern struct dn_fib_table *dn_fib_get_table(int n, int creat);
+extern struct dn_fib_table *dn_fib_get_table(u32 n, int creat);
 extern struct dn_fib_table *dn_fib_empty_table(void);
 extern void dn_fib_table_init(void);
 extern void dn_fib_table_cleanup(void);
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index adf7358..0dcbf16 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -150,7 +150,7 @@ #define FIB_RES_NETMASK(res)(0)
 #endif /* CONFIG_IP_ROUTE_MULTIPATH_WRANDOM */
 
 struct fib_table {
-   unsigned char   tb_id;
+   u32 tb_id;
unsignedtb_stamp;
int (*tb_lookup)(struct fib_table *tb, const struct flowi 
*flp, struct fib_result *res);
int (*tb_insert)(struct fib_table *table, struct rtmsg *r,
@@ -173,14 +173,14 @@ #ifndef CONFIG_IP_MULTIPLE_TABLES
 extern struct fib_table *ip_fib_local_table;
 extern struct fib_table *ip_fib_main_table;
 
-static inline struct fib_table *fib_get_table(int id)
+static inline struct fib_table *fib_get_table(u32 id)
 {
if (id != RT_TABLE_LOCAL)
return ip_fib_main_table;
return ip_fib_local_table;
 }
 
-static inline struct fib_table *fib_new_table(int id)
+static inline struct fib_table *fib_new_table(u32 id)
 {
return fib_get_table(id);
 }
@@ -205,9 +205,9 @@ #define ip_fib_main_table (fib_tables[RT
 
 extern struct fib_table * fib_tables[RT_TABLE_MAX+1];
 extern int fib_lookup(struct flowi *flp, struct fib_result *res);
-extern struct fib_table *__fib_new_table(int id);
+extern struct fib_table *__fib_new_table(u32 id);
 
-static inline struct fib_table *fib_get_table(int id)
+static inline struct fib_table *fib_get_table(u32 id)
 {
if (id == 0)
id = RT_TABLE_MAIN;
@@ -215,7 +215,7 @@ static inline struct fib_table *fib_get_
return fib_tables[id];
 }
 
-static inline struct fib_table *fib_new_table(int id)
+static inline struct fib_table *fib_new_table(u32 id)
 {
if (id == 0)
id = RT_TABLE_MAIN;
@@ -248,7 +248,7 @@ extern int fib_convert_rtentry(int cmd, 
 extern u32  __fib_res_prefsrc(struct fib_result *res);
 
 /* Exported by fib_hash.c */
-extern struct fib_table *fib_hash_init(int id);
+extern struct fib_table *fib_hash_init(u32 id);
 
 #ifdef CONFIG_IP_MULTIPLE_TABLES
 extern int fib4_rules_dump(struct sk_buff *skb, struct netlink_callback *cb);
diff --git a/net/decnet/dn_fib.c b/net/decnet/dn_fib.c
index ed5fb5c..7b3bf5c 100644
--- a/net/decnet/dn_fib.c
+++ b/net/decnet/dn_fib.c
@@ -534,8 +534,8 @@ int dn_fib_rtm_newroute(struct sk_buff *
 
 int dn_fib_dump(struct sk_buff *skb, struct netlink_callback *cb)
 {
-   int t;
-   int s_t;
+   u32 t;
+   u32 s_t;
struct dn_fib_table *tb;
 
if (NLMSG_PAYLOAD(cb-nlh, 0) = sizeof(struct rtmsg) 
@@ -765,7 +765,7 @@ void dn_fib_flush(void)
 {
 int flushed = 0;
 struct dn_fib_table *tb;
-int id;
+u32 id;
 
 for(id = RT_TABLE_MAX; id  0; id--) {
 if ((tb = dn_fib_get_table(id, 0)) == NULL)
diff --git a/net/decnet/dn_table.c b/net/decnet/dn_table.c
index c6a2e41..b7c6c06 100644
--- a/net/decnet/dn_table.c
+++ b/net/decnet/dn_table.c
@@ -264,7 +264,7 @@ static int dn_fib_nh_match(struct rtmsg 
 }
 
 static int dn_fib_dump_info(struct sk_buff *skb, u32 pid, u32 seq, int event,
-u8

[IPV4 03/06]: Increase number of possible routing tables to 2^32

[IPV4]: Increase number of possible routing tables to 2^32

Increase the number of possible routing tables to 2^32 by replacing the
fixed sized array of pointers by a hash table and replacing iterations
over all possible table IDs by hash table walking.

Signed-off-by: Patrick McHardy [EMAIL PROTECTED]

---
commit 148d1ca7c199005b5a92f8154a7caf3f78529672
tree ee025abdbab6fe6a4eac916791b8a06f0622d71e
parent a9fe50e925cdc0471b88bcf6f3cc18278b63c984
author Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 20:52:30 +0200
committer Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 20:52:30 +0200

 include/net/ip_fib.h|   25 ++--
 net/ipv4/fib_frontend.c |  102 +++
 net/ipv4/fib_hash.c |   26 ++--
 net/ipv4/fib_rules.c|4 +-
 net/ipv4/fib_trie.c |   26 ++--
 5 files changed, 101 insertions(+), 82 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 0dcbf16..8e9ba56 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -150,6 +150,7 @@ #define FIB_RES_NETMASK(res)(0)
 #endif /* CONFIG_IP_ROUTE_MULTIPATH_WRANDOM */
 
 struct fib_table {
+   struct hlist_node tb_hlist;
u32 tb_id;
unsignedtb_stamp;
int (*tb_lookup)(struct fib_table *tb, const struct flowi 
*flp, struct fib_result *res);
@@ -200,29 +201,13 @@ static inline void fib_select_default(co
 }
 
 #else /* CONFIG_IP_MULTIPLE_TABLES */
-#define ip_fib_local_table (fib_tables[RT_TABLE_LOCAL])
-#define ip_fib_main_table (fib_tables[RT_TABLE_MAIN])
+#define ip_fib_local_table fib_get_table(RT_TABLE_LOCAL)
+#define ip_fib_main_table fib_get_table(RT_TABLE_MAIN)
 
-extern struct fib_table * fib_tables[RT_TABLE_MAX+1];
 extern int fib_lookup(struct flowi *flp, struct fib_result *res);
-extern struct fib_table *__fib_new_table(u32 id);
-
-static inline struct fib_table *fib_get_table(u32 id)
-{
-   if (id == 0)
-   id = RT_TABLE_MAIN;
-
-   return fib_tables[id];
-}
-
-static inline struct fib_table *fib_new_table(u32 id)
-{
-   if (id == 0)
-   id = RT_TABLE_MAIN;
-
-   return fib_tables[id] ? : __fib_new_table(id);
-}
 
+extern struct fib_table *fib_new_table(u32 id);
+extern struct fib_table *fib_get_table(u32 id);
 extern void fib_select_default(const struct flowi *flp, struct fib_result 
*res);
 
 #endif /* CONFIG_IP_MULTIPLE_TABLES */
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 2696ede..ad4c14f 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -37,6 +37,7 @@ #include linux/if_arp.h
 #include linux/skbuff.h
 #include linux/netlink.h
 #include linux/init.h
+#include linux/list.h
 
 #include net/ip.h
 #include net/protocol.h
@@ -51,48 +52,67 @@ #define FFprint(a...) printk(KERN_DEBUG 
 
 #ifndef CONFIG_IP_MULTIPLE_TABLES
 
-#define RT_TABLE_MIN RT_TABLE_MAIN
-
 struct fib_table *ip_fib_local_table;
 struct fib_table *ip_fib_main_table;
 
-#else
+#define FIB_TABLE_HASHSZ 1
+static struct hlist_head fib_table_hash[FIB_TABLE_HASHSZ];
 
-#define RT_TABLE_MIN 1
+#else
 
-struct fib_table *fib_tables[RT_TABLE_MAX+1];
+#define FIB_TABLE_HASHSZ 256
+static struct hlist_head fib_table_hash[FIB_TABLE_HASHSZ];
 
-struct fib_table *__fib_new_table(u32 id)
+struct fib_table *fib_new_table(u32 id)
 {
struct fib_table *tb;
+   unsigned int h;
 
+   if (id == 0)
+   id = RT_TABLE_MAIN;
+   tb = fib_get_table(id);
+   if (tb)
+   return tb;
tb = fib_hash_init(id);
if (!tb)
return NULL;
-   fib_tables[id] = tb;
+   h = id  (FIB_TABLE_HASHSZ - 1);
+   hlist_add_head_rcu(tb-tb_hlist, fib_table_hash[h]);
return tb;
 }
 
+struct fib_table *fib_get_table(u32 id)
+{
+   struct fib_table *tb;
+   struct hlist_node *node;
+   unsigned int h;
 
+   if (id == 0)
+   id = RT_TABLE_MAIN;
+   h = id  (FIB_TABLE_HASHSZ - 1);
+   rcu_read_lock();
+   hlist_for_each_entry_rcu(tb, node, fib_table_hash[h], tb_hlist) {
+   if (tb-tb_id == id) {
+   rcu_read_unlock();
+   return tb;
+   }
+   }
+   rcu_read_unlock();
+   return NULL;
+}
 #endif /* CONFIG_IP_MULTIPLE_TABLES */
 
-
 static void fib_flush(void)
 {
int flushed = 0;
-#ifdef CONFIG_IP_MULTIPLE_TABLES
struct fib_table *tb;
-   u32 id;
+   struct hlist_node *node;
+   unsigned int h;
 
-   for (id = RT_TABLE_MAX; id0; id--) {
-   if ((tb = fib_get_table(id))==NULL)
-   continue;
-   flushed += tb-tb_flush(tb);
+   for (h = 0; h  FIB_TABLE_HASHSZ; h++) {
+   hlist_for_each_entry(tb, node, fib_table_hash[h], tb_hlist)
+   flushed += tb-tb_flush(tb);
}
-#else /* CONFIG_IP_MULTIPLE_TABLES */
-   flushed +=

[DECNET 05/06]: Increase number of possible routing tables to 2^32

[DECNET]: Increase number of possible routing tables to 2^32

Increase the number of possible routing tables to 2^32 by replacing the
fixed sized array of pointers by a hash table and replacing iterations
over all possible table IDs by hash table walking.

Signed-off-by: Patrick McHardy [EMAIL PROTECTED]

---
commit 9203e4cdab89d96c474c6a903ef9a1f47c7eee07
tree e0c6a2c5e3a691919863b4eb871fc3a25ebd5d44
parent cad398a8f3ef363abba9e6450dded94a022c96fa
author Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 20:54:19 +0200
committer Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 20:54:19 +0200

 include/net/dn_fib.h  |3 -
 net/decnet/dn_fib.c   |   49 ---
 net/decnet/dn_rules.c |2 -
 net/decnet/dn_table.c |  125 -
 4 files changed, 93 insertions(+), 86 deletions(-)

diff --git a/include/net/dn_fib.h b/include/net/dn_fib.h
index cd9c378..d97aa10 100644
--- a/include/net/dn_fib.h
+++ b/include/net/dn_fib.h
@@ -94,6 +94,7 @@ #define DN_FIB_INFO(f) ((f)-fn_info)
 
 
 struct dn_fib_table {
+   struct hlist_node hlist;
u32 n;
 
int (*insert)(struct dn_fib_table *t, struct rtmsg *r, 
@@ -177,8 +178,6 @@ static inline void dn_fib_res_put(struct
fib_rule_put(res-r);
 }
 
-extern struct dn_fib_table *dn_fib_tables[];
-
 #else /* Endnode */
 
 #define dn_fib_init()  do { } while(0)
diff --git a/net/decnet/dn_fib.c b/net/decnet/dn_fib.c
index fb59637..5ccca3e 100644
--- a/net/decnet/dn_fib.c
+++ b/net/decnet/dn_fib.c
@@ -532,39 +532,6 @@ int dn_fib_rtm_newroute(struct sk_buff *
return -ENOBUFS;
 }
 
-
-int dn_fib_dump(struct sk_buff *skb, struct netlink_callback *cb)
-{
-   u32 t;
-   u32 s_t;
-   struct dn_fib_table *tb;
-
-   if (NLMSG_PAYLOAD(cb-nlh, 0) = sizeof(struct rtmsg) 
-   ((struct rtmsg *)NLMSG_DATA(cb-nlh))-rtm_flagsRTM_F_CLONED)
-   return dn_cache_dump(skb, cb);
-
-   s_t = cb-args[0];
-   if (s_t == 0)
-   s_t = cb-args[0] = RT_MIN_TABLE;
-
-   for(t = s_t; t = RT_TABLE_MAX; t++) {
-   if (t  s_t)
-   continue;
-   if (t  s_t)
-   memset(cb-args[1], 0,
-  sizeof(cb-args) - sizeof(cb-args[0]));
-   tb = dn_fib_get_table(t, 0);
-   if (tb == NULL)
-   continue;
-   if (tb-dump(tb, skb, cb)  0)
-   break;
-   }
-
-   cb-args[0] = t;
-
-   return skb-len;
-}
-
 static void fib_magic(int cmd, int type, __le16 dst, int dst_len, struct 
dn_ifaddr *ifa)
 {
struct dn_fib_table *tb;
@@ -762,22 +729,6 @@ int dn_fib_sync_up(struct net_device *de
 return ret;
 }
 
-void dn_fib_flush(void)
-{
-int flushed = 0;
-struct dn_fib_table *tb;
-u32 id;
-
-for(id = RT_TABLE_MAX; id  0; id--) {
-if ((tb = dn_fib_get_table(id, 0)) == NULL)
-continue;
-flushed += tb-flush(tb);
-}
-
-if (flushed)
-dn_rt_cache_flush(-1);
-}
-
 static struct notifier_block dn_fib_dnaddr_notifier = {
.notifier_call = dn_fib_dnaddr_event,
 };
diff --git a/net/decnet/dn_rules.c b/net/decnet/dn_rules.c
index 096f127..878312f 100644
--- a/net/decnet/dn_rules.c
+++ b/net/decnet/dn_rules.c
@@ -210,7 +210,7 @@ unsigned dnet_addr_type(__le16 addr)
struct flowi fl = { .nl_u = { .dn_u = { .daddr = addr } } };
struct dn_fib_res res;
unsigned ret = RTN_UNICAST;
-   struct dn_fib_table *tb = dn_fib_tables[RT_TABLE_LOCAL];
+   struct dn_fib_table *tb = dn_fib_get_table(RT_TABLE_LOCAL, 0);
 
res.r = NULL;
 
diff --git a/net/decnet/dn_table.c b/net/decnet/dn_table.c
index d2ad791..5701a3f 100644
--- a/net/decnet/dn_table.c
+++ b/net/decnet/dn_table.c
@@ -75,9 +75,9 @@ #define DN_FIB_SCAN_KEY(f, fp, key) \
 for( ; ((f) = *(fp)) != NULL  dn_key_eq((f)-fn_key, (key)); (fp) = 
(f)-fn_next)
 
 #define RT_TABLE_MIN 1
-
+#define DN_FIB_TABLE_HASHSZ 256
+static struct hlist_head dn_fib_table_hash[DN_FIB_TABLE_HASHSZ];
 static DEFINE_RWLOCK(dn_fib_tables_lock);
-struct dn_fib_table *dn_fib_tables[RT_TABLE_MAX + 1];
 
 static kmem_cache_t *dn_hash_kmem __read_mostly;
 static int dn_fib_hash_zombies;
@@ -357,7 +357,7 @@ static __inline__ int dn_hash_dump_bucke
 {
int i, s_i;
 
-   s_i = cb-args[3];
+   s_i = cb-args[4];
for(i = 0; f; i++, f = f-fn_next) {
if (i  s_i)
continue;
@@ -370,11 +370,11 @@ static __inline__ int dn_hash_dump_bucke
(f-fn_state  DN_S_ZOMBIE) ? 0 : f-fn_type,
f-fn_scope, f-fn_key, dz-dz_order, 
f-fn_info, NLM_F_MULTI)  0) {
-   cb-args[3] = i;
+   cb-args[4] = i;
return -1;

[NET 06/06]: Increate RT_TABLE_MAX to 2^32

[NET]: Increate RT_TABLE_MAX to 2^32

Signed-off-by: Patrick McHardy [EMAIL PROTECTED]

---
commit f20cbb83204cd7e2ffa9cf4e8ee8b6353628d5d3
tree 8f0eaa4219506715449e7118037040f396875c99
parent 9203e4cdab89d96c474c6a903ef9a1f47c7eee07
author Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 20:54:49 +0200
committer Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 20:54:49 +0200

 include/linux/rtnetlink.h |4 +---
 1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index b01bc8b..a616c68 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -239,10 +239,8 @@ enum rt_class_t
RT_TABLE_DEFAULT=253,
RT_TABLE_MAIN=254,
RT_TABLE_LOCAL=255,
-   __RT_TABLE_MAX
+   RT_TABLE_MAX=0x
 };
-#define RT_TABLE_MAX (__RT_TABLE_MAX - 1)
-
 
 
 /* Routing message attributes */
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[IPV6 04/06]: Increase number of possible routing tables to 2^32

[IPV6]: Increase number of possible routing tables to 2^32

Increase number of possible routing tables to 2^32 by replacing iterations
over all possible table IDs by hash table walking.

Signed-off-by: Patrick McHardy [EMAIL PROTECTED]

---
commit cad398a8f3ef363abba9e6450dded94a022c96fa
tree 4fea9c50650ab65d942dca9c2545d1810b227839
parent 148d1ca7c199005b5a92f8154a7caf3f78529672
author Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 20:53:33 +0200
committer Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 20:53:33 +0200

 include/net/ip6_route.h |7 ++
 net/ipv6/ip6_fib.c  |  171 ++-
 net/ipv6/route.c|  128 ---
 3 files changed, 159 insertions(+), 147 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 9bfa3cc..01bfe40 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -137,6 +137,13 @@ extern int inet6_rtm_newroute(struct sk_
 extern int inet6_rtm_delroute(struct sk_buff *skb, struct nlmsghdr* nlh, void 
*arg);
 extern int inet6_rtm_getroute(struct sk_buff *skb, struct nlmsghdr* nlh, void 
*arg);
 
+struct rt6_rtnl_dump_arg
+{
+   struct sk_buff *skb;
+   struct netlink_callback *cb;
+};
+
+extern int rt6_dump_route(struct rt6_info *rt, void *p_arg);
 extern void rt6_ifdown(struct net_device *dev);
 extern void rt6_mtu_change(struct net_device *dev, unsigned mtu);
 
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 1f23161..bececbe 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -158,7 +158,26 @@ static struct fib6_table fib6_main_tbl =
 };
 
 #ifdef CONFIG_IPV6_MULTIPLE_TABLES
+#define FIB_TABLE_HASHSZ 256
+#else
+#define FIB_TABLE_HASHSZ 1
+#endif
+static struct hlist_head fib_table_hash[FIB_TABLE_HASHSZ];
+
+static void fib6_link_table(struct fib6_table *tb)
+{
+   unsigned int h;
+
+   h = tb-tb6_id  (FIB_TABLE_HASHSZ - 1);
 
+   /*
+* No protection necessary, this is the only list mutatation
+* operation, tables never disappear once they exist.
+*/
+   hlist_add_head_rcu(tb-tb6_hlist, fib_table_hash[h]);
+}
+
+#ifdef CONFIG_IPV6_MULTIPLE_TABLES
 static struct fib6_table fib6_local_tbl = {
.tb6_id = RT6_TABLE_LOCAL,
.tb6_lock   = RW_LOCK_UNLOCKED,
@@ -168,9 +187,6 @@ static struct fib6_table fib6_local_tbl 
},
 };
 
-#define FIB_TABLE_HASHSZ 256
-static struct hlist_head fib_table_hash[FIB_TABLE_HASHSZ];
-
 static struct fib6_table *fib6_alloc_table(u32 id)
 {
struct fib6_table *table;
@@ -186,19 +202,6 @@ static struct fib6_table *fib6_alloc_tab
return table;
 }
 
-static void fib6_link_table(struct fib6_table *tb)
-{
-   unsigned int h;
-
-   h = tb-tb6_id  (FIB_TABLE_HASHSZ - 1);
-
-   /*
-* No protection necessary, this is the only list mutatation
-* operation, tables never disappear once they exist.
-*/
-   hlist_add_head_rcu(tb-tb6_hlist, fib_table_hash[h]);
-}
-
 struct fib6_table *fib6_new_table(u32 id)
 {
struct fib6_table *tb;
@@ -263,10 +266,135 @@ struct dst_entry *fib6_rule_lookup(struc
 
 static void __init fib6_tables_init(void)
 {
+   fib6_link_table(fib6_main_tbl);
 }
 
 #endif
 
+static int fib6_dump_node(struct fib6_walker_t *w)
+{
+   int res;
+   struct rt6_info *rt;
+
+   for (rt = w-leaf; rt; rt = rt-u.next) {
+   res = rt6_dump_route(rt, w-args);
+   if (res  0) {
+   /* Frame is full, suspend walking */
+   w-leaf = rt;
+   return 1;
+   }
+   BUG_TRAP(res!=0);
+   }
+   w-leaf = NULL;
+   return 0;
+}
+
+static void fib6_dump_end(struct netlink_callback *cb)
+{
+   struct fib6_walker_t *w = (void*)cb-args[2];
+
+   if (w) {
+   cb-args[2] = 0;
+   kfree(w);
+   }
+   cb-done = (void*)cb-args[3];
+   cb-args[1] = 3;
+}
+
+static int fib6_dump_done(struct netlink_callback *cb)
+{
+   fib6_dump_end(cb);
+   return cb-done ? cb-done(cb) : 0;
+}
+
+static int fib6_dump_table(struct fib6_table *table, struct sk_buff *skb,
+  struct netlink_callback *cb)
+{
+   struct fib6_walker_t *w;
+   int res;
+
+   w = (void *)cb-args[2];
+   w-root = table-tb6_root;
+
+   if (cb-args[4] == 0) {
+   read_lock_bh(table-tb6_lock);
+   res = fib6_walk(w);
+   read_unlock_bh(table-tb6_lock);
+   if (res  0)
+   cb-args[4] = 1;
+   } else {
+   read_lock_bh(table-tb6_lock);
+   res = fib6_walk_continue(w);
+   read_unlock_bh(table-tb6_lock);
+   if (res != 0) {
+   if (res  0)
+   fib6_walker_unlink(w);
+   goto end;
+   }
+   fib6_walker_unlink(w);
+

Re: [RFC][PATCH] VM deadlock prevention core -v3

On Thu, Aug 10, 2006 at 04:46:31PM +0200, Peter Zijlstra ([EMAIL PROTECTED]) 
wrote:
 On Thu, 2006-08-10 at 18:02 +0400, Evgeniy Polyakov wrote:
  On Thu, Aug 10, 2006 at 03:32:49PM +0200, Peter Zijlstra ([EMAIL 
  PROTECTED]) wrote:
   Hi,
  
  Hello, Peter.
  
   So I try again, please tell me if I'm still on crack and should go detox.
   However if you do so, I kindly request some words on the how and why of 
   it.
  
  I think you should talk with doctor in that case, but not with kernel
  hackers :)
  
  I have some comments about implementation, not overall design, since we
  have slightly diametral points of view there.
  
  
  
   --- linux-2.6.orig/net/core/skbuff.c
   +++ linux-2.6/net/core/skbuff.c
   @@ -43,6 +43,7 @@
#include linux/kernel.h
#include linux/sched.h
#include linux/mm.h
   +#include linux/pagemap.h
#include linux/interrupt.h
#include linux/in.h
#include linux/inet.h
   @@ -125,6 +126,8 @@ EXPORT_SYMBOL(skb_truesize_bug);
 *
 */

   +#define ceiling_log2(x)  fls((x) - 1)
   +
/**
 *   __alloc_skb -   allocate a network buffer
 *   @size: size to allocate
   @@ -147,6 +150,59 @@ struct sk_buff *__alloc_skb(unsigned int
 struct sk_buff *skb;
 u8 *data;

   + size = SKB_DATA_ALIGN(size);
 
 I moved it here.

Yep.

   +
   + if (gfp_mask  __GFP_MEMALLOC) {
   + /*
   +  * Fallback allocation for memalloc reserves.
   +
 
* This allocator is build on alloc_pages() so that freed
* skbuffs return to the memalloc reserve imediately. SLAB
* memory might not ever be returned.
 
 This was missing,... 
 
   +  * the page is populated like so:
   +  *
   +  *   struct sk_buff
   +  *   [ struct sk_buff ]
   +  *   [ atomic_t ]
   +  *   unsigned int
   +  *   struct skb_shared_info
   +  *   char []
   +  *
   +  * We have to do higher order allocations for icky jumbo
   +  * frame drivers :-(. They really should be migrated to
   +  * scather/gather DMA and use skb fragments.
   +  */
   + unsigned int data_offset =
   + sizeof(struct sk_buff) + sizeof(unsigned int);
   + unsigned long length = size + data_offset +
   + sizeof(struct skb_shared_info);
   + unsigned int pages;
   + unsigned int order;
   + struct page *page;
   + void *kaddr;
   +
   + /*
   +  * Force fclone alloc in order to fudge a lacking in 
   skb_clone().
   +  */
   + fclone = 1;
   + if (fclone) {
   + data_offset += sizeof(struct sk_buff) + 
   sizeof(atomic_t);
   + length += sizeof(struct sk_buff) + sizeof(atomic_t);
   + }
   + pages = (length + PAGE_SIZE - 1)  PAGE_SHIFT;
   + order = ceiling_log2(pages);
   + skb = NULL;
   + if (!(page = alloc_pages(gfp_mask  ~__GFP_HIGHMEM, order)))
   + goto out;
   +
   + kaddr = pfn_to_kaddr(page_to_pfn(page));
   + skb = (struct sk_buff *)kaddr;
   +
   + *((unsigned int *)(kaddr + data_offset -
   + sizeof(unsigned int))) = order;
   + data = (u8 *)(kaddr + data_offset);
   +
  
  Tricky, but since you are using own allocator here, you could change it to
  be not so aggressive - i.e. do not round size to number of pages.
 
 I'm not sure I follow you, I'm explicitly using
 alloc_pages()/free_page(), if
 I were to go smart here, I'd loose the whole reason for doing so.

You can use page to put there several skbs for example or at least add
there a fclone (fast clone).

  
   + goto allocated;
   + }
   +
 cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;

 /* Get the HEAD */
   @@ -155,12 +211,13 @@ struct sk_buff *__alloc_skb(unsigned int
 goto out;

 /* Get the DATA. Size must match skb_add_mtu(). */
   - size = SKB_DATA_ALIGN(size);
  
  Bad sign.
 
 See above.

Yep, I've found.

 data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
 if (!data)
 goto nodata;

   +struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
   + unsigned length, gfp_t gfp_mask)
   +{
   + struct sk_buff *skb;
   +
   + WARN_ON(gfp_mask  (__GFP_NOMEMALLOC | __GFP_MEMALLOC));
   + gfp_mask = ~(__GFP_NOMEMALLOC | __GFP_MEMALLOC);
   +
   + skb = ___netdev_alloc_skb(dev, length, gfp_mask | __GFP_NOMEMALLOC);
   + if (skb)
   + goto done;
   +
   + if (atomic_read(dev-rx_reserve_used) =
   + dev-rx_reserve * dev-memalloc_socks)
   + goto out;
   +
   + /*
   +  * pre-inc guards against a race with netdev_wait_memalloc()
   +  */
   + atomic_inc(dev-rx_reserve_used);
   + skb = ___netdev_alloc_skb(dev, length, gfp_mask | __GFP_MEMALLOC);
   + if

Re: [RFC][PATCH] VM deadlock prevention core -v3


   Tricky, but since you are using own allocator here, you could change it to
   be not so aggressive - i.e. do not round size to number of pages.
  
  I'm not sure I follow you, I'm explicitly using
  alloc_pages()/free_page(), if
  I were to go smart here, I'd loose the whole reason for doing so.
 
 You can use page to put there several skbs for example or at least add
 there a fclone (fast clone).

fclone support is there.

+struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
+   unsigned length, gfp_t gfp_mask)
+{
+   struct sk_buff *skb;
+
+   WARN_ON(gfp_mask  (__GFP_NOMEMALLOC | __GFP_MEMALLOC));
+   gfp_mask = ~(__GFP_NOMEMALLOC | __GFP_MEMALLOC);
+
+   skb = ___netdev_alloc_skb(dev, length, gfp_mask | 
__GFP_NOMEMALLOC);
+   if (skb)
+   goto done;
+
+   if (atomic_read(dev-rx_reserve_used) =
+   dev-rx_reserve * dev-memalloc_socks)
+   goto out;
+
+   /*
+* pre-inc guards against a race with netdev_wait_memalloc()
+*/
+   atomic_inc(dev-rx_reserve_used);
+   skb = ___netdev_alloc_skb(dev, length, gfp_mask | 
__GFP_MEMALLOC);
+   if (unlikely(!skb)) {
+   atomic_dec(dev-rx_reserve_used);
+   goto out;
+   }
   
   Since you have added atomic operation in that path, you can use device's
   reference counter instead and do not care that it can dissapear.
  
  Is that the sole reason taking a reference on the device is bad?
 
 Taking a reference is bad due to performance reasons, since atomic
 increment is not that cheap. If you do it for one variable for the
 purpose of reference counting you can use device's refcnt istead, which
 will solve some races.

Yes, I understand you. However I'm not sure if performance is the only
reason not to take a refcount on the device. Anyway, I think I have just
been convinced to abandon the per device thing and go global.

@@ -434,6 +567,12 @@ struct sk_buff *skb_clone(struct sk_buff
n-fclone = SKB_FCLONE_CLONE;
atomic_inc(fclone_ref);
} else {
+   /*
+* should we special-case skb-memalloc cloning?
+* for now fudge it by forcing fast-clone alloc.
+*/
+   BUG_ON(skb-memalloc);
+
n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
if (!n)
return NULL;
   
   Ugh... cloning is a one of the shoulders of giant where Linux network
   stack is staying...
  
  Yes, I'm aware of that, I have a plan to fix this, however I haven't had
  time
  to implement it. My immediate concern is the point wrt. the net_device
  mapping.
  
  My idea was: instead of the order, store the size, and allocate clone 
  skbuffs in the available room at the end of the page(s), allocating
  extra pages
  if needed.
 
 You can check if requested skb with fclone fits allocated pages, and if
 so use fclone magic, otherwise postpone clone allocation until it is
 required.

Yes the fclone magic works, however that will only let you have one
clone.
I'm just not confident no receive path will ever exceed that.

 Sockets can live without network devices at all, I expect it is enough
 to clean up in socket destructor, since packets can come from
 different devices into the same socket.

You are right if the reserve wasn't device bound - which I will abandon 
because you are right that with multi-path routing, bridge device and 
other advanced goodies this scheme is broken in that there is no
unambiguous
mapping from sockets to devices.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.18-rc3-mm2 - IPV6_MULTIPLE_TABLES borked....

2006-08-10 Thread Valdis . Kletnieks

On Sun, 06 Aug 2006 03:08:09 PDT, Andrew Morton said:
 ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/

Building a kernel with IPV6_MULTIPLE_TABLES=y breaks my IPv6 connectivity
quite badly.  It basically totally refuses to answer an IPv6 Neighbor Solicit
packet or IPv6 Echo Request packet.  I run a 'tcpdump -n ipv6', and I see the
requests come in, and no packets leaving.  Interestingly enough, if I try to
ping6 *out* of the box, it's totally willing to send a Neighbor Solicit outbound
(although it appears to totally ignore the Neighbor Advert packet that comes
back). Of course, things don't work very well at all with busticated Neighbor
Solicit.

A kernel built with IPV6_MULTIPLE_TABLES=n works just fine.

The relevant ifconfig (eth3 is a 100mbit port, eth5 is a wireless card):

eth3  Link encap:Ethernet  HWaddr 00:06:5B:EA:8E:4E  
  inet addr:128.173.14.107  Bcast:128.173.15.255  Mask:255.255.252.0
  inet6 addr: 2001:468:c80:2103:206:5bff:feea:8e4e/64 Scope:Global
  inet6 addr: fe80::206:5bff:feea:8e4e/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:15529 errors:0 dropped:0 overruns:1 frame:0
  TX packets:2073 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000 
  RX bytes:2333290 (2.2 MiB)  TX bytes:228862 (223.4 KiB)
  Interrupt:11 Base address:0x6800 

eth5  Link encap:Ethernet  HWaddr 00:02:2D:5C:11:48  
  inet addr:198.82.168.129  Bcast:198.82.168.255  Mask:255.255.255.0
  inet6 addr: 2001:468:c80:2181:202:2dff:fe5c:1148/64 Scope:Global
  inet6 addr: fe80::202:2dff:fe5c:1148/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:2096 errors:0 dropped:0 overruns:0 frame:0
  TX packets:144 errors:1 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000 
  RX bytes:280919 (274.3 KiB)  TX bytes:22184 (21.6 KiB)
  Interrupt:11 Base address:0xe100 

loLink encap:Local Loopback  
  inet addr:127.0.0.1  Mask:255.0.0.0
  inet6 addr: ::1/128 Scope:Host
  UP LOOPBACK RUNNING  MTU:16436  Metric:1
  RX packets:1583 errors:0 dropped:0 overruns:0 frame:0
  TX packets:1583 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:0 
  RX bytes:642598 (627.5 KiB)  TX bytes:642598 (627.5 KiB)

A working routing table:

netstat -r -n -A inet6
Kernel IPv6 routing table
Destination Next Hop
Flags Metric RefUse Iface
::1/128 ::  
U 0  12   1 lo  
2001:468:c80:2103:206:5bff:feea:8e4e/128::  
U 0  41 lo  
2001:468:c80:2103::/64  ::  
UA256113   0 eth3
2001:468:c80:2181:202:2dff:fe5c:1148/128::  
U 0  01 lo  
2001:468:c80:2181::/64  ::  
UA25611   0 eth5
fe80::202:2dff:fe5c:1148/128::  
U 0  01 lo  
fe80::206:5bff:feea:8e4e/128::  
U 0  21 lo  
fe80::/64   ::  
U 25600 eth3
fe80::/64   ::  
U 25600 eth5
ff02::1/128 ff02::1 
UC0  113   0 eth3
ff02::1/128 ff02::1 
UC0  10 eth5
ff00::/8::  
U 25600 eth3
ff00::/8::  
U 25600 eth5
::/0fe80::20f:35ff:fe3e:d41a
UGDA  1024   10 eth3
::/0fe80::20f:35ff:fe3e:d41a
UGDA  1024   10 eth5




pgp0hv0N6FUv3.pgp
Description: PGP signature

Re: [PATCH 1/4] [NETLINK]: Handle NLM_F_ECHO in netlink_rcv_skb()

2006-08-10 Thread Alexey Kuznetsov

Hello!

 What's wrong with listening to the notification for that purpose?

Nothing! NLM_F_ECHO _is_ listening for notifications without subscription
to multicast groups and need to figure out what messages are yours.
But beyond this NLM_F_ECHO is totally subset of this.
Which still makes much more sense then echoing of a know thing, does not it?

Alexey


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [IPROUTE]: Support IPv6 routing table filter

On Thu, 10 Aug 2006 22:42:58 +0200
Patrick McHardy [EMAIL PROTECTED] wrote:

 Support IPv6 routing table filter in presence of multiple tables,
 f.e. ip -6 route list table 123. Compatibility is preserved for
 kernels not supporting multiple IPv6 tables.
 
 
applied thanks

-- 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: skb_shared_info()

On Tue, Aug 08, 2006 at 04:39:15PM -0700, David Miller ([EMAIL PROTECTED]) 
wrote:
 
 I'm beginning to think that where we store the
 skb_shared_info() is a weakness of the SKB design.

Food for thoughts - unix sockets can use PAGE_SIZEd chunks of memory
(and they do it almost always), which are aligned to 2*PAGE_SIZE due to
alignment issues with skb_shared_info, so unix sockets waste 3.5 kb of
memory on each skb. I think it is time to resurrect idea of placing
shared_info inside skb and allow to allocate it from own cache for
special cases, what do you think?

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[IPROUTE]: Support IPv6 routing table filter

Support IPv6 routing table filter in presence of multiple tables,
f.e. ip -6 route list table 123. Compatibility is preserved for
kernels not supporting multiple IPv6 tables.


[IPROUTE]: Support IPv6 routing table filter

The current behaviour for IPv6 routing table filters is to derive the
table from the route type. This doesn't really work anymore now that IPv6
supports multiple tables. Add detection for IPv6 multiple table support
(relying on the fact that the first routes dumped belong to the local table
and have rtm_table == RT_TABLE_LOCAL with multiple tables) and handle it
like other protocols.

Signed-off-by: Patrick McHardy [EMAIL PROTECTED]

---
commit 14d210c56edd67973439acd67d916de84a6e0384
tree 5678d9dba5c1b8a0b25133a89bce5d4e473a1160
parent e81c1a22cd2408a8b490ce39bf6ece2d19919a3b
author Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 22:39:21 +0200
committer Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 22:39:21 +0200

 ip/iproute.c |6 +-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/ip/iproute.c b/ip/iproute.c
index 8f4a55d..1645f0b 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -138,6 +138,7 @@ int print_route(const struct sockaddr_nl
inet_prefix prefsrc;
inet_prefix via;
int host_len = -1;
+   static int ip6_multiple_tables;
SPRINT_BUF(b1);

 
@@ -163,7 +164,10 @@ int print_route(const struct sockaddr_nl
else if (r-rtm_family == AF_IPX)
host_len = 80;
 
-   if (r-rtm_family == AF_INET6) {
+   if (r-rtm_family == AF_INET6  r-rtm_table != RT_TABLE_MAIN)
+   ip6_multiple_tables = 1;
+
+   if (r-rtm_family == AF_INET6  !ip6_multiple_tables) {
if (filter.tb) {
if (filter.tb  0) {
if (!(r-rtm_flagsRTM_F_CLONED))

Re: [PATCH 4/5] net: socket family using RCU


On Wed, Aug 09, 2006 at 11:31:42AM -0700, Stephen Hemminger wrote:
 Replace the gross custom locking done in socket code for net_family[]
 with simple RCU usage. Some reordering necessary to avoid sleep
 issues with sock_alloc.

Definitely a good use of RCU from a read-intensive standpoint -- does
anyone other than Linux-kernel networking developers change the elements
of the net_family[] array except at boot and shutdown?  ;-)

Some comments included below.  Looks good, but one question about
things like atalk_create() being able to sleep and a place or two
where a comment would be good.


...

  
 +/*
 + *  Allocate the socket and allow the family to set things up. if
 + *  the protocol is 0, the family is instructed to select an 
 appropriate
 + *  default.
 + */
 +sock = sock_alloc();
 +if (!sock) {
 +printk(KERN_WARNING socket: no more sockets\n);
 +return -ENFILE; /* Not exactly a match, but its the
 +   closest posix thing */
 +}
 +
 +sock-type = type;
 +
  #if defined(CONFIG_KMOD)
  /* Attempt to load a protocol module if the find failed.
   *
 @@ -1166,70 +1138,59 @@
   * requested real, full-featured networking support upon configuration.
   * Otherwise module support will break!
   */
 -if (net_families[family] == NULL) {
 +if (net_families[family] == NULL)
  request_module(net-pf-%d, family);

OK, I'll bite...

What happens if the module is not present?  Or is this what the
Otherwise module support will break comment is getting at?

request_module loads the module (and blocks). One would
expect that the module loaded would set net_families[] via
sock_register, so later reference would succeed. Comment is
historical since intention was to make base socket code itself modular
which never was done, and is probably a bad idea to even consider.

If module is not present, then net_families[] will still be NULL.

Also, this reference to net_families[family] is done without
rcu_dereference() and without any clear update-side lock.  This
just happens to be OK, since we are only testing for NULL, but
should at least have a comment.

 -}
  #endif
  
 -net_family_read_lock();
 -if (net_families[family] == NULL) {
 -err = -EAFNOSUPPORT;
 -goto out;
 -}
 -
 -/*
 - *  Allocate the socket and allow the family to set things up. if
 - *  the protocol is 0, the family is instructed to select an appropriate
 - *  default.
 - */
 -
 -if (!(sock = sock_alloc())) {
 -printk(KERN_WARNING socket: no more sockets\n);
 -err = -ENFILE;  /* Not exactly a match, but its the
 -   closest posix thing */
 -goto out;
 -}
 -
 -sock-type = type;
 +rcu_read_lock();
 +pf = rcu_dereference(net_families[family]);

OK, so the elements of the net_families array are protected by RCU.
All references should either be under rcu_read_lock() and accessed
via rcu_dereference() or under the update-side lock, whatever that
might be.


Yes, the net_family_lock

  
 -/*
 +/**
 + *  sock_unregister - remove a protocol handler
 + *  @family: protocol family to remove
 + *
   *  This function is called by a protocol handler that wants to
   *  remove its address family, and have it unlinked from the
 - *  SOCKET module.
 + *  new socket creation.
 + *
 + *  If protocol handler is a module, then it can use module reference
 + *  counts to protect against new references. If protocol handler is not
 + *  a module then it needs to provide its own protection in
 + *  the ops-create routine.
   */
 -
  int sock_unregister(int family)
  {
  if (family  0 || family = NPROTO)
 -return -1;
 +return -EINVAL;
  
 -net_family_write_lock();
 +spin_lock(net_family_lock);
  net_families[family] = NULL;

And this one is covered by net_families_lock, so we are set, since this
is the last one.

 -net_family_write_unlock();
 +spin_unlock(net_family_lock);
 +
 +synchronize_rcu();

OK, and the caller is presumably going to free up whatever needs to be
freed.

Or, if nothing need be freed, beyond this point, we know that all
non-sleeping code paths through any of the net_protocol_family
functions have completed.

(So, are all of the functions non-sleeping, or do we care?  The
definition of net_protocol_family in include/linux/net.h doesn't say
that they need to be non-sleeping...)

atalk_create() can potentially sleep in the following line of code:

   sk = sk_alloc(PF_APPLETALK, GFP_KERNEL, ddp_proto, 1);

The module reference counts are used to prevent that. Since
appletalk module can't be unloaded until there are no more appletalk
sockets open (ie ref count of appletalk module is zero). To prevent
new references there is a call to try_module_get() before the
net_families[family]-create() call. This happens inside
rcu_read_lock.

What prevents

Re: [PATCH 1/4] [NETLINK]: Handle NLM_F_ECHO in netlink_rcv_skb()

Hello

* Alexey Kuznetsov [EMAIL PROTECTED] 2006-08-11 00:32
 Nothing! NLM_F_ECHO _is_ listening for notifications without subscription
 to multicast groups and need to figure out what messages are yours.
 But beyond this NLM_F_ECHO is totally subset of this.
 Which still makes much more sense then echoing of a know thing, does not it?

I get your point and I see the value. Unfortunately, probably due to
lack of documentation, this feature isn't used by any applications I
know of. We even put in the hacks to make identification of own caused
notifications easier by storing the netlink pid of the originator in
the notification message.

I will put this back in (document it! :) and hide it behind
nlmsg_notify() so we do it for all notifications for consistency.

I use echoing of the original request for debuging purposes, it allows
to verify what is actually being parsed at the netlink family specific
parsing function. Using libnl a flag enables NLM_F_ECHO in all messages
and it gets simple to verify what exactly is being seen in the kernel side
parser by looking at the messages log. I agree, there is no functional
value besides the possibility to implement a netlink ping with NLMSG_NOOP.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[IPROUTE 01/03]: Preparation for 32 bit table IDs

[IPROUTE]: Preparation for 32 bit table IDs

The route table filter uses an integer for the table number and the value
-1 to represent cloned routes. For 32 bit table IDs it needs to become an
unsigned, so this won't work anymore. Introduce a new filter flag cloned
and use instead of filter.tb = -1.

Signed-off-by: Patrick McHardy [EMAIL PROTECTED]

---
commit 00d896184c5f8737269ac05264446c58133ec414
tree 3eb3760b7b5b8b5811cadeaaec1b949533fb5ffd
parent 14d210c56edd67973439acd67d916de84a6e0384
author Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 23:19:31 +0200
committer Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 23:19:31 +0200

 ip/iproute.c |   42 +-
 1 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/ip/iproute.c b/ip/iproute.c
index 1645f0b..cb674d7 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -89,6 +89,7 @@ static void usage(void)
 static struct
 {
int tb;
+   int cloned;
int flushed;
char *flushb;
int flushp;
@@ -168,22 +169,21 @@ int print_route(const struct sockaddr_nl
ip6_multiple_tables = 1;
 
if (r-rtm_family == AF_INET6  !ip6_multiple_tables) {
+   if (filter.cloned) {
+   if (!(r-rtm_flagsRTM_F_CLONED))
+   return 0;
+   }
if (filter.tb) {
-   if (filter.tb  0) {
-   if (!(r-rtm_flagsRTM_F_CLONED))
-   return 0;
-   } else {
-   if (r-rtm_flagsRTM_F_CLONED)
+   if (r-rtm_flagsRTM_F_CLONED)
+   return 0;
+   if (filter.tb == RT_TABLE_LOCAL) {
+   if (r-rtm_type != RTN_LOCAL)
return 0;
-   if (filter.tb == RT_TABLE_LOCAL) {
-   if (r-rtm_type != RTN_LOCAL)
-   return 0;
-   } else if (filter.tb == RT_TABLE_MAIN) {
-   if (r-rtm_type == RTN_LOCAL)
-   return 0;
-   } else {
+   } else if (filter.tb == RT_TABLE_MAIN) {
+   if (r-rtm_type == RTN_LOCAL)
return 0;
-   }
+   } else {
+   return 0;
}
}
} else {
@@ -1045,19 +1045,19 @@ static int iproute_list_or_flush(int arg
NEXT_ARG();
if (rtnl_rttable_a2n(tid, *argv)) {
if (strcmp(*argv, all) == 0) {
-   tid = 0;
+   filter.tb = 0;
} else if (strcmp(*argv, cache) == 0) {
-   tid = -1;
+   filter.cloned = 1;
} else if (strcmp(*argv, help) == 0) {
usage();
} else {
invarg(table id value is invalid\n, 
*argv);
}
-   }
-   filter.tb = tid;
+   } else
+   filter.tb = tid;
} else if (matches(*argv, cached) == 0 ||
   matches(*argv, cloned) == 0) {
-   filter.tb = -1;
+   filter.cloned = 1;
} else if (strcmp(*argv, tos) == 0 ||
   matches(*argv, dsfield) == 0) {
__u32 tos;
@@ -1189,7 +1189,7 @@ static int iproute_list_or_flush(int arg
char flushb[4096-512];
time_t start = time(0);
 
-   if (filter.tb == -1) {
+   if (filter.cloned) {
if (do_ipv6 != AF_INET6) {
iproute_flush_cache();
if (show_stats)
@@ -1215,7 +1215,7 @@ static int iproute_list_or_flush(int arg
}
if (filter.flushed == 0) {
if (round == 0) {
-   if (filter.tb != -1 || do_ipv6 == 
AF_INET6)
+   if (!filter.cloned || do_ipv6 == 
AF_INET6)
fprintf(stderr, Nothing to 
flush.\n);
} else if (show_stats)
printf(*** Flush is complete after %d 
round%s ***\n, round, round1?s:);
@@ -1239,7 +1239,7 @@ static int

[IPROUTE 03/03]: Add support for larger number of routing tables

[IPROUTE]: Add support for larger number of routing tables

Support support for 2^32 routing tables by using the new RTA_TABLE
attribute for specifying tables  255 and intepreting it if it is
sent by the kernel.

When tables  255 are used on a kernel not supporting it an error will
occur because of the unknown netlink attribute.

Signed-off-by: Patrick McHardy [EMAIL PROTECTED]

---
commit 7980d6ceea890359173344e71c1139b252fd9894
tree 19a33af25df28c002569e85b34a8c90ca517d875
parent ccd621fbb5faa91a98479e9492baee525c6f10c0
author Patrick McHardy [EMAIL PROTECTED] Fri, 11 Aug 2006 00:03:32 +0200
committer Patrick McHardy [EMAIL PROTECTED] Fri, 11 Aug 2006 00:03:32 +0200

 include/linux/rtnetlink.h |4 ++--
 include/rt_names.h|2 +-
 ip/ip_common.h|8 
 ip/iproute.c  |   21 ++---
 ip/iprule.c   |   14 +++---
 lib/rt_names.c|4 ++--
 6 files changed, 38 insertions(+), 15 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 5e33a20..d63578c 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -238,9 +238,8 @@ enum rt_class_t
RT_TABLE_DEFAULT=253,
RT_TABLE_MAIN=254,
RT_TABLE_LOCAL=255,
-   __RT_TABLE_MAX
+   RT_TABLE_MAX=0x,
 };
-#define RT_TABLE_MAX (__RT_TABLE_MAX - 1)
 
 
 
@@ -263,6 +262,7 @@ enum rtattr_type_t
RTA_CACHEINFO,
RTA_SESSION,
RTA_MP_ALGO,
+   RTA_TABLE,
__RTA_MAX
 };
 
diff --git a/include/rt_names.h b/include/rt_names.h
index 2d9ef10..07a10e0 100644
--- a/include/rt_names.h
+++ b/include/rt_names.h
@@ -5,7 +5,7 @@ #include asm/types.h
 
 char* rtnl_rtprot_n2a(int id, char *buf, int len);
 char* rtnl_rtscope_n2a(int id, char *buf, int len);
-char* rtnl_rttable_n2a(int id, char *buf, int len);
+char* rtnl_rttable_n2a(__u32 id, char *buf, int len);
 char* rtnl_rtrealm_n2a(int id, char *buf, int len);
 char* rtnl_dsfield_n2a(int id, char *buf, int len);
 int rtnl_rtprot_a2n(__u32 *id, char *arg);
diff --git a/ip/ip_common.h b/ip/ip_common.h
index 1fe4a69..8b286b0 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -32,4 +32,12 @@ extern int do_multiaddr(int argc, char *
 extern int do_multiroute(int argc, char **argv);
 extern int do_xfrm(int argc, char **argv);
 
+static inline int rtm_get_table(struct rtmsg *r, struct rtattr **tb)
+{
+   __u32 table = r-rtm_table;
+   if (tb[RTA_TABLE])
+   table = *(__u32*) RTA_DATA(tb[RTA_TABLE]);
+   return table;
+}
+
 extern struct rtnl_handle rth;
diff --git a/ip/iproute.c b/ip/iproute.c
index cb674d7..24e7a86 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -140,6 +140,7 @@ int print_route(const struct sockaddr_nl
inet_prefix via;
int host_len = -1;
static int ip6_multiple_tables;
+   __u32 table;
SPRINT_BUF(b1);

 
@@ -165,7 +166,10 @@ int print_route(const struct sockaddr_nl
else if (r-rtm_family == AF_IPX)
host_len = 80;
 
-   if (r-rtm_family == AF_INET6  r-rtm_table != RT_TABLE_MAIN)
+   parse_rtattr(tb, RTA_MAX, RTM_RTA(r), len);
+   table = rtm_get_table(r, tb);
+
+   if (r-rtm_family == AF_INET6  table != RT_TABLE_MAIN)
ip6_multiple_tables = 1;
 
if (r-rtm_family == AF_INET6  !ip6_multiple_tables) {
@@ -187,7 +191,7 @@ int print_route(const struct sockaddr_nl
}
}
} else {
-   if (filter.tb  0  filter.tb != r-rtm_table)
+   if (filter.tb  0  filter.tb != table)
return 0;
}
if ((filter.protocol^r-rtm_protocol)filter.protocolmask)
@@ -217,8 +221,6 @@ int print_route(const struct sockaddr_nl
if (filter.rprefsrc.family  r-rtm_family != filter.rprefsrc.family)
return 0;
 
-   parse_rtattr(tb, RTA_MAX, RTM_RTA(r), len);
-
memset(dst, 0, sizeof(dst));
dst.family = r-rtm_family;
if (tb[RTA_DST])
@@ -371,8 +373,8 @@ int print_route(const struct sockaddr_nl
fprintf(fp, dev %s , 
ll_index_to_name(*(int*)RTA_DATA(tb[RTA_OIF])));
 
if (!(r-rtm_flagsRTM_F_CLONED)) {
-   if (r-rtm_table != RT_TABLE_MAIN  !filter.tb)
-   fprintf(fp,  table %s , 
rtnl_rttable_n2a(r-rtm_table, b1, sizeof(b1)));
+   if (table != RT_TABLE_MAIN  !filter.tb)
+   fprintf(fp,  table %s , rtnl_rttable_n2a(table, b1, 
sizeof(b1)));
if (r-rtm_protocol != RTPROT_BOOT  filter.protocolmask != -1)
fprintf(fp,  proto %s , 
rtnl_rtprot_n2a(r-rtm_protocol, b1, sizeof(b1)));
if (r-rtm_scope != RT_SCOPE_UNIVERSE  filter.scopemask != -1)
@@ -875,7 +877,12 @@ #endif
NEXT_ARG();
if (rtnl_rttable_a2n(tid, *argv))
invarg(\table\ value is invalid\n, *argv);
-

[IPROUTE 02/03]: Use hash for routing table name cache

[IPROUTE]: Use hash for routing table name cache

Use a hash for routing table name cache instead of the fixed size array.

Signed-off-by: Patrick McHardy [EMAIL PROTECTED]

---
commit ccd621fbb5faa91a98479e9492baee525c6f10c0
tree e4e1416406b5ed252b3b1a91efc3d8caadbf1bd0
parent 00d896184c5f8737269ac05264446c58133ec414
author Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 23:27:59 +0200
committer Patrick McHardy [EMAIL PROTECTED] Thu, 10 Aug 2006 23:27:59 +0200

 lib/rt_names.c |   96 +++-
 1 files changed, 74 insertions(+), 22 deletions(-)

diff --git a/lib/rt_names.c b/lib/rt_names.c
index 05046c2..b77ad4a 100644
--- a/lib/rt_names.c
+++ b/lib/rt_names.c
@@ -23,6 +23,51 @@ #include linux/rtnetlink.h
 
 #include rt_names.h
 
+struct rtnl_hash_entry {
+   struct rtnl_hash_entry *next;
+   char *  name;
+   unsigned intid;
+};
+
+static void
+rtnl_hash_initialize(char *file, struct rtnl_hash_entry **hash, int size)
+{
+   struct rtnl_hash_entry *entry;
+   char buf[512];
+   FILE *fp;
+
+   fp = fopen(file, r);
+   if (!fp)
+   return;
+   while (fgets(buf, sizeof(buf), fp)) {
+   char *p = buf;
+   int id;
+   char namebuf[512];
+
+   while (*p == ' ' || *p == '\t')
+   p++;
+   if (*p == '#' || *p == '\n' || *p == 0)
+   continue;
+   if (sscanf(p, 0x%x %s\n, id, namebuf) != 2 
+   sscanf(p, 0x%x %s #, id, namebuf) != 2 
+   sscanf(p, %d %s\n, id, namebuf) != 2 
+   sscanf(p, %d %s #, id, namebuf) != 2) {
+   fprintf(stderr, Database %s is corrupted at %s\n,
+   file, p);
+   return;
+   }
+
+   if (id0)
+   continue;
+   entry = malloc(sizeof(*entry));
+   entry-id   = id;
+   entry-name = strdup(namebuf);
+   entry-next = hash[id  (size - 1)];
+   hash[id  (size - 1)] = entry;
+   }
+   fclose(fp);
+}
+
 static void rtnl_tab_initialize(char *file, char **tab, int size)
 {
char buf[512];
@@ -57,7 +102,6 @@ static void rtnl_tab_initialize(char *fi
fclose(fp);
 }
 
-
 static char * rtnl_rtprot_tab[256] = {
[RTPROT_UNSPEC] = none,
[RTPROT_REDIRECT] =redirect,
@@ -266,9 +310,14 @@ int rtnl_rtrealm_a2n(__u32 *id, char *ar
 }
 
 
+static struct rtnl_hash_entry dflt_table_entry  = { .id = 253, .name = 
default };
+static struct rtnl_hash_entry main_table_entry  = { .id = 254, .name = main 
};
+static struct rtnl_hash_entry local_table_entry = { .id = 255, .name = local 
};
 
-static char * rtnl_rttable_tab[256] = {
-   unspec,
+static struct rtnl_hash_entry * rtnl_rttable_hash[256] = {
+   [253] = dflt_table_entry,
+   [254] = main_table_entry,
+   [255] = local_table_entry,
 };
 
 static int rtnl_rttable_init;
@@ -276,26 +325,26 @@ static int rtnl_rttable_init;
 static void rtnl_rttable_initialize(void)
 {
rtnl_rttable_init = 1;
-   rtnl_rttable_tab[255] = local;
-   rtnl_rttable_tab[254] = main;
-   rtnl_rttable_tab[253] = default;
-   rtnl_tab_initialize(/etc/iproute2/rt_tables,
-   rtnl_rttable_tab, 256);
+   rtnl_hash_initialize(/etc/iproute2/rt_tables,
+rtnl_rttable_hash, 256);
 }
 
 char * rtnl_rttable_n2a(int id, char *buf, int len)
 {
-   if (id0 || id=256) {
-   snprintf(buf, len, %d, id);
+   struct rtnl_hash_entry *entry;
+
+   if (id = RT_TABLE_MAX) {
+   snprintf(buf, len, %u, id);
return buf;
}
-   if (!rtnl_rttable_tab[id]) {
-   if (!rtnl_rttable_init)
-   rtnl_rttable_initialize();
-   }
-   if (rtnl_rttable_tab[id])
-   return rtnl_rttable_tab[id];
-   snprintf(buf, len, %d, id);
+   if (!rtnl_rttable_init)
+   rtnl_rttable_initialize();
+   entry = rtnl_rttable_hash[id  255];
+   while (entry  entry-id != id)
+   entry = entry-next;
+   if (entry)
+   return entry-name;
+   snprintf(buf, len, %u, id);
return buf;
 }
 
@@ -303,6 +352,7 @@ int rtnl_rttable_a2n(__u32 *id, char *ar
 {
static char *cache = NULL;
static unsigned long res;
+   struct rtnl_hash_entry *entry;
char *end;
int i;
 
@@ -315,17 +365,19 @@ int rtnl_rttable_a2n(__u32 *id, char *ar
rtnl_rttable_initialize();
 
for (i=0; i256; i++) {
-   if (rtnl_rttable_tab[i] 
-   strcmp(rtnl_rttable_tab[i], arg) == 0) {
-   cache = rtnl_rttable_tab[i];
-   res = i;
+   entry = rtnl_rttable_hash[i];
+   while (entry

[IPROUTE 00/03]: Increase number of possible routing tables

These patches add support for a larger number of routing tables to iproute
and are needed for the patches doing the same for the kernel I just sent.
They apply on top of the [IPROUTE]: Support IPv6 routing table filter patch.

Please apply, thanks.


 include/linux/rtnetlink.h |4 -
 include/rt_names.h|2 
 ip/ip_common.h|8 +++
 ip/iproute.c  |   63 
 ip/iprule.c   |   14 +-
 lib/rt_names.c|  100 ++
 6 files changed, 133 insertions(+), 58 deletions(-)

Patrick McHardy:
  [IPROUTE]: Preparation for 32 bit table IDs
  [IPROUTE]: Use hash for routing table name cache
  [IPROUTE]: Add support for larger number of routing tables
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] fix memory leak in net/ipv4/tcp_probe.c::tcpprobe_read()

2006-08-10 Thread Jesper Juhl

On 05/08/06, David Miller [EMAIL PROTECTED] wrote:

From: Jesper Juhl [EMAIL PROTECTED]
Date: Sat, 5 Aug 2006 01:30:49 +0200

 On 31/07/06, David Miller [EMAIL PROTECTED] wrote:
  From: Jesper Juhl [EMAIL PROTECTED]
  Date: Sun, 30 Jul 2006 23:51:20 +0200

   Looks ok to me.

  I've applied James's version of the fix, thanks everyone.

 Hmm, if you are refering to commit
 118075b3cdc90e0815362365f3fc64d672ace0d6 -

http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=118075b3cdc90e0815362365f3fc64d672ace0d6
 then I think a mistake has crept in. That commit only initializes
 'cnt' to 0 - I don't see how that would fix the leak - looks like you
 forgot the business end of the patch...

See the commit right before that, the initialize of cnt to
zero is just to fix a compiler warning that resulted from
James's version of the fix.

Hmm, perhaps I'm going blind, but I don't see it.

The commit right before the one I linked to above is completely
unrelated : [ATALK]: Make CONFIG_DEV_APPLETALK a tristate.
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=9cac2c35e26cc44978df654306bb92d7cfe7e2de

And if I download 2.6.18-rc4 the tcpprobe_read() function (still)
looks like this :

static ssize_t tcpprobe_read(struct file *file, char __user *buf,
size_t len, loff_t *ppos)
{
   int error = 0, cnt = 0;
   unsigned char *tbuf;

   if (!buf || len  0)
   return -EINVAL;

   if (len == 0)
   return 0;

   tbuf = vmalloc(len);
   if (!tbuf)
   return -ENOMEM;

   error = wait_event_interruptible(tcpw.wait,
__kfifo_len(tcpw.fifo) != 0);
   if (error)
   return error;

   cnt = kfifo_get(tcpw.fifo, tbuf, len);
   error = copy_to_user(buf, tbuf, cnt);

   vfree(tbuf);

   return error ? error : cnt;
}

That function still contains the 'tbuf' leak.

I also couldn't find the fix in your git trees at
http://www.kernel.org/git/?p=linux/kernel/git/davem/net-2.6.19.git;a=summary
http://www.kernel.org/git/?p=linux/kernel/git/davem/net-2.6.git;a=summary

So either I'm going blind or a mistake has been made getting the fix
into mainline...

--
Jesper Juhl [EMAIL PROTECTED]
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please  http://www.expita.com/nomime.html
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.18-rc3-mm2 - IPV6_MULTIPLE_TABLES borked....

[EMAIL PROTECTED] wrote:
 On Sun, 06 Aug 2006 03:08:09 PDT, Andrew Morton said:
 
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
 
 
 Building a kernel with IPV6_MULTIPLE_TABLES=y breaks my IPv6 connectivity
 quite badly.  It basically totally refuses to answer an IPv6 Neighbor Solicit
 packet or IPv6 Echo Request packet.  I run a 'tcpdump -n ipv6', and I see the
 requests come in, and no packets leaving.  Interestingly enough, if I try to
 ping6 *out* of the box, it's totally willing to send a Neighbor Solicit 
 outbound
 (although it appears to totally ignore the Neighbor Advert packet that comes
 back). Of course, things don't work very well at all with busticated Neighbor
 Solicit.
 
 A kernel built with IPV6_MULTIPLE_TABLES=n works just fine.

It should be fixed by this patch (already contained in net-2.6.19).


[IPV6]: Fix policy routing lookup

When the lookup in a table returns ip6_null_entry the policy routing lookup
returns it instead of continuing in the next table, which effectively means
it only searches the local table.

Signed-off-by: Patrick McHardy [EMAIL PROTECTED]
Signed-off-by: David S. Miller [EMAIL PROTECTED]

---
commit 2b885e76c2b2c74d2dfe86a8140f0b41149f327c
tree 767711f03ea3e990ce02b3720718b77490027793
parent 5bd721a145d02a89a9b69adf3ede9d0b3647ae8b
author Patrick McHardy [EMAIL PROTECTED] Sun, 06 Aug 2006 22:24:08 -0700
committer David S. Miller [EMAIL PROTECTED] Sun, 06 Aug 2006 22:24:08 -0700

 net/ipv6/fib6_rules.c |4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/net/ipv6/fib6_rules.c b/net/ipv6/fib6_rules.c
index c3c8195..94a46ec 100644
--- a/net/ipv6/fib6_rules.c
+++ b/net/ipv6/fib6_rules.c
@@ -94,8 +94,10 @@ int fib6_rule_action(struct fib_rule *ru
 
if (rt != ip6_null_entry)
goto out;
-
dst_release(rt-u.dst);
+   rt = NULL;
+   goto out;
+
 discard_pkt:
dst_hold(rt-u.dst);
 out:

Re: 2.6.18-rc3-mm2 - IPV6_MULTIPLE_TABLES borked....

2006-08-10 Thread Valdis . Kletnieks

On Thu, 10 Aug 2006 22:02:03 +0200, Patrick McHardy said:

 [EMAIL PROTECTED] wrote:
  On Sun, 06 Aug 2006 03:08:09 PDT, Andrew Morton said:
  
 ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
  
  
  Building a kernel with IPV6_MULTIPLE_TABLES=y breaks my IPv6 connectivity

 It should be fixed by this patch (already contained in net-2.6.19).

Confirmed fixed, thanks...


pgp35bA5bBOzS.pgp
Description: PGP signature

Re: [IPROUTE 00/03]: Increase number of possible routing tables

On Fri, 11 Aug 2006 00:14:47 +0200 (MEST)
Patrick McHardy [EMAIL PROTECTED] wrote:

 These patches add support for a larger number of routing tables to iproute
 and are needed for the patches doing the same for the kernel I just sent.
 They apply on top of the [IPROUTE]: Support IPv6 routing table filter patch.
 
 Please apply, thanks.
 
 
  include/linux/rtnetlink.h |4 -
  include/rt_names.h|2 
  ip/ip_common.h|8 +++
  ip/iproute.c  |   63 
  ip/iprule.c   |   14 +-
  lib/rt_names.c|  100 
 ++
  6 files changed, 133 insertions(+), 58 deletions(-)
 
 Patrick McHardy:
   [IPROUTE]: Preparation for 32 bit table IDs
   [IPROUTE]: Use hash for routing table name cache
   [IPROUTE]: Add support for larger number of routing tables

Applied thanks. Let me know when your done, it has been too long
since a real release of iproute2. I'll roll one as soon as the flow subsides.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [IPROUTE 00/03]: Increase number of possible routing tables

Stephen Hemminger wrote:
 Applied thanks. Let me know when your done, it has been too long
 since a real release of iproute2. I'll roll one as soon as the flow subsides.


I only have one more patchset I would like to submit soon, the time
cleanups. But they are only meant to make auditing for integer overflows
easier, so we can one day switch to a higher clock resolution. iproute
seems to be mostly fine, but the kernel will probably take a bit longer,
so I wouldn't mind missing this release. I'll submit them in the next
days anyway, but feel free to release without them.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] fix memory leak in net/ipv4/tcp_probe.c::tcpprobe_read()

From: Jesper Juhl [EMAIL PROTECTED]
Date: Fri, 11 Aug 2006 00:18:44 +0200

 Hmm, perhaps I'm going blind, but I don't see it.

I definitely screwed that changeset up somehow.  Thanks for catching
this.

Somehow James's fix got clobbered into just my subsequent warning fix,
like this:

commit 118075b3cdc90e0815362365f3fc64d672ace0d6
Author: James Morris [EMAIL PROTECTED]
Date:   Sun Jul 30 20:21:45 2006 -0700

[TCP]: fix memory leak in net/ipv4/tcp_probe.c::tcpprobe_read()

Based upon a patch by Jesper Juhl.

Signed-off-by: James Morris [EMAIL PROTECTED]
Acked-by: Stephen Hemminger [EMAIL PROTECTED]
Acked-by: Jesper Juhl [EMAIL PROTECTED]
Signed-off-by: David S. Miller [EMAIL PROTECTED]

diff --git a/net/ipv4/tcp_probe.c b/net/ipv4/tcp_probe.c
index d7d517a..b343532 100644
--- a/net/ipv4/tcp_probe.c
+++ b/net/ipv4/tcp_probe.c
@@ -114,7 +114,7 @@ static int tcpprobe_open(struct inode * 
 static ssize_t tcpprobe_read(struct file *file, char __user *buf,
 size_t len, loff_t *ppos)
 {
-   int error = 0, cnt;
+   int error = 0, cnt = 0;
unsigned char *tbuf;

if (!buf || len  0)

Anyways, I'll fix this up, thanks again.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] fix memory leak in net/ipv4/tcp_probe.c::tcpprobe_read()

From: Stephen Hemminger [EMAIL PROTECTED]
Date: Thu, 10 Aug 2006 16:52:16 -0700

 Dave, here is my version...
 Don't leak memory on interrupted read. And only allocate
 as much memory as needed.

 Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]

I think I'm going to go with James's safe original fix
for now, thanks.

commit a7fc5b24a4921a6582ce47c0faf3a31858a80468
Author: David S. Miller [EMAIL PROTECTED]
Date:   Thu Aug 10 16:53:33 2006 -0700

[TCP]: Fix botched memory leak fix to tcpprobe_read().

Somehow I clobbered James's original fix and only my
subsequent compiler warning change went in for that
changeset.

Get the real fix in there.

Noticed by Jesper Juhl.

Signed-off-by: David S. Miller [EMAIL PROTECTED]

diff --git a/net/ipv4/tcp_probe.c b/net/ipv4/tcp_probe.c
index b343532..dab37d2 100644
--- a/net/ipv4/tcp_probe.c
+++ b/net/ipv4/tcp_probe.c
@@ -130,11 +130,12 @@ static ssize_t tcpprobe_read(struct file
error = wait_event_interruptible(tcpw.wait,
 __kfifo_len(tcpw.fifo) != 0);
if (error)
-   return error;
+   goto out_free;

cnt = kfifo_get(tcpw.fifo, tbuf, len);
error = copy_to_user(buf, tbuf, cnt);

+out_free:
vfree(tbuf);

return error ? error : cnt;
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] fix memory leak in net/ipv4/tcp_probe.c::tcpprobe_read()

Dave, here is my version...
Don't leak memory on interrupted read. And only allocate
as much memory as needed.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]


--- linux-2.6.orig/net/ipv4/tcp_probe.c 2006-08-10 16:32:36.0 -0700
+++ linux-2.6/net/ipv4/tcp_probe.c  2006-08-10 16:45:30.0 -0700
@@ -114,7 +114,7 @@
 static ssize_t tcpprobe_read(struct file *file, char __user *buf,
 size_t len, loff_t *ppos)
 {
-   int error = 0, cnt = 0;
+   int error, cnt;
unsigned char *tbuf;
 
if (!buf || len  0)
@@ -123,15 +123,16 @@
if (len == 0)
return 0;
 
-   tbuf = vmalloc(len);
-   if (!tbuf)
-   return -ENOMEM;
-
error = wait_event_interruptible(tcpw.wait,
 __kfifo_len(tcpw.fifo) != 0);
if (error)
return error;
 
+   len = min(len, kfifo_len(tcpw.fifo));
+   tbuf = vmalloc(len);
+   if (!tbuf)
+   return -ENOMEM;
+
cnt = kfifo_get(tcpw.fifo, tbuf, len);
error = copy_to_user(buf, tbuf, cnt);
 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] [BNX2]: Fix tx race condition

2006-08-10 Thread Michael Chan

[BNX2]: Fix tx race condition.

Fix a subtle race condition between bnx2_start_xmit() and bnx2_tx_int()
similar to the one in tg3 discovered by Herbert Xu:

CPU0CPU1
bnx2_start_xmit()
if (tx_ring_full) {
tx_lock
bnx2_tx()
if (!netif_queue_stopped)
netif_stop_queue()
if (!tx_ring_full)
update_tx_ring 
netif_wake_queue()
tx_unlock
}

Even though tx_ring is updated before the if statement in bnx2_tx_int() in
program order, it can be re-ordered by the CPU as shown above.  This
scenario can cause the tx queue to be stopped forever if bnx2_tx_int() has
just freed up the entire tx_ring.  The possibility of this happening
should be very rare though.

The following changes are made, very much identical to the tg3 fix:

1. Add memory barrier to fix the above race condition.

2. Eliminate the private tx_lock altogether and rely solely on
netif_tx_lock.  This eliminates one spinlock in bnx2_start_xmit()
when the ring is full.

3. Because of 2, use netif_tx_lock in bnx2_tx_int() before calling
netif_wake_queue().

4. Add memory barrier to bnx2_tx_avail().

5. Add bp-tx_wake_thresh which is set to half the tx ring size.

6. Check for the full wake queue condition before getting
netif_tx_lock in tg3_tx().  This reduces the number of unnecessary
spinlocks when the tx ring is full in a steady-state condition.

Signed-off-by: Michael Chan [EMAIL PROTECTED]

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index db73de0..2099edb 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -209,8 +209,10 @@ MODULE_DEVICE_TABLE(pci, bnx2_pci_tbl);
 
 static inline u32 bnx2_tx_avail(struct bnx2 *bp)
 {
-   u32 diff = TX_RING_IDX(bp-tx_prod) - TX_RING_IDX(bp-tx_cons);
+   u32 diff;
 
+   smp_mb();
+   diff = TX_RING_IDX(bp-tx_prod) - TX_RING_IDX(bp-tx_cons);
if (diff  MAX_TX_DESC_CNT)
diff = (diff  MAX_TX_DESC_CNT) - 1;
return (bp-tx_ring_size - diff);
@@ -1686,15 +1688,20 @@ bnx2_tx_int(struct bnx2 *bp)
}
 
bp-tx_cons = sw_cons;
+   /* Need to make the tx_cons update visible to bnx2_start_xmit()
+* before checking for netif_queue_stopped().  Without the
+* memory barrier, there is a small possibility that bnx2_start_xmit()
+* will miss it and cause the queue to be stopped forever.
+*/
+   smp_mb();
 
-   if (unlikely(netif_queue_stopped(bp-dev))) {
-   spin_lock(bp-tx_lock);
+   if (unlikely(netif_queue_stopped(bp-dev)) 
+(bnx2_tx_avail(bp)  bp-tx_wake_thresh)) {
+   netif_tx_lock(bp-dev);
if ((netif_queue_stopped(bp-dev)) 
-   (bnx2_tx_avail(bp)  MAX_SKB_FRAGS)) {
-
+   (bnx2_tx_avail(bp)  bp-tx_wake_thresh))
netif_wake_queue(bp-dev);
-   }
-   spin_unlock(bp-tx_lock);
+   netif_tx_unlock(bp-dev);
}
 }
 
@@ -3503,6 +3510,8 @@ bnx2_init_tx_ring(struct bnx2 *bp)
struct tx_bd *txbd;
u32 val;
 
+   bp-tx_wake_thresh = bp-tx_ring_size / 2;
+
txbd = bp-tx_desc_ring[MAX_TX_DESC_CNT];

txbd-tx_bd_haddr_hi = (u64) bp-tx_desc_mapping  32;
@@ -4390,10 +4399,8 @@ bnx2_vlan_rx_kill_vid(struct net_device 
 #endif
 
 /* Called with netif_tx_lock.
- * hard_start_xmit is pseudo-lockless - a lock is only required when
- * the tx queue is full. This way, we get the benefit of lockless
- * operations most of the time without the complexities to handle
- * netif_stop_queue/wake_queue race conditions.
+ * bnx2_tx_int() runs without netif_tx_lock unless it needs to call
+ * netif_wake_queue().
  */
 static int
 bnx2_start_xmit(struct sk_buff *skb, struct net_device *dev)
@@ -4512,12 +4519,9 @@ bnx2_start_xmit(struct sk_buff *skb, str
dev-trans_start = jiffies;
 
if (unlikely(bnx2_tx_avail(bp) = MAX_SKB_FRAGS)) {
-   spin_lock(bp-tx_lock);
netif_stop_queue(dev);
-   
-   if (bnx2_tx_avail(bp)  MAX_SKB_FRAGS)
+   if (bnx2_tx_avail(bp)  bp-tx_wake_thresh)
netif_wake_queue(dev);
-   spin_unlock(bp-tx_lock);
}
 
return NETDEV_TX_OK;
@@ -5628,7 +5632,6 @@ bnx2_init_board(struct pci_dev *pdev, st
bp-pdev = pdev;
 
spin_lock_init(bp-phy_lock);
-   spin_lock_init(bp-tx_lock);
INIT_WORK(bp-reset_task, bnx2_reset_task, bp);
 
dev-base_addr = dev-mem_start = pci_resource_start(pdev, 0);
diff --git a/drivers/net/bnx2.h b/drivers/net/bnx2.h
index 658c5ee..fe80476 100644
--- a/drivers/net/bnx2.h
+++ b/drivers/net/bnx2.h
@@ -3890,10 +3890,6 @@ struct bnx2 {
u32 tx_prod_bseq

[PATCH 2/2] [BNX2]: Convert to netdev_alloc_skb()

2006-08-10 Thread Michael Chan

[BNX2]: Convert to netdev_alloc_skb()

Convert dev_alloc_skb() to netdev_alloc_skb() and increase default
rx ring size to 255. The old ring size of 100 was too small.

Update version to 1.4.44.

Signed-off-by: Michael Chan [EMAIL PROTECTED]

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index 2099edb..652eb05 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -56,8 +56,8 @@
 
 #define DRV_MODULE_NAMEbnx2
 #define PFX DRV_MODULE_NAME: 
-#define DRV_MODULE_VERSION 1.4.43
-#define DRV_MODULE_RELDATE June 28, 2006
+#define DRV_MODULE_VERSION 1.4.44
+#define DRV_MODULE_RELDATE August 10, 2006
 
 #define RUN_AT(x) (jiffies + (x))
 
@@ -1571,7 +1571,7 @@ bnx2_alloc_rx_skb(struct bnx2 *bp, u16 i
struct rx_bd *rxbd = bp-rx_desc_ring[RX_RING(index)][RX_IDX(index)];
unsigned long align;
 
-   skb = dev_alloc_skb(bp-rx_buf_size);
+   skb = netdev_alloc_skb(bp-dev, bp-rx_buf_size);
if (skb == NULL) {
return -ENOMEM;
}
@@ -1580,7 +1580,6 @@ bnx2_alloc_rx_skb(struct bnx2 *bp, u16 i
skb_reserve(skb, 8 - align);
}
 
-   skb-dev = bp-dev;
mapping = pci_map_single(bp-pdev, skb-data, bp-rx_buf_use_size,
PCI_DMA_FROMDEVICE);
 
@@ -1793,7 +1792,7 @@ bnx2_rx_int(struct bnx2 *bp, int budget)
if ((bp-dev-mtu  1500)  (len = RX_COPY_THRESH)) {
struct sk_buff *new_skb;
 
-   new_skb = dev_alloc_skb(len + 2);
+   new_skb = netdev_alloc_skb(bp-dev, len + 2);
if (new_skb == NULL)
goto reuse_rx;
 
@@ -1804,7 +1803,6 @@ bnx2_rx_int(struct bnx2 *bp, int budget)
 
skb_reserve(new_skb, 2);
skb_put(new_skb, len);
-   new_skb-dev = bp-dev;
 
bnx2_reuse_rx_skb(bp, skb,
sw_ring_cons, sw_ring_prod);
@@ -3961,7 +3959,7 @@ bnx2_run_loopback(struct bnx2 *bp, int l
return -EINVAL;
 
pkt_size = 1514;
-   skb = dev_alloc_skb(pkt_size);
+   skb = netdev_alloc_skb(bp-dev, pkt_size);
if (!skb)
return -ENOMEM;
packet = skb_put(skb, pkt_size);
@@ -5754,7 +5752,7 @@ bnx2_init_board(struct pci_dev *pdev, st
bp-mac_addr[5] = (u8) reg;
 
bp-tx_ring_size = MAX_TX_DESC_CNT;
-   bnx2_set_rx_ring_size(bp, 100);
+   bnx2_set_rx_ring_size(bp, 255);
 
bp-rx_csum = 1;
 


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[IPROUTE]: Fix struct alignment with cris architecture

2006-08-10 Thread Andy Gay

[IPROUTE]: Fix struct alignment with cris architecture

gcc for the cris arch does not pad structures to the next multiple of 4
bytes, as the i386 gcc does.

This causes errors like this when displaying xfrm policies:

# ip x p
!!!Deficit 3, rta_len=300
src 192.168.251.32/29 dst 192.168.251.32/29 
dir in priority 0 
!!!Deficit 3, rta_len=180
src 0.0.0.0/0 dst 192.168.251.32/29 
dir in priority 2208 


Similar errors are seen from ip x s.

This patch fixes the errors when printing. I'm not sure whether we
should worry about other uses of the affected structs, I've not seen any
other bad effects from this though, so hopefully this is enough.

(Thanks to Herbert Xu for pointing out that NLMSG_SPACE is the correct
macro to use here.)

Tested against 2.6.17.6 kernel on i386, and 2.6.16.1 kernel on cris.

Signed-off-by: Andy Gay [EMAIL PROTECTED]

---

diff --git a/ip/xfrm_policy.c b/ip/xfrm_policy.c
index 433b513..340e7df 100644
--- a/ip/xfrm_policy.c
+++ b/ip/xfrm_policy.c
@@ -354,15 +354,15 @@ int xfrm_policy_print(const struct socka
 
if (n-nlmsg_type == XFRM_MSG_DELPOLICY)  {
xpid = NLMSG_DATA(n);
-   len -= NLMSG_LENGTH(sizeof(*xpid));
+   len -= NLMSG_SPACE(sizeof(*xpid));
} else if (n-nlmsg_type == XFRM_MSG_POLEXPIRE) {
xpexp = NLMSG_DATA(n);
xpinfo = xpexp-pol;
-   len -= NLMSG_LENGTH(sizeof(*xpexp));
+   len -= NLMSG_SPACE(sizeof(*xpexp));
} else {
xpexp = NULL;
xpinfo = NLMSG_DATA(n);
-   len -= NLMSG_LENGTH(sizeof(*xpinfo));
+   len -= NLMSG_SPACE(sizeof(*xpinfo));
}
 
if (len  0) {
diff --git a/ip/xfrm_state.c b/ip/xfrm_state.c
index 3eefaff..1d61685 100644
--- a/ip/xfrm_state.c
+++ b/ip/xfrm_state.c
@@ -575,15 +575,15 @@ int xfrm_state_print(const struct sockad
if (n-nlmsg_type == XFRM_MSG_DELSA) {
/* Dont blame me for this .. Herbert made me do it */
xsid = NLMSG_DATA(n);
-   len -= NLMSG_LENGTH(sizeof(*xsid));
+   len -= NLMSG_SPACE(sizeof(*xsid));
} else if (n-nlmsg_type == XFRM_MSG_EXPIRE) {
xexp = NLMSG_DATA(n);
xsinfo = xexp-state;
-   len -= NLMSG_LENGTH(sizeof(*xexp));
+   len -= NLMSG_SPACE(sizeof(*xexp));
} else {
xexp = NULL;
xsinfo = NLMSG_DATA(n);
-   len -= NLMSG_LENGTH(sizeof(*xsinfo));
+   len -= NLMSG_SPACE(sizeof(*xsinfo));
}
 
if (len  0) {



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] llc: multicast receive device match

From: Stephen Hemminger [EMAIL PROTECTED]
Date: Thu, 3 Aug 2006 10:05:58 -0700

 Fix from [EMAIL PROTECTED], STP packets are incorrectly received on all
 LLC datagram sockets, whichever interface they are bound to. 
 The llc_sap datagram receive logic sends packets with a unicast destination 
 MAC to one socket bound to that SAP and MAC, and multicast packets to all 
 sockets
 bound to that SAP. STP packets are multicast, and we do need to know
 on which interface they were received.

 Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]

Looks correct, I will apply this.

Thanks a lot.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] net: socket family using RCU

2006-08-10 Thread Paul E. McKenney

On Thu, Aug 10, 2006 at 01:28:27PM -0700, Stephen Hemminger wrote:
 
 On Wed, Aug 09, 2006 at 11:31:42AM -0700, Stephen Hemminger wrote:
  Replace the gross custom locking done in socket code for net_family[]
  with simple RCU usage. Some reordering necessary to avoid sleep
  issues with sock_alloc.
 
 Definitely a good use of RCU from a read-intensive standpoint -- does
 anyone other than Linux-kernel networking developers change the elements
 of the net_family[] array except at boot and shutdown?  ;-)
 
 Some comments included below.  Looks good, but one question about
 things like atalk_create() being able to sleep and a place or two
 where a comment would be good.
 
 
 ...
 
   
  +  /*
  +   *  Allocate the socket and allow the family to set things up. if
  +   *  the protocol is 0, the family is instructed to select an 
  appropriate
  +   *  default.
  +   */
  +  sock = sock_alloc();
  +  if (!sock) {
  +  printk(KERN_WARNING socket: no more sockets\n);
  +  return -ENFILE; /* Not exactly a match, but its the
  + closest posix thing */
  +  }
  +
  +  sock-type = type;
  +
   #if defined(CONFIG_KMOD)
 /* Attempt to load a protocol module if the find failed.
  *
  @@ -1166,70 +1138,59 @@
  * requested real, full-featured networking support upon configuration.
  * Otherwise module support will break!
  */
  -  if (net_families[family] == NULL) {
  +  if (net_families[family] == NULL)
 request_module(net-pf-%d, family);
 
 OK, I'll bite...
 
 What happens if the module is not present?  Or is this what the
 Otherwise module support will break comment is getting at?
 
 request_module loads the module (and blocks). One would
 expect that the module loaded would set net_families[] via
 sock_register, so later reference would succeed. Comment is
 historical since intention was to make base socket code itself modular
 which never was done, and is probably a bad idea to even consider.
 
 If module is not present, then net_families[] will still be NULL.

Got it!

 Also, this reference to net_families[family] is done without
 rcu_dereference() and without any clear update-side lock.  This
 just happens to be OK, since we are only testing for NULL, but
 should at least have a comment.
 
  -  }
   #endif
   
  -  net_family_read_lock();
  -  if (net_families[family] == NULL) {
  -  err = -EAFNOSUPPORT;
  -  goto out;
  -  }
  -
  -/*
  - *Allocate the socket and allow the family to set things up. if
  - *the protocol is 0, the family is instructed to select an 
  appropriate
  - *default.
  - */
  -
  -  if (!(sock = sock_alloc())) {
  -  printk(KERN_WARNING socket: no more sockets\n);
  -  err = -ENFILE;  /* Not exactly a match, but its the
  - closest posix thing */
  -  goto out;
  -  }
  -
  -  sock-type = type;
  +  rcu_read_lock();
  +  pf = rcu_dereference(net_families[family]);
 
 OK, so the elements of the net_families array are protected by RCU.
 All references should either be under rcu_read_lock() and accessed
 via rcu_dereference() or under the update-side lock, whatever that
 might be.
 
 
 Yes, the net_family_lock

Good.

   
  -/*
  +/**
  + *sock_unregister - remove a protocol handler
  + *@family: protocol family to remove
  + *
*This function is called by a protocol handler that wants to
*remove its address family, and have it unlinked from the
  - *SOCKET module.
  + *new socket creation.
  + *
  + *If protocol handler is a module, then it can use module 
  reference
  + *counts to protect against new references. If protocol handler 
  is not
  + *a module then it needs to provide its own protection in
  + *the ops-create routine.
*/
  -
   int sock_unregister(int family)
   {
 if (family  0 || family = NPROTO)
  -  return -1;
  +  return -EINVAL;
   
  -  net_family_write_lock();
  +  spin_lock(net_family_lock);
 net_families[family] = NULL;
 
 And this one is covered by net_families_lock, so we are set, since this
 is the last one.
 
  -  net_family_write_unlock();
  +  spin_unlock(net_family_lock);
  +
  +  synchronize_rcu();
 
 OK, and the caller is presumably going to free up whatever needs to be
 freed.
 
 Or, if nothing need be freed, beyond this point, we know that all
 non-sleeping code paths through any of the net_protocol_family
 functions have completed.
 
 (So, are all of the functions non-sleeping, or do we care?  The
 definition of net_protocol_family in include/linux/net.h doesn't say
 that they need to be non-sleeping...)
 
 atalk_create() can potentially sleep in the following line of code:
 
  sk = sk_alloc(PF_APPLETALK, GFP_KERNEL, ddp_proto, 1);
 
 The module reference counts are used to prevent that. Since
 appletalk module can't be unloaded until there are no more appletalk

Re: [PATCH 3/6] ehea: queue management

2006-08-10 Thread Michael Neuling

Please add comments to make the code more readable, especially at the
start of functions/structures to describe what they do.  A large readme
at the start of ehea_main.c which gave an overview of the driver design
would be really useful.

Comments inline below... 

 +static void *ipz_qpageit_get_inc(struct ipz_queue *queue)
 +{
 + void *retvalue = ipz_qeit_get(queue);
 + queue-current_q_offset += queue-pagesize;
 + if (queue-current_q_offset  queue-queue_length) {
 + queue-current_q_offset -= queue-pagesize;
 + retvalue = NULL;
 + }
 + else if u64) retvalue)  (EHEA_PAGESIZE-1)) != 0) {
 + EDEB(4, ERROR!! not at PAGE-Boundary);
 + return NULL;
 + }
 + EDEB(7, queue=%p retvalue=%p, queue, retvalue);

Don't redefine these debugging macros, but if so, what is EDEB?

 + return retvalue;
 +}
 +
 +static int ipz_queue_ctor(struct ipz_queue *queue,
 +   const u32 nr_of_pages,
 +   const u32 pagesize, const u32 qe_size,
 +   const u32 nr_of_sg)
 +{
 + int f;
 + EDEB_EN(7, nr_of_pages=%x pagesize=%x qe_size=%x,
 + nr_of_pages, pagesize, qe_size);
 + queue-queue_length = nr_of_pages * pagesize;
 + queue-queue_pages = vmalloc(nr_of_pages * sizeof(void *));

 + if (!queue-queue_pages) {
 + EDEB(4, ERROR!! didn't get the memory);
 + return 0;
 + }
 + memset(queue-queue_pages, 0, nr_of_pages * sizeof(void *));
 +
 + for (f = 0; f  nr_of_pages; f++) {
 + (queue-queue_pages)[f] =
 + (struct ipz_page *)get_zeroed_page(GFP_KERNEL);
 + if (!(queue-queue_pages)[f]) {
 + break;
 + }
 + }
 + if (f  nr_of_pages) {
 + int g;
 + EDEB_ERR(4, couldn't get 0ed pages queue=%p f=%x 
 +  nr_of_pages=%x, queue, f, nr_of_pages);
 + for (g = 0; g  f; g++) {
 + free_page((unsigned long)(queue-queue_pages)[g]);
 + }
 + return 0;

If you return here when calling from ehea_create_eq, I think you are
leaking the queue-queue_pages allocation (the pages they point to are
freed correctly).

Either way these allocations/deallocations seem too complicated.  Can
you write to dtor so it can be called to free the structure in any state?

 + }
 + queue-current_q_offset = 0;
 + queue-qe_size = qe_size;
 + queue-act_nr_of_sg = nr_of_sg;
 + queue-pagesize = pagesize;
 + queue-toggle_state = 1;
 + EDEB_EX(7, queue_length=%x queue_pages=%p qe_size=%x
 +  act_nr_of_sg=%x, queue-queue_length, queue-queue_pages,
 + queue-qe_size, queue-act_nr_of_sg);
 + return 1;
 +}
 +
 +static int ipz_queue_dtor(struct ipz_queue *queue)
 +{
 + int g;
 + EDEB_EN(7, ipz_queue pointer=%p, queue);
 + if (!queue) {
 + return 0;
 + }
 + if (!queue-queue_pages) {
 + return 0;
 + }

if (!queue || !queue-queue_pages) 
  return 0;

 + EDEB(7, destructing a queue with the following properties:\n
 +  queue_length=%x act_nr_of_sg=%x pagesize=%x qe_size=%x,
 +  queue-queue_length, queue-act_nr_of_sg, queue-pagesize,
 +  queue-qe_size);
 + for (g = 0; g  (queue-queue_length / queue-pagesize); g++) {
 + free_page((unsigned long)(queue-queue_pages)[g]);
 + }
 + vfree(queue-queue_pages);
 +
 + EDEB_EX(7, queue freed!);
 + return 1;
 +}
 +
 +struct ehea_cq *ehea_cq_new(void)
 +{
 + struct ehea_cq *cq = vmalloc(sizeof(*cq));
 + if (cq)
 + memset(cq, 0, sizeof(*cq));
 + return cq;
 +}
 +
 +void ehea_cq_delete(struct ehea_cq *cq)
 +{
 + vfree(cq);
 +}

This is used in only two places.  Do we need it?  

If we do... can we static inline?

 +struct ehea_cq *ehea_create_cq(struct ehea_adapter *adapter,
 +int nr_of_cqe, u64 eq_handle, u32 cq_token)
 +{
 + struct ehea_cq *cq = NULL;
 + struct h_galpa gal;
 +
 + u64 *cq_handle_ref;
 + u32 act_nr_of_entries;
 + u32 act_pages;
 + u64 hret = H_HARDWARE;
 + int ipz_rc;
 + u32 counter;
 + void *vpage = NULL;
 + u64 rpage = 0;
 +
 + EDEB_EN(7, adapter=%p nr_of_cqe=%x , eq_handle: %016lX,
 + adapter, nr_of_cqe, eq_handle);
 +
 + cq = ehea_cq_new();
 + if (!cq) {
 + cq = NULL;
 + EDEB_ERR(4, ehea_create_cq ret=%p (-ENOMEM), cq);
 + goto create_cq_exit0;
 + }
 +
 + cq-attr.max_nr_of_cqes = nr_of_cqe;
 + cq-attr.cq_token = cq_token;
 + cq-attr.eq_handle = eq_handle;
 +
 + cq-adapter = adapter;
 +
 + cq_handle_ref = cq-ipz_cq_handle;
 + act_nr_of_entries = 0;
 + act_pages = 0;
 +
 + hret = ehea_h_alloc_resource_cq(adapter-handle,
 + cq,
 + cq-attr,
 +

Re: [PATCH 3/6] ehea: queue management

2006-08-10 Thread Alexey Dobriyan

  +static inline u32 map_swqe_size(u8 swqe_enc_size)
  +{
  +   return 128  swqe_enc_size;
  +}^
  + |
  +static inline u32|map_rwqe_size(u8 rwqe_enc_size)
  +{|
  +   return 128  rwqe_enc_size;
  ^
  +}|
 |
 Snap!  These are ide|tical...
  |
No, they aren't. -+

 Name the function after what it does, not after the functions you expect
 to call it.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/6] ehea: queue management

2006-08-10 Thread Michael Neuling

   +static inline u32 map_swqe_size(u8 swqe_enc_size)
   +{
   + return 128  swqe_enc_size;
   +}  ^
   +   |
   +static inline u32|map_rwqe_size(u8 rwqe_enc_size)
   +{  |
   + return 128  rwqe_enc_size;
 ^
   +}  |
|
  Snap!  These are ide|tical...
 |
 No, they aren't. -+

Functionally identical.

Mikey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] VM deadlock prevention core -v3

2006-08-10 Thread Indan Zupancic

On Thu, August 10, 2006 18:50, Peter Zijlstra said:
 You are right if the reserve wasn't device bound - which I will abandon
 because you are right that with multi-path routing, bridge device and
 other advanced goodies this scheme is broken in that there is no
 unambiguous mapping from sockets to devices.

The natural thing seems to make reserves socket bound, but that has
overhead too and the simplicity of a global reserve is very tempting.

What about adding a flag to sk_set_memalloc() which says if memalloc is on
or off on the socket? (Or add sk_unset_memalloc). That way it's possible
to switch it off again, which doesn't seem like that a bad idea, because
then it can be turned on only when the socket can be used to reduce total
memory usage. Also if it is turned off again when no more memory can be
freed by using this socket, it will solve the starvation problem as a
starved socket now has a new chance to do its thing.

Greetings,

Indan


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [take6 1/3] kevent: Core files.

On Thu, 10 Aug 2006 12:22:35 +0400
Evgeniy Polyakov [EMAIL PROTECTED] wrote:

 On Thu, Aug 10, 2006 at 01:02:54AM -0700, Andrew Morton ([EMAIL PROTECTED]) 
 wrote:
Afaict this mmap function gives a user a free way of getting pinned 
memory. 
What is the upper bound on the amount of memory which a user can thus
obtain?
   
   it is limited by maximum queue length which is 4k entries right now, so
   maximum number of paged here is 4k*40/page_size, i.e. about 40 pages on
   x86.
  
  Is that per user or per fd?  If the latter that is, with the usual
  RLIMIT_NOFILE, 160MBytes.  2GB with 64k pagesize.  Problem ;)
 
 Per kevent fd.
 I have some ideas about better mmap ring implementation, which would
 dinamically grow it's buffer when events are added and reuse the same
 place for next events, but there are some nitpics unresolved yet.
 Let's not see there in next releases (no merge of course), until better 
 solution is ready. I will change that area when other things are ready.

This is not a problem with the mmap interface per-se.  If the proposed
event code permits each user to pin 160MB of kernel memory then that would
be a serious problem.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

remove unnecessary config.h includes from drivers/net/

2006-08-10 Thread Dave Jones

On Wed, Aug 09, 2006 at 09:04:38PM -0700, David Miller wrote:
  From: Dave Jones [EMAIL PROTECTED]
  Date: Wed, 9 Aug 2006 22:21:16 -0400
  
   config.h is automatically included by kbuild these days.
   
   Signed-off-by: Dave Jones [EMAIL PROTECTED]
  
  Applied to net-2.6.19, thanks Dave.

Here's a similar patch that does the same removals for drivers/net/

Signed-off-by: Dave Jones [EMAIL PROTECTED]

--- linux-2.6.17.noarch/drivers/net/irda/mcs7780.c~ 2006-08-10 
21:35:23.0 -0400
+++ linux-2.6.17.noarch/drivers/net/irda/mcs7780.c  2006-08-10 
21:35:25.0 -0400
@@ -45,7 +45,6 @@
 
 #include linux/module.h
 #include linux/moduleparam.h
-#include linux/config.h
 #include linux/kernel.h
 #include linux/types.h
 #include linux/errno.h
--- linux-2.6.17.noarch/drivers/net/irda/w83977af_ir.c~ 2006-08-10 
21:35:28.0 -0400
+++ linux-2.6.17.noarch/drivers/net/irda/w83977af_ir.c  2006-08-10 
21:35:30.0 -0400
@@ -40,7 +40,6 @@
  /
 
 #include linux/module.h
-#include linux/config.h 
 #include linux/kernel.h
 #include linux/types.h
 #include linux/skbuff.h
--- linux-2.6.17.noarch/drivers/net/smc911x.c~  2006-08-10 21:35:34.0 
-0400
+++ linux-2.6.17.noarch/drivers/net/smc911x.c   2006-08-10 21:35:37.0 
-0400
@@ -55,8 +55,6 @@ static const char version[] =
 )
 #endif
 
-
-#include linux/config.h
 #include linux/init.h
 #include linux/module.h
 #include linux/kernel.h
--- linux-2.6.17.noarch/drivers/net/netx-eth.c~ 2006-08-10 21:35:41.0 
-0400
+++ linux-2.6.17.noarch/drivers/net/netx-eth.c  2006-08-10 21:35:42.0 
-0400
@@ -17,7 +17,6 @@
  * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
  */
 
-#include linux/config.h
 #include linux/init.h
 #include linux/module.h
 #include linux/kernel.h
--- linux-2.6.17.noarch/drivers/net/wan/cycx_main.c~2006-08-10 
21:35:45.0 -0400
+++ linux-2.6.17.noarch/drivers/net/wan/cycx_main.c 2006-08-10 
21:35:48.0 -0400
@@ -40,7 +40,6 @@
 * 1998/08/08   acmeInitial version.
 */
 
-#include linux/config.h  /* OS configuration options */
 #include linux/stddef.h  /* offsetof(), etc. */
 #include linux/errno.h   /* return codes */
 #include linux/string.h  /* inline memset(), etc. */
--- linux-2.6.17.noarch/drivers/net/wan/sdla.c~ 2006-08-10 21:35:51.0 
-0400
+++ linux-2.6.17.noarch/drivers/net/wan/sdla.c  2006-08-10 21:35:53.0 
-0400
@@ -32,7 +32,6 @@
  * 2 of the License, or (at your option) any later version.
  */
 
-#include linux/config.h /* for CONFIG_DLCI_MAX */
 #include linux/module.h
 #include linux/kernel.h
 #include linux/types.h
--- linux-2.6.17.noarch/drivers/net/wan/dlci.c~ 2006-08-10 21:35:57.0 
-0400
+++ linux-2.6.17.noarch/drivers/net/wan/dlci.c  2006-08-10 21:35:59.0 
-0400
@@ -28,7 +28,6 @@
  * 2 of the License, or (at your option) any later version.
  */
 
-#include linux/config.h /* for CONFIG_DLCI_COUNT */
 #include linux/module.h
 #include linux/kernel.h
 #include linux/types.h
--- linux-2.6.17.noarch/drivers/net/phy/vitesse.c~  2006-08-10 
21:36:02.0 -0400
+++ linux-2.6.17.noarch/drivers/net/phy/vitesse.c   2006-08-10 
21:36:04.0 -0400
@@ -12,7 +12,6 @@
  *
  */
 
-#include linux/config.h
 #include linux/kernel.h
 #include linux/module.h
 #include linux/mii.h
--- linux-2.6.17.noarch/drivers/net/phy/smsc.c~ 2006-08-10 21:36:07.0 
-0400
+++ linux-2.6.17.noarch/drivers/net/phy/smsc.c  2006-08-10 21:36:08.0 
-0400
@@ -14,7 +14,6 @@
  *
  */
 
-#include linux/config.h
 #include linux/kernel.h
 #include linux/module.h
 #include linux/mii.h
--- linux-2.6.17.noarch/drivers/net/hp100.c~2006-08-10 21:36:12.0 
-0400
+++ linux-2.6.17.noarch/drivers/net/hp100.c 2006-08-10 21:36:14.0 
-0400
@@ -111,7 +111,6 @@
 #include linux/etherdevice.h
 #include linux/skbuff.h
 #include linux/types.h
-#include linux/config.h  /* for CONFIG_PCI */
 #include linux/delay.h
 #include linux/init.h
 #include linux/bitops.h
--- linux-2.6.17.noarch/drivers/net/3c501.c~2006-08-10 21:36:18.0 
-0400
+++ linux-2.6.17.noarch/drivers/net/3c501.c 2006-08-10 21:36:20.0 
-0400
@@ -120,7 +120,6 @@ static const char version[] =
 #include linux/slab.h
 #include linux/string.h
 #include linux/errno.h
-#include linux/config.h  /* for CONFIG_IP_MULTICAST */
 #include linux/spinlock.h
 #include linux/ethtool.h
 #include linux/delay.h

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

2.6.18-rc3-mm2 - BUG in rt6_lookup() from ipv6_del_addr()

2006-08-10 Thread Valdis . Kletnieks

On Sun, 06 Aug 2006 03:08:09 PDT, Andrew Morton said:
 ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/

After applying the patch that Patrick McHardy pointed me at, it lived
longer.  However, I'm now seeing problems at system shutdown (or anytime
you try to 'ifdown ethX' where ethX has an IPv6 address attached to it):

[  196.346000] BUG: unable to handle kernel NULL pointer dereference at virtual 
address 0014
[  196.347000]  printing eip:
[  196.348000] c032c436
[  196.348000] *pde = 
[  196.349000] Oops:  [#1]
[  196.349000] 4K_STACKS PREEMPT 
[  196.349000] last sysfs file: /class/net/eth1/address
[  196.349000] Modules linked in: thermal sony_acpi processor fan button 
battery ac nfnetlink i8k floppy nvram orinoco_cs orinoco hermes pcmcia 
firmware_class ohci1394 ieee1394 intel_agp agpgart iTCO_wdt yenta_socket 
rsrc_nonstatic pcmcia_core rtc
[  196.349000] CPU:0
[  196.349000] EIP:0060:[c032c436]Not tainted VLI
[  196.349000] EFLAGS: 00010246   (2.6.18-rc3-mm2 #4) 
[  196.349000] EIP is at rt6_lookup+0x47/0x83
[  196.349000] eax:    ebx:    ecx: 0005   edx: 
[  196.349000] esi: e8b25c98   edi: e8b25c20   ebp: e8b25c78   esp: e8b25c20
[  196.349000] ds: 007b   es: 007b   ss: 0068
[  196.349000] Process ip (pid: 2511, ti=e8b25000 task=effb0aa0 
task.ti=e8b25000)
[  196.349000] Stack: 0005  80fe    
  
[  196.349000]      
  
[  196.349000]   0008 eb6e98c8 e8b25ca8 
e8b25cb4 c0327c04 
[  196.349000] Call Trace:
[  196.349000]  [c0327c04] ipv6_del_addr+0x2ef/0x3a7
[  196.349000]  [c0327d3f] inet6_addr_del+0x83/0xbb
[  196.349000]  [c0327dd6] inet6_rtm_deladdr+0x5f/0x6b
[  196.349000]  [c02da097] rtnetlink_rcv_msg+0x1b3/0x1d6
[  196.349000]  [c02e011c] netlink_run_queue+0x5a/0xc6
[  196.349000]  [c02d9e9d] rtnetlink_rcv+0x29/0x42
[  196.349000]  [c02e0576] netlink_data_ready+0x12/0x49
[  196.349000]  [c02df518] netlink_sendskb+0x1c/0x4d
[  196.349000]  [c02dfea0] netlink_unicast+0x1c4/0x1d0
[  196.349000]  [c02e0557] netlink_sendmsg+0x274/0x281
[  196.349000]  [c02ca57e] sock_sendmsg+0xeb/0x106
[  196.349000]  [c02cad99] sys_sendto+0xbe/0xdc
[  196.349000]  [c02cb522] sys_socketcall+0xfb/0x186
[  196.349000]  [c0102849] sysenter_past_esp+0x56/0x79
[  196.349000] DWARF2 unwinder stuck at sysenter_past_esp+0x56/0x79
[  196.349000] Leftover inexact backtrace:
[  196.349000]  [c01036c7] show_stack_log_lvl+0x8c/0x97
[  196.349000]  [c010381f] show_registers+0x14d/0x1de
[  196.349000]  [c0103a5b] die+0x1ab/0x26d
[  196.349000]  [c0352205] do_page_fault+0x3f8/0x4c5
[  196.349000]  [c0351271] error_code+0x39/0x40
[  196.349000]  [c0327c04] ipv6_del_addr+0x2ef/0x3a7
[  196.349000]  [c0327d3f] inet6_addr_del+0x83/0xbb
[  196.349000]  [c0327dd6] inet6_rtm_deladdr+0x5f/0x6b
[  196.349000]  [c02da097] rtnetlink_rcv_msg+0x1b3/0x1d6
[  196.349000]  [c02e011c] netlink_run_queue+0x5a/0xc6
[  196.349000]  [c02d9e9d] rtnetlink_rcv+0x29/0x42
[  196.349000]  [c02e0576] netlink_data_ready+0x12/0x49
[  196.349000]  [c02df518] netlink_sendskb+0x1c/0x4d
[  196.349000]  [c02dfea0] netlink_unicast+0x1c4/0x1d0
[  196.349000]  [c02e0557] netlink_sendmsg+0x274/0x281
[  196.349000]  [c02ca57e] sock_sendmsg+0xeb/0x106
[  196.349000]  [c02cad99] sys_sendto+0xbe/0xdc
[  196.349000]  [c02cb522] sys_socketcall+0xfb/0x186
[  196.349000]  [c0102849] sysenter_past_esp+0x56/0x79
[  196.349000] Code: eb ff 89 5d a8 8d 45 b0 b9 10 00 00 00 89 f2 e8 c9 e0 eb 
ff 31 d2 83 7d 08 00 0f 95 c2 b9 ad cc 32 c0 89 f8 e8 47 7c 01 00 89 c3 66 83 
7b 14 00 74 2d 8b 43 04 85 c0 7f 21 68 c4 19 37 c0 68 99 
[  196.349000] EIP: [c032c436] rt6_lookup+0x47/0x83 SS:ESP 0068:e8b25c20

The unlucky 'ip' process then gets a SIGSEGV and dies while holding a lock
of some sort, so later 'ip' processes get hung in 'D' state.

Checking the lkml and netdev archives didn't find any useful hits for
'ipv6_addr_rel'...


pgpPNQBNHkWRz.pgp
Description: PGP signature

Re: [RFC][PATCH 2/9] deadlock prevention core

2006-08-10 Thread Rik van Riel


Thomas Graf wrote:


skb-dev is not guaranteed to still point to the allocating device
once the skb is freed again so reserve/unreserve isn't symmetric.
You'd need skb-alloc_dev or something.


There's another consequence of this property of the network
stack.

Every network interface must be able to fall back to these
MEMALLOC allocations, because the memory critical socket
could be on another network interface.  Hence, we cannot
know which network interfaces should (not) be marked MEMALLOC.

--
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it. - Brian W. Kernighan
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.18-rc3-mm2 - BUG in rt6_lookup() from ipv6_del_addr()