Re: [IPV4 0/9] TRIE performance patches

2008-02-01 Thread Robert Olsson

Hello, finally got some time to test...

Table w. 214k routes with full rDoS on two intrefaces on 2 x AMD64 processors, 
speed 2814.43 MHz. Profiled with CPU_CLK_UNHALTED and rtstat

w/o latest patch fib_trie pathes. Tput ~233 kpps  

samples  %symbol name

109925   14.4513  fn_trie_lookup
109821   14.4376  ip_route_input
8724511.4696  rt_intern_hash
31270 4.1109  kmem_cache_alloc
24159 3.1761  dev_queue_xmit
23200 3.0500  neigh_lookup
22464 2.9532  free_block
18412 2.4205  kmem_cache_free
17830 2.3440  dst_destroy
15740 2.0693  fib_get_table

With latest patch fib_patches.(Stephens others) Same throughput 
~233 kpps but we see a different profile. Why we don't get any 
better better throughput is yet to be understand (the drops in 
qdisc could be the cause) more analysis needed

7924214.3520  ip_route_input
6518811.8066  fn_trie_lookup
6455911.6927  rt_intern_hash
22901 4.1477  kmem_cache_alloc
21038 3.8103  check_leaf
16197 2.9335  neigh_lookup
14802 2.6809  free_block
14596 2.6436  ip_rcv_finish
12365 2.2395  fib_validate_source
12048 2.1821  dst_destroy

fib_hash thoughput ~177 kpps  
Hard work for fib_hash here as we have many zones.
it can be fast with less zines.

200568   37.8013  fn_hash_lookup
5835210.9977  ip_route_input
44495 8.3860  rt_intern_hash
12873 2.4262  kmem_cache_alloc
12115 2.2833  rt_may_expire
11691 2.2034  rt_garbage_collect
10821 2.0394  dev_queue_xmit
  1.8845  fib_validate_source
8762  1.6514  fib_get_table
8558  1.6129  fib_semantic_match


Cheers
--ro
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [IPV4 0/9] TRIE performance patches

2008-01-24 Thread Robert Olsson

Stephen Hemminger writes:

 > Dumping by prefix is possible, but unless 32x slower. Dumping in
 > address order is just as logical. Like I said, I'm investigating what
 > quagga handles.

 How about taking a snapshot to in address order (as you did) to some 
 allocated memory, returning from that memory in prefix order? This would 
 solve the  -EBUSY too and give a consistent view of the routing table at
 the time for the dump/snapshot.

 Cheers
--ro
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[IPV4 0/9] TRIE performance patches

2008-01-23 Thread Robert Olsson

Stephen Hemminger writes:

 > Time to handle a full BGP load (163K of routes).
 > 
 > Before:  LoadDumpFlush
 >
 > kmem_cache   3.8 13.07.2
 > iter 3.9 12.36.9
 > unordered3.1 11.94.9
 > find_node3.1  0.31.2

 I certainly like the speed but what will we brake when
 we don't return in longest prefix order?

labb:/# ip r
default via 10.10.10.1 dev eth0 
5.0.0.0/8 via 192.168.2.2 dev eth3 
10.10.10.0/24 dev eth0  proto kernel  scope link  src 10.10.10.2 
10.10.11.0/24 dev eth1  proto kernel  scope link  src 10.10.11.1 
11.0.0.0/8 via 10.10.11.2 dev eth1 
192.168.1.0/24 dev eth2  proto kernel  scope link  src 192.168.1.2 
192.168.2.0/24 dev eth3  proto kernel  scope link  src 192.168.2.1 

labb:/# ip route list match 10.10.10.1
default via 10.10.10.1 dev eth0 
10.10.10.0/24 dev eth0  proto kernel  scope link  src 10.10.10.2 
labb:/# 

Maybe the unordered dump can be ordered cheaply...

Cheers.
--ro

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang

2008-01-21 Thread Robert Olsson

David Miller writes:

 > Yes, this semaphore thing is highly problematic.  In the most crucial
 > areas where network driver consistency matters the most for ease of
 > understanding and debugging, the Intel drivers choose to be different
 > :-(
 > 
 > The way the napi_disable() logic breaks out from high packet load in
 > net_rx_action() is it simply returns even leaving interrupts disabled
 > when a pending napi_disable() is pending.
 > 
 > This is what trips up the semaphore logic.
 > 
 > Robert, give this patch a try.


 Yes it works. e1000 tested for ~3 hours with high very high load and 
 interface up/down every 5:th sec. Without the patch the irq's gets 
 disabled within a couple of seconds

 A resolute way of handling the semaphores. :)
   
 Signed-off-by: Robert Olsson <[EMAIL PROTECTED]>
 
 Cheers
--ro


 > In the long term this semaphore should be completely eliminated,
 > there is no justification for it.
 > 
 > Signed-off-by: David S. Miller <[EMAIL PROTECTED]>
 > 
 > diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
 > index 0c9a6f7..76c0fa6 100644
 > --- a/drivers/net/e1000/e1000_main.c
 > +++ b/drivers/net/e1000/e1000_main.c
 > @@ -632,6 +632,7 @@ e1000_down(struct e1000_adapter *adapter)
 >  
 >  #ifdef CONFIG_E1000_NAPI
 >  napi_disable(&adapter->napi);
 > +atomic_set(&adapter->irq_sem, 0);
 >  #endif
 >  e1000_irq_disable(adapter);
 >  
 > diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
 > index 2ab3bfb..9cc5a6b 100644
 > --- a/drivers/net/e1000e/netdev.c
 > +++ b/drivers/net/e1000e/netdev.c
 > @@ -2183,6 +2183,7 @@ void e1000e_down(struct e1000_adapter *adapter)
 >  msleep(10);
 >  
 >  napi_disable(&adapter->napi);
 > +atomic_set(&adapter->irq_sem, 0);
 >  e1000_irq_disable(adapter);
 >  
 >  del_timer_sync(&adapter->watchdog_timer);
 > diff --git a/drivers/net/ixgb/ixgb_main.c b/drivers/net/ixgb/ixgb_main.c
 > index d2fb88d..4f63839 100644
 > --- a/drivers/net/ixgb/ixgb_main.c
 > +++ b/drivers/net/ixgb/ixgb_main.c
 > @@ -296,6 +296,11 @@ ixgb_down(struct ixgb_adapter *adapter, boolean_t 
 > kill_watchdog)
 >  {
 >  struct net_device *netdev = adapter->netdev;
 >  
 > +#ifdef CONFIG_IXGB_NAPI
 > +napi_disable(&adapter->napi);
 > +atomic_set(&adapter->irq_sem, 0);
 > +#endif
 > +
 >  ixgb_irq_disable(adapter);
 >  free_irq(adapter->pdev->irq, netdev);
 >  
 > @@ -304,9 +309,7 @@ ixgb_down(struct ixgb_adapter *adapter, boolean_t 
 > kill_watchdog)
 >  
 >  if(kill_watchdog)
 >  del_timer_sync(&adapter->watchdog_timer);
 > -#ifdef CONFIG_IXGB_NAPI
 > -napi_disable(&adapter->napi);
 > -#endif
 > +
 >  adapter->link_speed = 0;
 >  adapter->link_duplex = 0;
 >  netif_carrier_off(netdev);
 > diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
 > index de3f45e..a4265bc 100644
 > --- a/drivers/net/ixgbe/ixgbe_main.c
 > +++ b/drivers/net/ixgbe/ixgbe_main.c
 > @@ -1409,9 +1409,11 @@ void ixgbe_down(struct ixgbe_adapter *adapter)
 >  IXGBE_WRITE_FLUSH(&adapter->hw);
 >  msleep(10);
 >  
 > +napi_disable(&adapter->napi);
 > +atomic_set(&adapter->irq_sem, 0);
 > +
 >  ixgbe_irq_disable(adapter);
 >  
 > -napi_disable(&adapter->napi);
 >  del_timer_sync(&adapter->watchdog_timer);
 >  
 >  netif_carrier_off(netdev);
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang

2008-01-18 Thread Robert Olsson

David Miller writes:

 > > eth0 e1000_irq_enable sem = 1<- ifconfig eth0 down
 > > eth0 e1000_irq_disable sem = 2
 > > 
 > > **e1000_open <- ifconfig eth0 up
 > > eth0 e1000_irq_disable sem = 3  Dead. irq's can't be enabled
 > > e1000_irq_enable miss
 > > eth0 e1000_irq_enable sem = 2
 > > e1000_irq_enable miss
 > > eth0 e1000_irq_enable sem = 1
 > > ADDRCONF(NETDEV_UP): eth0: link is not ready
 > 
 > Yes, this semaphore thing is highly problematic.  In the most crucial
 > areas where network driver consistency matters the most for ease of
 > understanding and debugging, the Intel drivers choose to be different

 I don't understand the idea with semaphore for enabling/disabling 
 irq's either the overall logic must safer/better without it.  
 
 > The way the napi_disable() logic breaks out from high packet load in
 > net_rx_action() is it simply returns even leaving interrupts disabled
 > when a pending napi_disable() is pending.
 > 
 > This is what trips up the semaphore logic.
 > 
 > Robert, give this patch a try.
 > 
 > In the long term this semaphore should be completely eliminated,
 > there is no justification for it.

 It's on the testing list...

 Cheers
--ro


 > 
 > Signed-off-by: David S. Miller <[EMAIL PROTECTED]>
 > 
 > diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
 > index 0c9a6f7..76c0fa6 100644
 > --- a/drivers/net/e1000/e1000_main.c
 > +++ b/drivers/net/e1000/e1000_main.c
 > @@ -632,6 +632,7 @@ e1000_down(struct e1000_adapter *adapter)
 >  
 >  #ifdef CONFIG_E1000_NAPI
 >  napi_disable(&adapter->napi);
 > +atomic_set(&adapter->irq_sem, 0);
 >  #endif
 >  e1000_irq_disable(adapter);
 >  
 > diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
 > index 2ab3bfb..9cc5a6b 100644
 > --- a/drivers/net/e1000e/netdev.c
 > +++ b/drivers/net/e1000e/netdev.c
 > @@ -2183,6 +2183,7 @@ void e1000e_down(struct e1000_adapter *adapter)
 >  msleep(10);
 >  
 >  napi_disable(&adapter->napi);
 > +atomic_set(&adapter->irq_sem, 0);
 >  e1000_irq_disable(adapter);
 >  
 >  del_timer_sync(&adapter->watchdog_timer);
 > diff --git a/drivers/net/ixgb/ixgb_main.c b/drivers/net/ixgb/ixgb_main.c
 > index d2fb88d..4f63839 100644
 > --- a/drivers/net/ixgb/ixgb_main.c
 > +++ b/drivers/net/ixgb/ixgb_main.c
 > @@ -296,6 +296,11 @@ ixgb_down(struct ixgb_adapter *adapter, boolean_t 
 > kill_watchdog)
 >  {
 >  struct net_device *netdev = adapter->netdev;
 >  
 > +#ifdef CONFIG_IXGB_NAPI
 > +napi_disable(&adapter->napi);
 > +atomic_set(&adapter->irq_sem, 0);
 > +#endif
 > +
 >  ixgb_irq_disable(adapter);
 >  free_irq(adapter->pdev->irq, netdev);
 >  
 > @@ -304,9 +309,7 @@ ixgb_down(struct ixgb_adapter *adapter, boolean_t 
 > kill_watchdog)
 >  
 >  if(kill_watchdog)
 >  del_timer_sync(&adapter->watchdog_timer);
 > -#ifdef CONFIG_IXGB_NAPI
 > -napi_disable(&adapter->napi);
 > -#endif
 > +
 >  adapter->link_speed = 0;
 >  adapter->link_duplex = 0;
 >  netif_carrier_off(netdev);
 > diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
 > index de3f45e..a4265bc 100644
 > --- a/drivers/net/ixgbe/ixgbe_main.c
 > +++ b/drivers/net/ixgbe/ixgbe_main.c
 > @@ -1409,9 +1409,11 @@ void ixgbe_down(struct ixgbe_adapter *adapter)
 >  IXGBE_WRITE_FLUSH(&adapter->hw);
 >  msleep(10);
 >  
 > +napi_disable(&adapter->napi);
 > +atomic_set(&adapter->irq_sem, 0);
 > +
 >  ixgbe_irq_disable(adapter);
 >  
 > -napi_disable(&adapter->napi);
 >  del_timer_sync(&adapter->watchdog_timer);
 >  
 >  netif_carrier_off(netdev);
 > --
 > To unsubscribe from this list: send the line "unsubscribe netdev" in
 > the body of a message to [EMAIL PROTECTED]
 > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang

2008-01-16 Thread Robert Olsson

David Miller writes:
 > > On Wednesday 16 January 2008, David Miller wrote:
 > > > Ok, here is the patch I'll propose to fix this.  The goal is to make
 > > > it as simple as possible without regressing the thing we were trying
 > > > to fix.
 > > 
 > > Looks good to me. Tested with -rc8.
 > 
 > Thanks for testing.

 Yes that code looks nice. I'm using the patch but I've noticed another 
 phenomena with the current e1000 driver. There is a race when taking a 
 device down at high traffic loads. I've tracked and instrumented and it 
 seems like occasionly irq_sem can get bump up so interrupts can't be 
 enabled again.


eth0 e1000_irq_enable sem = 1<- High netload
eth0 e1000_irq_enable sem = 1
eth0 e1000_irq_enable sem = 1
eth0 e1000_irq_enable sem = 1
eth0 e1000_irq_enable sem = 1
eth0 e1000_irq_enable sem = 1
eth0 e1000_irq_enable sem = 1<- ifconfig eth0 down
eth0 e1000_irq_disable sem = 2

**e1000_open <- ifconfig eth0 up
eth0 e1000_irq_disable sem = 3  Dead. irq's can't be enabled
e1000_irq_enable miss
eth0 e1000_irq_enable sem = 2
e1000_irq_enable miss
eth0 e1000_irq_enable sem = 1
ADDRCONF(NETDEV_UP): eth0: link is not ready


Cheers
--ro

static void
e1000_irq_disable(struct e1000_adapter *adapter)
{
atomic_inc(&adapter->irq_sem);
E1000_WRITE_REG(&adapter->hw, IMC, ~0);
E1000_WRITE_FLUSH(&adapter->hw);
synchronize_irq(adapter->pdev->irq);

if(adapter->netdev->ifindex == 3)
printk("%s e1000_irq_disable sem = %d\n",  adapter->netdev->name,
   atomic_read(&adapter->irq_sem));
}

static void
e1000_irq_enable(struct e1000_adapter *adapter)
{
if (likely(atomic_dec_and_test(&adapter->irq_sem))) {
E1000_WRITE_REG(&adapter->hw, IMS, IMS_ENABLE_MASK);
E1000_WRITE_FLUSH(&adapter->hw);
}
else
printk("e1000_irq_enable miss\n");

if(adapter->netdev->ifindex == 3)
  printk("%s e1000_irq_enable sem = %d\n",  adapter->netdev->name,
 atomic_read(&adapter->irq_sem));
}
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 6/6] fib_trie: combine leaf and info

2008-01-15 Thread Robert Olsson


Stephen Hemminger writes:
 > This is how I did it:

 Yes looks like an elegant solution. Did you even test it?
 Maybe we see some effects in just dumping a full table?

 Anyway lookup should be tested in some way. We can a lot
 of analyzing before getting to right entry, local_table
 backtracking, main lookup w. ev. backtracking etc.  So
 hopefully we get paid for this work.

 Also it might be idea to do some analysis of the fib_aliases
 list. Maybe the trick can be done again? ;)

 Cheers   
--ro  


 > --- a/net/ipv4/fib_trie.c2008-01-15 09:14:53.0 -0800
 > +++ b/net/ipv4/fib_trie.c2008-01-15 09:21:48.0 -0800
 > @@ -101,13 +101,6 @@ struct node {
 >  t_key key;
 >  };
 >  
 > -struct leaf {
 > -unsigned long parent;
 > -t_key key;
 > -struct hlist_head list;
 > -struct rcu_head rcu;
 > -};
 > -
 >  struct leaf_info {
 >  struct hlist_node hlist;
 >  struct rcu_head rcu;
 > @@ -115,6 +108,13 @@ struct leaf_info {
 >  struct list_head falh;
 >  };
 >  
 > +struct leaf {
 > +unsigned long parent;
 > +t_key key;
 > +struct hlist_head list;
 > +struct rcu_head rcu;
 > +};
 > +
 >  struct tnode {
 >  unsigned long parent;
 >  t_key key;
 > @@ -321,16 +321,6 @@ static void __leaf_free_rcu(struct rcu_h
 >  kmem_cache_free(trie_leaf_kmem, leaf);
 >  }
 >  
 > -static void __leaf_info_free_rcu(struct rcu_head *head)
 > -{
 > -kfree(container_of(head, struct leaf_info, rcu));
 > -}
 > -
 > -static inline void free_leaf_info(struct leaf_info *leaf)
 > -{
 > -call_rcu(&leaf->rcu, __leaf_info_free_rcu);
 > -}
 > -
 >  static struct tnode *tnode_alloc(size_t size)
 >  {
 >  struct page *pages;
 > @@ -357,7 +347,7 @@ static void __tnode_free_rcu(struct rcu_
 >  free_pages((unsigned long)tn, get_order(size));
 >  }
 >  
 > -static inline void tnode_free(struct tnode *tn)
 > +static void tnode_free(struct tnode *tn)
 >  {
 >  if (IS_LEAF(tn)) {
 >  struct leaf *l = (struct leaf *) tn;
 > @@ -376,16 +366,41 @@ static struct leaf *leaf_new(void)
 >  return l;
 >  }
 >  
 > +static void leaf_info_init(struct leaf_info *li, int plen)
 > +{
 > +li->plen = plen;
 > +INIT_LIST_HEAD(&li->falh);
 > +}
 > +
 > +static struct leaf_info *leaf_info_first(struct leaf *l, int plen)
 > +{
 > +struct leaf_info *li = (struct leaf_info *) (l + 1);
 > +leaf_info_init(li, plen);
 > +return li;
 > +}
 > +
 >  static struct leaf_info *leaf_info_new(int plen)
 >  {
 >  struct leaf_info *li = kmalloc(sizeof(struct leaf_info),  GFP_KERNEL);
 > -if (li) {
 > -li->plen = plen;
 > -INIT_LIST_HEAD(&li->falh);
 > -}
 > +if (li)
 > +leaf_info_init(li, plen);
 > +
 >  return li;
 >  }
 >  
 > +static void __leaf_info_free_rcu(struct rcu_head *head)
 > +{
 > +kfree(container_of(head, struct leaf_info, rcu));
 > +}
 > +
 > +static inline void free_leaf_info(struct leaf *l, struct leaf_info *leaf)
 > +{
 > +if (leaf == (struct leaf_info *)(l + 1))
 > +return;
 > +
 > +call_rcu(&leaf->rcu, __leaf_info_free_rcu);
 > +}
 > +
 >  static struct tnode* tnode_new(t_key key, int pos, int bits)
 >  {
 >  size_t sz = sizeof(struct tnode) + (sizeof(struct node *) << bits);
 > @@ -1047,18 +1062,13 @@ static struct list_head *fib_insert_node
 >  insert_leaf_info(&l->list, li);
 >  goto done;
 >  }
 > -l = leaf_new();
 >  
 > +l = leaf_new();
 >  if (!l)
 >  return NULL;
 >  
 >  l->key = key;
 > -li = leaf_info_new(plen);
 > -
 > -if (!li) {
 > -tnode_free((struct tnode *) l);
 > -return NULL;
 > -}
 > +li = leaf_info_first(l, plen);
 >  
 >  fa_head = &li->falh;
 >  insert_leaf_info(&l->list, li);
 > @@ -1091,7 +1101,7 @@ static struct list_head *fib_insert_node
 >  }
 >  
 >  if (!tn) {
 > -free_leaf_info(li);
 > +free_leaf_info(l, li);
 >  tnode_free((struct tnode *) l);
 >  return NULL;
 >  }
 > @@ -1624,7 +1634,7 @@ static int fn_trie_delete(struct fib_tab
 >  
 >  if (list_empty(fa_head)) {
 >  hlist_del_rcu(&li->hlist);
 > -free_leaf_info(li);
 > +free_leaf_info(l, li);
 >  }
 >  
 >  if (hlist_empty(&l->list))
 > @@ -1668,7 +1678,7 @@ static int trie_flush_leaf(struct trie *
 >  
 >  if (list_empty(&li->falh)) {
 >  hlist_del_rcu(&li->hlist);
 > -free_leaf_info(li);
 > +free_leaf_info(l, li);
 >  }
 >  }
 >  return found;
 > @@ -1935,7 +1945,8 @@ void __init fib_hash_init(void)
 >  fn_alias_kmem = kmem_cache_create("ip_fib_alias", sizeof(struct 
 > fib_alias),
 >0, SLAB_PANIC, NULL);
 >  
 > -

Re: [RFC 6/6] fib_trie: combine leaf and info

2008-01-15 Thread Robert Olsson

Eric Dumazet writes:

 > 
 > So you think that a leaf cannot have 2 infos, one 'embeded' and one in the 
 > list ?
 
 Hello,
 
 The model I thought of is to have either: 

 1) One leaf_info embedded in leaf. A fast-path leaf. FP-leaf
 Or 
 2) The intct old leaf_info list with arbitrary number leaf_infos

 If plen_iinfo is >=0  It's a FP-leaf
 
 Cheers.
--ro

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 6/6] fib_trie: combine leaf and info

2008-01-15 Thread Robert Olsson

Stephen Hemminger writes:

 > Okay, I would rather see the leaf_info explicit inside the leaf, also
 > your scheme probably breaks if I add two prefixes and then delete the first.
 > Let me have a go at it.

 I took Eric's patch a bit further... 
 
 Support for delete and dump is needed before any testing at all
 and maybe some macro for checking and setting FP-leaf's

 Cheers.
--ro


diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 6dab753..f5b276c 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -97,22 +97,32 @@ typedef unsigned int t_key;
 #define IS_LEAF(n) (n->parent & T_LEAF)
 
 struct node {
-   unsigned long parent;
-   t_key key;
+   unsigned long   parent;
+   t_key   key;
 };
 
 struct leaf {
-   unsigned long parent;
-   t_key key;
-   struct hlist_head list;
-   struct rcu_head rcu;
+   unsigned long   parent;
+   t_key   key;
+   /*
+* Because we often have only one info per leaf, we embedd one here
+* to save some space and speedup lookups (sharing cache line)
+* a sort of fastpatch leaf (FP-leaf) this indicated a negative of
+* plen_iinfo.
+* Note : not inside a structure so that we dont have holes on 64bit 
+*/
+   int plen_iinfo;
+   struct list_headfalh_iinfo;
+
+   struct hlist_head   list;
+   struct rcu_head rcu;
 };
 
 struct leaf_info {
-   struct hlist_node hlist;
-   struct rcu_head rcu;
-   int plen;
-   struct list_head falh;
+   struct hlist_node   hlist;
+   int plen;
+   struct list_headfalh;
+   struct rcu_head rcu;
 };
 
 struct tnode {
@@ -364,11 +374,13 @@ static inline void tnode_free(struct tnode *tn)
call_rcu(&tn->rcu, __tnode_free_rcu);
 }
 
-static struct leaf *leaf_new(void)
+static struct leaf *leaf_new(int plen)
 {
struct leaf *l = kmalloc(sizeof(struct leaf),  GFP_KERNEL);
if (l) {
l->parent = T_LEAF;
+   l->plen_iinfo = plen;
+   INIT_LIST_HEAD(&l->falh_iinfo);
INIT_HLIST_HEAD(&l->list);
}
return l;
@@ -890,7 +902,16 @@ static struct leaf_info *find_leaf_info(struct leaf *l, 
int plen)
 
 static inline struct list_head * get_fa_head(struct leaf *l, int plen)
 {
-   struct leaf_info *li = find_leaf_info(l, plen);
+   struct leaf_info *li;
+
+   if (l->plen_iinfo >= 0) {
+   if(l->plen_iinfo == plen)
+   return &l->falh_iinfo;
+   else
+   return NULL;
+   }
+
+   li = find_leaf_info(l, plen);
 
if (!li)
return NULL;
@@ -1036,6 +1057,21 @@ static struct list_head *fib_insert_node(struct trie *t, 
u32 key, int plen)
 
if (n != NULL && IS_LEAF(n) && tkey_equals(key, n->key)) {
l = (struct leaf *) n;
+
+   /* 
+* Check if it is a FP-leaf if so we have to convert 
+* it to an ordinary leaf with leaf_infos
+*/
+
+   if( l->plen_iinfo >= 0 ) {
+   li = leaf_info_new(l->plen_iinfo);
+   if (!li)
+   return NULL;
+
+   insert_leaf_info(&l->list, li);
+   l->plen_iinfo = -1;
+   }
+
li = leaf_info_new(plen);
 
if (!li)
@@ -1045,21 +1081,16 @@ static struct list_head *fib_insert_node(struct trie 
*t, u32 key, int plen)
insert_leaf_info(&l->list, li);
goto done;
}
-   l = leaf_new();
+
+   /* Insert a FP-leaf */
+   l = leaf_new(plen);
 
if (!l)
return NULL;
 
l->key = key;
-   li = leaf_info_new(plen);
-
-   if (!li) {
-   tnode_free((struct tnode *) l);
-   return NULL;
-   }
+   fa_head = &l->falh_iinfo;
 
-   fa_head = &li->falh;
-   insert_leaf_info(&l->list, li);
 
if (t->trie && n == NULL) {
/* Case 2: n is NULL, and will just insert a new leaf */
@@ -1089,7 +1120,6 @@ static struct list_head *fib_insert_node(struct trie *t, 
u32 key, int plen)
}
 
if (!tn) {
-   free_leaf_info(li);
tnode_free((struct tnode *) l);
return NULL;
}
@@ -1285,6 +1315,20 @@ static inline int check_leaf(struct trie *t, struct leaf 
*l,
struct hlist_head *hhead = &l->list;
struct hlist_node *node;
 
+   i = l->plen_iinfo;  
+   if( i >= 0) {   /* FP-leaf */
+   mask = inet_make_mask(i);
+   err = fib_semantic_match(&l->falh_iinfo, flp, res, 
htonl(l->key), mask, i);
+   if(err <= 0 ) 

Re: [PATCH 9/9] fix sparse warnings

2008-01-14 Thread Robert Olsson

Eric Dumazet writes:

 > > Thats 231173/241649 = 96% with the current Internet routing.
 > >
 > > How about if would have a fastpath and store one entry direct in the 
 > > leaf struct this to avoid loading the leaf_info list in most cases?
 > >
 > > One could believe that both lookup and dump could improve.
 > >
 > You mean to include one "leaf_info" inside leaf structure, so that we 
 > can access it without cache line miss ?

 Yes.

 Cheers
--ro
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/9] get rid of unused revision element

2008-01-14 Thread Robert Olsson

David Miller writes:

 > > The revision element must of been part of an earlier design,
 > > because currently it is set but never used.
 > > 
 > > Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]>
 > 
 > I suspect Robert wanted to play around with some generation
 > ID optimizations but never got around to it.

 The idea was to have a selective flush of route cache entries when
 a fib insert/delete happened. From what I remember you added another/
 better solution. Just a list with route cache entries pointing to parent 
 route. So yes this was obsoleted by your/our effort to avoid total 
 flushing of the route cache. Unfinished work.
 
 According to  http://bgpupdates.potaroo.net/instability/bgpupd.html
 (last in page) we currently flush the route cache 2.80 times per second. 
 when using full Internet routing with Linux. Maybe we're forced to pick 
 up this thread again someday.

 Cheers.
--ro
 
 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 9/9] fix sparse warnings

2008-01-14 Thread Robert Olsson

Thanks for hacking and improving and the trie... another idea that could
be also tested. If we look into routing table we see that most leafs 
only has one prefix

Main:
Aver depth: 2.57
Max depth:  7
Leaves: 231173

ip route | wc -l 
241649

Thats 231173/241649 = 96% with the current Internet routing.

How about if would have a fastpath and store one entry direct in the 
leaf struct this to avoid loading the leaf_info list in most cases?

One could believe that both lookup and dump could improve.

Cheers.
--ro



Stephen Hemminger writes:

 > Remember that the code should be optimized for lookup, not management
 > operations. We ran into this during testing (the test suite was looking
 > for number of routes), thats why I put in the size field.
 > 
 > The existing dump code is really slow:
 > 
 > 1) FIB_TRIE   Under KVM:
 >  load 164393 routes  12.436 sec
 >  ip route | wc -l12.569 sec
 >  grep /proc/net/route25.357 sec
 > 
 > 99% of the cpu time is spent in nextleaf() during these dump operations.
 > 
 > 2) FIB_HASH  Under KVM:
 >  load 164393 routes  10.833 sec
 >  ip route | wc -l1.981 sec
 >  grep /proc/net/route0.204 sec
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[FIB]: full_children & empty_children should be uint, not ushort

2008-01-13 Thread Robert Olsson

Eric Dumazet writes:
 > Eric Dumazet a écrit :
 > > 4) full_children & empty_children being 'unsigned short',
 > >we probably are limited to 2^15 elements, but I could not
 > >find this limit enforced somewhere.

 > Two fixes are possible : Enlarge full_children & empty_children to 32bits, 
 > or 
 > force a limit in code to never exceed 2^15 children in a tnode. I chose the 
 > first solution since it can be done with 0 memory cost on 64bit arches.

 Hello,

 Thanks for spotting this. No we don't want put limits on the (root) node size.
 You see the comment in code is correct so unsigned short are some leftover 
from 
 old testing which could have hit us hard as the routing table slowly grows.

 Cheers
        --ro


Signed-off-by: Robert Olsson <[EMAIL PROTECTED]>
 

 > [FIB]: full_children & empty_children should be uint, not ushort
 > 
 > If declared as unsigned short, these fields can overflow, and whole trie 
 > logic
 > is broken. I could not make the machine crash, but some tnode can never
 > be freed.
 > 
 > Note for 64 bit arches : By reordering t_key and parent in [node, leaf, 
 > tnode] 
 > structures, we can use 32 bits hole after t_key so that sizeof(struct tnode) 
 > doesnt change after this patch.
 > 
 > Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>
 > 
 > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
 > index f26ba31..9696722 100644
 > --- a/net/ipv4/fib_trie.c
 > +++ b/net/ipv4/fib_trie.c
 > @@ -97,13 +97,13 @@ typedef unsigned int t_key;
 >  #define IS_LEAF(n) (n->parent & T_LEAF)
 >  
 >  struct node {
 > -t_key key;
 >  unsigned long parent;
 > +t_key key;
 >  };
 >  
 >  struct leaf {
 > -t_key key;
 >  unsigned long parent;
 > +t_key key;
 >  struct hlist_head list;
 >  struct rcu_head rcu;
 >  };
 > @@ -116,12 +116,12 @@ struct leaf_info {
 >  };
 >  
 >  struct tnode {
 > -t_key key;
 >  unsigned long parent;
 > +t_key key;
 >  unsigned char pos;  /* 2log(KEYLENGTH) bits needed */
 >  unsigned char bits; /* 2log(KEYLENGTH) bits needed */
 > -unsigned short full_children;   /* KEYLENGTH bits needed */
 > -unsigned short empty_children;  /* KEYLENGTH bits needed */
 > +unsigned int full_children; /* KEYLENGTH bits needed */
 > +unsigned int empty_children;/* KEYLENGTH bits needed */
 >  struct rcu_head rcu;
 >  struct node *child[0];
 >  };
 > @@ -329,12 +329,12 @@ static inline void free_leaf_info(struct leaf_info 
 > *leaf)
 >  call_rcu(&leaf->rcu, __leaf_info_free_rcu);
 >  }
 >  
 > -static struct tnode *tnode_alloc(unsigned int size)
 > +static struct tnode *tnode_alloc(size_t size)
 >  {
 >  struct page *pages;
 >  
 >  if (size <= PAGE_SIZE)
 > -return kcalloc(size, 1, GFP_KERNEL);
 > +return kzalloc(size, GFP_KERNEL);
 >  
 >  pages = alloc_pages(GFP_KERNEL|__GFP_ZERO, get_order(size));
 >  if (!pages)
 > @@ -346,8 +346,8 @@ static struct tnode *tnode_alloc(unsigned int size)
 >  static void __tnode_free_rcu(struct rcu_head *head)
 >  {
 >  struct tnode *tn = container_of(head, struct tnode, rcu);
 > -unsigned int size = sizeof(struct tnode) +
 > -(1 << tn->bits) * sizeof(struct node *);
 > +size_t size = sizeof(struct tnode) +
 > +  (sizeof(struct node *) << tn->bits);
 >  
 >  if (size <= PAGE_SIZE)
 >  kfree(tn);
 > @@ -386,8 +386,7 @@ static struct leaf_info *leaf_info_new(int plen)
 >  
 >  static struct tnode* tnode_new(t_key key, int pos, int bits)
 >  {
 > -int nchildren = 1< -int sz = sizeof(struct tnode) + nchildren * sizeof(struct node *);
 > +size_t sz = sizeof(struct tnode) + (sizeof(struct node *) << bits);
 >  struct tnode *tn = tnode_alloc(sz);
 >  
 >  if (tn) {
 > @@ -399,8 +398,8 @@ static struct tnode* tnode_new(t_key key, int pos, int 
 > bits)
 >  tn->empty_children = 1<  }
 >  
 > -pr_debug("AT %p s=%u %u\n", tn, (unsigned int) sizeof(struct tnode),
 > - (unsigned int) (sizeof(struct node) * 1< +pr_debug("AT %p s=%u %lu\n", tn, (unsigned int) sizeof(struct tnode),
 > + (unsigned long) (sizeof(struct node) << bits));
 >  return tn;
 >  }
 >  
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] net: napi fix

2007-12-20 Thread Robert Olsson

David Miller writes:

 > > Is the netif_running() check even required?
 > 
 > No, it is not.
 > 
 > When a device is brought down, one of the first things
 > that happens is that we wait for all pending NAPI polls
 > to complete, then block any new polls from starting.

 Hello!

 Yes but the reason was not to wait for all pending polls to
 complete so a server/router could be rebooted even under high-
 load and DOS. We've experienced some nasty problems with this.

 Cheers.
--ro
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: net_rx_action/NAPI oops [PATCH]

2007-11-30 Thread Robert Olsson

 Hello!

 After further investigations. The bug was just in front of us...

 ifconfig down in combination with the test for || !netif_running() can 
 return a full quota and netif_rx_complete() done which causes the oops 
 in net_rx_action. Of course the load must be high enough to fill the 
 quota.

 I've found a bunch of drivers having this bug.

 Cheers.
--ro

 
Signed-off-by: Robert Olsson <[EMAIL PROTECTED]>


diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index cf39473..f4137ad 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -3947,6 +3947,10 @@ e1000_clean(struct napi_struct *napi, int budget)
 quit_polling:
if (likely(adapter->itr_setting & 3))
e1000_set_itr(adapter);
+
+   if(work_done == budget)
+   work_done--;
+   
netif_rx_complete(poll_dev, napi);
e1000_irq_enable(adapter);
}
diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
index 4fd2e23..e43b5ca 100644
--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e1000e/netdev.c
@@ -1408,6 +1408,9 @@ static int e1000_clean(struct napi_struct *napi, int 
budget)
if ((!tx_cleaned && (work_done < budget)) ||
   !netif_running(poll_dev)) {
 quit_polling:
+   if(work_done == budget)
+   work_done--;
+
if (adapter->itr_setting & 3)
e1000_set_itr(adapter);
netif_rx_complete(poll_dev, napi);
diff --git a/drivers/net/ixgb/ixgb_main.c b/drivers/net/ixgb/ixgb_main.c
index 3021234..e3064ef 100644
--- a/drivers/net/ixgb/ixgb_main.c
+++ b/drivers/net/ixgb/ixgb_main.c
@@ -1783,6 +1783,10 @@ ixgb_clean(struct napi_struct *napi, int budget)
 
/* if no Tx and not enough Rx work done, exit the polling mode */
if((!tx_cleaned && (work_done == 0)) || !netif_running(netdev)) {
+
+   if(work_done == budget)
+   work_done--;
+
netif_rx_complete(netdev, napi);
ixgb_irq_enable(adapter);
}
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 00bc525..204f5fa 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -579,6 +579,9 @@ static int ixgbe_clean_rxonly(struct napi_struct *napi, int 
budget)
/* If no Tx and not enough Rx work done, exit the polling mode */
if ((work_done < budget) || !netif_running(netdev)) {
 quit_polling:
+   if(work_done == budget)
+   work_done--;
+
netif_rx_complete(netdev, napi);
if (!test_bit(__IXGBE_DOWN, &adapter->state))
IXGBE_WRITE_REG(&adapter->hw, IXGBE_EIMS,
@@ -1483,6 +1486,9 @@ static int ixgbe_clean(struct napi_struct *napi, int 
budget)
if ((!tx_cleaned && (work_done < budget)) ||
!netif_running(adapter->netdev)) {
 quit_polling:
+   if(work_done == budget)
+   work_done--;
+
netif_rx_complete(netdev, napi);
ixgbe_irq_enable(adapter);
}
diff --git a/drivers/net/e100.c b/drivers/net/e100.c
index 3dbaec6..c566491 100644
--- a/drivers/net/e100.c
+++ b/drivers/net/e100.c
@@ -1998,6 +1998,10 @@ static int e100_poll(struct napi_struct *napi, int 
budget)
 
/* If no Rx and Tx cleanup work was done, exit polling mode. */
if((!tx_cleaned && (work_done == 0)) || !netif_running(netdev)) {
+
+   if(work_done == budget)
+   work_done--;
+
netif_rx_complete(netdev, napi);
e100_enable_irq(nic);
}
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: net_rx_action/NAPI oops [PATCH]

2007-11-28 Thread Robert Olsson


 No it doesn't. besides napi_disable and napi_synchronize are identical. 
 I was trying to disarm interrupts this way too. 

 The patch I did send yesterday is the only cure so-far but I don't if 
 it's 100% bullet proof either.

 I was stress-testing it patch but ran into new problems...(scheduling)
 
 Cheers.
--ro


Stephen Hemminger writes:
 > Would this fix it?
 > 
 > --- a/drivers/net/e1000/e1000_main.c 2007-11-15 21:13:12.0 -0800
 > +++ b/drivers/net/e1000/e1000_main.c 2007-11-28 08:37:03.0 -0800
 > @@ -630,10 +630,10 @@ e1000_down(struct e1000_adapter *adapter
 >   * reschedule our watchdog timer */
 >  set_bit(__E1000_DOWN, &adapter->flags);
 >  
 > +e1000_irq_disable(adapter);
 >  #ifdef CONFIG_E1000_NAPI
 > -napi_disable(&adapter->napi);
 > +napi_synchronize(&adapter->napi);
 >  #endif
 > -e1000_irq_disable(adapter);
 >  
 >  del_timer_sync(&adapter->tx_fifo_stall_timer);
 >  del_timer_sync(&adapter->watchdog_timer);
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: net_rx_action/NAPI oops [PATCH]

2007-11-28 Thread Robert Olsson

Stephen Hemminger writes:

 > It is considered a driver bug in 2.6.24 to call netif_rx_complete (clear 
 > NAPI_STATE_SCHED)
 > and do a full quota. That bug already had to be fixed in other drivers,
 > look like e1000 has same problem.

 From what I see the problem is not related to ->poll. But it's a variant of 
the same theme
 here it's napi_disable in  _down via ifconfig down that clears 
NAPI_STATE_SCHED while
 driver delivers a full quota.
 
 Cheers
--ro
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: net_rx_action/NAPI oops [PATCH]

2007-11-28 Thread Robert Olsson

Kok, Auke writes:
 > 
 > Robert, please give that patch a try (it fixes a crash that I had here as 
 > well)
 > and let us know if it works for you.

 No it doesn't cure the problem I've reported

 Cheers.
--ro


 BTW. You can try to verify the problem... Let pktgen feed a couple of streams
 into some interfaces. (I use two) ifconfig down sone of the incoming 
interfaces
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


net_rx_action/NAPI oops [PATCH]

2007-11-27 Thread Robert Olsson

Hello!

I've discovered a bug while testing the new multiQ NAPI code. In hi-load 
situations when we take down an interface we get a kernel panic. The
oops is below.

>From what I see this happens when driver does napi_disable() and clears
NAPI_STATE_SCHED. In net_rx_action there is a check for work == weight 
a sort indirect test but that's now not enough to cover the load situation. 
where we have NAPI_STATE_SCHED cleared by e1000_down in my case and still 
full quota. Latest git but I'll guess the is the same in all later kernels.
There might be different solutions... one variant is below:

Signed-off-by: Robert Olsson <[EMAIL PROTECTED]>

diff --git a/net/core/dev.c b/net/core/dev.c
index 043e2f8..1031233 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2207,7 +2207,7 @@ static void net_rx_action(struct softirq_action *h)
 * still "owns" the NAPI instance and therefore can
 * move the instance around on the list at-will.
 */
-   if (unlikely(work == weight))
+   if (unlikely(work == weight) && (test_bit(NAPI_STATE_SCHED, 
&n->state)))
list_move_tail(&n->poll_list, list);
 
netpoll_poll_unlock(have);


Cheers
--ro


labb:/# ifconfig  eth0 down 

BUG: unable to handle kernel paging request at virtual address 00100104
printing eip: c0433d67 *pde =  
Oops: 0002 [#1] SMP 
Modules linked in:

Pid: 4, comm: ksoftirqd/0 Not tainted (2.6.24-rc3bifrost-gb3664d45-dirty #32)
EIP: 0060:[] EFLAGS: 00010046 CPU: 0
EIP is at net_rx_action+0x107/0x120
EAX: 00100100 EBX: f757d4e0 ECX: c200d334 EDX: 00200200
ESI: 0040 EDI: c200d334 EBP: 00ec ESP: f7c6bf78
 DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068
Process ksoftirqd/0 (pid: 4, ti=f7c6a000 task=f7c58ab0 task.ti=f7c6a000)
Stack: c0236217 c200ce9c c200ce9c  fffcf892 0040 0005 c05b2a98 
   c0603e60 0008 c022a275  c06066c0 c06066c0 0246  
   c022a5e0  c022a327 c06066c0 c022a636 fffc  c02384f2 
Call Trace:
 [] __rcu_process_callbacks+0x107/0x190
 [] __do_softirq+0x75/0xf0
 [] ksoftirqd+0x0/0xd0
 [] do_softirq+0x37/0x40
 [] ksoftirqd+0x56/0xd0
 [] kthread+0x42/0x70
 [] kthread+0x0/0x70
 [] kernel_thread_helper+0x7/0x18
 ===
Code: 88 8c 52 c0 e8 4b 1d df ff e8 96 0c dd ff c7 05 64 7d 63 c0 01 00 00 00 
e9 61 ff ff ff 8d b4
 26 00 00 00 00 8b 03 8b 53 04 89 f9 <89> 50 04 89 02 89 d8 8b 57 04 e8 5a 34 
eb ff e9 4a ff ff ff
 90 
EIP: [] net_rx_action+0x107/0x120 SS:ESP 0068:f7c6bf78
Kernel panic - not syncing: Fatal exception in interrupt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pktgen question

2007-09-24 Thread Robert Olsson

Hi,

Steve Wise writes:
 > I think pktgen should be cloning the skbs using skb_clone().  Then it 
 > will work for all devices, eh?

 pktgen assumes for "fastpath" sending exclusive ownership of
 the skb. And does a skb_get to avoid final skb destruction so 
 the same skb can be sent over and over. The idea is to avoid 
 memory allocation and keep things in cache to give very high
 packet rates with identical packets.
   I
 But if you need to alter the packet then the skb_get trick can't 
 be done. And you have to turn off "fastpath" with clone_skb 

 > Perf-wise, you could clone the skbs up front, then deliver them to the 
 > nic in a tight loop.  This would mitigate the added overhead introduced 
 > by calling skb_clone() in the loop doing transmits...

 Sure it's can be done. It could replay sequences etc but it will not
 beat the skb_get trick in sending identical packets. It has been
 proposed before but I've avoided such efforts to keep things relatively 
 small and simple. Really pktgen should be reworked to have s small 
 skim in kernel and move the rest of the stuff to userland.

 Cheers.
--ro
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [-MM, FIX V3] e1000e: incorporate napi_struct changes from net-2.6.24.git

2007-09-08 Thread Robert Olsson

Kok, Auke writes:

> david,
 > 
 > while testing this patch I noticed that the poll routine is now called 
 > 100% of the time, and since I'm not doing much different than before, I 
 > suspec that something in the new napi code is staying in polling mode 
 > forever? Since e1000e is pretty much the same code as e1000, I doubt the 
 > problem is there, but you can probably tell better. ideas?

 Hello,
 
 Yes a correct observation. I've spotted this bug too and it caused by the 
 policy change in the NAPI scheduling. Look at tx_cleaned. 


 I suggest we revert this change for now.
 

 Cheers
        --ro


Signed-off-by: Robert Olsson <[EMAIL PROTECTED]>


 diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index 7b0bcdb..5cb883a 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -3944,7 +3944,7 @@ e1000_clean(struct napi_struct *napi, int budget)
  &work_done, budget);
 
/* If no Tx and not enough Rx work done, exit the polling mode */
-   if ((tx_cleaned && (work_done < budget)) ||
+   if ((!tx_cleaned && (work_done == 0)) ||
   !netif_running(poll_dev)) {
 quit_polling:
if (likely(adapter->itr_setting & 3))




-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pktgen terminating condition

2007-08-29 Thread Robert Olsson

jamal writes:
 > On Tue, 2007-28-08 at 21:43 -0700, Mandeep Singh Baines wrote:

 > I think its a good thing pktgen caught this; i am unsure however if it
 > is doing the right thing. Hoping Robert would respond.
 > One thing pktgen could do is restrict the amount of outstanding buffers
 > by using accounting the way sockets do - this at least wont stop
 > progress as it did in your case.

 Hello,

 Yes it's synchronization issue... the test is over and we have sent 
 all pkts to the device but pktgen cannot free the skb for it still 
 has refcounts. 

 IMO must drivers have provisions to handle situation like. I'll 
 guess we can discuss last-resort timer if it should be us, ms 
 or possibly seconds but shouldn't need ping to make this happen. 
 If so we probably have a ICMP packet sitting there waiting instead.  

 Cheers
--ro
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Remove softirq scheduling from pktgen [PATCH]

2007-08-27 Thread Robert Olsson

Christoph Hellwig writes:

 > > Hello, It's not a job for pktgen.
 > 
 > Please also kill the do_softirq export while you're at it.


 Right seems like pktgen luckily was the only user.

 Cheers
--ro


Signed-off-by: Robert Olsson <[EMAIL PROTECTED]>


diff --git a/kernel/softirq.c b/kernel/softirq.c
index 0f546dd..dbbdcd7 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -271,8 +271,6 @@ asmlinkage void do_softirq(void)
local_irq_restore(flags);
 }
 
-EXPORT_SYMBOL(do_softirq);
-
 #endif
 
 /*
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Remove softirq scheduling from pktgen [PATCH]

2007-08-27 Thread Robert Olsson


Hello, It's not a job for pktgen.

Cheers
--ro


Signed-off-by: Robert Olsson <[EMAIL PROTECTED]>


diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index 18601af..975e887 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -164,7 +164,7 @@
 #include  /* do_div */
 #include 
 
-#define VERSION  "pktgen v2.68: Packet Generator for packet performance 
testing.\n"
+#define VERSION  "pktgen v2.69: Packet Generator for packet performance 
testing.\n"
 
 /* The buckets are exponential in 'width' */
 #define LAT_BUCKETS_MAX 32
@@ -381,7 +381,6 @@ struct pktgen_thread {
struct list_head th_list;
struct task_struct *tsk;
char result[512];
-   u32 max_before_softirq; /* We'll call do_softirq to prevent starvation. 
*/
 
/* Field for thread to receive "posted" events terminate, stop ifs etc. 
*/
 
@@ -1752,9 +1751,6 @@ static int pktgen_thread_show(struct seq_file *seq, void 
*v)
 
BUG_ON(!t);
 
-   seq_printf(seq, "Name: %s  max_before_softirq: %d\n",
-  t->tsk->comm, t->max_before_softirq);
-
seq_printf(seq, "Running: ");
 
if_lock(t);
@@ -1787,7 +1783,6 @@ static ssize_t pktgen_thread_write(struct file *file,
int i = 0, max, len, ret;
char name[40];
char *pg_result;
-   unsigned long value = 0;
 
if (count < 1) {
//  sprintf(pg_result, "Wrong command format");
@@ -1861,12 +1856,8 @@ static ssize_t pktgen_thread_write(struct file *file,
}
 
if (!strcmp(name, "max_before_softirq")) {
-   len = num_arg(&user_buffer[i], 10, &value);
-   mutex_lock(&pktgen_thread_lock);
-   t->max_before_softirq = value;
-   mutex_unlock(&pktgen_thread_lock);
+   sprintf(pg_result, "OK: Note! max_before_softirq is obsoleted 
-- Do not use");
ret = count;
-   sprintf(pg_result, "OK: max_before_softirq=%lu", value);
goto out;
}
 
@@ -2145,7 +2136,6 @@ static void spin(struct pktgen_dev *pkt_dev, __u64 
spin_until_us)
if (spin_until_us - now > jiffies_to_usecs(1) + 1)
schedule_timeout_interruptible(1);
else if (spin_until_us - now > 100) {
-   do_softirq();
if (!pkt_dev->running)
return;
if (need_resched())
@@ -3515,8 +3505,6 @@ static int pktgen_thread_worker(void *arg)
struct pktgen_thread *t = arg;
struct pktgen_dev *pkt_dev = NULL;
int cpu = t->cpu;
-   u32 max_before_softirq;
-   u32 tx_since_softirq = 0;
 
BUG_ON(smp_processor_id() != cpu);
 
@@ -3526,8 +3514,6 @@ static int pktgen_thread_worker(void *arg)
 
pr_debug("pktgen: starting pktgen/%d:  pid=%d\n", cpu, current->pid);
 
-   max_before_softirq = t->max_before_softirq;
-
set_current_state(TASK_INTERRUPTIBLE);
 
set_freezable();
@@ -3546,24 +3532,9 @@ static int pktgen_thread_worker(void *arg)
 
__set_current_state(TASK_RUNNING);
 
-   if (pkt_dev) {
-
+   if (pkt_dev) 
pktgen_xmit(pkt_dev);
 
-   /*
-* We like to stay RUNNING but must also give
-* others fair share.
-*/
-
-   tx_since_softirq += pkt_dev->last_ok;
-
-   if (tx_since_softirq > max_before_softirq) {
-   if (local_softirq_pending())
-   do_softirq();
-   tx_since_softirq = 0;
-   }
-   }
-
if (t->control & T_STOP) {
pktgen_stop(t);
t->control &= ~(T_STOP);
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


pktgen Multiqueue support [PATCH]

2007-08-27 Thread Robert Olsson


Hello,

Below some pktgen support to send into different TX queues.
This can of course be feed into input queues on other machines

Cheers
--ro



Signed-off-by: Robert Olsson <[EMAIL PROTECTED]>


diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index a0db4d1..18601af 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -186,6 +186,7 @@
 #define F_SVID_RND(1<<10)  /* Random SVLAN ID */
 #define F_FLOW_SEQ(1<<11)  /* Sequential flows */
 #define F_IPSEC_ON(1<<12)  /* ipsec on for flows */
+#define F_QUEUE_MAP_RND (1<<13)/* queue map Random */
 
 /* Thread control flag bits */
 #define T_TERMINATE   (1<<0)
@@ -328,6 +329,7 @@ struct pktgen_dev {
__be32 cur_daddr;
__u16 cur_udp_dst;
__u16 cur_udp_src;
+   __u16 cur_queue_map;
__u32 cur_pkt_size;
 
__u8 hh[14];
@@ -355,6 +357,10 @@ struct pktgen_dev {
unsigned lflow; /* Flow length  (config) */
unsigned nflows;/* accumulated flows (stats) */
unsigned curfl; /* current sequenced flow (state)*/
+
+   u16 queue_map_min;
+   u16 queue_map_max;
+
 #ifdef CONFIG_XFRM
__u8ipsmode;/* IPSEC mode (config) */
__u8ipsproto;   /* IPSEC type (config) */
@@ -611,6 +617,11 @@ static int pktgen_if_show(struct seq_file *seq, void *v)
seq_printf(seq, " flows: %u flowlen: %u\n", pkt_dev->cflows,
   pkt_dev->lflow);
 
+   seq_printf(seq,
+  " queue_map_min: %u  queue_map_max: %u\n",
+  pkt_dev->queue_map_min,
+  pkt_dev->queue_map_max);
+
if (pkt_dev->flags & F_IPV6) {
char b1[128], b2[128], b3[128];
fmt_ip6(b1, pkt_dev->in6_saddr.s6_addr);
@@ -707,6 +718,9 @@ static int pktgen_if_show(struct seq_file *seq, void *v)
if (pkt_dev->flags & F_MPLS_RND)
seq_printf(seq,  "MPLS_RND  ");
 
+   if (pkt_dev->flags & F_QUEUE_MAP_RND)
+   seq_printf(seq,  "QUEUE_MAP_RND  ");
+
if (pkt_dev->cflows) {
if (pkt_dev->flags & F_FLOW_SEQ)
seq_printf(seq,  "FLOW_SEQ  "); /*in sequence flows*/
@@ -762,6 +776,8 @@ static int pktgen_if_show(struct seq_file *seq, void *v)
seq_printf(seq, " cur_udp_dst: %d  cur_udp_src: %d\n",
   pkt_dev->cur_udp_dst, pkt_dev->cur_udp_src);
 
+   seq_printf(seq, " cur_queue_map: %u\n", pkt_dev->cur_queue_map);
+
seq_printf(seq, " flows: %u\n", pkt_dev->nflows);
 
if (pkt_dev->result[0])
@@ -1213,6 +1229,11 @@ static ssize_t pktgen_if_write(struct file *file,
else if (strcmp(f, "FLOW_SEQ") == 0)
pkt_dev->flags |= F_FLOW_SEQ;
 
+   else if (strcmp(f, "QUEUE_MAP_RND") == 0)
+   pkt_dev->flags |= F_QUEUE_MAP_RND;
+
+   else if (strcmp(f, "!QUEUE_MAP_RND") == 0)
+   pkt_dev->flags &= ~F_QUEUE_MAP_RND;
 #ifdef CONFIG_XFRM
else if (strcmp(f, "IPSEC") == 0)
pkt_dev->flags |= F_IPSEC_ON;
@@ -1517,6 +1538,28 @@ static ssize_t pktgen_if_write(struct file *file,
return count;
}
 
+   if (!strcmp(name, "queue_map_min")) {
+   len = num_arg(&user_buffer[i], 5, &value);
+   if (len < 0) {
+   return len;
+   }
+   i += len;
+   pkt_dev->queue_map_min = value;
+   sprintf(pg_result, "OK: queue_map_min=%u", 
pkt_dev->queue_map_min);
+   return count;
+   }
+
+   if (!strcmp(name, "queue_map_max")) {
+   len = num_arg(&user_buffer[i], 5, &value);
+   if (len < 0) {
+   return len;
+   }
+   i += len;
+   pkt_dev->queue_map_max = value;
+   sprintf(pg_result, "OK: queue_map_max=%u", 
pkt_dev->queue_map_max);
+   return count;
+   }
+
if (!strcmp(name, "mpls")) {
unsigned n, offset;
len = get_labels(&user_buffer[i], pkt_dev);
@@ -2378,6 +2421,20 @@ static void mod_cur_headers(struct pktgen_dev *pkt_dev)
pkt_dev->cur_pkt_size = t;
}
 
+   if (pkt_dev->queue_map_min < pkt_dev->queue_map_max) {
+   __u16 t;
+   if (pkt_dev->flags & F_QUEUE_MAP_RND) {
+   t = random32() %
+   (pkt_dev->queue_map_max - 
pkt_dev->queue_map_min + 1)
+

pktgen multiqueue oops

2007-08-27 Thread Robert Olsson

Hello,

Initially pkt_dev can be NULL this causes netif_subqueue_stopped to 
oops. The patch below should cure it. But maybe the pktgen TX logic 
should be reworked to better support the new multiqueue support. 

Cheers
--ro

Signed-off-by: Robert Olsson <[EMAIL PROTECTED]>


diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index 7bae576..a0db4d1 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -3331,8 +3331,9 @@ static __inline__ void pktgen_xmit(struct pktgen_dev 
*pkt_dev)
}
 
if ((netif_queue_stopped(odev) ||
-netif_subqueue_stopped(odev, pkt_dev->skb->queue_mapping)) ||
-need_resched()) {
+(pkt_dev->skb && 
+ netif_subqueue_stopped(odev, pkt_dev->skb->queue_mapping))) ||
+   need_resched()) {
idle_start = getCurUs();
 
if (!netif_running(odev)) {
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFT 0/4] fib_trie cleanup patches

2007-08-08 Thread Robert Olsson

Stephen Hemminger writes:

 > I don't have a machine with anywhere near enough routes to test this,
 > so would someone with many routes give it a go and make sure nothing
 > got busted in the process.

 Hello! 

 It's not only the numbers of routes thats important... 

 Anyway I've done what can to verify that route selection (comparing
 lookups via netlink and userland longest-prefix-match using the same
 full routing table) and locking (concurrent rDoS on multiple CPU:s 
 also while loading the full routing table) is intact.

 Keep TKEY_GET_MASK in patch #2 as it gives some hint whats going on.

 And how about Paul E. McKenney comment about rcu_dereference() 
 covering the initial fetch?

 BTW. Right now the lab is setup for tests described above...

 Cheers.

--ro

 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFT] fib_trie: cleanup

2007-07-26 Thread Robert Olsson

David Miller writes:
 > From: Stephen Hemminger <[EMAIL PROTECTED]>
 > Date: Thu, 26 Jul 2007 09:46:48 +0100
 > 
 > > Try this out:
 > >  * replace macro's with inlines
 > >  * get rid of places doing multiple evaluations of NODE_PARENT
 > 
 > No objections from me.
 > 
 > Robert?

 Fine it looks cleaner and compiles... thanks Stephen.

 BTW 
 I think might be possible improve functional/performance parts too. It's the 
 list handling I'm thinking of.

 I've observed that in most cases there is only one leaf_info in the leaf list.
 In 96% of current Internet prefixes in the router I just looked into (below). 
 So optimizing for this case could be an idea. Some variant were the leaf holds
 the leaf_info data direct could be worth testing.

 Cheers.
--ro


 Main:
Aver depth: 2.57
Max depth:  6
Leaves: 215131
Internal nodes: 52394
  1: 27662  2: 9982  3: 8510  4: 3625  5: 1684  6: 626  7: 240  8: 64  
16: 1
Pointers: 427924
Null ptrs: 160400
Total size: 7102  kB


ip route list | wc -l 
 224649

215131/224649 = 96%
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [WIP][PATCHES] Network xmit batching

2007-06-19 Thread Robert Olsson

jamal writes:

 > What is your kernel config in regards to HRES timers? Robert mentioned
 > to me that the clock source maybe causing issues with pktgen (maybe even
 > qos). Robert, insights?

 pktgen heavily uses gettimeofday. I was using tsc as clock source with 
 our opterons in the lab. In late 2.6.20 gettimeofday was changed so tsc 
 couldn't be used on opterons (pktgen at least). 

 To give you an example w.intel's 82571EB we could send 1.488 Mpps using tsc 
 but with hpet only 400 kpps.

 But Evgeniy is most likely using the same clocksource both for the mainline 
 and batch tests so there must be something different...


 Cheers.
--ro
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NET: Multiqueue network device support.

2007-06-13 Thread Robert Olsson

jamal writes:

 > I think the one described by Leonid has not just 8 tx/rx rings but also
 > a separate register set, MSI binding etc iirc. The only shared resources
 > as far as i understood Leonid are the bus and the ethernet wire.
 
 AFAIK most new NIC will look like this...  

 I still lack a lot of crucial hardware understanding
 
 What will happen when if we for some reason is not capable of serving
 one TX ring? NIC still working so we continue filling/sending/clearing 
 on other rings?

 > So in such a case (assuming 8 rings), 
 > One model is creating 4 netdev devices each based on single tx/rx ring
 > and register set and then having a mother netdev (what you call the
 > bond) that feeds these children netdev based on some qos parametrization
 > is very sensible. Each of the children netdevices (by virtue of how we
 > do things today) could be tied to a CPU for effectiveness (because our
 > per CPU work is based on netdevs).

 Some kind of supervising function for the TX is probably needed as we still 
 want see the device as one entity. But if upcoming HW supports parallelism  
 straight to the TX-ring we of course like to use to get mininal cache 
 effects. It's up to how this "master netdev" or queue superviser can be 
 designed.
  
 > For the Leonid-NIC (for lack of better name) it may be harder to do
 > parallelization on rcv if you use what i said above. But you could
 > use a different model on receive - such as create a single netdev and
 > with 8 rcv rings and MSI tied on rcv to 8 different CPUs 

 Yes that should be the way do it... and ethtool or something to hint 
 the NIC how the incoming data classified wrt available CPU's. Maybe
 something more dynamic for the brave ones.

 Cheers
-ro
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NET: Multiqueue network device support.

2007-06-13 Thread Robert Olsson

jamal writes:

 > The key arguement i make (from day one actually) is to leave the
 > majority of the work to the driver.
 > My view of wireless WMM etc is it is a different media behavior
 > (compared to wired ethernet) which means a different view of strategy
 > for when it opens the valve to allow in more packets. 802.11 media has
 > embedded signalling which is usable. Guy Cohen gave a good use case
 > which i responded to. Do you wanna look at that and respond?

 Hello,

 Haven't got all details. IMO we need to support some "bonding-like" 
 scenario too. Where one CPU is feeding just one TX-ring. (and TX-buffers
 cleared by same CPU). We probably don't want to stall all queuing when
 when one ring is full. 
 
 The scenario I see is to support parallelism in forwarding/firewalling etc.
 For example when RX load via HW gets split into different CPU's and for 
 cache reasons we want to process in same CPU even with TX.

 If RX HW split keeps packets from the same flow on same CPU we shouldn't
 get reordering within flows.

 Cheers
--ro
 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] pktgen IPSEC 4/4: Add IPSEC support to pktgen

2007-06-12 Thread Robert Olsson

jamal writes:
 > 4 of 4

 > [PKTGEN] IPSEC support
 > Added transport mode ESP support for starters.
 > I will send more of these modes and types once i have resolved
 > the tunnel mode isses.
 > 
 > Signed-off-by: Jamal Hadi Salim <[EMAIL PROTECTED]>


 Signed-off-by: Robert Olsson <[EMAIL PROTECTED]>
 
 Cheers
--ro




 > 
 > diff --git a/net/core/pktgen.c b/net/core/pktgen.c
 > index bc4fb3b..bcec8e4 100644
 > --- a/net/core/pktgen.c
 > +++ b/net/core/pktgen.c
 > @@ -152,6 +152,9 @@
 >  #include 
 >  #include 
 >  #include 
 > +#ifdef CONFIG_XFRM
 > +#include 
 > +#endif
 >  #include 
 >  #include 
 >  #include 
 > @@ -182,6 +185,7 @@
 >  #define F_VID_RND (1<<9)/* Random VLAN ID */
 >  #define F_SVID_RND(1<<10)   /* Random SVLAN ID */
 >  #define F_FLOW_SEQ(1<<11)   /* Sequential flows */
 > +#define F_IPSEC_ON(1<<12)   /* ipsec on for flows */
 >  
 >  /* Thread control flag bits */
 >  #define T_TERMINATE   (1<<0)
 > @@ -208,6 +212,9 @@ static struct proc_dir_entry *pg_proc_dir = NULL;
 >  struct flow_state {
 >  __be32 cur_daddr;
 >  int count;
 > +#ifdef CONFIG_XFRM
 > +struct xfrm_state *x;
 > +#endif
 >  __u32 flags;
 >  };
 >  
 > @@ -348,7 +355,10 @@ struct pktgen_dev {
 >  unsigned lflow; /* Flow length  (config) */
 >  unsigned nflows;/* accumulated flows (stats) */
 >  unsigned curfl; /* current sequenced flow (state)*/
 > -
 > +#ifdef CONFIG_XFRM
 > +__u8ipsmode;/* IPSEC mode (config) */
 > +__u8ipsproto;   /* IPSEC type (config) */
 > +#endif
 >  char result[512];
 >  };
 >  
 > @@ -704,6 +714,9 @@ static int pktgen_if_show(struct seq_file *seq, void *v)
 >  seq_printf(seq,  "FLOW_RND  ");
 >  }
 >  
 > +if (pkt_dev->flags & F_IPSEC_ON)
 > +seq_printf(seq,  "IPSEC  ");
 > +
 >  if (pkt_dev->flags & F_MACSRC_RND)
 >  seq_printf(seq, "MACSRC_RND  ");
 >  
 > @@ -1198,6 +1211,11 @@ static ssize_t pktgen_if_write(struct file *file,
 >  else if (strcmp(f, "FLOW_SEQ") == 0)
 >  pkt_dev->flags |= F_FLOW_SEQ;
 >  
 > +#ifdef CONFIG_XFRM
 > +else if (strcmp(f, "IPSEC") == 0)
 > +pkt_dev->flags |= F_IPSEC_ON;
 > +#endif
 > +
 >  else if (strcmp(f, "!IPV6") == 0)
 >  pkt_dev->flags &= ~F_IPV6;
 >  
 > @@ -1206,7 +1224,7 @@ static ssize_t pktgen_if_write(struct file *file,
 >  "Flag -:%s:- unknown\nAvailable flags, (prepend 
 > ! to un-set flag):\n%s",
 >  f,
 >  "IPSRC_RND, IPDST_RND, UDPSRC_RND, UDPDST_RND, "
 > -"MACSRC_RND, MACDST_RND, TXSIZE_RND, IPV6, 
 > MPLS_RND, VID_RND, SVID_RND, FLOW_SEQ\n");
 > +"MACSRC_RND, MACDST_RND, TXSIZE_RND, IPV6, 
 > MPLS_RND, VID_RND, SVID_RND, FLOW_SEQ, IPSEC\n");
 >  return count;
 >  }
 >  sprintf(pg_result, "OK: flags=0x%x", pkt_dev->flags);
 > @@ -2094,6 +2112,7 @@ static void spin(struct pktgen_dev *pkt_dev, __u64 
 > spin_until_us)
 >  
 >  static inline void set_pkt_overhead(struct pktgen_dev *pkt_dev)
 >  {
 > +pkt_dev->pkt_overhead = 0;
 >  pkt_dev->pkt_overhead += pkt_dev->nr_labels*sizeof(u32);
 >  pkt_dev->pkt_overhead += VLAN_TAG_SIZE(pkt_dev);
 >  pkt_dev->pkt_overhead += SVLAN_TAG_SIZE(pkt_dev);
 > @@ -2130,6 +2149,31 @@ static inline int f_pick(struct pktgen_dev *pkt_dev)
 >  return pkt_dev->curfl;
 >  }
 >  
 > +
 > +#ifdef CONFIG_XFRM
 > +/* If there was already an IPSEC SA, we keep it as is, else
 > + * we go look for it ...
 > +*/
 > +inline
 > +void get_ipsec_sa(struct pktgen_dev *pkt_dev, int flow)
 > +{
 > +struct xfrm_state *x = pkt_dev->flows[flow].x;
 > +if (!x) {
 > +/*slow path: we dont already have xfrm_state*/
 > +x = xfrm_stateonly_find((xfrm_address_t *)&pkt_dev->cur_daddr,
 > +(xfrm_address_t *)&pkt_dev->cur_saddr,
 > +AF_INET,
 > +pkt_dev->ipsmode,
 > +pkt_dev->ipsproto, 0);
 > +if (x) {
 > +pkt_dev->flows[flow].x =

[PATCH] pktgen IPSEC 1/4: Centralize pktgen packet overhead management

2007-06-12 Thread Robert Olsson

jamal writes:
 > Manual labor still ... 1 of 4

 > [PKTGEN] Centralize packet overhead tracking
 > Track the extra packet overhead for VLAN tags, MPLS, IPSEC etc
 > 
 > Signed-off-by: Jamal Hadi Salim <[EMAIL PROTECTED]>


   Thanks, Jamal.

   I'll guess the ipsec part is to be considered work-in-progress
   and you're doing both the work and the progress.
  
   Signed-off-by: Robert Olsson <[EMAIL PROTECTED]>

   Cheers
--ro

 
 > diff --git a/net/core/pktgen.c b/net/core/pktgen.c
 > index 9cd3a1c..1352316 100644
 > --- a/net/core/pktgen.c
 > +++ b/net/core/pktgen.c
 > @@ -228,6 +228,7 @@ struct pktgen_dev {
 >  
 >  int min_pkt_size;   /* = ETH_ZLEN; */
 >  int max_pkt_size;   /* = ETH_ZLEN; */
 > +int pkt_overhead;   /* overhead for MPLS, VLANs, IPSEC etc */
 >  int nfrags;
 >  __u32 delay_us; /* Default delay */
 >  __u32 delay_ns;
 > @@ -2075,6 +2076,13 @@ static void spin(struct pktgen_dev *pkt_dev, __u64 
 > spin_until_us)
 >  pkt_dev->idle_acc += now - start;
 >  }
 >  
 > +static inline void set_pkt_overhead(struct pktgen_dev *pkt_dev)
 > +{
 > +pkt_dev->pkt_overhead += pkt_dev->nr_labels*sizeof(u32);
 > +pkt_dev->pkt_overhead += VLAN_TAG_SIZE(pkt_dev);
 > +pkt_dev->pkt_overhead += SVLAN_TAG_SIZE(pkt_dev);
 > +}
 > +
 >  /* Increment/randomize headers according to flags and current values
 >   * for IP src/dest, UDP src/dst port, MAC-Addr src/dst
 >   */
 > @@ -2323,9 +2331,7 @@ static struct sk_buff *fill_packet_ipv4(struct 
 > net_device *odev,
 >  
 >  datalen = (odev->hard_header_len + 16) & ~0xf;
 >  skb = alloc_skb(pkt_dev->cur_pkt_size + 64 + datalen +
 > -pkt_dev->nr_labels*sizeof(u32) +
 > -VLAN_TAG_SIZE(pkt_dev) + SVLAN_TAG_SIZE(pkt_dev),
 > -GFP_ATOMIC);
 > +pkt_dev->pkt_overhead, GFP_ATOMIC);
 >  if (!skb) {
 >  sprintf(pkt_dev->result, "No memory");
 >  return NULL;
 > @@ -2368,7 +2374,7 @@ static struct sk_buff *fill_packet_ipv4(struct 
 > net_device *odev,
 >  
 >  /* Eth + IPh + UDPh + mpls */
 >  datalen = pkt_dev->cur_pkt_size - 14 - 20 - 8 -
 > -  pkt_dev->nr_labels*sizeof(u32) - VLAN_TAG_SIZE(pkt_dev) - 
 > SVLAN_TAG_SIZE(pkt_dev);
 > +  pkt_dev->pkt_overhead;
 >  if (datalen < sizeof(struct pktgen_hdr))
 >  datalen = sizeof(struct pktgen_hdr);
 >  
 > @@ -2391,8 +2397,7 @@ static struct sk_buff *fill_packet_ipv4(struct 
 > net_device *odev,
 >  iph->check = ip_fast_csum((void *)iph, iph->ihl);
 >  skb->protocol = protocol;
 >  skb->mac_header = (skb->network_header - ETH_HLEN -
 > -   pkt_dev->nr_labels * sizeof(u32) -
 > -   VLAN_TAG_SIZE(pkt_dev) - SVLAN_TAG_SIZE(pkt_dev));
 > +   pkt_dev->pkt_overhead);
 >  skb->dev = odev;
 >  skb->pkt_type = PACKET_HOST;
 >  
 > @@ -2662,9 +2667,7 @@ static struct sk_buff *fill_packet_ipv6(struct 
 > net_device *odev,
 >  mod_cur_headers(pkt_dev);
 >  
 >  skb = alloc_skb(pkt_dev->cur_pkt_size + 64 + 16 +
 > -pkt_dev->nr_labels*sizeof(u32) +
 > -VLAN_TAG_SIZE(pkt_dev) + SVLAN_TAG_SIZE(pkt_dev),
 > -GFP_ATOMIC);
 > +pkt_dev->pkt_overhead, GFP_ATOMIC);
 >  if (!skb) {
 >  sprintf(pkt_dev->result, "No memory");
 >  return NULL;
 > @@ -2708,7 +2711,7 @@ static struct sk_buff *fill_packet_ipv6(struct 
 > net_device *odev,
 >  /* Eth + IPh + UDPh + mpls */
 >  datalen = pkt_dev->cur_pkt_size - 14 -
 >sizeof(struct ipv6hdr) - sizeof(struct udphdr) -
 > -  pkt_dev->nr_labels*sizeof(u32) - VLAN_TAG_SIZE(pkt_dev) - 
 > SVLAN_TAG_SIZE(pkt_dev);
 > +  pkt_dev->pkt_overhead;
 >  
 >  if (datalen < sizeof(struct pktgen_hdr)) {
 >  datalen = sizeof(struct pktgen_hdr);
 > @@ -2738,8 +2741,7 @@ static struct sk_buff *fill_packet_ipv6(struct 
 > net_device *odev,
 >  ipv6_addr_copy(&iph->saddr, &pkt_dev->cur_in6_saddr);
 >  
 >  skb->mac_header = (skb->network_header - ETH_HLEN -
 > -   pkt_dev->nr_labels * sizeof(u32) -
 > -   VLAN_TAG_SIZE(pkt_dev) - SVLAN_TAG_SIZE(pkt_dev));
 > +   pkt_dev->pkt_overhead);
 >  skb->protocol = protocol;
 >  skb->dev = odev;
 >  s

[PATCH] pktgen IPSEC 2/4: Introduce pktgen sequential flows

2007-06-12 Thread Robert Olsson




jamal writes:
 > 2 of 4
 > 
 > cheers,
 > jamal
 > commit 882c296bb3f153e1ac770a874c75cfb2bab8481b
 > Author: Jamal Hadi Salim <[EMAIL PROTECTED]>
 > Date:   Tue Jun 12 07:24:00 2007 -0400
 > 
 > [PKTGEN] Introduce sequential flows
 > 
 > By default all flows in pktgen are randomly selected.
 > This patch introduces ability to have all defined flows to
 > be sent sequentially. Robert defined randomness to be the
 > default behavior.
 > 
 > Signed-off-by: Jamal Hadi Salim <[EMAIL PROTECTED]>

   Signed-off-by: Robert Olsson <[EMAIL PROTECTED]>

   Cheers
--ro



 > 
 > diff --git a/net/core/pktgen.c b/net/core/pktgen.c
 > index 1352316..bc4fb3b 100644
 > --- a/net/core/pktgen.c
 > +++ b/net/core/pktgen.c
 > @@ -181,6 +181,7 @@
 >  #define F_MPLS_RND(1<<8)/* Random MPLS labels */
 >  #define F_VID_RND (1<<9)/* Random VLAN ID */
 >  #define F_SVID_RND(1<<10)   /* Random SVLAN ID */
 > +#define F_FLOW_SEQ(1<<11)   /* Sequential flows */
 >  
 >  /* Thread control flag bits */
 >  #define T_TERMINATE   (1<<0)
 > @@ -207,8 +208,12 @@ static struct proc_dir_entry *pg_proc_dir = NULL;
 >  struct flow_state {
 >  __be32 cur_daddr;
 >  int count;
 > +__u32 flags;
 >  };
 >  
 > +/* flow flag bits */
 > +#define F_INIT   (1<<0) /* flow has been initialized */
 > +
 >  struct pktgen_dev {
 >  /*
 >   * Try to keep frequent/infrequent used vars. separated.
 > @@ -342,6 +347,7 @@ struct pktgen_dev {
 >  unsigned cflows;/* Concurrent flows (config) */
 >  unsigned lflow; /* Flow length  (config) */
 >  unsigned nflows;/* accumulated flows (stats) */
 > +unsigned curfl; /* current sequenced flow (state)*/
 >  
 >  char result[512];
 >  };
 > @@ -691,6 +697,13 @@ static int pktgen_if_show(struct seq_file *seq, void *v)
 >  if (pkt_dev->flags & F_MPLS_RND)
 >  seq_printf(seq,  "MPLS_RND  ");
 >  
 > +if (pkt_dev->cflows) {
 > +if (pkt_dev->flags & F_FLOW_SEQ)
 > +seq_printf(seq,  "FLOW_SEQ  "); /*in sequence flows*/
 > +else
 > +seq_printf(seq,  "FLOW_RND  ");
 > +}
 > +
 >  if (pkt_dev->flags & F_MACSRC_RND)
 >  seq_printf(seq, "MACSRC_RND  ");
 >  
 > @@ -1182,6 +1195,9 @@ static ssize_t pktgen_if_write(struct file *file,
 >  else if (strcmp(f, "!SVID_RND") == 0)
 >  pkt_dev->flags &= ~F_SVID_RND;
 >  
 > +else if (strcmp(f, "FLOW_SEQ") == 0)
 > +pkt_dev->flags |= F_FLOW_SEQ;
 > +
 >  else if (strcmp(f, "!IPV6") == 0)
 >  pkt_dev->flags &= ~F_IPV6;
 >  
 > @@ -1190,7 +1206,7 @@ static ssize_t pktgen_if_write(struct file *file,
 >  "Flag -:%s:- unknown\nAvailable flags, (prepend 
 > ! to un-set flag):\n%s",
 >  f,
 >  "IPSRC_RND, IPDST_RND, UDPSRC_RND, UDPDST_RND, "
 > -"MACSRC_RND, MACDST_RND, TXSIZE_RND, IPV6, 
 > MPLS_RND, VID_RND, SVID_RND\n");
 > +"MACSRC_RND, MACDST_RND, TXSIZE_RND, IPV6, 
 > MPLS_RND, VID_RND, SVID_RND, FLOW_SEQ\n");
 >  return count;
 >  }
 >  sprintf(pg_result, "OK: flags=0x%x", pkt_dev->flags);
 > @@ -2083,6 +2099,37 @@ static inline void set_pkt_overhead(struct pktgen_dev 
 > *pkt_dev)
 >  pkt_dev->pkt_overhead += SVLAN_TAG_SIZE(pkt_dev);
 >  }
 >  
 > +static inline int f_seen(struct pktgen_dev *pkt_dev, int flow)
 > +{
 > +
 > +if (pkt_dev->flows[flow].flags & F_INIT)
 > +return 1;
 > +else
 > +return 0;
 > +}
 > +
 > +static inline int f_pick(struct pktgen_dev *pkt_dev)
 > +{
 > +int flow = pkt_dev->curfl;
 > +
 > +if (pkt_dev->flags & F_FLOW_SEQ) {
 > +if (pkt_dev->flows[flow].count >= pkt_dev->lflow) {
 > +/* reset time */
 > +pkt_dev->flows[flow].count = 0;
 > +pkt_dev->curfl += 1;
 > +if (pkt_dev->curfl >= pkt_dev->cflows)
 > +pkt_dev->curfl = 0; /*reset */
 > +}
 > +} else {
 > +flow = random32() % pkt_dev->cflow

[PATCH] pktgen IPSEC 3/4: Introduce xfrm SAD only lookup

2007-06-12 Thread Robert Olsson

jamal writes:
 > 3 of 4 ..

 > [XFRM] Introduce standalone SAD lookup
 > This allows other in-kernel functions to do SAD lookups.
 > The only known user at the moment is pktgen.
 > 
 > Signed-off-by: Jamal Hadi Salim <[EMAIL PROTECTED]>
 

 xfrm is not my area..... 

 Acked-by: Robert Olsson <[EMAIL PROTECTED]>

 Cheers
--ro


 > 
 > diff --git a/include/net/xfrm.h b/include/net/xfrm.h
 > index 311f25a..79d2c37 100644
 > --- a/include/net/xfrm.h
 > +++ b/include/net/xfrm.h
 > @@ -920,6 +920,10 @@ extern struct xfrm_state 
 > *xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t
 >struct flowi *fl, struct xfrm_tmpl 
 > *tmpl,
 >struct xfrm_policy *pol, int *err,
 >unsigned short family);
 > +extern struct xfrm_state * xfrm_stateonly_find(xfrm_address_t *daddr,
 > +   xfrm_address_t *saddr,
 > +   unsigned short family,
 > +   u8 mode, u8 proto, u32 reqid);
 >  extern int xfrm_state_check_expire(struct xfrm_state *x);
 >  extern void xfrm_state_insert(struct xfrm_state *x);
 >  extern int xfrm_state_add(struct xfrm_state *x);
 > diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
 > index 85f3f43..b8562e4 100644
 > --- a/net/xfrm/xfrm_state.c
 > +++ b/net/xfrm/xfrm_state.c
 > @@ -686,6 +686,41 @@ out:
 >  return x;
 >  }
 >  
 > +struct xfrm_state *
 > +xfrm_stateonly_find(xfrm_address_t *daddr, xfrm_address_t *saddr,
 > +unsigned short family, u8 mode, u8 proto, u32 reqid)
 > +{
 > +unsigned int h = xfrm_dst_hash(daddr, saddr, reqid, family);
 > +struct xfrm_state *rx = NULL, *x = NULL;
 > +struct hlist_node *entry;
 > +
 > +spin_lock(&xfrm_state_lock);
 > +hlist_for_each_entry(x, entry, xfrm_state_bydst+h, bydst) {
 > +if (x->props.family == family &&
 > +x->props.reqid == reqid &&
 > +!(x->props.flags & XFRM_STATE_WILDRECV) &&
 > +xfrm_state_addr_check(x, daddr, saddr, family) &&
 > +mode == x->props.mode &&
 > +proto == x->id.proto)  {
 > +
 > +if (x->km.state != XFRM_STATE_VALID)
 > +continue;
 > +else {
 > +rx = x;
 > +break;
 > +}
 > +}
 > +}
 > +
 > +if (rx)
 > +xfrm_state_hold(rx);
 > +spin_unlock(&xfrm_state_lock);
 > +
 > +
 > +return rx;
 > +}
 > +EXPORT_SYMBOL(xfrm_stateonly_find);
 > +
 >  static void __xfrm_state_insert(struct xfrm_state *x)
 >  {
 >  unsigned int h;
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fib_hash removal

2007-03-26 Thread Robert Olsson


Paul E. McKenney writes:

 > Those of use who dive into networking only occasionally would much
 > appreciate this.  ;-)

 No problem here... 

 Cheers
--ro

Acked-by: Robert Olsson <[EMAIL PROTECTED]>
Signed-off-by: Paul E. McKenney <[EMAIL PROTECTED]> (but trivial)


diff -urpNa -X dontdiff linux-2.6.20/net/ipv4/fib_trie.c 
linux-2.6.20-fn_trie_RTNL/net/ipv4/fib_trie.c
--- linux-2.6.20/net/ipv4/fib_trie.c2007-02-04 10:44:54.0 -0800
+++ linux-2.6.20-fn_trie_RTNL/net/ipv4/fib_trie.c   2007-03-20 
08:13:59.0 -0700
@@ -1124,6 +1124,9 @@ err:
return fa_head;
 }
 
+/*
+ * Caller must hold RTNL.
+ */
 static int fn_trie_insert(struct fib_table *tb, struct fib_config *cfg)
 {
struct trie *t = (struct trie *) tb->tb_data;
@@ -1543,6 +1546,9 @@ static int trie_leaf_remove(struct trie 
return 1;
 }
 
+/*
+ * Caller must hold RTNL.
+ */
 static int fn_trie_delete(struct fib_table *tb, struct fib_config *cfg)
 {
struct trie *t = (struct trie *) tb->tb_data;
@@ -1721,6 +1727,9 @@ up:
return NULL; /* Ready. Root of trie */
 }
 
+/*
+ * Caller must hold RTNL.
+ */
 static int fn_trie_flush(struct fib_table *tb)
 {
struct trie *t = (struct trie *) tb->tb_data;

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fib_hash removal

2007-03-19 Thread Robert Olsson

Paul E. McKenney writes:

 > > We have two users of trie_leaf_remove, fn_trie_flush and fn_trie_delete 
 > > both are holding RTNL. So there shouldn't be need for this preempt stuff. 
 > > This is assumed to a leftover from an older RCU-take.
 > 
 > True enough!  One request -- would it be reasonable to add to
 > trie_leaf_remove()'s comment to state that RTNL must be held
 > by the caller?
 
 Thanks for your review.  Yes but it's implicitly assumed that updater
 side holds RTNL and we have a comment that states we're run by updater. 
 If mention RTNL here we should probably comment other places too. ;)

 Cheers
   --ro
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fib_hash removal

2007-03-16 Thread Robert Olsson

Hello, Just discussed this Patrick...

We have two users of trie_leaf_remove, fn_trie_flush and fn_trie_delete 
both are holding RTNL. So there shouldn't be need for this preempt stuff. 
This is assumed to a leftover from an older RCU-take.

> Mhh .. I think I just remembered something - me incorrectly suggesting
> to add it there while we were talking about this at OLS :) IIRC the
> idea was to make sure tnode_free (which at that time didn't use
> call_rcu) wouldn't free memory while still in use in a rcu read-side
> critical section. It should have been synchronize_rcu of course,
> but with tnode_free using call_rcu it seems to be completely
> unnecessary. So I guess we can simply remove it.

Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]>
Signed-off-by: Robert Olsson <[EMAIL PROTECTED]>

Cheers.
--ro


diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1534,7 +1534,6 @@ static int trie_leaf_remove(struct trie 
t->revision++;
t->size--;
 
-   preempt_disable();
tp = NODE_PARENT(n);
tnode_free((struct tnode *) n);
 
@@ -1544,7 +1543,6 @@ static int trie_leaf_remove(struct trie 
rcu_assign_pointer(t->trie, trie_rebalance(t, tp));
} else
rcu_assign_pointer(t->trie, NULL);
-   preempt_enable();
 
return 1;
 }

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fib_trie root node settings

2007-03-16 Thread Robert Olsson

The threshold for root node can be more aggressive set to get
better tree compression. The new setting mekes the root grow 
from 16 to 19 bits and substansial improvemnt in Aver depth
this with the current table of 214393 prefixes 

But really the dynamic resize should need more investigation 
both in terms convergence and performance and maybe it should 
be possible to change...

Maybe just for the brave to start with or we may have to back 
this out.

- Current ---
Main:
Aver depth: 2.59
Max depth:  9
Leaves: 205530
Internal nodes: 52634
  1: 26507  2: 11693  3: 8007  4: 3700  5: 1776  6: 658  7: 241  8: 51  
16: 1
Pointers: 431426
Null ptrs: 173263
Total size: 6934  kB

- With the new setting ---
Main:
Aver depth: 2.06
Max depth:  8
Leaves: 205530
Internal nodes: 48235
  1: 23812  2: 10023  3: 8089  4: 3972  5: 2332  6: 6  19: 1
Pointers: 815276
Null ptrs: 561512
Total size: 8330  kB


Signed-off-by: Robert Olsson <[EMAIL PROTECTED]>


diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 7ef5948..1560c54 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -293,8 +293,8 @@ static inline void check_tnode(const struct tnode *tn)
 
 static int halve_threshold = 25;
 static int inflate_threshold = 50;
-static int halve_threshold_root = 15;
-static int inflate_threshold_root = 25; 
+static int halve_threshold_root = 8;
+static int inflate_threshold_root = 15; 
 
 
 static void __alias_free_mem(struct rcu_head *head)
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fib_trie resize break

2007-03-16 Thread Robert Olsson

Hello.

The patch below adds break condition for the resize operations. If 
we don't achieve the desired fill factor a warning is printed. Trie 
should still be operational but new thresholds should be considered.

Cheers
--ro


Signed-off-by: Robert Olsson <[EMAIL PROTECTED]>


diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 1e589b9..7ef5948 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -50,7 +50,7 @@
  * Patrick McHardy <[EMAIL PROTECTED]>
  */
 
-#define VERSION "0.407"
+#define VERSION "0.408"
 
 #include 
 #include 
@@ -460,6 +460,7 @@ static struct node *resize(struct trie *t, struct tnode *tn)
struct tnode *old_tn;
int inflate_threshold_use;
int halve_threshold_use;
+   int max_resize;
 
if (!tn)
return NULL;
@@ -560,7 +561,8 @@ static struct node *resize(struct trie *t, struct tnode *tn)
inflate_threshold_use = inflate_threshold;
 
err = 0;
-   while ((tn->full_children > 0 &&
+   max_resize = 10;
+   while ((tn->full_children > 0 &&  max_resize-- &&
   50 * (tn->full_children + tnode_child_length(tn) - 
tn->empty_children) >=
inflate_threshold_use * 
tnode_child_length(tn))) {
 
@@ -575,6 +577,17 @@ static struct node *resize(struct trie *t, struct tnode 
*tn)
}
}
 
+   if(max_resize < 0) {
+   
+   if(!tn->parent) 
+   printk(KERN_WARNING "Fix inflate_threshold_root. Now=%d 
size=%d bits\n", 
+  inflate_threshold_root, tn->bits);
+   
+   else 
+   printk(KERN_WARNING "Fix inflate_threshold. Now=%d 
size=%d bits\n", 
+  inflate_threshold, tn->bits);
+   }
+
check_tnode(tn);
 
/*
@@ -591,7 +604,8 @@ static struct node *resize(struct trie *t, struct tnode *tn)
halve_threshold_use = halve_threshold;
 
err = 0;
-   while (tn->bits > 1 &&
+   max_resize = 10;
+   while (tn->bits > 1 &&  max_resize-- &&
   100 * (tnode_child_length(tn) - tn->empty_children) <
   halve_threshold_use * tnode_child_length(tn)) {
 
@@ -606,6 +620,16 @@ static struct node *resize(struct trie *t, struct tnode 
*tn)
}
}
 
+   if(max_resize < 0) {
+   
+   if(!tn->parent) 
+   printk(KERN_WARNING "Fix halve_threshold_root. Now=%d 
size=%d bits\n", 
+  halve_threshold_root, tn->bits);
+   
+   else 
+   printk(KERN_WARNING "Fix halve_threshold. Now=%d 
size=%d bits\n", 
+  halve_threshold, tn->bits);
+   }
 
/* Only one child remains */
if (tn->empty_children == tnode_child_length(tn) - 1)
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH]: Dynamically sized routing cache hash table.

2007-03-08 Thread Robert Olsson

David Miller writes:

 > Even a nearly perfect hash has small lumps in distribution, and we
 > should not penalize entries which fall into these lumps.
 > 
 > Let us call T the threshold at which we would grow the routing hash
 > table.  As we approach T we start to GC.  Let's assume hash table
 > has shift = 2. and T would (with T=N+(N>>1) algorithm) therefore be
 > 6.
 > 
 > TABLE:   [0] DST1, DST2
 >  [1] DST3, DST4, DST5
 > 
 > DST6 arrives, what should we do?
 > 
 > If we just accept it and don't GC some existing entries, we
 > will grow the hash table.  This is the wrong thing to do if
 > our true working set is smaller than 6 entries and thus some
 > of the existing entries are unlikely to be reused and thus
 > could be purged to keep us from hitting T.
 > 
 > If they are all active, growing is the right thing to do.
 > 
 > This is the crux of the whole routing cache problem.
 
 Yes it very complex... I would be better to have GC processes
 datastructure more independent at least for us mortals.
 With the unicahe I was trying to achive something like this:

 Datastructure pretty independent and optimal wrt inserts and 
 deletes (GC) Well not 100% perfect as  we don't want to resize
 root node - which is close to hash resize. Stefan Nilsson did some 
 work in his "dynamic tries", halve_threshold and  inflate_threshold 
 is controlling the resize alone. AFAIK the two different thresholds 
 was used to prevent oscillation and prevent dampening. Maybe some
 ideas can be considered.

 About GC, if we forget what can be done with active GC for now..
 IMO the "most important" GC for constant load is the on-demanand 
 or passive where we triggered by a fixed threshhold. ( A fixed
 equilibrium point )

 > I am of the opinion that LRU, for routes not attached to sockets, is
 > probably the best thing to do here.
 
 Yes among other things rt_score checks for this.

 > Furthermore at high packet rates, the current rt_may_expire() logic
 > probably is not very effective since it's granularity is limited to
 > jiffies.  We can quite easily create 100,000 or more entries per
 > jiffie when HZ=100 during rDOS, for example.  So perhaps some global
 > LRU algorithm using ktime is more appropriate.

 Timer-based GC. In my world this is just to get rid of entries when 
 traffic has stopped/dropped. 

 > Global LRU is not easy without touching a lot of memory.  But I'm
 > sure some clever trick can be discovered by someone :)

 Yes as have to scan a entries. To be balanced with the work we have  
 to if we remove something that need to "restored".

 > It is amusing, but it seems that for rDOS workload most optimal
 > routing hash would be tiny one like my example above.  All packets
 > essentially miss the routing cache and create new entry.  So
 > keeping the working set as small as possible is what you want
 > to do since no matter how large you grow your hit rate will be
 > zero :-)

 Yes Alexey and I tried long time ago to limit the lengths of the
 hash-chains. We saw improved result but we didn't find any usable 
 measure for rDoS. 

 Cheers
--ro
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH]: Dynamically sized routing cache hash table.

2007-03-06 Thread Robert Olsson

Eric Dumazet writes:

 > With 2^20 entries, your actual limit of 2^19 entries in root node will 
 > probably show us quite different numbers for order-1,2,3,4... tnodes

 Yeep trie will get deeper and lookup more costly as insert and delete.
 The 2^19 was that was getting memory alloction problem that I never
 sorted out.

 > Yes, numbers you gave us basically showed a big root node, and mainly leaves 
 > and very few tnodes.
 > 
 > I was interested to see the distribution in case the root-node limit is hit, 
 > and we load into the table a *lot* of entries.

 Maxlength etc... well maybe root-restriction should be removed and just have 
 maxsize instead.

 Cheers
--ro
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linux 2.6 Ipv4 routing enhancement (fwd)

2007-03-06 Thread Robert Olsson

Richard Kojedzinszky writes:

 > Sorry for sending the tgz with .svn included. And i did not send 
 > instructions.
 > To do a test with fib_trie, issue
 > $ make clean all ROUTE_ALG=TRIE & ./try a
 > with fib_radix:
 > $ make clean all ROUTE_ALG=RADIX & ./try a
 > with fib_lef:
 > $ make clean all ROUTE_ALG=LEF SBBITS=4 & ./try a

 Thanks. First I'll use to do my testing in kernel context and in the 
 forwarding path with full semantic match so it's not that easy to compare.  
 But I'll take a look. BTW the you test so you do correct prefix matching?
 
 FYI. some old fib work on robur.slu.se

 # Look with just hlist
  /pub/Linux/net-development/fib_hlist/

 # 24 bit hash lookup
  /pub/Linux/net-development/fib_hash2/
 
 And some hlist/hash2/trie comparisons in:
 /pub/Linux/tmp/trie-talk-kth.pdf

 Cheers
--ro
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH]: Dynamically sized routing cache hash table.

2007-03-06 Thread Robert Olsson

Eric Dumazet writes:

 > Indeed. It would be nice to see how it performs with say 2^20 elements...
 > Because with your data, I wonder if the extra complexity of the trash is 
 > worth 
 > it (since most lookups are going to only hit the hash and give the answer 
 > without intermediate nodes)

 I don't know if I understand you fully. Yes in most cases the first lookup via
 "hash-header" will take us to direct to the correct leaf. If there are 
"collisions" 
 we have to sort them out by adding intermediate nodes.

 Something like where you have resizable hash where is each bucket in turn is a 
 resizable hash etc.

 Cheers
--ro
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH]: Dynamically sized routing cache hash table.

2007-03-06 Thread Robert Olsson

Eric Dumazet writes:

 > Well, maybe... but after looking robert's trash, I discovered its model is 
 > essentially a big (2^18 slots) root node (our hash table), and very few 
 > order:1,2,3 nodes.
 
 It's getting "hashlike" yes. I guess all effective algorithms today is doing
 some sort of "index" lookup and for large number of entries we cannot expect 
 to find the next node in the same cache line so the "tree depth" becomes a
 crucial performance factor. IMO nothing can beat a prefect distributed and 
 perfect sized hash. The trash work is an effort to get close with dynamic 
 data structure.

 Cheers
--ro
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH]: Dynamically sized routing cache hash table.

2007-03-06 Thread Robert Olsson

David Miller writes:
 
 Interesting.
 
 > Actually, more accurately, the conflict exists in how this GC
 > logic is implemented.  The core issue is that hash table size
 > guides the GC processing, and hash table growth therefore
 > modifies those GC goals.  So with the patch below we'll just
 > keep growing the hash table instead of giving GC some time to
 > try to keep the working set in equilibrium before doing the
 > hash grow.
 
 AFIK the equilibrium is resizing function as well but using fixed 
 hash table. So can we do without equilibrium resizing if tables 
 are dynamic?  I think so

 With the hash data structure we could monitor the average chain 
 length or just size and resize hash after that.

 > One idea is to put the hash grow check in the garbage collector,
 > and put the hash shrink check in rt_del().
 > 
 > In fact, it would be a good time to perhaps hack up some entirely
 > new passive GC logic for the routing cache.

 Could be, remeber GC in the hash chain also which was added after
 although it does's decrease the number of entries but it gives
 an upper limit. Also gc-goal must picked so it does not force 
 unwanted resizing.

 > BTW, another thing that plays into this is that Robert's TRASH work
 > could make this patch not necessary :-)

 It has "built-in" resize and chain control and the gc-goal is chosen not 
 to unnecessary resize the root node. 

 Cheers.
--ro
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


linux 2.6 Ipv4 routing enhancement (fwd)

2007-03-06 Thread Robert Olsson

Richard Kojedzinszky writes:

 > traffic, and also update the routing table (from BGP), the route cache 
 > seemed to be the bottleneck, as upon every fib update the whole route 
 > cache is flushed, and sometimes it took as many cpu cycles to let some 
 > packets being dropped. Meanwhile i knew that *BSD systems do not use such 
 > a cache, and of course without it a router can provide a constant 
 > performance, not depending on the number of different ip flows, and 
 > updating the fib does not take such a long time.

 Hmm I think there is cache is *BSD* too  

 Anyway you're correct the that the GC and insert/deletion of routes
 flushes the cache and can causes packets drops when all flows has
 to get recreated. Yes it's something thats needs to be addressed but
 it's not that common that people use dynamic routing protocols.

 Anyway Dave and Alexey started to look into this some time ago I got 
 involved later there were some idea how deal with this. This work 
 didn't come an end. So if you want to contribute I think we all be 
 happy. 

 > For this to be solved, i have played with ipv4 routing in linux kernel a 
 > bit. I have done two separate things:
 > - developed a new fib algorithm in fib_trie's place for ipv4
 > - rewrote the kernel not to use it's dst cache

 Just for routing?

 > The fib algorithm is like cisco's CEF (at least if my knowledge is correct), 
 > but first I use a 16-branching tree, to look up the address by 4 bit steps, 
 > and 
 > each node in this tree contains a simple sub-tree which is a radix tree, of 
 > course with maximum possible height 4. I think this is very simple, and is 
 > nearly 3 times faster than fib_trie. Now it has a missing feature: it does 
 > not 
 > export the fib in /proc/net/route.

 Full semantic match... . 
 
 The LC-trie scales tree brancing automatically so looking into linux 
 router running full BGP feed with 204300 prefixes we see:
   
 1: 27567  2: 10127  3: 8149  4: 3630  5: 1529  6: 558  7: 197  8: 53  16: 1

 Root node is 16-bit too and   Aver depth: 2.60
 So 3 times faster than fib_trie thats full sensation. How do you test?

 > The second thing i have done to minimize the cpu cycles during the 
 > forwarding 
 > phase, rewriting ip_input.c, route.c and some others to lef.c, and having a 
 > minimal functionality. I mean, for example, when a packet gets through the 
 > lef 
 > functions, ipsec policies are not checked.

  It would be nice to see a profile before and with your patch 

 > And to be more efficient, I attached a neighbour pointer to each fib entry, 
 > and 
 > using this the lookup + forwarding code is very fast.

 > Of course, the route cache needs very little time to forward packets when 
 > there 
 > are a small number of different ip flows, but when dealing with traffic in 
 > an 
 > ISP at core level, this cannot be stated.

 > So I have done tests with LEF, and compared them to the original linux 
 > kernel's 
 > performance.
 > With the worst case, LEF performed nearly 90% of the linux kernel with the 
 > most 
 > optimal case. Of course original linux performs poorly with the worst case.

 Send them and with profiles is possible...

 Cheers.

--ro
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Extensible hashing and RCU

2007-03-05 Thread Robert Olsson

Michael K. Edwards writes:

 > This, incidentally, seems very similar to the process that Robert
 > Olsson and Stefan Nilsson have gone through with their trie/hash
 > project.  Although I haven't tried it out yet and don't have any basis
 > for an independent opinion, the data and analysis provided in their
 > paper are quite convincing. 

 I've info about this "process" :) Moved fib_trie.c to userland to extend 
 it longer an variable keylengths. Testprogram was doing insert/lookup/remove 
 with random data with blistering performance (very flat trees).

 So quite happy I moved this trie back to the kernel and started testing with
 "real data" - ip addresses and rDoS.  Result was disastrous very deep 
 trees and awful network performance. Random data is very different from 
 real data was the lesson learned again. Gave up. 

 A couple days later an idea came up. I'll remembered the poof from the
 LC-trie paper that length of key does not impact tree depth. So went 
 to Stefan, Can't we just fix up the data rather than fiddling with an 
 new improved algorithm? The result was this "hash" header to boost tree 
 compression.
 
 Cheers
--ro

 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] pktgen: fix device name handling

2007-02-28 Thread Robert Olsson


Yes it seems be handle dev name change. So configuration scripts should
use ifindex now :)

Signed-off-by: Robert Olsson <[EMAIL PROTECTED]>

Cheers.
--ro



Stephen Hemminger writes:
 > Since devices can change name and other wierdness, don't hold onto
 > a copy of device name, instead use pointer to output device.
 > 
 > Fix a couple of leaks in error handling path as well.
 > 
 > Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]>
 > 
 > ---
 >  net/core/pktgen.c |  137 
 > +++---
 >  1 file changed, 70 insertions(+), 67 deletions(-)
 > 
 > --- pktgen.orig/net/core/pktgen.c2007-02-27 12:08:58.0 -0800
 > +++ pktgen/net/core/pktgen.c 2007-02-27 12:11:32.0 -0800
 > @@ -210,15 +210,11 @@
 >  };
 >  
 >  struct pktgen_dev {
 > -
 >  /*
 >   * Try to keep frequent/infrequent used vars. separated.
 >   */
 > -
 > -char ifname[IFNAMSIZ];
 > -char result[512];
 > -
 > -struct pktgen_thread *pg_thread;/* the owner */
 > +struct proc_dir_entry *entry;   /* proc file */
 > +struct pktgen_thread *pg_thread;/* the owner */
 >  struct list_head list;  /* Used for chaining in the thread's 
 > run-queue */
 >  
 >  int running;/* if this changes to false, the test will stop 
 > */
 > @@ -345,6 +341,8 @@
 >  unsigned cflows;/* Concurrent flows (config) */
 >  unsigned lflow; /* Flow length  (config) */
 >  unsigned nflows;/* accumulated flows (stats) */
 > +
 > +char result[512];
 >  };
 >  
 >  struct pktgen_hdr {
 > @@ -497,7 +495,7 @@
 >  static int pktgen_stop_device(struct pktgen_dev *pkt_dev);
 >  static void pktgen_stop(struct pktgen_thread *t);
 >  static void pktgen_clear_counters(struct pktgen_dev *pkt_dev);
 > -static int pktgen_mark_device(const char *ifname);
 > +
 >  static unsigned int scan_ip6(const char *s, char ip[16]);
 >  static unsigned int fmt_ip6(char *s, const char ip[16]);
 >  
 > @@ -591,7 +589,7 @@
 > " frags: %d  delay: %u  clone_skb: %d  ifname: %s\n",
 > pkt_dev->nfrags,
 > 1000 * pkt_dev->delay_us + pkt_dev->delay_ns,
 > -   pkt_dev->clone_skb, pkt_dev->ifname);
 > +   pkt_dev->clone_skb, pkt_dev->odev->name);
 >  
 >  seq_printf(seq, " flows: %u flowlen: %u\n", pkt_dev->cflows,
 > pkt_dev->lflow);
 > @@ -1682,13 +1680,13 @@
 >  if_lock(t);
 >  list_for_each_entry(pkt_dev, &t->if_list, list)
 >  if (pkt_dev->running)
 > -seq_printf(seq, "%s ", pkt_dev->ifname);
 > +seq_printf(seq, "%s ", pkt_dev->odev->name);
 >  
 >  seq_printf(seq, "\nStopped: ");
 >  
 >  list_for_each_entry(pkt_dev, &t->if_list, list)
 >  if (!pkt_dev->running)
 > -seq_printf(seq, "%s ", pkt_dev->ifname);
 > +seq_printf(seq, "%s ", pkt_dev->odev->name);
 >  
 >  if (t->result[0])
 >  seq_printf(seq, "\nResult: %s\n", t->result);
 > @@ -1834,12 +1832,11 @@
 >  /*
 >   * mark a device for removal
 >   */
 > -static int pktgen_mark_device(const char *ifname)
 > +static void pktgen_mark_device(const char *ifname)
 >  {
 >  struct pktgen_dev *pkt_dev = NULL;
 >  const int max_tries = 10, msec_per_try = 125;
 >  int i = 0;
 > -int ret = 0;
 >  
 >  mutex_lock(&pktgen_thread_lock);
 >  pr_debug("pktgen: pktgen_mark_device marking %s for removal\n", ifname);
 > @@ -1860,32 +1857,49 @@
 >  printk("pktgen_mark_device: timed out after waiting "
 > "%d msec for device %s to be removed\n",
 > msec_per_try * i, ifname);
 > -ret = 1;
 >  break;
 >  }
 >  
 >  }
 >  
 >  mutex_unlock(&pktgen_thread_lock);
 > +}
 >  
 > -return ret;
 > +static void pktgen_change_name(struct net_device *dev)
 > +{
 > +struct pktgen_thread *t;
 > +
 > +list_for_each_entry(t, &pktgen_threads, th_list) {
 > +struct pktgen_dev *pkt_dev;
 > +
 > +list_for_each_entry(pkt_dev, &t->if_list, list) {
 > +if (pkt_dev->odev != dev)
 > +continue;
 > +
 > +remove_proc_entry(

[PATCH 3/4] pktgen: don't use __constant_htonl()

2007-02-28 Thread Robert Olsson

OK!

Signed-off-by: Robert Olsson <[EMAIL PROTECTED]>

Cheers.
--ro

Stephen Hemminger writes:
 > The existing htonl() macro is smart enough to do the same code as
 > using __constant_htonl() and it looks cleaner.
 > 
 > Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]>
 > 
 > ---
 >  net/core/pktgen.c |   24 
 >  1 file changed, 12 insertions(+), 12 deletions(-)
 > 
 > --- pktgen.orig/net/core/pktgen.c2007-02-26 14:40:31.0 -0800
 > +++ pktgen/net/core/pktgen.c 2007-02-26 15:36:38.0 -0800
 > @@ -167,7 +167,7 @@
 >  #define LAT_BUCKETS_MAX 32
 >  #define IP_NAME_SZ 32
 >  #define MAX_MPLS_LABELS 16 /* This is the max label stack depth */
 > -#define MPLS_STACK_BOTTOM __constant_htonl(0x0100)
 > +#define MPLS_STACK_BOTTOM htonl(0x0100)
 >  
 >  /* Device flag bits */
 >  #define F_IPSRC_RND   (1<<0)/* IP-Src Random  */
 > @@ -2297,7 +2297,7 @@
 >  int datalen, iplen;
 >  struct iphdr *iph;
 >  struct pktgen_hdr *pgh = NULL;
 > -__be16 protocol = __constant_htons(ETH_P_IP);
 > +__be16 protocol = htons(ETH_P_IP);
 >  __be32 *mpls;
 >  __be16 *vlan_tci = NULL; /* Encapsulates priority and 
 > VLAN ID */
 >  __be16 *vlan_encapsulated_proto = NULL;  /* packet type ID field (or 
 > len) for VLAN tag */
 > @@ -2306,10 +2306,10 @@
 >  
 >  
 >  if (pkt_dev->nr_labels)
 > -protocol = __constant_htons(ETH_P_MPLS_UC);
 > +protocol = htons(ETH_P_MPLS_UC);
 >  
 >  if (pkt_dev->vlan_id != 0x)
 > -protocol = __constant_htons(ETH_P_8021Q);
 > +protocol = htons(ETH_P_8021Q);
 >  
 >  /* Update any of the values, used when we're incrementing various
 >   * fields.
 > @@ -2341,14 +2341,14 @@
 > pkt_dev->svlan_cfi,
 > pkt_dev->svlan_p);
 >  svlan_encapsulated_proto = (__be16 *)skb_put(skb, 
 > sizeof(__be16));
 > -*svlan_encapsulated_proto = 
 > __constant_htons(ETH_P_8021Q);
 > +*svlan_encapsulated_proto = htons(ETH_P_8021Q);
 >  }
 >  vlan_tci = (__be16 *)skb_put(skb, sizeof(__be16));
 >  *vlan_tci = build_tci(pkt_dev->vlan_id,
 >pkt_dev->vlan_cfi,
 >pkt_dev->vlan_p);
 >  vlan_encapsulated_proto = (__be16 *)skb_put(skb, 
 > sizeof(__be16));
 > -*vlan_encapsulated_proto = __constant_htons(ETH_P_IP);
 > +*vlan_encapsulated_proto = htons(ETH_P_IP);
 >  }
 >  
 >  iph = (struct iphdr *)skb_put(skb, sizeof(struct iphdr));
 > @@ -2635,7 +2635,7 @@
 >  int datalen;
 >  struct ipv6hdr *iph;
 >  struct pktgen_hdr *pgh = NULL;
 > -__be16 protocol = __constant_htons(ETH_P_IPV6);
 > +__be16 protocol = htons(ETH_P_IPV6);
 >  __be32 *mpls;
 >  __be16 *vlan_tci = NULL; /* Encapsulates priority and 
 > VLAN ID */
 >  __be16 *vlan_encapsulated_proto = NULL;  /* packet type ID field (or 
 > len) for VLAN tag */
 > @@ -2643,10 +2643,10 @@
 >  __be16 *svlan_encapsulated_proto = NULL; /* packet type ID field (or 
 > len) for SVLAN tag */
 >  
 >  if (pkt_dev->nr_labels)
 > -protocol = __constant_htons(ETH_P_MPLS_UC);
 > +protocol = htons(ETH_P_MPLS_UC);
 >  
 >  if (pkt_dev->vlan_id != 0x)
 > -protocol = __constant_htons(ETH_P_8021Q);
 > +protocol = htons(ETH_P_8021Q);
 >  
 >  /* Update any of the values, used when we're incrementing various
 >   * fields.
 > @@ -2677,14 +2677,14 @@
 > pkt_dev->svlan_cfi,
 > pkt_dev->svlan_p);
 >  svlan_encapsulated_proto = (__be16 *)skb_put(skb, 
 > sizeof(__be16));
 > -*svlan_encapsulated_proto = 
 > __constant_htons(ETH_P_8021Q);
 > +*svlan_encapsulated_proto = htons(ETH_P_8021Q);
 >  }
 >  vlan_tci = (__be16 *)skb_put(skb, sizeof(__be16));
 >  *vlan_tci = build_tci(pkt_dev->vlan_id,
 >pkt_dev->vlan_cfi,
 >pkt_dev->vlan_p);
 >  vlan_encapsulated_proto = (__be16 *)skb_put(skb, 
 > sizeof(__be16));
 > -*vlan_encapsulated_proto = __constant_htons(ETH_P_IPV6);
 > +*vlan_encapsulated_proto = htons(ETH_P_IPV6);
 >  

[PATCH 2/4] pktgen: use random32

2007-02-28 Thread Robert Olsson

Thanks!

It seems like network code has preference for net_random() but they
are the same now.

Signed-off-by: Robert Olsson <[EMAIL PROTECTED]>

Cheers.

--ro

Stephen Hemminger writes:
 > Can use random32() now.
 > 
 > Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]>
 > 
 > ---
 >  net/core/pktgen.c |   52 
 > +++-
 >  1 file changed, 19 insertions(+), 33 deletions(-)
 > 
 > --- pktgen.orig/net/core/pktgen.c2007-02-26 14:34:36.0 -0800
 > +++ pktgen/net/core/pktgen.c 2007-02-26 14:39:53.0 -0800
 > @@ -464,17 +464,6 @@
 >  return tmp;
 >  }
 >  
 > -static inline u32 pktgen_random(void)
 > -{
 > -#if 0
 > -__u32 n;
 > -get_random_bytes(&n, 4);
 > -return n;
 > -#else
 > -return net_random();
 > -#endif
 > -}
 > -
 >  static inline __u64 getCurMs(void)
 >  {
 >  struct timeval tv;
 > @@ -2091,7 +2080,7 @@
 >  int flow = 0;
 >  
 >  if (pkt_dev->cflows) {
 > -flow = pktgen_random() % pkt_dev->cflows;
 > +flow = random32() % pkt_dev->cflows;
 >  
 >  if (pkt_dev->flows[flow].count > pkt_dev->lflow)
 >  pkt_dev->flows[flow].count = 0;
 > @@ -2103,7 +2092,7 @@
 >  __u32 tmp;
 >  
 >  if (pkt_dev->flags & F_MACSRC_RND)
 > -mc = pktgen_random() % (pkt_dev->src_mac_count);
 > +mc = random32() % pkt_dev->src_mac_count;
 >  else {
 >  mc = pkt_dev->cur_src_mac_offset++;
 >  if (pkt_dev->cur_src_mac_offset >
 > @@ -2129,7 +2118,7 @@
 >  __u32 tmp;
 >  
 >  if (pkt_dev->flags & F_MACDST_RND)
 > -mc = pktgen_random() % (pkt_dev->dst_mac_count);
 > +mc = random32() % pkt_dev->dst_mac_count;
 >  
 >  else {
 >  mc = pkt_dev->cur_dst_mac_offset++;
 > @@ -2156,24 +2145,23 @@
 >  for(i = 0; i < pkt_dev->nr_labels; i++)
 >  if (pkt_dev->labels[i] & MPLS_STACK_BOTTOM)
 >  pkt_dev->labels[i] = MPLS_STACK_BOTTOM |
 > - ((__force __be32)pktgen_random() &
 > + ((__force __be32)random32() &
 >htonl(0x000f));
 >  }
 >  
 >  if ((pkt_dev->flags & F_VID_RND) && (pkt_dev->vlan_id != 0x)) {
 > -pkt_dev->vlan_id = pktgen_random() % 4096;
 > +pkt_dev->vlan_id = random32() & (4096-1);
 >  }
 >  
 >  if ((pkt_dev->flags & F_SVID_RND) && (pkt_dev->svlan_id != 0x)) {
 > -pkt_dev->svlan_id = pktgen_random() % 4096;
 > +pkt_dev->svlan_id = random32() & (4096 - 1);
 >  }
 >  
 >  if (pkt_dev->udp_src_min < pkt_dev->udp_src_max) {
 >  if (pkt_dev->flags & F_UDPSRC_RND)
 > -pkt_dev->cur_udp_src =
 > -((pktgen_random() %
 > -  (pkt_dev->udp_src_max - pkt_dev->udp_src_min)) +
 > - pkt_dev->udp_src_min);
 > +pkt_dev->cur_udp_src = random32() %
 > +(pkt_dev->udp_src_max - pkt_dev->udp_src_min)
 > ++ pkt_dev->udp_src_min;
 >  
 >  else {
 >  pkt_dev->cur_udp_src++;
 > @@ -2184,10 +2172,9 @@
 >  
 >  if (pkt_dev->udp_dst_min < pkt_dev->udp_dst_max) {
 >  if (pkt_dev->flags & F_UDPDST_RND) {
 > -pkt_dev->cur_udp_dst =
 > -((pktgen_random() %
 > -  (pkt_dev->udp_dst_max - pkt_dev->udp_dst_min)) +
 > - pkt_dev->udp_dst_min);
 > +pkt_dev->cur_udp_dst = random32() %
 > +(pkt_dev->udp_dst_max - pkt_dev->udp_dst_min)
 > ++ pkt_dev->udp_dst_min;
 >  } else {
 >  pkt_dev->cur_udp_dst++;
 >  if (pkt_dev->cur_udp_dst >= pkt_dev->udp_dst_max)
 > @@ -2202,7 +2189,7 @@
 > saddr_max))) {
 >  __u32 t;
 >  if (pkt_dev->flags & F_IPSRC_RND)
 > -t = ((pktgen_random

[PATCH 1/4] pktgen: use pr_debug

2007-02-28 Thread Robert Olsson

Thanks!

Signed-off-by: Robert Olsson <[EMAIL PROTECTED]>

--ro



Stephen Hemminger writes:
 > Remove private debug macro and replace with standard version
 > 
 > Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]>
 > 
 > 
 > ---
 >  net/core/pktgen.c |   34 +++---
 >  1 file changed, 15 insertions(+), 19 deletions(-)
 > 
 > --- pktgen.orig/net/core/pktgen.c2007-02-26 13:21:54.0 -0800
 > +++ pktgen/net/core/pktgen.c 2007-02-26 13:22:04.0 -0800
 > @@ -163,9 +163,6 @@
 >  
 >  #define VERSION  "pktgen v2.68: Packet Generator for packet performance 
 > testing.\n"
 >  
 > -/* #define PG_DEBUG(a) a */
 > -#define PG_DEBUG(a)
 > -
 >  /* The buckets are exponential in 'width' */
 >  #define LAT_BUCKETS_MAX 32
 >  #define IP_NAME_SZ 32
 > @@ -1856,8 +1853,7 @@
 >  int ret = 0;
 >  
 >  mutex_lock(&pktgen_thread_lock);
 > -PG_DEBUG(printk("pktgen: pktgen_mark_device marking %s for removal\n",
 > -ifname));
 > +pr_debug("pktgen: pktgen_mark_device marking %s for removal\n", ifname);
 >  
 >  while (1) {
 >  
 > @@ -1866,8 +1862,8 @@
 >  break;  /* success */
 >  
 >  mutex_unlock(&pktgen_thread_lock);
 > -PG_DEBUG(printk("pktgen: pktgen_mark_device waiting for %s "
 > -"to disappear\n", ifname));
 > +pr_debug("pktgen: pktgen_mark_device waiting for %s "
 > +"to disappear\n", ifname);
 >  schedule_timeout_interruptible(msecs_to_jiffies(msec_per_try));
 >  mutex_lock(&pktgen_thread_lock);
 >  
 > @@ -2847,7 +2843,7 @@
 >  struct pktgen_dev *pkt_dev;
 >  int started = 0;
 >  
 > -PG_DEBUG(printk("pktgen: entering pktgen_run. %p\n", t));
 > +pr_debug("pktgen: entering pktgen_run. %p\n", t);
 >  
 >  if_lock(t);
 >  list_for_each_entry(pkt_dev, &t->if_list, list) {
 > @@ -2879,7 +2875,7 @@
 >  {
 >  struct pktgen_thread *t;
 >  
 > -PG_DEBUG(printk("pktgen: entering pktgen_stop_all_threads_ifs.\n"));
 > +pr_debug("pktgen: entering pktgen_stop_all_threads_ifs.\n");
 >  
 >  mutex_lock(&pktgen_thread_lock);
 >  
 > @@ -2947,7 +2943,7 @@
 >  {
 >  struct pktgen_thread *t;
 >  
 > -PG_DEBUG(printk("pktgen: entering pktgen_run_all_threads.\n"));
 > +pr_debug("pktgen: entering pktgen_run_all_threads.\n");
 >  
 >  mutex_lock(&pktgen_thread_lock);
 >  
 > @@ -3039,7 +3035,7 @@
 >  {
 >  struct pktgen_dev *pkt_dev;
 >  
 > -PG_DEBUG(printk("pktgen: entering pktgen_stop\n"));
 > +pr_debug("pktgen: entering pktgen_stop\n");
 >  
 >  if_lock(t);
 >  
 > @@ -3063,7 +3059,7 @@
 >  struct list_head *q, *n;
 >  struct pktgen_dev *cur;
 >  
 > -PG_DEBUG(printk("pktgen: entering pktgen_rem_one_if\n"));
 > +pr_debug("pktgen: entering pktgen_rem_one_if\n");
 >  
 >  if_lock(t);
 >  
 > @@ -3092,7 +3088,7 @@
 >  
 >  /* Remove all devices, free mem */
 >  
 > -PG_DEBUG(printk("pktgen: entering pktgen_rem_all_ifs\n"));
 > +pr_debug("pktgen: entering pktgen_rem_all_ifs\n");
 >  if_lock(t);
 >  
 >  list_for_each_safe(q, n, &t->if_list) {
 > @@ -3275,7 +3271,7 @@
 >  
 >  t->pid = current->pid;
 >  
 > -PG_DEBUG(printk("pktgen: starting pktgen/%d:  pid=%d\n", cpu, 
 > current->pid));
 > +pr_debug("pktgen: starting pktgen/%d:  pid=%d\n", cpu, current->pid);
 >  
 >  max_before_softirq = t->max_before_softirq;
 >  
 > @@ -3336,13 +3332,13 @@
 >  set_current_state(TASK_INTERRUPTIBLE);
 >  }
 >  
 > -PG_DEBUG(printk("pktgen: %s stopping all device\n", t->tsk->comm));
 > +pr_debug("pktgen: %s stopping all device\n", t->tsk->comm);
 >  pktgen_stop(t);
 >  
 > -PG_DEBUG(printk("pktgen: %s removing all device\n", t->tsk->comm));
 > +pr_debug("pktgen: %s removing all device\n", t->tsk->comm);
 >  pktgen_rem_all_ifs(t);
 >  
 > -PG_DEBUG(printk("pktgen: %s removing thread.\n", t->tsk->comm));
 > +pr_debug("pktgen: %s removing thread.\n", t->tsk->comm);
 >  pktgen_rem_thread(t);
 >  
 >  return 0;
 > @@ -3361,7 +3357,7 @@
 >  }
 >  
 > 

Re: Extensible hashing and RCU

2007-02-21 Thread Robert Olsson

David Miller writes:

 > But what about if tree lookup were free :-)
 > 
 > This is why I consider Robert Olsson's trash work the most promising,
 > if we stick sockets into his full flow identified routing cache trie
 > entries, we can eliminate lookup altogether.
 > 
 > Just like how he already uses traffic inspection to kill cache entries
 > when FIN shutdown sequence is snooped, we can explicitly flush such
 > entries when socket is closed fully on local system.

 Below is current tree-stats from a "production" flowlogging application at a 
 large university (real traffic) using this full flow lookup.

 With 100k flows we typically see an aver depth 1.3 and a max depth 4. Right 
 now there is only a dst entry in leaf nodes. So yes anything else we can put 
 in leaf is "free".

trie:
Aver depth: 1.31
Max depth:  4
Leaves: 99930
Internal nodes: 14925
  1: 13450  2: 1465  3: 9  18: 1
Pointers: 294976
Null ptrs: 180122
Total size: 7259  kB

Cheers.
--ro
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Extensible hashing and RCU

2007-02-19 Thread Robert Olsson

Andi Kleen writes:

 > > If not, you loose.
 > 
 > It all depends on if the higher levels on the trie are small
 > enough to be kept in cache. Even with two cache misses it might
 > still break even, but have better scalability.

 Yes the trick to keep root large to allow a very flat tree and few 
 cache misses. Stefan Nilsson (author of LC-trie) and were able to 
 improve the the LC-trie quite a bit we called this trie+hash ->trash

 The paper discusses trie/hash... (you've seen it)

 http://www.nada.kth.se/~snilsson/public/papers/trash/

 > Another advantage would be to eliminate the need for large memory
 > blocks, which cause problems too e.g. on NUMA. It certainly would
 > save quite some memory if the tree levels are allocated on demand
 > only. However breaking it up might also cost more TLB misses, 
 > but those could be eliminated by preallocating the tree in
 > the same way as the hash today. Don't know if it's needed or not.
 > 
 > I guess someone needs to code it up and try it.
 
 I've implemented trie/trash as replacement for the dst hash to full
 key lookup for ipv4 (unicache) to start with. And is still is focusing 
 on the nasty parts, packet forwarding, as we don't want break this 
 So the benfits of full flow lookup is not accounted. I.e the full flow
 lookup could give socket at no cost and do some conntrack support
 like Evgeniy did in netchannels pathes.

 Below, some recent comparisions and profiles for the packet forwardning
 
Input 2 * 65k concurrent flows eth0->eth1, eth2->eth3 in forwarding
On separate CPU Opteron 2218 (2.6 GHZ)  net-2.6.21 git. Numbers are very 
approximative but should still be representative. Profiles are 
collected. 

Performance comparison
--
Table below holds: dst-entries in use, lookup hits, slow path and total pps

Flowlen 40

250k 1020 + 21 = 1041 pps  Vanilla rt_hash=32k
1M950 + 29 =  979 pps  Vanilla rt_hash=131k
260k  980 + 24 = 1004 pps  Unicache

Flowlen 4 (rdos)

290k 560 + 162 = 722 kpps  Vanilla  rt_hash=32k
1M   400 + 165 = 565 kpps  Vanilla  rt_hash=131k
230k 570 + 170 = 740 kpps  Unicache 


unicache flen=4 pkts

c02df84f 5257 7.72078 tkey_extract_bits
c023151a 5230 7.68112 e1000_clean_rx_irq
c02df908 3306 4.85541 tkey_equals
c014cf31 3296 4.84072 kfree
c02f8c3b 3067 4.5044  ip_route_input
c02fbdf0 2948 4.32963 ip_forward
c023024e 2809 4.12548 e1000_xmit_frame
c02e06f1 2792 4.10052 trie_lookup
c02fd764 2159 3.17085 ip_output
c032591c 1965 2.88593 fn_trie_lookup
c014cd82 1456 2.13838 kmem_cache_alloc
c02fa941 1337 1.96361 ip_rcv
c014ced0 1334 1.9592  kmem_cache_free
c02e1538 1289 1.89311 unicache_tcp_establish
c02e2d70 1218 1.78884 dev_queue_xmit
c02e31af 1074 1.57735 netif_receive_skb
c02f8484 1053 1.54651 ip_route_input_slow
c02db552 987  1.44957 __alloc_skb
c02e626f 913  1.34089 dst_alloc
c02edaad 828  1.21606 __qdisc_run
c0321ccf 810  1.18962 fib_get_table
c02e14c1 782  1.1485  match_pktgen
c02e6375 766  1.125   dst_destroy
c02e10e8 728  1.06919 unicache_hash_code
c0231242 647  0.950227e1000_clean_tx_irq
c02f7d23 625  0.917916ipv4_dst_destroy

unicache flen=40 pkts
-
c023151a 6742 10.3704 e1000_clean_rx_irq
c02df908 4553 7.00332 tkey_equals
c02fbdf0 4455 6.85258 ip_forward
c02e06f1 4067 6.25577 trie_lookup
c02f8c3b 3951 6.07734 ip_route_input
c02df84f 3929 6.0435  tkey_extract_bits
c023024e 3538 5.44207 e1000_xmit_frame
c014cf31 3152 4.84834 kfree
c02fd764 2711 4.17ip_output
c02e1538 1930 2.96868 unicache_tcp_establish
c02fa941 1696 2.60875 ip_rcv
c02e31af 1466 2.25497 netif_receive_skb
c02e2d70 1412 2.17191 dev_queue_xmit
c014cd82 1397 2.14883 kmem_cache_alloc
c02db552 1394 2.14422 __alloc_skb
c02edaad 1032 1.5874  __qdisc_run
c02ed6b8 957  1.47204 eth_header
c02e15dd 904  1.39051 unicache_garbage_collect_active
c02db94e 861  1.32437 kfree_skb
c0231242 794  1.22131 e1000_clean_tx_irq
c022fd58 778  1.1967  e1000_tx_map
c014ce73 756  1.16286 __kmalloc
c014ced0 740  1.13825 kmem_cache_free
c02e14c1 701  1.07826 match_pktgen
c023002c 621  0.955208e1000_tx_queue
c02e78fa 519  0.798314neigh_resolve_output

Vanilla w. flen=4 pkts rt_hash=32k
--
c02f6852 1570422.9102 ip_route_input
c023151a 5324 7.76705 e1000_clean_rx_irq
c02f84a1 4457 6.5022  ip_rcv
c02f9950 3065 4.47145 ip_forward
c023024e 2630 3.83684 e1000_xmit_frame
c0323380 2343 3.41814 fn_trie_lookup
c02fb2c4 2181 3.1818  ip_output
c02f4a3b 1839 2.68287 rt_intern_hash
c02f4480 1762 2.57054 rt_may_expire
c02f60

NAPI quota and budget

2007-02-05 Thread Robert Olsson

Hello!

K. Salah writes:
 
 > I have a question about the quota per poll in NAPI.  Any idea how the
 > quota of 64 packets per poll per NIC was selected?  Why 64, not any
 > other number?  Is there science behind this number.

 The number comes from experimentation. The "science" is thats it's
 possible set any number at all.

 The first NAPI driver tulip (100 Mps) increased the RX-ringsize from 32 
 128 descriptors and used 16 pkts per poll.

 A quota of 64 for GIGE devices (tested with 256 descriptors) seemed to be 
 good compromise in experiments.  16 < 64 < netdev_budget 
  
 > And the same goes for the budget of 300?
 > 
 > Was that done based on expermintation?

 Yes it was the size of central buffer size before NAPI. It's possible
 the netdev_budget can be increased a bit today as the packet processing
 speed has increased.

 Cheers.
--ro
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net: Fix the fib trie iterator to work with a single entry routing tables

2007-01-26 Thread Robert Olsson

David Miller writes:

 > > Yes the case when the trie is just a single leaf got wrong with the
 > > iterator and your patchs cures it. I think we have a similar problem
 > > with /proc/net/fib_trie
 > 
 > I'm happy to review a fix for that :-)
 
When main table is just a single leaf this gets printed as belonging to the 
local table in /proc/net/fib_trie. A fix is below.

Cheers.
    --ro


Signed-off-by: Robert Olsson <[EMAIL PROTECTED]>


--- net-2.6.20/net/ipv4/fib_trie.c.orig 2007-01-26 13:18:13.0 +0100
+++ net-2.6.20/net/ipv4/fib_trie.c  2007-01-26 13:19:36.0 +0100
@@ -2290,16 +2290,17 @@
if (v == SEQ_START_TOKEN)
return 0;
 
+   if (!NODE_PARENT(n)) {
+   if (iter->trie == trie_local)
+   seq_puts(seq, ":\n");
+   else
+   seq_puts(seq, ":\n");
+   } 
+
if (IS_TNODE(n)) {
struct tnode *tn = (struct tnode *) n;
__be32 prf = htonl(MASK_PFX(tn->key, tn->pos));
 
-   if (!NODE_PARENT(n)) {
-   if (iter->trie == trie_local)
-   seq_puts(seq, ":\n");
-   else
-   seq_puts(seq, ":\n");
-   } 
seq_indent(seq, iter->depth-1);
seq_printf(seq, "  +-- %d.%d.%d.%d/%d %d %d %d\n",
   NIPQUAD(prf), tn->pos, tn->bits, tn->full_children, 
 

 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] net: Fix the fib trie iterator to work with a single entry routing tables

2007-01-24 Thread Robert Olsson


Hello!

Yes the case when the trie is just a single leaf got wrong with the iterator 
and your patchs
cures it. I think we have a similar problem with /proc/net/fib_trie

Cheers
--ro

Signed-off-by: Robert Olsson <[EMAIL PROTECTED]>



Eric W. Biederman writes:
 > 
 > In a kernel with trie routing enabled I had a simple routing setup
 > with only a single route to the outside world and no default
 > route. "ip route table list main" showed my the route just fine but
 > /proc/net/route was an empty file.  What was going on?
 > 
 > Thinking it was a bug in something I did and I looked deeper.  Eventually
 > I setup a second route and everything looked correct, huh?  Finally I
 > realized that the it was just the iterator pair in fib_trie_get_first,
 > fib_trie_get_next just could not handle a routing table with a single entry.
 > 
 > So to save myself and others further confusion, here is a simple fix for
 > the fib proc iterator so it works even when there is only a single route
 > in a routing table.
 > 
 > Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]>
 > ---
 >  net/ipv4/fib_trie.c |   21 -
 >  1 files changed, 16 insertions(+), 5 deletions(-)
 > 
 > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
 > index cfb249c..13307c0 100644
 > --- a/net/ipv4/fib_trie.c
 > +++ b/net/ipv4/fib_trie.c
 > @@ -1989,6 +1989,10 @@ static struct node *fib_trie_get_next(struct 
 > fib_trie_iter *iter)
 >  unsigned cindex = iter->index;
 >  struct tnode *p;
 >  
 > +/* A single entry routing table */
 > +if (!tn)
 > +return NULL;
 > +
 >  pr_debug("get_next iter={node=%p index=%d depth=%d}\n",
 >   iter->tnode, iter->index, iter->depth);
 >  rescan:
 > @@ -2037,11 +2041,18 @@ static struct node *fib_trie_get_first(struct 
 > fib_trie_iter *iter,
 >  if(!iter)
 >  return NULL;
 >  
 > -if (n && IS_TNODE(n)) {
 > -iter->tnode = (struct tnode *) n;
 > -iter->trie = t;
 > -iter->index = 0;
 > -iter->depth = 1;
 > +if (n) {
 > +if (IS_TNODE(n)) {
 > +iter->tnode = (struct tnode *) n;
 > +iter->trie = t;
 > +iter->index = 0;
 > +iter->depth = 1;
 > +} else {
 > +iter->tnode = NULL;
 > +iter->trie  = t;
 > +iter->index = 0;
 > +iter->depth = 0;
 > +}
 >  return n;
 >  }
 >  return NULL;
 > -- 
 > 1.4.4.1.g278f
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pktgen

2006-12-01 Thread Robert Olsson

Alexey Dobriyan writes:

 > 
 > Confused now. Is my "t->control &= ~(T_TERMINATE);" fix deprecated by
 > completions?
 
 It's not needed with completion patch as this does the job a bit more
 mainstream. The T_TERMINATE seems to work well I've tested on machine
 with CPU:s. Only once I noticed that only 2 of 4 threads got started 
 but it could be something else...

 And the race is hardly seen with any real use of pktgen.. .Let's hear
 what others... and completions was out-of-date.

 Cheers.
--ro
 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pktgen

2006-12-01 Thread Robert Olsson

David Miller writes:

 > Agreed.
 > 
 > Robert, please fix this by using a completion so that we can
 > wait for the threads to start up, something like this:

Included. It passes my test but Alexey and others test.
Cheers.
--ro

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index 733d86d..a630a73 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -147,6 +147,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -160,7 +161,7 @@
 #include  /* do_div */
 #include 
 
-#define VERSION  "pktgen v2.68: Packet Generator for packet performance 
testing.\n"
+#define VERSION  "pktgen v2.69: Packet Generator for packet performance 
testing.\n"
 
 /* #define PG_DEBUG(a) a */
 #define PG_DEBUG(a)
@@ -206,6 +207,11 @@ static struct proc_dir_entry *pg_proc_di
 #define VLAN_TAG_SIZE(x) ((x)->vlan_id == 0x ? 0 : 4)
 #define SVLAN_TAG_SIZE(x) ((x)->svlan_id == 0x ? 0 : 4)
 
+struct pktgen_thread_info {
+   struct pktgen_thread *t;
+   struct completion *c;
+};
+
 struct flow_state {
__u32 cur_daddr;
int count;
@@ -3264,10 +3270,11 @@ out:;
  * Main loop of the thread goes here
  */
 
-static void pktgen_thread_worker(struct pktgen_thread *t)
+static void pktgen_thread_worker(struct pktgen_thread_info *info)
 {
DEFINE_WAIT(wait);
struct pktgen_dev *pkt_dev = NULL;
+   struct pktgen_thread *t = info->t;
int cpu = t->cpu;
sigset_t tmpsig;
u32 max_before_softirq;
@@ -3307,6 +3314,8 @@ static void pktgen_thread_worker(struct 
__set_current_state(TASK_INTERRUPTIBLE);
mb();
 
+complete(info->c);
+
while (1) {
 
__set_current_state(TASK_RUNNING);
@@ -3518,6 +3527,8 @@ static struct pktgen_thread *__init pktg
 static int __init pktgen_create_thread(const char *name, int cpu)
 {
int err;
+   struct pktgen_thread_info info;
+struct completion started;
struct pktgen_thread *t = NULL;
struct proc_dir_entry *pe;
 
@@ -3558,7 +3569,11 @@ static int __init pktgen_create_thread(c
 
t->removed = 0;
 
-   err = kernel_thread((void *)pktgen_thread_worker, (void *)t,
+   init_completion(&started);
+info.t = t;
+info.c = &started;
+
+   err = kernel_thread((void *)pktgen_thread_worker, (void *)&info,
  CLONE_FS | CLONE_FILES | CLONE_SIGHAND);
if (err < 0) {
printk("pktgen: kernel_thread() failed for cpu %d\n", t->cpu);
@@ -3568,6 +3583,7 @@ static int __init pktgen_create_thread(c
return err;
}
 
+   wait_for_completion(&started);
return 0;
 }
 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pktgen

2006-11-30 Thread Robert Olsson


Hello!

Seems you found a race when rmmod is done before it's fully started

Try:

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index 733d86d..ac0b4b1 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -160,7 +160,7 @@
 #include  /* do_div */
 #include 
 
-#define VERSION  "pktgen v2.68: Packet Generator for packet performance 
testing.\n"
+#define VERSION  "pktgen v2.69: Packet Generator for packet performance 
testing.\n"
 
 /* #define PG_DEBUG(a) a */
 #define PG_DEBUG(a)
@@ -3673,6 +3673,8 @@ static void __exit pg_cleanup(void)
struct list_head *q, *n;
wait_queue_head_t queue;
init_waitqueue_head(&queue);
+   
+   schedule_timeout_interruptible(msecs_to_jiffies(125));
 
/* Stop all interfaces & threads */
 


for i in 1 2 3 4 5 ; do modprobe pktgen ; rmmod pktgen ; done

In dmesg
pktgen v2.69: Packet Generator for packet performance testing.
pktgen v2.69: Packet Generator for packet performance testing.
pktgen v2.69: Packet Generator for packet performance testing.
pktgen v2.69: Packet Generator for packet performance testing.
pktgen v2.69: Packet Generator for packet performance testing.

Cheers.
--ro



Alexey Dobriyan writes:
 > On 11/30/06, David Miller <[EMAIL PROTECTED]> wrote:
 > > From: Alexey Dobriyan <[EMAIL PROTECTED]>
 > > Date: Wed, 29 Nov 2006 23:04:37 +0300
 > >
 > > > Looks like worker thread strategically clears it if scheduled at wrong
 > > > moment.
 > > >
 > > > --- a/net/core/pktgen.c
 > > > +++ b/net/core/pktgen.c
 > > > @@ -3292,7 +3292,6 @@ static void pktgen_thread_worker(struct
 > > >
 > > >  init_waitqueue_head(&t->queue);
 > > >
 > > > -t->control &= ~(T_TERMINATE);
 > > >  t->control &= ~(T_RUN);
 > > >  t->control &= ~(T_STOP);
 > > >  t->control &= ~(T_REMDEVALL);
 > >
 > > Good catch Alexey.  Did you rerun the load/unload test with
 > > this fix applied?  If it fixes things, I'll merge it.
 > 
 > Well, yes, it fixes things, except Ctrl+C getting you out of
 > modprobe/rmmod loop will spit
 > backtrace again. And other flags: T_RUN, T_STOP. Clearance is not
 > needed due to kZalloc and
 > create bugs as demostrated.
 > 
 > Give me some time.
 > -
 > To unsubscribe from this list: send the line "unsubscribe netdev" in
 > the body of a message to [EMAIL PROTECTED]
 > More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pktgen patch available for perusal.

2006-11-06 Thread Robert Olsson

Ben Greear writes:
 > > Changes:
 > > * use a nano-second timer based on the scheduler timer (TSC) for relative 
 > > times, instead of get_time_of_day.

Seems I missed to set tsc as clocksource. It makes a difference. Performance is 
normal and I'm less confused.

e1000 82546GB @ 1.6 GHz Opteron. Kernel 2.6.19-rc1_Bifrost_-g18e199c6-dirty 

echo acpi_pm> 
/sys/devices/system/clocksource/clocksource0/current_clocksource   

psize pps
-
60  556333
124  526942
252  452981
508  234996
1020  119748
1496  82248

echo tsc> 
/sys/devices/system/clocksource/clocksource0/current_clocksource 

psize pps
-
60  819914
124  747286
252  452975
508  234993
1020  119749
1496  82247

Cheers. 
--ro
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pktgen patch available for perusal.

2006-11-06 Thread Robert Olsson

jamal writes:

 > If you are listening then start with:
 > 
 > 1) Do a simple test with just udp traffic as above, doing simple
 > accounting. This helps you to get a feel on how things work.
 > 2) modify the matching rules to match your magic cookie
 > 3) write a simple action invoked by your matching rules and use tc to
 > add it to the policy.
 > 4) integrate in your app now that you know what you are doing.

 Yes. Sounds like simple and general solution. No call-backs, no #ifdef's
 no extra modules. Just a little recipe in pktgen.txt

 Cheers.
--ro
 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pktgen patch available for perusal.

2006-11-03 Thread Robert Olsson

Ben Greear writes:

 > It requires a hook in dev.c, or at least that is the only way I can 
 > think to implement it.

Well the hook be placed along the packet path even in drivers. In tulip I 
didn't 
even take packet of the ring in some experiments. 

And there plenty of existing hooks already i.e PRE_ROUTE?

 > I don't think that hook is going to be allowed into the kernel, so the 
 > best I was hoping for  was to have the code in pktgen with #if 0 and let 
 > users patch their kernel to add the hook...

Right so what I was trying to say was rather having #if 0 and dead code in 
pktgen it could be kept separate. 

Cheers.
--ro

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pktgen patch available for perusal.

2006-11-03 Thread Robert Olsson

Ben Greear writes:

 > I'd be thrilled to have the receive logic go into pktgen, even if it was #if 
 > 0 with a comment
 > showing how to patch dev.c to get it working.  It would make my out-of-tree 
 > patch smaller
 > and should help others who are doing research and driver development...


 Just an idea... RX part is pretty separate from TX. How about if we keep the 
 this code in a small seprate optional module? It can developed into some 
general 
 RX probe.

 Cheers.
--ro
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


pktgen patch available for perusal.

2006-11-01 Thread Robert Olsson

Ben Greear writes:
 > I've completed the first pass of my changes to pktgen in 2.6.18.
 > Many of these features are probably DOA based on previous conversations,
 > but perhaps this will help someone

 Thanks. Well sometimes there is a need to capture and drop pkts and various 
 points, so it handy to have code fragments ready and just add the hook when 
 it's needed in driver or rx softirq etc.

 > Changes:
 > * use a nano-second timer based on the scheduler timer (TSC) for relative 
 > times, instead of get_time_of_day.

 Any chance you extract this one so we can compare with get_time_of_day. This 
pops up
 in the profiles and TX numbers are not what used to be.

 Cheers.
--ro
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RFC: Removing busy-spin in pktgen.

2006-10-24 Thread Robert Olsson

Ben Greear writes:
 > I'm planning to re-merge my long-lost pktgen branch with the kernel
 > tree's pktgen.
 > 
 > I believe the main difference is that my out-of-tree pktgen does not do the
 > busy-spin, but waits on a queue for the net-device to wake it's tx-queue
 > when over-driving a NIC.
 > 
 > To implement this, I added a hook in the netdev-wake-queue logic and let
 > pktgen register itself as interested
 > 
 > Is there any interest in adding this sort of feature to the official
 > pktgen?

 Hello!

 So question is, pktgen hooks in netif_wake_queue or let pktgen busy-spin?

 My first instinct is to avoid the hook in general code and pktgen let spin 
 to only affect testing.
 
 Cheers.
--ro
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] DSCP support for pktgen

2006-09-26 Thread Robert Olsson

Francesco Fondelli writes:

 > Anyway, I've been asked to add support for managing DSCP codepoints, so
 > one can test DiffServ capable routers. It's very simple code and is working 
 > for
 > me.

Looks fine too. We ask Dave to apply again.

Cheers.
--ro


Signed-off-by: Francesco Fondelli <[EMAIL PROTECTED]>
Signed-off-by: Robert Olsson <[EMAIL PROTECTED]>



--- new/net/core/pktgen.c.vlan  2006-09-26 14:53:34.0 +0200
+++ new/net/core/pktgen.c   2006-09-26 14:54:49.0 +0200
@@ -160,7 +160,7 @@
 #include  /* do_div */
 #include 
 
-#define VERSION  "pktgen v2.67: Packet Generator for packet performance 
testing.\n"
+#define VERSION  "pktgen v2.68: Packet Generator for packet performance 
testing.\n"
 
 /* #define PG_DEBUG(a) a */
 #define PG_DEBUG(a)
@@ -292,6 +292,10 @@
__u16 udp_dst_min;  /* inclusive, dest UDP port */
__u16 udp_dst_max;  /* exclusive, dest UDP port */
 
+   /* DSCP + ECN */
+   __u8 tos;/* six most significant bits of (former) IPv4 TOS 
are for dscp codepoint */
+   __u8 traffic_class;  /* ditto for the (former) Traffic Class in IPv6 
(see RFC 3260, sec. 4) */
+
/* MPLS */
unsigned nr_labels; /* Depth of stack, 0 = no MPLS */
__be32 labels[MAX_MPLS_LABELS];
@@ -671,6 +675,14 @@
   pkt_dev->svlan_id, pkt_dev->svlan_p, 
pkt_dev->svlan_cfi);
}
 
+   if (pkt_dev->tos) {
+   seq_printf(seq, " tos: 0x%02x\n", pkt_dev->tos);
+   }
+
+   if (pkt_dev->traffic_class) {
+   seq_printf(seq, " traffic_class: 0x%02x\n", 
pkt_dev->traffic_class);
+   }
+
seq_printf(seq, " Flags: ");
 
if (pkt_dev->flags & F_IPV6)
@@ -748,12 +760,12 @@
 }
 
 
-static int hex32_arg(const char __user *user_buffer, __u32 *num)
+static int hex32_arg(const char __user *user_buffer, unsigned long maxlen, 
__u32 *num)
 {
int i = 0;
*num = 0;
 
-   for(; i < 8; i++) {
+   for(; i < maxlen; i++) {
char c;
*num <<= 4;
if (get_user(c, &user_buffer[i]))
@@ -848,7 +860,7 @@
pkt_dev->nr_labels = 0;
do {
__u32 tmp;
-   len = hex32_arg(&buffer[i], &tmp);
+   len = hex32_arg(&buffer[i], 8, &tmp);
if (len <= 0)
return len;
pkt_dev->labels[n] = htonl(tmp);
@@ -1185,11 +1197,15 @@
else if (strcmp(f, "!SVID_RND") == 0)
pkt_dev->flags &= ~F_SVID_RND;
 
+   else if (strcmp(f, "!IPV6") == 0)
+   pkt_dev->flags &= ~F_IPV6;
+
else {
sprintf(pg_result,
"Flag -:%s:- unknown\nAvailable flags, (prepend 
! to un-set flag):\n%s",
f,
-   "IPSRC_RND, IPDST_RND, TXSIZE_RND, UDPSRC_RND, 
UDPDST_RND, MACSRC_RND, MACDST_RND\n");
+   "IPSRC_RND, IPDST_RND, UDPSRC_RND, UDPDST_RND, "
+   "MACSRC_RND, MACDST_RND, TXSIZE_RND, IPV6, 
MPLS_RND, VID_RND, SVID_RND\n");
return count;
}
sprintf(pg_result, "OK: flags=0x%x", pkt_dev->flags);
@@ -1615,6 +1631,38 @@
return count;
}
 
+   if (!strcmp(name, "tos")) {
+   __u32 tmp_value = 0;
+   len = hex32_arg(&user_buffer[i], 2, &tmp_value);
+   if (len < 0) {
+   return len;
+   }
+   i += len;
+   if (len == 2) {
+   pkt_dev->tos = tmp_value;
+   sprintf(pg_result, "OK: tos=0x%02x", pkt_dev->tos);
+   } else {
+   sprintf(pg_result, "ERROR: tos must be 00-ff");
+   }
+   return count;
+   }
+
+   if (!strcmp(name, "traffic_class")) {
+   __u32 tmp_value = 0;
+   len = hex32_arg(&user_buffer[i], 2, &tmp_value);
+   if (len < 0) {
+   return len;
+   }
+   i += len;
+   if (len == 2) {
+   pkt_dev->traffic_class = tmp_value;
+   sprintf(pg_result, "OK: traffic_class=0x%02x", 
pkt_dev->traffic_class);
+   } else {
+   sprintf(pg_result, "ERROR: traffic_class must be 
00-ff");
+   }
+   return count;
+   }
+
sprintf(pkt_dev->result, &qu

[PATCH] vlan support for pktgen

2006-09-26 Thread Robert Olsson

Francesco Fondelli writes:

 > The attached patch allows pktgen to produce 802.1Q and Q-in-Q tagged frames.
 > I have used it for stress test a bridge and seems ok to me.
 > Unfortunately I have no access to net-2.6.x git tree so the diff is against
 > 2.6.17.13.

Thanks a lot.

Looks fine but haven't been able to test Q-in-Q. Below is a git diff.
We ask Dave to apply.

Cheers.
--ro


Signed-off-by: Francesco Fondelli <[EMAIL PROTECTED]>
Acked-by:  Steven Whitehouse <[EMAIL PROTECTED]>
Signed-off-by: Robert Olsson <[EMAIL PROTECTED]>

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index c23e9c0..e418224 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -109,6 +109,8 @@
  *
  * MPLS support by Steven Whitehouse <[EMAIL PROTECTED]>
  *
+ * 802.1Q/Q-in-Q support by Francesco Fondelli (FF) <[EMAIL PROTECTED]>
+ *
  */
 #include 
 #include 
@@ -137,6 +139,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -178,6 +181,8 @@
 #define F_TXSIZE_RND  (1<<6)   /* Transmit size is random */
 #define F_IPV6(1<<7)   /* Interface in IPV6 Mode */
 #define F_MPLS_RND(1<<8)   /* Random MPLS labels */
+#define F_VID_RND (1<<9)   /* Random VLAN ID */
+#define F_SVID_RND(1<<10)  /* Random SVLAN ID */
 
 /* Thread control flag bits */
 #define T_TERMINATE   (1<<0)
@@ -198,6 +203,9 @@ static struct proc_dir_entry *pg_proc_di
 
 #define MAX_CFLOWS  65536
 
+#define VLAN_TAG_SIZE(x) ((x)->vlan_id == 0x ? 0 : 4)
+#define SVLAN_TAG_SIZE(x) ((x)->svlan_id == 0x ? 0 : 4)
+
 struct flow_state {
__u32 cur_daddr;
int count;
@@ -287,6 +295,15 @@ struct pktgen_dev {
/* MPLS */
unsigned nr_labels; /* Depth of stack, 0 = no MPLS */
__be32 labels[MAX_MPLS_LABELS];
+   
+   /* VLAN/SVLAN (802.1Q/Q-in-Q) */
+   __u8  vlan_p;
+   __u8  vlan_cfi;
+   __u16 vlan_id;  /* 0x means no vlan tag */
+
+   __u8  svlan_p;
+   __u8  svlan_cfi;
+   __u16 svlan_id; /* 0x means no svlan tag */
 
__u32 src_mac_count;/* How many MACs to iterate through */
__u32 dst_mac_count;/* How many MACs to iterate through */
@@ -644,6 +661,16 @@ static int pktgen_if_show(struct seq_fil
   i == pkt_dev->nr_labels-1 ? "\n" : ", ");
}
 
+   if (pkt_dev->vlan_id != 0x) {
+   seq_printf(seq, " vlan_id: %u  vlan_p: %u  vlan_cfi: %u\n",
+  pkt_dev->vlan_id, pkt_dev->vlan_p, 
pkt_dev->vlan_cfi);
+   }
+   
+   if (pkt_dev->svlan_id != 0x) {
+   seq_printf(seq, " svlan_id: %u  vlan_p: %u  vlan_cfi: %u\n",
+  pkt_dev->svlan_id, pkt_dev->svlan_p, 
pkt_dev->svlan_cfi);
+   }
+
seq_printf(seq, " Flags: ");
 
if (pkt_dev->flags & F_IPV6)
@@ -673,6 +700,12 @@ static int pktgen_if_show(struct seq_fil
if (pkt_dev->flags & F_MACDST_RND)
seq_printf(seq, "MACDST_RND  ");
 
+   if (pkt_dev->flags & F_VID_RND)
+   seq_printf(seq, "VID_RND  ");
+
+   if (pkt_dev->flags & F_SVID_RND)
+   seq_printf(seq, "SVID_RND  ");
+
seq_puts(seq, "\n");
 
sa = pkt_dev->started_at;
@@ -1140,6 +1173,18 @@ static ssize_t pktgen_if_write(struct fi
else if (strcmp(f, "!MPLS_RND") == 0)
pkt_dev->flags &= ~F_MPLS_RND;
 
+   else if (strcmp(f, "VID_RND") == 0)
+   pkt_dev->flags |= F_VID_RND;
+
+   else if (strcmp(f, "!VID_RND") == 0)
+   pkt_dev->flags &= ~F_VID_RND;
+
+   else if (strcmp(f, "SVID_RND") == 0)
+   pkt_dev->flags |= F_SVID_RND;
+
+   else if (strcmp(f, "!SVID_RND") == 0)
+   pkt_dev->flags &= ~F_SVID_RND;
+
else {
sprintf(pg_result,
"Flag -:%s:- unknown\nAvailable flags, (prepend 
! to un-set flag):\n%s",
@@ -1445,6 +1490,128 @@ static ssize_t pktgen_if_write(struct fi
offset += sprintf(pg_result + offset,
  "%08x%s", ntohl(pkt_dev->labels[n]),
  n == pkt_dev->nr_labels-1 ? "" : ",");
+   
+   if (pkt_dev->nr_labels && pkt_dev->vlan_id != 0x) {
+   pkt_dev->vlan_id = 0x; /* turn off VLAN/SVLAN */
+   pkt_dev->svlan_id = 0x;
+   
+  

[PATCH] Documentation update pktgen

2006-09-26 Thread Robert Olsson


And the docs.

Cheers.
--ro


Signed-off-by: Francesco Fondelli <[EMAIL PROTECTED]>
Signed-off-by: Robert Olsson <[EMAIL PROTECTED]>

diff --git a/Documentation/networking/pktgen.txt 
b/Documentation/networking/pktgen.txt
index 278771c..6062e35 100644
--- a/Documentation/networking/pktgen.txt
+++ b/Documentation/networking/pktgen.txt
@@ -100,6 +100,7 @@ Examples:
  are: IPSRC_RND #IP Source is random (between min/max),
   IPDST_RND, UDPSRC_RND,
   UDPDST_RND, MACSRC_RND, MACDST_RND 
+  MPLS_RND, VID_RND, SVID_RND 
 
  pgset "udp_src_min 9"   set UDP source port min, If < udp_src_max, then
  cycle through the port range.
@@ -125,6 +126,21 @@ Examples:
 
  pgset "mpls 0"  turn off mpls (or any invalid argument works 
too!)
 
+ pgset "vlan_id 77"   set VLAN ID 0-4095
+ pgset "vlan_p 3" set priority bit 0-7 (default 0)
+ pgset "vlan_cfi 0"   set canonical format identifier 0-1 (default 0)
+
+ pgset "svlan_id 22"  set SVLAN ID 0-4095
+ pgset "svlan_p 3"set priority bit 0-7 (default 0)
+ pgset "svlan_cfi 0"  set canonical format identifier 0-1 (default 0)
+
+ pgset "vlan_id " > 4095 remove vlan and svlan tags
+ pgset "svlan "   > 4095 remove svlan tag
+
+
+ pgset "tos XX"   set former IPv4 TOS field (e.g. "tos 28" for AF11 no 
ECN, default 00)
+ pgset "traffic_class XX" set former IPv6 TRAFFIC CLASS (e.g. "traffic_class 
B8" for EF no ECN, default 00)
+
  pgset stop  aborts injection. Also, ^C aborts generator.
 
 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] vlan support for pktgen

2006-09-14 Thread Robert Olsson

Francesco Fondelli writes:

 > The attached patch allows pktgen to produce 802.1Q and Q-in-Q tagged frames.
 > I have used it for stress test a bridge and seems ok to me.
 > Unfortunately I have no access to net-2.6.x git tree so the diff is against
 > 2.6.17.13.
 > If you have a moment look over it, I think this feature could be useful for
 > someone else.

 Thanks a lot, code seems code seems straight forward but I'm not to 
 experienced with VLAN hacking and need find some test environment
 when I got a chance, maybe someone else has better opportunities to 
 verify.

 Cheers.
--ro
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Paper on lookup in Linux.

2006-09-04 Thread Robert Olsson

Andi Kleen writes:

 > The reason I'm asking is that we still have trouble with the TCP hash tables
 > taking far too much memory, and your new data structure might provide a nice 
 > alternative. 

 Yes it's dynamic and selftuning so no need reserve memory in advance and still 
 comparable performance to a perfect hash. Insert/delete and GC is fast and 
 locking is also doable we used RCU like fib_trie but at _bh in this 
"application". 
 
 Cheers.
--ro
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Paper on lookup in Linux.

2006-09-04 Thread Robert Olsson

Andi Kleen writes:
 > On Monday 04 September 2006 13:43, Robert Olsson wrote:
 > > 
 > > Hello.
 > > 
 > > People on this list might find this paper interesting:
 > > http://www.csc.kth.se/~snilsson/public/papers/trash/
 > 
 > Looks nice. Have you looked at using it for local TCP/UDP socket
 > lookups too or would that be part of the "unified flow cache"? 


 No we haven't put struct socket in the leafs (result node) yet we just kept 
 dst entries and some stateful flow variables that we used for active GC 
 and flow logging so far. So 128 bit flow version of the dst hash. It would 
 have taken a lot a of more work and we wanted to get this paper out. Also 
 ip_route_input is not the place have the lookup if we with unified include 
 ip6. So we focused on designing and building the lookup core.


 Cheers.
--ro

 struct leaf {
__u32 key[LPK];
unsigned long parent;
void *obj;  /* Generic leaf object */
__u32 state;
unsigned int len;
int iif;
struct timeval start;
struct rcu_head rcu;
};
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Paper on lookup in Linux.

2006-09-04 Thread Robert Olsson

Hello.

People on this list might find this paper interesting:
http://www.csc.kth.se/~snilsson/public/papers/trash/

Abstract is below. Feel free to redistribute.

Cheers. 
--ro


TRASH - A dynamic LC-trie and hash data structure
Robert Olsson and Stefan Nilsson

A dynamic LC-trie is currently used in the Linux kernel to implement address 
lookup in the IP routing table. The main virtue of this data structure is that 
it supports both fast address lookups and frequent updates of the table. Also, 
it has an efficient memory management scheme and supports multi-processor 
architectures using the RCU locking mechanism. The structure scales nicely: 
the expected number of memory accesses for one lookup is O(log log n), where 
n is the number of entries in the lookup table. In particular, the time does 
not depend on the length of the keys, 32-bit IPv4 addresses and 128-bit 
addresses does not make a difference in this respect.

In this article we introduce TRASH, a combination of a dynamic LC-trie and a 
hash function. TRASH is a general purpose data structure supporting fast 
lookup, insert and delete operations for arbitrarily long bit strings. TRASH 
enhances the level-compression part of the LC-trie by prepending a header to 
each key. The header is a hash value based on the complete key. The extended 
keys will behave like uniformly distributed data and hence the average and 
maximum depth is typically very small, in practice less than 1.5 and 5, 
respectively.

We have implemented the scheme in the Linux kernel as a replacement for the 
dst cache (IPv4) and performed a full scale test on a production router using 
128-bit flow-based lookups. The Linux implementation of TRASH inherits the 
efficient RCU locking mechanism from the dynamic LC-trie implementation. In 
particular, the lookup time increases only marginally for longer keys and TRASH
is highly insensitive to different types of data. The performance figures are 
very promising and the cache mechanism could easily be extended to serve as a 
unified lookup for fast socket lookup, flow logging, connection tracking and 
stateful networking in general.

Keywords: trie, LC-trie, hash, hashtrie, Linux, flow lookup, garbage collection
Trita-CSC-TCS 2006:2, ISRN/KTH/CSC/TCS-2006/2-SE, ISSN 1653-7092. 

-- 
VGER BF report: U 0.830787
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[e1000]: flow control on by default - good idea really?

2006-07-05 Thread Robert Olsson

jamal writes:

 > The default setup on the e1000 has rx flow control turned on.
 > I was sending at wire rate gige from the device - which is about
 > 1.48Mpps. The e1000 was in turn sending me flow control packets
 > as per default/expected behavior. Unfortunately, it was sending
 > a very large amount of packets. At one point i was seeing upto
 > 1Mpps and on average, the flow control packets were consuming
 > 60-70% of the bandwidth. Even when i fixed this behavior to act
 > properly, allowing flow control on consumed up to 15% of the bandwidth. 
 > Clearly, this is a bad thing. Yes, the device in the first instance was
 > at fault. But i have argued in the past that NAPI does just fine without
 > flow control being turned on, so even chewing 5% of bandwidth on flow
 > control is a bad thing..

 Interesting numbers. But one can argue that if there were no control packets 
 there would be data packets sent and dropped instead. But I'll agree with you
 as the flow control seems hard to monitor and tune and takes debugging
 down to link and chiplevels. Much easier and robust just to read the 
 dropped packet counter on the box that couldn't handle the load rather than 
 following secret flow control packets somewhere else and try to figure out 
 whats going on. 
  
 > As a compromise, can we declare flow control as an advanced feature
 > and turn it off by default? People who feel it is valuable and know
 > what they are doing can turn it off.

 At least for NAPI-scheme drivers.

 > If you want more details just shoot.

 Cheers.
--ro
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bugme-new] [Bug 6682] New: BUG: soft lockup detected on CPU#0! / ksoftirqd takse 100% CPU

2006-06-20 Thread Robert Olsson

Hello!

Yes seems the system is very loaded for some reason 

> > sometimes a day) we get 100% usage on ksoftirqd/0 and following messages 
   in logs:

as all softirq's are run via ksoftirqd. That's still OK but why don't the 
watchdog get any CPU share at all? Mismatch in priorities? 

Herbert Xu writes:
> On Mon, Jun 19, 2006 at 10:20:10PM +, Andrew Morton wrote:
 > >
 > > >  [] dev_queue_xmit+0xe0/0x203
 > > >  [] ip_output+0x1e1/0x237
 > > >  [] ip_forward+0x181/0x1df
 > > >  [] ip_rcv+0x40c/0x485
 > > >  [] netif_receive_skb+0x12f/0x165
 > > >  [] e1000_clean_rx_irq+0x389/0x410 [e1000]
 > > >  [] e1000_clean+0x94/0x12f [e1000]
 > > >  [] net_rx_action+0x69/0xf0
 > > >  [] __do_softirq+0x55/0xbd
 > > >  [] do_softirq+0x2d/0x31
 > > >  [] local_bh_enable+0x5a/0x65
 > > >  [] rt_run_flush+0x5f/0x80

Normal for a router...

 > Could you tell us the frequency of route updates on this machine?
 > Route updates are pretty expensive especially when a large number
 > of flows hits your machine right afterwards.

Yes flush is costly an unfortunly hard to avoid. We discussed this a
bit before...

 > You can monitor this by running ip mon.  You might just be getting
 > bogus route updates causing unnecessary flushes to the routing cache.

Just sampled 10 min in one of routers with full 2 * (Full BGP). Well
remember Zebra/Quagga has just one set in kernel. Anyway during the 
10 minutes I looked I got 4 (insertion/deletions)/second in average.

Cheers.
--ro

 
 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: netif_tx_disable and lockless TX

2006-06-02 Thread Robert Olsson

Stephen Hemminger writes:

 > I also noticed that you really don't save much by doing TX cleaning at 
 > hardirq, because in hardirq you need to do dev_kfree_irq and that causes 
 > a softirq (for the routing case where users=1). So when routing it 
 > doesn't make much difference, both methods cause the softirq delayed 
 > processing to be invoked. For locally generated packets which are 
 > cloned, the hardirq will drop the ref count, and that is faster than 
 > doing the whole softirq round trip.

 Right. Also the other way around, repeated ->poll can avoid TX hardirq's.

 Cheers.
--ro
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: netif_tx_disable and lockless TX

2006-05-31 Thread Robert Olsson

jamal writes:

 > Latency-wise: TX completion interrupt provides the best latency.
 > Processing in the poll() -aka softirq- was almost close to the hardirq
 > variant. So if you can make things run in a softirq such as transmit
 > one, then the numbers will likely stay the same.
 
 I don't remember we tried tasklet for TX a la Herbert's suggestion but we 
 used use tasklets for controlling RX processing to avoid hardirq livelock
 in pre-NAPI versions.

 Had variants of tulip driver with both TX cleaning at ->poll and TX
 cleaning at hardirq and didn't see any performance difference. The 
 ->poll was much cleaner but we kept Alexey's original work for tulip.

 > Sorry, I havent been following discussions on netchannels[1] so i am not
 > qualified to comment on the "replacement" part Dave mentioned earlier.
 > What I can say is the tx processing doesnt have to be part of the NAPI
 > poll() and still use hardirq.

 Yes true but I see TX numbers with newer boards (wire rate small pakets)
 with cleaing in ->poll. Also now linux is very safe in network "overload" 
 situations. Moving work to hardirq may change that.

 Cheers.

--ro
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


fib_trie leaf freeing [PATCH]

2006-04-04 Thread Robert Olsson

Hello!

Seems like leaf (end-nodes) has been freed by __tnode_free_rcu and not by
__leaf_free_rcu. This fixes the problem. Only tnode_free is now used which
checks for appropriate node type. free_leaf can be removed. 

Signed-off-by: Robert Olsson <[EMAIL PROTECTED]>

Cheers.
--ro


diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -50,7 +50,7 @@
  * Patrick McHardy <[EMAIL PROTECTED]>
  */
 
-#define VERSION "0.406"
+#define VERSION "0.407"
 
 #include 
 #include 
@@ -314,11 +314,6 @@ static void __leaf_free_rcu(struct rcu_h
kfree(container_of(head, struct leaf, rcu));
 }
 
-static inline void free_leaf(struct leaf *leaf)
-{
-   call_rcu(&leaf->rcu, __leaf_free_rcu);
-}
-
 static void __leaf_info_free_rcu(struct rcu_head *head)
 {
kfree(container_of(head, struct leaf_info, rcu));
@@ -357,7 +352,12 @@ static void __tnode_free_rcu(struct rcu_
 
 static inline void tnode_free(struct tnode *tn)
 {
-   call_rcu(&tn->rcu, __tnode_free_rcu);
+   if(IS_LEAF(tn)) {
+   struct leaf *l = (struct leaf *) tn;
+   call_rcu_bh(&l->rcu, __leaf_free_rcu);
+   }
+else
+   call_rcu(&tn->rcu, __tnode_free_rcu);
 }
 
 static struct leaf *leaf_new(void)
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MPLS extension for pktgen

2006-03-22 Thread Robert Olsson


Steven Whitehouse writes:
 > Here is the updated patch. If you are happy with this, then will you send
 > it on to Dave, or should I do that?


 Hello! 

 Yes the minor problem with the "return code" is fixed as:
 Result: OK: mpls=0001000a,0002000a,000a

 So I'll guess Dave will apply it. 

 Signed-off-by: Robert Olsson <[EMAIL PROTECTED]>
 
 Thanks.
--ro
  




 > 
 > Signed-off-by: Steven Whitehouse <[EMAIL PROTECTED]>
 > 
 > diff --git a/Documentation/networking/pktgen.txt 
 > b/Documentation/networking/pktgen.txt
 > --- a/Documentation/networking/pktgen.txt
 > +++ b/Documentation/networking/pktgen.txt
 > @@ -109,6 +109,22 @@ Examples:
 >   cycle through the port range.
 >   pgset "udp_dst_max 9"   set UDP destination port max.
 >  
 > + pgset "mpls 0001000a,0002000a,000a" set MPLS labels (in this example
 > + outer label=16,middle label=32,
 > + inner label=0 (IPv4 NULL)) Note that
 > + there must be no spaces between the
 > + arguments. Leading zeros are required.
 > + Do not set the bottom of stack bit,
 > + thats done automatically. If you do
 > + set the bottom of stack bit, that
 > + indicates that you want to randomly
 > + generate that address and the flag
 > + MPLS_RND will be turned on. You
 > + can have any mix of random and fixed
 > + labels in the label stack.
 > +
 > + pgset "mpls 0"   turn off mpls (or any invalid argument works 
 > too!)
 > +
 >   pgset stop   aborts injection. Also, ^C aborts generator.
 >  
 >  
 > @@ -167,6 +183,8 @@ pkt_size 
 >  min_pkt_size
 >  max_pkt_size
 >  
 > +mpls
 > +
 >  udp_src_min
 >  udp_src_max
 >  
 > @@ -211,4 +229,4 @@ Grant Grundler for testing on IA-64 and 
 >  Stephen Hemminger, Andi Kleen, Dave Miller and many others.
 >  
 >  
 > -Good luck with the linux net-development.
 > \ No newline at end of file
 > +Good luck with the linux net-development.
 > diff --git a/net/core/pktgen.c b/net/core/pktgen.c
 > --- a/net/core/pktgen.c
 > +++ b/net/core/pktgen.c
 > @@ -106,6 +106,9 @@
 >   *
 >   * interruptible_sleep_on_timeout() replaced Nishanth Aravamudan <[EMAIL 
 > PROTECTED]> 
 >   * 050103
 > + *
 > + * MPLS support by Steven Whitehouse <[EMAIL PROTECTED]>
 > + *
 >   */
 >  #include 
 >  #include 
 > @@ -154,7 +157,7 @@
 >  #include   /* do_div */
 >  #include 
 >  
 > -#define VERSION  "pktgen v2.66: Packet Generator for packet performance 
 > testing.\n"
 > +#define VERSION  "pktgen v2.67: Packet Generator for packet performance 
 > testing.\n"
 >  
 >  /* #define PG_DEBUG(a) a */
 >  #define PG_DEBUG(a)
 > @@ -162,6 +165,8 @@
 >  /* The buckets are exponential in 'width' */
 >  #define LAT_BUCKETS_MAX 32
 >  #define IP_NAME_SZ 32
 > +#define MAX_MPLS_LABELS 16 /* This is the max label stack depth */
 > +#define MPLS_STACK_BOTTOM __constant_htonl(0x0100)
 >  
 >  /* Device flag bits */
 >  #define F_IPSRC_RND   (1<<0)/* IP-Src Random  */
 > @@ -172,6 +177,7 @@
 >  #define F_MACDST_RND  (1<<5)/* MAC-Dst Random */
 >  #define F_TXSIZE_RND  (1<<6)/* Transmit size is random */
 >  #define F_IPV6(1<<7)/* Interface in IPV6 Mode */
 > +#define F_MPLS_RND(1<<8)/* Random MPLS labels */
 >  
 >  /* Thread control flag bits */
 >  #define T_TERMINATE   (1<<0)
 > @@ -278,6 +284,10 @@ struct pktgen_dev {
 >  __u16 udp_dst_min;  /* inclusive, dest UDP port */
 >  __u16 udp_dst_max;  /* exclusive, dest UDP port */
 >  
 > +/* MPLS */
 > +unsigned nr_labels; /* Depth of stack, 0 = no MPLS */
 > +__be32 labels[MAX_MPLS_LABELS];
 > +
 >  __u32 src_mac_count;/* How many MACs to iterate through */
 >  __u32 dst_mac_count;/* How many MACs to iterate through */
 >  
 > @@ -623,9 +633,19 @@ static int pktgen_if_show(struct seq_fil
 > pkt_dev->udp_dst_min, pkt_dev->udp_dst_max);
 >  
 >  seq_printf(seq,
 > -   " src_mac_count: %d  dst_mac_count: %d \n Flags: ",
 > +   " src_mac_count: %d  dst_mac_count: 

MPLS extension for pktgen

2006-03-21 Thread Robert Olsson

Steven Whitehouse writes:

 > I've been looking into MPLS recently and so one of the first things that
 > would be useful is a testbed to generate test traffic, and hence the
 > attached patch to pktgen.
 > 
 > If you have a moment to look over it, then please let me know if you
 > would give it your blessing. The patch is against davem's current
 > net-2.6.17 tree,

 Nice. Well never thought about mpls but it seems possible too. With mpls 
 enabled it seems send something my tcpdump does not understand so I trust 
 you there. I and it does not seem to brake standard ipv4 sending. So it 
 should be OK. 

 But I'll guess the mpls result code is not what you expected...

 echo "mpls 0001000a,0002000a,000a"  >  /proc/net/pktgen/eth1
 cat /proc/net/pktgen/eth1 | grep Res
 Result: 000a

  sprintf(pg_result, "OK: mpls=");
  for(n = 0; n < pkt_dev->nr_labels; n++)
   sprintf(pg_result, "%08x%s", ntohl(pkt_dev->labels[n]),
n == pkt_dev->nr_labels-1 ? "" : ",");

 Cheers.
--ro



 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PKTGEN: Adds thread limit parameter.

2006-03-13 Thread Robert Olsson

Luiz Fernando Capitulino writes:

 >  Well, I wouldn't say it's _really_ needed. But it really avoids having
 > too many thread entries in the pktgen's /proc directory, and as a good
 > result, you will not have pending threads which will never run as well.
 > 
 >  Also note that the patch is trivial, if you look at it in detail,
 > you'll see that the biggest change we have is the 'if' part. The rest I
 > would call cosmetic because the behaivor is the same.
 > 
 > | If so -- Wouldn't a concept of a bitmask to control also which CPU's
 > | that runs the threads be more general?
 > 
 >  Sounds like a bit complex, and would be my turn to ask if it's
 > really needed. :)

 A bit set for each CPU there will a pktgen process running.
  
 >  * Some minor fixes and cleanups, like functions returns being not
 >  checked.
 > 
 >  * A new command called 'rem_device' to remove one device at a time
 > (currently, we can only remove all devices in one shoot with
 > 'rem_devices_all')
 > 
 >  * Ports pktgen to use the kernel thread API

 The current model was chosen simplicity and to bound a device 
 to a specific CPU so it never cahanges. A change will need to 
 carefully tested.
 
 >  * cleanup the debug function usage
 
 I would like to remove the do_softirq stuff from pktgen... 

 Cheers.
--ro
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


PKTGEN: Adds thread limit parameter.

2006-03-13 Thread Robert Olsson

Hello!

Really needed? 

If so -- Wouldn't a concept of a bitmask to control also which CPU's
that runs the threads be more general?

Cheers.
--ro


Luiz Fernando Capitulino writes:
 > 
 > Currently, pktgen will create one thread for each online CPU in the
 > system. That behaivor may be annoying if you're using pktgen in a
 > large SMP system.
 > 
 > This patch adds a new module parameter called 'pg_max_threads', which
 > can be set to the maximum number of threads pktgen should create.
 > 
 > For example, if you're playing with pktgen in SMP system with 8
 > CPUs, but want only two pktgen's threads, you can do:
 > 
 >modprobe pktgen pg_max_threads=2
 > 
 > Signed-off-by: Luiz Capitulino <[EMAIL PROTECTED]>
 > 
 > ---
 > 
 >  net/core/pktgen.c |   23 +--
 >  1 files changed, 17 insertions(+), 6 deletions(-)
 > 
 > e210ad47d0db52496fdaabdd50bfe6ee0bcc53cd
 > diff --git a/net/core/pktgen.c b/net/core/pktgen.c
 > index 37b25a6..994aef1 100644
 > --- a/net/core/pktgen.c
 > +++ b/net/core/pktgen.c
 > @@ -154,7 +154,7 @@
 >  #include   /* do_div */
 >  #include 
 >  
 > -#define VERSION  "pktgen v2.66: Packet Generator for packet performance 
 > testing.\n"
 > +#define VERSION  "pktgen v2.67: Packet Generator for packet performance 
 > testing.\n"
 >  
 >  /* #define PG_DEBUG(a) a */
 >  #define PG_DEBUG(a)
 > @@ -488,6 +488,7 @@ static unsigned int fmt_ip6(char *s, con
 >  static int pg_count_d = 1000;   /* 1000 pkts by default */
 >  static int pg_delay_d;
 >  static int pg_clone_skb_d;
 > +static int pg_max_threads;
 >  static int debug;
 >  
 >  static DEFINE_MUTEX(pktgen_thread_lock);
 > @@ -3184,7 +3185,7 @@ static int pktgen_remove_device(struct p
 >  
 >  static int __init pg_init(void)
 >  {
 > -int cpu;
 > +int i, threads;
 >  struct proc_dir_entry *pe;
 >  
 >  printk(version);
 > @@ -3208,15 +3209,24 @@ static int __init pg_init(void)
 >  /* Register us to receive netdevice events */
 >  register_netdevice_notifier(&pktgen_notifier_block);
 >  
 > -for_each_online_cpu(cpu) {
 > +threads = num_online_cpus();
 > +
 > +/*
 > + * Check if we should have less threads than the number
 > + * of online CPUs
 > + */
 > +if ((pg_max_threads > 0) && (pg_max_threads < threads))
 > +threads = pg_max_threads;
 > +
 > +for (i = 0; i < threads; i++) {
 >  int err;
 >  char buf[30];
 >  
 > -sprintf(buf, "kpktgend_%i", cpu);
 > -err = pktgen_create_thread(buf, cpu);
 > +sprintf(buf, "kpktgend_%i", i);
 > +err = pktgen_create_thread(buf, i);
 >  if (err)
 >  printk("pktgen: WARNING: Cannot create thread for cpu 
 > %d (%d)\n",
 > -cpu, err);
 > +i, err);
 >  }
 >  
 >  if (list_empty(&pktgen_threads)) {
 > @@ -3263,4 +3273,5 @@ MODULE_LICENSE("GPL");
 >  module_param(pg_count_d, int, 0);
 >  module_param(pg_delay_d, int, 0);
 >  module_param(pg_clone_skb_d, int, 0);
 > +module_param(pg_max_threads, int, 0);
 >  module_param(debug, int, 0);
 > -- 
 > 1.2.4.gbe76
 > 
 > 
 > 
 > -- 
 > Luiz Fernando N. Capitulino
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] pktgen: fix races between control/worker threads

2006-02-22 Thread Robert Olsson

Jesse Brandeburg writes:
 > 
 > I looked quickly at this on a couple different machines and wasn't
 > able to reproduce, so don't let me block the patch.  I think its a
 > good patch FWIW

 OK! 
 We ask Deve to apply it.

 Cheers.
--ro
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


fib_trie stats fix [PATCH]

2006-02-22 Thread Robert Olsson


fib_triestats has been buggy and caused oopses some platforms as openwrt.
The patch below should cure those problems.

Cheers.
--ro


Signed-off-by: Robert Olsson <[EMAIL PROTECTED]>


--- linux-2.6.16-rc4/net/ipv4/fib_trie.c.061021 2006-02-21 22:22:57.0 
+0100
+++ linux-2.6.16-rc4/net/ipv4/fib_trie.c2006-02-21 22:31:10.0 
+0100
@@ -50,7 +50,7 @@
  * Patrick McHardy <[EMAIL PROTECTED]>
  */
 
-#define VERSION "0.405"
+#define VERSION "0.406"
 
 #include 
 #include 
@@ -84,7 +84,7 @@
 #include "fib_lookup.h"
 
 #undef CONFIG_IP_FIB_TRIE_STATS
-#define MAX_CHILDS 16384
+#define MAX_STAT_DEPTH 32
 
 #define KEYLENGTH (8*sizeof(t_key))
 #define MASK_PFX(k, l) (((l)==0)?0:(k >> (KEYLENGTH-l)) << (KEYLENGTH-l))
@@ -154,7 +154,7 @@
unsigned int tnodes;
unsigned int leaves;
unsigned int nullpointers;
-   unsigned int nodesizes[MAX_CHILDS];
+   unsigned int nodesizes[MAX_STAT_DEPTH];
 };
 
 struct trie {
@@ -2080,7 +2080,9 @@
int i;
 
s->tnodes++;
-   s->nodesizes[tn->bits]++;
+   if(tn->bits < MAX_STAT_DEPTH)
+   s->nodesizes[tn->bits]++;
+
for (i = 0; i < (1<bits); i++)
if (!tn->child[i])
s->nullpointers++;
@@ -2110,8 +2112,8 @@
seq_printf(seq, "\tInternal nodes: %d\n\t", stat->tnodes);
bytes += sizeof(struct tnode) * stat->tnodes;
 
-   max = MAX_CHILDS-1;
-   while (max >= 0 && stat->nodesizes[max] == 0)
+   max = MAX_STAT_DEPTH;
+   while (max > 0 && stat->nodesizes[max-1] == 0)
max--;
 
pointers = 0;

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


fib_trie initialzation fix [PATCH]

2006-02-22 Thread Robert Olsson


Hello!

In some kernel configs /proc functions seems to be accessed before the trie 
is initialized. The patch below checks for this.


Cheers.
--ro


Signed-off-by: Robert Olsson <[EMAIL PROTECTED]>


--- linux-2.6.16-rc4/net/ipv4/fib_trie.c.orig   2006-02-21 22:21:36.0 
+0100
+++ linux-2.6.16-rc4/net/ipv4/fib_trie.c2006-02-22 13:44:37.0 
+0100
@@ -50,7 +50,7 @@
  * Patrick McHardy <[EMAIL PROTECTED]>
  */
 
-#define VERSION "0.404"
+#define VERSION "0.405"
 
 #include 
 #include 
@@ -2040,7 +2040,15 @@
 static struct node *fib_trie_get_first(struct fib_trie_iter *iter,
   struct trie *t)
 {
-   struct node *n = rcu_dereference(t->trie);
+   struct node *n ;
+
+   if(!t)
+   return NULL;
+   
+   n = rcu_dereference(t->trie);
+
+   if(!iter)
+   return NULL;
 
if (n && IS_TNODE(n)) {
iter->tnode = (struct tnode *) n;
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] pktgen: fix races between control/worker threads

2006-02-17 Thread Robert Olsson

Arthur Kepner writes:

 Tanks. These races should be cured now I've tested a some runs and
 it works but I didn't see any problems before either. We'll hear from 
 Jesse if this cured his problems.

 Cheers.
--ro
 
 
Signed-off-by: Robert Olsson <[EMAIL PROTECTED]>


 > New patch differs from the previous one only in the ordering
 > of the two lines above.
 > 
 > diff --git a/net/core/pktgen.c b/net/core/pktgen.c
 > --- a/net/core/pktgen.c
 > +++ b/net/core/pktgen.c
 > @@ -153,7 +153,7 @@
 >  #include 
 >  
 >  
 > -#define VERSION  "pktgen v2.63: Packet Generator for packet performance 
 > testing.\n"
 > +#define VERSION  "pktgen v2.64: Packet Generator for packet performance 
 > testing.\n"
 >  
 >  /* #define PG_DEBUG(a) a */
 >  #define PG_DEBUG(a) 
 > @@ -176,7 +176,8 @@
 >  #define T_TERMINATE   (1<<0)  
 >  #define T_STOP(1<<1)  /* Stop run */
 >  #define T_RUN (1<<2)  /* Start run */
 > -#define T_REMDEV  (1<<3)  /* Remove all devs */
 > +#define T_REMDEVALL   (1<<3)  /* Remove all devs */
 > +#define T_REMDEV  (1<<4)  /* Remove one dev */
 >  
 >  /* Locks */
 >  #define   thread_lock()down(&pktgen_sem)
 > @@ -218,6 +219,8 @@ struct pktgen_dev {
 >   * we will do a random selection from within the range.
 >   */
 >  __u32 flags; 
 > +int removal_mark;   /* non-zero => the device is marked for 
 > + * removal by worker thread */
 >  
 >  int min_pkt_size;/* = ETH_ZLEN; */
 >  int max_pkt_size;/* = ETH_ZLEN; */
 > @@ -481,7 +484,7 @@ static void pktgen_stop_all_threads_ifs(
 >  static int pktgen_stop_device(struct pktgen_dev *pkt_dev);
 >  static void pktgen_stop(struct pktgen_thread* t);
 >  static void pktgen_clear_counters(struct pktgen_dev *pkt_dev);
 > -static struct pktgen_dev *pktgen_NN_threads(const char* dev_name, int 
 > remove);
 > +static int pktgen_mark_device(const char* ifname);
 >  static unsigned int scan_ip6(const char *s,char ip[16]);
 >  static unsigned int fmt_ip6(char *s,const char ip[16]);
 >  
 > @@ -1406,7 +1409,7 @@ static ssize_t pktgen_thread_write(struc
 >  
 >  if (!strcmp(name, "rem_device_all")) {
 >  thread_lock();
 > -t->control |= T_REMDEV;
 > +t->control |= T_REMDEVALL;
 >  thread_unlock();
 >  schedule_timeout_interruptible(msecs_to_jiffies(125));  /* 
 > Propagate thread->control  */
 >  ret = count;
 > @@ -1457,7 +1460,8 @@ static struct pktgen_dev *__pktgen_NN_th
 >  if (pkt_dev) {
 >  if(remove) { 
 >  if_lock(t);
 > -pktgen_remove_device(t, pkt_dev);
 > +pkt_dev->removal_mark = 1;
 > +t->control |=  T_REMDEV;
 >  if_unlock(t);
 >  }
 >  break;
 > @@ -1467,13 +1471,44 @@ static struct pktgen_dev *__pktgen_NN_th
 >  return pkt_dev;
 >  }
 >  
 > -static struct pktgen_dev *pktgen_NN_threads(const char* ifname, int remove) 
 > +/* 
 > + * mark a device for removal 
 > + */
 > +static int pktgen_mark_device(const char* ifname) 
 >  {
 >  struct pktgen_dev *pkt_dev = NULL;
 > +const int max_tries = 10, msec_per_try = 125; 
 > +int i = 0;
 > +int ret = 0;
 > +
 >  thread_lock();
 > -pkt_dev = __pktgen_NN_threads(ifname, remove);
 > -thread_unlock();
 > -return pkt_dev;
 > +PG_DEBUG(printk("pktgen: pktgen_mark_device marking %s for 
 > removal\n", 
 > +ifname));
 > +
 > +while(1) { 
 > +
 > +pkt_dev = __pktgen_NN_threads(ifname, REMOVE);
 > +if (pkt_dev == NULL) break; /* success */
 > +
 > +thread_unlock();
 > +PG_DEBUG(printk("pktgen: pktgen_mark_device waiting for %s "
 > +"to disappear\n", ifname));
 > +schedule_timeout_interruptible(msecs_to_jiffies(msec_per_try));
 > +thread_lock();
 > +
 > +if (++i >= max_tries) {
 > +printk("pktgen_mark_device: timed out after waiting "
 > +"%d msec for device %s to be removed\n",
 > +msec_per_try*i, ifname);
 > +ret = 1;
 > +break;
 > +}
 

[PATCH] pktgen: fix races between control/worker threads

2006-02-16 Thread Robert Olsson

Arthur Kepner writes:
 > 
 > Let's try this again. How does this look, Robert?

 Yeep better

 >  if(remove) { 

 > +t->control |=  T_REMDEV;
 > +pkt_dev->removal_mark = 1;

 >  }

 Guess you should mark before you set control to avoid the race above.
 Eventually T_REMDEV could be sent from the syncing loop you added
 too well maybe it's not needed.

 Luiz Fernando Capitulino <[EMAIL PROTECTED]> posted a 
 couple of patches before. One the patches had threads to use kernel 
 list functions. Unfortunely it was dependent on huge whitespace patch.

 Cheers.

--ro

 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] pktgen: potential double free of pktgen_dev's skb

2006-02-15 Thread Robert Olsson

Arthur Kepner writes:

 > There's a race in pktgen which can lead to a double
 > free of a pktgen_dev's skb. If a worker thread is in
 > the midst of doing fill_packet(), and the controlling
 > thread gets a "stop" message, the already freed skb
 > can be freed once again in pktgen_stop_device().

Hello!

A device can only belong to one CPU/thread and to avoid races control 
messages are posted to the thread. In the case you mention most pktgen_stop 
is called from the same CPU/thread. 

pktgen_stop_all_threads_ifs is an exception it should be possible 
set running in struct pktgen_dev to false for all devs in the thread 
and have the pktgen_thread_worker to do pktgen_stop to avoid races.

The same trick could be used in the case we get NETDEV_UNREGISTER
So in __pktgen_NN_threads instead of removing the device we set
running false and the pktgen_thread_worker do the remove job. It 
might need some syncing.  Anything you could try?

Cheers.
--ro

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Van Jacobson net channels

2006-02-02 Thread Robert Olsson

Leonid Grossman writes:

 > Right. Interrupt moderation is done on per channel basis. 
 > The only addition to the current NAPI mechanism I'd like to see is to
 > have NAPI setting desired interrupt rate (once interrupts are ON),
 > rather than use an interrupt per packet or a driver default. Arguably,
 > NAPI can figure out desired interrupt rate a bit better than a driver
 > can.

 In the current scheme a driver can easily use a dynamic interrupt scheme
 in fact tulip has used this for years. At low rates there are now delays at 
 all if reach some threshold it increases interrupt latency. It can be done 
 in sevaral levels. The best threshold seems luckily just to be to count 
 the number of packets sitting RX ring when ->poll is called. Jamal heavily 
 experimented with this and gave a talk at OLS 2000.

 Yes if net channel classifier runs in  hardirq we get back to the livelock 
 situation sooner or later. IMO interupts should just be a signal to 
 indicate work

 Cheers.
--ro

 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4] - pktgen: refinements and small fixes (V2).

2006-01-30 Thread Robert Olsson

Luiz Fernando Capitulino writes:

 > [PATCH 1/4]  pktgen: Lindent run.
 > [PATCH 2/4]  pktgen: Ports thread list to Kernel list implementation.
 > [PATCH 3/4]  pktgen: Fix kernel_thread() fail leak.
 > [PATCH 4/4]  pktgen: Fix Initialization fail leak.
 > 
 >  The changes from V1 are:
 > 
 >  1. More fixes made by hand after Lindent run
 >  2. Re-diffed agains't Dave's net-2.6.17 tree

 Should be fine I've used the previous version of the patches for a
 couple of days now. Thanks.

 Signed-off-by: Robert Olsson <[EMAIL PROTECTED]>

 Cheers.
--ro
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fib_rules w. RCU lock [PATCH]

2006-01-27 Thread Robert Olsson

David S. Miller writes:

 > > +  r = (struct fib_rule *) fib_rules.first;

 > I don't think this is right.
 > fib_rules.first is a pointer to the hlist_node (within the fib_rule)
 > not the fib_rule itself.

 It works because the hlist_node is the first member of the struct fib_rule
 but I agree it might safer/better to use i.e contailer_of.

 r = container_of(fib_rules.first, struct fib_rule, hlist);
 My colleauge Jens Laas spotted a simiar cast which is corrected as well.
 
 > What is the preempt_disable() actually protecting here?
 > It's a simple assignment to -1, and since the disable
 > occurs inside the ifindex test, it is not protecting
 > that either.

 The preempts are removed and an updated version of the patch is enclosed.

 Cheers.
        --ro


Signed-off-by: Robert Olsson <[EMAIL PROTECTED]>


--- linux-2.6.14.5/net/ipv4/fib_rules.c.orig2006-01-27 10:18:32.0 
+0100
+++ linux-2.6.14.5/net/ipv4/fib_rules.c 2006-01-27 13:38:02.0 +0100
@@ -39,6 +39,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -51,7 +53,7 @@
 
 struct fib_rule
 {
-   struct fib_rule *r_next;
+   struct hlist_node hlist;
atomic_tr_clntref;
u32 r_preference;
unsigned char   r_table;
@@ -74,6 +76,7 @@
 #endif
charr_ifname[IFNAMSIZ];
int r_dead;
+   struct  rcu_head rcu;
 };
 
 static struct fib_rule default_rule = {
@@ -84,7 +87,6 @@
 };
 
 static struct fib_rule main_rule = {
-   .r_next =   &default_rule,
.r_clntref =ATOMIC_INIT(2),
.r_preference = 0x7FFE,
.r_table =  RT_TABLE_MAIN,
@@ -92,23 +94,24 @@
 };
 
 static struct fib_rule local_rule = {
-   .r_next =   &main_rule,
.r_clntref =ATOMIC_INIT(2),
.r_table =  RT_TABLE_LOCAL,
.r_action = RTN_UNICAST,
 };
 
-static struct fib_rule *fib_rules = &local_rule;
-static DEFINE_RWLOCK(fib_rules_lock);
+struct hlist_head fib_rules;
+
+/* writer func called from netlink -- rtnl_sem hold*/
 
 int inet_rtm_delrule(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg)
 {
struct rtattr **rta = arg;
struct rtmsg *rtm = NLMSG_DATA(nlh);
-   struct fib_rule *r, **rp;
+   struct fib_rule *r;
+   struct hlist_node *node;
int err = -ESRCH;
 
-   for (rp=&fib_rules; (r=*rp) != NULL; rp=&r->r_next) {
+   hlist_for_each_entry(r, node, &fib_rules, hlist) {
if ((!rta[RTA_SRC-1] || memcmp(RTA_DATA(rta[RTA_SRC-1]), 
&r->r_src, 4) == 0) &&
rtm->rtm_src_len == r->r_src_len &&
rtm->rtm_dst_len == r->r_dst_len &&
@@ -125,10 +128,8 @@
if (r == &local_rule)
break;
 
-   write_lock_bh(&fib_rules_lock);
-   *rp = r->r_next;
+   hlist_del_rcu(&r->hlist);   
r->r_dead = 1;
-   write_unlock_bh(&fib_rules_lock);
fib_rule_put(r);
err = 0;
break;
@@ -149,21 +150,30 @@
return NULL;
 }
 
+static inline void fib_rule_put_rcu(struct rcu_head *head)
+{
+   struct fib_rule *r = container_of(head, struct fib_rule, rcu);
+   kfree(r);
+}
+
 void fib_rule_put(struct fib_rule *r)
 {
if (atomic_dec_and_test(&r->r_clntref)) {
if (r->r_dead)
-   kfree(r);
+   call_rcu(&r->rcu, fib_rule_put_rcu);
else
printk("Freeing alive rule %p\n", r);
}
 }
 
+/* writer func called from netlink -- rtnl_sem hold*/
+
 int inet_rtm_newrule(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg)
 {
struct rtattr **rta = arg;
struct rtmsg *rtm = NLMSG_DATA(nlh);
-   struct fib_rule *r, *new_r, **rp;
+   struct fib_rule *r, *new_r, *last = NULL;
+   struct hlist_node *node = NULL;
unsigned char table_id;
 
if (rtm->rtm_src_len > 32 || rtm->rtm_dst_len > 32 ||
@@ -187,6 +197,7 @@
if (!new_r)
return -ENOMEM;
memset(new_r, 0, sizeof(*new_r));
+
if (rta[RTA_SRC-1])
memcpy(&new_r->r_src, RTA_DATA(rta[RTA_SRC-1]), 4);
if (rta[RTA_DST-1])
@@ -219,28 +230,28 @@
if (rta[RTA_FLOW-1])
memcpy(&new_r->r_tclassid, RTA_DATA(rta[RTA_FLOW-1]), 4);
 #endif
+   r = container_of(fib_rules.first, struct fib_rule, hlist);
 
-   rp = &fib_rules;
if (!new_r->r_preference) {
-   r = fib_rules;
-   if (r && (r = r->r_next) != NULL) {
- 

Re: driver skb reuse

2006-01-25 Thread Robert Olsson

Benjamin LaHaise writes:

 > Right.  Btw, looking over your changes for skb reuse and how the e1000 
 > lays out its data structures, I think there is still some room for 
 > improvement: currently you still touch the skb that the e1000 used to 
 > allocate the receive buffers in.  That cacheline is likely to be L1 cold 
 > because of the large number of outstanding buffers the e1000 uses.  If you 
 > instead only touch the buffer_info structure to populate a freshly 
 > allocated skb, you can eliminate a couple of cache misses and increase 
 > the chances that the skb you're reusing stays in the L1 cache.  That will 
 > require a lot more surgery to the e1000 code, though, but it could be 
 > worth it.
 
 Getting rid of some cache misses would big win definitely. We go from 
 tutorial to real world. Are you brave to outline some code? 


 Cheers.
--ro
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: driver skb reuse

2006-01-25 Thread Robert Olsson

jamal writes:

 > > Here we process (drop) about 13%  packets more when skb'a get reued.
 > > 
 > Very cool.
 > Robert, it would be interesting to see something more interesting
 > (longer code path) such as udp. You can still use a packetgen to shoot
 > at the box (netperf is a wimp), send it to udp port 9 which just
 > swallows packets.

 Neither UDP or TCP kfree's the skb in it's normal path today so it cannot 
 be resused. ICMP packets seems get reused as-is. Dave and Andi some possible 
 improvement for TCP. 

 For UDP I just took a quick look. Only in case errors the skb gets freed. 
 So ie sending UDP pkt to socket which is not open would make the skb's reused.
 And someting we could test...

 The normal path from what I understand is queue packets in into the socket 
 receive queue.

 skb_queue_tail(&sk->sk_receive_queue, skb);

 Which could be an excellent place for UDP copy break code... which then
 makes skb reuse possible.

 Cheers.
--ro
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: driver skb reuse

2006-01-24 Thread Robert Olsson

Benjamin LaHaise writes:

 > Instead of doing a completely separate skb reuse path, what happens if 
 > you remove the memset() from __alloc_skb() and instead do it in a slab 
 > ctor?  I remember seeing that in the profiles for af_unix.  Dave, could 
 > you refresh my memory why the slab ctor ended up being dropped?  Doing 
 > the memset() at free time is usually almost free since it is usually in 
 > the cache at that point.

 Right but that's another issue... 

 This can be done I played with a variant of this too 

 I splitted alloc_skb in two parts to get the memset() part even done 
 in the driver just before passing the skb to the stack.

 I did expect a big win in but I didn't see any gain. Strange but the code 
 was bad so it might be worth a try again. But this is a bit different 
 from the skb reuse we discuss now.

 Cheers.
--ro

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


driver skb reuse

2006-01-24 Thread Robert Olsson

Hello! 
We disussed the resue of skb's some time ago.

Below some code to examine how skb's can be reused if upper layer (RX softirq) 
can consume the skb so we with in NAPI path can detect and reuse the skb. It 
can give new possibilites for TCP optimization (davem), driver common copbreak 
etc. In the test below I use my usual lab setup but just let netfilter drop the 
packets. We win about 13% in this experiment below.

Single Opteron 252 (2.6 GHz) CPU. e1000 6.2.15 Linux 2.6.14.5
Input rate 2 x 1.16 Mpps a netfilter drop rule for pktgen in the forward chain

Iface   MTU Met  RX-OK RX-ERR RX-DRP RX-OVR  TX-OK TX-ERR TX-DRP TX-OVR Flags
eth0   1500   0 6680261  0 3319739  0  5  0  0  0 BRU
eth1   1500   0  0  0  0  0  3  0  0  0 BRU
eth2   1500   0 6680344  0 3319656  0  3  0  0  0 BRU
eth3   1500   0  0  0  0  0  3  0  0  0 BRU

With the consume/recycle code below.

Iface   MTU Met  RX-OK RX-ERR RX-DRP RX-OVR  TX-OK TX-ERR TX-DRP TX-OVR Flags
eth0   1500   0 7541191  0 2458809  0  5  0  0  0 BRU
eth1   1500   0  0  0  0  0  3  0  0  0 BRU
eth2   1500   0 7541150  0 2458850  0  3  0  0  0 BRU
eth3   1500   0  0  0  0  0  3  0  0  0 BRU

Here we process (drop) about 13%  packets more when skb'a get reued.


--- e1000_main.c.060124 2006-01-24 12:30:10.0 +0100
+++ e1000_main.c2006-01-24 12:28:58.0 +0100
@@ -3802,6 +3802,10 @@
 
skb->protocol = eth_type_trans(skb, netdev);
 #ifdef CONFIG_E1000_NAPI
+
+   /* Increment users so skb don't get destructed */
+   skb_get(skb);
+
 #ifdef NETIF_F_HW_VLAN_TX
if(unlikely(adapter->vlgrp &&
(status & E1000_RXD_STAT_VP))) {
@@ -3814,6 +3818,23 @@
 #else
netif_receive_skb(skb);
 #endif
+
+   /* 
+* If skb is consumed by RX softirq we can simply use it again 
+* otherwise undo the users increment with kfree
+*/
+
+   if (!multi_descriptor && atomic_read(&skb->users) == 1 && 
+   realloc_skb(skb, adapter->rx_buffer_len, GFP_ATOMIC)) {
+   
+   skb_reserve(skb, 16);
+   skb->dev = netdev;
+   buffer_info->skb = skb;
+   adapter->net_stats.rx_compressed++; 
+   }
+   else 
+   kfree_skb(skb);
+
 #else /* CONFIG_E1000_NAPI */
 #ifdef NETIF_F_HW_VLAN_TX
if(unlikely(adapter->vlgrp &&


realloc_skb from [EMAIL PROTECTED]


--- linux-2.6.14.5/include/linux/skbuff.h.orig  2006-01-19 15:51:08.0 
+0100
+++ linux-2.6.14.5/include/linux/skbuff.h   2006-01-24 11:34:44.0 
+0100
@@ -303,6 +303,9 @@
 extern void   __kfree_skb(struct sk_buff *skb);
 extern struct sk_buff *__alloc_skb(unsigned int size,
   gfp_t priority, int fclone);
+extern struct sk_buff *realloc_skb(struct sk_buff *skb, unsigned int size, 
+  int priority);
+
 static inline struct sk_buff *alloc_skb(unsigned int size,
gfp_t priority)
 {
--- linux-2.6.14.5/net/core/skbuff.c.orig   2006-01-19 15:50:37.0 
+0100
+++ linux-2.6.14.5/net/core/skbuff.c2006-01-24 11:57:34.0 +0100
@@ -276,6 +276,59 @@
}
 }
 
+/**
+ * realloc_skb -   reset skb for new packet.
+ * @size: size to allocate
+ * @gfp_mask: allocation mask
+ *
+ * Allocate a new &sk_buff. The returned buffer has no headroom and a
+ * tail room of size bytes. The object has a reference count of one.
+ * The return is the buffer. On a failure the return is %NULL.
+ *
+ * Buffers may only be allocated from interrupts using a @gfp_mask of
+ * %GFP_ATOMIC.
+ */
+
+struct sk_buff *realloc_skb(struct sk_buff* skb, unsigned int size, int 
gfp_mask)
+{
+   int truesize = skb->truesize;
+   u8 *data = skb->head;
+
+   memset(skb, 0, offsetof(struct sk_buff, truesize));
+
+   /* Get the DATA. Size must match skb_add_mtu(). */
+   size = SKB_DATA_ALIGN(size);
+   if ((size+sizeof(struct sk_buff)) > truesize) {
+   skb_release_data(skb);
+   data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
+   if (data == NULL)
+   goto nodata;
+   }
+
+   /* XXX: does not include slab overhead */
+   skb->truesize = size + sizeof(struct sk_buff);
+
+   /* Load the data pointers. */
+   skb->head = data;
+   skb->data = data;
+   skb->tail = data;
+   skb->end = data + size;
+
+   /* Set up other state */
+   skb->len = 0;
+   skb->cloned = 0;
+   skb->data_len = 0;
+
+

[PATCH 00/00] pktgen: refinements and small fixes.

2006-01-23 Thread Robert Olsson

Luiz Fernando Capitulino writes:

 >  [PATCH 00/02] pktgen: Ports thread list to Kernel list implementation.
 >  [PATCH 00/03] pktgen: Fix kernel_thread() fail leak.
 >  [PATCH 00/04] pktgen: Fix Initialization fail leak. 
 > 

 >  But I'm sending these patches first, just to know if I'm doing something
 > wrong.

 Thanks!

 Looks interesting. Yes of course better to use kernel list functions. I'll 
 patch my lab version and run through some tests. 
 
 Yes keep on hacking...

 Cheers.
--ro
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


LOW throughput with pktgen and e1000

2005-12-28 Thread Robert Olsson

Aritz Bastida writes:

 > I need to use pktgen for sending packets at very high speed to another
 > machine, in order to test it under heavy network traffic. All my
 > previous injection test were done with a dual Pentium III 800 MHz. As
 > I needed a more powerful machine I got a Pentium 4 but the results are
 > quite similar.

 Also make sure HW_FLOW control from the receiver is not throttling the 
 sender.. Also adapters are different and also PCI-bridges.adds latency.
 Bus latency and bus speed is the most important factors so PIII might
 very well be faster then XEON/p4/Opteron. 

 > These results are similar to the ones I got with the Pentium III. I
 > can't reach even 400kpps. With another machine, a dual AMD Opteron, I
 > can send as fast as 650kpps but unfortunately that is the machine
 > being tested. In the pktgen paper says "A single flow of 1.48Mpps is
 > seen with a Xeon 2.67 GHz using a patched e1000 driver (64 byte
 > packets)" [1]. Well, I don't know how much faster the Intel Xeon is,
 > but I do have a fast machine, and I do have an e1000 NIC.
 
 The 1.488 Mpps was with several a patches one with HW TX prefetching which
 was catastrophic in production. It worked only in simplex mode.

 Below is from our dual 2.67 GHz XEON w.serverworks chipset and Intel 82546GB
 More or less vanilla 2.6.14  

 One
  880039pps 422Mb/sec (422418720bps) errors: 0

 Two concurrent
  850419pps 408Mb/sec (408201120bps) errors: 0
  850421pps 408Mb/sec (408202080bps) errors: 0

 All with 64 byte pkts (60 in pktgen) you're using 100 

 Cheers.
--ro
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Resend [PATCH netdev-2.6 2/8] e1000: Performance Enhancements

2005-12-19 Thread Robert Olsson

David S. Miller writes:

 > I think a new variant of netif_receive_skb() might be needed  or
 > maybe not.  I don't see a need for a new ->poll() for example.

 Yes poll should fine as-is for netif_receive_skb we have to see.
 
 > On the other hand, nobody checks the return value from
 > netif_receive_skb().  This would have been a nice avenue
 > were it not for the SKB refcounting, but I think we could
 > work around that by making the driver grab the extra reference
 > count before it invokes netif_receive_skb().
 > 
 > I even think you are hinting at exactly this :)

 Yes something like this at least as a start. Could be the first
 part to play with. I'll leave TCP optimizations to you. :) 

 Cheers.
--ro
 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Resend [PATCH netdev-2.6 2/8] e1000: Performance Enhancements

2005-12-14 Thread Robert Olsson

jamal writes:

 > Essentially the approach would be the same as Robert's old recycle patch
 > where he doesnt recycle certain skbs - the only difference being in the
 > case of forwarding, the  recycle is done asynchronously at EOT whereas
 > this is done  synchronously upon return from host path. 
 > The beauty of the approach is you dont have to worry about recycling on
 > the wrong CPU ;-> (which has been a big problem for forwarding path)

 > I have to chime in and say for the host stack - I like it ;->

 No we don't solve any problems for forwarding but as Dave pointed out
 we can do nice things. Instead of dropping skb is case of failures or
 netfilter etc we can reuse the skb and if the skb is consumed within
 the RX softirq we can just return it driver.
 
 You did the feedback mechanism NET_RX_XX stuff six years ago.
 Now it can possible used :)

 A practical problem is how maintain compatibility with the current 
 behavior which defaults to NET_RX_SKB_CONSUMED.
 
 An new driver entry point? And can we increase skb->users to delay 
 skb destruction until the driver got the indication back? 
 So the driver will do the final kfree and not in the protocol layers 
 as now?  This to avoid massive code changes.

 Thoughts?

 Cheers
--ro
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Resend [PATCH netdev-2.6 2/8] e1000: Performance Enhancements

2005-12-12 Thread Robert Olsson

jamal writes:
 > 
 > Ok, this makes things more interesting 
 > What worked for a XEON doesnt work the same way for an opteron.
 > 
 > For me, the copybreak (in its capacity as adding extra cycles that make
 > the prefetch look good) made things look good. Also, #125 gave a best
 > answer. None of these were the case from Roberts results.
 > 
 > Robert, what about just #1? Maybe thats the best compromise that would
 > work for all.

I've tried that before with flow test and got contribution from #2
 
0 prefetch 756 kpps 
1 prefetch 805 kpps (first)
2 prefetch 821 kpps (first two)
5 prefetch 803 kpps (all)


 > Also, I am really hoping that someone will test with older hardware
 > where i claim to have seen prefetch causing problems.

We give this up for now unless you or somebody else has some very good idea 
how handle prefetching in generic way.

I'll use #12 and you'll use #125 Intel uses #12345 etc

Or do we all benefit from #12? 

Cheers.
--ro 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >