Re: [IPV4 0/9] TRIE performance patches
Hello, finally got some time to test... Table w. 214k routes with full rDoS on two intrefaces on 2 x AMD64 processors, speed 2814.43 MHz. Profiled with CPU_CLK_UNHALTED and rtstat w/o latest patch fib_trie pathes. Tput ~233 kpps samples %symbol name 109925 14.4513 fn_trie_lookup 109821 14.4376 ip_route_input 8724511.4696 rt_intern_hash 31270 4.1109 kmem_cache_alloc 24159 3.1761 dev_queue_xmit 23200 3.0500 neigh_lookup 22464 2.9532 free_block 18412 2.4205 kmem_cache_free 17830 2.3440 dst_destroy 15740 2.0693 fib_get_table With latest patch fib_patches.(Stephens others) Same throughput ~233 kpps but we see a different profile. Why we don't get any better better throughput is yet to be understand (the drops in qdisc could be the cause) more analysis needed 7924214.3520 ip_route_input 6518811.8066 fn_trie_lookup 6455911.6927 rt_intern_hash 22901 4.1477 kmem_cache_alloc 21038 3.8103 check_leaf 16197 2.9335 neigh_lookup 14802 2.6809 free_block 14596 2.6436 ip_rcv_finish 12365 2.2395 fib_validate_source 12048 2.1821 dst_destroy fib_hash thoughput ~177 kpps Hard work for fib_hash here as we have many zones. it can be fast with less zines. 200568 37.8013 fn_hash_lookup 5835210.9977 ip_route_input 44495 8.3860 rt_intern_hash 12873 2.4262 kmem_cache_alloc 12115 2.2833 rt_may_expire 11691 2.2034 rt_garbage_collect 10821 2.0394 dev_queue_xmit 1.8845 fib_validate_source 8762 1.6514 fib_get_table 8558 1.6129 fib_semantic_match Cheers --ro -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [IPV4 0/9] TRIE performance patches
Stephen Hemminger writes: > Dumping by prefix is possible, but unless 32x slower. Dumping in > address order is just as logical. Like I said, I'm investigating what > quagga handles. How about taking a snapshot to in address order (as you did) to some allocated memory, returning from that memory in prefix order? This would solve the -EBUSY too and give a consistent view of the routing table at the time for the dump/snapshot. Cheers --ro -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[IPV4 0/9] TRIE performance patches
Stephen Hemminger writes: > Time to handle a full BGP load (163K of routes). > > Before: LoadDumpFlush > > kmem_cache 3.8 13.07.2 > iter 3.9 12.36.9 > unordered3.1 11.94.9 > find_node3.1 0.31.2 I certainly like the speed but what will we brake when we don't return in longest prefix order? labb:/# ip r default via 10.10.10.1 dev eth0 5.0.0.0/8 via 192.168.2.2 dev eth3 10.10.10.0/24 dev eth0 proto kernel scope link src 10.10.10.2 10.10.11.0/24 dev eth1 proto kernel scope link src 10.10.11.1 11.0.0.0/8 via 10.10.11.2 dev eth1 192.168.1.0/24 dev eth2 proto kernel scope link src 192.168.1.2 192.168.2.0/24 dev eth3 proto kernel scope link src 192.168.2.1 labb:/# ip route list match 10.10.10.1 default via 10.10.10.1 dev eth0 10.10.10.0/24 dev eth0 proto kernel scope link src 10.10.10.2 labb:/# Maybe the unordered dump can be ordered cheaply... Cheers. --ro -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
David Miller writes: > Yes, this semaphore thing is highly problematic. In the most crucial > areas where network driver consistency matters the most for ease of > understanding and debugging, the Intel drivers choose to be different > :-( > > The way the napi_disable() logic breaks out from high packet load in > net_rx_action() is it simply returns even leaving interrupts disabled > when a pending napi_disable() is pending. > > This is what trips up the semaphore logic. > > Robert, give this patch a try. Yes it works. e1000 tested for ~3 hours with high very high load and interface up/down every 5:th sec. Without the patch the irq's gets disabled within a couple of seconds A resolute way of handling the semaphores. :) Signed-off-by: Robert Olsson <[EMAIL PROTECTED]> Cheers --ro > In the long term this semaphore should be completely eliminated, > there is no justification for it. > > Signed-off-by: David S. Miller <[EMAIL PROTECTED]> > > diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c > index 0c9a6f7..76c0fa6 100644 > --- a/drivers/net/e1000/e1000_main.c > +++ b/drivers/net/e1000/e1000_main.c > @@ -632,6 +632,7 @@ e1000_down(struct e1000_adapter *adapter) > > #ifdef CONFIG_E1000_NAPI > napi_disable(&adapter->napi); > +atomic_set(&adapter->irq_sem, 0); > #endif > e1000_irq_disable(adapter); > > diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c > index 2ab3bfb..9cc5a6b 100644 > --- a/drivers/net/e1000e/netdev.c > +++ b/drivers/net/e1000e/netdev.c > @@ -2183,6 +2183,7 @@ void e1000e_down(struct e1000_adapter *adapter) > msleep(10); > > napi_disable(&adapter->napi); > +atomic_set(&adapter->irq_sem, 0); > e1000_irq_disable(adapter); > > del_timer_sync(&adapter->watchdog_timer); > diff --git a/drivers/net/ixgb/ixgb_main.c b/drivers/net/ixgb/ixgb_main.c > index d2fb88d..4f63839 100644 > --- a/drivers/net/ixgb/ixgb_main.c > +++ b/drivers/net/ixgb/ixgb_main.c > @@ -296,6 +296,11 @@ ixgb_down(struct ixgb_adapter *adapter, boolean_t > kill_watchdog) > { > struct net_device *netdev = adapter->netdev; > > +#ifdef CONFIG_IXGB_NAPI > +napi_disable(&adapter->napi); > +atomic_set(&adapter->irq_sem, 0); > +#endif > + > ixgb_irq_disable(adapter); > free_irq(adapter->pdev->irq, netdev); > > @@ -304,9 +309,7 @@ ixgb_down(struct ixgb_adapter *adapter, boolean_t > kill_watchdog) > > if(kill_watchdog) > del_timer_sync(&adapter->watchdog_timer); > -#ifdef CONFIG_IXGB_NAPI > -napi_disable(&adapter->napi); > -#endif > + > adapter->link_speed = 0; > adapter->link_duplex = 0; > netif_carrier_off(netdev); > diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c > index de3f45e..a4265bc 100644 > --- a/drivers/net/ixgbe/ixgbe_main.c > +++ b/drivers/net/ixgbe/ixgbe_main.c > @@ -1409,9 +1409,11 @@ void ixgbe_down(struct ixgbe_adapter *adapter) > IXGBE_WRITE_FLUSH(&adapter->hw); > msleep(10); > > +napi_disable(&adapter->napi); > +atomic_set(&adapter->irq_sem, 0); > + > ixgbe_irq_disable(adapter); > > -napi_disable(&adapter->napi); > del_timer_sync(&adapter->watchdog_timer); > > netif_carrier_off(netdev); -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
David Miller writes: > > eth0 e1000_irq_enable sem = 1<- ifconfig eth0 down > > eth0 e1000_irq_disable sem = 2 > > > > **e1000_open <- ifconfig eth0 up > > eth0 e1000_irq_disable sem = 3 Dead. irq's can't be enabled > > e1000_irq_enable miss > > eth0 e1000_irq_enable sem = 2 > > e1000_irq_enable miss > > eth0 e1000_irq_enable sem = 1 > > ADDRCONF(NETDEV_UP): eth0: link is not ready > > Yes, this semaphore thing is highly problematic. In the most crucial > areas where network driver consistency matters the most for ease of > understanding and debugging, the Intel drivers choose to be different I don't understand the idea with semaphore for enabling/disabling irq's either the overall logic must safer/better without it. > The way the napi_disable() logic breaks out from high packet load in > net_rx_action() is it simply returns even leaving interrupts disabled > when a pending napi_disable() is pending. > > This is what trips up the semaphore logic. > > Robert, give this patch a try. > > In the long term this semaphore should be completely eliminated, > there is no justification for it. It's on the testing list... Cheers --ro > > Signed-off-by: David S. Miller <[EMAIL PROTECTED]> > > diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c > index 0c9a6f7..76c0fa6 100644 > --- a/drivers/net/e1000/e1000_main.c > +++ b/drivers/net/e1000/e1000_main.c > @@ -632,6 +632,7 @@ e1000_down(struct e1000_adapter *adapter) > > #ifdef CONFIG_E1000_NAPI > napi_disable(&adapter->napi); > +atomic_set(&adapter->irq_sem, 0); > #endif > e1000_irq_disable(adapter); > > diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c > index 2ab3bfb..9cc5a6b 100644 > --- a/drivers/net/e1000e/netdev.c > +++ b/drivers/net/e1000e/netdev.c > @@ -2183,6 +2183,7 @@ void e1000e_down(struct e1000_adapter *adapter) > msleep(10); > > napi_disable(&adapter->napi); > +atomic_set(&adapter->irq_sem, 0); > e1000_irq_disable(adapter); > > del_timer_sync(&adapter->watchdog_timer); > diff --git a/drivers/net/ixgb/ixgb_main.c b/drivers/net/ixgb/ixgb_main.c > index d2fb88d..4f63839 100644 > --- a/drivers/net/ixgb/ixgb_main.c > +++ b/drivers/net/ixgb/ixgb_main.c > @@ -296,6 +296,11 @@ ixgb_down(struct ixgb_adapter *adapter, boolean_t > kill_watchdog) > { > struct net_device *netdev = adapter->netdev; > > +#ifdef CONFIG_IXGB_NAPI > +napi_disable(&adapter->napi); > +atomic_set(&adapter->irq_sem, 0); > +#endif > + > ixgb_irq_disable(adapter); > free_irq(adapter->pdev->irq, netdev); > > @@ -304,9 +309,7 @@ ixgb_down(struct ixgb_adapter *adapter, boolean_t > kill_watchdog) > > if(kill_watchdog) > del_timer_sync(&adapter->watchdog_timer); > -#ifdef CONFIG_IXGB_NAPI > -napi_disable(&adapter->napi); > -#endif > + > adapter->link_speed = 0; > adapter->link_duplex = 0; > netif_carrier_off(netdev); > diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c > index de3f45e..a4265bc 100644 > --- a/drivers/net/ixgbe/ixgbe_main.c > +++ b/drivers/net/ixgbe/ixgbe_main.c > @@ -1409,9 +1409,11 @@ void ixgbe_down(struct ixgbe_adapter *adapter) > IXGBE_WRITE_FLUSH(&adapter->hw); > msleep(10); > > +napi_disable(&adapter->napi); > +atomic_set(&adapter->irq_sem, 0); > + > ixgbe_irq_disable(adapter); > > -napi_disable(&adapter->napi); > del_timer_sync(&adapter->watchdog_timer); > > netif_carrier_off(netdev); > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
David Miller writes: > > On Wednesday 16 January 2008, David Miller wrote: > > > Ok, here is the patch I'll propose to fix this. The goal is to make > > > it as simple as possible without regressing the thing we were trying > > > to fix. > > > > Looks good to me. Tested with -rc8. > > Thanks for testing. Yes that code looks nice. I'm using the patch but I've noticed another phenomena with the current e1000 driver. There is a race when taking a device down at high traffic loads. I've tracked and instrumented and it seems like occasionly irq_sem can get bump up so interrupts can't be enabled again. eth0 e1000_irq_enable sem = 1<- High netload eth0 e1000_irq_enable sem = 1 eth0 e1000_irq_enable sem = 1 eth0 e1000_irq_enable sem = 1 eth0 e1000_irq_enable sem = 1 eth0 e1000_irq_enable sem = 1 eth0 e1000_irq_enable sem = 1<- ifconfig eth0 down eth0 e1000_irq_disable sem = 2 **e1000_open <- ifconfig eth0 up eth0 e1000_irq_disable sem = 3 Dead. irq's can't be enabled e1000_irq_enable miss eth0 e1000_irq_enable sem = 2 e1000_irq_enable miss eth0 e1000_irq_enable sem = 1 ADDRCONF(NETDEV_UP): eth0: link is not ready Cheers --ro static void e1000_irq_disable(struct e1000_adapter *adapter) { atomic_inc(&adapter->irq_sem); E1000_WRITE_REG(&adapter->hw, IMC, ~0); E1000_WRITE_FLUSH(&adapter->hw); synchronize_irq(adapter->pdev->irq); if(adapter->netdev->ifindex == 3) printk("%s e1000_irq_disable sem = %d\n", adapter->netdev->name, atomic_read(&adapter->irq_sem)); } static void e1000_irq_enable(struct e1000_adapter *adapter) { if (likely(atomic_dec_and_test(&adapter->irq_sem))) { E1000_WRITE_REG(&adapter->hw, IMS, IMS_ENABLE_MASK); E1000_WRITE_FLUSH(&adapter->hw); } else printk("e1000_irq_enable miss\n"); if(adapter->netdev->ifindex == 3) printk("%s e1000_irq_enable sem = %d\n", adapter->netdev->name, atomic_read(&adapter->irq_sem)); } -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 6/6] fib_trie: combine leaf and info
Stephen Hemminger writes: > This is how I did it: Yes looks like an elegant solution. Did you even test it? Maybe we see some effects in just dumping a full table? Anyway lookup should be tested in some way. We can a lot of analyzing before getting to right entry, local_table backtracking, main lookup w. ev. backtracking etc. So hopefully we get paid for this work. Also it might be idea to do some analysis of the fib_aliases list. Maybe the trick can be done again? ;) Cheers --ro > --- a/net/ipv4/fib_trie.c2008-01-15 09:14:53.0 -0800 > +++ b/net/ipv4/fib_trie.c2008-01-15 09:21:48.0 -0800 > @@ -101,13 +101,6 @@ struct node { > t_key key; > }; > > -struct leaf { > -unsigned long parent; > -t_key key; > -struct hlist_head list; > -struct rcu_head rcu; > -}; > - > struct leaf_info { > struct hlist_node hlist; > struct rcu_head rcu; > @@ -115,6 +108,13 @@ struct leaf_info { > struct list_head falh; > }; > > +struct leaf { > +unsigned long parent; > +t_key key; > +struct hlist_head list; > +struct rcu_head rcu; > +}; > + > struct tnode { > unsigned long parent; > t_key key; > @@ -321,16 +321,6 @@ static void __leaf_free_rcu(struct rcu_h > kmem_cache_free(trie_leaf_kmem, leaf); > } > > -static void __leaf_info_free_rcu(struct rcu_head *head) > -{ > -kfree(container_of(head, struct leaf_info, rcu)); > -} > - > -static inline void free_leaf_info(struct leaf_info *leaf) > -{ > -call_rcu(&leaf->rcu, __leaf_info_free_rcu); > -} > - > static struct tnode *tnode_alloc(size_t size) > { > struct page *pages; > @@ -357,7 +347,7 @@ static void __tnode_free_rcu(struct rcu_ > free_pages((unsigned long)tn, get_order(size)); > } > > -static inline void tnode_free(struct tnode *tn) > +static void tnode_free(struct tnode *tn) > { > if (IS_LEAF(tn)) { > struct leaf *l = (struct leaf *) tn; > @@ -376,16 +366,41 @@ static struct leaf *leaf_new(void) > return l; > } > > +static void leaf_info_init(struct leaf_info *li, int plen) > +{ > +li->plen = plen; > +INIT_LIST_HEAD(&li->falh); > +} > + > +static struct leaf_info *leaf_info_first(struct leaf *l, int plen) > +{ > +struct leaf_info *li = (struct leaf_info *) (l + 1); > +leaf_info_init(li, plen); > +return li; > +} > + > static struct leaf_info *leaf_info_new(int plen) > { > struct leaf_info *li = kmalloc(sizeof(struct leaf_info), GFP_KERNEL); > -if (li) { > -li->plen = plen; > -INIT_LIST_HEAD(&li->falh); > -} > +if (li) > +leaf_info_init(li, plen); > + > return li; > } > > +static void __leaf_info_free_rcu(struct rcu_head *head) > +{ > +kfree(container_of(head, struct leaf_info, rcu)); > +} > + > +static inline void free_leaf_info(struct leaf *l, struct leaf_info *leaf) > +{ > +if (leaf == (struct leaf_info *)(l + 1)) > +return; > + > +call_rcu(&leaf->rcu, __leaf_info_free_rcu); > +} > + > static struct tnode* tnode_new(t_key key, int pos, int bits) > { > size_t sz = sizeof(struct tnode) + (sizeof(struct node *) << bits); > @@ -1047,18 +1062,13 @@ static struct list_head *fib_insert_node > insert_leaf_info(&l->list, li); > goto done; > } > -l = leaf_new(); > > +l = leaf_new(); > if (!l) > return NULL; > > l->key = key; > -li = leaf_info_new(plen); > - > -if (!li) { > -tnode_free((struct tnode *) l); > -return NULL; > -} > +li = leaf_info_first(l, plen); > > fa_head = &li->falh; > insert_leaf_info(&l->list, li); > @@ -1091,7 +1101,7 @@ static struct list_head *fib_insert_node > } > > if (!tn) { > -free_leaf_info(li); > +free_leaf_info(l, li); > tnode_free((struct tnode *) l); > return NULL; > } > @@ -1624,7 +1634,7 @@ static int fn_trie_delete(struct fib_tab > > if (list_empty(fa_head)) { > hlist_del_rcu(&li->hlist); > -free_leaf_info(li); > +free_leaf_info(l, li); > } > > if (hlist_empty(&l->list)) > @@ -1668,7 +1678,7 @@ static int trie_flush_leaf(struct trie * > > if (list_empty(&li->falh)) { > hlist_del_rcu(&li->hlist); > -free_leaf_info(li); > +free_leaf_info(l, li); > } > } > return found; > @@ -1935,7 +1945,8 @@ void __init fib_hash_init(void) > fn_alias_kmem = kmem_cache_create("ip_fib_alias", sizeof(struct > fib_alias), >0, SLAB_PANIC, NULL); > > -
Re: [RFC 6/6] fib_trie: combine leaf and info
Eric Dumazet writes: > > So you think that a leaf cannot have 2 infos, one 'embeded' and one in the > list ? Hello, The model I thought of is to have either: 1) One leaf_info embedded in leaf. A fast-path leaf. FP-leaf Or 2) The intct old leaf_info list with arbitrary number leaf_infos If plen_iinfo is >=0 It's a FP-leaf Cheers. --ro -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 6/6] fib_trie: combine leaf and info
Stephen Hemminger writes: > Okay, I would rather see the leaf_info explicit inside the leaf, also > your scheme probably breaks if I add two prefixes and then delete the first. > Let me have a go at it. I took Eric's patch a bit further... Support for delete and dump is needed before any testing at all and maybe some macro for checking and setting FP-leaf's Cheers. --ro diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 6dab753..f5b276c 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -97,22 +97,32 @@ typedef unsigned int t_key; #define IS_LEAF(n) (n->parent & T_LEAF) struct node { - unsigned long parent; - t_key key; + unsigned long parent; + t_key key; }; struct leaf { - unsigned long parent; - t_key key; - struct hlist_head list; - struct rcu_head rcu; + unsigned long parent; + t_key key; + /* +* Because we often have only one info per leaf, we embedd one here +* to save some space and speedup lookups (sharing cache line) +* a sort of fastpatch leaf (FP-leaf) this indicated a negative of +* plen_iinfo. +* Note : not inside a structure so that we dont have holes on 64bit +*/ + int plen_iinfo; + struct list_headfalh_iinfo; + + struct hlist_head list; + struct rcu_head rcu; }; struct leaf_info { - struct hlist_node hlist; - struct rcu_head rcu; - int plen; - struct list_head falh; + struct hlist_node hlist; + int plen; + struct list_headfalh; + struct rcu_head rcu; }; struct tnode { @@ -364,11 +374,13 @@ static inline void tnode_free(struct tnode *tn) call_rcu(&tn->rcu, __tnode_free_rcu); } -static struct leaf *leaf_new(void) +static struct leaf *leaf_new(int plen) { struct leaf *l = kmalloc(sizeof(struct leaf), GFP_KERNEL); if (l) { l->parent = T_LEAF; + l->plen_iinfo = plen; + INIT_LIST_HEAD(&l->falh_iinfo); INIT_HLIST_HEAD(&l->list); } return l; @@ -890,7 +902,16 @@ static struct leaf_info *find_leaf_info(struct leaf *l, int plen) static inline struct list_head * get_fa_head(struct leaf *l, int plen) { - struct leaf_info *li = find_leaf_info(l, plen); + struct leaf_info *li; + + if (l->plen_iinfo >= 0) { + if(l->plen_iinfo == plen) + return &l->falh_iinfo; + else + return NULL; + } + + li = find_leaf_info(l, plen); if (!li) return NULL; @@ -1036,6 +1057,21 @@ static struct list_head *fib_insert_node(struct trie *t, u32 key, int plen) if (n != NULL && IS_LEAF(n) && tkey_equals(key, n->key)) { l = (struct leaf *) n; + + /* +* Check if it is a FP-leaf if so we have to convert +* it to an ordinary leaf with leaf_infos +*/ + + if( l->plen_iinfo >= 0 ) { + li = leaf_info_new(l->plen_iinfo); + if (!li) + return NULL; + + insert_leaf_info(&l->list, li); + l->plen_iinfo = -1; + } + li = leaf_info_new(plen); if (!li) @@ -1045,21 +1081,16 @@ static struct list_head *fib_insert_node(struct trie *t, u32 key, int plen) insert_leaf_info(&l->list, li); goto done; } - l = leaf_new(); + + /* Insert a FP-leaf */ + l = leaf_new(plen); if (!l) return NULL; l->key = key; - li = leaf_info_new(plen); - - if (!li) { - tnode_free((struct tnode *) l); - return NULL; - } + fa_head = &l->falh_iinfo; - fa_head = &li->falh; - insert_leaf_info(&l->list, li); if (t->trie && n == NULL) { /* Case 2: n is NULL, and will just insert a new leaf */ @@ -1089,7 +1120,6 @@ static struct list_head *fib_insert_node(struct trie *t, u32 key, int plen) } if (!tn) { - free_leaf_info(li); tnode_free((struct tnode *) l); return NULL; } @@ -1285,6 +1315,20 @@ static inline int check_leaf(struct trie *t, struct leaf *l, struct hlist_head *hhead = &l->list; struct hlist_node *node; + i = l->plen_iinfo; + if( i >= 0) { /* FP-leaf */ + mask = inet_make_mask(i); + err = fib_semantic_match(&l->falh_iinfo, flp, res, htonl(l->key), mask, i); + if(err <= 0 )
Re: [PATCH 9/9] fix sparse warnings
Eric Dumazet writes: > > Thats 231173/241649 = 96% with the current Internet routing. > > > > How about if would have a fastpath and store one entry direct in the > > leaf struct this to avoid loading the leaf_info list in most cases? > > > > One could believe that both lookup and dump could improve. > > > You mean to include one "leaf_info" inside leaf structure, so that we > can access it without cache line miss ? Yes. Cheers --ro -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/9] get rid of unused revision element
David Miller writes: > > The revision element must of been part of an earlier design, > > because currently it is set but never used. > > > > Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> > > I suspect Robert wanted to play around with some generation > ID optimizations but never got around to it. The idea was to have a selective flush of route cache entries when a fib insert/delete happened. From what I remember you added another/ better solution. Just a list with route cache entries pointing to parent route. So yes this was obsoleted by your/our effort to avoid total flushing of the route cache. Unfinished work. According to http://bgpupdates.potaroo.net/instability/bgpupd.html (last in page) we currently flush the route cache 2.80 times per second. when using full Internet routing with Linux. Maybe we're forced to pick up this thread again someday. Cheers. --ro -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 9/9] fix sparse warnings
Thanks for hacking and improving and the trie... another idea that could be also tested. If we look into routing table we see that most leafs only has one prefix Main: Aver depth: 2.57 Max depth: 7 Leaves: 231173 ip route | wc -l 241649 Thats 231173/241649 = 96% with the current Internet routing. How about if would have a fastpath and store one entry direct in the leaf struct this to avoid loading the leaf_info list in most cases? One could believe that both lookup and dump could improve. Cheers. --ro Stephen Hemminger writes: > Remember that the code should be optimized for lookup, not management > operations. We ran into this during testing (the test suite was looking > for number of routes), thats why I put in the size field. > > The existing dump code is really slow: > > 1) FIB_TRIE Under KVM: > load 164393 routes 12.436 sec > ip route | wc -l12.569 sec > grep /proc/net/route25.357 sec > > 99% of the cpu time is spent in nextleaf() during these dump operations. > > 2) FIB_HASH Under KVM: > load 164393 routes 10.833 sec > ip route | wc -l1.981 sec > grep /proc/net/route0.204 sec -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[FIB]: full_children & empty_children should be uint, not ushort
Eric Dumazet writes: > Eric Dumazet a écrit : > > 4) full_children & empty_children being 'unsigned short', > >we probably are limited to 2^15 elements, but I could not > >find this limit enforced somewhere. > Two fixes are possible : Enlarge full_children & empty_children to 32bits, > or > force a limit in code to never exceed 2^15 children in a tnode. I chose the > first solution since it can be done with 0 memory cost on 64bit arches. Hello, Thanks for spotting this. No we don't want put limits on the (root) node size. You see the comment in code is correct so unsigned short are some leftover from old testing which could have hit us hard as the routing table slowly grows. Cheers --ro Signed-off-by: Robert Olsson <[EMAIL PROTECTED]> > [FIB]: full_children & empty_children should be uint, not ushort > > If declared as unsigned short, these fields can overflow, and whole trie > logic > is broken. I could not make the machine crash, but some tnode can never > be freed. > > Note for 64 bit arches : By reordering t_key and parent in [node, leaf, > tnode] > structures, we can use 32 bits hole after t_key so that sizeof(struct tnode) > doesnt change after this patch. > > Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]> > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > index f26ba31..9696722 100644 > --- a/net/ipv4/fib_trie.c > +++ b/net/ipv4/fib_trie.c > @@ -97,13 +97,13 @@ typedef unsigned int t_key; > #define IS_LEAF(n) (n->parent & T_LEAF) > > struct node { > -t_key key; > unsigned long parent; > +t_key key; > }; > > struct leaf { > -t_key key; > unsigned long parent; > +t_key key; > struct hlist_head list; > struct rcu_head rcu; > }; > @@ -116,12 +116,12 @@ struct leaf_info { > }; > > struct tnode { > -t_key key; > unsigned long parent; > +t_key key; > unsigned char pos; /* 2log(KEYLENGTH) bits needed */ > unsigned char bits; /* 2log(KEYLENGTH) bits needed */ > -unsigned short full_children; /* KEYLENGTH bits needed */ > -unsigned short empty_children; /* KEYLENGTH bits needed */ > +unsigned int full_children; /* KEYLENGTH bits needed */ > +unsigned int empty_children;/* KEYLENGTH bits needed */ > struct rcu_head rcu; > struct node *child[0]; > }; > @@ -329,12 +329,12 @@ static inline void free_leaf_info(struct leaf_info > *leaf) > call_rcu(&leaf->rcu, __leaf_info_free_rcu); > } > > -static struct tnode *tnode_alloc(unsigned int size) > +static struct tnode *tnode_alloc(size_t size) > { > struct page *pages; > > if (size <= PAGE_SIZE) > -return kcalloc(size, 1, GFP_KERNEL); > +return kzalloc(size, GFP_KERNEL); > > pages = alloc_pages(GFP_KERNEL|__GFP_ZERO, get_order(size)); > if (!pages) > @@ -346,8 +346,8 @@ static struct tnode *tnode_alloc(unsigned int size) > static void __tnode_free_rcu(struct rcu_head *head) > { > struct tnode *tn = container_of(head, struct tnode, rcu); > -unsigned int size = sizeof(struct tnode) + > -(1 << tn->bits) * sizeof(struct node *); > +size_t size = sizeof(struct tnode) + > + (sizeof(struct node *) << tn->bits); > > if (size <= PAGE_SIZE) > kfree(tn); > @@ -386,8 +386,7 @@ static struct leaf_info *leaf_info_new(int plen) > > static struct tnode* tnode_new(t_key key, int pos, int bits) > { > -int nchildren = 1< -int sz = sizeof(struct tnode) + nchildren * sizeof(struct node *); > +size_t sz = sizeof(struct tnode) + (sizeof(struct node *) << bits); > struct tnode *tn = tnode_alloc(sz); > > if (tn) { > @@ -399,8 +398,8 @@ static struct tnode* tnode_new(t_key key, int pos, int > bits) > tn->empty_children = 1< } > > -pr_debug("AT %p s=%u %u\n", tn, (unsigned int) sizeof(struct tnode), > - (unsigned int) (sizeof(struct node) * 1< +pr_debug("AT %p s=%u %lu\n", tn, (unsigned int) sizeof(struct tnode), > + (unsigned long) (sizeof(struct node) << bits)); > return tn; > } > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] net: napi fix
David Miller writes: > > Is the netif_running() check even required? > > No, it is not. > > When a device is brought down, one of the first things > that happens is that we wait for all pending NAPI polls > to complete, then block any new polls from starting. Hello! Yes but the reason was not to wait for all pending polls to complete so a server/router could be rebooted even under high- load and DOS. We've experienced some nasty problems with this. Cheers. --ro -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: net_rx_action/NAPI oops [PATCH]
Hello! After further investigations. The bug was just in front of us... ifconfig down in combination with the test for || !netif_running() can return a full quota and netif_rx_complete() done which causes the oops in net_rx_action. Of course the load must be high enough to fill the quota. I've found a bunch of drivers having this bug. Cheers. --ro Signed-off-by: Robert Olsson <[EMAIL PROTECTED]> diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c index cf39473..f4137ad 100644 --- a/drivers/net/e1000/e1000_main.c +++ b/drivers/net/e1000/e1000_main.c @@ -3947,6 +3947,10 @@ e1000_clean(struct napi_struct *napi, int budget) quit_polling: if (likely(adapter->itr_setting & 3)) e1000_set_itr(adapter); + + if(work_done == budget) + work_done--; + netif_rx_complete(poll_dev, napi); e1000_irq_enable(adapter); } diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c index 4fd2e23..e43b5ca 100644 --- a/drivers/net/e1000e/netdev.c +++ b/drivers/net/e1000e/netdev.c @@ -1408,6 +1408,9 @@ static int e1000_clean(struct napi_struct *napi, int budget) if ((!tx_cleaned && (work_done < budget)) || !netif_running(poll_dev)) { quit_polling: + if(work_done == budget) + work_done--; + if (adapter->itr_setting & 3) e1000_set_itr(adapter); netif_rx_complete(poll_dev, napi); diff --git a/drivers/net/ixgb/ixgb_main.c b/drivers/net/ixgb/ixgb_main.c index 3021234..e3064ef 100644 --- a/drivers/net/ixgb/ixgb_main.c +++ b/drivers/net/ixgb/ixgb_main.c @@ -1783,6 +1783,10 @@ ixgb_clean(struct napi_struct *napi, int budget) /* if no Tx and not enough Rx work done, exit the polling mode */ if((!tx_cleaned && (work_done == 0)) || !netif_running(netdev)) { + + if(work_done == budget) + work_done--; + netif_rx_complete(netdev, napi); ixgb_irq_enable(adapter); } diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c index 00bc525..204f5fa 100644 --- a/drivers/net/ixgbe/ixgbe_main.c +++ b/drivers/net/ixgbe/ixgbe_main.c @@ -579,6 +579,9 @@ static int ixgbe_clean_rxonly(struct napi_struct *napi, int budget) /* If no Tx and not enough Rx work done, exit the polling mode */ if ((work_done < budget) || !netif_running(netdev)) { quit_polling: + if(work_done == budget) + work_done--; + netif_rx_complete(netdev, napi); if (!test_bit(__IXGBE_DOWN, &adapter->state)) IXGBE_WRITE_REG(&adapter->hw, IXGBE_EIMS, @@ -1483,6 +1486,9 @@ static int ixgbe_clean(struct napi_struct *napi, int budget) if ((!tx_cleaned && (work_done < budget)) || !netif_running(adapter->netdev)) { quit_polling: + if(work_done == budget) + work_done--; + netif_rx_complete(netdev, napi); ixgbe_irq_enable(adapter); } diff --git a/drivers/net/e100.c b/drivers/net/e100.c index 3dbaec6..c566491 100644 --- a/drivers/net/e100.c +++ b/drivers/net/e100.c @@ -1998,6 +1998,10 @@ static int e100_poll(struct napi_struct *napi, int budget) /* If no Rx and Tx cleanup work was done, exit polling mode. */ if((!tx_cleaned && (work_done == 0)) || !netif_running(netdev)) { + + if(work_done == budget) + work_done--; + netif_rx_complete(netdev, napi); e100_enable_irq(nic); } - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: net_rx_action/NAPI oops [PATCH]
No it doesn't. besides napi_disable and napi_synchronize are identical. I was trying to disarm interrupts this way too. The patch I did send yesterday is the only cure so-far but I don't if it's 100% bullet proof either. I was stress-testing it patch but ran into new problems...(scheduling) Cheers. --ro Stephen Hemminger writes: > Would this fix it? > > --- a/drivers/net/e1000/e1000_main.c 2007-11-15 21:13:12.0 -0800 > +++ b/drivers/net/e1000/e1000_main.c 2007-11-28 08:37:03.0 -0800 > @@ -630,10 +630,10 @@ e1000_down(struct e1000_adapter *adapter > * reschedule our watchdog timer */ > set_bit(__E1000_DOWN, &adapter->flags); > > +e1000_irq_disable(adapter); > #ifdef CONFIG_E1000_NAPI > -napi_disable(&adapter->napi); > +napi_synchronize(&adapter->napi); > #endif > -e1000_irq_disable(adapter); > > del_timer_sync(&adapter->tx_fifo_stall_timer); > del_timer_sync(&adapter->watchdog_timer); - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: net_rx_action/NAPI oops [PATCH]
Stephen Hemminger writes: > It is considered a driver bug in 2.6.24 to call netif_rx_complete (clear > NAPI_STATE_SCHED) > and do a full quota. That bug already had to be fixed in other drivers, > look like e1000 has same problem. From what I see the problem is not related to ->poll. But it's a variant of the same theme here it's napi_disable in _down via ifconfig down that clears NAPI_STATE_SCHED while driver delivers a full quota. Cheers --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: net_rx_action/NAPI oops [PATCH]
Kok, Auke writes: > > Robert, please give that patch a try (it fixes a crash that I had here as > well) > and let us know if it works for you. No it doesn't cure the problem I've reported Cheers. --ro BTW. You can try to verify the problem... Let pktgen feed a couple of streams into some interfaces. (I use two) ifconfig down sone of the incoming interfaces - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
net_rx_action/NAPI oops [PATCH]
Hello! I've discovered a bug while testing the new multiQ NAPI code. In hi-load situations when we take down an interface we get a kernel panic. The oops is below. >From what I see this happens when driver does napi_disable() and clears NAPI_STATE_SCHED. In net_rx_action there is a check for work == weight a sort indirect test but that's now not enough to cover the load situation. where we have NAPI_STATE_SCHED cleared by e1000_down in my case and still full quota. Latest git but I'll guess the is the same in all later kernels. There might be different solutions... one variant is below: Signed-off-by: Robert Olsson <[EMAIL PROTECTED]> diff --git a/net/core/dev.c b/net/core/dev.c index 043e2f8..1031233 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2207,7 +2207,7 @@ static void net_rx_action(struct softirq_action *h) * still "owns" the NAPI instance and therefore can * move the instance around on the list at-will. */ - if (unlikely(work == weight)) + if (unlikely(work == weight) && (test_bit(NAPI_STATE_SCHED, &n->state))) list_move_tail(&n->poll_list, list); netpoll_poll_unlock(have); Cheers --ro labb:/# ifconfig eth0 down BUG: unable to handle kernel paging request at virtual address 00100104 printing eip: c0433d67 *pde = Oops: 0002 [#1] SMP Modules linked in: Pid: 4, comm: ksoftirqd/0 Not tainted (2.6.24-rc3bifrost-gb3664d45-dirty #32) EIP: 0060:[] EFLAGS: 00010046 CPU: 0 EIP is at net_rx_action+0x107/0x120 EAX: 00100100 EBX: f757d4e0 ECX: c200d334 EDX: 00200200 ESI: 0040 EDI: c200d334 EBP: 00ec ESP: f7c6bf78 DS: 007b ES: 007b FS: 00d8 GS: SS: 0068 Process ksoftirqd/0 (pid: 4, ti=f7c6a000 task=f7c58ab0 task.ti=f7c6a000) Stack: c0236217 c200ce9c c200ce9c fffcf892 0040 0005 c05b2a98 c0603e60 0008 c022a275 c06066c0 c06066c0 0246 c022a5e0 c022a327 c06066c0 c022a636 fffc c02384f2 Call Trace: [] __rcu_process_callbacks+0x107/0x190 [] __do_softirq+0x75/0xf0 [] ksoftirqd+0x0/0xd0 [] do_softirq+0x37/0x40 [] ksoftirqd+0x56/0xd0 [] kthread+0x42/0x70 [] kthread+0x0/0x70 [] kernel_thread_helper+0x7/0x18 === Code: 88 8c 52 c0 e8 4b 1d df ff e8 96 0c dd ff c7 05 64 7d 63 c0 01 00 00 00 e9 61 ff ff ff 8d b4 26 00 00 00 00 8b 03 8b 53 04 89 f9 <89> 50 04 89 02 89 d8 8b 57 04 e8 5a 34 eb ff e9 4a ff ff ff 90 EIP: [] net_rx_action+0x107/0x120 SS:ESP 0068:f7c6bf78 Kernel panic - not syncing: Fatal exception in interrupt - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pktgen question
Hi, Steve Wise writes: > I think pktgen should be cloning the skbs using skb_clone(). Then it > will work for all devices, eh? pktgen assumes for "fastpath" sending exclusive ownership of the skb. And does a skb_get to avoid final skb destruction so the same skb can be sent over and over. The idea is to avoid memory allocation and keep things in cache to give very high packet rates with identical packets. I But if you need to alter the packet then the skb_get trick can't be done. And you have to turn off "fastpath" with clone_skb > Perf-wise, you could clone the skbs up front, then deliver them to the > nic in a tight loop. This would mitigate the added overhead introduced > by calling skb_clone() in the loop doing transmits... Sure it's can be done. It could replay sequences etc but it will not beat the skb_get trick in sending identical packets. It has been proposed before but I've avoided such efforts to keep things relatively small and simple. Really pktgen should be reworked to have s small skim in kernel and move the rest of the stuff to userland. Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [-MM, FIX V3] e1000e: incorporate napi_struct changes from net-2.6.24.git
Kok, Auke writes: > david, > > while testing this patch I noticed that the poll routine is now called > 100% of the time, and since I'm not doing much different than before, I > suspec that something in the new napi code is staying in polling mode > forever? Since e1000e is pretty much the same code as e1000, I doubt the > problem is there, but you can probably tell better. ideas? Hello, Yes a correct observation. I've spotted this bug too and it caused by the policy change in the NAPI scheduling. Look at tx_cleaned. I suggest we revert this change for now. Cheers --ro Signed-off-by: Robert Olsson <[EMAIL PROTECTED]> diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c index 7b0bcdb..5cb883a 100644 --- a/drivers/net/e1000/e1000_main.c +++ b/drivers/net/e1000/e1000_main.c @@ -3944,7 +3944,7 @@ e1000_clean(struct napi_struct *napi, int budget) &work_done, budget); /* If no Tx and not enough Rx work done, exit the polling mode */ - if ((tx_cleaned && (work_done < budget)) || + if ((!tx_cleaned && (work_done == 0)) || !netif_running(poll_dev)) { quit_polling: if (likely(adapter->itr_setting & 3)) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pktgen terminating condition
jamal writes: > On Tue, 2007-28-08 at 21:43 -0700, Mandeep Singh Baines wrote: > I think its a good thing pktgen caught this; i am unsure however if it > is doing the right thing. Hoping Robert would respond. > One thing pktgen could do is restrict the amount of outstanding buffers > by using accounting the way sockets do - this at least wont stop > progress as it did in your case. Hello, Yes it's synchronization issue... the test is over and we have sent all pkts to the device but pktgen cannot free the skb for it still has refcounts. IMO must drivers have provisions to handle situation like. I'll guess we can discuss last-resort timer if it should be us, ms or possibly seconds but shouldn't need ping to make this happen. If so we probably have a ICMP packet sitting there waiting instead. Cheers --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Remove softirq scheduling from pktgen [PATCH]
Christoph Hellwig writes: > > Hello, It's not a job for pktgen. > > Please also kill the do_softirq export while you're at it. Right seems like pktgen luckily was the only user. Cheers --ro Signed-off-by: Robert Olsson <[EMAIL PROTECTED]> diff --git a/kernel/softirq.c b/kernel/softirq.c index 0f546dd..dbbdcd7 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -271,8 +271,6 @@ asmlinkage void do_softirq(void) local_irq_restore(flags); } -EXPORT_SYMBOL(do_softirq); - #endif /* - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Remove softirq scheduling from pktgen [PATCH]
Hello, It's not a job for pktgen. Cheers --ro Signed-off-by: Robert Olsson <[EMAIL PROTECTED]> diff --git a/net/core/pktgen.c b/net/core/pktgen.c index 18601af..975e887 100644 --- a/net/core/pktgen.c +++ b/net/core/pktgen.c @@ -164,7 +164,7 @@ #include /* do_div */ #include -#define VERSION "pktgen v2.68: Packet Generator for packet performance testing.\n" +#define VERSION "pktgen v2.69: Packet Generator for packet performance testing.\n" /* The buckets are exponential in 'width' */ #define LAT_BUCKETS_MAX 32 @@ -381,7 +381,6 @@ struct pktgen_thread { struct list_head th_list; struct task_struct *tsk; char result[512]; - u32 max_before_softirq; /* We'll call do_softirq to prevent starvation. */ /* Field for thread to receive "posted" events terminate, stop ifs etc. */ @@ -1752,9 +1751,6 @@ static int pktgen_thread_show(struct seq_file *seq, void *v) BUG_ON(!t); - seq_printf(seq, "Name: %s max_before_softirq: %d\n", - t->tsk->comm, t->max_before_softirq); - seq_printf(seq, "Running: "); if_lock(t); @@ -1787,7 +1783,6 @@ static ssize_t pktgen_thread_write(struct file *file, int i = 0, max, len, ret; char name[40]; char *pg_result; - unsigned long value = 0; if (count < 1) { // sprintf(pg_result, "Wrong command format"); @@ -1861,12 +1856,8 @@ static ssize_t pktgen_thread_write(struct file *file, } if (!strcmp(name, "max_before_softirq")) { - len = num_arg(&user_buffer[i], 10, &value); - mutex_lock(&pktgen_thread_lock); - t->max_before_softirq = value; - mutex_unlock(&pktgen_thread_lock); + sprintf(pg_result, "OK: Note! max_before_softirq is obsoleted -- Do not use"); ret = count; - sprintf(pg_result, "OK: max_before_softirq=%lu", value); goto out; } @@ -2145,7 +2136,6 @@ static void spin(struct pktgen_dev *pkt_dev, __u64 spin_until_us) if (spin_until_us - now > jiffies_to_usecs(1) + 1) schedule_timeout_interruptible(1); else if (spin_until_us - now > 100) { - do_softirq(); if (!pkt_dev->running) return; if (need_resched()) @@ -3515,8 +3505,6 @@ static int pktgen_thread_worker(void *arg) struct pktgen_thread *t = arg; struct pktgen_dev *pkt_dev = NULL; int cpu = t->cpu; - u32 max_before_softirq; - u32 tx_since_softirq = 0; BUG_ON(smp_processor_id() != cpu); @@ -3526,8 +3514,6 @@ static int pktgen_thread_worker(void *arg) pr_debug("pktgen: starting pktgen/%d: pid=%d\n", cpu, current->pid); - max_before_softirq = t->max_before_softirq; - set_current_state(TASK_INTERRUPTIBLE); set_freezable(); @@ -3546,24 +3532,9 @@ static int pktgen_thread_worker(void *arg) __set_current_state(TASK_RUNNING); - if (pkt_dev) { - + if (pkt_dev) pktgen_xmit(pkt_dev); - /* -* We like to stay RUNNING but must also give -* others fair share. -*/ - - tx_since_softirq += pkt_dev->last_ok; - - if (tx_since_softirq > max_before_softirq) { - if (local_softirq_pending()) - do_softirq(); - tx_since_softirq = 0; - } - } - if (t->control & T_STOP) { pktgen_stop(t); t->control &= ~(T_STOP); - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
pktgen Multiqueue support [PATCH]
Hello, Below some pktgen support to send into different TX queues. This can of course be feed into input queues on other machines Cheers --ro Signed-off-by: Robert Olsson <[EMAIL PROTECTED]> diff --git a/net/core/pktgen.c b/net/core/pktgen.c index a0db4d1..18601af 100644 --- a/net/core/pktgen.c +++ b/net/core/pktgen.c @@ -186,6 +186,7 @@ #define F_SVID_RND(1<<10) /* Random SVLAN ID */ #define F_FLOW_SEQ(1<<11) /* Sequential flows */ #define F_IPSEC_ON(1<<12) /* ipsec on for flows */ +#define F_QUEUE_MAP_RND (1<<13)/* queue map Random */ /* Thread control flag bits */ #define T_TERMINATE (1<<0) @@ -328,6 +329,7 @@ struct pktgen_dev { __be32 cur_daddr; __u16 cur_udp_dst; __u16 cur_udp_src; + __u16 cur_queue_map; __u32 cur_pkt_size; __u8 hh[14]; @@ -355,6 +357,10 @@ struct pktgen_dev { unsigned lflow; /* Flow length (config) */ unsigned nflows;/* accumulated flows (stats) */ unsigned curfl; /* current sequenced flow (state)*/ + + u16 queue_map_min; + u16 queue_map_max; + #ifdef CONFIG_XFRM __u8ipsmode;/* IPSEC mode (config) */ __u8ipsproto; /* IPSEC type (config) */ @@ -611,6 +617,11 @@ static int pktgen_if_show(struct seq_file *seq, void *v) seq_printf(seq, " flows: %u flowlen: %u\n", pkt_dev->cflows, pkt_dev->lflow); + seq_printf(seq, + " queue_map_min: %u queue_map_max: %u\n", + pkt_dev->queue_map_min, + pkt_dev->queue_map_max); + if (pkt_dev->flags & F_IPV6) { char b1[128], b2[128], b3[128]; fmt_ip6(b1, pkt_dev->in6_saddr.s6_addr); @@ -707,6 +718,9 @@ static int pktgen_if_show(struct seq_file *seq, void *v) if (pkt_dev->flags & F_MPLS_RND) seq_printf(seq, "MPLS_RND "); + if (pkt_dev->flags & F_QUEUE_MAP_RND) + seq_printf(seq, "QUEUE_MAP_RND "); + if (pkt_dev->cflows) { if (pkt_dev->flags & F_FLOW_SEQ) seq_printf(seq, "FLOW_SEQ "); /*in sequence flows*/ @@ -762,6 +776,8 @@ static int pktgen_if_show(struct seq_file *seq, void *v) seq_printf(seq, " cur_udp_dst: %d cur_udp_src: %d\n", pkt_dev->cur_udp_dst, pkt_dev->cur_udp_src); + seq_printf(seq, " cur_queue_map: %u\n", pkt_dev->cur_queue_map); + seq_printf(seq, " flows: %u\n", pkt_dev->nflows); if (pkt_dev->result[0]) @@ -1213,6 +1229,11 @@ static ssize_t pktgen_if_write(struct file *file, else if (strcmp(f, "FLOW_SEQ") == 0) pkt_dev->flags |= F_FLOW_SEQ; + else if (strcmp(f, "QUEUE_MAP_RND") == 0) + pkt_dev->flags |= F_QUEUE_MAP_RND; + + else if (strcmp(f, "!QUEUE_MAP_RND") == 0) + pkt_dev->flags &= ~F_QUEUE_MAP_RND; #ifdef CONFIG_XFRM else if (strcmp(f, "IPSEC") == 0) pkt_dev->flags |= F_IPSEC_ON; @@ -1517,6 +1538,28 @@ static ssize_t pktgen_if_write(struct file *file, return count; } + if (!strcmp(name, "queue_map_min")) { + len = num_arg(&user_buffer[i], 5, &value); + if (len < 0) { + return len; + } + i += len; + pkt_dev->queue_map_min = value; + sprintf(pg_result, "OK: queue_map_min=%u", pkt_dev->queue_map_min); + return count; + } + + if (!strcmp(name, "queue_map_max")) { + len = num_arg(&user_buffer[i], 5, &value); + if (len < 0) { + return len; + } + i += len; + pkt_dev->queue_map_max = value; + sprintf(pg_result, "OK: queue_map_max=%u", pkt_dev->queue_map_max); + return count; + } + if (!strcmp(name, "mpls")) { unsigned n, offset; len = get_labels(&user_buffer[i], pkt_dev); @@ -2378,6 +2421,20 @@ static void mod_cur_headers(struct pktgen_dev *pkt_dev) pkt_dev->cur_pkt_size = t; } + if (pkt_dev->queue_map_min < pkt_dev->queue_map_max) { + __u16 t; + if (pkt_dev->flags & F_QUEUE_MAP_RND) { + t = random32() % + (pkt_dev->queue_map_max - pkt_dev->queue_map_min + 1) +
pktgen multiqueue oops
Hello, Initially pkt_dev can be NULL this causes netif_subqueue_stopped to oops. The patch below should cure it. But maybe the pktgen TX logic should be reworked to better support the new multiqueue support. Cheers --ro Signed-off-by: Robert Olsson <[EMAIL PROTECTED]> diff --git a/net/core/pktgen.c b/net/core/pktgen.c index 7bae576..a0db4d1 100644 --- a/net/core/pktgen.c +++ b/net/core/pktgen.c @@ -3331,8 +3331,9 @@ static __inline__ void pktgen_xmit(struct pktgen_dev *pkt_dev) } if ((netif_queue_stopped(odev) || -netif_subqueue_stopped(odev, pkt_dev->skb->queue_mapping)) || -need_resched()) { +(pkt_dev->skb && + netif_subqueue_stopped(odev, pkt_dev->skb->queue_mapping))) || + need_resched()) { idle_start = getCurUs(); if (!netif_running(odev)) { - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFT 0/4] fib_trie cleanup patches
Stephen Hemminger writes: > I don't have a machine with anywhere near enough routes to test this, > so would someone with many routes give it a go and make sure nothing > got busted in the process. Hello! It's not only the numbers of routes thats important... Anyway I've done what can to verify that route selection (comparing lookups via netlink and userland longest-prefix-match using the same full routing table) and locking (concurrent rDoS on multiple CPU:s also while loading the full routing table) is intact. Keep TKEY_GET_MASK in patch #2 as it gives some hint whats going on. And how about Paul E. McKenney comment about rcu_dereference() covering the initial fetch? BTW. Right now the lab is setup for tests described above... Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFT] fib_trie: cleanup
David Miller writes: > From: Stephen Hemminger <[EMAIL PROTECTED]> > Date: Thu, 26 Jul 2007 09:46:48 +0100 > > > Try this out: > > * replace macro's with inlines > > * get rid of places doing multiple evaluations of NODE_PARENT > > No objections from me. > > Robert? Fine it looks cleaner and compiles... thanks Stephen. BTW I think might be possible improve functional/performance parts too. It's the list handling I'm thinking of. I've observed that in most cases there is only one leaf_info in the leaf list. In 96% of current Internet prefixes in the router I just looked into (below). So optimizing for this case could be an idea. Some variant were the leaf holds the leaf_info data direct could be worth testing. Cheers. --ro Main: Aver depth: 2.57 Max depth: 6 Leaves: 215131 Internal nodes: 52394 1: 27662 2: 9982 3: 8510 4: 3625 5: 1684 6: 626 7: 240 8: 64 16: 1 Pointers: 427924 Null ptrs: 160400 Total size: 7102 kB ip route list | wc -l 224649 215131/224649 = 96% - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [WIP][PATCHES] Network xmit batching
jamal writes: > What is your kernel config in regards to HRES timers? Robert mentioned > to me that the clock source maybe causing issues with pktgen (maybe even > qos). Robert, insights? pktgen heavily uses gettimeofday. I was using tsc as clock source with our opterons in the lab. In late 2.6.20 gettimeofday was changed so tsc couldn't be used on opterons (pktgen at least). To give you an example w.intel's 82571EB we could send 1.488 Mpps using tsc but with hpet only 400 kpps. But Evgeniy is most likely using the same clocksource both for the mainline and batch tests so there must be something different... Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
jamal writes: > I think the one described by Leonid has not just 8 tx/rx rings but also > a separate register set, MSI binding etc iirc. The only shared resources > as far as i understood Leonid are the bus and the ethernet wire. AFAIK most new NIC will look like this... I still lack a lot of crucial hardware understanding What will happen when if we for some reason is not capable of serving one TX ring? NIC still working so we continue filling/sending/clearing on other rings? > So in such a case (assuming 8 rings), > One model is creating 4 netdev devices each based on single tx/rx ring > and register set and then having a mother netdev (what you call the > bond) that feeds these children netdev based on some qos parametrization > is very sensible. Each of the children netdevices (by virtue of how we > do things today) could be tied to a CPU for effectiveness (because our > per CPU work is based on netdevs). Some kind of supervising function for the TX is probably needed as we still want see the device as one entity. But if upcoming HW supports parallelism straight to the TX-ring we of course like to use to get mininal cache effects. It's up to how this "master netdev" or queue superviser can be designed. > For the Leonid-NIC (for lack of better name) it may be harder to do > parallelization on rcv if you use what i said above. But you could > use a different model on receive - such as create a single netdev and > with 8 rcv rings and MSI tied on rcv to 8 different CPUs Yes that should be the way do it... and ethtool or something to hint the NIC how the incoming data classified wrt available CPU's. Maybe something more dynamic for the brave ones. Cheers -ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
jamal writes: > The key arguement i make (from day one actually) is to leave the > majority of the work to the driver. > My view of wireless WMM etc is it is a different media behavior > (compared to wired ethernet) which means a different view of strategy > for when it opens the valve to allow in more packets. 802.11 media has > embedded signalling which is usable. Guy Cohen gave a good use case > which i responded to. Do you wanna look at that and respond? Hello, Haven't got all details. IMO we need to support some "bonding-like" scenario too. Where one CPU is feeding just one TX-ring. (and TX-buffers cleared by same CPU). We probably don't want to stall all queuing when when one ring is full. The scenario I see is to support parallelism in forwarding/firewalling etc. For example when RX load via HW gets split into different CPU's and for cache reasons we want to process in same CPU even with TX. If RX HW split keeps packets from the same flow on same CPU we shouldn't get reordering within flows. Cheers --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] pktgen IPSEC 4/4: Add IPSEC support to pktgen
jamal writes: > 4 of 4 > [PKTGEN] IPSEC support > Added transport mode ESP support for starters. > I will send more of these modes and types once i have resolved > the tunnel mode isses. > > Signed-off-by: Jamal Hadi Salim <[EMAIL PROTECTED]> Signed-off-by: Robert Olsson <[EMAIL PROTECTED]> Cheers --ro > > diff --git a/net/core/pktgen.c b/net/core/pktgen.c > index bc4fb3b..bcec8e4 100644 > --- a/net/core/pktgen.c > +++ b/net/core/pktgen.c > @@ -152,6 +152,9 @@ > #include > #include > #include > +#ifdef CONFIG_XFRM > +#include > +#endif > #include > #include > #include > @@ -182,6 +185,7 @@ > #define F_VID_RND (1<<9)/* Random VLAN ID */ > #define F_SVID_RND(1<<10) /* Random SVLAN ID */ > #define F_FLOW_SEQ(1<<11) /* Sequential flows */ > +#define F_IPSEC_ON(1<<12) /* ipsec on for flows */ > > /* Thread control flag bits */ > #define T_TERMINATE (1<<0) > @@ -208,6 +212,9 @@ static struct proc_dir_entry *pg_proc_dir = NULL; > struct flow_state { > __be32 cur_daddr; > int count; > +#ifdef CONFIG_XFRM > +struct xfrm_state *x; > +#endif > __u32 flags; > }; > > @@ -348,7 +355,10 @@ struct pktgen_dev { > unsigned lflow; /* Flow length (config) */ > unsigned nflows;/* accumulated flows (stats) */ > unsigned curfl; /* current sequenced flow (state)*/ > - > +#ifdef CONFIG_XFRM > +__u8ipsmode;/* IPSEC mode (config) */ > +__u8ipsproto; /* IPSEC type (config) */ > +#endif > char result[512]; > }; > > @@ -704,6 +714,9 @@ static int pktgen_if_show(struct seq_file *seq, void *v) > seq_printf(seq, "FLOW_RND "); > } > > +if (pkt_dev->flags & F_IPSEC_ON) > +seq_printf(seq, "IPSEC "); > + > if (pkt_dev->flags & F_MACSRC_RND) > seq_printf(seq, "MACSRC_RND "); > > @@ -1198,6 +1211,11 @@ static ssize_t pktgen_if_write(struct file *file, > else if (strcmp(f, "FLOW_SEQ") == 0) > pkt_dev->flags |= F_FLOW_SEQ; > > +#ifdef CONFIG_XFRM > +else if (strcmp(f, "IPSEC") == 0) > +pkt_dev->flags |= F_IPSEC_ON; > +#endif > + > else if (strcmp(f, "!IPV6") == 0) > pkt_dev->flags &= ~F_IPV6; > > @@ -1206,7 +1224,7 @@ static ssize_t pktgen_if_write(struct file *file, > "Flag -:%s:- unknown\nAvailable flags, (prepend > ! to un-set flag):\n%s", > f, > "IPSRC_RND, IPDST_RND, UDPSRC_RND, UDPDST_RND, " > -"MACSRC_RND, MACDST_RND, TXSIZE_RND, IPV6, > MPLS_RND, VID_RND, SVID_RND, FLOW_SEQ\n"); > +"MACSRC_RND, MACDST_RND, TXSIZE_RND, IPV6, > MPLS_RND, VID_RND, SVID_RND, FLOW_SEQ, IPSEC\n"); > return count; > } > sprintf(pg_result, "OK: flags=0x%x", pkt_dev->flags); > @@ -2094,6 +2112,7 @@ static void spin(struct pktgen_dev *pkt_dev, __u64 > spin_until_us) > > static inline void set_pkt_overhead(struct pktgen_dev *pkt_dev) > { > +pkt_dev->pkt_overhead = 0; > pkt_dev->pkt_overhead += pkt_dev->nr_labels*sizeof(u32); > pkt_dev->pkt_overhead += VLAN_TAG_SIZE(pkt_dev); > pkt_dev->pkt_overhead += SVLAN_TAG_SIZE(pkt_dev); > @@ -2130,6 +2149,31 @@ static inline int f_pick(struct pktgen_dev *pkt_dev) > return pkt_dev->curfl; > } > > + > +#ifdef CONFIG_XFRM > +/* If there was already an IPSEC SA, we keep it as is, else > + * we go look for it ... > +*/ > +inline > +void get_ipsec_sa(struct pktgen_dev *pkt_dev, int flow) > +{ > +struct xfrm_state *x = pkt_dev->flows[flow].x; > +if (!x) { > +/*slow path: we dont already have xfrm_state*/ > +x = xfrm_stateonly_find((xfrm_address_t *)&pkt_dev->cur_daddr, > +(xfrm_address_t *)&pkt_dev->cur_saddr, > +AF_INET, > +pkt_dev->ipsmode, > +pkt_dev->ipsproto, 0); > +if (x) { > +pkt_dev->flows[flow].x =
[PATCH] pktgen IPSEC 1/4: Centralize pktgen packet overhead management
jamal writes: > Manual labor still ... 1 of 4 > [PKTGEN] Centralize packet overhead tracking > Track the extra packet overhead for VLAN tags, MPLS, IPSEC etc > > Signed-off-by: Jamal Hadi Salim <[EMAIL PROTECTED]> Thanks, Jamal. I'll guess the ipsec part is to be considered work-in-progress and you're doing both the work and the progress. Signed-off-by: Robert Olsson <[EMAIL PROTECTED]> Cheers --ro > diff --git a/net/core/pktgen.c b/net/core/pktgen.c > index 9cd3a1c..1352316 100644 > --- a/net/core/pktgen.c > +++ b/net/core/pktgen.c > @@ -228,6 +228,7 @@ struct pktgen_dev { > > int min_pkt_size; /* = ETH_ZLEN; */ > int max_pkt_size; /* = ETH_ZLEN; */ > +int pkt_overhead; /* overhead for MPLS, VLANs, IPSEC etc */ > int nfrags; > __u32 delay_us; /* Default delay */ > __u32 delay_ns; > @@ -2075,6 +2076,13 @@ static void spin(struct pktgen_dev *pkt_dev, __u64 > spin_until_us) > pkt_dev->idle_acc += now - start; > } > > +static inline void set_pkt_overhead(struct pktgen_dev *pkt_dev) > +{ > +pkt_dev->pkt_overhead += pkt_dev->nr_labels*sizeof(u32); > +pkt_dev->pkt_overhead += VLAN_TAG_SIZE(pkt_dev); > +pkt_dev->pkt_overhead += SVLAN_TAG_SIZE(pkt_dev); > +} > + > /* Increment/randomize headers according to flags and current values > * for IP src/dest, UDP src/dst port, MAC-Addr src/dst > */ > @@ -2323,9 +2331,7 @@ static struct sk_buff *fill_packet_ipv4(struct > net_device *odev, > > datalen = (odev->hard_header_len + 16) & ~0xf; > skb = alloc_skb(pkt_dev->cur_pkt_size + 64 + datalen + > -pkt_dev->nr_labels*sizeof(u32) + > -VLAN_TAG_SIZE(pkt_dev) + SVLAN_TAG_SIZE(pkt_dev), > -GFP_ATOMIC); > +pkt_dev->pkt_overhead, GFP_ATOMIC); > if (!skb) { > sprintf(pkt_dev->result, "No memory"); > return NULL; > @@ -2368,7 +2374,7 @@ static struct sk_buff *fill_packet_ipv4(struct > net_device *odev, > > /* Eth + IPh + UDPh + mpls */ > datalen = pkt_dev->cur_pkt_size - 14 - 20 - 8 - > - pkt_dev->nr_labels*sizeof(u32) - VLAN_TAG_SIZE(pkt_dev) - > SVLAN_TAG_SIZE(pkt_dev); > + pkt_dev->pkt_overhead; > if (datalen < sizeof(struct pktgen_hdr)) > datalen = sizeof(struct pktgen_hdr); > > @@ -2391,8 +2397,7 @@ static struct sk_buff *fill_packet_ipv4(struct > net_device *odev, > iph->check = ip_fast_csum((void *)iph, iph->ihl); > skb->protocol = protocol; > skb->mac_header = (skb->network_header - ETH_HLEN - > - pkt_dev->nr_labels * sizeof(u32) - > - VLAN_TAG_SIZE(pkt_dev) - SVLAN_TAG_SIZE(pkt_dev)); > + pkt_dev->pkt_overhead); > skb->dev = odev; > skb->pkt_type = PACKET_HOST; > > @@ -2662,9 +2667,7 @@ static struct sk_buff *fill_packet_ipv6(struct > net_device *odev, > mod_cur_headers(pkt_dev); > > skb = alloc_skb(pkt_dev->cur_pkt_size + 64 + 16 + > -pkt_dev->nr_labels*sizeof(u32) + > -VLAN_TAG_SIZE(pkt_dev) + SVLAN_TAG_SIZE(pkt_dev), > -GFP_ATOMIC); > +pkt_dev->pkt_overhead, GFP_ATOMIC); > if (!skb) { > sprintf(pkt_dev->result, "No memory"); > return NULL; > @@ -2708,7 +2711,7 @@ static struct sk_buff *fill_packet_ipv6(struct > net_device *odev, > /* Eth + IPh + UDPh + mpls */ > datalen = pkt_dev->cur_pkt_size - 14 - >sizeof(struct ipv6hdr) - sizeof(struct udphdr) - > - pkt_dev->nr_labels*sizeof(u32) - VLAN_TAG_SIZE(pkt_dev) - > SVLAN_TAG_SIZE(pkt_dev); > + pkt_dev->pkt_overhead; > > if (datalen < sizeof(struct pktgen_hdr)) { > datalen = sizeof(struct pktgen_hdr); > @@ -2738,8 +2741,7 @@ static struct sk_buff *fill_packet_ipv6(struct > net_device *odev, > ipv6_addr_copy(&iph->saddr, &pkt_dev->cur_in6_saddr); > > skb->mac_header = (skb->network_header - ETH_HLEN - > - pkt_dev->nr_labels * sizeof(u32) - > - VLAN_TAG_SIZE(pkt_dev) - SVLAN_TAG_SIZE(pkt_dev)); > + pkt_dev->pkt_overhead); > skb->protocol = protocol; > skb->dev = odev; > s
[PATCH] pktgen IPSEC 2/4: Introduce pktgen sequential flows
jamal writes: > 2 of 4 > > cheers, > jamal > commit 882c296bb3f153e1ac770a874c75cfb2bab8481b > Author: Jamal Hadi Salim <[EMAIL PROTECTED]> > Date: Tue Jun 12 07:24:00 2007 -0400 > > [PKTGEN] Introduce sequential flows > > By default all flows in pktgen are randomly selected. > This patch introduces ability to have all defined flows to > be sent sequentially. Robert defined randomness to be the > default behavior. > > Signed-off-by: Jamal Hadi Salim <[EMAIL PROTECTED]> Signed-off-by: Robert Olsson <[EMAIL PROTECTED]> Cheers --ro > > diff --git a/net/core/pktgen.c b/net/core/pktgen.c > index 1352316..bc4fb3b 100644 > --- a/net/core/pktgen.c > +++ b/net/core/pktgen.c > @@ -181,6 +181,7 @@ > #define F_MPLS_RND(1<<8)/* Random MPLS labels */ > #define F_VID_RND (1<<9)/* Random VLAN ID */ > #define F_SVID_RND(1<<10) /* Random SVLAN ID */ > +#define F_FLOW_SEQ(1<<11) /* Sequential flows */ > > /* Thread control flag bits */ > #define T_TERMINATE (1<<0) > @@ -207,8 +208,12 @@ static struct proc_dir_entry *pg_proc_dir = NULL; > struct flow_state { > __be32 cur_daddr; > int count; > +__u32 flags; > }; > > +/* flow flag bits */ > +#define F_INIT (1<<0) /* flow has been initialized */ > + > struct pktgen_dev { > /* > * Try to keep frequent/infrequent used vars. separated. > @@ -342,6 +347,7 @@ struct pktgen_dev { > unsigned cflows;/* Concurrent flows (config) */ > unsigned lflow; /* Flow length (config) */ > unsigned nflows;/* accumulated flows (stats) */ > +unsigned curfl; /* current sequenced flow (state)*/ > > char result[512]; > }; > @@ -691,6 +697,13 @@ static int pktgen_if_show(struct seq_file *seq, void *v) > if (pkt_dev->flags & F_MPLS_RND) > seq_printf(seq, "MPLS_RND "); > > +if (pkt_dev->cflows) { > +if (pkt_dev->flags & F_FLOW_SEQ) > +seq_printf(seq, "FLOW_SEQ "); /*in sequence flows*/ > +else > +seq_printf(seq, "FLOW_RND "); > +} > + > if (pkt_dev->flags & F_MACSRC_RND) > seq_printf(seq, "MACSRC_RND "); > > @@ -1182,6 +1195,9 @@ static ssize_t pktgen_if_write(struct file *file, > else if (strcmp(f, "!SVID_RND") == 0) > pkt_dev->flags &= ~F_SVID_RND; > > +else if (strcmp(f, "FLOW_SEQ") == 0) > +pkt_dev->flags |= F_FLOW_SEQ; > + > else if (strcmp(f, "!IPV6") == 0) > pkt_dev->flags &= ~F_IPV6; > > @@ -1190,7 +1206,7 @@ static ssize_t pktgen_if_write(struct file *file, > "Flag -:%s:- unknown\nAvailable flags, (prepend > ! to un-set flag):\n%s", > f, > "IPSRC_RND, IPDST_RND, UDPSRC_RND, UDPDST_RND, " > -"MACSRC_RND, MACDST_RND, TXSIZE_RND, IPV6, > MPLS_RND, VID_RND, SVID_RND\n"); > +"MACSRC_RND, MACDST_RND, TXSIZE_RND, IPV6, > MPLS_RND, VID_RND, SVID_RND, FLOW_SEQ\n"); > return count; > } > sprintf(pg_result, "OK: flags=0x%x", pkt_dev->flags); > @@ -2083,6 +2099,37 @@ static inline void set_pkt_overhead(struct pktgen_dev > *pkt_dev) > pkt_dev->pkt_overhead += SVLAN_TAG_SIZE(pkt_dev); > } > > +static inline int f_seen(struct pktgen_dev *pkt_dev, int flow) > +{ > + > +if (pkt_dev->flows[flow].flags & F_INIT) > +return 1; > +else > +return 0; > +} > + > +static inline int f_pick(struct pktgen_dev *pkt_dev) > +{ > +int flow = pkt_dev->curfl; > + > +if (pkt_dev->flags & F_FLOW_SEQ) { > +if (pkt_dev->flows[flow].count >= pkt_dev->lflow) { > +/* reset time */ > +pkt_dev->flows[flow].count = 0; > +pkt_dev->curfl += 1; > +if (pkt_dev->curfl >= pkt_dev->cflows) > +pkt_dev->curfl = 0; /*reset */ > +} > +} else { > +flow = random32() % pkt_dev->cflow
[PATCH] pktgen IPSEC 3/4: Introduce xfrm SAD only lookup
jamal writes: > 3 of 4 .. > [XFRM] Introduce standalone SAD lookup > This allows other in-kernel functions to do SAD lookups. > The only known user at the moment is pktgen. > > Signed-off-by: Jamal Hadi Salim <[EMAIL PROTECTED]> xfrm is not my area..... Acked-by: Robert Olsson <[EMAIL PROTECTED]> Cheers --ro > > diff --git a/include/net/xfrm.h b/include/net/xfrm.h > index 311f25a..79d2c37 100644 > --- a/include/net/xfrm.h > +++ b/include/net/xfrm.h > @@ -920,6 +920,10 @@ extern struct xfrm_state > *xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t >struct flowi *fl, struct xfrm_tmpl > *tmpl, >struct xfrm_policy *pol, int *err, >unsigned short family); > +extern struct xfrm_state * xfrm_stateonly_find(xfrm_address_t *daddr, > + xfrm_address_t *saddr, > + unsigned short family, > + u8 mode, u8 proto, u32 reqid); > extern int xfrm_state_check_expire(struct xfrm_state *x); > extern void xfrm_state_insert(struct xfrm_state *x); > extern int xfrm_state_add(struct xfrm_state *x); > diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c > index 85f3f43..b8562e4 100644 > --- a/net/xfrm/xfrm_state.c > +++ b/net/xfrm/xfrm_state.c > @@ -686,6 +686,41 @@ out: > return x; > } > > +struct xfrm_state * > +xfrm_stateonly_find(xfrm_address_t *daddr, xfrm_address_t *saddr, > +unsigned short family, u8 mode, u8 proto, u32 reqid) > +{ > +unsigned int h = xfrm_dst_hash(daddr, saddr, reqid, family); > +struct xfrm_state *rx = NULL, *x = NULL; > +struct hlist_node *entry; > + > +spin_lock(&xfrm_state_lock); > +hlist_for_each_entry(x, entry, xfrm_state_bydst+h, bydst) { > +if (x->props.family == family && > +x->props.reqid == reqid && > +!(x->props.flags & XFRM_STATE_WILDRECV) && > +xfrm_state_addr_check(x, daddr, saddr, family) && > +mode == x->props.mode && > +proto == x->id.proto) { > + > +if (x->km.state != XFRM_STATE_VALID) > +continue; > +else { > +rx = x; > +break; > +} > +} > +} > + > +if (rx) > +xfrm_state_hold(rx); > +spin_unlock(&xfrm_state_lock); > + > + > +return rx; > +} > +EXPORT_SYMBOL(xfrm_stateonly_find); > + > static void __xfrm_state_insert(struct xfrm_state *x) > { > unsigned int h; - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fib_hash removal
Paul E. McKenney writes: > Those of use who dive into networking only occasionally would much > appreciate this. ;-) No problem here... Cheers --ro Acked-by: Robert Olsson <[EMAIL PROTECTED]> Signed-off-by: Paul E. McKenney <[EMAIL PROTECTED]> (but trivial) diff -urpNa -X dontdiff linux-2.6.20/net/ipv4/fib_trie.c linux-2.6.20-fn_trie_RTNL/net/ipv4/fib_trie.c --- linux-2.6.20/net/ipv4/fib_trie.c2007-02-04 10:44:54.0 -0800 +++ linux-2.6.20-fn_trie_RTNL/net/ipv4/fib_trie.c 2007-03-20 08:13:59.0 -0700 @@ -1124,6 +1124,9 @@ err: return fa_head; } +/* + * Caller must hold RTNL. + */ static int fn_trie_insert(struct fib_table *tb, struct fib_config *cfg) { struct trie *t = (struct trie *) tb->tb_data; @@ -1543,6 +1546,9 @@ static int trie_leaf_remove(struct trie return 1; } +/* + * Caller must hold RTNL. + */ static int fn_trie_delete(struct fib_table *tb, struct fib_config *cfg) { struct trie *t = (struct trie *) tb->tb_data; @@ -1721,6 +1727,9 @@ up: return NULL; /* Ready. Root of trie */ } +/* + * Caller must hold RTNL. + */ static int fn_trie_flush(struct fib_table *tb) { struct trie *t = (struct trie *) tb->tb_data; - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fib_hash removal
Paul E. McKenney writes: > > We have two users of trie_leaf_remove, fn_trie_flush and fn_trie_delete > > both are holding RTNL. So there shouldn't be need for this preempt stuff. > > This is assumed to a leftover from an older RCU-take. > > True enough! One request -- would it be reasonable to add to > trie_leaf_remove()'s comment to state that RTNL must be held > by the caller? Thanks for your review. Yes but it's implicitly assumed that updater side holds RTNL and we have a comment that states we're run by updater. If mention RTNL here we should probably comment other places too. ;) Cheers --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] fib_hash removal
Hello, Just discussed this Patrick... We have two users of trie_leaf_remove, fn_trie_flush and fn_trie_delete both are holding RTNL. So there shouldn't be need for this preempt stuff. This is assumed to a leftover from an older RCU-take. > Mhh .. I think I just remembered something - me incorrectly suggesting > to add it there while we were talking about this at OLS :) IIRC the > idea was to make sure tnode_free (which at that time didn't use > call_rcu) wouldn't free memory while still in use in a rcu read-side > critical section. It should have been synchronize_rcu of course, > but with tnode_free using call_rcu it seems to be completely > unnecessary. So I guess we can simply remove it. Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]> Signed-off-by: Robert Olsson <[EMAIL PROTECTED]> Cheers. --ro diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -1534,7 +1534,6 @@ static int trie_leaf_remove(struct trie t->revision++; t->size--; - preempt_disable(); tp = NODE_PARENT(n); tnode_free((struct tnode *) n); @@ -1544,7 +1543,6 @@ static int trie_leaf_remove(struct trie rcu_assign_pointer(t->trie, trie_rebalance(t, tp)); } else rcu_assign_pointer(t->trie, NULL); - preempt_enable(); return 1; } - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] fib_trie root node settings
The threshold for root node can be more aggressive set to get better tree compression. The new setting mekes the root grow from 16 to 19 bits and substansial improvemnt in Aver depth this with the current table of 214393 prefixes But really the dynamic resize should need more investigation both in terms convergence and performance and maybe it should be possible to change... Maybe just for the brave to start with or we may have to back this out. - Current --- Main: Aver depth: 2.59 Max depth: 9 Leaves: 205530 Internal nodes: 52634 1: 26507 2: 11693 3: 8007 4: 3700 5: 1776 6: 658 7: 241 8: 51 16: 1 Pointers: 431426 Null ptrs: 173263 Total size: 6934 kB - With the new setting --- Main: Aver depth: 2.06 Max depth: 8 Leaves: 205530 Internal nodes: 48235 1: 23812 2: 10023 3: 8089 4: 3972 5: 2332 6: 6 19: 1 Pointers: 815276 Null ptrs: 561512 Total size: 8330 kB Signed-off-by: Robert Olsson <[EMAIL PROTECTED]> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 7ef5948..1560c54 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -293,8 +293,8 @@ static inline void check_tnode(const struct tnode *tn) static int halve_threshold = 25; static int inflate_threshold = 50; -static int halve_threshold_root = 15; -static int inflate_threshold_root = 25; +static int halve_threshold_root = 8; +static int inflate_threshold_root = 15; static void __alias_free_mem(struct rcu_head *head) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] fib_trie resize break
Hello. The patch below adds break condition for the resize operations. If we don't achieve the desired fill factor a warning is printed. Trie should still be operational but new thresholds should be considered. Cheers --ro Signed-off-by: Robert Olsson <[EMAIL PROTECTED]> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 1e589b9..7ef5948 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -50,7 +50,7 @@ * Patrick McHardy <[EMAIL PROTECTED]> */ -#define VERSION "0.407" +#define VERSION "0.408" #include #include @@ -460,6 +460,7 @@ static struct node *resize(struct trie *t, struct tnode *tn) struct tnode *old_tn; int inflate_threshold_use; int halve_threshold_use; + int max_resize; if (!tn) return NULL; @@ -560,7 +561,8 @@ static struct node *resize(struct trie *t, struct tnode *tn) inflate_threshold_use = inflate_threshold; err = 0; - while ((tn->full_children > 0 && + max_resize = 10; + while ((tn->full_children > 0 && max_resize-- && 50 * (tn->full_children + tnode_child_length(tn) - tn->empty_children) >= inflate_threshold_use * tnode_child_length(tn))) { @@ -575,6 +577,17 @@ static struct node *resize(struct trie *t, struct tnode *tn) } } + if(max_resize < 0) { + + if(!tn->parent) + printk(KERN_WARNING "Fix inflate_threshold_root. Now=%d size=%d bits\n", + inflate_threshold_root, tn->bits); + + else + printk(KERN_WARNING "Fix inflate_threshold. Now=%d size=%d bits\n", + inflate_threshold, tn->bits); + } + check_tnode(tn); /* @@ -591,7 +604,8 @@ static struct node *resize(struct trie *t, struct tnode *tn) halve_threshold_use = halve_threshold; err = 0; - while (tn->bits > 1 && + max_resize = 10; + while (tn->bits > 1 && max_resize-- && 100 * (tnode_child_length(tn) - tn->empty_children) < halve_threshold_use * tnode_child_length(tn)) { @@ -606,6 +620,16 @@ static struct node *resize(struct trie *t, struct tnode *tn) } } + if(max_resize < 0) { + + if(!tn->parent) + printk(KERN_WARNING "Fix halve_threshold_root. Now=%d size=%d bits\n", + halve_threshold_root, tn->bits); + + else + printk(KERN_WARNING "Fix halve_threshold. Now=%d size=%d bits\n", + halve_threshold, tn->bits); + } /* Only one child remains */ if (tn->empty_children == tnode_child_length(tn) - 1) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH]: Dynamically sized routing cache hash table.
David Miller writes: > Even a nearly perfect hash has small lumps in distribution, and we > should not penalize entries which fall into these lumps. > > Let us call T the threshold at which we would grow the routing hash > table. As we approach T we start to GC. Let's assume hash table > has shift = 2. and T would (with T=N+(N>>1) algorithm) therefore be > 6. > > TABLE: [0] DST1, DST2 > [1] DST3, DST4, DST5 > > DST6 arrives, what should we do? > > If we just accept it and don't GC some existing entries, we > will grow the hash table. This is the wrong thing to do if > our true working set is smaller than 6 entries and thus some > of the existing entries are unlikely to be reused and thus > could be purged to keep us from hitting T. > > If they are all active, growing is the right thing to do. > > This is the crux of the whole routing cache problem. Yes it very complex... I would be better to have GC processes datastructure more independent at least for us mortals. With the unicahe I was trying to achive something like this: Datastructure pretty independent and optimal wrt inserts and deletes (GC) Well not 100% perfect as we don't want to resize root node - which is close to hash resize. Stefan Nilsson did some work in his "dynamic tries", halve_threshold and inflate_threshold is controlling the resize alone. AFAIK the two different thresholds was used to prevent oscillation and prevent dampening. Maybe some ideas can be considered. About GC, if we forget what can be done with active GC for now.. IMO the "most important" GC for constant load is the on-demanand or passive where we triggered by a fixed threshhold. ( A fixed equilibrium point ) > I am of the opinion that LRU, for routes not attached to sockets, is > probably the best thing to do here. Yes among other things rt_score checks for this. > Furthermore at high packet rates, the current rt_may_expire() logic > probably is not very effective since it's granularity is limited to > jiffies. We can quite easily create 100,000 or more entries per > jiffie when HZ=100 during rDOS, for example. So perhaps some global > LRU algorithm using ktime is more appropriate. Timer-based GC. In my world this is just to get rid of entries when traffic has stopped/dropped. > Global LRU is not easy without touching a lot of memory. But I'm > sure some clever trick can be discovered by someone :) Yes as have to scan a entries. To be balanced with the work we have to if we remove something that need to "restored". > It is amusing, but it seems that for rDOS workload most optimal > routing hash would be tiny one like my example above. All packets > essentially miss the routing cache and create new entry. So > keeping the working set as small as possible is what you want > to do since no matter how large you grow your hit rate will be > zero :-) Yes Alexey and I tried long time ago to limit the lengths of the hash-chains. We saw improved result but we didn't find any usable measure for rDoS. Cheers --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH]: Dynamically sized routing cache hash table.
Eric Dumazet writes: > With 2^20 entries, your actual limit of 2^19 entries in root node will > probably show us quite different numbers for order-1,2,3,4... tnodes Yeep trie will get deeper and lookup more costly as insert and delete. The 2^19 was that was getting memory alloction problem that I never sorted out. > Yes, numbers you gave us basically showed a big root node, and mainly leaves > and very few tnodes. > > I was interested to see the distribution in case the root-node limit is hit, > and we load into the table a *lot* of entries. Maxlength etc... well maybe root-restriction should be removed and just have maxsize instead. Cheers --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: linux 2.6 Ipv4 routing enhancement (fwd)
Richard Kojedzinszky writes: > Sorry for sending the tgz with .svn included. And i did not send > instructions. > To do a test with fib_trie, issue > $ make clean all ROUTE_ALG=TRIE & ./try a > with fib_radix: > $ make clean all ROUTE_ALG=RADIX & ./try a > with fib_lef: > $ make clean all ROUTE_ALG=LEF SBBITS=4 & ./try a Thanks. First I'll use to do my testing in kernel context and in the forwarding path with full semantic match so it's not that easy to compare. But I'll take a look. BTW the you test so you do correct prefix matching? FYI. some old fib work on robur.slu.se # Look with just hlist /pub/Linux/net-development/fib_hlist/ # 24 bit hash lookup /pub/Linux/net-development/fib_hash2/ And some hlist/hash2/trie comparisons in: /pub/Linux/tmp/trie-talk-kth.pdf Cheers --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH]: Dynamically sized routing cache hash table.
Eric Dumazet writes: > Indeed. It would be nice to see how it performs with say 2^20 elements... > Because with your data, I wonder if the extra complexity of the trash is > worth > it (since most lookups are going to only hit the hash and give the answer > without intermediate nodes) I don't know if I understand you fully. Yes in most cases the first lookup via "hash-header" will take us to direct to the correct leaf. If there are "collisions" we have to sort them out by adding intermediate nodes. Something like where you have resizable hash where is each bucket in turn is a resizable hash etc. Cheers --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH]: Dynamically sized routing cache hash table.
Eric Dumazet writes: > Well, maybe... but after looking robert's trash, I discovered its model is > essentially a big (2^18 slots) root node (our hash table), and very few > order:1,2,3 nodes. It's getting "hashlike" yes. I guess all effective algorithms today is doing some sort of "index" lookup and for large number of entries we cannot expect to find the next node in the same cache line so the "tree depth" becomes a crucial performance factor. IMO nothing can beat a prefect distributed and perfect sized hash. The trash work is an effort to get close with dynamic data structure. Cheers --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH]: Dynamically sized routing cache hash table.
David Miller writes: Interesting. > Actually, more accurately, the conflict exists in how this GC > logic is implemented. The core issue is that hash table size > guides the GC processing, and hash table growth therefore > modifies those GC goals. So with the patch below we'll just > keep growing the hash table instead of giving GC some time to > try to keep the working set in equilibrium before doing the > hash grow. AFIK the equilibrium is resizing function as well but using fixed hash table. So can we do without equilibrium resizing if tables are dynamic? I think so With the hash data structure we could monitor the average chain length or just size and resize hash after that. > One idea is to put the hash grow check in the garbage collector, > and put the hash shrink check in rt_del(). > > In fact, it would be a good time to perhaps hack up some entirely > new passive GC logic for the routing cache. Could be, remeber GC in the hash chain also which was added after although it does's decrease the number of entries but it gives an upper limit. Also gc-goal must picked so it does not force unwanted resizing. > BTW, another thing that plays into this is that Robert's TRASH work > could make this patch not necessary :-) It has "built-in" resize and chain control and the gc-goal is chosen not to unnecessary resize the root node. Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
linux 2.6 Ipv4 routing enhancement (fwd)
Richard Kojedzinszky writes: > traffic, and also update the routing table (from BGP), the route cache > seemed to be the bottleneck, as upon every fib update the whole route > cache is flushed, and sometimes it took as many cpu cycles to let some > packets being dropped. Meanwhile i knew that *BSD systems do not use such > a cache, and of course without it a router can provide a constant > performance, not depending on the number of different ip flows, and > updating the fib does not take such a long time. Hmm I think there is cache is *BSD* too Anyway you're correct the that the GC and insert/deletion of routes flushes the cache and can causes packets drops when all flows has to get recreated. Yes it's something thats needs to be addressed but it's not that common that people use dynamic routing protocols. Anyway Dave and Alexey started to look into this some time ago I got involved later there were some idea how deal with this. This work didn't come an end. So if you want to contribute I think we all be happy. > For this to be solved, i have played with ipv4 routing in linux kernel a > bit. I have done two separate things: > - developed a new fib algorithm in fib_trie's place for ipv4 > - rewrote the kernel not to use it's dst cache Just for routing? > The fib algorithm is like cisco's CEF (at least if my knowledge is correct), > but first I use a 16-branching tree, to look up the address by 4 bit steps, > and > each node in this tree contains a simple sub-tree which is a radix tree, of > course with maximum possible height 4. I think this is very simple, and is > nearly 3 times faster than fib_trie. Now it has a missing feature: it does > not > export the fib in /proc/net/route. Full semantic match... . The LC-trie scales tree brancing automatically so looking into linux router running full BGP feed with 204300 prefixes we see: 1: 27567 2: 10127 3: 8149 4: 3630 5: 1529 6: 558 7: 197 8: 53 16: 1 Root node is 16-bit too and Aver depth: 2.60 So 3 times faster than fib_trie thats full sensation. How do you test? > The second thing i have done to minimize the cpu cycles during the > forwarding > phase, rewriting ip_input.c, route.c and some others to lef.c, and having a > minimal functionality. I mean, for example, when a packet gets through the > lef > functions, ipsec policies are not checked. It would be nice to see a profile before and with your patch > And to be more efficient, I attached a neighbour pointer to each fib entry, > and > using this the lookup + forwarding code is very fast. > Of course, the route cache needs very little time to forward packets when > there > are a small number of different ip flows, but when dealing with traffic in > an > ISP at core level, this cannot be stated. > So I have done tests with LEF, and compared them to the original linux > kernel's > performance. > With the worst case, LEF performed nearly 90% of the linux kernel with the > most > optimal case. Of course original linux performs poorly with the worst case. Send them and with profiles is possible... Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
Michael K. Edwards writes: > This, incidentally, seems very similar to the process that Robert > Olsson and Stefan Nilsson have gone through with their trie/hash > project. Although I haven't tried it out yet and don't have any basis > for an independent opinion, the data and analysis provided in their > paper are quite convincing. I've info about this "process" :) Moved fib_trie.c to userland to extend it longer an variable keylengths. Testprogram was doing insert/lookup/remove with random data with blistering performance (very flat trees). So quite happy I moved this trie back to the kernel and started testing with "real data" - ip addresses and rDoS. Result was disastrous very deep trees and awful network performance. Random data is very different from real data was the lesson learned again. Gave up. A couple days later an idea came up. I'll remembered the poof from the LC-trie paper that length of key does not impact tree depth. So went to Stefan, Can't we just fix up the data rather than fiddling with an new improved algorithm? The result was this "hash" header to boost tree compression. Cheers --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] pktgen: fix device name handling
Yes it seems be handle dev name change. So configuration scripts should use ifindex now :) Signed-off-by: Robert Olsson <[EMAIL PROTECTED]> Cheers. --ro Stephen Hemminger writes: > Since devices can change name and other wierdness, don't hold onto > a copy of device name, instead use pointer to output device. > > Fix a couple of leaks in error handling path as well. > > Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> > > --- > net/core/pktgen.c | 137 > +++--- > 1 file changed, 70 insertions(+), 67 deletions(-) > > --- pktgen.orig/net/core/pktgen.c2007-02-27 12:08:58.0 -0800 > +++ pktgen/net/core/pktgen.c 2007-02-27 12:11:32.0 -0800 > @@ -210,15 +210,11 @@ > }; > > struct pktgen_dev { > - > /* > * Try to keep frequent/infrequent used vars. separated. > */ > - > -char ifname[IFNAMSIZ]; > -char result[512]; > - > -struct pktgen_thread *pg_thread;/* the owner */ > +struct proc_dir_entry *entry; /* proc file */ > +struct pktgen_thread *pg_thread;/* the owner */ > struct list_head list; /* Used for chaining in the thread's > run-queue */ > > int running;/* if this changes to false, the test will stop > */ > @@ -345,6 +341,8 @@ > unsigned cflows;/* Concurrent flows (config) */ > unsigned lflow; /* Flow length (config) */ > unsigned nflows;/* accumulated flows (stats) */ > + > +char result[512]; > }; > > struct pktgen_hdr { > @@ -497,7 +495,7 @@ > static int pktgen_stop_device(struct pktgen_dev *pkt_dev); > static void pktgen_stop(struct pktgen_thread *t); > static void pktgen_clear_counters(struct pktgen_dev *pkt_dev); > -static int pktgen_mark_device(const char *ifname); > + > static unsigned int scan_ip6(const char *s, char ip[16]); > static unsigned int fmt_ip6(char *s, const char ip[16]); > > @@ -591,7 +589,7 @@ > " frags: %d delay: %u clone_skb: %d ifname: %s\n", > pkt_dev->nfrags, > 1000 * pkt_dev->delay_us + pkt_dev->delay_ns, > - pkt_dev->clone_skb, pkt_dev->ifname); > + pkt_dev->clone_skb, pkt_dev->odev->name); > > seq_printf(seq, " flows: %u flowlen: %u\n", pkt_dev->cflows, > pkt_dev->lflow); > @@ -1682,13 +1680,13 @@ > if_lock(t); > list_for_each_entry(pkt_dev, &t->if_list, list) > if (pkt_dev->running) > -seq_printf(seq, "%s ", pkt_dev->ifname); > +seq_printf(seq, "%s ", pkt_dev->odev->name); > > seq_printf(seq, "\nStopped: "); > > list_for_each_entry(pkt_dev, &t->if_list, list) > if (!pkt_dev->running) > -seq_printf(seq, "%s ", pkt_dev->ifname); > +seq_printf(seq, "%s ", pkt_dev->odev->name); > > if (t->result[0]) > seq_printf(seq, "\nResult: %s\n", t->result); > @@ -1834,12 +1832,11 @@ > /* > * mark a device for removal > */ > -static int pktgen_mark_device(const char *ifname) > +static void pktgen_mark_device(const char *ifname) > { > struct pktgen_dev *pkt_dev = NULL; > const int max_tries = 10, msec_per_try = 125; > int i = 0; > -int ret = 0; > > mutex_lock(&pktgen_thread_lock); > pr_debug("pktgen: pktgen_mark_device marking %s for removal\n", ifname); > @@ -1860,32 +1857,49 @@ > printk("pktgen_mark_device: timed out after waiting " > "%d msec for device %s to be removed\n", > msec_per_try * i, ifname); > -ret = 1; > break; > } > > } > > mutex_unlock(&pktgen_thread_lock); > +} > > -return ret; > +static void pktgen_change_name(struct net_device *dev) > +{ > +struct pktgen_thread *t; > + > +list_for_each_entry(t, &pktgen_threads, th_list) { > +struct pktgen_dev *pkt_dev; > + > +list_for_each_entry(pkt_dev, &t->if_list, list) { > +if (pkt_dev->odev != dev) > +continue; > + > +remove_proc_entry(
[PATCH 3/4] pktgen: don't use __constant_htonl()
OK! Signed-off-by: Robert Olsson <[EMAIL PROTECTED]> Cheers. --ro Stephen Hemminger writes: > The existing htonl() macro is smart enough to do the same code as > using __constant_htonl() and it looks cleaner. > > Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> > > --- > net/core/pktgen.c | 24 > 1 file changed, 12 insertions(+), 12 deletions(-) > > --- pktgen.orig/net/core/pktgen.c2007-02-26 14:40:31.0 -0800 > +++ pktgen/net/core/pktgen.c 2007-02-26 15:36:38.0 -0800 > @@ -167,7 +167,7 @@ > #define LAT_BUCKETS_MAX 32 > #define IP_NAME_SZ 32 > #define MAX_MPLS_LABELS 16 /* This is the max label stack depth */ > -#define MPLS_STACK_BOTTOM __constant_htonl(0x0100) > +#define MPLS_STACK_BOTTOM htonl(0x0100) > > /* Device flag bits */ > #define F_IPSRC_RND (1<<0)/* IP-Src Random */ > @@ -2297,7 +2297,7 @@ > int datalen, iplen; > struct iphdr *iph; > struct pktgen_hdr *pgh = NULL; > -__be16 protocol = __constant_htons(ETH_P_IP); > +__be16 protocol = htons(ETH_P_IP); > __be32 *mpls; > __be16 *vlan_tci = NULL; /* Encapsulates priority and > VLAN ID */ > __be16 *vlan_encapsulated_proto = NULL; /* packet type ID field (or > len) for VLAN tag */ > @@ -2306,10 +2306,10 @@ > > > if (pkt_dev->nr_labels) > -protocol = __constant_htons(ETH_P_MPLS_UC); > +protocol = htons(ETH_P_MPLS_UC); > > if (pkt_dev->vlan_id != 0x) > -protocol = __constant_htons(ETH_P_8021Q); > +protocol = htons(ETH_P_8021Q); > > /* Update any of the values, used when we're incrementing various > * fields. > @@ -2341,14 +2341,14 @@ > pkt_dev->svlan_cfi, > pkt_dev->svlan_p); > svlan_encapsulated_proto = (__be16 *)skb_put(skb, > sizeof(__be16)); > -*svlan_encapsulated_proto = > __constant_htons(ETH_P_8021Q); > +*svlan_encapsulated_proto = htons(ETH_P_8021Q); > } > vlan_tci = (__be16 *)skb_put(skb, sizeof(__be16)); > *vlan_tci = build_tci(pkt_dev->vlan_id, >pkt_dev->vlan_cfi, >pkt_dev->vlan_p); > vlan_encapsulated_proto = (__be16 *)skb_put(skb, > sizeof(__be16)); > -*vlan_encapsulated_proto = __constant_htons(ETH_P_IP); > +*vlan_encapsulated_proto = htons(ETH_P_IP); > } > > iph = (struct iphdr *)skb_put(skb, sizeof(struct iphdr)); > @@ -2635,7 +2635,7 @@ > int datalen; > struct ipv6hdr *iph; > struct pktgen_hdr *pgh = NULL; > -__be16 protocol = __constant_htons(ETH_P_IPV6); > +__be16 protocol = htons(ETH_P_IPV6); > __be32 *mpls; > __be16 *vlan_tci = NULL; /* Encapsulates priority and > VLAN ID */ > __be16 *vlan_encapsulated_proto = NULL; /* packet type ID field (or > len) for VLAN tag */ > @@ -2643,10 +2643,10 @@ > __be16 *svlan_encapsulated_proto = NULL; /* packet type ID field (or > len) for SVLAN tag */ > > if (pkt_dev->nr_labels) > -protocol = __constant_htons(ETH_P_MPLS_UC); > +protocol = htons(ETH_P_MPLS_UC); > > if (pkt_dev->vlan_id != 0x) > -protocol = __constant_htons(ETH_P_8021Q); > +protocol = htons(ETH_P_8021Q); > > /* Update any of the values, used when we're incrementing various > * fields. > @@ -2677,14 +2677,14 @@ > pkt_dev->svlan_cfi, > pkt_dev->svlan_p); > svlan_encapsulated_proto = (__be16 *)skb_put(skb, > sizeof(__be16)); > -*svlan_encapsulated_proto = > __constant_htons(ETH_P_8021Q); > +*svlan_encapsulated_proto = htons(ETH_P_8021Q); > } > vlan_tci = (__be16 *)skb_put(skb, sizeof(__be16)); > *vlan_tci = build_tci(pkt_dev->vlan_id, >pkt_dev->vlan_cfi, >pkt_dev->vlan_p); > vlan_encapsulated_proto = (__be16 *)skb_put(skb, > sizeof(__be16)); > -*vlan_encapsulated_proto = __constant_htons(ETH_P_IPV6); > +*vlan_encapsulated_proto = htons(ETH_P_IPV6); >
[PATCH 2/4] pktgen: use random32
Thanks! It seems like network code has preference for net_random() but they are the same now. Signed-off-by: Robert Olsson <[EMAIL PROTECTED]> Cheers. --ro Stephen Hemminger writes: > Can use random32() now. > > Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> > > --- > net/core/pktgen.c | 52 > +++- > 1 file changed, 19 insertions(+), 33 deletions(-) > > --- pktgen.orig/net/core/pktgen.c2007-02-26 14:34:36.0 -0800 > +++ pktgen/net/core/pktgen.c 2007-02-26 14:39:53.0 -0800 > @@ -464,17 +464,6 @@ > return tmp; > } > > -static inline u32 pktgen_random(void) > -{ > -#if 0 > -__u32 n; > -get_random_bytes(&n, 4); > -return n; > -#else > -return net_random(); > -#endif > -} > - > static inline __u64 getCurMs(void) > { > struct timeval tv; > @@ -2091,7 +2080,7 @@ > int flow = 0; > > if (pkt_dev->cflows) { > -flow = pktgen_random() % pkt_dev->cflows; > +flow = random32() % pkt_dev->cflows; > > if (pkt_dev->flows[flow].count > pkt_dev->lflow) > pkt_dev->flows[flow].count = 0; > @@ -2103,7 +2092,7 @@ > __u32 tmp; > > if (pkt_dev->flags & F_MACSRC_RND) > -mc = pktgen_random() % (pkt_dev->src_mac_count); > +mc = random32() % pkt_dev->src_mac_count; > else { > mc = pkt_dev->cur_src_mac_offset++; > if (pkt_dev->cur_src_mac_offset > > @@ -2129,7 +2118,7 @@ > __u32 tmp; > > if (pkt_dev->flags & F_MACDST_RND) > -mc = pktgen_random() % (pkt_dev->dst_mac_count); > +mc = random32() % pkt_dev->dst_mac_count; > > else { > mc = pkt_dev->cur_dst_mac_offset++; > @@ -2156,24 +2145,23 @@ > for(i = 0; i < pkt_dev->nr_labels; i++) > if (pkt_dev->labels[i] & MPLS_STACK_BOTTOM) > pkt_dev->labels[i] = MPLS_STACK_BOTTOM | > - ((__force __be32)pktgen_random() & > + ((__force __be32)random32() & >htonl(0x000f)); > } > > if ((pkt_dev->flags & F_VID_RND) && (pkt_dev->vlan_id != 0x)) { > -pkt_dev->vlan_id = pktgen_random() % 4096; > +pkt_dev->vlan_id = random32() & (4096-1); > } > > if ((pkt_dev->flags & F_SVID_RND) && (pkt_dev->svlan_id != 0x)) { > -pkt_dev->svlan_id = pktgen_random() % 4096; > +pkt_dev->svlan_id = random32() & (4096 - 1); > } > > if (pkt_dev->udp_src_min < pkt_dev->udp_src_max) { > if (pkt_dev->flags & F_UDPSRC_RND) > -pkt_dev->cur_udp_src = > -((pktgen_random() % > - (pkt_dev->udp_src_max - pkt_dev->udp_src_min)) + > - pkt_dev->udp_src_min); > +pkt_dev->cur_udp_src = random32() % > +(pkt_dev->udp_src_max - pkt_dev->udp_src_min) > ++ pkt_dev->udp_src_min; > > else { > pkt_dev->cur_udp_src++; > @@ -2184,10 +2172,9 @@ > > if (pkt_dev->udp_dst_min < pkt_dev->udp_dst_max) { > if (pkt_dev->flags & F_UDPDST_RND) { > -pkt_dev->cur_udp_dst = > -((pktgen_random() % > - (pkt_dev->udp_dst_max - pkt_dev->udp_dst_min)) + > - pkt_dev->udp_dst_min); > +pkt_dev->cur_udp_dst = random32() % > +(pkt_dev->udp_dst_max - pkt_dev->udp_dst_min) > ++ pkt_dev->udp_dst_min; > } else { > pkt_dev->cur_udp_dst++; > if (pkt_dev->cur_udp_dst >= pkt_dev->udp_dst_max) > @@ -2202,7 +2189,7 @@ > saddr_max))) { > __u32 t; > if (pkt_dev->flags & F_IPSRC_RND) > -t = ((pktgen_random
[PATCH 1/4] pktgen: use pr_debug
Thanks! Signed-off-by: Robert Olsson <[EMAIL PROTECTED]> --ro Stephen Hemminger writes: > Remove private debug macro and replace with standard version > > Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> > > > --- > net/core/pktgen.c | 34 +++--- > 1 file changed, 15 insertions(+), 19 deletions(-) > > --- pktgen.orig/net/core/pktgen.c2007-02-26 13:21:54.0 -0800 > +++ pktgen/net/core/pktgen.c 2007-02-26 13:22:04.0 -0800 > @@ -163,9 +163,6 @@ > > #define VERSION "pktgen v2.68: Packet Generator for packet performance > testing.\n" > > -/* #define PG_DEBUG(a) a */ > -#define PG_DEBUG(a) > - > /* The buckets are exponential in 'width' */ > #define LAT_BUCKETS_MAX 32 > #define IP_NAME_SZ 32 > @@ -1856,8 +1853,7 @@ > int ret = 0; > > mutex_lock(&pktgen_thread_lock); > -PG_DEBUG(printk("pktgen: pktgen_mark_device marking %s for removal\n", > -ifname)); > +pr_debug("pktgen: pktgen_mark_device marking %s for removal\n", ifname); > > while (1) { > > @@ -1866,8 +1862,8 @@ > break; /* success */ > > mutex_unlock(&pktgen_thread_lock); > -PG_DEBUG(printk("pktgen: pktgen_mark_device waiting for %s " > -"to disappear\n", ifname)); > +pr_debug("pktgen: pktgen_mark_device waiting for %s " > +"to disappear\n", ifname); > schedule_timeout_interruptible(msecs_to_jiffies(msec_per_try)); > mutex_lock(&pktgen_thread_lock); > > @@ -2847,7 +2843,7 @@ > struct pktgen_dev *pkt_dev; > int started = 0; > > -PG_DEBUG(printk("pktgen: entering pktgen_run. %p\n", t)); > +pr_debug("pktgen: entering pktgen_run. %p\n", t); > > if_lock(t); > list_for_each_entry(pkt_dev, &t->if_list, list) { > @@ -2879,7 +2875,7 @@ > { > struct pktgen_thread *t; > > -PG_DEBUG(printk("pktgen: entering pktgen_stop_all_threads_ifs.\n")); > +pr_debug("pktgen: entering pktgen_stop_all_threads_ifs.\n"); > > mutex_lock(&pktgen_thread_lock); > > @@ -2947,7 +2943,7 @@ > { > struct pktgen_thread *t; > > -PG_DEBUG(printk("pktgen: entering pktgen_run_all_threads.\n")); > +pr_debug("pktgen: entering pktgen_run_all_threads.\n"); > > mutex_lock(&pktgen_thread_lock); > > @@ -3039,7 +3035,7 @@ > { > struct pktgen_dev *pkt_dev; > > -PG_DEBUG(printk("pktgen: entering pktgen_stop\n")); > +pr_debug("pktgen: entering pktgen_stop\n"); > > if_lock(t); > > @@ -3063,7 +3059,7 @@ > struct list_head *q, *n; > struct pktgen_dev *cur; > > -PG_DEBUG(printk("pktgen: entering pktgen_rem_one_if\n")); > +pr_debug("pktgen: entering pktgen_rem_one_if\n"); > > if_lock(t); > > @@ -3092,7 +3088,7 @@ > > /* Remove all devices, free mem */ > > -PG_DEBUG(printk("pktgen: entering pktgen_rem_all_ifs\n")); > +pr_debug("pktgen: entering pktgen_rem_all_ifs\n"); > if_lock(t); > > list_for_each_safe(q, n, &t->if_list) { > @@ -3275,7 +3271,7 @@ > > t->pid = current->pid; > > -PG_DEBUG(printk("pktgen: starting pktgen/%d: pid=%d\n", cpu, > current->pid)); > +pr_debug("pktgen: starting pktgen/%d: pid=%d\n", cpu, current->pid); > > max_before_softirq = t->max_before_softirq; > > @@ -3336,13 +3332,13 @@ > set_current_state(TASK_INTERRUPTIBLE); > } > > -PG_DEBUG(printk("pktgen: %s stopping all device\n", t->tsk->comm)); > +pr_debug("pktgen: %s stopping all device\n", t->tsk->comm); > pktgen_stop(t); > > -PG_DEBUG(printk("pktgen: %s removing all device\n", t->tsk->comm)); > +pr_debug("pktgen: %s removing all device\n", t->tsk->comm); > pktgen_rem_all_ifs(t); > > -PG_DEBUG(printk("pktgen: %s removing thread.\n", t->tsk->comm)); > +pr_debug("pktgen: %s removing thread.\n", t->tsk->comm); > pktgen_rem_thread(t); > > return 0; > @@ -3361,7 +3357,7 @@ > } > >
Re: Extensible hashing and RCU
David Miller writes: > But what about if tree lookup were free :-) > > This is why I consider Robert Olsson's trash work the most promising, > if we stick sockets into his full flow identified routing cache trie > entries, we can eliminate lookup altogether. > > Just like how he already uses traffic inspection to kill cache entries > when FIN shutdown sequence is snooped, we can explicitly flush such > entries when socket is closed fully on local system. Below is current tree-stats from a "production" flowlogging application at a large university (real traffic) using this full flow lookup. With 100k flows we typically see an aver depth 1.3 and a max depth 4. Right now there is only a dst entry in leaf nodes. So yes anything else we can put in leaf is "free". trie: Aver depth: 1.31 Max depth: 4 Leaves: 99930 Internal nodes: 14925 1: 13450 2: 1465 3: 9 18: 1 Pointers: 294976 Null ptrs: 180122 Total size: 7259 kB Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
Andi Kleen writes: > > If not, you loose. > > It all depends on if the higher levels on the trie are small > enough to be kept in cache. Even with two cache misses it might > still break even, but have better scalability. Yes the trick to keep root large to allow a very flat tree and few cache misses. Stefan Nilsson (author of LC-trie) and were able to improve the the LC-trie quite a bit we called this trie+hash ->trash The paper discusses trie/hash... (you've seen it) http://www.nada.kth.se/~snilsson/public/papers/trash/ > Another advantage would be to eliminate the need for large memory > blocks, which cause problems too e.g. on NUMA. It certainly would > save quite some memory if the tree levels are allocated on demand > only. However breaking it up might also cost more TLB misses, > but those could be eliminated by preallocating the tree in > the same way as the hash today. Don't know if it's needed or not. > > I guess someone needs to code it up and try it. I've implemented trie/trash as replacement for the dst hash to full key lookup for ipv4 (unicache) to start with. And is still is focusing on the nasty parts, packet forwarding, as we don't want break this So the benfits of full flow lookup is not accounted. I.e the full flow lookup could give socket at no cost and do some conntrack support like Evgeniy did in netchannels pathes. Below, some recent comparisions and profiles for the packet forwardning Input 2 * 65k concurrent flows eth0->eth1, eth2->eth3 in forwarding On separate CPU Opteron 2218 (2.6 GHZ) net-2.6.21 git. Numbers are very approximative but should still be representative. Profiles are collected. Performance comparison -- Table below holds: dst-entries in use, lookup hits, slow path and total pps Flowlen 40 250k 1020 + 21 = 1041 pps Vanilla rt_hash=32k 1M950 + 29 = 979 pps Vanilla rt_hash=131k 260k 980 + 24 = 1004 pps Unicache Flowlen 4 (rdos) 290k 560 + 162 = 722 kpps Vanilla rt_hash=32k 1M 400 + 165 = 565 kpps Vanilla rt_hash=131k 230k 570 + 170 = 740 kpps Unicache unicache flen=4 pkts c02df84f 5257 7.72078 tkey_extract_bits c023151a 5230 7.68112 e1000_clean_rx_irq c02df908 3306 4.85541 tkey_equals c014cf31 3296 4.84072 kfree c02f8c3b 3067 4.5044 ip_route_input c02fbdf0 2948 4.32963 ip_forward c023024e 2809 4.12548 e1000_xmit_frame c02e06f1 2792 4.10052 trie_lookup c02fd764 2159 3.17085 ip_output c032591c 1965 2.88593 fn_trie_lookup c014cd82 1456 2.13838 kmem_cache_alloc c02fa941 1337 1.96361 ip_rcv c014ced0 1334 1.9592 kmem_cache_free c02e1538 1289 1.89311 unicache_tcp_establish c02e2d70 1218 1.78884 dev_queue_xmit c02e31af 1074 1.57735 netif_receive_skb c02f8484 1053 1.54651 ip_route_input_slow c02db552 987 1.44957 __alloc_skb c02e626f 913 1.34089 dst_alloc c02edaad 828 1.21606 __qdisc_run c0321ccf 810 1.18962 fib_get_table c02e14c1 782 1.1485 match_pktgen c02e6375 766 1.125 dst_destroy c02e10e8 728 1.06919 unicache_hash_code c0231242 647 0.950227e1000_clean_tx_irq c02f7d23 625 0.917916ipv4_dst_destroy unicache flen=40 pkts - c023151a 6742 10.3704 e1000_clean_rx_irq c02df908 4553 7.00332 tkey_equals c02fbdf0 4455 6.85258 ip_forward c02e06f1 4067 6.25577 trie_lookup c02f8c3b 3951 6.07734 ip_route_input c02df84f 3929 6.0435 tkey_extract_bits c023024e 3538 5.44207 e1000_xmit_frame c014cf31 3152 4.84834 kfree c02fd764 2711 4.17ip_output c02e1538 1930 2.96868 unicache_tcp_establish c02fa941 1696 2.60875 ip_rcv c02e31af 1466 2.25497 netif_receive_skb c02e2d70 1412 2.17191 dev_queue_xmit c014cd82 1397 2.14883 kmem_cache_alloc c02db552 1394 2.14422 __alloc_skb c02edaad 1032 1.5874 __qdisc_run c02ed6b8 957 1.47204 eth_header c02e15dd 904 1.39051 unicache_garbage_collect_active c02db94e 861 1.32437 kfree_skb c0231242 794 1.22131 e1000_clean_tx_irq c022fd58 778 1.1967 e1000_tx_map c014ce73 756 1.16286 __kmalloc c014ced0 740 1.13825 kmem_cache_free c02e14c1 701 1.07826 match_pktgen c023002c 621 0.955208e1000_tx_queue c02e78fa 519 0.798314neigh_resolve_output Vanilla w. flen=4 pkts rt_hash=32k -- c02f6852 1570422.9102 ip_route_input c023151a 5324 7.76705 e1000_clean_rx_irq c02f84a1 4457 6.5022 ip_rcv c02f9950 3065 4.47145 ip_forward c023024e 2630 3.83684 e1000_xmit_frame c0323380 2343 3.41814 fn_trie_lookup c02fb2c4 2181 3.1818 ip_output c02f4a3b 1839 2.68287 rt_intern_hash c02f4480 1762 2.57054 rt_may_expire c02f60
NAPI quota and budget
Hello! K. Salah writes: > I have a question about the quota per poll in NAPI. Any idea how the > quota of 64 packets per poll per NIC was selected? Why 64, not any > other number? Is there science behind this number. The number comes from experimentation. The "science" is thats it's possible set any number at all. The first NAPI driver tulip (100 Mps) increased the RX-ringsize from 32 128 descriptors and used 16 pkts per poll. A quota of 64 for GIGE devices (tested with 256 descriptors) seemed to be good compromise in experiments. 16 < 64 < netdev_budget > And the same goes for the budget of 300? > > Was that done based on expermintation? Yes it was the size of central buffer size before NAPI. It's possible the netdev_budget can be increased a bit today as the packet processing speed has increased. Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Fix the fib trie iterator to work with a single entry routing tables
David Miller writes: > > Yes the case when the trie is just a single leaf got wrong with the > > iterator and your patchs cures it. I think we have a similar problem > > with /proc/net/fib_trie > > I'm happy to review a fix for that :-) When main table is just a single leaf this gets printed as belonging to the local table in /proc/net/fib_trie. A fix is below. Cheers. --ro Signed-off-by: Robert Olsson <[EMAIL PROTECTED]> --- net-2.6.20/net/ipv4/fib_trie.c.orig 2007-01-26 13:18:13.0 +0100 +++ net-2.6.20/net/ipv4/fib_trie.c 2007-01-26 13:19:36.0 +0100 @@ -2290,16 +2290,17 @@ if (v == SEQ_START_TOKEN) return 0; + if (!NODE_PARENT(n)) { + if (iter->trie == trie_local) + seq_puts(seq, ":\n"); + else + seq_puts(seq, ":\n"); + } + if (IS_TNODE(n)) { struct tnode *tn = (struct tnode *) n; __be32 prf = htonl(MASK_PFX(tn->key, tn->pos)); - if (!NODE_PARENT(n)) { - if (iter->trie == trie_local) - seq_puts(seq, ":\n"); - else - seq_puts(seq, ":\n"); - } seq_indent(seq, iter->depth-1); seq_printf(seq, " +-- %d.%d.%d.%d/%d %d %d %d\n", NIPQUAD(prf), tn->pos, tn->bits, tn->full_children, - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] net: Fix the fib trie iterator to work with a single entry routing tables
Hello! Yes the case when the trie is just a single leaf got wrong with the iterator and your patchs cures it. I think we have a similar problem with /proc/net/fib_trie Cheers --ro Signed-off-by: Robert Olsson <[EMAIL PROTECTED]> Eric W. Biederman writes: > > In a kernel with trie routing enabled I had a simple routing setup > with only a single route to the outside world and no default > route. "ip route table list main" showed my the route just fine but > /proc/net/route was an empty file. What was going on? > > Thinking it was a bug in something I did and I looked deeper. Eventually > I setup a second route and everything looked correct, huh? Finally I > realized that the it was just the iterator pair in fib_trie_get_first, > fib_trie_get_next just could not handle a routing table with a single entry. > > So to save myself and others further confusion, here is a simple fix for > the fib proc iterator so it works even when there is only a single route > in a routing table. > > Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]> > --- > net/ipv4/fib_trie.c | 21 - > 1 files changed, 16 insertions(+), 5 deletions(-) > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > index cfb249c..13307c0 100644 > --- a/net/ipv4/fib_trie.c > +++ b/net/ipv4/fib_trie.c > @@ -1989,6 +1989,10 @@ static struct node *fib_trie_get_next(struct > fib_trie_iter *iter) > unsigned cindex = iter->index; > struct tnode *p; > > +/* A single entry routing table */ > +if (!tn) > +return NULL; > + > pr_debug("get_next iter={node=%p index=%d depth=%d}\n", > iter->tnode, iter->index, iter->depth); > rescan: > @@ -2037,11 +2041,18 @@ static struct node *fib_trie_get_first(struct > fib_trie_iter *iter, > if(!iter) > return NULL; > > -if (n && IS_TNODE(n)) { > -iter->tnode = (struct tnode *) n; > -iter->trie = t; > -iter->index = 0; > -iter->depth = 1; > +if (n) { > +if (IS_TNODE(n)) { > +iter->tnode = (struct tnode *) n; > +iter->trie = t; > +iter->index = 0; > +iter->depth = 1; > +} else { > +iter->tnode = NULL; > +iter->trie = t; > +iter->index = 0; > +iter->depth = 0; > +} > return n; > } > return NULL; > -- > 1.4.4.1.g278f - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pktgen
Alexey Dobriyan writes: > > Confused now. Is my "t->control &= ~(T_TERMINATE);" fix deprecated by > completions? It's not needed with completion patch as this does the job a bit more mainstream. The T_TERMINATE seems to work well I've tested on machine with CPU:s. Only once I noticed that only 2 of 4 threads got started but it could be something else... And the race is hardly seen with any real use of pktgen.. .Let's hear what others... and completions was out-of-date. Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pktgen
David Miller writes: > Agreed. > > Robert, please fix this by using a completion so that we can > wait for the threads to start up, something like this: Included. It passes my test but Alexey and others test. Cheers. --ro diff --git a/net/core/pktgen.c b/net/core/pktgen.c index 733d86d..a630a73 100644 --- a/net/core/pktgen.c +++ b/net/core/pktgen.c @@ -147,6 +147,7 @@ #include #include #include +#include #include #include #include @@ -160,7 +161,7 @@ #include /* do_div */ #include -#define VERSION "pktgen v2.68: Packet Generator for packet performance testing.\n" +#define VERSION "pktgen v2.69: Packet Generator for packet performance testing.\n" /* #define PG_DEBUG(a) a */ #define PG_DEBUG(a) @@ -206,6 +207,11 @@ static struct proc_dir_entry *pg_proc_di #define VLAN_TAG_SIZE(x) ((x)->vlan_id == 0x ? 0 : 4) #define SVLAN_TAG_SIZE(x) ((x)->svlan_id == 0x ? 0 : 4) +struct pktgen_thread_info { + struct pktgen_thread *t; + struct completion *c; +}; + struct flow_state { __u32 cur_daddr; int count; @@ -3264,10 +3270,11 @@ out:; * Main loop of the thread goes here */ -static void pktgen_thread_worker(struct pktgen_thread *t) +static void pktgen_thread_worker(struct pktgen_thread_info *info) { DEFINE_WAIT(wait); struct pktgen_dev *pkt_dev = NULL; + struct pktgen_thread *t = info->t; int cpu = t->cpu; sigset_t tmpsig; u32 max_before_softirq; @@ -3307,6 +3314,8 @@ static void pktgen_thread_worker(struct __set_current_state(TASK_INTERRUPTIBLE); mb(); +complete(info->c); + while (1) { __set_current_state(TASK_RUNNING); @@ -3518,6 +3527,8 @@ static struct pktgen_thread *__init pktg static int __init pktgen_create_thread(const char *name, int cpu) { int err; + struct pktgen_thread_info info; +struct completion started; struct pktgen_thread *t = NULL; struct proc_dir_entry *pe; @@ -3558,7 +3569,11 @@ static int __init pktgen_create_thread(c t->removed = 0; - err = kernel_thread((void *)pktgen_thread_worker, (void *)t, + init_completion(&started); +info.t = t; +info.c = &started; + + err = kernel_thread((void *)pktgen_thread_worker, (void *)&info, CLONE_FS | CLONE_FILES | CLONE_SIGHAND); if (err < 0) { printk("pktgen: kernel_thread() failed for cpu %d\n", t->cpu); @@ -3568,6 +3583,7 @@ static int __init pktgen_create_thread(c return err; } + wait_for_completion(&started); return 0; } - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pktgen
Hello! Seems you found a race when rmmod is done before it's fully started Try: diff --git a/net/core/pktgen.c b/net/core/pktgen.c index 733d86d..ac0b4b1 100644 --- a/net/core/pktgen.c +++ b/net/core/pktgen.c @@ -160,7 +160,7 @@ #include /* do_div */ #include -#define VERSION "pktgen v2.68: Packet Generator for packet performance testing.\n" +#define VERSION "pktgen v2.69: Packet Generator for packet performance testing.\n" /* #define PG_DEBUG(a) a */ #define PG_DEBUG(a) @@ -3673,6 +3673,8 @@ static void __exit pg_cleanup(void) struct list_head *q, *n; wait_queue_head_t queue; init_waitqueue_head(&queue); + + schedule_timeout_interruptible(msecs_to_jiffies(125)); /* Stop all interfaces & threads */ for i in 1 2 3 4 5 ; do modprobe pktgen ; rmmod pktgen ; done In dmesg pktgen v2.69: Packet Generator for packet performance testing. pktgen v2.69: Packet Generator for packet performance testing. pktgen v2.69: Packet Generator for packet performance testing. pktgen v2.69: Packet Generator for packet performance testing. pktgen v2.69: Packet Generator for packet performance testing. Cheers. --ro Alexey Dobriyan writes: > On 11/30/06, David Miller <[EMAIL PROTECTED]> wrote: > > From: Alexey Dobriyan <[EMAIL PROTECTED]> > > Date: Wed, 29 Nov 2006 23:04:37 +0300 > > > > > Looks like worker thread strategically clears it if scheduled at wrong > > > moment. > > > > > > --- a/net/core/pktgen.c > > > +++ b/net/core/pktgen.c > > > @@ -3292,7 +3292,6 @@ static void pktgen_thread_worker(struct > > > > > > init_waitqueue_head(&t->queue); > > > > > > -t->control &= ~(T_TERMINATE); > > > t->control &= ~(T_RUN); > > > t->control &= ~(T_STOP); > > > t->control &= ~(T_REMDEVALL); > > > > Good catch Alexey. Did you rerun the load/unload test with > > this fix applied? If it fixes things, I'll merge it. > > Well, yes, it fixes things, except Ctrl+C getting you out of > modprobe/rmmod loop will spit > backtrace again. And other flags: T_RUN, T_STOP. Clearance is not > needed due to kZalloc and > create bugs as demostrated. > > Give me some time. > - > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pktgen patch available for perusal.
Ben Greear writes: > > Changes: > > * use a nano-second timer based on the scheduler timer (TSC) for relative > > times, instead of get_time_of_day. Seems I missed to set tsc as clocksource. It makes a difference. Performance is normal and I'm less confused. e1000 82546GB @ 1.6 GHz Opteron. Kernel 2.6.19-rc1_Bifrost_-g18e199c6-dirty echo acpi_pm> /sys/devices/system/clocksource/clocksource0/current_clocksource psize pps - 60 556333 124 526942 252 452981 508 234996 1020 119748 1496 82248 echo tsc> /sys/devices/system/clocksource/clocksource0/current_clocksource psize pps - 60 819914 124 747286 252 452975 508 234993 1020 119749 1496 82247 Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pktgen patch available for perusal.
jamal writes: > If you are listening then start with: > > 1) Do a simple test with just udp traffic as above, doing simple > accounting. This helps you to get a feel on how things work. > 2) modify the matching rules to match your magic cookie > 3) write a simple action invoked by your matching rules and use tc to > add it to the policy. > 4) integrate in your app now that you know what you are doing. Yes. Sounds like simple and general solution. No call-backs, no #ifdef's no extra modules. Just a little recipe in pktgen.txt Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pktgen patch available for perusal.
Ben Greear writes: > It requires a hook in dev.c, or at least that is the only way I can > think to implement it. Well the hook be placed along the packet path even in drivers. In tulip I didn't even take packet of the ring in some experiments. And there plenty of existing hooks already i.e PRE_ROUTE? > I don't think that hook is going to be allowed into the kernel, so the > best I was hoping for was to have the code in pktgen with #if 0 and let > users patch their kernel to add the hook... Right so what I was trying to say was rather having #if 0 and dead code in pktgen it could be kept separate. Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pktgen patch available for perusal.
Ben Greear writes: > I'd be thrilled to have the receive logic go into pktgen, even if it was #if > 0 with a comment > showing how to patch dev.c to get it working. It would make my out-of-tree > patch smaller > and should help others who are doing research and driver development... Just an idea... RX part is pretty separate from TX. How about if we keep the this code in a small seprate optional module? It can developed into some general RX probe. Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
pktgen patch available for perusal.
Ben Greear writes: > I've completed the first pass of my changes to pktgen in 2.6.18. > Many of these features are probably DOA based on previous conversations, > but perhaps this will help someone Thanks. Well sometimes there is a need to capture and drop pkts and various points, so it handy to have code fragments ready and just add the hook when it's needed in driver or rx softirq etc. > Changes: > * use a nano-second timer based on the scheduler timer (TSC) for relative > times, instead of get_time_of_day. Any chance you extract this one so we can compare with get_time_of_day. This pops up in the profiles and TX numbers are not what used to be. Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RFC: Removing busy-spin in pktgen.
Ben Greear writes: > I'm planning to re-merge my long-lost pktgen branch with the kernel > tree's pktgen. > > I believe the main difference is that my out-of-tree pktgen does not do the > busy-spin, but waits on a queue for the net-device to wake it's tx-queue > when over-driving a NIC. > > To implement this, I added a hook in the netdev-wake-queue logic and let > pktgen register itself as interested > > Is there any interest in adding this sort of feature to the official > pktgen? Hello! So question is, pktgen hooks in netif_wake_queue or let pktgen busy-spin? My first instinct is to avoid the hook in general code and pktgen let spin to only affect testing. Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] DSCP support for pktgen
Francesco Fondelli writes: > Anyway, I've been asked to add support for managing DSCP codepoints, so > one can test DiffServ capable routers. It's very simple code and is working > for > me. Looks fine too. We ask Dave to apply again. Cheers. --ro Signed-off-by: Francesco Fondelli <[EMAIL PROTECTED]> Signed-off-by: Robert Olsson <[EMAIL PROTECTED]> --- new/net/core/pktgen.c.vlan 2006-09-26 14:53:34.0 +0200 +++ new/net/core/pktgen.c 2006-09-26 14:54:49.0 +0200 @@ -160,7 +160,7 @@ #include /* do_div */ #include -#define VERSION "pktgen v2.67: Packet Generator for packet performance testing.\n" +#define VERSION "pktgen v2.68: Packet Generator for packet performance testing.\n" /* #define PG_DEBUG(a) a */ #define PG_DEBUG(a) @@ -292,6 +292,10 @@ __u16 udp_dst_min; /* inclusive, dest UDP port */ __u16 udp_dst_max; /* exclusive, dest UDP port */ + /* DSCP + ECN */ + __u8 tos;/* six most significant bits of (former) IPv4 TOS are for dscp codepoint */ + __u8 traffic_class; /* ditto for the (former) Traffic Class in IPv6 (see RFC 3260, sec. 4) */ + /* MPLS */ unsigned nr_labels; /* Depth of stack, 0 = no MPLS */ __be32 labels[MAX_MPLS_LABELS]; @@ -671,6 +675,14 @@ pkt_dev->svlan_id, pkt_dev->svlan_p, pkt_dev->svlan_cfi); } + if (pkt_dev->tos) { + seq_printf(seq, " tos: 0x%02x\n", pkt_dev->tos); + } + + if (pkt_dev->traffic_class) { + seq_printf(seq, " traffic_class: 0x%02x\n", pkt_dev->traffic_class); + } + seq_printf(seq, " Flags: "); if (pkt_dev->flags & F_IPV6) @@ -748,12 +760,12 @@ } -static int hex32_arg(const char __user *user_buffer, __u32 *num) +static int hex32_arg(const char __user *user_buffer, unsigned long maxlen, __u32 *num) { int i = 0; *num = 0; - for(; i < 8; i++) { + for(; i < maxlen; i++) { char c; *num <<= 4; if (get_user(c, &user_buffer[i])) @@ -848,7 +860,7 @@ pkt_dev->nr_labels = 0; do { __u32 tmp; - len = hex32_arg(&buffer[i], &tmp); + len = hex32_arg(&buffer[i], 8, &tmp); if (len <= 0) return len; pkt_dev->labels[n] = htonl(tmp); @@ -1185,11 +1197,15 @@ else if (strcmp(f, "!SVID_RND") == 0) pkt_dev->flags &= ~F_SVID_RND; + else if (strcmp(f, "!IPV6") == 0) + pkt_dev->flags &= ~F_IPV6; + else { sprintf(pg_result, "Flag -:%s:- unknown\nAvailable flags, (prepend ! to un-set flag):\n%s", f, - "IPSRC_RND, IPDST_RND, TXSIZE_RND, UDPSRC_RND, UDPDST_RND, MACSRC_RND, MACDST_RND\n"); + "IPSRC_RND, IPDST_RND, UDPSRC_RND, UDPDST_RND, " + "MACSRC_RND, MACDST_RND, TXSIZE_RND, IPV6, MPLS_RND, VID_RND, SVID_RND\n"); return count; } sprintf(pg_result, "OK: flags=0x%x", pkt_dev->flags); @@ -1615,6 +1631,38 @@ return count; } + if (!strcmp(name, "tos")) { + __u32 tmp_value = 0; + len = hex32_arg(&user_buffer[i], 2, &tmp_value); + if (len < 0) { + return len; + } + i += len; + if (len == 2) { + pkt_dev->tos = tmp_value; + sprintf(pg_result, "OK: tos=0x%02x", pkt_dev->tos); + } else { + sprintf(pg_result, "ERROR: tos must be 00-ff"); + } + return count; + } + + if (!strcmp(name, "traffic_class")) { + __u32 tmp_value = 0; + len = hex32_arg(&user_buffer[i], 2, &tmp_value); + if (len < 0) { + return len; + } + i += len; + if (len == 2) { + pkt_dev->traffic_class = tmp_value; + sprintf(pg_result, "OK: traffic_class=0x%02x", pkt_dev->traffic_class); + } else { + sprintf(pg_result, "ERROR: traffic_class must be 00-ff"); + } + return count; + } + sprintf(pkt_dev->result, &qu
[PATCH] vlan support for pktgen
Francesco Fondelli writes: > The attached patch allows pktgen to produce 802.1Q and Q-in-Q tagged frames. > I have used it for stress test a bridge and seems ok to me. > Unfortunately I have no access to net-2.6.x git tree so the diff is against > 2.6.17.13. Thanks a lot. Looks fine but haven't been able to test Q-in-Q. Below is a git diff. We ask Dave to apply. Cheers. --ro Signed-off-by: Francesco Fondelli <[EMAIL PROTECTED]> Acked-by: Steven Whitehouse <[EMAIL PROTECTED]> Signed-off-by: Robert Olsson <[EMAIL PROTECTED]> diff --git a/net/core/pktgen.c b/net/core/pktgen.c index c23e9c0..e418224 100644 --- a/net/core/pktgen.c +++ b/net/core/pktgen.c @@ -109,6 +109,8 @@ * * MPLS support by Steven Whitehouse <[EMAIL PROTECTED]> * + * 802.1Q/Q-in-Q support by Francesco Fondelli (FF) <[EMAIL PROTECTED]> + * */ #include #include @@ -137,6 +139,7 @@ #include #include #include +#include #include #include #include @@ -178,6 +181,8 @@ #define F_TXSIZE_RND (1<<6) /* Transmit size is random */ #define F_IPV6(1<<7) /* Interface in IPV6 Mode */ #define F_MPLS_RND(1<<8) /* Random MPLS labels */ +#define F_VID_RND (1<<9) /* Random VLAN ID */ +#define F_SVID_RND(1<<10) /* Random SVLAN ID */ /* Thread control flag bits */ #define T_TERMINATE (1<<0) @@ -198,6 +203,9 @@ static struct proc_dir_entry *pg_proc_di #define MAX_CFLOWS 65536 +#define VLAN_TAG_SIZE(x) ((x)->vlan_id == 0x ? 0 : 4) +#define SVLAN_TAG_SIZE(x) ((x)->svlan_id == 0x ? 0 : 4) + struct flow_state { __u32 cur_daddr; int count; @@ -287,6 +295,15 @@ struct pktgen_dev { /* MPLS */ unsigned nr_labels; /* Depth of stack, 0 = no MPLS */ __be32 labels[MAX_MPLS_LABELS]; + + /* VLAN/SVLAN (802.1Q/Q-in-Q) */ + __u8 vlan_p; + __u8 vlan_cfi; + __u16 vlan_id; /* 0x means no vlan tag */ + + __u8 svlan_p; + __u8 svlan_cfi; + __u16 svlan_id; /* 0x means no svlan tag */ __u32 src_mac_count;/* How many MACs to iterate through */ __u32 dst_mac_count;/* How many MACs to iterate through */ @@ -644,6 +661,16 @@ static int pktgen_if_show(struct seq_fil i == pkt_dev->nr_labels-1 ? "\n" : ", "); } + if (pkt_dev->vlan_id != 0x) { + seq_printf(seq, " vlan_id: %u vlan_p: %u vlan_cfi: %u\n", + pkt_dev->vlan_id, pkt_dev->vlan_p, pkt_dev->vlan_cfi); + } + + if (pkt_dev->svlan_id != 0x) { + seq_printf(seq, " svlan_id: %u vlan_p: %u vlan_cfi: %u\n", + pkt_dev->svlan_id, pkt_dev->svlan_p, pkt_dev->svlan_cfi); + } + seq_printf(seq, " Flags: "); if (pkt_dev->flags & F_IPV6) @@ -673,6 +700,12 @@ static int pktgen_if_show(struct seq_fil if (pkt_dev->flags & F_MACDST_RND) seq_printf(seq, "MACDST_RND "); + if (pkt_dev->flags & F_VID_RND) + seq_printf(seq, "VID_RND "); + + if (pkt_dev->flags & F_SVID_RND) + seq_printf(seq, "SVID_RND "); + seq_puts(seq, "\n"); sa = pkt_dev->started_at; @@ -1140,6 +1173,18 @@ static ssize_t pktgen_if_write(struct fi else if (strcmp(f, "!MPLS_RND") == 0) pkt_dev->flags &= ~F_MPLS_RND; + else if (strcmp(f, "VID_RND") == 0) + pkt_dev->flags |= F_VID_RND; + + else if (strcmp(f, "!VID_RND") == 0) + pkt_dev->flags &= ~F_VID_RND; + + else if (strcmp(f, "SVID_RND") == 0) + pkt_dev->flags |= F_SVID_RND; + + else if (strcmp(f, "!SVID_RND") == 0) + pkt_dev->flags &= ~F_SVID_RND; + else { sprintf(pg_result, "Flag -:%s:- unknown\nAvailable flags, (prepend ! to un-set flag):\n%s", @@ -1445,6 +1490,128 @@ static ssize_t pktgen_if_write(struct fi offset += sprintf(pg_result + offset, "%08x%s", ntohl(pkt_dev->labels[n]), n == pkt_dev->nr_labels-1 ? "" : ","); + + if (pkt_dev->nr_labels && pkt_dev->vlan_id != 0x) { + pkt_dev->vlan_id = 0x; /* turn off VLAN/SVLAN */ + pkt_dev->svlan_id = 0x; + +
[PATCH] Documentation update pktgen
And the docs. Cheers. --ro Signed-off-by: Francesco Fondelli <[EMAIL PROTECTED]> Signed-off-by: Robert Olsson <[EMAIL PROTECTED]> diff --git a/Documentation/networking/pktgen.txt b/Documentation/networking/pktgen.txt index 278771c..6062e35 100644 --- a/Documentation/networking/pktgen.txt +++ b/Documentation/networking/pktgen.txt @@ -100,6 +100,7 @@ Examples: are: IPSRC_RND #IP Source is random (between min/max), IPDST_RND, UDPSRC_RND, UDPDST_RND, MACSRC_RND, MACDST_RND + MPLS_RND, VID_RND, SVID_RND pgset "udp_src_min 9" set UDP source port min, If < udp_src_max, then cycle through the port range. @@ -125,6 +126,21 @@ Examples: pgset "mpls 0" turn off mpls (or any invalid argument works too!) + pgset "vlan_id 77" set VLAN ID 0-4095 + pgset "vlan_p 3" set priority bit 0-7 (default 0) + pgset "vlan_cfi 0" set canonical format identifier 0-1 (default 0) + + pgset "svlan_id 22" set SVLAN ID 0-4095 + pgset "svlan_p 3"set priority bit 0-7 (default 0) + pgset "svlan_cfi 0" set canonical format identifier 0-1 (default 0) + + pgset "vlan_id " > 4095 remove vlan and svlan tags + pgset "svlan " > 4095 remove svlan tag + + + pgset "tos XX" set former IPv4 TOS field (e.g. "tos 28" for AF11 no ECN, default 00) + pgset "traffic_class XX" set former IPv6 TRAFFIC CLASS (e.g. "traffic_class B8" for EF no ECN, default 00) + pgset stop aborts injection. Also, ^C aborts generator. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] vlan support for pktgen
Francesco Fondelli writes: > The attached patch allows pktgen to produce 802.1Q and Q-in-Q tagged frames. > I have used it for stress test a bridge and seems ok to me. > Unfortunately I have no access to net-2.6.x git tree so the diff is against > 2.6.17.13. > If you have a moment look over it, I think this feature could be useful for > someone else. Thanks a lot, code seems code seems straight forward but I'm not to experienced with VLAN hacking and need find some test environment when I got a chance, maybe someone else has better opportunities to verify. Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Paper on lookup in Linux.
Andi Kleen writes: > The reason I'm asking is that we still have trouble with the TCP hash tables > taking far too much memory, and your new data structure might provide a nice > alternative. Yes it's dynamic and selftuning so no need reserve memory in advance and still comparable performance to a perfect hash. Insert/delete and GC is fast and locking is also doable we used RCU like fib_trie but at _bh in this "application". Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Paper on lookup in Linux.
Andi Kleen writes: > On Monday 04 September 2006 13:43, Robert Olsson wrote: > > > > Hello. > > > > People on this list might find this paper interesting: > > http://www.csc.kth.se/~snilsson/public/papers/trash/ > > Looks nice. Have you looked at using it for local TCP/UDP socket > lookups too or would that be part of the "unified flow cache"? No we haven't put struct socket in the leafs (result node) yet we just kept dst entries and some stateful flow variables that we used for active GC and flow logging so far. So 128 bit flow version of the dst hash. It would have taken a lot a of more work and we wanted to get this paper out. Also ip_route_input is not the place have the lookup if we with unified include ip6. So we focused on designing and building the lookup core. Cheers. --ro struct leaf { __u32 key[LPK]; unsigned long parent; void *obj; /* Generic leaf object */ __u32 state; unsigned int len; int iif; struct timeval start; struct rcu_head rcu; }; - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Paper on lookup in Linux.
Hello. People on this list might find this paper interesting: http://www.csc.kth.se/~snilsson/public/papers/trash/ Abstract is below. Feel free to redistribute. Cheers. --ro TRASH - A dynamic LC-trie and hash data structure Robert Olsson and Stefan Nilsson A dynamic LC-trie is currently used in the Linux kernel to implement address lookup in the IP routing table. The main virtue of this data structure is that it supports both fast address lookups and frequent updates of the table. Also, it has an efficient memory management scheme and supports multi-processor architectures using the RCU locking mechanism. The structure scales nicely: the expected number of memory accesses for one lookup is O(log log n), where n is the number of entries in the lookup table. In particular, the time does not depend on the length of the keys, 32-bit IPv4 addresses and 128-bit addresses does not make a difference in this respect. In this article we introduce TRASH, a combination of a dynamic LC-trie and a hash function. TRASH is a general purpose data structure supporting fast lookup, insert and delete operations for arbitrarily long bit strings. TRASH enhances the level-compression part of the LC-trie by prepending a header to each key. The header is a hash value based on the complete key. The extended keys will behave like uniformly distributed data and hence the average and maximum depth is typically very small, in practice less than 1.5 and 5, respectively. We have implemented the scheme in the Linux kernel as a replacement for the dst cache (IPv4) and performed a full scale test on a production router using 128-bit flow-based lookups. The Linux implementation of TRASH inherits the efficient RCU locking mechanism from the dynamic LC-trie implementation. In particular, the lookup time increases only marginally for longer keys and TRASH is highly insensitive to different types of data. The performance figures are very promising and the cache mechanism could easily be extended to serve as a unified lookup for fast socket lookup, flow logging, connection tracking and stateful networking in general. Keywords: trie, LC-trie, hash, hashtrie, Linux, flow lookup, garbage collection Trita-CSC-TCS 2006:2, ISRN/KTH/CSC/TCS-2006/2-SE, ISSN 1653-7092. -- VGER BF report: U 0.830787 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[e1000]: flow control on by default - good idea really?
jamal writes: > The default setup on the e1000 has rx flow control turned on. > I was sending at wire rate gige from the device - which is about > 1.48Mpps. The e1000 was in turn sending me flow control packets > as per default/expected behavior. Unfortunately, it was sending > a very large amount of packets. At one point i was seeing upto > 1Mpps and on average, the flow control packets were consuming > 60-70% of the bandwidth. Even when i fixed this behavior to act > properly, allowing flow control on consumed up to 15% of the bandwidth. > Clearly, this is a bad thing. Yes, the device in the first instance was > at fault. But i have argued in the past that NAPI does just fine without > flow control being turned on, so even chewing 5% of bandwidth on flow > control is a bad thing.. Interesting numbers. But one can argue that if there were no control packets there would be data packets sent and dropped instead. But I'll agree with you as the flow control seems hard to monitor and tune and takes debugging down to link and chiplevels. Much easier and robust just to read the dropped packet counter on the box that couldn't handle the load rather than following secret flow control packets somewhere else and try to figure out whats going on. > As a compromise, can we declare flow control as an advanced feature > and turn it off by default? People who feel it is valuable and know > what they are doing can turn it off. At least for NAPI-scheme drivers. > If you want more details just shoot. Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 6682] New: BUG: soft lockup detected on CPU#0! / ksoftirqd takse 100% CPU
Hello! Yes seems the system is very loaded for some reason > > sometimes a day) we get 100% usage on ksoftirqd/0 and following messages in logs: as all softirq's are run via ksoftirqd. That's still OK but why don't the watchdog get any CPU share at all? Mismatch in priorities? Herbert Xu writes: > On Mon, Jun 19, 2006 at 10:20:10PM +, Andrew Morton wrote: > > > > > [] dev_queue_xmit+0xe0/0x203 > > > [] ip_output+0x1e1/0x237 > > > [] ip_forward+0x181/0x1df > > > [] ip_rcv+0x40c/0x485 > > > [] netif_receive_skb+0x12f/0x165 > > > [] e1000_clean_rx_irq+0x389/0x410 [e1000] > > > [] e1000_clean+0x94/0x12f [e1000] > > > [] net_rx_action+0x69/0xf0 > > > [] __do_softirq+0x55/0xbd > > > [] do_softirq+0x2d/0x31 > > > [] local_bh_enable+0x5a/0x65 > > > [] rt_run_flush+0x5f/0x80 Normal for a router... > Could you tell us the frequency of route updates on this machine? > Route updates are pretty expensive especially when a large number > of flows hits your machine right afterwards. Yes flush is costly an unfortunly hard to avoid. We discussed this a bit before... > You can monitor this by running ip mon. You might just be getting > bogus route updates causing unnecessary flushes to the routing cache. Just sampled 10 min in one of routers with full 2 * (Full BGP). Well remember Zebra/Quagga has just one set in kernel. Anyway during the 10 minutes I looked I got 4 (insertion/deletions)/second in average. Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: netif_tx_disable and lockless TX
Stephen Hemminger writes: > I also noticed that you really don't save much by doing TX cleaning at > hardirq, because in hardirq you need to do dev_kfree_irq and that causes > a softirq (for the routing case where users=1). So when routing it > doesn't make much difference, both methods cause the softirq delayed > processing to be invoked. For locally generated packets which are > cloned, the hardirq will drop the ref count, and that is faster than > doing the whole softirq round trip. Right. Also the other way around, repeated ->poll can avoid TX hardirq's. Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: netif_tx_disable and lockless TX
jamal writes: > Latency-wise: TX completion interrupt provides the best latency. > Processing in the poll() -aka softirq- was almost close to the hardirq > variant. So if you can make things run in a softirq such as transmit > one, then the numbers will likely stay the same. I don't remember we tried tasklet for TX a la Herbert's suggestion but we used use tasklets for controlling RX processing to avoid hardirq livelock in pre-NAPI versions. Had variants of tulip driver with both TX cleaning at ->poll and TX cleaning at hardirq and didn't see any performance difference. The ->poll was much cleaner but we kept Alexey's original work for tulip. > Sorry, I havent been following discussions on netchannels[1] so i am not > qualified to comment on the "replacement" part Dave mentioned earlier. > What I can say is the tx processing doesnt have to be part of the NAPI > poll() and still use hardirq. Yes true but I see TX numbers with newer boards (wire rate small pakets) with cleaing in ->poll. Also now linux is very safe in network "overload" situations. Moving work to hardirq may change that. Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
fib_trie leaf freeing [PATCH]
Hello! Seems like leaf (end-nodes) has been freed by __tnode_free_rcu and not by __leaf_free_rcu. This fixes the problem. Only tnode_free is now used which checks for appropriate node type. free_leaf can be removed. Signed-off-by: Robert Olsson <[EMAIL PROTECTED]> Cheers. --ro diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -50,7 +50,7 @@ * Patrick McHardy <[EMAIL PROTECTED]> */ -#define VERSION "0.406" +#define VERSION "0.407" #include #include @@ -314,11 +314,6 @@ static void __leaf_free_rcu(struct rcu_h kfree(container_of(head, struct leaf, rcu)); } -static inline void free_leaf(struct leaf *leaf) -{ - call_rcu(&leaf->rcu, __leaf_free_rcu); -} - static void __leaf_info_free_rcu(struct rcu_head *head) { kfree(container_of(head, struct leaf_info, rcu)); @@ -357,7 +352,12 @@ static void __tnode_free_rcu(struct rcu_ static inline void tnode_free(struct tnode *tn) { - call_rcu(&tn->rcu, __tnode_free_rcu); + if(IS_LEAF(tn)) { + struct leaf *l = (struct leaf *) tn; + call_rcu_bh(&l->rcu, __leaf_free_rcu); + } +else + call_rcu(&tn->rcu, __tnode_free_rcu); } static struct leaf *leaf_new(void) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MPLS extension for pktgen
Steven Whitehouse writes: > Here is the updated patch. If you are happy with this, then will you send > it on to Dave, or should I do that? Hello! Yes the minor problem with the "return code" is fixed as: Result: OK: mpls=0001000a,0002000a,000a So I'll guess Dave will apply it. Signed-off-by: Robert Olsson <[EMAIL PROTECTED]> Thanks. --ro > > Signed-off-by: Steven Whitehouse <[EMAIL PROTECTED]> > > diff --git a/Documentation/networking/pktgen.txt > b/Documentation/networking/pktgen.txt > --- a/Documentation/networking/pktgen.txt > +++ b/Documentation/networking/pktgen.txt > @@ -109,6 +109,22 @@ Examples: > cycle through the port range. > pgset "udp_dst_max 9" set UDP destination port max. > > + pgset "mpls 0001000a,0002000a,000a" set MPLS labels (in this example > + outer label=16,middle label=32, > + inner label=0 (IPv4 NULL)) Note that > + there must be no spaces between the > + arguments. Leading zeros are required. > + Do not set the bottom of stack bit, > + thats done automatically. If you do > + set the bottom of stack bit, that > + indicates that you want to randomly > + generate that address and the flag > + MPLS_RND will be turned on. You > + can have any mix of random and fixed > + labels in the label stack. > + > + pgset "mpls 0" turn off mpls (or any invalid argument works > too!) > + > pgset stop aborts injection. Also, ^C aborts generator. > > > @@ -167,6 +183,8 @@ pkt_size > min_pkt_size > max_pkt_size > > +mpls > + > udp_src_min > udp_src_max > > @@ -211,4 +229,4 @@ Grant Grundler for testing on IA-64 and > Stephen Hemminger, Andi Kleen, Dave Miller and many others. > > > -Good luck with the linux net-development. > \ No newline at end of file > +Good luck with the linux net-development. > diff --git a/net/core/pktgen.c b/net/core/pktgen.c > --- a/net/core/pktgen.c > +++ b/net/core/pktgen.c > @@ -106,6 +106,9 @@ > * > * interruptible_sleep_on_timeout() replaced Nishanth Aravamudan <[EMAIL > PROTECTED]> > * 050103 > + * > + * MPLS support by Steven Whitehouse <[EMAIL PROTECTED]> > + * > */ > #include > #include > @@ -154,7 +157,7 @@ > #include /* do_div */ > #include > > -#define VERSION "pktgen v2.66: Packet Generator for packet performance > testing.\n" > +#define VERSION "pktgen v2.67: Packet Generator for packet performance > testing.\n" > > /* #define PG_DEBUG(a) a */ > #define PG_DEBUG(a) > @@ -162,6 +165,8 @@ > /* The buckets are exponential in 'width' */ > #define LAT_BUCKETS_MAX 32 > #define IP_NAME_SZ 32 > +#define MAX_MPLS_LABELS 16 /* This is the max label stack depth */ > +#define MPLS_STACK_BOTTOM __constant_htonl(0x0100) > > /* Device flag bits */ > #define F_IPSRC_RND (1<<0)/* IP-Src Random */ > @@ -172,6 +177,7 @@ > #define F_MACDST_RND (1<<5)/* MAC-Dst Random */ > #define F_TXSIZE_RND (1<<6)/* Transmit size is random */ > #define F_IPV6(1<<7)/* Interface in IPV6 Mode */ > +#define F_MPLS_RND(1<<8)/* Random MPLS labels */ > > /* Thread control flag bits */ > #define T_TERMINATE (1<<0) > @@ -278,6 +284,10 @@ struct pktgen_dev { > __u16 udp_dst_min; /* inclusive, dest UDP port */ > __u16 udp_dst_max; /* exclusive, dest UDP port */ > > +/* MPLS */ > +unsigned nr_labels; /* Depth of stack, 0 = no MPLS */ > +__be32 labels[MAX_MPLS_LABELS]; > + > __u32 src_mac_count;/* How many MACs to iterate through */ > __u32 dst_mac_count;/* How many MACs to iterate through */ > > @@ -623,9 +633,19 @@ static int pktgen_if_show(struct seq_fil > pkt_dev->udp_dst_min, pkt_dev->udp_dst_max); > > seq_printf(seq, > - " src_mac_count: %d dst_mac_count: %d \n Flags: ", > + " src_mac_count: %d dst_mac_count:
MPLS extension for pktgen
Steven Whitehouse writes: > I've been looking into MPLS recently and so one of the first things that > would be useful is a testbed to generate test traffic, and hence the > attached patch to pktgen. > > If you have a moment to look over it, then please let me know if you > would give it your blessing. The patch is against davem's current > net-2.6.17 tree, Nice. Well never thought about mpls but it seems possible too. With mpls enabled it seems send something my tcpdump does not understand so I trust you there. I and it does not seem to brake standard ipv4 sending. So it should be OK. But I'll guess the mpls result code is not what you expected... echo "mpls 0001000a,0002000a,000a" > /proc/net/pktgen/eth1 cat /proc/net/pktgen/eth1 | grep Res Result: 000a sprintf(pg_result, "OK: mpls="); for(n = 0; n < pkt_dev->nr_labels; n++) sprintf(pg_result, "%08x%s", ntohl(pkt_dev->labels[n]), n == pkt_dev->nr_labels-1 ? "" : ","); Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PKTGEN: Adds thread limit parameter.
Luiz Fernando Capitulino writes: > Well, I wouldn't say it's _really_ needed. But it really avoids having > too many thread entries in the pktgen's /proc directory, and as a good > result, you will not have pending threads which will never run as well. > > Also note that the patch is trivial, if you look at it in detail, > you'll see that the biggest change we have is the 'if' part. The rest I > would call cosmetic because the behaivor is the same. > > | If so -- Wouldn't a concept of a bitmask to control also which CPU's > | that runs the threads be more general? > > Sounds like a bit complex, and would be my turn to ask if it's > really needed. :) A bit set for each CPU there will a pktgen process running. > * Some minor fixes and cleanups, like functions returns being not > checked. > > * A new command called 'rem_device' to remove one device at a time > (currently, we can only remove all devices in one shoot with > 'rem_devices_all') > > * Ports pktgen to use the kernel thread API The current model was chosen simplicity and to bound a device to a specific CPU so it never cahanges. A change will need to carefully tested. > * cleanup the debug function usage I would like to remove the do_softirq stuff from pktgen... Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
PKTGEN: Adds thread limit parameter.
Hello! Really needed? If so -- Wouldn't a concept of a bitmask to control also which CPU's that runs the threads be more general? Cheers. --ro Luiz Fernando Capitulino writes: > > Currently, pktgen will create one thread for each online CPU in the > system. That behaivor may be annoying if you're using pktgen in a > large SMP system. > > This patch adds a new module parameter called 'pg_max_threads', which > can be set to the maximum number of threads pktgen should create. > > For example, if you're playing with pktgen in SMP system with 8 > CPUs, but want only two pktgen's threads, you can do: > >modprobe pktgen pg_max_threads=2 > > Signed-off-by: Luiz Capitulino <[EMAIL PROTECTED]> > > --- > > net/core/pktgen.c | 23 +-- > 1 files changed, 17 insertions(+), 6 deletions(-) > > e210ad47d0db52496fdaabdd50bfe6ee0bcc53cd > diff --git a/net/core/pktgen.c b/net/core/pktgen.c > index 37b25a6..994aef1 100644 > --- a/net/core/pktgen.c > +++ b/net/core/pktgen.c > @@ -154,7 +154,7 @@ > #include /* do_div */ > #include > > -#define VERSION "pktgen v2.66: Packet Generator for packet performance > testing.\n" > +#define VERSION "pktgen v2.67: Packet Generator for packet performance > testing.\n" > > /* #define PG_DEBUG(a) a */ > #define PG_DEBUG(a) > @@ -488,6 +488,7 @@ static unsigned int fmt_ip6(char *s, con > static int pg_count_d = 1000; /* 1000 pkts by default */ > static int pg_delay_d; > static int pg_clone_skb_d; > +static int pg_max_threads; > static int debug; > > static DEFINE_MUTEX(pktgen_thread_lock); > @@ -3184,7 +3185,7 @@ static int pktgen_remove_device(struct p > > static int __init pg_init(void) > { > -int cpu; > +int i, threads; > struct proc_dir_entry *pe; > > printk(version); > @@ -3208,15 +3209,24 @@ static int __init pg_init(void) > /* Register us to receive netdevice events */ > register_netdevice_notifier(&pktgen_notifier_block); > > -for_each_online_cpu(cpu) { > +threads = num_online_cpus(); > + > +/* > + * Check if we should have less threads than the number > + * of online CPUs > + */ > +if ((pg_max_threads > 0) && (pg_max_threads < threads)) > +threads = pg_max_threads; > + > +for (i = 0; i < threads; i++) { > int err; > char buf[30]; > > -sprintf(buf, "kpktgend_%i", cpu); > -err = pktgen_create_thread(buf, cpu); > +sprintf(buf, "kpktgend_%i", i); > +err = pktgen_create_thread(buf, i); > if (err) > printk("pktgen: WARNING: Cannot create thread for cpu > %d (%d)\n", > -cpu, err); > +i, err); > } > > if (list_empty(&pktgen_threads)) { > @@ -3263,4 +3273,5 @@ MODULE_LICENSE("GPL"); > module_param(pg_count_d, int, 0); > module_param(pg_delay_d, int, 0); > module_param(pg_clone_skb_d, int, 0); > +module_param(pg_max_threads, int, 0); > module_param(debug, int, 0); > -- > 1.2.4.gbe76 > > > > -- > Luiz Fernando N. Capitulino - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] pktgen: fix races between control/worker threads
Jesse Brandeburg writes: > > I looked quickly at this on a couple different machines and wasn't > able to reproduce, so don't let me block the patch. I think its a > good patch FWIW OK! We ask Deve to apply it. Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
fib_trie stats fix [PATCH]
fib_triestats has been buggy and caused oopses some platforms as openwrt. The patch below should cure those problems. Cheers. --ro Signed-off-by: Robert Olsson <[EMAIL PROTECTED]> --- linux-2.6.16-rc4/net/ipv4/fib_trie.c.061021 2006-02-21 22:22:57.0 +0100 +++ linux-2.6.16-rc4/net/ipv4/fib_trie.c2006-02-21 22:31:10.0 +0100 @@ -50,7 +50,7 @@ * Patrick McHardy <[EMAIL PROTECTED]> */ -#define VERSION "0.405" +#define VERSION "0.406" #include #include @@ -84,7 +84,7 @@ #include "fib_lookup.h" #undef CONFIG_IP_FIB_TRIE_STATS -#define MAX_CHILDS 16384 +#define MAX_STAT_DEPTH 32 #define KEYLENGTH (8*sizeof(t_key)) #define MASK_PFX(k, l) (((l)==0)?0:(k >> (KEYLENGTH-l)) << (KEYLENGTH-l)) @@ -154,7 +154,7 @@ unsigned int tnodes; unsigned int leaves; unsigned int nullpointers; - unsigned int nodesizes[MAX_CHILDS]; + unsigned int nodesizes[MAX_STAT_DEPTH]; }; struct trie { @@ -2080,7 +2080,9 @@ int i; s->tnodes++; - s->nodesizes[tn->bits]++; + if(tn->bits < MAX_STAT_DEPTH) + s->nodesizes[tn->bits]++; + for (i = 0; i < (1<bits); i++) if (!tn->child[i]) s->nullpointers++; @@ -2110,8 +2112,8 @@ seq_printf(seq, "\tInternal nodes: %d\n\t", stat->tnodes); bytes += sizeof(struct tnode) * stat->tnodes; - max = MAX_CHILDS-1; - while (max >= 0 && stat->nodesizes[max] == 0) + max = MAX_STAT_DEPTH; + while (max > 0 && stat->nodesizes[max-1] == 0) max--; pointers = 0; - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
fib_trie initialzation fix [PATCH]
Hello! In some kernel configs /proc functions seems to be accessed before the trie is initialized. The patch below checks for this. Cheers. --ro Signed-off-by: Robert Olsson <[EMAIL PROTECTED]> --- linux-2.6.16-rc4/net/ipv4/fib_trie.c.orig 2006-02-21 22:21:36.0 +0100 +++ linux-2.6.16-rc4/net/ipv4/fib_trie.c2006-02-22 13:44:37.0 +0100 @@ -50,7 +50,7 @@ * Patrick McHardy <[EMAIL PROTECTED]> */ -#define VERSION "0.404" +#define VERSION "0.405" #include #include @@ -2040,7 +2040,15 @@ static struct node *fib_trie_get_first(struct fib_trie_iter *iter, struct trie *t) { - struct node *n = rcu_dereference(t->trie); + struct node *n ; + + if(!t) + return NULL; + + n = rcu_dereference(t->trie); + + if(!iter) + return NULL; if (n && IS_TNODE(n)) { iter->tnode = (struct tnode *) n; - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] pktgen: fix races between control/worker threads
Arthur Kepner writes: Tanks. These races should be cured now I've tested a some runs and it works but I didn't see any problems before either. We'll hear from Jesse if this cured his problems. Cheers. --ro Signed-off-by: Robert Olsson <[EMAIL PROTECTED]> > New patch differs from the previous one only in the ordering > of the two lines above. > > diff --git a/net/core/pktgen.c b/net/core/pktgen.c > --- a/net/core/pktgen.c > +++ b/net/core/pktgen.c > @@ -153,7 +153,7 @@ > #include > > > -#define VERSION "pktgen v2.63: Packet Generator for packet performance > testing.\n" > +#define VERSION "pktgen v2.64: Packet Generator for packet performance > testing.\n" > > /* #define PG_DEBUG(a) a */ > #define PG_DEBUG(a) > @@ -176,7 +176,8 @@ > #define T_TERMINATE (1<<0) > #define T_STOP(1<<1) /* Stop run */ > #define T_RUN (1<<2) /* Start run */ > -#define T_REMDEV (1<<3) /* Remove all devs */ > +#define T_REMDEVALL (1<<3) /* Remove all devs */ > +#define T_REMDEV (1<<4) /* Remove one dev */ > > /* Locks */ > #define thread_lock()down(&pktgen_sem) > @@ -218,6 +219,8 @@ struct pktgen_dev { > * we will do a random selection from within the range. > */ > __u32 flags; > +int removal_mark; /* non-zero => the device is marked for > + * removal by worker thread */ > > int min_pkt_size;/* = ETH_ZLEN; */ > int max_pkt_size;/* = ETH_ZLEN; */ > @@ -481,7 +484,7 @@ static void pktgen_stop_all_threads_ifs( > static int pktgen_stop_device(struct pktgen_dev *pkt_dev); > static void pktgen_stop(struct pktgen_thread* t); > static void pktgen_clear_counters(struct pktgen_dev *pkt_dev); > -static struct pktgen_dev *pktgen_NN_threads(const char* dev_name, int > remove); > +static int pktgen_mark_device(const char* ifname); > static unsigned int scan_ip6(const char *s,char ip[16]); > static unsigned int fmt_ip6(char *s,const char ip[16]); > > @@ -1406,7 +1409,7 @@ static ssize_t pktgen_thread_write(struc > > if (!strcmp(name, "rem_device_all")) { > thread_lock(); > -t->control |= T_REMDEV; > +t->control |= T_REMDEVALL; > thread_unlock(); > schedule_timeout_interruptible(msecs_to_jiffies(125)); /* > Propagate thread->control */ > ret = count; > @@ -1457,7 +1460,8 @@ static struct pktgen_dev *__pktgen_NN_th > if (pkt_dev) { > if(remove) { > if_lock(t); > -pktgen_remove_device(t, pkt_dev); > +pkt_dev->removal_mark = 1; > +t->control |= T_REMDEV; > if_unlock(t); > } > break; > @@ -1467,13 +1471,44 @@ static struct pktgen_dev *__pktgen_NN_th > return pkt_dev; > } > > -static struct pktgen_dev *pktgen_NN_threads(const char* ifname, int remove) > +/* > + * mark a device for removal > + */ > +static int pktgen_mark_device(const char* ifname) > { > struct pktgen_dev *pkt_dev = NULL; > +const int max_tries = 10, msec_per_try = 125; > +int i = 0; > +int ret = 0; > + > thread_lock(); > -pkt_dev = __pktgen_NN_threads(ifname, remove); > -thread_unlock(); > -return pkt_dev; > +PG_DEBUG(printk("pktgen: pktgen_mark_device marking %s for > removal\n", > +ifname)); > + > +while(1) { > + > +pkt_dev = __pktgen_NN_threads(ifname, REMOVE); > +if (pkt_dev == NULL) break; /* success */ > + > +thread_unlock(); > +PG_DEBUG(printk("pktgen: pktgen_mark_device waiting for %s " > +"to disappear\n", ifname)); > +schedule_timeout_interruptible(msecs_to_jiffies(msec_per_try)); > +thread_lock(); > + > +if (++i >= max_tries) { > +printk("pktgen_mark_device: timed out after waiting " > +"%d msec for device %s to be removed\n", > +msec_per_try*i, ifname); > +ret = 1; > +break; > +}
[PATCH] pktgen: fix races between control/worker threads
Arthur Kepner writes: > > Let's try this again. How does this look, Robert? Yeep better > if(remove) { > +t->control |= T_REMDEV; > +pkt_dev->removal_mark = 1; > } Guess you should mark before you set control to avoid the race above. Eventually T_REMDEV could be sent from the syncing loop you added too well maybe it's not needed. Luiz Fernando Capitulino <[EMAIL PROTECTED]> posted a couple of patches before. One the patches had threads to use kernel list functions. Unfortunely it was dependent on huge whitespace patch. Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] pktgen: potential double free of pktgen_dev's skb
Arthur Kepner writes: > There's a race in pktgen which can lead to a double > free of a pktgen_dev's skb. If a worker thread is in > the midst of doing fill_packet(), and the controlling > thread gets a "stop" message, the already freed skb > can be freed once again in pktgen_stop_device(). Hello! A device can only belong to one CPU/thread and to avoid races control messages are posted to the thread. In the case you mention most pktgen_stop is called from the same CPU/thread. pktgen_stop_all_threads_ifs is an exception it should be possible set running in struct pktgen_dev to false for all devs in the thread and have the pktgen_thread_worker to do pktgen_stop to avoid races. The same trick could be used in the case we get NETDEV_UNREGISTER So in __pktgen_NN_threads instead of removing the device we set running false and the pktgen_thread_worker do the remove job. It might need some syncing. Anything you could try? Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Van Jacobson net channels
Leonid Grossman writes: > Right. Interrupt moderation is done on per channel basis. > The only addition to the current NAPI mechanism I'd like to see is to > have NAPI setting desired interrupt rate (once interrupts are ON), > rather than use an interrupt per packet or a driver default. Arguably, > NAPI can figure out desired interrupt rate a bit better than a driver > can. In the current scheme a driver can easily use a dynamic interrupt scheme in fact tulip has used this for years. At low rates there are now delays at all if reach some threshold it increases interrupt latency. It can be done in sevaral levels. The best threshold seems luckily just to be to count the number of packets sitting RX ring when ->poll is called. Jamal heavily experimented with this and gave a talk at OLS 2000. Yes if net channel classifier runs in hardirq we get back to the livelock situation sooner or later. IMO interupts should just be a signal to indicate work Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/4] - pktgen: refinements and small fixes (V2).
Luiz Fernando Capitulino writes: > [PATCH 1/4] pktgen: Lindent run. > [PATCH 2/4] pktgen: Ports thread list to Kernel list implementation. > [PATCH 3/4] pktgen: Fix kernel_thread() fail leak. > [PATCH 4/4] pktgen: Fix Initialization fail leak. > > The changes from V1 are: > > 1. More fixes made by hand after Lindent run > 2. Re-diffed agains't Dave's net-2.6.17 tree Should be fine I've used the previous version of the patches for a couple of days now. Thanks. Signed-off-by: Robert Olsson <[EMAIL PROTECTED]> Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fib_rules w. RCU lock [PATCH]
David S. Miller writes: > > + r = (struct fib_rule *) fib_rules.first; > I don't think this is right. > fib_rules.first is a pointer to the hlist_node (within the fib_rule) > not the fib_rule itself. It works because the hlist_node is the first member of the struct fib_rule but I agree it might safer/better to use i.e contailer_of. r = container_of(fib_rules.first, struct fib_rule, hlist); My colleauge Jens Laas spotted a simiar cast which is corrected as well. > What is the preempt_disable() actually protecting here? > It's a simple assignment to -1, and since the disable > occurs inside the ifindex test, it is not protecting > that either. The preempts are removed and an updated version of the patch is enclosed. Cheers. --ro Signed-off-by: Robert Olsson <[EMAIL PROTECTED]> --- linux-2.6.14.5/net/ipv4/fib_rules.c.orig2006-01-27 10:18:32.0 +0100 +++ linux-2.6.14.5/net/ipv4/fib_rules.c 2006-01-27 13:38:02.0 +0100 @@ -39,6 +39,8 @@ #include #include #include +#include +#include #include #include @@ -51,7 +53,7 @@ struct fib_rule { - struct fib_rule *r_next; + struct hlist_node hlist; atomic_tr_clntref; u32 r_preference; unsigned char r_table; @@ -74,6 +76,7 @@ #endif charr_ifname[IFNAMSIZ]; int r_dead; + struct rcu_head rcu; }; static struct fib_rule default_rule = { @@ -84,7 +87,6 @@ }; static struct fib_rule main_rule = { - .r_next = &default_rule, .r_clntref =ATOMIC_INIT(2), .r_preference = 0x7FFE, .r_table = RT_TABLE_MAIN, @@ -92,23 +94,24 @@ }; static struct fib_rule local_rule = { - .r_next = &main_rule, .r_clntref =ATOMIC_INIT(2), .r_table = RT_TABLE_LOCAL, .r_action = RTN_UNICAST, }; -static struct fib_rule *fib_rules = &local_rule; -static DEFINE_RWLOCK(fib_rules_lock); +struct hlist_head fib_rules; + +/* writer func called from netlink -- rtnl_sem hold*/ int inet_rtm_delrule(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg) { struct rtattr **rta = arg; struct rtmsg *rtm = NLMSG_DATA(nlh); - struct fib_rule *r, **rp; + struct fib_rule *r; + struct hlist_node *node; int err = -ESRCH; - for (rp=&fib_rules; (r=*rp) != NULL; rp=&r->r_next) { + hlist_for_each_entry(r, node, &fib_rules, hlist) { if ((!rta[RTA_SRC-1] || memcmp(RTA_DATA(rta[RTA_SRC-1]), &r->r_src, 4) == 0) && rtm->rtm_src_len == r->r_src_len && rtm->rtm_dst_len == r->r_dst_len && @@ -125,10 +128,8 @@ if (r == &local_rule) break; - write_lock_bh(&fib_rules_lock); - *rp = r->r_next; + hlist_del_rcu(&r->hlist); r->r_dead = 1; - write_unlock_bh(&fib_rules_lock); fib_rule_put(r); err = 0; break; @@ -149,21 +150,30 @@ return NULL; } +static inline void fib_rule_put_rcu(struct rcu_head *head) +{ + struct fib_rule *r = container_of(head, struct fib_rule, rcu); + kfree(r); +} + void fib_rule_put(struct fib_rule *r) { if (atomic_dec_and_test(&r->r_clntref)) { if (r->r_dead) - kfree(r); + call_rcu(&r->rcu, fib_rule_put_rcu); else printk("Freeing alive rule %p\n", r); } } +/* writer func called from netlink -- rtnl_sem hold*/ + int inet_rtm_newrule(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg) { struct rtattr **rta = arg; struct rtmsg *rtm = NLMSG_DATA(nlh); - struct fib_rule *r, *new_r, **rp; + struct fib_rule *r, *new_r, *last = NULL; + struct hlist_node *node = NULL; unsigned char table_id; if (rtm->rtm_src_len > 32 || rtm->rtm_dst_len > 32 || @@ -187,6 +197,7 @@ if (!new_r) return -ENOMEM; memset(new_r, 0, sizeof(*new_r)); + if (rta[RTA_SRC-1]) memcpy(&new_r->r_src, RTA_DATA(rta[RTA_SRC-1]), 4); if (rta[RTA_DST-1]) @@ -219,28 +230,28 @@ if (rta[RTA_FLOW-1]) memcpy(&new_r->r_tclassid, RTA_DATA(rta[RTA_FLOW-1]), 4); #endif + r = container_of(fib_rules.first, struct fib_rule, hlist); - rp = &fib_rules; if (!new_r->r_preference) { - r = fib_rules; - if (r && (r = r->r_next) != NULL) { -
Re: driver skb reuse
Benjamin LaHaise writes: > Right. Btw, looking over your changes for skb reuse and how the e1000 > lays out its data structures, I think there is still some room for > improvement: currently you still touch the skb that the e1000 used to > allocate the receive buffers in. That cacheline is likely to be L1 cold > because of the large number of outstanding buffers the e1000 uses. If you > instead only touch the buffer_info structure to populate a freshly > allocated skb, you can eliminate a couple of cache misses and increase > the chances that the skb you're reusing stays in the L1 cache. That will > require a lot more surgery to the e1000 code, though, but it could be > worth it. Getting rid of some cache misses would big win definitely. We go from tutorial to real world. Are you brave to outline some code? Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: driver skb reuse
jamal writes: > > Here we process (drop) about 13% packets more when skb'a get reued. > > > Very cool. > Robert, it would be interesting to see something more interesting > (longer code path) such as udp. You can still use a packetgen to shoot > at the box (netperf is a wimp), send it to udp port 9 which just > swallows packets. Neither UDP or TCP kfree's the skb in it's normal path today so it cannot be resused. ICMP packets seems get reused as-is. Dave and Andi some possible improvement for TCP. For UDP I just took a quick look. Only in case errors the skb gets freed. So ie sending UDP pkt to socket which is not open would make the skb's reused. And someting we could test... The normal path from what I understand is queue packets in into the socket receive queue. skb_queue_tail(&sk->sk_receive_queue, skb); Which could be an excellent place for UDP copy break code... which then makes skb reuse possible. Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: driver skb reuse
Benjamin LaHaise writes: > Instead of doing a completely separate skb reuse path, what happens if > you remove the memset() from __alloc_skb() and instead do it in a slab > ctor? I remember seeing that in the profiles for af_unix. Dave, could > you refresh my memory why the slab ctor ended up being dropped? Doing > the memset() at free time is usually almost free since it is usually in > the cache at that point. Right but that's another issue... This can be done I played with a variant of this too I splitted alloc_skb in two parts to get the memset() part even done in the driver just before passing the skb to the stack. I did expect a big win in but I didn't see any gain. Strange but the code was bad so it might be worth a try again. But this is a bit different from the skb reuse we discuss now. Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
driver skb reuse
Hello! We disussed the resue of skb's some time ago. Below some code to examine how skb's can be reused if upper layer (RX softirq) can consume the skb so we with in NAPI path can detect and reuse the skb. It can give new possibilites for TCP optimization (davem), driver common copbreak etc. In the test below I use my usual lab setup but just let netfilter drop the packets. We win about 13% in this experiment below. Single Opteron 252 (2.6 GHz) CPU. e1000 6.2.15 Linux 2.6.14.5 Input rate 2 x 1.16 Mpps a netfilter drop rule for pktgen in the forward chain Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flags eth0 1500 0 6680261 0 3319739 0 5 0 0 0 BRU eth1 1500 0 0 0 0 0 3 0 0 0 BRU eth2 1500 0 6680344 0 3319656 0 3 0 0 0 BRU eth3 1500 0 0 0 0 0 3 0 0 0 BRU With the consume/recycle code below. Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flags eth0 1500 0 7541191 0 2458809 0 5 0 0 0 BRU eth1 1500 0 0 0 0 0 3 0 0 0 BRU eth2 1500 0 7541150 0 2458850 0 3 0 0 0 BRU eth3 1500 0 0 0 0 0 3 0 0 0 BRU Here we process (drop) about 13% packets more when skb'a get reued. --- e1000_main.c.060124 2006-01-24 12:30:10.0 +0100 +++ e1000_main.c2006-01-24 12:28:58.0 +0100 @@ -3802,6 +3802,10 @@ skb->protocol = eth_type_trans(skb, netdev); #ifdef CONFIG_E1000_NAPI + + /* Increment users so skb don't get destructed */ + skb_get(skb); + #ifdef NETIF_F_HW_VLAN_TX if(unlikely(adapter->vlgrp && (status & E1000_RXD_STAT_VP))) { @@ -3814,6 +3818,23 @@ #else netif_receive_skb(skb); #endif + + /* +* If skb is consumed by RX softirq we can simply use it again +* otherwise undo the users increment with kfree +*/ + + if (!multi_descriptor && atomic_read(&skb->users) == 1 && + realloc_skb(skb, adapter->rx_buffer_len, GFP_ATOMIC)) { + + skb_reserve(skb, 16); + skb->dev = netdev; + buffer_info->skb = skb; + adapter->net_stats.rx_compressed++; + } + else + kfree_skb(skb); + #else /* CONFIG_E1000_NAPI */ #ifdef NETIF_F_HW_VLAN_TX if(unlikely(adapter->vlgrp && realloc_skb from [EMAIL PROTECTED] --- linux-2.6.14.5/include/linux/skbuff.h.orig 2006-01-19 15:51:08.0 +0100 +++ linux-2.6.14.5/include/linux/skbuff.h 2006-01-24 11:34:44.0 +0100 @@ -303,6 +303,9 @@ extern void __kfree_skb(struct sk_buff *skb); extern struct sk_buff *__alloc_skb(unsigned int size, gfp_t priority, int fclone); +extern struct sk_buff *realloc_skb(struct sk_buff *skb, unsigned int size, + int priority); + static inline struct sk_buff *alloc_skb(unsigned int size, gfp_t priority) { --- linux-2.6.14.5/net/core/skbuff.c.orig 2006-01-19 15:50:37.0 +0100 +++ linux-2.6.14.5/net/core/skbuff.c2006-01-24 11:57:34.0 +0100 @@ -276,6 +276,59 @@ } } +/** + * realloc_skb - reset skb for new packet. + * @size: size to allocate + * @gfp_mask: allocation mask + * + * Allocate a new &sk_buff. The returned buffer has no headroom and a + * tail room of size bytes. The object has a reference count of one. + * The return is the buffer. On a failure the return is %NULL. + * + * Buffers may only be allocated from interrupts using a @gfp_mask of + * %GFP_ATOMIC. + */ + +struct sk_buff *realloc_skb(struct sk_buff* skb, unsigned int size, int gfp_mask) +{ + int truesize = skb->truesize; + u8 *data = skb->head; + + memset(skb, 0, offsetof(struct sk_buff, truesize)); + + /* Get the DATA. Size must match skb_add_mtu(). */ + size = SKB_DATA_ALIGN(size); + if ((size+sizeof(struct sk_buff)) > truesize) { + skb_release_data(skb); + data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask); + if (data == NULL) + goto nodata; + } + + /* XXX: does not include slab overhead */ + skb->truesize = size + sizeof(struct sk_buff); + + /* Load the data pointers. */ + skb->head = data; + skb->data = data; + skb->tail = data; + skb->end = data + size; + + /* Set up other state */ + skb->len = 0; + skb->cloned = 0; + skb->data_len = 0; + +
[PATCH 00/00] pktgen: refinements and small fixes.
Luiz Fernando Capitulino writes: > [PATCH 00/02] pktgen: Ports thread list to Kernel list implementation. > [PATCH 00/03] pktgen: Fix kernel_thread() fail leak. > [PATCH 00/04] pktgen: Fix Initialization fail leak. > > But I'm sending these patches first, just to know if I'm doing something > wrong. Thanks! Looks interesting. Yes of course better to use kernel list functions. I'll patch my lab version and run through some tests. Yes keep on hacking... Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
LOW throughput with pktgen and e1000
Aritz Bastida writes: > I need to use pktgen for sending packets at very high speed to another > machine, in order to test it under heavy network traffic. All my > previous injection test were done with a dual Pentium III 800 MHz. As > I needed a more powerful machine I got a Pentium 4 but the results are > quite similar. Also make sure HW_FLOW control from the receiver is not throttling the sender.. Also adapters are different and also PCI-bridges.adds latency. Bus latency and bus speed is the most important factors so PIII might very well be faster then XEON/p4/Opteron. > These results are similar to the ones I got with the Pentium III. I > can't reach even 400kpps. With another machine, a dual AMD Opteron, I > can send as fast as 650kpps but unfortunately that is the machine > being tested. In the pktgen paper says "A single flow of 1.48Mpps is > seen with a Xeon 2.67 GHz using a patched e1000 driver (64 byte > packets)" [1]. Well, I don't know how much faster the Intel Xeon is, > but I do have a fast machine, and I do have an e1000 NIC. The 1.488 Mpps was with several a patches one with HW TX prefetching which was catastrophic in production. It worked only in simplex mode. Below is from our dual 2.67 GHz XEON w.serverworks chipset and Intel 82546GB More or less vanilla 2.6.14 One 880039pps 422Mb/sec (422418720bps) errors: 0 Two concurrent 850419pps 408Mb/sec (408201120bps) errors: 0 850421pps 408Mb/sec (408202080bps) errors: 0 All with 64 byte pkts (60 in pktgen) you're using 100 Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Resend [PATCH netdev-2.6 2/8] e1000: Performance Enhancements
David S. Miller writes: > I think a new variant of netif_receive_skb() might be needed or > maybe not. I don't see a need for a new ->poll() for example. Yes poll should fine as-is for netif_receive_skb we have to see. > On the other hand, nobody checks the return value from > netif_receive_skb(). This would have been a nice avenue > were it not for the SKB refcounting, but I think we could > work around that by making the driver grab the extra reference > count before it invokes netif_receive_skb(). > > I even think you are hinting at exactly this :) Yes something like this at least as a start. Could be the first part to play with. I'll leave TCP optimizations to you. :) Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Resend [PATCH netdev-2.6 2/8] e1000: Performance Enhancements
jamal writes: > Essentially the approach would be the same as Robert's old recycle patch > where he doesnt recycle certain skbs - the only difference being in the > case of forwarding, the recycle is done asynchronously at EOT whereas > this is done synchronously upon return from host path. > The beauty of the approach is you dont have to worry about recycling on > the wrong CPU ;-> (which has been a big problem for forwarding path) > I have to chime in and say for the host stack - I like it ;-> No we don't solve any problems for forwarding but as Dave pointed out we can do nice things. Instead of dropping skb is case of failures or netfilter etc we can reuse the skb and if the skb is consumed within the RX softirq we can just return it driver. You did the feedback mechanism NET_RX_XX stuff six years ago. Now it can possible used :) A practical problem is how maintain compatibility with the current behavior which defaults to NET_RX_SKB_CONSUMED. An new driver entry point? And can we increase skb->users to delay skb destruction until the driver got the indication back? So the driver will do the final kfree and not in the protocol layers as now? This to avoid massive code changes. Thoughts? Cheers --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Resend [PATCH netdev-2.6 2/8] e1000: Performance Enhancements
jamal writes: > > Ok, this makes things more interesting > What worked for a XEON doesnt work the same way for an opteron. > > For me, the copybreak (in its capacity as adding extra cycles that make > the prefetch look good) made things look good. Also, #125 gave a best > answer. None of these were the case from Roberts results. > > Robert, what about just #1? Maybe thats the best compromise that would > work for all. I've tried that before with flow test and got contribution from #2 0 prefetch 756 kpps 1 prefetch 805 kpps (first) 2 prefetch 821 kpps (first two) 5 prefetch 803 kpps (all) > Also, I am really hoping that someone will test with older hardware > where i claim to have seen prefetch causing problems. We give this up for now unless you or somebody else has some very good idea how handle prefetching in generic way. I'll use #12 and you'll use #125 Intel uses #12345 etc Or do we all benefit from #12? Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html