Re: tbench regression in 2.6.25-rc1

2008-02-20 Thread Zhang, Yanmin
Comparing with kernel 2.6.24, tbench result has regression with
2.6.25-rc1.
1) On 2 quad-core processor stoakley: 4%.
2) On 4 quad-core processor tigerton: more than 30%.

bisect located below patch.

b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b is first bad commit
commit b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b
Author: Herbert Xu <[EMAIL PROTECTED]>
Date:   Tue Nov 13 21:33:32 2007 -0800

[IPV6]: Move nfheader_len into rt6_info

The dst member nfheader_len is only used by IPv6.  It's also currently
creating a rather ugly alignment hole in struct dst.  Therefore this patch
moves it from there into struct rt6_info.

Above patch changes the cache line alignment, especially member __refcnt. I did 
a 
testing by adding 2 unsigned long pading before lastuse, so the 3 members,
lastuse/__refcnt/__use, are moved to next cache line. The performance is 
recovered.

I created a patch to rearrange the members in struct dst_entry.

With Eric and Valdis Kletnieks's suggestion, I made finer arrangement.
1) Move tclassid under ops in case CONFIG_NET_CLS_ROUTE=y. So 
sizeof(dst_entry)=200
no matter if CONFIG_NET_CLS_ROUTE=y/n. I tested many patches on my 16-core 
tigerton by
moving tclassid to different place. It looks like tclassid could also have 
impact on
performance.
If moving tclassid before metrics, or just don't move tclassid, the performance 
isn't
good. So I move it behind metrics.
2) Add comments before __refcnt.

On 16-core tigerton:
If CONFIG_NET_CLS_ROUTE=y, the result with below patch is about 18% better than
the one without the patch;
If CONFIG_NET_CLS_ROUTE=n, the result with below patch is about 30% better than
the one without the patch.

With 32bit 2.6.25-rc1 on 8-core stoakley, the new patch doesn't introduce 
regression.

Thank Eric, Valdis, and David!

Signed-off-by: Zhang Yanmin <[EMAIL PROTECTED]>
Acked-by: Eric Dumazet <[EMAIL PROTECTED]>

---

--- linux-2.6.25-rc1/include/net/dst.h  2008-02-21 14:33:43.0 +0800
+++ linux-2.6.25-rc1_work/include/net/dst.h 2008-02-22 12:52:19.0 
+0800
@@ -52,15 +52,10 @@ struct dst_entry
unsigned short  header_len; /* more space at head required 
*/
unsigned short  trailer_len;/* space to reserve at tail */
 
-   u32 metrics[RTAX_MAX];
-   struct dst_entry*path;
-
-   unsigned long   rate_last;  /* rate limiting for ICMP */
unsigned intrate_tokens;
+   unsigned long   rate_last;  /* rate limiting for ICMP */
 
-#ifdef CONFIG_NET_CLS_ROUTE
-   __u32   tclassid;
-#endif
+   struct dst_entry*path;
 
struct neighbour*neighbour;
struct hh_cache *hh;
@@ -70,10 +65,20 @@ struct dst_entry
int (*output)(struct sk_buff*);
 
struct  dst_ops *ops;
-   
-   unsigned long   lastuse;
+
+   u32 metrics[RTAX_MAX];
+
+#ifdef CONFIG_NET_CLS_ROUTE
+   __u32   tclassid;
+#endif
+
+   /*
+* __refcnt wants to be on a different cache line from
+* input/output/ops or performance tanks badly
+*/
atomic_t__refcnt;   /* client references*/
int __use;
+   unsigned long   lastuse;
union {
struct dst_entry *next;
struct rtable*rt_next;


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tbench regression in 2.6.25-rc1

2008-02-19 Thread Zhang, Yanmin
On Tue, 2008-02-19 at 08:40 +0100, Eric Dumazet wrote:
> Zhang, Yanmin a �crit :
> > On Mon, 2008-02-18 at 12:33 -0500, [EMAIL PROTECTED] wrote: 
> >> On Mon, 18 Feb 2008 16:12:38 +0800, "Zhang, Yanmin" said:
> >>
> >>> I also think __refcnt is the key. I did a new testing by adding 2 
> >>> unsigned long
> >>> pading before lastuse, so the 3 members are moved to next cache line. The 
> >>> performance is
> >>> recovered.
> >>>
> >>> How about below patch? Almost all performance is recovered with the new 
> >>> patch.
> >>>
> >>> Signed-off-by: Zhang Yanmin <[EMAIL PROTECTED]>
> >> Could you add a comment someplace that says "refcnt wants to be on a 
> >> different
> >> cache line from input/output/ops or performance tanks badly", to warn some
> >> future kernel hacker who starts adding new fields to the structure?
> > Ok. Below is the new patch.
> > 
> > 1) Move tclassid under ops in case CONFIG_NET_CLS_ROUTE=y. So 
> > sizeof(dst_entry)=200
> > no matter if CONFIG_NET_CLS_ROUTE=y/n. I tested many patches on my 16-core 
> > tigerton by
> > moving tclassid to different place. It looks like tclassid could also have 
> > impact on
> > performance.
> > If moving tclassid before metrics, or just don't move tclassid, the 
> > performance isn't
> > good. So I move it behind metrics.
> > 
> > 2) Add comments before __refcnt.
> > 
> > If CONFIG_NET_CLS_ROUTE=y, the result with below patch is about 18% better 
> > than
> > the one without the patch.
> > 
> > If CONFIG_NET_CLS_ROUTE=n, the result with below patch is about 30% better 
> > than
> > the one without the patch.
> > 
> > Signed-off-by: Zhang Yanmin <[EMAIL PROTECTED]>
> > 
> > ---
> > 
> > --- linux-2.6.25-rc1/include/net/dst.h  2008-02-21 14:33:43.0 
> > +0800
> > +++ linux-2.6.25-rc1_work/include/net/dst.h 2008-02-22 12:52:19.0 
> > +0800
> > @@ -52,15 +52,10 @@ struct dst_entry
> > unsigned short  header_len; /* more space at head required 
> > */
> > unsigned short  trailer_len;/* space to reserve at tail */
> >  
> > -   u32 metrics[RTAX_MAX];
> > -   struct dst_entry*path;
> > -
> > -   unsigned long   rate_last;  /* rate limiting for ICMP */
> > unsigned intrate_tokens;
> > +   unsigned long   rate_last;  /* rate limiting for ICMP */
> >  
> > -#ifdef CONFIG_NET_CLS_ROUTE
> > -   __u32   tclassid;
> > -#endif
> > +   struct dst_entry*path;
> >  
> > struct neighbour*neighbour;
> > struct hh_cache *hh;
> > @@ -70,10 +65,20 @@ struct dst_entry
> > int (*output)(struct sk_buff*);
> >  
> > struct  dst_ops *ops;
> > -   
> > -   unsigned long   lastuse;
> > +
> > +   u32 metrics[RTAX_MAX];
> > +
> > +#ifdef CONFIG_NET_CLS_ROUTE
> > +   __u32   tclassid;
> > +#endif
> > +
> > +   /*
> > +* __refcnt wants to be on a different cache line from
> > +* input/output/ops or performance tanks badly
> > +*/
> > atomic_t__refcnt;   /* client references*/
> > int __use;
> > +   unsigned long   lastuse;
> > union {
> > struct dst_entry *next;
> > struct rtable*rt_next;
> > 
> > 
> > 
> 
> I prefer this patch, but unfortunatly your perf numbers are for 64 bits 
> kernels.
> 
> Could you please test now with 32 bits one ?
I tested it with 32bit 2.6.25-rc1 on 8-core stoakley. The result almost has no 
difference
between pure kernel and patched kernel.

New update: On 8-core stoakley, the regression becomes 2~3% with kernel 
2.6.25-rc2. On
tigerton, the regression is still 30% with 2.6.25-rc2. On Tulsa( 8 
cores+hyperthreading),
the regression is still 4% with 2.6.25-rc2.

With my patch, on tigerton, almost all regression disappears. On tulsa, only 
about 2%
regression disappears.

So this issue is triggerred with multiple-cpu. Perhaps process scheduler is 
another
factor causing the issue to happen, but it's very hard to change scheduler.


Eric,

I tested your new patch in function loopback_xmit. It has no improvement, while 
it doesn't
introduce new issues. As you tested it on dual-core machine and got 
improvement, how about
merging your patch with mine?

-yanmin


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tbench regression in 2.6.25-rc1

2008-02-19 Thread Zhang, Yanmin
On Tue, 2008-02-19 at 08:35 +0100, Eric Dumazet wrote:
> Zhang, Yanmin a �crit :
> > On Mon, 2008-02-18 at 11:11 +0100, Eric Dumazet wrote:
> >> On Mon, 18 Feb 2008 16:12:38 +0800
> >> "Zhang, Yanmin" <[EMAIL PROTECTED]> wrote:
> >>
> >>> On Fri, 2008-02-15 at 15:22 -0800, David Miller wrote:
> >>>> From: Eric Dumazet <[EMAIL PROTECTED]>
> >>>> Date: Fri, 15 Feb 2008 15:21:48 +0100
> >>>>
> >>>>> On linux-2.6.25-rc1 x86_64 :
> >>>>>
> >>>>> offsetof(struct dst_entry, lastuse)=0xb0
> >>>>> offsetof(struct dst_entry, __refcnt)=0xb8
> >>>>> offsetof(struct dst_entry, __use)=0xbc
> >>>>> offsetof(struct dst_entry, next)=0xc0
> >>>>>
> >>>>> So it should be optimal... I dont know why tbench prefers __refcnt 
> >>>>> being 
> >>>>> on 0xc0, since in this case lastuse will be on a different cache line...
> >>>>>
> >>>>> Each incoming IP packet will need to change lastuse, __refcnt and 
> >>>>> __use, 
> >>>>> so keeping them in the same cache line is a win.
> >>>>>
> >>>>> I suspect then that even this patch could help tbench, since it avoids 
> >>>>> writing lastuse...
> >>>> I think your suspicions are right, and even moreso
> >>>> it helps to keep __refcnt out of the same cache line
> >>>> as input/output/ops which are read-almost-entirely :-
> >>> I think you are right. The issue is these three variables sharing the 
> >>> same cache line
> >>> with input/output/ops.
> >>>
> >>>> )
> >>>>
> >>>> I haven't done an exhaustive analysis, but it seems that
> >>>> the write traffic to lastuse and __refcnt are about the
> >>>> same.  However if we find that __refcnt gets hit more
> >>>> than lastuse in this workload, it explains the regression.
> >>> I also think __refcnt is the key. I did a new testing by adding 2 
> >>> unsigned long
> >>> pading before lastuse, so the 3 members are moved to next cache line. The 
> >>> performance is
> >>> recovered.
> >>>
> >>> How about below patch? Almost all performance is recovered with the new 
> >>> patch.
> >>>
> >>> Signed-off-by: Zhang Yanmin <[EMAIL PROTECTED]>
> >>>
> >>> ---
> >>>
> >>> --- linux-2.6.25-rc1/include/net/dst.h2008-02-21 14:33:43.0 
> >>> +0800
> >>> +++ linux-2.6.25-rc1_work/include/net/dst.h   2008-02-21 
> >>> 14:36:22.0 +0800
> >>> @@ -52,11 +52,10 @@ struct dst_entry
> >>>   unsigned short  header_len; /* more space at head required 
> >>> */
> >>>   unsigned short  trailer_len;/* space to reserve at tail */
> >>>  
> >>> - u32 metrics[RTAX_MAX];
> >>> - struct dst_entry*path;
> >>> -
> >>> - unsigned long   rate_last;  /* rate limiting for ICMP */
> >>>   unsigned intrate_tokens;
> >>> + unsigned long   rate_last;  /* rate limiting for ICMP */
> >>> +
> >>> + struct dst_entry*path;
> >>>  
> >>>  #ifdef CONFIG_NET_CLS_ROUTE
> >>>   __u32   tclassid;
> >>> @@ -70,10 +69,12 @@ struct dst_entry
> >>>   int (*output)(struct sk_buff*);
> >>>  
> >>>   struct  dst_ops *ops;
> >>> - 
> >>> - unsigned long   lastuse;
> >>> +
> >>> + u32 metrics[RTAX_MAX];
> >>> +
> >>>   atomic_t__refcnt;   /* client references*/
> >>>   int __use;
> >>> + unsigned long   lastuse;
> >>>   union {
> >>>   struct dst_entry *next;
> >>>   struct rtable*rt_next;
> >>>
> >>>
> >> Well, after this patch, we grow dst_entry by 8 bytes :
> > With my .config, it doesn't grow. Perhaps because of CONFIG_NET_CLS_ROUTE, 
> > I don't
> > enable it. I will move tclassid under ops.
> > 
> >> sizeof(struct dst_entry)=0xd0
> >> offsetof(struct dst_entry, input)=0x68
>

Re: tbench regression in 2.6.25-rc1

2008-02-18 Thread Zhang, Yanmin
On Mon, 2008-02-18 at 12:33 -0500, [EMAIL PROTECTED] wrote: 
> On Mon, 18 Feb 2008 16:12:38 +0800, "Zhang, Yanmin" said:
> 
> > I also think __refcnt is the key. I did a new testing by adding 2 unsigned 
> > long
> > pading before lastuse, so the 3 members are moved to next cache line. The 
> > performance is
> > recovered.
> > 
> > How about below patch? Almost all performance is recovered with the new 
> > patch.
> > 
> > Signed-off-by: Zhang Yanmin <[EMAIL PROTECTED]>
> 
> Could you add a comment someplace that says "refcnt wants to be on a different
> cache line from input/output/ops or performance tanks badly", to warn some
> future kernel hacker who starts adding new fields to the structure?
Ok. Below is the new patch.

1) Move tclassid under ops in case CONFIG_NET_CLS_ROUTE=y. So 
sizeof(dst_entry)=200
no matter if CONFIG_NET_CLS_ROUTE=y/n. I tested many patches on my 16-core 
tigerton by
moving tclassid to different place. It looks like tclassid could also have 
impact on
performance.
If moving tclassid before metrics, or just don't move tclassid, the performance 
isn't
good. So I move it behind metrics.

2) Add comments before __refcnt.

If CONFIG_NET_CLS_ROUTE=y, the result with below patch is about 18% better than
the one without the patch.

If CONFIG_NET_CLS_ROUTE=n, the result with below patch is about 30% better than
the one without the patch.

Signed-off-by: Zhang Yanmin <[EMAIL PROTECTED]>

---

--- linux-2.6.25-rc1/include/net/dst.h  2008-02-21 14:33:43.0 +0800
+++ linux-2.6.25-rc1_work/include/net/dst.h 2008-02-22 12:52:19.0 
+0800
@@ -52,15 +52,10 @@ struct dst_entry
unsigned short  header_len; /* more space at head required 
*/
unsigned short  trailer_len;/* space to reserve at tail */
 
-   u32 metrics[RTAX_MAX];
-   struct dst_entry*path;
-
-   unsigned long   rate_last;  /* rate limiting for ICMP */
unsigned intrate_tokens;
+   unsigned long   rate_last;  /* rate limiting for ICMP */
 
-#ifdef CONFIG_NET_CLS_ROUTE
-   __u32   tclassid;
-#endif
+   struct dst_entry*path;
 
struct neighbour*neighbour;
struct hh_cache *hh;
@@ -70,10 +65,20 @@ struct dst_entry
int (*output)(struct sk_buff*);
 
struct  dst_ops *ops;
-   
-   unsigned long   lastuse;
+
+   u32 metrics[RTAX_MAX];
+
+#ifdef CONFIG_NET_CLS_ROUTE
+   __u32   tclassid;
+#endif
+
+   /*
+* __refcnt wants to be on a different cache line from
+* input/output/ops or performance tanks badly
+*/
atomic_t__refcnt;   /* client references*/
int __use;
+   unsigned long   lastuse;
union {
struct dst_entry *next;
struct rtable*rt_next;


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tbench regression in 2.6.25-rc1

2008-02-18 Thread Zhang, Yanmin
On Mon, 2008-02-18 at 11:11 +0100, Eric Dumazet wrote:
> On Mon, 18 Feb 2008 16:12:38 +0800
> "Zhang, Yanmin" <[EMAIL PROTECTED]> wrote:
> 
> > On Fri, 2008-02-15 at 15:22 -0800, David Miller wrote:
> > > From: Eric Dumazet <[EMAIL PROTECTED]>
> > > Date: Fri, 15 Feb 2008 15:21:48 +0100
> > > 
> > > > On linux-2.6.25-rc1 x86_64 :
> > > > 
> > > > offsetof(struct dst_entry, lastuse)=0xb0
> > > > offsetof(struct dst_entry, __refcnt)=0xb8
> > > > offsetof(struct dst_entry, __use)=0xbc
> > > > offsetof(struct dst_entry, next)=0xc0
> > > > 
> > > > So it should be optimal... I dont know why tbench prefers __refcnt 
> > > > being 
> > > > on 0xc0, since in this case lastuse will be on a different cache line...
> > > > 
> > > > Each incoming IP packet will need to change lastuse, __refcnt and 
> > > > __use, 
> > > > so keeping them in the same cache line is a win.
> > > > 
> > > > I suspect then that even this patch could help tbench, since it avoids 
> > > > writing lastuse...
> > > 
> > > I think your suspicions are right, and even moreso
> > > it helps to keep __refcnt out of the same cache line
> > > as input/output/ops which are read-almost-entirely :-
> > I think you are right. The issue is these three variables sharing the same 
> > cache line
> > with input/output/ops.
> > 
> > > )
> > > 
> > > I haven't done an exhaustive analysis, but it seems that
> > > the write traffic to lastuse and __refcnt are about the
> > > same.  However if we find that __refcnt gets hit more
> > > than lastuse in this workload, it explains the regression.
> > I also think __refcnt is the key. I did a new testing by adding 2 unsigned 
> > long
> > pading before lastuse, so the 3 members are moved to next cache line. The 
> > performance is
> > recovered.
> > 
> > How about below patch? Almost all performance is recovered with the new 
> > patch.
> > 
> > Signed-off-by: Zhang Yanmin <[EMAIL PROTECTED]>
> > 
> > ---
> > 
> > --- linux-2.6.25-rc1/include/net/dst.h  2008-02-21 14:33:43.0 
> > +0800
> > +++ linux-2.6.25-rc1_work/include/net/dst.h 2008-02-21 14:36:22.0 
> > +0800
> > @@ -52,11 +52,10 @@ struct dst_entry
> > unsigned short  header_len; /* more space at head required 
> > */
> > unsigned short  trailer_len;/* space to reserve at tail */
> >  
> > -   u32 metrics[RTAX_MAX];
> > -   struct dst_entry*path;
> > -
> > -   unsigned long   rate_last;  /* rate limiting for ICMP */
> > unsigned intrate_tokens;
> > +   unsigned long   rate_last;  /* rate limiting for ICMP */
> > +
> > +   struct dst_entry*path;
> >  
> >  #ifdef CONFIG_NET_CLS_ROUTE
> > __u32   tclassid;
> > @@ -70,10 +69,12 @@ struct dst_entry
> > int (*output)(struct sk_buff*);
> >  
> > struct  dst_ops *ops;
> > -   
> > -   unsigned long   lastuse;
> > +
> > +   u32 metrics[RTAX_MAX];
> > +
> > atomic_t__refcnt;   /* client references*/
> > int __use;
> > +   unsigned long   lastuse;
> > union {
> > struct dst_entry *next;
> > struct rtable*rt_next;
> > 
> > 
> 
> Well, after this patch, we grow dst_entry by 8 bytes :
With my .config, it doesn't grow. Perhaps because of CONFIG_NET_CLS_ROUTE, I 
don't
enable it. I will move tclassid under ops.

> 
> sizeof(struct dst_entry)=0xd0
> offsetof(struct dst_entry, input)=0x68
> offsetof(struct dst_entry, output)=0x70
> offsetof(struct dst_entry, __refcnt)=0xb4
> offsetof(struct dst_entry, lastuse)=0xc0
> offsetof(struct dst_entry, __use)=0xb8
> sizeof(struct rtable)=0x140
> 
> 
> So we dirty two cache lines instead of one, unless your cpu have 128 bytes 
> cache lines ?
> 
> I am quite suprised that my patch to not change lastuse if already set to 
> jiffies changes nothing...
> 
> If you have some time, could you also test this (unrelated) patch ?
> 
> We can avoid dirty all the time a cache line of loopback device.
> 
> diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
> index f2a6e71..0a4186a 100644
> --- a/drivers/net/loopback.c
> +++ b/drivers/net/loopback.c
> @@ -150,7 +150,10 @@ static int loopback_xmit(struct sk_buff *skb, struct 
> net_device *dev)
> return 0;
> }
>  #endif
> -   dev->last_rx = jiffies;
> +#ifdef CONFIG_SMP
> +   if (dev->last_rx != jiffies)
> +#endif
> +   dev->last_rx = jiffies;
>  
> /* it's OK to use per_cpu_ptr() because BHs are off */
> pcpu_lstats = netdev_priv(dev);
> 
Although I didn't test it, I don't think it's ok. The key is __refcnt shares 
the same
cache line with ops/input/output.

-yanmin


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tbench regression in 2.6.25-rc1

2008-02-18 Thread Zhang, Yanmin
On Fri, 2008-02-15 at 15:22 -0800, David Miller wrote:
> From: Eric Dumazet <[EMAIL PROTECTED]>
> Date: Fri, 15 Feb 2008 15:21:48 +0100
> 
> > On linux-2.6.25-rc1 x86_64 :
> > 
> > offsetof(struct dst_entry, lastuse)=0xb0
> > offsetof(struct dst_entry, __refcnt)=0xb8
> > offsetof(struct dst_entry, __use)=0xbc
> > offsetof(struct dst_entry, next)=0xc0
> > 
> > So it should be optimal... I dont know why tbench prefers __refcnt being 
> > on 0xc0, since in this case lastuse will be on a different cache line...
> > 
> > Each incoming IP packet will need to change lastuse, __refcnt and __use, 
> > so keeping them in the same cache line is a win.
> > 
> > I suspect then that even this patch could help tbench, since it avoids 
> > writing lastuse...
> 
> I think your suspicions are right, and even moreso
> it helps to keep __refcnt out of the same cache line
> as input/output/ops which are read-almost-entirely :-
I think you are right. The issue is these three variables sharing the same 
cache line
with input/output/ops.

> )
> 
> I haven't done an exhaustive analysis, but it seems that
> the write traffic to lastuse and __refcnt are about the
> same.  However if we find that __refcnt gets hit more
> than lastuse in this workload, it explains the regression.
I also think __refcnt is the key. I did a new testing by adding 2 unsigned long
pading before lastuse, so the 3 members are moved to next cache line. The 
performance is
recovered.

How about below patch? Almost all performance is recovered with the new patch.

Signed-off-by: Zhang Yanmin <[EMAIL PROTECTED]>

---

--- linux-2.6.25-rc1/include/net/dst.h  2008-02-21 14:33:43.0 +0800
+++ linux-2.6.25-rc1_work/include/net/dst.h 2008-02-21 14:36:22.0 
+0800
@@ -52,11 +52,10 @@ struct dst_entry
unsigned short  header_len; /* more space at head required 
*/
unsigned short  trailer_len;/* space to reserve at tail */
 
-   u32 metrics[RTAX_MAX];
-   struct dst_entry*path;
-
-   unsigned long   rate_last;  /* rate limiting for ICMP */
unsigned intrate_tokens;
+   unsigned long   rate_last;  /* rate limiting for ICMP */
+
+   struct dst_entry*path;
 
 #ifdef CONFIG_NET_CLS_ROUTE
__u32   tclassid;
@@ -70,10 +69,12 @@ struct dst_entry
int (*output)(struct sk_buff*);
 
struct  dst_ops *ops;
-   
-   unsigned long   lastuse;
+
+   u32 metrics[RTAX_MAX];
+
atomic_t__refcnt;   /* client references*/
int __use;
+   unsigned long   lastuse;
union {
struct dst_entry *next;
struct rtable*rt_next;


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tbench regression in 2.6.25-rc1

2008-02-17 Thread Zhang, Yanmin
On Fri, 2008-02-15 at 15:21 +0100, Eric Dumazet wrote:
> Zhang, Yanmin a écrit :
> > On Fri, 2008-02-15 at 07:05 +0100, Eric Dumazet wrote:
> >   
> >> Zhang, Yanmin a �crit :
> >> 
> >>> Comparing with kernel 2.6.24, tbench result has regression with
> >>> 2.6.25-rc1.
> >>>
> >>> 1) On 2 quad-core processor stoakley: 4%.
> >>> 2) On 4 quad-core processor tigerton: more than 30%.
> >>>
> >>> bisect located below patch.
> >>>
> >>> b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b is first bad commit
> >>> commit b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b
> >>> Author: Herbert Xu <[EMAIL PROTECTED]>
> >>> Date:   Tue Nov 13 21:33:32 2007 -0800
> >>>
> >>> [IPV6]: Move nfheader_len into rt6_info
> >>> 
> >>> The dst member nfheader_len is only used by IPv6.  It's also currently
> >>> creating a rather ugly alignment hole in struct dst.  Therefore this 
> >>> patch
> >>> moves it from there into struct rt6_info.
> >>>
> >>>
> >>> As tbench uses ipv4, so the patch's real impact on ipv4 is it deletes
> >>> nfheader_len in dst_entry. It might change cache line alignment.
> >>>
> >>> To verify my finding, I just added nfheader_len back to dst_entry in 
> >>> 2.6.25-rc1
> >>> and reran tbench on the 2 machines. Performance could be recovered 
> >>> completely.
> >>>
> >>> I started cpu_number*2 tbench processes. On my 16-core tigerton:
> >>> #./tbench_srv &
> >>> #./tbench 32 127.0.0.1
> >>>
> >>> -yanmin
> >>>   
> >> Yup. struct dst is sensitive to alignements, especially for benches.
> >>
> >> In the real world, we need to make sure that next pointer start at a cache 
> >> line bondary (or a litle bit after), so that RT cache lookups use one 
> >> cache 
> >> line per entry instead of two. This permits better behavior in DDOS 
> >> attacks.
> >>
> >> (check commit 1e19e02ca0c5e33ea73a25127dbe6c3b8fcaac4b for reference)
> >>
> >> Are you using a 64 or a 32 bit kernel ?
> >> 
> > 64bit x86-64 machine. On another 4-way Madison Itanium machine, tbench has 
> > the
> > similiar regression.
> >
> >   
> 
> On linux-2.6.25-rc1 x86_64 :
> 
> offsetof(struct dst_entry, lastuse)=0xb0
> offsetof(struct dst_entry, __refcnt)=0xb8
> offsetof(struct dst_entry, __use)=0xbc
> offsetof(struct dst_entry, next)=0xc0
> 
> So it should be optimal... I dont know why tbench prefers __refcnt being 
> on 0xc0, since in this case lastuse will be on a different cache line...
> 
> Each incoming IP packet will need to change lastuse, __refcnt and __use, 
> so keeping them in the same cache line is a win.
> 
> I suspect then that even this patch could help tbench, since it avoids 
> writing lastuse...
> 
> diff --git a/include/net/dst.h b/include/net/dst.h
> index e3ac7d0..24d3c4e 100644
> --- a/include/net/dst.h
> +++ b/include/net/dst.h
> @@ -147,7 +147,8 @@ static inline void dst_use(struct dst_entry *dst, 
> unsigned long time)
>  {
> dst_hold(dst);
> dst->__use++;
> -   dst->lastuse = time;
> +   if (time != dst->lastuse)
> +   dst->lastuse = time;
>  }
I did a quick test and this patch doesn't help tbench.

Thanks,
-yanmin


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tbench regression in 2.6.25-rc1

2008-02-14 Thread Zhang, Yanmin
On Fri, 2008-02-15 at 07:05 +0100, Eric Dumazet wrote:
> Zhang, Yanmin a �crit :
> > Comparing with kernel 2.6.24, tbench result has regression with
> > 2.6.25-rc1.
> > 
> > 1) On 2 quad-core processor stoakley: 4%.
> > 2) On 4 quad-core processor tigerton: more than 30%.
> > 
> > bisect located below patch.
> > 
> > b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b is first bad commit
> > commit b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b
> > Author: Herbert Xu <[EMAIL PROTECTED]>
> > Date:   Tue Nov 13 21:33:32 2007 -0800
> > 
> > [IPV6]: Move nfheader_len into rt6_info
> > 
> > The dst member nfheader_len is only used by IPv6.  It's also currently
> > creating a rather ugly alignment hole in struct dst.  Therefore this 
> > patch
> > moves it from there into struct rt6_info.
> > 
> > 
> > As tbench uses ipv4, so the patch's real impact on ipv4 is it deletes
> > nfheader_len in dst_entry. It might change cache line alignment.
> > 
> > To verify my finding, I just added nfheader_len back to dst_entry in 
> > 2.6.25-rc1
> > and reran tbench on the 2 machines. Performance could be recovered 
> > completely.
> > 
> > I started cpu_number*2 tbench processes. On my 16-core tigerton:
> > #./tbench_srv &
> > #./tbench 32 127.0.0.1
> > 
> > -yanmin
> 
> Yup. struct dst is sensitive to alignements, especially for benches.
> 
> In the real world, we need to make sure that next pointer start at a cache 
> line bondary (or a litle bit after), so that RT cache lookups use one cache 
> line per entry instead of two. This permits better behavior in DDOS attacks.
> 
> (check commit 1e19e02ca0c5e33ea73a25127dbe6c3b8fcaac4b for reference)
> 
> Are you using a 64 or a 32 bit kernel ?
64bit x86-64 machine. On another 4-way Madison Itanium machine, tbench has the
similiar regression.

-yanmin


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


tbench regression in 2.6.25-rc1

2008-02-14 Thread Zhang, Yanmin
Comparing with kernel 2.6.24, tbench result has regression with
2.6.25-rc1.

1) On 2 quad-core processor stoakley: 4%.
2) On 4 quad-core processor tigerton: more than 30%.

bisect located below patch.

b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b is first bad commit
commit b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b
Author: Herbert Xu <[EMAIL PROTECTED]>
Date:   Tue Nov 13 21:33:32 2007 -0800

[IPV6]: Move nfheader_len into rt6_info

The dst member nfheader_len is only used by IPv6.  It's also currently
creating a rather ugly alignment hole in struct dst.  Therefore this patch
moves it from there into struct rt6_info.


As tbench uses ipv4, so the patch's real impact on ipv4 is it deletes
nfheader_len in dst_entry. It might change cache line alignment.

To verify my finding, I just added nfheader_len back to dst_entry in 2.6.25-rc1
and reran tbench on the 2 machines. Performance could be recovered completely.

I started cpu_number*2 tbench processes. On my 16-core tigerton:
#./tbench_srv &
#./tbench 32 127.0.0.1

-yanmin


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netperf TCP_RR(loopback) 10% regression in 2.6.24-rc6, comparing with 2.6.22

2008-01-22 Thread Zhang, Yanmin
On Wed, 2008-01-23 at 08:42 +0800, Zhang, Yanmin wrote:
> On Tue, 2008-01-22 at 10:36 -0800, Rick Jones wrote:
> > When parsing the -P option in scan_socket_args() of src/nettest_bsd.c, 
> > netperf is using "break_args()" from src/netsh.c which indeed if the 
> > command line says "-P 12345" will set both the local and remote port 
> > numbers to 12345.  If instead you were to say "-P 12345,"  it will use 
> > 12345 only for the netperf side.  If you say "-P ,12345" it will use 
> > 12345 only for the netserver side.  To set both sides at once to 
> > different values it would be "-P 12345,54321"
> > 
> > In theory, send_udp_rr() in src/nettest_bsd.c (or I suppose 
> > scan_socket_args() could have more code added to it to check for a UDP 
> > test over loopback, but probably needs to be a check for any local IP, 
> > and unless this becomes something bigger than "Doctor! Doctor! It hurts 
> > when I do this!" :) I'm inclined to leave it as caveat benchmarker and 
> > perhaps some additional text in the manual.
> I will instrument kernel to see if kernel does work like it is expected.
> 
> When an issue is found, we shouldn't escape by saying it's nothing to do
> with me.
> 
I went through netperf source again and did a step debug with gdb.

Both sides bind 0.0.0.0:12384 to its own sockets. netperf binds firstly.
When netperf calls connect to configure server 127.0.0.1:12384, kernel chooses
socket A's queue. kernel is correct.

Anther question is no matter who binds 0.0.0.0:12384 firstly, netperf
always sends packets to its own socket. I suspect API connect called by netperf 
to
configure server ip/port has the side-effect, as server doesn't call connect.

It's good to add additional text in netperf manual.

Sorry and thanks for your guys kind response.

-yanmin


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netperf TCP_RR(loopback) 10% regression in 2.6.24-rc6, comparing with 2.6.22

2008-01-22 Thread Zhang, Yanmin
On Tue, 2008-01-22 at 10:36 -0800, Rick Jones wrote:
> When parsing the -P option in scan_socket_args() of src/nettest_bsd.c, 
> netperf is using "break_args()" from src/netsh.c which indeed if the 
> command line says "-P 12345" will set both the local and remote port 
> numbers to 12345.  If instead you were to say "-P 12345,"  it will use 
> 12345 only for the netperf side.  If you say "-P ,12345" it will use 
> 12345 only for the netserver side.  To set both sides at once to 
> different values it would be "-P 12345,54321"
> 
> In theory, send_udp_rr() in src/nettest_bsd.c (or I suppose 
> scan_socket_args() could have more code added to it to check for a UDP 
> test over loopback, but probably needs to be a check for any local IP, 
> and unless this becomes something bigger than "Doctor! Doctor! It hurts 
> when I do this!" :) I'm inclined to leave it as caveat benchmarker and 
> perhaps some additional text in the manual.
I will instrument kernel to see if kernel does work like it is expected.

When an issue is found, we shouldn't escape by saying it's nothing to do
with me.

-yanmin


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netperf TCP_RR(loopback) 10% regression in 2.6.24-rc6, comparing with 2.6.22

2008-01-21 Thread Zhang, Yanmin
On Tue, 2008-01-22 at 07:27 +0100, Eric Dumazet wrote:
> Zhang, Yanmin a �crit :
> > On Tue, 2008-01-22 at 13:24 +0800, Zhang, Yanmin wrote:
> >> On Mon, 2008-01-14 at 09:46 -0800, Rick Jones wrote:
> >>>>> *) netperf/netserver support CPU affinity within themselves with the 
> >>>>> global -T option to netperf.  Is the result with taskset much 
> >>>>> different? 
> >>>>>   The equivalent to the above would be to run netperf with:
> >>>>>
> >>>>> ./netperf -T 0,7 ..
> >>>> I checked the source codes and didn't find this option.
> >>>> I use netperf V2.3 (I found the number in the makefile).
> >>> Indeed, that version pre-dates the -T option.  If you weren't already 
> >>> chasing a regression I'd suggest an upgrade to 2.4.mumble.  Once you are 
> >>> at a point where changing another variable won't muddle things you may 
> >>> want to consider upgrading.
> >>>
> >>> happy benchmarking,
> >> Rick,
> >>
> >> I found my UDP_RR testing is just loop in netperf instead of ping-pang 
> >> between
> >> netserver and netperf. Is it correct? TCP_RR is ok.
> >>
> >> #./netserver
> >> #./netperf -t UDP_RR -l 60 -H 127.0.0.1 -i 30,3 -I 99,5 -- -P 12384 -r 1,1
> > I digged into netperf and netserver.
> > 
> > netperf binds ip 0 and port 12384 to its own socket. netserver binds ip
> > 127.0.0.1 and port 12384 to its own socket. Then, netperf calls connect to 
> > setup server
> > 127.0.0.1 and port 12384. Then, netperf starts sends UDP packets, but all 
> > packets netperf
> > sends are just received by netperf itself. netserver doesn't receive any 
> > packet.
> > 
> > I think netperf binding should fail, or netperf shouldn't get the packet it 
> > sends out, because
> > netserver already binds port 12384.
> > 
> > I am wondering if UDP stack in kernel has a bug.
> 
> If :
> - socket A is bound to 0.0.0.0:12384 and
> - socket B is bound to 127.0.0.1:12384
> 
> Then packets sent to 127.0.0.1:12384 should be queued for socket B
> 
> If they are queued to socket A as you believe it is currently done, then yes 
> there is a bug in kernel.
I double-checked it and they are queued to socket A. If I define a different 
local port
for netperf, packets will be queued to socket B.

-yanmin


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netperf TCP_RR(loopback) 10% regression in 2.6.24-rc6, comparing with 2.6.22

2008-01-21 Thread Zhang, Yanmin
On Mon, 2008-01-21 at 22:22 -0800, David Miller wrote:
> From: "Zhang, Yanmin" <[EMAIL PROTECTED]>
> Date: Tue, 22 Jan 2008 14:07:19 +0800
> 
> > I am wondering if UDP stack in kernel has a bug.
> 
> If one server binds to INADDR_ANY with port N, then any other socket
> can be bound to a specific IP address with port N.  When packets
> come in destined for port N, the delivery will be prioritized
> to whichever socket has the more specific and matching binding.
What does 'more specific' mean here? I assume 127.0.0.1 should be
prioritized before 0.0.0.0 which means packets should be queued to
127.0.0.1 firstly.

> 
> So the kernel is fine.
But kernel now queues packets to 0.0.0.0.

> 
> Netperf just needs to be more careful in order to handle this kind of
> case more cleanly.
It's better if kernel works more reasonable.

-yanmin


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netperf TCP_RR(loopback) 10% regression in 2.6.24-rc6, comparing with 2.6.22

2008-01-21 Thread Zhang, Yanmin
On Tue, 2008-01-22 at 13:24 +0800, Zhang, Yanmin wrote:
> On Mon, 2008-01-14 at 09:46 -0800, Rick Jones wrote:
> > >>*) netperf/netserver support CPU affinity within themselves with the 
> > >>global -T option to netperf.  Is the result with taskset much different? 
> > >>   The equivalent to the above would be to run netperf with:
> > >>
> > >>./netperf -T 0,7 ..
> > > 
> > > I checked the source codes and didn't find this option.
> > > I use netperf V2.3 (I found the number in the makefile).
> > 
> > Indeed, that version pre-dates the -T option.  If you weren't already 
> > chasing a regression I'd suggest an upgrade to 2.4.mumble.  Once you are 
> > at a point where changing another variable won't muddle things you may 
> > want to consider upgrading.
> > 
> > happy benchmarking,
> Rick,
> 
> I found my UDP_RR testing is just loop in netperf instead of ping-pang between
> netserver and netperf. Is it correct? TCP_RR is ok.
> 
> #./netserver
> #./netperf -t UDP_RR -l 60 -H 127.0.0.1 -i 30,3 -I 99,5 -- -P 12384 -r 1,1
I digged into netperf and netserver.

netperf binds ip 0 and port 12384 to its own socket. netserver binds ip
127.0.0.1 and port 12384 to its own socket. Then, netperf calls connect to 
setup server
127.0.0.1 and port 12384. Then, netperf starts sends UDP packets, but all 
packets netperf
sends are just received by netperf itself. netserver doesn't receive any packet.

I think netperf binding should fail, or netperf shouldn't get the packet it 
sends out, because
netserver already binds port 12384.

I am wondering if UDP stack in kernel has a bug.

TCP_RR testing hasn't such issue.

-yanmin


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netperf TCP_RR(loopback) 10% regression in 2.6.24-rc6, comparing with 2.6.22

2008-01-21 Thread Zhang, Yanmin
On Mon, 2008-01-14 at 09:46 -0800, Rick Jones wrote:
> >>*) netperf/netserver support CPU affinity within themselves with the 
> >>global -T option to netperf.  Is the result with taskset much different? 
> >>   The equivalent to the above would be to run netperf with:
> >>
> >>./netperf -T 0,7 ..
> > 
> > I checked the source codes and didn't find this option.
> > I use netperf V2.3 (I found the number in the makefile).
> 
> Indeed, that version pre-dates the -T option.  If you weren't already 
> chasing a regression I'd suggest an upgrade to 2.4.mumble.  Once you are 
> at a point where changing another variable won't muddle things you may 
> want to consider upgrading.
> 
> happy benchmarking,
Rick,

I found my UDP_RR testing is just loop in netperf instead of ping-pang between
netserver and netperf. Is it correct? TCP_RR is ok.

#./netserver
#./netperf -t UDP_RR -l 60 -H 127.0.0.1 -i 30,3 -I 99,5 -- -P 12384 -r 1,1

Thanks,
-yanmin


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netperf TCP_RR(loopback) 10% regression in 2.6.24-rc6, comparing with 2.6.22

2008-01-15 Thread Zhang, Yanmin
On Wed, 2008-01-16 at 08:34 +0800, Zhang, Yanmin wrote:
> On Mon, 2008-01-14 at 21:53 +1100, Herbert Xu wrote:
> > On Mon, Jan 14, 2008 at 08:44:40AM +, Ilpo Jrvinen wrote:
> > >
> > > > > I tried to use bisect to locate the bad patch between 2.6.22 and 
> > > > > 2.6.23-rc1,
> > > > > but the bisected kernel wasn't stable and went crazy.
> > > 
> > > TCP work between that is very much non-existing.
> > 
> > Make sure you haven't switched between SLAB/SLUB while testing this.
> I can make sure. In addition, I tried both SLAB and SLUB and make sure the 
> regression is still there if CONFIG_SLAB=y.
I retried bisect between 2.6.22 and 2.6.23-rc1. This time, I enabled 
CONFIG_SLAB=y,
and deleted the warmup procedure in the testing scripts. In addition, bind the 2
processes on the same logical processor. The regression is about 20% which is 
larger
than the one when binding 2 processes to different core.

The new bisect reported cfs core patch causes it. The results of every step look
stable.

dd41f596cda0d7d6e4a8b139ffdfabcefdd46528 is first bad commit
commit dd41f596cda0d7d6e4a8b139ffdfabcefdd46528
Author: Ingo Molnar <[EMAIL PROTECTED]>
Date:   Mon Jul 9 18:51:59 2007 +0200

sched: cfs core code

apply the CFS core code.

this change switches over the scheduler core to CFS's modular
design and makes use of kernel/sched_fair/rt/idletask.c to implement
Linux's scheduling policies.

thanks to Andrew Morton and Thomas Gleixner for lots of detailed review
feedback and for fixlets.

Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>
Signed-off-by: Mike Galbraith <[EMAIL PROTECTED]>
Signed-off-by: Dmitry Adamushko <[EMAIL PROTECTED]>
Signed-off-by: Srivatsa Vaddagiri <[EMAIL PROTECTED]>


-yanmin


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netperf TCP_RR(loopback) 10% regression in 2.6.24-rc6, comparing with 2.6.22

2008-01-15 Thread Zhang, Yanmin
On Mon, 2008-01-14 at 21:53 +1100, Herbert Xu wrote:
> On Mon, Jan 14, 2008 at 08:44:40AM +, Ilpo J�rvinen wrote:
> >
> > > > I tried to use bisect to locate the bad patch between 2.6.22 and 
> > > > 2.6.23-rc1,
> > > > but the bisected kernel wasn't stable and went crazy.
> > 
> > TCP work between that is very much non-existing.
> 
> Make sure you haven't switched between SLAB/SLUB while testing this.
I can make sure. In addition, I tried both SLAB and SLUB and make sure the 
regression is still there if CONFIG_SLAB=y.

Thanks,
-yanmin

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netperf TCP_RR(loopback) 10% regression in 2.6.24-rc6, comparing with 2.6.22

2008-01-14 Thread Zhang, Yanmin
On Mon, 2008-01-14 at 11:21 +0200, Ilpo J�rvinen wrote:
> On Mon, 14 Jan 2008, Ilpo J�rvinen wrote:
> 
> > On Fri, 11 Jan 2008, Zhang, Yanmin wrote:
> > 
> > > On Wed, 2008-01-09 at 17:35 +0800, Zhang, Yanmin wrote: 
> > > > 
> > > > As a matter of fact, 2.6.23 has about 6% regression and 2.6.24-rc's
> > > > regression is between 16%~11%.
> > > > 
> > > > I tried to use bisect to locate the bad patch between 2.6.22 and 
> > > > 2.6.23-rc1,
> > > > but the bisected kernel wasn't stable and went crazy.
> > 
> > TCP work between that is very much non-existing.
> 
> I _really_ meant 2.6.22 - 2.6.23-rc1, not 2.6.24-rc1 in case you had a 
> typo
I did bisect 2.6.22 - 2.6.23-rc1. I also tested it on the latest 2.6.24-rc.

>  there which is not that uncommon while typing kernel versions... :-)
Thanks. I will retry bisect and bind the server/client to the same logical 
processor, where
I hope the result is stable this time when bisecting.

Manual testing showed there is still same or more regression if I bind the
processes on the same cpu.


Thanks a lot!

-yanmin


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netperf TCP_RR(loopback) 10% regression in 2.6.24-rc6, comparing with 2.6.22

2008-01-13 Thread Zhang, Yanmin
On Fri, 2008-01-11 at 09:56 -0800, Rick Jones wrote:
> >>The test command is:
> >>#sudo taskset -c 7 ./netserver
> >>#sudo taskset -c 0 ./netperf -t TCP_RR -l 60 -H 127.0.0.1 -i 50,3 -I 99,5 
> >>-- -r 1,1
> 
> A couple of comments/questions on the command lines:
Thanks for your kind comments.

> 
> *) netperf/netserver support CPU affinity within themselves with the 
> global -T option to netperf.  Is the result with taskset much different? 
>The equivalent to the above would be to run netperf with:
> 
> ./netperf -T 0,7 ..
I checked the source codes and didn't find this option.
I use netperf V2.3 (I found the number in the makefile).

> .
> 
> The one possibly salient difference between the two is that when done 
> within netperf, the initial process creation will take place wherever 
> the scheduler wants it.
> 
> *) The -i option to set the confidence iteration count will silently cap 
> the max at 30.
Indeed, you are right.

-yanmin


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netperf TCP_RR(loopback) 10% regression in 2.6.24-rc6, comparing with 2.6.22

2008-01-11 Thread Zhang, Yanmin
On Wed, 2008-01-09 at 17:35 +0800, Zhang, Yanmin wrote: 
> The regression is:
> 1)stoakley with 2 qual-core processors: 11%;
> 2)Tulsa with 4 dual-core(+hyperThread) processors:13%;
I have new update on this issue and also cc to netdev maillist.
Thank David Miller for pointing me the netdev maillist.

> 
> The test command is:
> #sudo taskset -c 7 ./netserver
> #sudo taskset -c 0 ./netperf -t TCP_RR -l 60 -H 127.0.0.1 -i 50,3 -I 99,5 -- 
> -r 1,1
> 
> As a matter of fact, 2.6.23 has about 6% regression and 2.6.24-rc's
> regression is between 16%~11%.
> 
> I tried to use bisect to locate the bad patch between 2.6.22 and 2.6.23-rc1,
> but the bisected kernel wasn't stable and went crazy.
> 
> I tried both CONFIG_SLUB=y and CONFIG_SLAB=y to make sure SLUB isn't the
> culprit.
> 
> The oprofile data of CONFIG_SLAB=y. Top cpu utilizations are:
> 1) 2.6.22 
> 2067379   9.4888  vmlinux  schedule
> 1873604   8.5994  vmlinux  mwait_idle
> 1568131   7.1974  vmlinux  resched_task
> 1066976   4.8972  vmlinux  tcp_v4_rcv
> 9866414.5285  vmlinux  tcp_rcv_established
> 9795184.4958  vmlinux  find_busiest_group
> 7670693.5207  vmlinux  sock_def_readable
> 7368083.3818  vmlinux  tcp_sendmsg
> 5958892.7350  vmlinux  task_rq_lock
> 5571932.5574  vmlinux  tcp_ack
> 4705702.1598  vmlinux  __mod_timer
> 3922201.8002  vmlinux  __alloc_skb
> 3581061.6436  vmlinux  skb_release_data
> 3133721.4383  vmlinux  skb_clone
> 
> 2) 2.6.24-rc7
> 2668426  12.4497  vmlinux  vmlinux  schedule
> 9556984.4589  vmlinux  vmlinux  
> skb_release_data
> 8363113.9018  vmlinux  vmlinux  tcp_v4_rcv
> 7623983.5570  vmlinux  vmlinux  
> skb_release_all
> 7289073.4007  vmlinux  vmlinux  
> task_rq_lock
> 7050373.2894  vmlinux  vmlinux  __wake_up
> 6942063.2388  vmlinux  vmlinux  
> __mod_timer
> 6176162.8815  vmlinux  vmlinux  mwait_idle
> 
> It looks like tcp in 2.6.22 sends more packets, but frees far less skb than 
> 2.6.24-rc6.
> tcp_rcv_established in 2.6.22 is highlighted on cpu utilization.
I instrumented kernel to capure the function call numbers.
1) 2.6.22
skb_release_data:50148649
tcp_ack: 25062858   
tcp_transmit_skb:25063150   
tcp_v4_rcv:  25063279   

2) 2.6.24-rc6
skb_release_data:21429692   
tcp_ack: 10707710   
tcp_transmit_skb:10707866
tcp_v4_rcv:  10707959   

The data doesn't show that 2.6.22 sends more packets while freeing far less skb 
than
2.6.24-rc6.

The data showed skb_release_data of kernel 2.6.22 is more than double of the 
one of
2.6.24-rc6. But netperf result just showed about 10% regression.

As the packet only has 1 byte, so I suspect 2.6.24-rc6 tries to merge packets 
after waiting for
a latency. 2.6.22 might haven't the wait latency or the latency is very small, 
so 2.6.22 almost
sends the packets immediately. I will check the source codes later.

-yanmin


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ixgb: add PCI Error recovery callbacks

2006-07-05 Thread Zhang, Yanmin
On Thu, 2006-07-06 at 03:44, Linas Vepstas wrote:
> On Wed, Jul 05, 2006 at 08:49:27AM -0700, Auke Kok wrote:
> > Zhang, Yanmin wrote:
> > >On Fri, 2006-06-30 at 00:26, Linas Vepstas wrote:
> > >>Adds PCI Error recovery callbacks to the Intel 10-gigabit ethernet
> > >>ixgb device driver. Lightly tested, works.
> > >
> > >Both pci_disable_device and ixgb_down would access the device. It doesn't
> > >follow Documentation/pci-error-recovery.txt that error_detected shouldn't 
> > >do
> > >any access to the device.
> > 
> > Moreover, it was Linas who wrote this documentation in the first place :)
> 
> On the pSeries, its harmless to try to do i/o; the i/o will e blocked.
In the future, we might move the pci error recovery codes to generic to
support other platforms which might not block I/O. So it's better to follow
Documentation/pci-error-recovery.txt when adding error recovery codes into 
driver.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ixgb: add PCI Error recovery callbacks

2006-07-02 Thread Zhang, Yanmin
On Fri, 2006-06-30 at 00:26, Linas Vepstas wrote:
> Adds PCI Error recovery callbacks to the Intel 10-gigabit ethernet
> ixgb device driver. Lightly tested, works.
> 
> Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]>
> +/**
> + * ixgb_io_error_detected() - called when PCI error is detected
> + * @pdevpointer to pci device with error
> + * @state   pci channel state after error
> + *
> + * This callback is called by the PCI subsystem whenever
> + * a PCI bus error is detected.
> + */
> +static pci_ers_result_t ixgb_io_error_detected (struct pci_dev *pdev,
> +  enum pci_channel_state state)
> +{
> + struct net_device *netdev = pci_get_drvdata(pdev);
> + struct ixgb_adapter *adapter = netdev->priv;
> +
> + if(netif_running(netdev))
> + ixgb_down(adapter, TRUE);
> +
> + pci_disable_device(pdev);
> +
> + /* Request a slot reset. */
> + return PCI_ERS_RESULT_NEED_RESET;
> +}
Both pci_disable_device and ixgb_down would access the device. It doesn't
follow Documentation/pci-error-recovery.txt that error_detected shouldn't do
any access to the device.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] PCI Error Recovery: e1000 network device driver

2006-04-28 Thread Zhang, Yanmin
>>-Original Message-
>>From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Linas Vepstas
>>Sent: 2006年3月25日 11:22
>>To: Greg KH
>>Cc: Jeff Garzik; Ronciak, John; Brandeburg, Jesse; Kirsher, Jeffrey T; 
>>linux-kernel@vger.kernel.org; netdev@vger.kernel.org;
>>[EMAIL PROTECTED]; [EMAIL PROTECTED]; Linux NICS
>>Subject: Re: [PATCH] PCI Error Recovery: e1000 network device driver
>>
>>On Fri, Mar 24, 2006 at 06:22:06PM -0800, Greg KH wrote:
>>> ... a bit
>>> different from the traditional kernel coding style.
>>
>>Sorry, this is due to inattention on my part; I get cross-eyed
>>after staring at the same code for too long. The patch below should
>>fix things.
>>
>>--linas
>>
>>
>>[PATCH] PCI Error Recovery: e1000 network device driver
>>
>>Various PCI bus errors can be signaled by newer PCI controllers.  This
>>patch adds the PCI error recovery callbacks to the intel gigabit
>>ethernet e1000 device driver. The patch has been tested, and appears
>>to work well.
>>
>>Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]>
>>Acked-by: Jesse Brandeburg <[EMAIL PROTECTED]>
>>
>>
>>
>> drivers/net/e1000/e1000_main.c |  114 
>> -
>> 1 files changed, 113 insertions(+), 1 deletion(-)
>>
>>Index: linux-2.6.16-git6/drivers/net/e1000/e1000_main.c
>>===
>>--- linux-2.6.16-git6.orig/drivers/net/e1000/e1000_main.c 2006-03-23 
>>15:48:01.0 -0600
>>+++ linux-2.6.16-git6/drivers/net/e1000/e1000_main.c  2006-03-24 
>>15:14:40.431371705 -0600
>>@@ -226,6 +226,16 @@ static int e1000_resume(struct pci_dev *
>>+/**
>>+ * e1000_io_error_detected - called when PCI error is detected
>>+ * @pdev: Pointer to PCI device
>>+ * @state: The current pci conneection state
>>+ *
>>+ * This function is called after a PCI bus error affecting
>>+ * this device has been detected.
>>+ */
>>+static pci_ers_result_t e1000_io_error_detected(struct pci_dev *pdev, 
>>pci_channel_state_t state)
>>+{
>>+ struct net_device *netdev = pci_get_drvdata(pdev);
>>+ struct e1000_adapter *adapter = netdev->priv;
>>+
>>+ netif_device_detach(netdev);
>>+
>>+ if (netif_running(netdev))
>>+ e1000_down(adapter);
[YM] e1000_down will do device IO. So it's not appropriate to do so here.


>>+
>>+ /* Request a slot slot reset. */
>>+ return PCI_ERS_RESULT_NEED_RESET;
>>+}
>>+
>>+/**
>>+ * e1000_io_slot_reset - called after the pci bus has been reset.
>>+ * @pdev: Pointer to PCI device
>>+ *
>>+ * Restart the card from scratch, as if from a cold-boot. Implementation
>>+ * resembles the first-half of the e1000_resume routine.
>>+ */
>>+static pci_ers_result_t e1000_io_slot_reset(struct pci_dev *pdev)
>>+{
>>+ struct net_device *netdev = pci_get_drvdata(pdev);
>>+ struct e1000_adapter *adapter = netdev->priv;
>>+
>>+ if (pci_enable_device(pdev)) {
>>+ printk(KERN_ERR "e1000: Cannot re-enable PCI device after 
>>reset.\n");
>>+ return PCI_ERS_RESULT_DISCONNECT;
>>+ }
>>+ pci_set_master(pdev);
>>+
>>+ pci_enable_wake(pdev, 3, 0);
>>+ pci_enable_wake(pdev, 4, 0); /* 4 == D3 cold */
[YM] Suggest using PCI_D3hot and PCI_D3cold instead of hard-coded numbers.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] PCI Error Recovery: e100 network device driver

2006-04-28 Thread Zhang, Yanmin
On Fri, 2006-04-07 at 06:24, Linas Vepstas wrote:
> Please apply and forward upstream.
> 
> --linas
> 
> [PATCH] PCI Error Recovery: e100 network device driver
> 
> Various PCI bus errors can be signaled by newer PCI controllers.  This
> patch adds the PCI error recovery callbacks to the intel ethernet e100
> device driver. The patch has been tested, and appears to work well.
> 
> Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]>
> Acked-by: Jesse Brandeburg <[EMAIL PROTECTED]>
I am enabling PCI-Express AER (Advanced Error Reporting) in kernel and
glad to see many drivers to support pci error handling.


> 
> 
> 
>  drivers/net/e100.c |   65 
> +
>  1 files changed, 65 insertions(+)
> 
> Index: linux-2.6.17-rc1/drivers/net/e100.c
> ===
> --- linux-2.6.17-rc1.orig/drivers/net/e100.c  2006-04-05 09:56:06.0 
> -0500
> +++ linux-2.6.17-rc1/drivers/net/e100.c   2006-04-06 15:17:29.0 
> -0500
> @@ -2781,6 +2781,70 @@ static void e100_shutdown(struct pci_dev
>  }
>  
> 
> +/* -- PCI Error Recovery infrastructure  -- */
> +/** e100_io_error_detected() is called when PCI error is detected */
> +static pci_ers_result_t e100_io_error_detected(struct pci_dev *pdev, 
> pci_channel_state_t state)
> +{
> + struct net_device *netdev = pci_get_drvdata(pdev);
> +
> + /* Same as calling e100_down(netdev_priv(netdev)), but generic */
> + netdev->stop(netdev);
e100 stop method e100_close calls e100_down which would do IO. Does it
violate the rule defined in Documentation/pci-error-recovery.txt that
error_detected shouldn't do any IO?
Suggest to create a new function, such like e100_close_noreset.


> +
> + /* Detach; put netif into state similar to hotplug unplug */
> + netif_poll_enable(netdev);
> + netif_device_detach(netdev);
> +
> + /* Request a slot reset. */
> + return PCI_ERS_RESULT_NEED_RESET;
> +}
> +
> +/** e100_io_slot_reset is called after the pci bus has been reset.
> + *  Restart the card from scratch. */
> +static pci_ers_result_t e100_io_slot_reset(struct pci_dev *pdev)
> +{
> + struct net_device *netdev = pci_get_drvdata(pdev);
> + struct nic *nic = netdev_priv(netdev);
> +
> + if(pci_enable_device(pdev)) {
> + printk(KERN_ERR "e100: Cannot re-enable PCI device after 
> reset.\n");
> + return PCI_ERS_RESULT_DISCONNECT;
> + }
> + pci_set_master(pdev);
> +
> + /* Only one device per card can do a reset */
> + if (0 != PCI_FUNC (pdev->devfn))
> + return PCI_ERS_RESULT_RECOVERED;
> + e100_hw_reset(nic);
> + e100_phy_init(nic);
Should pci_set_master be called after e100_hw_reset like in function
e100_probe?


> +
> + return PCI_ERS_RESULT_RECOVERED;
> +}
> +
> +/** e100_io_resume is called when the error recovery driver
> + *  tells us that its OK to resume normal operation.
> + */
> +static void e100_io_resume(struct pci_dev *pdev)
> +{
> + struct net_device *netdev = pci_get_drvdata(pdev);
> + struct nic *nic = netdev_priv(netdev);
> +
> + /* ack any pending wake events, disable PME */
> + pci_enable_wake(pdev, 0, 0);
> +
> + netif_device_attach(netdev);
> + if(netif_running(netdev)) {
> + e100_open (netdev);
> + mod_timer(&nic->watchdog, jiffies);
e100_open calls e100_up which already sets watchdog timer. Why to set
it again?

> + }
> +}
> +
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html