subject:"Direct io on block device has performance regression on 2.6.x kernel"

Re: Direct io on block device has performance regression on 2.6.x kernel

2005-03-10 Thread Andrew Morton

"Chen, Kenneth W" <[EMAIL PROTECTED]> wrote:
>
> Let me work on the readv/writev support (unless someone beat me to it).

Please also move it to the address_space_operations level.  Yes, there are
performance benefits from simply omitting the LFS checks, the mmap
consistency fixes, etc.  But they're there for a reason.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Direct io on block device has performance regression on 2.6.x kernel

2005-03-10 Thread Chen, Kenneth W

Andrew Morton wrote on Thursday, March 10, 2005 12:31 PM
> >  > Fine-grained alignment is probably too hard, and it should fall back to
> >  > __blockdev_direct_IO().
> >  >
> >  > Does it do the right thing with a request which is non-page-aligned, but
> >  > 512-byte aligned?
> >  >
> >  > readv and writev?
> >  >
> >
> >  That's why direct_io_worker() is slower.  It does everything and handles
> >  every possible usage scenarios out there.  I hope making the function 
> > fatter
> >  is not in the plan.
>
> We just cannot make a change like this if it does not support readv and
> writev well, and if it does not support down-to-512-byte size and
> alignment.  It will break applications.

I must misread your mail.  Yes it does support 512-byte size and alignment.
Let me work on the readv/writev support (unless someone beat me to it).

- Ken


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Direct io on block device has performance regression on 2.6.x kernel

2005-03-10 Thread Andrew Morton

"Chen, Kenneth W" <[EMAIL PROTECTED]> wrote:
>
> Losing 6% just from Linux kernel is a huge deal for this type of benchmark.
>  People work for days to implement features which might give sub percentage
>  gain.  Making Software run faster is not easy, but making software run slower
>  apparently is a fairly easy task.
> 
> 

heh

> 
>  > Fine-grained alignment is probably too hard, and it should fall back to
>  > __blockdev_direct_IO().
>  >
>  > Does it do the right thing with a request which is non-page-aligned, but
>  > 512-byte aligned?
>  >
>  > readv and writev?
>  >
> 
>  That's why direct_io_worker() is slower.  It does everything and handles
>  every possible usage scenarios out there.  I hope making the function fatter
>  is not in the plan.

We just cannot make a change like this if it does not support readv and
writev well, and if it does not support down-to-512-byte size and
alignment.  It will break applications.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Direct io on block device has performance regression on 2.6.x kernel

2005-03-10 Thread Chen, Kenneth W

Andrew Morton wrote on Wednesday, March 09, 2005 8:10 PM
> > 2.6.9 kernel is 6% slower compare to distributor's 2.4 kernel (RHEL3).  
> > Roughly
> > 2% came from storage driver (I'm not allowed to say anything beyond that, 
> > there
> > is a fix though).
>
> The codepaths are indeed longer in 2.6.

Thank you for acknowledging this.


> > 2% came from DIO.
>
> hm, that's not a lot.
> 
> 2% is pretty thin :(

This is the exact reason that I did not want to put these numbers out
in the first place. Because most people usually underestimate the
magnitude of these percentage point.

Now I have to give a speech on "performance optimization 101".  Take a
look at this page: http://www.suse.de/~aj/SPEC/CINT/d-permanent/index.html
This page tracks the development of gcc and measures the performance of
gcc with SPECint2000.  Study the last chart, take out your calculator
and calculate how much performance gain gcc made over the course of 3.5
years of development.  Also please factor in the kind of man power that
went into the compiler development.

Until people understand the kind of scale to expect when evaluating a
complex piece of software, then we can talk about database transaction
processing benchmark.  This benchmark goes one step further.  It bench
the entire software stack (kernel/library/application/compiler), it bench
the entire hardware platform (cpu/memory/IO/chipset) and on the grand
scale, it bench system integration: storage, network, interconnect, mid-
tier app server, front end clients, etc etc.  Any specific function/
component only represent a small portion of the entire system, essential
but small. For example, the hottest function in the kernel is 7.5%, out
of 20% kernel time. If we throw away that function entirely, there will
be only 1.5% direct impact on total cpu cycles.

So what's the point?  The point is when judging a number whether it is
thin or thick, it has to be judged against the complexity of SUT.  It
has to be judged against a relevant scale for that particular workload.
And the scale has to be laid out correctly that represents the weight
of each component.

Losing 6% just from Linux kernel is a huge deal for this type of benchmark.
People work for days to implement features which might give sub percentage
gain.  Making Software run faster is not easy, but making software run slower
apparently is a fairly easy task.



> Fine-grained alignment is probably too hard, and it should fall back to
> __blockdev_direct_IO().
>
> Does it do the right thing with a request which is non-page-aligned, but
> 512-byte aligned?
>
> readv and writev?
>

That's why direct_io_worker() is slower.  It does everything and handles
every possible usage scenarios out there.  I hope making the function fatter
is not in the plan.

- Ken


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Direct io on block device has performance regression on 2.6.x kernel

2005-03-10 Thread Chen, Kenneth W

Andrew Morton wrote on Wednesday, March 09, 2005 8:10 PM
  2.6.9 kernel is 6% slower compare to distributor's 2.4 kernel (RHEL3).  
  Roughly
  2% came from storage driver (I'm not allowed to say anything beyond that, 
  there
  is a fix though).

 The codepaths are indeed longer in 2.6.

Thank you for acknowledging this.


  2% came from DIO.

 hm, that's not a lot.
 
 2% is pretty thin :(

This is the exact reason that I did not want to put these numbers out
in the first place. Because most people usually underestimate the
magnitude of these percentage point.

Now I have to give a speech on performance optimization 101.  Take a
look at this page: http://www.suse.de/~aj/SPEC/CINT/d-permanent/index.html
This page tracks the development of gcc and measures the performance of
gcc with SPECint2000.  Study the last chart, take out your calculator
and calculate how much performance gain gcc made over the course of 3.5
years of development.  Also please factor in the kind of man power that
went into the compiler development.

Until people understand the kind of scale to expect when evaluating a
complex piece of software, then we can talk about database transaction
processing benchmark.  This benchmark goes one step further.  It bench
the entire software stack (kernel/library/application/compiler), it bench
the entire hardware platform (cpu/memory/IO/chipset) and on the grand
scale, it bench system integration: storage, network, interconnect, mid-
tier app server, front end clients, etc etc.  Any specific function/
component only represent a small portion of the entire system, essential
but small. For example, the hottest function in the kernel is 7.5%, out
of 20% kernel time. If we throw away that function entirely, there will
be only 1.5% direct impact on total cpu cycles.

So what's the point?  The point is when judging a number whether it is
thin or thick, it has to be judged against the complexity of SUT.  It
has to be judged against a relevant scale for that particular workload.
And the scale has to be laid out correctly that represents the weight
of each component.

Losing 6% just from Linux kernel is a huge deal for this type of benchmark.
People work for days to implement features which might give sub percentage
gain.  Making Software run faster is not easy, but making software run slower
apparently is a fairly easy task.



 Fine-grained alignment is probably too hard, and it should fall back to
 __blockdev_direct_IO().

 Does it do the right thing with a request which is non-page-aligned, but
 512-byte aligned?

 readv and writev?


That's why direct_io_worker() is slower.  It does everything and handles
every possible usage scenarios out there.  I hope making the function fatter
is not in the plan.

- Ken


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Direct io on block device has performance regression on 2.6.x kernel

2005-03-10 Thread Andrew Morton

Chen, Kenneth W [EMAIL PROTECTED] wrote:

 Losing 6% just from Linux kernel is a huge deal for this type of benchmark.
  People work for days to implement features which might give sub percentage
  gain.  Making Software run faster is not easy, but making software run slower
  apparently is a fairly easy task.
 
 

heh

 
   Fine-grained alignment is probably too hard, and it should fall back to
   __blockdev_direct_IO().
  
   Does it do the right thing with a request which is non-page-aligned, but
   512-byte aligned?
  
   readv and writev?
  
 
  That's why direct_io_worker() is slower.  It does everything and handles
  every possible usage scenarios out there.  I hope making the function fatter
  is not in the plan.

We just cannot make a change like this if it does not support readv and
writev well, and if it does not support down-to-512-byte size and
alignment.  It will break applications.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Direct io on block device has performance regression on 2.6.x kernel

2005-03-10 Thread Chen, Kenneth W

Andrew Morton wrote on Thursday, March 10, 2005 12:31 PM
Fine-grained alignment is probably too hard, and it should fall back to
__blockdev_direct_IO().
   
Does it do the right thing with a request which is non-page-aligned, but
512-byte aligned?
   
readv and writev?
   
 
   That's why direct_io_worker() is slower.  It does everything and handles
   every possible usage scenarios out there.  I hope making the function 
  fatter
   is not in the plan.

 We just cannot make a change like this if it does not support readv and
 writev well, and if it does not support down-to-512-byte size and
 alignment.  It will break applications.

I must misread your mail.  Yes it does support 512-byte size and alignment.
Let me work on the readv/writev support (unless someone beat me to it).

- Ken


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Direct io on block device has performance regression on 2.6.x kernel

2005-03-10 Thread Andrew Morton

Chen, Kenneth W [EMAIL PROTECTED] wrote:

 Let me work on the readv/writev support (unless someone beat me to it).

Please also move it to the address_space_operations level.  Yes, there are
performance benefits from simply omitting the LFS checks, the mmap
consistency fixes, etc.  But they're there for a reason.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Andrew Morton

"Chen, Kenneth W" <[EMAIL PROTECTED]> wrote:
>
> > Did you generate a kernel profile?
> 
>  Top 40 kernel hot functions, percentage is normalized to kernel utilization.
> 
>  _spin_unlock_irqrestore  23.54%
>  _spin_unlock_irq 19.27%

Cripes.

Is that with CONFIG_PREEMPT?  If so, and if you disable CONFIG_PREEMPT,
this cost should be accounting the the spin_unlock() caller and we can see
who the culprit is.   Perhaps dio->bio_lock.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Jesse Barnes

On Wednesday, March 9, 2005 3:23 pm, Andi Kleen wrote:
> "Chen, Kenneth W" <[EMAIL PROTECTED]> writes:
> > Just to clarify here, these data need to be taken at grain of salt. A
> > high count in _spin_unlock_* functions do not automatically points to
> > lock contention.  It's one of the blind spot syndrome with timer based
> > profile on ia64.  There are some lock contentions in 2.6 kernel that
> > we are staring at.  Please do not misinterpret the number here.
>
> Why don't you use oprofileÂ>? It uses NMIs and can profile "inside"
> interrupt disabled sections.

That was oprofile output, but on ia64, 'NMI's are maskable due to the way irq 
disabling works.  Here's a very hackish patch that changes the kernel to use 
cr.tpr instead of psr.i for interrupt control.  Making oprofile use real ia64 
NMIs is left as an exercise for the reader :)

Jesse
= arch/ia64/Kconfig.debug 1.2 vs edited =
--- 1.2/arch/ia64/Kconfig.debug 2005-01-07 16:15:52 -08:00
+++ edited/arch/ia64/Kconfig.debug  2005-02-28 10:07:27 -08:00
@@ -56,6 +56,15 @@
  and restore instructions.  It's useful for tracking down spinlock
  problems, but slow!  If you're unsure, select N.
 
+config IA64_ALLOW_NMI
+   bool "Allow non-maskable interrupts"
+   help
+ The normal ia64 irq enable/disable code prevents even non-maskable
+ interrupts from occuring, which can be a problem for kernel
+ debuggers, watchdogs, and profilers.  Say Y here if you're interested
+ in NMIs and don't mind the small performance penalty this option
+ imposes.
+
 config SYSVIPC_COMPAT
bool
depends on COMPAT && SYSVIPC
= arch/ia64/kernel/head.S 1.31 vs edited =
--- 1.31/arch/ia64/kernel/head.S2005-01-28 15:50:13 -08:00
+++ edited/arch/ia64/kernel/head.S  2005-03-01 13:17:51 -08:00
@@ -59,6 +59,14 @@
.save rp, r0// terminate unwind chain with a NULL rp
.body
 
+#ifdef CONFIG_IA64_ALLOW_NMI   // disable interrupts initially (re-enabled in 
start_kernel())
+   mov r16=1<<16
+   ;;
+   mov cr.tpr=r16
+   ;;
+   srlz.d
+   ;;
+#endif
rsm psr.i | psr.ic
;;
srlz.i
@@ -129,8 +137,8 @@
/*
 * Switch into virtual mode:
 */
-   movl 
r16=(IA64_PSR_IT|IA64_PSR_IC|IA64_PSR_DT|IA64_PSR_RT|IA64_PSR_DFH|IA64_PSR_BN \
- |IA64_PSR_DI)
+   movl 
r16=(IA64_PSR_IT|IA64_PSR_IC|IA64_PSR_I|IA64_PSR_DT|IA64_PSR_RT|IA64_PSR_DFH|\
+ IA64_PSR_BN|IA64_PSR_DI)
;;
mov cr.ipsr=r16
movl r17=1f
= arch/ia64/kernel/irq_ia64.c 1.25 vs edited =
--- 1.25/arch/ia64/kernel/irq_ia64.c2005-01-22 15:54:49 -08:00
+++ edited/arch/ia64/kernel/irq_ia64.c  2005-03-01 12:50:18 -08:00
@@ -103,8 +103,6 @@
 void
 ia64_handle_irq (ia64_vector vector, struct pt_regs *regs)
 {
-   unsigned long saved_tpr;
-
 #if IRQ_DEBUG
{
unsigned long bsp, sp;
@@ -135,17 +133,9 @@
}
 #endif /* IRQ_DEBUG */
 
-   /*
-* Always set TPR to limit maximum interrupt nesting depth to
-* 16 (without this, it would be ~240, which could easily lead
-* to kernel stack overflows).
-*/
irq_enter();
-   saved_tpr = ia64_getreg(_IA64_REG_CR_TPR);
-   ia64_srlz_d();
while (vector != IA64_SPURIOUS_INT_VECTOR) {
if (!IS_RESCHEDULE(vector)) {
-   ia64_setreg(_IA64_REG_CR_TPR, vector);
ia64_srlz_d();
 
__do_IRQ(local_vector_to_irq(vector), regs);
@@ -154,7 +144,6 @@
 * Disable interrupts and send EOI:
 */
local_irq_disable();
-   ia64_setreg(_IA64_REG_CR_TPR, saved_tpr);
}
ia64_eoi();
vector = ia64_get_ivr();
@@ -165,6 +154,7 @@
 * come through until ia64_eoi() has been done.
 */
irq_exit();
+   local_irq_enable();
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
= include/asm-ia64/hw_irq.h 1.15 vs edited =
--- 1.15/include/asm-ia64/hw_irq.h  2005-01-22 15:54:52 -08:00
+++ edited/include/asm-ia64/hw_irq.h2005-03-01 13:01:03 -08:00
@@ -36,6 +36,10 @@
 
 #define AUTO_ASSIGN-1
 
+#define IA64_NMI_VECTOR0x02/* NMI (note that this 
can be
+  masked if psr.i or psr.ic
+  are cleared) */
+
 #define IA64_SPURIOUS_INT_VECTOR   0x0f
 
 /*
= include/asm-ia64/system.h 1.48 vs edited =
--- 1.48/include/asm-ia64/system.h  2005-01-04 18:48:18 -08:00
+++ edited/include/asm-ia64/system.h2005-03-01 15:28:23 -08:00
@@ -107,12 +107,61 @@
 
 #define safe_halt() ia64_pal_halt_light()/* PAL_HALT_LIGHT */
 
+/* For spinlocks etc */
+#ifdef CONFIG_IA64_ALLOW_NMI
+
+#define IA64_TPR_MMI_BIT

Re: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Andrew Vasquez

On Wed, 09 Mar 2005, Chen, Kenneth W wrote:

> Andrew Morton wrote Wednesday, March 09, 2005 6:26 PM
> > What does "1/3 of the total benchmark performance regression" mean?  One
> > third of 0.1% isn't very impressive.  You haven't told us anything at all
> > about the magnitude of this regression.
> 
> 2.6.9 kernel is 6% slower compare to distributor's 2.4 kernel (RHEL3).  
> Roughly
> 2% came from storage driver (I'm not allowed to say anything beyond that, 
> there
> is a fix though).
> 

Ok now, that statement piqued my interest -- since looking through a
previous email it seems you are using the qla2xxx driver.  Care to
elaborate?

Regards,
Andrew Vasquez
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Jesse Barnes

On Wednesday, March 9, 2005 3:23 pm, Andi Kleen wrote:
> "Chen, Kenneth W" <[EMAIL PROTECTED]> writes:
> > Just to clarify here, these data need to be taken at grain of salt. A
> > high count in _spin_unlock_* functions do not automatically points to
> > lock contention.  It's one of the blind spot syndrome with timer based
> > profile on ia64.  There are some lock contentions in 2.6 kernel that
> > we are staring at.  Please do not misinterpret the number here.
>
> Why don't you use oprofileÂ>? It uses NMIs and can profile "inside"
> interrupt disabled sections.

Oh, and there are other ways of doing interrupt off profiling by using the 
PMU.  q-tools can do this I think.

Jesse
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread David Lang

On Wed, 9 Mar 2005, Chen, Kenneth W wrote:
Also, I'm rather peeved that we're hearing about this regression now rather
than two years ago.  And mystified as to why yours is the only group which
has reported it.
2.6.X kernel has never been faster than the 2.4 kernel (RHEL3).  At one 
point
of time, around 2.6.2, the gap is pretty close, at around 1%, but still slower.
Around 2.6.5, we found global plug list is causing huge lock contention on
32-way numa box.  That got fixed in 2.6.7.  Then comes 2.6.8 which took a big
dip at close to 20% regression.  Then we fixed 17% regression in the scheduler
(fixed with cache_decay_tick).  2.6.9 is the last one we measured and it is 6%
slower.  It's a constant moving target, a wild goose to chase.
I don't know why other people have not reported the problem, perhaps they
haven't got a chance to run transaction processing db workload on 2.6 kernel.
Perhaps they have not compared, perhaps they are working on the same problem.
I just don't know.
Also the 2.6 kernel is Soo much better in the case where you have many 
threads (even if they are all completely idle) that that improvement may 
be masking the regression that Ken is reporting (I've seen a 50% 
performance hit on 2.4 with just a thousand or two threads compared to 
2.6). let's face it, a typical linux box today starts up a LOT of stuff 
that will never get used, but will count as an idle thread.

David Lang
--
There are two ways of constructing a software design. One way is to make it so 
simple that there are obviously no deficiencies. And the other way is to make 
it so complicated that there are no obvious deficiencies.
 -- C.A.R. Hoare
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Andrew Morton

"Chen, Kenneth W" <[EMAIL PROTECTED]> wrote:
>
> Andrew Morton wrote Wednesday, March 09, 2005 6:26 PM
> > What does "1/3 of the total benchmark performance regression" mean?  One
> > third of 0.1% isn't very impressive.  You haven't told us anything at all
> > about the magnitude of this regression.
> 
> 2.6.9 kernel is 6% slower compare to distributor's 2.4 kernel (RHEL3).  
> Roughly
> 2% came from storage driver (I'm not allowed to say anything beyond that, 
> there
> is a fix though).

The codepaths are indeed longer in 2.6.

> 2% came from DIO.

hm, that's not a lot.

Once you redo that patch to use aops and to work with O_DIRECT, the paths
will get a little deeper, but not much.  We really should do this so that
O_DIRECT works, and in case someone has gone and mmapped the blockdev.

Fine-grained alignment is probably too hard, and it should fall back to
__blockdev_direct_IO().

Does it do the right thing with a request which is non-page-aligned, but
512-byte aligned?

readv and writev?

2% is pretty thin :(

> The rest of 2% is still unaccounted for.  We don't know where.

General cache replacement, perhaps.  9MB is a big cache though.

> ...
> Around 2.6.5, we found global plug list is causing huge lock contention on
> 32-way numa box.  That got fixed in 2.6.7.  Then comes 2.6.8 which took a big
> dip at close to 20% regression.  Then we fixed 17% regression in the scheduler
> (fixed with cache_decay_tick).  2.6.9 is the last one we measured and it is 6%
> slower.  It's a constant moving target, a wild goose to chase.
> 

OK.  Seems that the 2.4 O(1) scheduler got it right for that machine.

> haven't got a chance to run transaction processing db workload on 2.6 kernel.
> Perhaps they have not compared, perhaps they are working on the same problem.
> I just don't know.

Maybe there are other factors which drown these little things out:
architecture improvements, choice of architecture, driver changes, etc.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Andrew Morton

David Lang <[EMAIL PROTECTED]> wrote:
>
> (I've seen a 50% 
>  performance hit on 2.4 with just a thousand or two threads compared to 
>  2.6)

Was that 2.4 kernel a vendor kernel with the O(1) scheduler?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Chen, Kenneth W

Andrew Morton wrote Wednesday, March 09, 2005 6:26 PM
> What does "1/3 of the total benchmark performance regression" mean?  One
> third of 0.1% isn't very impressive.  You haven't told us anything at all
> about the magnitude of this regression.

2.6.9 kernel is 6% slower compare to distributor's 2.4 kernel (RHEL3).  Roughly
2% came from storage driver (I'm not allowed to say anything beyond that, there
is a fix though).

2% came from DIO.

The rest of 2% is still unaccounted for.  We don't know where.

> How much system time?  User time?  All that stuff.
20.5% in the kernel, 79.5% in user space.


> But the first thing to do is to work out where the cycles are going to.
You've seen the profile.  That's where all the cycle went.


> Also, I'm rather peeved that we're hearing about this regression now rather
> than two years ago.  And mystified as to why yours is the only group which
> has reported it.

2.6.X kernel has never been faster than the 2.4 kernel (RHEL3).  At one point
of time, around 2.6.2, the gap is pretty close, at around 1%, but still slower.
Around 2.6.5, we found global plug list is causing huge lock contention on
32-way numa box.  That got fixed in 2.6.7.  Then comes 2.6.8 which took a big
dip at close to 20% regression.  Then we fixed 17% regression in the scheduler
(fixed with cache_decay_tick).  2.6.9 is the last one we measured and it is 6%
slower.  It's a constant moving target, a wild goose to chase.

I don't know why other people have not reported the problem, perhaps they
haven't got a chance to run transaction processing db workload on 2.6 kernel.
Perhaps they have not compared, perhaps they are working on the same problem.
I just don't know.

- Ken


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Andrew Morton

"Chen, Kenneth W" <[EMAIL PROTECTED]> wrote:
>
> This is all real: real benchmark running on real hardware, with real
>  result showing large performance regression.  Nothing synthetic here.
> 

Ken, could you *please* be more complete, more organized and more specific?

What does "1/3 of the total benchmark performance regression" mean?  One
third of 0.1% isn't very impressive.  You haven't told us anything at all
about the magnitude of this regression.

Where does the rest of the regression come from?

How much system time?  User time?  All that stuff.

>  And yes, it is all worth pursuing, the two patches on raw device recuperate
>  1/3 of the total benchmark performance regression.

The patch needs a fair bit of work, and if it still provides useful gains
when it's complete I guess could make sense as some database special-case.

But the first thing to do is to work out where the cycles are going to.

Also, I'm rather peeved that we're hearing about this regression now rather
than two years ago.  And mystified as to why yours is the only group which
has reported it.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Andrew Morton

"Chen, Kenneth W" <[EMAIL PROTECTED]> wrote:
>
> Andrew Morton wrote on Wednesday, March 09, 2005 2:45 PM
>  > >
>  > > > Did you generate a kernel profile?
>  > >
>  > >  Top 40 kernel hot functions, percentage is normalized to kernel 
> utilization.
>  > >
>  > >  _spin_unlock_irqrestore 23.54%
>  > >  _spin_unlock_irq19.27%
>  >
>  > Cripes.
>  >
>  > Is that with CONFIG_PREEMPT?  If so, and if you disable CONFIG_PREEMPT,
>  > this cost should be accounting the the spin_unlock() caller and we can see
>  > who the culprit is.   Perhaps dio->bio_lock.
> 
>  CONFIG_PREEMPT is off.
> 
>  Sorry for all the confusion, I probably shouldn't post the first profile
>  to confuse people.  See 2nd profile that I posted earlier (copied here 
> again).
> 
>  scsi_request_fn  7.54%
>  finish_task_switch   6.25%
>  __blockdev_direct_IO 4.97%
>  __make_request   3.87%
>  scsi_end_request 3.54%
>  dio_bio_end_io   2.70%
>  follow_hugetlb_page  2.39%
>  __wake_up2.37%
>  aio_complete 1.82%

What are these percentages?  Total CPU time?  The direct-io stuff doesn't
look too bad.  It's surprising that tweaking the direct-io submission code
makes much difference.

hm.  __blockdev_direct_IO() doesn't actually do much.  I assume your damn
compiler went and inlined direct_io_worker() on us.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Chen, Kenneth W

Chen, Kenneth W wrote on Wednesday, March 09, 2005 5:45 PM
> Andrew Morton wrote on Wednesday, March 09, 2005 5:34 PM
> > What are these percentages?  Total CPU time?  The direct-io stuff doesn't
> > look too bad.  It's surprising that tweaking the direct-io submission code
> > makes much difference.
>
> Percentage is relative to total kernel time.  There are three DIO functions
> showed up in the profile:
>
> __blockdev_direct_IO  4.97%
> dio_bio_end_io2.70%
> dio_bio_complete  1.20%

For the sake of comparison, let's look at the effect of performance patch on
raw device, in place of the above three functions, we now have two:

raw_file_rw 1.59%
raw_file_aio_rw 1.19%

A total saving of 6.09% (4.97+2.70+1.20 -1.59-1.19).  That's only counting
the cpu cycles.  We have tons of other data showing significant kernel path
length reduction with the performance patch.  Cache misses reduced across
the entire 3 level cache hierarchy, that's a secondary effect can not be
ignored since kernel is also competing cache resource with application.

- Ken


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Chen, Kenneth W

Andrew Morton wrote on Wednesday, March 09, 2005 2:45 PM
> >
> > > Did you generate a kernel profile?
> >
> >  Top 40 kernel hot functions, percentage is normalized to kernel 
> > utilization.
> >
> >  _spin_unlock_irqrestore23.54%
> >  _spin_unlock_irq   19.27%
>
> Cripes.
>
> Is that with CONFIG_PREEMPT?  If so, and if you disable CONFIG_PREEMPT,
> this cost should be accounting the the spin_unlock() caller and we can see
> who the culprit is.   Perhaps dio->bio_lock.

CONFIG_PREEMPT is off.

Sorry for all the confusion, I probably shouldn't post the first profile
to confuse people.  See 2nd profile that I posted earlier (copied here again).

scsi_request_fn 7.54%
finish_task_switch  6.25%
__blockdev_direct_IO4.97%
__make_request  3.87%
scsi_end_request3.54%
dio_bio_end_io  2.70%
follow_hugetlb_page 2.39%
__wake_up   2.37%
aio_complete1.82%
kmem_cache_alloc1.68%
__mod_timer 1.63%
e1000_clean 1.57%
__generic_file_aio_read 1.42%
mempool_alloc   1.37%
put_page1.35%
e1000_intr  1.31%
schedule1.25%
dio_bio_complete1.20%
scsi_device_unbusy  1.07%
kmem_cache_free 1.06%
__copy_user 1.04%
scsi_dispatch_cmd   1.04%
__end_that_request_first1.04%
generic_make_request1.02%
kfree   0.94%
__aio_get_req   0.93%
sys_pread64 0.83%
get_request 0.79%
put_io_context  0.76%
dnotify_parent  0.73%
vfs_read0.73%
update_atime0.73%
finished_one_bio0.63%
generic_file_aio_write_nolock   0.63%
scsi_put_command0.62%
break_fault 0.62%
e1000_xmit_frame0.62%
aio_read_evt0.59%
scsi_io_completion  0.59%
inode_times_differ  0.58%


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Chen, Kenneth W

Andrew Morton wrote on Wednesday, March 09, 2005 5:34 PM
> What are these percentages?  Total CPU time?  The direct-io stuff doesn't
> look too bad.  It's surprising that tweaking the direct-io submission code
> makes much difference.

Percentage is relative to total kernel time.  There are three DIO functions
showed up in the profile:

__blockdev_direct_IO4.97%
dio_bio_end_io  2.70%
dio_bio_complete1.20%

> hm.  __blockdev_direct_IO() doesn't actually do much.  I assume your damn
> compiler went and inlined direct_io_worker() on us.

We are using gcc version 3.4.3.  I suppose I can finger point gcc ? :-P

- Ken


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Chen, Kenneth W

For people who is dying to see some q-tool profile, here is one.
It's not a vanilla 2.6.9 kernel, but with patches in raw device
to get around the DIO performance problem.

- Ken


Flat profile of CPU_CYCLES in hist#0:
 Each histogram sample counts as 255.337u seconds
% time  self cumul calls self/call  tot/call name
  5.08  1.92  1.92 - - - schedule
  4.64  0.62  2.54 - - - __ia64_readw_relaxed
  4.03  0.54  3.08 - - - _stext
  3.03  0.41  3.49 - - - qla2x00_queuecommand
  2.73  0.37  3.86 - - - qla2x00_start_scsi
  1.92  0.26  4.12 - - - __mod_timer
  1.78  0.24  4.36 - - - scsi_request_fn
  1.68  0.23  4.58 - - - __copy_user
  1.45  0.20  4.78 - - - raw_file_rw
  1.30  0.17  4.95 - - - kmem_cache_alloc
  1.29  0.17  5.12 - - - mempool_alloc
  1.29  0.17  5.30 - - - follow_hugetlb_page
  1.19  0.16  5.46 - - - generic_make_request
  1.14  0.15  5.61 - - - qla2x00_next
  1.06  0.14  5.75 - - - memset
  1.03  0.14  5.89 - - - raw_file_aio_rw
  1.01  0.14  6.03 - - - e1000_clean
  0.93  0.13  6.15 - - - scsi_get_command
  0.93  0.12  6.28 - - - sd_init_command
  0.87  0.12  6.39 - - - __make_request
  0.87  0.12  6.51 - - - __aio_get_req
  0.81  0.11  6.62 - - - qla2300_intr_handler
  0.77  0.10  6.72 - - - put_io_context
  0.75  0.10  6.82 - - - 
qla2x00_process_completed_request
  0.74  0.10  6.92 - - - e1000_intr
  0.73  0.10  7.02 - - - get_request
  0.72  0.10  7.12 - - - rse_clear_invalid
  0.70  0.09  7.21 - - - aio_read_evt
  0.70  0.09  7.31 - - - e1000_xmit_frame
  0.70  0.09  7.40 - - - __bio_add_page
  0.69  0.09  7.49 - - - 
qla2x00_process_response_queue
  0.69  0.09  7.58 - - - vfs_read
  0.69  0.09  7.68 - - - break_fault
  0.67  0.09  7.77 - - - scsi_dispatch_cmd
  0.66  0.09  7.86 - - - try_to_wake_up
  0.64  0.09  7.94 - - - blk_queue_start_tag
  0.63  0.08  8.03 - - - sys_pread64
  0.62  0.08  8.11 - - - alt_dtlb_miss
  0.60  0.08  8.19 - - - 
ia64_spinlock_contention
  0.57  0.08  8.27 - - - skb_release_data
  0.55  0.07  8.34 - - - scsi_prep_fn
  0.53  0.07  8.41 - - - tcp_sendmsg
  0.52  0.07  8.48 - - - internal_add_timer
  0.51  0.07  8.55 - - - drive_stat_acct
  0.51  0.07  8.62 - - - tcp_transmit_skb
  0.50  0.07  8.69 - - - task_rq_lock
  0.49  0.07  8.75 - - - get_user_pages
  0.48  0.06  8.82 - - - tcp_rcv_established
  0.47  0.06  8.88 - - - kmem_cache_free
  0.47  0.06  8.94 - - - save_switch_stack
  0.46  0.06  9.00 - - - del_timer
  0.46  0.06  9.07 - - - aio_pread
  0.45  0.06  9.13 - - - bio_alloc
  0.44  0.06  9.19 - - - finish_task_switch
  0.44  0.06  9.25 - - - ip_queue_xmit
  0.43  0.06  9.30 - - - move_tasks
  0.42  0.06  9.36 - - - lock_sock
  0.40  0.05  9.41 - - - elv_queue_empty
  0.40  0.05  9.47 - - - bio_add_page
  0.39  0.05  9.52 - - - try_atomic_semop
  0.38  0.05  9.57 - - - qla2x00_done
  0.38  0.05  9.62 - - - tcp_recvmsg
  0.37  0.05  9.67 - - - put_page
  0.37  0.05  9.72 - - - elv_next_request
  0.36  0.05  9.77 -

RE: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Chen, Kenneth W

Jesse Barnes wrote on Wednesday, March 09, 2005 3:53 PM
> > "Chen, Kenneth W" <[EMAIL PROTECTED]> writes:
> > > Just to clarify here, these data need to be taken at grain of salt. A
> > > high count in _spin_unlock_* functions do not automatically points to
> > > lock contention.  It's one of the blind spot syndrome with timer based
> > > profile on ia64.  There are some lock contentions in 2.6 kernel that
> > > we are staring at.  Please do not misinterpret the number here.
> >
> > Why don't you use oprofileÂ>? It uses NMIs and can profile "inside"
> > interrupt disabled sections.
>
> Oh, and there are other ways of doing interrupt off profiling by using the
> PMU.  q-tools can do this I think.

Thank you all for the suggestions.  I'm well aware of q-tools and been using
it on and off.  It's just that I don't have any data collected with q-tool
for that particular hardware/software benchmark configuration.  I posted
with whatever data I have.

- Ken


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Chen, Kenneth W

Andi Kleen wrote on Wednesday, March 09, 2005 3:23 PM
> > Just to clarify here, these data need to be taken at grain of salt. A
> > high count in _spin_unlock_* functions do not automatically points to
> > lock contention.  It's one of the blind spot syndrome with timer based
> > profile on ia64.  There are some lock contentions in 2.6 kernel that
> > we are staring at.  Please do not misinterpret the number here.
>
> Why don't you use oprofileÂ>? It uses NMIs and can profile "inside"
> interrupt disabled sections.

The profile is taken on ia64.  we don't have nmi.  Oprofile will produce
the same result.

- Ken


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Andi Kleen

"Chen, Kenneth W" <[EMAIL PROTECTED]> writes:
>
> Just to clarify here, these data need to be taken at grain of salt. A
> high count in _spin_unlock_* functions do not automatically points to
> lock contention.  It's one of the blind spot syndrome with timer based
> profile on ia64.  There are some lock contentions in 2.6 kernel that
> we are staring at.  Please do not misinterpret the number here.

Why don't you use oprofileÂ>? It uses NMIs and can profile "inside" 
interrupt disabled sections.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Chen, Kenneth W

Chen, Kenneth W wrote on Wednesday, March 09, 2005 1:59 PM
> > Did you generate a kernel profile?
>
> Top 40 kernel hot functions, percentage is normalized to kernel utilization.
>
> _spin_unlock_irqrestore   23.54%
> _spin_unlock_irq  19.27%
> 
>
> Profile with spin lock inlined, so that it is easier to see functions
> that has the lock contention, again top 40 hot functions:

Just to clarify here, these data need to be taken at grain of salt. A
high count in _spin_unlock_* functions do not automatically points to
lock contention.  It's one of the blind spot syndrome with timer based
profile on ia64.  There are some lock contentions in 2.6 kernel that
we are staring at.  Please do not misinterpret the number here.

- Ken


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Chen, Kenneth W

Andrew Morton wrote on Wednesday, March 09, 2005 12:05 PM
> "Chen, Kenneth W" <[EMAIL PROTECTED]> wrote:
> > Let me answer the questions in reverse order.  We started with running
> > industry standard transaction processing database benchmark on 2.6 kernel,
> > on real hardware (4P smp, 64 GB memory, 450 disks) running industry
> > standard db application.  What we measured is that with best tuning done
> > to the system, 2.6 kernel has a huge performance regression relative to
> > its predecessor 2.4 kernel (a kernel from RHEL3, 2.4.21 based).
>
> That's news to me.  I thought we were doing OK with big database stuff.
> Surely lots of people have been testing such things.

There are different level of "big" stuff.  We used to work on 32-way numa
box, but other show stopper issues popping up before we get to the I/O stack.
The good thing came out of that work is the removal of global unplug lock.


> > And yes, it is all worth pursuing, the two patches on raw device recuperate
> > 1/3 of the total benchmark performance regression.
>
> On a real disk driver?  hm, I'm wrong then.
>

Yes, on real disk driver (qlogic fiber channel) and with real 15K rpm disks.


> Did you generate a kernel profile?

Top 40 kernel hot functions, percentage is normalized to kernel utilization.

_spin_unlock_irqrestore 23.54%
_spin_unlock_irq19.27%
__blockdev_direct_IO3.57%
follow_hugetlb_page 1.84%
e1000_clean 1.38%
kmem_cache_alloc1.31%
put_page1.29%
__generic_file_aio_read 1.18%
e1000_intr  1.07%
schedule1.01%
dio_bio_complete0.97%
mempool_alloc   0.96%
kmem_cache_free 0.90%
__end_that_request_first0.88%
__copy_user 0.82%
kfree   0.77%
generic_make_request0.73%
_spin_lock  0.73%
kref_put0.73%
vfs_read0.68%
update_atime0.68%
scsi_dispatch_cmd   0.67%
fget_light  0.66%
put_io_context  0.60%
_spin_lock_irqsave  0.58%
scsi_finish_command 0.58%
generic_file_aio_write_nolock   0.57%
inode_times_differ  0.55%
break_fault 0.53%
__do_softirq0.48%
aio_read_evt0.48%
try_atomic_semop0.44%
sys_pread64 0.43%
__bio_add_page  0.43%
__mod_timer 0.42%
bio_alloc   0.41%
scsi_decide_disposition 0.40%
e1000_clean_rx_irq  0.39%
find_vma0.38%
dnotify_parent  0.38%


Profile with spin lock inlined, so that it is easier to see functions
that has the lock contention, again top 40 hot functions:

scsi_request_fn 7.54%
finish_task_switch  6.25%
__blockdev_direct_IO4.97%
__make_request  3.87%
scsi_end_request3.54%
dio_bio_end_io  2.70%
follow_hugetlb_page 2.39%
__wake_up   2.37%
aio_complete1.82%
kmem_cache_alloc1.68%
__mod_timer 1.63%
e1000_clean 1.57%
__generic_file_aio_read 1.42%
mempool_alloc   1.37%
put_page1.35%
e1000_intr  1.31%
schedule1.25%
dio_bio_complete1.20%
scsi_device_unbusy  1.07%
kmem_cache_free 1.06%
__copy_user 1.04%
scsi_dispatch_cmd   1.04%
__end_that_request_first1.04%
generic_make_request1.02%
kfree   0.94%
__aio_get_req   0.93%
sys_pread64 0.83%
get_request 0.79%
put_io_context  0.76%
dnotify_parent  0.73%
vfs_read0.73%
update_atime0.73%
finished_one_bio0.63%
generic_file_aio_write_nolock   0.63%
scsi_put_command0.62%
break_fault 0.62%
e1000_xmit_frame0.62%
aio_read_evt0.59%
scsi_io_completion  0.59%
inode_times_differ  0.58%



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Andrew Morton

"Chen, Kenneth W" <[EMAIL PROTECTED]> wrote:
>
> Andrew Morton wrote on Tuesday, March 08, 2005 10:28 PM
> > But before doing anything else, please bench this on real hardware,
> > see if it is worth pursuing.
> 
> Let me answer the questions in reverse order.  We started with running
> industry standard transaction processing database benchmark on 2.6 kernel,
> on real hardware (4P smp, 64 GB memory, 450 disks) running industry
> standard db application.  What we measured is that with best tuning done
> to the system, 2.6 kernel has a huge performance regression relative to
> its predecessor 2.4 kernel (a kernel from RHEL3, 2.4.21 based).

That's news to me.  I thought we were doing OK with big database stuff. 
Surely lots of people have been testing such things.

> And yes, it is all worth pursuing, the two patches on raw device recuperate
> 1/3 of the total benchmark performance regression.

On a real disk driver?  hm, I'm wrong then.

Did you generate a kernel profile?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Chen, Kenneth W

Andrew Morton wrote on Tuesday, March 08, 2005 10:28 PM
> But before doing anything else, please bench this on real hardware,
> see if it is worth pursuing.

Let me answer the questions in reverse order.  We started with running
industry standard transaction processing database benchmark on 2.6 kernel,
on real hardware (4P smp, 64 GB memory, 450 disks) running industry
standard db application.  What we measured is that with best tuning done
to the system, 2.6 kernel has a huge performance regression relative to
its predecessor 2.4 kernel (a kernel from RHEL3, 2.4.21 based).

Ever since we had that measurement, people kick my butt everyday and
asking "after you telling us how great 2.6 kernel is, why is my workload
running significantly slower on this shinny 2.6 kernel?".  It hurts,
It hurts like a sledge hammer nailed right in the middle of my head.

This is all real: real benchmark running on real hardware, with real
result showing large performance regression.  Nothing synthetic here.

And yes, it is all worth pursuing, the two patches on raw device recuperate
1/3 of the total benchmark performance regression.

The reason I posted the pseudo disk driver is for people to see the effect
easier without shelling out a couple of million dollar to buy all that
equipment.


> Once you bolt this onto a real device driver the proportional difference
> will fall, due to addition of the constant factor.
>
> Once you bolt all this onto a real disk controller all the numbers will get
> worse (but in a strictly proportional manner) due to the disk transfers
> depriving the CPU of memory bandwidth.
>

That's not how I would interpret the number.  Kernel utilization went up for
2.6 kernel running the same db workload.  One reason is I/O stack is taxing a
little bit on each I/O call (or I should say less efficient), even with 
minuscule
amount, given the shear amount of I/O rate, it will be amplified very quickly.
One cpu cycle spend in the kernel means one less cpu cycle for the application.
My mean point is with less efficient I/O stack, kernel is actually taking away
valuable compute resources from application to crunch SQL transaction.  And that
leads to lower performance.

One can extrapolate it the other way: make kernel more efficient in processing
these I/O requests, kernel utilization goes down, cycle saved will transfer to
application to crunch more SQL transaction, and performance goes up.  I hope
everyone is following me here.


> At 5 usecs per request I figure that's 3% CPU utilisation for 16k requests
> at 100 MB/sec.

Our smallest setup has 450 disks, and the workload will generate about 50,000
I/O per second.  Larger setup will have more I/O rate.


> What sort of CPU?
>
> What speed CPU?
>
> What size requests?
>
> Reads or writes?
>

1.6 GHz Itanium2, 9M L3
I/O requests are mixture of 2KB and 16KB, occasionally some large size in burst.
Both read/write, about 50/50 split on rw.

- Ken


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Chen, Kenneth W

Andrew Morton wrote on Tuesday, March 08, 2005 10:28 PM
 But before doing anything else, please bench this on real hardware,
 see if it is worth pursuing.

Let me answer the questions in reverse order.  We started with running
industry standard transaction processing database benchmark on 2.6 kernel,
on real hardware (4P smp, 64 GB memory, 450 disks) running industry
standard db application.  What we measured is that with best tuning done
to the system, 2.6 kernel has a huge performance regression relative to
its predecessor 2.4 kernel (a kernel from RHEL3, 2.4.21 based).

Ever since we had that measurement, people kick my butt everyday and
asking after you telling us how great 2.6 kernel is, why is my workload
running significantly slower on this shinny 2.6 kernel?.  It hurts,
It hurts like a sledge hammer nailed right in the middle of my head.

This is all real: real benchmark running on real hardware, with real
result showing large performance regression.  Nothing synthetic here.

And yes, it is all worth pursuing, the two patches on raw device recuperate
1/3 of the total benchmark performance regression.

The reason I posted the pseudo disk driver is for people to see the effect
easier without shelling out a couple of million dollar to buy all that
equipment.


 Once you bolt this onto a real device driver the proportional difference
 will fall, due to addition of the constant factor.

 Once you bolt all this onto a real disk controller all the numbers will get
 worse (but in a strictly proportional manner) due to the disk transfers
 depriving the CPU of memory bandwidth.


That's not how I would interpret the number.  Kernel utilization went up for
2.6 kernel running the same db workload.  One reason is I/O stack is taxing a
little bit on each I/O call (or I should say less efficient), even with 
minuscule
amount, given the shear amount of I/O rate, it will be amplified very quickly.
One cpu cycle spend in the kernel means one less cpu cycle for the application.
My mean point is with less efficient I/O stack, kernel is actually taking away
valuable compute resources from application to crunch SQL transaction.  And that
leads to lower performance.

One can extrapolate it the other way: make kernel more efficient in processing
these I/O requests, kernel utilization goes down, cycle saved will transfer to
application to crunch more SQL transaction, and performance goes up.  I hope
everyone is following me here.


 At 5 usecs per request I figure that's 3% CPU utilisation for 16k requests
 at 100 MB/sec.

Our smallest setup has 450 disks, and the workload will generate about 50,000
I/O per second.  Larger setup will have more I/O rate.


 What sort of CPU?

 What speed CPU?

 What size requests?

 Reads or writes?


1.6 GHz Itanium2, 9M L3
I/O requests are mixture of 2KB and 16KB, occasionally some large size in burst.
Both read/write, about 50/50 split on rw.

- Ken


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Andrew Morton

Chen, Kenneth W [EMAIL PROTECTED] wrote:

 Andrew Morton wrote on Tuesday, March 08, 2005 10:28 PM
  But before doing anything else, please bench this on real hardware,
  see if it is worth pursuing.
 
 Let me answer the questions in reverse order.  We started with running
 industry standard transaction processing database benchmark on 2.6 kernel,
 on real hardware (4P smp, 64 GB memory, 450 disks) running industry
 standard db application.  What we measured is that with best tuning done
 to the system, 2.6 kernel has a huge performance regression relative to
 its predecessor 2.4 kernel (a kernel from RHEL3, 2.4.21 based).

That's news to me.  I thought we were doing OK with big database stuff. 
Surely lots of people have been testing such things.

 And yes, it is all worth pursuing, the two patches on raw device recuperate
 1/3 of the total benchmark performance regression.

On a real disk driver?  hm, I'm wrong then.

Did you generate a kernel profile?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Chen, Kenneth W

Andrew Morton wrote on Wednesday, March 09, 2005 12:05 PM
 Chen, Kenneth W [EMAIL PROTECTED] wrote:
  Let me answer the questions in reverse order.  We started with running
  industry standard transaction processing database benchmark on 2.6 kernel,
  on real hardware (4P smp, 64 GB memory, 450 disks) running industry
  standard db application.  What we measured is that with best tuning done
  to the system, 2.6 kernel has a huge performance regression relative to
  its predecessor 2.4 kernel (a kernel from RHEL3, 2.4.21 based).

 That's news to me.  I thought we were doing OK with big database stuff.
 Surely lots of people have been testing such things.

There are different level of big stuff.  We used to work on 32-way numa
box, but other show stopper issues popping up before we get to the I/O stack.
The good thing came out of that work is the removal of global unplug lock.


  And yes, it is all worth pursuing, the two patches on raw device recuperate
  1/3 of the total benchmark performance regression.

 On a real disk driver?  hm, I'm wrong then.


Yes, on real disk driver (qlogic fiber channel) and with real 15K rpm disks.


 Did you generate a kernel profile?

Top 40 kernel hot functions, percentage is normalized to kernel utilization.

_spin_unlock_irqrestore 23.54%
_spin_unlock_irq19.27%
__blockdev_direct_IO3.57%
follow_hugetlb_page 1.84%
e1000_clean 1.38%
kmem_cache_alloc1.31%
put_page1.29%
__generic_file_aio_read 1.18%
e1000_intr  1.07%
schedule1.01%
dio_bio_complete0.97%
mempool_alloc   0.96%
kmem_cache_free 0.90%
__end_that_request_first0.88%
__copy_user 0.82%
kfree   0.77%
generic_make_request0.73%
_spin_lock  0.73%
kref_put0.73%
vfs_read0.68%
update_atime0.68%
scsi_dispatch_cmd   0.67%
fget_light  0.66%
put_io_context  0.60%
_spin_lock_irqsave  0.58%
scsi_finish_command 0.58%
generic_file_aio_write_nolock   0.57%
inode_times_differ  0.55%
break_fault 0.53%
__do_softirq0.48%
aio_read_evt0.48%
try_atomic_semop0.44%
sys_pread64 0.43%
__bio_add_page  0.43%
__mod_timer 0.42%
bio_alloc   0.41%
scsi_decide_disposition 0.40%
e1000_clean_rx_irq  0.39%
find_vma0.38%
dnotify_parent  0.38%


Profile with spin lock inlined, so that it is easier to see functions
that has the lock contention, again top 40 hot functions:

scsi_request_fn 7.54%
finish_task_switch  6.25%
__blockdev_direct_IO4.97%
__make_request  3.87%
scsi_end_request3.54%
dio_bio_end_io  2.70%
follow_hugetlb_page 2.39%
__wake_up   2.37%
aio_complete1.82%
kmem_cache_alloc1.68%
__mod_timer 1.63%
e1000_clean 1.57%
__generic_file_aio_read 1.42%
mempool_alloc   1.37%
put_page1.35%
e1000_intr  1.31%
schedule1.25%
dio_bio_complete1.20%
scsi_device_unbusy  1.07%
kmem_cache_free 1.06%
__copy_user 1.04%
scsi_dispatch_cmd   1.04%
__end_that_request_first1.04%
generic_make_request1.02%
kfree   0.94%
__aio_get_req   0.93%
sys_pread64 0.83%
get_request 0.79%
put_io_context  0.76%
dnotify_parent  0.73%
vfs_read0.73%
update_atime0.73%
finished_one_bio0.63%
generic_file_aio_write_nolock   0.63%
scsi_put_command0.62%
break_fault 0.62%
e1000_xmit_frame0.62%
aio_read_evt0.59%
scsi_io_completion  0.59%
inode_times_differ  0.58%



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Chen, Kenneth W

Chen, Kenneth W wrote on Wednesday, March 09, 2005 1:59 PM
  Did you generate a kernel profile?

 Top 40 kernel hot functions, percentage is normalized to kernel utilization.

 _spin_unlock_irqrestore   23.54%
 _spin_unlock_irq  19.27%
 

 Profile with spin lock inlined, so that it is easier to see functions
 that has the lock contention, again top 40 hot functions:

Just to clarify here, these data need to be taken at grain of salt. A
high count in _spin_unlock_* functions do not automatically points to
lock contention.  It's one of the blind spot syndrome with timer based
profile on ia64.  There are some lock contentions in 2.6 kernel that
we are staring at.  Please do not misinterpret the number here.

- Ken


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Andi Kleen

Chen, Kenneth W [EMAIL PROTECTED] writes:

 Just to clarify here, these data need to be taken at grain of salt. A
 high count in _spin_unlock_* functions do not automatically points to
 lock contention.  It's one of the blind spot syndrome with timer based
 profile on ia64.  There are some lock contentions in 2.6 kernel that
 we are staring at.  Please do not misinterpret the number here.

Why don't you use oprofileÂ? It uses NMIs and can profile inside 
interrupt disabled sections.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Chen, Kenneth W

Andi Kleen wrote on Wednesday, March 09, 2005 3:23 PM
  Just to clarify here, these data need to be taken at grain of salt. A
  high count in _spin_unlock_* functions do not automatically points to
  lock contention.  It's one of the blind spot syndrome with timer based
  profile on ia64.  There are some lock contentions in 2.6 kernel that
  we are staring at.  Please do not misinterpret the number here.

 Why don't you use oprofileÂ? It uses NMIs and can profile inside
 interrupt disabled sections.

The profile is taken on ia64.  we don't have nmi.  Oprofile will produce
the same result.

- Ken


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Chen, Kenneth W

Jesse Barnes wrote on Wednesday, March 09, 2005 3:53 PM
  Chen, Kenneth W [EMAIL PROTECTED] writes:
   Just to clarify here, these data need to be taken at grain of salt. A
   high count in _spin_unlock_* functions do not automatically points to
   lock contention.  It's one of the blind spot syndrome with timer based
   profile on ia64.  There are some lock contentions in 2.6 kernel that
   we are staring at.  Please do not misinterpret the number here.
 
  Why don't you use oprofileÂ? It uses NMIs and can profile inside
  interrupt disabled sections.

 Oh, and there are other ways of doing interrupt off profiling by using the
 PMU.  q-tools can do this I think.

Thank you all for the suggestions.  I'm well aware of q-tools and been using
it on and off.  It's just that I don't have any data collected with q-tool
for that particular hardware/software benchmark configuration.  I posted
with whatever data I have.

- Ken


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Chen, Kenneth W

For people who is dying to see some q-tool profile, here is one.
It's not a vanilla 2.6.9 kernel, but with patches in raw device
to get around the DIO performance problem.

- Ken


Flat profile of CPU_CYCLES in hist#0:
 Each histogram sample counts as 255.337u seconds
% time  self cumul calls self/call  tot/call name
  5.08  1.92  1.92 - - - schedule
  4.64  0.62  2.54 - - - __ia64_readw_relaxed
  4.03  0.54  3.08 - - - _stext
  3.03  0.41  3.49 - - - qla2x00_queuecommand
  2.73  0.37  3.86 - - - qla2x00_start_scsi
  1.92  0.26  4.12 - - - __mod_timer
  1.78  0.24  4.36 - - - scsi_request_fn
  1.68  0.23  4.58 - - - __copy_user
  1.45  0.20  4.78 - - - raw_file_rw
  1.30  0.17  4.95 - - - kmem_cache_alloc
  1.29  0.17  5.12 - - - mempool_alloc
  1.29  0.17  5.30 - - - follow_hugetlb_page
  1.19  0.16  5.46 - - - generic_make_request
  1.14  0.15  5.61 - - - qla2x00_next
  1.06  0.14  5.75 - - - memset
  1.03  0.14  5.89 - - - raw_file_aio_rw
  1.01  0.14  6.03 - - - e1000_clean
  0.93  0.13  6.15 - - - scsi_get_command
  0.93  0.12  6.28 - - - sd_init_command
  0.87  0.12  6.39 - - - __make_request
  0.87  0.12  6.51 - - - __aio_get_req
  0.81  0.11  6.62 - - - qla2300_intr_handler
  0.77  0.10  6.72 - - - put_io_context
  0.75  0.10  6.82 - - - 
qla2x00_process_completed_request
  0.74  0.10  6.92 - - - e1000_intr
  0.73  0.10  7.02 - - - get_request
  0.72  0.10  7.12 - - - rse_clear_invalid
  0.70  0.09  7.21 - - - aio_read_evt
  0.70  0.09  7.31 - - - e1000_xmit_frame
  0.70  0.09  7.40 - - - __bio_add_page
  0.69  0.09  7.49 - - - 
qla2x00_process_response_queue
  0.69  0.09  7.58 - - - vfs_read
  0.69  0.09  7.68 - - - break_fault
  0.67  0.09  7.77 - - - scsi_dispatch_cmd
  0.66  0.09  7.86 - - - try_to_wake_up
  0.64  0.09  7.94 - - - blk_queue_start_tag
  0.63  0.08  8.03 - - - sys_pread64
  0.62  0.08  8.11 - - - alt_dtlb_miss
  0.60  0.08  8.19 - - - 
ia64_spinlock_contention
  0.57  0.08  8.27 - - - skb_release_data
  0.55  0.07  8.34 - - - scsi_prep_fn
  0.53  0.07  8.41 - - - tcp_sendmsg
  0.52  0.07  8.48 - - - internal_add_timer
  0.51  0.07  8.55 - - - drive_stat_acct
  0.51  0.07  8.62 - - - tcp_transmit_skb
  0.50  0.07  8.69 - - - task_rq_lock
  0.49  0.07  8.75 - - - get_user_pages
  0.48  0.06  8.82 - - - tcp_rcv_established
  0.47  0.06  8.88 - - - kmem_cache_free
  0.47  0.06  8.94 - - - save_switch_stack
  0.46  0.06  9.00 - - - del_timer
  0.46  0.06  9.07 - - - aio_pread
  0.45  0.06  9.13 - - - bio_alloc
  0.44  0.06  9.19 - - - finish_task_switch
  0.44  0.06  9.25 - - - ip_queue_xmit
  0.43  0.06  9.30 - - - move_tasks
  0.42  0.06  9.36 - - - lock_sock
  0.40  0.05  9.41 - - - elv_queue_empty
  0.40  0.05  9.47 - - - bio_add_page
  0.39  0.05  9.52 - - - try_atomic_semop
  0.38  0.05  9.57 - - - qla2x00_done
  0.38  0.05  9.62 - - - tcp_recvmsg
  0.37  0.05  9.67 - - - put_page
  0.37  0.05  9.72 - - - elv_next_request
  0.36  0.05  9.77 -

RE: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Chen, Kenneth W

Andrew Morton wrote on Wednesday, March 09, 2005 5:34 PM
 What are these percentages?  Total CPU time?  The direct-io stuff doesn't
 look too bad.  It's surprising that tweaking the direct-io submission code
 makes much difference.

Percentage is relative to total kernel time.  There are three DIO functions
showed up in the profile:

__blockdev_direct_IO4.97%
dio_bio_end_io  2.70%
dio_bio_complete1.20%

 hm.  __blockdev_direct_IO() doesn't actually do much.  I assume your damn
 compiler went and inlined direct_io_worker() on us.

We are using gcc version 3.4.3.  I suppose I can finger point gcc ? :-P

- Ken


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Chen, Kenneth W

Andrew Morton wrote on Wednesday, March 09, 2005 2:45 PM
 
   Did you generate a kernel profile?
 
   Top 40 kernel hot functions, percentage is normalized to kernel 
  utilization.
 
   _spin_unlock_irqrestore23.54%
   _spin_unlock_irq   19.27%

 Cripes.

 Is that with CONFIG_PREEMPT?  If so, and if you disable CONFIG_PREEMPT,
 this cost should be accounting the the spin_unlock() caller and we can see
 who the culprit is.   Perhaps dio-bio_lock.

CONFIG_PREEMPT is off.

Sorry for all the confusion, I probably shouldn't post the first profile
to confuse people.  See 2nd profile that I posted earlier (copied here again).

scsi_request_fn 7.54%
finish_task_switch  6.25%
__blockdev_direct_IO4.97%
__make_request  3.87%
scsi_end_request3.54%
dio_bio_end_io  2.70%
follow_hugetlb_page 2.39%
__wake_up   2.37%
aio_complete1.82%
kmem_cache_alloc1.68%
__mod_timer 1.63%
e1000_clean 1.57%
__generic_file_aio_read 1.42%
mempool_alloc   1.37%
put_page1.35%
e1000_intr  1.31%
schedule1.25%
dio_bio_complete1.20%
scsi_device_unbusy  1.07%
kmem_cache_free 1.06%
__copy_user 1.04%
scsi_dispatch_cmd   1.04%
__end_that_request_first1.04%
generic_make_request1.02%
kfree   0.94%
__aio_get_req   0.93%
sys_pread64 0.83%
get_request 0.79%
put_io_context  0.76%
dnotify_parent  0.73%
vfs_read0.73%
update_atime0.73%
finished_one_bio0.63%
generic_file_aio_write_nolock   0.63%
scsi_put_command0.62%
break_fault 0.62%
e1000_xmit_frame0.62%
aio_read_evt0.59%
scsi_io_completion  0.59%
inode_times_differ  0.58%


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Chen, Kenneth W

Chen, Kenneth W wrote on Wednesday, March 09, 2005 5:45 PM
 Andrew Morton wrote on Wednesday, March 09, 2005 5:34 PM
  What are these percentages?  Total CPU time?  The direct-io stuff doesn't
  look too bad.  It's surprising that tweaking the direct-io submission code
  makes much difference.

 Percentage is relative to total kernel time.  There are three DIO functions
 showed up in the profile:

 __blockdev_direct_IO  4.97%
 dio_bio_end_io2.70%
 dio_bio_complete  1.20%

For the sake of comparison, let's look at the effect of performance patch on
raw device, in place of the above three functions, we now have two:

raw_file_rw 1.59%
raw_file_aio_rw 1.19%

A total saving of 6.09% (4.97+2.70+1.20 -1.59-1.19).  That's only counting
the cpu cycles.  We have tons of other data showing significant kernel path
length reduction with the performance patch.  Cache misses reduced across
the entire 3 level cache hierarchy, that's a secondary effect can not be
ignored since kernel is also competing cache resource with application.

- Ken


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Andrew Morton

Chen, Kenneth W [EMAIL PROTECTED] wrote:

 Andrew Morton wrote on Wednesday, March 09, 2005 2:45 PM
   
 Did you generate a kernel profile?
   
 Top 40 kernel hot functions, percentage is normalized to kernel 
 utilization.
   
 _spin_unlock_irqrestore 23.54%
 _spin_unlock_irq19.27%
  
   Cripes.
  
   Is that with CONFIG_PREEMPT?  If so, and if you disable CONFIG_PREEMPT,
   this cost should be accounting the the spin_unlock() caller and we can see
   who the culprit is.   Perhaps dio-bio_lock.
 
  CONFIG_PREEMPT is off.
 
  Sorry for all the confusion, I probably shouldn't post the first profile
  to confuse people.  See 2nd profile that I posted earlier (copied here 
 again).
 
  scsi_request_fn  7.54%
  finish_task_switch   6.25%
  __blockdev_direct_IO 4.97%
  __make_request   3.87%
  scsi_end_request 3.54%
  dio_bio_end_io   2.70%
  follow_hugetlb_page  2.39%
  __wake_up2.37%
  aio_complete 1.82%

What are these percentages?  Total CPU time?  The direct-io stuff doesn't
look too bad.  It's surprising that tweaking the direct-io submission code
makes much difference.

hm.  __blockdev_direct_IO() doesn't actually do much.  I assume your damn
compiler went and inlined direct_io_worker() on us.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Andrew Morton

Chen, Kenneth W [EMAIL PROTECTED] wrote:

 This is all real: real benchmark running on real hardware, with real
  result showing large performance regression.  Nothing synthetic here.
 

Ken, could you *please* be more complete, more organized and more specific?

What does 1/3 of the total benchmark performance regression mean?  One
third of 0.1% isn't very impressive.  You haven't told us anything at all
about the magnitude of this regression.

Where does the rest of the regression come from?

How much system time?  User time?  All that stuff.

  And yes, it is all worth pursuing, the two patches on raw device recuperate
  1/3 of the total benchmark performance regression.

The patch needs a fair bit of work, and if it still provides useful gains
when it's complete I guess could make sense as some database special-case.

But the first thing to do is to work out where the cycles are going to.


Also, I'm rather peeved that we're hearing about this regression now rather
than two years ago.  And mystified as to why yours is the only group which
has reported it.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Chen, Kenneth W

Andrew Morton wrote Wednesday, March 09, 2005 6:26 PM
 What does 1/3 of the total benchmark performance regression mean?  One
 third of 0.1% isn't very impressive.  You haven't told us anything at all
 about the magnitude of this regression.

2.6.9 kernel is 6% slower compare to distributor's 2.4 kernel (RHEL3).  Roughly
2% came from storage driver (I'm not allowed to say anything beyond that, there
is a fix though).

2% came from DIO.

The rest of 2% is still unaccounted for.  We don't know where.

 How much system time?  User time?  All that stuff.
20.5% in the kernel, 79.5% in user space.


 But the first thing to do is to work out where the cycles are going to.
You've seen the profile.  That's where all the cycle went.


 Also, I'm rather peeved that we're hearing about this regression now rather
 than two years ago.  And mystified as to why yours is the only group which
 has reported it.

2.6.X kernel has never been faster than the 2.4 kernel (RHEL3).  At one point
of time, around 2.6.2, the gap is pretty close, at around 1%, but still slower.
Around 2.6.5, we found global plug list is causing huge lock contention on
32-way numa box.  That got fixed in 2.6.7.  Then comes 2.6.8 which took a big
dip at close to 20% regression.  Then we fixed 17% regression in the scheduler
(fixed with cache_decay_tick).  2.6.9 is the last one we measured and it is 6%
slower.  It's a constant moving target, a wild goose to chase.

I don't know why other people have not reported the problem, perhaps they
haven't got a chance to run transaction processing db workload on 2.6 kernel.
Perhaps they have not compared, perhaps they are working on the same problem.
I just don't know.

- Ken


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Andrew Morton

David Lang [EMAIL PROTECTED] wrote:

 (I've seen a 50% 
  performance hit on 2.4 with just a thousand or two threads compared to 
  2.6)

Was that 2.4 kernel a vendor kernel with the O(1) scheduler?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Andrew Morton

Chen, Kenneth W [EMAIL PROTECTED] wrote:

 Andrew Morton wrote Wednesday, March 09, 2005 6:26 PM
  What does 1/3 of the total benchmark performance regression mean?  One
  third of 0.1% isn't very impressive.  You haven't told us anything at all
  about the magnitude of this regression.
 
 2.6.9 kernel is 6% slower compare to distributor's 2.4 kernel (RHEL3).  
 Roughly
 2% came from storage driver (I'm not allowed to say anything beyond that, 
 there
 is a fix though).

The codepaths are indeed longer in 2.6.

 2% came from DIO.

hm, that's not a lot.

Once you redo that patch to use aops and to work with O_DIRECT, the paths
will get a little deeper, but not much.  We really should do this so that
O_DIRECT works, and in case someone has gone and mmapped the blockdev.

Fine-grained alignment is probably too hard, and it should fall back to
__blockdev_direct_IO().

Does it do the right thing with a request which is non-page-aligned, but
512-byte aligned?

readv and writev?

2% is pretty thin :(

 The rest of 2% is still unaccounted for.  We don't know where.

General cache replacement, perhaps.  9MB is a big cache though.

 ...
 Around 2.6.5, we found global plug list is causing huge lock contention on
 32-way numa box.  That got fixed in 2.6.7.  Then comes 2.6.8 which took a big
 dip at close to 20% regression.  Then we fixed 17% regression in the scheduler
 (fixed with cache_decay_tick).  2.6.9 is the last one we measured and it is 6%
 slower.  It's a constant moving target, a wild goose to chase.
 

OK.  Seems that the 2.4 O(1) scheduler got it right for that machine.

 haven't got a chance to run transaction processing db workload on 2.6 kernel.
 Perhaps they have not compared, perhaps they are working on the same problem.
 I just don't know.

Maybe there are other factors which drown these little things out:
architecture improvements, choice of architecture, driver changes, etc.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread David Lang

On Wed, 9 Mar 2005, Chen, Kenneth W wrote:
Also, I'm rather peeved that we're hearing about this regression now rather
than two years ago.  And mystified as to why yours is the only group which
has reported it.
2.6.X kernel has never been faster than the 2.4 kernel (RHEL3).  At one 
point
of time, around 2.6.2, the gap is pretty close, at around 1%, but still slower.
Around 2.6.5, we found global plug list is causing huge lock contention on
32-way numa box.  That got fixed in 2.6.7.  Then comes 2.6.8 which took a big
dip at close to 20% regression.  Then we fixed 17% regression in the scheduler
(fixed with cache_decay_tick).  2.6.9 is the last one we measured and it is 6%
slower.  It's a constant moving target, a wild goose to chase.
I don't know why other people have not reported the problem, perhaps they
haven't got a chance to run transaction processing db workload on 2.6 kernel.
Perhaps they have not compared, perhaps they are working on the same problem.
I just don't know.
Also the 2.6 kernel is Soo much better in the case where you have many 
threads (even if they are all completely idle) that that improvement may 
be masking the regression that Ken is reporting (I've seen a 50% 
performance hit on 2.4 with just a thousand or two threads compared to 
2.6). let's face it, a typical linux box today starts up a LOT of stuff 
that will never get used, but will count as an idle thread.

David Lang
--
There are two ways of constructing a software design. One way is to make it so 
simple that there are obviously no deficiencies. And the other way is to make 
it so complicated that there are no obvious deficiencies.
 -- C.A.R. Hoare
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Jesse Barnes

On Wednesday, March 9, 2005 3:23 pm, Andi Kleen wrote:
 Chen, Kenneth W [EMAIL PROTECTED] writes:
  Just to clarify here, these data need to be taken at grain of salt. A
  high count in _spin_unlock_* functions do not automatically points to
  lock contention.  It's one of the blind spot syndrome with timer based
  profile on ia64.  There are some lock contentions in 2.6 kernel that
  we are staring at.  Please do not misinterpret the number here.

 Why don't you use oprofileÂ? It uses NMIs and can profile inside
 interrupt disabled sections.

Oh, and there are other ways of doing interrupt off profiling by using the 
PMU.  q-tools can do this I think.

Jesse
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Andrew Vasquez

On Wed, 09 Mar 2005, Chen, Kenneth W wrote:

 Andrew Morton wrote Wednesday, March 09, 2005 6:26 PM
  What does 1/3 of the total benchmark performance regression mean?  One
  third of 0.1% isn't very impressive.  You haven't told us anything at all
  about the magnitude of this regression.
 
 2.6.9 kernel is 6% slower compare to distributor's 2.4 kernel (RHEL3).  
 Roughly
 2% came from storage driver (I'm not allowed to say anything beyond that, 
 there
 is a fix though).
 

Ok now, that statement piqued my interest -- since looking through a
previous email it seems you are using the qla2xxx driver.  Care to
elaborate?

Regards,
Andrew Vasquez
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Jesse Barnes

On Wednesday, March 9, 2005 3:23 pm, Andi Kleen wrote:
 Chen, Kenneth W [EMAIL PROTECTED] writes:
  Just to clarify here, these data need to be taken at grain of salt. A
  high count in _spin_unlock_* functions do not automatically points to
  lock contention.  It's one of the blind spot syndrome with timer based
  profile on ia64.  There are some lock contentions in 2.6 kernel that
  we are staring at.  Please do not misinterpret the number here.

 Why don't you use oprofileÂ? It uses NMIs and can profile inside
 interrupt disabled sections.

That was oprofile output, but on ia64, 'NMI's are maskable due to the way irq 
disabling works.  Here's a very hackish patch that changes the kernel to use 
cr.tpr instead of psr.i for interrupt control.  Making oprofile use real ia64 
NMIs is left as an exercise for the reader :)

Jesse
= arch/ia64/Kconfig.debug 1.2 vs edited =
--- 1.2/arch/ia64/Kconfig.debug 2005-01-07 16:15:52 -08:00
+++ edited/arch/ia64/Kconfig.debug  2005-02-28 10:07:27 -08:00
@@ -56,6 +56,15 @@
  and restore instructions.  It's useful for tracking down spinlock
  problems, but slow!  If you're unsure, select N.
 
+config IA64_ALLOW_NMI
+   bool Allow non-maskable interrupts
+   help
+ The normal ia64 irq enable/disable code prevents even non-maskable
+ interrupts from occuring, which can be a problem for kernel
+ debuggers, watchdogs, and profilers.  Say Y here if you're interested
+ in NMIs and don't mind the small performance penalty this option
+ imposes.
+
 config SYSVIPC_COMPAT
bool
depends on COMPAT  SYSVIPC
= arch/ia64/kernel/head.S 1.31 vs edited =
--- 1.31/arch/ia64/kernel/head.S2005-01-28 15:50:13 -08:00
+++ edited/arch/ia64/kernel/head.S  2005-03-01 13:17:51 -08:00
@@ -59,6 +59,14 @@
.save rp, r0// terminate unwind chain with a NULL rp
.body
 
+#ifdef CONFIG_IA64_ALLOW_NMI   // disable interrupts initially (re-enabled in 
start_kernel())
+   mov r16=116
+   ;;
+   mov cr.tpr=r16
+   ;;
+   srlz.d
+   ;;
+#endif
rsm psr.i | psr.ic
;;
srlz.i
@@ -129,8 +137,8 @@
/*
 * Switch into virtual mode:
 */
-   movl 
r16=(IA64_PSR_IT|IA64_PSR_IC|IA64_PSR_DT|IA64_PSR_RT|IA64_PSR_DFH|IA64_PSR_BN \
- |IA64_PSR_DI)
+   movl 
r16=(IA64_PSR_IT|IA64_PSR_IC|IA64_PSR_I|IA64_PSR_DT|IA64_PSR_RT|IA64_PSR_DFH|\
+ IA64_PSR_BN|IA64_PSR_DI)
;;
mov cr.ipsr=r16
movl r17=1f
= arch/ia64/kernel/irq_ia64.c 1.25 vs edited =
--- 1.25/arch/ia64/kernel/irq_ia64.c2005-01-22 15:54:49 -08:00
+++ edited/arch/ia64/kernel/irq_ia64.c  2005-03-01 12:50:18 -08:00
@@ -103,8 +103,6 @@
 void
 ia64_handle_irq (ia64_vector vector, struct pt_regs *regs)
 {
-   unsigned long saved_tpr;
-
 #if IRQ_DEBUG
{
unsigned long bsp, sp;
@@ -135,17 +133,9 @@
}
 #endif /* IRQ_DEBUG */
 
-   /*
-* Always set TPR to limit maximum interrupt nesting depth to
-* 16 (without this, it would be ~240, which could easily lead
-* to kernel stack overflows).
-*/
irq_enter();
-   saved_tpr = ia64_getreg(_IA64_REG_CR_TPR);
-   ia64_srlz_d();
while (vector != IA64_SPURIOUS_INT_VECTOR) {
if (!IS_RESCHEDULE(vector)) {
-   ia64_setreg(_IA64_REG_CR_TPR, vector);
ia64_srlz_d();
 
__do_IRQ(local_vector_to_irq(vector), regs);
@@ -154,7 +144,6 @@
 * Disable interrupts and send EOI:
 */
local_irq_disable();
-   ia64_setreg(_IA64_REG_CR_TPR, saved_tpr);
}
ia64_eoi();
vector = ia64_get_ivr();
@@ -165,6 +154,7 @@
 * come through until ia64_eoi() has been done.
 */
irq_exit();
+   local_irq_enable();
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
= include/asm-ia64/hw_irq.h 1.15 vs edited =
--- 1.15/include/asm-ia64/hw_irq.h  2005-01-22 15:54:52 -08:00
+++ edited/include/asm-ia64/hw_irq.h2005-03-01 13:01:03 -08:00
@@ -36,6 +36,10 @@
 
 #define AUTO_ASSIGN-1
 
+#define IA64_NMI_VECTOR0x02/* NMI (note that this 
can be
+  masked if psr.i or psr.ic
+  are cleared) */
+
 #define IA64_SPURIOUS_INT_VECTOR   0x0f
 
 /*
= include/asm-ia64/system.h 1.48 vs edited =
--- 1.48/include/asm-ia64/system.h  2005-01-04 18:48:18 -08:00
+++ edited/include/asm-ia64/system.h2005-03-01 15:28:23 -08:00
@@ -107,12 +107,61 @@
 
 #define safe_halt() ia64_pal_halt_light()/* PAL_HALT_LIGHT */
 
+/* For spinlocks etc */
+#ifdef CONFIG_IA64_ALLOW_NMI
+
+#define IA64_TPR_MMI_BIT (116)
+
+#define

Re: Direct io on block device has performance regression on 2.6.x kernel

2005-03-09 Thread Andrew Morton

Chen, Kenneth W [EMAIL PROTECTED] wrote:

  Did you generate a kernel profile?
 
  Top 40 kernel hot functions, percentage is normalized to kernel utilization.
 
  _spin_unlock_irqrestore  23.54%
  _spin_unlock_irq 19.27%

Cripes.

Is that with CONFIG_PREEMPT?  If so, and if you disable CONFIG_PREEMPT,
this cost should be accounting the the spin_unlock() caller and we can see
who the culprit is.   Perhaps dio-bio_lock.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Direct io on block device has performance regression on 2.6.x kernel

2005-03-08 Thread Andrew Morton

"Chen, Kenneth W" <[EMAIL PROTECTED]> wrote:
>
> Direct I/O on block device running 2.6.X kernel is a lot SLOWER
>  than running on a 2.4 Kernel!
> 

A little bit slower, it appears.   It used to be faster.

> ...
> 
>   synchronous I/O AIO
>   (pread/pwrite/read/write)   io_submit
>  2.4.21 based
>  (RHEL3)  265,122 229,810
> 
>  2.6.9218,565 206,917
>  2.6.10   213,041 205,891
>  2.6.11   212,284 201,124

What sort of CPU?

What speed CPU?

What size requests?

Reads or writes?

At 5 usecs per request I figure that's 3% CPU utilisation for 16k requests
at 100 MB/sec.

Once you bolt this onto a real device driver the proportional difference
will fall, due to addition of the constant factor.

Once you bolt all this onto a real disk controller all the numbers will get
worse (but in a strictly proportional manner) due to the disk transfers
depriving the CPU of memory bandwidth.

The raw driver is deprecated and we'd like to remove it.  The preferred way
of doing direct-IO against a blockdev is by opening it with O_DIRECT.

Your patches don't address blockdevs opened with O_DIRECT.  What you should
do is to make def_blk_aops.direct_IO point at a new function.  That will
then work correctly with both raw and with open(/dev/hdX, O_DIRECT).

But before doing anything else, please bench this on real hardware, see if
it is worth pursuing.  And gather an oprofile trace of the existing code.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Direct io on block device has performance regression on 2.6.x kernel - fix sync I/O path

2005-03-08 Thread Chen, Kenneth W

Christoph Hellwig wrote on Tuesday, March 08, 2005 6:20 PM
> this is not the blockdevice, but the obsolete raw device driver.  Please
> benchmark and if nessecary fix the blockdevice O_DIRECT codepath insted
> as the raw driver is slowly going away.

>From performance perspective, can raw device be resurrected? (just asking)

- Ken


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Direct io on block device has performance regression on 2.6.x kernel - fix sync I/O path

2005-03-08 Thread Christoph Hellwig

> --- linux-2.6.9/drivers/char/raw.c2004-10-18 14:54:37.0 -0700
> +++ linux-2.6.9.ken/drivers/char/raw.c2005-03-08 17:22:07.0 
> -0800

this is not the blockdevice, but the obsolete raw device driver.  Please
benchmark and if nessecary fix the blockdevice O_DIRECT codepath insted
as the raw driver is slowly going away.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Direct io on block device has performance regression on 2.6.x kernel - fix AIO path

2005-03-08 Thread Chen, Kenneth W

This patch adds block device direct I/O for AIO path.

30% performance gain!!

AIO (io_submit)
2.6.9   206,917
2.6.9+patches   268,484

- Ken


Signed-off-by: Ken Chen <[EMAIL PROTECTED]>

--- linux-2.6.9/drivers/char/raw.c  2005-03-08 17:22:07.0 -0800
+++ linux-2.6.9.ken/drivers/char/raw.c  2005-03-08 17:25:38.0 -0800
@@ -385,21 +385,148 @@ static ssize_t raw_file_write(struct fil
return raw_file_rw(file, (char __user *) buf, count, ppos, WRITE);
 }

-static ssize_t raw_file_aio_write(struct kiocb *iocb, const char __user *buf,
-   size_t count, loff_t pos)
+int raw_end_aio(struct bio *bio, unsigned int bytes_done, int error)
 {
-   struct iovec local_iov = {
-   .iov_base = (char __user *)buf,
-   .iov_len = count
-   };
+   struct kiocb* iocb = bio->bi_private;
+   atomic_t* bio_count = (atomic_t*) >private;
+
+   if ((bio->bi_rw & 0x1) == READ)
+   bio_check_pages_dirty(bio);
+   else {
+   int i;
+   struct bio_vec *bvec = bio->bi_io_vec;
+   struct page *page;
+   for (i = 0; i < bio->bi_vcnt; i++) {
+   page = bvec[i].bv_page;
+   if (page)
+   put_page(page);
+   }
+   bio_put(bio);
+   }
+   if (atomic_dec_and_test(bio_count))
+   aio_complete(iocb, iocb->ki_nbytes, 0);

-   return generic_file_aio_write_nolock(iocb, _iov, 1, 
>ki_pos);
+   return 0;
 }

+static ssize_t raw_file_aio_rw(struct kiocb *iocb, char __user *buf,
+   size_t count, loff_t pos, int rw)
+{
+   struct inode * inode = iocb->ki_filp->f_mapping->host;
+   unsigned long blkbits = inode->i_blkbits;
+   unsigned long blocksize_mask = (1<< blkbits) - 1;
+   struct page * quick_list[PAGE_QUICK_LIST];
+   int nr_pages, cur_offset, cur_len;
+   struct bio * bio;
+   unsigned long ret;
+   unsigned long addr = (unsigned long) buf;
+   loff_t size;
+   int pg_idx;
+   atomic_t *bio_count = (atomic_t *) >private;
+
+   if (count == 0)
+   return 0;
+
+   /* first check the alignment */
+   if (addr & blocksize_mask || count & blocksize_mask ||
+   count < 0 || pos & blocksize_mask)
+   return -EINVAL;
+
+   size = i_size_read(inode);
+   if (pos >= size)
+   return -ENXIO;
+   if (pos + count > size)
+   count = size - pos;
+
+   nr_pages = (addr + count + PAGE_SIZE - 1) / PAGE_SIZE -
+   addr / PAGE_SIZE;
+
+   pg_idx = PAGE_QUICK_LIST;
+   atomic_set(bio_count, 1);
+
+start:
+   bio = bio_alloc(GFP_KERNEL, nr_pages);
+   if (unlikely(bio == NULL)) {
+   if (atomic_read(bio_count) == 1)
+   return -ENOMEM;
+   else {
+   iocb->ki_nbytes = addr - (unsigned long) buf;
+   goto out;
+   }
+   }
+
+   /* initialize bio */
+   bio->bi_bdev = I_BDEV(inode);
+   bio->bi_end_io = raw_end_aio;
+   bio->bi_private = iocb;
+   bio->bi_sector = pos >> blkbits;
+
+   while (count > 0) {
+   cur_offset = addr & ~PAGE_MASK;
+   cur_len = PAGE_SIZE - cur_offset;
+   if (cur_len > count)
+   cur_len = count;
+
+   if (pg_idx >= PAGE_QUICK_LIST) {
+   down_read(>mm->mmap_sem);
+   ret = get_user_pages(current, current->mm, addr,
+   min(nr_pages, PAGE_QUICK_LIST),
+   rw==READ, 0, quick_list, NULL);
+   up_read(>mm->mmap_sem);
+   if (unlikely(ret < 0)) {
+   bio_put(bio);
+   if (atomic_read(bio_count) == 1)
+   return ret;
+   else {
+   iocb->ki_nbytes = addr - (unsigned 
long) buf;
+   goto out;
+   }
+   }
+   pg_idx = 0;
+   }
+
+   if (unlikely(!bio_add_page(bio, quick_list[pg_idx], cur_len, 
cur_offset))) {
+   atomic_inc(bio_count);
+   if (rw == READ)
+   bio_set_pages_dirty(bio);
+   submit_bio(rw, bio);
+   pos += addr - (unsigned long) buf;
+   goto start;
+   }
+
+   addr += cur_len;
+   count -= cur_len;
+   pg_idx++;
+   nr_pages--;
+   }
+
+   atomic_inc(bio_count);
+   if (rw == READ)
+

Direct io on block device has performance regression on 2.6.x kernel - pseudo disk driver

2005-03-08 Thread Chen, Kenneth W

The pseudo disk driver that I used to stress the kernel I/O stack
(anything above block layer, AIO/DIO/BIO).

- Ken



diff -Nur zero/blknull.c blknull/blknull.c
--- zero/blknull.c  1969-12-31 16:00:00.0 -0800
+++ blknull/blknull.c   2005-03-03 19:04:07.0 -0800
@@ -0,0 +1,97 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+
+#define BLK_NULL_MAJOR 60
+#define BLK_NULL_NAME  "blknull"
+
+
+MODULE_AUTHOR("Ken Chen");
+MODULE_DESCRIPTION("null block driver");
+MODULE_LICENSE("GPL");
+
+
+spinlock_t driver_lock;
+struct request_queue *q;
+struct gendisk *disk;
+
+
+static int null_open(struct inode *inode, struct file *filp)
+{
+   return 0;
+}
+
+static int null_release(struct inode *inode, struct file *filp)
+{
+   return 0;
+}
+
+static struct block_device_operations null_fops = {
+   .owner  = THIS_MODULE,
+   .open   = null_open,
+   .release= null_release,
+};
+
+static void do_null_request(request_queue_t *q)
+{
+   struct request *req;
+
+   while (!blk_queue_plugged(q)) {
+   req = elv_next_request(q);
+   if (!req)
+   break;
+
+   blkdev_dequeue_request(req);
+
+   end_that_request_first(req, 1, req->nr_sectors);
+   end_that_request_last(req);
+   }
+}
+
+static int __init init_blk_null_module(void)
+{
+
+   if (register_blkdev(BLK_NULL_MAJOR, BLK_NULL_NAME)) {
+   printk(KERN_ERR "Unable to register null blk device\n");
+   return 0;
+   }
+
+   spin_lock_init(_lock);
+   q = blk_init_queue(do_null_request, _lock);
+   if (q) {
+   disk = alloc_disk(1);
+
+   if (disk) {
+   disk->major = BLK_NULL_MAJOR;
+   disk->first_minor = 0;
+   disk->fops = _fops;
+   disk->capacity = 1<<30;
+   disk->queue = q;
+   memcpy(disk->disk_name, BLK_NULL_NAME, 
sizeof(BLK_NULL_NAME));
+   add_disk(disk);
+   return 1;
+   }
+
+   blk_cleanup_queue(q);
+   }
+   unregister_blkdev(BLK_NULL_MAJOR, BLK_NULL_NAME);
+   return 0;
+}
+
+static void __exit exit_blk_null_module(void)
+{
+   del_gendisk(disk);
+   blk_cleanup_queue(q);
+   unregister_blkdev(BLK_NULL_MAJOR, BLK_NULL_NAME);
+}
+
+module_init(init_blk_null_module);
+module_exit(exit_blk_null_module);
diff -Nur zero/Makefile blknull/Makefile
--- zero/Makefile   1969-12-31 16:00:00.0 -0800
+++ blknull/Makefile2005-03-03 18:42:55.0 -0800
@@ -0,0 +1 @@
+obj-m := blknull.o



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Direct io on block device has performance regression on 2.6.x kernel

2005-03-08 Thread Chen, Kenneth W

OK, last one in the series: user level test programs that stress
the kernel I/O stack.  Pretty dull stuff.

- Ken



diff -Nur zero/aio_null.c blknull_test/aio_null.c
--- zero/aio_null.c 1969-12-31 16:00:00.0 -0800
+++ blknull_test/aio_null.c 2005-03-08 00:46:17.0 -0800
@@ -0,0 +1,76 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define MAXAIO 1024
+
+char   buf[4096] __attribute__((aligned(4096)));
+
+io_context_t   io_ctx;
+struct iocbiocbpool[MAXAIO];
+struct io_eventioevent[MAXAIO];
+
+void aio_setup(int n)
+{
+   int res = io_queue_init(n, _ctx);
+   if (res != 0) {
+   printf("io_queue_setup(%d) returned %d (%s)\n",
+   n, res, strerror(-res));
+   exit(0);
+   }
+}
+
+main(int argc, char* argv[])
+{
+   int fd, i, status, batch;
+   struct iocb* iocbbatch[MAXAIO];
+   int devidx;
+   off_t   offset;
+   unsigned long start, end;
+
+   batch = argc < 2 ? 100: atoi(argv[1]);
+   if (batch >= MAXAIO)
+   batch = MAXAIO;
+
+   aio_setup(MAXAIO);
+   fd = open("/dev/raw/raw1", O_RDONLY);
+
+   if (fd == -1) {
+   perror("error opening\n");
+   exit (0);
+   }
+   for (i=0; i
+#include 
+#include 
+#include 
+
+main(int argc, char* argv[])
+{
+   int fd;
+   char *addr;
+
+   fd = open("/dev/raw/raw1", O_RDONLY);
+   if (fd == -1) {
+   perror("error opening\n");
+   exit(0);
+   }
+
+   addr = memalign(4096, 4096);
+   if (addr == 0) {
+   printf("no memory\n");
+   exit(0);
+   }
+
+   while (1) {
+   pread(fd, addr, 4096, 0);
+   }
+
+}
diff -Nur zero/makefile blknull_test/makefile
--- zero/makefile   1969-12-31 16:00:00.0 -0800
+++ blknull_test/makefile   2005-03-08 17:10:39.0 -0800
@@ -0,0 +1,10 @@
+all:   pread_null aio_null
+
+pread_null: pread_null.c
+   gcc -O3 -o $@ pread_null.c
+
+aio_null: aio_null.c
+   gcc -O3 -o $@ aio_null.c -laio
+
+clean:
+   rm -f pread_null aio_null



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Direct io on block device has performance regression on 2.6.x kernel - fix sync I/O path

2005-03-08 Thread Chen, Kenneth W

This patch adds block device direct I/O for synchronous path.
I added in the raw device code to demo the performance effect.

48% performance gain!!


synchronous I/O
(pread/pwrite/read/write)
2.6.9   218,565
2.6.9+patches   323,016

- Ken


Signed-off-by: Ken Chen <[EMAIL PROTECTED]>


diff -Nurp linux-2.6.9/drivers/char/raw.c linux-2.6.9.ken/drivers/char/raw.c
--- linux-2.6.9/drivers/char/raw.c  2004-10-18 14:54:37.0 -0700
+++ linux-2.6.9.ken/drivers/char/raw.c  2005-03-08 17:22:07.0 -0800
@@ -238,15 +238,151 @@ out:
return err;
 }

+struct rio {
+   atomic_t bio_count;
+   struct task_struct *p;
+};
+
+int raw_end_io(struct bio *bio, unsigned int bytes_done, int error)
+{
+   struct rio * rio = bio->bi_private;
+
+   if ((bio->bi_rw & 0x1) == READ)
+   bio_check_pages_dirty(bio);
+   else {
+   int i;
+   struct bio_vec *bvec = bio->bi_io_vec;
+   struct page *page;
+   for (i = 0; i < bio->bi_vcnt; i++) {
+   page = bvec[i].bv_page;
+   if (page)
+   put_page(page);
+   }
+   bio_put(bio);
+   }
+
+   if (atomic_dec_and_test(>bio_count))
+   wake_up_process(rio->p);
+   return 0;
+}
+
+#define PAGE_QUICK_LIST16
+static ssize_t raw_file_rw(struct file *filp, char __user *buf,
+   size_t count, loff_t *ppos, int rw)
+{
+   struct inode * inode = filp->f_mapping->host;
+   unsigned long blkbits = inode->i_blkbits;
+   unsigned long blocksize_mask = (1<< blkbits) - 1;
+   struct page * quick_list[PAGE_QUICK_LIST];
+   int nr_pages, cur_offset, cur_len, pg_idx;
+   struct bio * bio;
+   unsigned long ret;
+   unsigned long addr = (unsigned long) buf;
+   loff_t pos = *ppos, size;
+   struct rio rio;
+
+   if (count == 0)
+   return 0;
+
+   /* first check the alignment */
+   if (addr & blocksize_mask || count & blocksize_mask ||
+   count < 0 || pos & blocksize_mask)
+   return -EINVAL;
+
+   size = i_size_read(inode);
+   if (pos >= size)
+   return -ENXIO;
+   if (pos + count > size)
+   count = size - pos;
+
+   nr_pages = (addr + count + PAGE_SIZE - 1) / PAGE_SIZE -
+   addr / PAGE_SIZE;
+
+   pg_idx = PAGE_QUICK_LIST;
+   atomic_set(_count, 1);
+   rio.p = current;
+
+start:
+   bio = bio_alloc(GFP_KERNEL, nr_pages);
+   if (unlikely(bio == NULL)) {
+   if (atomic_read(_count) == 1)
+   return -ENOMEM;
+   else {
+   goto out;
+   }
+   }
+
+   /* initialize bio */
+   bio->bi_bdev = I_BDEV(inode);
+   bio->bi_end_io = raw_end_io;
+   bio->bi_private = 
+   bio->bi_sector = pos >> blkbits;
+
+   while (count > 0) {
+   cur_offset = addr & ~PAGE_MASK;
+   cur_len = PAGE_SIZE - cur_offset;
+   if (cur_len > count)
+   cur_len = count;
+
+   if (pg_idx >= PAGE_QUICK_LIST) {
+   down_read(>mm->mmap_sem);
+   ret = get_user_pages(current, current->mm, addr,
+   min(nr_pages, PAGE_QUICK_LIST),
+   rw==READ, 0, quick_list, NULL);
+   up_read(>mm->mmap_sem);
+   if (unlikely(ret < 0)) {
+   bio_put(bio);
+   if (atomic_read(_count) == 1)
+   return ret;
+   else {
+   goto out;
+   }
+   }
+   pg_idx = 0;
+   }
+
+   if (unlikely(!bio_add_page(bio, quick_list[pg_idx], cur_len,
+   cur_offset))) {
+   atomic_inc(_count);
+   if (rw == READ)
+   bio_set_pages_dirty(bio);
+   submit_bio(rw, bio);
+   pos += addr - (unsigned long) buf;
+   goto start;
+   }
+
+   addr += cur_len;
+   count -= cur_len;
+   pg_idx++;
+   nr_pages--;
+   }
+
+   atomic_inc(_count);
+   if (rw == READ)
+   bio_set_pages_dirty(bio);
+   submit_bio(rw, bio);
+out:
+   set_current_state(TASK_UNINTERRUPTIBLE);
+   blk_run_address_space(inode->i_mapping);
+   if (!atomic_dec_and_test(_count))
+   io_schedule();
+   set_current_state(TASK_RUNNING);
+
+   ret = addr - (unsigned long) buf;
+   *ppos += ret;
+

Direct io on block device has performance regression on 2.6.x kernel

2005-03-08 Thread Chen, Kenneth W

I don't know where to start, but let me start with the bombshell:

Direct I/O on block device running 2.6.X kernel is a lot SLOWER
than running on a 2.4 Kernel!


Processing a direct I/O request on 2.6 is taking a lot more cpu
time compare to the same I/O request running on a 2.4 kernel.

The proof: easy.  I started off by having a pseudo disk, a software
disk that has zero access latency.  By hooking this pseudo disk into
the block layer API, I can effectively stress the entire I/O stack
above the block level.  Combined with user level test programs that
simply submit direct I/O in a simple while loop, I can measure how
fast kernel can process these I/O requests.  The performance metric
can be either throughput (# of I/O per second) or per unit of work
(processing time per I/O).  For the data presented below, I'm using
throughput metric (meaning larger number is better performance).
Pay attention to relative percentage as absolute number depends on
platform/CPU that test suite runs on.


synchronous I/O AIO
(pread/pwrite/read/write)   io_submit
2.4.21 based
(RHEL3) 265,122 229,810

2.6.9   218,565 206,917
2.6.10  213,041 205,891
2.6.11  212,284 201,124

>From the above chart, you can see that 2.6 kernel is at least 18%
slower in processing direct I/O (on block device) in the synchronous
path and 10% slower in the AIO path compare to a distributor's 2.4
kernel.  What's worse, with each advance of kernel version, the I/O
path is becoming slower and slower.

Most of the performance regression for 2.6.9 came from dio layer (I
still have to find where the regression came from with 2.6.10 and 2.6.11).
DIO is just overloaded with too many areas to cover.  I think it's better
to break things up a little bit.

For example, by having a set of dedicated functions that do direct I/O
on block device improves the performance dramatically:

synchronous I/O AIO
(pread/pwrite/read/write)   io_submit
2.4.21 based
(RHEL3) 265,122 229,810
2.6.9   218,565 206,917
2.6.9+patches   323,016 268,484

See, we can be actually 22% faster in synchronous path and 17% faster
In the AIO path, if we do it right!

Kernel patch and test suite to follow in the next couple postings.

- Ken




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Direct io on block device has performance regression on 2.6.x kernel

2005-03-08 Thread Chen, Kenneth W

I don't know where to start, but let me start with the bombshell:

Direct I/O on block device running 2.6.X kernel is a lot SLOWER
than running on a 2.4 Kernel!


Processing a direct I/O request on 2.6 is taking a lot more cpu
time compare to the same I/O request running on a 2.4 kernel.

The proof: easy.  I started off by having a pseudo disk, a software
disk that has zero access latency.  By hooking this pseudo disk into
the block layer API, I can effectively stress the entire I/O stack
above the block level.  Combined with user level test programs that
simply submit direct I/O in a simple while loop, I can measure how
fast kernel can process these I/O requests.  The performance metric
can be either throughput (# of I/O per second) or per unit of work
(processing time per I/O).  For the data presented below, I'm using
throughput metric (meaning larger number is better performance).
Pay attention to relative percentage as absolute number depends on
platform/CPU that test suite runs on.


synchronous I/O AIO
(pread/pwrite/read/write)   io_submit
2.4.21 based
(RHEL3) 265,122 229,810

2.6.9   218,565 206,917
2.6.10  213,041 205,891
2.6.11  212,284 201,124

From the above chart, you can see that 2.6 kernel is at least 18%
slower in processing direct I/O (on block device) in the synchronous
path and 10% slower in the AIO path compare to a distributor's 2.4
kernel.  What's worse, with each advance of kernel version, the I/O
path is becoming slower and slower.

Most of the performance regression for 2.6.9 came from dio layer (I
still have to find where the regression came from with 2.6.10 and 2.6.11).
DIO is just overloaded with too many areas to cover.  I think it's better
to break things up a little bit.

For example, by having a set of dedicated functions that do direct I/O
on block device improves the performance dramatically:

synchronous I/O AIO
(pread/pwrite/read/write)   io_submit
2.4.21 based
(RHEL3) 265,122 229,810
2.6.9   218,565 206,917
2.6.9+patches   323,016 268,484

See, we can be actually 22% faster in synchronous path and 17% faster
In the AIO path, if we do it right!

Kernel patch and test suite to follow in the next couple postings.

- Ken




-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Direct io on block device has performance regression on 2.6.x kernel - fix sync I/O path

2005-03-08 Thread Chen, Kenneth W

This patch adds block device direct I/O for synchronous path.
I added in the raw device code to demo the performance effect.

48% performance gain!!


synchronous I/O
(pread/pwrite/read/write)
2.6.9   218,565
2.6.9+patches   323,016

- Ken


Signed-off-by: Ken Chen [EMAIL PROTECTED]


diff -Nurp linux-2.6.9/drivers/char/raw.c linux-2.6.9.ken/drivers/char/raw.c
--- linux-2.6.9/drivers/char/raw.c  2004-10-18 14:54:37.0 -0700
+++ linux-2.6.9.ken/drivers/char/raw.c  2005-03-08 17:22:07.0 -0800
@@ -238,15 +238,151 @@ out:
return err;
 }

+struct rio {
+   atomic_t bio_count;
+   struct task_struct *p;
+};
+
+int raw_end_io(struct bio *bio, unsigned int bytes_done, int error)
+{
+   struct rio * rio = bio-bi_private;
+
+   if ((bio-bi_rw  0x1) == READ)
+   bio_check_pages_dirty(bio);
+   else {
+   int i;
+   struct bio_vec *bvec = bio-bi_io_vec;
+   struct page *page;
+   for (i = 0; i  bio-bi_vcnt; i++) {
+   page = bvec[i].bv_page;
+   if (page)
+   put_page(page);
+   }
+   bio_put(bio);
+   }
+
+   if (atomic_dec_and_test(rio-bio_count))
+   wake_up_process(rio-p);
+   return 0;
+}
+
+#define PAGE_QUICK_LIST16
+static ssize_t raw_file_rw(struct file *filp, char __user *buf,
+   size_t count, loff_t *ppos, int rw)
+{
+   struct inode * inode = filp-f_mapping-host;
+   unsigned long blkbits = inode-i_blkbits;
+   unsigned long blocksize_mask = (1 blkbits) - 1;
+   struct page * quick_list[PAGE_QUICK_LIST];
+   int nr_pages, cur_offset, cur_len, pg_idx;
+   struct bio * bio;
+   unsigned long ret;
+   unsigned long addr = (unsigned long) buf;
+   loff_t pos = *ppos, size;
+   struct rio rio;
+
+   if (count == 0)
+   return 0;
+
+   /* first check the alignment */
+   if (addr  blocksize_mask || count  blocksize_mask ||
+   count  0 || pos  blocksize_mask)
+   return -EINVAL;
+
+   size = i_size_read(inode);
+   if (pos = size)
+   return -ENXIO;
+   if (pos + count  size)
+   count = size - pos;
+
+   nr_pages = (addr + count + PAGE_SIZE - 1) / PAGE_SIZE -
+   addr / PAGE_SIZE;
+
+   pg_idx = PAGE_QUICK_LIST;
+   atomic_set(rio.bio_count, 1);
+   rio.p = current;
+
+start:
+   bio = bio_alloc(GFP_KERNEL, nr_pages);
+   if (unlikely(bio == NULL)) {
+   if (atomic_read(rio.bio_count) == 1)
+   return -ENOMEM;
+   else {
+   goto out;
+   }
+   }
+
+   /* initialize bio */
+   bio-bi_bdev = I_BDEV(inode);
+   bio-bi_end_io = raw_end_io;
+   bio-bi_private = rio;
+   bio-bi_sector = pos  blkbits;
+
+   while (count  0) {
+   cur_offset = addr  ~PAGE_MASK;
+   cur_len = PAGE_SIZE - cur_offset;
+   if (cur_len  count)
+   cur_len = count;
+
+   if (pg_idx = PAGE_QUICK_LIST) {
+   down_read(current-mm-mmap_sem);
+   ret = get_user_pages(current, current-mm, addr,
+   min(nr_pages, PAGE_QUICK_LIST),
+   rw==READ, 0, quick_list, NULL);
+   up_read(current-mm-mmap_sem);
+   if (unlikely(ret  0)) {
+   bio_put(bio);
+   if (atomic_read(rio.bio_count) == 1)
+   return ret;
+   else {
+   goto out;
+   }
+   }
+   pg_idx = 0;
+   }
+
+   if (unlikely(!bio_add_page(bio, quick_list[pg_idx], cur_len,
+   cur_offset))) {
+   atomic_inc(rio.bio_count);
+   if (rw == READ)
+   bio_set_pages_dirty(bio);
+   submit_bio(rw, bio);
+   pos += addr - (unsigned long) buf;
+   goto start;
+   }
+
+   addr += cur_len;
+   count -= cur_len;
+   pg_idx++;
+   nr_pages--;
+   }
+
+   atomic_inc(rio.bio_count);
+   if (rw == READ)
+   bio_set_pages_dirty(bio);
+   submit_bio(rw, bio);
+out:
+   set_current_state(TASK_UNINTERRUPTIBLE);
+   blk_run_address_space(inode-i_mapping);
+   if (!atomic_dec_and_test(rio.bio_count))
+   io_schedule();
+   set_current_state(TASK_RUNNING);
+
+   ret = addr - (unsigned long) buf;
+

RE: Direct io on block device has performance regression on 2.6.x kernel

2005-03-08 Thread Chen, Kenneth W

OK, last one in the series: user level test programs that stress
the kernel I/O stack.  Pretty dull stuff.

- Ken



diff -Nur zero/aio_null.c blknull_test/aio_null.c
--- zero/aio_null.c 1969-12-31 16:00:00.0 -0800
+++ blknull_test/aio_null.c 2005-03-08 00:46:17.0 -0800
@@ -0,0 +1,76 @@
+#include stdio.h
+#include stdlib.h
+#include unistd.h
+#include fcntl.h
+#include sched.h
+#include signal.h
+#include sys/types.h
+#include linux/ioctl.h
+#include libaio.h
+
+#define MAXAIO 1024
+
+char   buf[4096] __attribute__((aligned(4096)));
+
+io_context_t   io_ctx;
+struct iocbiocbpool[MAXAIO];
+struct io_eventioevent[MAXAIO];
+
+void aio_setup(int n)
+{
+   int res = io_queue_init(n, io_ctx);
+   if (res != 0) {
+   printf(io_queue_setup(%d) returned %d (%s)\n,
+   n, res, strerror(-res));
+   exit(0);
+   }
+}
+
+main(int argc, char* argv[])
+{
+   int fd, i, status, batch;
+   struct iocb* iocbbatch[MAXAIO];
+   int devidx;
+   off_t   offset;
+   unsigned long start, end;
+
+   batch = argc  2 ? 100: atoi(argv[1]);
+   if (batch = MAXAIO)
+   batch = MAXAIO;
+
+   aio_setup(MAXAIO);
+   fd = open(/dev/raw/raw1, O_RDONLY);
+
+   if (fd == -1) {
+   perror(error opening\n);
+   exit (0);
+   }
+   for (i=0; ibatch; i++) {
+   iocbbatch[i] = iocbpool+i;
+   io_prep_pread(iocbbatch[i], fd, buf, 4096, 0);
+   }
+
+   while (1) {
+   struct timespec ts={30,0};
+   int bufidx;
+   int reap;
+
+   status = io_submit(io_ctx, i, iocbbatch);
+   if (status != i) {
+   printf(bad io_submit: %d [%s]\n, status,
+   strerror(-status));
+   }
+
+   // reap at least batch count back
+   reap = io_getevents(io_ctx, batch, MAXAIO, ioevent, ts);
+   if (reap  batch) {
+   printf(io_getevents returned=%d [%s]\n, reap,
+   strerror(-reap));
+   }
+
+   // check the return result of each event
+   for (i=0; ireap; i++)
+   if (ioevent[i].res != 4096)
+   printf(error in read: %lx\n, ioevent[i].res);
+   } /* while (1) */
+}
diff -Nur zero/pread_null.c blknull_test/pread_null.c
--- zero/pread_null.c   1969-12-31 16:00:00.0 -0800
+++ blknull_test/pread_null.c   2005-03-08 00:44:20.0 -0800
@@ -0,0 +1,27 @@
+#include stdio.h
+#include fcntl.h
+#include unistd.h
+#include malloc.h
+
+main(int argc, char* argv[])
+{
+   int fd;
+   char *addr;
+
+   fd = open(/dev/raw/raw1, O_RDONLY);
+   if (fd == -1) {
+   perror(error opening\n);
+   exit(0);
+   }
+
+   addr = memalign(4096, 4096);
+   if (addr == 0) {
+   printf(no memory\n);
+   exit(0);
+   }
+
+   while (1) {
+   pread(fd, addr, 4096, 0);
+   }
+
+}
diff -Nur zero/makefile blknull_test/makefile
--- zero/makefile   1969-12-31 16:00:00.0 -0800
+++ blknull_test/makefile   2005-03-08 17:10:39.0 -0800
@@ -0,0 +1,10 @@
+all:   pread_null aio_null
+
+pread_null: pread_null.c
+   gcc -O3 -o $@ pread_null.c
+
+aio_null: aio_null.c
+   gcc -O3 -o $@ aio_null.c -laio
+
+clean:
+   rm -f pread_null aio_null



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Direct io on block device has performance regression on 2.6.x kernel - pseudo disk driver

2005-03-08 Thread Chen, Kenneth W

The pseudo disk driver that I used to stress the kernel I/O stack
(anything above block layer, AIO/DIO/BIO).

- Ken



diff -Nur zero/blknull.c blknull/blknull.c
--- zero/blknull.c  1969-12-31 16:00:00.0 -0800
+++ blknull/blknull.c   2005-03-03 19:04:07.0 -0800
@@ -0,0 +1,97 @@
+#include linux/module.h
+#include linux/types.h
+#include linux/kernel.h
+#include linux/major.h
+#include linux/fs.h
+#include linux/bio.h
+#include linux/blkpg.h
+#include linux/spinlock.h
+
+#include linux/blkdev.h
+#include linux/genhd.h
+
+#define BLK_NULL_MAJOR 60
+#define BLK_NULL_NAME  blknull
+
+
+MODULE_AUTHOR(Ken Chen);
+MODULE_DESCRIPTION(null block driver);
+MODULE_LICENSE(GPL);
+
+
+spinlock_t driver_lock;
+struct request_queue *q;
+struct gendisk *disk;
+
+
+static int null_open(struct inode *inode, struct file *filp)
+{
+   return 0;
+}
+
+static int null_release(struct inode *inode, struct file *filp)
+{
+   return 0;
+}
+
+static struct block_device_operations null_fops = {
+   .owner  = THIS_MODULE,
+   .open   = null_open,
+   .release= null_release,
+};
+
+static void do_null_request(request_queue_t *q)
+{
+   struct request *req;
+
+   while (!blk_queue_plugged(q)) {
+   req = elv_next_request(q);
+   if (!req)
+   break;
+
+   blkdev_dequeue_request(req);
+
+   end_that_request_first(req, 1, req-nr_sectors);
+   end_that_request_last(req);
+   }
+}
+
+static int __init init_blk_null_module(void)
+{
+
+   if (register_blkdev(BLK_NULL_MAJOR, BLK_NULL_NAME)) {
+   printk(KERN_ERR Unable to register null blk device\n);
+   return 0;
+   }
+
+   spin_lock_init(driver_lock);
+   q = blk_init_queue(do_null_request, driver_lock);
+   if (q) {
+   disk = alloc_disk(1);
+
+   if (disk) {
+   disk-major = BLK_NULL_MAJOR;
+   disk-first_minor = 0;
+   disk-fops = null_fops;
+   disk-capacity = 130;
+   disk-queue = q;
+   memcpy(disk-disk_name, BLK_NULL_NAME, 
sizeof(BLK_NULL_NAME));
+   add_disk(disk);
+   return 1;
+   }
+
+   blk_cleanup_queue(q);
+   }
+   unregister_blkdev(BLK_NULL_MAJOR, BLK_NULL_NAME);
+   return 0;
+}
+
+static void __exit exit_blk_null_module(void)
+{
+   del_gendisk(disk);
+   blk_cleanup_queue(q);
+   unregister_blkdev(BLK_NULL_MAJOR, BLK_NULL_NAME);
+}
+
+module_init(init_blk_null_module);
+module_exit(exit_blk_null_module);
diff -Nur zero/Makefile blknull/Makefile
--- zero/Makefile   1969-12-31 16:00:00.0 -0800
+++ blknull/Makefile2005-03-03 18:42:55.0 -0800
@@ -0,0 +1 @@
+obj-m := blknull.o



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Direct io on block device has performance regression on 2.6.x kernel - fix AIO path

2005-03-08 Thread Chen, Kenneth W

This patch adds block device direct I/O for AIO path.

30% performance gain!!

AIO (io_submit)
2.6.9   206,917
2.6.9+patches   268,484

- Ken


Signed-off-by: Ken Chen [EMAIL PROTECTED]

--- linux-2.6.9/drivers/char/raw.c  2005-03-08 17:22:07.0 -0800
+++ linux-2.6.9.ken/drivers/char/raw.c  2005-03-08 17:25:38.0 -0800
@@ -385,21 +385,148 @@ static ssize_t raw_file_write(struct fil
return raw_file_rw(file, (char __user *) buf, count, ppos, WRITE);
 }

-static ssize_t raw_file_aio_write(struct kiocb *iocb, const char __user *buf,
-   size_t count, loff_t pos)
+int raw_end_aio(struct bio *bio, unsigned int bytes_done, int error)
 {
-   struct iovec local_iov = {
-   .iov_base = (char __user *)buf,
-   .iov_len = count
-   };
+   struct kiocb* iocb = bio-bi_private;
+   atomic_t* bio_count = (atomic_t*) iocb-private;
+
+   if ((bio-bi_rw  0x1) == READ)
+   bio_check_pages_dirty(bio);
+   else {
+   int i;
+   struct bio_vec *bvec = bio-bi_io_vec;
+   struct page *page;
+   for (i = 0; i  bio-bi_vcnt; i++) {
+   page = bvec[i].bv_page;
+   if (page)
+   put_page(page);
+   }
+   bio_put(bio);
+   }
+   if (atomic_dec_and_test(bio_count))
+   aio_complete(iocb, iocb-ki_nbytes, 0);

-   return generic_file_aio_write_nolock(iocb, local_iov, 1, 
iocb-ki_pos);
+   return 0;
 }

+static ssize_t raw_file_aio_rw(struct kiocb *iocb, char __user *buf,
+   size_t count, loff_t pos, int rw)
+{
+   struct inode * inode = iocb-ki_filp-f_mapping-host;
+   unsigned long blkbits = inode-i_blkbits;
+   unsigned long blocksize_mask = (1 blkbits) - 1;
+   struct page * quick_list[PAGE_QUICK_LIST];
+   int nr_pages, cur_offset, cur_len;
+   struct bio * bio;
+   unsigned long ret;
+   unsigned long addr = (unsigned long) buf;
+   loff_t size;
+   int pg_idx;
+   atomic_t *bio_count = (atomic_t *) iocb-private;
+
+   if (count == 0)
+   return 0;
+
+   /* first check the alignment */
+   if (addr  blocksize_mask || count  blocksize_mask ||
+   count  0 || pos  blocksize_mask)
+   return -EINVAL;
+
+   size = i_size_read(inode);
+   if (pos = size)
+   return -ENXIO;
+   if (pos + count  size)
+   count = size - pos;
+
+   nr_pages = (addr + count + PAGE_SIZE - 1) / PAGE_SIZE -
+   addr / PAGE_SIZE;
+
+   pg_idx = PAGE_QUICK_LIST;
+   atomic_set(bio_count, 1);
+
+start:
+   bio = bio_alloc(GFP_KERNEL, nr_pages);
+   if (unlikely(bio == NULL)) {
+   if (atomic_read(bio_count) == 1)
+   return -ENOMEM;
+   else {
+   iocb-ki_nbytes = addr - (unsigned long) buf;
+   goto out;
+   }
+   }
+
+   /* initialize bio */
+   bio-bi_bdev = I_BDEV(inode);
+   bio-bi_end_io = raw_end_aio;
+   bio-bi_private = iocb;
+   bio-bi_sector = pos  blkbits;
+
+   while (count  0) {
+   cur_offset = addr  ~PAGE_MASK;
+   cur_len = PAGE_SIZE - cur_offset;
+   if (cur_len  count)
+   cur_len = count;
+
+   if (pg_idx = PAGE_QUICK_LIST) {
+   down_read(current-mm-mmap_sem);
+   ret = get_user_pages(current, current-mm, addr,
+   min(nr_pages, PAGE_QUICK_LIST),
+   rw==READ, 0, quick_list, NULL);
+   up_read(current-mm-mmap_sem);
+   if (unlikely(ret  0)) {
+   bio_put(bio);
+   if (atomic_read(bio_count) == 1)
+   return ret;
+   else {
+   iocb-ki_nbytes = addr - (unsigned 
long) buf;
+   goto out;
+   }
+   }
+   pg_idx = 0;
+   }
+
+   if (unlikely(!bio_add_page(bio, quick_list[pg_idx], cur_len, 
cur_offset))) {
+   atomic_inc(bio_count);
+   if (rw == READ)
+   bio_set_pages_dirty(bio);
+   submit_bio(rw, bio);
+   pos += addr - (unsigned long) buf;
+   goto start;
+   }
+
+   addr += cur_len;
+   count -= cur_len;
+   pg_idx++;
+   nr_pages--;
+   }
+
+   atomic_inc(bio_count);
+   if (rw == READ)
+

Re: Direct io on block device has performance regression on 2.6.x kernel - fix sync I/O path

2005-03-08 Thread Christoph Hellwig

 --- linux-2.6.9/drivers/char/raw.c2004-10-18 14:54:37.0 -0700
 +++ linux-2.6.9.ken/drivers/char/raw.c2005-03-08 17:22:07.0 
 -0800

this is not the blockdevice, but the obsolete raw device driver.  Please
benchmark and if nessecary fix the blockdevice O_DIRECT codepath insted
as the raw driver is slowly going away.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Direct io on block device has performance regression on 2.6.x kernel - fix sync I/O path

2005-03-08 Thread Chen, Kenneth W

Christoph Hellwig wrote on Tuesday, March 08, 2005 6:20 PM
 this is not the blockdevice, but the obsolete raw device driver.  Please
 benchmark and if nessecary fix the blockdevice O_DIRECT codepath insted
 as the raw driver is slowly going away.

From performance perspective, can raw device be resurrected? (just asking)

- Ken


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Direct io on block device has performance regression on 2.6.x kernel

2005-03-08 Thread Andrew Morton

Chen, Kenneth W [EMAIL PROTECTED] wrote:

 Direct I/O on block device running 2.6.X kernel is a lot SLOWER
  than running on a 2.4 Kernel!
 

A little bit slower, it appears.   It used to be faster.

 ...
 
   synchronous I/O AIO
   (pread/pwrite/read/write)   io_submit
  2.4.21 based
  (RHEL3)  265,122 229,810
 
  2.6.9218,565 206,917
  2.6.10   213,041 205,891
  2.6.11   212,284 201,124

What sort of CPU?

What speed CPU?

What size requests?

Reads or writes?

At 5 usecs per request I figure that's 3% CPU utilisation for 16k requests
at 100 MB/sec.

Once you bolt this onto a real device driver the proportional difference
will fall, due to addition of the constant factor.

Once you bolt all this onto a real disk controller all the numbers will get
worse (but in a strictly proportional manner) due to the disk transfers
depriving the CPU of memory bandwidth.

The raw driver is deprecated and we'd like to remove it.  The preferred way
of doing direct-IO against a blockdev is by opening it with O_DIRECT.

Your patches don't address blockdevs opened with O_DIRECT.  What you should
do is to make def_blk_aops.direct_IO point at a new function.  That will
then work correctly with both raw and with open(/dev/hdX, O_DIRECT).


But before doing anything else, please bench this on real hardware, see if
it is worth pursuing.  And gather an oprofile trace of the existing code.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

66 matches

Mail list logo