Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-02-10 Thread Al Boldi
Andreas Dilger wrote:
> On Jan 26, 2008  08:27 +0300, Al Boldi wrote:
> > Jan Kara wrote:
> > > > data=ordered mode has proven reliable over the years, and it does
> > > > this by ordering filedata flushes before metadata flushes.  But this
> > > > sometimes causes contention in the order of a 10x slowdown for
> > > > certain apps, either due to the misuse of fsync or due to inherent
> > > > behaviour like db's, as well as inherent starvation issues exposed
> > > > by the data=ordered mode.
> > > >
> > > > data=writeback mode alleviates data=order mode slowdowns, but only
> > > > works per-mount and is too dangerous to run as a default mode.
> > > >
> > > > This RFC proposes to introduce a tunable which allows to disable
> > > > fsync and changes ordered into writeback writeout on a per-process
> > > > basis like this:
> > > >
> > > >   echo 1 > /proc/`pidof process`/softsync
> > >
> > >   I guess disabling fsync() was already commented on enough. Regarding
> > > switching to writeback mode on per-process basis - not easily possible
> > > because sometimes data is not written out by the process which stored
> > > them (think of mmaped file).
> >
> > Do you mean there is a locking problem?
> >
> > > And in case of DB, they use direct-io
> > > anyway most of the time so they don't care about journaling mode
> > > anyway.
> >
> > Testing with sqlite3 and mysql4 shows that performance drastically
> > improves with writeback writeout.
> >
> > >  But as Diego wrote, there is definitely some room for improvement in
> > > current data=ordered mode so the difference shouldn't be as big in the
> > > end.
> >
> > Yes, it would be nice to get to the bottom of this starvation problem,
> > but even then, the proposed tunable remains useful for misbehaving apps.
>
> Al, can you try a patch posted to linux-fsdevel and linux-ext4 from
> Hisashi Hifumi <[EMAIL PROTECTED]> to see if this improves
> your situation?  Dated Mon, 04 Feb 2008 19:15:25 +0900.
>
> [PATCH] ext3,4:fdatasync should skip metadata writeout when
> overwriting
>
> It may be that we already have a solution in that patch for database
> workloads where the pages are already allocated by avoiding the need
> for ordered mode journal flushing in that case.

Well, it seems that it does have a positive effect for the 'konqueror hangs' 
case, but doesn't improve the db case.

This shouldn't be surprising, as the db redundant writeout problem is 
localized not in fsync but rather in ext3_ordered_write_end.

Maybe some form of a staged merged commit could help.


Thanks!

--
Al

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-02-10 Thread Al Boldi
Andreas Dilger wrote:
 On Jan 26, 2008  08:27 +0300, Al Boldi wrote:
  Jan Kara wrote:
data=ordered mode has proven reliable over the years, and it does
this by ordering filedata flushes before metadata flushes.  But this
sometimes causes contention in the order of a 10x slowdown for
certain apps, either due to the misuse of fsync or due to inherent
behaviour like db's, as well as inherent starvation issues exposed
by the data=ordered mode.
   
data=writeback mode alleviates data=order mode slowdowns, but only
works per-mount and is too dangerous to run as a default mode.
   
This RFC proposes to introduce a tunable which allows to disable
fsync and changes ordered into writeback writeout on a per-process
basis like this:
   
  echo 1  /proc/`pidof process`/softsync
  
 I guess disabling fsync() was already commented on enough. Regarding
   switching to writeback mode on per-process basis - not easily possible
   because sometimes data is not written out by the process which stored
   them (think of mmaped file).
 
  Do you mean there is a locking problem?
 
   And in case of DB, they use direct-io
   anyway most of the time so they don't care about journaling mode
   anyway.
 
  Testing with sqlite3 and mysql4 shows that performance drastically
  improves with writeback writeout.
 
But as Diego wrote, there is definitely some room for improvement in
   current data=ordered mode so the difference shouldn't be as big in the
   end.
 
  Yes, it would be nice to get to the bottom of this starvation problem,
  but even then, the proposed tunable remains useful for misbehaving apps.

 Al, can you try a patch posted to linux-fsdevel and linux-ext4 from
 Hisashi Hifumi [EMAIL PROTECTED] to see if this improves
 your situation?  Dated Mon, 04 Feb 2008 19:15:25 +0900.

 [PATCH] ext3,4:fdatasync should skip metadata writeout when
 overwriting

 It may be that we already have a solution in that patch for database
 workloads where the pages are already allocated by avoiding the need
 for ordered mode journal flushing in that case.

Well, it seems that it does have a positive effect for the 'konqueror hangs' 
case, but doesn't improve the db case.

This shouldn't be surprising, as the db redundant writeout problem is 
localized not in fsync but rather in ext3_ordered_write_end.

Maybe some form of a staged merged commit could help.


Thanks!

--
Al

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-02-06 Thread Andreas Dilger
On Jan 26, 2008  08:27 +0300, Al Boldi wrote:
> Jan Kara wrote:
> > > data=ordered mode has proven reliable over the years, and it does this
> > > by ordering filedata flushes before metadata flushes.  But this
> > > sometimes causes contention in the order of a 10x slowdown for certain
> > > apps, either due to the misuse of fsync or due to inherent behaviour
> > > like db's, as well as inherent starvation issues exposed by the
> > > data=ordered mode.
> > >
> > > data=writeback mode alleviates data=order mode slowdowns, but only works
> > > per-mount and is too dangerous to run as a default mode.
> > >
> > > This RFC proposes to introduce a tunable which allows to disable fsync
> > > and changes ordered into writeback writeout on a per-process basis like
> > > this:
> > >
> > >   echo 1 > /proc/`pidof process`/softsync
> >
> >   I guess disabling fsync() was already commented on enough. Regarding
> > switching to writeback mode on per-process basis - not easily possible
> > because sometimes data is not written out by the process which stored
> > them (think of mmaped file).
> 
> Do you mean there is a locking problem?
> 
> > And in case of DB, they use direct-io
> > anyway most of the time so they don't care about journaling mode anyway.
> 
> Testing with sqlite3 and mysql4 shows that performance drastically improves 
> with writeback writeout.
> 
> >  But as Diego wrote, there is definitely some room for improvement in
> > current data=ordered mode so the difference shouldn't be as big in the
> > end.
> 
> Yes, it would be nice to get to the bottom of this starvation problem, but 
> even then, the proposed tunable remains useful for misbehaving apps.

Al, can you try a patch posted to linux-fsdevel and linux-ext4 from
Hisashi Hifumi <[EMAIL PROTECTED]> to see if this improves
your situation?  Dated Mon, 04 Feb 2008 19:15:25 +0900.

[PATCH] ext3,4:fdatasync should skip metadata writeout when overwriting

It may be that we already have a solution in that patch for database
workloads where the pages are already allocated by avoiding the need
for ordered mode journal flushing in that case.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-02-06 Thread Andreas Dilger
On Jan 26, 2008  08:27 +0300, Al Boldi wrote:
 Jan Kara wrote:
   data=ordered mode has proven reliable over the years, and it does this
   by ordering filedata flushes before metadata flushes.  But this
   sometimes causes contention in the order of a 10x slowdown for certain
   apps, either due to the misuse of fsync or due to inherent behaviour
   like db's, as well as inherent starvation issues exposed by the
   data=ordered mode.
  
   data=writeback mode alleviates data=order mode slowdowns, but only works
   per-mount and is too dangerous to run as a default mode.
  
   This RFC proposes to introduce a tunable which allows to disable fsync
   and changes ordered into writeback writeout on a per-process basis like
   this:
  
 echo 1  /proc/`pidof process`/softsync
 
I guess disabling fsync() was already commented on enough. Regarding
  switching to writeback mode on per-process basis - not easily possible
  because sometimes data is not written out by the process which stored
  them (think of mmaped file).
 
 Do you mean there is a locking problem?
 
  And in case of DB, they use direct-io
  anyway most of the time so they don't care about journaling mode anyway.
 
 Testing with sqlite3 and mysql4 shows that performance drastically improves 
 with writeback writeout.
 
   But as Diego wrote, there is definitely some room for improvement in
  current data=ordered mode so the difference shouldn't be as big in the
  end.
 
 Yes, it would be nice to get to the bottom of this starvation problem, but 
 even then, the proposed tunable remains useful for misbehaving apps.

Al, can you try a patch posted to linux-fsdevel and linux-ext4 from
Hisashi Hifumi [EMAIL PROTECTED] to see if this improves
your situation?  Dated Mon, 04 Feb 2008 19:15:25 +0900.

[PATCH] ext3,4:fdatasync should skip metadata writeout when overwriting

It may be that we already have a solution in that patch for database
workloads where the pages are already allocated by avoiding the need
for ordered mode journal flushing in that case.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-02-05 Thread Al Boldi
Jan Kara wrote:
> On Tue 05-02-08 10:07:44, Al Boldi wrote:
> > Jan Kara wrote:
> > > On Sat 02-02-08 00:26:00, Al Boldi wrote:
> > > > Chris Mason wrote:
> > > > > Al, could you please compare the write throughput from vmstat for
> > > > > the data=ordered vs data=writeback runs?  I would guess the
> > > > > data=ordered one has a lower overall write throughput.
> > > >
> > > > That's what I would have guessed, but it's actually going up 4x fold
> > > > for mysql from 559mb to 2135mb, while the db-size ends up at 549mb.
> > >
> > >   So you say we write 4-times as much data in ordered mode as in
> > > writeback mode. Hmm, probably possible because we force all the dirty
> > > data to disk when committing a transation in ordered mode (and don't
> > > do this in writeback mode). So if the workload repeatedly dirties the
> > > whole DB, we are going to write the whole DB several times in ordered
> > > mode but in writeback mode we just keep the data in memory all the
> > > time. But this is what you ask for if you mount in ordered mode so I
> > > wouldn't consider it a bug.
> >
> > Ok, maybe not a bug, but a bit inefficient.  Check out this workload:
> >
> > sync;
> >
> > while :; do
> >   dd < /dev/full > /mnt/sda2/x.dmp bs=1M count=20
> >   rm -f /mnt/sda2/x.dmp
> >   usleep 1
> > done
:
:
> > Do you think these 12mb redundant writeouts could be buffered?
>
>   No, I don't think so. At least when I run it, number of blocks written
> out varies which confirms that these 12mb are just data blocks which
> happen to be in the file when transaction commits (which is every 5
> seconds).

Just a thought, but maybe double-buffering can help?

> And to satisfy journaling gurantees in ordered mode you must
> write them so you really have no choice...

Making this RFC rather useful.

What we need now is an implementation, which should be easy.

Maybe something on these lines:

<< in ext3_ordered_write_end >>
  if (current->soft_sync & 1)
return ext3_writeback_write_end;

<< in ext3_ordered_writepage >>
  if (current->soft_sync & 2)
return ext3_writeback_writepage;

<< in ext3_sync_file >>
  if (current->soft_sync & 4)
return ret;

<< in ext3_file_write >>
  if (current->soft_sync & 8)
return ret;

As you can see soft_sync is masked and bits are ordered by importance.

It would be neat if somebody interested could cook-up a patch.


Thanks!

--
Al

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-02-05 Thread Jan Kara
On Tue 05-02-08 10:07:44, Al Boldi wrote:
> Jan Kara wrote:
> > On Sat 02-02-08 00:26:00, Al Boldi wrote:
> > > Chris Mason wrote:
> > > > Al, could you please compare the write throughput from vmstat for the
> > > > data=ordered vs data=writeback runs?  I would guess the data=ordered
> > > > one has a lower overall write throughput.
> > >
> > > That's what I would have guessed, but it's actually going up 4x fold for
> > > mysql from 559mb to 2135mb, while the db-size ends up at 549mb.
> >
> >   So you say we write 4-times as much data in ordered mode as in writeback
> > mode. Hmm, probably possible because we force all the dirty data to disk
> > when committing a transation in ordered mode (and don't do this in
> > writeback mode). So if the workload repeatedly dirties the whole DB, we
> > are going to write the whole DB several times in ordered mode but in
> > writeback mode we just keep the data in memory all the time. But this is
> > what you ask for if you mount in ordered mode so I wouldn't consider it a
> > bug.
> 
> Ok, maybe not a bug, but a bit inefficient.  Check out this workload:
> 
> sync;
> 
> while :; do
>   dd < /dev/full > /mnt/sda2/x.dmp bs=1M count=20
>   rm -f /mnt/sda2/x.dmp
>   usleep 1
> done
> 
> vmstat 1 ( with mount /dev/sda2 /mnt/sda2 -o data=writeback) << note io-bo >>
> 
> procs ---memory-- ---swap-- -io --system-- cpu
>  r  b   swpd   free   buff  cache   si   sobibo   incs us sy id wa
>  2  0  0 293008   5232  5743600 0 0   18   206  4 80 16  0
>  1  0  0 282840   5232  6762000 0 0   18   238  3 81 16  0
>  1  0  0 297032   5244  5336400 0   152   21   211  4 79 17  0
>  1  0  0 285236   5244  6522400 0 0   18   232  4 80 16  0
>  1  0  0 299464   5244  5088000 0 0   18   222  4 80 16  0
>  1  0  0 290156   5244  6017600 0 0   18   236  3 80 17  0
>  0  0  0 302124   5256  4778800 0   152   21   213  4 80 16  0
>  1  0  0 292180   5256  5824800 0 0   18   239  3 81 16  0
>  1  0  0 287452   5256  6244400 0 0   18   202  3 80 17  0
>  1  0  0 293016   5256  5739200 0 0   18   250  4 80 16  0
>  0  0  0 302052   5256  4778800 0 0   19   194  3 81 16  0
>  1  0  0 297536   5268  5292800 0   152   20   233  4 79 17  0
>  1  0  0 286468   5268  6387200 0 0   18   212  3 81 16  0
>  1  0  0 301572   5268  4881200 0 0   18   267  4 79 17  0
>  1  0  0 292636   5268  5777600 0 0   18   208  4 80 16  0
>  1  0  0 302124   5280  4778800 0   152   21   237  4 80 16  0
>  1  0  0 291436   5280  5897600 0 0   18   205  3 81 16  0
>  1  0  0 302068   5280  4778800 0 0   18   234  3 81 16  0
>  1  0  0 293008   5280  5738800 0 0   18   221  4 79 17  0
>  1  0  0 297288   5292  5253200 0   156   22   233  2 81 16  1
>  1  0  0 294676   5292  5572400 0 0   19   199  3 81 16  0
> 
> 
> vmstat 1 (with mount /dev/sda2 /mnt/sda2 -o data=ordered)
> 
> procs ---memory-- ---swap-- -io --system-- cpu
>  r  b   swpd   free   buff  cache   si   sobibo   incs us sy id wa
>  2  0  0 291052   5156  5901600 0 0   19   223  3 82 15  0
>  1  0  0 291408   5156  5870400 0 0   18   218  3 81 16  0
>  1  0  0 291888   5156  5827600 020   23   229  3 80 17  0
>  1  0  0 300764   5168  4947200 0 12864   91   235  3 69 13 15
>  1  0  0 300740   5168  4945600 0 0   19   215  3 80 17  0
>  1  0  0 301088   5168  4904400 0 0   18   241  4 80 16  0
>  1  0  0 298220   5168  5187200 0 0   18   225  3 81 16  0
>  0  1  0 289168   5168  6075200 0 12712   45   237  3 77 15  5
>  1  0  0 300260   5180  4985200 0   152   68   211  4 72 15  9
>  1  0  0 298616   5180  5146000 0 0   18   237  3 81 16  0
>  1  0  0 296988   5180  5309200 0 0   18   223  3 81 16  0
>  1  0  0 296608   5180  5348000 0 0   18   223  3 81 16  0
>  0  0  0 301640   5192  4803600 0 12868   93   206  4 67 13 16
>  0  0  0 301624   5192  4803600 0 0   21   218  3 81 16  0
>  0  0  0 301600   5192  4803600 0 0   18   212  3 81 16  0
>  0  0  0 301584   5192  4803600 0 0   18   209  4 80 16  0
>  0  0  0 301568   5192  4803600 0 0   18   208  3 81 16  0
>  1  0  0 285520   5204  6454800 0 12864   95   216  3 69 13 15
>  2  0  0 285124   5204  6492400 0 0   18   222  4 80 16  0
>  1  0  0 283612   5204  66392

Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-02-05 Thread Jan Kara
On Tue 05-02-08 10:07:44, Al Boldi wrote:
 Jan Kara wrote:
  On Sat 02-02-08 00:26:00, Al Boldi wrote:
   Chris Mason wrote:
Al, could you please compare the write throughput from vmstat for the
data=ordered vs data=writeback runs?  I would guess the data=ordered
one has a lower overall write throughput.
  
   That's what I would have guessed, but it's actually going up 4x fold for
   mysql from 559mb to 2135mb, while the db-size ends up at 549mb.
 
So you say we write 4-times as much data in ordered mode as in writeback
  mode. Hmm, probably possible because we force all the dirty data to disk
  when committing a transation in ordered mode (and don't do this in
  writeback mode). So if the workload repeatedly dirties the whole DB, we
  are going to write the whole DB several times in ordered mode but in
  writeback mode we just keep the data in memory all the time. But this is
  what you ask for if you mount in ordered mode so I wouldn't consider it a
  bug.
 
 Ok, maybe not a bug, but a bit inefficient.  Check out this workload:
 
 sync;
 
 while :; do
   dd  /dev/full  /mnt/sda2/x.dmp bs=1M count=20
   rm -f /mnt/sda2/x.dmp
   usleep 1
 done
 
 vmstat 1 ( with mount /dev/sda2 /mnt/sda2 -o data=writeback)  note io-bo 
 
 procs ---memory-- ---swap-- -io --system-- cpu
  r  b   swpd   free   buff  cache   si   sobibo   incs us sy id wa
  2  0  0 293008   5232  5743600 0 0   18   206  4 80 16  0
  1  0  0 282840   5232  6762000 0 0   18   238  3 81 16  0
  1  0  0 297032   5244  5336400 0   152   21   211  4 79 17  0
  1  0  0 285236   5244  6522400 0 0   18   232  4 80 16  0
  1  0  0 299464   5244  5088000 0 0   18   222  4 80 16  0
  1  0  0 290156   5244  6017600 0 0   18   236  3 80 17  0
  0  0  0 302124   5256  4778800 0   152   21   213  4 80 16  0
  1  0  0 292180   5256  5824800 0 0   18   239  3 81 16  0
  1  0  0 287452   5256  6244400 0 0   18   202  3 80 17  0
  1  0  0 293016   5256  5739200 0 0   18   250  4 80 16  0
  0  0  0 302052   5256  4778800 0 0   19   194  3 81 16  0
  1  0  0 297536   5268  5292800 0   152   20   233  4 79 17  0
  1  0  0 286468   5268  6387200 0 0   18   212  3 81 16  0
  1  0  0 301572   5268  4881200 0 0   18   267  4 79 17  0
  1  0  0 292636   5268  5777600 0 0   18   208  4 80 16  0
  1  0  0 302124   5280  4778800 0   152   21   237  4 80 16  0
  1  0  0 291436   5280  5897600 0 0   18   205  3 81 16  0
  1  0  0 302068   5280  4778800 0 0   18   234  3 81 16  0
  1  0  0 293008   5280  5738800 0 0   18   221  4 79 17  0
  1  0  0 297288   5292  5253200 0   156   22   233  2 81 16  1
  1  0  0 294676   5292  5572400 0 0   19   199  3 81 16  0
 
 
 vmstat 1 (with mount /dev/sda2 /mnt/sda2 -o data=ordered)
 
 procs ---memory-- ---swap-- -io --system-- cpu
  r  b   swpd   free   buff  cache   si   sobibo   incs us sy id wa
  2  0  0 291052   5156  5901600 0 0   19   223  3 82 15  0
  1  0  0 291408   5156  5870400 0 0   18   218  3 81 16  0
  1  0  0 291888   5156  5827600 020   23   229  3 80 17  0
  1  0  0 300764   5168  4947200 0 12864   91   235  3 69 13 15
  1  0  0 300740   5168  4945600 0 0   19   215  3 80 17  0
  1  0  0 301088   5168  4904400 0 0   18   241  4 80 16  0
  1  0  0 298220   5168  5187200 0 0   18   225  3 81 16  0
  0  1  0 289168   5168  6075200 0 12712   45   237  3 77 15  5
  1  0  0 300260   5180  4985200 0   152   68   211  4 72 15  9
  1  0  0 298616   5180  5146000 0 0   18   237  3 81 16  0
  1  0  0 296988   5180  5309200 0 0   18   223  3 81 16  0
  1  0  0 296608   5180  5348000 0 0   18   223  3 81 16  0
  0  0  0 301640   5192  4803600 0 12868   93   206  4 67 13 16
  0  0  0 301624   5192  4803600 0 0   21   218  3 81 16  0
  0  0  0 301600   5192  4803600 0 0   18   212  3 81 16  0
  0  0  0 301584   5192  4803600 0 0   18   209  4 80 16  0
  0  0  0 301568   5192  4803600 0 0   18   208  3 81 16  0
  1  0  0 285520   5204  6454800 0 12864   95   216  3 69 13 15
  2  0  0 285124   5204  6492400 0 0   18   222  4 80 16  0
  1  0  0 283612   5204  6639200 0 0   18   231  3 81 16  0
  1  0  0 284216   5204  6573600 0 0   18   218  4 

Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-02-05 Thread Al Boldi
Jan Kara wrote:
 On Tue 05-02-08 10:07:44, Al Boldi wrote:
  Jan Kara wrote:
   On Sat 02-02-08 00:26:00, Al Boldi wrote:
Chris Mason wrote:
 Al, could you please compare the write throughput from vmstat for
 the data=ordered vs data=writeback runs?  I would guess the
 data=ordered one has a lower overall write throughput.
   
That's what I would have guessed, but it's actually going up 4x fold
for mysql from 559mb to 2135mb, while the db-size ends up at 549mb.
  
 So you say we write 4-times as much data in ordered mode as in
   writeback mode. Hmm, probably possible because we force all the dirty
   data to disk when committing a transation in ordered mode (and don't
   do this in writeback mode). So if the workload repeatedly dirties the
   whole DB, we are going to write the whole DB several times in ordered
   mode but in writeback mode we just keep the data in memory all the
   time. But this is what you ask for if you mount in ordered mode so I
   wouldn't consider it a bug.
 
  Ok, maybe not a bug, but a bit inefficient.  Check out this workload:
 
  sync;
 
  while :; do
dd  /dev/full  /mnt/sda2/x.dmp bs=1M count=20
rm -f /mnt/sda2/x.dmp
usleep 1
  done
:
:
  Do you think these 12mb redundant writeouts could be buffered?

   No, I don't think so. At least when I run it, number of blocks written
 out varies which confirms that these 12mb are just data blocks which
 happen to be in the file when transaction commits (which is every 5
 seconds).

Just a thought, but maybe double-buffering can help?

 And to satisfy journaling gurantees in ordered mode you must
 write them so you really have no choice...

Making this RFC rather useful.

What we need now is an implementation, which should be easy.

Maybe something on these lines:

 in ext3_ordered_write_end 
  if (current-soft_sync  1)
return ext3_writeback_write_end;

 in ext3_ordered_writepage 
  if (current-soft_sync  2)
return ext3_writeback_writepage;

 in ext3_sync_file 
  if (current-soft_sync  4)
return ret;

 in ext3_file_write 
  if (current-soft_sync  8)
return ret;

As you can see soft_sync is masked and bits are ordered by importance.

It would be neat if somebody interested could cook-up a patch.


Thanks!

--
Al

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-02-04 Thread Al Boldi
Jan Kara wrote:
> On Sat 02-02-08 00:26:00, Al Boldi wrote:
> > Chris Mason wrote:
> > > Al, could you please compare the write throughput from vmstat for the
> > > data=ordered vs data=writeback runs?  I would guess the data=ordered
> > > one has a lower overall write throughput.
> >
> > That's what I would have guessed, but it's actually going up 4x fold for
> > mysql from 559mb to 2135mb, while the db-size ends up at 549mb.
>
>   So you say we write 4-times as much data in ordered mode as in writeback
> mode. Hmm, probably possible because we force all the dirty data to disk
> when committing a transation in ordered mode (and don't do this in
> writeback mode). So if the workload repeatedly dirties the whole DB, we
> are going to write the whole DB several times in ordered mode but in
> writeback mode we just keep the data in memory all the time. But this is
> what you ask for if you mount in ordered mode so I wouldn't consider it a
> bug.

Ok, maybe not a bug, but a bit inefficient.  Check out this workload:

sync;

while :; do
  dd < /dev/full > /mnt/sda2/x.dmp bs=1M count=20
  rm -f /mnt/sda2/x.dmp
  usleep 1
done

vmstat 1 ( with mount /dev/sda2 /mnt/sda2 -o data=writeback) << note io-bo >>

procs ---memory-- ---swap-- -io --system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   incs us sy id wa
 2  0  0 293008   5232  5743600 0 0   18   206  4 80 16  0
 1  0  0 282840   5232  6762000 0 0   18   238  3 81 16  0
 1  0  0 297032   5244  5336400 0   152   21   211  4 79 17  0
 1  0  0 285236   5244  6522400 0 0   18   232  4 80 16  0
 1  0  0 299464   5244  5088000 0 0   18   222  4 80 16  0
 1  0  0 290156   5244  6017600 0 0   18   236  3 80 17  0
 0  0  0 302124   5256  4778800 0   152   21   213  4 80 16  0
 1  0  0 292180   5256  5824800 0 0   18   239  3 81 16  0
 1  0  0 287452   5256  6244400 0 0   18   202  3 80 17  0
 1  0  0 293016   5256  5739200 0 0   18   250  4 80 16  0
 0  0  0 302052   5256  4778800 0 0   19   194  3 81 16  0
 1  0  0 297536   5268  5292800 0   152   20   233  4 79 17  0
 1  0  0 286468   5268  6387200 0 0   18   212  3 81 16  0
 1  0  0 301572   5268  4881200 0 0   18   267  4 79 17  0
 1  0  0 292636   5268  5777600 0 0   18   208  4 80 16  0
 1  0  0 302124   5280  4778800 0   152   21   237  4 80 16  0
 1  0  0 291436   5280  5897600 0 0   18   205  3 81 16  0
 1  0  0 302068   5280  4778800 0 0   18   234  3 81 16  0
 1  0  0 293008   5280  5738800 0 0   18   221  4 79 17  0
 1  0  0 297288   5292  5253200 0   156   22   233  2 81 16  1
 1  0  0 294676   5292  5572400 0 0   19   199  3 81 16  0


vmstat 1 (with mount /dev/sda2 /mnt/sda2 -o data=ordered)

procs ---memory-- ---swap-- -io --system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   incs us sy id wa
 2  0  0 291052   5156  5901600 0 0   19   223  3 82 15  0
 1  0  0 291408   5156  5870400 0 0   18   218  3 81 16  0
 1  0  0 291888   5156  5827600 020   23   229  3 80 17  0
 1  0  0 300764   5168  4947200 0 12864   91   235  3 69 13 15
 1  0  0 300740   5168  4945600 0 0   19   215  3 80 17  0
 1  0  0 301088   5168  4904400 0 0   18   241  4 80 16  0
 1  0  0 298220   5168  5187200 0 0   18   225  3 81 16  0
 0  1  0 289168   5168  6075200 0 12712   45   237  3 77 15  5
 1  0  0 300260   5180  4985200 0   152   68   211  4 72 15  9
 1  0  0 298616   5180  5146000 0 0   18   237  3 81 16  0
 1  0  0 296988   5180  5309200 0 0   18   223  3 81 16  0
 1  0  0 296608   5180  5348000 0 0   18   223  3 81 16  0
 0  0  0 301640   5192  4803600 0 12868   93   206  4 67 13 16
 0  0  0 301624   5192  4803600 0 0   21   218  3 81 16  0
 0  0  0 301600   5192  4803600 0 0   18   212  3 81 16  0
 0  0  0 301584   5192  4803600 0 0   18   209  4 80 16  0
 0  0  0 301568   5192  4803600 0 0   18   208  3 81 16  0
 1  0  0 285520   5204  6454800 0 12864   95   216  3 69 13 15
 2  0  0 285124   5204  6492400 0 0   18   222  4 80 16  0
 1  0  0 283612   5204  6639200 0 0   18   231  3 81 16  0
 1  0  0 284216   5204  6573600 0 0   18   218  4 80 16  0
 0  1  0 289160   5204  6075200 0 12712   56   213  3 74 15  8
 1 

Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-02-04 Thread Jan Kara
On Sat 02-02-08 00:26:00, Al Boldi wrote:
> Chris Mason wrote:
> > On Thursday 31 January 2008, Jan Kara wrote:
> > > On Thu 31-01-08 11:56:01, Chris Mason wrote:
> > > > On Thursday 31 January 2008, Al Boldi wrote:
> > > > > The big difference between ordered and writeback is that once the
> > > > > slowdown starts, ordered goes into ~100% iowait, whereas writeback
> > > > > continues 100% user.
> > > >
> > > > Does data=ordered write buffers in the order they were dirtied?  This
> > > > might explain the extreme problems in transactional workloads.
> > >
> > >   Well, it does but we submit them to block layer all at once so
> > > elevator should sort the requests for us...
> >
> > nr_requests is fairly small, so a long stream of random requests should
> > still end up being random IO.
> >
> > Al, could you please compare the write throughput from vmstat for the
> > data=ordered vs data=writeback runs?  I would guess the data=ordered one
> > has a lower overall write throughput.
> 
> That's what I would have guessed, but it's actually going up 4x fold for 
> mysql from 559mb to 2135mb, while the db-size ends up at 549mb.
  So you say we write 4-times as much data in ordered mode as in writeback
mode. Hmm, probably possible because we force all the dirty data to disk
when committing a transation in ordered mode (and don't do this in
writeback mode). So if the workload repeatedly dirties the whole DB, we are
going to write the whole DB several times in ordered mode but in writeback
mode we just keep the data in memory all the time. But this is what you
ask for if you mount in ordered mode so I wouldn't consider it a bug.
  I still don't like your hack with per-process journal mode setting but we
could easily do per-file journal mode setting (we already have a flag to do
data journaling for a file) and that would help at least your DB
workload...

Honza
-- 
Jan Kara <[EMAIL PROTECTED]>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-02-04 Thread Jan Kara
On Sat 02-02-08 00:26:00, Al Boldi wrote:
 Chris Mason wrote:
  On Thursday 31 January 2008, Jan Kara wrote:
   On Thu 31-01-08 11:56:01, Chris Mason wrote:
On Thursday 31 January 2008, Al Boldi wrote:
 The big difference between ordered and writeback is that once the
 slowdown starts, ordered goes into ~100% iowait, whereas writeback
 continues 100% user.
   
Does data=ordered write buffers in the order they were dirtied?  This
might explain the extreme problems in transactional workloads.
  
 Well, it does but we submit them to block layer all at once so
   elevator should sort the requests for us...
 
  nr_requests is fairly small, so a long stream of random requests should
  still end up being random IO.
 
  Al, could you please compare the write throughput from vmstat for the
  data=ordered vs data=writeback runs?  I would guess the data=ordered one
  has a lower overall write throughput.
 
 That's what I would have guessed, but it's actually going up 4x fold for 
 mysql from 559mb to 2135mb, while the db-size ends up at 549mb.
  So you say we write 4-times as much data in ordered mode as in writeback
mode. Hmm, probably possible because we force all the dirty data to disk
when committing a transation in ordered mode (and don't do this in
writeback mode). So if the workload repeatedly dirties the whole DB, we are
going to write the whole DB several times in ordered mode but in writeback
mode we just keep the data in memory all the time. But this is what you
ask for if you mount in ordered mode so I wouldn't consider it a bug.
  I still don't like your hack with per-process journal mode setting but we
could easily do per-file journal mode setting (we already have a flag to do
data journaling for a file) and that would help at least your DB
workload...

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-02-04 Thread Al Boldi
Jan Kara wrote:
 On Sat 02-02-08 00:26:00, Al Boldi wrote:
  Chris Mason wrote:
   Al, could you please compare the write throughput from vmstat for the
   data=ordered vs data=writeback runs?  I would guess the data=ordered
   one has a lower overall write throughput.
 
  That's what I would have guessed, but it's actually going up 4x fold for
  mysql from 559mb to 2135mb, while the db-size ends up at 549mb.

   So you say we write 4-times as much data in ordered mode as in writeback
 mode. Hmm, probably possible because we force all the dirty data to disk
 when committing a transation in ordered mode (and don't do this in
 writeback mode). So if the workload repeatedly dirties the whole DB, we
 are going to write the whole DB several times in ordered mode but in
 writeback mode we just keep the data in memory all the time. But this is
 what you ask for if you mount in ordered mode so I wouldn't consider it a
 bug.

Ok, maybe not a bug, but a bit inefficient.  Check out this workload:

sync;

while :; do
  dd  /dev/full  /mnt/sda2/x.dmp bs=1M count=20
  rm -f /mnt/sda2/x.dmp
  usleep 1
done

vmstat 1 ( with mount /dev/sda2 /mnt/sda2 -o data=writeback)  note io-bo 

procs ---memory-- ---swap-- -io --system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   incs us sy id wa
 2  0  0 293008   5232  5743600 0 0   18   206  4 80 16  0
 1  0  0 282840   5232  6762000 0 0   18   238  3 81 16  0
 1  0  0 297032   5244  5336400 0   152   21   211  4 79 17  0
 1  0  0 285236   5244  6522400 0 0   18   232  4 80 16  0
 1  0  0 299464   5244  5088000 0 0   18   222  4 80 16  0
 1  0  0 290156   5244  6017600 0 0   18   236  3 80 17  0
 0  0  0 302124   5256  4778800 0   152   21   213  4 80 16  0
 1  0  0 292180   5256  5824800 0 0   18   239  3 81 16  0
 1  0  0 287452   5256  6244400 0 0   18   202  3 80 17  0
 1  0  0 293016   5256  5739200 0 0   18   250  4 80 16  0
 0  0  0 302052   5256  4778800 0 0   19   194  3 81 16  0
 1  0  0 297536   5268  5292800 0   152   20   233  4 79 17  0
 1  0  0 286468   5268  6387200 0 0   18   212  3 81 16  0
 1  0  0 301572   5268  4881200 0 0   18   267  4 79 17  0
 1  0  0 292636   5268  5777600 0 0   18   208  4 80 16  0
 1  0  0 302124   5280  4778800 0   152   21   237  4 80 16  0
 1  0  0 291436   5280  5897600 0 0   18   205  3 81 16  0
 1  0  0 302068   5280  4778800 0 0   18   234  3 81 16  0
 1  0  0 293008   5280  5738800 0 0   18   221  4 79 17  0
 1  0  0 297288   5292  5253200 0   156   22   233  2 81 16  1
 1  0  0 294676   5292  5572400 0 0   19   199  3 81 16  0


vmstat 1 (with mount /dev/sda2 /mnt/sda2 -o data=ordered)

procs ---memory-- ---swap-- -io --system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   incs us sy id wa
 2  0  0 291052   5156  5901600 0 0   19   223  3 82 15  0
 1  0  0 291408   5156  5870400 0 0   18   218  3 81 16  0
 1  0  0 291888   5156  5827600 020   23   229  3 80 17  0
 1  0  0 300764   5168  4947200 0 12864   91   235  3 69 13 15
 1  0  0 300740   5168  4945600 0 0   19   215  3 80 17  0
 1  0  0 301088   5168  4904400 0 0   18   241  4 80 16  0
 1  0  0 298220   5168  5187200 0 0   18   225  3 81 16  0
 0  1  0 289168   5168  6075200 0 12712   45   237  3 77 15  5
 1  0  0 300260   5180  4985200 0   152   68   211  4 72 15  9
 1  0  0 298616   5180  5146000 0 0   18   237  3 81 16  0
 1  0  0 296988   5180  5309200 0 0   18   223  3 81 16  0
 1  0  0 296608   5180  5348000 0 0   18   223  3 81 16  0
 0  0  0 301640   5192  4803600 0 12868   93   206  4 67 13 16
 0  0  0 301624   5192  4803600 0 0   21   218  3 81 16  0
 0  0  0 301600   5192  4803600 0 0   18   212  3 81 16  0
 0  0  0 301584   5192  4803600 0 0   18   209  4 80 16  0
 0  0  0 301568   5192  4803600 0 0   18   208  3 81 16  0
 1  0  0 285520   5204  6454800 0 12864   95   216  3 69 13 15
 2  0  0 285124   5204  6492400 0 0   18   222  4 80 16  0
 1  0  0 283612   5204  6639200 0 0   18   231  3 81 16  0
 1  0  0 284216   5204  6573600 0 0   18   218  4 80 16  0
 0  1  0 289160   5204  6075200 0 12712   56   213  3 74 15  8
 1  0  0 285884   5216  64128   

Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-02-01 Thread Al Boldi
Chris Mason wrote:
> On Thursday 31 January 2008, Jan Kara wrote:
> > On Thu 31-01-08 11:56:01, Chris Mason wrote:
> > > On Thursday 31 January 2008, Al Boldi wrote:
> > > > The big difference between ordered and writeback is that once the
> > > > slowdown starts, ordered goes into ~100% iowait, whereas writeback
> > > > continues 100% user.
> > >
> > > Does data=ordered write buffers in the order they were dirtied?  This
> > > might explain the extreme problems in transactional workloads.
> >
> >   Well, it does but we submit them to block layer all at once so
> > elevator should sort the requests for us...
>
> nr_requests is fairly small, so a long stream of random requests should
> still end up being random IO.
>
> Al, could you please compare the write throughput from vmstat for the
> data=ordered vs data=writeback runs?  I would guess the data=ordered one
> has a lower overall write throughput.

That's what I would have guessed, but it's actually going up 4x fold for 
mysql from 559mb to 2135mb, while the db-size ends up at 549mb.

This may mean that data=ordered isn't buffering redundant writes; or worse.


Thanks!

--
Al

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-31 Thread Chris Mason
On Thursday 31 January 2008, Jan Kara wrote:
> On Thu 31-01-08 11:56:01, Chris Mason wrote:
> > On Thursday 31 January 2008, Al Boldi wrote:
> > > Andreas Dilger wrote:
> > > > On Wednesday 30 January 2008, Al Boldi wrote:
> > > > > And, a quick test of successive 1sec delayed syncs shows no hangs
> > > > > until about 1 minute (~180mb) of db-writeout activity, when the
> > > > > sync abruptly hangs for minutes on end, and io-wait shows almost
> > > > > 100%.
> > > >
> > > > How large is the journal in this filesystem?  You can check via
> > > > "debugfs -R 'stat <8>' /dev/XXX".
> > >
> > > 32mb.
> > >
> > > > Is this affected by increasing
> > > > the journal size?  You can set the journal size via "mke2fs -J
> > > > size=400" at format time, or on an unmounted filesystem by running
> > > > "tune2fs -O ^has_journal /dev/XXX" then "tune2fs -J size=400
> > > > /dev/XXX".
> > >
> > > Setting size=400 doesn't help, nor does size=4.
> > >
> > > > I suspect that the stall is caused by the journal filling up, and
> > > > then waiting while the entire journal is checkpointed back to the
> > > > filesystem before the next transaction can start.
> > > >
> > > > It is possible to improve this behaviour in JBD by reducing the
> > > > amount of space that is cleared if the journal becomes "full", and
> > > > also doing journal checkpointing before it becomes full.  While that
> > > > may reduce performance a small amount, it would help avoid such huge
> > > > latency problems. I believe we have such a patch in one of the Lustre
> > > > branches already, and while I'm not sure what kernel it is for the
> > > > JBD code rarely changes much
> > >
> > > The big difference between ordered and writeback is that once the
> > > slowdown starts, ordered goes into ~100% iowait, whereas writeback
> > > continues 100% user.
> >
> > Does data=ordered write buffers in the order they were dirtied?  This
> > might explain the extreme problems in transactional workloads.
>
>   Well, it does but we submit them to block layer all at once so elevator
> should sort the requests for us...

nr_requests is fairly small, so a long stream of random requests should still 
end up being random IO.

Al, could you please compare the write throughput from vmstat for the 
data=ordered vs data=writeback runs?  I would guess the data=ordered one has 
a lower overall write throughput.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-31 Thread Jan Kara
On Thu 31-01-08 11:56:01, Chris Mason wrote:
> On Thursday 31 January 2008, Al Boldi wrote:
> > Andreas Dilger wrote:
> > > On Wednesday 30 January 2008, Al Boldi wrote:
> > > > And, a quick test of successive 1sec delayed syncs shows no hangs until
> > > > about 1 minute (~180mb) of db-writeout activity, when the sync abruptly
> > > > hangs for minutes on end, and io-wait shows almost 100%.
> > >
> > > How large is the journal in this filesystem?  You can check via
> > > "debugfs -R 'stat <8>' /dev/XXX".
> >
> > 32mb.
> >
> > > Is this affected by increasing
> > > the journal size?  You can set the journal size via "mke2fs -J size=400"
> > > at format time, or on an unmounted filesystem by running
> > > "tune2fs -O ^has_journal /dev/XXX" then "tune2fs -J size=400 /dev/XXX".
> >
> > Setting size=400 doesn't help, nor does size=4.
> >
> > > I suspect that the stall is caused by the journal filling up, and then
> > > waiting while the entire journal is checkpointed back to the filesystem
> > > before the next transaction can start.
> > >
> > > It is possible to improve this behaviour in JBD by reducing the amount
> > > of space that is cleared if the journal becomes "full", and also doing
> > > journal checkpointing before it becomes full.  While that may reduce
> > > performance a small amount, it would help avoid such huge latency
> > > problems. I believe we have such a patch in one of the Lustre branches
> > > already, and while I'm not sure what kernel it is for the JBD code rarely
> > > changes much
> >
> > The big difference between ordered and writeback is that once the slowdown
> > starts, ordered goes into ~100% iowait, whereas writeback continues 100%
> > user.
> 
> Does data=ordered write buffers in the order they were dirtied?  This might 
> explain the extreme problems in transactional workloads.
  Well, it does but we submit them to block layer all at once so elevator
should sort the requests for us...

Honza
-- 
Jan Kara <[EMAIL PROTECTED]>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-31 Thread Chris Mason
On Thursday 31 January 2008, Al Boldi wrote:
> Andreas Dilger wrote:
> > On Wednesday 30 January 2008, Al Boldi wrote:
> > > And, a quick test of successive 1sec delayed syncs shows no hangs until
> > > about 1 minute (~180mb) of db-writeout activity, when the sync abruptly
> > > hangs for minutes on end, and io-wait shows almost 100%.
> >
> > How large is the journal in this filesystem?  You can check via
> > "debugfs -R 'stat <8>' /dev/XXX".
>
> 32mb.
>
> > Is this affected by increasing
> > the journal size?  You can set the journal size via "mke2fs -J size=400"
> > at format time, or on an unmounted filesystem by running
> > "tune2fs -O ^has_journal /dev/XXX" then "tune2fs -J size=400 /dev/XXX".
>
> Setting size=400 doesn't help, nor does size=4.
>
> > I suspect that the stall is caused by the journal filling up, and then
> > waiting while the entire journal is checkpointed back to the filesystem
> > before the next transaction can start.
> >
> > It is possible to improve this behaviour in JBD by reducing the amount
> > of space that is cleared if the journal becomes "full", and also doing
> > journal checkpointing before it becomes full.  While that may reduce
> > performance a small amount, it would help avoid such huge latency
> > problems. I believe we have such a patch in one of the Lustre branches
> > already, and while I'm not sure what kernel it is for the JBD code rarely
> > changes much
>
> The big difference between ordered and writeback is that once the slowdown
> starts, ordered goes into ~100% iowait, whereas writeback continues 100%
> user.

Does data=ordered write buffers in the order they were dirtied?  This might 
explain the extreme problems in transactional workloads.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-31 Thread Chris Mason
On Thursday 31 January 2008, Jan Kara wrote:
 On Thu 31-01-08 11:56:01, Chris Mason wrote:
  On Thursday 31 January 2008, Al Boldi wrote:
   Andreas Dilger wrote:
On Wednesday 30 January 2008, Al Boldi wrote:
 And, a quick test of successive 1sec delayed syncs shows no hangs
 until about 1 minute (~180mb) of db-writeout activity, when the
 sync abruptly hangs for minutes on end, and io-wait shows almost
 100%.
   
How large is the journal in this filesystem?  You can check via
debugfs -R 'stat 8' /dev/XXX.
  
   32mb.
  
Is this affected by increasing
the journal size?  You can set the journal size via mke2fs -J
size=400 at format time, or on an unmounted filesystem by running
tune2fs -O ^has_journal /dev/XXX then tune2fs -J size=400
/dev/XXX.
  
   Setting size=400 doesn't help, nor does size=4.
  
I suspect that the stall is caused by the journal filling up, and
then waiting while the entire journal is checkpointed back to the
filesystem before the next transaction can start.
   
It is possible to improve this behaviour in JBD by reducing the
amount of space that is cleared if the journal becomes full, and
also doing journal checkpointing before it becomes full.  While that
may reduce performance a small amount, it would help avoid such huge
latency problems. I believe we have such a patch in one of the Lustre
branches already, and while I'm not sure what kernel it is for the
JBD code rarely changes much
  
   The big difference between ordered and writeback is that once the
   slowdown starts, ordered goes into ~100% iowait, whereas writeback
   continues 100% user.
 
  Does data=ordered write buffers in the order they were dirtied?  This
  might explain the extreme problems in transactional workloads.

   Well, it does but we submit them to block layer all at once so elevator
 should sort the requests for us...

nr_requests is fairly small, so a long stream of random requests should still 
end up being random IO.

Al, could you please compare the write throughput from vmstat for the 
data=ordered vs data=writeback runs?  I would guess the data=ordered one has 
a lower overall write throughput.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-31 Thread Chris Mason
On Thursday 31 January 2008, Al Boldi wrote:
 Andreas Dilger wrote:
  On Wednesday 30 January 2008, Al Boldi wrote:
   And, a quick test of successive 1sec delayed syncs shows no hangs until
   about 1 minute (~180mb) of db-writeout activity, when the sync abruptly
   hangs for minutes on end, and io-wait shows almost 100%.
 
  How large is the journal in this filesystem?  You can check via
  debugfs -R 'stat 8' /dev/XXX.

 32mb.

  Is this affected by increasing
  the journal size?  You can set the journal size via mke2fs -J size=400
  at format time, or on an unmounted filesystem by running
  tune2fs -O ^has_journal /dev/XXX then tune2fs -J size=400 /dev/XXX.

 Setting size=400 doesn't help, nor does size=4.

  I suspect that the stall is caused by the journal filling up, and then
  waiting while the entire journal is checkpointed back to the filesystem
  before the next transaction can start.
 
  It is possible to improve this behaviour in JBD by reducing the amount
  of space that is cleared if the journal becomes full, and also doing
  journal checkpointing before it becomes full.  While that may reduce
  performance a small amount, it would help avoid such huge latency
  problems. I believe we have such a patch in one of the Lustre branches
  already, and while I'm not sure what kernel it is for the JBD code rarely
  changes much

 The big difference between ordered and writeback is that once the slowdown
 starts, ordered goes into ~100% iowait, whereas writeback continues 100%
 user.

Does data=ordered write buffers in the order they were dirtied?  This might 
explain the extreme problems in transactional workloads.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-31 Thread Jan Kara
On Thu 31-01-08 11:56:01, Chris Mason wrote:
 On Thursday 31 January 2008, Al Boldi wrote:
  Andreas Dilger wrote:
   On Wednesday 30 January 2008, Al Boldi wrote:
And, a quick test of successive 1sec delayed syncs shows no hangs until
about 1 minute (~180mb) of db-writeout activity, when the sync abruptly
hangs for minutes on end, and io-wait shows almost 100%.
  
   How large is the journal in this filesystem?  You can check via
   debugfs -R 'stat 8' /dev/XXX.
 
  32mb.
 
   Is this affected by increasing
   the journal size?  You can set the journal size via mke2fs -J size=400
   at format time, or on an unmounted filesystem by running
   tune2fs -O ^has_journal /dev/XXX then tune2fs -J size=400 /dev/XXX.
 
  Setting size=400 doesn't help, nor does size=4.
 
   I suspect that the stall is caused by the journal filling up, and then
   waiting while the entire journal is checkpointed back to the filesystem
   before the next transaction can start.
  
   It is possible to improve this behaviour in JBD by reducing the amount
   of space that is cleared if the journal becomes full, and also doing
   journal checkpointing before it becomes full.  While that may reduce
   performance a small amount, it would help avoid such huge latency
   problems. I believe we have such a patch in one of the Lustre branches
   already, and while I'm not sure what kernel it is for the JBD code rarely
   changes much
 
  The big difference between ordered and writeback is that once the slowdown
  starts, ordered goes into ~100% iowait, whereas writeback continues 100%
  user.
 
 Does data=ordered write buffers in the order they were dirtied?  This might 
 explain the extreme problems in transactional workloads.
  Well, it does but we submit them to block layer all at once so elevator
should sort the requests for us...

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-30 Thread Al Boldi
Andreas Dilger wrote:
> On Wednesday 30 January 2008, Al Boldi wrote:
> > And, a quick test of successive 1sec delayed syncs shows no hangs until
> > about 1 minute (~180mb) of db-writeout activity, when the sync abruptly
> > hangs for minutes on end, and io-wait shows almost 100%.
>
> How large is the journal in this filesystem?  You can check via
> "debugfs -R 'stat <8>' /dev/XXX".

32mb.

> Is this affected by increasing
> the journal size?  You can set the journal size via "mke2fs -J size=400"
> at format time, or on an unmounted filesystem by running
> "tune2fs -O ^has_journal /dev/XXX" then "tune2fs -J size=400 /dev/XXX".

Setting size=400 doesn't help, nor does size=4.

> I suspect that the stall is caused by the journal filling up, and then
> waiting while the entire journal is checkpointed back to the filesystem
> before the next transaction can start.
>
> It is possible to improve this behaviour in JBD by reducing the amount
> of space that is cleared if the journal becomes "full", and also doing
> journal checkpointing before it becomes full.  While that may reduce
> performance a small amount, it would help avoid such huge latency
> problems. I believe we have such a patch in one of the Lustre branches
> already, and while I'm not sure what kernel it is for the JBD code rarely
> changes much

The big difference between ordered and writeback is that once the slowdown 
starts, ordered goes into ~100% iowait, whereas writeback continues 100% 
user.


Thanks!

--
Al

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-30 Thread Andreas Dilger
On Wednesday 30 January 2008, Al Boldi wrote:
> And, a quick test of successive 1sec delayed syncs shows no hangs until
> about 1 minute (~180mb) of db-writeout activity, when the sync abruptly
> hangs for minutes on end, and io-wait shows almost 100%.

How large is the journal in this filesystem?  You can check via
"debugfs -R 'stat <8>' /dev/XXX".  Is this affected by increasing
the journal size?  You can set the journal size via "mke2fs -J size=400" 
at format time, or on an unmounted filesystem by running
"tune2fs -O ^has_journal /dev/XXX" then "tune2fs -J size=400 /dev/XXX".

I suspect that the stall is caused by the journal filling up, and then
waiting while the entire journal is checkpointed back to the filesystem
before the next transaction can start.

It is possible to improve this behaviour in JBD by reducing the amount
of space that is cleared if the journal becomes "full", and also doing
journal checkpointing before it becomes full.  While that may reduce
performance a small amount, it would help avoid such huge latency problems.
I believe we have such a patch in one of the Lustre branches already,
and while I'm not sure what kernel it is for the JBD code rarely changes
much

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-30 Thread Al Boldi
Chris Mason wrote:
> On Wednesday 30 January 2008, Al Boldi wrote:
> > Jan Kara wrote:
> > > > Chris Snook wrote:
> > > > > Al Boldi wrote:
> > > > > > This RFC proposes to introduce a tunable which allows to disable
> > > > > > fsync and changes ordered into writeback writeout on a
> > > > > > per-process basis like this:
> > > > > >
> > > > > >   echo 1 > /proc/`pidof process`/softsync
> > > > >
> > > > > This is basically a kernel workaround for stupid app behavior.
> > > >
> > > > Exactly right to some extent, but don't forget the underlying
> > > > data=ordered starvation problem, which looks like a genuinely deep
> > > > problem maybe related to blockIO.
> > >
> > >   It is a problem with the way how ext3 does fsync (at least that's
> > > what we ended up with in that konqueror problem)... It has to flush
> > > the current transaction which means that app doing fsync() has to wait
> > > till all dirty data of all files on the filesystem are written (if we
> > > are in ordered mode). And that takes quite some time... There are
> > > possibilities how to avoid that but especially with freshly created
> > > files, it's tough and I don't see a way how to do it without some
> > > fundamental changes to JBD.
> >
> > Ok, but keep in mind that this starvation occurs even in the absence of
> > fsync, as the benchmarks show.
> >
> > And, a quick test of successive 1sec delayed syncs shows no hangs until
> > about 1 minute (~180mb) of db-writeout activity, when the sync abruptly
> > hangs for minutes on end, and io-wait shows almost 100%.
>
> Do you see this on older kernels as well?  The first thing we need to
> understand is if this particular stall is new.

2.6.24,22,19 and 2.4.32 show the same problem.


Thanks!

--
Al

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-30 Thread Chris Mason
On Wednesday 30 January 2008, Al Boldi wrote:
> Jan Kara wrote:
> > > Chris Snook wrote:
> > > > Al Boldi wrote:
> > > > > This RFC proposes to introduce a tunable which allows to disable
> > > > > fsync and changes ordered into writeback writeout on a per-process
> > > > > basis like this:
> > > > >
> > > > >   echo 1 > /proc/`pidof process`/softsync
> > > >
> > > > This is basically a kernel workaround for stupid app behavior.
> > >
> > > Exactly right to some extent, but don't forget the underlying
> > > data=ordered starvation problem, which looks like a genuinely deep
> > > problem maybe related to blockIO.
> >
> >   It is a problem with the way how ext3 does fsync (at least that's what
> > we ended up with in that konqueror problem)... It has to flush the
> > current transaction which means that app doing fsync() has to wait till
> > all dirty data of all files on the filesystem are written (if we are in
> > ordered mode). And that takes quite some time... There are possibilities
> > how to avoid that but especially with freshly created files, it's tough
> > and I don't see a way how to do it without some fundamental changes to
> > JBD.
>
> Ok, but keep in mind that this starvation occurs even in the absence of
> fsync, as the benchmarks show.
>
> And, a quick test of successive 1sec delayed syncs shows no hangs until
> about 1 minute (~180mb) of db-writeout activity, when the sync abruptly
> hangs for minutes on end, and io-wait shows almost 100%.

Do you see this on older kernels as well?  The first thing we need to 
understand is if this particular stall is new.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-30 Thread Chris Mason
On Wednesday 30 January 2008, Al Boldi wrote:
 Jan Kara wrote:
   Chris Snook wrote:
Al Boldi wrote:
 This RFC proposes to introduce a tunable which allows to disable
 fsync and changes ordered into writeback writeout on a per-process
 basis like this:

   echo 1  /proc/`pidof process`/softsync
   
This is basically a kernel workaround for stupid app behavior.
  
   Exactly right to some extent, but don't forget the underlying
   data=ordered starvation problem, which looks like a genuinely deep
   problem maybe related to blockIO.
 
It is a problem with the way how ext3 does fsync (at least that's what
  we ended up with in that konqueror problem)... It has to flush the
  current transaction which means that app doing fsync() has to wait till
  all dirty data of all files on the filesystem are written (if we are in
  ordered mode). And that takes quite some time... There are possibilities
  how to avoid that but especially with freshly created files, it's tough
  and I don't see a way how to do it without some fundamental changes to
  JBD.

 Ok, but keep in mind that this starvation occurs even in the absence of
 fsync, as the benchmarks show.

 And, a quick test of successive 1sec delayed syncs shows no hangs until
 about 1 minute (~180mb) of db-writeout activity, when the sync abruptly
 hangs for minutes on end, and io-wait shows almost 100%.

Do you see this on older kernels as well?  The first thing we need to 
understand is if this particular stall is new.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-30 Thread Al Boldi
Chris Mason wrote:
 On Wednesday 30 January 2008, Al Boldi wrote:
  Jan Kara wrote:
Chris Snook wrote:
 Al Boldi wrote:
  This RFC proposes to introduce a tunable which allows to disable
  fsync and changes ordered into writeback writeout on a
  per-process basis like this:
 
echo 1  /proc/`pidof process`/softsync

 This is basically a kernel workaround for stupid app behavior.
   
Exactly right to some extent, but don't forget the underlying
data=ordered starvation problem, which looks like a genuinely deep
problem maybe related to blockIO.
  
 It is a problem with the way how ext3 does fsync (at least that's
   what we ended up with in that konqueror problem)... It has to flush
   the current transaction which means that app doing fsync() has to wait
   till all dirty data of all files on the filesystem are written (if we
   are in ordered mode). And that takes quite some time... There are
   possibilities how to avoid that but especially with freshly created
   files, it's tough and I don't see a way how to do it without some
   fundamental changes to JBD.
 
  Ok, but keep in mind that this starvation occurs even in the absence of
  fsync, as the benchmarks show.
 
  And, a quick test of successive 1sec delayed syncs shows no hangs until
  about 1 minute (~180mb) of db-writeout activity, when the sync abruptly
  hangs for minutes on end, and io-wait shows almost 100%.

 Do you see this on older kernels as well?  The first thing we need to
 understand is if this particular stall is new.

2.6.24,22,19 and 2.4.32 show the same problem.


Thanks!

--
Al

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-30 Thread Andreas Dilger
On Wednesday 30 January 2008, Al Boldi wrote:
 And, a quick test of successive 1sec delayed syncs shows no hangs until
 about 1 minute (~180mb) of db-writeout activity, when the sync abruptly
 hangs for minutes on end, and io-wait shows almost 100%.

How large is the journal in this filesystem?  You can check via
debugfs -R 'stat 8' /dev/XXX.  Is this affected by increasing
the journal size?  You can set the journal size via mke2fs -J size=400 
at format time, or on an unmounted filesystem by running
tune2fs -O ^has_journal /dev/XXX then tune2fs -J size=400 /dev/XXX.

I suspect that the stall is caused by the journal filling up, and then
waiting while the entire journal is checkpointed back to the filesystem
before the next transaction can start.

It is possible to improve this behaviour in JBD by reducing the amount
of space that is cleared if the journal becomes full, and also doing
journal checkpointing before it becomes full.  While that may reduce
performance a small amount, it would help avoid such huge latency problems.
I believe we have such a patch in one of the Lustre branches already,
and while I'm not sure what kernel it is for the JBD code rarely changes
much

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-30 Thread Al Boldi
Andreas Dilger wrote:
 On Wednesday 30 January 2008, Al Boldi wrote:
  And, a quick test of successive 1sec delayed syncs shows no hangs until
  about 1 minute (~180mb) of db-writeout activity, when the sync abruptly
  hangs for minutes on end, and io-wait shows almost 100%.

 How large is the journal in this filesystem?  You can check via
 debugfs -R 'stat 8' /dev/XXX.

32mb.

 Is this affected by increasing
 the journal size?  You can set the journal size via mke2fs -J size=400
 at format time, or on an unmounted filesystem by running
 tune2fs -O ^has_journal /dev/XXX then tune2fs -J size=400 /dev/XXX.

Setting size=400 doesn't help, nor does size=4.

 I suspect that the stall is caused by the journal filling up, and then
 waiting while the entire journal is checkpointed back to the filesystem
 before the next transaction can start.

 It is possible to improve this behaviour in JBD by reducing the amount
 of space that is cleared if the journal becomes full, and also doing
 journal checkpointing before it becomes full.  While that may reduce
 performance a small amount, it would help avoid such huge latency
 problems. I believe we have such a patch in one of the Lustre branches
 already, and while I'm not sure what kernel it is for the JBD code rarely
 changes much

The big difference between ordered and writeback is that once the slowdown 
starts, ordered goes into ~100% iowait, whereas writeback continues 100% 
user.


Thanks!

--
Al

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-29 Thread Al Boldi
Jan Kara wrote:
> > Chris Snook wrote:
> > > Al Boldi wrote:
> > > > This RFC proposes to introduce a tunable which allows to disable
> > > > fsync and changes ordered into writeback writeout on a per-process
> > > > basis like this:
> > > >
> > > >   echo 1 > /proc/`pidof process`/softsync
> > >
> > > This is basically a kernel workaround for stupid app behavior.
> >
> > Exactly right to some extent, but don't forget the underlying
> > data=ordered starvation problem, which looks like a genuinely deep
> > problem maybe related to blockIO.
>
>   It is a problem with the way how ext3 does fsync (at least that's what
> we ended up with in that konqueror problem)... It has to flush the
> current transaction which means that app doing fsync() has to wait till
> all dirty data of all files on the filesystem are written (if we are in
> ordered mode). And that takes quite some time... There are possibilities
> how to avoid that but especially with freshly created files, it's tough
> and I don't see a way how to do it without some fundamental changes to
> JBD.

Ok, but keep in mind that this starvation occurs even in the absence of 
fsync, as the benchmarks show.

And, a quick test of successive 1sec delayed syncs shows no hangs until about 
1 minute (~180mb) of db-writeout activity, when the sync abruptly hangs for 
minutes on end, and io-wait shows almost 100%.

Now it turns out that 'echo 3 > /proc/.../drop_caches' has no effect, but 
doing it a few more times makes the hangs go away for while, only to come 
back again and again.


Thanks!

--
Al

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-29 Thread Jan Kara
> Chris Snook wrote:
> > Al Boldi wrote:
> > > Greetings!
> > >
> > > data=ordered mode has proven reliable over the years, and it does this
> > > by ordering filedata flushes before metadata flushes.  But this
> > > sometimes causes contention in the order of a 10x slowdown for certain
> > > apps, either due to the misuse of fsync or due to inherent behaviour
> > > like db's, as well as inherent starvation issues exposed by the
> > > data=ordered mode.
> > >
> > > data=writeback mode alleviates data=order mode slowdowns, but only works
> > > per-mount and is too dangerous to run as a default mode.
> > >
> > > This RFC proposes to introduce a tunable which allows to disable fsync
> > > and changes ordered into writeback writeout on a per-process basis like
> > > this:
> > >
> > >   echo 1 > /proc/`pidof process`/softsync
> > >
> > >
> > > Your comments are much welcome!
> >
> > This is basically a kernel workaround for stupid app behavior.
> 
> Exactly right to some extent, but don't forget the underlying data=ordered 
> starvation problem, which looks like a genuinely deep problem maybe related 
> to blockIO.
  It is a problem with the way how ext3 does fsync (at least that's what
we ended up with in that konqueror problem)... It has to flush the
current transaction which means that app doing fsync() has to wait till
all dirty data of all files on the filesystem are written (if we are in
ordered mode). And that takes quite some time... There are possibilities
how to avoid that but especially with freshly created files, it's tough
and I don't see a way how to do it without some fundamental changes to
JBD.

Honza
-- 
Jan Kara <[EMAIL PROTECTED]>
SuSE CR Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-29 Thread Jan Kara
 Chris Snook wrote:
  Al Boldi wrote:
   Greetings!
  
   data=ordered mode has proven reliable over the years, and it does this
   by ordering filedata flushes before metadata flushes.  But this
   sometimes causes contention in the order of a 10x slowdown for certain
   apps, either due to the misuse of fsync or due to inherent behaviour
   like db's, as well as inherent starvation issues exposed by the
   data=ordered mode.
  
   data=writeback mode alleviates data=order mode slowdowns, but only works
   per-mount and is too dangerous to run as a default mode.
  
   This RFC proposes to introduce a tunable which allows to disable fsync
   and changes ordered into writeback writeout on a per-process basis like
   this:
  
 echo 1  /proc/`pidof process`/softsync
  
  
   Your comments are much welcome!
 
  This is basically a kernel workaround for stupid app behavior.
 
 Exactly right to some extent, but don't forget the underlying data=ordered 
 starvation problem, which looks like a genuinely deep problem maybe related 
 to blockIO.
  It is a problem with the way how ext3 does fsync (at least that's what
we ended up with in that konqueror problem)... It has to flush the
current transaction which means that app doing fsync() has to wait till
all dirty data of all files on the filesystem are written (if we are in
ordered mode). And that takes quite some time... There are possibilities
how to avoid that but especially with freshly created files, it's tough
and I don't see a way how to do it without some fundamental changes to
JBD.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-29 Thread Al Boldi
Jan Kara wrote:
  Chris Snook wrote:
   Al Boldi wrote:
This RFC proposes to introduce a tunable which allows to disable
fsync and changes ordered into writeback writeout on a per-process
basis like this:
   
  echo 1  /proc/`pidof process`/softsync
  
   This is basically a kernel workaround for stupid app behavior.
 
  Exactly right to some extent, but don't forget the underlying
  data=ordered starvation problem, which looks like a genuinely deep
  problem maybe related to blockIO.

   It is a problem with the way how ext3 does fsync (at least that's what
 we ended up with in that konqueror problem)... It has to flush the
 current transaction which means that app doing fsync() has to wait till
 all dirty data of all files on the filesystem are written (if we are in
 ordered mode). And that takes quite some time... There are possibilities
 how to avoid that but especially with freshly created files, it's tough
 and I don't see a way how to do it without some fundamental changes to
 JBD.

Ok, but keep in mind that this starvation occurs even in the absence of 
fsync, as the benchmarks show.

And, a quick test of successive 1sec delayed syncs shows no hangs until about 
1 minute (~180mb) of db-writeout activity, when the sync abruptly hangs for 
minutes on end, and io-wait shows almost 100%.

Now it turns out that 'echo 3  /proc/.../drop_caches' has no effect, but 
doing it a few more times makes the hangs go away for while, only to come 
back again and again.


Thanks!

--
Al

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-28 Thread Al Boldi
Jan Kara wrote:
> On Sat 26-01-08 08:27:59, Al Boldi wrote:
> > Do you mean there is a locking problem?
>
>   No, but if you write to an mmaped file, then we can find out only later
> we have dirty data in pages and we call writepage() on behalf of e.g.
> pdflush().

Ok, that's a special case, which we could code for, but doesn't seem 
worthwile.  In any case, child-forks should inherit its parent mode.

> > > And in case of DB, they use direct-io
> > > anyway most of the time so they don't care about journaling mode
> > > anyway.
> >
> > Testing with sqlite3 and mysql4 shows that performance drastically
> > improves with writeback writeout.
>
>   And do you have the databases configured to use direct IO or not?

I don't think so, but these tests are only meant to expose the underlying 
problem which needs to be fixed, while this RFC proposes a useful 
workaround.

In another post Jan Kara wrote:
>   Hmm, if you're willing to test patches, then you could try a debug
> patch: http://bugzilla.kernel.org/attachment.cgi?id=14574
>   and send me the output. What kind of load do you observe problems with
> and which problems exactly?

8M-record insert into indexed db-table:
 ordered  writeback
sqlite3:  75m22s8m45s
mysql4 :  23m35s5m29s

Also, see the 'konqueror deadlocks in 2.6.22' thread.


Thanks!

--
Al

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-28 Thread Jan Kara
On Sat 26-01-08 08:27:43, Al Boldi wrote:
> Diego Calleja wrote:
> > El Thu, 24 Jan 2008 23:36:00 +0300, Al Boldi <[EMAIL PROTECTED]> escribió:
> > > Greetings!
> > >
> > > data=ordered mode has proven reliable over the years, and it does this
> > > by ordering filedata flushes before metadata flushes.  But this
> > > sometimes causes contention in the order of a 10x slowdown for certain
> > > apps, either due to the misuse of fsync or due to inherent behaviour
> > > like db's, as well as inherent starvation issues exposed by the
> > > data=ordered mode.
> >
> > There's a related bug in bugzilla:
> > http://bugzilla.kernel.org/show_bug.cgi?id=9546
> >
> > The diagnostic from Jan Kara is different though, but I think it may be
> > the same problem...
> >
> > "One process does data-intensive load. Thus in the ordered mode the
> > transaction is tiny but has tons of data buffers attached. If commit
> > happens, it takes a long time to sync all the data before the commit
> > can proceed... In the writeback mode, we don't wait for data buffers, in
> > the journal mode amount of data to be written is really limited by the
> > maximum size of a transaction and so we write by much smaller chunks
> > and better latency is thus ensured."
> >
> >
> > I'm hitting this bug too...it's surprising that there's not many people
> > reporting more bugs about this, because it's really annoying.
> >
> >
> > There's a patch by Jan Kara (that I'm including here because bugzilla
> > didn't include it and took me a while to find it) which I don't know if
> > it's supposed to fix the problem , but it'd be interesting to try:
> 
> Thanks a lot, but it doesn't fix it.
  Hmm, if you're willing to test patches, then you could try a debug patch:
http://bugzilla.kernel.org/attachment.cgi?id=14574
  and send me the output. What kind of load do you observe problems with
and which problems exactly?

Honza
-- 
Jan Kara <[EMAIL PROTECTED]>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-28 Thread Jan Kara
On Sat 26-01-08 08:27:59, Al Boldi wrote:
> Jan Kara wrote:
> > > Greetings!
> > >
> > > data=ordered mode has proven reliable over the years, and it does this
> > > by ordering filedata flushes before metadata flushes.  But this
> > > sometimes causes contention in the order of a 10x slowdown for certain
> > > apps, either due to the misuse of fsync or due to inherent behaviour
> > > like db's, as well as inherent starvation issues exposed by the
> > > data=ordered mode.
> > >
> > > data=writeback mode alleviates data=order mode slowdowns, but only works
> > > per-mount and is too dangerous to run as a default mode.
> > >
> > > This RFC proposes to introduce a tunable which allows to disable fsync
> > > and changes ordered into writeback writeout on a per-process basis like
> > > this:
> > >
> > >   echo 1 > /proc/`pidof process`/softsync
> >
> >   I guess disabling fsync() was already commented on enough. Regarding
> > switching to writeback mode on per-process basis - not easily possible
> > because sometimes data is not written out by the process which stored
> > them (think of mmaped file).
> 
> Do you mean there is a locking problem?
  No, but if you write to an mmaped file, then we can find out only later
we have dirty data in pages and we call writepage() on behalf of e.g.
pdflush().

> > And in case of DB, they use direct-io
> > anyway most of the time so they don't care about journaling mode anyway.
> 
> Testing with sqlite3 and mysql4 shows that performance drastically improves 
> with writeback writeout.
  And do you have the databases configured to use direct IO or not?

Honza
-- 
Jan Kara <[EMAIL PROTECTED]>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-28 Thread Jan Kara
On Sat 26-01-08 08:27:59, Al Boldi wrote:
 Jan Kara wrote:
   Greetings!
  
   data=ordered mode has proven reliable over the years, and it does this
   by ordering filedata flushes before metadata flushes.  But this
   sometimes causes contention in the order of a 10x slowdown for certain
   apps, either due to the misuse of fsync or due to inherent behaviour
   like db's, as well as inherent starvation issues exposed by the
   data=ordered mode.
  
   data=writeback mode alleviates data=order mode slowdowns, but only works
   per-mount and is too dangerous to run as a default mode.
  
   This RFC proposes to introduce a tunable which allows to disable fsync
   and changes ordered into writeback writeout on a per-process basis like
   this:
  
 echo 1  /proc/`pidof process`/softsync
 
I guess disabling fsync() was already commented on enough. Regarding
  switching to writeback mode on per-process basis - not easily possible
  because sometimes data is not written out by the process which stored
  them (think of mmaped file).
 
 Do you mean there is a locking problem?
  No, but if you write to an mmaped file, then we can find out only later
we have dirty data in pages and we call writepage() on behalf of e.g.
pdflush().

  And in case of DB, they use direct-io
  anyway most of the time so they don't care about journaling mode anyway.
 
 Testing with sqlite3 and mysql4 shows that performance drastically improves 
 with writeback writeout.
  And do you have the databases configured to use direct IO or not?

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-28 Thread Al Boldi
Jan Kara wrote:
 On Sat 26-01-08 08:27:59, Al Boldi wrote:
  Do you mean there is a locking problem?

   No, but if you write to an mmaped file, then we can find out only later
 we have dirty data in pages and we call writepage() on behalf of e.g.
 pdflush().

Ok, that's a special case, which we could code for, but doesn't seem 
worthwile.  In any case, child-forks should inherit its parent mode.

   And in case of DB, they use direct-io
   anyway most of the time so they don't care about journaling mode
   anyway.
 
  Testing with sqlite3 and mysql4 shows that performance drastically
  improves with writeback writeout.

   And do you have the databases configured to use direct IO or not?

I don't think so, but these tests are only meant to expose the underlying 
problem which needs to be fixed, while this RFC proposes a useful 
workaround.

In another post Jan Kara wrote:
   Hmm, if you're willing to test patches, then you could try a debug
 patch: http://bugzilla.kernel.org/attachment.cgi?id=14574
   and send me the output. What kind of load do you observe problems with
 and which problems exactly?

8M-record insert into indexed db-table:
 ordered  writeback
sqlite3:  75m22s8m45s
mysql4 :  23m35s5m29s

Also, see the 'konqueror deadlocks in 2.6.22' thread.


Thanks!

--
Al

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-28 Thread Jan Kara
On Sat 26-01-08 08:27:43, Al Boldi wrote:
 Diego Calleja wrote:
  El Thu, 24 Jan 2008 23:36:00 +0300, Al Boldi [EMAIL PROTECTED] escribió:
   Greetings!
  
   data=ordered mode has proven reliable over the years, and it does this
   by ordering filedata flushes before metadata flushes.  But this
   sometimes causes contention in the order of a 10x slowdown for certain
   apps, either due to the misuse of fsync or due to inherent behaviour
   like db's, as well as inherent starvation issues exposed by the
   data=ordered mode.
 
  There's a related bug in bugzilla:
  http://bugzilla.kernel.org/show_bug.cgi?id=9546
 
  The diagnostic from Jan Kara is different though, but I think it may be
  the same problem...
 
  One process does data-intensive load. Thus in the ordered mode the
  transaction is tiny but has tons of data buffers attached. If commit
  happens, it takes a long time to sync all the data before the commit
  can proceed... In the writeback mode, we don't wait for data buffers, in
  the journal mode amount of data to be written is really limited by the
  maximum size of a transaction and so we write by much smaller chunks
  and better latency is thus ensured.
 
 
  I'm hitting this bug too...it's surprising that there's not many people
  reporting more bugs about this, because it's really annoying.
 
 
  There's a patch by Jan Kara (that I'm including here because bugzilla
  didn't include it and took me a while to find it) which I don't know if
  it's supposed to fix the problem , but it'd be interesting to try:
 
 Thanks a lot, but it doesn't fix it.
  Hmm, if you're willing to test patches, then you could try a debug patch:
http://bugzilla.kernel.org/attachment.cgi?id=14574
  and send me the output. What kind of load do you observe problems with
and which problems exactly?

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-25 Thread Al Boldi
Chris Snook wrote:
> Al Boldi wrote:
> > Greetings!
> >
> > data=ordered mode has proven reliable over the years, and it does this
> > by ordering filedata flushes before metadata flushes.  But this
> > sometimes causes contention in the order of a 10x slowdown for certain
> > apps, either due to the misuse of fsync or due to inherent behaviour
> > like db's, as well as inherent starvation issues exposed by the
> > data=ordered mode.
> >
> > data=writeback mode alleviates data=order mode slowdowns, but only works
> > per-mount and is too dangerous to run as a default mode.
> >
> > This RFC proposes to introduce a tunable which allows to disable fsync
> > and changes ordered into writeback writeout on a per-process basis like
> > this:
> >
> >   echo 1 > /proc/`pidof process`/softsync
> >
> >
> > Your comments are much welcome!
>
> This is basically a kernel workaround for stupid app behavior.

Exactly right to some extent, but don't forget the underlying data=ordered 
starvation problem, which looks like a genuinely deep problem maybe related 
to blockIO.

> It
> wouldn't be the first time we've provided such an option, but we shouldn't
> do it without a very good justification.  At the very least, we need a
> test case that demonstrates the problem

See the 'konqueror deadlocks in 2.6.22' thread.

> and benchmark results that prove that this approach actually fixes it.

8M-record insert into indexed db-table:
 ordered  writeback
sqlite3:  75m22s8m45s
mysql4 :  23m35s5m29s

> I suspect we can find a cleaner fix for the problem.

I hope so, but even with a fix available addressing the data=ordered 
starvation issue, this tunable could remain useful for those apps that 
misbehave.


Thanks!

--
Al

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-25 Thread Al Boldi
Jan Kara wrote:
> > Greetings!
> >
> > data=ordered mode has proven reliable over the years, and it does this
> > by ordering filedata flushes before metadata flushes.  But this
> > sometimes causes contention in the order of a 10x slowdown for certain
> > apps, either due to the misuse of fsync or due to inherent behaviour
> > like db's, as well as inherent starvation issues exposed by the
> > data=ordered mode.
> >
> > data=writeback mode alleviates data=order mode slowdowns, but only works
> > per-mount and is too dangerous to run as a default mode.
> >
> > This RFC proposes to introduce a tunable which allows to disable fsync
> > and changes ordered into writeback writeout on a per-process basis like
> > this:
> >
> >   echo 1 > /proc/`pidof process`/softsync
>
>   I guess disabling fsync() was already commented on enough. Regarding
> switching to writeback mode on per-process basis - not easily possible
> because sometimes data is not written out by the process which stored
> them (think of mmaped file).

Do you mean there is a locking problem?

> And in case of DB, they use direct-io
> anyway most of the time so they don't care about journaling mode anyway.

Testing with sqlite3 and mysql4 shows that performance drastically improves 
with writeback writeout.

>  But as Diego wrote, there is definitely some room for improvement in
> current data=ordered mode so the difference shouldn't be as big in the
> end.

Yes, it would be nice to get to the bottom of this starvation problem, but 
even then, the proposed tunable remains useful for misbehaving apps.


Thanks!

--
Al

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-25 Thread Al Boldi
[EMAIL PROTECTED] wrote:
> On Thu, 24 Jan 2008 23:36:00 +0300, Al Boldi said:
> > This RFC proposes to introduce a tunable which allows to disable fsync
> > and changes ordered into writeback writeout on a per-process basis like
> > this:
:
:
> But if you want to give them enough rope to shoot themselves in the foot
> with, I'd suggest abusing LD_PRELOAD to replace the fsync() glibc code
> instead.  No need to clutter the kernel with rope that can be (and has
> been) done in userspace.

Ok that's possible, but as you cannot use LD_PRELOAD to deal with changing 
ordered into writeback mode, we might as well allow them to disable fsync 
here, because it is in the same use-case.


Thanks!

--
Al

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-25 Thread Al Boldi
Diego Calleja wrote:
> El Thu, 24 Jan 2008 23:36:00 +0300, Al Boldi <[EMAIL PROTECTED]> escribió:
> > Greetings!
> >
> > data=ordered mode has proven reliable over the years, and it does this
> > by ordering filedata flushes before metadata flushes.  But this
> > sometimes causes contention in the order of a 10x slowdown for certain
> > apps, either due to the misuse of fsync or due to inherent behaviour
> > like db's, as well as inherent starvation issues exposed by the
> > data=ordered mode.
>
> There's a related bug in bugzilla:
> http://bugzilla.kernel.org/show_bug.cgi?id=9546
>
> The diagnostic from Jan Kara is different though, but I think it may be
> the same problem...
>
> "One process does data-intensive load. Thus in the ordered mode the
> transaction is tiny but has tons of data buffers attached. If commit
> happens, it takes a long time to sync all the data before the commit
> can proceed... In the writeback mode, we don't wait for data buffers, in
> the journal mode amount of data to be written is really limited by the
> maximum size of a transaction and so we write by much smaller chunks
> and better latency is thus ensured."
>
>
> I'm hitting this bug too...it's surprising that there's not many people
> reporting more bugs about this, because it's really annoying.
>
>
> There's a patch by Jan Kara (that I'm including here because bugzilla
> didn't include it and took me a while to find it) which I don't know if
> it's supposed to fix the problem , but it'd be interesting to try:


Thanks a lot, but it doesn't fix it.

--
Al

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-25 Thread david

On Thu, 24 Jan 2008, Andreas Dilger wrote:


On Jan 24, 2008  23:36 +0300, Al Boldi wrote:

data=ordered mode has proven reliable over the years, and it does this by
ordering filedata flushes before metadata flushes.  But this sometimes
causes contention in the order of a 10x slowdown for certain apps, either
due to the misuse of fsync or due to inherent behaviour like db's, as well
as inherent starvation issues exposed by the data=ordered mode.

data=writeback mode alleviates data=order mode slowdowns, but only works
per-mount and is too dangerous to run as a default mode.

This RFC proposes to introduce a tunable which allows to disable fsync and
changes ordered into writeback writeout on a per-process basis like this:

  echo 1 > /proc/`pidof process`/softsync


If fsync performance is an issue for you, run the filesystem in data=journal
mode, put the journal on a separate disk and make it big enough that you
don't block on it to flush the data to the filesystem (but not so big that
it is consuming all of your RAM).


my understanding is that the journal is limited to 128M or so. This 
prevents you from making it big enough to avoid all problems.


David Lang


That keeps your data guarantees without hurting performance.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-25 Thread Andreas Dilger
On Jan 24, 2008  23:36 +0300, Al Boldi wrote:
> data=ordered mode has proven reliable over the years, and it does this by 
> ordering filedata flushes before metadata flushes.  But this sometimes 
> causes contention in the order of a 10x slowdown for certain apps, either 
> due to the misuse of fsync or due to inherent behaviour like db's, as well 
> as inherent starvation issues exposed by the data=ordered mode.
> 
> data=writeback mode alleviates data=order mode slowdowns, but only works 
> per-mount and is too dangerous to run as a default mode.
> 
> This RFC proposes to introduce a tunable which allows to disable fsync and 
> changes ordered into writeback writeout on a per-process basis like this:
> 
>   echo 1 > /proc/`pidof process`/softsync

If fsync performance is an issue for you, run the filesystem in data=journal
mode, put the journal on a separate disk and make it big enough that you
don't block on it to flush the data to the filesystem (but not so big that
it is consuming all of your RAM).

That keeps your data guarantees without hurting performance.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-25 Thread Jan Kara
> Greetings!
> 
> data=ordered mode has proven reliable over the years, and it does this by 
> ordering filedata flushes before metadata flushes.  But this sometimes 
> causes contention in the order of a 10x slowdown for certain apps, either 
> due to the misuse of fsync or due to inherent behaviour like db's, as well 
> as inherent starvation issues exposed by the data=ordered mode.
> 
> data=writeback mode alleviates data=order mode slowdowns, but only works 
> per-mount and is too dangerous to run as a default mode.
> 
> This RFC proposes to introduce a tunable which allows to disable fsync and 
> changes ordered into writeback writeout on a per-process basis like this:
> 
>   echo 1 > /proc/`pidof process`/softsync
  I guess disabling fsync() was already commented on enough. Regarding
switching to writeback mode on per-process basis - not easily possible
because sometimes data is not written out by the process which stored
them (think of mmaped file). And in case of DB, they use direct-io
anyway most of the time so they don't care about journaling mode anyway.
  But as Diego wrote, there is definitely some room for improvement in
current data=ordered mode so the difference shouldn't be as big in the
end.

Honza
-- 
Jan Kara <[EMAIL PROTECTED]>
SuSE CR Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-25 Thread david

On Thu, 24 Jan 2008, Andreas Dilger wrote:


On Jan 24, 2008  23:36 +0300, Al Boldi wrote:

data=ordered mode has proven reliable over the years, and it does this by
ordering filedata flushes before metadata flushes.  But this sometimes
causes contention in the order of a 10x slowdown for certain apps, either
due to the misuse of fsync or due to inherent behaviour like db's, as well
as inherent starvation issues exposed by the data=ordered mode.

data=writeback mode alleviates data=order mode slowdowns, but only works
per-mount and is too dangerous to run as a default mode.

This RFC proposes to introduce a tunable which allows to disable fsync and
changes ordered into writeback writeout on a per-process basis like this:

  echo 1  /proc/`pidof process`/softsync


If fsync performance is an issue for you, run the filesystem in data=journal
mode, put the journal on a separate disk and make it big enough that you
don't block on it to flush the data to the filesystem (but not so big that
it is consuming all of your RAM).


my understanding is that the journal is limited to 128M or so. This 
prevents you from making it big enough to avoid all problems.


David Lang


That keeps your data guarantees without hurting performance.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-25 Thread Al Boldi
Chris Snook wrote:
 Al Boldi wrote:
  Greetings!
 
  data=ordered mode has proven reliable over the years, and it does this
  by ordering filedata flushes before metadata flushes.  But this
  sometimes causes contention in the order of a 10x slowdown for certain
  apps, either due to the misuse of fsync or due to inherent behaviour
  like db's, as well as inherent starvation issues exposed by the
  data=ordered mode.
 
  data=writeback mode alleviates data=order mode slowdowns, but only works
  per-mount and is too dangerous to run as a default mode.
 
  This RFC proposes to introduce a tunable which allows to disable fsync
  and changes ordered into writeback writeout on a per-process basis like
  this:
 
echo 1  /proc/`pidof process`/softsync
 
 
  Your comments are much welcome!

 This is basically a kernel workaround for stupid app behavior.

Exactly right to some extent, but don't forget the underlying data=ordered 
starvation problem, which looks like a genuinely deep problem maybe related 
to blockIO.

 It
 wouldn't be the first time we've provided such an option, but we shouldn't
 do it without a very good justification.  At the very least, we need a
 test case that demonstrates the problem

See the 'konqueror deadlocks in 2.6.22' thread.

 and benchmark results that prove that this approach actually fixes it.

8M-record insert into indexed db-table:
 ordered  writeback
sqlite3:  75m22s8m45s
mysql4 :  23m35s5m29s

 I suspect we can find a cleaner fix for the problem.

I hope so, but even with a fix available addressing the data=ordered 
starvation issue, this tunable could remain useful for those apps that 
misbehave.


Thanks!

--
Al

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-25 Thread Al Boldi
Jan Kara wrote:
  Greetings!
 
  data=ordered mode has proven reliable over the years, and it does this
  by ordering filedata flushes before metadata flushes.  But this
  sometimes causes contention in the order of a 10x slowdown for certain
  apps, either due to the misuse of fsync or due to inherent behaviour
  like db's, as well as inherent starvation issues exposed by the
  data=ordered mode.
 
  data=writeback mode alleviates data=order mode slowdowns, but only works
  per-mount and is too dangerous to run as a default mode.
 
  This RFC proposes to introduce a tunable which allows to disable fsync
  and changes ordered into writeback writeout on a per-process basis like
  this:
 
echo 1  /proc/`pidof process`/softsync

   I guess disabling fsync() was already commented on enough. Regarding
 switching to writeback mode on per-process basis - not easily possible
 because sometimes data is not written out by the process which stored
 them (think of mmaped file).

Do you mean there is a locking problem?

 And in case of DB, they use direct-io
 anyway most of the time so they don't care about journaling mode anyway.

Testing with sqlite3 and mysql4 shows that performance drastically improves 
with writeback writeout.

  But as Diego wrote, there is definitely some room for improvement in
 current data=ordered mode so the difference shouldn't be as big in the
 end.

Yes, it would be nice to get to the bottom of this starvation problem, but 
even then, the proposed tunable remains useful for misbehaving apps.


Thanks!

--
Al

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-25 Thread Al Boldi
Diego Calleja wrote:
 El Thu, 24 Jan 2008 23:36:00 +0300, Al Boldi [EMAIL PROTECTED] escribió:
  Greetings!
 
  data=ordered mode has proven reliable over the years, and it does this
  by ordering filedata flushes before metadata flushes.  But this
  sometimes causes contention in the order of a 10x slowdown for certain
  apps, either due to the misuse of fsync or due to inherent behaviour
  like db's, as well as inherent starvation issues exposed by the
  data=ordered mode.

 There's a related bug in bugzilla:
 http://bugzilla.kernel.org/show_bug.cgi?id=9546

 The diagnostic from Jan Kara is different though, but I think it may be
 the same problem...

 One process does data-intensive load. Thus in the ordered mode the
 transaction is tiny but has tons of data buffers attached. If commit
 happens, it takes a long time to sync all the data before the commit
 can proceed... In the writeback mode, we don't wait for data buffers, in
 the journal mode amount of data to be written is really limited by the
 maximum size of a transaction and so we write by much smaller chunks
 and better latency is thus ensured.


 I'm hitting this bug too...it's surprising that there's not many people
 reporting more bugs about this, because it's really annoying.


 There's a patch by Jan Kara (that I'm including here because bugzilla
 didn't include it and took me a while to find it) which I don't know if
 it's supposed to fix the problem , but it'd be interesting to try:


Thanks a lot, but it doesn't fix it.

--
Al

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-25 Thread Andreas Dilger
On Jan 24, 2008  23:36 +0300, Al Boldi wrote:
 data=ordered mode has proven reliable over the years, and it does this by 
 ordering filedata flushes before metadata flushes.  But this sometimes 
 causes contention in the order of a 10x slowdown for certain apps, either 
 due to the misuse of fsync or due to inherent behaviour like db's, as well 
 as inherent starvation issues exposed by the data=ordered mode.
 
 data=writeback mode alleviates data=order mode slowdowns, but only works 
 per-mount and is too dangerous to run as a default mode.
 
 This RFC proposes to introduce a tunable which allows to disable fsync and 
 changes ordered into writeback writeout on a per-process basis like this:
 
   echo 1  /proc/`pidof process`/softsync

If fsync performance is an issue for you, run the filesystem in data=journal
mode, put the journal on a separate disk and make it big enough that you
don't block on it to flush the data to the filesystem (but not so big that
it is consuming all of your RAM).

That keeps your data guarantees without hurting performance.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-25 Thread Jan Kara
 Greetings!
 
 data=ordered mode has proven reliable over the years, and it does this by 
 ordering filedata flushes before metadata flushes.  But this sometimes 
 causes contention in the order of a 10x slowdown for certain apps, either 
 due to the misuse of fsync or due to inherent behaviour like db's, as well 
 as inherent starvation issues exposed by the data=ordered mode.
 
 data=writeback mode alleviates data=order mode slowdowns, but only works 
 per-mount and is too dangerous to run as a default mode.
 
 This RFC proposes to introduce a tunable which allows to disable fsync and 
 changes ordered into writeback writeout on a per-process basis like this:
 
   echo 1  /proc/`pidof process`/softsync
  I guess disabling fsync() was already commented on enough. Regarding
switching to writeback mode on per-process basis - not easily possible
because sometimes data is not written out by the process which stored
them (think of mmaped file). And in case of DB, they use direct-io
anyway most of the time so they don't care about journaling mode anyway.
  But as Diego wrote, there is definitely some room for improvement in
current data=ordered mode so the difference shouldn't be as big in the
end.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-24 Thread Chris Snook

Al Boldi wrote:

Greetings!

data=ordered mode has proven reliable over the years, and it does this by 
ordering filedata flushes before metadata flushes.  But this sometimes 
causes contention in the order of a 10x slowdown for certain apps, either 
due to the misuse of fsync or due to inherent behaviour like db's, as well 
as inherent starvation issues exposed by the data=ordered mode.


data=writeback mode alleviates data=order mode slowdowns, but only works 
per-mount and is too dangerous to run as a default mode.


This RFC proposes to introduce a tunable which allows to disable fsync and 
changes ordered into writeback writeout on a per-process basis like this:


  echo 1 > /proc/`pidof process`/softsync


Your comments are much welcome!


This is basically a kernel workaround for stupid app behavior.  It wouldn't be 
the first time we've provided such an option, but we shouldn't do it without a 
very good justification.  At the very least, we need a test case that 
demonstrates the problem and benchmark results that prove that this approach 
actually fixes it.  I suspect we can find a cleaner fix for the problem.


-- Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-24 Thread Valdis . Kletnieks
On Thu, 24 Jan 2008 23:36:00 +0300, Al Boldi said:
> data=ordered mode has proven reliable over the years, and it does this by 
> ordering filedata flushes before metadata flushes.  But this sometimes 
> causes contention in the order of a 10x slowdown for certain apps, either 
> due to the misuse of fsync or due to inherent behaviour like db's, as well 
> as inherent starvation issues exposed by the data=ordered mode.

If they're misusing it, they should be fixed.  There should be a limit to
how much the kernel will do to reduce the pain of doing stupid things.

> This RFC proposes to introduce a tunable which allows to disable fsync and 
> changes ordered into writeback writeout on a per-process basis like this:

Well-written programs only call fsync() when they really do need the semantics
of fsync.  Disabling that is just *asking* for trouble.

>From rfc2821:

6.1 Reliable Delivery and Replies by Email

   When the receiver-SMTP accepts a piece of mail (by sending a "250 OK"
   message in response to DATA), it is accepting responsibility for
   delivering or relaying the message.  It must take this responsibility
   seriously.  It MUST NOT lose the message for frivolous reasons, such
   as because the host later crashes or because of a predictable
   resource shortage.

Some people really *do* think "the CPU took a machine check and after replacing
the motherboard, the resulting fsck ate the file" is a "frivolous" reason to
lose data.

But if you want to give them enough rope to shoot themselves in the foot with,
I'd suggest abusing LD_PRELOAD to replace the fsync() glibc code instead.  No
need to clutter the kernel with rope that can be (and has been) done in 
userspace.


pgpoN3ulmif7w.pgp
Description: PGP signature


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-24 Thread Diego Calleja
El Thu, 24 Jan 2008 23:36:00 +0300, Al Boldi <[EMAIL PROTECTED]> escribió:

> Greetings!
> 
> data=ordered mode has proven reliable over the years, and it does this by 
> ordering filedata flushes before metadata flushes.  But this sometimes 
> causes contention in the order of a 10x slowdown for certain apps, either 
> due to the misuse of fsync or due to inherent behaviour like db's, as well 
> as inherent starvation issues exposed by the data=ordered mode.

There's a related bug in bugzilla: 
http://bugzilla.kernel.org/show_bug.cgi?id=9546

The diagnostic from Jan Kara is different though, but I think it may be the same
problem...

"One process does data-intensive load. Thus in the ordered mode the
transaction is tiny but has tons of data buffers attached. If commit
happens, it takes a long time to sync all the data before the commit
can proceed... In the writeback mode, we don't wait for data buffers, in
the journal mode amount of data to be written is really limited by the
maximum size of a transaction and so we write by much smaller chunks
and better latency is thus ensured."


I'm hitting this bug too...it's surprising that there's not many people
reporting more bugs about this, because it's really annoying.


There's a patch by Jan Kara (that I'm including here because bugzilla didn't
include it and took me a while to find it) which I don't know if it's supposed 
to
fix the problem , but it'd be interesting to try:




Don't allow too much data buffers in a transaction.

diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index 08ff6c7..e6f9dd6 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -163,7 +163,7 @@ repeat_locked:
spin_lock(>t_handle_lock);
needed = transaction->t_outstanding_credits + nblocks;
 
-   if (needed > journal->j_max_transaction_buffers) {
+   if (needed > journal->j_max_transaction_buffers || 
atomic_read(>t_data_buf_count) > 32768) {
/*
 * If the current transaction is already too large, then start
 * to commit it: we can then go back and attach this handle to
@@ -1528,6 +1528,7 @@ static void __journal_temp_unlink_buffer(struct 
journal_head *jh)
return;
case BJ_SyncData:
list = >t_sync_datalist;
+   atomic_dec(>t_data_buf_count);
break;
case BJ_Metadata:
transaction->t_nr_buffers--;
@@ -1989,6 +1990,7 @@ void __journal_file_buffer(struct journal_head *jh,
return;
case BJ_SyncData:
list = >t_sync_datalist;
+   atomic_inc(>t_data_buf_count);
break;
case BJ_Metadata:
transaction->t_nr_buffers++;
diff --git a/include/linux/jbd.h b/include/linux/jbd.h
index d9ecd13..6dd284a 100644
--- a/include/linux/jbd.h
+++ b/include/linux/jbd.h
@@ -541,6 +541,12 @@ struct transaction_s
int t_outstanding_credits;
 
/*
+* Number of data buffers on t_sync_datalist attached to
+* the transaction.
+*/
+   atomic_tt_data_buf_count;
+
+   /*
 * Forward and backward links for the circular list of all transactions
 * awaiting checkpoint. [j_list_lock]
 */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-24 Thread Diego Calleja
El Thu, 24 Jan 2008 23:36:00 +0300, Al Boldi [EMAIL PROTECTED] escribió:

 Greetings!
 
 data=ordered mode has proven reliable over the years, and it does this by 
 ordering filedata flushes before metadata flushes.  But this sometimes 
 causes contention in the order of a 10x slowdown for certain apps, either 
 due to the misuse of fsync or due to inherent behaviour like db's, as well 
 as inherent starvation issues exposed by the data=ordered mode.

There's a related bug in bugzilla: 
http://bugzilla.kernel.org/show_bug.cgi?id=9546

The diagnostic from Jan Kara is different though, but I think it may be the same
problem...

One process does data-intensive load. Thus in the ordered mode the
transaction is tiny but has tons of data buffers attached. If commit
happens, it takes a long time to sync all the data before the commit
can proceed... In the writeback mode, we don't wait for data buffers, in
the journal mode amount of data to be written is really limited by the
maximum size of a transaction and so we write by much smaller chunks
and better latency is thus ensured.


I'm hitting this bug too...it's surprising that there's not many people
reporting more bugs about this, because it's really annoying.


There's a patch by Jan Kara (that I'm including here because bugzilla didn't
include it and took me a while to find it) which I don't know if it's supposed 
to
fix the problem , but it'd be interesting to try:




Don't allow too much data buffers in a transaction.

diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index 08ff6c7..e6f9dd6 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -163,7 +163,7 @@ repeat_locked:
spin_lock(transaction-t_handle_lock);
needed = transaction-t_outstanding_credits + nblocks;
 
-   if (needed  journal-j_max_transaction_buffers) {
+   if (needed  journal-j_max_transaction_buffers || 
atomic_read(transaction-t_data_buf_count)  32768) {
/*
 * If the current transaction is already too large, then start
 * to commit it: we can then go back and attach this handle to
@@ -1528,6 +1528,7 @@ static void __journal_temp_unlink_buffer(struct 
journal_head *jh)
return;
case BJ_SyncData:
list = transaction-t_sync_datalist;
+   atomic_dec(transaction-t_data_buf_count);
break;
case BJ_Metadata:
transaction-t_nr_buffers--;
@@ -1989,6 +1990,7 @@ void __journal_file_buffer(struct journal_head *jh,
return;
case BJ_SyncData:
list = transaction-t_sync_datalist;
+   atomic_inc(transaction-t_data_buf_count);
break;
case BJ_Metadata:
transaction-t_nr_buffers++;
diff --git a/include/linux/jbd.h b/include/linux/jbd.h
index d9ecd13..6dd284a 100644
--- a/include/linux/jbd.h
+++ b/include/linux/jbd.h
@@ -541,6 +541,12 @@ struct transaction_s
int t_outstanding_credits;
 
/*
+* Number of data buffers on t_sync_datalist attached to
+* the transaction.
+*/
+   atomic_tt_data_buf_count;
+
+   /*
 * Forward and backward links for the circular list of all transactions
 * awaiting checkpoint. [j_list_lock]
 */
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-24 Thread Valdis . Kletnieks
On Thu, 24 Jan 2008 23:36:00 +0300, Al Boldi said:
 data=ordered mode has proven reliable over the years, and it does this by 
 ordering filedata flushes before metadata flushes.  But this sometimes 
 causes contention in the order of a 10x slowdown for certain apps, either 
 due to the misuse of fsync or due to inherent behaviour like db's, as well 
 as inherent starvation issues exposed by the data=ordered mode.

If they're misusing it, they should be fixed.  There should be a limit to
how much the kernel will do to reduce the pain of doing stupid things.

 This RFC proposes to introduce a tunable which allows to disable fsync and 
 changes ordered into writeback writeout on a per-process basis like this:

Well-written programs only call fsync() when they really do need the semantics
of fsync.  Disabling that is just *asking* for trouble.

From rfc2821:

6.1 Reliable Delivery and Replies by Email

   When the receiver-SMTP accepts a piece of mail (by sending a 250 OK
   message in response to DATA), it is accepting responsibility for
   delivering or relaying the message.  It must take this responsibility
   seriously.  It MUST NOT lose the message for frivolous reasons, such
   as because the host later crashes or because of a predictable
   resource shortage.

Some people really *do* think the CPU took a machine check and after replacing
the motherboard, the resulting fsck ate the file is a frivolous reason to
lose data.

But if you want to give them enough rope to shoot themselves in the foot with,
I'd suggest abusing LD_PRELOAD to replace the fsync() glibc code instead.  No
need to clutter the kernel with rope that can be (and has been) done in 
userspace.


pgpoN3ulmif7w.pgp
Description: PGP signature


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-24 Thread Chris Snook

Al Boldi wrote:

Greetings!

data=ordered mode has proven reliable over the years, and it does this by 
ordering filedata flushes before metadata flushes.  But this sometimes 
causes contention in the order of a 10x slowdown for certain apps, either 
due to the misuse of fsync or due to inherent behaviour like db's, as well 
as inherent starvation issues exposed by the data=ordered mode.


data=writeback mode alleviates data=order mode slowdowns, but only works 
per-mount and is too dangerous to run as a default mode.


This RFC proposes to introduce a tunable which allows to disable fsync and 
changes ordered into writeback writeout on a per-process basis like this:


  echo 1  /proc/`pidof process`/softsync


Your comments are much welcome!


This is basically a kernel workaround for stupid app behavior.  It wouldn't be 
the first time we've provided such an option, but we shouldn't do it without a 
very good justification.  At the very least, we need a test case that 
demonstrates the problem and benchmark results that prove that this approach 
actually fixes it.  I suspect we can find a cleaner fix for the problem.


-- Chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/