Re: Reducing the bdi proporion calculation period to speed up disk write

2007-12-13 Thread Zhang, Yanmin
On Tue, 2007-12-11 at 11:11 +0100, Peter Zijlstra wrote:
> On Tue, 2007-12-11 at 14:25 +0800, zhejiang wrote:
> > The patch 04fbfdc14e5f48463820d6b9807daa5e9c92c51f implemented bdi per
> > device dirty threshold. It works well.
> > However, the period for proportion calculation may be too large.
> > For 8G memory, the calc_period_shift() will return 19 as the shift.
> > 
> > When we switch writing operation between different disks, there may be
> > potential performance issue.
> > 
> > For example, we first write to disk A, then write to disk B.
> > The proportion for disk B will increase slowly because the denominator
> > is too large (It's 2^18 + (global_count & counter_mask)).
> > The disk B will get small dirty page quota for a long time,
> > it will get blocked frequently though the total dirty page is under the
> > dirty page limit.
> > 
> > Peter provided a patch to avoid this issue, this patch allow violation
> > of bdi limits if there is a lot of room on the system.
> > It looks like:
> > 
> > +if (nr_reclaimable + nr_writeback < (background_thresh +
> > dirty_thresh) / 2)
> > + break; 
> > 
> > This patch really help to avoid congestion, but if the dirty pages
> > exceed about 3/4 of the dirty_thresh, congestion still happens if we
> > write to another disk. 
> > 
> > I think that we can reduce the period to speed up the proportion
> > adjustment. 
> > 
> > diff -Nur a/page-writeback.c b/page-writeback.c
> > --- a/page-writeback.c  2007-12-11 13:46:30.0 +0800
> > +++ b/page-writeback.c  2007-12-11 13:47:11.0 +0800
> > @@ -128,10 +128,7 @@
> >   */
> >  static int calc_period_shift(void)
> >  {
> > -   unsigned long dirty_total;
> > -
> > -   dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> > 100;
> > -   return 2 + ilog2(dirty_total - 1);
> > +   return 12;
> >  }
> 
> Its a heuristic, it might need some tuning, but a static value is wrong.
> I think its generally true that the larger the machine memory size, the
> faster the storage subsystem. And the more likely it has more disks.
> 
> One of the reasons this value isn't static is that with your fixed 12 it
> becomes very hard to balance over more than 4096 active devices. Of
> course, it takes quite a special set-up to get into that situation.
I strongly agree with you that a static value is not a good idea.

> 
> As it is, it now takes about 2 * dirty limit to switch over, you could
> start by making that just a single, or maybe even half a, dirty limit.
We will do more testing to choose a better formular based on dirty_ratio
and total memory.

> 
> 
> Also, I'm not quite convinced your benchmark is all that useful. Do you
> really think it matches an actual frequently occurring usage pattern?
We used iozone to test 1.2GB sequential write/rewrite. It's hard to match
exactly an actual usage pattern, but I have an example. Administrator
might backup big files to other free disks periodically although he/she might
not need it fast.

-yanmin


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reducing the bdi proporion calculation period to speed up disk write

2007-12-11 Thread Peter Zijlstra

On Tue, 2007-12-11 at 14:25 +0800, zhejiang wrote:
> The patch 04fbfdc14e5f48463820d6b9807daa5e9c92c51f implemented bdi per
> device dirty threshold. It works well.
> However, the period for proportion calculation may be too large.
> For 8G memory, the calc_period_shift() will return 19 as the shift.
> 
> When we switch writing operation between different disks, there may be
> potential performance issue.
> 
> For example, we first write to disk A, then write to disk B.
> The proportion for disk B will increase slowly because the denominator
> is too large (It's 2^18 + (global_count & counter_mask)).
> The disk B will get small dirty page quota for a long time,
> it will get blocked frequently though the total dirty page is under the
> dirty page limit.
> 
> Peter provided a patch to avoid this issue, this patch allow violation
> of bdi limits if there is a lot of room on the system.
> It looks like:
> 
> +if (nr_reclaimable + nr_writeback < (background_thresh +
> dirty_thresh) / 2)
> + break; 
> 
> This patch really help to avoid congestion, but if the dirty pages
> exceed about 3/4 of the dirty_thresh, congestion still happens if we
> write to another disk. 
> 
> I think that we can reduce the period to speed up the proportion
> adjustment. 
> 
> diff -Nur a/page-writeback.c b/page-writeback.c
> --- a/page-writeback.c  2007-12-11 13:46:30.0 +0800
> +++ b/page-writeback.c  2007-12-11 13:47:11.0 +0800
> @@ -128,10 +128,7 @@
>   */
>  static int calc_period_shift(void)
>  {
> -   unsigned long dirty_total;
> -
> -   dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> 100;
> -   return 2 + ilog2(dirty_total - 1);
> +   return 12;
>  }

Its a heuristic, it might need some tuning, but a static value is wrong.
I think its generally true that the larger the machine memory size, the
faster the storage subsystem. And the more likely it has more disks.

One of the reasons this value isn't static is that with your fixed 12 it
becomes very hard to balance over more than 4096 active devices. Of
course, it takes quite a special set-up to get into that situation.

As it is, it now takes about 2 * dirty limit to switch over, you could
start by making that just a single, or maybe even half a, dirty limit.


Also, I'm not quite convinced your benchmark is all that useful. Do you
really think it matches an actual frequently occurring usage pattern?



signature.asc
Description: This is a digitally signed message part


Reducing the bdi proporion calculation period to speed up disk write

2007-12-10 Thread zhejiang
The patch 04fbfdc14e5f48463820d6b9807daa5e9c92c51f implemented bdi per
device dirty threshold. It works well.
However, the period for proportion calculation may be too large.
For 8G memory, the calc_period_shift() will return 19 as the shift.

When we switch writing operation between different disks, there may be
potential performance issue.

For example, we first write to disk A, then write to disk B.
The proportion for disk B will increase slowly because the denominator
is too large (It's 2^18 + (global_count & counter_mask)).
The disk B will get small dirty page quota for a long time,
it will get blocked frequently though the total dirty page is under the
dirty page limit.

Peter provided a patch to avoid this issue, this patch allow violation
of bdi limits if there is a lot of room on the system.
It looks like:

+if (nr_reclaimable + nr_writeback < (background_thresh +
dirty_thresh) / 2)
+ break; 

This patch really help to avoid congestion, but if the dirty pages
exceed about 3/4 of the dirty_thresh, congestion still happens if we
write to another disk. 

I think that we can reduce the period to speed up the proportion
adjustment. 

diff -Nur a/page-writeback.c b/page-writeback.c
--- a/page-writeback.c  2007-12-11 13:46:30.0 +0800
+++ b/page-writeback.c  2007-12-11 13:47:11.0 +0800
@@ -128,10 +128,7 @@
  */
 static int calc_period_shift(void)
 {
-   unsigned long dirty_total;
-
-   dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
100;
-   return 2 + ilog2(dirty_total - 1);
+   return 12;
 }


In the 8G memory system, I did some testing with iozone.
I found that reducing the period help to increase the write speed 
when switch to a new disk.


Run  "./iozone -B -i 0 -i 2 -r 4k -s 1000M" twice in the disk B.
Here is the result:

1. With the patch 04fbfdc14e5f48463820d6b9807daa5e9c92c51f
First   Second
write   78M 173M
rewrite 112M203M
randread1710M   1697M
randwrite   192M1412M

2. With Peter's patch
write   134M169M
rewrite 134M203M
randread1717M   1705M
randwrite   179M1412M 

3.Adjust the shift to 12
write   260M259M
rewrite 240M246M
randread1712M   1700M
randwrite   1409M   1409M

4.With Peter's patch and adjust the shift to 12
write   256M239M
rewrite 253M253M
randread1704M   1716M
randwrite   1414M   1416M


Run  "./iozone -B -i 0 -i 2 -r 4k -s 500M" twice in the disk B.

1. With the patch 04fbfdc14e5f48463820d6b9807daa5e9c92c51f
First   Second
write   821M725M
rewrite 144M1299M
randread1740M   1733M
randwrite   1444M   1440M

2. With Peter's patch
write   1100M   1112M
rewrite 1295M   1313M
randread1745M   1744M
randwrite   1452M   1449M 

3.Adjust the shift to 12
write   1021M   1104M
rewrite 1314M   1311M
randread1741M   1737M
randwrite   1448M   1445M

4.With Peter's patch and adjust the shift to 12
write   1104M   1105M
rewrite 1292M   1308M
randread1737M   1741M
randwrite   1449M   1449M
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/