Re: Reducing the bdi proporion calculation period to speed up disk write
On Tue, 2007-12-11 at 11:11 +0100, Peter Zijlstra wrote: > On Tue, 2007-12-11 at 14:25 +0800, zhejiang wrote: > > The patch 04fbfdc14e5f48463820d6b9807daa5e9c92c51f implemented bdi per > > device dirty threshold. It works well. > > However, the period for proportion calculation may be too large. > > For 8G memory, the calc_period_shift() will return 19 as the shift. > > > > When we switch writing operation between different disks, there may be > > potential performance issue. > > > > For example, we first write to disk A, then write to disk B. > > The proportion for disk B will increase slowly because the denominator > > is too large (It's 2^18 + (global_count & counter_mask)). > > The disk B will get small dirty page quota for a long time, > > it will get blocked frequently though the total dirty page is under the > > dirty page limit. > > > > Peter provided a patch to avoid this issue, this patch allow violation > > of bdi limits if there is a lot of room on the system. > > It looks like: > > > > +if (nr_reclaimable + nr_writeback < (background_thresh + > > dirty_thresh) / 2) > > + break; > > > > This patch really help to avoid congestion, but if the dirty pages > > exceed about 3/4 of the dirty_thresh, congestion still happens if we > > write to another disk. > > > > I think that we can reduce the period to speed up the proportion > > adjustment. > > > > diff -Nur a/page-writeback.c b/page-writeback.c > > --- a/page-writeback.c 2007-12-11 13:46:30.0 +0800 > > +++ b/page-writeback.c 2007-12-11 13:47:11.0 +0800 > > @@ -128,10 +128,7 @@ > > */ > > static int calc_period_shift(void) > > { > > - unsigned long dirty_total; > > - > > - dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) / > > 100; > > - return 2 + ilog2(dirty_total - 1); > > + return 12; > > } > > Its a heuristic, it might need some tuning, but a static value is wrong. > I think its generally true that the larger the machine memory size, the > faster the storage subsystem. And the more likely it has more disks. > > One of the reasons this value isn't static is that with your fixed 12 it > becomes very hard to balance over more than 4096 active devices. Of > course, it takes quite a special set-up to get into that situation. I strongly agree with you that a static value is not a good idea. > > As it is, it now takes about 2 * dirty limit to switch over, you could > start by making that just a single, or maybe even half a, dirty limit. We will do more testing to choose a better formular based on dirty_ratio and total memory. > > > Also, I'm not quite convinced your benchmark is all that useful. Do you > really think it matches an actual frequently occurring usage pattern? We used iozone to test 1.2GB sequential write/rewrite. It's hard to match exactly an actual usage pattern, but I have an example. Administrator might backup big files to other free disks periodically although he/she might not need it fast. -yanmin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Reducing the bdi proporion calculation period to speed up disk write
On Tue, 2007-12-11 at 14:25 +0800, zhejiang wrote: > The patch 04fbfdc14e5f48463820d6b9807daa5e9c92c51f implemented bdi per > device dirty threshold. It works well. > However, the period for proportion calculation may be too large. > For 8G memory, the calc_period_shift() will return 19 as the shift. > > When we switch writing operation between different disks, there may be > potential performance issue. > > For example, we first write to disk A, then write to disk B. > The proportion for disk B will increase slowly because the denominator > is too large (It's 2^18 + (global_count & counter_mask)). > The disk B will get small dirty page quota for a long time, > it will get blocked frequently though the total dirty page is under the > dirty page limit. > > Peter provided a patch to avoid this issue, this patch allow violation > of bdi limits if there is a lot of room on the system. > It looks like: > > +if (nr_reclaimable + nr_writeback < (background_thresh + > dirty_thresh) / 2) > + break; > > This patch really help to avoid congestion, but if the dirty pages > exceed about 3/4 of the dirty_thresh, congestion still happens if we > write to another disk. > > I think that we can reduce the period to speed up the proportion > adjustment. > > diff -Nur a/page-writeback.c b/page-writeback.c > --- a/page-writeback.c 2007-12-11 13:46:30.0 +0800 > +++ b/page-writeback.c 2007-12-11 13:47:11.0 +0800 > @@ -128,10 +128,7 @@ > */ > static int calc_period_shift(void) > { > - unsigned long dirty_total; > - > - dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) / > 100; > - return 2 + ilog2(dirty_total - 1); > + return 12; > } Its a heuristic, it might need some tuning, but a static value is wrong. I think its generally true that the larger the machine memory size, the faster the storage subsystem. And the more likely it has more disks. One of the reasons this value isn't static is that with your fixed 12 it becomes very hard to balance over more than 4096 active devices. Of course, it takes quite a special set-up to get into that situation. As it is, it now takes about 2 * dirty limit to switch over, you could start by making that just a single, or maybe even half a, dirty limit. Also, I'm not quite convinced your benchmark is all that useful. Do you really think it matches an actual frequently occurring usage pattern? signature.asc Description: This is a digitally signed message part
Reducing the bdi proporion calculation period to speed up disk write
The patch 04fbfdc14e5f48463820d6b9807daa5e9c92c51f implemented bdi per device dirty threshold. It works well. However, the period for proportion calculation may be too large. For 8G memory, the calc_period_shift() will return 19 as the shift. When we switch writing operation between different disks, there may be potential performance issue. For example, we first write to disk A, then write to disk B. The proportion for disk B will increase slowly because the denominator is too large (It's 2^18 + (global_count & counter_mask)). The disk B will get small dirty page quota for a long time, it will get blocked frequently though the total dirty page is under the dirty page limit. Peter provided a patch to avoid this issue, this patch allow violation of bdi limits if there is a lot of room on the system. It looks like: +if (nr_reclaimable + nr_writeback < (background_thresh + dirty_thresh) / 2) + break; This patch really help to avoid congestion, but if the dirty pages exceed about 3/4 of the dirty_thresh, congestion still happens if we write to another disk. I think that we can reduce the period to speed up the proportion adjustment. diff -Nur a/page-writeback.c b/page-writeback.c --- a/page-writeback.c 2007-12-11 13:46:30.0 +0800 +++ b/page-writeback.c 2007-12-11 13:47:11.0 +0800 @@ -128,10 +128,7 @@ */ static int calc_period_shift(void) { - unsigned long dirty_total; - - dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) / 100; - return 2 + ilog2(dirty_total - 1); + return 12; } In the 8G memory system, I did some testing with iozone. I found that reducing the period help to increase the write speed when switch to a new disk. Run "./iozone -B -i 0 -i 2 -r 4k -s 1000M" twice in the disk B. Here is the result: 1. With the patch 04fbfdc14e5f48463820d6b9807daa5e9c92c51f First Second write 78M 173M rewrite 112M203M randread1710M 1697M randwrite 192M1412M 2. With Peter's patch write 134M169M rewrite 134M203M randread1717M 1705M randwrite 179M1412M 3.Adjust the shift to 12 write 260M259M rewrite 240M246M randread1712M 1700M randwrite 1409M 1409M 4.With Peter's patch and adjust the shift to 12 write 256M239M rewrite 253M253M randread1704M 1716M randwrite 1414M 1416M Run "./iozone -B -i 0 -i 2 -r 4k -s 500M" twice in the disk B. 1. With the patch 04fbfdc14e5f48463820d6b9807daa5e9c92c51f First Second write 821M725M rewrite 144M1299M randread1740M 1733M randwrite 1444M 1440M 2. With Peter's patch write 1100M 1112M rewrite 1295M 1313M randread1745M 1744M randwrite 1452M 1449M 3.Adjust the shift to 12 write 1021M 1104M rewrite 1314M 1311M randread1741M 1737M randwrite 1448M 1445M 4.With Peter's patch and adjust the shift to 12 write 1104M 1105M rewrite 1292M 1308M randread1737M 1741M randwrite 1449M 1449M -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/