On Thu, Oct 04, 2007 at 03:03:44PM +1000, David Chinner wrote: > On Thu, Oct 04, 2007 at 10:21:33AM +0800, Fengguang Wu wrote: > > On Wed, Oct 03, 2007 at 12:41:19PM +1000, David Chinner wrote: > > > On Wed, Oct 03, 2007 at 09:34:39AM +0800, Fengguang Wu wrote: > > > > On Wed, Oct 03, 2007 at 07:47:45AM +1000, David Chinner wrote: > > > > > On Tue, Oct 02, 2007 at 04:41:48PM +0800, Fengguang Wu wrote: > > > > > > wbc.pages_skipped = 0; > > > > > > @@ -560,8 +561,9 @@ static void background_writeout(unsigned > > > > > > min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write; > > > > > > if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) { > > > > > > /* Wrote less than expected */ > > > > > > - congestion_wait(WRITE, HZ/10); > > > > > > - if (!wbc.encountered_congestion) > > > > > > + if (wbc.encountered_congestion || wbc.more_io) > > > > > > + congestion_wait(WRITE, HZ/10); > > > > > > + else > > > > > > break; > > > > > > } > > > > > > > > > > Why do you call congestion_wait() if there is more I/O to issue? If > > > > > we have a fast filesystem, this might cause the device queues to > > > > > fill, then drain on congestion_wait(), then fill again, etc. i.e. we > > > > > will have trouble keeping the queues full, right? > > > > > > > > You mean slow writers and fast RAID? That would be exactly the case > > > > these patches try to improve. > > > > > > I mean any writers and a fast block device (raid or otherwise). > > > > > > > This patchset makes kupdate/background writeback more responsible, > > > > so that if (avg-write-speed < device-capabilities), the dirty data are > > > > synced timely, and we don't have to go for balance_dirty_pages(). > > > > > > Sure, but I'm asking about the effect of the patches on the > > > (avg-write-speed == device-capabilities) case. I agree that > > > they are necessary for timely syncing of data but I'm trying > > > to understand what effect they have on the normal write case > > > > > (i.e. keeping the disk at full write throughput). > > > > OK, I guess it is the focus of all your questions: Why should we sleep > > in congestion_wait() and possibly hurt the write throughput? I'll try > > to summary it: > > > > - congestion_wait() is necessary > > Besides device congestions, there may be other blockades we have to > > wait on, e.g. temporary page locks, NFS/journal issues(I guess). > > We skip locked pages in writeback, and if some filesystems have > blocking issues that require non-blocking writeback waits for some > I/O to complete before re-entering writeback, then perhaps they should be > setting wbc->encountered_congestion to tell writeback to back off.
We have wbc->pages_skipped for that :-) > The question I'm asking is that if more_io tells us we have more > work to do, why do we have to sleep first if the block dev is > able to take more I/O? See below. > > > > - congestion_wait() is called only when necessary > > congestion_wait() will only be called we saw blockades: > > if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) { > > congestion_wait(WRITE, HZ/10); > > } > > So in normal case, it may well write 128MB data without any waiting. > > Sure, but wbc.more_io doesn't indicate a blockade - just that there > is more work to do, right? It's not wbc.more_io, but the context(wbc.pages_skipped > 0) indicates a blockade: if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) { /* all-written or blockade... */ if (wbc.encountered_congestion || wbc.more_io) /* blockade! */ congestion_wait(WRITE, HZ/10); else /* all-written! */ break; } We can also read the whole background_writeout() logic as while (!done) { /* sync _all_ sync-able data */ congestion_wait(100ms); } And an example run could be: sync 1000MB, skipped 100MB congestion_wait(100ms); sync 100MB, skipped 10MB congestion_wait(100ms); sync 10MB, all done Note that it's far from "wait 100ms for every 4MB" (which is merely the worst possible case). > > - congestion_wait() won't hurt write throughput > > When not congested, congestion_wait() will be wake up on each write > > completion. > > What happens if the I/O we issued has already completed before we > got back up to the congestion_wait() call? We'll spend 100ms > sleeping when we shouldn't have and throughput goes down by 10% on > every occurrence.... Ah, that was out of my imagination. Maybe we could do with if (wbc.more_io) congestion_wait(WRITE, 1); It's at least 10 times better. > if we've got more work to do, then we should do it without an > arbitrary, non-deterministic delay being inserted. If the delay is > needed to prevent he system from "going mad" (whatever tht means), > then what's the explaination for the system "going mad"? "going mad" means "busy waiting". > > Note that MAX_WRITEBACK_PAGES=1024 and > > /sys/block/sda/queue/max_sectors_kb=512(for me), > > which means we are gave the chance to sync 4MB on every 512KB written, > > which means we are able to submit write IOs 8 times faster than the > > device capability. congestion_wait() is a magical timer :-) > > So, with Jens Axboe's sglist chaining, that single I/O could now > be up to 32MB on some hardware. IOWs, we push 1024 pages, and that > could end up as a single I/O being issued to disk. > > Your magic just broke. :/ Hmm, congestion_wait(WRITE, 1) could re-establish the balance ;-) Which waits <10ms for HZ=100. > > > > So for your question of queue depth, the answer is: the queue length > > > > will not build up in the first place. > > > > > > Which queue are you talking about here? The block deivce queue? > > > > Yes, the elevator's queues. > > I think this is the wrong thing to be doing and is detrimental > to I/o perfomrance because it wil reduce elevator efficiency. > > The elevator can only work efficiently if we allow the queues to > build up. The deeper the queue, the better the elevator can sort the > I/o requests and keep the device at maximum efficiency. If we don't > push enough I/O into the queues the we miss opportunities to combine > adjacent I/Os and reduce the seek load of writeback. Also, a shallow > queue will run dry if we don't get back to it in time which is > possible if we wait for I/o to complete before we go and flush > more.... Sure, the queues should be filled as fast as possible. How fast can we fill the queue? Let's measure it: //generated by the patch below [ 871.430700] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 54289 global 29911 0 0 wc _M tw -12 sk 0 [ 871.444718] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 53253 global 28857 0 0 wc _M tw -12 sk 0 [ 871.458764] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 52217 global 27834 0 0 wc _M tw -12 sk 0 [ 871.472797] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 51181 global 26780 0 0 wc _M tw -12 sk 0 [ 871.486825] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 50145 global 25757 0 0 wc _M tw -12 sk 0 [ 871.500857] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 49109 global 24734 0 0 wc _M tw -12 sk 0 [ 871.514864] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 48073 global 23680 0 0 wc _M tw -12 sk 0 [ 871.528889] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 47037 global 22657 0 0 wc _M tw -12 sk 0 [ 871.542894] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 46001 global 21603 0 0 wc _M tw -12 sk 0 [ 871.556927] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 44965 global 20580 0 0 wc _M tw -12 sk 0 [ 871.570961] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 43929 global 19557 0 0 wc _M tw -12 sk 0 [ 871.584992] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 42893 global 18503 0 0 wc _M tw -12 sk 0 [ 871.599005] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 41857 global 17480 0 0 wc _M tw -12 sk 0 [ 871.613027] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 40821 global 16426 0 0 wc _M tw -12 sk 0 [ 871.628626] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 39785 global 15403 961 0 wc _M tw -12 sk 0 [ 871.644439] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 38749 global 14380 1550 0 wc _M tw -12 sk 0 [ 871.660267] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 37713 global 13326 2573 0 wc _M tw -12 sk 0 [ 871.676236] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 36677 global 12303 3224 0 wc _M tw -12 sk 0 [ 871.692021] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 35641 global 11280 4154 0 wc _M tw -12 sk 0 [ 871.707824] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 34605 global 10226 4929 0 wc _M tw -12 sk 0 [ 871.723638] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 33569 global 9203 5735 0 wc _M tw -12 sk 0 [ 871.739708] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 32533 global 8149 6603 0 wc _M tw -12 sk 0 [ 871.756407] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 31497 global 7126 7409 0 wc _M tw -12 sk 0 [ 871.772165] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 30461 global 6103 8246 0 wc _M tw -12 sk 0 [ 871.788035] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 29425 global 5049 9052 0 wc _M tw -12 sk 0 [ 871.803896] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 28389 global 4026 9982 0 wc _M tw -12 sk 0 [ 871.820427] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 27353 global 2972 10757 0 wc _M tw -12 sk 0 [ 871.836728] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 26317 global 1949 11656 0 wc _M tw -12 sk 0 [ 871.853286] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 25281 global 895 12431 0 wc _M tw -12 sk 0 [ 871.868762] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 24245 global 58 13051 0 wc __ tw 168 sk 0 It's an Intel Core 2 2.93GHz CPU and a SATA disk. The trace shows that - there's no congestion_wait() called in wb_kupdate() - it takes wb_kupdate() ~15ms to sync every 4MB So I guess congestion_wait(WRITE, 1) will be more than enough. However, wb_kupdate() is syncing the data a bit slow(4*1000/15=266MB/s), could it be because of a lot of cond_resched()? Fengguang --- mm/page-writeback.c | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) --- linux-2.6.23-rc8-mm2.orig/mm/page-writeback.c +++ linux-2.6.23-rc8-mm2/mm/page-writeback.c @@ -98,6 +98,26 @@ EXPORT_SYMBOL(laptop_mode); /* End of sysctl-exported parameters */ +#define writeback_debug_report(n, wbc) do { \ + __writeback_debug_report(n, wbc, __FILE__, __LINE__, __FUNCTION__); \ +} while (0) + +void __writeback_debug_report(long n, struct writeback_control *wbc, + const char *file, int line, const char *func) +{ + printk("%s %d %s: %s(%d) %ld " + "global %lu %lu %lu " + "wc %c%c tw %ld sk %ld\n", + file, line, func, + current->comm, current->pid, n, + global_page_state(NR_FILE_DIRTY), + global_page_state(NR_WRITEBACK), + global_page_state(NR_UNSTABLE_NFS), + wbc->encountered_congestion ? 'C':'_', + wbc->more_io ? 'M':'_', + wbc->nr_to_write, + wbc->pages_skipped); +} static void background_writeout(unsigned long _min_pages); @@ -404,6 +424,7 @@ static void balance_dirty_pages(struct a pages_written += write_chunk - wbc.nr_to_write; get_dirty_limits(&background_thresh, &dirty_thresh, &bdi_thresh, bdi); + writeback_debug_report(pages_written, &wbc); } /* @@ -568,6 +589,7 @@ static void background_writeout(unsigned wbc.pages_skipped = 0; writeback_inodes(&wbc); min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write; + writeback_debug_report(min_pages, &wbc); if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) { /* Wrote less than expected */ if (wbc.encountered_congestion) @@ -643,6 +665,7 @@ static void wb_kupdate(unsigned long arg wbc.encountered_congestion = 0; wbc.nr_to_write = MAX_WRITEBACK_PAGES; writeback_inodes(&wbc); + writeback_debug_report(nr_to_write, &wbc); if (wbc.nr_to_write > 0) { if (wbc.encountered_congestion) congestion_wait(WRITE, HZ/10); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/