Re: [PATCHSET v5] Make background writeback great again for the first time
On Fri 13-05-16 12:29:10, Jens Axboe wrote: > Thanks Jan, this is great and super useful! I'm revamping certain parts of > it to deal with write back caching better, and I'll take a look at the > regressions that you reported. > > What kind of SSD is this? I'm assuming it's SATA (QD=32), and then it would > probably be a safe assumption that it's flagging itself as having a volatile > write back cache, would that be a correct assumption? Yes, it is SATA with writeback cache. > Are you using scsi-mq, or do you have an IO scheduler attached to it? The disk was using IO scheduler, however at this point I'm not 100% sure which scheduler (deadline or cfq) was the default one for the distro that was installed. The machine is currently testing something else so I cannot reinstall it and check. Maybe I can rerun some tests later in the week when the machine gets freed with scsi-mq or deadline IO scheduler so that we have 100% certain config. Honza -- Jan Kara SUSE Labs, CR
Re: [PATCHSET v5] Make background writeback great again for the first time
On 05/11/2016 10:36 AM, Jan Kara wrote: On Tue 03-05-16 14:17:19, Jan Kara wrote: The question remains how common a pattern where throttling of background writeback delays also something else is. I'll schedule a couple of benchmarks to measure impact of your patches for a wider range of workloads (but sadly pretty limited set of hw). If ext3 is the only one seeing issues, I would be willing to accept that ext3 takes the hit since it is doing something rather stupid (but inherent in its journal design) and we have a way to deal with this either by enabling delayed allocation or by turning off the writeback throttling... So I've run some benchmarks on a machine with 6 GB of RAM and SSD with queue depth 32. The filesystem on the disk was XFS this time. I've found couple of regressions. A clear one is with dbench (version 4). The average throughput numbers look like: BaselineWBT Hmeanmb/sec-1 30.26 ( 0.00%) 18.67 (-38.28%) Hmeanmb/sec-2 40.71 ( 0.00%) 31.25 (-23.23%) Hmeanmb/sec-4 52.67 ( 0.00%) 46.83 (-11.09%) Hmeanmb/sec-8 69.51 ( 0.00%) 64.35 ( -7.42%) Hmeanmb/sec-1691.07 ( 0.00%) 86.46 ( -5.07%) Hmeanmb/sec-32 115.10 ( 0.00%) 110.29 ( -4.18%) Hmeanmb/sec-64 145.14 ( 0.00%) 134.97 ( -7.00%) Hmeanmb/sec-512 93.99 ( 0.00%) 133.85 ( 42.41%) There were also some losses in a filebench webproxy workload (I can give you exact details of the settings if you want to reproduce it). Also, and this really puzzles me, I've seen higher read latencies in some cases (I've verified they are not just noise by rerunning the test for kernel with writeback throttling patches). For example with the following fio job file: [global] direct=0 ioengine=sync runtime=300 time_based invalidate=1 blocksize=4096 size=10g# Just random value, we are running time based workload log_avg_msec=10 group_reporting=1 [writer] nrfiles=1 filesize=1g fdatasync=256 readwrite=randwrite numjobs=4 [reader] # Simulate random reading from different files, switching to different file # after 16 ios. This somewhat simulates application startup. new_group filesize=100m nrfiles=20 file_service_type=random:16 readwrite=randread I get the following results: Throughput BaselineWBT Hmeankb/sec-writer-write 591.60 ( 0.00%) 507.00 (-14.30%) Hmeankb/sec-reader-read 211.81 ( 0.00%) 137.53 (-35.07%) So both read and write throughput have suffered. And latencies don't offset for the loss either: FIO read latency Min latency-read 1383.00 ( 0.00%) 1519.00 ( -9.83%) 1st-qrtle latency-read 3485.00 ( 0.00%) 5235.00 (-50.22%) 2nd-qrtle latency-read 4708.00 ( 0.00%)15028.00 (-219.20%) 3rd-qrtle latency-read10286.00 ( 0.00%)57622.00 (-460.20%) Max-90% latency-read 195834.00 ( 0.00%) 167149.00 ( 14.65%) Max-93% latency-read 273145.00 ( 0.00%) 200319.00 ( 26.66%) Max-95% latency-read 335434.00 ( 0.00%) 220695.00 ( 34.21%) Max-99% latency-read 537017.00 ( 0.00%) 347174.00 ( 35.35%) Max latency-read 991101.00 ( 0.00%) 485835.00 ( 50.98%) Meanlatency-read51282.79 ( 0.00%)49953.95 ( 2.59%) So we have reduced the extra high read latencies which is nice but on average there is no change. And another fio jobfile which doesn't look great: [global] direct=0 ioengine=sync runtime=300 blocksize=4096 invalidate=1 time_based ramp_time=5 # Let the flusher thread start before taking measurements log_avg_msec=10 group_reporting=1 [writer] nrfiles=1 filesize=$((MEMTOTAL_BYTES*2)) readwrite=randwrite [reader] # Simulate random reading from different files, switching to different file # after 16 ios. This somewhat simulates application startup. new_group filesize=100m nrfiles=20 file_service_type=random:16 readwrite=randread The throughput numbers look like: Hmeankb/sec-writer-write24707.22 ( 0.00%)19912.23 (-19.41%) Hmeankb/sec-reader-read 886.65 ( 0.00%) 905.71 ( 2.15%) So we've got significant hit in writes not really offset by a big increase in reads. Read latency numbers look like (I show the WBT numbers for two runs just so that one can see how variable the latency numbers are because I was puzzled by very high max latency for WBT kernels - quartiles seem rather stable higher percentiles and min/max are rather variable): Baseline WBT WBT Min latency-read 1230.00 ( 0.00%) 1560.00 (-26.83%)1100.00 ( 10.57%) 1st-qrtle latency-read 3357.00 ( 0.00%) 3351.00 ( 0.18%)3351.00 ( 0.18%) 2nd-qrtle latency-read 4074.00 ( 0.00%) 4056.00 ( 0.44%)4022.00 ( 1.28%) 3rd-qrtle latency-read 5198.00 ( 0.00%) 5145.00 ( 1.02%)5095.00 ( 1.98%) Max-90% la
Re: [PATCHSET v5] Make background writeback great again for the first time
On Tue 03-05-16 14:17:19, Jan Kara wrote: > The question remains how common a pattern where throttling of background > writeback delays also something else is. I'll schedule a couple of > benchmarks to measure impact of your patches for a wider range of workloads > (but sadly pretty limited set of hw). If ext3 is the only one seeing > issues, I would be willing to accept that ext3 takes the hit since it is > doing something rather stupid (but inherent in its journal design) and we > have a way to deal with this either by enabling delayed allocation or by > turning off the writeback throttling... So I've run some benchmarks on a machine with 6 GB of RAM and SSD with queue depth 32. The filesystem on the disk was XFS this time. I've found couple of regressions. A clear one is with dbench (version 4). The average throughput numbers look like: BaselineWBT Hmeanmb/sec-1 30.26 ( 0.00%) 18.67 (-38.28%) Hmeanmb/sec-2 40.71 ( 0.00%) 31.25 (-23.23%) Hmeanmb/sec-4 52.67 ( 0.00%) 46.83 (-11.09%) Hmeanmb/sec-8 69.51 ( 0.00%) 64.35 ( -7.42%) Hmeanmb/sec-1691.07 ( 0.00%) 86.46 ( -5.07%) Hmeanmb/sec-32 115.10 ( 0.00%) 110.29 ( -4.18%) Hmeanmb/sec-64 145.14 ( 0.00%) 134.97 ( -7.00%) Hmeanmb/sec-512 93.99 ( 0.00%) 133.85 ( 42.41%) There were also some losses in a filebench webproxy workload (I can give you exact details of the settings if you want to reproduce it). Also, and this really puzzles me, I've seen higher read latencies in some cases (I've verified they are not just noise by rerunning the test for kernel with writeback throttling patches). For example with the following fio job file: [global] direct=0 ioengine=sync runtime=300 time_based invalidate=1 blocksize=4096 size=10g# Just random value, we are running time based workload log_avg_msec=10 group_reporting=1 [writer] nrfiles=1 filesize=1g fdatasync=256 readwrite=randwrite numjobs=4 [reader] # Simulate random reading from different files, switching to different file # after 16 ios. This somewhat simulates application startup. new_group filesize=100m nrfiles=20 file_service_type=random:16 readwrite=randread I get the following results: Throughput BaselineWBT Hmeankb/sec-writer-write 591.60 ( 0.00%) 507.00 (-14.30%) Hmeankb/sec-reader-read 211.81 ( 0.00%) 137.53 (-35.07%) So both read and write throughput have suffered. And latencies don't offset for the loss either: FIO read latency Min latency-read 1383.00 ( 0.00%) 1519.00 ( -9.83%) 1st-qrtle latency-read 3485.00 ( 0.00%) 5235.00 (-50.22%) 2nd-qrtle latency-read 4708.00 ( 0.00%)15028.00 (-219.20%) 3rd-qrtle latency-read10286.00 ( 0.00%)57622.00 (-460.20%) Max-90% latency-read 195834.00 ( 0.00%) 167149.00 ( 14.65%) Max-93% latency-read 273145.00 ( 0.00%) 200319.00 ( 26.66%) Max-95% latency-read 335434.00 ( 0.00%) 220695.00 ( 34.21%) Max-99% latency-read 537017.00 ( 0.00%) 347174.00 ( 35.35%) Max latency-read 991101.00 ( 0.00%) 485835.00 ( 50.98%) Meanlatency-read51282.79 ( 0.00%)49953.95 ( 2.59%) So we have reduced the extra high read latencies which is nice but on average there is no change. And another fio jobfile which doesn't look great: [global] direct=0 ioengine=sync runtime=300 blocksize=4096 invalidate=1 time_based ramp_time=5 # Let the flusher thread start before taking measurements log_avg_msec=10 group_reporting=1 [writer] nrfiles=1 filesize=$((MEMTOTAL_BYTES*2)) readwrite=randwrite [reader] # Simulate random reading from different files, switching to different file # after 16 ios. This somewhat simulates application startup. new_group filesize=100m nrfiles=20 file_service_type=random:16 readwrite=randread The throughput numbers look like: Hmeankb/sec-writer-write24707.22 ( 0.00%)19912.23 (-19.41%) Hmeankb/sec-reader-read 886.65 ( 0.00%) 905.71 ( 2.15%) So we've got significant hit in writes not really offset by a big increase in reads. Read latency numbers look like (I show the WBT numbers for two runs just so that one can see how variable the latency numbers are because I was puzzled by very high max latency for WBT kernels - quartiles seem rather stable higher percentiles and min/max are rather variable): Baseline WBT WBT Min latency-read 1230.00 ( 0.00%) 1560.00 (-26.83%)1100.00 ( 10.57%) 1st-qrtle latency-read 3357.00 ( 0.00%) 3351.00 ( 0.18%)3351.00 ( 0.18%) 2nd-qrtle latency-read 4074.00 ( 0.00%) 4056.00 ( 0.44%)4022.00 ( 1.28%) 3rd-qrtle latency-read 5198.00 ( 0.00%) 5145.00 ( 1.02%)5095.00 ( 1.98%) Max-90% latency-read 6594.00 ( 0.
Re: [PATCHSET v5] Make background writeback great again for the first time
On Tue 03-05-16 09:42:40, Chris Mason wrote: > On Tue, May 03, 2016 at 03:06:09PM +0200, Jan Kara wrote: > > On Tue 03-05-16 08:40:11, Chris Mason wrote: > > > On Tue, May 03, 2016 at 02:17:19PM +0200, Jan Kara wrote: > > > > On Thu 28-04-16 12:46:41, Jens Axboe wrote: > > > > > >>- rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step)); > > > > > >>- rwb->wb_normal = (rwb->wb_max + 1) / 2; > > > > > >>- rwb->wb_background = (rwb->wb_max + 3) / 4; > > > > > >>+ if (rwb->queue_depth == 1) { > > > > > >>+ rwb->wb_max = rwb->wb_normal = 2; > > > > > >>+ rwb->wb_background = 1; > > > > > > > > > > > >This breaks the detection of too big scale_step in scale_up() where > > > > > >we key > > > > > >of wb_max == 1 value. However even with that fixed no luck :(: > > > > > > > > > > Yeah, I need to look at that. For QD=1, I think the only sensible > > > > > values for > > > > > max/normal/bg is 2/2/1 and 1/1/1 if we step down. > > > > > > > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync > > > > > >Runtime: 105.126 107.125 105.641 > > > > > > > > > > > >So about the same as before. I'll try to debug this later today... > > > > > > > > > > Thanks, I'm very interested in what you find! > > > > > > > > OK, so the reason was relatively standard in the end. I was using ext3 > > > > (or > > > > more exactly ext4 without delayed allocation) for the test. The > > > > throttling > > > > of background writes gave more priority to writes from the journalling > > > > thread which happen with WRITE_SYNC and thus are not throttled. Thus the > > > > journalling thread ended up having to do more data writeback to be able > > > > to > > > > commit a transaction (due to requirements of data=ordered mode) and it > > > > is > > > > less efficient at that than the normal flusher thread. > > > > > > > > So this is an example where throttling background writeback effectively > > > > just pushes more work into another context which does it less > > > > efficiently > > > > and indirectly makes everyone wait for it. ext3 has been always > > > > sensitive to > > > > issues like this. ext4 is using delayed allocation and thus only data > > > > writes into holes end up being part of a transaction -> simple dd test > > > > case > > > > doesn't hit that path. And indeed when I repeat the same test with ext4, > > > > the numbers with and without your patch are exactly the same. > > > > > > > > The question remains how common a pattern where throttling of background > > > > writeback delays also something else is. I'll schedule a couple of > > > > benchmarks to measure impact of your patches for a wider range of > > > > workloads > > > > (but sadly pretty limited set of hw). If ext3 is the only one seeing > > > > issues, I would be willing to accept that ext3 takes the hit since it is > > > > doing something rather stupid (but inherent in its journal design) and > > > > we > > > > have a way to deal with this either by enabling delayed allocation or by > > > > turning off the writeback throttling... > > > > > > At least in the case of io that we know is going to be data=ordered, we > > > can bump the prio of those pages? > > > > But how would flusher thread, which is submitting IO, know that? We would > > have to somehow mark inodes that are part of the running transaction and > > flusher thread could give more priority to such writeback - e.g. by using > > WRITE_SYNC or at least plain writes. Hmm, if we use an inode flag for that, > > it could be doable. > > This would be specific to the data=ordered code in the FS. If there's > some way to test for an inode or a page's status in the data=ordered > list, the FS writepages call could flag the IO as higher prio? Oh, right, we could do that. I can experiment with that later. Honza -- Jan Kara SUSE Labs, CR
Re: [PATCHSET v5] Make background writeback great again for the first time
On Tue, May 03, 2016 at 03:06:09PM +0200, Jan Kara wrote: > On Tue 03-05-16 08:40:11, Chris Mason wrote: > > On Tue, May 03, 2016 at 02:17:19PM +0200, Jan Kara wrote: > > > On Thu 28-04-16 12:46:41, Jens Axboe wrote: > > > > >>- rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step)); > > > > >>- rwb->wb_normal = (rwb->wb_max + 1) / 2; > > > > >>- rwb->wb_background = (rwb->wb_max + 3) / 4; > > > > >>+ if (rwb->queue_depth == 1) { > > > > >>+ rwb->wb_max = rwb->wb_normal = 2; > > > > >>+ rwb->wb_background = 1; > > > > > > > > > >This breaks the detection of too big scale_step in scale_up() where we > > > > >key > > > > >of wb_max == 1 value. However even with that fixed no luck :(: > > > > > > > > Yeah, I need to look at that. For QD=1, I think the only sensible > > > > values for > > > > max/normal/bg is 2/2/1 and 1/1/1 if we step down. > > > > > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync > > > > >Runtime: 105.126 107.125 105.641 > > > > > > > > > >So about the same as before. I'll try to debug this later today... > > > > > > > > Thanks, I'm very interested in what you find! > > > > > > OK, so the reason was relatively standard in the end. I was using ext3 (or > > > more exactly ext4 without delayed allocation) for the test. The throttling > > > of background writes gave more priority to writes from the journalling > > > thread which happen with WRITE_SYNC and thus are not throttled. Thus the > > > journalling thread ended up having to do more data writeback to be able to > > > commit a transaction (due to requirements of data=ordered mode) and it is > > > less efficient at that than the normal flusher thread. > > > > > > So this is an example where throttling background writeback effectively > > > just pushes more work into another context which does it less efficiently > > > and indirectly makes everyone wait for it. ext3 has been always sensitive > > > to > > > issues like this. ext4 is using delayed allocation and thus only data > > > writes into holes end up being part of a transaction -> simple dd test > > > case > > > doesn't hit that path. And indeed when I repeat the same test with ext4, > > > the numbers with and without your patch are exactly the same. > > > > > > The question remains how common a pattern where throttling of background > > > writeback delays also something else is. I'll schedule a couple of > > > benchmarks to measure impact of your patches for a wider range of > > > workloads > > > (but sadly pretty limited set of hw). If ext3 is the only one seeing > > > issues, I would be willing to accept that ext3 takes the hit since it is > > > doing something rather stupid (but inherent in its journal design) and we > > > have a way to deal with this either by enabling delayed allocation or by > > > turning off the writeback throttling... > > > > At least in the case of io that we know is going to be data=ordered, we > > can bump the prio of those pages? > > But how would flusher thread, which is submitting IO, know that? We would > have to somehow mark inodes that are part of the running transaction and > flusher thread could give more priority to such writeback - e.g. by using > WRITE_SYNC or at least plain writes. Hmm, if we use an inode flag for that, > it could be doable. This would be specific to the data=ordered code in the FS. If there's some way to test for an inode or a page's status in the data=ordered list, the FS writepages call could flag the IO as higher prio? -chris
Re: [PATCHSET v5] Make background writeback great again for the first time
On Tue 03-05-16 08:40:11, Chris Mason wrote: > On Tue, May 03, 2016 at 02:17:19PM +0200, Jan Kara wrote: > > On Thu 28-04-16 12:46:41, Jens Axboe wrote: > > > >>- rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step)); > > > >>- rwb->wb_normal = (rwb->wb_max + 1) / 2; > > > >>- rwb->wb_background = (rwb->wb_max + 3) / 4; > > > >>+ if (rwb->queue_depth == 1) { > > > >>+ rwb->wb_max = rwb->wb_normal = 2; > > > >>+ rwb->wb_background = 1; > > > > > > > >This breaks the detection of too big scale_step in scale_up() where we > > > >key > > > >of wb_max == 1 value. However even with that fixed no luck :(: > > > > > > Yeah, I need to look at that. For QD=1, I think the only sensible values > > > for > > > max/normal/bg is 2/2/1 and 1/1/1 if we step down. > > > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync > > > >Runtime: 105.126 107.125 105.641 > > > > > > > >So about the same as before. I'll try to debug this later today... > > > > > > Thanks, I'm very interested in what you find! > > > > OK, so the reason was relatively standard in the end. I was using ext3 (or > > more exactly ext4 without delayed allocation) for the test. The throttling > > of background writes gave more priority to writes from the journalling > > thread which happen with WRITE_SYNC and thus are not throttled. Thus the > > journalling thread ended up having to do more data writeback to be able to > > commit a transaction (due to requirements of data=ordered mode) and it is > > less efficient at that than the normal flusher thread. > > > > So this is an example where throttling background writeback effectively > > just pushes more work into another context which does it less efficiently > > and indirectly makes everyone wait for it. ext3 has been always sensitive to > > issues like this. ext4 is using delayed allocation and thus only data > > writes into holes end up being part of a transaction -> simple dd test case > > doesn't hit that path. And indeed when I repeat the same test with ext4, > > the numbers with and without your patch are exactly the same. > > > > The question remains how common a pattern where throttling of background > > writeback delays also something else is. I'll schedule a couple of > > benchmarks to measure impact of your patches for a wider range of workloads > > (but sadly pretty limited set of hw). If ext3 is the only one seeing > > issues, I would be willing to accept that ext3 takes the hit since it is > > doing something rather stupid (but inherent in its journal design) and we > > have a way to deal with this either by enabling delayed allocation or by > > turning off the writeback throttling... > > At least in the case of io that we know is going to be data=ordered, we > can bump the prio of those pages? But how would flusher thread, which is submitting IO, know that? We would have to somehow mark inodes that are part of the running transaction and flusher thread could give more priority to such writeback - e.g. by using WRITE_SYNC or at least plain writes. Hmm, if we use an inode flag for that, it could be doable. Honza -- Jan Kara SUSE Labs, CR
Re: [PATCHSET v5] Make background writeback great again for the first time
On Tue, May 03, 2016 at 02:17:19PM +0200, Jan Kara wrote: > On Thu 28-04-16 12:46:41, Jens Axboe wrote: > > >>- rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step)); > > >>- rwb->wb_normal = (rwb->wb_max + 1) / 2; > > >>- rwb->wb_background = (rwb->wb_max + 3) / 4; > > >>+ if (rwb->queue_depth == 1) { > > >>+ rwb->wb_max = rwb->wb_normal = 2; > > >>+ rwb->wb_background = 1; > > > > > >This breaks the detection of too big scale_step in scale_up() where we key > > >of wb_max == 1 value. However even with that fixed no luck :(: > > > > Yeah, I need to look at that. For QD=1, I think the only sensible values for > > max/normal/bg is 2/2/1 and 1/1/1 if we step down. > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync > > >Runtime: 105.126 107.125 105.641 > > > > > >So about the same as before. I'll try to debug this later today... > > > > Thanks, I'm very interested in what you find! > > OK, so the reason was relatively standard in the end. I was using ext3 (or > more exactly ext4 without delayed allocation) for the test. The throttling > of background writes gave more priority to writes from the journalling > thread which happen with WRITE_SYNC and thus are not throttled. Thus the > journalling thread ended up having to do more data writeback to be able to > commit a transaction (due to requirements of data=ordered mode) and it is > less efficient at that than the normal flusher thread. > > So this is an example where throttling background writeback effectively > just pushes more work into another context which does it less efficiently > and indirectly makes everyone wait for it. ext3 has been always sensitive to > issues like this. ext4 is using delayed allocation and thus only data > writes into holes end up being part of a transaction -> simple dd test case > doesn't hit that path. And indeed when I repeat the same test with ext4, > the numbers with and without your patch are exactly the same. > > The question remains how common a pattern where throttling of background > writeback delays also something else is. I'll schedule a couple of > benchmarks to measure impact of your patches for a wider range of workloads > (but sadly pretty limited set of hw). If ext3 is the only one seeing > issues, I would be willing to accept that ext3 takes the hit since it is > doing something rather stupid (but inherent in its journal design) and we > have a way to deal with this either by enabling delayed allocation or by > turning off the writeback throttling... At least in the case of io that we know is going to be data=ordered, we can bump the prio of those pages? -chris
Re: [PATCHSET v5] Make background writeback great again for the first time
On Thu 28-04-16 12:46:41, Jens Axboe wrote: > >>- rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step)); > >>- rwb->wb_normal = (rwb->wb_max + 1) / 2; > >>- rwb->wb_background = (rwb->wb_max + 3) / 4; > >>+ if (rwb->queue_depth == 1) { > >>+ rwb->wb_max = rwb->wb_normal = 2; > >>+ rwb->wb_background = 1; > > > >This breaks the detection of too big scale_step in scale_up() where we key > >of wb_max == 1 value. However even with that fixed no luck :(: > > Yeah, I need to look at that. For QD=1, I think the only sensible values for > max/normal/bg is 2/2/1 and 1/1/1 if we step down. > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync > >Runtime: 105.126 107.125 105.641 > > > >So about the same as before. I'll try to debug this later today... > > Thanks, I'm very interested in what you find! OK, so the reason was relatively standard in the end. I was using ext3 (or more exactly ext4 without delayed allocation) for the test. The throttling of background writes gave more priority to writes from the journalling thread which happen with WRITE_SYNC and thus are not throttled. Thus the journalling thread ended up having to do more data writeback to be able to commit a transaction (due to requirements of data=ordered mode) and it is less efficient at that than the normal flusher thread. So this is an example where throttling background writeback effectively just pushes more work into another context which does it less efficiently and indirectly makes everyone wait for it. ext3 has been always sensitive to issues like this. ext4 is using delayed allocation and thus only data writes into holes end up being part of a transaction -> simple dd test case doesn't hit that path. And indeed when I repeat the same test with ext4, the numbers with and without your patch are exactly the same. The question remains how common a pattern where throttling of background writeback delays also something else is. I'll schedule a couple of benchmarks to measure impact of your patches for a wider range of workloads (but sadly pretty limited set of hw). If ext3 is the only one seeing issues, I would be willing to accept that ext3 takes the hit since it is doing something rather stupid (but inherent in its journal design) and we have a way to deal with this either by enabling delayed allocation or by turning off the writeback throttling... Honza -- Jan Kara SUSE Labs, CR
Re: [PATCHSET v5] Make background writeback great again for the first time
On 04/28/2016 05:54 AM, Jan Kara wrote: On Wed 27-04-16 14:59:15, Jens Axboe wrote: On Wed, Apr 27 2016, Jens Axboe wrote: On Wed, Apr 27 2016, Jens Axboe wrote: On 04/27/2016 12:01 PM, Jan Kara wrote: Hi, On Tue 26-04-16 09:55:23, Jens Axboe wrote: Since the dawn of time, our background buffered writeback has sucked. When we do background buffered writeback, it should have little impact on foreground activity. That's the definition of background activity... But for as long as I can remember, heavy buffered writers have not behaved like that. For instance, if I do something like this: $ dd if=/dev/zero of=foo bs=1M count=10k on my laptop, and then try and start chrome, it basically won't start before the buffered writeback is done. Or, for server oriented workloads, where installation of a big RPM (or similar) adversely impacts database reads or sync writes. When that happens, I get people yelling at me. I have posted plenty of results previously, I'll keep it shorter this time. Here's a run on my laptop, using read-to-pipe-async for reading a 5g file, and rewriting it. You can find this test program in the fio git repo. I have tested your patchset on my test system. Generally I have observed noticeable drop in average throughput for heavy background writes without any other disk activity and also somewhat increased variance in the runtimes. It is most visible on this simple testcases: dd if=/dev/zero of=/mnt/file bs=1M count=1 and dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly created before each dd run on a dedicated disk. Without your patches I get pretty stable dd runtimes for both cases: dd if=/dev/zero of=/mnt/file bs=1M count=1 Runtimes: 87.9611 87.3279 87.2554 dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync Runtimes: 93.3502 93.2086 93.541 With your patches the numbers look like: dd if=/dev/zero of=/mnt/file bs=1M count=1 Runtimes: 108.183, 97.184, 99.9587 dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync Runtimes: 104.9, 102.775, 102.892 I have checked whether the variance is due to some interaction with CFQ which is used for the disk. When I switched the disk to deadline, I still get some variance although, the throughput is still ~10% lower: dd if=/dev/zero of=/mnt/file bs=1M count=1 Runtimes: 100.417 100.643 100.866 dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync Runtimes: 104.208 106.341 105.483 The disk is rotational SATA drive with writeback cache, queue depth of the disk reported in /sys/block/sdb/device/queue_depth is 1. So I think we still need some tweaking on the low end of the storage spectrum so that we don't lose 10% of throughput for simple cases like this. Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if you are seeing smaller requests, and that is why it both varies and you get lower throughput? I'll try and setup a test here similar to yours. Jan, care to try the below patch? I can't fully reproduce your issue on a SCSI disk limited to QD=1, but I have a feeling this might help. It's a bit of a hack, but the general idea is to allow one more request to build up for QD=1 devices. That eliminates wait time between one request finishing, and the next being submitted. That accidentally added a potentially stall, this one is both cleaner and should have that fixed. .. - rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step)); - rwb->wb_normal = (rwb->wb_max + 1) / 2; - rwb->wb_background = (rwb->wb_max + 3) / 4; + if (rwb->queue_depth == 1) { + rwb->wb_max = rwb->wb_normal = 2; + rwb->wb_background = 1; This breaks the detection of too big scale_step in scale_up() where we key of wb_max == 1 value. However even with that fixed no luck :(: Yeah, I need to look at that. For QD=1, I think the only sensible values for max/normal/bg is 2/2/1 and 1/1/1 if we step down. dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync Runtime: 105.126 107.125 105.641 So about the same as before. I'll try to debug this later today... Thanks, I'm very interested in what you find! -- Jens Axboe
Re: [PATCHSET v5] Make background writeback great again for the first time
On 04/27/2016 10:06 PM, xiakaixu wrote: diff --git a/lib/wbt.c b/lib/wbt.c index 650da911f24f..322f5e04e994 100644 --- a/lib/wbt.c +++ b/lib/wbt.c @@ -98,18 +98,23 @@ void __wbt_done(struct rq_wb *rwb) else limit = rwb->wb_normal; Hi Jens, This statement 'limit = rwb->wb_normal' is executed twice, maybe once is enough. It is not a big deal anyway :) I'll clean that up, thanks for noticing. No functional difference. Another question about this if branch: if (rwb->wc && !atomic_read(&rwb->bdi->wb.dirty_sleeping)) limit = 0; I can't follow the logic of this if branch. why set limit equal to 0 when the device supports write back caches and there are tasks being limited in balance_dirty_pages(). Could you pelase give more info about this ? Thanks! Sure. So for write back caching, we have to try a bit harder to ensure that the device doesn't build up long internal queues with a lot of dirty data in the cache. So for the case where we have write back caching AND we don't have anyone waiting for the IO, allow the queue depth to drain to zero before building it back up again. Does that make sense? + inflight = atomic_dec_return(&rwb->inflight); + /* -* Don't wake anyone up if we are above the normal limit. If -* throttling got disabled (limit == 0) with waiters, ensure -* that we wake them up. +* wbt got disabled with IO in flight. Wake up any potential +* waiters, we don't have to do more than that. */ - inflight = atomic_dec_return(&rwb->inflight); - if (limit && inflight >= limit) { - if (!rwb->wb_max) - wake_up_all(&rwb->wait); + if (!rwb_enabled(rwb)) { + wake_up_all(&rwb->wait); return; } Maybe it is better that executing this if branch earlier. So we can wake up potential waiters in time when wbt got disabled. The !rwb_enabled() case will only happen if someone disabled wbt while we had tracked IO in flight. We have to it below the atomic_dec_return(), so we could reorder that to be at the front. Ideally we just want it out-of-line instead, as it's the unexpected slower path. -- Jens Axboe
Re: [PATCHSET v5] Make background writeback great again for the first time
On Wed 27-04-16 14:59:15, Jens Axboe wrote: > On Wed, Apr 27 2016, Jens Axboe wrote: > > On Wed, Apr 27 2016, Jens Axboe wrote: > > > On 04/27/2016 12:01 PM, Jan Kara wrote: > > > >Hi, > > > > > > > >On Tue 26-04-16 09:55:23, Jens Axboe wrote: > > > >>Since the dawn of time, our background buffered writeback has sucked. > > > >>When we do background buffered writeback, it should have little impact > > > >>on foreground activity. That's the definition of background activity... > > > >>But for as long as I can remember, heavy buffered writers have not > > > >>behaved like that. For instance, if I do something like this: > > > >> > > > >>$ dd if=/dev/zero of=foo bs=1M count=10k > > > >> > > > >>on my laptop, and then try and start chrome, it basically won't start > > > >>before the buffered writeback is done. Or, for server oriented > > > >>workloads, where installation of a big RPM (or similar) adversely > > > >>impacts database reads or sync writes. When that happens, I get people > > > >>yelling at me. > > > >> > > > >>I have posted plenty of results previously, I'll keep it shorter > > > >>this time. Here's a run on my laptop, using read-to-pipe-async for > > > >>reading a 5g file, and rewriting it. You can find this test program > > > >>in the fio git repo. > > > > > > > >I have tested your patchset on my test system. Generally I have observed > > > >noticeable drop in average throughput for heavy background writes without > > > >any other disk activity and also somewhat increased variance in the > > > >runtimes. It is most visible on this simple testcases: > > > > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 > > > > > > > >and > > > > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync > > > > > > > >The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly > > > >created before each dd run on a dedicated disk. > > > > > > > >Without your patches I get pretty stable dd runtimes for both cases: > > > > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 > > > >Runtimes: 87.9611 87.3279 87.2554 > > > > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync > > > >Runtimes: 93.3502 93.2086 93.541 > > > > > > > >With your patches the numbers look like: > > > > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 > > > >Runtimes: 108.183, 97.184, 99.9587 > > > > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync > > > >Runtimes: 104.9, 102.775, 102.892 > > > > > > > >I have checked whether the variance is due to some interaction with CFQ > > > >which is used for the disk. When I switched the disk to deadline, I still > > > >get some variance although, the throughput is still ~10% lower: > > > > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 > > > >Runtimes: 100.417 100.643 100.866 > > > > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync > > > >Runtimes: 104.208 106.341 105.483 > > > > > > > >The disk is rotational SATA drive with writeback cache, queue depth of > > > >the > > > >disk reported in /sys/block/sdb/device/queue_depth is 1. > > > > > > > >So I think we still need some tweaking on the low end of the storage > > > >spectrum so that we don't lose 10% of throughput for simple cases like > > > >this. > > > > > > Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if > > > you are seeing smaller requests, and that is why it both varies and > > > you get lower throughput? I'll try and setup a test here similar to > > > yours. > > > > Jan, care to try the below patch? I can't fully reproduce your issue on > > a SCSI disk limited to QD=1, but I have a feeling this might help. It's > > a bit of a hack, but the general idea is to allow one more request to > > build up for QD=1 devices. That eliminates wait time between one request > > finishing, and the next being submitted. > > That accidentally added a potentially stall, this one is both cleaner > and should have that fixed. > .. > - rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step)); > - rwb->wb_normal = (rwb->wb_max + 1) / 2; > - rwb->wb_background = (rwb->wb_max + 3) / 4; > + if (rwb->queue_depth == 1) { > + rwb->wb_max = rwb->wb_normal = 2; > + rwb->wb_background = 1; This breaks the detection of too big scale_step in scale_up() where we key of wb_max == 1 value. However even with that fixed no luck :(: dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync Runtime: 105.126 107.125 105.641 So about the same as before. I'll try to debug this later today... Honza -- Jan Kara SUSE Labs, CR
Re: [PATCHSET v5] Make background writeback great again for the first time
δΊ 2016/4/28 4:59, Jens Axboe ει: > On Wed, Apr 27 2016, Jens Axboe wrote: >> On Wed, Apr 27 2016, Jens Axboe wrote: >>> On 04/27/2016 12:01 PM, Jan Kara wrote: Hi, On Tue 26-04-16 09:55:23, Jens Axboe wrote: > Since the dawn of time, our background buffered writeback has sucked. > When we do background buffered writeback, it should have little impact > on foreground activity. That's the definition of background activity... > But for as long as I can remember, heavy buffered writers have not > behaved like that. For instance, if I do something like this: > > $ dd if=/dev/zero of=foo bs=1M count=10k > > on my laptop, and then try and start chrome, it basically won't start > before the buffered writeback is done. Or, for server oriented > workloads, where installation of a big RPM (or similar) adversely > impacts database reads or sync writes. When that happens, I get people > yelling at me. > > I have posted plenty of results previously, I'll keep it shorter > this time. Here's a run on my laptop, using read-to-pipe-async for > reading a 5g file, and rewriting it. You can find this test program > in the fio git repo. I have tested your patchset on my test system. Generally I have observed noticeable drop in average throughput for heavy background writes without any other disk activity and also somewhat increased variance in the runtimes. It is most visible on this simple testcases: dd if=/dev/zero of=/mnt/file bs=1M count=1 and dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly created before each dd run on a dedicated disk. Without your patches I get pretty stable dd runtimes for both cases: dd if=/dev/zero of=/mnt/file bs=1M count=1 Runtimes: 87.9611 87.3279 87.2554 dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync Runtimes: 93.3502 93.2086 93.541 With your patches the numbers look like: dd if=/dev/zero of=/mnt/file bs=1M count=1 Runtimes: 108.183, 97.184, 99.9587 dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync Runtimes: 104.9, 102.775, 102.892 I have checked whether the variance is due to some interaction with CFQ which is used for the disk. When I switched the disk to deadline, I still get some variance although, the throughput is still ~10% lower: dd if=/dev/zero of=/mnt/file bs=1M count=1 Runtimes: 100.417 100.643 100.866 dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync Runtimes: 104.208 106.341 105.483 The disk is rotational SATA drive with writeback cache, queue depth of the disk reported in /sys/block/sdb/device/queue_depth is 1. So I think we still need some tweaking on the low end of the storage spectrum so that we don't lose 10% of throughput for simple cases like this. >>> >>> Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if >>> you are seeing smaller requests, and that is why it both varies and >>> you get lower throughput? I'll try and setup a test here similar to >>> yours. >> >> Jan, care to try the below patch? I can't fully reproduce your issue on >> a SCSI disk limited to QD=1, but I have a feeling this might help. It's >> a bit of a hack, but the general idea is to allow one more request to >> build up for QD=1 devices. That eliminates wait time between one request >> finishing, and the next being submitted. > > That accidentally added a potentially stall, this one is both cleaner > and should have that fixed. > > diff --git a/lib/wbt.c b/lib/wbt.c > index 650da911f24f..322f5e04e994 100644 > --- a/lib/wbt.c > +++ b/lib/wbt.c > @@ -98,18 +98,23 @@ void __wbt_done(struct rq_wb *rwb) > else > limit = rwb->wb_normal; Hi Jens, This statement 'limit = rwb->wb_normal' is executed twice, maybe once is enough. It is not a big deal anyway :) Another question about this if branch: if (rwb->wc && !atomic_read(&rwb->bdi->wb.dirty_sleeping)) limit = 0; I can't follow the logic of this if branch. why set limit equal to 0 when the device supports write back caches and there are tasks being limited in balance_dirty_pages(). Could you pelase give more info about this ? Thanks! > > + inflight = atomic_dec_return(&rwb->inflight); > + > /* > - * Don't wake anyone up if we are above the normal limit. If > - * throttling got disabled (limit == 0) with waiters, ensure > - * that we wake them up. > + * wbt got disabled with IO in flight. Wake up any potential > + * waiters, we don't have to do more than that. >*/ > - inflight = atomic_dec_return(&rwb->inflight); > - if (limit && inflight >= limit) { > - if (!rwb->wb_max) > - wake_up_all(
Re: [PATCHSET v5] Make background writeback great again for the first time
On Wed, Apr 27 2016, Jens Axboe wrote: > On Wed, Apr 27 2016, Jens Axboe wrote: > > On 04/27/2016 12:01 PM, Jan Kara wrote: > > >Hi, > > > > > >On Tue 26-04-16 09:55:23, Jens Axboe wrote: > > >>Since the dawn of time, our background buffered writeback has sucked. > > >>When we do background buffered writeback, it should have little impact > > >>on foreground activity. That's the definition of background activity... > > >>But for as long as I can remember, heavy buffered writers have not > > >>behaved like that. For instance, if I do something like this: > > >> > > >>$ dd if=/dev/zero of=foo bs=1M count=10k > > >> > > >>on my laptop, and then try and start chrome, it basically won't start > > >>before the buffered writeback is done. Or, for server oriented > > >>workloads, where installation of a big RPM (or similar) adversely > > >>impacts database reads or sync writes. When that happens, I get people > > >>yelling at me. > > >> > > >>I have posted plenty of results previously, I'll keep it shorter > > >>this time. Here's a run on my laptop, using read-to-pipe-async for > > >>reading a 5g file, and rewriting it. You can find this test program > > >>in the fio git repo. > > > > > >I have tested your patchset on my test system. Generally I have observed > > >noticeable drop in average throughput for heavy background writes without > > >any other disk activity and also somewhat increased variance in the > > >runtimes. It is most visible on this simple testcases: > > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 > > > > > >and > > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync > > > > > >The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly > > >created before each dd run on a dedicated disk. > > > > > >Without your patches I get pretty stable dd runtimes for both cases: > > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 > > >Runtimes: 87.9611 87.3279 87.2554 > > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync > > >Runtimes: 93.3502 93.2086 93.541 > > > > > >With your patches the numbers look like: > > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 > > >Runtimes: 108.183, 97.184, 99.9587 > > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync > > >Runtimes: 104.9, 102.775, 102.892 > > > > > >I have checked whether the variance is due to some interaction with CFQ > > >which is used for the disk. When I switched the disk to deadline, I still > > >get some variance although, the throughput is still ~10% lower: > > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 > > >Runtimes: 100.417 100.643 100.866 > > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync > > >Runtimes: 104.208 106.341 105.483 > > > > > >The disk is rotational SATA drive with writeback cache, queue depth of the > > >disk reported in /sys/block/sdb/device/queue_depth is 1. > > > > > >So I think we still need some tweaking on the low end of the storage > > >spectrum so that we don't lose 10% of throughput for simple cases like > > >this. > > > > Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if > > you are seeing smaller requests, and that is why it both varies and > > you get lower throughput? I'll try and setup a test here similar to > > yours. > > Jan, care to try the below patch? I can't fully reproduce your issue on > a SCSI disk limited to QD=1, but I have a feeling this might help. It's > a bit of a hack, but the general idea is to allow one more request to > build up for QD=1 devices. That eliminates wait time between one request > finishing, and the next being submitted. That accidentally added a potentially stall, this one is both cleaner and should have that fixed. diff --git a/lib/wbt.c b/lib/wbt.c index 650da911f24f..322f5e04e994 100644 --- a/lib/wbt.c +++ b/lib/wbt.c @@ -98,18 +98,23 @@ void __wbt_done(struct rq_wb *rwb) else limit = rwb->wb_normal; + inflight = atomic_dec_return(&rwb->inflight); + /* -* Don't wake anyone up if we are above the normal limit. If -* throttling got disabled (limit == 0) with waiters, ensure -* that we wake them up. +* wbt got disabled with IO in flight. Wake up any potential +* waiters, we don't have to do more than that. */ - inflight = atomic_dec_return(&rwb->inflight); - if (limit && inflight >= limit) { - if (!rwb->wb_max) - wake_up_all(&rwb->wait); + if (!rwb_enabled(rwb)) { + wake_up_all(&rwb->wait); return; } + /* +* Don't wake anyone up if we are above the normal limit. +*/ + if (inflight && inflight >= limit) + return; + if (waitqueue_active(&rwb->wait)) { int diff = limit - inflight; @@ -150,14 +155,26 @@ static void calc_wb_limits(struct rq_wb *rwb) return; } - depth = min_t(unsigned int, RW
Re: [PATCHSET v5] Make background writeback great again for the first time
On Wed, Apr 27 2016, Jens Axboe wrote: > On 04/27/2016 12:01 PM, Jan Kara wrote: > >Hi, > > > >On Tue 26-04-16 09:55:23, Jens Axboe wrote: > >>Since the dawn of time, our background buffered writeback has sucked. > >>When we do background buffered writeback, it should have little impact > >>on foreground activity. That's the definition of background activity... > >>But for as long as I can remember, heavy buffered writers have not > >>behaved like that. For instance, if I do something like this: > >> > >>$ dd if=/dev/zero of=foo bs=1M count=10k > >> > >>on my laptop, and then try and start chrome, it basically won't start > >>before the buffered writeback is done. Or, for server oriented > >>workloads, where installation of a big RPM (or similar) adversely > >>impacts database reads or sync writes. When that happens, I get people > >>yelling at me. > >> > >>I have posted plenty of results previously, I'll keep it shorter > >>this time. Here's a run on my laptop, using read-to-pipe-async for > >>reading a 5g file, and rewriting it. You can find this test program > >>in the fio git repo. > > > >I have tested your patchset on my test system. Generally I have observed > >noticeable drop in average throughput for heavy background writes without > >any other disk activity and also somewhat increased variance in the > >runtimes. It is most visible on this simple testcases: > > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 > > > >and > > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync > > > >The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly > >created before each dd run on a dedicated disk. > > > >Without your patches I get pretty stable dd runtimes for both cases: > > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 > >Runtimes: 87.9611 87.3279 87.2554 > > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync > >Runtimes: 93.3502 93.2086 93.541 > > > >With your patches the numbers look like: > > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 > >Runtimes: 108.183, 97.184, 99.9587 > > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync > >Runtimes: 104.9, 102.775, 102.892 > > > >I have checked whether the variance is due to some interaction with CFQ > >which is used for the disk. When I switched the disk to deadline, I still > >get some variance although, the throughput is still ~10% lower: > > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 > >Runtimes: 100.417 100.643 100.866 > > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync > >Runtimes: 104.208 106.341 105.483 > > > >The disk is rotational SATA drive with writeback cache, queue depth of the > >disk reported in /sys/block/sdb/device/queue_depth is 1. > > > >So I think we still need some tweaking on the low end of the storage > >spectrum so that we don't lose 10% of throughput for simple cases like > >this. > > Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if > you are seeing smaller requests, and that is why it both varies and > you get lower throughput? I'll try and setup a test here similar to > yours. Jan, care to try the below patch? I can't fully reproduce your issue on a SCSI disk limited to QD=1, but I have a feeling this might help. It's a bit of a hack, but the general idea is to allow one more request to build up for QD=1 devices. That eliminates wait time between one request finishing, and the next being submitted. diff --git a/lib/wbt.c b/lib/wbt.c index 650da911f24f..6b24c8525ace 100644 --- a/lib/wbt.c +++ b/lib/wbt.c @@ -93,23 +93,30 @@ void __wbt_done(struct rq_wb *rwb) * If the device does write back caching, drop further down * before we wake people up. */ - if (rwb->wc && !atomic_read(&rwb->bdi->wb.dirty_sleeping)) + if (rwb->queue_depth == 1) + limit = 2; + else if (rwb->wc && !atomic_read(&rwb->bdi->wb.dirty_sleeping)) limit = 0; else limit = rwb->wb_normal; + inflight = atomic_dec_return(&rwb->inflight); + /* -* Don't wake anyone up if we are above the normal limit. If -* throttling got disabled (limit == 0) with waiters, ensure -* that we wake them up. +* wbt got disabled with IO in flight. Wake up any potential +* waiters, we don't have to do more than that. */ - inflight = atomic_dec_return(&rwb->inflight); - if (limit && inflight >= limit) { - if (!rwb->wb_max) - wake_up_all(&rwb->wait); + if (!rwb_enabled(rwb)) { + wake_up_all(&rwb->wait); return; } + /* +* Don't wake anyone up if we are above the normal limit. +*/ + if (inflight >= limit) + return; + if (waitqueue_active(&rwb->wait)) { int diff = limit - inflight; @@ -366,6 +373,9 @@ static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw) } els
Re: [PATCHSET v5] Make background writeback great again for the first time
On 04/27/2016 12:01 PM, Jan Kara wrote: Hi, On Tue 26-04-16 09:55:23, Jens Axboe wrote: Since the dawn of time, our background buffered writeback has sucked. When we do background buffered writeback, it should have little impact on foreground activity. That's the definition of background activity... But for as long as I can remember, heavy buffered writers have not behaved like that. For instance, if I do something like this: $ dd if=/dev/zero of=foo bs=1M count=10k on my laptop, and then try and start chrome, it basically won't start before the buffered writeback is done. Or, for server oriented workloads, where installation of a big RPM (or similar) adversely impacts database reads or sync writes. When that happens, I get people yelling at me. I have posted plenty of results previously, I'll keep it shorter this time. Here's a run on my laptop, using read-to-pipe-async for reading a 5g file, and rewriting it. You can find this test program in the fio git repo. I have tested your patchset on my test system. Generally I have observed noticeable drop in average throughput for heavy background writes without any other disk activity and also somewhat increased variance in the runtimes. It is most visible on this simple testcases: dd if=/dev/zero of=/mnt/file bs=1M count=1 and dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly created before each dd run on a dedicated disk. Without your patches I get pretty stable dd runtimes for both cases: dd if=/dev/zero of=/mnt/file bs=1M count=1 Runtimes: 87.9611 87.3279 87.2554 dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync Runtimes: 93.3502 93.2086 93.541 With your patches the numbers look like: dd if=/dev/zero of=/mnt/file bs=1M count=1 Runtimes: 108.183, 97.184, 99.9587 dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync Runtimes: 104.9, 102.775, 102.892 I have checked whether the variance is due to some interaction with CFQ which is used for the disk. When I switched the disk to deadline, I still get some variance although, the throughput is still ~10% lower: dd if=/dev/zero of=/mnt/file bs=1M count=1 Runtimes: 100.417 100.643 100.866 dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync Runtimes: 104.208 106.341 105.483 The disk is rotational SATA drive with writeback cache, queue depth of the disk reported in /sys/block/sdb/device/queue_depth is 1. So I think we still need some tweaking on the low end of the storage spectrum so that we don't lose 10% of throughput for simple cases like this. Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if you are seeing smaller requests, and that is why it both varies and you get lower throughput? I'll try and setup a test here similar to yours. -- Jens Axboe
Re: [PATCHSET v5] Make background writeback great again for the first time
Hi, On Tue 26-04-16 09:55:23, Jens Axboe wrote: > Since the dawn of time, our background buffered writeback has sucked. > When we do background buffered writeback, it should have little impact > on foreground activity. That's the definition of background activity... > But for as long as I can remember, heavy buffered writers have not > behaved like that. For instance, if I do something like this: > > $ dd if=/dev/zero of=foo bs=1M count=10k > > on my laptop, and then try and start chrome, it basically won't start > before the buffered writeback is done. Or, for server oriented > workloads, where installation of a big RPM (or similar) adversely > impacts database reads or sync writes. When that happens, I get people > yelling at me. > > I have posted plenty of results previously, I'll keep it shorter > this time. Here's a run on my laptop, using read-to-pipe-async for > reading a 5g file, and rewriting it. You can find this test program > in the fio git repo. I have tested your patchset on my test system. Generally I have observed noticeable drop in average throughput for heavy background writes without any other disk activity and also somewhat increased variance in the runtimes. It is most visible on this simple testcases: dd if=/dev/zero of=/mnt/file bs=1M count=1 and dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly created before each dd run on a dedicated disk. Without your patches I get pretty stable dd runtimes for both cases: dd if=/dev/zero of=/mnt/file bs=1M count=1 Runtimes: 87.9611 87.3279 87.2554 dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync Runtimes: 93.3502 93.2086 93.541 With your patches the numbers look like: dd if=/dev/zero of=/mnt/file bs=1M count=1 Runtimes: 108.183, 97.184, 99.9587 dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync Runtimes: 104.9, 102.775, 102.892 I have checked whether the variance is due to some interaction with CFQ which is used for the disk. When I switched the disk to deadline, I still get some variance although, the throughput is still ~10% lower: dd if=/dev/zero of=/mnt/file bs=1M count=1 Runtimes: 100.417 100.643 100.866 dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync Runtimes: 104.208 106.341 105.483 The disk is rotational SATA drive with writeback cache, queue depth of the disk reported in /sys/block/sdb/device/queue_depth is 1. So I think we still need some tweaking on the low end of the storage spectrum so that we don't lose 10% of throughput for simple cases like this. Honza -- Jan Kara SUSE Labs, CR