Re: [PATCHSET v5] Make background writeback great again for the first time

2016-05-16 Thread Jan Kara
On Fri 13-05-16 12:29:10, Jens Axboe wrote:
> Thanks Jan, this is great and super useful! I'm revamping certain parts of
> it to deal with write back caching better, and I'll take a look at the
> regressions that you reported.
> 
> What kind of SSD is this? I'm assuming it's SATA (QD=32), and then it would
> probably be a safe assumption that it's flagging itself as having a volatile
> write back cache, would that be a correct assumption?

Yes, it is SATA with writeback cache.

> Are you using scsi-mq, or do you have an IO scheduler attached to it?

The disk was using IO scheduler, however at this point I'm not 100% sure
which scheduler (deadline or cfq) was the default one for the distro that
was installed. The machine is currently testing something else so I cannot
reinstall it and check. Maybe I can rerun some tests later in the week when
the machine gets freed with scsi-mq or deadline IO scheduler so that we
have 100% certain config.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCHSET v5] Make background writeback great again for the first time

2016-05-16 Thread Jan Kara
On Fri 13-05-16 12:29:10, Jens Axboe wrote:
> Thanks Jan, this is great and super useful! I'm revamping certain parts of
> it to deal with write back caching better, and I'll take a look at the
> regressions that you reported.
> 
> What kind of SSD is this? I'm assuming it's SATA (QD=32), and then it would
> probably be a safe assumption that it's flagging itself as having a volatile
> write back cache, would that be a correct assumption?

Yes, it is SATA with writeback cache.

> Are you using scsi-mq, or do you have an IO scheduler attached to it?

The disk was using IO scheduler, however at this point I'm not 100% sure
which scheduler (deadline or cfq) was the default one for the distro that
was installed. The machine is currently testing something else so I cannot
reinstall it and check. Maybe I can rerun some tests later in the week when
the machine gets freed with scsi-mq or deadline IO scheduler so that we
have 100% certain config.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCHSET v5] Make background writeback great again for the first time

2016-05-13 Thread Jens Axboe

On 05/11/2016 10:36 AM, Jan Kara wrote:

On Tue 03-05-16 14:17:19, Jan Kara wrote:

The question remains how common a pattern where throttling of background
writeback delays also something else is. I'll schedule a couple of
benchmarks to measure impact of your patches for a wider range of workloads
(but sadly pretty limited set of hw). If ext3 is the only one seeing
issues, I would be willing to accept that ext3 takes the hit since it is
doing something rather stupid (but inherent in its journal design) and we
have a way to deal with this either by enabling delayed allocation or by
turning off the writeback throttling...


So I've run some benchmarks on a machine with 6 GB of RAM and SSD with
queue depth 32. The filesystem on the disk was XFS this time. I've found
couple of regressions. A clear one is with dbench (version 4). The average
throughput numbers look like:

BaselineWBT
Hmeanmb/sec-1 30.26 (  0.00%)   18.67 (-38.28%)
Hmeanmb/sec-2 40.71 (  0.00%)   31.25 (-23.23%)
Hmeanmb/sec-4 52.67 (  0.00%)   46.83 (-11.09%)
Hmeanmb/sec-8 69.51 (  0.00%)   64.35 ( -7.42%)
Hmeanmb/sec-1691.07 (  0.00%)   86.46 ( -5.07%)
Hmeanmb/sec-32   115.10 (  0.00%)  110.29 ( -4.18%)
Hmeanmb/sec-64   145.14 (  0.00%)  134.97 ( -7.00%)
Hmeanmb/sec-512   93.99 (  0.00%)  133.85 ( 42.41%)

There were also some losses in a filebench webproxy workload (I can give
you exact details of the settings if you want to reproduce it).

Also, and this really puzzles me, I've seen higher read latencies in some
cases (I've verified they are not just noise by rerunning the test for
kernel with writeback throttling patches). For example with the following
fio job file:

[global]
direct=0
ioengine=sync
runtime=300
time_based
invalidate=1
blocksize=4096
size=10g# Just random value, we are running time based workload
log_avg_msec=10
group_reporting=1

[writer]
nrfiles=1
filesize=1g
fdatasync=256
readwrite=randwrite
numjobs=4

[reader]
# Simulate random reading from different files, switching to different file
# after 16 ios. This somewhat simulates application startup.
new_group
filesize=100m
nrfiles=20
file_service_type=random:16
readwrite=randread

I get the following results:

Throughput  BaselineWBT
Hmeankb/sec-writer-write  591.60 (  0.00%)  507.00 (-14.30%)
Hmeankb/sec-reader-read   211.81 (  0.00%)  137.53 (-35.07%)

So both read and write throughput have suffered. And latencies don't offset
for the loss either:

FIO read latency
Min latency-read 1383.00 (  0.00%) 1519.00 ( -9.83%)
1st-qrtle   latency-read 3485.00 (  0.00%) 5235.00 (-50.22%)
2nd-qrtle   latency-read 4708.00 (  0.00%)15028.00 (-219.20%)
3rd-qrtle   latency-read10286.00 (  0.00%)57622.00 (-460.20%)
Max-90% latency-read   195834.00 (  0.00%)   167149.00 ( 14.65%)
Max-93% latency-read   273145.00 (  0.00%)   200319.00 ( 26.66%)
Max-95% latency-read   335434.00 (  0.00%)   220695.00 ( 34.21%)
Max-99% latency-read   537017.00 (  0.00%)   347174.00 ( 35.35%)
Max latency-read   991101.00 (  0.00%)   485835.00 ( 50.98%)
Meanlatency-read51282.79 (  0.00%)49953.95 (  2.59%)

So we have reduced the extra high read latencies which is nice but on
average there is no change.

And another fio jobfile which doesn't look great:

[global]
direct=0
ioengine=sync
runtime=300
blocksize=4096
invalidate=1
time_based
ramp_time=5 # Let the flusher thread start before taking measurements
log_avg_msec=10
group_reporting=1

[writer]
nrfiles=1
filesize=$((MEMTOTAL_BYTES*2))
readwrite=randwrite

[reader]
# Simulate random reading from different files, switching to different file
# after 16 ios. This somewhat simulates application startup.
new_group
filesize=100m
nrfiles=20
file_service_type=random:16
readwrite=randread

The throughput numbers look like:
Hmeankb/sec-writer-write24707.22 (  0.00%)19912.23 (-19.41%)
Hmeankb/sec-reader-read   886.65 (  0.00%)  905.71 (  2.15%)

So we've got significant hit in writes not really offset by a big increase
in reads. Read latency numbers look like (I show the WBT numbers for two runs
just so that one can see how variable the latency numbers are because I was
puzzled by very high max latency for WBT kernels - quartiles seem rather
stable higher percentiles and min/max are rather variable):

   Baseline WBT WBT
Min latency-read 1230.00 (  0.00%) 1560.00 (-26.83%)1100.00 
( 10.57%)
1st-qrtle   latency-read 3357.00 (  0.00%) 3351.00 (  0.18%)3351.00 
(  0.18%)
2nd-qrtle   latency-read 4074.00 (  0.00%) 4056.00 (  0.44%)4022.00 
(  1.28%)
3rd-qrtle   latency-read 5198.00 (  0.00%) 5145.00 (  1.02%)5095.00 
(  1.98%)
Max-90% 

Re: [PATCHSET v5] Make background writeback great again for the first time

2016-05-13 Thread Jens Axboe

On 05/11/2016 10:36 AM, Jan Kara wrote:

On Tue 03-05-16 14:17:19, Jan Kara wrote:

The question remains how common a pattern where throttling of background
writeback delays also something else is. I'll schedule a couple of
benchmarks to measure impact of your patches for a wider range of workloads
(but sadly pretty limited set of hw). If ext3 is the only one seeing
issues, I would be willing to accept that ext3 takes the hit since it is
doing something rather stupid (but inherent in its journal design) and we
have a way to deal with this either by enabling delayed allocation or by
turning off the writeback throttling...


So I've run some benchmarks on a machine with 6 GB of RAM and SSD with
queue depth 32. The filesystem on the disk was XFS this time. I've found
couple of regressions. A clear one is with dbench (version 4). The average
throughput numbers look like:

BaselineWBT
Hmeanmb/sec-1 30.26 (  0.00%)   18.67 (-38.28%)
Hmeanmb/sec-2 40.71 (  0.00%)   31.25 (-23.23%)
Hmeanmb/sec-4 52.67 (  0.00%)   46.83 (-11.09%)
Hmeanmb/sec-8 69.51 (  0.00%)   64.35 ( -7.42%)
Hmeanmb/sec-1691.07 (  0.00%)   86.46 ( -5.07%)
Hmeanmb/sec-32   115.10 (  0.00%)  110.29 ( -4.18%)
Hmeanmb/sec-64   145.14 (  0.00%)  134.97 ( -7.00%)
Hmeanmb/sec-512   93.99 (  0.00%)  133.85 ( 42.41%)

There were also some losses in a filebench webproxy workload (I can give
you exact details of the settings if you want to reproduce it).

Also, and this really puzzles me, I've seen higher read latencies in some
cases (I've verified they are not just noise by rerunning the test for
kernel with writeback throttling patches). For example with the following
fio job file:

[global]
direct=0
ioengine=sync
runtime=300
time_based
invalidate=1
blocksize=4096
size=10g# Just random value, we are running time based workload
log_avg_msec=10
group_reporting=1

[writer]
nrfiles=1
filesize=1g
fdatasync=256
readwrite=randwrite
numjobs=4

[reader]
# Simulate random reading from different files, switching to different file
# after 16 ios. This somewhat simulates application startup.
new_group
filesize=100m
nrfiles=20
file_service_type=random:16
readwrite=randread

I get the following results:

Throughput  BaselineWBT
Hmeankb/sec-writer-write  591.60 (  0.00%)  507.00 (-14.30%)
Hmeankb/sec-reader-read   211.81 (  0.00%)  137.53 (-35.07%)

So both read and write throughput have suffered. And latencies don't offset
for the loss either:

FIO read latency
Min latency-read 1383.00 (  0.00%) 1519.00 ( -9.83%)
1st-qrtle   latency-read 3485.00 (  0.00%) 5235.00 (-50.22%)
2nd-qrtle   latency-read 4708.00 (  0.00%)15028.00 (-219.20%)
3rd-qrtle   latency-read10286.00 (  0.00%)57622.00 (-460.20%)
Max-90% latency-read   195834.00 (  0.00%)   167149.00 ( 14.65%)
Max-93% latency-read   273145.00 (  0.00%)   200319.00 ( 26.66%)
Max-95% latency-read   335434.00 (  0.00%)   220695.00 ( 34.21%)
Max-99% latency-read   537017.00 (  0.00%)   347174.00 ( 35.35%)
Max latency-read   991101.00 (  0.00%)   485835.00 ( 50.98%)
Meanlatency-read51282.79 (  0.00%)49953.95 (  2.59%)

So we have reduced the extra high read latencies which is nice but on
average there is no change.

And another fio jobfile which doesn't look great:

[global]
direct=0
ioengine=sync
runtime=300
blocksize=4096
invalidate=1
time_based
ramp_time=5 # Let the flusher thread start before taking measurements
log_avg_msec=10
group_reporting=1

[writer]
nrfiles=1
filesize=$((MEMTOTAL_BYTES*2))
readwrite=randwrite

[reader]
# Simulate random reading from different files, switching to different file
# after 16 ios. This somewhat simulates application startup.
new_group
filesize=100m
nrfiles=20
file_service_type=random:16
readwrite=randread

The throughput numbers look like:
Hmeankb/sec-writer-write24707.22 (  0.00%)19912.23 (-19.41%)
Hmeankb/sec-reader-read   886.65 (  0.00%)  905.71 (  2.15%)

So we've got significant hit in writes not really offset by a big increase
in reads. Read latency numbers look like (I show the WBT numbers for two runs
just so that one can see how variable the latency numbers are because I was
puzzled by very high max latency for WBT kernels - quartiles seem rather
stable higher percentiles and min/max are rather variable):

   Baseline WBT WBT
Min latency-read 1230.00 (  0.00%) 1560.00 (-26.83%)1100.00 
( 10.57%)
1st-qrtle   latency-read 3357.00 (  0.00%) 3351.00 (  0.18%)3351.00 
(  0.18%)
2nd-qrtle   latency-read 4074.00 (  0.00%) 4056.00 (  0.44%)4022.00 
(  1.28%)
3rd-qrtle   latency-read 5198.00 (  0.00%) 5145.00 (  1.02%)5095.00 
(  1.98%)
Max-90% 

Re: [PATCHSET v5] Make background writeback great again for the first time

2016-05-11 Thread Jan Kara
On Tue 03-05-16 14:17:19, Jan Kara wrote:
> The question remains how common a pattern where throttling of background
> writeback delays also something else is. I'll schedule a couple of
> benchmarks to measure impact of your patches for a wider range of workloads
> (but sadly pretty limited set of hw). If ext3 is the only one seeing
> issues, I would be willing to accept that ext3 takes the hit since it is
> doing something rather stupid (but inherent in its journal design) and we
> have a way to deal with this either by enabling delayed allocation or by
> turning off the writeback throttling...

So I've run some benchmarks on a machine with 6 GB of RAM and SSD with
queue depth 32. The filesystem on the disk was XFS this time. I've found
couple of regressions. A clear one is with dbench (version 4). The average
throughput numbers look like:

BaselineWBT
Hmeanmb/sec-1 30.26 (  0.00%)   18.67 (-38.28%)
Hmeanmb/sec-2 40.71 (  0.00%)   31.25 (-23.23%)
Hmeanmb/sec-4 52.67 (  0.00%)   46.83 (-11.09%)
Hmeanmb/sec-8 69.51 (  0.00%)   64.35 ( -7.42%)
Hmeanmb/sec-1691.07 (  0.00%)   86.46 ( -5.07%)
Hmeanmb/sec-32   115.10 (  0.00%)  110.29 ( -4.18%)
Hmeanmb/sec-64   145.14 (  0.00%)  134.97 ( -7.00%)
Hmeanmb/sec-512   93.99 (  0.00%)  133.85 ( 42.41%)

There were also some losses in a filebench webproxy workload (I can give
you exact details of the settings if you want to reproduce it).

Also, and this really puzzles me, I've seen higher read latencies in some
cases (I've verified they are not just noise by rerunning the test for
kernel with writeback throttling patches). For example with the following
fio job file:

[global]
direct=0
ioengine=sync
runtime=300
time_based
invalidate=1
blocksize=4096
size=10g# Just random value, we are running time based workload
log_avg_msec=10
group_reporting=1

[writer]
nrfiles=1
filesize=1g
fdatasync=256
readwrite=randwrite
numjobs=4

[reader]
# Simulate random reading from different files, switching to different file
# after 16 ios. This somewhat simulates application startup.
new_group
filesize=100m
nrfiles=20
file_service_type=random:16
readwrite=randread

I get the following results:

Throughput  BaselineWBT
Hmeankb/sec-writer-write  591.60 (  0.00%)  507.00 (-14.30%)
Hmeankb/sec-reader-read   211.81 (  0.00%)  137.53 (-35.07%)

So both read and write throughput have suffered. And latencies don't offset
for the loss either:

FIO read latency
Min latency-read 1383.00 (  0.00%) 1519.00 ( -9.83%)
1st-qrtle   latency-read 3485.00 (  0.00%) 5235.00 (-50.22%)
2nd-qrtle   latency-read 4708.00 (  0.00%)15028.00 (-219.20%)
3rd-qrtle   latency-read10286.00 (  0.00%)57622.00 (-460.20%)
Max-90% latency-read   195834.00 (  0.00%)   167149.00 ( 14.65%)
Max-93% latency-read   273145.00 (  0.00%)   200319.00 ( 26.66%)
Max-95% latency-read   335434.00 (  0.00%)   220695.00 ( 34.21%)
Max-99% latency-read   537017.00 (  0.00%)   347174.00 ( 35.35%)
Max latency-read   991101.00 (  0.00%)   485835.00 ( 50.98%)
Meanlatency-read51282.79 (  0.00%)49953.95 (  2.59%)

So we have reduced the extra high read latencies which is nice but on
average there is no change.

And another fio jobfile which doesn't look great:

[global]
direct=0
ioengine=sync
runtime=300
blocksize=4096
invalidate=1
time_based
ramp_time=5 # Let the flusher thread start before taking measurements
log_avg_msec=10
group_reporting=1

[writer]
nrfiles=1
filesize=$((MEMTOTAL_BYTES*2))
readwrite=randwrite

[reader]
# Simulate random reading from different files, switching to different file
# after 16 ios. This somewhat simulates application startup.
new_group
filesize=100m
nrfiles=20
file_service_type=random:16
readwrite=randread

The throughput numbers look like:
Hmeankb/sec-writer-write24707.22 (  0.00%)19912.23 (-19.41%)
Hmeankb/sec-reader-read   886.65 (  0.00%)  905.71 (  2.15%)

So we've got significant hit in writes not really offset by a big increase
in reads. Read latency numbers look like (I show the WBT numbers for two runs
just so that one can see how variable the latency numbers are because I was
puzzled by very high max latency for WBT kernels - quartiles seem rather
stable higher percentiles and min/max are rather variable):

   Baseline WBT WBT
Min latency-read 1230.00 (  0.00%) 1560.00 (-26.83%)1100.00 
( 10.57%)
1st-qrtle   latency-read 3357.00 (  0.00%) 3351.00 (  0.18%)3351.00 
(  0.18%)
2nd-qrtle   latency-read 4074.00 (  0.00%) 4056.00 (  0.44%)4022.00 
(  1.28%)
3rd-qrtle   latency-read 5198.00 (  0.00%) 5145.00 (  1.02%)5095.00 
(  1.98%)
Max-90% latency-read 6594.00 (  

Re: [PATCHSET v5] Make background writeback great again for the first time

2016-05-11 Thread Jan Kara
On Tue 03-05-16 14:17:19, Jan Kara wrote:
> The question remains how common a pattern where throttling of background
> writeback delays also something else is. I'll schedule a couple of
> benchmarks to measure impact of your patches for a wider range of workloads
> (but sadly pretty limited set of hw). If ext3 is the only one seeing
> issues, I would be willing to accept that ext3 takes the hit since it is
> doing something rather stupid (but inherent in its journal design) and we
> have a way to deal with this either by enabling delayed allocation or by
> turning off the writeback throttling...

So I've run some benchmarks on a machine with 6 GB of RAM and SSD with
queue depth 32. The filesystem on the disk was XFS this time. I've found
couple of regressions. A clear one is with dbench (version 4). The average
throughput numbers look like:

BaselineWBT
Hmeanmb/sec-1 30.26 (  0.00%)   18.67 (-38.28%)
Hmeanmb/sec-2 40.71 (  0.00%)   31.25 (-23.23%)
Hmeanmb/sec-4 52.67 (  0.00%)   46.83 (-11.09%)
Hmeanmb/sec-8 69.51 (  0.00%)   64.35 ( -7.42%)
Hmeanmb/sec-1691.07 (  0.00%)   86.46 ( -5.07%)
Hmeanmb/sec-32   115.10 (  0.00%)  110.29 ( -4.18%)
Hmeanmb/sec-64   145.14 (  0.00%)  134.97 ( -7.00%)
Hmeanmb/sec-512   93.99 (  0.00%)  133.85 ( 42.41%)

There were also some losses in a filebench webproxy workload (I can give
you exact details of the settings if you want to reproduce it).

Also, and this really puzzles me, I've seen higher read latencies in some
cases (I've verified they are not just noise by rerunning the test for
kernel with writeback throttling patches). For example with the following
fio job file:

[global]
direct=0
ioengine=sync
runtime=300
time_based
invalidate=1
blocksize=4096
size=10g# Just random value, we are running time based workload
log_avg_msec=10
group_reporting=1

[writer]
nrfiles=1
filesize=1g
fdatasync=256
readwrite=randwrite
numjobs=4

[reader]
# Simulate random reading from different files, switching to different file
# after 16 ios. This somewhat simulates application startup.
new_group
filesize=100m
nrfiles=20
file_service_type=random:16
readwrite=randread

I get the following results:

Throughput  BaselineWBT
Hmeankb/sec-writer-write  591.60 (  0.00%)  507.00 (-14.30%)
Hmeankb/sec-reader-read   211.81 (  0.00%)  137.53 (-35.07%)

So both read and write throughput have suffered. And latencies don't offset
for the loss either:

FIO read latency
Min latency-read 1383.00 (  0.00%) 1519.00 ( -9.83%)
1st-qrtle   latency-read 3485.00 (  0.00%) 5235.00 (-50.22%)
2nd-qrtle   latency-read 4708.00 (  0.00%)15028.00 (-219.20%)
3rd-qrtle   latency-read10286.00 (  0.00%)57622.00 (-460.20%)
Max-90% latency-read   195834.00 (  0.00%)   167149.00 ( 14.65%)
Max-93% latency-read   273145.00 (  0.00%)   200319.00 ( 26.66%)
Max-95% latency-read   335434.00 (  0.00%)   220695.00 ( 34.21%)
Max-99% latency-read   537017.00 (  0.00%)   347174.00 ( 35.35%)
Max latency-read   991101.00 (  0.00%)   485835.00 ( 50.98%)
Meanlatency-read51282.79 (  0.00%)49953.95 (  2.59%)

So we have reduced the extra high read latencies which is nice but on
average there is no change.

And another fio jobfile which doesn't look great:

[global]
direct=0
ioengine=sync
runtime=300
blocksize=4096
invalidate=1
time_based
ramp_time=5 # Let the flusher thread start before taking measurements
log_avg_msec=10
group_reporting=1

[writer]
nrfiles=1
filesize=$((MEMTOTAL_BYTES*2))
readwrite=randwrite

[reader]
# Simulate random reading from different files, switching to different file
# after 16 ios. This somewhat simulates application startup.
new_group
filesize=100m
nrfiles=20
file_service_type=random:16
readwrite=randread

The throughput numbers look like:
Hmeankb/sec-writer-write24707.22 (  0.00%)19912.23 (-19.41%)
Hmeankb/sec-reader-read   886.65 (  0.00%)  905.71 (  2.15%)

So we've got significant hit in writes not really offset by a big increase
in reads. Read latency numbers look like (I show the WBT numbers for two runs
just so that one can see how variable the latency numbers are because I was
puzzled by very high max latency for WBT kernels - quartiles seem rather
stable higher percentiles and min/max are rather variable):

   Baseline WBT WBT
Min latency-read 1230.00 (  0.00%) 1560.00 (-26.83%)1100.00 
( 10.57%)
1st-qrtle   latency-read 3357.00 (  0.00%) 3351.00 (  0.18%)3351.00 
(  0.18%)
2nd-qrtle   latency-read 4074.00 (  0.00%) 4056.00 (  0.44%)4022.00 
(  1.28%)
3rd-qrtle   latency-read 5198.00 (  0.00%) 5145.00 (  1.02%)5095.00 
(  1.98%)
Max-90% latency-read 6594.00 (  

Re: [PATCHSET v5] Make background writeback great again for the first time

2016-05-03 Thread Jan Kara
On Tue 03-05-16 09:42:40, Chris Mason wrote:
> On Tue, May 03, 2016 at 03:06:09PM +0200, Jan Kara wrote:
> > On Tue 03-05-16 08:40:11, Chris Mason wrote:
> > > On Tue, May 03, 2016 at 02:17:19PM +0200, Jan Kara wrote:
> > > > On Thu 28-04-16 12:46:41, Jens Axboe wrote:
> > > > > >>-   rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> > > > > >>-   rwb->wb_normal = (rwb->wb_max + 1) / 2;
> > > > > >>-   rwb->wb_background = (rwb->wb_max + 3) / 4;
> > > > > >>+   if (rwb->queue_depth == 1) {
> > > > > >>+   rwb->wb_max = rwb->wb_normal = 2;
> > > > > >>+   rwb->wb_background = 1;
> > > > > >
> > > > > >This breaks the detection of too big scale_step in scale_up() where 
> > > > > >we key
> > > > > >of wb_max == 1 value. However even with that fixed no luck :(:
> > > > > 
> > > > > Yeah, I need to look at that. For QD=1, I think the only sensible 
> > > > > values for
> > > > > max/normal/bg is 2/2/1 and 1/1/1 if we step down.
> > > > > 
> > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> > > > > >Runtime: 105.126 107.125 105.641
> > > > > >
> > > > > >So about the same as before. I'll try to debug this later today...
> > > > > 
> > > > > Thanks, I'm very interested in what you find!
> > > > 
> > > > OK, so the reason was relatively standard in the end. I was using ext3 
> > > > (or
> > > > more exactly ext4 without delayed allocation) for the test. The 
> > > > throttling
> > > > of background writes gave more priority to writes from the journalling
> > > > thread which happen with WRITE_SYNC and thus are not throttled. Thus the
> > > > journalling thread ended up having to do more data writeback to be able 
> > > > to
> > > > commit a transaction (due to requirements of data=ordered mode) and it 
> > > > is
> > > > less efficient at that than the normal flusher thread.
> > > > 
> > > > So this is an example where throttling background writeback effectively
> > > > just pushes more work into another context which does it less 
> > > > efficiently
> > > > and indirectly makes everyone wait for it. ext3 has been always 
> > > > sensitive to
> > > > issues like this. ext4 is using delayed allocation and thus only data
> > > > writes into holes end up being part of a transaction -> simple dd test 
> > > > case
> > > > doesn't hit that path. And indeed when I repeat the same test with ext4,
> > > > the numbers with and without your patch are exactly the same.
> > > > 
> > > > The question remains how common a pattern where throttling of background
> > > > writeback delays also something else is. I'll schedule a couple of
> > > > benchmarks to measure impact of your patches for a wider range of 
> > > > workloads
> > > > (but sadly pretty limited set of hw). If ext3 is the only one seeing
> > > > issues, I would be willing to accept that ext3 takes the hit since it is
> > > > doing something rather stupid (but inherent in its journal design) and 
> > > > we
> > > > have a way to deal with this either by enabling delayed allocation or by
> > > > turning off the writeback throttling...
> > > 
> > > At least in the case of io that we know is going to be data=ordered, we
> > > can bump the prio of those pages?
> > 
> > But how would flusher thread, which is submitting IO, know that? We would
> > have to somehow mark inodes that are part of the running transaction and
> > flusher thread could give more priority to such writeback - e.g. by using
> > WRITE_SYNC or at least plain writes. Hmm, if we use an inode flag for that,
> > it could be doable.
> 
> This would be specific to the data=ordered code in the FS.  If there's
> some way to test for an inode or a page's status in the data=ordered
> list, the FS writepages call could flag the IO as higher prio?

Oh, right, we could do that. I can experiment with that later.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCHSET v5] Make background writeback great again for the first time

2016-05-03 Thread Jan Kara
On Tue 03-05-16 09:42:40, Chris Mason wrote:
> On Tue, May 03, 2016 at 03:06:09PM +0200, Jan Kara wrote:
> > On Tue 03-05-16 08:40:11, Chris Mason wrote:
> > > On Tue, May 03, 2016 at 02:17:19PM +0200, Jan Kara wrote:
> > > > On Thu 28-04-16 12:46:41, Jens Axboe wrote:
> > > > > >>-   rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> > > > > >>-   rwb->wb_normal = (rwb->wb_max + 1) / 2;
> > > > > >>-   rwb->wb_background = (rwb->wb_max + 3) / 4;
> > > > > >>+   if (rwb->queue_depth == 1) {
> > > > > >>+   rwb->wb_max = rwb->wb_normal = 2;
> > > > > >>+   rwb->wb_background = 1;
> > > > > >
> > > > > >This breaks the detection of too big scale_step in scale_up() where 
> > > > > >we key
> > > > > >of wb_max == 1 value. However even with that fixed no luck :(:
> > > > > 
> > > > > Yeah, I need to look at that. For QD=1, I think the only sensible 
> > > > > values for
> > > > > max/normal/bg is 2/2/1 and 1/1/1 if we step down.
> > > > > 
> > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> > > > > >Runtime: 105.126 107.125 105.641
> > > > > >
> > > > > >So about the same as before. I'll try to debug this later today...
> > > > > 
> > > > > Thanks, I'm very interested in what you find!
> > > > 
> > > > OK, so the reason was relatively standard in the end. I was using ext3 
> > > > (or
> > > > more exactly ext4 without delayed allocation) for the test. The 
> > > > throttling
> > > > of background writes gave more priority to writes from the journalling
> > > > thread which happen with WRITE_SYNC and thus are not throttled. Thus the
> > > > journalling thread ended up having to do more data writeback to be able 
> > > > to
> > > > commit a transaction (due to requirements of data=ordered mode) and it 
> > > > is
> > > > less efficient at that than the normal flusher thread.
> > > > 
> > > > So this is an example where throttling background writeback effectively
> > > > just pushes more work into another context which does it less 
> > > > efficiently
> > > > and indirectly makes everyone wait for it. ext3 has been always 
> > > > sensitive to
> > > > issues like this. ext4 is using delayed allocation and thus only data
> > > > writes into holes end up being part of a transaction -> simple dd test 
> > > > case
> > > > doesn't hit that path. And indeed when I repeat the same test with ext4,
> > > > the numbers with and without your patch are exactly the same.
> > > > 
> > > > The question remains how common a pattern where throttling of background
> > > > writeback delays also something else is. I'll schedule a couple of
> > > > benchmarks to measure impact of your patches for a wider range of 
> > > > workloads
> > > > (but sadly pretty limited set of hw). If ext3 is the only one seeing
> > > > issues, I would be willing to accept that ext3 takes the hit since it is
> > > > doing something rather stupid (but inherent in its journal design) and 
> > > > we
> > > > have a way to deal with this either by enabling delayed allocation or by
> > > > turning off the writeback throttling...
> > > 
> > > At least in the case of io that we know is going to be data=ordered, we
> > > can bump the prio of those pages?
> > 
> > But how would flusher thread, which is submitting IO, know that? We would
> > have to somehow mark inodes that are part of the running transaction and
> > flusher thread could give more priority to such writeback - e.g. by using
> > WRITE_SYNC or at least plain writes. Hmm, if we use an inode flag for that,
> > it could be doable.
> 
> This would be specific to the data=ordered code in the FS.  If there's
> some way to test for an inode or a page's status in the data=ordered
> list, the FS writepages call could flag the IO as higher prio?

Oh, right, we could do that. I can experiment with that later.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCHSET v5] Make background writeback great again for the first time

2016-05-03 Thread Chris Mason
On Tue, May 03, 2016 at 03:06:09PM +0200, Jan Kara wrote:
> On Tue 03-05-16 08:40:11, Chris Mason wrote:
> > On Tue, May 03, 2016 at 02:17:19PM +0200, Jan Kara wrote:
> > > On Thu 28-04-16 12:46:41, Jens Axboe wrote:
> > > > >>- rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> > > > >>- rwb->wb_normal = (rwb->wb_max + 1) / 2;
> > > > >>- rwb->wb_background = (rwb->wb_max + 3) / 4;
> > > > >>+ if (rwb->queue_depth == 1) {
> > > > >>+ rwb->wb_max = rwb->wb_normal = 2;
> > > > >>+ rwb->wb_background = 1;
> > > > >
> > > > >This breaks the detection of too big scale_step in scale_up() where we 
> > > > >key
> > > > >of wb_max == 1 value. However even with that fixed no luck :(:
> > > > 
> > > > Yeah, I need to look at that. For QD=1, I think the only sensible 
> > > > values for
> > > > max/normal/bg is 2/2/1 and 1/1/1 if we step down.
> > > > 
> > > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> > > > >Runtime: 105.126 107.125 105.641
> > > > >
> > > > >So about the same as before. I'll try to debug this later today...
> > > > 
> > > > Thanks, I'm very interested in what you find!
> > > 
> > > OK, so the reason was relatively standard in the end. I was using ext3 (or
> > > more exactly ext4 without delayed allocation) for the test. The throttling
> > > of background writes gave more priority to writes from the journalling
> > > thread which happen with WRITE_SYNC and thus are not throttled. Thus the
> > > journalling thread ended up having to do more data writeback to be able to
> > > commit a transaction (due to requirements of data=ordered mode) and it is
> > > less efficient at that than the normal flusher thread.
> > > 
> > > So this is an example where throttling background writeback effectively
> > > just pushes more work into another context which does it less efficiently
> > > and indirectly makes everyone wait for it. ext3 has been always sensitive 
> > > to
> > > issues like this. ext4 is using delayed allocation and thus only data
> > > writes into holes end up being part of a transaction -> simple dd test 
> > > case
> > > doesn't hit that path. And indeed when I repeat the same test with ext4,
> > > the numbers with and without your patch are exactly the same.
> > > 
> > > The question remains how common a pattern where throttling of background
> > > writeback delays also something else is. I'll schedule a couple of
> > > benchmarks to measure impact of your patches for a wider range of 
> > > workloads
> > > (but sadly pretty limited set of hw). If ext3 is the only one seeing
> > > issues, I would be willing to accept that ext3 takes the hit since it is
> > > doing something rather stupid (but inherent in its journal design) and we
> > > have a way to deal with this either by enabling delayed allocation or by
> > > turning off the writeback throttling...
> > 
> > At least in the case of io that we know is going to be data=ordered, we
> > can bump the prio of those pages?
> 
> But how would flusher thread, which is submitting IO, know that? We would
> have to somehow mark inodes that are part of the running transaction and
> flusher thread could give more priority to such writeback - e.g. by using
> WRITE_SYNC or at least plain writes. Hmm, if we use an inode flag for that,
> it could be doable.

This would be specific to the data=ordered code in the FS.  If there's
some way to test for an inode or a page's status in the data=ordered
list, the FS writepages call could flag the IO as higher prio?

-chris



Re: [PATCHSET v5] Make background writeback great again for the first time

2016-05-03 Thread Chris Mason
On Tue, May 03, 2016 at 03:06:09PM +0200, Jan Kara wrote:
> On Tue 03-05-16 08:40:11, Chris Mason wrote:
> > On Tue, May 03, 2016 at 02:17:19PM +0200, Jan Kara wrote:
> > > On Thu 28-04-16 12:46:41, Jens Axboe wrote:
> > > > >>- rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> > > > >>- rwb->wb_normal = (rwb->wb_max + 1) / 2;
> > > > >>- rwb->wb_background = (rwb->wb_max + 3) / 4;
> > > > >>+ if (rwb->queue_depth == 1) {
> > > > >>+ rwb->wb_max = rwb->wb_normal = 2;
> > > > >>+ rwb->wb_background = 1;
> > > > >
> > > > >This breaks the detection of too big scale_step in scale_up() where we 
> > > > >key
> > > > >of wb_max == 1 value. However even with that fixed no luck :(:
> > > > 
> > > > Yeah, I need to look at that. For QD=1, I think the only sensible 
> > > > values for
> > > > max/normal/bg is 2/2/1 and 1/1/1 if we step down.
> > > > 
> > > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> > > > >Runtime: 105.126 107.125 105.641
> > > > >
> > > > >So about the same as before. I'll try to debug this later today...
> > > > 
> > > > Thanks, I'm very interested in what you find!
> > > 
> > > OK, so the reason was relatively standard in the end. I was using ext3 (or
> > > more exactly ext4 without delayed allocation) for the test. The throttling
> > > of background writes gave more priority to writes from the journalling
> > > thread which happen with WRITE_SYNC and thus are not throttled. Thus the
> > > journalling thread ended up having to do more data writeback to be able to
> > > commit a transaction (due to requirements of data=ordered mode) and it is
> > > less efficient at that than the normal flusher thread.
> > > 
> > > So this is an example where throttling background writeback effectively
> > > just pushes more work into another context which does it less efficiently
> > > and indirectly makes everyone wait for it. ext3 has been always sensitive 
> > > to
> > > issues like this. ext4 is using delayed allocation and thus only data
> > > writes into holes end up being part of a transaction -> simple dd test 
> > > case
> > > doesn't hit that path. And indeed when I repeat the same test with ext4,
> > > the numbers with and without your patch are exactly the same.
> > > 
> > > The question remains how common a pattern where throttling of background
> > > writeback delays also something else is. I'll schedule a couple of
> > > benchmarks to measure impact of your patches for a wider range of 
> > > workloads
> > > (but sadly pretty limited set of hw). If ext3 is the only one seeing
> > > issues, I would be willing to accept that ext3 takes the hit since it is
> > > doing something rather stupid (but inherent in its journal design) and we
> > > have a way to deal with this either by enabling delayed allocation or by
> > > turning off the writeback throttling...
> > 
> > At least in the case of io that we know is going to be data=ordered, we
> > can bump the prio of those pages?
> 
> But how would flusher thread, which is submitting IO, know that? We would
> have to somehow mark inodes that are part of the running transaction and
> flusher thread could give more priority to such writeback - e.g. by using
> WRITE_SYNC or at least plain writes. Hmm, if we use an inode flag for that,
> it could be doable.

This would be specific to the data=ordered code in the FS.  If there's
some way to test for an inode or a page's status in the data=ordered
list, the FS writepages call could flag the IO as higher prio?

-chris



Re: [PATCHSET v5] Make background writeback great again for the first time

2016-05-03 Thread Jan Kara
On Tue 03-05-16 08:40:11, Chris Mason wrote:
> On Tue, May 03, 2016 at 02:17:19PM +0200, Jan Kara wrote:
> > On Thu 28-04-16 12:46:41, Jens Axboe wrote:
> > > >>-   rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> > > >>-   rwb->wb_normal = (rwb->wb_max + 1) / 2;
> > > >>-   rwb->wb_background = (rwb->wb_max + 3) / 4;
> > > >>+   if (rwb->queue_depth == 1) {
> > > >>+   rwb->wb_max = rwb->wb_normal = 2;
> > > >>+   rwb->wb_background = 1;
> > > >
> > > >This breaks the detection of too big scale_step in scale_up() where we 
> > > >key
> > > >of wb_max == 1 value. However even with that fixed no luck :(:
> > > 
> > > Yeah, I need to look at that. For QD=1, I think the only sensible values 
> > > for
> > > max/normal/bg is 2/2/1 and 1/1/1 if we step down.
> > > 
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> > > >Runtime: 105.126 107.125 105.641
> > > >
> > > >So about the same as before. I'll try to debug this later today...
> > > 
> > > Thanks, I'm very interested in what you find!
> > 
> > OK, so the reason was relatively standard in the end. I was using ext3 (or
> > more exactly ext4 without delayed allocation) for the test. The throttling
> > of background writes gave more priority to writes from the journalling
> > thread which happen with WRITE_SYNC and thus are not throttled. Thus the
> > journalling thread ended up having to do more data writeback to be able to
> > commit a transaction (due to requirements of data=ordered mode) and it is
> > less efficient at that than the normal flusher thread.
> > 
> > So this is an example where throttling background writeback effectively
> > just pushes more work into another context which does it less efficiently
> > and indirectly makes everyone wait for it. ext3 has been always sensitive to
> > issues like this. ext4 is using delayed allocation and thus only data
> > writes into holes end up being part of a transaction -> simple dd test case
> > doesn't hit that path. And indeed when I repeat the same test with ext4,
> > the numbers with and without your patch are exactly the same.
> > 
> > The question remains how common a pattern where throttling of background
> > writeback delays also something else is. I'll schedule a couple of
> > benchmarks to measure impact of your patches for a wider range of workloads
> > (but sadly pretty limited set of hw). If ext3 is the only one seeing
> > issues, I would be willing to accept that ext3 takes the hit since it is
> > doing something rather stupid (but inherent in its journal design) and we
> > have a way to deal with this either by enabling delayed allocation or by
> > turning off the writeback throttling...
> 
> At least in the case of io that we know is going to be data=ordered, we
> can bump the prio of those pages?

But how would flusher thread, which is submitting IO, know that? We would
have to somehow mark inodes that are part of the running transaction and
flusher thread could give more priority to such writeback - e.g. by using
WRITE_SYNC or at least plain writes. Hmm, if we use an inode flag for that,
it could be doable.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCHSET v5] Make background writeback great again for the first time

2016-05-03 Thread Jan Kara
On Tue 03-05-16 08:40:11, Chris Mason wrote:
> On Tue, May 03, 2016 at 02:17:19PM +0200, Jan Kara wrote:
> > On Thu 28-04-16 12:46:41, Jens Axboe wrote:
> > > >>-   rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> > > >>-   rwb->wb_normal = (rwb->wb_max + 1) / 2;
> > > >>-   rwb->wb_background = (rwb->wb_max + 3) / 4;
> > > >>+   if (rwb->queue_depth == 1) {
> > > >>+   rwb->wb_max = rwb->wb_normal = 2;
> > > >>+   rwb->wb_background = 1;
> > > >
> > > >This breaks the detection of too big scale_step in scale_up() where we 
> > > >key
> > > >of wb_max == 1 value. However even with that fixed no luck :(:
> > > 
> > > Yeah, I need to look at that. For QD=1, I think the only sensible values 
> > > for
> > > max/normal/bg is 2/2/1 and 1/1/1 if we step down.
> > > 
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> > > >Runtime: 105.126 107.125 105.641
> > > >
> > > >So about the same as before. I'll try to debug this later today...
> > > 
> > > Thanks, I'm very interested in what you find!
> > 
> > OK, so the reason was relatively standard in the end. I was using ext3 (or
> > more exactly ext4 without delayed allocation) for the test. The throttling
> > of background writes gave more priority to writes from the journalling
> > thread which happen with WRITE_SYNC and thus are not throttled. Thus the
> > journalling thread ended up having to do more data writeback to be able to
> > commit a transaction (due to requirements of data=ordered mode) and it is
> > less efficient at that than the normal flusher thread.
> > 
> > So this is an example where throttling background writeback effectively
> > just pushes more work into another context which does it less efficiently
> > and indirectly makes everyone wait for it. ext3 has been always sensitive to
> > issues like this. ext4 is using delayed allocation and thus only data
> > writes into holes end up being part of a transaction -> simple dd test case
> > doesn't hit that path. And indeed when I repeat the same test with ext4,
> > the numbers with and without your patch are exactly the same.
> > 
> > The question remains how common a pattern where throttling of background
> > writeback delays also something else is. I'll schedule a couple of
> > benchmarks to measure impact of your patches for a wider range of workloads
> > (but sadly pretty limited set of hw). If ext3 is the only one seeing
> > issues, I would be willing to accept that ext3 takes the hit since it is
> > doing something rather stupid (but inherent in its journal design) and we
> > have a way to deal with this either by enabling delayed allocation or by
> > turning off the writeback throttling...
> 
> At least in the case of io that we know is going to be data=ordered, we
> can bump the prio of those pages?

But how would flusher thread, which is submitting IO, know that? We would
have to somehow mark inodes that are part of the running transaction and
flusher thread could give more priority to such writeback - e.g. by using
WRITE_SYNC or at least plain writes. Hmm, if we use an inode flag for that,
it could be doable.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCHSET v5] Make background writeback great again for the first time

2016-05-03 Thread Chris Mason
On Tue, May 03, 2016 at 02:17:19PM +0200, Jan Kara wrote:
> On Thu 28-04-16 12:46:41, Jens Axboe wrote:
> > >>- rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> > >>- rwb->wb_normal = (rwb->wb_max + 1) / 2;
> > >>- rwb->wb_background = (rwb->wb_max + 3) / 4;
> > >>+ if (rwb->queue_depth == 1) {
> > >>+ rwb->wb_max = rwb->wb_normal = 2;
> > >>+ rwb->wb_background = 1;
> > >
> > >This breaks the detection of too big scale_step in scale_up() where we key
> > >of wb_max == 1 value. However even with that fixed no luck :(:
> > 
> > Yeah, I need to look at that. For QD=1, I think the only sensible values for
> > max/normal/bg is 2/2/1 and 1/1/1 if we step down.
> > 
> > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> > >Runtime: 105.126 107.125 105.641
> > >
> > >So about the same as before. I'll try to debug this later today...
> > 
> > Thanks, I'm very interested in what you find!
> 
> OK, so the reason was relatively standard in the end. I was using ext3 (or
> more exactly ext4 without delayed allocation) for the test. The throttling
> of background writes gave more priority to writes from the journalling
> thread which happen with WRITE_SYNC and thus are not throttled. Thus the
> journalling thread ended up having to do more data writeback to be able to
> commit a transaction (due to requirements of data=ordered mode) and it is
> less efficient at that than the normal flusher thread.
> 
> So this is an example where throttling background writeback effectively
> just pushes more work into another context which does it less efficiently
> and indirectly makes everyone wait for it. ext3 has been always sensitive to
> issues like this. ext4 is using delayed allocation and thus only data
> writes into holes end up being part of a transaction -> simple dd test case
> doesn't hit that path. And indeed when I repeat the same test with ext4,
> the numbers with and without your patch are exactly the same.
> 
> The question remains how common a pattern where throttling of background
> writeback delays also something else is. I'll schedule a couple of
> benchmarks to measure impact of your patches for a wider range of workloads
> (but sadly pretty limited set of hw). If ext3 is the only one seeing
> issues, I would be willing to accept that ext3 takes the hit since it is
> doing something rather stupid (but inherent in its journal design) and we
> have a way to deal with this either by enabling delayed allocation or by
> turning off the writeback throttling...

At least in the case of io that we know is going to be data=ordered, we
can bump the prio of those pages?

-chris



Re: [PATCHSET v5] Make background writeback great again for the first time

2016-05-03 Thread Chris Mason
On Tue, May 03, 2016 at 02:17:19PM +0200, Jan Kara wrote:
> On Thu 28-04-16 12:46:41, Jens Axboe wrote:
> > >>- rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> > >>- rwb->wb_normal = (rwb->wb_max + 1) / 2;
> > >>- rwb->wb_background = (rwb->wb_max + 3) / 4;
> > >>+ if (rwb->queue_depth == 1) {
> > >>+ rwb->wb_max = rwb->wb_normal = 2;
> > >>+ rwb->wb_background = 1;
> > >
> > >This breaks the detection of too big scale_step in scale_up() where we key
> > >of wb_max == 1 value. However even with that fixed no luck :(:
> > 
> > Yeah, I need to look at that. For QD=1, I think the only sensible values for
> > max/normal/bg is 2/2/1 and 1/1/1 if we step down.
> > 
> > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> > >Runtime: 105.126 107.125 105.641
> > >
> > >So about the same as before. I'll try to debug this later today...
> > 
> > Thanks, I'm very interested in what you find!
> 
> OK, so the reason was relatively standard in the end. I was using ext3 (or
> more exactly ext4 without delayed allocation) for the test. The throttling
> of background writes gave more priority to writes from the journalling
> thread which happen with WRITE_SYNC and thus are not throttled. Thus the
> journalling thread ended up having to do more data writeback to be able to
> commit a transaction (due to requirements of data=ordered mode) and it is
> less efficient at that than the normal flusher thread.
> 
> So this is an example where throttling background writeback effectively
> just pushes more work into another context which does it less efficiently
> and indirectly makes everyone wait for it. ext3 has been always sensitive to
> issues like this. ext4 is using delayed allocation and thus only data
> writes into holes end up being part of a transaction -> simple dd test case
> doesn't hit that path. And indeed when I repeat the same test with ext4,
> the numbers with and without your patch are exactly the same.
> 
> The question remains how common a pattern where throttling of background
> writeback delays also something else is. I'll schedule a couple of
> benchmarks to measure impact of your patches for a wider range of workloads
> (but sadly pretty limited set of hw). If ext3 is the only one seeing
> issues, I would be willing to accept that ext3 takes the hit since it is
> doing something rather stupid (but inherent in its journal design) and we
> have a way to deal with this either by enabling delayed allocation or by
> turning off the writeback throttling...

At least in the case of io that we know is going to be data=ordered, we
can bump the prio of those pages?

-chris



Re: [PATCHSET v5] Make background writeback great again for the first time

2016-05-03 Thread Jan Kara
On Thu 28-04-16 12:46:41, Jens Axboe wrote:
> >>-   rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> >>-   rwb->wb_normal = (rwb->wb_max + 1) / 2;
> >>-   rwb->wb_background = (rwb->wb_max + 3) / 4;
> >>+   if (rwb->queue_depth == 1) {
> >>+   rwb->wb_max = rwb->wb_normal = 2;
> >>+   rwb->wb_background = 1;
> >
> >This breaks the detection of too big scale_step in scale_up() where we key
> >of wb_max == 1 value. However even with that fixed no luck :(:
> 
> Yeah, I need to look at that. For QD=1, I think the only sensible values for
> max/normal/bg is 2/2/1 and 1/1/1 if we step down.
> 
> >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> >Runtime: 105.126 107.125 105.641
> >
> >So about the same as before. I'll try to debug this later today...
> 
> Thanks, I'm very interested in what you find!

OK, so the reason was relatively standard in the end. I was using ext3 (or
more exactly ext4 without delayed allocation) for the test. The throttling
of background writes gave more priority to writes from the journalling
thread which happen with WRITE_SYNC and thus are not throttled. Thus the
journalling thread ended up having to do more data writeback to be able to
commit a transaction (due to requirements of data=ordered mode) and it is
less efficient at that than the normal flusher thread.

So this is an example where throttling background writeback effectively
just pushes more work into another context which does it less efficiently
and indirectly makes everyone wait for it. ext3 has been always sensitive to
issues like this. ext4 is using delayed allocation and thus only data
writes into holes end up being part of a transaction -> simple dd test case
doesn't hit that path. And indeed when I repeat the same test with ext4,
the numbers with and without your patch are exactly the same.

The question remains how common a pattern where throttling of background
writeback delays also something else is. I'll schedule a couple of
benchmarks to measure impact of your patches for a wider range of workloads
(but sadly pretty limited set of hw). If ext3 is the only one seeing
issues, I would be willing to accept that ext3 takes the hit since it is
doing something rather stupid (but inherent in its journal design) and we
have a way to deal with this either by enabling delayed allocation or by
turning off the writeback throttling...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCHSET v5] Make background writeback great again for the first time

2016-05-03 Thread Jan Kara
On Thu 28-04-16 12:46:41, Jens Axboe wrote:
> >>-   rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> >>-   rwb->wb_normal = (rwb->wb_max + 1) / 2;
> >>-   rwb->wb_background = (rwb->wb_max + 3) / 4;
> >>+   if (rwb->queue_depth == 1) {
> >>+   rwb->wb_max = rwb->wb_normal = 2;
> >>+   rwb->wb_background = 1;
> >
> >This breaks the detection of too big scale_step in scale_up() where we key
> >of wb_max == 1 value. However even with that fixed no luck :(:
> 
> Yeah, I need to look at that. For QD=1, I think the only sensible values for
> max/normal/bg is 2/2/1 and 1/1/1 if we step down.
> 
> >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> >Runtime: 105.126 107.125 105.641
> >
> >So about the same as before. I'll try to debug this later today...
> 
> Thanks, I'm very interested in what you find!

OK, so the reason was relatively standard in the end. I was using ext3 (or
more exactly ext4 without delayed allocation) for the test. The throttling
of background writes gave more priority to writes from the journalling
thread which happen with WRITE_SYNC and thus are not throttled. Thus the
journalling thread ended up having to do more data writeback to be able to
commit a transaction (due to requirements of data=ordered mode) and it is
less efficient at that than the normal flusher thread.

So this is an example where throttling background writeback effectively
just pushes more work into another context which does it less efficiently
and indirectly makes everyone wait for it. ext3 has been always sensitive to
issues like this. ext4 is using delayed allocation and thus only data
writes into holes end up being part of a transaction -> simple dd test case
doesn't hit that path. And indeed when I repeat the same test with ext4,
the numbers with and without your patch are exactly the same.

The question remains how common a pattern where throttling of background
writeback delays also something else is. I'll schedule a couple of
benchmarks to measure impact of your patches for a wider range of workloads
(but sadly pretty limited set of hw). If ext3 is the only one seeing
issues, I would be willing to accept that ext3 takes the hit since it is
doing something rather stupid (but inherent in its journal design) and we
have a way to deal with this either by enabling delayed allocation or by
turning off the writeback throttling...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCHSET v5] Make background writeback great again for the first time

2016-04-28 Thread Jens Axboe

On 04/28/2016 05:54 AM, Jan Kara wrote:

On Wed 27-04-16 14:59:15, Jens Axboe wrote:

On Wed, Apr 27 2016, Jens Axboe wrote:

On Wed, Apr 27 2016, Jens Axboe wrote:

On 04/27/2016 12:01 PM, Jan Kara wrote:

Hi,

On Tue 26-04-16 09:55:23, Jens Axboe wrote:

Since the dawn of time, our background buffered writeback has sucked.
When we do background buffered writeback, it should have little impact
on foreground activity. That's the definition of background activity...
But for as long as I can remember, heavy buffered writers have not
behaved like that. For instance, if I do something like this:

$ dd if=/dev/zero of=foo bs=1M count=10k

on my laptop, and then try and start chrome, it basically won't start
before the buffered writeback is done. Or, for server oriented
workloads, where installation of a big RPM (or similar) adversely
impacts database reads or sync writes. When that happens, I get people
yelling at me.

I have posted plenty of results previously, I'll keep it shorter
this time. Here's a run on my laptop, using read-to-pipe-async for
reading a 5g file, and rewriting it. You can find this test program
in the fio git repo.


I have tested your patchset on my test system. Generally I have observed
noticeable drop in average throughput for heavy background writes without
any other disk activity and also somewhat increased variance in the
runtimes. It is most visible on this simple testcases:

dd if=/dev/zero of=/mnt/file bs=1M count=1

and

dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync

The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly
created before each dd run on a dedicated disk.

Without your patches I get pretty stable dd runtimes for both cases:

dd if=/dev/zero of=/mnt/file bs=1M count=1
Runtimes: 87.9611 87.3279 87.2554

dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
Runtimes: 93.3502 93.2086 93.541

With your patches the numbers look like:

dd if=/dev/zero of=/mnt/file bs=1M count=1
Runtimes: 108.183, 97.184, 99.9587

dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
Runtimes: 104.9, 102.775, 102.892

I have checked whether the variance is due to some interaction with CFQ
which is used for the disk. When I switched the disk to deadline, I still
get some variance although, the throughput is still ~10% lower:

dd if=/dev/zero of=/mnt/file bs=1M count=1
Runtimes: 100.417 100.643 100.866

dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
Runtimes: 104.208 106.341 105.483

The disk is rotational SATA drive with writeback cache, queue depth of the
disk reported in /sys/block/sdb/device/queue_depth is 1.

So I think we still need some tweaking on the low end of the storage
spectrum so that we don't lose 10% of throughput for simple cases like
this.


Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if
you are seeing smaller requests, and that is why it both varies and
you get lower throughput? I'll try and setup a test here similar to
yours.


Jan, care to try the below patch? I can't fully reproduce your issue on
a SCSI disk limited to QD=1, but I have a feeling this might help. It's
a bit of a hack, but the general idea is to allow one more request to
build up for QD=1 devices. That eliminates wait time between one request
finishing, and the next being submitted.


That accidentally added a potentially stall, this one is both cleaner
and should have that fixed.


..

-   rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
-   rwb->wb_normal = (rwb->wb_max + 1) / 2;
-   rwb->wb_background = (rwb->wb_max + 3) / 4;
+   if (rwb->queue_depth == 1) {
+   rwb->wb_max = rwb->wb_normal = 2;
+   rwb->wb_background = 1;


This breaks the detection of too big scale_step in scale_up() where we key
of wb_max == 1 value. However even with that fixed no luck :(:


Yeah, I need to look at that. For QD=1, I think the only sensible values 
for max/normal/bg is 2/2/1 and 1/1/1 if we step down.



dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
Runtime: 105.126 107.125 105.641

So about the same as before. I'll try to debug this later today...


Thanks, I'm very interested in what you find!

--
Jens Axboe



Re: [PATCHSET v5] Make background writeback great again for the first time

2016-04-28 Thread Jens Axboe

On 04/28/2016 05:54 AM, Jan Kara wrote:

On Wed 27-04-16 14:59:15, Jens Axboe wrote:

On Wed, Apr 27 2016, Jens Axboe wrote:

On Wed, Apr 27 2016, Jens Axboe wrote:

On 04/27/2016 12:01 PM, Jan Kara wrote:

Hi,

On Tue 26-04-16 09:55:23, Jens Axboe wrote:

Since the dawn of time, our background buffered writeback has sucked.
When we do background buffered writeback, it should have little impact
on foreground activity. That's the definition of background activity...
But for as long as I can remember, heavy buffered writers have not
behaved like that. For instance, if I do something like this:

$ dd if=/dev/zero of=foo bs=1M count=10k

on my laptop, and then try and start chrome, it basically won't start
before the buffered writeback is done. Or, for server oriented
workloads, where installation of a big RPM (or similar) adversely
impacts database reads or sync writes. When that happens, I get people
yelling at me.

I have posted plenty of results previously, I'll keep it shorter
this time. Here's a run on my laptop, using read-to-pipe-async for
reading a 5g file, and rewriting it. You can find this test program
in the fio git repo.


I have tested your patchset on my test system. Generally I have observed
noticeable drop in average throughput for heavy background writes without
any other disk activity and also somewhat increased variance in the
runtimes. It is most visible on this simple testcases:

dd if=/dev/zero of=/mnt/file bs=1M count=1

and

dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync

The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly
created before each dd run on a dedicated disk.

Without your patches I get pretty stable dd runtimes for both cases:

dd if=/dev/zero of=/mnt/file bs=1M count=1
Runtimes: 87.9611 87.3279 87.2554

dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
Runtimes: 93.3502 93.2086 93.541

With your patches the numbers look like:

dd if=/dev/zero of=/mnt/file bs=1M count=1
Runtimes: 108.183, 97.184, 99.9587

dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
Runtimes: 104.9, 102.775, 102.892

I have checked whether the variance is due to some interaction with CFQ
which is used for the disk. When I switched the disk to deadline, I still
get some variance although, the throughput is still ~10% lower:

dd if=/dev/zero of=/mnt/file bs=1M count=1
Runtimes: 100.417 100.643 100.866

dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
Runtimes: 104.208 106.341 105.483

The disk is rotational SATA drive with writeback cache, queue depth of the
disk reported in /sys/block/sdb/device/queue_depth is 1.

So I think we still need some tweaking on the low end of the storage
spectrum so that we don't lose 10% of throughput for simple cases like
this.


Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if
you are seeing smaller requests, and that is why it both varies and
you get lower throughput? I'll try and setup a test here similar to
yours.


Jan, care to try the below patch? I can't fully reproduce your issue on
a SCSI disk limited to QD=1, but I have a feeling this might help. It's
a bit of a hack, but the general idea is to allow one more request to
build up for QD=1 devices. That eliminates wait time between one request
finishing, and the next being submitted.


That accidentally added a potentially stall, this one is both cleaner
and should have that fixed.


..

-   rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
-   rwb->wb_normal = (rwb->wb_max + 1) / 2;
-   rwb->wb_background = (rwb->wb_max + 3) / 4;
+   if (rwb->queue_depth == 1) {
+   rwb->wb_max = rwb->wb_normal = 2;
+   rwb->wb_background = 1;


This breaks the detection of too big scale_step in scale_up() where we key
of wb_max == 1 value. However even with that fixed no luck :(:


Yeah, I need to look at that. For QD=1, I think the only sensible values 
for max/normal/bg is 2/2/1 and 1/1/1 if we step down.



dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
Runtime: 105.126 107.125 105.641

So about the same as before. I'll try to debug this later today...


Thanks, I'm very interested in what you find!

--
Jens Axboe



Re: [PATCHSET v5] Make background writeback great again for the first time

2016-04-28 Thread Jens Axboe

On 04/27/2016 10:06 PM, xiakaixu wrote:

diff --git a/lib/wbt.c b/lib/wbt.c
index 650da911f24f..322f5e04e994 100644
--- a/lib/wbt.c
+++ b/lib/wbt.c
@@ -98,18 +98,23 @@ void __wbt_done(struct rq_wb *rwb)
else
limit = rwb->wb_normal;

Hi Jens,

This statement 'limit = rwb->wb_normal' is executed twice, maybe once is
enough. It is not a big deal anyway :)


I'll clean that up, thanks for noticing. No functional difference.


Another question about this if branch:

if (rwb->wc && !atomic_read(>bdi->wb.dirty_sleeping))
limit = 0;

I can't follow the logic of this if branch. why set limit equal to 0
when the device supports write back caches and there are tasks being
limited in balance_dirty_pages(). Could you pelase give more info
about this ?  Thanks!


Sure. So for write back caching, we have to try a bit harder to ensure 
that the device doesn't build up long internal queues with a lot of 
dirty data in the cache. So for the case where we have write back 
caching AND we don't have anyone waiting for the IO, allow the queue 
depth to drain to zero before building it back up again.


Does that make sense?



+   inflight = atomic_dec_return(>inflight);
+
/*
-* Don't wake anyone up if we are above the normal limit. If
-* throttling got disabled (limit == 0) with waiters, ensure
-* that we wake them up.
+* wbt got disabled with IO in flight. Wake up any potential
+* waiters, we don't have to do more than that.
 */
-   inflight = atomic_dec_return(>inflight);
-   if (limit && inflight >= limit) {
-   if (!rwb->wb_max)
-   wake_up_all(>wait);
+   if (!rwb_enabled(rwb)) {
+   wake_up_all(>wait);
return;
}


Maybe it is better that executing this if branch earlier. So we can wake up
potential waiters in time when wbt got disabled.


The !rwb_enabled() case will only happen if someone disabled wbt while 
we had tracked IO in flight. We have to it below the 
atomic_dec_return(), so we could reorder that to be at the front. 
Ideally we just want it out-of-line instead, as it's the unexpected 
slower path.


--
Jens Axboe



Re: [PATCHSET v5] Make background writeback great again for the first time

2016-04-28 Thread Jens Axboe

On 04/27/2016 10:06 PM, xiakaixu wrote:

diff --git a/lib/wbt.c b/lib/wbt.c
index 650da911f24f..322f5e04e994 100644
--- a/lib/wbt.c
+++ b/lib/wbt.c
@@ -98,18 +98,23 @@ void __wbt_done(struct rq_wb *rwb)
else
limit = rwb->wb_normal;

Hi Jens,

This statement 'limit = rwb->wb_normal' is executed twice, maybe once is
enough. It is not a big deal anyway :)


I'll clean that up, thanks for noticing. No functional difference.


Another question about this if branch:

if (rwb->wc && !atomic_read(>bdi->wb.dirty_sleeping))
limit = 0;

I can't follow the logic of this if branch. why set limit equal to 0
when the device supports write back caches and there are tasks being
limited in balance_dirty_pages(). Could you pelase give more info
about this ?  Thanks!


Sure. So for write back caching, we have to try a bit harder to ensure 
that the device doesn't build up long internal queues with a lot of 
dirty data in the cache. So for the case where we have write back 
caching AND we don't have anyone waiting for the IO, allow the queue 
depth to drain to zero before building it back up again.


Does that make sense?



+   inflight = atomic_dec_return(>inflight);
+
/*
-* Don't wake anyone up if we are above the normal limit. If
-* throttling got disabled (limit == 0) with waiters, ensure
-* that we wake them up.
+* wbt got disabled with IO in flight. Wake up any potential
+* waiters, we don't have to do more than that.
 */
-   inflight = atomic_dec_return(>inflight);
-   if (limit && inflight >= limit) {
-   if (!rwb->wb_max)
-   wake_up_all(>wait);
+   if (!rwb_enabled(rwb)) {
+   wake_up_all(>wait);
return;
}


Maybe it is better that executing this if branch earlier. So we can wake up
potential waiters in time when wbt got disabled.


The !rwb_enabled() case will only happen if someone disabled wbt while 
we had tracked IO in flight. We have to it below the 
atomic_dec_return(), so we could reorder that to be at the front. 
Ideally we just want it out-of-line instead, as it's the unexpected 
slower path.


--
Jens Axboe



Re: [PATCHSET v5] Make background writeback great again for the first time

2016-04-28 Thread Jan Kara
On Wed 27-04-16 14:59:15, Jens Axboe wrote:
> On Wed, Apr 27 2016, Jens Axboe wrote:
> > On Wed, Apr 27 2016, Jens Axboe wrote:
> > > On 04/27/2016 12:01 PM, Jan Kara wrote:
> > > >Hi,
> > > >
> > > >On Tue 26-04-16 09:55:23, Jens Axboe wrote:
> > > >>Since the dawn of time, our background buffered writeback has sucked.
> > > >>When we do background buffered writeback, it should have little impact
> > > >>on foreground activity. That's the definition of background activity...
> > > >>But for as long as I can remember, heavy buffered writers have not
> > > >>behaved like that. For instance, if I do something like this:
> > > >>
> > > >>$ dd if=/dev/zero of=foo bs=1M count=10k
> > > >>
> > > >>on my laptop, and then try and start chrome, it basically won't start
> > > >>before the buffered writeback is done. Or, for server oriented
> > > >>workloads, where installation of a big RPM (or similar) adversely
> > > >>impacts database reads or sync writes. When that happens, I get people
> > > >>yelling at me.
> > > >>
> > > >>I have posted plenty of results previously, I'll keep it shorter
> > > >>this time. Here's a run on my laptop, using read-to-pipe-async for
> > > >>reading a 5g file, and rewriting it. You can find this test program
> > > >>in the fio git repo.
> > > >
> > > >I have tested your patchset on my test system. Generally I have observed
> > > >noticeable drop in average throughput for heavy background writes without
> > > >any other disk activity and also somewhat increased variance in the
> > > >runtimes. It is most visible on this simple testcases:
> > > >
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=1
> > > >
> > > >and
> > > >
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> > > >
> > > >The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly
> > > >created before each dd run on a dedicated disk.
> > > >
> > > >Without your patches I get pretty stable dd runtimes for both cases:
> > > >
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=1
> > > >Runtimes: 87.9611 87.3279 87.2554
> > > >
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> > > >Runtimes: 93.3502 93.2086 93.541
> > > >
> > > >With your patches the numbers look like:
> > > >
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=1
> > > >Runtimes: 108.183, 97.184, 99.9587
> > > >
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> > > >Runtimes: 104.9, 102.775, 102.892
> > > >
> > > >I have checked whether the variance is due to some interaction with CFQ
> > > >which is used for the disk. When I switched the disk to deadline, I still
> > > >get some variance although, the throughput is still ~10% lower:
> > > >
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=1
> > > >Runtimes: 100.417 100.643 100.866
> > > >
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> > > >Runtimes: 104.208 106.341 105.483
> > > >
> > > >The disk is rotational SATA drive with writeback cache, queue depth of 
> > > >the
> > > >disk reported in /sys/block/sdb/device/queue_depth is 1.
> > > >
> > > >So I think we still need some tweaking on the low end of the storage
> > > >spectrum so that we don't lose 10% of throughput for simple cases like
> > > >this.
> > > 
> > > Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if
> > > you are seeing smaller requests, and that is why it both varies and
> > > you get lower throughput? I'll try and setup a test here similar to
> > > yours.
> > 
> > Jan, care to try the below patch? I can't fully reproduce your issue on
> > a SCSI disk limited to QD=1, but I have a feeling this might help. It's
> > a bit of a hack, but the general idea is to allow one more request to
> > build up for QD=1 devices. That eliminates wait time between one request
> > finishing, and the next being submitted.
> 
> That accidentally added a potentially stall, this one is both cleaner
> and should have that fixed.
> 
..
> - rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> - rwb->wb_normal = (rwb->wb_max + 1) / 2;
> - rwb->wb_background = (rwb->wb_max + 3) / 4;
> + if (rwb->queue_depth == 1) {
> + rwb->wb_max = rwb->wb_normal = 2;
> + rwb->wb_background = 1;

This breaks the detection of too big scale_step in scale_up() where we key
of wb_max == 1 value. However even with that fixed no luck :(:

dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
Runtime: 105.126 107.125 105.641

So about the same as before. I'll try to debug this later today...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCHSET v5] Make background writeback great again for the first time

2016-04-28 Thread Jan Kara
On Wed 27-04-16 14:59:15, Jens Axboe wrote:
> On Wed, Apr 27 2016, Jens Axboe wrote:
> > On Wed, Apr 27 2016, Jens Axboe wrote:
> > > On 04/27/2016 12:01 PM, Jan Kara wrote:
> > > >Hi,
> > > >
> > > >On Tue 26-04-16 09:55:23, Jens Axboe wrote:
> > > >>Since the dawn of time, our background buffered writeback has sucked.
> > > >>When we do background buffered writeback, it should have little impact
> > > >>on foreground activity. That's the definition of background activity...
> > > >>But for as long as I can remember, heavy buffered writers have not
> > > >>behaved like that. For instance, if I do something like this:
> > > >>
> > > >>$ dd if=/dev/zero of=foo bs=1M count=10k
> > > >>
> > > >>on my laptop, and then try and start chrome, it basically won't start
> > > >>before the buffered writeback is done. Or, for server oriented
> > > >>workloads, where installation of a big RPM (or similar) adversely
> > > >>impacts database reads or sync writes. When that happens, I get people
> > > >>yelling at me.
> > > >>
> > > >>I have posted plenty of results previously, I'll keep it shorter
> > > >>this time. Here's a run on my laptop, using read-to-pipe-async for
> > > >>reading a 5g file, and rewriting it. You can find this test program
> > > >>in the fio git repo.
> > > >
> > > >I have tested your patchset on my test system. Generally I have observed
> > > >noticeable drop in average throughput for heavy background writes without
> > > >any other disk activity and also somewhat increased variance in the
> > > >runtimes. It is most visible on this simple testcases:
> > > >
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=1
> > > >
> > > >and
> > > >
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> > > >
> > > >The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly
> > > >created before each dd run on a dedicated disk.
> > > >
> > > >Without your patches I get pretty stable dd runtimes for both cases:
> > > >
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=1
> > > >Runtimes: 87.9611 87.3279 87.2554
> > > >
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> > > >Runtimes: 93.3502 93.2086 93.541
> > > >
> > > >With your patches the numbers look like:
> > > >
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=1
> > > >Runtimes: 108.183, 97.184, 99.9587
> > > >
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> > > >Runtimes: 104.9, 102.775, 102.892
> > > >
> > > >I have checked whether the variance is due to some interaction with CFQ
> > > >which is used for the disk. When I switched the disk to deadline, I still
> > > >get some variance although, the throughput is still ~10% lower:
> > > >
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=1
> > > >Runtimes: 100.417 100.643 100.866
> > > >
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> > > >Runtimes: 104.208 106.341 105.483
> > > >
> > > >The disk is rotational SATA drive with writeback cache, queue depth of 
> > > >the
> > > >disk reported in /sys/block/sdb/device/queue_depth is 1.
> > > >
> > > >So I think we still need some tweaking on the low end of the storage
> > > >spectrum so that we don't lose 10% of throughput for simple cases like
> > > >this.
> > > 
> > > Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if
> > > you are seeing smaller requests, and that is why it both varies and
> > > you get lower throughput? I'll try and setup a test here similar to
> > > yours.
> > 
> > Jan, care to try the below patch? I can't fully reproduce your issue on
> > a SCSI disk limited to QD=1, but I have a feeling this might help. It's
> > a bit of a hack, but the general idea is to allow one more request to
> > build up for QD=1 devices. That eliminates wait time between one request
> > finishing, and the next being submitted.
> 
> That accidentally added a potentially stall, this one is both cleaner
> and should have that fixed.
> 
..
> - rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> - rwb->wb_normal = (rwb->wb_max + 1) / 2;
> - rwb->wb_background = (rwb->wb_max + 3) / 4;
> + if (rwb->queue_depth == 1) {
> + rwb->wb_max = rwb->wb_normal = 2;
> + rwb->wb_background = 1;

This breaks the detection of too big scale_step in scale_up() where we key
of wb_max == 1 value. However even with that fixed no luck :(:

dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
Runtime: 105.126 107.125 105.641

So about the same as before. I'll try to debug this later today...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCHSET v5] Make background writeback great again for the first time

2016-04-27 Thread xiakaixu
于 2016/4/28 4:59, Jens Axboe 写道:
> On Wed, Apr 27 2016, Jens Axboe wrote:
>> On Wed, Apr 27 2016, Jens Axboe wrote:
>>> On 04/27/2016 12:01 PM, Jan Kara wrote:
 Hi,

 On Tue 26-04-16 09:55:23, Jens Axboe wrote:
> Since the dawn of time, our background buffered writeback has sucked.
> When we do background buffered writeback, it should have little impact
> on foreground activity. That's the definition of background activity...
> But for as long as I can remember, heavy buffered writers have not
> behaved like that. For instance, if I do something like this:
>
> $ dd if=/dev/zero of=foo bs=1M count=10k
>
> on my laptop, and then try and start chrome, it basically won't start
> before the buffered writeback is done. Or, for server oriented
> workloads, where installation of a big RPM (or similar) adversely
> impacts database reads or sync writes. When that happens, I get people
> yelling at me.
>
> I have posted plenty of results previously, I'll keep it shorter
> this time. Here's a run on my laptop, using read-to-pipe-async for
> reading a 5g file, and rewriting it. You can find this test program
> in the fio git repo.

 I have tested your patchset on my test system. Generally I have observed
 noticeable drop in average throughput for heavy background writes without
 any other disk activity and also somewhat increased variance in the
 runtimes. It is most visible on this simple testcases:

 dd if=/dev/zero of=/mnt/file bs=1M count=1

 and

 dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync

 The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly
 created before each dd run on a dedicated disk.

 Without your patches I get pretty stable dd runtimes for both cases:

 dd if=/dev/zero of=/mnt/file bs=1M count=1
 Runtimes: 87.9611 87.3279 87.2554

 dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
 Runtimes: 93.3502 93.2086 93.541

 With your patches the numbers look like:

 dd if=/dev/zero of=/mnt/file bs=1M count=1
 Runtimes: 108.183, 97.184, 99.9587

 dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
 Runtimes: 104.9, 102.775, 102.892

 I have checked whether the variance is due to some interaction with CFQ
 which is used for the disk. When I switched the disk to deadline, I still
 get some variance although, the throughput is still ~10% lower:

 dd if=/dev/zero of=/mnt/file bs=1M count=1
 Runtimes: 100.417 100.643 100.866

 dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
 Runtimes: 104.208 106.341 105.483

 The disk is rotational SATA drive with writeback cache, queue depth of the
 disk reported in /sys/block/sdb/device/queue_depth is 1.

 So I think we still need some tweaking on the low end of the storage
 spectrum so that we don't lose 10% of throughput for simple cases like
 this.
>>>
>>> Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if
>>> you are seeing smaller requests, and that is why it both varies and
>>> you get lower throughput? I'll try and setup a test here similar to
>>> yours.
>>
>> Jan, care to try the below patch? I can't fully reproduce your issue on
>> a SCSI disk limited to QD=1, but I have a feeling this might help. It's
>> a bit of a hack, but the general idea is to allow one more request to
>> build up for QD=1 devices. That eliminates wait time between one request
>> finishing, and the next being submitted.
> 
> That accidentally added a potentially stall, this one is both cleaner
> and should have that fixed.
> 
> diff --git a/lib/wbt.c b/lib/wbt.c
> index 650da911f24f..322f5e04e994 100644
> --- a/lib/wbt.c
> +++ b/lib/wbt.c
> @@ -98,18 +98,23 @@ void __wbt_done(struct rq_wb *rwb)
>   else
>   limit = rwb->wb_normal;
Hi Jens,

This statement 'limit = rwb->wb_normal' is executed twice, maybe once is
enough. It is not a big deal anyway :)


Another question about this if branch:

   if (rwb->wc && !atomic_read(>bdi->wb.dirty_sleeping))
limit = 0;

I can't follow the logic of this if branch. why set limit equal to 0
when the device supports write back caches and there are tasks being
limited in balance_dirty_pages(). Could you pelase give more info
about this ?  Thanks!
>  
> + inflight = atomic_dec_return(>inflight);
> +
>   /*
> -  * Don't wake anyone up if we are above the normal limit. If
> -  * throttling got disabled (limit == 0) with waiters, ensure
> -  * that we wake them up.
> +  * wbt got disabled with IO in flight. Wake up any potential
> +  * waiters, we don't have to do more than that.
>*/
> - inflight = atomic_dec_return(>inflight);
> - if (limit && inflight >= limit) {
> - if (!rwb->wb_max)
> - wake_up_all(>wait);
> +

Re: [PATCHSET v5] Make background writeback great again for the first time

2016-04-27 Thread xiakaixu
于 2016/4/28 4:59, Jens Axboe 写道:
> On Wed, Apr 27 2016, Jens Axboe wrote:
>> On Wed, Apr 27 2016, Jens Axboe wrote:
>>> On 04/27/2016 12:01 PM, Jan Kara wrote:
 Hi,

 On Tue 26-04-16 09:55:23, Jens Axboe wrote:
> Since the dawn of time, our background buffered writeback has sucked.
> When we do background buffered writeback, it should have little impact
> on foreground activity. That's the definition of background activity...
> But for as long as I can remember, heavy buffered writers have not
> behaved like that. For instance, if I do something like this:
>
> $ dd if=/dev/zero of=foo bs=1M count=10k
>
> on my laptop, and then try and start chrome, it basically won't start
> before the buffered writeback is done. Or, for server oriented
> workloads, where installation of a big RPM (or similar) adversely
> impacts database reads or sync writes. When that happens, I get people
> yelling at me.
>
> I have posted plenty of results previously, I'll keep it shorter
> this time. Here's a run on my laptop, using read-to-pipe-async for
> reading a 5g file, and rewriting it. You can find this test program
> in the fio git repo.

 I have tested your patchset on my test system. Generally I have observed
 noticeable drop in average throughput for heavy background writes without
 any other disk activity and also somewhat increased variance in the
 runtimes. It is most visible on this simple testcases:

 dd if=/dev/zero of=/mnt/file bs=1M count=1

 and

 dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync

 The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly
 created before each dd run on a dedicated disk.

 Without your patches I get pretty stable dd runtimes for both cases:

 dd if=/dev/zero of=/mnt/file bs=1M count=1
 Runtimes: 87.9611 87.3279 87.2554

 dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
 Runtimes: 93.3502 93.2086 93.541

 With your patches the numbers look like:

 dd if=/dev/zero of=/mnt/file bs=1M count=1
 Runtimes: 108.183, 97.184, 99.9587

 dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
 Runtimes: 104.9, 102.775, 102.892

 I have checked whether the variance is due to some interaction with CFQ
 which is used for the disk. When I switched the disk to deadline, I still
 get some variance although, the throughput is still ~10% lower:

 dd if=/dev/zero of=/mnt/file bs=1M count=1
 Runtimes: 100.417 100.643 100.866

 dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
 Runtimes: 104.208 106.341 105.483

 The disk is rotational SATA drive with writeback cache, queue depth of the
 disk reported in /sys/block/sdb/device/queue_depth is 1.

 So I think we still need some tweaking on the low end of the storage
 spectrum so that we don't lose 10% of throughput for simple cases like
 this.
>>>
>>> Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if
>>> you are seeing smaller requests, and that is why it both varies and
>>> you get lower throughput? I'll try and setup a test here similar to
>>> yours.
>>
>> Jan, care to try the below patch? I can't fully reproduce your issue on
>> a SCSI disk limited to QD=1, but I have a feeling this might help. It's
>> a bit of a hack, but the general idea is to allow one more request to
>> build up for QD=1 devices. That eliminates wait time between one request
>> finishing, and the next being submitted.
> 
> That accidentally added a potentially stall, this one is both cleaner
> and should have that fixed.
> 
> diff --git a/lib/wbt.c b/lib/wbt.c
> index 650da911f24f..322f5e04e994 100644
> --- a/lib/wbt.c
> +++ b/lib/wbt.c
> @@ -98,18 +98,23 @@ void __wbt_done(struct rq_wb *rwb)
>   else
>   limit = rwb->wb_normal;
Hi Jens,

This statement 'limit = rwb->wb_normal' is executed twice, maybe once is
enough. It is not a big deal anyway :)


Another question about this if branch:

   if (rwb->wc && !atomic_read(>bdi->wb.dirty_sleeping))
limit = 0;

I can't follow the logic of this if branch. why set limit equal to 0
when the device supports write back caches and there are tasks being
limited in balance_dirty_pages(). Could you pelase give more info
about this ?  Thanks!
>  
> + inflight = atomic_dec_return(>inflight);
> +
>   /*
> -  * Don't wake anyone up if we are above the normal limit. If
> -  * throttling got disabled (limit == 0) with waiters, ensure
> -  * that we wake them up.
> +  * wbt got disabled with IO in flight. Wake up any potential
> +  * waiters, we don't have to do more than that.
>*/
> - inflight = atomic_dec_return(>inflight);
> - if (limit && inflight >= limit) {
> - if (!rwb->wb_max)
> - wake_up_all(>wait);
> +

Re: [PATCHSET v5] Make background writeback great again for the first time

2016-04-27 Thread Jens Axboe
On Wed, Apr 27 2016, Jens Axboe wrote:
> On Wed, Apr 27 2016, Jens Axboe wrote:
> > On 04/27/2016 12:01 PM, Jan Kara wrote:
> > >Hi,
> > >
> > >On Tue 26-04-16 09:55:23, Jens Axboe wrote:
> > >>Since the dawn of time, our background buffered writeback has sucked.
> > >>When we do background buffered writeback, it should have little impact
> > >>on foreground activity. That's the definition of background activity...
> > >>But for as long as I can remember, heavy buffered writers have not
> > >>behaved like that. For instance, if I do something like this:
> > >>
> > >>$ dd if=/dev/zero of=foo bs=1M count=10k
> > >>
> > >>on my laptop, and then try and start chrome, it basically won't start
> > >>before the buffered writeback is done. Or, for server oriented
> > >>workloads, where installation of a big RPM (or similar) adversely
> > >>impacts database reads or sync writes. When that happens, I get people
> > >>yelling at me.
> > >>
> > >>I have posted plenty of results previously, I'll keep it shorter
> > >>this time. Here's a run on my laptop, using read-to-pipe-async for
> > >>reading a 5g file, and rewriting it. You can find this test program
> > >>in the fio git repo.
> > >
> > >I have tested your patchset on my test system. Generally I have observed
> > >noticeable drop in average throughput for heavy background writes without
> > >any other disk activity and also somewhat increased variance in the
> > >runtimes. It is most visible on this simple testcases:
> > >
> > >dd if=/dev/zero of=/mnt/file bs=1M count=1
> > >
> > >and
> > >
> > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> > >
> > >The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly
> > >created before each dd run on a dedicated disk.
> > >
> > >Without your patches I get pretty stable dd runtimes for both cases:
> > >
> > >dd if=/dev/zero of=/mnt/file bs=1M count=1
> > >Runtimes: 87.9611 87.3279 87.2554
> > >
> > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> > >Runtimes: 93.3502 93.2086 93.541
> > >
> > >With your patches the numbers look like:
> > >
> > >dd if=/dev/zero of=/mnt/file bs=1M count=1
> > >Runtimes: 108.183, 97.184, 99.9587
> > >
> > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> > >Runtimes: 104.9, 102.775, 102.892
> > >
> > >I have checked whether the variance is due to some interaction with CFQ
> > >which is used for the disk. When I switched the disk to deadline, I still
> > >get some variance although, the throughput is still ~10% lower:
> > >
> > >dd if=/dev/zero of=/mnt/file bs=1M count=1
> > >Runtimes: 100.417 100.643 100.866
> > >
> > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> > >Runtimes: 104.208 106.341 105.483
> > >
> > >The disk is rotational SATA drive with writeback cache, queue depth of the
> > >disk reported in /sys/block/sdb/device/queue_depth is 1.
> > >
> > >So I think we still need some tweaking on the low end of the storage
> > >spectrum so that we don't lose 10% of throughput for simple cases like
> > >this.
> > 
> > Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if
> > you are seeing smaller requests, and that is why it both varies and
> > you get lower throughput? I'll try and setup a test here similar to
> > yours.
> 
> Jan, care to try the below patch? I can't fully reproduce your issue on
> a SCSI disk limited to QD=1, but I have a feeling this might help. It's
> a bit of a hack, but the general idea is to allow one more request to
> build up for QD=1 devices. That eliminates wait time between one request
> finishing, and the next being submitted.

That accidentally added a potentially stall, this one is both cleaner
and should have that fixed.

diff --git a/lib/wbt.c b/lib/wbt.c
index 650da911f24f..322f5e04e994 100644
--- a/lib/wbt.c
+++ b/lib/wbt.c
@@ -98,18 +98,23 @@ void __wbt_done(struct rq_wb *rwb)
else
limit = rwb->wb_normal;
 
+   inflight = atomic_dec_return(>inflight);
+
/*
-* Don't wake anyone up if we are above the normal limit. If
-* throttling got disabled (limit == 0) with waiters, ensure
-* that we wake them up.
+* wbt got disabled with IO in flight. Wake up any potential
+* waiters, we don't have to do more than that.
 */
-   inflight = atomic_dec_return(>inflight);
-   if (limit && inflight >= limit) {
-   if (!rwb->wb_max)
-   wake_up_all(>wait);
+   if (!rwb_enabled(rwb)) {
+   wake_up_all(>wait);
return;
}
 
+   /*
+* Don't wake anyone up if we are above the normal limit.
+*/
+   if (inflight && inflight >= limit)
+   return;
+
if (waitqueue_active(>wait)) {
int diff = limit - inflight;
 
@@ -150,14 +155,26 @@ static void calc_wb_limits(struct rq_wb *rwb)
return;
}
 
-   depth = min_t(unsigned int, RWB_MAX_DEPTH, 

Re: [PATCHSET v5] Make background writeback great again for the first time

2016-04-27 Thread Jens Axboe
On Wed, Apr 27 2016, Jens Axboe wrote:
> On Wed, Apr 27 2016, Jens Axboe wrote:
> > On 04/27/2016 12:01 PM, Jan Kara wrote:
> > >Hi,
> > >
> > >On Tue 26-04-16 09:55:23, Jens Axboe wrote:
> > >>Since the dawn of time, our background buffered writeback has sucked.
> > >>When we do background buffered writeback, it should have little impact
> > >>on foreground activity. That's the definition of background activity...
> > >>But for as long as I can remember, heavy buffered writers have not
> > >>behaved like that. For instance, if I do something like this:
> > >>
> > >>$ dd if=/dev/zero of=foo bs=1M count=10k
> > >>
> > >>on my laptop, and then try and start chrome, it basically won't start
> > >>before the buffered writeback is done. Or, for server oriented
> > >>workloads, where installation of a big RPM (or similar) adversely
> > >>impacts database reads or sync writes. When that happens, I get people
> > >>yelling at me.
> > >>
> > >>I have posted plenty of results previously, I'll keep it shorter
> > >>this time. Here's a run on my laptop, using read-to-pipe-async for
> > >>reading a 5g file, and rewriting it. You can find this test program
> > >>in the fio git repo.
> > >
> > >I have tested your patchset on my test system. Generally I have observed
> > >noticeable drop in average throughput for heavy background writes without
> > >any other disk activity and also somewhat increased variance in the
> > >runtimes. It is most visible on this simple testcases:
> > >
> > >dd if=/dev/zero of=/mnt/file bs=1M count=1
> > >
> > >and
> > >
> > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> > >
> > >The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly
> > >created before each dd run on a dedicated disk.
> > >
> > >Without your patches I get pretty stable dd runtimes for both cases:
> > >
> > >dd if=/dev/zero of=/mnt/file bs=1M count=1
> > >Runtimes: 87.9611 87.3279 87.2554
> > >
> > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> > >Runtimes: 93.3502 93.2086 93.541
> > >
> > >With your patches the numbers look like:
> > >
> > >dd if=/dev/zero of=/mnt/file bs=1M count=1
> > >Runtimes: 108.183, 97.184, 99.9587
> > >
> > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> > >Runtimes: 104.9, 102.775, 102.892
> > >
> > >I have checked whether the variance is due to some interaction with CFQ
> > >which is used for the disk. When I switched the disk to deadline, I still
> > >get some variance although, the throughput is still ~10% lower:
> > >
> > >dd if=/dev/zero of=/mnt/file bs=1M count=1
> > >Runtimes: 100.417 100.643 100.866
> > >
> > >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> > >Runtimes: 104.208 106.341 105.483
> > >
> > >The disk is rotational SATA drive with writeback cache, queue depth of the
> > >disk reported in /sys/block/sdb/device/queue_depth is 1.
> > >
> > >So I think we still need some tweaking on the low end of the storage
> > >spectrum so that we don't lose 10% of throughput for simple cases like
> > >this.
> > 
> > Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if
> > you are seeing smaller requests, and that is why it both varies and
> > you get lower throughput? I'll try and setup a test here similar to
> > yours.
> 
> Jan, care to try the below patch? I can't fully reproduce your issue on
> a SCSI disk limited to QD=1, but I have a feeling this might help. It's
> a bit of a hack, but the general idea is to allow one more request to
> build up for QD=1 devices. That eliminates wait time between one request
> finishing, and the next being submitted.

That accidentally added a potentially stall, this one is both cleaner
and should have that fixed.

diff --git a/lib/wbt.c b/lib/wbt.c
index 650da911f24f..322f5e04e994 100644
--- a/lib/wbt.c
+++ b/lib/wbt.c
@@ -98,18 +98,23 @@ void __wbt_done(struct rq_wb *rwb)
else
limit = rwb->wb_normal;
 
+   inflight = atomic_dec_return(>inflight);
+
/*
-* Don't wake anyone up if we are above the normal limit. If
-* throttling got disabled (limit == 0) with waiters, ensure
-* that we wake them up.
+* wbt got disabled with IO in flight. Wake up any potential
+* waiters, we don't have to do more than that.
 */
-   inflight = atomic_dec_return(>inflight);
-   if (limit && inflight >= limit) {
-   if (!rwb->wb_max)
-   wake_up_all(>wait);
+   if (!rwb_enabled(rwb)) {
+   wake_up_all(>wait);
return;
}
 
+   /*
+* Don't wake anyone up if we are above the normal limit.
+*/
+   if (inflight && inflight >= limit)
+   return;
+
if (waitqueue_active(>wait)) {
int diff = limit - inflight;
 
@@ -150,14 +155,26 @@ static void calc_wb_limits(struct rq_wb *rwb)
return;
}
 
-   depth = min_t(unsigned int, RWB_MAX_DEPTH, 

Re: [PATCHSET v5] Make background writeback great again for the first time

2016-04-27 Thread Jens Axboe
On Wed, Apr 27 2016, Jens Axboe wrote:
> On 04/27/2016 12:01 PM, Jan Kara wrote:
> >Hi,
> >
> >On Tue 26-04-16 09:55:23, Jens Axboe wrote:
> >>Since the dawn of time, our background buffered writeback has sucked.
> >>When we do background buffered writeback, it should have little impact
> >>on foreground activity. That's the definition of background activity...
> >>But for as long as I can remember, heavy buffered writers have not
> >>behaved like that. For instance, if I do something like this:
> >>
> >>$ dd if=/dev/zero of=foo bs=1M count=10k
> >>
> >>on my laptop, and then try and start chrome, it basically won't start
> >>before the buffered writeback is done. Or, for server oriented
> >>workloads, where installation of a big RPM (or similar) adversely
> >>impacts database reads or sync writes. When that happens, I get people
> >>yelling at me.
> >>
> >>I have posted plenty of results previously, I'll keep it shorter
> >>this time. Here's a run on my laptop, using read-to-pipe-async for
> >>reading a 5g file, and rewriting it. You can find this test program
> >>in the fio git repo.
> >
> >I have tested your patchset on my test system. Generally I have observed
> >noticeable drop in average throughput for heavy background writes without
> >any other disk activity and also somewhat increased variance in the
> >runtimes. It is most visible on this simple testcases:
> >
> >dd if=/dev/zero of=/mnt/file bs=1M count=1
> >
> >and
> >
> >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> >
> >The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly
> >created before each dd run on a dedicated disk.
> >
> >Without your patches I get pretty stable dd runtimes for both cases:
> >
> >dd if=/dev/zero of=/mnt/file bs=1M count=1
> >Runtimes: 87.9611 87.3279 87.2554
> >
> >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> >Runtimes: 93.3502 93.2086 93.541
> >
> >With your patches the numbers look like:
> >
> >dd if=/dev/zero of=/mnt/file bs=1M count=1
> >Runtimes: 108.183, 97.184, 99.9587
> >
> >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> >Runtimes: 104.9, 102.775, 102.892
> >
> >I have checked whether the variance is due to some interaction with CFQ
> >which is used for the disk. When I switched the disk to deadline, I still
> >get some variance although, the throughput is still ~10% lower:
> >
> >dd if=/dev/zero of=/mnt/file bs=1M count=1
> >Runtimes: 100.417 100.643 100.866
> >
> >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> >Runtimes: 104.208 106.341 105.483
> >
> >The disk is rotational SATA drive with writeback cache, queue depth of the
> >disk reported in /sys/block/sdb/device/queue_depth is 1.
> >
> >So I think we still need some tweaking on the low end of the storage
> >spectrum so that we don't lose 10% of throughput for simple cases like
> >this.
> 
> Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if
> you are seeing smaller requests, and that is why it both varies and
> you get lower throughput? I'll try and setup a test here similar to
> yours.

Jan, care to try the below patch? I can't fully reproduce your issue on
a SCSI disk limited to QD=1, but I have a feeling this might help. It's
a bit of a hack, but the general idea is to allow one more request to
build up for QD=1 devices. That eliminates wait time between one request
finishing, and the next being submitted.


diff --git a/lib/wbt.c b/lib/wbt.c
index 650da911f24f..6b24c8525ace 100644
--- a/lib/wbt.c
+++ b/lib/wbt.c
@@ -93,23 +93,30 @@ void __wbt_done(struct rq_wb *rwb)
 * If the device does write back caching, drop further down
 * before we wake people up.
 */
-   if (rwb->wc && !atomic_read(>bdi->wb.dirty_sleeping))
+   if (rwb->queue_depth == 1)
+   limit = 2;
+   else if (rwb->wc && !atomic_read(>bdi->wb.dirty_sleeping))
limit = 0;
else
limit = rwb->wb_normal;
 
+   inflight = atomic_dec_return(>inflight);
+
/*
-* Don't wake anyone up if we are above the normal limit. If
-* throttling got disabled (limit == 0) with waiters, ensure
-* that we wake them up.
+* wbt got disabled with IO in flight. Wake up any potential
+* waiters, we don't have to do more than that.
 */
-   inflight = atomic_dec_return(>inflight);
-   if (limit && inflight >= limit) {
-   if (!rwb->wb_max)
-   wake_up_all(>wait);
+   if (!rwb_enabled(rwb)) {
+   wake_up_all(>wait);
return;
}
 
+   /*
+* Don't wake anyone up if we are above the normal limit.
+*/
+   if (inflight >= limit)
+   return;
+
if (waitqueue_active(>wait)) {
int diff = limit - inflight;
 
@@ -366,6 +373,9 @@ static inline unsigned int get_limit(struct rq_wb *rwb, 
unsigned long rw)
} else
limit = 

Re: [PATCHSET v5] Make background writeback great again for the first time

2016-04-27 Thread Jens Axboe
On Wed, Apr 27 2016, Jens Axboe wrote:
> On 04/27/2016 12:01 PM, Jan Kara wrote:
> >Hi,
> >
> >On Tue 26-04-16 09:55:23, Jens Axboe wrote:
> >>Since the dawn of time, our background buffered writeback has sucked.
> >>When we do background buffered writeback, it should have little impact
> >>on foreground activity. That's the definition of background activity...
> >>But for as long as I can remember, heavy buffered writers have not
> >>behaved like that. For instance, if I do something like this:
> >>
> >>$ dd if=/dev/zero of=foo bs=1M count=10k
> >>
> >>on my laptop, and then try and start chrome, it basically won't start
> >>before the buffered writeback is done. Or, for server oriented
> >>workloads, where installation of a big RPM (or similar) adversely
> >>impacts database reads or sync writes. When that happens, I get people
> >>yelling at me.
> >>
> >>I have posted plenty of results previously, I'll keep it shorter
> >>this time. Here's a run on my laptop, using read-to-pipe-async for
> >>reading a 5g file, and rewriting it. You can find this test program
> >>in the fio git repo.
> >
> >I have tested your patchset on my test system. Generally I have observed
> >noticeable drop in average throughput for heavy background writes without
> >any other disk activity and also somewhat increased variance in the
> >runtimes. It is most visible on this simple testcases:
> >
> >dd if=/dev/zero of=/mnt/file bs=1M count=1
> >
> >and
> >
> >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> >
> >The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly
> >created before each dd run on a dedicated disk.
> >
> >Without your patches I get pretty stable dd runtimes for both cases:
> >
> >dd if=/dev/zero of=/mnt/file bs=1M count=1
> >Runtimes: 87.9611 87.3279 87.2554
> >
> >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> >Runtimes: 93.3502 93.2086 93.541
> >
> >With your patches the numbers look like:
> >
> >dd if=/dev/zero of=/mnt/file bs=1M count=1
> >Runtimes: 108.183, 97.184, 99.9587
> >
> >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> >Runtimes: 104.9, 102.775, 102.892
> >
> >I have checked whether the variance is due to some interaction with CFQ
> >which is used for the disk. When I switched the disk to deadline, I still
> >get some variance although, the throughput is still ~10% lower:
> >
> >dd if=/dev/zero of=/mnt/file bs=1M count=1
> >Runtimes: 100.417 100.643 100.866
> >
> >dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
> >Runtimes: 104.208 106.341 105.483
> >
> >The disk is rotational SATA drive with writeback cache, queue depth of the
> >disk reported in /sys/block/sdb/device/queue_depth is 1.
> >
> >So I think we still need some tweaking on the low end of the storage
> >spectrum so that we don't lose 10% of throughput for simple cases like
> >this.
> 
> Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if
> you are seeing smaller requests, and that is why it both varies and
> you get lower throughput? I'll try and setup a test here similar to
> yours.

Jan, care to try the below patch? I can't fully reproduce your issue on
a SCSI disk limited to QD=1, but I have a feeling this might help. It's
a bit of a hack, but the general idea is to allow one more request to
build up for QD=1 devices. That eliminates wait time between one request
finishing, and the next being submitted.


diff --git a/lib/wbt.c b/lib/wbt.c
index 650da911f24f..6b24c8525ace 100644
--- a/lib/wbt.c
+++ b/lib/wbt.c
@@ -93,23 +93,30 @@ void __wbt_done(struct rq_wb *rwb)
 * If the device does write back caching, drop further down
 * before we wake people up.
 */
-   if (rwb->wc && !atomic_read(>bdi->wb.dirty_sleeping))
+   if (rwb->queue_depth == 1)
+   limit = 2;
+   else if (rwb->wc && !atomic_read(>bdi->wb.dirty_sleeping))
limit = 0;
else
limit = rwb->wb_normal;
 
+   inflight = atomic_dec_return(>inflight);
+
/*
-* Don't wake anyone up if we are above the normal limit. If
-* throttling got disabled (limit == 0) with waiters, ensure
-* that we wake them up.
+* wbt got disabled with IO in flight. Wake up any potential
+* waiters, we don't have to do more than that.
 */
-   inflight = atomic_dec_return(>inflight);
-   if (limit && inflight >= limit) {
-   if (!rwb->wb_max)
-   wake_up_all(>wait);
+   if (!rwb_enabled(rwb)) {
+   wake_up_all(>wait);
return;
}
 
+   /*
+* Don't wake anyone up if we are above the normal limit.
+*/
+   if (inflight >= limit)
+   return;
+
if (waitqueue_active(>wait)) {
int diff = limit - inflight;
 
@@ -366,6 +373,9 @@ static inline unsigned int get_limit(struct rq_wb *rwb, 
unsigned long rw)
} else
limit = 

Re: [PATCHSET v5] Make background writeback great again for the first time

2016-04-27 Thread Jens Axboe

On 04/27/2016 12:01 PM, Jan Kara wrote:

Hi,

On Tue 26-04-16 09:55:23, Jens Axboe wrote:

Since the dawn of time, our background buffered writeback has sucked.
When we do background buffered writeback, it should have little impact
on foreground activity. That's the definition of background activity...
But for as long as I can remember, heavy buffered writers have not
behaved like that. For instance, if I do something like this:

$ dd if=/dev/zero of=foo bs=1M count=10k

on my laptop, and then try and start chrome, it basically won't start
before the buffered writeback is done. Or, for server oriented
workloads, where installation of a big RPM (or similar) adversely
impacts database reads or sync writes. When that happens, I get people
yelling at me.

I have posted plenty of results previously, I'll keep it shorter
this time. Here's a run on my laptop, using read-to-pipe-async for
reading a 5g file, and rewriting it. You can find this test program
in the fio git repo.


I have tested your patchset on my test system. Generally I have observed
noticeable drop in average throughput for heavy background writes without
any other disk activity and also somewhat increased variance in the
runtimes. It is most visible on this simple testcases:

dd if=/dev/zero of=/mnt/file bs=1M count=1

and

dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync

The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly
created before each dd run on a dedicated disk.

Without your patches I get pretty stable dd runtimes for both cases:

dd if=/dev/zero of=/mnt/file bs=1M count=1
Runtimes: 87.9611 87.3279 87.2554

dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
Runtimes: 93.3502 93.2086 93.541

With your patches the numbers look like:

dd if=/dev/zero of=/mnt/file bs=1M count=1
Runtimes: 108.183, 97.184, 99.9587

dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
Runtimes: 104.9, 102.775, 102.892

I have checked whether the variance is due to some interaction with CFQ
which is used for the disk. When I switched the disk to deadline, I still
get some variance although, the throughput is still ~10% lower:

dd if=/dev/zero of=/mnt/file bs=1M count=1
Runtimes: 100.417 100.643 100.866

dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
Runtimes: 104.208 106.341 105.483

The disk is rotational SATA drive with writeback cache, queue depth of the
disk reported in /sys/block/sdb/device/queue_depth is 1.

So I think we still need some tweaking on the low end of the storage
spectrum so that we don't lose 10% of throughput for simple cases like
this.


Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if you 
are seeing smaller requests, and that is why it both varies and you get 
lower throughput? I'll try and setup a test here similar to yours.


--
Jens Axboe



Re: [PATCHSET v5] Make background writeback great again for the first time

2016-04-27 Thread Jens Axboe

On 04/27/2016 12:01 PM, Jan Kara wrote:

Hi,

On Tue 26-04-16 09:55:23, Jens Axboe wrote:

Since the dawn of time, our background buffered writeback has sucked.
When we do background buffered writeback, it should have little impact
on foreground activity. That's the definition of background activity...
But for as long as I can remember, heavy buffered writers have not
behaved like that. For instance, if I do something like this:

$ dd if=/dev/zero of=foo bs=1M count=10k

on my laptop, and then try and start chrome, it basically won't start
before the buffered writeback is done. Or, for server oriented
workloads, where installation of a big RPM (or similar) adversely
impacts database reads or sync writes. When that happens, I get people
yelling at me.

I have posted plenty of results previously, I'll keep it shorter
this time. Here's a run on my laptop, using read-to-pipe-async for
reading a 5g file, and rewriting it. You can find this test program
in the fio git repo.


I have tested your patchset on my test system. Generally I have observed
noticeable drop in average throughput for heavy background writes without
any other disk activity and also somewhat increased variance in the
runtimes. It is most visible on this simple testcases:

dd if=/dev/zero of=/mnt/file bs=1M count=1

and

dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync

The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly
created before each dd run on a dedicated disk.

Without your patches I get pretty stable dd runtimes for both cases:

dd if=/dev/zero of=/mnt/file bs=1M count=1
Runtimes: 87.9611 87.3279 87.2554

dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
Runtimes: 93.3502 93.2086 93.541

With your patches the numbers look like:

dd if=/dev/zero of=/mnt/file bs=1M count=1
Runtimes: 108.183, 97.184, 99.9587

dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
Runtimes: 104.9, 102.775, 102.892

I have checked whether the variance is due to some interaction with CFQ
which is used for the disk. When I switched the disk to deadline, I still
get some variance although, the throughput is still ~10% lower:

dd if=/dev/zero of=/mnt/file bs=1M count=1
Runtimes: 100.417 100.643 100.866

dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
Runtimes: 104.208 106.341 105.483

The disk is rotational SATA drive with writeback cache, queue depth of the
disk reported in /sys/block/sdb/device/queue_depth is 1.

So I think we still need some tweaking on the low end of the storage
spectrum so that we don't lose 10% of throughput for simple cases like
this.


Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if you 
are seeing smaller requests, and that is why it both varies and you get 
lower throughput? I'll try and setup a test here similar to yours.


--
Jens Axboe



Re: [PATCHSET v5] Make background writeback great again for the first time

2016-04-27 Thread Jan Kara
Hi,

On Tue 26-04-16 09:55:23, Jens Axboe wrote:
> Since the dawn of time, our background buffered writeback has sucked.
> When we do background buffered writeback, it should have little impact
> on foreground activity. That's the definition of background activity...
> But for as long as I can remember, heavy buffered writers have not
> behaved like that. For instance, if I do something like this:
> 
> $ dd if=/dev/zero of=foo bs=1M count=10k
> 
> on my laptop, and then try and start chrome, it basically won't start
> before the buffered writeback is done. Or, for server oriented
> workloads, where installation of a big RPM (or similar) adversely
> impacts database reads or sync writes. When that happens, I get people
> yelling at me.
> 
> I have posted plenty of results previously, I'll keep it shorter
> this time. Here's a run on my laptop, using read-to-pipe-async for
> reading a 5g file, and rewriting it. You can find this test program
> in the fio git repo.

I have tested your patchset on my test system. Generally I have observed
noticeable drop in average throughput for heavy background writes without
any other disk activity and also somewhat increased variance in the
runtimes. It is most visible on this simple testcases:

dd if=/dev/zero of=/mnt/file bs=1M count=1

and

dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync

The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly
created before each dd run on a dedicated disk.

Without your patches I get pretty stable dd runtimes for both cases:

dd if=/dev/zero of=/mnt/file bs=1M count=1
Runtimes: 87.9611 87.3279 87.2554

dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
Runtimes: 93.3502 93.2086 93.541

With your patches the numbers look like:

dd if=/dev/zero of=/mnt/file bs=1M count=1
Runtimes: 108.183, 97.184, 99.9587

dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
Runtimes: 104.9, 102.775, 102.892

I have checked whether the variance is due to some interaction with CFQ
which is used for the disk. When I switched the disk to deadline, I still
get some variance although, the throughput is still ~10% lower:

dd if=/dev/zero of=/mnt/file bs=1M count=1
Runtimes: 100.417 100.643 100.866

dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
Runtimes: 104.208 106.341 105.483

The disk is rotational SATA drive with writeback cache, queue depth of the
disk reported in /sys/block/sdb/device/queue_depth is 1.

So I think we still need some tweaking on the low end of the storage
spectrum so that we don't lose 10% of throughput for simple cases like
this.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCHSET v5] Make background writeback great again for the first time

2016-04-27 Thread Jan Kara
Hi,

On Tue 26-04-16 09:55:23, Jens Axboe wrote:
> Since the dawn of time, our background buffered writeback has sucked.
> When we do background buffered writeback, it should have little impact
> on foreground activity. That's the definition of background activity...
> But for as long as I can remember, heavy buffered writers have not
> behaved like that. For instance, if I do something like this:
> 
> $ dd if=/dev/zero of=foo bs=1M count=10k
> 
> on my laptop, and then try and start chrome, it basically won't start
> before the buffered writeback is done. Or, for server oriented
> workloads, where installation of a big RPM (or similar) adversely
> impacts database reads or sync writes. When that happens, I get people
> yelling at me.
> 
> I have posted plenty of results previously, I'll keep it shorter
> this time. Here's a run on my laptop, using read-to-pipe-async for
> reading a 5g file, and rewriting it. You can find this test program
> in the fio git repo.

I have tested your patchset on my test system. Generally I have observed
noticeable drop in average throughput for heavy background writes without
any other disk activity and also somewhat increased variance in the
runtimes. It is most visible on this simple testcases:

dd if=/dev/zero of=/mnt/file bs=1M count=1

and

dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync

The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly
created before each dd run on a dedicated disk.

Without your patches I get pretty stable dd runtimes for both cases:

dd if=/dev/zero of=/mnt/file bs=1M count=1
Runtimes: 87.9611 87.3279 87.2554

dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
Runtimes: 93.3502 93.2086 93.541

With your patches the numbers look like:

dd if=/dev/zero of=/mnt/file bs=1M count=1
Runtimes: 108.183, 97.184, 99.9587

dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
Runtimes: 104.9, 102.775, 102.892

I have checked whether the variance is due to some interaction with CFQ
which is used for the disk. When I switched the disk to deadline, I still
get some variance although, the throughput is still ~10% lower:

dd if=/dev/zero of=/mnt/file bs=1M count=1
Runtimes: 100.417 100.643 100.866

dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
Runtimes: 104.208 106.341 105.483

The disk is rotational SATA drive with writeback cache, queue depth of the
disk reported in /sys/block/sdb/device/queue_depth is 1.

So I think we still need some tweaking on the low end of the storage
spectrum so that we don't lose 10% of throughput for simple cases like
this.

Honza
-- 
Jan Kara 
SUSE Labs, CR


[PATCHSET v5] Make background writeback great again for the first time

2016-04-26 Thread Jens Axboe
Hi,

Since the dawn of time, our background buffered writeback has sucked.
When we do background buffered writeback, it should have little impact
on foreground activity. That's the definition of background activity...
But for as long as I can remember, heavy buffered writers have not
behaved like that. For instance, if I do something like this:

$ dd if=/dev/zero of=foo bs=1M count=10k

on my laptop, and then try and start chrome, it basically won't start
before the buffered writeback is done. Or, for server oriented
workloads, where installation of a big RPM (or similar) adversely
impacts database reads or sync writes. When that happens, I get people
yelling at me.

I have posted plenty of results previously, I'll keep it shorter
this time. Here's a run on my laptop, using read-to-pipe-async for
reading a 5g file, and rewriting it. You can find this test program
in the fio git repo.

4.6-rc3:

$ t/read-to-pipe-async -f ~/5g > 5g-new

Latency percentiles (usec) (READERS)
50.th: 2
75.th: 3
90.th: 5
95.th: 7
99.th: 43
99.5000th: 77
99.9000th: 9008
99.9900th: 91008
99.9990th: 286208
99.th: 347648
Over=1251, min=0, max=358081
Latency percentiles (usec) (WRITERS)
50.th: 4
75.th: 8
90.th: 13
95.th: 15
99.th: 32
99.5000th: 43
99.9000th: 81
99.9900th: 2372
99.9990th: 104320
99.th: 349696
Over=63, min=1, max=358321
Read rate (KB/sec) : 91859
Write rate (KB/sec): 91859

4.6-rc3 + wb-buf-throttle

Latency percentiles (usec) (READERS)
50.th: 2
75.th: 3
90.th: 5
95.th: 8
99.th: 48
99.5000th: 79
99.9000th: 5304
99.9900th: 22496
99.9990th: 29408
99.th: 33728
Over=860, min=0, max=37599
Latency percentiles (usec) (WRITERS)
50.th: 4
75.th: 9
90.th: 14
95.th: 16
99.th: 34
99.5000th: 45
99.9000th: 87
99.9900th: 1342
99.9990th: 13648
99.th: 21280
Over=29, min=1, max=30457
Read rate (KB/sec) : 95832
Write rate (KB/sec): 95832

Better throughput and tighter latencies, for both reads and writes.
That's hard not to like.

The above was the why. The how is basically throttling background
writeback. We still want to issue big writes from the vm side of things,
so we get nice and big extents on the file system end. But we don't need
to flood the device with THOUSANDS of requests for background writeback.
For most devices, we don't need a whole lot to get decent throughput.

This adds some simple blk-wb code that keeps limits how much buffered
writeback we keep in flight on the device end. It's all about managing
the queues on the hardware side. The big change in this version is that
it should be pretty much auto-tuning - you no longer have to set a
given percentage of writeback bandwidth. I've implemented something
similar to CoDel to manage the writeback queue. See the last patch
for a full description, but the tldr is that we monitor min latencies
over a window of time, and scale up/down the queue based on that. This
needs a minimum of tunables, and it stays out of the way, if your device
is fast enough. There's a single tunable now, wb_last_usec, that simply
sets this latency target. Most people won't have to touch this, it'll
work pretty well just being in the ballpark.

I welcome testing. If you are sick of Linux bogging down when buffered
writes are happening, then this is for you, laptop or server. The
patchset is fully stable, I have not observed problems. It passes full
xfstest runs, and a variety of benchmarks as well. It works equally well
on blk-mq/scsi-mq, and "classic" setups.

You can also find this in a branch in the block git repo:

git://git.kernel.dk/linux-block.git wb-buf-throttle

Note that I rebase this branch when I collapse patches. The
wb-buf-throttle-v5 will remain the same as this version. I've folded
the device write cache changes into my 4.7 branches, so they are not
a part of this posting. Get the full wb-buf-throttle branch, or apply
the patches here on top of my for-next. A full patch against Linus'
current tree can also be downloaded here:

http://brick.kernel.dk/snaps/wb-buf-throttle-v5.patch

Changes since v4

- Add some documentation for the two queue sysfs files
- Kill off wb_stats sysfs file. Use the trace points to get this info
  now.
- Various work around making this block layer agnostic. The main code
  now resides in lib/wbt.c and can be plugged into NFS as well, for
  instance.
- Fix an issue with double completions on the block layer side.
- Fix an issue where a long sync issue was disregarded, if the stat
  sample weren't valid.
- Speed up the division in rwb_arm_timer().
- Add logic to scale back up for 'unknown' latency events.
- Don't track sync 

[PATCHSET v5] Make background writeback great again for the first time

2016-04-26 Thread Jens Axboe
Hi,

Since the dawn of time, our background buffered writeback has sucked.
When we do background buffered writeback, it should have little impact
on foreground activity. That's the definition of background activity...
But for as long as I can remember, heavy buffered writers have not
behaved like that. For instance, if I do something like this:

$ dd if=/dev/zero of=foo bs=1M count=10k

on my laptop, and then try and start chrome, it basically won't start
before the buffered writeback is done. Or, for server oriented
workloads, where installation of a big RPM (or similar) adversely
impacts database reads or sync writes. When that happens, I get people
yelling at me.

I have posted plenty of results previously, I'll keep it shorter
this time. Here's a run on my laptop, using read-to-pipe-async for
reading a 5g file, and rewriting it. You can find this test program
in the fio git repo.

4.6-rc3:

$ t/read-to-pipe-async -f ~/5g > 5g-new

Latency percentiles (usec) (READERS)
50.th: 2
75.th: 3
90.th: 5
95.th: 7
99.th: 43
99.5000th: 77
99.9000th: 9008
99.9900th: 91008
99.9990th: 286208
99.th: 347648
Over=1251, min=0, max=358081
Latency percentiles (usec) (WRITERS)
50.th: 4
75.th: 8
90.th: 13
95.th: 15
99.th: 32
99.5000th: 43
99.9000th: 81
99.9900th: 2372
99.9990th: 104320
99.th: 349696
Over=63, min=1, max=358321
Read rate (KB/sec) : 91859
Write rate (KB/sec): 91859

4.6-rc3 + wb-buf-throttle

Latency percentiles (usec) (READERS)
50.th: 2
75.th: 3
90.th: 5
95.th: 8
99.th: 48
99.5000th: 79
99.9000th: 5304
99.9900th: 22496
99.9990th: 29408
99.th: 33728
Over=860, min=0, max=37599
Latency percentiles (usec) (WRITERS)
50.th: 4
75.th: 9
90.th: 14
95.th: 16
99.th: 34
99.5000th: 45
99.9000th: 87
99.9900th: 1342
99.9990th: 13648
99.th: 21280
Over=29, min=1, max=30457
Read rate (KB/sec) : 95832
Write rate (KB/sec): 95832

Better throughput and tighter latencies, for both reads and writes.
That's hard not to like.

The above was the why. The how is basically throttling background
writeback. We still want to issue big writes from the vm side of things,
so we get nice and big extents on the file system end. But we don't need
to flood the device with THOUSANDS of requests for background writeback.
For most devices, we don't need a whole lot to get decent throughput.

This adds some simple blk-wb code that keeps limits how much buffered
writeback we keep in flight on the device end. It's all about managing
the queues on the hardware side. The big change in this version is that
it should be pretty much auto-tuning - you no longer have to set a
given percentage of writeback bandwidth. I've implemented something
similar to CoDel to manage the writeback queue. See the last patch
for a full description, but the tldr is that we monitor min latencies
over a window of time, and scale up/down the queue based on that. This
needs a minimum of tunables, and it stays out of the way, if your device
is fast enough. There's a single tunable now, wb_last_usec, that simply
sets this latency target. Most people won't have to touch this, it'll
work pretty well just being in the ballpark.

I welcome testing. If you are sick of Linux bogging down when buffered
writes are happening, then this is for you, laptop or server. The
patchset is fully stable, I have not observed problems. It passes full
xfstest runs, and a variety of benchmarks as well. It works equally well
on blk-mq/scsi-mq, and "classic" setups.

You can also find this in a branch in the block git repo:

git://git.kernel.dk/linux-block.git wb-buf-throttle

Note that I rebase this branch when I collapse patches. The
wb-buf-throttle-v5 will remain the same as this version. I've folded
the device write cache changes into my 4.7 branches, so they are not
a part of this posting. Get the full wb-buf-throttle branch, or apply
the patches here on top of my for-next. A full patch against Linus'
current tree can also be downloaded here:

http://brick.kernel.dk/snaps/wb-buf-throttle-v5.patch

Changes since v4

- Add some documentation for the two queue sysfs files
- Kill off wb_stats sysfs file. Use the trace points to get this info
  now.
- Various work around making this block layer agnostic. The main code
  now resides in lib/wbt.c and can be plugged into NFS as well, for
  instance.
- Fix an issue with double completions on the block layer side.
- Fix an issue where a long sync issue was disregarded, if the stat
  sample weren't valid.
- Speed up the division in rwb_arm_timer().
- Add logic to scale back up for 'unknown' latency events.
- Don't track sync