subject:"Re\: \[PATCHSET v3\]\[RFC\] Make background writeback not suck"


On 04/01/2016 12:27 AM, Dave Chinner wrote:

On Thu, Mar 31, 2016 at 09:25:33PM -0600, Jens Axboe wrote:

On 03/31/2016 06:46 PM, Dave Chinner wrote:

virtio in guest, XFS direct IO -> no-op -> scsi in host.


That has write back caching enabled on the guest, correct?


No. It uses virtio,cache=none (that's the "XFS Direct IO" bit above).
Sorry for not being clear about that.


That's fine, it's one less worry if that's not the case. So if you
cat the 'write_cache' file in the virtioblk sysfs block queue/
directory, it says 'write through'? Just want to confirm that we got
that propagated correctly.


No such file. But I did find:

$ cat /sys/block/vdc/cache_type
write back

Which is what I'd expect it to safe given the man page description
of cache=none:

Note that this is considered a writeback mode and the guest
OS must handle the disk write cache correctly in order to
avoid data corruption on host crashes.

To make it say "write through" I need to use cache=directsync, but
I have no need for such integrity guarantees on a volatile test
device...


I wasn't as concerned about the integrity side, more if it's flagged as 
write back then we induce further throttling. But I'll see if I can get 
your test case reproduced, then I don't see why it can't get fixed. I'm 
off all of next week though, so probably won't be until the week after...


--
Jens Axboe

Re: [PATCHSET v3][RFC] Make background writeback not suck


On 04/01/2016 12:27 AM, Dave Chinner wrote:

On Thu, Mar 31, 2016 at 09:25:33PM -0600, Jens Axboe wrote:

On 03/31/2016 06:46 PM, Dave Chinner wrote:

virtio in guest, XFS direct IO -> no-op -> scsi in host.


That has write back caching enabled on the guest, correct?


No. It uses virtio,cache=none (that's the "XFS Direct IO" bit above).
Sorry for not being clear about that.


That's fine, it's one less worry if that's not the case. So if you
cat the 'write_cache' file in the virtioblk sysfs block queue/
directory, it says 'write through'? Just want to confirm that we got
that propagated correctly.


No such file. But I did find:

$ cat /sys/block/vdc/cache_type
write back

Which is what I'd expect it to safe given the man page description
of cache=none:

Note that this is considered a writeback mode and the guest
OS must handle the disk write cache correctly in order to
avoid data corruption on host crashes.

To make it say "write through" I need to use cache=directsync, but
I have no need for such integrity guarantees on a volatile test
device...


I wasn't as concerned about the integrity side, more if it's flagged as 
write back then we induce further throttling. But I'll see if I can get 
your test case reproduced, then I don't see why it can't get fixed. I'm 
off all of next week though, so probably won't be until the week after...


--
Jens Axboe

Re: [PATCHSET v3][RFC] Make background writeback not suck


On 04/01/2016 12:16 AM, Dave Chinner wrote:

On Thu, Mar 31, 2016 at 09:39:25PM -0600, Jens Axboe wrote:

On 03/31/2016 09:29 PM, Jens Axboe wrote:

I can't seem to reproduce this at all. On an nvme device, I get a
fairly steady 60K/sec file creation rate, and we're nowhere near
being IO bound. So the throttling has no effect at all.


That's too slow to show the stalls - your likely concurrency bound
in allocation by the default AG count (4) from mkfs. Use mkfs.xfs -d
agcount=32 so that every thread works in it's own AG.


That's the key, with that I get 300-400K ops/sec instead. I'll run some
testing with this tomorrow and see what I can find, it did one full run
now and I didn't see any issues, but I need to run it at various
settings and see if I can find the issue.


No stalls seen, I get the same performance with it disabled and with
it enabled, at both default settings, and lower ones
(wb_percent=20). Looking at iostat, we don't drive a lot of depth,
so it makes sense, even with the throttling we're doing essentially
the same amount of IO.


Try appending numa=fake=4 to your guest's kernel command line.

(that's what I'm using)


Sure, I can give that a go.


What does 'nr_requests' say for your virtio_blk device? Looks like
virtio_blk has a queue_depth setting, but it's not set by default,
and then it uses the free entries in the ring. But I don't know what
that is...


$ cat /sys/block/vdc/queue/nr_requests
128


OK, so that would put you in the 16/32/64 category for idle/normal/high 
priority writeback. Which fits with the iostat below, which is in the 
~16 range.


So the META thing should help, it'll bump it up a bit. But we're also 
seeing smaller requests, and I think that could be because after we do 
throttle, we could potentially have a merge candidate. The code doesn't 
check post-sleeping, it'll allow any merges before though. Though that 
part is a little harder to read from the iostat numbers, but there does 
seem to be a correlation between your higher depths and bigger request 
sizes.



I'll try the "don't throttle REQ_META" patch, but this seems like a
fragile way to solve this problem - it shuts up the messenger, but
doesn't solve the problem for any other subsystem that might have a
similer issue. e.g. next we're going to have to make sure direct IO
(which is also REQ_WRITE dispatch) does not get throttled, and so
on


I don't think there's anything wrong with the REQ_META patch. Sure, we 
could have better classifications (like discussed below), but that's 
mainly tweaking. As long as we get the same answers, it's fine. There's 
no throttling of O_DIRECT writes in the current code, it specifically 
doesn't include those. It's only for the unbounded writes, which 
writeback tends to be.



It seems to me that the right thing to do here is add a separate
classification flag for IO that can be throttled. e.g. as
REQ_WRITEBACK and only background writeback work sets this flag.
That would ensure that when the IO is being dispatched from other
sources (e.g. fsync, sync_file_range(), direct IO, filesystem
metadata, etc) it is clear that it is not a target for throttling.
This would also allow us to easily switch off throttling if
writeback is occurring for memory reclaim reasons, and so on.
Throttling policy decisions belong above the block layer, even
though the throttle mechanism itself is in the block layer.


We're already doing all of that, it's just doesn't include a specific 
REQ_WRITEBACK flag. And yeah, that would clean up the checking for 
request type, but functionally it should be the same as it is now. It'll 
be a bit more robust and easier to read if we just have a REQ_WRITEBACK, 
right now it's WRITE_SYNC vs WRITE for important vs not-important, with 
a check for write vs O_DIRECT write as well.



--
Jens Axboe

Re: [PATCHSET v3][RFC] Make background writeback not suck


On 04/01/2016 12:16 AM, Dave Chinner wrote:

On Thu, Mar 31, 2016 at 09:39:25PM -0600, Jens Axboe wrote:

On 03/31/2016 09:29 PM, Jens Axboe wrote:

I can't seem to reproduce this at all. On an nvme device, I get a
fairly steady 60K/sec file creation rate, and we're nowhere near
being IO bound. So the throttling has no effect at all.


That's too slow to show the stalls - your likely concurrency bound
in allocation by the default AG count (4) from mkfs. Use mkfs.xfs -d
agcount=32 so that every thread works in it's own AG.


That's the key, with that I get 300-400K ops/sec instead. I'll run some
testing with this tomorrow and see what I can find, it did one full run
now and I didn't see any issues, but I need to run it at various
settings and see if I can find the issue.


No stalls seen, I get the same performance with it disabled and with
it enabled, at both default settings, and lower ones
(wb_percent=20). Looking at iostat, we don't drive a lot of depth,
so it makes sense, even with the throttling we're doing essentially
the same amount of IO.


Try appending numa=fake=4 to your guest's kernel command line.

(that's what I'm using)


Sure, I can give that a go.


What does 'nr_requests' say for your virtio_blk device? Looks like
virtio_blk has a queue_depth setting, but it's not set by default,
and then it uses the free entries in the ring. But I don't know what
that is...


$ cat /sys/block/vdc/queue/nr_requests
128


OK, so that would put you in the 16/32/64 category for idle/normal/high 
priority writeback. Which fits with the iostat below, which is in the 
~16 range.


So the META thing should help, it'll bump it up a bit. But we're also 
seeing smaller requests, and I think that could be because after we do 
throttle, we could potentially have a merge candidate. The code doesn't 
check post-sleeping, it'll allow any merges before though. Though that 
part is a little harder to read from the iostat numbers, but there does 
seem to be a correlation between your higher depths and bigger request 
sizes.



I'll try the "don't throttle REQ_META" patch, but this seems like a
fragile way to solve this problem - it shuts up the messenger, but
doesn't solve the problem for any other subsystem that might have a
similer issue. e.g. next we're going to have to make sure direct IO
(which is also REQ_WRITE dispatch) does not get throttled, and so
on


I don't think there's anything wrong with the REQ_META patch. Sure, we 
could have better classifications (like discussed below), but that's 
mainly tweaking. As long as we get the same answers, it's fine. There's 
no throttling of O_DIRECT writes in the current code, it specifically 
doesn't include those. It's only for the unbounded writes, which 
writeback tends to be.



It seems to me that the right thing to do here is add a separate
classification flag for IO that can be throttled. e.g. as
REQ_WRITEBACK and only background writeback work sets this flag.
That would ensure that when the IO is being dispatched from other
sources (e.g. fsync, sync_file_range(), direct IO, filesystem
metadata, etc) it is clear that it is not a target for throttling.
This would also allow us to easily switch off throttling if
writeback is occurring for memory reclaim reasons, and so on.
Throttling policy decisions belong above the block layer, even
though the throttle mechanism itself is in the block layer.


We're already doing all of that, it's just doesn't include a specific 
REQ_WRITEBACK flag. And yeah, that would clean up the checking for 
request type, but functionally it should be the same as it is now. It'll 
be a bit more robust and easier to read if we just have a REQ_WRITEBACK, 
right now it's WRITE_SYNC vs WRITE for important vs not-important, with 
a check for write vs O_DIRECT write as well.



--
Jens Axboe

Re: [PATCHSET v3][RFC] Make background writeback not suck

On Thu, Mar 31, 2016 at 09:25:33PM -0600, Jens Axboe wrote:
> On 03/31/2016 06:46 PM, Dave Chinner wrote:
> >>>virtio in guest, XFS direct IO -> no-op -> scsi in host.
> >>
> >>That has write back caching enabled on the guest, correct?
> >
> >No. It uses virtio,cache=none (that's the "XFS Direct IO" bit above).
> >Sorry for not being clear about that.
> 
> That's fine, it's one less worry if that's not the case. So if you
> cat the 'write_cache' file in the virtioblk sysfs block queue/
> directory, it says 'write through'? Just want to confirm that we got
> that propagated correctly.

No such file. But I did find:

$ cat /sys/block/vdc/cache_type 
write back

Which is what I'd expect it to safe given the man page description
of cache=none:

Note that this is considered a writeback mode and the guest
OS must handle the disk write cache correctly in order to
avoid data corruption on host crashes.

To make it say "write through" I need to use cache=directsync, but
I have no need for such integrity guarantees on a volatile test
device...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCHSET v3][RFC] Make background writeback not suck

On Thu, Mar 31, 2016 at 09:25:33PM -0600, Jens Axboe wrote:
> On 03/31/2016 06:46 PM, Dave Chinner wrote:
> >>>virtio in guest, XFS direct IO -> no-op -> scsi in host.
> >>
> >>That has write back caching enabled on the guest, correct?
> >
> >No. It uses virtio,cache=none (that's the "XFS Direct IO" bit above).
> >Sorry for not being clear about that.
> 
> That's fine, it's one less worry if that's not the case. So if you
> cat the 'write_cache' file in the virtioblk sysfs block queue/
> directory, it says 'write through'? Just want to confirm that we got
> that propagated correctly.

No such file. But I did find:

$ cat /sys/block/vdc/cache_type 
write back

Which is what I'd expect it to safe given the man page description
of cache=none:

Note that this is considered a writeback mode and the guest
OS must handle the disk write cache correctly in order to
avoid data corruption on host crashes.

To make it say "write through" I need to use cache=directsync, but
I have no need for such integrity guarantees on a volatile test
device...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCHSET v3][RFC] Make background writeback not suck

On Thu, Mar 31, 2016 at 09:39:25PM -0600, Jens Axboe wrote:
> On 03/31/2016 09:29 PM, Jens Axboe wrote:
> >>>I can't seem to reproduce this at all. On an nvme device, I get a
> >>>fairly steady 60K/sec file creation rate, and we're nowhere near
> >>>being IO bound. So the throttling has no effect at all.
> >>
> >>That's too slow to show the stalls - your likely concurrency bound
> >>in allocation by the default AG count (4) from mkfs. Use mkfs.xfs -d
> >>agcount=32 so that every thread works in it's own AG.
> >
> >That's the key, with that I get 300-400K ops/sec instead. I'll run some
> >testing with this tomorrow and see what I can find, it did one full run
> >now and I didn't see any issues, but I need to run it at various
> >settings and see if I can find the issue.
> 
> No stalls seen, I get the same performance with it disabled and with
> it enabled, at both default settings, and lower ones
> (wb_percent=20). Looking at iostat, we don't drive a lot of depth,
> so it makes sense, even with the throttling we're doing essentially
> the same amount of IO.

Try appending numa=fake=4 to your guest's kernel command line.

(that's what I'm using)

> 
> What does 'nr_requests' say for your virtio_blk device? Looks like
> virtio_blk has a queue_depth setting, but it's not set by default,
> and then it uses the free entries in the ring. But I don't know what
> that is...

$ cat /sys/block/vdc/queue/nr_requests 
128
$

Without the block throttling, guest IO (measured within the guest)
looks like this over a fair proportion of the test (5s sample time)

# iostat -d -x -m 5 /dev/vdc

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
vdc   0.00 20443.006.20  436.60 0.05   269.89  1248.48
73.83  146.11  486.58  141.27   1.64  72.40
vdc   0.00 11567.60   19.20  161.40 0.05   146.08  1657.12   
119.17  704.57  707.25  704.25   5.34  96.48
vdc   0.00 12723.203.20  437.40 0.05   193.65   900.38
29.46   57.121.75   57.52   0.78  34.56
vdc   0.00  1739.80   22.40  426.80 0.05   123.62   563.86
23.44   62.51   79.89   61.59   1.01  45.28
vdc   0.00 12553.800.00  521.20 0.00   210.86   828.54
34.38   65.960.00   65.96   0.97  50.80
vdc   0.00 12523.60   25.60  529.60 0.10   201.94   745.29
52.24   77.730.41   81.47   1.14  63.20
vdc   0.00  5419.80   22.40  502.60 0.05   158.34   617.90
24.42   63.81   30.96   65.27   1.31  68.80
vdc   0.00 12059.000.00  439.60 0.00   174.85   814.59
30.91   70.270.00   70.27   0.72  31.76
vdc   0.00  7578.00   25.60  397.00 0.10   139.18   675.00
15.72   37.26   61.19   35.72   0.73  30.72
vdc   0.00  9156.000.00  537.40 0.00   173.57   661.45
17.08   29.620.00   29.62   0.53  28.72
vdc   0.00  5274.80   22.40  377.60 0.05   136.42   698.77
26.17   68.33  186.96   61.30   1.53  61.36
vdc   0.00  9407.003.20  541.00 0.05   174.28   656.05
36.10   66.333.00   66.71   0.87  47.60
vdc   0.00  8687.20   22.40  410.40 0.05   150.98   714.70
39.91   92.21   93.82   92.12   1.39  60.32
vdc   0.00  8872.800.00  422.60 0.00   139.28   674.96
25.01   33.030.00   33.03   0.91  38.40
vdc   0.00  1081.60   22.40  241.00 0.0568.88   535.97
10.78   82.89  137.86   77.79   2.25  59.20
vdc   0.00  9826.800.00  445.00 0.00   167.42   770.49
45.16  101.490.00  101.49   1.80  79.92
vdc   0.00  7394.00   22.40  447.60 0.05   157.34   685.83
18.06   38.42   77.64   36.46   1.46  68.48
vdc   0.00  9984.803.20  252.00 0.05   108.46   870.82
85.68  293.73   16.75  297.24   3.00  76.64
vdc   0.00 0.00   22.40  454.20 0.05   117.67   505.86 
8.11   39.51   35.71   39.70   1.17  55.76
vdc   0.00 10273.200.00  418.80 0.00   156.76   766.57
90.52  179.400.00  179.40   1.85  77.52
vdc   0.00  5650.00   22.40  185.00 0.0584.12   831.20   
103.90  575.15   60.82  637.42   4.21  87.36
vdc   0.00  7193.000.00  308.80 0.00   120.71   800.56
63.77  194.350.00  194.35   2.24  69.12
vdc   0.00  4460.809.80  211.00 0.0369.52   645.07
72.35  154.81  269.39  149.49   4.42  97.60
vdc   0.00   683.00   14.00  374.60 0.0599.13   522.69
25.38  167.61  603.14  151.33   1.45  56.24
vdc   0.00  7140.201.80  275.20 0.03   104.53   773.06
85.25  202.67   32.44  203.79   2.80  77.68
vdc   0.00  6916.000.00  164.00 0.0082.59  1031.33   
126.20  813.600.00  813.60   6.10 100.00
vdc   0.00  2255.60   22.40  359.00 0.05   107.41   577.06

Re: [PATCHSET v3][RFC] Make background writeback not suck

On Thu, Mar 31, 2016 at 09:39:25PM -0600, Jens Axboe wrote:
> On 03/31/2016 09:29 PM, Jens Axboe wrote:
> >>>I can't seem to reproduce this at all. On an nvme device, I get a
> >>>fairly steady 60K/sec file creation rate, and we're nowhere near
> >>>being IO bound. So the throttling has no effect at all.
> >>
> >>That's too slow to show the stalls - your likely concurrency bound
> >>in allocation by the default AG count (4) from mkfs. Use mkfs.xfs -d
> >>agcount=32 so that every thread works in it's own AG.
> >
> >That's the key, with that I get 300-400K ops/sec instead. I'll run some
> >testing with this tomorrow and see what I can find, it did one full run
> >now and I didn't see any issues, but I need to run it at various
> >settings and see if I can find the issue.
> 
> No stalls seen, I get the same performance with it disabled and with
> it enabled, at both default settings, and lower ones
> (wb_percent=20). Looking at iostat, we don't drive a lot of depth,
> so it makes sense, even with the throttling we're doing essentially
> the same amount of IO.

Try appending numa=fake=4 to your guest's kernel command line.

(that's what I'm using)

> 
> What does 'nr_requests' say for your virtio_blk device? Looks like
> virtio_blk has a queue_depth setting, but it's not set by default,
> and then it uses the free entries in the ring. But I don't know what
> that is...

$ cat /sys/block/vdc/queue/nr_requests 
128
$

Without the block throttling, guest IO (measured within the guest)
looks like this over a fair proportion of the test (5s sample time)

# iostat -d -x -m 5 /dev/vdc

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
vdc   0.00 20443.006.20  436.60 0.05   269.89  1248.48
73.83  146.11  486.58  141.27   1.64  72.40
vdc   0.00 11567.60   19.20  161.40 0.05   146.08  1657.12   
119.17  704.57  707.25  704.25   5.34  96.48
vdc   0.00 12723.203.20  437.40 0.05   193.65   900.38
29.46   57.121.75   57.52   0.78  34.56
vdc   0.00  1739.80   22.40  426.80 0.05   123.62   563.86
23.44   62.51   79.89   61.59   1.01  45.28
vdc   0.00 12553.800.00  521.20 0.00   210.86   828.54
34.38   65.960.00   65.96   0.97  50.80
vdc   0.00 12523.60   25.60  529.60 0.10   201.94   745.29
52.24   77.730.41   81.47   1.14  63.20
vdc   0.00  5419.80   22.40  502.60 0.05   158.34   617.90
24.42   63.81   30.96   65.27   1.31  68.80
vdc   0.00 12059.000.00  439.60 0.00   174.85   814.59
30.91   70.270.00   70.27   0.72  31.76
vdc   0.00  7578.00   25.60  397.00 0.10   139.18   675.00
15.72   37.26   61.19   35.72   0.73  30.72
vdc   0.00  9156.000.00  537.40 0.00   173.57   661.45
17.08   29.620.00   29.62   0.53  28.72
vdc   0.00  5274.80   22.40  377.60 0.05   136.42   698.77
26.17   68.33  186.96   61.30   1.53  61.36
vdc   0.00  9407.003.20  541.00 0.05   174.28   656.05
36.10   66.333.00   66.71   0.87  47.60
vdc   0.00  8687.20   22.40  410.40 0.05   150.98   714.70
39.91   92.21   93.82   92.12   1.39  60.32
vdc   0.00  8872.800.00  422.60 0.00   139.28   674.96
25.01   33.030.00   33.03   0.91  38.40
vdc   0.00  1081.60   22.40  241.00 0.0568.88   535.97
10.78   82.89  137.86   77.79   2.25  59.20
vdc   0.00  9826.800.00  445.00 0.00   167.42   770.49
45.16  101.490.00  101.49   1.80  79.92
vdc   0.00  7394.00   22.40  447.60 0.05   157.34   685.83
18.06   38.42   77.64   36.46   1.46  68.48
vdc   0.00  9984.803.20  252.00 0.05   108.46   870.82
85.68  293.73   16.75  297.24   3.00  76.64
vdc   0.00 0.00   22.40  454.20 0.05   117.67   505.86 
8.11   39.51   35.71   39.70   1.17  55.76
vdc   0.00 10273.200.00  418.80 0.00   156.76   766.57
90.52  179.400.00  179.40   1.85  77.52
vdc   0.00  5650.00   22.40  185.00 0.0584.12   831.20   
103.90  575.15   60.82  637.42   4.21  87.36
vdc   0.00  7193.000.00  308.80 0.00   120.71   800.56
63.77  194.350.00  194.35   2.24  69.12
vdc   0.00  4460.809.80  211.00 0.0369.52   645.07
72.35  154.81  269.39  149.49   4.42  97.60
vdc   0.00   683.00   14.00  374.60 0.0599.13   522.69
25.38  167.61  603.14  151.33   1.45  56.24
vdc   0.00  7140.201.80  275.20 0.03   104.53   773.06
85.25  202.67   32.44  203.79   2.80  77.68
vdc   0.00  6916.000.00  164.00 0.0082.59  1031.33   
126.20  813.600.00  813.60   6.10 100.00
vdc   0.00  2255.60   22.40  359.00 0.05   107.41   577.06

Re: [PATCHSET v3][RFC] Make background writeback not suck

On Thu, Mar 31, 2016 at 09:29:30PM -0600, Jens Axboe wrote:
> On 03/31/2016 06:56 PM, Dave Chinner wrote:
> >I'm not changing the host kernels - it's a production machine and so
> >it runs long uptime testing of stable kernels.  (e.g. catch slow
> >memory leaks, etc). So if you've disabled throttling in the guest, I
> >can't test the throttling changes.
> 
> Right, that'd definitely hide the problem for you. I'll see if I can
> get it in a reproducible state and take it from there.
> 
> On your host, you said it's SCSI backed, but what does the device look like?

HW RAID 0 w/ 1GB FBWC (dell h710, IIRC) of 2x200GB SATA SSDs
(actually 256GB, but 25% of each is left as spare, unused space).
Sustains about 35,000 random 4k write IOPS, up to 70k read IOPS.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCHSET v3][RFC] Make background writeback not suck

On Thu, Mar 31, 2016 at 09:29:30PM -0600, Jens Axboe wrote:
> On 03/31/2016 06:56 PM, Dave Chinner wrote:
> >I'm not changing the host kernels - it's a production machine and so
> >it runs long uptime testing of stable kernels.  (e.g. catch slow
> >memory leaks, etc). So if you've disabled throttling in the guest, I
> >can't test the throttling changes.
> 
> Right, that'd definitely hide the problem for you. I'll see if I can
> get it in a reproducible state and take it from there.
> 
> On your host, you said it's SCSI backed, but what does the device look like?

HW RAID 0 w/ 1GB FBWC (dell h710, IIRC) of 2x200GB SATA SSDs
(actually 256GB, but 25% of each is left as spare, unused space).
Sustains about 35,000 random 4k write IOPS, up to 70k read IOPS.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCHSET v3][RFC] Make background writeback not suck


On 03/31/2016 09:29 PM, Jens Axboe wrote:

I can't seem to reproduce this at all. On an nvme device, I get a
fairly steady 60K/sec file creation rate, and we're nowhere near
being IO bound. So the throttling has no effect at all.


That's too slow to show the stalls - your likely concurrency bound
in allocation by the default AG count (4) from mkfs. Use mkfs.xfs -d
agcount=32 so that every thread works in it's own AG.


That's the key, with that I get 300-400K ops/sec instead. I'll run some
testing with this tomorrow and see what I can find, it did one full run
now and I didn't see any issues, but I need to run it at various
settings and see if I can find the issue.


No stalls seen, I get the same performance with it disabled and with it 
enabled, at both default settings, and lower ones (wb_percent=20). 
Looking at iostat, we don't drive a lot of depth, so it makes sense, 
even with the throttling we're doing essentially the same amount of IO.


What does 'nr_requests' say for your virtio_blk device? Looks like 
virtio_blk has a queue_depth setting, but it's not set by default, and 
then it uses the free entries in the ring. But I don't know what that is...


--
Jens Axboe

Re: [PATCHSET v3][RFC] Make background writeback not suck


On 03/31/2016 09:29 PM, Jens Axboe wrote:

I can't seem to reproduce this at all. On an nvme device, I get a
fairly steady 60K/sec file creation rate, and we're nowhere near
being IO bound. So the throttling has no effect at all.


That's too slow to show the stalls - your likely concurrency bound
in allocation by the default AG count (4) from mkfs. Use mkfs.xfs -d
agcount=32 so that every thread works in it's own AG.


That's the key, with that I get 300-400K ops/sec instead. I'll run some
testing with this tomorrow and see what I can find, it did one full run
now and I didn't see any issues, but I need to run it at various
settings and see if I can find the issue.


No stalls seen, I get the same performance with it disabled and with it 
enabled, at both default settings, and lower ones (wb_percent=20). 
Looking at iostat, we don't drive a lot of depth, so it makes sense, 
even with the throttling we're doing essentially the same amount of IO.


What does 'nr_requests' say for your virtio_blk device? Looks like 
virtio_blk has a queue_depth setting, but it's not set by default, and 
then it uses the free entries in the ring. But I don't know what that is...


--
Jens Axboe

Re: [PATCHSET v3][RFC] Make background writeback not suck


On 03/31/2016 09:29 PM, Jens Axboe wrote:

I'm not changing the host kernels - it's a production machine and so
it runs long uptime testing of stable kernels.  (e.g. catch slow
memory leaks, etc). So if you've disabled throttling in the guest, I
can't test the throttling changes.


Right, that'd definitely hide the problem for you. I'll see if I can get
it in a reproducible state and take it from there.


Though on the guest, if you could try with just this one applied:

http://git.kernel.dk/cgit/linux-block/commit/?h=wb-buf-throttle=f21fb0e42c7347bd639a17341dcd3f72c1a30d29

I'd appreciate it. It won't disable the throttling in the guest, just 
treat META and PRIO a bit differently.


--
Jens Axboe

Re: [PATCHSET v3][RFC] Make background writeback not suck


On 03/31/2016 09:29 PM, Jens Axboe wrote:

I'm not changing the host kernels - it's a production machine and so
it runs long uptime testing of stable kernels.  (e.g. catch slow
memory leaks, etc). So if you've disabled throttling in the guest, I
can't test the throttling changes.


Right, that'd definitely hide the problem for you. I'll see if I can get
it in a reproducible state and take it from there.


Though on the guest, if you could try with just this one applied:

http://git.kernel.dk/cgit/linux-block/commit/?h=wb-buf-throttle=f21fb0e42c7347bd639a17341dcd3f72c1a30d29

I'd appreciate it. It won't disable the throttling in the guest, just 
treat META and PRIO a bit differently.


--
Jens Axboe

Re: [PATCHSET v3][RFC] Make background writeback not suck

On 03/31/2016 06:56 PM, Dave Chinner wrote:

On Thu, Mar 31, 2016 at 10:21:04AM -0600, Jens Axboe wrote:

On 03/31/2016 08:29 AM, Jens Axboe wrote:

What I see in these performance dips is the XFS transaction
subsystem stalling *completely* - instead of running at a steady
state of around 350,000 transactions/s, there are *zero*
transactions running for periods of up to ten seconds. This
co-incides with the CPU usage falling to almost zero as well.
AFAICT, the only thing that is running when the filesystem stalls
like this is memory reclaim.

I'll take a look at this, stalls should definitely not be occurring. How
much memory does the box have?

I can't seem to reproduce this at all. On an nvme device, I get a
fairly steady 60K/sec file creation rate, and we're nowhere near
being IO bound. So the throttling has no effect at all.

That's too slow to show the stalls - your likely concurrency bound
in allocation by the default AG count (4) from mkfs. Use mkfs.xfs -d
agcount=32 so that every thread works in it's own AG.

That's the key, with that I get 300-400K ops/sec instead. I'll run some
testing with this tomorrow and see what I can find, it did one full run
now and I didn't see any issues, but I need to run it at various
settings and see if I can find the issue.

On a raid0 on 4 flash devices, I get something that looks more IO
bound, for some reason. Still no impact of the throttling, however.
But given that your setup is this:

virtio in guest, XFS direct IO -> no-op -> scsi in host.

we do potentially have two throttling points, which we don't want.
Is both the guest and the host running the new code, or just the
guest?

Just the guest. Host is running a 4.2.x kernel, IIRC.

In any case, can I talk you into trying with two patches on top of
the current code? It's the two newest patches here:

https://urldefense.proofpoint.com/v2/url?u=http-3A__git.kernel.dk_cgit_linux-2Dblock_log_-3Fh-3Dwb-2Dbuf-2Dthrottle=CwIBAg=5VD0RTtNlTh3ycd41b3MUw=cK1a7KivzZRh1fKQMjSm2A=68CEi93IKLje5aOoxk1y9HMe_HF9pAhzxJGTmTZ7_DY=NeYNPvJa3VdF_EEsL8VqAQzJ4UycbXZ5PzHihwZAc_A=

The first treats REQ_META|REQ_PRIO like they should be treated, like
high priority IO. The second disables throttling for virtual
devices, so we only throttle on the backend. The latter should
probably be the other way around, but we need some way of conveying
that information to the backend.

I'm not changing the host kernels - it's a production machine and so
it runs long uptime testing of stable kernels. (e.g. catch slow
memory leaks, etc). So if you've disabled throttling in the guest, I
can't test the throttling changes.

Right, that'd definitely hide the problem for you. I'll see if I can get
it in a reproducible state and take it from there.

On your host, you said it's SCSI backed, but what does the device look like?

--
Jens Axboe

Re: [PATCHSET v3][RFC] Make background writeback not suck

On 03/31/2016 06:56 PM, Dave Chinner wrote:

On Thu, Mar 31, 2016 at 10:21:04AM -0600, Jens Axboe wrote:

On 03/31/2016 08:29 AM, Jens Axboe wrote:

I'll take a look at this, stalls should definitely not be occurring. How
much memory does the box have?

I can't seem to reproduce this at all. On an nvme device, I get a
fairly steady 60K/sec file creation rate, and we're nowhere near
being IO bound. So the throttling has no effect at all.

That's too slow to show the stalls - your likely concurrency bound
in allocation by the default AG count (4) from mkfs. Use mkfs.xfs -d
agcount=32 so that every thread works in it's own AG.

On a raid0 on 4 flash devices, I get something that looks more IO
bound, for some reason. Still no impact of the throttling, however.
But given that your setup is this:

virtio in guest, XFS direct IO -> no-op -> scsi in host.

we do potentially have two throttling points, which we don't want.
Is both the guest and the host running the new code, or just the
guest?

Just the guest. Host is running a 4.2.x kernel, IIRC.

In any case, can I talk you into trying with two patches on top of
the current code? It's the two newest patches here:

Right, that'd definitely hide the problem for you. I'll see if I can get
it in a reproducible state and take it from there.

On your host, you said it's SCSI backed, but what does the device look like?

--
Jens Axboe

Re: [PATCHSET v3][RFC] Make background writeback not suck


On 03/31/2016 06:46 PM, Dave Chinner wrote:

On Thu, Mar 31, 2016 at 08:29:35AM -0600, Jens Axboe wrote:

On 03/31/2016 02:24 AM, Dave Chinner wrote:

On Wed, Mar 30, 2016 at 09:07:48AM -0600, Jens Axboe wrote:

Hi,

This patchset isn't as much a final solution, as it's demonstration
of what I believe is a huge issue. Since the dawn of time, our
background buffered writeback has sucked. When we do background
buffered writeback, it should have little impact on foreground
activity. That's the definition of background activity... But for as
long as I can remember, heavy buffered writers has not behaved like
that. For instance, if I do something like this:

$ dd if=/dev/zero of=foo bs=1M count=10k

on my laptop, and then try and start chrome, it basically won't start
before the buffered writeback is done. Or, for server oriented
workloads, where installation of a big RPM (or similar) adversely
impacts data base reads or sync writes. When that happens, I get people
yelling at me.

Last time I posted this, I used flash storage as the example. But
this works equally well on rotating storage. Let's run a test case
that writes a lot. This test writes 50 files, each 100M, on XFS on
a regular hard drive. While this happens, we attempt to read
another file with fio.

Writers:

$ time (./write-files ; sync)
real1m6.304s
user0m0.020s
sys 0m12.210s


Great. So a basic IO tests looks good - let's through something more
complex at it. Say, a benchmark I've been using for years to stress
the Io subsystem, the filesystem and memory reclaim all at the same
time: a concurent fsmark inode creation test.
(first google hit https://lkml.org/lkml/2013/9/10/46)


Is that how you are invoking it as well same arguments?


Yes. And the VM is exactly the same, too - 16p/16GB RAM. Cut down
version of the script I use:

#!/bin/bash

QUOTA=
MKFSOPTS=
NFILES=10
DEV=/dev/vdc
LOGBSIZE=256k
FSMARK=/home/dave/src/fs_mark-3.3/fs_mark
MNT=/mnt/scratch

while [ $# -gt 0 ]; do
 case "$1" in
 -q) QUOTA="uquota,gquota,pquota" ;;
 -N) NFILES=$2 ; shift ;;
 -d) DEV=$2 ; shift ;;
 -l) LOGBSIZE=$2; shift ;;
 --) shift ; break ;;
 esac
 shift
done
MKFSOPTS="$MKFSOPTS $*"

echo QUOTA=$QUOTA
echo MKFSOPTS=$MKFSOPTS
echo DEV=$DEV

sudo umount $MNT > /dev/null 2>&1
sudo mkfs.xfs -f $MKFSOPTS $DEV
sudo mount -o nobarrier,logbsize=$LOGBSIZE,$QUOTA $DEV $MNT
sudo chmod 777 $MNT
sudo sh -c "echo 1 > /proc/sys/fs/xfs/stats_clear"
time $FSMARK  -D  1  -S0  -n  $NFILES  -s  0  -L  32 \
 -d  $MNT/0  -d  $MNT/1 \
 -d  $MNT/2  -d  $MNT/3 \
 -d  $MNT/4  -d  $MNT/5 \
 -d  $MNT/6  -d  $MNT/7 \
 -d  $MNT/8  -d  $MNT/9 \
 -d  $MNT/10  -d  $MNT/11 \
 -d  $MNT/12  -d  $MNT/13 \
 -d  $MNT/14  -d  $MNT/15 \
 | tee >(stats --trim-outliers | tail -1 1>&2)
sync
sudo umount /mnt/scratch


Perfect, thanks!


The above was run without scsi-mq, and with using the deadline scheduler,
results with CFQ are similary depressing for this test. So IO scheduling
is in place for this test, it's not pure blk-mq without scheduling.


virtio in guest, XFS direct IO -> no-op -> scsi in host.


That has write back caching enabled on the guest, correct?


No. It uses virtio,cache=none (that's the "XFS Direct IO" bit above).
Sorry for not being clear about that.


That's fine, it's one less worry if that's not the case. So if you cat 
the 'write_cache' file in the virtioblk sysfs block queue/ directory, it 
says 'write through'? Just want to confirm that we got that propagated 
correctly.



--
Jens Axboe

Re: [PATCHSET v3][RFC] Make background writeback not suck


On 03/31/2016 06:46 PM, Dave Chinner wrote:

On Thu, Mar 31, 2016 at 08:29:35AM -0600, Jens Axboe wrote:

On 03/31/2016 02:24 AM, Dave Chinner wrote:

On Wed, Mar 30, 2016 at 09:07:48AM -0600, Jens Axboe wrote:

Hi,

This patchset isn't as much a final solution, as it's demonstration
of what I believe is a huge issue. Since the dawn of time, our
background buffered writeback has sucked. When we do background
buffered writeback, it should have little impact on foreground
activity. That's the definition of background activity... But for as
long as I can remember, heavy buffered writers has not behaved like
that. For instance, if I do something like this:

$ dd if=/dev/zero of=foo bs=1M count=10k

on my laptop, and then try and start chrome, it basically won't start
before the buffered writeback is done. Or, for server oriented
workloads, where installation of a big RPM (or similar) adversely
impacts data base reads or sync writes. When that happens, I get people
yelling at me.

Last time I posted this, I used flash storage as the example. But
this works equally well on rotating storage. Let's run a test case
that writes a lot. This test writes 50 files, each 100M, on XFS on
a regular hard drive. While this happens, we attempt to read
another file with fio.

Writers:

$ time (./write-files ; sync)
real1m6.304s
user0m0.020s
sys 0m12.210s


Great. So a basic IO tests looks good - let's through something more
complex at it. Say, a benchmark I've been using for years to stress
the Io subsystem, the filesystem and memory reclaim all at the same
time: a concurent fsmark inode creation test.
(first google hit https://lkml.org/lkml/2013/9/10/46)


Is that how you are invoking it as well same arguments?


Yes. And the VM is exactly the same, too - 16p/16GB RAM. Cut down
version of the script I use:

#!/bin/bash

QUOTA=
MKFSOPTS=
NFILES=10
DEV=/dev/vdc
LOGBSIZE=256k
FSMARK=/home/dave/src/fs_mark-3.3/fs_mark
MNT=/mnt/scratch

while [ $# -gt 0 ]; do
 case "$1" in
 -q) QUOTA="uquota,gquota,pquota" ;;
 -N) NFILES=$2 ; shift ;;
 -d) DEV=$2 ; shift ;;
 -l) LOGBSIZE=$2; shift ;;
 --) shift ; break ;;
 esac
 shift
done
MKFSOPTS="$MKFSOPTS $*"

echo QUOTA=$QUOTA
echo MKFSOPTS=$MKFSOPTS
echo DEV=$DEV

sudo umount $MNT > /dev/null 2>&1
sudo mkfs.xfs -f $MKFSOPTS $DEV
sudo mount -o nobarrier,logbsize=$LOGBSIZE,$QUOTA $DEV $MNT
sudo chmod 777 $MNT
sudo sh -c "echo 1 > /proc/sys/fs/xfs/stats_clear"
time $FSMARK  -D  1  -S0  -n  $NFILES  -s  0  -L  32 \
 -d  $MNT/0  -d  $MNT/1 \
 -d  $MNT/2  -d  $MNT/3 \
 -d  $MNT/4  -d  $MNT/5 \
 -d  $MNT/6  -d  $MNT/7 \
 -d  $MNT/8  -d  $MNT/9 \
 -d  $MNT/10  -d  $MNT/11 \
 -d  $MNT/12  -d  $MNT/13 \
 -d  $MNT/14  -d  $MNT/15 \
 | tee >(stats --trim-outliers | tail -1 1>&2)
sync
sudo umount /mnt/scratch


Perfect, thanks!


The above was run without scsi-mq, and with using the deadline scheduler,
results with CFQ are similary depressing for this test. So IO scheduling
is in place for this test, it's not pure blk-mq without scheduling.


virtio in guest, XFS direct IO -> no-op -> scsi in host.


That has write back caching enabled on the guest, correct?


No. It uses virtio,cache=none (that's the "XFS Direct IO" bit above).
Sorry for not being clear about that.


That's fine, it's one less worry if that's not the case. So if you cat 
the 'write_cache' file in the virtioblk sysfs block queue/ directory, it 
says 'write through'? Just want to confirm that we got that propagated 
correctly.



--
Jens Axboe

Re: [PATCHSET v3][RFC] Make background writeback not suck

On Thu, Mar 31, 2016 at 10:09:56PM +, Holger Hoffstätte wrote:
> 
> Hi,
> 
> Jens mentioned on Twitter I should post my experience here as well,
> so here we go.
> 
> I've backported this series (incl. updates) to stable-4.4.x - not too
> difficult, minus the NVM part which I don't need anyway - and have been
> running it for the past few days without any problem whatsoever, with
> GREAT success.
> 
> My use case is primarily larger amounts of stuff (transcoded movies,
> finished downloads, built Gentoo packages) that gets copied from tmpfs
> to SSD (or disk) and every time that happens, the system noticeably
> strangles readers (desktop, interactive shell). It does not really matter
> how I tune writeback via the write_expire/dirty_bytes knobs or the
> scheduler (and yes, I understand how they work); lowering the writeback
> limits helped a bit but the system is still overwhelmed. Jacking up
> deadline's writes_starved to unreasonable levels helps a bit, but in turn
> makes all writes suffer. Anything else - even tried BFQ for a while,
> which has its own unrelated problems - didn't really help either.

Can you go back to your original kernel, and lower nr_requests to 8?

Essentially all I see the block throttle doing is keeping the
request queue depth to somewhere between 8-12 requests, rather than
letting it blow out to near nr_requests (around 105-115), so it
would be interesting to note whether the block throttling has any
noticable difference in behaviour when compared to just having a
very shallow request queue

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCHSET v3][RFC] Make background writeback not suck

On Thu, Mar 31, 2016 at 10:09:56PM +, Holger Hoffstätte wrote:
> 
> Hi,
> 
> Jens mentioned on Twitter I should post my experience here as well,
> so here we go.
> 
> I've backported this series (incl. updates) to stable-4.4.x - not too
> difficult, minus the NVM part which I don't need anyway - and have been
> running it for the past few days without any problem whatsoever, with
> GREAT success.
> 
> My use case is primarily larger amounts of stuff (transcoded movies,
> finished downloads, built Gentoo packages) that gets copied from tmpfs
> to SSD (or disk) and every time that happens, the system noticeably
> strangles readers (desktop, interactive shell). It does not really matter
> how I tune writeback via the write_expire/dirty_bytes knobs or the
> scheduler (and yes, I understand how they work); lowering the writeback
> limits helped a bit but the system is still overwhelmed. Jacking up
> deadline's writes_starved to unreasonable levels helps a bit, but in turn
> makes all writes suffer. Anything else - even tried BFQ for a while,
> which has its own unrelated problems - didn't really help either.

Can you go back to your original kernel, and lower nr_requests to 8?

Essentially all I see the block throttle doing is keeping the
request queue depth to somewhere between 8-12 requests, rather than
letting it blow out to near nr_requests (around 105-115), so it
would be interesting to note whether the block throttling has any
noticable difference in behaviour when compared to just having a
very shallow request queue

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCHSET v3][RFC] Make background writeback not suck

On Thu, Mar 31, 2016 at 10:21:04AM -0600, Jens Axboe wrote:
> On 03/31/2016 08:29 AM, Jens Axboe wrote:
> >>What I see in these performance dips is the XFS transaction
> >>subsystem stalling *completely* - instead of running at a steady
> >>state of around 350,000 transactions/s, there are *zero*
> >>transactions running for periods of up to ten seconds.  This
> >>co-incides with the CPU usage falling to almost zero as well.
> >>AFAICT, the only thing that is running when the filesystem stalls
> >>like this is memory reclaim.
> >
> >I'll take a look at this, stalls should definitely not be occurring. How
> >much memory does the box have?
> 
> I can't seem to reproduce this at all. On an nvme device, I get a
> fairly steady 60K/sec file creation rate, and we're nowhere near
> being IO bound. So the throttling has no effect at all.

That's too slow to show the stalls - your likely concurrency bound
in allocation by the default AG count (4) from mkfs. Use mkfs.xfs -d
agcount=32 so that every thread works in it's own AG.

> On a raid0 on 4 flash devices, I get something that looks more IO
> bound, for some reason. Still no impact of the throttling, however.
> But given that your setup is this:
> 
>   virtio in guest, XFS direct IO -> no-op -> scsi in host.
> 
> we do potentially have two throttling points, which we don't want.
> Is both the guest and the host running the new code, or just the
> guest?

Just the guest. Host is running a 4.2.x kernel, IIRC.

> In any case, can I talk you into trying with two patches on top of
> the current code? It's the two newest patches here:
> 
> http://git.kernel.dk/cgit/linux-block/log/?h=wb-buf-throttle
> 
> The first treats REQ_META|REQ_PRIO like they should be treated, like
> high priority IO. The second disables throttling for virtual
> devices, so we only throttle on the backend. The latter should
> probably be the other way around, but we need some way of conveying
> that information to the backend.

I'm not changing the host kernels - it's a production machine and so
it runs long uptime testing of stable kernels.  (e.g. catch slow
memory leaks, etc). So if you've disabled throttling in the guest, I
can't test the throttling changes.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCHSET v3][RFC] Make background writeback not suck

On Thu, Mar 31, 2016 at 10:21:04AM -0600, Jens Axboe wrote:
> On 03/31/2016 08:29 AM, Jens Axboe wrote:
> >>What I see in these performance dips is the XFS transaction
> >>subsystem stalling *completely* - instead of running at a steady
> >>state of around 350,000 transactions/s, there are *zero*
> >>transactions running for periods of up to ten seconds.  This
> >>co-incides with the CPU usage falling to almost zero as well.
> >>AFAICT, the only thing that is running when the filesystem stalls
> >>like this is memory reclaim.
> >
> >I'll take a look at this, stalls should definitely not be occurring. How
> >much memory does the box have?
> 
> I can't seem to reproduce this at all. On an nvme device, I get a
> fairly steady 60K/sec file creation rate, and we're nowhere near
> being IO bound. So the throttling has no effect at all.

That's too slow to show the stalls - your likely concurrency bound
in allocation by the default AG count (4) from mkfs. Use mkfs.xfs -d
agcount=32 so that every thread works in it's own AG.

> On a raid0 on 4 flash devices, I get something that looks more IO
> bound, for some reason. Still no impact of the throttling, however.
> But given that your setup is this:
> 
>   virtio in guest, XFS direct IO -> no-op -> scsi in host.
> 
> we do potentially have two throttling points, which we don't want.
> Is both the guest and the host running the new code, or just the
> guest?

Just the guest. Host is running a 4.2.x kernel, IIRC.

> In any case, can I talk you into trying with two patches on top of
> the current code? It's the two newest patches here:
> 
> http://git.kernel.dk/cgit/linux-block/log/?h=wb-buf-throttle
> 
> The first treats REQ_META|REQ_PRIO like they should be treated, like
> high priority IO. The second disables throttling for virtual
> devices, so we only throttle on the backend. The latter should
> probably be the other way around, but we need some way of conveying
> that information to the backend.

I'm not changing the host kernels - it's a production machine and so
it runs long uptime testing of stable kernels.  (e.g. catch slow
memory leaks, etc). So if you've disabled throttling in the guest, I
can't test the throttling changes.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCHSET v3][RFC] Make background writeback not suck

On Thu, Mar 31, 2016 at 08:29:35AM -0600, Jens Axboe wrote:
> On 03/31/2016 02:24 AM, Dave Chinner wrote:
> >On Wed, Mar 30, 2016 at 09:07:48AM -0600, Jens Axboe wrote:
> >>Hi,
> >>
> >>This patchset isn't as much a final solution, as it's demonstration
> >>of what I believe is a huge issue. Since the dawn of time, our
> >>background buffered writeback has sucked. When we do background
> >>buffered writeback, it should have little impact on foreground
> >>activity. That's the definition of background activity... But for as
> >>long as I can remember, heavy buffered writers has not behaved like
> >>that. For instance, if I do something like this:
> >>
> >>$ dd if=/dev/zero of=foo bs=1M count=10k
> >>
> >>on my laptop, and then try and start chrome, it basically won't start
> >>before the buffered writeback is done. Or, for server oriented
> >>workloads, where installation of a big RPM (or similar) adversely
> >>impacts data base reads or sync writes. When that happens, I get people
> >>yelling at me.
> >>
> >>Last time I posted this, I used flash storage as the example. But
> >>this works equally well on rotating storage. Let's run a test case
> >>that writes a lot. This test writes 50 files, each 100M, on XFS on
> >>a regular hard drive. While this happens, we attempt to read
> >>another file with fio.
> >>
> >>Writers:
> >>
> >>$ time (./write-files ; sync)
> >>real1m6.304s
> >>user0m0.020s
> >>sys 0m12.210s
> >
> >Great. So a basic IO tests looks good - let's through something more
> >complex at it. Say, a benchmark I've been using for years to stress
> >the Io subsystem, the filesystem and memory reclaim all at the same
> >time: a concurent fsmark inode creation test.
> >(first google hit https://lkml.org/lkml/2013/9/10/46)
> 
> Is that how you are invoking it as well same arguments?

Yes. And the VM is exactly the same, too - 16p/16GB RAM. Cut down
version of the script I use:

#!/bin/bash

QUOTA=
MKFSOPTS=
NFILES=10
DEV=/dev/vdc
LOGBSIZE=256k
FSMARK=/home/dave/src/fs_mark-3.3/fs_mark
MNT=/mnt/scratch

while [ $# -gt 0 ]; do
case "$1" in
-q) QUOTA="uquota,gquota,pquota" ;;
-N) NFILES=$2 ; shift ;;
-d) DEV=$2 ; shift ;;
-l) LOGBSIZE=$2; shift ;;
--) shift ; break ;;
esac
shift
done
MKFSOPTS="$MKFSOPTS $*"

echo QUOTA=$QUOTA
echo MKFSOPTS=$MKFSOPTS
echo DEV=$DEV

sudo umount $MNT > /dev/null 2>&1
sudo mkfs.xfs -f $MKFSOPTS $DEV
sudo mount -o nobarrier,logbsize=$LOGBSIZE,$QUOTA $DEV $MNT
sudo chmod 777 $MNT
sudo sh -c "echo 1 > /proc/sys/fs/xfs/stats_clear"
time $FSMARK  -D  1  -S0  -n  $NFILES  -s  0  -L  32 \
-d  $MNT/0  -d  $MNT/1 \
-d  $MNT/2  -d  $MNT/3 \
-d  $MNT/4  -d  $MNT/5 \
-d  $MNT/6  -d  $MNT/7 \
-d  $MNT/8  -d  $MNT/9 \
-d  $MNT/10  -d  $MNT/11 \
-d  $MNT/12  -d  $MNT/13 \
-d  $MNT/14  -d  $MNT/15 \
| tee >(stats --trim-outliers | tail -1 1>&2)
sync
sudo umount /mnt/scratch
$

> >>The above was run without scsi-mq, and with using the deadline scheduler,
> >>results with CFQ are similary depressing for this test. So IO scheduling
> >>is in place for this test, it's not pure blk-mq without scheduling.
> >
> >virtio in guest, XFS direct IO -> no-op -> scsi in host.
> 
> That has write back caching enabled on the guest, correct?

No. It uses virtio,cache=none (that's the "XFS Direct IO" bit above).
Sorry for not being clear about that.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCHSET v3][RFC] Make background writeback not suck

On Thu, Mar 31, 2016 at 08:29:35AM -0600, Jens Axboe wrote:
> On 03/31/2016 02:24 AM, Dave Chinner wrote:
> >On Wed, Mar 30, 2016 at 09:07:48AM -0600, Jens Axboe wrote:
> >>Hi,
> >>
> >>This patchset isn't as much a final solution, as it's demonstration
> >>of what I believe is a huge issue. Since the dawn of time, our
> >>background buffered writeback has sucked. When we do background
> >>buffered writeback, it should have little impact on foreground
> >>activity. That's the definition of background activity... But for as
> >>long as I can remember, heavy buffered writers has not behaved like
> >>that. For instance, if I do something like this:
> >>
> >>$ dd if=/dev/zero of=foo bs=1M count=10k
> >>
> >>on my laptop, and then try and start chrome, it basically won't start
> >>before the buffered writeback is done. Or, for server oriented
> >>workloads, where installation of a big RPM (or similar) adversely
> >>impacts data base reads or sync writes. When that happens, I get people
> >>yelling at me.
> >>
> >>Last time I posted this, I used flash storage as the example. But
> >>this works equally well on rotating storage. Let's run a test case
> >>that writes a lot. This test writes 50 files, each 100M, on XFS on
> >>a regular hard drive. While this happens, we attempt to read
> >>another file with fio.
> >>
> >>Writers:
> >>
> >>$ time (./write-files ; sync)
> >>real1m6.304s
> >>user0m0.020s
> >>sys 0m12.210s
> >
> >Great. So a basic IO tests looks good - let's through something more
> >complex at it. Say, a benchmark I've been using for years to stress
> >the Io subsystem, the filesystem and memory reclaim all at the same
> >time: a concurent fsmark inode creation test.
> >(first google hit https://lkml.org/lkml/2013/9/10/46)
> 
> Is that how you are invoking it as well same arguments?

Yes. And the VM is exactly the same, too - 16p/16GB RAM. Cut down
version of the script I use:

#!/bin/bash

QUOTA=
MKFSOPTS=
NFILES=10
DEV=/dev/vdc
LOGBSIZE=256k
FSMARK=/home/dave/src/fs_mark-3.3/fs_mark
MNT=/mnt/scratch

while [ $# -gt 0 ]; do
case "$1" in
-q) QUOTA="uquota,gquota,pquota" ;;
-N) NFILES=$2 ; shift ;;
-d) DEV=$2 ; shift ;;
-l) LOGBSIZE=$2; shift ;;
--) shift ; break ;;
esac
shift
done
MKFSOPTS="$MKFSOPTS $*"

echo QUOTA=$QUOTA
echo MKFSOPTS=$MKFSOPTS
echo DEV=$DEV

sudo umount $MNT > /dev/null 2>&1
sudo mkfs.xfs -f $MKFSOPTS $DEV
sudo mount -o nobarrier,logbsize=$LOGBSIZE,$QUOTA $DEV $MNT
sudo chmod 777 $MNT
sudo sh -c "echo 1 > /proc/sys/fs/xfs/stats_clear"
time $FSMARK  -D  1  -S0  -n  $NFILES  -s  0  -L  32 \
-d  $MNT/0  -d  $MNT/1 \
-d  $MNT/2  -d  $MNT/3 \
-d  $MNT/4  -d  $MNT/5 \
-d  $MNT/6  -d  $MNT/7 \
-d  $MNT/8  -d  $MNT/9 \
-d  $MNT/10  -d  $MNT/11 \
-d  $MNT/12  -d  $MNT/13 \
-d  $MNT/14  -d  $MNT/15 \
| tee >(stats --trim-outliers | tail -1 1>&2)
sync
sudo umount /mnt/scratch
$

> >>The above was run without scsi-mq, and with using the deadline scheduler,
> >>results with CFQ are similary depressing for this test. So IO scheduling
> >>is in place for this test, it's not pure blk-mq without scheduling.
> >
> >virtio in guest, XFS direct IO -> no-op -> scsi in host.
> 
> That has write back caching enabled on the guest, correct?

No. It uses virtio,cache=none (that's the "XFS Direct IO" bit above).
Sorry for not being clear about that.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCHSET v3][RFC] Make background writeback not suck

2016-03-31 Thread Holger Hoffstätte


Hi,

Jens mentioned on Twitter I should post my experience here as well,
so here we go.

I've backported this series (incl. updates) to stable-4.4.x - not too
difficult, minus the NVM part which I don't need anyway - and have been
running it for the past few days without any problem whatsoever, with
GREAT success.

My use case is primarily larger amounts of stuff (transcoded movies,
finished downloads, built Gentoo packages) that gets copied from tmpfs
to SSD (or disk) and every time that happens, the system noticeably
strangles readers (desktop, interactive shell). It does not really matter
how I tune writeback via the write_expire/dirty_bytes knobs or the
scheduler (and yes, I understand how they work); lowering the writeback
limits helped a bit but the system is still overwhelmed. Jacking up
deadline's writes_starved to unreasonable levels helps a bit, but in turn
makes all writes suffer. Anything else - even tried BFQ for a while,
which has its own unrelated problems - didn't really help either.

With this patchset the buffered writeback in these situations is much
improved, and copying several GBs at once to a SATA-3 SSD (or even an
external USB-2 disk with measly 40 MB/s) doodles along in the background
like it always should have, and desktop work is not noticeably affected.

I guess the effect will be even more noticeable on slower block devices
(laptops, old SSDs or disks).

So: +1 would apply again!

cheers
Holger

Re: [PATCHSET v3][RFC] Make background writeback not suck

2016-03-31 Thread Holger Hoffstätte


Hi,

Jens mentioned on Twitter I should post my experience here as well,
so here we go.

I've backported this series (incl. updates) to stable-4.4.x - not too
difficult, minus the NVM part which I don't need anyway - and have been
running it for the past few days without any problem whatsoever, with
GREAT success.

My use case is primarily larger amounts of stuff (transcoded movies,
finished downloads, built Gentoo packages) that gets copied from tmpfs
to SSD (or disk) and every time that happens, the system noticeably
strangles readers (desktop, interactive shell). It does not really matter
how I tune writeback via the write_expire/dirty_bytes knobs or the
scheduler (and yes, I understand how they work); lowering the writeback
limits helped a bit but the system is still overwhelmed. Jacking up
deadline's writes_starved to unreasonable levels helps a bit, but in turn
makes all writes suffer. Anything else - even tried BFQ for a while,
which has its own unrelated problems - didn't really help either.

With this patchset the buffered writeback in these situations is much
improved, and copying several GBs at once to a SATA-3 SSD (or even an
external USB-2 disk with measly 40 MB/s) doodles along in the background
like it always should have, and desktop work is not noticeably affected.

I guess the effect will be even more noticeable on slower block devices
(laptops, old SSDs or disks).

So: +1 would apply again!

cheers
Holger

Re: [PATCHSET v3][RFC] Make background writeback not suck


On 03/31/2016 08:29 AM, Jens Axboe wrote:

What I see in these performance dips is the XFS transaction
subsystem stalling *completely* - instead of running at a steady
state of around 350,000 transactions/s, there are *zero*
transactions running for periods of up to ten seconds.  This
co-incides with the CPU usage falling to almost zero as well.
AFAICT, the only thing that is running when the filesystem stalls
like this is memory reclaim.


I'll take a look at this, stalls should definitely not be occurring. How
much memory does the box have?


I can't seem to reproduce this at all. On an nvme device, I get a fairly 
steady 60K/sec file creation rate, and we're nowhere near being IO 
bound. So the throttling has no effect at all.


On a raid0 on 4 flash devices, I get something that looks more IO bound, 
for some reason. Still no impact of the throttling, however. But given 
that your setup is this:


virtio in guest, XFS direct IO -> no-op -> scsi in host.

we do potentially have two throttling points, which we don't want. Is 
both the guest and the host running the new code, or just the guest?


In any case, can I talk you into trying with two patches on top of the 
current code? It's the two newest patches here:


http://git.kernel.dk/cgit/linux-block/log/?h=wb-buf-throttle

The first treats REQ_META|REQ_PRIO like they should be treated, like 
high priority IO. The second disables throttling for virtual devices, so 
we only throttle on the backend. The latter should probably be the other 
way around, but we need some way of conveying that information to the 
backend.


--
Jens Axboe

Re: [PATCHSET v3][RFC] Make background writeback not suck


On 03/31/2016 08:29 AM, Jens Axboe wrote:

What I see in these performance dips is the XFS transaction
subsystem stalling *completely* - instead of running at a steady
state of around 350,000 transactions/s, there are *zero*
transactions running for periods of up to ten seconds.  This
co-incides with the CPU usage falling to almost zero as well.
AFAICT, the only thing that is running when the filesystem stalls
like this is memory reclaim.


I'll take a look at this, stalls should definitely not be occurring. How
much memory does the box have?


I can't seem to reproduce this at all. On an nvme device, I get a fairly 
steady 60K/sec file creation rate, and we're nowhere near being IO 
bound. So the throttling has no effect at all.


On a raid0 on 4 flash devices, I get something that looks more IO bound, 
for some reason. Still no impact of the throttling, however. But given 
that your setup is this:


virtio in guest, XFS direct IO -> no-op -> scsi in host.

we do potentially have two throttling points, which we don't want. Is 
both the guest and the host running the new code, or just the guest?


In any case, can I talk you into trying with two patches on top of the 
current code? It's the two newest patches here:


http://git.kernel.dk/cgit/linux-block/log/?h=wb-buf-throttle

The first treats REQ_META|REQ_PRIO like they should be treated, like 
high priority IO. The second disables throttling for virtual devices, so 
we only throttle on the backend. The latter should probably be the other 
way around, but we need some way of conveying that information to the 
backend.


--
Jens Axboe

Re: [PATCHSET v3][RFC] Make background writeback not suck


On 03/31/2016 02:24 AM, Dave Chinner wrote:

On Wed, Mar 30, 2016 at 09:07:48AM -0600, Jens Axboe wrote:

Hi,

This patchset isn't as much a final solution, as it's demonstration
of what I believe is a huge issue. Since the dawn of time, our
background buffered writeback has sucked. When we do background
buffered writeback, it should have little impact on foreground
activity. That's the definition of background activity... But for as
long as I can remember, heavy buffered writers has not behaved like
that. For instance, if I do something like this:

$ dd if=/dev/zero of=foo bs=1M count=10k

on my laptop, and then try and start chrome, it basically won't start
before the buffered writeback is done. Or, for server oriented
workloads, where installation of a big RPM (or similar) adversely
impacts data base reads or sync writes. When that happens, I get people
yelling at me.

Last time I posted this, I used flash storage as the example. But
this works equally well on rotating storage. Let's run a test case
that writes a lot. This test writes 50 files, each 100M, on XFS on
a regular hard drive. While this happens, we attempt to read
another file with fio.

Writers:

$ time (./write-files ; sync)
real1m6.304s
user0m0.020s
sys 0m12.210s


Great. So a basic IO tests looks good - let's through something more
complex at it. Say, a benchmark I've been using for years to stress
the Io subsystem, the filesystem and memory reclaim all at the same
time: a concurent fsmark inode creation test.
(first google hit https://lkml.org/lkml/2013/9/10/46)


Is that how you are invoking it as well same arguments?


This generates thousands of REQ_WRITE metadata IOs every second, so
iif I understand how the throttle works correctly, these would be
classified as background writeback by the block layer throttle.
And

FSUse%Count SizeFiles/sec App Overhead
  0  1600 255845.0 10796891
  0  3200 261348.8 10842349
  0  4800 249172.3 14121232
  0  6400 245172.8 12453759
  0  8000 201249.5 14293100
  0  9600 200417.5 29496551

0 11200  90399.6 40665397

  0 12800 212265.6 21839031
  0 14400 206398.8 32598378
  0 16000 197589.7 26266552
  0 17600 206405.2 16447795

0 19200  99189.6 87650540

  0 20800 249720.8 12294862
  0 22400 138523.8 47330007

0 24000  85486.2 14271096

  0 25600 157538.1 64430611
  0 27200 109677.8 47835961
  0 28800 207230.5 31301031
  0 30400 188739.6 33750424
  0 32000 174197.9 41402526
  0 33600 139152.0100838085
  0 35200 203729.7 34833764
  0 36800 228277.4 12459062

0 38400  94962.0 30189182

  0 40000 166221.9 40564922

0 41600  62902.5 80098461

  0 43200 217932.6 22539354
  0 44800 189594.6 24692209
  0 46400 137834.1 39822038
  0 48000 240043.8 12779453
  0 49600 176830.8 16604133
  0 51200 180771.8 32860221

real5m35.967s
user3m57.054s
sys 48m53.332s

In those highlighted report points, the performance has dropped
significantly. The typical range I expect to see ionce memory has
filled (a bit over 8m inodes) is 180k-220k.  Runtime on a vanilla
kernel was 4m40s and there were no performance drops, so this
workload runs almost a minute slower with the block layer throttling
code.

What I see in these performance dips is the XFS transaction
subsystem stalling *completely* - instead of running at a steady
state of around 350,000 transactions/s, there are *zero*
transactions running for periods of up to ten seconds.  This
co-incides with the CPU usage falling to almost zero as well.
AFAICT, the only thing that is running when the filesystem stalls
like this is memory reclaim.


I'll take a look at this, stalls should definitely not be occurring. How 
much memory does the box have?



Without the block throttling patches, the workload quickly finds a
steady state of around 7.5-8.5 million cached inodes, and it doesn't
vary much outside those bounds.

Re: [PATCHSET v3][RFC] Make background writeback not suck


On 03/31/2016 02:24 AM, Dave Chinner wrote:

On Wed, Mar 30, 2016 at 09:07:48AM -0600, Jens Axboe wrote:

Hi,

This patchset isn't as much a final solution, as it's demonstration
of what I believe is a huge issue. Since the dawn of time, our
background buffered writeback has sucked. When we do background
buffered writeback, it should have little impact on foreground
activity. That's the definition of background activity... But for as
long as I can remember, heavy buffered writers has not behaved like
that. For instance, if I do something like this:

$ dd if=/dev/zero of=foo bs=1M count=10k

on my laptop, and then try and start chrome, it basically won't start
before the buffered writeback is done. Or, for server oriented
workloads, where installation of a big RPM (or similar) adversely
impacts data base reads or sync writes. When that happens, I get people
yelling at me.

Last time I posted this, I used flash storage as the example. But
this works equally well on rotating storage. Let's run a test case
that writes a lot. This test writes 50 files, each 100M, on XFS on
a regular hard drive. While this happens, we attempt to read
another file with fio.

Writers:

$ time (./write-files ; sync)
real1m6.304s
user0m0.020s
sys 0m12.210s


Great. So a basic IO tests looks good - let's through something more
complex at it. Say, a benchmark I've been using for years to stress
the Io subsystem, the filesystem and memory reclaim all at the same
time: a concurent fsmark inode creation test.
(first google hit https://lkml.org/lkml/2013/9/10/46)


Is that how you are invoking it as well same arguments?


This generates thousands of REQ_WRITE metadata IOs every second, so
iif I understand how the throttle works correctly, these would be
classified as background writeback by the block layer throttle.
And

FSUse%Count SizeFiles/sec App Overhead
  0  1600 255845.0 10796891
  0  3200 261348.8 10842349
  0  4800 249172.3 14121232
  0  6400 245172.8 12453759
  0  8000 201249.5 14293100
  0  9600 200417.5 29496551

0 11200  90399.6 40665397

  0 12800 212265.6 21839031
  0 14400 206398.8 32598378
  0 16000 197589.7 26266552
  0 17600 206405.2 16447795

0 19200  99189.6 87650540

  0 20800 249720.8 12294862
  0 22400 138523.8 47330007

0 24000  85486.2 14271096

  0 25600 157538.1 64430611
  0 27200 109677.8 47835961
  0 28800 207230.5 31301031
  0 30400 188739.6 33750424
  0 32000 174197.9 41402526
  0 33600 139152.0100838085
  0 35200 203729.7 34833764
  0 36800 228277.4 12459062

0 38400  94962.0 30189182

  0 40000 166221.9 40564922

0 41600  62902.5 80098461

  0 43200 217932.6 22539354
  0 44800 189594.6 24692209
  0 46400 137834.1 39822038
  0 48000 240043.8 12779453
  0 49600 176830.8 16604133
  0 51200 180771.8 32860221

real5m35.967s
user3m57.054s
sys 48m53.332s

In those highlighted report points, the performance has dropped
significantly. The typical range I expect to see ionce memory has
filled (a bit over 8m inodes) is 180k-220k.  Runtime on a vanilla
kernel was 4m40s and there were no performance drops, so this
workload runs almost a minute slower with the block layer throttling
code.

What I see in these performance dips is the XFS transaction
subsystem stalling *completely* - instead of running at a steady
state of around 350,000 transactions/s, there are *zero*
transactions running for periods of up to ten seconds.  This
co-incides with the CPU usage falling to almost zero as well.
AFAICT, the only thing that is running when the filesystem stalls
like this is memory reclaim.


I'll take a look at this, stalls should definitely not be occurring. How 
much memory does the box have?



Without the block throttling patches, the workload quickly finds a
steady state of around 7.5-8.5 million cached inodes, and it doesn't
vary much outside those bounds.

Re: [PATCHSET v3][RFC] Make background writeback not suck

On Wed, Mar 30, 2016 at 09:07:48AM -0600, Jens Axboe wrote:
> Hi,
> 
> This patchset isn't as much a final solution, as it's demonstration
> of what I believe is a huge issue. Since the dawn of time, our
> background buffered writeback has sucked. When we do background
> buffered writeback, it should have little impact on foreground
> activity. That's the definition of background activity... But for as
> long as I can remember, heavy buffered writers has not behaved like
> that. For instance, if I do something like this:
> 
> $ dd if=/dev/zero of=foo bs=1M count=10k
> 
> on my laptop, and then try and start chrome, it basically won't start
> before the buffered writeback is done. Or, for server oriented
> workloads, where installation of a big RPM (or similar) adversely
> impacts data base reads or sync writes. When that happens, I get people
> yelling at me.
> 
> Last time I posted this, I used flash storage as the example. But
> this works equally well on rotating storage. Let's run a test case
> that writes a lot. This test writes 50 files, each 100M, on XFS on
> a regular hard drive. While this happens, we attempt to read
> another file with fio.
> 
> Writers:
> 
> $ time (./write-files ; sync)
> real  1m6.304s
> user  0m0.020s
> sys   0m12.210s

Great. So a basic IO tests looks good - let's through something more
complex at it. Say, a benchmark I've been using for years to stress
the Io subsystem, the filesystem and memory reclaim all at the same
time: a concurent fsmark inode creation test.
(first google hit https://lkml.org/lkml/2013/9/10/46)

This generates thousands of REQ_WRITE metadata IOs every second, so
iif I understand how the throttle works correctly, these would be
classified as background writeback by the block layer throttle.
And

FSUse%Count SizeFiles/sec App Overhead
 0  1600 255845.0 10796891
 0  3200 261348.8 10842349
 0  4800 249172.3 14121232
 0  6400 245172.8 12453759
 0  8000 201249.5 14293100
 0  9600 200417.5 29496551
 0 11200  90399.6 40665397
 0 12800 212265.6 21839031
 0 14400 206398.8 32598378
 0 16000 197589.7 26266552
 0 17600 206405.2 16447795
 0 19200  99189.6 87650540
 0 20800 249720.8 12294862
 0 22400 138523.8 47330007
 0 24000  85486.2 14271096
 0 25600 157538.1 64430611
 0 27200 109677.8 47835961
 0 28800 207230.5 31301031
 0 30400 188739.6 33750424
 0 32000 174197.9 41402526
 0 33600 139152.0100838085
 0 35200 203729.7 34833764
 0 36800 228277.4 12459062
 0 38400  94962.0 30189182
 0 40000 166221.9 40564922
 0 41600  62902.5 80098461
 0 43200 217932.6 22539354
 0 44800 189594.6 24692209
 0 46400 137834.1 39822038
 0 48000 240043.8 12779453
 0 49600 176830.8 16604133
 0 51200 180771.8 32860221

real5m35.967s
user3m57.054s
sys 48m53.332s

In those highlighted report points, the performance has dropped
significantly. The typical range I expect to see ionce memory has
filled (a bit over 8m inodes) is 180k-220k.  Runtime on a vanilla
kernel was 4m40s and there were no performance drops, so this
workload runs almost a minute slower with the block layer throttling
code.

What I see in these performance dips is the XFS transaction
subsystem stalling *completely* - instead of running at a steady
state of around 350,000 transactions/s, there are *zero*
transactions running for periods of up to ten seconds.  This
co-incides with the CPU usage falling to almost zero as well.
AFAICT, the only thing that is running when the filesystem stalls
like this is memory reclaim.

Without the block throttling patches, the workload quickly finds a
steady state of around 7.5-8.5 million cached inodes, and it doesn't
vary much outside those bounds. With the block throttling patches,
on every transaction subsystem stall that occurs, the inode cache
gets 3-4 million inodes trimmed out of it (i.e. half the
cache), and in a

Re: [PATCHSET v3][RFC] Make background writeback not suck