Re: [PATCHSET v3][RFC] Make background writeback not suck
On 04/01/16 03:01, Dave Chinner wrote: > Can you go back to your original kernel, and lower nr_requests to 8? Sure, did that and as expected it didn't help much. Under prolonged stress it was actually even a bit worse than writeback throttling. IMHO that's not really surprising either, since small queues now punish everyone and in interactive mode I really want to e.g. start loading hundreds of small thumbnails at once, or du a directory. Instead of randomized aka manual/interactive testing I created a simple stress tester: #!/bin/sh while [[ true ]] do cp bigfile bigfile.out done and running that in the background turns the system into a tar pit, which is laughable when you consider that I have 24G and 8 cores. With the writeback patchset and wb_percent=1 (yes, really!) it is almost unnoticeable, but according to nmon still writes ~250-280 MB/s. This is with deadline on ext4 on an older SATA-3 SSD that can still do peak ~465 MB/s (with dd). cheers, Holger
Re: [PATCHSET v3][RFC] Make background writeback not suck
On 04/01/2016 12:27 AM, Dave Chinner wrote: On Thu, Mar 31, 2016 at 09:25:33PM -0600, Jens Axboe wrote: On 03/31/2016 06:46 PM, Dave Chinner wrote: virtio in guest, XFS direct IO -> no-op -> scsi in host. That has write back caching enabled on the guest, correct? No. It uses virtio,cache=none (that's the "XFS Direct IO" bit above). Sorry for not being clear about that. That's fine, it's one less worry if that's not the case. So if you cat the 'write_cache' file in the virtioblk sysfs block queue/ directory, it says 'write through'? Just want to confirm that we got that propagated correctly. No such file. But I did find: $ cat /sys/block/vdc/cache_type write back Which is what I'd expect it to safe given the man page description of cache=none: Note that this is considered a writeback mode and the guest OS must handle the disk write cache correctly in order to avoid data corruption on host crashes. To make it say "write through" I need to use cache=directsync, but I have no need for such integrity guarantees on a volatile test device... I wasn't as concerned about the integrity side, more if it's flagged as write back then we induce further throttling. But I'll see if I can get your test case reproduced, then I don't see why it can't get fixed. I'm off all of next week though, so probably won't be until the week after... -- Jens Axboe
Re: [PATCHSET v3][RFC] Make background writeback not suck
On 04/01/2016 12:16 AM, Dave Chinner wrote: On Thu, Mar 31, 2016 at 09:39:25PM -0600, Jens Axboe wrote: On 03/31/2016 09:29 PM, Jens Axboe wrote: I can't seem to reproduce this at all. On an nvme device, I get a fairly steady 60K/sec file creation rate, and we're nowhere near being IO bound. So the throttling has no effect at all. That's too slow to show the stalls - your likely concurrency bound in allocation by the default AG count (4) from mkfs. Use mkfs.xfs -d agcount=32 so that every thread works in it's own AG. That's the key, with that I get 300-400K ops/sec instead. I'll run some testing with this tomorrow and see what I can find, it did one full run now and I didn't see any issues, but I need to run it at various settings and see if I can find the issue. No stalls seen, I get the same performance with it disabled and with it enabled, at both default settings, and lower ones (wb_percent=20). Looking at iostat, we don't drive a lot of depth, so it makes sense, even with the throttling we're doing essentially the same amount of IO. Try appending numa=fake=4 to your guest's kernel command line. (that's what I'm using) Sure, I can give that a go. What does 'nr_requests' say for your virtio_blk device? Looks like virtio_blk has a queue_depth setting, but it's not set by default, and then it uses the free entries in the ring. But I don't know what that is... $ cat /sys/block/vdc/queue/nr_requests 128 OK, so that would put you in the 16/32/64 category for idle/normal/high priority writeback. Which fits with the iostat below, which is in the ~16 range. So the META thing should help, it'll bump it up a bit. But we're also seeing smaller requests, and I think that could be because after we do throttle, we could potentially have a merge candidate. The code doesn't check post-sleeping, it'll allow any merges before though. Though that part is a little harder to read from the iostat numbers, but there does seem to be a correlation between your higher depths and bigger request sizes. I'll try the "don't throttle REQ_META" patch, but this seems like a fragile way to solve this problem - it shuts up the messenger, but doesn't solve the problem for any other subsystem that might have a similer issue. e.g. next we're going to have to make sure direct IO (which is also REQ_WRITE dispatch) does not get throttled, and so on I don't think there's anything wrong with the REQ_META patch. Sure, we could have better classifications (like discussed below), but that's mainly tweaking. As long as we get the same answers, it's fine. There's no throttling of O_DIRECT writes in the current code, it specifically doesn't include those. It's only for the unbounded writes, which writeback tends to be. It seems to me that the right thing to do here is add a separate classification flag for IO that can be throttled. e.g. as REQ_WRITEBACK and only background writeback work sets this flag. That would ensure that when the IO is being dispatched from other sources (e.g. fsync, sync_file_range(), direct IO, filesystem metadata, etc) it is clear that it is not a target for throttling. This would also allow us to easily switch off throttling if writeback is occurring for memory reclaim reasons, and so on. Throttling policy decisions belong above the block layer, even though the throttle mechanism itself is in the block layer. We're already doing all of that, it's just doesn't include a specific REQ_WRITEBACK flag. And yeah, that would clean up the checking for request type, but functionally it should be the same as it is now. It'll be a bit more robust and easier to read if we just have a REQ_WRITEBACK, right now it's WRITE_SYNC vs WRITE for important vs not-important, with a check for write vs O_DIRECT write as well. -- Jens Axboe
Re: [PATCHSET v3][RFC] Make background writeback not suck
On Thu, Mar 31, 2016 at 09:25:33PM -0600, Jens Axboe wrote: > On 03/31/2016 06:46 PM, Dave Chinner wrote: > >>>virtio in guest, XFS direct IO -> no-op -> scsi in host. > >> > >>That has write back caching enabled on the guest, correct? > > > >No. It uses virtio,cache=none (that's the "XFS Direct IO" bit above). > >Sorry for not being clear about that. > > That's fine, it's one less worry if that's not the case. So if you > cat the 'write_cache' file in the virtioblk sysfs block queue/ > directory, it says 'write through'? Just want to confirm that we got > that propagated correctly. No such file. But I did find: $ cat /sys/block/vdc/cache_type write back Which is what I'd expect it to safe given the man page description of cache=none: Note that this is considered a writeback mode and the guest OS must handle the disk write cache correctly in order to avoid data corruption on host crashes. To make it say "write through" I need to use cache=directsync, but I have no need for such integrity guarantees on a volatile test device... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCHSET v3][RFC] Make background writeback not suck
On Thu, Mar 31, 2016 at 09:39:25PM -0600, Jens Axboe wrote: > On 03/31/2016 09:29 PM, Jens Axboe wrote: > >>>I can't seem to reproduce this at all. On an nvme device, I get a > >>>fairly steady 60K/sec file creation rate, and we're nowhere near > >>>being IO bound. So the throttling has no effect at all. > >> > >>That's too slow to show the stalls - your likely concurrency bound > >>in allocation by the default AG count (4) from mkfs. Use mkfs.xfs -d > >>agcount=32 so that every thread works in it's own AG. > > > >That's the key, with that I get 300-400K ops/sec instead. I'll run some > >testing with this tomorrow and see what I can find, it did one full run > >now and I didn't see any issues, but I need to run it at various > >settings and see if I can find the issue. > > No stalls seen, I get the same performance with it disabled and with > it enabled, at both default settings, and lower ones > (wb_percent=20). Looking at iostat, we don't drive a lot of depth, > so it makes sense, even with the throttling we're doing essentially > the same amount of IO. Try appending numa=fake=4 to your guest's kernel command line. (that's what I'm using) > > What does 'nr_requests' say for your virtio_blk device? Looks like > virtio_blk has a queue_depth setting, but it's not set by default, > and then it uses the free entries in the ring. But I don't know what > that is... $ cat /sys/block/vdc/queue/nr_requests 128 $ Without the block throttling, guest IO (measured within the guest) looks like this over a fair proportion of the test (5s sample time) # iostat -d -x -m 5 /dev/vdc Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util vdc 0.00 20443.006.20 436.60 0.05 269.89 1248.48 73.83 146.11 486.58 141.27 1.64 72.40 vdc 0.00 11567.60 19.20 161.40 0.05 146.08 1657.12 119.17 704.57 707.25 704.25 5.34 96.48 vdc 0.00 12723.203.20 437.40 0.05 193.65 900.38 29.46 57.121.75 57.52 0.78 34.56 vdc 0.00 1739.80 22.40 426.80 0.05 123.62 563.86 23.44 62.51 79.89 61.59 1.01 45.28 vdc 0.00 12553.800.00 521.20 0.00 210.86 828.54 34.38 65.960.00 65.96 0.97 50.80 vdc 0.00 12523.60 25.60 529.60 0.10 201.94 745.29 52.24 77.730.41 81.47 1.14 63.20 vdc 0.00 5419.80 22.40 502.60 0.05 158.34 617.90 24.42 63.81 30.96 65.27 1.31 68.80 vdc 0.00 12059.000.00 439.60 0.00 174.85 814.59 30.91 70.270.00 70.27 0.72 31.76 vdc 0.00 7578.00 25.60 397.00 0.10 139.18 675.00 15.72 37.26 61.19 35.72 0.73 30.72 vdc 0.00 9156.000.00 537.40 0.00 173.57 661.45 17.08 29.620.00 29.62 0.53 28.72 vdc 0.00 5274.80 22.40 377.60 0.05 136.42 698.77 26.17 68.33 186.96 61.30 1.53 61.36 vdc 0.00 9407.003.20 541.00 0.05 174.28 656.05 36.10 66.333.00 66.71 0.87 47.60 vdc 0.00 8687.20 22.40 410.40 0.05 150.98 714.70 39.91 92.21 93.82 92.12 1.39 60.32 vdc 0.00 8872.800.00 422.60 0.00 139.28 674.96 25.01 33.030.00 33.03 0.91 38.40 vdc 0.00 1081.60 22.40 241.00 0.0568.88 535.97 10.78 82.89 137.86 77.79 2.25 59.20 vdc 0.00 9826.800.00 445.00 0.00 167.42 770.49 45.16 101.490.00 101.49 1.80 79.92 vdc 0.00 7394.00 22.40 447.60 0.05 157.34 685.83 18.06 38.42 77.64 36.46 1.46 68.48 vdc 0.00 9984.803.20 252.00 0.05 108.46 870.82 85.68 293.73 16.75 297.24 3.00 76.64 vdc 0.00 0.00 22.40 454.20 0.05 117.67 505.86 8.11 39.51 35.71 39.70 1.17 55.76 vdc 0.00 10273.200.00 418.80 0.00 156.76 766.57 90.52 179.400.00 179.40 1.85 77.52 vdc 0.00 5650.00 22.40 185.00 0.0584.12 831.20 103.90 575.15 60.82 637.42 4.21 87.36 vdc 0.00 7193.000.00 308.80 0.00 120.71 800.56 63.77 194.350.00 194.35 2.24 69.12 vdc 0.00 4460.809.80 211.00 0.0369.52 645.07 72.35 154.81 269.39 149.49 4.42 97.60 vdc 0.00 683.00 14.00 374.60 0.0599.13 522.69 25.38 167.61 603.14 151.33 1.45 56.24 vdc 0.00 7140.201.80 275.20 0.03 104.53 773.06 85.25 202.67 32.44 203.79 2.80 77.68 vdc 0.00 6916.000.00 164.00 0.0082.59 1031.33 126.20 813.600.00 813.60 6.10 100.00 vdc 0.00 2255.60 22.40 359.00 0.05 107.41 577.06
Re: [PATCHSET v3][RFC] Make background writeback not suck
On Thu, Mar 31, 2016 at 09:29:30PM -0600, Jens Axboe wrote: > On 03/31/2016 06:56 PM, Dave Chinner wrote: > >I'm not changing the host kernels - it's a production machine and so > >it runs long uptime testing of stable kernels. (e.g. catch slow > >memory leaks, etc). So if you've disabled throttling in the guest, I > >can't test the throttling changes. > > Right, that'd definitely hide the problem for you. I'll see if I can > get it in a reproducible state and take it from there. > > On your host, you said it's SCSI backed, but what does the device look like? HW RAID 0 w/ 1GB FBWC (dell h710, IIRC) of 2x200GB SATA SSDs (actually 256GB, but 25% of each is left as spare, unused space). Sustains about 35,000 random 4k write IOPS, up to 70k read IOPS. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCHSET v3][RFC] Make background writeback not suck
On 03/31/2016 09:29 PM, Jens Axboe wrote: I can't seem to reproduce this at all. On an nvme device, I get a fairly steady 60K/sec file creation rate, and we're nowhere near being IO bound. So the throttling has no effect at all. That's too slow to show the stalls - your likely concurrency bound in allocation by the default AG count (4) from mkfs. Use mkfs.xfs -d agcount=32 so that every thread works in it's own AG. That's the key, with that I get 300-400K ops/sec instead. I'll run some testing with this tomorrow and see what I can find, it did one full run now and I didn't see any issues, but I need to run it at various settings and see if I can find the issue. No stalls seen, I get the same performance with it disabled and with it enabled, at both default settings, and lower ones (wb_percent=20). Looking at iostat, we don't drive a lot of depth, so it makes sense, even with the throttling we're doing essentially the same amount of IO. What does 'nr_requests' say for your virtio_blk device? Looks like virtio_blk has a queue_depth setting, but it's not set by default, and then it uses the free entries in the ring. But I don't know what that is... -- Jens Axboe
Re: [PATCHSET v3][RFC] Make background writeback not suck
On 03/31/2016 09:29 PM, Jens Axboe wrote: I'm not changing the host kernels - it's a production machine and so it runs long uptime testing of stable kernels. (e.g. catch slow memory leaks, etc). So if you've disabled throttling in the guest, I can't test the throttling changes. Right, that'd definitely hide the problem for you. I'll see if I can get it in a reproducible state and take it from there. Though on the guest, if you could try with just this one applied: http://git.kernel.dk/cgit/linux-block/commit/?h=wb-buf-throttle&id=f21fb0e42c7347bd639a17341dcd3f72c1a30d29 I'd appreciate it. It won't disable the throttling in the guest, just treat META and PRIO a bit differently. -- Jens Axboe
Re: [PATCHSET v3][RFC] Make background writeback not suck
On 03/31/2016 06:56 PM, Dave Chinner wrote: On Thu, Mar 31, 2016 at 10:21:04AM -0600, Jens Axboe wrote: On 03/31/2016 08:29 AM, Jens Axboe wrote: What I see in these performance dips is the XFS transaction subsystem stalling *completely* - instead of running at a steady state of around 350,000 transactions/s, there are *zero* transactions running for periods of up to ten seconds. This co-incides with the CPU usage falling to almost zero as well. AFAICT, the only thing that is running when the filesystem stalls like this is memory reclaim. I'll take a look at this, stalls should definitely not be occurring. How much memory does the box have? I can't seem to reproduce this at all. On an nvme device, I get a fairly steady 60K/sec file creation rate, and we're nowhere near being IO bound. So the throttling has no effect at all. That's too slow to show the stalls - your likely concurrency bound in allocation by the default AG count (4) from mkfs. Use mkfs.xfs -d agcount=32 so that every thread works in it's own AG. That's the key, with that I get 300-400K ops/sec instead. I'll run some testing with this tomorrow and see what I can find, it did one full run now and I didn't see any issues, but I need to run it at various settings and see if I can find the issue. On a raid0 on 4 flash devices, I get something that looks more IO bound, for some reason. Still no impact of the throttling, however. But given that your setup is this: virtio in guest, XFS direct IO -> no-op -> scsi in host. we do potentially have two throttling points, which we don't want. Is both the guest and the host running the new code, or just the guest? Just the guest. Host is running a 4.2.x kernel, IIRC. OK In any case, can I talk you into trying with two patches on top of the current code? It's the two newest patches here: https://urldefense.proofpoint.com/v2/url?u=http-3A__git.kernel.dk_cgit_linux-2Dblock_log_-3Fh-3Dwb-2Dbuf-2Dthrottle&d=CwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=cK1a7KivzZRh1fKQMjSm2A&m=68CEi93IKLje5aOoxk1y9HMe_HF9pAhzxJGTmTZ7_DY&s=NeYNPvJa3VdF_EEsL8VqAQzJ4UycbXZ5PzHihwZAc_A&e= The first treats REQ_META|REQ_PRIO like they should be treated, like high priority IO. The second disables throttling for virtual devices, so we only throttle on the backend. The latter should probably be the other way around, but we need some way of conveying that information to the backend. I'm not changing the host kernels - it's a production machine and so it runs long uptime testing of stable kernels. (e.g. catch slow memory leaks, etc). So if you've disabled throttling in the guest, I can't test the throttling changes. Right, that'd definitely hide the problem for you. I'll see if I can get it in a reproducible state and take it from there. On your host, you said it's SCSI backed, but what does the device look like? -- Jens Axboe
Re: [PATCHSET v3][RFC] Make background writeback not suck
On 03/31/2016 06:46 PM, Dave Chinner wrote: On Thu, Mar 31, 2016 at 08:29:35AM -0600, Jens Axboe wrote: On 03/31/2016 02:24 AM, Dave Chinner wrote: On Wed, Mar 30, 2016 at 09:07:48AM -0600, Jens Axboe wrote: Hi, This patchset isn't as much a final solution, as it's demonstration of what I believe is a huge issue. Since the dawn of time, our background buffered writeback has sucked. When we do background buffered writeback, it should have little impact on foreground activity. That's the definition of background activity... But for as long as I can remember, heavy buffered writers has not behaved like that. For instance, if I do something like this: $ dd if=/dev/zero of=foo bs=1M count=10k on my laptop, and then try and start chrome, it basically won't start before the buffered writeback is done. Or, for server oriented workloads, where installation of a big RPM (or similar) adversely impacts data base reads or sync writes. When that happens, I get people yelling at me. Last time I posted this, I used flash storage as the example. But this works equally well on rotating storage. Let's run a test case that writes a lot. This test writes 50 files, each 100M, on XFS on a regular hard drive. While this happens, we attempt to read another file with fio. Writers: $ time (./write-files ; sync) real1m6.304s user0m0.020s sys 0m12.210s Great. So a basic IO tests looks good - let's through something more complex at it. Say, a benchmark I've been using for years to stress the Io subsystem, the filesystem and memory reclaim all at the same time: a concurent fsmark inode creation test. (first google hit https://lkml.org/lkml/2013/9/10/46) Is that how you are invoking it as well same arguments? Yes. And the VM is exactly the same, too - 16p/16GB RAM. Cut down version of the script I use: #!/bin/bash QUOTA= MKFSOPTS= NFILES=10 DEV=/dev/vdc LOGBSIZE=256k FSMARK=/home/dave/src/fs_mark-3.3/fs_mark MNT=/mnt/scratch while [ $# -gt 0 ]; do case "$1" in -q) QUOTA="uquota,gquota,pquota" ;; -N) NFILES=$2 ; shift ;; -d) DEV=$2 ; shift ;; -l) LOGBSIZE=$2; shift ;; --) shift ; break ;; esac shift done MKFSOPTS="$MKFSOPTS $*" echo QUOTA=$QUOTA echo MKFSOPTS=$MKFSOPTS echo DEV=$DEV sudo umount $MNT > /dev/null 2>&1 sudo mkfs.xfs -f $MKFSOPTS $DEV sudo mount -o nobarrier,logbsize=$LOGBSIZE,$QUOTA $DEV $MNT sudo chmod 777 $MNT sudo sh -c "echo 1 > /proc/sys/fs/xfs/stats_clear" time $FSMARK -D 1 -S0 -n $NFILES -s 0 -L 32 \ -d $MNT/0 -d $MNT/1 \ -d $MNT/2 -d $MNT/3 \ -d $MNT/4 -d $MNT/5 \ -d $MNT/6 -d $MNT/7 \ -d $MNT/8 -d $MNT/9 \ -d $MNT/10 -d $MNT/11 \ -d $MNT/12 -d $MNT/13 \ -d $MNT/14 -d $MNT/15 \ | tee >(stats --trim-outliers | tail -1 1>&2) sync sudo umount /mnt/scratch Perfect, thanks! The above was run without scsi-mq, and with using the deadline scheduler, results with CFQ are similary depressing for this test. So IO scheduling is in place for this test, it's not pure blk-mq without scheduling. virtio in guest, XFS direct IO -> no-op -> scsi in host. That has write back caching enabled on the guest, correct? No. It uses virtio,cache=none (that's the "XFS Direct IO" bit above). Sorry for not being clear about that. That's fine, it's one less worry if that's not the case. So if you cat the 'write_cache' file in the virtioblk sysfs block queue/ directory, it says 'write through'? Just want to confirm that we got that propagated correctly. -- Jens Axboe
Re: [PATCHSET v3][RFC] Make background writeback not suck
On Thu, Mar 31, 2016 at 10:09:56PM +, Holger Hoffstätte wrote: > > Hi, > > Jens mentioned on Twitter I should post my experience here as well, > so here we go. > > I've backported this series (incl. updates) to stable-4.4.x - not too > difficult, minus the NVM part which I don't need anyway - and have been > running it for the past few days without any problem whatsoever, with > GREAT success. > > My use case is primarily larger amounts of stuff (transcoded movies, > finished downloads, built Gentoo packages) that gets copied from tmpfs > to SSD (or disk) and every time that happens, the system noticeably > strangles readers (desktop, interactive shell). It does not really matter > how I tune writeback via the write_expire/dirty_bytes knobs or the > scheduler (and yes, I understand how they work); lowering the writeback > limits helped a bit but the system is still overwhelmed. Jacking up > deadline's writes_starved to unreasonable levels helps a bit, but in turn > makes all writes suffer. Anything else - even tried BFQ for a while, > which has its own unrelated problems - didn't really help either. Can you go back to your original kernel, and lower nr_requests to 8? Essentially all I see the block throttle doing is keeping the request queue depth to somewhere between 8-12 requests, rather than letting it blow out to near nr_requests (around 105-115), so it would be interesting to note whether the block throttling has any noticable difference in behaviour when compared to just having a very shallow request queue Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCHSET v3][RFC] Make background writeback not suck
On Thu, Mar 31, 2016 at 10:21:04AM -0600, Jens Axboe wrote: > On 03/31/2016 08:29 AM, Jens Axboe wrote: > >>What I see in these performance dips is the XFS transaction > >>subsystem stalling *completely* - instead of running at a steady > >>state of around 350,000 transactions/s, there are *zero* > >>transactions running for periods of up to ten seconds. This > >>co-incides with the CPU usage falling to almost zero as well. > >>AFAICT, the only thing that is running when the filesystem stalls > >>like this is memory reclaim. > > > >I'll take a look at this, stalls should definitely not be occurring. How > >much memory does the box have? > > I can't seem to reproduce this at all. On an nvme device, I get a > fairly steady 60K/sec file creation rate, and we're nowhere near > being IO bound. So the throttling has no effect at all. That's too slow to show the stalls - your likely concurrency bound in allocation by the default AG count (4) from mkfs. Use mkfs.xfs -d agcount=32 so that every thread works in it's own AG. > On a raid0 on 4 flash devices, I get something that looks more IO > bound, for some reason. Still no impact of the throttling, however. > But given that your setup is this: > > virtio in guest, XFS direct IO -> no-op -> scsi in host. > > we do potentially have two throttling points, which we don't want. > Is both the guest and the host running the new code, or just the > guest? Just the guest. Host is running a 4.2.x kernel, IIRC. > In any case, can I talk you into trying with two patches on top of > the current code? It's the two newest patches here: > > http://git.kernel.dk/cgit/linux-block/log/?h=wb-buf-throttle > > The first treats REQ_META|REQ_PRIO like they should be treated, like > high priority IO. The second disables throttling for virtual > devices, so we only throttle on the backend. The latter should > probably be the other way around, but we need some way of conveying > that information to the backend. I'm not changing the host kernels - it's a production machine and so it runs long uptime testing of stable kernels. (e.g. catch slow memory leaks, etc). So if you've disabled throttling in the guest, I can't test the throttling changes. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCHSET v3][RFC] Make background writeback not suck
On Thu, Mar 31, 2016 at 08:29:35AM -0600, Jens Axboe wrote: > On 03/31/2016 02:24 AM, Dave Chinner wrote: > >On Wed, Mar 30, 2016 at 09:07:48AM -0600, Jens Axboe wrote: > >>Hi, > >> > >>This patchset isn't as much a final solution, as it's demonstration > >>of what I believe is a huge issue. Since the dawn of time, our > >>background buffered writeback has sucked. When we do background > >>buffered writeback, it should have little impact on foreground > >>activity. That's the definition of background activity... But for as > >>long as I can remember, heavy buffered writers has not behaved like > >>that. For instance, if I do something like this: > >> > >>$ dd if=/dev/zero of=foo bs=1M count=10k > >> > >>on my laptop, and then try and start chrome, it basically won't start > >>before the buffered writeback is done. Or, for server oriented > >>workloads, where installation of a big RPM (or similar) adversely > >>impacts data base reads or sync writes. When that happens, I get people > >>yelling at me. > >> > >>Last time I posted this, I used flash storage as the example. But > >>this works equally well on rotating storage. Let's run a test case > >>that writes a lot. This test writes 50 files, each 100M, on XFS on > >>a regular hard drive. While this happens, we attempt to read > >>another file with fio. > >> > >>Writers: > >> > >>$ time (./write-files ; sync) > >>real1m6.304s > >>user0m0.020s > >>sys 0m12.210s > > > >Great. So a basic IO tests looks good - let's through something more > >complex at it. Say, a benchmark I've been using for years to stress > >the Io subsystem, the filesystem and memory reclaim all at the same > >time: a concurent fsmark inode creation test. > >(first google hit https://lkml.org/lkml/2013/9/10/46) > > Is that how you are invoking it as well same arguments? Yes. And the VM is exactly the same, too - 16p/16GB RAM. Cut down version of the script I use: #!/bin/bash QUOTA= MKFSOPTS= NFILES=10 DEV=/dev/vdc LOGBSIZE=256k FSMARK=/home/dave/src/fs_mark-3.3/fs_mark MNT=/mnt/scratch while [ $# -gt 0 ]; do case "$1" in -q) QUOTA="uquota,gquota,pquota" ;; -N) NFILES=$2 ; shift ;; -d) DEV=$2 ; shift ;; -l) LOGBSIZE=$2; shift ;; --) shift ; break ;; esac shift done MKFSOPTS="$MKFSOPTS $*" echo QUOTA=$QUOTA echo MKFSOPTS=$MKFSOPTS echo DEV=$DEV sudo umount $MNT > /dev/null 2>&1 sudo mkfs.xfs -f $MKFSOPTS $DEV sudo mount -o nobarrier,logbsize=$LOGBSIZE,$QUOTA $DEV $MNT sudo chmod 777 $MNT sudo sh -c "echo 1 > /proc/sys/fs/xfs/stats_clear" time $FSMARK -D 1 -S0 -n $NFILES -s 0 -L 32 \ -d $MNT/0 -d $MNT/1 \ -d $MNT/2 -d $MNT/3 \ -d $MNT/4 -d $MNT/5 \ -d $MNT/6 -d $MNT/7 \ -d $MNT/8 -d $MNT/9 \ -d $MNT/10 -d $MNT/11 \ -d $MNT/12 -d $MNT/13 \ -d $MNT/14 -d $MNT/15 \ | tee >(stats --trim-outliers | tail -1 1>&2) sync sudo umount /mnt/scratch $ > >>The above was run without scsi-mq, and with using the deadline scheduler, > >>results with CFQ are similary depressing for this test. So IO scheduling > >>is in place for this test, it's not pure blk-mq without scheduling. > > > >virtio in guest, XFS direct IO -> no-op -> scsi in host. > > That has write back caching enabled on the guest, correct? No. It uses virtio,cache=none (that's the "XFS Direct IO" bit above). Sorry for not being clear about that. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCHSET v3][RFC] Make background writeback not suck
Hi, Jens mentioned on Twitter I should post my experience here as well, so here we go. I've backported this series (incl. updates) to stable-4.4.x - not too difficult, minus the NVM part which I don't need anyway - and have been running it for the past few days without any problem whatsoever, with GREAT success. My use case is primarily larger amounts of stuff (transcoded movies, finished downloads, built Gentoo packages) that gets copied from tmpfs to SSD (or disk) and every time that happens, the system noticeably strangles readers (desktop, interactive shell). It does not really matter how I tune writeback via the write_expire/dirty_bytes knobs or the scheduler (and yes, I understand how they work); lowering the writeback limits helped a bit but the system is still overwhelmed. Jacking up deadline's writes_starved to unreasonable levels helps a bit, but in turn makes all writes suffer. Anything else - even tried BFQ for a while, which has its own unrelated problems - didn't really help either. With this patchset the buffered writeback in these situations is much improved, and copying several GBs at once to a SATA-3 SSD (or even an external USB-2 disk with measly 40 MB/s) doodles along in the background like it always should have, and desktop work is not noticeably affected. I guess the effect will be even more noticeable on slower block devices (laptops, old SSDs or disks). So: +1 would apply again! cheers Holger
Re: [PATCHSET v3][RFC] Make background writeback not suck
On 03/31/2016 08:29 AM, Jens Axboe wrote: What I see in these performance dips is the XFS transaction subsystem stalling *completely* - instead of running at a steady state of around 350,000 transactions/s, there are *zero* transactions running for periods of up to ten seconds. This co-incides with the CPU usage falling to almost zero as well. AFAICT, the only thing that is running when the filesystem stalls like this is memory reclaim. I'll take a look at this, stalls should definitely not be occurring. How much memory does the box have? I can't seem to reproduce this at all. On an nvme device, I get a fairly steady 60K/sec file creation rate, and we're nowhere near being IO bound. So the throttling has no effect at all. On a raid0 on 4 flash devices, I get something that looks more IO bound, for some reason. Still no impact of the throttling, however. But given that your setup is this: virtio in guest, XFS direct IO -> no-op -> scsi in host. we do potentially have two throttling points, which we don't want. Is both the guest and the host running the new code, or just the guest? In any case, can I talk you into trying with two patches on top of the current code? It's the two newest patches here: http://git.kernel.dk/cgit/linux-block/log/?h=wb-buf-throttle The first treats REQ_META|REQ_PRIO like they should be treated, like high priority IO. The second disables throttling for virtual devices, so we only throttle on the backend. The latter should probably be the other way around, but we need some way of conveying that information to the backend. -- Jens Axboe
Re: [PATCHSET v3][RFC] Make background writeback not suck
On 03/31/2016 02:24 AM, Dave Chinner wrote: On Wed, Mar 30, 2016 at 09:07:48AM -0600, Jens Axboe wrote: Hi, This patchset isn't as much a final solution, as it's demonstration of what I believe is a huge issue. Since the dawn of time, our background buffered writeback has sucked. When we do background buffered writeback, it should have little impact on foreground activity. That's the definition of background activity... But for as long as I can remember, heavy buffered writers has not behaved like that. For instance, if I do something like this: $ dd if=/dev/zero of=foo bs=1M count=10k on my laptop, and then try and start chrome, it basically won't start before the buffered writeback is done. Or, for server oriented workloads, where installation of a big RPM (or similar) adversely impacts data base reads or sync writes. When that happens, I get people yelling at me. Last time I posted this, I used flash storage as the example. But this works equally well on rotating storage. Let's run a test case that writes a lot. This test writes 50 files, each 100M, on XFS on a regular hard drive. While this happens, we attempt to read another file with fio. Writers: $ time (./write-files ; sync) real1m6.304s user0m0.020s sys 0m12.210s Great. So a basic IO tests looks good - let's through something more complex at it. Say, a benchmark I've been using for years to stress the Io subsystem, the filesystem and memory reclaim all at the same time: a concurent fsmark inode creation test. (first google hit https://lkml.org/lkml/2013/9/10/46) Is that how you are invoking it as well same arguments? This generates thousands of REQ_WRITE metadata IOs every second, so iif I understand how the throttle works correctly, these would be classified as background writeback by the block layer throttle. And FSUse%Count SizeFiles/sec App Overhead 0 1600 255845.0 10796891 0 3200 261348.8 10842349 0 4800 249172.3 14121232 0 6400 245172.8 12453759 0 8000 201249.5 14293100 0 9600 200417.5 29496551 0 11200 90399.6 40665397 0 12800 212265.6 21839031 0 14400 206398.8 32598378 0 16000 197589.7 26266552 0 17600 206405.2 16447795 0 19200 99189.6 87650540 0 20800 249720.8 12294862 0 22400 138523.8 47330007 0 24000 85486.2 14271096 0 25600 157538.1 64430611 0 27200 109677.8 47835961 0 28800 207230.5 31301031 0 30400 188739.6 33750424 0 32000 174197.9 41402526 0 33600 139152.0100838085 0 35200 203729.7 34833764 0 36800 228277.4 12459062 0 38400 94962.0 30189182 0 40000 166221.9 40564922 0 41600 62902.5 80098461 0 43200 217932.6 22539354 0 44800 189594.6 24692209 0 46400 137834.1 39822038 0 48000 240043.8 12779453 0 49600 176830.8 16604133 0 51200 180771.8 32860221 real5m35.967s user3m57.054s sys 48m53.332s In those highlighted report points, the performance has dropped significantly. The typical range I expect to see ionce memory has filled (a bit over 8m inodes) is 180k-220k. Runtime on a vanilla kernel was 4m40s and there were no performance drops, so this workload runs almost a minute slower with the block layer throttling code. What I see in these performance dips is the XFS transaction subsystem stalling *completely* - instead of running at a steady state of around 350,000 transactions/s, there are *zero* transactions running for periods of up to ten seconds. This co-incides with the CPU usage falling to almost zero as well. AFAICT, the only thing that is running when the filesystem stalls like this is memory reclaim. I'll take a look at this, stalls should definitely not be occurring. How much memory does the box have? Without the block throttling patches, the workload quickly finds a steady state of around 7.5-8.5 million cached inodes, and it doesn't vary much outside those bounds. Wi
Re: [PATCHSET v3][RFC] Make background writeback not suck
On Wed, Mar 30, 2016 at 09:07:48AM -0600, Jens Axboe wrote: > Hi, > > This patchset isn't as much a final solution, as it's demonstration > of what I believe is a huge issue. Since the dawn of time, our > background buffered writeback has sucked. When we do background > buffered writeback, it should have little impact on foreground > activity. That's the definition of background activity... But for as > long as I can remember, heavy buffered writers has not behaved like > that. For instance, if I do something like this: > > $ dd if=/dev/zero of=foo bs=1M count=10k > > on my laptop, and then try and start chrome, it basically won't start > before the buffered writeback is done. Or, for server oriented > workloads, where installation of a big RPM (or similar) adversely > impacts data base reads or sync writes. When that happens, I get people > yelling at me. > > Last time I posted this, I used flash storage as the example. But > this works equally well on rotating storage. Let's run a test case > that writes a lot. This test writes 50 files, each 100M, on XFS on > a regular hard drive. While this happens, we attempt to read > another file with fio. > > Writers: > > $ time (./write-files ; sync) > real 1m6.304s > user 0m0.020s > sys 0m12.210s Great. So a basic IO tests looks good - let's through something more complex at it. Say, a benchmark I've been using for years to stress the Io subsystem, the filesystem and memory reclaim all at the same time: a concurent fsmark inode creation test. (first google hit https://lkml.org/lkml/2013/9/10/46) This generates thousands of REQ_WRITE metadata IOs every second, so iif I understand how the throttle works correctly, these would be classified as background writeback by the block layer throttle. And FSUse%Count SizeFiles/sec App Overhead 0 1600 255845.0 10796891 0 3200 261348.8 10842349 0 4800 249172.3 14121232 0 6400 245172.8 12453759 0 8000 201249.5 14293100 0 9600 200417.5 29496551 0 11200 90399.6 40665397 0 12800 212265.6 21839031 0 14400 206398.8 32598378 0 16000 197589.7 26266552 0 17600 206405.2 16447795 0 19200 99189.6 87650540 0 20800 249720.8 12294862 0 22400 138523.8 47330007 0 24000 85486.2 14271096 0 25600 157538.1 64430611 0 27200 109677.8 47835961 0 28800 207230.5 31301031 0 30400 188739.6 33750424 0 32000 174197.9 41402526 0 33600 139152.0100838085 0 35200 203729.7 34833764 0 36800 228277.4 12459062 0 38400 94962.0 30189182 0 40000 166221.9 40564922 0 41600 62902.5 80098461 0 43200 217932.6 22539354 0 44800 189594.6 24692209 0 46400 137834.1 39822038 0 48000 240043.8 12779453 0 49600 176830.8 16604133 0 51200 180771.8 32860221 real5m35.967s user3m57.054s sys 48m53.332s In those highlighted report points, the performance has dropped significantly. The typical range I expect to see ionce memory has filled (a bit over 8m inodes) is 180k-220k. Runtime on a vanilla kernel was 4m40s and there were no performance drops, so this workload runs almost a minute slower with the block layer throttling code. What I see in these performance dips is the XFS transaction subsystem stalling *completely* - instead of running at a steady state of around 350,000 transactions/s, there are *zero* transactions running for periods of up to ten seconds. This co-incides with the CPU usage falling to almost zero as well. AFAICT, the only thing that is running when the filesystem stalls like this is memory reclaim. Without the block throttling patches, the workload quickly finds a steady state of around 7.5-8.5 million cached inodes, and it doesn't vary much outside those bounds. With the block throttling patches, on every transaction subsystem stall that occurs, the inode cache gets 3-4 million inodes trimmed out of it (i.e. half the cache), and in a c
[PATCHSET v3][RFC] Make background writeback not suck
Hi, This patchset isn't as much a final solution, as it's demonstration of what I believe is a huge issue. Since the dawn of time, our background buffered writeback has sucked. When we do background buffered writeback, it should have little impact on foreground activity. That's the definition of background activity... But for as long as I can remember, heavy buffered writers has not behaved like that. For instance, if I do something like this: $ dd if=/dev/zero of=foo bs=1M count=10k on my laptop, and then try and start chrome, it basically won't start before the buffered writeback is done. Or, for server oriented workloads, where installation of a big RPM (or similar) adversely impacts data base reads or sync writes. When that happens, I get people yelling at me. Last time I posted this, I used flash storage as the example. But this works equally well on rotating storage. Let's run a test case that writes a lot. This test writes 50 files, each 100M, on XFS on a regular hard drive. While this happens, we attempt to read another file with fio. Writers: $ time (./write-files ; sync) real1m6.304s user0m0.020s sys 0m12.210s Fio reader: read : io=35580KB, bw=550868B/s, iops=134, runt= 66139msec clat (usec): min=40, max=654204, avg=7432.37, stdev=43872.83 lat (usec): min=40, max=654204, avg=7432.70, stdev=43872.83 clat percentiles (usec): | 1.00th=[ 41], 5.00th=[ 41], 10.00th=[ 41], 20.00th=[ 42], | 30.00th=[ 42], 40.00th=[ 42], 50.00th=[ 43], 60.00th=[ 52], | 70.00th=[ 59], 80.00th=[ 65], 90.00th=[ 87], 95.00th=[ 1192], | 99.00th=[254976], 99.50th=[358400], 99.90th=[16], 99.95th=[468992], | 99.99th=[651264] Let's run the same test, but with the patches applied, and wb_percent set to 10%: Writers: $ time (./write-files ; sync) real1m29.384s user0m0.040s sys 0m10.810s Fio reader: read : io=1024.0MB, bw=18640KB/s, iops=4660, runt= 56254msec clat (usec): min=39, max=408400, avg=212.05, stdev=2982.44 lat (usec): min=39, max=408400, avg=212.30, stdev=2982.44 clat percentiles (usec): | 1.00th=[ 40], 5.00th=[ 41], 10.00th=[ 41], 20.00th=[ 41], | 30.00th=[ 42], 40.00th=[ 42], 50.00th=[ 42], 60.00th=[ 42], | 70.00th=[ 43], 80.00th=[ 45], 90.00th=[ 56], 95.00th=[ 60], | 99.00th=[ 454], 99.50th=[ 8768], 99.90th=[36608], 99.95th=[43264], | 99.99th=[69120] Much better, looking at the P99.x percentiles, and of course on the bandwidth front as well. It's the difference between this: ---io -system-- --cpu- bibo in cs us sy id wa st 20636 45056 5593 10833 0 0 94 6 0 16416 46080 4484 8666 0 0 94 6 0 16960 47104 5183 8936 0 0 94 6 0 and this ---io -system-- --cpu- bibo in cs us sy id wa st 384 73728 571 558 0 0 95 5 0 384 73728 548 545 0 0 95 5 0 388 73728 575 763 0 0 96 4 0 in the vmstat output. It's not quite as bad as deeper queue depth devices, where we have hugely bursty IO, but it's still very slow. If we don't run the competing reader, the dirty data writeback proceeds at normal rates: # time (./write-files ; sync) real1m6.919s user0m0.010s sys 0m10.900s The above was run without scsi-mq, and with using the deadline scheduler, results with CFQ are similary depressing for this test. So IO scheduling is in place for this test, it's not pure blk-mq without scheduling. The above was the why. The how is basically throttling background writeback. We still want to issue big writes from the vm side of things, so we get nice and big extents on the file system end. But we don't need to flood the device with THOUSANDS of requests for background writeback. For most devices, we don't need a whole lot to get decent throughput. This adds some simple blk-wb code that keeps limits how much buffered writeback we keep in flight on the device end. The default is pretty low. If we end up switching to WB_SYNC_ALL, we up the limits. If the dirtying task ends up being throttled in balance_dirty_pages(), we up the limit. If we need to reclaim memory, we up the limit. The cases that need to clean memory at or near device speeds, they get to do that. We still don't need thousands of requests to accomplish that. And for the cases where we don't need to be near device limits, we can clean at a more reasonable pace. See the last patch in the series for a more detailed description of the change, and the tunable. I welcome testing. If you are sick of Linux bogging down when buffered writes are happening, then this is for you, laptop or server. The patchset is fully stable, I have not observed problems. It passes full xfstest runs, and a variety of benchmarks as well. It works equally well on blk-mq/scsi-mq, and "classic" setups. You can also find this in a branch in the block git repo: git://git.kernel.dk/linux-block.git wb-buf-throttle Note that I rebase this branch when I c