Summarized at Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file https://bugzilla.kernel.org/show_bug.cgi?id=90401
see below. This is reproducable with fio, no need for Windows XP in Virtualbox for reproducing the issue. Next I will try to reproduce with a freshly creating filesystem. Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills: > On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote: > > Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White: > > > On 12/26/2014 05:37 AM, Martin Steigerwald wrote: > > > > Hello! > > > > > > > > First: Have a merry christmas and enjoy a quiet time in these days. > > > > > > > > Second: At a time you feel like it, here is a little rant, but also a > > > > bug > > > > report: > > > > > > > > I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with > > > > space_cache, skinny meta data extents – are these a problem? – and > > > > > > > compress=lzo: > > > (there is no known problem with skinny metadata, it's actually more > > > efficient than the older format. There has been some anecdotes about > > > mixing the skinny and fat metadata but nothing has ever been > > > demonstrated problematic.) > > > > > > > merkaba:~> btrfs fi sh /home > > > > Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a > > > > > > > > Total devices 2 FS bytes used 144.41GiB > > > > devid 1 size 160.00GiB used 160.00GiB path > > > > /dev/mapper/msata-home > > > > devid 2 size 160.00GiB used 160.00GiB path > > > > /dev/mapper/sata-home > > > > > > > > Btrfs v3.17 > > > > merkaba:~> btrfs fi df /home > > > > Data, RAID1: total=154.97GiB, used=141.12GiB > > > > System, RAID1: total=32.00MiB, used=48.00KiB > > > > Metadata, RAID1: total=5.00GiB, used=3.29GiB > > > > GlobalReserve, single: total=512.00MiB, used=0.00B > > > > > > This filesystem, at the allocation level, is "very full" (see below). > > > > > > > And I had hangs with BTRFS again. This time as I wanted to install tax > > > > return software in Virtualbox´d Windows XP VM (which I use once a year > > > > cause I know no tax return software for Linux which would be suitable > > > > for > > > > Germany and I frankly don´t care about the end of security cause all > > > > surfing and other network access I will do from the Linux box and I > > > > only > > > > run the VM behind a firewall). > > > > > > > And thus I try the balance dance again: > > > ITEM: Balance... it doesn't do what you think it does... 8-) > > > > > > "Balancing" is something you should almost never need to do. It is only > > > for cases of changing geometry (adding disks, switching RAID levels, > > > etc.) of for cases when you've radically changed allocation behaviors > > > (like you decided to remove all your VM's or you've decided to remove a > > > mail spool directory full of thousands of tiny files). > > > > > > People run balance all the time because they think they should. They are > > > _usually_ incorrect in that belief. > > > > I only see the lockups of BTRFS is the trees *occupy* all space on the > > device. > No, "the trees" occupy 3.29 GiB of your 5 GiB of mirrored metadata > space. What's more, balance does *not* balance the metadata trees. The > remaining space -- 154.97 GiB -- is unstructured storage for file > data, and you have some 13 GiB of that available for use. > > Now, since you're seeing lockups when the space on your disks is > all allocated I'd say that's a bug. However, you're the *only* person > who's reported this as a regular occurrence. Does this happen with all > filesystems you have, or just this one? > > > I *never* so far saw it lockup if there is still space BTRFS can allocate > > from to *extend* a tree. > > It's not a tree. It's simply space allocation. It's not even space > *usage* you're talking about here -- it's just allocation (i.e. the FS > saying "I'm going to use this piece of disk for this purpose"). > > > This may be a bug, but this is what I see. > > > > And no amount of "you should not balance a BTRFS" will make that > > perception go away. > > > > See, I see the sun coming out on a morning and you tell me "no, it > > doesn´t". Simply that is not going to match my perception. > > Duncan's assertion is correct in its detail. Looking at your space Robert's :) > usage, I would not suggest that running a balance is something you > need to do. Now, since you have these lockups that seem quite > repeatable, there's probably a lurking bug in there, but hacking > around with balance every time you hit it isn't going to get the > problem solved properly. > > I think I would suggest the following: > > - make sure you have some way of logging your dmesg permanently (use > a different filesystem for /var/log, or a serial console, or a > netconsole) > > - when the lockup happens, hit Alt-SysRq-t a few times > > - send the dmesg output here, or post to bugzilla.kernel.org > > That's probably going to give enough information to the developers > to work out where the lockup is happening, and is clearly the way > forward here. And I got it reproduced. *Perfectly* reproduced, I´d say. But let me run the whole story: 1) I downsized my /home BTRFS from dual 170 GiB to dual 160 GiB again. Which gave me: merkaba:~> btrfs fi sh /home Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a Total devices 2 FS bytes used 144.19GiB devid 1 size 160.00GiB used 150.01GiB path /dev/mapper/msata-home devid 2 size 160.00GiB used 150.01GiB path /dev/mapper/sata-home Btrfs v3.17 merkaba:~> btrfs fi df /home Data, RAID1: total=144.98GiB, used=140.95GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.24GiB GlobalReserve, single: total=512.00MiB, used=0.00B 2) I run the Virtualbox machine again and defragmented the NTFS filesystem in the VDI image file. And: It worked *just* fine. Fine as in *fine*. No issues whatsoever. I got this during the run: ATOP - merkaba 2014/12/27 12:58:42 ----------- 10s elapsed PRC | sys 10.41s | user 1.08s | #proc 357 | #trun 4 | #tslpi 694 | #tslpu 0 | #zombie 0 | no procacct | CPU | sys 107% | user 11% | irq 0% | idle 259% | wait 23% | guest 0% | curf 3.01GHz | curscal 93% | cpu | sys 29% | user 3% | irq 0% | idle 63% | cpu002 w 5% | guest 0% | curf 3.00GHz | curscal 93% | cpu | sys 27% | user 3% | irq 0% | idle 65% | cpu000 w 5% | guest 0% | curf 3.03GHz | curscal 94% | cpu | sys 26% | user 3% | irq 0% | idle 63% | cpu003 w 8% | guest 0% | curf 3.00GHz | curscal 93% | cpu | sys 24% | user 2% | irq 0% | idle 68% | cpu001 w 6% | guest 0% | curf 3.00GHz | curscal 93% | CPL | avg1 1.92 | avg5 1.01 | avg15 0.56 | | csw 501619 | intr 129279 | | numcpu 4 | MEM | tot 15.5G | free 610.1M | cache 9.1G | buff 0.1M | slab 1.0G | shmem 183.5M | vmbal 0.0M | hptot 0.0M | SWP | tot 12.0G | free 11.6G | | | | | vmcom 7.1G | vmlim 19.7G | PAG | scan 219141 | steal 215577 | stall 936 | | | | swin 0 | swout 940 | LVM | sata-home | busy 53% | read 181413 | write 0 | KiB/w 0 | MBr/s 70.86 | MBw/s 0.00 | avio 0.03 ms | LVM | sata-swap | busy 2% | read 0 | write 940 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.37 | avio 0.17 ms | LVM | sata-debian | busy 0% | read 0 | write 1 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.00 | avio 1.00 ms | LVM | msata-debian | busy 0% | read 0 | write 1 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.00 | avio 0.00 ms | DSK | sda | busy 53% | read 181413 | write 477 | KiB/w 7 | MBr/s 70.86 | MBw/s 0.37 | avio 0.03 ms | DSK | sdb | busy 0% | read 0 | write 1 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.00 | avio 0.00 ms | NET | transport | tcpi 16 | tcpo 16 | udpi 0 | udpo 0 | tcpao 1 | tcppo 1 | tcprs 0 | NET | network | ipi 16 | ipo 16 | ipfrw 0 | deliv 16 | | icmpi 0 | icmpo 0 | NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 | PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/3 9650 - martin martin 22 7.89s 0.65s 0K 128K 705.5M 382.1M -- - S 2 87% VirtualBox 9911 - root root 1 0.69s 0.01s 0K 0K 0K 0K -- - S 3 7% watch 9598 - root root 1 0.38s 0.00s 0K 0K 0K 20K -- - S 0 4% kworker/u8:9 9892 - root root 1 0.36s 0.00s 0K 0K 0K 0K -- - S 1 4% kworker/u8:17 9428 - root root 1 0.30s 0.00s 0K 0K 0K 0K -- - R 0 3% kworker/u8:3 9589 - root root 1 0.23s 0.00s 0K 0K 0K 0K -- - S 1 2% kworker/u8:6 4746 - martin martin 2 0.04s 0.13s 0K -16K 0K 0K -- - R 2 2% konsole Every 1,0s: cat /proc/meminfo Sat Dec 27 12:59:23 2014 MemTotal: 16210512 kB MemFree: 786632 kB MemAvailable: 10271500 kB Buffers: 52 kB Cached: 9564340 kB SwapCached: 70268 kB Active: 6847560 kB Inactive: 5257956 kB Active(anon): 2016412 kB Inactive(anon): 703076 kB Active(file): 4831148 kB Inactive(file): 4554880 kB Unevictable: 9068 kB Mlocked: 9068 kB SwapTotal: 12582908 kB SwapFree: 12186680 kB Dirty: 972324 kB Writeback: 0 kB AnonPages: 2526340 kB Mapped: 2457096 kB Shmem: 173564 kB Slab: 918128 kB SReclaimable: 848816 kB SUnreclaim: 69312 kB KernelStack: 11200 kB PageTables: 64556 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 20688164 kB Committed_AS: 7438348 kB I am not seeing more than one GiB of dirty here during regular usage and it is no problem. And kworker thread CPU usage just fine. So no, the dirty_background_ratio isn´t an issue with this 16 GiB ThinkPad T520. Please note: I do Linux performance analysis and tuning courses for about 7 years or so meanwhile. I *know* these knobs. I may have used wrong terms regarding BTRFS, and my understanding of BTRFS space allocation probably can be more accurate, but I do think that I am onto something here. This is no rotating disk, it can handle the write burst just fine and I generally do not tune where there is no need to tune. Here there isn´t. And it wouldn´t be much more than a fine tuning. With slow devices or with rsync over NFS by all means reduce it. But here it simply isn´t an issue as you can see with the low kworker thread CPU usage and the general SSD usage above. So defragmentation completed just nice, no issue so far. But I am close to full device space reservation already: merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home Sa 27. Dez 13:02:40 CET 2014 Label: 'home' uuid: [some UUID] Total devices 2 FS bytes used 151.58GiB devid 1 size 160.00GiB used 158.01GiB path /dev/mapper/msata-home devid 2 size 160.00GiB used 158.01GiB path /dev/mapper/sata-home I thought I can trigger it again by defragmenting in Windows XP again, but mind you, its defragmented already so it doesn´t to much. I did the sdelete dance just to trigger something and well I saw kworker a bit higher, but not much. But finally I got to: merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home Sa 27. Dez 13:26:39 CET 2014 Label: 'home' uuid: [some UUID] Total devices 2 FS bytes used 152.83GiB devid 1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home devid 2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home Btrfs v3.17 Data, RAID1: total=154.97GiB, used=149.58GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.26GiB GlobalReserve, single: total=512.00MiB, used=0.00B So I did, if Virtualbox can write randomly in a file, I can too. So I did: martin@merkaba:~> cat ssd-test.fio [global] bs=4k #ioengine=libaio #iodepth=4 size=4g #direct=1 runtime=120 filename=ssd.test.file [seq-write] rw=write stonewall [rand-write] rw=randwrite stonewall And got: ATOP - merkaba 2014/12/27 13:41:02 ----------- 10s elapsed PRC | sys 10.14s | user 0.38s | #proc 332 | #trun 2 | #tslpi 548 | #tslpu 0 | #zombie 0 | no procacct | CPU | sys 102% | user 4% | irq 0% | idle 295% | wait 0% | guest 0% | curf 3.10GHz | curscal 96% | cpu | sys 76% | user 0% | irq 0% | idle 24% | cpu001 w 0% | guest 0% | curf 3.20GHz | curscal 99% | cpu | sys 24% | user 1% | irq 0% | idle 75% | cpu000 w 0% | guest 0% | curf 3.19GHz | curscal 99% | cpu | sys 1% | user 1% | irq 0% | idle 98% | cpu002 w 0% | guest 0% | curf 3.00GHz | curscal 93% | cpu | sys 1% | user 1% | irq 0% | idle 98% | cpu003 w 0% | guest 0% | curf 3.00GHz | curscal 93% | CPL | avg1 0.82 | avg5 0.78 | avg15 0.99 | | csw 6233 | intr 12023 | | numcpu 4 | MEM | tot 15.5G | free 4.0G | cache 9.7G | buff 0.0M | slab 333.1M | shmem 206.6M | vmbal 0.0M | hptot 0.0M | SWP | tot 12.0G | free 11.7G | | | | | vmcom 3.4G | vmlim 19.7G | LVM | sata-home | busy 0% | read 8 | write 0 | KiB/w 0 | MBr/s 0.00 | MBw/s 0.00 | avio 0.12 ms | DSK | sda | busy 0% | read 8 | write 0 | KiB/w 0 | MBr/s 0.00 | MBw/s 0.00 | avio 0.12 ms | NET | transport | tcpi 16 | tcpo 16 | udpi 0 | udpo 0 | tcpao 1 | tcppo 1 | tcprs 0 | NET | network | ipi 16 | ipo 16 | ipfrw 0 | deliv 16 | | icmpi 0 | icmpo 0 | NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 | PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/2 18079 - martin martin 2 9.99s 0.00s 0K 0K 0K 16K -- - R 1 100% fio 4746 - martin martin 2 0.01s 0.14s 0K 0K 0K 0K -- - S 2 2% konsole 3291 - martin martin 4 0.01s 0.11s 0K 0K 0K 0K -- - S 0 1% plasma-desktop 1488 - root root 1 0.03s 0.04s 0K 0K 0K 0K -- - S 0 1% Xorg 10036 - root root 1 0.04s 0.02s 0K 0K 0K 0K -- - R 2 1% atop while fio was just *laying* out the 4 GiB file. Yes, thats 100% system CPU for 10 seconds while allocatiing a 4 GiB file on a filesystem like: martin@merkaba:~> LANG=C df -hT /home Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/msata-home btrfs 170G 156G 17G 91% /home where a 4 GiB file should easily fit, no? (And this output is with the 4 GiB file. So it was even 4 GiB more free before.) But it gets even more visible: martin@merkaba:~> fio ssd-test.fio seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 fio-2.1.11 Starting 2 processes Jobs: 1 (f=1): [_(1),w(1)] [19.3% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01m:57s] 0$ zsh 1$ zsh 2$ zsh 3-$ zsh 4$ zsh 5$* zsh yes, thats 0 IOPS. 0 IOPS and in zero IOPS. For minutes. And here is why: ATOP - merkaba 2014/12/27 13:46:52 ----------- 10s elapsed PRC | sys 10.77s | user 0.31s | #proc 334 | #trun 2 | #tslpi 548 | #tslpu 3 | #zombie 0 | no procacct | CPU | sys 108% | user 3% | irq 0% | idle 286% | wait 2% | guest 0% | curf 3.08GHz | curscal 96% | cpu | sys 72% | user 1% | irq 0% | idle 28% | cpu000 w 0% | guest 0% | curf 3.00GHz | curscal 93% | cpu | sys 19% | user 0% | irq 0% | idle 81% | cpu001 w 0% | guest 0% | curf 3.00GHz | curscal 93% | cpu | sys 11% | user 1% | irq 0% | idle 87% | cpu003 w 1% | guest 0% | curf 3.19GHz | curscal 99% | cpu | sys 6% | user 1% | irq 0% | idle 91% | cpu002 w 1% | guest 0% | curf 3.11GHz | curscal 97% | CPL | avg1 2.78 | avg5 1.34 | avg15 1.12 | | csw 50192 | intr 32379 | | numcpu 4 | MEM | tot 15.5G | free 5.0G | cache 8.7G | buff 0.0M | slab 332.6M | shmem 207.2M | vmbal 0.0M | hptot 0.0M | SWP | tot 12.0G | free 11.7G | | | | | vmcom 3.4G | vmlim 19.7G | LVM | sata-home | busy 5% | read 160 | write 11177 | KiB/w 3 | MBr/s 0.06 | MBw/s 4.36 | avio 0.05 ms | LVM | msata-home | busy 4% | read 28 | write 11177 | KiB/w 3 | MBr/s 0.01 | MBw/s 4.36 | avio 0.04 ms | LVM | sata-debian | busy 0% | read 0 | write 844 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.33 | avio 0.02 ms | LVM | msata-debian | busy 0% | read 0 | write 844 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.33 | avio 0.02 ms | DSK | sda | busy 5% | read 160 | write 10200 | KiB/w 4 | MBr/s 0.06 | MBw/s 4.69 | avio 0.05 ms | DSK | sdb | busy 4% | read 28 | write 10558 | KiB/w 4 | MBr/s 0.01 | MBw/s 4.69 | avio 0.04 ms | NET | transport | tcpi 35 | tcpo 33 | udpi 3 | udpo 3 | tcpao 2 | tcppo 1 | tcprs 0 | NET | network | ipi 38 | ipo 36 | ipfrw 0 | deliv 38 | | icmpi 0 | icmpo 0 | NET | eth0 0% | pcki 22 | pcko 20 | si 9 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 | NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 | PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/3 14973 - root root 1 8.92s 0.00s 0K 0K 0K 144K -- - S 0 89% kworker/u8:14 17450 - root root 1 0.86s 0.00s 0K 0K 0K 32K -- - R 3 9% kworker/u8:5 788 - root root 1 0.25s 0.00s 0K 0K 128K 18880K -- - S 3 3% btrfs-transact 12254 - root root 1 0.14s 0.00s 0K 0K 64K 576K -- - S 2 1% kworker/u8:3 17332 - root root 1 0.11s 0.00s 0K 0K 112K 1348K -- - S 2 1% kworker/u8:4 3291 - martin martin 4 0.01s 0.09s 0K 0K 0K 0K -- - S 1 1% plasma-deskto ATOP - merkaba 2014/12/27 13:47:12 ----------- 10s elapsed PRC | sys 10.78s | user 0.44s | #proc 334 | #trun 3 | #tslpi 547 | #tslpu 3 | #zombie 0 | no procacct | CPU | sys 106% | user 4% | irq 0% | idle 288% | wait 1% | guest 0% | curf 3.00GHz | curscal 93% | cpu | sys 93% | user 0% | irq 0% | idle 7% | cpu002 w 0% | guest 0% | curf 3.00GHz | curscal 93% | cpu | sys 7% | user 0% | irq 0% | idle 93% | cpu003 w 0% | guest 0% | curf 3.01GHz | curscal 94% | cpu | sys 3% | user 2% | irq 0% | idle 94% | cpu000 w 0% | guest 0% | curf 3.00GHz | curscal 93% | cpu | sys 3% | user 2% | irq 0% | idle 95% | cpu001 w 0% | guest 0% | curf 3.00GHz | curscal 93% | CPL | avg1 3.33 | avg5 1.56 | avg15 1.20 | | csw 38253 | intr 23104 | | numcpu 4 | MEM | tot 15.5G | free 4.9G | cache 8.7G | buff 0.0M | slab 336.5M | shmem 207.2M | vmbal 0.0M | hptot 0.0M | SWP | tot 12.0G | free 11.7G | | | | | vmcom 3.4G | vmlim 19.7G | LVM | msata-home | busy 2% | read 0 | write 2337 | KiB/w 3 | MBr/s 0.00 | MBw/s 0.91 | avio 0.07 ms | LVM | sata-home | busy 2% | read 36 | write 2337 | KiB/w 3 | MBr/s 0.01 | MBw/s 0.91 | avio 0.07 ms | LVM | msata-debian | busy 1% | read 1 | write 1630 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.65 | avio 0.03 ms | LVM | sata-debian | busy 0% | read 0 | write 1019 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.41 | avio 0.02 ms | DSK | sdb | busy 2% | read 1 | write 2545 | KiB/w 5 | MBr/s 0.00 | MBw/s 1.45 | avio 0.07 ms | DSK | sda | busy 1% | read 36 | write 2461 | KiB/w 5 | MBr/s 0.01 | MBw/s 1.28 | avio 0.06 ms | NET | transport | tcpi 20 | tcpo 20 | udpi 1 | udpo 1 | tcpao 1 | tcppo 1 | tcprs 0 | NET | network | ipi 21 | ipo 21 | ipfrw 0 | deliv 21 | | icmpi 0 | icmpo 0 | NET | eth0 0% | pcki 5 | pcko 5 | si 0 Kbps | so 0 Kbps | erri 0 | erro 0 | drpo 0 | NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 | PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/3 17450 - root root 1 9.96s 0.00s 0K 0K 0K 0K -- - R 2 100% kworker/u8:5 4746 - martin martin 2 0.06s 0.15s 0K 0K 0K 0K -- - S 1 2% konsole 10508 - root root 1 0.13s 0.00s 0K 0K 96K 4048K -- - S 1 1% kworker/u8:18 1488 - root root 1 0.06s 0.06s 0K 0K 0K 0K -- - S 0 1% Xorg 17332 - root root 1 0.12s 0.00s 0K 0K 96K 580K -- - R 3 1% kworker/u8:4 17454 - root root 1 0.11s 0.00s 0K 0K 32K 4416K -- - D 1 1% kworker/u8:6 17516 - root root 1 0.09s 0.00s 0K 0K 16K 136K -- - S 3 1% kworker/u8:7 3268 - martin martin 3 0.02s 0.05s 0K 0K 0K 0K -- - S 1 1% kwin 10036 - root root 1 0.05s 0.02s 0K 0K 0K 0K -- - R 0 1% atop So BTRFS is basically busy with itself and nothing else. Look at the SSD usage. They are *idling* around. Heck 2400 write accesses in 10 seconds. Thats a joke with SSDs that can do 40000 IOPS (depending on how and what you measure of course, like request size, read, write, iodepth and so). Its kworker/u8:5 utilizing 100% of one core for minutes. Its the random write case it seems. Here are values from fio job: martin@merkaba:~> fio ssd-test.fio seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 fio-2.1.11 Starting 2 processes Jobs: 1 (f=1): [_(1),w(1)] [3.6% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01h:06m:26s] seq-write: (groupid=0, jobs=1): err= 0: pid=19212: Sat Dec 27 13:48:33 2014 write: io=4096.0MB, bw=343683KB/s, iops=85920, runt= 12204msec clat (usec): min=3, max=38048, avg=10.52, stdev=205.25 lat (usec): min=3, max=38048, avg=10.66, stdev=205.43 clat percentiles (usec): | 1.00th=[ 4], 5.00th=[ 4], 10.00th=[ 4], 20.00th=[ 4], | 30.00th=[ 4], 40.00th=[ 5], 50.00th=[ 5], 60.00th=[ 5], | 70.00th=[ 7], 80.00th=[ 8], 90.00th=[ 8], 95.00th=[ 9], | 99.00th=[ 14], 99.50th=[ 20], 99.90th=[ 211], 99.95th=[ 2128], | 99.99th=[10304] bw (KB /s): min=164328, max=812984, per=100.00%, avg=345585.75, stdev=201695.20 lat (usec) : 4=0.18%, 10=95.31%, 20=4.00%, 50=0.18%, 100=0.12% lat (usec) : 250=0.12%, 500=0.02%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01% cpu : usr=13.55%, sys=46.89%, ctx=7810, majf=0, minf=6 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Seems fine. But: rand-write: (groupid=1, jobs=1): err= 0: pid=19243: Sat Dec 27 13:48:33 2014 write: io=140336KB, bw=1018.4KB/s, iops=254, runt=137803msec clat (usec): min=4, max=21299K, avg=3708.02, stdev=266885.61 lat (usec): min=4, max=21299K, avg=3708.14, stdev=266885.61 clat percentiles (usec): | 1.00th=[ 4], 5.00th=[ 5], 10.00th=[ 5], 20.00th=[ 5], | 30.00th=[ 6], 40.00th=[ 6], 50.00th=[ 6], 60.00th=[ 6], | 70.00th=[ 7], 80.00th=[ 7], 90.00th=[ 9], 95.00th=[ 10], | 99.00th=[ 18], 99.50th=[ 19], 99.90th=[ 28], 99.95th=[ 116], | 99.99th=[16711680] bw (KB /s): min= 0, max= 3426, per=100.00%, avg=1030.10, stdev=938.02 lat (usec) : 10=92.63%, 20=6.89%, 50=0.43%, 100=0.01%, 250=0.02% lat (msec) : 250=0.01%, 500=0.01%, >=2000=0.02% cpu : usr=0.06%, sys=1.59%, ctx=28720, majf=0, minf=7 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=35084/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: io=4096.0MB, aggrb=343682KB/s, minb=343682KB/s, maxb=343682KB/s, mint=12204msec, maxt=12204msec Run status group 1 (all jobs): WRITE: io=140336KB, aggrb=1018KB/s, minb=1018KB/s, maxb=1018KB/s, mint=137803msec, maxt=137803msec What? 254 IOPS? With a Dual SSD BTRFS RAID 1? What? Ey, *what*? Repeating with the random write case. Its a different kworker now, but similar result: ATOP - merkaba 2014/12/27 13:51:48 ----------- 10s elapsed PRC | sys 10.66s | user 0.25s | #proc 330 | #trun 2 | #tslpi 545 | #tslpu 2 | #zombie 0 | no procacct | CPU | sys 105% | user 3% | irq 0% | idle 292% | wait 0% | guest 0% | curf 3.07GHz | curscal 95% | cpu | sys 92% | user 0% | irq 0% | idle 8% | cpu002 w 0% | guest 0% | curf 3.19GHz | curscal 99% | cpu | sys 8% | user 0% | irq 0% | idle 92% | cpu003 w 0% | guest 0% | curf 3.09GHz | curscal 96% | cpu | sys 3% | user 2% | irq 0% | idle 95% | cpu000 w 0% | guest 0% | curf 3.00GHz | curscal 93% | cpu | sys 2% | user 1% | irq 0% | idle 97% | cpu001 w 0% | guest 0% | curf 3.00GHz | curscal 93% | CPL | avg1 1.00 | avg5 1.32 | avg15 1.23 | | csw 34484 | intr 23182 | | numcpu 4 | MEM | tot 15.5G | free 5.4G | cache 8.3G | buff 0.0M | slab 334.8M | shmem 207.2M | vmbal 0.0M | hptot 0.0M | SWP | tot 12.0G | free 11.7G | | | | | vmcom 3.4G | vmlim 19.7G | LVM | sata-home | busy 1% | read 36 | write 2502 | KiB/w 4 | MBr/s 0.01 | MBw/s 0.98 | avio 0.06 ms | LVM | msata-home | busy 1% | read 48 | write 2502 | KiB/w 4 | MBr/s 0.02 | MBw/s 0.98 | avio 0.04 ms | LVM | msata-debian | busy 0% | read 0 | write 6 | KiB/w 7 | MBr/s 0.00 | MBw/s 0.00 | avio 1.33 ms | LVM | sata-debian | busy 0% | read 0 | write 6 | KiB/w 7 | MBr/s 0.00 | MBw/s 0.00 | avio 0.17 ms | DSK | sda | busy 1% | read 36 | write 2494 | KiB/w 4 | MBr/s 0.01 | MBw/s 0.98 | avio 0.06 ms | DSK | sdb | busy 1% | read 48 | write 2494 | KiB/w 4 | MBr/s 0.02 | MBw/s 0.98 | avio 0.04 ms | NET | transport | tcpi 32 | tcpo 30 | udpi 2 | udpo 2 | tcpao 2 | tcppo 1 | tcprs 0 | NET | network | ipi 35 | ipo 32 | ipfrw 0 | deliv 35 | | icmpi 0 | icmpo 0 | NET | eth0 0% | pcki 19 | pcko 16 | si 9 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 | NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 | PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/2 11746 - root root 1 10.00s 0.00s 0K 0K 0K 0K -- - R 2 100% kworker/u8:0 12254 - root root 1 0.16s 0.00s 0K 0K 112K 1712K -- - S 3 2% kworker/u8:3 17517 - root root 1 0.16s 0.00s 0K 0K 144K 1764K -- - S 1 2% kworker/u8:8 And now the graphical environemnt is locked. Continuining on TTY1. Doing another fio job with tee so I can get output easily. Wow! I wonder whether this is reproducable with a fresh BTRFS with fio stressing it. Like a 10 GiB BTRFS with 5 GiB fio test file and just letting it run. Okay, I let the final fio job complete and include the output here. Okay, and there we are and I do have sysrq-t figures. Okay, this is 1,2 MiB xz packed. So I better start a bug report about this and attach it there. Is dislike cloud URLs that may disappear at some time. Now please finally acknowledge that there is an issue. Maybe I was not using the correct terms at the beginning, but there is a real issue. I do performance things for half a decade at least, I know that there is an issue when I see it. There we go: Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file https://bugzilla.kernel.org/show_bug.cgi?id=90401 Thanks, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
signature.asc
Description: This is a digitally signed message part.