Am Sonntag, 28. Dezember 2014, 14:56:21 schrieb Martin Steigerwald: > Am Sonntag, 28. Dezember 2014, 14:40:32 schrieb Martin Steigerwald: > > Am Sonntag, 28. Dezember 2014, 14:00:19 schrieb Martin Steigerwald: > > > Am Samstag, 27. Dezember 2014, 14:55:58 schrieb Martin Steigerwald: > > > > Summarized at > > > > > > > > Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for > > > > minutes on random write into big file > > > > https://bugzilla.kernel.org/show_bug.cgi?id=90401 > > > > > > > > see below. This is reproducable with fio, no need for Windows XP in > > > > Virtualbox for reproducing the issue. Next I will try to reproduce with > > > > a freshly creating filesystem. > > > > > > > > > > > > Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills: > > > > > On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote: > > > > > > Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White: > > > > > > > On 12/26/2014 05:37 AM, Martin Steigerwald wrote: > > > > > > > > Hello! > > > > > > > > > > > > > > > > First: Have a merry christmas and enjoy a quiet time in these > > > > > > > > days. > > > > > > > > > > > > > > > > Second: At a time you feel like it, here is a little rant, but > > > > > > > > also a > > > > > > > > bug > > > > > > > > report: > > > > > > > > > > > > > > > > I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD > > > > > > > > RAID with > > > > > > > > space_cache, skinny meta data extents – are these a problem? – > > > > > > > > and > > > > > > > > > > > > > > > compress=lzo: > > > > > > > (there is no known problem with skinny metadata, it's actually > > > > > > > more > > > > > > > efficient than the older format. There has been some anecdotes > > > > > > > about > > > > > > > mixing the skinny and fat metadata but nothing has ever been > > > > > > > demonstrated problematic.) > > > > > > > > > > > > > > > merkaba:~> btrfs fi sh /home > > > > > > > > Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a > > > > > > > > > > > > > > > > Total devices 2 FS bytes used 144.41GiB > > > > > > > > devid 1 size 160.00GiB used 160.00GiB path > > > > > > > > /dev/mapper/msata-home > > > > > > > > devid 2 size 160.00GiB used 160.00GiB path > > > > > > > > /dev/mapper/sata-home > > > > > > > > > > > > > > > > Btrfs v3.17 > > > > > > > > merkaba:~> btrfs fi df /home > > > > > > > > Data, RAID1: total=154.97GiB, used=141.12GiB > > > > > > > > System, RAID1: total=32.00MiB, used=48.00KiB > > > > > > > > Metadata, RAID1: total=5.00GiB, used=3.29GiB > > > > > > > > GlobalReserve, single: total=512.00MiB, used=0.00B > > > > > > > > > > > > > > This filesystem, at the allocation level, is "very full" (see > > > > > > > below). > > > > > > > > > > > > > > > And I had hangs with BTRFS again. This time as I wanted to > > > > > > > > install tax > > > > > > > > return software in Virtualbox´d Windows XP VM (which I use once > > > > > > > > a year > > > > > > > > cause I know no tax return software for Linux which would be > > > > > > > > suitable > > > > > > > > for > > > > > > > > Germany and I frankly don´t care about the end of security > > > > > > > > cause all > > > > > > > > surfing and other network access I will do from the Linux box > > > > > > > > and I > > > > > > > > only > > > > > > > > run the VM behind a firewall). > > > > > > > > > > > > > > > And thus I try the balance dance again: > > > > > > > ITEM: Balance... it doesn't do what you think it does... > > > > > > > > > > > > > > "Balancing" is something you should almost never need to do. It > > > > > > > is only > > > > > > > for cases of changing geometry (adding disks, switching RAID > > > > > > > levels, > > > > > > > etc.) of for cases when you've radically changed allocation > > > > > > > behaviors > > > > > > > (like you decided to remove all your VM's or you've decided to > > > > > > > remove a > > > > > > > mail spool directory full of thousands of tiny files). > > > > > > > > > > > > > > People run balance all the time because they think they should. > > > > > > > They are > > > > > > > _usually_ incorrect in that belief. > > > > > > > > > > > > I only see the lockups of BTRFS is the trees *occupy* all space on > > > > > > the > > > > > > device. > > > > > No, "the trees" occupy 3.29 GiB of your 5 GiB of mirrored metadata > > > > > space. What's more, balance does *not* balance the metadata trees. The > > > > > remaining space -- 154.97 GiB -- is unstructured storage for file > > > > > data, and you have some 13 GiB of that available for use. > > > > > > > > > > Now, since you're seeing lockups when the space on your disks is > > > > > all allocated I'd say that's a bug. However, you're the *only* person > > > > > who's reported this as a regular occurrence. Does this happen with all > > > > > filesystems you have, or just this one? > > > > > > > > > > > I *never* so far saw it lockup if there is still space BTRFS can > > > > > > allocate > > > > > > from to *extend* a tree. > > > > > > > > > > It's not a tree. It's simply space allocation. It's not even space > > > > > *usage* you're talking about here -- it's just allocation (i.e. the FS > > > > > saying "I'm going to use this piece of disk for this purpose"). > > > > > > > > > > > This may be a bug, but this is what I see. > > > > > > > > > > > > And no amount of "you should not balance a BTRFS" will make that > > > > > > perception go away. > > > > > > > > > > > > See, I see the sun coming out on a morning and you tell me "no, it > > > > > > doesn´t". Simply that is not going to match my perception. > > > > > > > > > > Duncan's assertion is correct in its detail. Looking at your space > > > > > > > > Robert's > > > > > > > > > usage, I would not suggest that running a balance is something you > > > > > need to do. Now, since you have these lockups that seem quite > > > > > repeatable, there's probably a lurking bug in there, but hacking > > > > > around with balance every time you hit it isn't going to get the > > > > > problem solved properly. > > > > > > > > > > I think I would suggest the following: > > > > > > > > > > - make sure you have some way of logging your dmesg permanently (use > > > > > a different filesystem for /var/log, or a serial console, or a > > > > > netconsole) > > > > > > > > > > - when the lockup happens, hit Alt-SysRq-t a few times > > > > > > > > > > - send the dmesg output here, or post to bugzilla.kernel.org > > > > > > > > > > That's probably going to give enough information to the developers > > > > > to work out where the lockup is happening, and is clearly the way > > > > > forward here. > > > > > > > > And I got it reproduced. *Perfectly* reproduced, I´d say. > > > > > > > > But let me run the whole story: > > > > > > > > 1) I downsized my /home BTRFS from dual 170 GiB to dual 160 GiB again. > > > > > > [… story of trying to reproduce with Windows XP defragmenting which was > > > unsuccessful as BTRFS still had free device space to allocate new chunks > > > from …] > > > > > > > But finally I got to: > > > > > > > > merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home > > > > Sa 27. Dez 13:26:39 CET 2014 > > > > Label: 'home' uuid: [some UUID] > > > > Total devices 2 FS bytes used 152.83GiB > > > > devid 1 size 160.00GiB used 160.00GiB path > > > > /dev/mapper/msata-home > > > > devid 2 size 160.00GiB used 160.00GiB path > > > > /dev/mapper/sata-home > > > > > > > > Btrfs v3.17 > > > > Data, RAID1: total=154.97GiB, used=149.58GiB > > > > System, RAID1: total=32.00MiB, used=48.00KiB > > > > Metadata, RAID1: total=5.00GiB, used=3.26GiB > > > > GlobalReserve, single: total=512.00MiB, used=0.00B > > > > > > > > > > > > > > > > So I did, if Virtualbox can write randomly in a file, I can too. > > > > > > > > So I did: > > > > > > > > > > > > martin@merkaba:~> cat ssd-test.fio > > > > [global] > > > > bs=4k > > > > #ioengine=libaio > > > > #iodepth=4 > > > > size=4g > > > > #direct=1 > > > > runtime=120 > > > > filename=ssd.test.file > > > > > > > > [seq-write] > > > > rw=write > > > > stonewall > > > > > > > > [rand-write] > > > > rw=randwrite > > > > stonewall > > > > > > > > > > > > > > > > And got: > > > > > > > > ATOP - merkaba 2014/12/27 13:41:02 > > > > ----------- 10s elapsed > > > > PRC | sys 10.14s | user 0.38s | #proc 332 | #trun 2 | > > > > #tslpi 548 | #tslpu 0 | #zombie 0 | no procacct | > > > > CPU | sys 102% | user 4% | irq 0% | idle 295% | > > > > wait 0% | guest 0% | curf 3.10GHz | curscal 96% | > > > > cpu | sys 76% | user 0% | irq 0% | idle 24% | > > > > cpu001 w 0% | guest 0% | curf 3.20GHz | curscal 99% | > > > > cpu | sys 24% | user 1% | irq 0% | idle 75% | > > > > cpu000 w 0% | guest 0% | curf 3.19GHz | curscal 99% | > > > > cpu | sys 1% | user 1% | irq 0% | idle 98% | > > > > cpu002 w 0% | guest 0% | curf 3.00GHz | curscal 93% | > > > > cpu | sys 1% | user 1% | irq 0% | idle 98% | > > > > cpu003 w 0% | guest 0% | curf 3.00GHz | curscal 93% | > > > > CPL | avg1 0.82 | avg5 0.78 | avg15 0.99 | | > > > > csw 6233 | intr 12023 | | numcpu 4 | > > > > MEM | tot 15.5G | free 4.0G | cache 9.7G | buff 0.0M | > > > > slab 333.1M | shmem 206.6M | vmbal 0.0M | hptot 0.0M | > > > > SWP | tot 12.0G | free 11.7G | | | > > > > | | vmcom 3.4G | vmlim 19.7G | > > > > LVM | sata-home | busy 0% | read 8 | write 0 | > > > > KiB/w 0 | MBr/s 0.00 | MBw/s 0.00 | avio 0.12 ms | > > > > DSK | sda | busy 0% | read 8 | write 0 | > > > > KiB/w 0 | MBr/s 0.00 | MBw/s 0.00 | avio 0.12 ms | > > > > NET | transport | tcpi 16 | tcpo 16 | udpi 0 | > > > > udpo 0 | tcpao 1 | tcppo 1 | tcprs 0 | > > > > NET | network | ipi 16 | ipo 16 | ipfrw 0 | > > > > deliv 16 | | icmpi 0 | icmpo 0 | > > > > NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | > > > > so 2 Kbps | erri 0 | erro 0 | drpo 0 | > > > > > > > > PID TID RUID EUID THR SYSCPU USRCPU VGROW > > > > RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/2 > > > > 18079 - martin martin 2 9.99s 0.00s 0K > > > > 0K 0K 16K -- - R 1 100% fio > > > > 4746 - martin martin 2 0.01s 0.14s 0K > > > > 0K 0K 0K -- - S 2 2% konsole > > > > 3291 - martin martin 4 0.01s 0.11s 0K > > > > 0K 0K 0K -- - S 0 1% plasma-desktop > > > > 1488 - root root 1 0.03s 0.04s 0K > > > > 0K 0K 0K -- - S 0 1% Xorg > > > > 10036 - root root 1 0.04s 0.02s 0K > > > > 0K 0K 0K -- - R 2 1% atop > > > > > > > > while fio was just *laying* out the 4 GiB file. Yes, thats 100% system > > > > CPU > > > > for 10 seconds while allocatiing a 4 GiB file on a filesystem like: > > > > > > > > martin@merkaba:~> LANG=C df -hT /home > > > > Filesystem Type Size Used Avail Use% Mounted on > > > > /dev/mapper/msata-home btrfs 170G 156G 17G 91% /home > > > > > > > > where a 4 GiB file should easily fit, no? (And this output is with the 4 > > > > GiB file. So it was even 4 GiB more free before.) > > > > > > > > > > > > But it gets even more visible: > > > > > > > > martin@merkaba:~> fio ssd-test.fio > > > > seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, > > > > iodepth=1 > > > > rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, > > > > iodepth=1 > > > > fio-2.1.11 > > > > Starting 2 processes > > > > Jobs: 1 (f=1): [_(1),w(1)] [19.3% done] [0KB/0KB/0KB /s] [0/0/0 iops] > > > > [eta 01m:57s] > > > > 0$ zsh 1$ zsh 2$ zsh 3-$ zsh 4$ zsh 5$* zsh > > > > > > > > > > > > > > > > yes, thats 0 IOPS. > > > > > > > > 0 IOPS and in zero IOPS. For minutes. > > > > > > > > > > > > > > > > And here is why: > > > > > > > > ATOP - merkaba 2014/12/27 13:46:52 > > > > ----------- 10s elapsed > > > > PRC | sys 10.77s | user 0.31s | #proc 334 | #trun 2 | > > > > #tslpi 548 | #tslpu 3 | #zombie 0 | no procacct | > > > > CPU | sys 108% | user 3% | irq 0% | idle 286% | > > > > wait 2% | guest 0% | curf 3.08GHz | curscal 96% | > > > > cpu | sys 72% | user 1% | irq 0% | idle 28% | > > > > cpu000 w 0% | guest 0% | curf 3.00GHz | curscal 93% | > > > > cpu | sys 19% | user 0% | irq 0% | idle 81% | > > > > cpu001 w 0% | guest 0% | curf 3.00GHz | curscal 93% | > > > > cpu | sys 11% | user 1% | irq 0% | idle 87% | > > > > cpu003 w 1% | guest 0% | curf 3.19GHz | curscal 99% | > > > > cpu | sys 6% | user 1% | irq 0% | idle 91% | > > > > cpu002 w 1% | guest 0% | curf 3.11GHz | curscal 97% | > > > > CPL | avg1 2.78 | avg5 1.34 | avg15 1.12 | | > > > > csw 50192 | intr 32379 | | numcpu 4 | > > > > MEM | tot 15.5G | free 5.0G | cache 8.7G | buff 0.0M | > > > > slab 332.6M | shmem 207.2M | vmbal 0.0M | hptot 0.0M | > > > > SWP | tot 12.0G | free 11.7G | | | > > > > | | vmcom 3.4G | vmlim 19.7G | > > > > LVM | sata-home | busy 5% | read 160 | write 11177 | > > > > KiB/w 3 | MBr/s 0.06 | MBw/s 4.36 | avio 0.05 ms | > > > > LVM | msata-home | busy 4% | read 28 | write 11177 | > > > > KiB/w 3 | MBr/s 0.01 | MBw/s 4.36 | avio 0.04 ms | > > > > LVM | sata-debian | busy 0% | read 0 | write 844 | > > > > KiB/w 4 | MBr/s 0.00 | MBw/s 0.33 | avio 0.02 ms | > > > > LVM | msata-debian | busy 0% | read 0 | write 844 | > > > > KiB/w 4 | MBr/s 0.00 | MBw/s 0.33 | avio 0.02 ms | > > > > DSK | sda | busy 5% | read 160 | write 10200 | > > > > KiB/w 4 | MBr/s 0.06 | MBw/s 4.69 | avio 0.05 ms | > > > > DSK | sdb | busy 4% | read 28 | write 10558 | > > > > KiB/w 4 | MBr/s 0.01 | MBw/s 4.69 | avio 0.04 ms | > > > > NET | transport | tcpi 35 | tcpo 33 | udpi 3 | > > > > udpo 3 | tcpao 2 | tcppo 1 | tcprs 0 | > > > > NET | network | ipi 38 | ipo 36 | ipfrw 0 | > > > > deliv 38 | | icmpi 0 | icmpo 0 | > > > > NET | eth0 0% | pcki 22 | pcko 20 | si 9 Kbps | > > > > so 2 Kbps | erri 0 | erro 0 | drpo 0 | > > > > NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | > > > > so 2 Kbps | erri 0 | erro 0 | drpo 0 | > > > > > > > > PID TID RUID EUID THR SYSCPU USRCPU VGROW > > > > RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/3 > > > > 14973 - root root 1 8.92s 0.00s 0K > > > > 0K 0K 144K -- - S 0 89% kworker/u8:14 > > > > 17450 - root root 1 0.86s 0.00s 0K > > > > 0K 0K 32K -- - R 3 9% kworker/u8:5 > > > > 788 - root root 1 0.25s 0.00s 0K > > > > 0K 128K 18880K -- - S 3 3% btrfs-transact > > > > 12254 - root root 1 0.14s 0.00s 0K > > > > 0K 64K 576K -- - S 2 1% kworker/u8:3 > > > > 17332 - root root 1 0.11s 0.00s 0K > > > > 0K 112K 1348K -- - S 2 1% kworker/u8:4 > > > > 3291 - martin martin 4 0.01s 0.09s 0K > > > > 0K 0K 0K -- - S 1 1% plasma-deskto > > > > > > > > > > > > > > > > > > > > ATOP - merkaba 2014/12/27 13:47:12 > > > > ----------- 10s elapsed > > > > PRC | sys 10.78s | user 0.44s | #proc 334 | #trun 3 | > > > > #tslpi 547 | #tslpu 3 | #zombie 0 | no procacct | > > > > CPU | sys 106% | user 4% | irq 0% | idle 288% | > > > > wait 1% | guest 0% | curf 3.00GHz | curscal 93% | > > > > cpu | sys 93% | user 0% | irq 0% | idle 7% | > > > > cpu002 w 0% | guest 0% | curf 3.00GHz | curscal 93% | > > > > cpu | sys 7% | user 0% | irq 0% | idle 93% | > > > > cpu003 w 0% | guest 0% | curf 3.01GHz | curscal 94% | > > > > cpu | sys 3% | user 2% | irq 0% | idle 94% | > > > > cpu000 w 0% | guest 0% | curf 3.00GHz | curscal 93% | > > > > cpu | sys 3% | user 2% | irq 0% | idle 95% | > > > > cpu001 w 0% | guest 0% | curf 3.00GHz | curscal 93% | > > > > CPL | avg1 3.33 | avg5 1.56 | avg15 1.20 | | > > > > csw 38253 | intr 23104 | | numcpu 4 | > > > > MEM | tot 15.5G | free 4.9G | cache 8.7G | buff 0.0M | > > > > slab 336.5M | shmem 207.2M | vmbal 0.0M | hptot 0.0M | > > > > SWP | tot 12.0G | free 11.7G | | | > > > > | | vmcom 3.4G | vmlim 19.7G | > > > > LVM | msata-home | busy 2% | read 0 | write 2337 | > > > > KiB/w 3 | MBr/s 0.00 | MBw/s 0.91 | avio 0.07 ms | > > > > LVM | sata-home | busy 2% | read 36 | write 2337 | > > > > KiB/w 3 | MBr/s 0.01 | MBw/s 0.91 | avio 0.07 ms | > > > > LVM | msata-debian | busy 1% | read 1 | write 1630 | > > > > KiB/w 4 | MBr/s 0.00 | MBw/s 0.65 | avio 0.03 ms | > > > > LVM | sata-debian | busy 0% | read 0 | write 1019 | > > > > KiB/w 4 | MBr/s 0.00 | MBw/s 0.41 | avio 0.02 ms | > > > > DSK | sdb | busy 2% | read 1 | write 2545 | > > > > KiB/w 5 | MBr/s 0.00 | MBw/s 1.45 | avio 0.07 ms | > > > > DSK | sda | busy 1% | read 36 | write 2461 | > > > > KiB/w 5 | MBr/s 0.01 | MBw/s 1.28 | avio 0.06 ms | > > > > NET | transport | tcpi 20 | tcpo 20 | udpi 1 | > > > > udpo 1 | tcpao 1 | tcppo 1 | tcprs 0 | > > > > NET | network | ipi 21 | ipo 21 | ipfrw 0 | > > > > deliv 21 | | icmpi 0 | icmpo 0 | > > > > NET | eth0 0% | pcki 5 | pcko 5 | si 0 Kbps | > > > > so 0 Kbps | erri 0 | erro 0 | drpo 0 | > > > > NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | > > > > so 2 Kbps | erri 0 | erro 0 | drpo 0 | > > > > > > > > PID TID RUID EUID THR SYSCPU USRCPU VGROW > > > > RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/3 > > > > 17450 - root root 1 9.96s 0.00s 0K > > > > 0K 0K 0K -- - R 2 100% kworker/u8:5 > > > > 4746 - martin martin 2 0.06s 0.15s 0K > > > > 0K 0K 0K -- - S 1 2% konsole > > > > 10508 - root root 1 0.13s 0.00s 0K > > > > 0K 96K 4048K -- - S 1 1% kworker/u8:18 > > > > 1488 - root root 1 0.06s 0.06s 0K > > > > 0K 0K 0K -- - S 0 1% Xorg > > > > 17332 - root root 1 0.12s 0.00s 0K > > > > 0K 96K 580K -- - R 3 1% kworker/u8:4 > > > > 17454 - root root 1 0.11s 0.00s 0K > > > > 0K 32K 4416K -- - D 1 1% kworker/u8:6 > > > > 17516 - root root 1 0.09s 0.00s 0K > > > > 0K 16K 136K -- - S 3 1% kworker/u8:7 > > > > 3268 - martin martin 3 0.02s 0.05s 0K > > > > 0K 0K 0K -- - S 1 1% kwin > > > > 10036 - root root 1 0.05s 0.02s 0K > > > > 0K 0K 0K -- - R 0 1% atop > > > > > > > > > > > > > > > > So BTRFS is basically busy with itself and nothing else. Look at the SSD > > > > usage. They are *idling* around. Heck 2400 write accesses in 10 seconds. > > > > Thats a joke with SSDs that can do 40000 IOPS (depending on how and what > > > > you measure of course, like request size, read, write, iodepth and so). > > > > > > > > Its kworker/u8:5 utilizing 100% of one core for minutes. > > > > > > > > > > > > > > > > Its the random write case it seems. Here are values from fio job: > > > > > > > > martin@merkaba:~> fio ssd-test.fio > > > > seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, > > > > iodepth=1 > > > > rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, > > > > iodepth=1 > > > > fio-2.1.11 > > > > Starting 2 processes > > > > Jobs: 1 (f=1): [_(1),w(1)] [3.6% done] [0KB/0KB/0KB /s] [0/0/0 iops] > > > > [eta 01h:06m:26s] > > > > seq-write: (groupid=0, jobs=1): err= 0: pid=19212: Sat Dec 27 13:48:33 > > > > 2014 > > > > write: io=4096.0MB, bw=343683KB/s, iops=85920, runt= 12204msec > > > > clat (usec): min=3, max=38048, avg=10.52, stdev=205.25 > > > > lat (usec): min=3, max=38048, avg=10.66, stdev=205.43 > > > > clat percentiles (usec): > > > > | 1.00th=[ 4], 5.00th=[ 4], 10.00th=[ 4], 20.00th=[ > > > > 4], > > > > | 30.00th=[ 4], 40.00th=[ 5], 50.00th=[ 5], 60.00th=[ > > > > 5], > > > > | 70.00th=[ 7], 80.00th=[ 8], 90.00th=[ 8], 95.00th=[ > > > > 9], > > > > | 99.00th=[ 14], 99.50th=[ 20], 99.90th=[ 211], 99.95th=[ > > > > 2128], > > > > | 99.99th=[10304] > > > > bw (KB /s): min=164328, max=812984, per=100.00%, avg=345585.75, > > > > stdev=201695.20 > > > > lat (usec) : 4=0.18%, 10=95.31%, 20=4.00%, 50=0.18%, 100=0.12% > > > > lat (usec) : 250=0.12%, 500=0.02%, 750=0.01%, 1000=0.01% > > > > lat (msec) : 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01% > > > > cpu : usr=13.55%, sys=46.89%, ctx=7810, majf=0, minf=6 > > > > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, > > > > >=64=0.0% > > > > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > > > > >=64=0.0% > > > > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > > > > >=64=0.0% > > > > issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0 > > > > latency : target=0, window=0, percentile=100.00%, depth=1 > > > > > > > > Seems fine. > > > > > > > > > > > > But: > > > > > > > > rand-write: (groupid=1, jobs=1): err= 0: pid=19243: Sat Dec 27 13:48:33 > > > > 2014 > > > > write: io=140336KB, bw=1018.4KB/s, iops=254, runt=137803msec > > > > clat (usec): min=4, max=21299K, avg=3708.02, stdev=266885.61 > > > > lat (usec): min=4, max=21299K, avg=3708.14, stdev=266885.61 > > > > clat percentiles (usec): > > > > | 1.00th=[ 4], 5.00th=[ 5], 10.00th=[ 5], 20.00th=[ > > > > 5], > > > > | 30.00th=[ 6], 40.00th=[ 6], 50.00th=[ 6], 60.00th=[ > > > > 6], > > > > | 70.00th=[ 7], 80.00th=[ 7], 90.00th=[ 9], 95.00th=[ > > > > 10], > > > > | 99.00th=[ 18], 99.50th=[ 19], 99.90th=[ 28], 99.95th=[ > > > > 116], > > > > | 99.99th=[16711680] > > > > bw (KB /s): min= 0, max= 3426, per=100.00%, avg=1030.10, > > > > stdev=938.02 > > > > lat (usec) : 10=92.63%, 20=6.89%, 50=0.43%, 100=0.01%, 250=0.02% > > > > lat (msec) : 250=0.01%, 500=0.01%, >=2000=0.02% > > > > cpu : usr=0.06%, sys=1.59%, ctx=28720, majf=0, minf=7 > > > > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, > > > > >=64=0.0% > > > > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > > > > >=64=0.0% > > > > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > > > > >=64=0.0% > > > > issued : total=r=0/w=35084/d=0, short=r=0/w=0/d=0 > > > > latency : target=0, window=0, percentile=100.00%, depth=1 > > > > > > > > Run status group 0 (all jobs): > > > > WRITE: io=4096.0MB, aggrb=343682KB/s, minb=343682KB/s, > > > > maxb=343682KB/s, mint=12204msec, maxt=12204msec > > > > > > > > Run status group 1 (all jobs): > > > > WRITE: io=140336KB, aggrb=1018KB/s, minb=1018KB/s, maxb=1018KB/s, > > > > mint=137803msec, maxt=137803msec > > > > > > > > > > > > What? 254 IOPS? With a Dual SSD BTRFS RAID 1? > > > > > > > > What? > > > > > > > > Ey, *what*? > > […] > > > > There we go: > > > > > > > > Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for > > > > minutes on random write into big file > > > > https://bugzilla.kernel.org/show_bug.cgi?id=90401 > > > > > > I have done more tests. > > > > > > This is on the same /home after extending it to 170 GiB and balancing it > > > to > > > btrfs balance start -dusage=80 > > > > > > It has plenty of free space free. I updated the bug report and hope it can > > > give an easy enough to comprehend summary. The new tests are in: > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=90401#c6 > > > > > > > > > > > > Pasting below for discussion on list. Summary: I easily get 38000 (!) > > > IOPS. It may be an idea to reduce to 160 GiB, but right now this does > > > not work as it says no free space on device when trying to downsize it. > > > I may try with 165 or 162GiB. > > > > > > So now we have three IOPS figures: > > > > > > - 256 IOPS in worst case scenario > > > - 4700 IOPS when trying to reproduce worst case scenario with a fresh and > > > small > > > BTRFS > > > - 38000 IOPS when /home has unused device space to allocate chunks from > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=90401#c8 > > > > > > > > > This is another test. > > > > > > Okay, and this is the last series of tests for today. > > > > Conclusion: > > > > I cannot manage to get it down to the knees as before, but I come near to > > it. > > > > Still its 8000 IOPS, instead of 250 IOPS, in an according to btrfs fi sh > > even *worse* situation than before. > > > > That hints me at the need to look at the free space fragmentation, as in the > > beginning the problem started appearing with: > > > > merkaba:~> btrfs fi sh /home > > Label: 'home' uuid: […] > > Total devices 2 FS bytes used 144.41GiB > > devid 1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home > > devid 2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home > > > > Btrfs v3.17 > > merkaba:~> btrfs fi df /home > > Data, RAID1: total=154.97GiB, used=141.12GiB > > System, RAID1: total=32.00MiB, used=48.00KiB > > Metadata, RAID1: total=5.00GiB, used=3.29GiB > > GlobalReserve, single: total=512.00MiB, used=0.00B > > > > > > > > Yes, thats 13 GiB of free space *within* the chunks. > > > > So while I can get it down in IOPS by bringing it to a situation where it > > can not reserve additional data chunks again, I cannot recreate the > > abysmal 250 IOPS figure by this. Not even with my /home filesystem. > > > > So there is more to it. I think its important to look into free space > > fragmentation. It seems it needs an *aged* filesystem to recreate. At > > it seems the balances really helped. As I am not able to recreate the > > issue to that extent right now. > > > > So this shows my original idea about free device space to allocate from > > also doesn´t explain it fully. It seems to be something thats going on > > within the chunks that explains the worst case <300 IOPS, kworker using > > one core for minutes and desktop locked scenario. > > > > Is there a way to view free space fragmentation in BTRFS? > > So to rephrase that: > > From what I perceive the worst case issue happens when > > 1) BTRFS cannot reserve any new chunks from unused device space anymore. > > 2) The free space in the existing chunks is highly fragmented. > > Only one of those conditions is not sufficient to trigger it. > > Thats at least my current idea about it.
One more note about the IOPS. I currently let fio run with: martin@merkaba:~> cat ssd-test.fio [global] bs=4k #ioengine=libaio #iodepth=4 size=4g #direct=1 runtime=120 filename=ssd.test.file #[seq-write] #rw=write #stonewall [rand-write] rw=randwrite stonewall This is using buffered I/O write read()/write() system calls. So these IOPS are not regarding the device raw capabilities. I specifically wanted to test through the page cache to simulate what I see with Virtualbox writing to the VDI file (i.e. dirty piling up and dirty_background_ratio in effect). Just like with a real app. But that also means that IOPS may be higher cause fio ends before all of the writes have been completed to disk. That means when I reach <300IOPS with buffered writes, that means that through the pagecache BTRFS was not able to yield a higher IOPS. But it also means that I measure write requests like an application would be doing (unless it uses fsync() or direct I/O which it seems to me Virtualbox doesn´t at least not with every request). Just wanted to make that explicit. Its basically visible in the job file from what I commented out in there, but still, I thought I mention it. I just tested the effect by reducing the test file to 500MiB and the runtime to 10 seconds and I got 98000 IOPS for that. So the larger test file size, but specifically the runtime forces the kernel to do actual writes due to: merkaba:~> grep . /proc/sys/vm/dirty_* /proc/sys/vm/dirty_background_bytes:0 /proc/sys/vm/dirty_background_ratio:10 /proc/sys/vm/dirty_bytes:0 /proc/sys/vm/dirty_expire_centisecs:3000 /proc/sys/vm/dirty_ratio:20 /proc/sys/vm/dirty_writeback_centisecs:500 (standard values, I still see no need to optimize anything in here with those SSDs, not even with the 16 GiB of RAM the laptop has, as the SSDs usually easily can keep up, and I´d rather wait for a change in the default value unless I am convinced of a benefit in manually adapting it in *this* case) Ciao, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
signature.asc
Description: This is a digitally signed message part.