Am Sonntag, 28. Dezember 2014, 14:00:19 schrieb Martin Steigerwald: > Am Samstag, 27. Dezember 2014, 14:55:58 schrieb Martin Steigerwald: > > Summarized at > > > > Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for > > minutes on random write into big file > > https://bugzilla.kernel.org/show_bug.cgi?id=90401 > > > > see below. This is reproducable with fio, no need for Windows XP in > > Virtualbox for reproducing the issue. Next I will try to reproduce with > > a freshly creating filesystem. > > > > > > Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills: > > > On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote: > > > > Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White: > > > > > On 12/26/2014 05:37 AM, Martin Steigerwald wrote: > > > > > > Hello! > > > > > > > > > > > > First: Have a merry christmas and enjoy a quiet time in these days. > > > > > > > > > > > > Second: At a time you feel like it, here is a little rant, but also > > > > > > a > > > > > > bug > > > > > > report: > > > > > > > > > > > > I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID > > > > > > with > > > > > > space_cache, skinny meta data extents – are these a problem? – and > > > > > > > > > > > compress=lzo: > > > > > (there is no known problem with skinny metadata, it's actually more > > > > > efficient than the older format. There has been some anecdotes about > > > > > mixing the skinny and fat metadata but nothing has ever been > > > > > demonstrated problematic.) > > > > > > > > > > > merkaba:~> btrfs fi sh /home > > > > > > Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a > > > > > > > > > > > > Total devices 2 FS bytes used 144.41GiB > > > > > > devid 1 size 160.00GiB used 160.00GiB path > > > > > > /dev/mapper/msata-home > > > > > > devid 2 size 160.00GiB used 160.00GiB path > > > > > > /dev/mapper/sata-home > > > > > > > > > > > > Btrfs v3.17 > > > > > > merkaba:~> btrfs fi df /home > > > > > > Data, RAID1: total=154.97GiB, used=141.12GiB > > > > > > System, RAID1: total=32.00MiB, used=48.00KiB > > > > > > Metadata, RAID1: total=5.00GiB, used=3.29GiB > > > > > > GlobalReserve, single: total=512.00MiB, used=0.00B > > > > > > > > > > This filesystem, at the allocation level, is "very full" (see below). > > > > > > > > > > > And I had hangs with BTRFS again. This time as I wanted to install > > > > > > tax > > > > > > return software in Virtualbox´d Windows XP VM (which I use once a > > > > > > year > > > > > > cause I know no tax return software for Linux which would be > > > > > > suitable > > > > > > for > > > > > > Germany and I frankly don´t care about the end of security cause all > > > > > > surfing and other network access I will do from the Linux box and I > > > > > > only > > > > > > run the VM behind a firewall). > > > > > > > > > > > And thus I try the balance dance again: > > > > > ITEM: Balance... it doesn't do what you think it does... 8-) > > > > > > > > > > "Balancing" is something you should almost never need to do. It is > > > > > only > > > > > for cases of changing geometry (adding disks, switching RAID levels, > > > > > etc.) of for cases when you've radically changed allocation behaviors > > > > > (like you decided to remove all your VM's or you've decided to remove > > > > > a > > > > > mail spool directory full of thousands of tiny files). > > > > > > > > > > People run balance all the time because they think they should. They > > > > > are > > > > > _usually_ incorrect in that belief. > > > > > > > > I only see the lockups of BTRFS is the trees *occupy* all space on the > > > > device. > > > No, "the trees" occupy 3.29 GiB of your 5 GiB of mirrored metadata > > > space. What's more, balance does *not* balance the metadata trees. The > > > remaining space -- 154.97 GiB -- is unstructured storage for file > > > data, and you have some 13 GiB of that available for use. > > > > > > Now, since you're seeing lockups when the space on your disks is > > > all allocated I'd say that's a bug. However, you're the *only* person > > > who's reported this as a regular occurrence. Does this happen with all > > > filesystems you have, or just this one? > > > > > > > I *never* so far saw it lockup if there is still space BTRFS can > > > > allocate > > > > from to *extend* a tree. > > > > > > It's not a tree. It's simply space allocation. It's not even space > > > *usage* you're talking about here -- it's just allocation (i.e. the FS > > > saying "I'm going to use this piece of disk for this purpose"). > > > > > > > This may be a bug, but this is what I see. > > > > > > > > And no amount of "you should not balance a BTRFS" will make that > > > > perception go away. > > > > > > > > See, I see the sun coming out on a morning and you tell me "no, it > > > > doesn´t". Simply that is not going to match my perception. > > > > > > Duncan's assertion is correct in its detail. Looking at your space > > > > Robert's :) > > > > > usage, I would not suggest that running a balance is something you > > > need to do. Now, since you have these lockups that seem quite > > > repeatable, there's probably a lurking bug in there, but hacking > > > around with balance every time you hit it isn't going to get the > > > problem solved properly. > > > > > > I think I would suggest the following: > > > > > > - make sure you have some way of logging your dmesg permanently (use > > > a different filesystem for /var/log, or a serial console, or a > > > netconsole) > > > > > > - when the lockup happens, hit Alt-SysRq-t a few times > > > > > > - send the dmesg output here, or post to bugzilla.kernel.org > > > > > > That's probably going to give enough information to the developers > > > to work out where the lockup is happening, and is clearly the way > > > forward here. > > > > And I got it reproduced. *Perfectly* reproduced, I´d say. > > > > But let me run the whole story: > > > > 1) I downsized my /home BTRFS from dual 170 GiB to dual 160 GiB again. > > [… story of trying to reproduce with Windows XP defragmenting which was > unsuccessful as BTRFS still had free device space to allocate new chunks > from …] > > > But finally I got to: > > > > merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home > > Sa 27. Dez 13:26:39 CET 2014 > > Label: 'home' uuid: [some UUID] > > Total devices 2 FS bytes used 152.83GiB > > devid 1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home > > devid 2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home > > > > Btrfs v3.17 > > Data, RAID1: total=154.97GiB, used=149.58GiB > > System, RAID1: total=32.00MiB, used=48.00KiB > > Metadata, RAID1: total=5.00GiB, used=3.26GiB > > GlobalReserve, single: total=512.00MiB, used=0.00B > > > > > > > > So I did, if Virtualbox can write randomly in a file, I can too. > > > > So I did: > > > > > > martin@merkaba:~> cat ssd-test.fio > > [global] > > bs=4k > > #ioengine=libaio > > #iodepth=4 > > size=4g > > #direct=1 > > runtime=120 > > filename=ssd.test.file > > > > [seq-write] > > rw=write > > stonewall > > > > [rand-write] > > rw=randwrite > > stonewall > > > > > > > > And got: > > > > ATOP - merkaba 2014/12/27 13:41:02 > > ----------- 10s elapsed > > PRC | sys 10.14s | user 0.38s | #proc 332 | #trun 2 | > > #tslpi 548 | #tslpu 0 | #zombie 0 | no procacct | > > CPU | sys 102% | user 4% | irq 0% | idle 295% | > > wait 0% | guest 0% | curf 3.10GHz | curscal 96% | > > cpu | sys 76% | user 0% | irq 0% | idle 24% | > > cpu001 w 0% | guest 0% | curf 3.20GHz | curscal 99% | > > cpu | sys 24% | user 1% | irq 0% | idle 75% | > > cpu000 w 0% | guest 0% | curf 3.19GHz | curscal 99% | > > cpu | sys 1% | user 1% | irq 0% | idle 98% | > > cpu002 w 0% | guest 0% | curf 3.00GHz | curscal 93% | > > cpu | sys 1% | user 1% | irq 0% | idle 98% | > > cpu003 w 0% | guest 0% | curf 3.00GHz | curscal 93% | > > CPL | avg1 0.82 | avg5 0.78 | avg15 0.99 | | csw > > 6233 | intr 12023 | | numcpu 4 | > > MEM | tot 15.5G | free 4.0G | cache 9.7G | buff 0.0M | > > slab 333.1M | shmem 206.6M | vmbal 0.0M | hptot 0.0M | > > SWP | tot 12.0G | free 11.7G | | | > > | | vmcom 3.4G | vmlim 19.7G | > > LVM | sata-home | busy 0% | read 8 | write 0 | > > KiB/w 0 | MBr/s 0.00 | MBw/s 0.00 | avio 0.12 ms | > > DSK | sda | busy 0% | read 8 | write 0 | > > KiB/w 0 | MBr/s 0.00 | MBw/s 0.00 | avio 0.12 ms | > > NET | transport | tcpi 16 | tcpo 16 | udpi 0 | > > udpo 0 | tcpao 1 | tcppo 1 | tcprs 0 | > > NET | network | ipi 16 | ipo 16 | ipfrw 0 | > > deliv 16 | | icmpi 0 | icmpo 0 | > > NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so > > 2 Kbps | erri 0 | erro 0 | drpo 0 | > > > > PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW > > RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/2 > > 18079 - martin martin 2 9.99s 0.00s 0K 0K > > 0K 16K -- - R 1 100% fio > > 4746 - martin martin 2 0.01s 0.14s 0K 0K > > 0K 0K -- - S 2 2% konsole > > 3291 - martin martin 4 0.01s 0.11s 0K 0K > > 0K 0K -- - S 0 1% plasma-desktop > > 1488 - root root 1 0.03s 0.04s 0K 0K > > 0K 0K -- - S 0 1% Xorg > > 10036 - root root 1 0.04s 0.02s 0K 0K > > 0K 0K -- - R 2 1% atop > > > > while fio was just *laying* out the 4 GiB file. Yes, thats 100% system CPU > > for 10 seconds while allocatiing a 4 GiB file on a filesystem like: > > > > martin@merkaba:~> LANG=C df -hT /home > > Filesystem Type Size Used Avail Use% Mounted on > > /dev/mapper/msata-home btrfs 170G 156G 17G 91% /home > > > > where a 4 GiB file should easily fit, no? (And this output is with the 4 > > GiB file. So it was even 4 GiB more free before.) > > > > > > But it gets even more visible: > > > > martin@merkaba:~> fio ssd-test.fio > > seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 > > rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, > > iodepth=1 > > fio-2.1.11 > > Starting 2 processes > > Jobs: 1 (f=1): [_(1),w(1)] [19.3% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta > > 01m:57s] > > 0$ zsh 1$ zsh 2$ zsh 3-$ zsh 4$ zsh 5$* zsh > > > > > > > > yes, thats 0 IOPS. > > > > 0 IOPS and in zero IOPS. For minutes. > > > > > > > > And here is why: > > > > ATOP - merkaba 2014/12/27 13:46:52 > > ----------- 10s elapsed > > PRC | sys 10.77s | user 0.31s | #proc 334 | #trun 2 | > > #tslpi 548 | #tslpu 3 | #zombie 0 | no procacct | > > CPU | sys 108% | user 3% | irq 0% | idle 286% | > > wait 2% | guest 0% | curf 3.08GHz | curscal 96% | > > cpu | sys 72% | user 1% | irq 0% | idle 28% | > > cpu000 w 0% | guest 0% | curf 3.00GHz | curscal 93% | > > cpu | sys 19% | user 0% | irq 0% | idle 81% | > > cpu001 w 0% | guest 0% | curf 3.00GHz | curscal 93% | > > cpu | sys 11% | user 1% | irq 0% | idle 87% | > > cpu003 w 1% | guest 0% | curf 3.19GHz | curscal 99% | > > cpu | sys 6% | user 1% | irq 0% | idle 91% | > > cpu002 w 1% | guest 0% | curf 3.11GHz | curscal 97% | > > CPL | avg1 2.78 | avg5 1.34 | avg15 1.12 | | csw > > 50192 | intr 32379 | | numcpu 4 | > > MEM | tot 15.5G | free 5.0G | cache 8.7G | buff 0.0M | > > slab 332.6M | shmem 207.2M | vmbal 0.0M | hptot 0.0M | > > SWP | tot 12.0G | free 11.7G | | | > > | | vmcom 3.4G | vmlim 19.7G | > > LVM | sata-home | busy 5% | read 160 | write 11177 | > > KiB/w 3 | MBr/s 0.06 | MBw/s 4.36 | avio 0.05 ms | > > LVM | msata-home | busy 4% | read 28 | write 11177 | > > KiB/w 3 | MBr/s 0.01 | MBw/s 4.36 | avio 0.04 ms | > > LVM | sata-debian | busy 0% | read 0 | write 844 | > > KiB/w 4 | MBr/s 0.00 | MBw/s 0.33 | avio 0.02 ms | > > LVM | msata-debian | busy 0% | read 0 | write 844 | > > KiB/w 4 | MBr/s 0.00 | MBw/s 0.33 | avio 0.02 ms | > > DSK | sda | busy 5% | read 160 | write 10200 | > > KiB/w 4 | MBr/s 0.06 | MBw/s 4.69 | avio 0.05 ms | > > DSK | sdb | busy 4% | read 28 | write 10558 | > > KiB/w 4 | MBr/s 0.01 | MBw/s 4.69 | avio 0.04 ms | > > NET | transport | tcpi 35 | tcpo 33 | udpi 3 | > > udpo 3 | tcpao 2 | tcppo 1 | tcprs 0 | > > NET | network | ipi 38 | ipo 36 | ipfrw 0 | > > deliv 38 | | icmpi 0 | icmpo 0 | > > NET | eth0 0% | pcki 22 | pcko 20 | si 9 Kbps | so > > 2 Kbps | erri 0 | erro 0 | drpo 0 | > > NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so > > 2 Kbps | erri 0 | erro 0 | drpo 0 | > > > > PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW > > RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/3 > > 14973 - root root 1 8.92s 0.00s 0K 0K > > 0K 144K -- - S 0 89% kworker/u8:14 > > 17450 - root root 1 0.86s 0.00s 0K 0K > > 0K 32K -- - R 3 9% kworker/u8:5 > > 788 - root root 1 0.25s 0.00s 0K 0K > > 128K 18880K -- - S 3 3% btrfs-transact > > 12254 - root root 1 0.14s 0.00s 0K 0K > > 64K 576K -- - S 2 1% kworker/u8:3 > > 17332 - root root 1 0.11s 0.00s 0K 0K > > 112K 1348K -- - S 2 1% kworker/u8:4 > > 3291 - martin martin 4 0.01s 0.09s 0K 0K > > 0K 0K -- - S 1 1% plasma-deskto > > > > > > > > > > ATOP - merkaba 2014/12/27 13:47:12 > > ----------- 10s elapsed > > PRC | sys 10.78s | user 0.44s | #proc 334 | #trun 3 | > > #tslpi 547 | #tslpu 3 | #zombie 0 | no procacct | > > CPU | sys 106% | user 4% | irq 0% | idle 288% | > > wait 1% | guest 0% | curf 3.00GHz | curscal 93% | > > cpu | sys 93% | user 0% | irq 0% | idle 7% | > > cpu002 w 0% | guest 0% | curf 3.00GHz | curscal 93% | > > cpu | sys 7% | user 0% | irq 0% | idle 93% | > > cpu003 w 0% | guest 0% | curf 3.01GHz | curscal 94% | > > cpu | sys 3% | user 2% | irq 0% | idle 94% | > > cpu000 w 0% | guest 0% | curf 3.00GHz | curscal 93% | > > cpu | sys 3% | user 2% | irq 0% | idle 95% | > > cpu001 w 0% | guest 0% | curf 3.00GHz | curscal 93% | > > CPL | avg1 3.33 | avg5 1.56 | avg15 1.20 | | csw > > 38253 | intr 23104 | | numcpu 4 | > > MEM | tot 15.5G | free 4.9G | cache 8.7G | buff 0.0M | > > slab 336.5M | shmem 207.2M | vmbal 0.0M | hptot 0.0M | > > SWP | tot 12.0G | free 11.7G | | | > > | | vmcom 3.4G | vmlim 19.7G | > > LVM | msata-home | busy 2% | read 0 | write 2337 | > > KiB/w 3 | MBr/s 0.00 | MBw/s 0.91 | avio 0.07 ms | > > LVM | sata-home | busy 2% | read 36 | write 2337 | > > KiB/w 3 | MBr/s 0.01 | MBw/s 0.91 | avio 0.07 ms | > > LVM | msata-debian | busy 1% | read 1 | write 1630 | > > KiB/w 4 | MBr/s 0.00 | MBw/s 0.65 | avio 0.03 ms | > > LVM | sata-debian | busy 0% | read 0 | write 1019 | > > KiB/w 4 | MBr/s 0.00 | MBw/s 0.41 | avio 0.02 ms | > > DSK | sdb | busy 2% | read 1 | write 2545 | > > KiB/w 5 | MBr/s 0.00 | MBw/s 1.45 | avio 0.07 ms | > > DSK | sda | busy 1% | read 36 | write 2461 | > > KiB/w 5 | MBr/s 0.01 | MBw/s 1.28 | avio 0.06 ms | > > NET | transport | tcpi 20 | tcpo 20 | udpi 1 | > > udpo 1 | tcpao 1 | tcppo 1 | tcprs 0 | > > NET | network | ipi 21 | ipo 21 | ipfrw 0 | > > deliv 21 | | icmpi 0 | icmpo 0 | > > NET | eth0 0% | pcki 5 | pcko 5 | si 0 Kbps | so > > 0 Kbps | erri 0 | erro 0 | drpo 0 | > > NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so > > 2 Kbps | erri 0 | erro 0 | drpo 0 | > > > > PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW > > RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/3 > > 17450 - root root 1 9.96s 0.00s 0K 0K > > 0K 0K -- - R 2 100% kworker/u8:5 > > 4746 - martin martin 2 0.06s 0.15s 0K 0K > > 0K 0K -- - S 1 2% konsole > > 10508 - root root 1 0.13s 0.00s 0K 0K > > 96K 4048K -- - S 1 1% kworker/u8:18 > > 1488 - root root 1 0.06s 0.06s 0K 0K > > 0K 0K -- - S 0 1% Xorg > > 17332 - root root 1 0.12s 0.00s 0K 0K > > 96K 580K -- - R 3 1% kworker/u8:4 > > 17454 - root root 1 0.11s 0.00s 0K 0K > > 32K 4416K -- - D 1 1% kworker/u8:6 > > 17516 - root root 1 0.09s 0.00s 0K 0K > > 16K 136K -- - S 3 1% kworker/u8:7 > > 3268 - martin martin 3 0.02s 0.05s 0K 0K > > 0K 0K -- - S 1 1% kwin > > 10036 - root root 1 0.05s 0.02s 0K 0K > > 0K 0K -- - R 0 1% atop > > > > > > > > So BTRFS is basically busy with itself and nothing else. Look at the SSD > > usage. They are *idling* around. Heck 2400 write accesses in 10 seconds. > > Thats a joke with SSDs that can do 40000 IOPS (depending on how and what > > you measure of course, like request size, read, write, iodepth and so). > > > > Its kworker/u8:5 utilizing 100% of one core for minutes. > > > > > > > > Its the random write case it seems. Here are values from fio job: > > > > martin@merkaba:~> fio ssd-test.fio > > seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 > > rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, > > iodepth=1 > > fio-2.1.11 > > Starting 2 processes > > Jobs: 1 (f=1): [_(1),w(1)] [3.6% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta > > 01h:06m:26s] > > seq-write: (groupid=0, jobs=1): err= 0: pid=19212: Sat Dec 27 13:48:33 2014 > > write: io=4096.0MB, bw=343683KB/s, iops=85920, runt= 12204msec > > clat (usec): min=3, max=38048, avg=10.52, stdev=205.25 > > lat (usec): min=3, max=38048, avg=10.66, stdev=205.43 > > clat percentiles (usec): > > | 1.00th=[ 4], 5.00th=[ 4], 10.00th=[ 4], 20.00th=[ 4], > > | 30.00th=[ 4], 40.00th=[ 5], 50.00th=[ 5], 60.00th=[ 5], > > | 70.00th=[ 7], 80.00th=[ 8], 90.00th=[ 8], 95.00th=[ 9], > > | 99.00th=[ 14], 99.50th=[ 20], 99.90th=[ 211], 99.95th=[ 2128], > > | 99.99th=[10304] > > bw (KB /s): min=164328, max=812984, per=100.00%, avg=345585.75, > > stdev=201695.20 > > lat (usec) : 4=0.18%, 10=95.31%, 20=4.00%, 50=0.18%, 100=0.12% > > lat (usec) : 250=0.12%, 500=0.02%, 750=0.01%, 1000=0.01% > > lat (msec) : 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01% > > cpu : usr=13.55%, sys=46.89%, ctx=7810, majf=0, minf=6 > > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, > > >=64=0.0% > > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > > >=64=0.0% > > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > > >=64=0.0% > > issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0 > > latency : target=0, window=0, percentile=100.00%, depth=1 > > > > Seems fine. > > > > > > But: > > > > rand-write: (groupid=1, jobs=1): err= 0: pid=19243: Sat Dec 27 13:48:33 2014 > > write: io=140336KB, bw=1018.4KB/s, iops=254, runt=137803msec > > clat (usec): min=4, max=21299K, avg=3708.02, stdev=266885.61 > > lat (usec): min=4, max=21299K, avg=3708.14, stdev=266885.61 > > clat percentiles (usec): > > | 1.00th=[ 4], 5.00th=[ 5], 10.00th=[ 5], 20.00th=[ 5], > > | 30.00th=[ 6], 40.00th=[ 6], 50.00th=[ 6], 60.00th=[ 6], > > | 70.00th=[ 7], 80.00th=[ 7], 90.00th=[ 9], 95.00th=[ 10], > > | 99.00th=[ 18], 99.50th=[ 19], 99.90th=[ 28], 99.95th=[ 116], > > | 99.99th=[16711680] > > bw (KB /s): min= 0, max= 3426, per=100.00%, avg=1030.10, > > stdev=938.02 > > lat (usec) : 10=92.63%, 20=6.89%, 50=0.43%, 100=0.01%, 250=0.02% > > lat (msec) : 250=0.01%, 500=0.01%, >=2000=0.02% > > cpu : usr=0.06%, sys=1.59%, ctx=28720, majf=0, minf=7 > > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, > > >=64=0.0% > > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > > >=64=0.0% > > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > > >=64=0.0% > > issued : total=r=0/w=35084/d=0, short=r=0/w=0/d=0 > > latency : target=0, window=0, percentile=100.00%, depth=1 > > > > Run status group 0 (all jobs): > > WRITE: io=4096.0MB, aggrb=343682KB/s, minb=343682KB/s, maxb=343682KB/s, > > mint=12204msec, maxt=12204msec > > > > Run status group 1 (all jobs): > > WRITE: io=140336KB, aggrb=1018KB/s, minb=1018KB/s, maxb=1018KB/s, > > mint=137803msec, maxt=137803msec > > > > > > What? 254 IOPS? With a Dual SSD BTRFS RAID 1? > > > > What? > > > > Ey, *what*? […] > > There we go: > > > > Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for > > minutes on random write into big file > > https://bugzilla.kernel.org/show_bug.cgi?id=90401 > > I have done more tests. > > This is on the same /home after extending it to 170 GiB and balancing it to > btrfs balance start -dusage=80 > > It has plenty of free space free. I updated the bug report and hope it can > give an easy enough to comprehend summary. The new tests are in: > > https://bugzilla.kernel.org/show_bug.cgi?id=90401#c6 > > > > Pasting below for discussion on list. Summary: I easily get 38000 (!) > IOPS. It may be an idea to reduce to 160 GiB, but right now this does > not work as it says no free space on device when trying to downsize it. > I may try with 165 or 162GiB. > > So now we have three IOPS figures: > > - 256 IOPS in worst case scenario > - 4700 IOPS when trying to reproduce worst case scenario with a fresh and > small > BTRFS > - 38000 IOPS when /home has unused device space to allocate chunks from > > https://bugzilla.kernel.org/show_bug.cgi?id=90401#c8 > > > This is another test.
Okay, and this is the last series of tests for today. Conclusion: I cannot manage to get it down to the knees as before, but I come near to it. Still its 8000 IOPS, instead of 250 IOPS, in an according to btrfs fi sh even *worse* situation than before. That hints me at the need to look at the free space fragmentation, as in the beginning the problem started appearing with: merkaba:~> btrfs fi sh /home Label: 'home' uuid: […] Total devices 2 FS bytes used 144.41GiB devid 1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home devid 2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home Btrfs v3.17 merkaba:~> btrfs fi df /home Data, RAID1: total=154.97GiB, used=141.12GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.29GiB GlobalReserve, single: total=512.00MiB, used=0.00B Yes, thats 13 GiB of free space *within* the chunks. So while I can get it down in IOPS by bringing it to a situation where it can not reserve additional data chunks again, I cannot recreate the abysmal 250 IOPS figure by this. Not even with my /home filesystem. So there is more to it. I think its important to look into free space fragmentation. It seems it needs an *aged* filesystem to recreate. At it seems the balances really helped. As I am not able to recreate the issue to that extent right now. So this shows my original idea about free device space to allocate from also doesn´t explain it fully. It seems to be something thats going on within the chunks that explains the worst case <300 IOPS, kworker using one core for minutes and desktop locked scenario. Is there a way to view free space fragmentation in BTRFS? Test log follows, also added to bug report: https://bugzilla.kernel.org/show_bug.cgi?id=90401#c9 Okay, retesting with merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home So 28. Dez 14:01:05 CET 2014 Label: 'home' uuid: […] Total devices 2 FS bytes used 155.15GiB devid 1 size 163.00GiB used 159.92GiB path /dev/mapper/msata-home devid 2 size 163.00GiB used 159.92GiB path /dev/mapper/sata-home Btrfs v3.17 Data, RAID1: total=154.95GiB, used=151.84GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=4.94GiB, used=3.31GiB GlobalReserve, single: total=512.00MiB, used=0.00B Thats is just 3 GiB to reserve new data chunks from from. First run – all good: martin@merkaba:~> fio ssd-test.fio rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 fio-2.1.11 Starting 1 process Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/134.2MB/0KB /s] [0/34.4K/0 iops] [eta 00m:00s] rand-write: (groupid=0, jobs=1): err= 0: pid=10483: Sun Dec 28 14:02:59 2014 write: io=4096.0MB, bw=218101KB/s, iops=54525, runt= 19231msec clat (usec): min=4, max=20056, avg=14.87, stdev=143.15 lat (usec): min=4, max=20056, avg=15.09, stdev=143.26 clat percentiles (usec): | 1.00th=[ 6], 5.00th=[ 7], 10.00th=[ 7], 20.00th=[ 8], | 30.00th=[ 8], 40.00th=[ 10], 50.00th=[ 11], 60.00th=[ 12], | 70.00th=[ 13], 80.00th=[ 14], 90.00th=[ 15], 95.00th=[ 17], | 99.00th=[ 52], 99.50th=[ 99], 99.90th=[ 434], 99.95th=[ 980], | 99.99th=[ 7968] bw (KB /s): min=62600, max=424456, per=100.00%, avg=218821.63, stdev=93695.28 lat (usec) : 10=38.19%, 20=58.83%, 50=1.90%, 100=0.59%, 250=0.33% lat (usec) : 500=0.09%, 750=0.03%, 1000=0.01% lat (msec) : 2=0.02%, 4=0.01%, 10=0.02%, 20=0.01%, 50=0.01% cpu : usr=15.50%, sys=61.86%, ctx=93432, majf=0, minf=5 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: io=4096.0MB, aggrb=218101KB/s, minb=218101KB/s, maxb=218101KB/s, mint=19231msec, maxt=19231msec Second run: merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home So 28. Dez 14:04:01 CET 2014 Label: 'home' uuid: […] Total devices 2 FS bytes used 155.23GiB devid 1 size 163.00GiB used 160.95GiB path /dev/mapper/msata-home devid 2 size 163.00GiB used 160.95GiB path /dev/mapper/sata-home Btrfs v3.17 Data, RAID1: total=155.98GiB, used=151.91GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=4.94GiB, used=3.32GiB GlobalReserve, single: total=512.00MiB, used=0.00B martin@merkaba:~> fio ssd-test.fio rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 fio-2.1.11 Starting 1 process Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/171.3MB/0KB /s] [0/43.9K/0 iops] [eta 00m:00s] rand-write: (groupid=0, jobs=1): err= 0: pid=10501: Sun Dec 28 14:05:03 2014 write: io=4096.0MB, bw=220637KB/s, iops=55159, runt= 19010msec clat (usec): min=4, max=20578, avg=14.45, stdev=160.84 lat (usec): min=4, max=20578, avg=14.65, stdev=160.88 clat percentiles (usec): | 1.00th=[ 6], 5.00th=[ 7], 10.00th=[ 7], 20.00th=[ 7], | 30.00th=[ 8], 40.00th=[ 10], 50.00th=[ 11], 60.00th=[ 12], | 70.00th=[ 12], 80.00th=[ 13], 90.00th=[ 15], 95.00th=[ 17], | 99.00th=[ 42], 99.50th=[ 79], 99.90th=[ 278], 99.95th=[ 620], | 99.99th=[ 9792] bw (KB /s): min= 5, max=454816, per=100.00%, avg=224700.32, stdev=100763.29 lat (usec) : 10=38.15%, 20=58.73%, 50=2.28%, 100=0.47%, 250=0.26% lat (usec) : 500=0.06%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.01%, 50=0.01% cpu : usr=15.83%, sys=63.17%, ctx=74934, majf=0, minf=5 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: io=4096.0MB, aggrb=220636KB/s, minb=220636KB/s, maxb=220636KB/s, mint=19010msec, maxt=19010msec Okay, now try the same without space for a free chunk to allocate. The testfile is still there, fio doesn´t delete and recreate it on every attempt, but just writes into it: martin@merkaba:~> ls -l ssd.test.file -rw-r--r-- 1 martin martin 4294967296 Dez 28 14:05 ssd.test.file Okay – with still one chunk to allocate: merkaba:~> btrfs filesystem resize 1:161G /home Resize '/home' of '1:161G' merkaba:~> btrfs filesystem resize 2:161G /home Resize '/home' of '2:161G' merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home So 28. Dez 14:08:45 CET 2014 Label: 'home' uuid: […] Total devices 2 FS bytes used 155.15GiB devid 1 size 161.00GiB used 159.92GiB path /dev/mapper/msata-home devid 2 size 161.00GiB used 159.92GiB path /dev/mapper/sata-home Btrfs v3.17 Data, RAID1: total=154.95GiB, used=151.84GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=4.94GiB, used=3.31GiB GlobalReserve, single: total=512.00MiB, used=0.00B I would like to have it allocate the chunks by other means, but it frees them eventually afterwards again, so I did it this way. Note, we still have the original file there. The space it currently occupies is already taken into account. Next test: martin@merkaba:~> fio ssd-test.fio rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 fio-2.1.11 Starting 1 process Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/130.5MB/0KB /s] [0/33.5K/0 iops] [eta 00m:00s] rand-write: (groupid=0, jobs=1): err= 0: pid=10563: Sun Dec 28 14:10:34 2014 write: io=4096.0MB, bw=210526KB/s, iops=52631, runt= 19923msec clat (usec): min=4, max=21820, avg=14.78, stdev=119.40 lat (usec): min=4, max=21821, avg=15.03, stdev=120.26 clat percentiles (usec): | 1.00th=[ 6], 5.00th=[ 7], 10.00th=[ 7], 20.00th=[ 8], | 30.00th=[ 9], 40.00th=[ 10], 50.00th=[ 11], 60.00th=[ 11], | 70.00th=[ 12], 80.00th=[ 13], 90.00th=[ 14], 95.00th=[ 17], | 99.00th=[ 62], 99.50th=[ 131], 99.90th=[ 490], 99.95th=[ 964], | 99.99th=[ 6816] bw (KB /s): min= 3, max=410480, per=100.00%, avg=216892.84, stdev=95620.33 lat (usec) : 10=33.20%, 20=63.71%, 50=1.86%, 100=0.59%, 250=0.42% lat (usec) : 500=0.12%, 750=0.03%, 1000=0.01% lat (msec) : 2=0.02%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01% cpu : usr=15.13%, sys=62.74%, ctx=94346, majf=0, minf=5 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: io=4096.0MB, aggrb=210525KB/s, minb=210525KB/s, maxb=210525KB/s, mint=19923msec, maxt=19923msec Okay, this is still good. merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home So 28. Dez 14:11:18 CET 2014 Label: 'home' uuid: […] Total devices 2 FS bytes used 155.17GiB devid 1 size 161.00GiB used 160.91GiB path /dev/mapper/msata-home devid 2 size 161.00GiB used 160.91GiB path /dev/mapper/sata-home Btrfs v3.17 Data, RAID1: total=155.94GiB, used=151.86GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=4.94GiB, used=3.30GiB GlobalReserve, single: total=512.00MiB, used=0.00B Now no space left to reserve additional chunks anymore. Another test: martin@merkaba:~> fio ssd-test.fio rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 fio-2.1.11 Starting 1 process Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/152.3MB/0KB /s] [0/38.1K/0 iops] [eta 00m:00s] rand-write: (groupid=0, jobs=1): err= 0: pid=10580: Sun Dec 28 14:13:26 2014 write: io=4096.0MB, bw=225804KB/s, iops=56450, runt= 18575msec clat (usec): min=4, max=16669, avg=13.66, stdev=72.88 lat (usec): min=4, max=16669, avg=13.89, stdev=73.06 clat percentiles (usec): | 1.00th=[ 6], 5.00th=[ 7], 10.00th=[ 7], 20.00th=[ 8], | 30.00th=[ 8], 40.00th=[ 10], 50.00th=[ 11], 60.00th=[ 12], | 70.00th=[ 13], 80.00th=[ 14], 90.00th=[ 15], 95.00th=[ 20], | 99.00th=[ 65], 99.50th=[ 113], 99.90th=[ 314], 99.95th=[ 506], | 99.99th=[ 2768] bw (KB /s): min= 4, max=444568, per=100.00%, avg=231326.97, stdev=93374.31 lat (usec) : 10=36.50%, 20=58.44%, 50=3.73%, 100=0.76%, 250=0.44% lat (usec) : 500=0.09%, 750=0.02%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01% cpu : usr=16.35%, sys=68.39%, ctx=127221, majf=0, minf=5 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: io=4096.0MB, aggrb=225803KB/s, minb=225803KB/s, maxb=225803KB/s, mint=18575msec, maxt=18575msec Okay, this still does not trigger it. Another test, it even freed some chunk: merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home So 28. Dez 14:14:21 CET 2014 Label: 'home' uuid: […] Total devices 2 FS bytes used 155.28GiB devid 1 size 161.00GiB used 160.85GiB path /dev/mapper/msata-home devid 2 size 161.00GiB used 160.85GiB path /dev/mapper/sata-home Btrfs v3.17 Data, RAID1: total=155.89GiB, used=151.97GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=4.94GiB, used=3.31GiB GlobalReserve, single: total=512.00MiB, used=0.00B Still good: martin@merkaba:~> fio ssd-test.fio rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 fio-2.1.11 Starting 1 process Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/156.5MB/0KB /s] [0/40.6K/0 iops] [eta 00m:00s] rand-write: (groupid=0, jobs=1): err= 0: pid=10589: Sun Dec 28 14:14:37 2014 write: io=4096.0MB, bw=161121KB/s, iops=40280, runt= 26032msec clat (usec): min=4, max=1228.9K, avg=15.69, stdev=1205.88 lat (usec): min=4, max=1228.9K, avg=15.92, stdev=1205.90 clat percentiles (usec): | 1.00th=[ 6], 5.00th=[ 7], 10.00th=[ 7], 20.00th=[ 8], | 30.00th=[ 8], 40.00th=[ 10], 50.00th=[ 11], 60.00th=[ 12], | 70.00th=[ 13], 80.00th=[ 14], 90.00th=[ 15], 95.00th=[ 19], | 99.00th=[ 53], 99.50th=[ 96], 99.90th=[ 366], 99.95th=[ 764], | 99.99th=[ 7776] bw (KB /s): min= 0, max=431680, per=100.00%, avg=219856.30, stdev=98172.64 lat (usec) : 10=39.24%, 20=55.83%, 50=3.81%, 100=0.63%, 250=0.33% lat (usec) : 500=0.08%, 750=0.02%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.01%, 2000=0.01% cpu : usr=11.50%, sys=61.08%, ctx=123428, majf=0, minf=6 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: io=4096.0MB, aggrb=161121KB/s, minb=161121KB/s, maxb=161121KB/s, mint=26032msec, maxt=26032msec Okay, lets allocate one GiB with fallocate to make free space tighter: martin@merkaba:~> /usr/bin/time fallocate -l 1G 1g-1 0.00user 0.09system 0:00.11elapsed 86%CPU (0avgtext+0avgdata 1752maxresident)k 112inputs+64outputs (1major+89minor)pagefaults 0swaps martin@merkaba:~> ls -lh 1g-1 -rw-r--r-- 1 martin martin 1,0G Dez 28 14:16 1g merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home So 28. Dez 14:16:24 CET 2014 Label: 'home' uuid: […] Total devices 2 FS bytes used 156.15GiB devid 1 size 161.00GiB used 160.94GiB path /dev/mapper/msata-home devid 2 size 161.00GiB used 160.94GiB path /dev/mapper/sata-home Btrfs v3.17 Data, RAID1: total=155.97GiB, used=152.84GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=4.94GiB, used=3.31GiB GlobalReserve, single: total=512.00MiB, used=0.00B Still not: martin@merkaba:~> fio ssd-test.fio rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 fio-2.1.11 Starting 1 process Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/132.1MB/0KB /s] [0/34.4K/0 iops] [eta 00m:00s] rand-write: (groupid=0, jobs=1): err= 0: pid=10632: Sun Dec 28 14:17:12 2014 write: io=4096.0MB, bw=198773KB/s, iops=49693, runt= 21101msec clat (usec): min=4, max=543255, avg=16.27, stdev=563.85 lat (usec): min=4, max=543255, avg=16.48, stdev=563.91 clat percentiles (usec): | 1.00th=[ 6], 5.00th=[ 7], 10.00th=[ 7], 20.00th=[ 8], | 30.00th=[ 9], 40.00th=[ 10], 50.00th=[ 11], 60.00th=[ 12], | 70.00th=[ 12], 80.00th=[ 13], 90.00th=[ 14], 95.00th=[ 17], | 99.00th=[ 49], 99.50th=[ 98], 99.90th=[ 386], 99.95th=[ 828], | 99.99th=[10816] bw (KB /s): min= 4, max=444848, per=100.00%, avg=203909.07, stdev=109502.11 lat (usec) : 10=33.97%, 20=62.99%, 50=2.05%, 100=0.51%, 250=0.33% lat (usec) : 500=0.08%, 750=0.02%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.01%, 750=0.01% cpu : usr=14.21%, sys=60.44%, ctx=70273, majf=0, minf=6 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: io=4096.0MB, aggrb=198772KB/s, minb=198772KB/s, maxb=198772KB/s, mint=21101msec, maxt=21101msec Another 1G file. Got it: ATOP - merkaba 2014/12/28 14:18:14 ----------- 10s elapsed PRC | sys 21.74s | user 2.48s | #proc 382 | #trun 8 | #tslpi 698 | #tslpu 1 | #zombie 0 | no procacct | CPU | sys 218% | user 24% | irq 1% | idle 155% | wait 2% | guest 0% | curf 3.00GHz | curscal 93% | cpu | sys 75% | user 5% | irq 0% | idle 19% | cpu003 w 0% | guest 0% | curf 3.00GHz | curscal 93% | cpu | sys 59% | user 3% | irq 0% | idle 37% | cpu001 w 0% | guest 0% | curf 3.00GHz | curscal 93% | cpu | sys 48% | user 6% | irq 1% | idle 45% | cpu000 w 1% | guest 0% | curf 3.00GHz | curscal 93% | cpu | sys 36% | user 9% | irq 0% | idle 54% | cpu002 w 1% | guest 0% | curf 3.00GHz | curscal 93% | CPL | avg1 2.13 | avg5 2.37 | avg15 1.92 | | csw 67473 | intr 59152 | | numcpu 4 | MEM | tot 15.5G | free 1.1G | cache 11.0G | buff 0.1M | slab 740.2M | shmem 190.9M | vmbal 0.0M | hptot 0.0M | SWP | tot 12.0G | free 11.4G | | | | | vmcom 5.4G | vmlim 19.7G | PAG | scan 0 | steal 0 | stall 1 | | | | swin 19 | swout 0 | LVM | sata-home | busy 8% | read 4 | write 26062 | KiB/w 3 | MBr/s 0.00 | MBw/s 10.18 | avio 0.03 ms | LVM | msata-home | busy 5% | read 4 | write 26062 | KiB/w 3 | MBr/s 0.00 | MBw/s 10.18 | avio 0.02 ms | LVM | sata-swap | busy 0% | read 19 | write 0 | KiB/w 0 | MBr/s 0.01 | MBw/s 0.00 | avio 0.05 ms | LVM | msata-debian | busy 0% | read 0 | write 4 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.00 | avio 0.00 ms | LVM | sata-debian | busy 0% | read 0 | write 4 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.00 | avio 0.00 ms | DSK | sda | busy 8% | read 23 | write 13239 | KiB/w 7 | MBr/s 0.01 | MBw/s 10.18 | avio 0.06 ms | DSK | sdb | busy 5% | read 4 | write 14360 | KiB/w 7 | MBr/s 0.00 | MBw/s 10.18 | avio 0.04 ms | NET | transport | tcpi 18 | tcpo 18 | udpi 0 | udpo 0 | tcpao 1 | tcppo 1 | tcprs 0 | NET | network | ipi 18 | ipo 18 | ipfrw 0 | deliv 18 | | icmpi 0 | icmpo 0 | NET | eth0 0% | pcki 2 | pcko 2 | si 0 Kbps | so 0 Kbps | erri 0 | erro 0 | drpo 0 | NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 | PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/5 10657 - martin martin 1 9.88s 0.00s 0K 0K 0K 48K -- - R 1 99% fallocate 9685 - root root 1 9.84s 0.00s 0K 0K 0K 0K -- - D 0 99% kworker/u8:10 martin@merkaba:~> /usr/bin/time fallocate -l 1G 1g-2 ; ls -l 1g* 0.00user 59.28system 1:00.21elapsed 98%CPU (0avgtext+0avgdata 1756maxresident)k 0inputs+416outputs (0major+90minor)pagefaults 0swaps -rw-r--r-- 1 martin martin 1073741824 Dez 28 14:16 1g-1 -rw-r--r-- 1 martin martin 1073741824 Dez 28 14:17 1g-2 One minute in system CPU for this. Okay, so now another test: merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home So 28. Dez 14:19:30 CET 2014 Label: 'home' uuid: […] Total devices 2 FS bytes used 157.18GiB devid 1 size 161.00GiB used 160.91GiB path /dev/mapper/msata-home devid 2 size 161.00GiB used 160.91GiB path /dev/mapper/sata-home Btrfs v3.17 Data, RAID1: total=155.94GiB, used=153.87GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=4.94GiB, used=3.30GiB GlobalReserve, single: total=512.00MiB, used=0.00B I admit, this now really isn´t nice to it anymore, but I want to see where it starts to become an issue. ATOP - merkaba 2014/12/28 14:21:18 ----------- 1s elapsed PRC | sys 1.15s | user 0.16s | #proc 382 | #trun 2 | #tslpi 707 | #tslpu 1 | #zombie 0 | clones 0 | | no procacct | CPU | sys 163% | user 24% | irq 1% | idle 189% | wait 26% | | steal 0% | guest 0% | curf 3.01GHz | curscal 94% | cpu | sys 72% | user 1% | irq 0% | idle 25% | cpu001 w 1% | | steal 0% | guest 0% | curf 3.00GHz | curscal 93% | cpu | sys 41% | user 9% | irq 0% | idle 32% | cpu002 w 19% | | steal 0% | guest 0% | curf 3.00GHz | curscal 93% | cpu | sys 34% | user 10% | irq 0% | idle 53% | cpu003 w 3% | | steal 0% | guest 0% | curf 3.03GHz | curscal 94% | cpu | sys 16% | user 4% | irq 0% | idle 77% | cpu000 w 3% | | steal 0% | guest 0% | curf 3.00GHz | curscal 93% | CPL | avg1 2.37 | avg5 2.64 | avg15 2.13 | | | csw 18687 | intr 12435 | | | numcpu 4 | MEM | tot 15.5G | free 2.5G | cache 9.5G | buff 0.1M | slab 742.6M | shmem 242.8M | shrss 115.5M | vmbal 0.0M | hptot 0.0M | hpuse 0.0M | SWP | tot 12.0G | free 11.4G | | | | | | | vmcom 5.5G | vmlim 19.7G | LVM | msata-home | busy 71% | read 28 | write 8134 | KiB/r 4 | KiB/w 3 | MBr/s 0.11 | MBw/s 31.68 | avq 13.21 | avio 0.06 ms | LVM | sata-home | busy 40% | read 72 | write 8135 | KiB/r 4 | KiB/w 3 | MBr/s 0.28 | MBw/s 31.69 | avq 41.67 | avio 0.03 ms | DSK | sdb | busy 71% | read 24 | write 6049 | KiB/r 4 | KiB/w 5 | MBr/s 0.11 | MBw/s 31.68 | avq 5.64 | avio 0.08 ms | DSK | sda | busy 38% | read 60 | write 5987 | KiB/r 4 | KiB/w 5 | MBr/s 0.28 | MBw/s 31.69 | avq 20.40 | avio 0.04 ms | NET | transport | tcpi 16 | tcpo 16 | udpi 0 | udpo 0 | tcpao 1 | tcppo 1 | tcprs 0 | tcpie 0 | udpie 0 | NET | network | ipi 16 | ipo 16 | ipfrw 0 | deliv 16 | | | | icmpi 0 | icmpo 0 | NET | lo ---- | pcki 16 | pcko 16 | si 20 Kbps | so 20 Kbps | coll 0 | erri 0 | erro 0 | drpi 0 | drpo 0 | PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/2 10459 - root root 1 0.70s 0.00s 0K 0K 0K 0K -- - D 3 100% kworker/u8:17 10674 - martin martin 1 0.20s 0.01s 0K 0K 0K 27504K -- - R 0 30% fio Okay. It is jumping between 0 IOPS and 40000 IOPS and hogging 100% with one kworker. Still quite okay in terms of IOPS tough: martin@merkaba:~> fio ssd-test.fio rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 fio-2.1.11 Starting 1 process Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/126.8MB/0KB /s] [0/32.5K/0 iops] [eta 00m:00s] rand-write: (groupid=0, jobs=1): err= 0: pid=10674: Sun Dec 28 14:22:16 2014 write: io=3801.3MB, bw=32415KB/s, iops=8103, runt=120083msec clat (usec): min=4, max=1809.9K, avg=83.88, stdev=3615.98 lat (usec): min=4, max=1809.9K, avg=84.10, stdev=3616.00 clat percentiles (usec): | 1.00th=[ 6], 5.00th=[ 7], 10.00th=[ 7], 20.00th=[ 8], | 30.00th=[ 9], 40.00th=[ 10], 50.00th=[ 11], 60.00th=[ 11], | 70.00th=[ 12], 80.00th=[ 13], 90.00th=[ 15], 95.00th=[ 18], | 99.00th=[ 52], 99.50th=[ 124], 99.90th=[24704], 99.95th=[30592], | 99.99th=[57088] bw (KB /s): min= 0, max=417544, per=100.00%, avg=48302.16, stdev=89108.07 lat (usec) : 10=35.61%, 20=60.17%, 50=3.17%, 100=0.47%, 250=0.27% lat (usec) : 500=0.05%, 750=0.02%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.04%, 50=0.16% lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2000=0.01% cpu : usr=2.37%, sys=29.74%, ctx=202984, majf=0, minf=6 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=973128/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: io=3801.3MB, aggrb=32415KB/s, minb=32415KB/s, maxb=32415KB/s, mint=120083msec, maxt=120083msec I stop this here now. Its interesting to see that even with: merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home So 28. Dez 14:23:11 CET 2014 Label: 'home' uuid: […] Total devices 2 FS bytes used 157.89GiB devid 1 size 161.00GiB used 160.91GiB path /dev/mapper/msata-home devid 2 size 161.00GiB used 160.91GiB path /dev/mapper/sata-home Btrfs v3.17 Data, RAID1: total=155.94GiB, used=154.59GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=4.94GiB, used=3.30GiB GlobalReserve, single: total=512.00MiB, used=0.00B I am not able to fully reproduce it. I can partly reproduce it, but it behaves way better than before. I think to go further one needs to have a look at the free space fragmentation inside the chunks. As in the beginning I had: I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with space_cache, skinny meta data extents – are these a problem? – and compress=lzo: merkaba:~> btrfs fi sh /home Label: 'home' uuid: […] Total devices 2 FS bytes used 144.41GiB devid 1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home devid 2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home Btrfs v3.17 merkaba:~> btrfs fi df /home Data, RAID1: total=154.97GiB, used=141.12GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.29GiB GlobalReserve, single: total=512.00MiB, used=0.00B So I had quite some free space *within* the chunks and it still was a problem. -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
signature.asc
Description: This is a digitally signed message part.