Re: BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare)

Martin Steigerwald Sun, 28 Dec 2014 05:41:28 -0800

Am Sonntag, 28. Dezember 2014, 14:00:19 schrieb Martin Steigerwald:
> Am Samstag, 27. Dezember 2014, 14:55:58 schrieb Martin Steigerwald:
> > Summarized at
> > 
> > Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for 
> > minutes on random write into big file
> > https://bugzilla.kernel.org/show_bug.cgi?id=90401
> > 
> > see below. This is reproducable with fio, no need for Windows XP in
> > Virtualbox for reproducing the issue. Next I will try to reproduce with
> > a freshly creating filesystem.
> > 
> > 
> > Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
> > > On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
> > > > Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
> > > > > On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
> > > > > > Hello!
> > > > > > 
> > > > > > First: Have a merry christmas and enjoy a quiet time in these days.
> > > > > > 
> > > > > > Second: At a time you feel like it, here is a little rant, but also 
> > > > > > a
> > > > > > bug
> > > > > > report:
> > > > > > 
> > > > > > I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID 
> > > > > > with
> > > > > > space_cache, skinny meta data extents – are these a problem? – and
> > > > > 
> > > > > > compress=lzo:
> > > > > (there is no known problem with skinny metadata, it's actually more
> > > > > efficient than the older format. There has been some anecdotes about
> > > > > mixing the skinny and fat metadata but nothing has ever been
> > > > > demonstrated problematic.)
> > > > > 
> > > > > > merkaba:~> btrfs fi sh /home
> > > > > > Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
> > > > > > 
> > > > > >          Total devices 2 FS bytes used 144.41GiB
> > > > > >          devid    1 size 160.00GiB used 160.00GiB path
> > > > > >          /dev/mapper/msata-home
> > > > > >          devid    2 size 160.00GiB used 160.00GiB path
> > > > > >          /dev/mapper/sata-home
> > > > > > 
> > > > > > Btrfs v3.17
> > > > > > merkaba:~> btrfs fi df /home
> > > > > > Data, RAID1: total=154.97GiB, used=141.12GiB
> > > > > > System, RAID1: total=32.00MiB, used=48.00KiB
> > > > > > Metadata, RAID1: total=5.00GiB, used=3.29GiB
> > > > > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > > > > 
> > > > > This filesystem, at the allocation level, is "very full" (see below).
> > > > > 
> > > > > > And I had hangs with BTRFS again. This time as I wanted to install 
> > > > > > tax
> > > > > > return software in Virtualbox´d Windows XP VM (which I use once a 
> > > > > > year
> > > > > > cause I know no tax return software for Linux which would be 
> > > > > > suitable
> > > > > > for
> > > > > > Germany and I frankly don´t care about the end of security cause all
> > > > > > surfing and other network access I will do from the Linux box and I
> > > > > > only
> > > > > > run the VM behind a firewall).
> > > > > 
> > > > > > And thus I try the balance dance again:
> > > > > ITEM: Balance... it doesn't do what you think it does... 8-)
> > > > > 
> > > > > "Balancing" is something you should almost never need to do. It is 
> > > > > only
> > > > > for cases of changing geometry (adding disks, switching RAID levels,
> > > > > etc.) of for cases when you've radically changed allocation behaviors
> > > > > (like you decided to remove all your VM's or you've decided to remove 
> > > > > a
> > > > > mail spool directory full of thousands of tiny files).
> > > > > 
> > > > > People run balance all the time because they think they should. They 
> > > > > are
> > > > > _usually_ incorrect in that belief.
> > > > 
> > > > I only see the lockups of BTRFS is the trees *occupy* all space on the
> > > > device.
> > >    No, "the trees" occupy 3.29 GiB of your 5 GiB of mirrored metadata
> > > space. What's more, balance does *not* balance the metadata trees. The
> > > remaining space -- 154.97 GiB -- is unstructured storage for file
> > > data, and you have some 13 GiB of that available for use.
> > > 
> > >    Now, since you're seeing lockups when the space on your disks is
> > > all allocated I'd say that's a bug. However, you're the *only* person
> > > who's reported this as a regular occurrence. Does this happen with all
> > > filesystems you have, or just this one?
> > > 
> > > > I *never* so far saw it lockup if there is still space BTRFS can 
> > > > allocate
> > > > from to *extend* a tree.
> > > 
> > >    It's not a tree. It's simply space allocation. It's not even space
> > > *usage* you're talking about here -- it's just allocation (i.e. the FS
> > > saying "I'm going to use this piece of disk for this purpose").
> > > 
> > > > This may be a bug, but this is what I see.
> > > > 
> > > > And no amount of "you should not balance a BTRFS" will make that
> > > > perception go away.
> > > > 
> > > > See, I see the sun coming out on a morning and you tell me "no, it
> > > > doesn´t". Simply that is not going to match my perception.
> > > 
> > >    Duncan's assertion is correct in its detail. Looking at your space
> > 
> > Robert's :)
> > 
> > > usage, I would not suggest that running a balance is something you
> > > need to do. Now, since you have these lockups that seem quite
> > > repeatable, there's probably a lurking bug in there, but hacking
> > > around with balance every time you hit it isn't going to get the
> > > problem solved properly.
> > > 
> > >    I think I would suggest the following:
> > > 
> > >  - make sure you have some way of logging your dmesg permanently (use
> > >    a different filesystem for /var/log, or a serial console, or a
> > >    netconsole)
> > > 
> > >  - when the lockup happens, hit Alt-SysRq-t a few times
> > > 
> > >  - send the dmesg output here, or post to bugzilla.kernel.org
> > > 
> > >    That's probably going to give enough information to the developers
> > > to work out where the lockup is happening, and is clearly the way
> > > forward here.
> > 
> > And I got it reproduced. *Perfectly* reproduced, I´d say.
> > 
> > But let me run the whole story:
> > 
> > 1) I downsized my /home BTRFS from dual 170 GiB to dual 160 GiB again.
> 
> [… story of trying to reproduce with Windows XP defragmenting which was
> unsuccessful as BTRFS still had free device space to allocate new chunks
> from …]
> 
> > But finally I got to:
> > 
> > merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
> > Sa 27. Dez 13:26:39 CET 2014
> > Label: 'home'  uuid: [some UUID]
> >         Total devices 2 FS bytes used 152.83GiB
> >         devid    1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
> >         devid    2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home
> > 
> > Btrfs v3.17
> > Data, RAID1: total=154.97GiB, used=149.58GiB
> > System, RAID1: total=32.00MiB, used=48.00KiB
> > Metadata, RAID1: total=5.00GiB, used=3.26GiB
> > GlobalReserve, single: total=512.00MiB, used=0.00B
> > 
> > 
> > 
> > So I did, if Virtualbox can write randomly in a file, I can too.
> > 
> > So I did:
> > 
> > 
> > martin@merkaba:~> cat ssd-test.fio 
> > [global]
> > bs=4k
> > #ioengine=libaio
> > #iodepth=4
> > size=4g
> > #direct=1
> > runtime=120
> > filename=ssd.test.file
> > 
> > [seq-write]
> > rw=write
> > stonewall
> > 
> > [rand-write]
> > rw=randwrite
> > stonewall
> > 
> > 
> > 
> > And got:
> > 
> > ATOP - merkaba                          2014/12/27  13:41:02                
> >           -----------                           10s elapsed
> > PRC |  sys   10.14s |  user   0.38s |  #proc    332  | #trun      2  |  
> > #tslpi   548 |  #tslpu     0 |  #zombie    0  | no  procacct  |
> > CPU |  sys     102% |  user      4% |  irq       0%  | idle    295%  |  
> > wait      0% |  guest     0% |  curf 3.10GHz  | curscal  96%  |
> > cpu |  sys      76% |  user      0% |  irq       0%  | idle     24%  |  
> > cpu001 w  0% |  guest     0% |  curf 3.20GHz  | curscal  99%  |
> > cpu |  sys      24% |  user      1% |  irq       0%  | idle     75%  |  
> > cpu000 w  0% |  guest     0% |  curf 3.19GHz  | curscal  99%  |
> > cpu |  sys       1% |  user      1% |  irq       0%  | idle     98%  |  
> > cpu002 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > cpu |  sys       1% |  user      1% |  irq       0%  | idle     98%  |  
> > cpu003 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > CPL |  avg1    0.82 |  avg5    0.78 |  avg15   0.99  |               |  csw 
> >     6233 |  intr   12023 |                | numcpu     4  |
> > MEM |  tot    15.5G |  free    4.0G |  cache   9.7G  | buff    0.0M  |  
> > slab  333.1M |  shmem 206.6M |  vmbal   0.0M  | hptot   0.0M  |
> > SWP |  tot    12.0G |  free   11.7G |                |               |      
> >          |               |  vmcom   3.4G  | vmlim  19.7G  |
> > LVM |     sata-home |  busy      0% |  read       8  | write      0  |  
> > KiB/w      0 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.12 ms  |
> > DSK |           sda |  busy      0% |  read       8  | write      0  |  
> > KiB/w      0 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.12 ms  |
> > NET |  transport    |  tcpi      16 |  tcpo      16  | udpi       0  |  
> > udpo       0 |  tcpao      1 |  tcppo      1  | tcprs      0  |
> > NET |  network      |  ipi       16 |  ipo       16  | ipfrw      0  |  
> > deliv     16 |               |  icmpi      0  | icmpo      0  |
> > NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so  
> >   2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> > 
> >   PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW  
> >  RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/2
> > 18079      -   martin    martin        2   9.99s    0.00s      0K       0K  
> >     0K      16K  --     -  R       1  100%   fio
> >  4746      -   martin    martin        2   0.01s    0.14s      0K       0K  
> >     0K       0K  --     -  S       2    2%   konsole
> >  3291      -   martin    martin        4   0.01s    0.11s      0K       0K  
> >     0K       0K  --     -  S       0    1%   plasma-desktop
> >  1488      -   root      root          1   0.03s    0.04s      0K       0K  
> >     0K       0K  --     -  S       0    1%   Xorg
> > 10036      -   root      root          1   0.04s    0.02s      0K       0K  
> >     0K       0K  --     -  R       2    1%   atop
> > 
> > while fio was just *laying* out the 4 GiB file. Yes, thats 100% system CPU
> > for 10 seconds while allocatiing a 4 GiB file on a filesystem like:
> > 
> > martin@merkaba:~> LANG=C df -hT /home
> > Filesystem             Type   Size  Used Avail Use% Mounted on
> > /dev/mapper/msata-home btrfs  170G  156G   17G  91% /home
> > 
> > where a 4 GiB file should easily fit, no? (And this output is with the 4
> > GiB file. So it was even 4 GiB more free before.)
> > 
> > 
> > But it gets even more visible:
> > 
> > martin@merkaba:~> fio ssd-test.fio
> > seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, 
> > iodepth=1
> > fio-2.1.11
> > Starting 2 processes
> > Jobs: 1 (f=1): [_(1),w(1)] [19.3% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 
> > 01m:57s]       
> > 0$ zsh  1$ zsh  2$ zsh  3-$ zsh  4$ zsh  5$* zsh                            
> >        
> > 
> > 
> > yes, thats 0 IOPS.
> > 
> > 0 IOPS and in zero IOPS. For minutes.
> > 
> > 
> > 
> > And here is why:
> > 
> > ATOP - merkaba                          2014/12/27  13:46:52                
> >           -----------                           10s elapsed
> > PRC |  sys   10.77s |  user   0.31s |  #proc    334  | #trun      2  |  
> > #tslpi   548 |  #tslpu     3 |  #zombie    0  | no  procacct  |
> > CPU |  sys     108% |  user      3% |  irq       0%  | idle    286%  |  
> > wait      2% |  guest     0% |  curf 3.08GHz  | curscal  96%  |
> > cpu |  sys      72% |  user      1% |  irq       0%  | idle     28%  |  
> > cpu000 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > cpu |  sys      19% |  user      0% |  irq       0%  | idle     81%  |  
> > cpu001 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > cpu |  sys      11% |  user      1% |  irq       0%  | idle     87%  |  
> > cpu003 w  1% |  guest     0% |  curf 3.19GHz  | curscal  99%  |
> > cpu |  sys       6% |  user      1% |  irq       0%  | idle     91%  |  
> > cpu002 w  1% |  guest     0% |  curf 3.11GHz  | curscal  97%  |
> > CPL |  avg1    2.78 |  avg5    1.34 |  avg15   1.12  |               |  csw 
> >    50192 |  intr   32379 |                | numcpu     4  |
> > MEM |  tot    15.5G |  free    5.0G |  cache   8.7G  | buff    0.0M  |  
> > slab  332.6M |  shmem 207.2M |  vmbal   0.0M  | hptot   0.0M  |
> > SWP |  tot    12.0G |  free   11.7G |                |               |      
> >          |               |  vmcom   3.4G  | vmlim  19.7G  |
> > LVM |     sata-home |  busy      5% |  read     160  | write  11177  |  
> > KiB/w      3 |  MBr/s   0.06 |  MBw/s   4.36  | avio 0.05 ms  |
> > LVM |    msata-home |  busy      4% |  read      28  | write  11177  |  
> > KiB/w      3 |  MBr/s   0.01 |  MBw/s   4.36  | avio 0.04 ms  |
> > LVM |   sata-debian |  busy      0% |  read       0  | write    844  |  
> > KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.33  | avio 0.02 ms  |
> > LVM |  msata-debian |  busy      0% |  read       0  | write    844  |  
> > KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.33  | avio 0.02 ms  |
> > DSK |           sda |  busy      5% |  read     160  | write  10200  |  
> > KiB/w      4 |  MBr/s   0.06 |  MBw/s   4.69  | avio 0.05 ms  |
> > DSK |           sdb |  busy      4% |  read      28  | write  10558  |  
> > KiB/w      4 |  MBr/s   0.01 |  MBw/s   4.69  | avio 0.04 ms  |
> > NET |  transport    |  tcpi      35 |  tcpo      33  | udpi       3  |  
> > udpo       3 |  tcpao      2 |  tcppo      1  | tcprs      0  |
> > NET |  network      |  ipi       38 |  ipo       36  | ipfrw      0  |  
> > deliv     38 |               |  icmpi      0  | icmpo      0  |
> > NET |  eth0      0% |  pcki      22 |  pcko      20  | si    9 Kbps  |  so  
> >   2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> > NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so  
> >   2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> > 
> >   PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW  
> >  RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/3
> > 14973      -   root      root          1   8.92s    0.00s      0K       0K  
> >     0K     144K  --     -  S       0   89%   kworker/u8:14
> > 17450      -   root      root          1   0.86s    0.00s      0K       0K  
> >     0K      32K  --     -  R       3    9%   kworker/u8:5
> >   788      -   root      root          1   0.25s    0.00s      0K       0K  
> >   128K   18880K  --     -  S       3    3%   btrfs-transact
> > 12254      -   root      root          1   0.14s    0.00s      0K       0K  
> >    64K     576K  --     -  S       2    1%   kworker/u8:3
> > 17332      -   root      root          1   0.11s    0.00s      0K       0K  
> >   112K    1348K  --     -  S       2    1%   kworker/u8:4
> >  3291      -   martin    martin        4   0.01s    0.09s      0K       0K  
> >     0K       0K  --     -  S       1    1%   plasma-deskto
> > 
> > 
> > 
> > 
> > ATOP - merkaba                          2014/12/27  13:47:12                
> >           -----------                           10s elapsed
> > PRC |  sys   10.78s |  user   0.44s |  #proc    334  | #trun      3  |  
> > #tslpi   547 |  #tslpu     3 |  #zombie    0  | no  procacct  |
> > CPU |  sys     106% |  user      4% |  irq       0%  | idle    288%  |  
> > wait      1% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > cpu |  sys      93% |  user      0% |  irq       0%  | idle      7%  |  
> > cpu002 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > cpu |  sys       7% |  user      0% |  irq       0%  | idle     93%  |  
> > cpu003 w  0% |  guest     0% |  curf 3.01GHz  | curscal  94%  |
> > cpu |  sys       3% |  user      2% |  irq       0%  | idle     94%  |  
> > cpu000 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > cpu |  sys       3% |  user      2% |  irq       0%  | idle     95%  |  
> > cpu001 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > CPL |  avg1    3.33 |  avg5    1.56 |  avg15   1.20  |               |  csw 
> >    38253 |  intr   23104 |                | numcpu     4  |
> > MEM |  tot    15.5G |  free    4.9G |  cache   8.7G  | buff    0.0M  |  
> > slab  336.5M |  shmem 207.2M |  vmbal   0.0M  | hptot   0.0M  |
> > SWP |  tot    12.0G |  free   11.7G |                |               |      
> >          |               |  vmcom   3.4G  | vmlim  19.7G  |
> > LVM |    msata-home |  busy      2% |  read       0  | write   2337  |  
> > KiB/w      3 |  MBr/s   0.00 |  MBw/s   0.91  | avio 0.07 ms  |
> > LVM |     sata-home |  busy      2% |  read      36  | write   2337  |  
> > KiB/w      3 |  MBr/s   0.01 |  MBw/s   0.91  | avio 0.07 ms  |
> > LVM |  msata-debian |  busy      1% |  read       1  | write   1630  |  
> > KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.65  | avio 0.03 ms  |
> > LVM |   sata-debian |  busy      0% |  read       0  | write   1019  |  
> > KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.41  | avio 0.02 ms  |
> > DSK |           sdb |  busy      2% |  read       1  | write   2545  |  
> > KiB/w      5 |  MBr/s   0.00 |  MBw/s   1.45  | avio 0.07 ms  |
> > DSK |           sda |  busy      1% |  read      36  | write   2461  |  
> > KiB/w      5 |  MBr/s   0.01 |  MBw/s   1.28  | avio 0.06 ms  |
> > NET |  transport    |  tcpi      20 |  tcpo      20  | udpi       1  |  
> > udpo       1 |  tcpao      1 |  tcppo      1  | tcprs      0  |
> > NET |  network      |  ipi       21 |  ipo       21  | ipfrw      0  |  
> > deliv     21 |               |  icmpi      0  | icmpo      0  |
> > NET |  eth0      0% |  pcki       5 |  pcko       5  | si    0 Kbps  |  so  
> >   0 Kbps |  erri       0 |  erro       0  | drpo       0  |
> > NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so  
> >   2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> > 
> >   PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW  
> >  RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/3
> > 17450      -   root      root          1   9.96s    0.00s      0K       0K  
> >     0K       0K  --     -  R       2  100%   kworker/u8:5
> >  4746      -   martin    martin        2   0.06s    0.15s      0K       0K  
> >     0K       0K  --     -  S       1    2%   konsole
> > 10508      -   root      root          1   0.13s    0.00s      0K       0K  
> >    96K    4048K  --     -  S       1    1%   kworker/u8:18
> >  1488      -   root      root          1   0.06s    0.06s      0K       0K  
> >     0K       0K  --     -  S       0    1%   Xorg
> > 17332      -   root      root          1   0.12s    0.00s      0K       0K  
> >    96K     580K  --     -  R       3    1%   kworker/u8:4
> > 17454      -   root      root          1   0.11s    0.00s      0K       0K  
> >    32K    4416K  --     -  D       1    1%   kworker/u8:6
> > 17516      -   root      root          1   0.09s    0.00s      0K       0K  
> >    16K     136K  --     -  S       3    1%   kworker/u8:7
> >  3268      -   martin    martin        3   0.02s    0.05s      0K       0K  
> >     0K       0K  --     -  S       1    1%   kwin
> > 10036      -   root      root          1   0.05s    0.02s      0K       0K  
> >     0K       0K  --     -  R       0    1%   atop
> > 
> > 
> > 
> > So BTRFS is basically busy with itself and nothing else. Look at the SSD
> > usage. They are *idling* around. Heck 2400 write accesses in 10 seconds.
> > Thats a joke with SSDs that can do 40000 IOPS (depending on how and what
> > you measure of course, like request size, read, write, iodepth and so).
> > 
> > Its kworker/u8:5 utilizing 100% of one core for minutes.
> > 
> > 
> > 
> > Its the random write case it seems. Here are values from fio job:
> > 
> > martin@merkaba:~> fio ssd-test.fio
> > seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> > rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, 
> > iodepth=1
> > fio-2.1.11
> > Starting 2 processes
> > Jobs: 1 (f=1): [_(1),w(1)] [3.6% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 
> > 01h:06m:26s]
> > seq-write: (groupid=0, jobs=1): err= 0: pid=19212: Sat Dec 27 13:48:33 2014
> >   write: io=4096.0MB, bw=343683KB/s, iops=85920, runt= 12204msec
> >     clat (usec): min=3, max=38048, avg=10.52, stdev=205.25
> >      lat (usec): min=3, max=38048, avg=10.66, stdev=205.43
> >     clat percentiles (usec):
> >      |  1.00th=[    4],  5.00th=[    4], 10.00th=[    4], 20.00th=[    4],
> >      | 30.00th=[    4], 40.00th=[    5], 50.00th=[    5], 60.00th=[    5],
> >      | 70.00th=[    7], 80.00th=[    8], 90.00th=[    8], 95.00th=[    9],
> >      | 99.00th=[   14], 99.50th=[   20], 99.90th=[  211], 99.95th=[ 2128],
> >      | 99.99th=[10304]
> >     bw (KB  /s): min=164328, max=812984, per=100.00%, avg=345585.75, 
> > stdev=201695.20
> >     lat (usec) : 4=0.18%, 10=95.31%, 20=4.00%, 50=0.18%, 100=0.12%
> >     lat (usec) : 250=0.12%, 500=0.02%, 750=0.01%, 1000=0.01%
> >     lat (msec) : 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
> >   cpu          : usr=13.55%, sys=46.89%, ctx=7810, majf=0, minf=6
> >   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
> > >=64=0.0%
> >      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> > >=64=0.0%
> >      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> > >=64=0.0%
> >      issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
> >      latency   : target=0, window=0, percentile=100.00%, depth=1
> > 
> > Seems fine.
> > 
> > 
> > But:
> > 
> > rand-write: (groupid=1, jobs=1): err= 0: pid=19243: Sat Dec 27 13:48:33 2014
> >   write: io=140336KB, bw=1018.4KB/s, iops=254, runt=137803msec
> >     clat (usec): min=4, max=21299K, avg=3708.02, stdev=266885.61
> >      lat (usec): min=4, max=21299K, avg=3708.14, stdev=266885.61
> >     clat percentiles (usec):
> >      |  1.00th=[    4],  5.00th=[    5], 10.00th=[    5], 20.00th=[    5],
> >      | 30.00th=[    6], 40.00th=[    6], 50.00th=[    6], 60.00th=[    6],
> >      | 70.00th=[    7], 80.00th=[    7], 90.00th=[    9], 95.00th=[   10],
> >      | 99.00th=[   18], 99.50th=[   19], 99.90th=[   28], 99.95th=[  116],
> >      | 99.99th=[16711680]
> >     bw (KB  /s): min=    0, max= 3426, per=100.00%, avg=1030.10, 
> > stdev=938.02
> >     lat (usec) : 10=92.63%, 20=6.89%, 50=0.43%, 100=0.01%, 250=0.02%
> >     lat (msec) : 250=0.01%, 500=0.01%, >=2000=0.02%
> >   cpu          : usr=0.06%, sys=1.59%, ctx=28720, majf=0, minf=7
> >   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
> > >=64=0.0%
> >      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> > >=64=0.0%
> >      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> > >=64=0.0%
> >      issued    : total=r=0/w=35084/d=0, short=r=0/w=0/d=0
> >      latency   : target=0, window=0, percentile=100.00%, depth=1
> > 
> > Run status group 0 (all jobs):
> >   WRITE: io=4096.0MB, aggrb=343682KB/s, minb=343682KB/s, maxb=343682KB/s, 
> > mint=12204msec, maxt=12204msec
> > 
> > Run status group 1 (all jobs):
> >   WRITE: io=140336KB, aggrb=1018KB/s, minb=1018KB/s, maxb=1018KB/s, 
> > mint=137803msec, maxt=137803msec
> > 
> > 
> > What? 254 IOPS? With a Dual SSD BTRFS RAID 1?
> > 
> > What?
> > 
> > Ey, *what*?
[…] 
> > There we go:
> > 
> > Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for 
> > minutes on random write into big file
> > https://bugzilla.kernel.org/show_bug.cgi?id=90401
> 
> I have done more tests.
> 
> This is on the same /home after extending it to 170 GiB and balancing it to
> btrfs balance start -dusage=80
> 
> It has plenty of free space free. I updated the bug report and hope it can
> give an easy enough to comprehend summary. The new tests are in:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=90401#c6
> 
> 
> 
> Pasting below for discussion on list. Summary: I easily get 38000 (!)
> IOPS. It may be an idea to reduce to 160 GiB, but right now this does
> not work as it says no free space on device when trying to downsize it.
> I may try with 165 or 162GiB.
> 
> So now we have three IOPS figures:
> 
> - 256 IOPS in worst case scenario
> - 4700 IOPS when trying to reproduce worst case scenario with a fresh and 
> small
> BTRFS
> - 38000 IOPS when /home has unused device space to allocate chunks from
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=90401#c8
> 
> 
> This is another test.



Okay, and this is the last series of tests for today.

Conclusion:

I cannot manage to get it down to the knees as before, but I come near to it.

Still its 8000 IOPS, instead of 250 IOPS, in an according to btrfs fi sh
even *worse* situation than before.

That hints me at the need to look at the free space fragmentation, as in the
beginning the problem started appearing with:

merkaba:~> btrfs fi sh /home
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 144.41GiB
        devid    1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
        devid    2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home

Btrfs v3.17
merkaba:~> btrfs fi df /home
Data, RAID1: total=154.97GiB, used=141.12GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.29GiB
GlobalReserve, single: total=512.00MiB, used=0.00B



Yes, thats 13 GiB of free space *within* the chunks.

So while I can get it down in IOPS by bringing it to a situation where it
can not reserve additional data chunks again, I cannot recreate the
abysmal 250 IOPS figure by this. Not even with my /home filesystem.

So there is more to it. I think its important to look into free space
fragmentation. It seems it needs an *aged* filesystem to recreate. At
it seems the balances really helped. As I am not able to recreate the
issue to that extent right now.

So this shows my original idea about free device space to allocate from
also doesn´t explain it fully. It seems to be something thats going on
within the chunks that explains the worst case <300 IOPS, kworker using
one core for minutes and desktop locked scenario.

Is there a way to view free space fragmentation in BTRFS?




Test log follows, also added to bug report:

https://bugzilla.kernel.org/show_bug.cgi?id=90401#c9


Okay, retesting with 

merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 14:01:05 CET 2014
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 155.15GiB
        devid    1 size 163.00GiB used 159.92GiB path /dev/mapper/msata-home
        devid    2 size 163.00GiB used 159.92GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=154.95GiB, used=151.84GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.94GiB, used=3.31GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


Thats is just 3 GiB to reserve new data chunks from from.

First run – all good:

martin@merkaba:~> fio ssd-test.fio
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/134.2MB/0KB /s] [0/34.4K/0 iops] [eta 
00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10483: Sun Dec 28 14:02:59 2014
  write: io=4096.0MB, bw=218101KB/s, iops=54525, runt= 19231msec
    clat (usec): min=4, max=20056, avg=14.87, stdev=143.15
     lat (usec): min=4, max=20056, avg=15.09, stdev=143.26
    clat percentiles (usec):
     |  1.00th=[    6],  5.00th=[    7], 10.00th=[    7], 20.00th=[    8],
     | 30.00th=[    8], 40.00th=[   10], 50.00th=[   11], 60.00th=[   12],
     | 70.00th=[   13], 80.00th=[   14], 90.00th=[   15], 95.00th=[   17],
     | 99.00th=[   52], 99.50th=[   99], 99.90th=[  434], 99.95th=[  980],
     | 99.99th=[ 7968]
    bw (KB  /s): min=62600, max=424456, per=100.00%, avg=218821.63, 
stdev=93695.28
    lat (usec) : 10=38.19%, 20=58.83%, 50=1.90%, 100=0.59%, 250=0.33%
    lat (usec) : 500=0.09%, 750=0.03%, 1000=0.01%
    lat (msec) : 2=0.02%, 4=0.01%, 10=0.02%, 20=0.01%, 50=0.01%
  cpu          : usr=15.50%, sys=61.86%, ctx=93432, majf=0, minf=5
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=218101KB/s, minb=218101KB/s, maxb=218101KB/s, 
mint=19231msec, maxt=19231msec




Second run:


merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 14:04:01 CET 2014
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 155.23GiB
        devid    1 size 163.00GiB used 160.95GiB path /dev/mapper/msata-home
        devid    2 size 163.00GiB used 160.95GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=155.98GiB, used=151.91GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.94GiB, used=3.32GiB
GlobalReserve, single: total=512.00MiB, used=0.00B




martin@merkaba:~> fio ssd-test.fio
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/171.3MB/0KB /s] [0/43.9K/0 iops] [eta 
00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10501: Sun Dec 28 14:05:03 2014
  write: io=4096.0MB, bw=220637KB/s, iops=55159, runt= 19010msec
    clat (usec): min=4, max=20578, avg=14.45, stdev=160.84
     lat (usec): min=4, max=20578, avg=14.65, stdev=160.88
    clat percentiles (usec):
     |  1.00th=[    6],  5.00th=[    7], 10.00th=[    7], 20.00th=[    7],
     | 30.00th=[    8], 40.00th=[   10], 50.00th=[   11], 60.00th=[   12],
     | 70.00th=[   12], 80.00th=[   13], 90.00th=[   15], 95.00th=[   17],
     | 99.00th=[   42], 99.50th=[   79], 99.90th=[  278], 99.95th=[  620],
     | 99.99th=[ 9792]
    bw (KB  /s): min=    5, max=454816, per=100.00%, avg=224700.32, 
stdev=100763.29
    lat (usec) : 10=38.15%, 20=58.73%, 50=2.28%, 100=0.47%, 250=0.26%
    lat (usec) : 500=0.06%, 750=0.01%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.01%, 50=0.01%
  cpu          : usr=15.83%, sys=63.17%, ctx=74934, majf=0, minf=5
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=220636KB/s, minb=220636KB/s, maxb=220636KB/s, 
mint=19010msec, maxt=19010msec



Okay, now try the same without space for a free chunk to allocate.

The testfile is still there, fio doesn´t delete and recreate it on
every attempt, but just writes into it:

martin@merkaba:~> ls -l ssd.test.file 
-rw-r--r-- 1 martin martin 4294967296 Dez 28 14:05 ssd.test.file


Okay – with still one chunk to allocate:

merkaba:~> btrfs filesystem resize 1:161G /home
Resize '/home' of '1:161G'
merkaba:~> btrfs filesystem resize 2:161G /home
Resize '/home' of '2:161G'
merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 14:08:45 CET 2014
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 155.15GiB
        devid    1 size 161.00GiB used 159.92GiB path /dev/mapper/msata-home
        devid    2 size 161.00GiB used 159.92GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=154.95GiB, used=151.84GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.94GiB, used=3.31GiB
GlobalReserve, single: total=512.00MiB, used=0.00B



I would like to have it allocate the chunks by other means, but it frees
them eventually afterwards again, so I did it this way.

Note, we still have the original file there. The space it currently
occupies is already taken into account.


Next test:

martin@merkaba:~> fio ssd-test.fio   
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/130.5MB/0KB /s] [0/33.5K/0 iops] [eta 
00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10563: Sun Dec 28 14:10:34 2014
  write: io=4096.0MB, bw=210526KB/s, iops=52631, runt= 19923msec
    clat (usec): min=4, max=21820, avg=14.78, stdev=119.40
     lat (usec): min=4, max=21821, avg=15.03, stdev=120.26
    clat percentiles (usec):
     |  1.00th=[    6],  5.00th=[    7], 10.00th=[    7], 20.00th=[    8],
     | 30.00th=[    9], 40.00th=[   10], 50.00th=[   11], 60.00th=[   11],
     | 70.00th=[   12], 80.00th=[   13], 90.00th=[   14], 95.00th=[   17],
     | 99.00th=[   62], 99.50th=[  131], 99.90th=[  490], 99.95th=[  964],
     | 99.99th=[ 6816]
    bw (KB  /s): min=    3, max=410480, per=100.00%, avg=216892.84, 
stdev=95620.33
    lat (usec) : 10=33.20%, 20=63.71%, 50=1.86%, 100=0.59%, 250=0.42%
    lat (usec) : 500=0.12%, 750=0.03%, 1000=0.01%
    lat (msec) : 2=0.02%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
  cpu          : usr=15.13%, sys=62.74%, ctx=94346, majf=0, minf=5
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=210525KB/s, minb=210525KB/s, maxb=210525KB/s, 
mint=19923msec, maxt=19923msec


Okay, this is still good.

merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 14:11:18 CET 2014
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 155.17GiB
        devid    1 size 161.00GiB used 160.91GiB path /dev/mapper/msata-home
        devid    2 size 161.00GiB used 160.91GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=155.94GiB, used=151.86GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.94GiB, used=3.30GiB
GlobalReserve, single: total=512.00MiB, used=0.00B




Now no space left to reserve additional chunks anymore. Another test:


martin@merkaba:~> fio ssd-test.fio
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/152.3MB/0KB /s] [0/38.1K/0 iops] [eta 
00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10580: Sun Dec 28 14:13:26 2014
  write: io=4096.0MB, bw=225804KB/s, iops=56450, runt= 18575msec
    clat (usec): min=4, max=16669, avg=13.66, stdev=72.88
     lat (usec): min=4, max=16669, avg=13.89, stdev=73.06
    clat percentiles (usec):
     |  1.00th=[    6],  5.00th=[    7], 10.00th=[    7], 20.00th=[    8],
     | 30.00th=[    8], 40.00th=[   10], 50.00th=[   11], 60.00th=[   12],
     | 70.00th=[   13], 80.00th=[   14], 90.00th=[   15], 95.00th=[   20],
     | 99.00th=[   65], 99.50th=[  113], 99.90th=[  314], 99.95th=[  506],
     | 99.99th=[ 2768]
    bw (KB  /s): min=    4, max=444568, per=100.00%, avg=231326.97, 
stdev=93374.31
    lat (usec) : 10=36.50%, 20=58.44%, 50=3.73%, 100=0.76%, 250=0.44%
    lat (usec) : 500=0.09%, 750=0.02%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%
  cpu          : usr=16.35%, sys=68.39%, ctx=127221, majf=0, minf=5
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=225803KB/s, minb=225803KB/s, maxb=225803KB/s, 
mint=18575msec, maxt=18575msec



Okay, this still does not trigger it.


Another test, it even freed some chunk:

merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 14:14:21 CET 2014
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 155.28GiB
        devid    1 size 161.00GiB used 160.85GiB path /dev/mapper/msata-home
        devid    2 size 161.00GiB used 160.85GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=155.89GiB, used=151.97GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.94GiB, used=3.31GiB
GlobalReserve, single: total=512.00MiB, used=0.00B



Still good:

martin@merkaba:~> fio ssd-test.fio
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/156.5MB/0KB /s] [0/40.6K/0 iops] [eta 
00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10589: Sun Dec 28 14:14:37 2014
  write: io=4096.0MB, bw=161121KB/s, iops=40280, runt= 26032msec
    clat (usec): min=4, max=1228.9K, avg=15.69, stdev=1205.88
     lat (usec): min=4, max=1228.9K, avg=15.92, stdev=1205.90
    clat percentiles (usec):
     |  1.00th=[    6],  5.00th=[    7], 10.00th=[    7], 20.00th=[    8],
     | 30.00th=[    8], 40.00th=[   10], 50.00th=[   11], 60.00th=[   12],
     | 70.00th=[   13], 80.00th=[   14], 90.00th=[   15], 95.00th=[   19],
     | 99.00th=[   53], 99.50th=[   96], 99.90th=[  366], 99.95th=[  764],
     | 99.99th=[ 7776]
    bw (KB  /s): min=    0, max=431680, per=100.00%, avg=219856.30, 
stdev=98172.64
    lat (usec) : 10=39.24%, 20=55.83%, 50=3.81%, 100=0.63%, 250=0.33%
    lat (usec) : 500=0.08%, 750=0.02%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.01%, 2000=0.01%
  cpu          : usr=11.50%, sys=61.08%, ctx=123428, majf=0, minf=6
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=161121KB/s, minb=161121KB/s, maxb=161121KB/s, 
mint=26032msec, maxt=26032msec



Okay, lets allocate one GiB with fallocate to make free space tighter:

martin@merkaba:~> /usr/bin/time fallocate -l 1G 1g-1
0.00user 0.09system 0:00.11elapsed 86%CPU (0avgtext+0avgdata 1752maxresident)k
112inputs+64outputs (1major+89minor)pagefaults 0swaps
martin@merkaba:~> ls -lh 1g-1
-rw-r--r-- 1 martin martin 1,0G Dez 28 14:16 1g


merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 14:16:24 CET 2014
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 156.15GiB
        devid    1 size 161.00GiB used 160.94GiB path /dev/mapper/msata-home
        devid    2 size 161.00GiB used 160.94GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=155.97GiB, used=152.84GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.94GiB, used=3.31GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


Still not:

martin@merkaba:~> fio ssd-test.fio                  
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/132.1MB/0KB /s] [0/34.4K/0 iops] [eta 
00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10632: Sun Dec 28 14:17:12 2014
  write: io=4096.0MB, bw=198773KB/s, iops=49693, runt= 21101msec
    clat (usec): min=4, max=543255, avg=16.27, stdev=563.85
     lat (usec): min=4, max=543255, avg=16.48, stdev=563.91
    clat percentiles (usec):
     |  1.00th=[    6],  5.00th=[    7], 10.00th=[    7], 20.00th=[    8],
     | 30.00th=[    9], 40.00th=[   10], 50.00th=[   11], 60.00th=[   12],
     | 70.00th=[   12], 80.00th=[   13], 90.00th=[   14], 95.00th=[   17],
     | 99.00th=[   49], 99.50th=[   98], 99.90th=[  386], 99.95th=[  828],
     | 99.99th=[10816]
    bw (KB  /s): min=    4, max=444848, per=100.00%, avg=203909.07, 
stdev=109502.11
    lat (usec) : 10=33.97%, 20=62.99%, 50=2.05%, 100=0.51%, 250=0.33%
    lat (usec) : 500=0.08%, 750=0.02%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.01%, 750=0.01%
  cpu          : usr=14.21%, sys=60.44%, ctx=70273, majf=0, minf=6
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=198772KB/s, minb=198772KB/s, maxb=198772KB/s, 
mint=21101msec, maxt=21101msec




Another 1G file.

Got it:

ATOP - merkaba                       2014/12/28  14:18:14                       
-----------                        10s elapsed
PRC | sys   21.74s | user   2.48s | #proc    382 | #trun      8 |  #tslpi   698 
| #tslpu     1 | #zombie    0 | no  procacct |
CPU | sys     218% | user     24% | irq       1% | idle    155% |  wait      2% 
| guest     0% | curf 3.00GHz | curscal  93% |
cpu | sys      75% | user      5% | irq       0% | idle     19% |  cpu003 w  0% 
| guest     0% | curf 3.00GHz | curscal  93% |
cpu | sys      59% | user      3% | irq       0% | idle     37% |  cpu001 w  0% 
| guest     0% | curf 3.00GHz | curscal  93% |
cpu | sys      48% | user      6% | irq       1% | idle     45% |  cpu000 w  1% 
| guest     0% | curf 3.00GHz | curscal  93% |
cpu | sys      36% | user      9% | irq       0% | idle     54% |  cpu002 w  1% 
| guest     0% | curf 3.00GHz | curscal  93% |
CPL | avg1    2.13 | avg5    2.37 | avg15   1.92 |              |  csw    67473 
| intr   59152 |              | numcpu     4 |
MEM | tot    15.5G | free    1.1G | cache  11.0G | buff    0.1M |  slab  740.2M 
| shmem 190.9M | vmbal   0.0M | hptot   0.0M |
SWP | tot    12.0G | free   11.4G |              |              |               
|              | vmcom   5.4G | vmlim  19.7G |
PAG | scan       0 | steal      0 | stall      1 |              |               
|              | swin      19 | swout      0 |
LVM |    sata-home | busy      8% | read       4 | write  26062 |  KiB/w      3 
| MBr/s   0.00 | MBw/s  10.18 | avio 0.03 ms |
LVM |   msata-home | busy      5% | read       4 | write  26062 |  KiB/w      3 
| MBr/s   0.00 | MBw/s  10.18 | avio 0.02 ms |
LVM |    sata-swap | busy      0% | read      19 | write      0 |  KiB/w      0 
| MBr/s   0.01 | MBw/s   0.00 | avio 0.05 ms |
LVM | msata-debian | busy      0% | read       0 | write      4 |  KiB/w      4 
| MBr/s   0.00 | MBw/s   0.00 | avio 0.00 ms |
LVM |  sata-debian | busy      0% | read       0 | write      4 |  KiB/w      4 
| MBr/s   0.00 | MBw/s   0.00 | avio 0.00 ms |
DSK |          sda | busy      8% | read      23 | write  13239 |  KiB/w      7 
| MBr/s   0.01 | MBw/s  10.18 | avio 0.06 ms |
DSK |          sdb | busy      5% | read       4 | write  14360 |  KiB/w      7 
| MBr/s   0.00 | MBw/s  10.18 | avio 0.04 ms |
NET | transport    | tcpi      18 | tcpo      18 | udpi       0 |  udpo       0 
| tcpao      1 | tcppo      1 | tcprs      0 |
NET | network      | ipi       18 | ipo       18 | ipfrw      0 |  deliv     18 
|              | icmpi      0 | icmpo      0 |
NET | eth0      0% | pcki       2 | pcko       2 | si    0 Kbps |  so    0 Kbps 
| erri       0 | erro       0 | drpo       0 |
NET | lo      ---- | pcki      16 | pcko      16 | si    2 Kbps |  so    2 Kbps 
| erri       0 | erro       0 | drpo       0 |

  PID   TID  RUID      EUID       THR  SYSCPU  USRCPU   VGROW   RGROW   RDDSK   
WRDSK  ST  EXC  S  CPUNR   CPU  CMD        1/5
10657     -  martin    martin       1   9.88s   0.00s      0K      0K      0K   
  48K  --    -  R      1   99%  fallocate
 9685     -  root      root         1   9.84s   0.00s      0K      0K      0K   
   0K  --    -  D      0   99%  kworker/u8:10



martin@merkaba:~> /usr/bin/time fallocate -l 1G 1g-2 ; ls -l 1g*
0.00user 59.28system 1:00.21elapsed 98%CPU (0avgtext+0avgdata 1756maxresident)k
0inputs+416outputs (0major+90minor)pagefaults 0swaps
-rw-r--r-- 1 martin martin 1073741824 Dez 28 14:16 1g-1
-rw-r--r-- 1 martin martin 1073741824 Dez 28 14:17 1g-2


One minute in system CPU for this.



Okay, so now another test:

merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 14:19:30 CET 2014
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 157.18GiB
        devid    1 size 161.00GiB used 160.91GiB path /dev/mapper/msata-home
        devid    2 size 161.00GiB used 160.91GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=155.94GiB, used=153.87GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.94GiB, used=3.30GiB
GlobalReserve, single: total=512.00MiB, used=0.00B



I admit, this now really isn´t nice to it anymore, but I want to see
where it starts to become an issue.



ATOP - merkaba                                  2014/12/28  14:21:18            
                      -----------                                  1s elapsed
PRC | sys    1.15s | user   0.16s | #proc    382  | #trun      2 | #tslpi   707 
| #tslpu     1 | #zombie    0 | clones     0  |              | no  procacct |
CPU | sys     163% | user     24% | irq       1%  | idle    189% | wait     26% 
|              | steal     0% | guest     0%  | curf 3.01GHz | curscal  94% |
cpu | sys      72% | user      1% | irq       0%  | idle     25% | cpu001 w  1% 
|              | steal     0% | guest     0%  | curf 3.00GHz | curscal  93% |
cpu | sys      41% | user      9% | irq       0%  | idle     32% | cpu002 w 19% 
|              | steal     0% | guest     0%  | curf 3.00GHz | curscal  93% |
cpu | sys      34% | user     10% | irq       0%  | idle     53% | cpu003 w  3% 
|              | steal     0% | guest     0%  | curf 3.03GHz | curscal  94% |
cpu | sys      16% | user      4% | irq       0%  | idle     77% | cpu000 w  3% 
|              | steal     0% | guest     0%  | curf 3.00GHz | curscal  93% |
CPL | avg1    2.37 | avg5    2.64 | avg15   2.13  |              |              
| csw    18687 | intr   12435 |               |              | numcpu     4 |
MEM | tot    15.5G | free    2.5G | cache   9.5G  | buff    0.1M | slab  742.6M 
| shmem 242.8M | shrss 115.5M | vmbal   0.0M  | hptot   0.0M | hpuse   0.0M |
SWP | tot    12.0G | free   11.4G |               |              |              
|              |              |               | vmcom   5.5G | vmlim  19.7G |
LVM |   msata-home | busy     71% | read      28  | write   8134 | KiB/r      4 
| KiB/w      3 | MBr/s   0.11 | MBw/s  31.68  | avq    13.21 | avio 0.06 ms |
LVM |    sata-home | busy     40% | read      72  | write   8135 | KiB/r      4 
| KiB/w      3 | MBr/s   0.28 | MBw/s  31.69  | avq    41.67 | avio 0.03 ms |
DSK |          sdb | busy     71% | read      24  | write   6049 | KiB/r      4 
| KiB/w      5 | MBr/s   0.11 | MBw/s  31.68  | avq     5.64 | avio 0.08 ms |
DSK |          sda | busy     38% | read      60  | write   5987 | KiB/r      4 
| KiB/w      5 | MBr/s   0.28 | MBw/s  31.69  | avq    20.40 | avio 0.04 ms |
NET | transport    | tcpi      16 | tcpo      16  | udpi       0 | udpo       0 
| tcpao      1 | tcppo      1 | tcprs      0  | tcpie      0 | udpie      0 |
NET | network      | ipi       16 | ipo       16  | ipfrw      0 | deliv     16 
|              |              |               | icmpi      0 | icmpo      0 |
NET | lo      ---- | pcki      16 | pcko      16  | si   20 Kbps | so   20 Kbps 
| coll       0 | erri       0 | erro       0  | drpi       0 | drpo       0 |

  PID     TID    RUID        EUID         THR    SYSCPU    USRCPU     VGROW     
RGROW    RDDSK     WRDSK    ST    EXC    S    CPUNR     CPU    CMD        1/2
10459       -    root        root           1     0.70s     0.00s        0K     
   0K       0K        0K    --      -    D        3    100%    kworker/u8:17
10674       -    martin      martin         1     0.20s     0.01s        0K     
   0K       0K    27504K    --      -    R        0     30%    fio



Okay.

It is jumping between 0 IOPS and 40000 IOPS and hogging 100% with one kworker.



Still quite okay in terms of IOPS tough:

martin@merkaba:~> fio ssd-test.fio                              
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/126.8MB/0KB /s] [0/32.5K/0 iops] [eta 
00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10674: Sun Dec 28 14:22:16 2014
  write: io=3801.3MB, bw=32415KB/s, iops=8103, runt=120083msec
    clat (usec): min=4, max=1809.9K, avg=83.88, stdev=3615.98
     lat (usec): min=4, max=1809.9K, avg=84.10, stdev=3616.00
    clat percentiles (usec):
     |  1.00th=[    6],  5.00th=[    7], 10.00th=[    7], 20.00th=[    8],
     | 30.00th=[    9], 40.00th=[   10], 50.00th=[   11], 60.00th=[   11],
     | 70.00th=[   12], 80.00th=[   13], 90.00th=[   15], 95.00th=[   18],
     | 99.00th=[   52], 99.50th=[  124], 99.90th=[24704], 99.95th=[30592],
     | 99.99th=[57088]
    bw (KB  /s): min=    0, max=417544, per=100.00%, avg=48302.16, 
stdev=89108.07
    lat (usec) : 10=35.61%, 20=60.17%, 50=3.17%, 100=0.47%, 250=0.27%
    lat (usec) : 500=0.05%, 750=0.02%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.04%, 50=0.16%
    lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
    lat (msec) : 2000=0.01%
  cpu          : usr=2.37%, sys=29.74%, ctx=202984, majf=0, minf=6
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=973128/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=3801.3MB, aggrb=32415KB/s, minb=32415KB/s, maxb=32415KB/s, 
mint=120083msec, maxt=120083msec



I stop this here now.


Its interesting to see that even with:

merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 14:23:11 CET 2014
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 157.89GiB
        devid    1 size 161.00GiB used 160.91GiB path /dev/mapper/msata-home
        devid    2 size 161.00GiB used 160.91GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=155.94GiB, used=154.59GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.94GiB, used=3.30GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

I am not able to fully reproduce it.


I can partly reproduce it, but it behaves way better than before.



I think to go further one needs to have a look at the free space
fragmentation inside the chunks.

As in the beginning I had:

I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
space_cache, skinny meta data extents – are these a problem? – and
compress=lzo:

merkaba:~> btrfs fi sh /home
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 144.41GiB
        devid    1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
        devid    2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home

Btrfs v3.17
merkaba:~> btrfs fi df /home
Data, RAID1: total=154.97GiB, used=141.12GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.29GiB
GlobalReserve, single: total=512.00MiB, used=0.00B



So I had quite some free space *within* the chunks and it still was a
problem.










-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

signature.asc
Description: This is a digitally signed message part.

Re: BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare)

Reply via email to