Re: BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare, current idea)

Martin Steigerwald Sun, 28 Dec 2014 07:01:02 -0800

Am Sonntag, 28. Dezember 2014, 14:56:21 schrieb Martin Steigerwald:
> Am Sonntag, 28. Dezember 2014, 14:40:32 schrieb Martin Steigerwald:
> > Am Sonntag, 28. Dezember 2014, 14:00:19 schrieb Martin Steigerwald:
> > > Am Samstag, 27. Dezember 2014, 14:55:58 schrieb Martin Steigerwald:
> > > > Summarized at
> > > > 
> > > > Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for 
> > > > minutes on random write into big file
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=90401
> > > > 
> > > > see below. This is reproducable with fio, no need for Windows XP in
> > > > Virtualbox for reproducing the issue. Next I will try to reproduce with
> > > > a freshly creating filesystem.
> > > > 
> > > > 
> > > > Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
> > > > > On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
> > > > > > Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
> > > > > > > On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
> > > > > > > > Hello!
> > > > > > > > 
> > > > > > > > First: Have a merry christmas and enjoy a quiet time in these 
> > > > > > > > days.
> > > > > > > > 
> > > > > > > > Second: At a time you feel like it, here is a little rant, but 
> > > > > > > > also a
> > > > > > > > bug
> > > > > > > > report:
> > > > > > > > 
> > > > > > > > I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD 
> > > > > > > > RAID with
> > > > > > > > space_cache, skinny meta data extents – are these a problem? – 
> > > > > > > > and
> > > > > > > 
> > > > > > > > compress=lzo:
> > > > > > > (there is no known problem with skinny metadata, it's actually 
> > > > > > > more
> > > > > > > efficient than the older format. There has been some anecdotes 
> > > > > > > about
> > > > > > > mixing the skinny and fat metadata but nothing has ever been
> > > > > > > demonstrated problematic.)
> > > > > > > 
> > > > > > > > merkaba:~> btrfs fi sh /home
> > > > > > > > Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
> > > > > > > > 
> > > > > > > >          Total devices 2 FS bytes used 144.41GiB
> > > > > > > >          devid    1 size 160.00GiB used 160.00GiB path
> > > > > > > >          /dev/mapper/msata-home
> > > > > > > >          devid    2 size 160.00GiB used 160.00GiB path
> > > > > > > >          /dev/mapper/sata-home
> > > > > > > > 
> > > > > > > > Btrfs v3.17
> > > > > > > > merkaba:~> btrfs fi df /home
> > > > > > > > Data, RAID1: total=154.97GiB, used=141.12GiB
> > > > > > > > System, RAID1: total=32.00MiB, used=48.00KiB
> > > > > > > > Metadata, RAID1: total=5.00GiB, used=3.29GiB
> > > > > > > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > > > > > > 
> > > > > > > This filesystem, at the allocation level, is "very full" (see 
> > > > > > > below).
> > > > > > > 
> > > > > > > > And I had hangs with BTRFS again. This time as I wanted to 
> > > > > > > > install tax
> > > > > > > > return software in Virtualbox´d Windows XP VM (which I use once 
> > > > > > > > a year
> > > > > > > > cause I know no tax return software for Linux which would be 
> > > > > > > > suitable
> > > > > > > > for
> > > > > > > > Germany and I frankly don´t care about the end of security 
> > > > > > > > cause all
> > > > > > > > surfing and other network access I will do from the Linux box 
> > > > > > > > and I
> > > > > > > > only
> > > > > > > > run the VM behind a firewall).
> > > > > > > 
> > > > > > > > And thus I try the balance dance again:
> > > > > > > ITEM: Balance... it doesn't do what you think it does... 
> > > > > > > 
> > > > > > > "Balancing" is something you should almost never need to do. It 
> > > > > > > is only
> > > > > > > for cases of changing geometry (adding disks, switching RAID 
> > > > > > > levels,
> > > > > > > etc.) of for cases when you've radically changed allocation 
> > > > > > > behaviors
> > > > > > > (like you decided to remove all your VM's or you've decided to 
> > > > > > > remove a
> > > > > > > mail spool directory full of thousands of tiny files).
> > > > > > > 
> > > > > > > People run balance all the time because they think they should. 
> > > > > > > They are
> > > > > > > _usually_ incorrect in that belief.
> > > > > > 
> > > > > > I only see the lockups of BTRFS is the trees *occupy* all space on 
> > > > > > the
> > > > > > device.
> > > > >    No, "the trees" occupy 3.29 GiB of your 5 GiB of mirrored metadata
> > > > > space. What's more, balance does *not* balance the metadata trees. The
> > > > > remaining space -- 154.97 GiB -- is unstructured storage for file
> > > > > data, and you have some 13 GiB of that available for use.
> > > > > 
> > > > >    Now, since you're seeing lockups when the space on your disks is
> > > > > all allocated I'd say that's a bug. However, you're the *only* person
> > > > > who's reported this as a regular occurrence. Does this happen with all
> > > > > filesystems you have, or just this one?
> > > > > 
> > > > > > I *never* so far saw it lockup if there is still space BTRFS can 
> > > > > > allocate
> > > > > > from to *extend* a tree.
> > > > > 
> > > > >    It's not a tree. It's simply space allocation. It's not even space
> > > > > *usage* you're talking about here -- it's just allocation (i.e. the FS
> > > > > saying "I'm going to use this piece of disk for this purpose").
> > > > > 
> > > > > > This may be a bug, but this is what I see.
> > > > > > 
> > > > > > And no amount of "you should not balance a BTRFS" will make that
> > > > > > perception go away.
> > > > > > 
> > > > > > See, I see the sun coming out on a morning and you tell me "no, it
> > > > > > doesn´t". Simply that is not going to match my perception.
> > > > > 
> > > > >    Duncan's assertion is correct in its detail. Looking at your space
> > > > 
> > > > Robert's 
> > > > 
> > > > > usage, I would not suggest that running a balance is something you
> > > > > need to do. Now, since you have these lockups that seem quite
> > > > > repeatable, there's probably a lurking bug in there, but hacking
> > > > > around with balance every time you hit it isn't going to get the
> > > > > problem solved properly.
> > > > > 
> > > > >    I think I would suggest the following:
> > > > > 
> > > > >  - make sure you have some way of logging your dmesg permanently (use
> > > > >    a different filesystem for /var/log, or a serial console, or a
> > > > >    netconsole)
> > > > > 
> > > > >  - when the lockup happens, hit Alt-SysRq-t a few times
> > > > > 
> > > > >  - send the dmesg output here, or post to bugzilla.kernel.org
> > > > > 
> > > > >    That's probably going to give enough information to the developers
> > > > > to work out where the lockup is happening, and is clearly the way
> > > > > forward here.
> > > > 
> > > > And I got it reproduced. *Perfectly* reproduced, I´d say.
> > > > 
> > > > But let me run the whole story:
> > > > 
> > > > 1) I downsized my /home BTRFS from dual 170 GiB to dual 160 GiB again.
> > > 
> > > [… story of trying to reproduce with Windows XP defragmenting which was
> > > unsuccessful as BTRFS still had free device space to allocate new chunks
> > > from …]
> > > 
> > > > But finally I got to:
> > > > 
> > > > merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
> > > > Sa 27. Dez 13:26:39 CET 2014
> > > > Label: 'home'  uuid: [some UUID]
> > > >         Total devices 2 FS bytes used 152.83GiB
> > > >         devid    1 size 160.00GiB used 160.00GiB path 
> > > > /dev/mapper/msata-home
> > > >         devid    2 size 160.00GiB used 160.00GiB path 
> > > > /dev/mapper/sata-home
> > > > 
> > > > Btrfs v3.17
> > > > Data, RAID1: total=154.97GiB, used=149.58GiB
> > > > System, RAID1: total=32.00MiB, used=48.00KiB
> > > > Metadata, RAID1: total=5.00GiB, used=3.26GiB
> > > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > > > 
> > > > 
> > > > 
> > > > So I did, if Virtualbox can write randomly in a file, I can too.
> > > > 
> > > > So I did:
> > > > 
> > > > 
> > > > martin@merkaba:~> cat ssd-test.fio 
> > > > [global]
> > > > bs=4k
> > > > #ioengine=libaio
> > > > #iodepth=4
> > > > size=4g
> > > > #direct=1
> > > > runtime=120
> > > > filename=ssd.test.file
> > > > 
> > > > [seq-write]
> > > > rw=write
> > > > stonewall
> > > > 
> > > > [rand-write]
> > > > rw=randwrite
> > > > stonewall
> > > > 
> > > > 
> > > > 
> > > > And got:
> > > > 
> > > > ATOP - merkaba                          2014/12/27  13:41:02            
> > > >               -----------                           10s elapsed
> > > > PRC |  sys   10.14s |  user   0.38s |  #proc    332  | #trun      2  |  
> > > > #tslpi   548 |  #tslpu     0 |  #zombie    0  | no  procacct  |
> > > > CPU |  sys     102% |  user      4% |  irq       0%  | idle    295%  |  
> > > > wait      0% |  guest     0% |  curf 3.10GHz  | curscal  96%  |
> > > > cpu |  sys      76% |  user      0% |  irq       0%  | idle     24%  |  
> > > > cpu001 w  0% |  guest     0% |  curf 3.20GHz  | curscal  99%  |
> > > > cpu |  sys      24% |  user      1% |  irq       0%  | idle     75%  |  
> > > > cpu000 w  0% |  guest     0% |  curf 3.19GHz  | curscal  99%  |
> > > > cpu |  sys       1% |  user      1% |  irq       0%  | idle     98%  |  
> > > > cpu002 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > > > cpu |  sys       1% |  user      1% |  irq       0%  | idle     98%  |  
> > > > cpu003 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > > > CPL |  avg1    0.82 |  avg5    0.78 |  avg15   0.99  |               |  
> > > > csw     6233 |  intr   12023 |                | numcpu     4  |
> > > > MEM |  tot    15.5G |  free    4.0G |  cache   9.7G  | buff    0.0M  |  
> > > > slab  333.1M |  shmem 206.6M |  vmbal   0.0M  | hptot   0.0M  |
> > > > SWP |  tot    12.0G |  free   11.7G |                |               |  
> > > >              |               |  vmcom   3.4G  | vmlim  19.7G  |
> > > > LVM |     sata-home |  busy      0% |  read       8  | write      0  |  
> > > > KiB/w      0 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.12 ms  |
> > > > DSK |           sda |  busy      0% |  read       8  | write      0  |  
> > > > KiB/w      0 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.12 ms  |
> > > > NET |  transport    |  tcpi      16 |  tcpo      16  | udpi       0  |  
> > > > udpo       0 |  tcpao      1 |  tcppo      1  | tcprs      0  |
> > > > NET |  network      |  ipi       16 |  ipo       16  | ipfrw      0  |  
> > > > deliv     16 |               |  icmpi      0  | icmpo      0  |
> > > > NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  
> > > > so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> > > > 
> > > >   PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    
> > > > RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/2
> > > > 18079      -   martin    martin        2   9.99s    0.00s      0K       
> > > > 0K      0K      16K  --     -  R       1  100%   fio
> > > >  4746      -   martin    martin        2   0.01s    0.14s      0K       
> > > > 0K      0K       0K  --     -  S       2    2%   konsole
> > > >  3291      -   martin    martin        4   0.01s    0.11s      0K       
> > > > 0K      0K       0K  --     -  S       0    1%   plasma-desktop
> > > >  1488      -   root      root          1   0.03s    0.04s      0K       
> > > > 0K      0K       0K  --     -  S       0    1%   Xorg
> > > > 10036      -   root      root          1   0.04s    0.02s      0K       
> > > > 0K      0K       0K  --     -  R       2    1%   atop
> > > > 
> > > > while fio was just *laying* out the 4 GiB file. Yes, thats 100% system 
> > > > CPU
> > > > for 10 seconds while allocatiing a 4 GiB file on a filesystem like:
> > > > 
> > > > martin@merkaba:~> LANG=C df -hT /home
> > > > Filesystem             Type   Size  Used Avail Use% Mounted on
> > > > /dev/mapper/msata-home btrfs  170G  156G   17G  91% /home
> > > > 
> > > > where a 4 GiB file should easily fit, no? (And this output is with the 4
> > > > GiB file. So it was even 4 GiB more free before.)
> > > > 
> > > > 
> > > > But it gets even more visible:
> > > > 
> > > > martin@merkaba:~> fio ssd-test.fio
> > > > seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, 
> > > > iodepth=1
> > > > rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, 
> > > > iodepth=1
> > > > fio-2.1.11
> > > > Starting 2 processes
> > > > Jobs: 1 (f=1): [_(1),w(1)] [19.3% done] [0KB/0KB/0KB /s] [0/0/0 iops] 
> > > > [eta 01m:57s]       
> > > > 0$ zsh  1$ zsh  2$ zsh  3-$ zsh  4$ zsh  5$* zsh                        
> > > >            
> > > > 
> > > > 
> > > > yes, thats 0 IOPS.
> > > > 
> > > > 0 IOPS and in zero IOPS. For minutes.
> > > > 
> > > > 
> > > > 
> > > > And here is why:
> > > > 
> > > > ATOP - merkaba                          2014/12/27  13:46:52            
> > > >               -----------                           10s elapsed
> > > > PRC |  sys   10.77s |  user   0.31s |  #proc    334  | #trun      2  |  
> > > > #tslpi   548 |  #tslpu     3 |  #zombie    0  | no  procacct  |
> > > > CPU |  sys     108% |  user      3% |  irq       0%  | idle    286%  |  
> > > > wait      2% |  guest     0% |  curf 3.08GHz  | curscal  96%  |
> > > > cpu |  sys      72% |  user      1% |  irq       0%  | idle     28%  |  
> > > > cpu000 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > > > cpu |  sys      19% |  user      0% |  irq       0%  | idle     81%  |  
> > > > cpu001 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > > > cpu |  sys      11% |  user      1% |  irq       0%  | idle     87%  |  
> > > > cpu003 w  1% |  guest     0% |  curf 3.19GHz  | curscal  99%  |
> > > > cpu |  sys       6% |  user      1% |  irq       0%  | idle     91%  |  
> > > > cpu002 w  1% |  guest     0% |  curf 3.11GHz  | curscal  97%  |
> > > > CPL |  avg1    2.78 |  avg5    1.34 |  avg15   1.12  |               |  
> > > > csw    50192 |  intr   32379 |                | numcpu     4  |
> > > > MEM |  tot    15.5G |  free    5.0G |  cache   8.7G  | buff    0.0M  |  
> > > > slab  332.6M |  shmem 207.2M |  vmbal   0.0M  | hptot   0.0M  |
> > > > SWP |  tot    12.0G |  free   11.7G |                |               |  
> > > >              |               |  vmcom   3.4G  | vmlim  19.7G  |
> > > > LVM |     sata-home |  busy      5% |  read     160  | write  11177  |  
> > > > KiB/w      3 |  MBr/s   0.06 |  MBw/s   4.36  | avio 0.05 ms  |
> > > > LVM |    msata-home |  busy      4% |  read      28  | write  11177  |  
> > > > KiB/w      3 |  MBr/s   0.01 |  MBw/s   4.36  | avio 0.04 ms  |
> > > > LVM |   sata-debian |  busy      0% |  read       0  | write    844  |  
> > > > KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.33  | avio 0.02 ms  |
> > > > LVM |  msata-debian |  busy      0% |  read       0  | write    844  |  
> > > > KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.33  | avio 0.02 ms  |
> > > > DSK |           sda |  busy      5% |  read     160  | write  10200  |  
> > > > KiB/w      4 |  MBr/s   0.06 |  MBw/s   4.69  | avio 0.05 ms  |
> > > > DSK |           sdb |  busy      4% |  read      28  | write  10558  |  
> > > > KiB/w      4 |  MBr/s   0.01 |  MBw/s   4.69  | avio 0.04 ms  |
> > > > NET |  transport    |  tcpi      35 |  tcpo      33  | udpi       3  |  
> > > > udpo       3 |  tcpao      2 |  tcppo      1  | tcprs      0  |
> > > > NET |  network      |  ipi       38 |  ipo       36  | ipfrw      0  |  
> > > > deliv     38 |               |  icmpi      0  | icmpo      0  |
> > > > NET |  eth0      0% |  pcki      22 |  pcko      20  | si    9 Kbps  |  
> > > > so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> > > > NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  
> > > > so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> > > > 
> > > >   PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    
> > > > RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/3
> > > > 14973      -   root      root          1   8.92s    0.00s      0K       
> > > > 0K      0K     144K  --     -  S       0   89%   kworker/u8:14
> > > > 17450      -   root      root          1   0.86s    0.00s      0K       
> > > > 0K      0K      32K  --     -  R       3    9%   kworker/u8:5
> > > >   788      -   root      root          1   0.25s    0.00s      0K       
> > > > 0K    128K   18880K  --     -  S       3    3%   btrfs-transact
> > > > 12254      -   root      root          1   0.14s    0.00s      0K       
> > > > 0K     64K     576K  --     -  S       2    1%   kworker/u8:3
> > > > 17332      -   root      root          1   0.11s    0.00s      0K       
> > > > 0K    112K    1348K  --     -  S       2    1%   kworker/u8:4
> > > >  3291      -   martin    martin        4   0.01s    0.09s      0K       
> > > > 0K      0K       0K  --     -  S       1    1%   plasma-deskto
> > > > 
> > > > 
> > > > 
> > > > 
> > > > ATOP - merkaba                          2014/12/27  13:47:12            
> > > >               -----------                           10s elapsed
> > > > PRC |  sys   10.78s |  user   0.44s |  #proc    334  | #trun      3  |  
> > > > #tslpi   547 |  #tslpu     3 |  #zombie    0  | no  procacct  |
> > > > CPU |  sys     106% |  user      4% |  irq       0%  | idle    288%  |  
> > > > wait      1% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > > > cpu |  sys      93% |  user      0% |  irq       0%  | idle      7%  |  
> > > > cpu002 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > > > cpu |  sys       7% |  user      0% |  irq       0%  | idle     93%  |  
> > > > cpu003 w  0% |  guest     0% |  curf 3.01GHz  | curscal  94%  |
> > > > cpu |  sys       3% |  user      2% |  irq       0%  | idle     94%  |  
> > > > cpu000 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > > > cpu |  sys       3% |  user      2% |  irq       0%  | idle     95%  |  
> > > > cpu001 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
> > > > CPL |  avg1    3.33 |  avg5    1.56 |  avg15   1.20  |               |  
> > > > csw    38253 |  intr   23104 |                | numcpu     4  |
> > > > MEM |  tot    15.5G |  free    4.9G |  cache   8.7G  | buff    0.0M  |  
> > > > slab  336.5M |  shmem 207.2M |  vmbal   0.0M  | hptot   0.0M  |
> > > > SWP |  tot    12.0G |  free   11.7G |                |               |  
> > > >              |               |  vmcom   3.4G  | vmlim  19.7G  |
> > > > LVM |    msata-home |  busy      2% |  read       0  | write   2337  |  
> > > > KiB/w      3 |  MBr/s   0.00 |  MBw/s   0.91  | avio 0.07 ms  |
> > > > LVM |     sata-home |  busy      2% |  read      36  | write   2337  |  
> > > > KiB/w      3 |  MBr/s   0.01 |  MBw/s   0.91  | avio 0.07 ms  |
> > > > LVM |  msata-debian |  busy      1% |  read       1  | write   1630  |  
> > > > KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.65  | avio 0.03 ms  |
> > > > LVM |   sata-debian |  busy      0% |  read       0  | write   1019  |  
> > > > KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.41  | avio 0.02 ms  |
> > > > DSK |           sdb |  busy      2% |  read       1  | write   2545  |  
> > > > KiB/w      5 |  MBr/s   0.00 |  MBw/s   1.45  | avio 0.07 ms  |
> > > > DSK |           sda |  busy      1% |  read      36  | write   2461  |  
> > > > KiB/w      5 |  MBr/s   0.01 |  MBw/s   1.28  | avio 0.06 ms  |
> > > > NET |  transport    |  tcpi      20 |  tcpo      20  | udpi       1  |  
> > > > udpo       1 |  tcpao      1 |  tcppo      1  | tcprs      0  |
> > > > NET |  network      |  ipi       21 |  ipo       21  | ipfrw      0  |  
> > > > deliv     21 |               |  icmpi      0  | icmpo      0  |
> > > > NET |  eth0      0% |  pcki       5 |  pcko       5  | si    0 Kbps  |  
> > > > so    0 Kbps |  erri       0 |  erro       0  | drpo       0  |
> > > > NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  
> > > > so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
> > > > 
> > > >   PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    
> > > > RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/3
> > > > 17450      -   root      root          1   9.96s    0.00s      0K       
> > > > 0K      0K       0K  --     -  R       2  100%   kworker/u8:5
> > > >  4746      -   martin    martin        2   0.06s    0.15s      0K       
> > > > 0K      0K       0K  --     -  S       1    2%   konsole
> > > > 10508      -   root      root          1   0.13s    0.00s      0K       
> > > > 0K     96K    4048K  --     -  S       1    1%   kworker/u8:18
> > > >  1488      -   root      root          1   0.06s    0.06s      0K       
> > > > 0K      0K       0K  --     -  S       0    1%   Xorg
> > > > 17332      -   root      root          1   0.12s    0.00s      0K       
> > > > 0K     96K     580K  --     -  R       3    1%   kworker/u8:4
> > > > 17454      -   root      root          1   0.11s    0.00s      0K       
> > > > 0K     32K    4416K  --     -  D       1    1%   kworker/u8:6
> > > > 17516      -   root      root          1   0.09s    0.00s      0K       
> > > > 0K     16K     136K  --     -  S       3    1%   kworker/u8:7
> > > >  3268      -   martin    martin        3   0.02s    0.05s      0K       
> > > > 0K      0K       0K  --     -  S       1    1%   kwin
> > > > 10036      -   root      root          1   0.05s    0.02s      0K       
> > > > 0K      0K       0K  --     -  R       0    1%   atop
> > > > 
> > > > 
> > > > 
> > > > So BTRFS is basically busy with itself and nothing else. Look at the SSD
> > > > usage. They are *idling* around. Heck 2400 write accesses in 10 seconds.
> > > > Thats a joke with SSDs that can do 40000 IOPS (depending on how and what
> > > > you measure of course, like request size, read, write, iodepth and so).
> > > > 
> > > > Its kworker/u8:5 utilizing 100% of one core for minutes.
> > > > 
> > > > 
> > > > 
> > > > Its the random write case it seems. Here are values from fio job:
> > > > 
> > > > martin@merkaba:~> fio ssd-test.fio
> > > > seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, 
> > > > iodepth=1
> > > > rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, 
> > > > iodepth=1
> > > > fio-2.1.11
> > > > Starting 2 processes
> > > > Jobs: 1 (f=1): [_(1),w(1)] [3.6% done] [0KB/0KB/0KB /s] [0/0/0 iops] 
> > > > [eta 01h:06m:26s]
> > > > seq-write: (groupid=0, jobs=1): err= 0: pid=19212: Sat Dec 27 13:48:33 
> > > > 2014
> > > >   write: io=4096.0MB, bw=343683KB/s, iops=85920, runt= 12204msec
> > > >     clat (usec): min=3, max=38048, avg=10.52, stdev=205.25
> > > >      lat (usec): min=3, max=38048, avg=10.66, stdev=205.43
> > > >     clat percentiles (usec):
> > > >      |  1.00th=[    4],  5.00th=[    4], 10.00th=[    4], 20.00th=[    
> > > > 4],
> > > >      | 30.00th=[    4], 40.00th=[    5], 50.00th=[    5], 60.00th=[    
> > > > 5],
> > > >      | 70.00th=[    7], 80.00th=[    8], 90.00th=[    8], 95.00th=[    
> > > > 9],
> > > >      | 99.00th=[   14], 99.50th=[   20], 99.90th=[  211], 99.95th=[ 
> > > > 2128],
> > > >      | 99.99th=[10304]
> > > >     bw (KB  /s): min=164328, max=812984, per=100.00%, avg=345585.75, 
> > > > stdev=201695.20
> > > >     lat (usec) : 4=0.18%, 10=95.31%, 20=4.00%, 50=0.18%, 100=0.12%
> > > >     lat (usec) : 250=0.12%, 500=0.02%, 750=0.01%, 1000=0.01%
> > > >     lat (msec) : 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
> > > >   cpu          : usr=13.55%, sys=46.89%, ctx=7810, majf=0, minf=6
> > > >   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
> > > > >=64=0.0%
> > > >      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> > > > >=64=0.0%
> > > >      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> > > > >=64=0.0%
> > > >      issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
> > > >      latency   : target=0, window=0, percentile=100.00%, depth=1
> > > > 
> > > > Seems fine.
> > > > 
> > > > 
> > > > But:
> > > > 
> > > > rand-write: (groupid=1, jobs=1): err= 0: pid=19243: Sat Dec 27 13:48:33 
> > > > 2014
> > > >   write: io=140336KB, bw=1018.4KB/s, iops=254, runt=137803msec
> > > >     clat (usec): min=4, max=21299K, avg=3708.02, stdev=266885.61
> > > >      lat (usec): min=4, max=21299K, avg=3708.14, stdev=266885.61
> > > >     clat percentiles (usec):
> > > >      |  1.00th=[    4],  5.00th=[    5], 10.00th=[    5], 20.00th=[    
> > > > 5],
> > > >      | 30.00th=[    6], 40.00th=[    6], 50.00th=[    6], 60.00th=[    
> > > > 6],
> > > >      | 70.00th=[    7], 80.00th=[    7], 90.00th=[    9], 95.00th=[   
> > > > 10],
> > > >      | 99.00th=[   18], 99.50th=[   19], 99.90th=[   28], 99.95th=[  
> > > > 116],
> > > >      | 99.99th=[16711680]
> > > >     bw (KB  /s): min=    0, max= 3426, per=100.00%, avg=1030.10, 
> > > > stdev=938.02
> > > >     lat (usec) : 10=92.63%, 20=6.89%, 50=0.43%, 100=0.01%, 250=0.02%
> > > >     lat (msec) : 250=0.01%, 500=0.01%, >=2000=0.02%
> > > >   cpu          : usr=0.06%, sys=1.59%, ctx=28720, majf=0, minf=7
> > > >   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
> > > > >=64=0.0%
> > > >      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> > > > >=64=0.0%
> > > >      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> > > > >=64=0.0%
> > > >      issued    : total=r=0/w=35084/d=0, short=r=0/w=0/d=0
> > > >      latency   : target=0, window=0, percentile=100.00%, depth=1
> > > > 
> > > > Run status group 0 (all jobs):
> > > >   WRITE: io=4096.0MB, aggrb=343682KB/s, minb=343682KB/s, 
> > > > maxb=343682KB/s, mint=12204msec, maxt=12204msec
> > > > 
> > > > Run status group 1 (all jobs):
> > > >   WRITE: io=140336KB, aggrb=1018KB/s, minb=1018KB/s, maxb=1018KB/s, 
> > > > mint=137803msec, maxt=137803msec
> > > > 
> > > > 
> > > > What? 254 IOPS? With a Dual SSD BTRFS RAID 1?
> > > > 
> > > > What?
> > > > 
> > > > Ey, *what*?
> > […] 
> > > > There we go:
> > > > 
> > > > Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for 
> > > > minutes on random write into big file
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=90401
> > > 
> > > I have done more tests.
> > > 
> > > This is on the same /home after extending it to 170 GiB and balancing it 
> > > to
> > > btrfs balance start -dusage=80
> > > 
> > > It has plenty of free space free. I updated the bug report and hope it can
> > > give an easy enough to comprehend summary. The new tests are in:
> > > 
> > > https://bugzilla.kernel.org/show_bug.cgi?id=90401#c6
> > > 
> > > 
> > > 
> > > Pasting below for discussion on list. Summary: I easily get 38000 (!)
> > > IOPS. It may be an idea to reduce to 160 GiB, but right now this does
> > > not work as it says no free space on device when trying to downsize it.
> > > I may try with 165 or 162GiB.
> > > 
> > > So now we have three IOPS figures:
> > > 
> > > - 256 IOPS in worst case scenario
> > > - 4700 IOPS when trying to reproduce worst case scenario with a fresh and 
> > > small
> > > BTRFS
> > > - 38000 IOPS when /home has unused device space to allocate chunks from
> > > 
> > > https://bugzilla.kernel.org/show_bug.cgi?id=90401#c8
> > > 
> > > 
> > > This is another test.
> > 
> > 
> > Okay, and this is the last series of tests for today.
> > 
> > Conclusion:
> > 
> > I cannot manage to get it down to the knees as before, but I come near to 
> > it.
> > 
> > Still its 8000 IOPS, instead of 250 IOPS, in an according to btrfs fi sh
> > even *worse* situation than before.
> > 
> > That hints me at the need to look at the free space fragmentation, as in the
> > beginning the problem started appearing with:
> > 
> > merkaba:~> btrfs fi sh /home
> > Label: 'home'  uuid: […]
> >         Total devices 2 FS bytes used 144.41GiB
> >         devid    1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
> >         devid    2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home
> > 
> > Btrfs v3.17
> > merkaba:~> btrfs fi df /home
> > Data, RAID1: total=154.97GiB, used=141.12GiB
> > System, RAID1: total=32.00MiB, used=48.00KiB
> > Metadata, RAID1: total=5.00GiB, used=3.29GiB
> > GlobalReserve, single: total=512.00MiB, used=0.00B
> > 
> > 
> > 
> > Yes, thats 13 GiB of free space *within* the chunks.
> > 
> > So while I can get it down in IOPS by bringing it to a situation where it
> > can not reserve additional data chunks again, I cannot recreate the
> > abysmal 250 IOPS figure by this. Not even with my /home filesystem.
> > 
> > So there is more to it. I think its important to look into free space
> > fragmentation. It seems it needs an *aged* filesystem to recreate. At
> > it seems the balances really helped. As I am not able to recreate the
> > issue to that extent right now.
> > 
> > So this shows my original idea about free device space to allocate from
> > also doesn´t explain it fully. It seems to be something thats going on
> > within the chunks that explains the worst case <300 IOPS, kworker using
> > one core for minutes and desktop locked scenario.
> > 
> > Is there a way to view free space fragmentation in BTRFS?
> 
> So to rephrase that:
> 
> From what I perceive the worst case issue happens when
> 
> 1) BTRFS cannot reserve any new chunks from unused device space anymore.
> 
> 2) The free space in the existing chunks is highly fragmented.
> 
> Only one of those conditions is not sufficient to trigger it.
> 
> Thats at least my current idea about it.


One more note about the IOPS. I currently let fio run with:

martin@merkaba:~> cat ssd-test.fio 
[global]
bs=4k
#ioengine=libaio
#iodepth=4
size=4g
#direct=1
runtime=120
filename=ssd.test.file

#[seq-write]
#rw=write
#stonewall

[rand-write]
rw=randwrite
stonewall


This is using buffered I/O write read()/write() system calls. So these
IOPS are not regarding the device raw capabilities. I specifically wanted
to test through the page cache to simulate what I see with Virtualbox
writing to the VDI file (i.e. dirty piling up and dirty_background_ratio
in effect). Just like with a real app.

But that also means that IOPS may be higher cause fio ends before all of the
writes have been completed to disk.

That means when I reach <300IOPS with buffered writes, that means that
through the pagecache BTRFS was not able to yield a higher IOPS.

But it also means that I measure write requests like an application would
be doing (unless it uses fsync() or direct I/O which it seems to me
Virtualbox doesn´t at least not with every request).

Just wanted to make that explicit. Its basically visible in the job file
from what I commented out in there, but still, I thought I mention it.

I just tested the effect by reducing the test file to 500MiB and the runtime
to 10 seconds and I got 98000 IOPS for that. So the larger test file size,
but specifically the runtime forces the kernel to do actual writes due to:

merkaba:~> grep . /proc/sys/vm/dirty_*
/proc/sys/vm/dirty_background_bytes:0
/proc/sys/vm/dirty_background_ratio:10
/proc/sys/vm/dirty_bytes:0
/proc/sys/vm/dirty_expire_centisecs:3000
/proc/sys/vm/dirty_ratio:20
/proc/sys/vm/dirty_writeback_centisecs:500

(standard values, I still see no need to optimize anything in here with
those SSDs, not even with the 16 GiB of RAM the laptop has, as the SSDs
usually easily can keep up, and I´d rather wait for a change in the default
value unless I am convinced of a benefit in manually adapting it in *this*
case)

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

signature.asc
Description: This is a digitally signed message part.

Re: BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare, current idea)

Reply via email to