Re: BTRFS free space handling still needs more work: Hangs again

Martin Steigerwald Sat, 27 Dec 2014 05:56:44 -0800

Summarized at

Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes 
on random write into big file
https://bugzilla.kernel.org/show_bug.cgi?id=90401


see below. This is reproducable with fio, no need for Windows XP in
Virtualbox for reproducing the issue. Next I will try to reproduce with
a freshly creating filesystem.


Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills:
> On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote:
> > Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White:
> > > On 12/26/2014 05:37 AM, Martin Steigerwald wrote:
> > > > Hello!
> > > > 
> > > > First: Have a merry christmas and enjoy a quiet time in these days.
> > > > 
> > > > Second: At a time you feel like it, here is a little rant, but also a
> > > > bug
> > > > report:
> > > > 
> > > > I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
> > > > space_cache, skinny meta data extents – are these a problem? – and
> > > 
> > > > compress=lzo:
> > > (there is no known problem with skinny metadata, it's actually more
> > > efficient than the older format. There has been some anecdotes about
> > > mixing the skinny and fat metadata but nothing has ever been
> > > demonstrated problematic.)
> > > 
> > > > merkaba:~> btrfs fi sh /home
> > > > Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
> > > > 
> > > >          Total devices 2 FS bytes used 144.41GiB
> > > >          devid    1 size 160.00GiB used 160.00GiB path
> > > >          /dev/mapper/msata-home
> > > >          devid    2 size 160.00GiB used 160.00GiB path
> > > >          /dev/mapper/sata-home
> > > > 
> > > > Btrfs v3.17
> > > > merkaba:~> btrfs fi df /home
> > > > Data, RAID1: total=154.97GiB, used=141.12GiB
> > > > System, RAID1: total=32.00MiB, used=48.00KiB
> > > > Metadata, RAID1: total=5.00GiB, used=3.29GiB
> > > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > > 
> > > This filesystem, at the allocation level, is "very full" (see below).
> > > 
> > > > And I had hangs with BTRFS again. This time as I wanted to install tax
> > > > return software in Virtualbox´d Windows XP VM (which I use once a year
> > > > cause I know no tax return software for Linux which would be suitable
> > > > for
> > > > Germany and I frankly don´t care about the end of security cause all
> > > > surfing and other network access I will do from the Linux box and I
> > > > only
> > > > run the VM behind a firewall).
> > > 
> > > > And thus I try the balance dance again:
> > > ITEM: Balance... it doesn't do what you think it does... 8-)
> > > 
> > > "Balancing" is something you should almost never need to do. It is only
> > > for cases of changing geometry (adding disks, switching RAID levels,
> > > etc.) of for cases when you've radically changed allocation behaviors
> > > (like you decided to remove all your VM's or you've decided to remove a
> > > mail spool directory full of thousands of tiny files).
> > > 
> > > People run balance all the time because they think they should. They are
> > > _usually_ incorrect in that belief.
> > 
> > I only see the lockups of BTRFS is the trees *occupy* all space on the
> > device.
>    No, "the trees" occupy 3.29 GiB of your 5 GiB of mirrored metadata
> space. What's more, balance does *not* balance the metadata trees. The
> remaining space -- 154.97 GiB -- is unstructured storage for file
> data, and you have some 13 GiB of that available for use.
> 
>    Now, since you're seeing lockups when the space on your disks is
> all allocated I'd say that's a bug. However, you're the *only* person
> who's reported this as a regular occurrence. Does this happen with all
> filesystems you have, or just this one?
> 
> > I *never* so far saw it lockup if there is still space BTRFS can allocate
> > from to *extend* a tree.
> 
>    It's not a tree. It's simply space allocation. It's not even space
> *usage* you're talking about here -- it's just allocation (i.e. the FS
> saying "I'm going to use this piece of disk for this purpose").
> 
> > This may be a bug, but this is what I see.
> > 
> > And no amount of "you should not balance a BTRFS" will make that
> > perception go away.
> > 
> > See, I see the sun coming out on a morning and you tell me "no, it
> > doesn´t". Simply that is not going to match my perception.
> 
>    Duncan's assertion is correct in its detail. Looking at your space

Robert's :)

> usage, I would not suggest that running a balance is something you
> need to do. Now, since you have these lockups that seem quite
> repeatable, there's probably a lurking bug in there, but hacking
> around with balance every time you hit it isn't going to get the
> problem solved properly.
> 
>    I think I would suggest the following:
> 
>  - make sure you have some way of logging your dmesg permanently (use
>    a different filesystem for /var/log, or a serial console, or a
>    netconsole)
> 
>  - when the lockup happens, hit Alt-SysRq-t a few times
> 
>  - send the dmesg output here, or post to bugzilla.kernel.org
> 
>    That's probably going to give enough information to the developers
> to work out where the lockup is happening, and is clearly the way
> forward here.

And I got it reproduced. *Perfectly* reproduced, I´d say.

But let me run the whole story:

1) I downsized my /home BTRFS from dual 170 GiB to dual 160 GiB again.

Which gave me:

merkaba:~> btrfs fi sh /home
Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
        Total devices 2 FS bytes used 144.19GiB
        devid    1 size 160.00GiB used 150.01GiB path /dev/mapper/msata-home
        devid    2 size 160.00GiB used 150.01GiB path /dev/mapper/sata-home

Btrfs v3.17
merkaba:~> btrfs fi df /home
Data, RAID1: total=144.98GiB, used=140.95GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.24GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


2) I run the Virtualbox machine again and defragmented the NTFS filesystem
in the VDI image file. And: It worked *just* fine. Fine as in *fine*. No issues
whatsoever.


I got this during the run:

ATOP - merkaba                          2014/12/27  12:58:42                    
      -----------                           10s elapsed
PRC |  sys   10.41s |  user   1.08s |  #proc    357  | #trun      4  |  #tslpi  
 694 |  #tslpu     0 |  #zombie    0  | no  procacct  |
CPU |  sys     107% |  user     11% |  irq       0%  | idle    259%  |  wait    
 23% |  guest     0% |  curf 3.01GHz  | curscal  93%  |
cpu |  sys      29% |  user      3% |  irq       0%  | idle     63%  |  cpu002 
w  5% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
cpu |  sys      27% |  user      3% |  irq       0%  | idle     65%  |  cpu000 
w  5% |  guest     0% |  curf 3.03GHz  | curscal  94%  |
cpu |  sys      26% |  user      3% |  irq       0%  | idle     63%  |  cpu003 
w  8% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
cpu |  sys      24% |  user      2% |  irq       0%  | idle     68%  |  cpu001 
w  6% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
CPL |  avg1    1.92 |  avg5    1.01 |  avg15   0.56  |               |  csw   
501619 |  intr  129279 |                | numcpu     4  |
MEM |  tot    15.5G |  free  610.1M |  cache   9.1G  | buff    0.1M  |  slab    
1.0G |  shmem 183.5M |  vmbal   0.0M  | hptot   0.0M  |
SWP |  tot    12.0G |  free   11.6G |                |               |          
     |               |  vmcom   7.1G  | vmlim  19.7G  |
PAG |  scan  219141 |  steal 215577 |  stall    936  |               |          
     |               |  swin       0  | swout    940  |
LVM |     sata-home |  busy     53% |  read  181413  | write      0  |  KiB/w   
   0 |  MBr/s  70.86 |  MBw/s   0.00  | avio 0.03 ms  |
LVM |     sata-swap |  busy      2% |  read       0  | write    940  |  KiB/w   
   4 |  MBr/s   0.00 |  MBw/s   0.37  | avio 0.17 ms  |
LVM |   sata-debian |  busy      0% |  read       0  | write      1  |  KiB/w   
   4 |  MBr/s   0.00 |  MBw/s   0.00  | avio 1.00 ms  |
LVM |  msata-debian |  busy      0% |  read       0  | write      1  |  KiB/w   
   4 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.00 ms  |
DSK |           sda |  busy     53% |  read  181413  | write    477  |  KiB/w   
   7 |  MBr/s  70.86 |  MBw/s   0.37  | avio 0.03 ms  |
DSK |           sdb |  busy      0% |  read       0  | write      1  |  KiB/w   
   4 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.00 ms  |
NET |  transport    |  tcpi      16 |  tcpo      16  | udpi       0  |  udpo    
   0 |  tcpao      1 |  tcppo      1  | tcprs      0  |
NET |  network      |  ipi       16 |  ipo       16  | ipfrw      0  |  deliv   
  16 |               |  icmpi      0  | icmpo      0  |
NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 
Kbps |  erri       0 |  erro       0  | drpo       0  |

  PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   
RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/3
 9650      -   martin    martin       22   7.89s    0.65s      0K     128K  
705.5M   382.1M  --     -  S       2   87%   VirtualBox
 9911      -   root      root          1   0.69s    0.01s      0K       0K      
0K       0K  --     -  S       3    7%   watch
 9598      -   root      root          1   0.38s    0.00s      0K       0K      
0K      20K  --     -  S       0    4%   kworker/u8:9
 9892      -   root      root          1   0.36s    0.00s      0K       0K      
0K       0K  --     -  S       1    4%   kworker/u8:17
 9428      -   root      root          1   0.30s    0.00s      0K       0K      
0K       0K  --     -  R       0    3%   kworker/u8:3
 9589      -   root      root          1   0.23s    0.00s      0K       0K      
0K       0K  --     -  S       1    2%   kworker/u8:6 
 4746      -   martin    martin        2   0.04s    0.13s      0K     -16K      
0K       0K  --     -  R       2    2%   konsole



Every 1,0s: cat /proc/meminfo                                                   
                               Sat Dec 27 12:59:23 2014

MemTotal:       16210512 kB
MemFree:          786632 kB
MemAvailable:   10271500 kB
Buffers:              52 kB
Cached:          9564340 kB
SwapCached:        70268 kB
Active:          6847560 kB
Inactive:        5257956 kB
Active(anon):    2016412 kB
Inactive(anon):   703076 kB
Active(file):    4831148 kB
Inactive(file):  4554880 kB
Unevictable:        9068 kB
Mlocked:            9068 kB
SwapTotal:      12582908 kB
SwapFree:       12186680 kB
Dirty:            972324 kB
Writeback:             0 kB
AnonPages:       2526340 kB
Mapped:          2457096 kB
Shmem:            173564 kB
Slab:             918128 kB
SReclaimable:     848816 kB
SUnreclaim:        69312 kB
KernelStack:       11200 kB
PageTables:        64556 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    20688164 kB
Committed_AS:    7438348 kB



I am not seeing more than one GiB of dirty here during regular usage and
it is no problem.

And kworker thread CPU usage just fine. So no, the dirty_background_ratio
isn´t an issue with this 16 GiB ThinkPad T520. Please note: I do Linux
performance analysis and tuning courses for about 7 years or so meanwhile.

I *know* these knobs. I may have used wrong terms regarding BTRFS, and my
understanding of BTRFS space allocation probably can be more accurate, but
I do think that I am onto something here. This is no rotating disk, it can 
handle
the write burst just fine and I generally do not tune where there is no need to
tune. Here there isn´t. And it wouldn´t be much more than a fine tuning.

With slow devices or with rsync over NFS by all means reduce it. But here it
simply isn´t an issue as you can see with the low kworker thread CPU usage
and the general SSD usage above.


So defragmentation completed just nice, no issue so far.

But I am close to full device space reservation already:

merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
Sa 27. Dez 13:02:40 CET 2014
Label: 'home'  uuid: [some UUID]
        Total devices 2 FS bytes used 151.58GiB
        devid    1 size 160.00GiB used 158.01GiB path /dev/mapper/msata-home
        devid    2 size 160.00GiB used 158.01GiB path /dev/mapper/sata-home



I thought I can trigger it again by defragmenting in Windows XP again, but
mind you, its defragmented already so it doesn´t to much. I did the sdelete
dance just to trigger something and well I saw kworker a bit higher, but not
much.

But finally I got to:

merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
Sa 27. Dez 13:26:39 CET 2014
Label: 'home'  uuid: [some UUID]
        Total devices 2 FS bytes used 152.83GiB
        devid    1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
        devid    2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=154.97GiB, used=149.58GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.26GiB
GlobalReserve, single: total=512.00MiB, used=0.00B



So I did, if Virtualbox can write randomly in a file, I can too.

So I did:


martin@merkaba:~> cat ssd-test.fio 
[global]
bs=4k
#ioengine=libaio
#iodepth=4
size=4g
#direct=1
runtime=120
filename=ssd.test.file

[seq-write]
rw=write
stonewall

[rand-write]
rw=randwrite
stonewall



And got:

ATOP - merkaba                          2014/12/27  13:41:02                    
      -----------                           10s elapsed
PRC |  sys   10.14s |  user   0.38s |  #proc    332  | #trun      2  |  #tslpi  
 548 |  #tslpu     0 |  #zombie    0  | no  procacct  |
CPU |  sys     102% |  user      4% |  irq       0%  | idle    295%  |  wait    
  0% |  guest     0% |  curf 3.10GHz  | curscal  96%  |
cpu |  sys      76% |  user      0% |  irq       0%  | idle     24%  |  cpu001 
w  0% |  guest     0% |  curf 3.20GHz  | curscal  99%  |
cpu |  sys      24% |  user      1% |  irq       0%  | idle     75%  |  cpu000 
w  0% |  guest     0% |  curf 3.19GHz  | curscal  99%  |
cpu |  sys       1% |  user      1% |  irq       0%  | idle     98%  |  cpu002 
w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
cpu |  sys       1% |  user      1% |  irq       0%  | idle     98%  |  cpu003 
w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
CPL |  avg1    0.82 |  avg5    0.78 |  avg15   0.99  |               |  csw     
6233 |  intr   12023 |                | numcpu     4  |
MEM |  tot    15.5G |  free    4.0G |  cache   9.7G  | buff    0.0M  |  slab  
333.1M |  shmem 206.6M |  vmbal   0.0M  | hptot   0.0M  |
SWP |  tot    12.0G |  free   11.7G |                |               |          
     |               |  vmcom   3.4G  | vmlim  19.7G  |
LVM |     sata-home |  busy      0% |  read       8  | write      0  |  KiB/w   
   0 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.12 ms  |
DSK |           sda |  busy      0% |  read       8  | write      0  |  KiB/w   
   0 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.12 ms  |
NET |  transport    |  tcpi      16 |  tcpo      16  | udpi       0  |  udpo    
   0 |  tcpao      1 |  tcppo      1  | tcprs      0  |
NET |  network      |  ipi       16 |  ipo       16  | ipfrw      0  |  deliv   
  16 |               |  icmpi      0  | icmpo      0  |
NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 
Kbps |  erri       0 |  erro       0  | drpo       0  |

  PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   
RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/2
18079      -   martin    martin        2   9.99s    0.00s      0K       0K      
0K      16K  --     -  R       1  100%   fio
 4746      -   martin    martin        2   0.01s    0.14s      0K       0K      
0K       0K  --     -  S       2    2%   konsole
 3291      -   martin    martin        4   0.01s    0.11s      0K       0K      
0K       0K  --     -  S       0    1%   plasma-desktop
 1488      -   root      root          1   0.03s    0.04s      0K       0K      
0K       0K  --     -  S       0    1%   Xorg
10036      -   root      root          1   0.04s    0.02s      0K       0K      
0K       0K  --     -  R       2    1%   atop

while fio was just *laying* out the 4 GiB file. Yes, thats 100% system CPU
for 10 seconds while allocatiing a 4 GiB file on a filesystem like:

martin@merkaba:~> LANG=C df -hT /home
Filesystem             Type   Size  Used Avail Use% Mounted on
/dev/mapper/msata-home btrfs  170G  156G   17G  91% /home

where a 4 GiB file should easily fit, no? (And this output is with the 4
GiB file. So it was even 4 GiB more free before.)


But it gets even more visible:

martin@merkaba:~> fio ssd-test.fio
seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 2 processes
Jobs: 1 (f=1): [_(1),w(1)] [19.3% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 
01m:57s]       
0$ zsh  1$ zsh  2$ zsh  3-$ zsh  4$ zsh  5$* zsh                                
   


yes, thats 0 IOPS.

0 IOPS and in zero IOPS. For minutes.



And here is why:

ATOP - merkaba                          2014/12/27  13:46:52                    
      -----------                           10s elapsed
PRC |  sys   10.77s |  user   0.31s |  #proc    334  | #trun      2  |  #tslpi  
 548 |  #tslpu     3 |  #zombie    0  | no  procacct  |
CPU |  sys     108% |  user      3% |  irq       0%  | idle    286%  |  wait    
  2% |  guest     0% |  curf 3.08GHz  | curscal  96%  |
cpu |  sys      72% |  user      1% |  irq       0%  | idle     28%  |  cpu000 
w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
cpu |  sys      19% |  user      0% |  irq       0%  | idle     81%  |  cpu001 
w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
cpu |  sys      11% |  user      1% |  irq       0%  | idle     87%  |  cpu003 
w  1% |  guest     0% |  curf 3.19GHz  | curscal  99%  |
cpu |  sys       6% |  user      1% |  irq       0%  | idle     91%  |  cpu002 
w  1% |  guest     0% |  curf 3.11GHz  | curscal  97%  |
CPL |  avg1    2.78 |  avg5    1.34 |  avg15   1.12  |               |  csw    
50192 |  intr   32379 |                | numcpu     4  |
MEM |  tot    15.5G |  free    5.0G |  cache   8.7G  | buff    0.0M  |  slab  
332.6M |  shmem 207.2M |  vmbal   0.0M  | hptot   0.0M  |
SWP |  tot    12.0G |  free   11.7G |                |               |          
     |               |  vmcom   3.4G  | vmlim  19.7G  |
LVM |     sata-home |  busy      5% |  read     160  | write  11177  |  KiB/w   
   3 |  MBr/s   0.06 |  MBw/s   4.36  | avio 0.05 ms  |
LVM |    msata-home |  busy      4% |  read      28  | write  11177  |  KiB/w   
   3 |  MBr/s   0.01 |  MBw/s   4.36  | avio 0.04 ms  |
LVM |   sata-debian |  busy      0% |  read       0  | write    844  |  KiB/w   
   4 |  MBr/s   0.00 |  MBw/s   0.33  | avio 0.02 ms  |
LVM |  msata-debian |  busy      0% |  read       0  | write    844  |  KiB/w   
   4 |  MBr/s   0.00 |  MBw/s   0.33  | avio 0.02 ms  |
DSK |           sda |  busy      5% |  read     160  | write  10200  |  KiB/w   
   4 |  MBr/s   0.06 |  MBw/s   4.69  | avio 0.05 ms  |
DSK |           sdb |  busy      4% |  read      28  | write  10558  |  KiB/w   
   4 |  MBr/s   0.01 |  MBw/s   4.69  | avio 0.04 ms  |
NET |  transport    |  tcpi      35 |  tcpo      33  | udpi       3  |  udpo    
   3 |  tcpao      2 |  tcppo      1  | tcprs      0  |
NET |  network      |  ipi       38 |  ipo       36  | ipfrw      0  |  deliv   
  38 |               |  icmpi      0  | icmpo      0  |
NET |  eth0      0% |  pcki      22 |  pcko      20  | si    9 Kbps  |  so    2 
Kbps |  erri       0 |  erro       0  | drpo       0  |
NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 
Kbps |  erri       0 |  erro       0  | drpo       0  |

  PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   
RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/3
14973      -   root      root          1   8.92s    0.00s      0K       0K      
0K     144K  --     -  S       0   89%   kworker/u8:14
17450      -   root      root          1   0.86s    0.00s      0K       0K      
0K      32K  --     -  R       3    9%   kworker/u8:5
  788      -   root      root          1   0.25s    0.00s      0K       0K    
128K   18880K  --     -  S       3    3%   btrfs-transact
12254      -   root      root          1   0.14s    0.00s      0K       0K     
64K     576K  --     -  S       2    1%   kworker/u8:3
17332      -   root      root          1   0.11s    0.00s      0K       0K    
112K    1348K  --     -  S       2    1%   kworker/u8:4
 3291      -   martin    martin        4   0.01s    0.09s      0K       0K      
0K       0K  --     -  S       1    1%   plasma-deskto




ATOP - merkaba                          2014/12/27  13:47:12                    
      -----------                           10s elapsed
PRC |  sys   10.78s |  user   0.44s |  #proc    334  | #trun      3  |  #tslpi  
 547 |  #tslpu     3 |  #zombie    0  | no  procacct  |
CPU |  sys     106% |  user      4% |  irq       0%  | idle    288%  |  wait    
  1% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
cpu |  sys      93% |  user      0% |  irq       0%  | idle      7%  |  cpu002 
w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
cpu |  sys       7% |  user      0% |  irq       0%  | idle     93%  |  cpu003 
w  0% |  guest     0% |  curf 3.01GHz  | curscal  94%  |
cpu |  sys       3% |  user      2% |  irq       0%  | idle     94%  |  cpu000 
w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
cpu |  sys       3% |  user      2% |  irq       0%  | idle     95%  |  cpu001 
w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
CPL |  avg1    3.33 |  avg5    1.56 |  avg15   1.20  |               |  csw    
38253 |  intr   23104 |                | numcpu     4  |
MEM |  tot    15.5G |  free    4.9G |  cache   8.7G  | buff    0.0M  |  slab  
336.5M |  shmem 207.2M |  vmbal   0.0M  | hptot   0.0M  |
SWP |  tot    12.0G |  free   11.7G |                |               |          
     |               |  vmcom   3.4G  | vmlim  19.7G  |
LVM |    msata-home |  busy      2% |  read       0  | write   2337  |  KiB/w   
   3 |  MBr/s   0.00 |  MBw/s   0.91  | avio 0.07 ms  |
LVM |     sata-home |  busy      2% |  read      36  | write   2337  |  KiB/w   
   3 |  MBr/s   0.01 |  MBw/s   0.91  | avio 0.07 ms  |
LVM |  msata-debian |  busy      1% |  read       1  | write   1630  |  KiB/w   
   4 |  MBr/s   0.00 |  MBw/s   0.65  | avio 0.03 ms  |
LVM |   sata-debian |  busy      0% |  read       0  | write   1019  |  KiB/w   
   4 |  MBr/s   0.00 |  MBw/s   0.41  | avio 0.02 ms  |
DSK |           sdb |  busy      2% |  read       1  | write   2545  |  KiB/w   
   5 |  MBr/s   0.00 |  MBw/s   1.45  | avio 0.07 ms  |
DSK |           sda |  busy      1% |  read      36  | write   2461  |  KiB/w   
   5 |  MBr/s   0.01 |  MBw/s   1.28  | avio 0.06 ms  |
NET |  transport    |  tcpi      20 |  tcpo      20  | udpi       1  |  udpo    
   1 |  tcpao      1 |  tcppo      1  | tcprs      0  |
NET |  network      |  ipi       21 |  ipo       21  | ipfrw      0  |  deliv   
  21 |               |  icmpi      0  | icmpo      0  |
NET |  eth0      0% |  pcki       5 |  pcko       5  | si    0 Kbps  |  so    0 
Kbps |  erri       0 |  erro       0  | drpo       0  |
NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 
Kbps |  erri       0 |  erro       0  | drpo       0  |

  PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   
RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/3
17450      -   root      root          1   9.96s    0.00s      0K       0K      
0K       0K  --     -  R       2  100%   kworker/u8:5
 4746      -   martin    martin        2   0.06s    0.15s      0K       0K      
0K       0K  --     -  S       1    2%   konsole
10508      -   root      root          1   0.13s    0.00s      0K       0K     
96K    4048K  --     -  S       1    1%   kworker/u8:18
 1488      -   root      root          1   0.06s    0.06s      0K       0K      
0K       0K  --     -  S       0    1%   Xorg
17332      -   root      root          1   0.12s    0.00s      0K       0K     
96K     580K  --     -  R       3    1%   kworker/u8:4
17454      -   root      root          1   0.11s    0.00s      0K       0K     
32K    4416K  --     -  D       1    1%   kworker/u8:6
17516      -   root      root          1   0.09s    0.00s      0K       0K     
16K     136K  --     -  S       3    1%   kworker/u8:7
 3268      -   martin    martin        3   0.02s    0.05s      0K       0K      
0K       0K  --     -  S       1    1%   kwin
10036      -   root      root          1   0.05s    0.02s      0K       0K      
0K       0K  --     -  R       0    1%   atop



So BTRFS is basically busy with itself and nothing else. Look at the SSD
usage. They are *idling* around. Heck 2400 write accesses in 10 seconds.
Thats a joke with SSDs that can do 40000 IOPS (depending on how and what
you measure of course, like request size, read, write, iodepth and so).

Its kworker/u8:5 utilizing 100% of one core for minutes.



Its the random write case it seems. Here are values from fio job:

martin@merkaba:~> fio ssd-test.fio
seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 2 processes
Jobs: 1 (f=1): [_(1),w(1)] [3.6% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 
01h:06m:26s]
seq-write: (groupid=0, jobs=1): err= 0: pid=19212: Sat Dec 27 13:48:33 2014
  write: io=4096.0MB, bw=343683KB/s, iops=85920, runt= 12204msec
    clat (usec): min=3, max=38048, avg=10.52, stdev=205.25
     lat (usec): min=3, max=38048, avg=10.66, stdev=205.43
    clat percentiles (usec):
     |  1.00th=[    4],  5.00th=[    4], 10.00th=[    4], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    5], 50.00th=[    5], 60.00th=[    5],
     | 70.00th=[    7], 80.00th=[    8], 90.00th=[    8], 95.00th=[    9],
     | 99.00th=[   14], 99.50th=[   20], 99.90th=[  211], 99.95th=[ 2128],
     | 99.99th=[10304]
    bw (KB  /s): min=164328, max=812984, per=100.00%, avg=345585.75, 
stdev=201695.20
    lat (usec) : 4=0.18%, 10=95.31%, 20=4.00%, 50=0.18%, 100=0.12%
    lat (usec) : 250=0.12%, 500=0.02%, 750=0.01%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
  cpu          : usr=13.55%, sys=46.89%, ctx=7810, majf=0, minf=6
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Seems fine.


But:

rand-write: (groupid=1, jobs=1): err= 0: pid=19243: Sat Dec 27 13:48:33 2014
  write: io=140336KB, bw=1018.4KB/s, iops=254, runt=137803msec
    clat (usec): min=4, max=21299K, avg=3708.02, stdev=266885.61
     lat (usec): min=4, max=21299K, avg=3708.14, stdev=266885.61
    clat percentiles (usec):
     |  1.00th=[    4],  5.00th=[    5], 10.00th=[    5], 20.00th=[    5],
     | 30.00th=[    6], 40.00th=[    6], 50.00th=[    6], 60.00th=[    6],
     | 70.00th=[    7], 80.00th=[    7], 90.00th=[    9], 95.00th=[   10],
     | 99.00th=[   18], 99.50th=[   19], 99.90th=[   28], 99.95th=[  116],
     | 99.99th=[16711680]
    bw (KB  /s): min=    0, max= 3426, per=100.00%, avg=1030.10, stdev=938.02
    lat (usec) : 10=92.63%, 20=6.89%, 50=0.43%, 100=0.01%, 250=0.02%
    lat (msec) : 250=0.01%, 500=0.01%, >=2000=0.02%
  cpu          : usr=0.06%, sys=1.59%, ctx=28720, majf=0, minf=7
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=35084/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=343682KB/s, minb=343682KB/s, maxb=343682KB/s, 
mint=12204msec, maxt=12204msec

Run status group 1 (all jobs):
  WRITE: io=140336KB, aggrb=1018KB/s, minb=1018KB/s, maxb=1018KB/s, 
mint=137803msec, maxt=137803msec


What? 254 IOPS? With a Dual SSD BTRFS RAID 1?

What?

Ey, *what*?



Repeating with the random write case.

Its a different kworker now, but similar result:

ATOP - merkaba                          2014/12/27  13:51:48                    
      -----------                           10s elapsed
PRC |  sys   10.66s |  user   0.25s |  #proc    330  | #trun      2  |  #tslpi  
 545 |  #tslpu     2 |  #zombie    0  | no  procacct  |
CPU |  sys     105% |  user      3% |  irq       0%  | idle    292%  |  wait    
  0% |  guest     0% |  curf 3.07GHz  | curscal  95%  |
cpu |  sys      92% |  user      0% |  irq       0%  | idle      8%  |  cpu002 
w  0% |  guest     0% |  curf 3.19GHz  | curscal  99%  |
cpu |  sys       8% |  user      0% |  irq       0%  | idle     92%  |  cpu003 
w  0% |  guest     0% |  curf 3.09GHz  | curscal  96%  |
cpu |  sys       3% |  user      2% |  irq       0%  | idle     95%  |  cpu000 
w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
cpu |  sys       2% |  user      1% |  irq       0%  | idle     97%  |  cpu001 
w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
CPL |  avg1    1.00 |  avg5    1.32 |  avg15   1.23  |               |  csw    
34484 |  intr   23182 |                | numcpu     4  |
MEM |  tot    15.5G |  free    5.4G |  cache   8.3G  | buff    0.0M  |  slab  
334.8M |  shmem 207.2M |  vmbal   0.0M  | hptot   0.0M  |
SWP |  tot    12.0G |  free   11.7G |                |               |          
     |               |  vmcom   3.4G  | vmlim  19.7G  |
LVM |     sata-home |  busy      1% |  read      36  | write   2502  |  KiB/w   
   4 |  MBr/s   0.01 |  MBw/s   0.98  | avio 0.06 ms  |
LVM |    msata-home |  busy      1% |  read      48  | write   2502  |  KiB/w   
   4 |  MBr/s   0.02 |  MBw/s   0.98  | avio 0.04 ms  |
LVM |  msata-debian |  busy      0% |  read       0  | write      6  |  KiB/w   
   7 |  MBr/s   0.00 |  MBw/s   0.00  | avio 1.33 ms  |
LVM |   sata-debian |  busy      0% |  read       0  | write      6  |  KiB/w   
   7 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.17 ms  |
DSK |           sda |  busy      1% |  read      36  | write   2494  |  KiB/w   
   4 |  MBr/s   0.01 |  MBw/s   0.98  | avio 0.06 ms  |
DSK |           sdb |  busy      1% |  read      48  | write   2494  |  KiB/w   
   4 |  MBr/s   0.02 |  MBw/s   0.98  | avio 0.04 ms  |
NET |  transport    |  tcpi      32 |  tcpo      30  | udpi       2  |  udpo    
   2 |  tcpao      2 |  tcppo      1  | tcprs      0  |
NET |  network      |  ipi       35 |  ipo       32  | ipfrw      0  |  deliv   
  35 |               |  icmpi      0  | icmpo      0  |
NET |  eth0      0% |  pcki      19 |  pcko      16  | si    9 Kbps  |  so    2 
Kbps |  erri       0 |  erro       0  | drpo       0  |
NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 
Kbps |  erri       0 |  erro       0  | drpo       0  |

  PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   
RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/2
11746      -   root      root          1  10.00s    0.00s      0K       0K      
0K       0K  --     -  R       2  100%   kworker/u8:0
12254      -   root      root          1   0.16s    0.00s      0K       0K    
112K    1712K  --     -  S       3    2%   kworker/u8:3  
17517      -   root      root          1   0.16s    0.00s      0K       0K    
144K    1764K  --     -  S       1    2%   kworker/u8:8



And now the graphical environemnt is locked. Continuining on TTY1.

Doing another fio job with tee so I can get output easily.

Wow! I wonder whether this is reproducable with a fresh BTRFS with fio 
stressing it.

Like a 10 GiB BTRFS with 5 GiB fio test file and just letting it run.


Okay, I let the final fio job complete and include the output here.


Okay, and there we are and I do have sysrq-t figures. 

Okay, this is 1,2 MiB xz packed. So I better start a bug report about this
and attach it there. Is dislike cloud URLs that may disappear at some time.



Now please finally acknowledge that there is an issue. Maybe I was not
using the correct terms at the beginning, but there is a real issue. I do
performance things for half a decade at least, I know that there is an issue
when I see it.




There we go:

Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes 
on random write into big file
https://bugzilla.kernel.org/show_bug.cgi?id=90401

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

signature.asc
Description: This is a digitally signed message part.

Re: BTRFS free space handling still needs more work: Hangs again

Reply via email to