Re: BTRFS free space handling still needs more work: Hangs again

Martin Steigerwald Sun, 28 Dec 2014 07:47:26 -0800

Am Sonntag, 28. Dezember 2014, 16:42:20 schrieb Martin Steigerwald:
> Am Sonntag, 28. Dezember 2014, 06:52:41 schrieb Robert White:
> > On 12/28/2014 04:07 AM, Martin Steigerwald wrote:
> > > Am Samstag, 27. Dezember 2014, 20:03:09 schrieb Robert White:
> > >> Now:
> > >>
> > >> The complaining party has verified the minimum, repeatable case of
> > >> simple file allocation on a very fragmented system and the responding
> > >> party and several others have understood and supported the bug.
> > >
> > > I didn´t yet provide such a test case.
> > 
> > My bad.
> > 
> > >
> > > At the moment I can only reproduce this kworker thread using a CPU for
> > > minutes case with my /home filesystem.
> > >
> > > A mininmal test case for me would be to be able to reproduce it with a
> > > fresh BTRFS filesystem. But yet with my testcase with the fresh BTRFS I
> > > get 4800 instead of 270 IOPS.
> > >
> > 
> > A version of the test case to demonstrate absolutely system-clogging 
> > loads is pretty easy to construct.
> > 
> > Make a raid1 filesystem.
> > Balance it once to make sure the seed filesystem is fully integrated.
> > 
> > Create a bunch of small files that are at least 4K in size, but are 
> > randomly sized. Fill the entire filesystem with them.
> > 
> > BASH Script:
> > typeset -i counter=0
> > while
> >   dd if=/dev/urandom of=/mnt/Work/$((++counter)) bs=$((4096 + $RANDOM)) 
> > count=1 2>/dev/null
> > do
> > echo $counter >/dev/null #basically a noop
> > done
> >
> > The while will exit when the dd encounters a full filesystem.
> > 
> > Then delete ~10% of the files with
> > rm *0
> > 
> > Run the while loop again, then delete a different 10% with "rm *1".
> > 
> > Then again with rm *2, etc...
> > 
> > Do this a few times and with each iteration the CPU usage gets worse and 
> > worse. You'll easily get system-wide stalls on all IO tasks lasting ten 
> > or more seconds.
> 
> Thanks Robert. Thats wonderful.
> 
> I wondered about such a test case already and thought about reproducing
> it just with fallocate calls instead to reduce the amount of actual
> writes done. I.e. just do some silly fallocate, truncating, write just
> some parts with dd seek and remove things again kind of workload.
> 
> Feel free to add your testcase to the bug report:
> 
> [Bug 90401] New: btrfs kworker thread uses up 100% of a Sandybridge core for 
> minutes on random write into big file
> https://bugzilla.kernel.org/show_bug.cgi?id=90401
> 
> Cause anything that helps a BTRFS developer to reproduce will make it easier
> to find and fix the root cause of it.
> 
> I think I will try with this little critter:
> 
> merkaba:/mnt/btrfsraid1> cat freespracefragment.sh 
> #!/bin/bash
> 
> TESTDIR="./test"
> mkdir -p "$TESTDIR"
> 
> typeset -i counter=0
> while true; do
>         fallocate -l $((4096 + $RANDOM)) "$TESTDIR/$((++counter))"
>         echo $counter >/dev/null #basically a noop
> done
> 
> It takes a time, the script itself is using only a few percent of one core
> there, while busying out the SSDs more heavily than I thought it would do.
> But well I see up to 12000 writes per 10 seconds – thats not that much, still
> it busies one SSD for 80%:
> 
> ATOP - merkaba                                 2014/12/28  16:40:57           
>                       -----------                                   10s 
> elapsed
> PRC | sys    1.50s | user   3.47s | #proc    367  | #trun      1 | #tslpi   
> 649 | #tslpu     0 | #zombie    0 | clones   839  |              | no  
> procacct |
> CPU | sys      30% | user     38% | irq       1%  | idle    293% | wait     
> 37% |              | steal     0% | guest     0%  | curf 1.63GHz | curscal  
> 50% |
> cpu | sys       7% | user     11% | irq       1%  | idle     75% | cpu000 w  
> 6% |              | steal     0% | guest     0%  | curf 1.25GHz | curscal  
> 39% |
> cpu | sys       8% | user     11% | irq       0%  | idle     76% | cpu002 w  
> 4% |              | steal     0% | guest     0%  | curf 1.55GHz | curscal  
> 48% |
> cpu | sys       7% | user      9% | irq       0%  | idle     71% | cpu001 w 
> 13% |              | steal     0% | guest     0%  | curf 1.75GHz | curscal  
> 54% |
> cpu | sys       8% | user      7% | irq       0%  | idle     71% | cpu003 w 
> 14% |              | steal     0% | guest     0%  | curf 1.96GHz | curscal  
> 61% |
> CPL | avg1    1.69 | avg5    1.30 | avg15   0.94  |              |            
>   | csw    68387 | intr   36928 |               |              | numcpu     4 
> |
> MEM | tot    15.5G | free    3.1G | cache   8.8G  | buff    4.2M | slab    
> 1.0G | shmem 210.3M | shrss  79.1M | vmbal   0.0M  | hptot   0.0M | hpuse   
> 0.0M |
> SWP | tot    12.0G | free   11.5G |               |              |            
>   |              |              |               | vmcom   4.9G | vmlim  19.7G 
> |
> LVM | a-btrfsraid1 | busy     80% | read       0  | write  11873 | KiB/r      
> 0 | KiB/w      3 | MBr/s   0.00 | MBw/s   4.31  | avq     1.11 | avio 0.67 ms 
> |
> LVM | a-btrfsraid1 | busy      5% | read       0  | write  11873 | KiB/r      
> 0 | KiB/w      3 | MBr/s   0.00 | MBw/s   4.31  | avq     2.45 | avio 0.04 ms 
> |
> LVM |   msata-home | busy      3% | read       0  | write    175 | KiB/r      
> 0 | KiB/w      3 | MBr/s   0.00 | MBw/s   0.06  | avq     1.71 | avio 1.43 ms 
> |
> LVM | msata-debian | busy      0% | read       0  | write     10 | KiB/r      
> 0 | KiB/w      8 | MBr/s   0.00 | MBw/s   0.01  | avq     1.15 | avio 3.40 ms 
> |
> LVM |    sata-home | busy      0% | read       0  | write    175 | KiB/r      
> 0 | KiB/w      3 | MBr/s   0.00 | MBw/s   0.06  | avq     1.71 | avio 0.04 ms 
> |
> LVM |  sata-debian | busy      0% | read       0  | write     10 | KiB/r      
> 0 | KiB/w      8 | MBr/s   0.00 | MBw/s   0.01  | avq     1.00 | avio 0.10 ms 
> |
> DSK |          sdb | busy     80% | read       0  | write  11880 | KiB/r      
> 0 | KiB/w      3 | MBr/s   0.00 | MBw/s   4.38  | avq     1.11 | avio 0.67 ms 
> |
> DSK |          sda | busy      5% | read       0  | write  12069 | KiB/r      
> 0 | KiB/w      3 | MBr/s   0.00 | MBw/s   4.38  | avq     2.51 | avio 0.04 ms 
> |
> NET | transport    | tcpi      26 | tcpo      26  | udpi       0 | udpo       
> 0 | tcpao      2 | tcppo      1 | tcprs      0  | tcpie      0 | udpie      0 
> |
> NET | network      | ipi       26 | ipo       26  | ipfrw      0 | deliv     
> 26 |              |              |               | icmpi      0 | icmpo      
> 0 |
> NET | eth0      0% | pcki      10 | pcko      10  | si    5 Kbps | so    1 
> Kbps | coll       0 | erri       0 | erro       0  | drpi       0 | drpo      
>  0 |
> NET | lo      ---- | pcki      16 | pcko      16  | si    2 Kbps | so    2 
> Kbps | coll       0 | erri       0 | erro       0  | drpi       0 | drpo      
>  0 |
> 
>   PID     TID    RUID        EUID         THR    SYSCPU    USRCPU     VGROW   
>   RGROW    RDDSK     WRDSK    ST    EXC    S    CPUNR     CPU    CMD        
> 1/4
>  9169       -    martin      martin        14     0.22s     1.53s        0K   
>      0K       0K        4K    --      -    S        1     18%    amarok
>  1488       -    root        root           1     0.34s     0.27s      220K   
>      0K       0K        0K    --      -    S        2      6%    Xorg
>  6816       -    martin      martin         7     0.05s     0.44s        0K   
>      0K       0K        0K    --      -    S        1      5%    kmail
> 24390       -    root        root           1     0.20s     0.25s       24K   
>     24K       0K    40800K    --      -    S        0      5%    
> freespracefrag
>  3268       -    martin      martin         3     0.08s     0.34s        0K   
>      0K       0K       24K    --      -    S        0      4%    kwin
> 
> 
> 
> But only with a low amount of writes:
> 
> merkaba:/mnt/btrfsraid1> vmstat 1
> procs -----------memory---------- ---swap-- -----io---- -system-- 
> ------cpu-----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa 
> st
>  2  0 538424 3326248   4304 9202576    6   11  1968  4029  273  207 15 10 72  
> 3  0
>  1  0 538424 3325244   4304 9202836    0    0     0  6456 3498 7635 11  8 72 
> 10  0
>  0  0 538424 3325168   4304 9202932    0    0     0  9032 3719 6764  9  9 74  
> 9  0
>  0  0 538424 3334508   4304 9202932    0    0     0  8936 3548 6035  7  8 76  
> 9  0
>  0  0 538424 3334144   4304 9202876    0    0     0  9008 3335 5635  7  7 76 
> 10  0
>  0  0 538424 3332724   4304 9202728    0    0     0 11240 3555 5699  7  8 76 
> 10  0
>  2  0 538424 3333328   4304 9202876    0    0     0  9080 3724 6542  8  8 75  
> 9  0
>  0  0 538424 3333328   4304 9202876    0    0     0  6968 2951 5015  7  7 76 
> 10  0
>  0  1 538424 3332832   4304 9202584    0    0     0  9160 3663 6772  8  8 76  
> 9  0


Let me rephrase that.

One one hand rather low, but for the kind of workload just for *fallocating*
new files actually quite much. I just tell it to *reserve* the space for the
file I do not tell it to write to them. And yet its about 6-12 MiB/s.

> Still it busies one of both SSDs for about 80%:
> 
> iostat -xz 1
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            7,04    0,00    7,04    9,80    0,00   76,13
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda               0,00     0,00    0,00 1220,00     0,00  4556,00     7,47    
>  0,12    0,10    0,00    0,10   0,04   5,10
> sdb               0,00    10,00    0,00 1210,00     0,00  4556,00     7,53    
>  0,85    0,70    0,00    0,70   0,66  79,90
> dm-2              0,00     0,00    0,00    4,00     0,00    36,00    18,00    
>  0,02    5,00    0,00    5,00   4,25   1,70
> dm-5              0,00     0,00    0,00    4,00     0,00    36,00    18,00    
>  0,00    0,25    0,00    0,25   0,25   0,10
> dm-6              0,00     0,00    0,00 1216,00     0,00  4520,00     7,43    
>  0,12    0,10    0,00    0,10   0,04   5,00
> dm-7              0,00     0,00    0,00 1216,00     0,00  4520,00     7,43    
>  0,84    0,69    0,00    0,69   0,66  79,70
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            6,55    0,00    7,81    9,32    0,00   76,32
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda               0,00     0,00    0,00 1203,00     0,00  4472,00     7,43    
>  0,09    0,07    0,00    0,07   0,03   3,80
> sdb               0,00     0,00    0,00 1203,00     0,00  4472,00     7,43    
>  0,79    0,66    0,00    0,66   0,64  77,10
> dm-6              0,00     0,00    0,00 1203,00     0,00  4472,00     7,43    
>  0,09    0,07    0,00    0,07   0,03   4,00
> dm-7              0,00     0,00    0,00 1203,00     0,00  4472,00     7,43    
>  0,79    0,66    0,00    0,66   0,64  77,10
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            7,79    0,00    7,79    9,30    0,00   75,13
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda               0,00     0,00    0,00 1202,00     0,00  4468,00     7,43    
>  0,09    0,07    0,00    0,07   0,04   4,70
> sdb               0,00     0,00    4,00 1202,00  2048,00  4468,00    10,81    
>  0,86    0,71    4,75    0,70   0,65  78,10
> dm-1              0,00     0,00    4,00    0,00  2048,00     0,00  1024,00    
>  0,02    4,75    4,75    0,00   2,00   0,80
> dm-6              0,00     0,00    0,00 1202,00     0,00  4468,00     7,43    
>  0,08    0,07    0,00    0,07   0,04   4,60
> dm-7              0,00     0,00    0,00 1202,00     0,00  4468,00     7,43    
>  0,84    0,70    0,00    0,70   0,65  77,80
> 
> 
> But yet, neither I hit full CPU usage nor full SSD usage (just 80%), so
> this is yet another interesting case.
[…]
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS free space handling still needs more work: Hangs again

Reply via email to