RE: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-24 Thread David Woodhouse
On Wed, 2010-06-23 at 20:43 -0700, Daniel Taylor wrote:
 There is also the issue of btrfs over RAID (which I know is not
 entirely sensible, but which will happen). 

Well, we could discourage that by merging the RAID support that's been
pending for a while but I suspect Chris is a bit busy right now :)

-- 
dwmw2

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Balancing leaves when walking from top to down (was Btrfs:...)

2010-06-24 Thread Chris Mason
On Thu, Jun 24, 2010 at 12:37:59AM +0100, Jamie Lokier wrote:
 Edward Shishkin wrote:
  Chris Mason wrote:
  On Tue, Jun 22, 2010 at 04:12:57PM +0200, Edward Shishkin wrote:

  Chris Mason wrote:
  
  On Mon, Jun 21, 2010 at 09:15:28AM -0400, Chris Mason wrote:

  I'll reproduce from your test case and provide a fix.  mount -o
  max_inline=1500 would give us 50% usage in the worst case
  
  This is a very strange statement: how did you calculate this lower bound?
  
  
  We want room for the extent and the inode item and the inode backref.
  It's a rough number that leaves some extra room.  But even with a 2K
  item we're getting very close to 50% usage of the metadata area.
  

  (minus the
  balancing bug you hit).
  
  Ok, the balancing bug was interesting.  What happens is we create all
  the inodes and directory items nicely packed into the leaves.
  
  Then delayed allocation kicks in and starts inserting the big fat inline
  extents.  This often forces the balancing code to split a leaf twice in
  order to make room for the new inline extent.  The double split code
  wasn't balancing the bits that were left over on either side of our
  desired slot.
  
  The patch below fixes the problem for me, reducing the number of leaves
  that have more than 2K of free space down from 6% of the total to about
  74 leaves.  It could be zero, but the balancing code doesn't push
  items around if our leaf is in the first or last slot of the parent
  node (complexity vs benefit tradeoff).

  Nevertheless, I see leaves, which are not in the first or last slot,
  but mergeable with their neighbors (test case is the same):
  
  
  Correct, but it was in the first or last slot when it was balanced (I
  confirmed this with printk).
  
  Then the higher layers were balanced and our leaves were no longer in
  the first/last slot.  We don't rebalance leaves when we balance level 1.
  

  leaf 269557760 items 22 free space 2323 generation 25 owner 2
  fs uuid 614fb921-cfa9-403d-9feb-940021e72382
  chunk uuid b1674882-a445-45f9-b250-0985e483d231
  
  leaf 280027136 items 18 free space 2627 generation 25 owner 2
  fs uuid 614fb921-cfa9-403d-9feb-940021e72382
  chunk uuid b1674882-a445-45f9-b250-0985e483d231
  
  node 269549568 level 1 items 60 free 61 generation 25 owner 2
  fs uuid 614fb921-cfa9-403d-9feb-940021e72382
  chunk uuid b1674882-a445-45f9-b250-0985e483d231
 key (175812608 EXTENT_ITEM 4096) block 175828992 (42927) gen 15
 key (176025600 EXTENT_ITEM 4096) block 176111616 (42996) gen 15
 key (176238592 EXTENT_ITEM 4096) block 176300032 (43042) gen 15
 key (176451584 EXTENT_ITEM 4096) block 216248320 (52795) gen 17
 key (176672768 EXTENT_ITEM 4096) block 216236032 (52792) gen 17
 key (176783360 EXTENT_ITEM 4096) block 216252416 (52796) gen 17
 key (176955392 EXTENT_ITEM 4096) block 138854400 (33900) gen 25
 key (177131520 EXTENT_ITEM 4096) block 280289280 (68430) gen 25
 key (177348608 EXTENT_ITEM 4096) block 280285184 (68429) gen 25
 key (177561600 EXTENT_ITEM 4096) block 269557760 (65810) gen 25
 key (177795072 EXTENT_ITEM 4096) block 280027136 (68366) gen 25
 key (178008064 EXTENT_ITEM 4096) block 280064000 (68375) gen 25
 key (178233344 EXTENT_ITEM 4096) block 285061120 (69595) gen 25
 key (178442240 EXTENT_ITEM 4096) block 178442240 (43565) gen 16
 key (178655232 EXTENT_ITEM 4096) block 178655232 (43617) gen 16
 key (178868224 EXTENT_ITEM 4096) block 178868224 (43669) gen 16
  [...]
  
  
  With the patch, I'm able to create 106894 files (2K each) on a 1GB FS.
  That doesn't sound like a huge improvement, but the default from
  mkfs.btrfs is to duplicate metadata.  After duplication, that's about
  417MB or about 40% of the overall space.
  
  When you factor in the space that we reserve to avoid exploding on
  enospc and the space that we have allocated for data extents, that's not
  a very surprising number.
  
  I'm still putting this patch through more testing, the double split code
  is used in some difficult corners and I need to make sure I've tried
  all of them.

  Chris, for the further code review we need documents, which reflect
  the original ideas of the balancing, etc. Could you please provide them?
  Obviously, I can not do it instead of you, as I don't know those ideas.
  
  
  
  Which part are you most interested in?

  
  Balancing.
  
  What conditions do we want to provide?
  Why won't we get utilization slump in future?
  How do we need to fix things when something goes wrong?
  Whereas in accordance with the single existing paper Btrfs
  design looks really broken, because:
 
  1. the balancing condition n = number_of_keys = 2n+1
is not satisfied (indeed, when number_of_keys is 1
we have 1 = 2n+1 -- false);
 
 That doesn't matter.  It is not necessary (or desirable) to follow the
 

RE: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-24 Thread Daniel Taylor
 

 -Original Message-
 From: mikefe...@gmail.com [mailto:mikefe...@gmail.com] On 
 Behalf Of Mike Fedyk
 Sent: Wednesday, June 23, 2010 9:51 PM
 To: Daniel Taylor
 Cc: Daniel J Blueman; Mat; LKML; 
 linux-fsde...@vger.kernel.org; Chris Mason; Ric Wheeler; 
 Andrew Morton; Linus Torvalds; The development of BTRFS
 Subject: Re: Btrfs: broken file system design (was Unbound(?) 
 internal fragmentation in Btrfs)
 
 On Wed, Jun 23, 2010 at 8:43 PM, Daniel Taylor 
 daniel.tay...@wdc.com wrote:
  Just an FYI reminder.  The original test (2K files) is utterly
  pathological for disk drives with 4K physical sectors, such as
  those now shipping from WD, Seagate, and others.  Some of the
  SSDs have larger (16K0 or smaller blocks (2K).  There is also
  the issue of btrfs over RAID (which I know is not entirely
  sensible, but which will happen).
 
  The absolute minimum allocation size for data should be the same
  as, and aligned with, the underlying disk block size.  If that
  results in underutilization, I think that's a good thing for
  performance, compared to read-modify-write cycles to update
  partial disk blocks.
 
 Block size = 4k
 
 Btrfs packs smaller objects into the blocks in certain cases.
 

As long as no object smaller than the disk block size is ever
flushed to media, and all flushed objects are aligned to the disk
blocks, there should be no real performance hit from that.

Otherwise we end up with the damage for the ext[234] family, where
the file blocks can be aligned, but the 1K inode updates cause
the read-modify-write (RMW) cycles and and cost 10% performance
hit for creation/update of large numbers of files.

An RMW cycle costs at least a full rotation (11 msec on a 5400 RPM
drive), which is painful.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html