Re: Unbound(?) internal fragmentation in Btrfs

2010-06-18 Thread Christian Stroetmann
 key (321 INODE_REF 256) itemoff 3819 itemsize 16
   inode ref index 66 namelen 6 name: file65
   item 2 key (321 XATTR_ITEM 3817753667) itemoff 3741 itemsize 78
   location key (0 UNKNOWN 0) type 8
   namelen 16 datalen 32 name: security.selinux
leaf 29995008 items 3 free space 1675 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
   item 0 key (321 EXTENT_DATA 0) itemoff 1926 itemsize 2069
   inline extent data size 2048 ram 2048 compress 0
   item 1 key (322 INODE_ITEM 0) itemoff 1766 itemsize 160
   inode generation 8 size 2048 block group 29360128 mode 100644
links 1
   item 2 key (322 INODE_REF 256) itemoff 1750 itemsize 16
   inode ref index 67 namelen 6 name: file66
...

Appendix C.
---

D.E. Knuth, The Art of Computer Programming, vol. 3 (Sorting and Searching),
Addison-Wesley, Reading, MA, 1973.

--
Edward O. Shishkin
Principal Software Engineer
Red Hat Czech

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

 

Hi to all,

First of let me say: Btrfs really has matured a lot in the last months
and this is thanks to you guys (the developers !)

More and more people are making it their dedicated filesystem (MeeGo)
or an option (Ubuntu, Fedora)

So thank you very very much for your on-going efforts on making this
more and more a viable (and usable !) alternative/competition to zfs
:)

The problems Edward mentioned sound like some very important points
(issues ?) to investigate

I find it somewhat surprising that none of you guys commented on his mail
   

Me too.

I'm in no way a developer (in fact a power-user) but would
nevertheless be interested in your opinions on this matter -
especially Chris'

Thanks !

Mat

   

With all the best
Christian Stroetmann

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Christian Stroetmann

Chris Mason wrote:

On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
   

Mat wrote:
 

On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkinedw...@redhat.com  wrote:
   

Hello everyone.

I was asked to review/evaluate Btrfs for using in enterprise
systems and the below are my first impressions (linux-2.6.33).

The first test I have made was filling an empty 659M (/dev/sdb2)
btrfs partition (mounted to /mnt) with 2K files:

# for i in $(seq 100); \
do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
(terminated after getting No space left on device reports).

# ls /mnt | wc -l
59480

So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17,
and the first obvious question is hey, where are other 83% of my
disk space??? I looked at the btrfs storage tree (fs_tree) and was
shocked with the situation on the leaf level. The Appendix B shows
5 adjacent btrfs leafs, which have the same parent.

For example, look at the leaf 29425664: items 1 free space 3892
(of 4096!!). Note, that this free space (3892) is _dead_: any
attempts to write to the file system will result in No space left
on device.
 

There are two easy ways to fix this problem.  Turn off the inline
extents (max_inline=0) or allow splitting of the inline extents.  I
didn't put in the splitting simply because the complexity was high while
the benefits were low (in comparison with just turning off the inline
extents).
   
But then the benefits of splitting must be high, because it solves this 
problem if inline extents are turned on.
   

It must be a highly unexpected and difficult question for file system
developers: how efficiently does your file system manage disk space?

In the meanwhile I confirm that Btrfs design is completely broken:
records stored in the B-tree differ greatly from each other (it is
unacceptable!), and the balancing algorithms have been modified in
insane manner. All these factors has led to loss of *all* boundaries
holding internal fragmentation and to exhaustive waste of disk space
(and memory!) in spite of the property scaling in their ability to
address large storage.

This is not a large storage, this is a scalable sieve: you can not
rely on finding there some given amount of water even after infinite
increasing the size of the sieve (read escalating the pool of Btrfs
devices).

It seems that nobody have reviewed Btrfs before its inclusion to the
mainline. I have only found a pair of recommendations with a common
idea that Btrfs maintainer is not a crazy man. Plus a number of
papers which admire with the Btrfs phenomena. Sigh.

Well, let's decide what can we do in current situation..
The first obvious point here is that we *can not* put such file system
to production. Just because it doesn't provide any guarantees for our
users regarding disk space utilization.
 

Are you basing all of this on inline extents?  The other extents of
variable size are more flexible (taking up the room in the leaf), but
they can also easy be split during balancing.
   
If we have to split everywhere, hasn't it then some (dramatic) impact on 
the performance of the Btrfs filesystem?

As it was said above: splitting has a high complexity.

-chris
--
   

Have fun
Christian Stroetmann
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Christian Stroetmann

Daniel J Blueman wrote:

On Fri, Jun 18, 2010 at 1:32 PM, Edward Shishkin
edward.shish...@gmail.com  wrote:
   

Mat wrote:
 

On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkinedw...@redhat.com  wrote:
   

Hello everyone.

I was asked to review/evaluate Btrfs for using in enterprise
systems and the below are my first impressions (linux-2.6.33).

The first test I have made was filling an empty 659M (/dev/sdb2)
btrfs partition (mounted to /mnt) with 2K files:

# for i in $(seq 100); \
do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
(terminated after getting No space left on device reports).

# ls /mnt | wc -l
59480

So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17,
and the first obvious question is hey, where are other 83% of my
disk space??? I looked at the btrfs storage tree (fs_tree) and was
shocked with the situation on the leaf level. The Appendix B shows
5 adjacent btrfs leafs, which have the same parent.

For example, look at the leaf 29425664: items 1 free space 3892
(of 4096!!). Note, that this free space (3892) is _dead_: any
attempts to write to the file system will result in No space left
on device.

Internal fragmentation (see Appendix A) of those 5 leafs is
(1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then
ext4 and xfs: The last ones in this example will show fragmentation
near zero with blocksize= 2K. Even with 4K blocksize they will
show better utilization 0.50 (against 0.38 in btrfs)!

I have a small question for btrfs developers: Why do you folks put
inline extents, xattr, etc items of variable size to the B-tree
in spite of the fact that B-tree is a data structure NOT for variable
sized records? This disadvantage of B-trees was widely discussed.
For example, maestro D. Knuth warned about this issue long time
ago (see Appendix C).

It is a well known fact that internal fragmentation of classic Bayer's
B-trees is restricted by the value 0.50 (see Appendix C). However it
takes place only if your tree contains records of the _same_ length
(for example, extent pointers). Once you put to your B-tree records
of variable length (restricted only by leaf size, like btrfs inline
extents), your tree LOSES this boundary. Moreover, even worse:
it is clear, that in this case utilization of B-tree scales as zero(!).
That said, for every small E and for every amount of data N we
can construct a consistent B-tree, which contains data N and has
utilization worse then E. I.e. from the standpoint of utilization
such trees can be completely degenerated.

That said, the very important property of B-trees, which guarantees
non-zero utilization, has been lost, and I don't see in Btrfs code any
substitution for this property. In other words, where is a formal
guarantee that all disk space of our users won't be eaten by internal
fragmentation? I consider such guarantee as a *necessary* condition
for putting a file system to production.
 

Wow...a small part of me says 'well said', on the basis that your
assertions are true, but I do think there needs to be more
constructivity in such critique; it is almost impossible to be a great
engineer and a great academic at once in a time-pressured environment.
   

I find this is somehow off-topic, but:
For sure, it isn't impossible. History showed and present shows that 
there are exceptions.

If you can produce some specific and suggestions with code references,
I'm sure we'll get some good discussion with potential to improve from
where we are.

Thanks,
   Daniel
   

Have fun
Christian Stroetmann
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Christian Stroetmann
 there are no deep problems, then I will believe you for now.


-- Jamie
   

With all the best
Christian Stroetmann


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html