Re: Unbound(?) internal fragmentation in Btrfs

2010-06-18 Thread Christian Stroetmann

Aloha Mat:

On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkinedw...@redhat.com  wrote:
   

Hello everyone.

I was asked to review/evaluate Btrfs for using in enterprise
systems and the below are my first impressions (linux-2.6.33).

The first test I have made was filling an empty 659M (/dev/sdb2)
btrfs partition (mounted to /mnt) with 2K files:

# for i in $(seq 100); \
do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
(terminated after getting No space left on device reports).

# ls /mnt | wc -l
59480

So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17,
and the first obvious question is hey, where are other 83% of my
disk space??? I looked at the btrfs storage tree (fs_tree) and was
shocked with the situation on the leaf level. The Appendix B shows
5 adjacent btrfs leafs, which have the same parent.

For example, look at the leaf 29425664: items 1 free space 3892
(of 4096!!). Note, that this free space (3892) is _dead_: any
attempts to write to the file system will result in No space left
on device.

Internal fragmentation (see Appendix A) of those 5 leafs is
(1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then
ext4 and xfs: The last ones in this example will show fragmentation
near zero with blocksize= 2K. Even with 4K blocksize they will
show better utilization 0.50 (against 0.38 in btrfs)!

I have a small question for btrfs developers: Why do you folks put
inline extents, xattr, etc items of variable size to the B-tree
in spite of the fact that B-tree is a data structure NOT for variable
sized records? This disadvantage of B-trees was widely discussed.
For example, maestro D. Knuth warned about this issue long time
ago (see Appendix C).

It is a well known fact that internal fragmentation of classic Bayer's
B-trees is restricted by the value 0.50 (see Appendix C). However it
takes place only if your tree contains records of the _same_ length
(for example, extent pointers). Once you put to your B-tree records
of variable length (restricted only by leaf size, like btrfs inline
extents), your tree LOSES this boundary. Moreover, even worse:
it is clear, that in this case utilization of B-tree scales as zero(!).
That said, for every small E and for every amount of data N we
can construct a consistent B-tree, which contains data N and has
utilization worse then E. I.e. from the standpoint of utilization
such trees can be completely degenerated.

That said, the very important property of B-trees, which guarantees
non-zero utilization, has been lost, and I don't see in Btrfs code any
substitution for this property. In other words, where is a formal
guarantee that all disk space of our users won't be eaten by internal
fragmentation? I consider such guarantee as a *necessary* condition
for putting a file system to production.

Any technical comments are welcome.

Thanks,
Edward.


Appendix A.
---
Glossary

1. Utilization of data and(or) metadata storage.

The fraction A/B, where
A is total size in bytes of stored data and(or) metadata.
B = N * S, where
N is number of blocks occupied by stored data and(or) metadata.
S is block size in bytes.

2. Internal fragmentation of data and(or) metadata storage.

difference (1 - U), where U is utilization.


Appendix B.
---
a period in the dump of the fs_tree (btrfs-debug-tree /dev/sdb2)

...

leaf 29982720 items 4 free space 1572 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
   item 0 key (319 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
   location key (0 UNKNOWN 0) type 8
   namelen 16 datalen 32 name: security.selinux
   item 1 key (319 EXTENT_DATA 0) itemoff 1848 itemsize 2069
   inline extent data size 2048 ram 2048 compress 0
   item 2 key (320 INODE_ITEM 0) itemoff 1688 itemsize 160
   inode generation 8 size 2048 block group 29360128 mode 100644
links 1
   item 3 key (320 INODE_REF 256) itemoff 1672 itemsize 16
   inode ref index 65 namelen 6 name: file64
leaf 29425664 items 1 free space 3892 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
   item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
   location key (0 UNKNOWN 0) type 8
   namelen 16 datalen 32 name: security.selinux
leaf 29990912 items 1 free space 1901 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
   item 0 key (320 EXTENT_DATA 0) itemoff 1926 itemsize 2069
   inline extent data size 2048 ram 2048 compress 0
leaf 29986816 items 3 free space 3666 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
   item 0 key (321 INODE_ITEM 0) itemoff 3835 itemsize 160
   inode generation 8 size 2048 block group 29360128 mode 100644
links 1
   item 1 

Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Edward Shishkin

Mat wrote:

On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin edw...@redhat.com wrote:
  

Hello everyone.

I was asked to review/evaluate Btrfs for using in enterprise
systems and the below are my first impressions (linux-2.6.33).

The first test I have made was filling an empty 659M (/dev/sdb2)
btrfs partition (mounted to /mnt) with 2K files:

# for i in $(seq 100); \
do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
(terminated after getting No space left on device reports).

# ls /mnt | wc -l
59480

So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17,
and the first obvious question is hey, where are other 83% of my
disk space??? I looked at the btrfs storage tree (fs_tree) and was
shocked with the situation on the leaf level. The Appendix B shows
5 adjacent btrfs leafs, which have the same parent.

For example, look at the leaf 29425664: items 1 free space 3892
(of 4096!!). Note, that this free space (3892) is _dead_: any
attempts to write to the file system will result in No space left
on device.

Internal fragmentation (see Appendix A) of those 5 leafs is
(1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then
ext4 and xfs: The last ones in this example will show fragmentation
near zero with blocksize = 2K. Even with 4K blocksize they will
show better utilization 0.50 (against 0.38 in btrfs)!

I have a small question for btrfs developers: Why do you folks put
inline extents, xattr, etc items of variable size to the B-tree
in spite of the fact that B-tree is a data structure NOT for variable
sized records? This disadvantage of B-trees was widely discussed.
For example, maestro D. Knuth warned about this issue long time
ago (see Appendix C).

It is a well known fact that internal fragmentation of classic Bayer's
B-trees is restricted by the value 0.50 (see Appendix C). However it
takes place only if your tree contains records of the _same_ length
(for example, extent pointers). Once you put to your B-tree records
of variable length (restricted only by leaf size, like btrfs inline
extents), your tree LOSES this boundary. Moreover, even worse:
it is clear, that in this case utilization of B-tree scales as zero(!).
That said, for every small E and for every amount of data N we
can construct a consistent B-tree, which contains data N and has
utilization worse then E. I.e. from the standpoint of utilization
such trees can be completely degenerated.

That said, the very important property of B-trees, which guarantees
non-zero utilization, has been lost, and I don't see in Btrfs code any
substitution for this property. In other words, where is a formal
guarantee that all disk space of our users won't be eaten by internal
fragmentation? I consider such guarantee as a *necessary* condition
for putting a file system to production.

Any technical comments are welcome.

Thanks,
Edward.


Appendix A.
---
Glossary

1. Utilization of data and(or) metadata storage.

The fraction A/B, where
A is total size in bytes of stored data and(or) metadata.
B = N * S, where
N is number of blocks occupied by stored data and(or) metadata.
S is block size in bytes.

2. Internal fragmentation of data and(or) metadata storage.

difference (1 - U), where U is utilization.


Appendix B.
---
a period in the dump of the fs_tree (btrfs-debug-tree /dev/sdb2)

...

leaf 29982720 items 4 free space 1572 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
  item 0 key (319 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
  location key (0 UNKNOWN 0) type 8
  namelen 16 datalen 32 name: security.selinux
  item 1 key (319 EXTENT_DATA 0) itemoff 1848 itemsize 2069
  inline extent data size 2048 ram 2048 compress 0
  item 2 key (320 INODE_ITEM 0) itemoff 1688 itemsize 160
  inode generation 8 size 2048 block group 29360128 mode 100644
links 1
  item 3 key (320 INODE_REF 256) itemoff 1672 itemsize 16
  inode ref index 65 namelen 6 name: file64
leaf 29425664 items 1 free space 3892 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
  item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
  location key (0 UNKNOWN 0) type 8
  namelen 16 datalen 32 name: security.selinux
leaf 29990912 items 1 free space 1901 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
  item 0 key (320 EXTENT_DATA 0) itemoff 1926 itemsize 2069
  inline extent data size 2048 ram 2048 compress 0
leaf 29986816 items 3 free space 3666 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
  item 0 key (321 INODE_ITEM 0) itemoff 3835 itemsize 160
  inode generation 8 size 2048 block group 29360128 mode 100644
links 1
  item 1 key (321 

Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Daniel J Blueman
On Fri, Jun 18, 2010 at 1:32 PM, Edward Shishkin
edward.shish...@gmail.com wrote:
 Mat wrote:

 On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin edw...@redhat.com wrote:

 Hello everyone.

 I was asked to review/evaluate Btrfs for using in enterprise
 systems and the below are my first impressions (linux-2.6.33).

 The first test I have made was filling an empty 659M (/dev/sdb2)
 btrfs partition (mounted to /mnt) with 2K files:

 # for i in $(seq 100); \
 do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
 (terminated after getting No space left on device reports).

 # ls /mnt | wc -l
 59480

 So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17,
 and the first obvious question is hey, where are other 83% of my
 disk space??? I looked at the btrfs storage tree (fs_tree) and was
 shocked with the situation on the leaf level. The Appendix B shows
 5 adjacent btrfs leafs, which have the same parent.

 For example, look at the leaf 29425664: items 1 free space 3892
 (of 4096!!). Note, that this free space (3892) is _dead_: any
 attempts to write to the file system will result in No space left
 on device.

 Internal fragmentation (see Appendix A) of those 5 leafs is
 (1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then
 ext4 and xfs: The last ones in this example will show fragmentation
 near zero with blocksize = 2K. Even with 4K blocksize they will
 show better utilization 0.50 (against 0.38 in btrfs)!

 I have a small question for btrfs developers: Why do you folks put
 inline extents, xattr, etc items of variable size to the B-tree
 in spite of the fact that B-tree is a data structure NOT for variable
 sized records? This disadvantage of B-trees was widely discussed.
 For example, maestro D. Knuth warned about this issue long time
 ago (see Appendix C).

 It is a well known fact that internal fragmentation of classic Bayer's
 B-trees is restricted by the value 0.50 (see Appendix C). However it
 takes place only if your tree contains records of the _same_ length
 (for example, extent pointers). Once you put to your B-tree records
 of variable length (restricted only by leaf size, like btrfs inline
 extents), your tree LOSES this boundary. Moreover, even worse:
 it is clear, that in this case utilization of B-tree scales as zero(!).
 That said, for every small E and for every amount of data N we
 can construct a consistent B-tree, which contains data N and has
 utilization worse then E. I.e. from the standpoint of utilization
 such trees can be completely degenerated.

 That said, the very important property of B-trees, which guarantees
 non-zero utilization, has been lost, and I don't see in Btrfs code any
 substitution for this property. In other words, where is a formal
 guarantee that all disk space of our users won't be eaten by internal
 fragmentation? I consider such guarantee as a *necessary* condition
 for putting a file system to production.

Wow...a small part of me says 'well said', on the basis that your
assertions are true, but I do think there needs to be more
constructivity in such critique; it is almost impossible to be a great
engineer and a great academic at once in a time-pressured environment.

If you can produce some specific and suggestions with code references,
I'm sure we'll get some good discussion with potential to improve from
where we are.

Thanks,
  Daniel
-- 
Daniel J Blueman
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Chris Mason
On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
 Mat wrote:
 On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin edw...@redhat.com wrote:
 Hello everyone.
 
 I was asked to review/evaluate Btrfs for using in enterprise
 systems and the below are my first impressions (linux-2.6.33).
 
 The first test I have made was filling an empty 659M (/dev/sdb2)
 btrfs partition (mounted to /mnt) with 2K files:
 
 # for i in $(seq 100); \
 do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
 (terminated after getting No space left on device reports).
 
 # ls /mnt | wc -l
 59480
 
 So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17,
 and the first obvious question is hey, where are other 83% of my
 disk space??? I looked at the btrfs storage tree (fs_tree) and was
 shocked with the situation on the leaf level. The Appendix B shows
 5 adjacent btrfs leafs, which have the same parent.
 
 For example, look at the leaf 29425664: items 1 free space 3892
 (of 4096!!). Note, that this free space (3892) is _dead_: any
 attempts to write to the file system will result in No space left
 on device.

There are two easy ways to fix this problem.  Turn off the inline
extents (max_inline=0) or allow splitting of the inline extents.  I
didn't put in the splitting simply because the complexity was high while
the benefits were low (in comparison with just turning off the inline
extents).

 
 It must be a highly unexpected and difficult question for file system
 developers: how efficiently does your file system manage disk space?
 
 In the meanwhile I confirm that Btrfs design is completely broken:
 records stored in the B-tree differ greatly from each other (it is
 unacceptable!), and the balancing algorithms have been modified in
 insane manner. All these factors has led to loss of *all* boundaries
 holding internal fragmentation and to exhaustive waste of disk space
 (and memory!) in spite of the property scaling in their ability to
 address large storage.
 
 This is not a large storage, this is a scalable sieve: you can not
 rely on finding there some given amount of water even after infinite
 increasing the size of the sieve (read escalating the pool of Btrfs
 devices).
 
 It seems that nobody have reviewed Btrfs before its inclusion to the
 mainline. I have only found a pair of recommendations with a common
 idea that Btrfs maintainer is not a crazy man. Plus a number of
 papers which admire with the Btrfs phenomena. Sigh.
 
 Well, let's decide what can we do in current situation..
 The first obvious point here is that we *can not* put such file system
 to production. Just because it doesn't provide any guarantees for our
 users regarding disk space utilization.

Are you basing all of this on inline extents?  The other extents of
variable size are more flexible (taking up the room in the leaf), but
they can also easy be split during balancing.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Edward Shishkin

Chris Mason wrote:

On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
  

Mat wrote:


On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin edw...@redhat.com wrote:
  

Hello everyone.

I was asked to review/evaluate Btrfs for using in enterprise
systems and the below are my first impressions (linux-2.6.33).

The first test I have made was filling an empty 659M (/dev/sdb2)
btrfs partition (mounted to /mnt) with 2K files:

# for i in $(seq 100); \
do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
(terminated after getting No space left on device reports).

# ls /mnt | wc -l
59480

So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17,
and the first obvious question is hey, where are other 83% of my
disk space??? I looked at the btrfs storage tree (fs_tree) and was
shocked with the situation on the leaf level. The Appendix B shows
5 adjacent btrfs leafs, which have the same parent.

For example, look at the leaf 29425664: items 1 free space 3892
(of 4096!!). Note, that this free space (3892) is _dead_: any
attempts to write to the file system will result in No space left
on device.



There are two easy ways to fix this problem.  Turn off the inline
extents (max_inline=0) or allow splitting of the inline extents.  I
didn't put in the splitting simply because the complexity was high while
the benefits were low (in comparison with just turning off the inline
extents).
  


Hello, Chris. Thanks for response!
I afraid that both ways won't fix the problem. Look at this leaf:

[...]
leaf 29425664 items 1 free space 3892 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
   item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
   location key (0 UNKNOWN 0) type 8
   namelen 16 datalen 32 name: security.selinux
[...]

There is no inline extents, and what are you going to split here?
All leafs must be at least a half filled, otherwise we loose all
boundaries, which provides non-zero utilization..

Any ideas?

Thanks,
Edward.

  

It must be a highly unexpected and difficult question for file system
developers: how efficiently does your file system manage disk space?

In the meanwhile I confirm that Btrfs design is completely broken:
records stored in the B-tree differ greatly from each other (it is
unacceptable!), and the balancing algorithms have been modified in
insane manner. All these factors has led to loss of *all* boundaries
holding internal fragmentation and to exhaustive waste of disk space
(and memory!) in spite of the property scaling in their ability to
address large storage.

This is not a large storage, this is a scalable sieve: you can not
rely on finding there some given amount of water even after infinite
increasing the size of the sieve (read escalating the pool of Btrfs
devices).

It seems that nobody have reviewed Btrfs before its inclusion to the
mainline. I have only found a pair of recommendations with a common
idea that Btrfs maintainer is not a crazy man. Plus a number of
papers which admire with the Btrfs phenomena. Sigh.

Well, let's decide what can we do in current situation..
The first obvious point here is that we *can not* put such file system
to production. Just because it doesn't provide any guarantees for our
users regarding disk space utilization.



Are you basing all of this on inline extents?  The other extents of
variable size are more flexible (taking up the room in the leaf), but
they can also easy be split during balancing.

-chris

  


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Chris Mason
On Fri, Jun 18, 2010 at 05:05:46PM +0200, Edward Shishkin wrote:
 Chris Mason wrote:
 On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
 Mat wrote:
 On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin edw...@redhat.com wrote:
 Hello everyone.
 
 I was asked to review/evaluate Btrfs for using in enterprise
 systems and the below are my first impressions (linux-2.6.33).
 
 The first test I have made was filling an empty 659M (/dev/sdb2)
 btrfs partition (mounted to /mnt) with 2K files:
 
 # for i in $(seq 100); \
 do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
 (terminated after getting No space left on device reports).
 
 # ls /mnt | wc -l
 59480
 
 So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17,
 and the first obvious question is hey, where are other 83% of my
 disk space??? I looked at the btrfs storage tree (fs_tree) and was
 shocked with the situation on the leaf level. The Appendix B shows
 5 adjacent btrfs leafs, which have the same parent.
 
 For example, look at the leaf 29425664: items 1 free space 3892
 (of 4096!!). Note, that this free space (3892) is _dead_: any
 attempts to write to the file system will result in No space left
 on device.
 
 There are two easy ways to fix this problem.  Turn off the inline
 extents (max_inline=0) or allow splitting of the inline extents.  I
 didn't put in the splitting simply because the complexity was high while
 the benefits were low (in comparison with just turning off the inline
 extents).
 
 Hello, Chris. Thanks for response!
 I afraid that both ways won't fix the problem. Look at this leaf:
 
 [...]
 leaf 29425664 items 1 free space 3892 generation 8 owner 5
 fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
 chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
location key (0 UNKNOWN 0) type 8
namelen 16 datalen 32 name: security.selinux
 [...]
 
 There is no inline extents, and what are you going to split here?
 All leafs must be at least a half filled, otherwise we loose all
 boundaries, which provides non-zero utilization..

Right, there is no inline extent because we require them to fit entirely
in the leaf.  So we end up with mostly empty leaves because the inline
item is large enough to make it difficult to push around but not large
enough to fill the leaf.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Christian Stroetmann

Chris Mason wrote:

On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
   

Mat wrote:
 

On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkinedw...@redhat.com  wrote:
   

Hello everyone.

I was asked to review/evaluate Btrfs for using in enterprise
systems and the below are my first impressions (linux-2.6.33).

The first test I have made was filling an empty 659M (/dev/sdb2)
btrfs partition (mounted to /mnt) with 2K files:

# for i in $(seq 100); \
do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
(terminated after getting No space left on device reports).

# ls /mnt | wc -l
59480

So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17,
and the first obvious question is hey, where are other 83% of my
disk space??? I looked at the btrfs storage tree (fs_tree) and was
shocked with the situation on the leaf level. The Appendix B shows
5 adjacent btrfs leafs, which have the same parent.

For example, look at the leaf 29425664: items 1 free space 3892
(of 4096!!). Note, that this free space (3892) is _dead_: any
attempts to write to the file system will result in No space left
on device.
 

There are two easy ways to fix this problem.  Turn off the inline
extents (max_inline=0) or allow splitting of the inline extents.  I
didn't put in the splitting simply because the complexity was high while
the benefits were low (in comparison with just turning off the inline
extents).
   
But then the benefits of splitting must be high, because it solves this 
problem if inline extents are turned on.
   

It must be a highly unexpected and difficult question for file system
developers: how efficiently does your file system manage disk space?

In the meanwhile I confirm that Btrfs design is completely broken:
records stored in the B-tree differ greatly from each other (it is
unacceptable!), and the balancing algorithms have been modified in
insane manner. All these factors has led to loss of *all* boundaries
holding internal fragmentation and to exhaustive waste of disk space
(and memory!) in spite of the property scaling in their ability to
address large storage.

This is not a large storage, this is a scalable sieve: you can not
rely on finding there some given amount of water even after infinite
increasing the size of the sieve (read escalating the pool of Btrfs
devices).

It seems that nobody have reviewed Btrfs before its inclusion to the
mainline. I have only found a pair of recommendations with a common
idea that Btrfs maintainer is not a crazy man. Plus a number of
papers which admire with the Btrfs phenomena. Sigh.

Well, let's decide what can we do in current situation..
The first obvious point here is that we *can not* put such file system
to production. Just because it doesn't provide any guarantees for our
users regarding disk space utilization.
 

Are you basing all of this on inline extents?  The other extents of
variable size are more flexible (taking up the room in the leaf), but
they can also easy be split during balancing.
   
If we have to split everywhere, hasn't it then some (dramatic) impact on 
the performance of the Btrfs filesystem?

As it was said above: splitting has a high complexity.

-chris
--
   

Have fun
Christian Stroetmann
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Chris Mason
On Fri, Jun 18, 2010 at 05:21:21PM +0200, Christian Stroetmann wrote:
 Chris Mason wrote:
 On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
 Mat wrote:
 On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkinedw...@redhat.com  wrote:
 Hello everyone.
 
 I was asked to review/evaluate Btrfs for using in enterprise
 systems and the below are my first impressions (linux-2.6.33).
 
 The first test I have made was filling an empty 659M (/dev/sdb2)
 btrfs partition (mounted to /mnt) with 2K files:
 
 # for i in $(seq 100); \
 do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
 (terminated after getting No space left on device reports).
 
 # ls /mnt | wc -l
 59480
 
 So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17,
 and the first obvious question is hey, where are other 83% of my
 disk space??? I looked at the btrfs storage tree (fs_tree) and was
 shocked with the situation on the leaf level. The Appendix B shows
 5 adjacent btrfs leafs, which have the same parent.
 
 For example, look at the leaf 29425664: items 1 free space 3892
 (of 4096!!). Note, that this free space (3892) is _dead_: any
 attempts to write to the file system will result in No space left
 on device.
 There are two easy ways to fix this problem.  Turn off the inline
 extents (max_inline=0) or allow splitting of the inline extents.  I
 didn't put in the splitting simply because the complexity was high while
 the benefits were low (in comparison with just turning off the inline
 extents).
 But then the benefits of splitting must be high, because it solves
 this problem if inline extents are turned on.

It depends, we might also argue that for larger inline extents like this
we are better off with much larger leaves or with using max_inline=0
instead.

 It must be a highly unexpected and difficult question for file system
 developers: how efficiently does your file system manage disk space?
 
 In the meanwhile I confirm that Btrfs design is completely broken:
 records stored in the B-tree differ greatly from each other (it is
 unacceptable!), and the balancing algorithms have been modified in
 insane manner. All these factors has led to loss of *all* boundaries
 holding internal fragmentation and to exhaustive waste of disk space
 (and memory!) in spite of the property scaling in their ability to
 address large storage.
 
 This is not a large storage, this is a scalable sieve: you can not
 rely on finding there some given amount of water even after infinite
 increasing the size of the sieve (read escalating the pool of Btrfs
 devices).
 
 It seems that nobody have reviewed Btrfs before its inclusion to the
 mainline. I have only found a pair of recommendations with a common
 idea that Btrfs maintainer is not a crazy man. Plus a number of
 papers which admire with the Btrfs phenomena. Sigh.
 
 Well, let's decide what can we do in current situation..
 The first obvious point here is that we *can not* put such file system
 to production. Just because it doesn't provide any guarantees for our
 users regarding disk space utilization.
 Are you basing all of this on inline extents?  The other extents of
 variable size are more flexible (taking up the room in the leaf), but
 they can also easy be split during balancing.
 If we have to split everywhere, hasn't it then some (dramatic)
 impact on the performance of the Btrfs filesystem?
 As it was said above: splitting has a high complexity.

Yes.  Both Edward and I have worked on the reiserfs sources where inline
extents were split during balancing.  It made for a few surprises in the
file read/write code.  They aren't impossible by any means, but I wanted
to avoid them.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Jamie Lokier
Edward Shishkin wrote:
 If you decide to base your file system on some algorithms then please
 use the original ones from proper academic papers. DO NOT modify the
 algorithms in solitude: this is very fragile thing! All such
 modifications must be reviewed by specialists in the theory of
 algorithms. Such review can be done in various scientific magazines of
 proper level.
 
 Personally I don't see any way to improve the situation with Btrfs
 except full redesigning the last one. If you want to base your file
 system on the paper of Ohad Rodeh, then please, use *exactly* the
 Bayer's B-trees that he refers to. That said, make sure that all
 records you put to the tree has equal length and all non-root nodes of
 your tree are at least half filled.

First, thanks Edward for identifying a specific problem with the
current btrfs implementation.

I've studied modified B-trees quite a lot and know enough to be sure
that they are quite robust when you modify them in all sorts of ways.

Moreover, you are incorrect to say there's an intrinsic algorithmic
problem with variable-length records.  It is not true; if Knuth said
so, Knuth was mistaken.

This is easily shown by modelling variable-length records (key -
string) as a range of fixed length records ([key,index] - byte) and
apply the standard B-tree algorithms to that, which guarantees
algorithm properties such as space utilisation and time; then you can
substitute a compressed representation of contiguous index runs,
which amounts to nothing more than just storing the strings (split
where the algorithm says to do so) and endpoint indexes , and because
this compression does not expand (in any way that matters), classic
algorithmic properties are still guaranteed.

Variable-length keys are a different business.  Those are trickier,
but as far as I know, btrfs doesn't use them.

 As to current Btrfs code: *NOT ACK*!!! I don't think we need such
 file systems.

Btrfs provides many useful features that other filesystems don't.  We
definitely need it, or something like it.  You have identified a bug.
It's not a corruption bug, but it's definitely a bug, and probably
affects performance as well as space utilisation.

It is not deep design bug; it is just a result of the packing and
balancing heuristics.

If you are still interested, please apply your knowledge of B-tree
algorithms to understanding why btrfs fails to balance the tree
sufficiently well, and then propose a fix.

Note that it's not necessarily a problem to have a few nodes with low
utilisation.  Deliberate violation of the classic balancing heuristic
is often useful for performance.[*]  The problem you've found is only a
real problem when there are _too many_ nodes with low utilisation.

[*] For example when filling a tree by inserting contiguously
ascending keys, the classic split into two when full heuristic gives
the worst possible results (50% lost space), and deliberately
underfilling the most actively updated nodes, which is not permitted
at all by the classic algorithm, gives denser packing in the end
(almost zero lost space).  It's also faster.  The trick is to make
sure there's just the right number of underfilled nodes...

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Edward Shishkin

Chris Mason wrote:

On Fri, Jun 18, 2010 at 05:05:46PM +0200, Edward Shishkin wrote:
  

Chris Mason wrote:


On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
  

Mat wrote:


On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin edw...@redhat.com wrote:
  

Hello everyone.

I was asked to review/evaluate Btrfs for using in enterprise
systems and the below are my first impressions (linux-2.6.33).

The first test I have made was filling an empty 659M (/dev/sdb2)
btrfs partition (mounted to /mnt) with 2K files:

# for i in $(seq 100); \
do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
(terminated after getting No space left on device reports).

# ls /mnt | wc -l
59480

So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17,
and the first obvious question is hey, where are other 83% of my
disk space??? I looked at the btrfs storage tree (fs_tree) and was
shocked with the situation on the leaf level. The Appendix B shows
5 adjacent btrfs leafs, which have the same parent.

For example, look at the leaf 29425664: items 1 free space 3892
(of 4096!!). Note, that this free space (3892) is _dead_: any
attempts to write to the file system will result in No space left
on device.


There are two easy ways to fix this problem.  Turn off the inline
extents (max_inline=0) or allow splitting of the inline extents.  I
didn't put in the splitting simply because the complexity was high while
the benefits were low (in comparison with just turning off the inline
extents).
  

Hello, Chris. Thanks for response!
I afraid that both ways won't fix the problem. Look at this leaf:

[...]
leaf 29425664 items 1 free space 3892 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
   item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
   location key (0 UNKNOWN 0) type 8
   namelen 16 datalen 32 name: security.selinux
[...]

There is no inline extents, and what are you going to split here?
All leafs must be at least a half filled, otherwise we loose all
boundaries, which provides non-zero utilization..



Right, there is no inline extent because we require them to fit entirely
in the leaf.  So we end up with mostly empty leaves because the inline
item is large enough to make it difficult to push around but not large
enough to fill the leaf.
  


How about left and right neighbors? They contain a lot of
free space (1572 and 1901 respectively).
I am not happy with the very fact of such shallow leafs which
contain only one small (xattr) item..
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Edward Shishkin

Daniel J Blueman wrote:

On Fri, Jun 18, 2010 at 1:32 PM, Edward Shishkin
edward.shish...@gmail.com wrote:
  

Mat wrote:


On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin edw...@redhat.com wrote:
  

Hello everyone.

I was asked to review/evaluate Btrfs for using in enterprise
systems and the below are my first impressions (linux-2.6.33).

The first test I have made was filling an empty 659M (/dev/sdb2)
btrfs partition (mounted to /mnt) with 2K files:

# for i in $(seq 100); \
do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
(terminated after getting No space left on device reports).

# ls /mnt | wc -l
59480

So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17,
and the first obvious question is hey, where are other 83% of my
disk space??? I looked at the btrfs storage tree (fs_tree) and was
shocked with the situation on the leaf level. The Appendix B shows
5 adjacent btrfs leafs, which have the same parent.

For example, look at the leaf 29425664: items 1 free space 3892
(of 4096!!). Note, that this free space (3892) is _dead_: any
attempts to write to the file system will result in No space left
on device.

Internal fragmentation (see Appendix A) of those 5 leafs is
(1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then
ext4 and xfs: The last ones in this example will show fragmentation
near zero with blocksize = 2K. Even with 4K blocksize they will
show better utilization 0.50 (against 0.38 in btrfs)!

I have a small question for btrfs developers: Why do you folks put
inline extents, xattr, etc items of variable size to the B-tree
in spite of the fact that B-tree is a data structure NOT for variable
sized records? This disadvantage of B-trees was widely discussed.
For example, maestro D. Knuth warned about this issue long time
ago (see Appendix C).

It is a well known fact that internal fragmentation of classic Bayer's
B-trees is restricted by the value 0.50 (see Appendix C). However it
takes place only if your tree contains records of the _same_ length
(for example, extent pointers). Once you put to your B-tree records
of variable length (restricted only by leaf size, like btrfs inline
extents), your tree LOSES this boundary. Moreover, even worse:
it is clear, that in this case utilization of B-tree scales as zero(!).
That said, for every small E and for every amount of data N we
can construct a consistent B-tree, which contains data N and has
utilization worse then E. I.e. from the standpoint of utilization
such trees can be completely degenerated.

That said, the very important property of B-trees, which guarantees
non-zero utilization, has been lost, and I don't see in Btrfs code any
substitution for this property. In other words, where is a formal
guarantee that all disk space of our users won't be eaten by internal
fragmentation? I consider such guarantee as a *necessary* condition
for putting a file system to production.



Wow...a small part of me says 'well said', on the basis that your
assertions are true, but I do think there needs to be more
constructivity in such critique; it is almost impossible to be a great
engineer and a great academic at once in a time-pressured environment.
  


Sure it is impossible. I believe in division of labour:
academics writes algorithms, and we (engineers) encode them.

I have noticed that events in Btrfs develop by scenario not predicted
by the paper of academic Ohad Rodeh (in spite of the announce that
Btrfs is based on this paper). This is why I have started to grumble..

Thanks.


If you can produce some specific and suggestions with code references,
I'm sure we'll get some good discussion with potential to improve from
where we are.

Thanks,
  Daniel
  


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Chris Mason
On Fri, Jun 18, 2010 at 06:22:39PM +0200, Edward Shishkin wrote:
 Chris Mason wrote:
 On Fri, Jun 18, 2010 at 05:05:46PM +0200, Edward Shishkin wrote:
 Chris Mason wrote:
 On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
 Mat wrote:
 On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin edw...@redhat.com 
 wrote:
 Hello everyone.
 
 I was asked to review/evaluate Btrfs for using in enterprise
 systems and the below are my first impressions (linux-2.6.33).
 
 The first test I have made was filling an empty 659M (/dev/sdb2)
 btrfs partition (mounted to /mnt) with 2K files:
 
 # for i in $(seq 100); \
 do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
 (terminated after getting No space left on device reports).
 
 # ls /mnt | wc -l
 59480
 
 So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17,
 and the first obvious question is hey, where are other 83% of my
 disk space??? I looked at the btrfs storage tree (fs_tree) and was
 shocked with the situation on the leaf level. The Appendix B shows
 5 adjacent btrfs leafs, which have the same parent.
 
 For example, look at the leaf 29425664: items 1 free space 3892
 (of 4096!!). Note, that this free space (3892) is _dead_: any
 attempts to write to the file system will result in No space left
 on device.
 There are two easy ways to fix this problem.  Turn off the inline
 extents (max_inline=0) or allow splitting of the inline extents.  I
 didn't put in the splitting simply because the complexity was high while
 the benefits were low (in comparison with just turning off the inline
 extents).
 Hello, Chris. Thanks for response!
 I afraid that both ways won't fix the problem. Look at this leaf:
 
 [...]
 leaf 29425664 items 1 free space 3892 generation 8 owner 5
 fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
 chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
location key (0 UNKNOWN 0) type 8
namelen 16 datalen 32 name: security.selinux
 [...]
 
 There is no inline extents, and what are you going to split here?
 All leafs must be at least a half filled, otherwise we loose all
 boundaries, which provides non-zero utilization..
 
 Right, there is no inline extent because we require them to fit entirely
 in the leaf.  So we end up with mostly empty leaves because the inline
 item is large enough to make it difficult to push around but not large
 enough to fill the leaf.
 
 How about left and right neighbors? They contain a lot of
 free space (1572 and 1901 respectively).
 I am not happy with the very fact of such shallow leafs which
 contain only one small (xattr) item..

Sure, the balancing can also be made more aggressive.  This should be
very easy to fix.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Christian Stroetmann

Daniel J Blueman wrote:

On Fri, Jun 18, 2010 at 1:32 PM, Edward Shishkin
edward.shish...@gmail.com  wrote:
   

Mat wrote:
 

On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkinedw...@redhat.com  wrote:
   

Hello everyone.

I was asked to review/evaluate Btrfs for using in enterprise
systems and the below are my first impressions (linux-2.6.33).

The first test I have made was filling an empty 659M (/dev/sdb2)
btrfs partition (mounted to /mnt) with 2K files:

# for i in $(seq 100); \
do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
(terminated after getting No space left on device reports).

# ls /mnt | wc -l
59480

So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17,
and the first obvious question is hey, where are other 83% of my
disk space??? I looked at the btrfs storage tree (fs_tree) and was
shocked with the situation on the leaf level. The Appendix B shows
5 adjacent btrfs leafs, which have the same parent.

For example, look at the leaf 29425664: items 1 free space 3892
(of 4096!!). Note, that this free space (3892) is _dead_: any
attempts to write to the file system will result in No space left
on device.

Internal fragmentation (see Appendix A) of those 5 leafs is
(1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then
ext4 and xfs: The last ones in this example will show fragmentation
near zero with blocksize= 2K. Even with 4K blocksize they will
show better utilization 0.50 (against 0.38 in btrfs)!

I have a small question for btrfs developers: Why do you folks put
inline extents, xattr, etc items of variable size to the B-tree
in spite of the fact that B-tree is a data structure NOT for variable
sized records? This disadvantage of B-trees was widely discussed.
For example, maestro D. Knuth warned about this issue long time
ago (see Appendix C).

It is a well known fact that internal fragmentation of classic Bayer's
B-trees is restricted by the value 0.50 (see Appendix C). However it
takes place only if your tree contains records of the _same_ length
(for example, extent pointers). Once you put to your B-tree records
of variable length (restricted only by leaf size, like btrfs inline
extents), your tree LOSES this boundary. Moreover, even worse:
it is clear, that in this case utilization of B-tree scales as zero(!).
That said, for every small E and for every amount of data N we
can construct a consistent B-tree, which contains data N and has
utilization worse then E. I.e. from the standpoint of utilization
such trees can be completely degenerated.

That said, the very important property of B-trees, which guarantees
non-zero utilization, has been lost, and I don't see in Btrfs code any
substitution for this property. In other words, where is a formal
guarantee that all disk space of our users won't be eaten by internal
fragmentation? I consider such guarantee as a *necessary* condition
for putting a file system to production.
 

Wow...a small part of me says 'well said', on the basis that your
assertions are true, but I do think there needs to be more
constructivity in such critique; it is almost impossible to be a great
engineer and a great academic at once in a time-pressured environment.
   

I find this is somehow off-topic, but:
For sure, it isn't impossible. History showed and present shows that 
there are exceptions.

If you can produce some specific and suggestions with code references,
I'm sure we'll get some good discussion with potential to improve from
where we are.

Thanks,
   Daniel
   

Have fun
Christian Stroetmann
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Christian Stroetmann

Jamie Lokier wrote:

Edward Shishkin wrote:
   

If you decide to base your file system on some algorithms then please
use the original ones from proper academic papers. DO NOT modify the
algorithms in solitude: this is very fragile thing! All such
modifications must be reviewed by specialists in the theory of
algorithms. Such review can be done in various scientific magazines of
proper level.

Personally I don't see any way to improve the situation with Btrfs
except full redesigning the last one. If you want to base your file
system on the paper of Ohad Rodeh, then please, use *exactly* the
Bayer's B-trees that he refers to. That said, make sure that all
records you put to the tree has equal length and all non-root nodes of
your tree are at least half filled.
 

First, thanks Edward for identifying a specific problem with the
current btrfs implementation.

I've studied modified B-trees quite a lot and know enough to be sure
that they are quite robust when you modify them in all sorts of ways.
   


This is the point: Which kind of modified B-tree data structure is best 
suited?



Moreover, you are incorrect to say there's an intrinsic algorithmic
problem with variable-length records.  It is not true; if Knuth said
so, Knuth was mistaken.

This is easily shown by modelling variable-length records (key -
string) as a range of fixed length records ([key,index] -  byte) and
apply the standard B-tree algorithms to that, which guarantees
algorithm properties such as space utilisation and time; then you can
substitute a compressed representation of contiguous index runs,
which amounts to nothing more than just storing the strings (split
where the algorithm says to do so) and endpoint indexes , and because
this compression does not expand (in any way that matters), classic
algorithmic properties are still guaranteed.

Variable-length keys are a different business.  Those are trickier,
but as far as I know, btrfs doesn't use them.

   

As to current Btrfs code: *NOT ACK*!!! I don't think we need such
file systems.
 

Btrfs provides many useful features that other filesystems don't.  We
definitely need it, or something like it.  You have identified a bug.
It's not a corruption bug, but it's definitely a bug, and probably
affects performance as well as space utilisation.

It is not deep design bug; it is just a result of the packing and
balancing heuristics.
   


I think this is the most important design question in relation with 
filesystems that use some kind of B-trees, which means, if the wrong 
modified B-tree as the fundamental data structure was chosen, then this 
is a deep design bug.



If you are still interested, please apply your knowledge of B-tree
algorithms to understanding why btrfs fails to balance the tree
sufficiently well, and then propose a fix.
   


This is a general problem of filesystem design, especially the packing 
and balancing heurisitcs, and a special problem of the Btrfs filesystem. 
You can't simply say do it in this or that way. That's why another 
filesystem uses something exotic like a B*-tree in conjunction with 
dancing trees as fundamental data structure, which leads back to the 
deep design question/problem/decision/bug/ And after I followed the 
explanations of this exotic B-tree version by the main developer I knew 
just right from the start of the development of the Btrfs filesystem 
that it wasn't chosen the right modified B-tree data structure, because 
it was too simple and too general. And since some days I have the 
impression that there wasn't made a design decision at all with the only 
exception that there has to be some kind of a B-tree algorithm/data 
structure in the Btrfs filesystem.


And I also think that such a deep desgin decision can't simply be 
corrected in general (subjective opinion).



Note that it's not necessarily a problem to have a few nodes with low
utilisation.  Deliberate violation of the classic balancing heuristic
is often useful for performance.[*]  The problem you've found is only a
real problem when there are _too many_ nodes with low utilisation.
   


The found problem is the first problem with the chosen modified B-tree 
data structure. I wouldn't call it only a problem in a special case.



[*] For example when filling a tree by inserting contiguously
ascending keys, the classic split into two when full heuristic gives
the worst possible results (50% lost space), and deliberately
underfilling the most actively updated nodes, which is not permitted
at all by the classic algorithm, gives denser packing in the end
(almost zero lost space).  It's also faster.  The trick is to make
sure there's just the right number of underfilled nodes...
   


Yes, but 
Firstly, maybe you are too focused on the classic B-tree algorithm here.
Secondly, a trick here, a split there, turning off a feature and then? 
Then we have complexity at then end, which brings us back to the start, 
the design decision.


But if you say there are no 

Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Edward Shishkin

Jamie Lokier wrote:

Edward Shishkin wrote:
  

If you decide to base your file system on some algorithms then please
use the original ones from proper academic papers. DO NOT modify the
algorithms in solitude: this is very fragile thing! All such
modifications must be reviewed by specialists in the theory of
algorithms. Such review can be done in various scientific magazines of
proper level.

Personally I don't see any way to improve the situation with Btrfs
except full redesigning the last one. If you want to base your file
system on the paper of Ohad Rodeh, then please, use *exactly* the
Bayer's B-trees that he refers to. That said, make sure that all
records you put to the tree has equal length and all non-root nodes of
your tree are at least half filled.



First, thanks Edward for identifying a specific problem with the
current btrfs implementation.
  


Hello Jamie.


I've studied modified B-trees quite a lot and know enough to be sure
that they are quite robust when you modify them in all sorts of ways.
  


Which property is robust?


Moreover, you are incorrect to say there's an intrinsic algorithmic
problem with variable-length records.  It is not true; if Knuth said
so, Knuth was mistaken.
  


I didn't say about intrinsic algorithmic problems :)
I just repeat (after Knuth et al) that B-trees with variable-length 
records don't
have any sane boundary for internal fragmentation. The common idea is 
that if we
don't want Btrfs to be in infinite development stage, then we should 
choose some
*sane* strategy (for example the paper of Ohad Rodeh) and strictly 
adhere this in

future.


This is easily shown by modelling variable-length records (key -
string) as a range of fixed length records ([key,index] - byte) and
apply the standard B-tree algorithms to that, which guarantees
algorithm properties such as space utilisation and time; then you can
substitute a compressed representation of contiguous index runs,
which amounts to nothing more than just storing the strings (split
where the algorithm says to do so) and endpoint indexes , and because
this compression does not expand (in any way that matters), classic
algorithmic properties are still guaranteed.

Variable-length keys are a different business.  Those are trickier,
but as far as I know, btrfs doesn't use them.

  

As to current Btrfs code: *NOT ACK*!!! I don't think we need such
file systems.



Btrfs provides many useful features that other filesystems don't.  We
definitely need it, or something like it.  You have identified a bug.
It's not a corruption bug, but it's definitely a bug, and probably
affects performance as well as space utilisation.

It is not deep design bug; it is just a result of the packing and
balancing heuristics.
  


Frankly, I would like to believe to such end, taking into account amount 
of my
contributions to the Btrfs project. At least to make sure I didn't do 
useless

work..


If you are still interested, please apply your knowledge of B-tree
algorithms to understanding why btrfs fails to balance the tree
sufficiently well,


Because of top-down balancing. It doesn't like clever things like 
splitting

and merging. Currently top-down works properly only with stupid classic
Bayer's B-trees.


 and then propose a fix.
  


I'll try to help, but I am rather pessimistic here: working out 
algorithms is

something, which doesn't like timelines..


Note that it's not necessarily a problem to have a few nodes with low
utilisation.  Deliberate violation of the classic balancing heuristic
is often useful for performance.[*]


Ok, let's violate this, but not in the detriment of utilisation:
Internal fragmentation is a horror for file systems, the enemy #1
(which is completely defeated in the last century BTW).


  The problem you've found is only a
real problem when there are _too many_ nodes with low utilisation.
  


IMHO this is a problem, as we can not estimate number of such nodes.
Do you have any sane upper boundary for this number? I don't know such one.
Maybe I have missed something?


[*] For example when filling a tree by inserting contiguously
ascending keys, the classic split into two when full heuristic gives
the worst possible results (50% lost space), and deliberately
underfilling the most actively updated nodes, which is not permitted
at all by the classic algorithm, gives denser packing in the end
(almost zero lost space).


At the end of what? I hope you don't speak about on-line defragmentation?
Can you point me to the paper (if any)?

Thanks!


  It's also faster.  The trick is to make
sure there's just the right number of underfilled nodes...

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  



--
Edward O. Shishkin
Principal Software Engineer
Red Hat Czech

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the 

Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Chris Mason
On Fri, Jun 18, 2010 at 09:29:40PM +0200, Edward Shishkin wrote:
 Jamie Lokier wrote:
 Edward Shishkin wrote:
 If you decide to base your file system on some algorithms then please
 use the original ones from proper academic papers. DO NOT modify the
 algorithms in solitude: this is very fragile thing! All such
 modifications must be reviewed by specialists in the theory of
 algorithms. Such review can be done in various scientific magazines of
 proper level.
 
 Personally I don't see any way to improve the situation with Btrfs
 except full redesigning the last one. If you want to base your file
 system on the paper of Ohad Rodeh, then please, use *exactly* the
 Bayer's B-trees that he refers to. That said, make sure that all
 records you put to the tree has equal length and all non-root nodes of
 your tree are at least half filled.
 
 First, thanks Edward for identifying a specific problem with the
 current btrfs implementation.
 
 Hello Jamie.
 
 I've studied modified B-trees quite a lot and know enough to be sure
 that they are quite robust when you modify them in all sorts of ways.
 
 Which property is robust?
 
 Moreover, you are incorrect to say there's an intrinsic algorithmic
 problem with variable-length records.  It is not true; if Knuth said
 so, Knuth was mistaken.
 
 I didn't say about intrinsic algorithmic problems :)
 I just repeat (after Knuth et al) that B-trees with variable-length
 records don't
 have any sane boundary for internal fragmentation. The common idea
 is that if we
 don't want Btrfs to be in infinite development stage, then we should
 choose some
 *sane* strategy (for example the paper of Ohad Rodeh) and strictly
 adhere this in
 future.

Again, other than the inline file data, what exactly do you believe
needs to change?  Top down balancing vs balancing on insertion doesn't
impact our ability to maintain full leaves.  The current code is clearly
choosing not to merge two leaves that it should have merged, which is
just a plain old bug.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: default subvolume abilities/restrictions

2010-06-18 Thread C Anthony Risinger
On Sun, Jun 13, 2010 at 12:47 PM, C Anthony Risinger anth...@extof.me wrote:
 On Sat, Jun 12, 2010 at 8:06 PM, C Anthony Risinger anth...@extof.me wrote:
 On Sat, Jun 12, 2010 at 7:22 PM, David Brown bt...@davidb.org wrote:
 On Sat, Jun 12, 2010 at 06:06:23PM -0500, C Anthony Risinger wrote:

 # btrfs subvolume create new_root
 # mv . new_root/old_root

 can i at least get confirmation that the above is possible?

 I've had no problem with

  # btrfs subvolume snapshot . new_root
  # mkdir old_root
  # mv * old_root
  # rm -rf old_root

 Make sure the 'mv' fails fo move new_root, and I'd look at the
 new_root before removing everything.

 David

 heh, yeah i as i was writing the last email i realized that all i
 really wanted was to:

 # mv * new_root

 for some reason i was convinced that i must snapshot the old_root (.)
 to new_root... and then remove the erroneous stuff from old_root (.).
 thus a way to parent the default subvol (old_root/.) seemed a better
 solution...

 but alas, a snapshot isn't necessary.  i can create an empty subvol
 new_root, and then mv * new_root.

 i don't know how that escaped me :-), sorry for all the noise.
 however, there probably is a legitimate use case for wanting to
 replace the default subvolume, but this isn't it.

 C Anthony

 ok i take it all back, i DO need this...

 i rewrote my initramfs hook to do the following operations:

 # btrfs subvolume create /new_root
 # mv /* /new_root

 instead of what i had:

 # btrfs subvolume snapshot / /new_root

 and it resulted in scarily COPYING my entire system... several gigs
 worth... to the newly created subvolume, which took forever, and
 grinded on my HD for awhile.  i don't know how long because i went to
 bed.

 this is why i need a way to parent the default subvolume.

 a snapshot is nice and quick, but it leaves / full of erroneous
 folders (dev/etc/usr/lib), an entire system, that will no longer be
 used.  this space will in time become dead wasted space unless my
 users manually rm -rf themselves.

 so... any input on this?  how can i effectively, and efficiently, move
 a users installation into a dedicated subvolume, when they have
 already installed into the default subvolume?

 i think the best way is what i originally suggested; make an empty
 subvolume the new top-level subvol, and place the old top-level subvol
 INTO it with a new name.

 thoughts?

can i get a little feedback on this problem?  i choke slightly when
telling users the only way to clean their old / is by rm -rf'ing
everything

i need the ability to move the default subvolume into a new, empty
subvolume.  the empty subvolume then becomes the new default.

the end result is moving the users installation into a dedicated
subvolume, cleanly and efficiently, so the system can do other things
with the subroot structure.

the way i am doing it now is snapshotting the users / to /__active...
however, the side effect is an entire system worth of files that will
in time become dead space.

moving the users install via mv into an empty subvol does not work.
everything is actually copied = slow,slow,slow.

ideas???  how can i parent the default subvol with an empty subvol?
this seems a legitimate operation.

thanks,
C Anthony
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] current btrfs-progs-unstable fails to compile

2010-06-18 Thread Bernhard Rosenkraenzer
Hi,
current btrfs-progs-unstable fails to compile for me (Ark Linux devel
branch, using gcc 4.5 and eglibc 2.12):

/usr/bin/ld: btrfsck.o: in function maybe_free_inode_rec:btrfsck.c:323:
error: undefined reference to 'S_ISDIR'
/usr/bin/ld: btrfsck.o: in function maybe_free_inode_rec:btrfsck.c:340:
error: undefined reference to 'S_ISREG'
/usr/bin/ld: btrfsck.o: in function maybe_free_inode_rec:btrfsck.c:328:
error: undefined reference to 'S_ISREG'
/usr/bin/ld: btrfsck.o: in function maybe_free_inode_rec:btrfsck.c:340:
error: undefined reference to 'S_ISLNK'
/usr/bin/ld: btrfsck.o: in function maybe_free_inode_rec:btrfsck.c:328:
error: undefined reference to 'S_ISLNK'
collect2: ld returned 1 exit status

Patch attached.

ttyl
bero--- btrfs-progs-unstable/btrfsck.c.ark	2010-06-18 22:52:52.592535586 +0200
+++ btrfs-progs-unstable/btrfsck.c	2010-06-18 22:52:59.540417164 +0200
@@ -18,6 +18,7 @@
 
 #define _XOPEN_SOURCE 500
 #define _GNU_SOURCE 1
+#include sys/stat.h
 #include stdio.h
 #include stdlib.h
 #include fcntl.h


Balancing leaves when walking from top to down (was Btrfs:...)

2010-06-18 Thread Edward Shishkin

Chris Mason wrote:

On Fri, Jun 18, 2010 at 09:29:40PM +0200, Edward Shishkin wrote:
  

Jamie Lokier wrote:


Edward Shishkin wrote:
  

If you decide to base your file system on some algorithms then please
use the original ones from proper academic papers. DO NOT modify the
algorithms in solitude: this is very fragile thing! All such
modifications must be reviewed by specialists in the theory of
algorithms. Such review can be done in various scientific magazines of
proper level.

Personally I don't see any way to improve the situation with Btrfs
except full redesigning the last one. If you want to base your file
system on the paper of Ohad Rodeh, then please, use *exactly* the
Bayer's B-trees that he refers to. That said, make sure that all
records you put to the tree has equal length and all non-root nodes of
your tree are at least half filled.


First, thanks Edward for identifying a specific problem with the
current btrfs implementation.
  

Hello Jamie.



I've studied modified B-trees quite a lot and know enough to be sure
that they are quite robust when you modify them in all sorts of ways.
  

Which property is robust?



Moreover, you are incorrect to say there's an intrinsic algorithmic
problem with variable-length records.  It is not true; if Knuth said
so, Knuth was mistaken.
  

I didn't say about intrinsic algorithmic problems :)
I just repeat (after Knuth et al) that B-trees with variable-length
records don't
have any sane boundary for internal fragmentation. The common idea
is that if we
don't want Btrfs to be in infinite development stage, then we should
choose some
*sane* strategy (for example the paper of Ohad Rodeh) and strictly
adhere this in
future.



Again, other than the inline file data, what exactly do you believe
needs to change?


1. getting rid of inline extents;
2. new formats for directory and xattr items to not look like a train,
  which is able to occupy the whole leaf;
3. make sure we do pro-active balancing like it is described in the paper.

Sorry, I don't see other ways for now..


  Top down balancing vs balancing on insertion doesn't
impact our ability to maintain full leaves.  The current code is clearly
choosing not to merge two leaves that it should have merged, which is
just a plain old bug.
  


How are you going to balance leaves when walking from top to down?
Suppose 1) and 2) above are not satisfied and having arrived to the leaf
level we see a number of items of variable length. What will we do to
keep leaves full?

Could you please provide a sketch of the algorithm?

Thanks!

--
Edward O. Shishkin
Principal Software Engineer
Red Hat Czech

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Balancing leaves when walking from top to down (was Btrfs:...)

2010-06-18 Thread Ric Wheeler

On 06/18/2010 06:04 PM, Edward Shishkin wrote:

Chris Mason wrote:

On Fri, Jun 18, 2010 at 09:29:40PM +0200, Edward Shishkin wrote:

Jamie Lokier wrote:

Edward Shishkin wrote:

If you decide to base your file system on some algorithms then please
use the original ones from proper academic papers. DO NOT modify the
algorithms in solitude: this is very fragile thing! All such
modifications must be reviewed by specialists in the theory of
algorithms. Such review can be done in various scientific magazines of
proper level.

Personally I don't see any way to improve the situation with Btrfs
except full redesigning the last one. If you want to base your file
system on the paper of Ohad Rodeh, then please, use *exactly* the
Bayer's B-trees that he refers to. That said, make sure that all
records you put to the tree has equal length and all non-root nodes of
your tree are at least half filled.

First, thanks Edward for identifying a specific problem with the
current btrfs implementation.

Hello Jamie.


I've studied modified B-trees quite a lot and know enough to be sure
that they are quite robust when you modify them in all sorts of ways.

Which property is robust?


Moreover, you are incorrect to say there's an intrinsic algorithmic
problem with variable-length records. It is not true; if Knuth said
so, Knuth was mistaken.

I didn't say about intrinsic algorithmic problems :)
I just repeat (after Knuth et al) that B-trees with variable-length
records don't
have any sane boundary for internal fragmentation. The common idea
is that if we
don't want Btrfs to be in infinite development stage, then we should
choose some
*sane* strategy (for example the paper of Ohad Rodeh) and strictly
adhere this in
future.


Again, other than the inline file data, what exactly do you believe
needs to change?


1. getting rid of inline extents;
2. new formats for directory and xattr items to not look like a train,
which is able to occupy the whole leaf;
3. make sure we do pro-active balancing like it is described in the paper.

Sorry, I don't see other ways for now..


Top down balancing vs balancing on insertion doesn't
impact our ability to maintain full leaves. The current code is clearly
choosing not to merge two leaves that it should have merged, which is
just a plain old bug.


How are you going to balance leaves when walking from top to down?
Suppose 1) and 2) above are not satisfied and having arrived to the leaf
level we see a number of items of variable length. What will we do to
keep leaves full?

Could you please provide a sketch of the algorithm?

Thanks!



Hi Edward,

Is it really a requirement to have 100% full leaves? Most DB's (assuming I 
remember correctly) have deliberate strategies around this kind of thing. You 
might want to leave room in leaf nodes so that future insertions can be 
contiguous on the media with older data.


Regards,

Ric


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Balancing leaves when walking from top to down (was Btrfs:...)

2010-06-18 Thread Edward Shishkin

Ric Wheeler wrote:

On 06/18/2010 06:04 PM, Edward Shishkin wrote:

Chris Mason wrote:

On Fri, Jun 18, 2010 at 09:29:40PM +0200, Edward Shishkin wrote:

Jamie Lokier wrote:

Edward Shishkin wrote:
If you decide to base your file system on some algorithms then 
please

use the original ones from proper academic papers. DO NOT modify the
algorithms in solitude: this is very fragile thing! All such
modifications must be reviewed by specialists in the theory of
algorithms. Such review can be done in various scientific 
magazines of

proper level.

Personally I don't see any way to improve the situation with Btrfs
except full redesigning the last one. If you want to base your file
system on the paper of Ohad Rodeh, then please, use *exactly* the
Bayer's B-trees that he refers to. That said, make sure that all
records you put to the tree has equal length and all non-root 
nodes of

your tree are at least half filled.

First, thanks Edward for identifying a specific problem with the
current btrfs implementation.

Hello Jamie.


I've studied modified B-trees quite a lot and know enough to be sure
that they are quite robust when you modify them in all sorts of ways.

Which property is robust?


Moreover, you are incorrect to say there's an intrinsic algorithmic
problem with variable-length records. It is not true; if Knuth said
so, Knuth was mistaken.

I didn't say about intrinsic algorithmic problems :)
I just repeat (after Knuth et al) that B-trees with variable-length
records don't
have any sane boundary for internal fragmentation. The common idea
is that if we
don't want Btrfs to be in infinite development stage, then we should
choose some
*sane* strategy (for example the paper of Ohad Rodeh) and strictly
adhere this in
future.


Again, other than the inline file data, what exactly do you believe
needs to change?


1. getting rid of inline extents;
2. new formats for directory and xattr items to not look like a train,
which is able to occupy the whole leaf;
3. make sure we do pro-active balancing like it is described in the 
paper.


Sorry, I don't see other ways for now..


Top down balancing vs balancing on insertion doesn't
impact our ability to maintain full leaves. The current code is clearly
choosing not to merge two leaves that it should have merged, which is
just a plain old bug.


How are you going to balance leaves when walking from top to down?
Suppose 1) and 2) above are not satisfied and having arrived to the leaf
level we see a number of items of variable length. What will we do to
keep leaves full?

Could you please provide a sketch of the algorithm?

Thanks!



Hi Edward,

Is it really a requirement to have 100% full leaves? Most DB's 
(assuming I remember correctly) have deliberate strategies around this 
kind of thing. You might want to leave room in leaf nodes so that 
future insertions can be contiguous on the media with older data.


Regards,

Ric


Hello Ric.

No, every leaf shouldn't be necessarily 100% full.
We may require every L-vicinity(*) of every leaf to be full in
some sense. And this local condition sometimes brings the (global)
boundary for utilization of the whole tree.

In the classic Bayer's B-trees we have so-called S(0) balancing
condition implicitly satisfied, which requires every leaf to be at
least half full. It provides the low boundary 0.5 for utilization
of the whole tree.

Next example is ReiserFS file system, which uses so-called S(1)
balancing condition on the leaf level, which requires every 1-vicinity
of every leaf to be incompressible (i.e. two leaves can not be
squeezed to a single one). This local incompressibility brings the
sweet low utilization boundary 0.5-E for the whole tree (E is a small
number, which is back proportional to the tree branching).

(*) any set of L+1 neighboring nodes, which contains the leaf.

--
Edward O. Shishkin
Principal Software Engineer
Red Hat Czech

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html