Re: Unbound(?) internal fragmentation in Btrfs
Aloha Mat: On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkinedw...@redhat.com wrote: Hello everyone. I was asked to review/evaluate Btrfs for using in enterprise systems and the below are my first impressions (linux-2.6.33). The first test I have made was filling an empty 659M (/dev/sdb2) btrfs partition (mounted to /mnt) with 2K files: # for i in $(seq 100); \ do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done (terminated after getting No space left on device reports). # ls /mnt | wc -l 59480 So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17, and the first obvious question is hey, where are other 83% of my disk space??? I looked at the btrfs storage tree (fs_tree) and was shocked with the situation on the leaf level. The Appendix B shows 5 adjacent btrfs leafs, which have the same parent. For example, look at the leaf 29425664: items 1 free space 3892 (of 4096!!). Note, that this free space (3892) is _dead_: any attempts to write to the file system will result in No space left on device. Internal fragmentation (see Appendix A) of those 5 leafs is (1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then ext4 and xfs: The last ones in this example will show fragmentation near zero with blocksize= 2K. Even with 4K blocksize they will show better utilization 0.50 (against 0.38 in btrfs)! I have a small question for btrfs developers: Why do you folks put inline extents, xattr, etc items of variable size to the B-tree in spite of the fact that B-tree is a data structure NOT for variable sized records? This disadvantage of B-trees was widely discussed. For example, maestro D. Knuth warned about this issue long time ago (see Appendix C). It is a well known fact that internal fragmentation of classic Bayer's B-trees is restricted by the value 0.50 (see Appendix C). However it takes place only if your tree contains records of the _same_ length (for example, extent pointers). Once you put to your B-tree records of variable length (restricted only by leaf size, like btrfs inline extents), your tree LOSES this boundary. Moreover, even worse: it is clear, that in this case utilization of B-tree scales as zero(!). That said, for every small E and for every amount of data N we can construct a consistent B-tree, which contains data N and has utilization worse then E. I.e. from the standpoint of utilization such trees can be completely degenerated. That said, the very important property of B-trees, which guarantees non-zero utilization, has been lost, and I don't see in Btrfs code any substitution for this property. In other words, where is a formal guarantee that all disk space of our users won't be eaten by internal fragmentation? I consider such guarantee as a *necessary* condition for putting a file system to production. Any technical comments are welcome. Thanks, Edward. Appendix A. --- Glossary 1. Utilization of data and(or) metadata storage. The fraction A/B, where A is total size in bytes of stored data and(or) metadata. B = N * S, where N is number of blocks occupied by stored data and(or) metadata. S is block size in bytes. 2. Internal fragmentation of data and(or) metadata storage. difference (1 - U), where U is utilization. Appendix B. --- a period in the dump of the fs_tree (btrfs-debug-tree /dev/sdb2) ... leaf 29982720 items 4 free space 1572 generation 8 owner 5 fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956 chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6 item 0 key (319 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78 location key (0 UNKNOWN 0) type 8 namelen 16 datalen 32 name: security.selinux item 1 key (319 EXTENT_DATA 0) itemoff 1848 itemsize 2069 inline extent data size 2048 ram 2048 compress 0 item 2 key (320 INODE_ITEM 0) itemoff 1688 itemsize 160 inode generation 8 size 2048 block group 29360128 mode 100644 links 1 item 3 key (320 INODE_REF 256) itemoff 1672 itemsize 16 inode ref index 65 namelen 6 name: file64 leaf 29425664 items 1 free space 3892 generation 8 owner 5 fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956 chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6 item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78 location key (0 UNKNOWN 0) type 8 namelen 16 datalen 32 name: security.selinux leaf 29990912 items 1 free space 1901 generation 8 owner 5 fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956 chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6 item 0 key (320 EXTENT_DATA 0) itemoff 1926 itemsize 2069 inline extent data size 2048 ram 2048 compress 0 leaf 29986816 items 3 free space 3666 generation 8 owner 5 fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956 chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6 item 0 key (321 INODE_ITEM 0) itemoff 3835 itemsize 160 inode generation 8 size 2048 block group 29360128 mode 100644 links 1 item 1
Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Mat wrote: On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin edw...@redhat.com wrote: Hello everyone. I was asked to review/evaluate Btrfs for using in enterprise systems and the below are my first impressions (linux-2.6.33). The first test I have made was filling an empty 659M (/dev/sdb2) btrfs partition (mounted to /mnt) with 2K files: # for i in $(seq 100); \ do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done (terminated after getting No space left on device reports). # ls /mnt | wc -l 59480 So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17, and the first obvious question is hey, where are other 83% of my disk space??? I looked at the btrfs storage tree (fs_tree) and was shocked with the situation on the leaf level. The Appendix B shows 5 adjacent btrfs leafs, which have the same parent. For example, look at the leaf 29425664: items 1 free space 3892 (of 4096!!). Note, that this free space (3892) is _dead_: any attempts to write to the file system will result in No space left on device. Internal fragmentation (see Appendix A) of those 5 leafs is (1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then ext4 and xfs: The last ones in this example will show fragmentation near zero with blocksize = 2K. Even with 4K blocksize they will show better utilization 0.50 (against 0.38 in btrfs)! I have a small question for btrfs developers: Why do you folks put inline extents, xattr, etc items of variable size to the B-tree in spite of the fact that B-tree is a data structure NOT for variable sized records? This disadvantage of B-trees was widely discussed. For example, maestro D. Knuth warned about this issue long time ago (see Appendix C). It is a well known fact that internal fragmentation of classic Bayer's B-trees is restricted by the value 0.50 (see Appendix C). However it takes place only if your tree contains records of the _same_ length (for example, extent pointers). Once you put to your B-tree records of variable length (restricted only by leaf size, like btrfs inline extents), your tree LOSES this boundary. Moreover, even worse: it is clear, that in this case utilization of B-tree scales as zero(!). That said, for every small E and for every amount of data N we can construct a consistent B-tree, which contains data N and has utilization worse then E. I.e. from the standpoint of utilization such trees can be completely degenerated. That said, the very important property of B-trees, which guarantees non-zero utilization, has been lost, and I don't see in Btrfs code any substitution for this property. In other words, where is a formal guarantee that all disk space of our users won't be eaten by internal fragmentation? I consider such guarantee as a *necessary* condition for putting a file system to production. Any technical comments are welcome. Thanks, Edward. Appendix A. --- Glossary 1. Utilization of data and(or) metadata storage. The fraction A/B, where A is total size in bytes of stored data and(or) metadata. B = N * S, where N is number of blocks occupied by stored data and(or) metadata. S is block size in bytes. 2. Internal fragmentation of data and(or) metadata storage. difference (1 - U), where U is utilization. Appendix B. --- a period in the dump of the fs_tree (btrfs-debug-tree /dev/sdb2) ... leaf 29982720 items 4 free space 1572 generation 8 owner 5 fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956 chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6 item 0 key (319 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78 location key (0 UNKNOWN 0) type 8 namelen 16 datalen 32 name: security.selinux item 1 key (319 EXTENT_DATA 0) itemoff 1848 itemsize 2069 inline extent data size 2048 ram 2048 compress 0 item 2 key (320 INODE_ITEM 0) itemoff 1688 itemsize 160 inode generation 8 size 2048 block group 29360128 mode 100644 links 1 item 3 key (320 INODE_REF 256) itemoff 1672 itemsize 16 inode ref index 65 namelen 6 name: file64 leaf 29425664 items 1 free space 3892 generation 8 owner 5 fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956 chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6 item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78 location key (0 UNKNOWN 0) type 8 namelen 16 datalen 32 name: security.selinux leaf 29990912 items 1 free space 1901 generation 8 owner 5 fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956 chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6 item 0 key (320 EXTENT_DATA 0) itemoff 1926 itemsize 2069 inline extent data size 2048 ram 2048 compress 0 leaf 29986816 items 3 free space 3666 generation 8 owner 5 fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956 chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6 item 0 key (321 INODE_ITEM 0) itemoff 3835 itemsize 160 inode generation 8 size 2048 block group 29360128 mode 100644 links 1 item 1 key (321
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
On Fri, Jun 18, 2010 at 1:32 PM, Edward Shishkin edward.shish...@gmail.com wrote: Mat wrote: On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin edw...@redhat.com wrote: Hello everyone. I was asked to review/evaluate Btrfs for using in enterprise systems and the below are my first impressions (linux-2.6.33). The first test I have made was filling an empty 659M (/dev/sdb2) btrfs partition (mounted to /mnt) with 2K files: # for i in $(seq 100); \ do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done (terminated after getting No space left on device reports). # ls /mnt | wc -l 59480 So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17, and the first obvious question is hey, where are other 83% of my disk space??? I looked at the btrfs storage tree (fs_tree) and was shocked with the situation on the leaf level. The Appendix B shows 5 adjacent btrfs leafs, which have the same parent. For example, look at the leaf 29425664: items 1 free space 3892 (of 4096!!). Note, that this free space (3892) is _dead_: any attempts to write to the file system will result in No space left on device. Internal fragmentation (see Appendix A) of those 5 leafs is (1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then ext4 and xfs: The last ones in this example will show fragmentation near zero with blocksize = 2K. Even with 4K blocksize they will show better utilization 0.50 (against 0.38 in btrfs)! I have a small question for btrfs developers: Why do you folks put inline extents, xattr, etc items of variable size to the B-tree in spite of the fact that B-tree is a data structure NOT for variable sized records? This disadvantage of B-trees was widely discussed. For example, maestro D. Knuth warned about this issue long time ago (see Appendix C). It is a well known fact that internal fragmentation of classic Bayer's B-trees is restricted by the value 0.50 (see Appendix C). However it takes place only if your tree contains records of the _same_ length (for example, extent pointers). Once you put to your B-tree records of variable length (restricted only by leaf size, like btrfs inline extents), your tree LOSES this boundary. Moreover, even worse: it is clear, that in this case utilization of B-tree scales as zero(!). That said, for every small E and for every amount of data N we can construct a consistent B-tree, which contains data N and has utilization worse then E. I.e. from the standpoint of utilization such trees can be completely degenerated. That said, the very important property of B-trees, which guarantees non-zero utilization, has been lost, and I don't see in Btrfs code any substitution for this property. In other words, where is a formal guarantee that all disk space of our users won't be eaten by internal fragmentation? I consider such guarantee as a *necessary* condition for putting a file system to production. Wow...a small part of me says 'well said', on the basis that your assertions are true, but I do think there needs to be more constructivity in such critique; it is almost impossible to be a great engineer and a great academic at once in a time-pressured environment. If you can produce some specific and suggestions with code references, I'm sure we'll get some good discussion with potential to improve from where we are. Thanks, Daniel -- Daniel J Blueman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote: Mat wrote: On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin edw...@redhat.com wrote: Hello everyone. I was asked to review/evaluate Btrfs for using in enterprise systems and the below are my first impressions (linux-2.6.33). The first test I have made was filling an empty 659M (/dev/sdb2) btrfs partition (mounted to /mnt) with 2K files: # for i in $(seq 100); \ do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done (terminated after getting No space left on device reports). # ls /mnt | wc -l 59480 So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17, and the first obvious question is hey, where are other 83% of my disk space??? I looked at the btrfs storage tree (fs_tree) and was shocked with the situation on the leaf level. The Appendix B shows 5 adjacent btrfs leafs, which have the same parent. For example, look at the leaf 29425664: items 1 free space 3892 (of 4096!!). Note, that this free space (3892) is _dead_: any attempts to write to the file system will result in No space left on device. There are two easy ways to fix this problem. Turn off the inline extents (max_inline=0) or allow splitting of the inline extents. I didn't put in the splitting simply because the complexity was high while the benefits were low (in comparison with just turning off the inline extents). It must be a highly unexpected and difficult question for file system developers: how efficiently does your file system manage disk space? In the meanwhile I confirm that Btrfs design is completely broken: records stored in the B-tree differ greatly from each other (it is unacceptable!), and the balancing algorithms have been modified in insane manner. All these factors has led to loss of *all* boundaries holding internal fragmentation and to exhaustive waste of disk space (and memory!) in spite of the property scaling in their ability to address large storage. This is not a large storage, this is a scalable sieve: you can not rely on finding there some given amount of water even after infinite increasing the size of the sieve (read escalating the pool of Btrfs devices). It seems that nobody have reviewed Btrfs before its inclusion to the mainline. I have only found a pair of recommendations with a common idea that Btrfs maintainer is not a crazy man. Plus a number of papers which admire with the Btrfs phenomena. Sigh. Well, let's decide what can we do in current situation.. The first obvious point here is that we *can not* put such file system to production. Just because it doesn't provide any guarantees for our users regarding disk space utilization. Are you basing all of this on inline extents? The other extents of variable size are more flexible (taking up the room in the leaf), but they can also easy be split during balancing. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Chris Mason wrote: On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote: Mat wrote: On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin edw...@redhat.com wrote: Hello everyone. I was asked to review/evaluate Btrfs for using in enterprise systems and the below are my first impressions (linux-2.6.33). The first test I have made was filling an empty 659M (/dev/sdb2) btrfs partition (mounted to /mnt) with 2K files: # for i in $(seq 100); \ do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done (terminated after getting No space left on device reports). # ls /mnt | wc -l 59480 So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17, and the first obvious question is hey, where are other 83% of my disk space??? I looked at the btrfs storage tree (fs_tree) and was shocked with the situation on the leaf level. The Appendix B shows 5 adjacent btrfs leafs, which have the same parent. For example, look at the leaf 29425664: items 1 free space 3892 (of 4096!!). Note, that this free space (3892) is _dead_: any attempts to write to the file system will result in No space left on device. There are two easy ways to fix this problem. Turn off the inline extents (max_inline=0) or allow splitting of the inline extents. I didn't put in the splitting simply because the complexity was high while the benefits were low (in comparison with just turning off the inline extents). Hello, Chris. Thanks for response! I afraid that both ways won't fix the problem. Look at this leaf: [...] leaf 29425664 items 1 free space 3892 generation 8 owner 5 fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956 chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6 item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78 location key (0 UNKNOWN 0) type 8 namelen 16 datalen 32 name: security.selinux [...] There is no inline extents, and what are you going to split here? All leafs must be at least a half filled, otherwise we loose all boundaries, which provides non-zero utilization.. Any ideas? Thanks, Edward. It must be a highly unexpected and difficult question for file system developers: how efficiently does your file system manage disk space? In the meanwhile I confirm that Btrfs design is completely broken: records stored in the B-tree differ greatly from each other (it is unacceptable!), and the balancing algorithms have been modified in insane manner. All these factors has led to loss of *all* boundaries holding internal fragmentation and to exhaustive waste of disk space (and memory!) in spite of the property scaling in their ability to address large storage. This is not a large storage, this is a scalable sieve: you can not rely on finding there some given amount of water even after infinite increasing the size of the sieve (read escalating the pool of Btrfs devices). It seems that nobody have reviewed Btrfs before its inclusion to the mainline. I have only found a pair of recommendations with a common idea that Btrfs maintainer is not a crazy man. Plus a number of papers which admire with the Btrfs phenomena. Sigh. Well, let's decide what can we do in current situation.. The first obvious point here is that we *can not* put such file system to production. Just because it doesn't provide any guarantees for our users regarding disk space utilization. Are you basing all of this on inline extents? The other extents of variable size are more flexible (taking up the room in the leaf), but they can also easy be split during balancing. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
On Fri, Jun 18, 2010 at 05:05:46PM +0200, Edward Shishkin wrote: Chris Mason wrote: On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote: Mat wrote: On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin edw...@redhat.com wrote: Hello everyone. I was asked to review/evaluate Btrfs for using in enterprise systems and the below are my first impressions (linux-2.6.33). The first test I have made was filling an empty 659M (/dev/sdb2) btrfs partition (mounted to /mnt) with 2K files: # for i in $(seq 100); \ do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done (terminated after getting No space left on device reports). # ls /mnt | wc -l 59480 So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17, and the first obvious question is hey, where are other 83% of my disk space??? I looked at the btrfs storage tree (fs_tree) and was shocked with the situation on the leaf level. The Appendix B shows 5 adjacent btrfs leafs, which have the same parent. For example, look at the leaf 29425664: items 1 free space 3892 (of 4096!!). Note, that this free space (3892) is _dead_: any attempts to write to the file system will result in No space left on device. There are two easy ways to fix this problem. Turn off the inline extents (max_inline=0) or allow splitting of the inline extents. I didn't put in the splitting simply because the complexity was high while the benefits were low (in comparison with just turning off the inline extents). Hello, Chris. Thanks for response! I afraid that both ways won't fix the problem. Look at this leaf: [...] leaf 29425664 items 1 free space 3892 generation 8 owner 5 fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956 chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6 item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78 location key (0 UNKNOWN 0) type 8 namelen 16 datalen 32 name: security.selinux [...] There is no inline extents, and what are you going to split here? All leafs must be at least a half filled, otherwise we loose all boundaries, which provides non-zero utilization.. Right, there is no inline extent because we require them to fit entirely in the leaf. So we end up with mostly empty leaves because the inline item is large enough to make it difficult to push around but not large enough to fill the leaf. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Chris Mason wrote: On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote: Mat wrote: On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkinedw...@redhat.com wrote: Hello everyone. I was asked to review/evaluate Btrfs for using in enterprise systems and the below are my first impressions (linux-2.6.33). The first test I have made was filling an empty 659M (/dev/sdb2) btrfs partition (mounted to /mnt) with 2K files: # for i in $(seq 100); \ do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done (terminated after getting No space left on device reports). # ls /mnt | wc -l 59480 So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17, and the first obvious question is hey, where are other 83% of my disk space??? I looked at the btrfs storage tree (fs_tree) and was shocked with the situation on the leaf level. The Appendix B shows 5 adjacent btrfs leafs, which have the same parent. For example, look at the leaf 29425664: items 1 free space 3892 (of 4096!!). Note, that this free space (3892) is _dead_: any attempts to write to the file system will result in No space left on device. There are two easy ways to fix this problem. Turn off the inline extents (max_inline=0) or allow splitting of the inline extents. I didn't put in the splitting simply because the complexity was high while the benefits were low (in comparison with just turning off the inline extents). But then the benefits of splitting must be high, because it solves this problem if inline extents are turned on. It must be a highly unexpected and difficult question for file system developers: how efficiently does your file system manage disk space? In the meanwhile I confirm that Btrfs design is completely broken: records stored in the B-tree differ greatly from each other (it is unacceptable!), and the balancing algorithms have been modified in insane manner. All these factors has led to loss of *all* boundaries holding internal fragmentation and to exhaustive waste of disk space (and memory!) in spite of the property scaling in their ability to address large storage. This is not a large storage, this is a scalable sieve: you can not rely on finding there some given amount of water even after infinite increasing the size of the sieve (read escalating the pool of Btrfs devices). It seems that nobody have reviewed Btrfs before its inclusion to the mainline. I have only found a pair of recommendations with a common idea that Btrfs maintainer is not a crazy man. Plus a number of papers which admire with the Btrfs phenomena. Sigh. Well, let's decide what can we do in current situation.. The first obvious point here is that we *can not* put such file system to production. Just because it doesn't provide any guarantees for our users regarding disk space utilization. Are you basing all of this on inline extents? The other extents of variable size are more flexible (taking up the room in the leaf), but they can also easy be split during balancing. If we have to split everywhere, hasn't it then some (dramatic) impact on the performance of the Btrfs filesystem? As it was said above: splitting has a high complexity. -chris -- Have fun Christian Stroetmann -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
On Fri, Jun 18, 2010 at 05:21:21PM +0200, Christian Stroetmann wrote: Chris Mason wrote: On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote: Mat wrote: On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkinedw...@redhat.com wrote: Hello everyone. I was asked to review/evaluate Btrfs for using in enterprise systems and the below are my first impressions (linux-2.6.33). The first test I have made was filling an empty 659M (/dev/sdb2) btrfs partition (mounted to /mnt) with 2K files: # for i in $(seq 100); \ do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done (terminated after getting No space left on device reports). # ls /mnt | wc -l 59480 So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17, and the first obvious question is hey, where are other 83% of my disk space??? I looked at the btrfs storage tree (fs_tree) and was shocked with the situation on the leaf level. The Appendix B shows 5 adjacent btrfs leafs, which have the same parent. For example, look at the leaf 29425664: items 1 free space 3892 (of 4096!!). Note, that this free space (3892) is _dead_: any attempts to write to the file system will result in No space left on device. There are two easy ways to fix this problem. Turn off the inline extents (max_inline=0) or allow splitting of the inline extents. I didn't put in the splitting simply because the complexity was high while the benefits were low (in comparison with just turning off the inline extents). But then the benefits of splitting must be high, because it solves this problem if inline extents are turned on. It depends, we might also argue that for larger inline extents like this we are better off with much larger leaves or with using max_inline=0 instead. It must be a highly unexpected and difficult question for file system developers: how efficiently does your file system manage disk space? In the meanwhile I confirm that Btrfs design is completely broken: records stored in the B-tree differ greatly from each other (it is unacceptable!), and the balancing algorithms have been modified in insane manner. All these factors has led to loss of *all* boundaries holding internal fragmentation and to exhaustive waste of disk space (and memory!) in spite of the property scaling in their ability to address large storage. This is not a large storage, this is a scalable sieve: you can not rely on finding there some given amount of water even after infinite increasing the size of the sieve (read escalating the pool of Btrfs devices). It seems that nobody have reviewed Btrfs before its inclusion to the mainline. I have only found a pair of recommendations with a common idea that Btrfs maintainer is not a crazy man. Plus a number of papers which admire with the Btrfs phenomena. Sigh. Well, let's decide what can we do in current situation.. The first obvious point here is that we *can not* put such file system to production. Just because it doesn't provide any guarantees for our users regarding disk space utilization. Are you basing all of this on inline extents? The other extents of variable size are more flexible (taking up the room in the leaf), but they can also easy be split during balancing. If we have to split everywhere, hasn't it then some (dramatic) impact on the performance of the Btrfs filesystem? As it was said above: splitting has a high complexity. Yes. Both Edward and I have worked on the reiserfs sources where inline extents were split during balancing. It made for a few surprises in the file read/write code. They aren't impossible by any means, but I wanted to avoid them. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Edward Shishkin wrote: If you decide to base your file system on some algorithms then please use the original ones from proper academic papers. DO NOT modify the algorithms in solitude: this is very fragile thing! All such modifications must be reviewed by specialists in the theory of algorithms. Such review can be done in various scientific magazines of proper level. Personally I don't see any way to improve the situation with Btrfs except full redesigning the last one. If you want to base your file system on the paper of Ohad Rodeh, then please, use *exactly* the Bayer's B-trees that he refers to. That said, make sure that all records you put to the tree has equal length and all non-root nodes of your tree are at least half filled. First, thanks Edward for identifying a specific problem with the current btrfs implementation. I've studied modified B-trees quite a lot and know enough to be sure that they are quite robust when you modify them in all sorts of ways. Moreover, you are incorrect to say there's an intrinsic algorithmic problem with variable-length records. It is not true; if Knuth said so, Knuth was mistaken. This is easily shown by modelling variable-length records (key - string) as a range of fixed length records ([key,index] - byte) and apply the standard B-tree algorithms to that, which guarantees algorithm properties such as space utilisation and time; then you can substitute a compressed representation of contiguous index runs, which amounts to nothing more than just storing the strings (split where the algorithm says to do so) and endpoint indexes , and because this compression does not expand (in any way that matters), classic algorithmic properties are still guaranteed. Variable-length keys are a different business. Those are trickier, but as far as I know, btrfs doesn't use them. As to current Btrfs code: *NOT ACK*!!! I don't think we need such file systems. Btrfs provides many useful features that other filesystems don't. We definitely need it, or something like it. You have identified a bug. It's not a corruption bug, but it's definitely a bug, and probably affects performance as well as space utilisation. It is not deep design bug; it is just a result of the packing and balancing heuristics. If you are still interested, please apply your knowledge of B-tree algorithms to understanding why btrfs fails to balance the tree sufficiently well, and then propose a fix. Note that it's not necessarily a problem to have a few nodes with low utilisation. Deliberate violation of the classic balancing heuristic is often useful for performance.[*] The problem you've found is only a real problem when there are _too many_ nodes with low utilisation. [*] For example when filling a tree by inserting contiguously ascending keys, the classic split into two when full heuristic gives the worst possible results (50% lost space), and deliberately underfilling the most actively updated nodes, which is not permitted at all by the classic algorithm, gives denser packing in the end (almost zero lost space). It's also faster. The trick is to make sure there's just the right number of underfilled nodes... -- Jamie -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Chris Mason wrote: On Fri, Jun 18, 2010 at 05:05:46PM +0200, Edward Shishkin wrote: Chris Mason wrote: On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote: Mat wrote: On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin edw...@redhat.com wrote: Hello everyone. I was asked to review/evaluate Btrfs for using in enterprise systems and the below are my first impressions (linux-2.6.33). The first test I have made was filling an empty 659M (/dev/sdb2) btrfs partition (mounted to /mnt) with 2K files: # for i in $(seq 100); \ do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done (terminated after getting No space left on device reports). # ls /mnt | wc -l 59480 So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17, and the first obvious question is hey, where are other 83% of my disk space??? I looked at the btrfs storage tree (fs_tree) and was shocked with the situation on the leaf level. The Appendix B shows 5 adjacent btrfs leafs, which have the same parent. For example, look at the leaf 29425664: items 1 free space 3892 (of 4096!!). Note, that this free space (3892) is _dead_: any attempts to write to the file system will result in No space left on device. There are two easy ways to fix this problem. Turn off the inline extents (max_inline=0) or allow splitting of the inline extents. I didn't put in the splitting simply because the complexity was high while the benefits were low (in comparison with just turning off the inline extents). Hello, Chris. Thanks for response! I afraid that both ways won't fix the problem. Look at this leaf: [...] leaf 29425664 items 1 free space 3892 generation 8 owner 5 fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956 chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6 item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78 location key (0 UNKNOWN 0) type 8 namelen 16 datalen 32 name: security.selinux [...] There is no inline extents, and what are you going to split here? All leafs must be at least a half filled, otherwise we loose all boundaries, which provides non-zero utilization.. Right, there is no inline extent because we require them to fit entirely in the leaf. So we end up with mostly empty leaves because the inline item is large enough to make it difficult to push around but not large enough to fill the leaf. How about left and right neighbors? They contain a lot of free space (1572 and 1901 respectively). I am not happy with the very fact of such shallow leafs which contain only one small (xattr) item.. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Daniel J Blueman wrote: On Fri, Jun 18, 2010 at 1:32 PM, Edward Shishkin edward.shish...@gmail.com wrote: Mat wrote: On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin edw...@redhat.com wrote: Hello everyone. I was asked to review/evaluate Btrfs for using in enterprise systems and the below are my first impressions (linux-2.6.33). The first test I have made was filling an empty 659M (/dev/sdb2) btrfs partition (mounted to /mnt) with 2K files: # for i in $(seq 100); \ do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done (terminated after getting No space left on device reports). # ls /mnt | wc -l 59480 So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17, and the first obvious question is hey, where are other 83% of my disk space??? I looked at the btrfs storage tree (fs_tree) and was shocked with the situation on the leaf level. The Appendix B shows 5 adjacent btrfs leafs, which have the same parent. For example, look at the leaf 29425664: items 1 free space 3892 (of 4096!!). Note, that this free space (3892) is _dead_: any attempts to write to the file system will result in No space left on device. Internal fragmentation (see Appendix A) of those 5 leafs is (1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then ext4 and xfs: The last ones in this example will show fragmentation near zero with blocksize = 2K. Even with 4K blocksize they will show better utilization 0.50 (against 0.38 in btrfs)! I have a small question for btrfs developers: Why do you folks put inline extents, xattr, etc items of variable size to the B-tree in spite of the fact that B-tree is a data structure NOT for variable sized records? This disadvantage of B-trees was widely discussed. For example, maestro D. Knuth warned about this issue long time ago (see Appendix C). It is a well known fact that internal fragmentation of classic Bayer's B-trees is restricted by the value 0.50 (see Appendix C). However it takes place only if your tree contains records of the _same_ length (for example, extent pointers). Once you put to your B-tree records of variable length (restricted only by leaf size, like btrfs inline extents), your tree LOSES this boundary. Moreover, even worse: it is clear, that in this case utilization of B-tree scales as zero(!). That said, for every small E and for every amount of data N we can construct a consistent B-tree, which contains data N and has utilization worse then E. I.e. from the standpoint of utilization such trees can be completely degenerated. That said, the very important property of B-trees, which guarantees non-zero utilization, has been lost, and I don't see in Btrfs code any substitution for this property. In other words, where is a formal guarantee that all disk space of our users won't be eaten by internal fragmentation? I consider such guarantee as a *necessary* condition for putting a file system to production. Wow...a small part of me says 'well said', on the basis that your assertions are true, but I do think there needs to be more constructivity in such critique; it is almost impossible to be a great engineer and a great academic at once in a time-pressured environment. Sure it is impossible. I believe in division of labour: academics writes algorithms, and we (engineers) encode them. I have noticed that events in Btrfs develop by scenario not predicted by the paper of academic Ohad Rodeh (in spite of the announce that Btrfs is based on this paper). This is why I have started to grumble.. Thanks. If you can produce some specific and suggestions with code references, I'm sure we'll get some good discussion with potential to improve from where we are. Thanks, Daniel -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
On Fri, Jun 18, 2010 at 06:22:39PM +0200, Edward Shishkin wrote: Chris Mason wrote: On Fri, Jun 18, 2010 at 05:05:46PM +0200, Edward Shishkin wrote: Chris Mason wrote: On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote: Mat wrote: On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin edw...@redhat.com wrote: Hello everyone. I was asked to review/evaluate Btrfs for using in enterprise systems and the below are my first impressions (linux-2.6.33). The first test I have made was filling an empty 659M (/dev/sdb2) btrfs partition (mounted to /mnt) with 2K files: # for i in $(seq 100); \ do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done (terminated after getting No space left on device reports). # ls /mnt | wc -l 59480 So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17, and the first obvious question is hey, where are other 83% of my disk space??? I looked at the btrfs storage tree (fs_tree) and was shocked with the situation on the leaf level. The Appendix B shows 5 adjacent btrfs leafs, which have the same parent. For example, look at the leaf 29425664: items 1 free space 3892 (of 4096!!). Note, that this free space (3892) is _dead_: any attempts to write to the file system will result in No space left on device. There are two easy ways to fix this problem. Turn off the inline extents (max_inline=0) or allow splitting of the inline extents. I didn't put in the splitting simply because the complexity was high while the benefits were low (in comparison with just turning off the inline extents). Hello, Chris. Thanks for response! I afraid that both ways won't fix the problem. Look at this leaf: [...] leaf 29425664 items 1 free space 3892 generation 8 owner 5 fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956 chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6 item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78 location key (0 UNKNOWN 0) type 8 namelen 16 datalen 32 name: security.selinux [...] There is no inline extents, and what are you going to split here? All leafs must be at least a half filled, otherwise we loose all boundaries, which provides non-zero utilization.. Right, there is no inline extent because we require them to fit entirely in the leaf. So we end up with mostly empty leaves because the inline item is large enough to make it difficult to push around but not large enough to fill the leaf. How about left and right neighbors? They contain a lot of free space (1572 and 1901 respectively). I am not happy with the very fact of such shallow leafs which contain only one small (xattr) item.. Sure, the balancing can also be made more aggressive. This should be very easy to fix. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Daniel J Blueman wrote: On Fri, Jun 18, 2010 at 1:32 PM, Edward Shishkin edward.shish...@gmail.com wrote: Mat wrote: On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkinedw...@redhat.com wrote: Hello everyone. I was asked to review/evaluate Btrfs for using in enterprise systems and the below are my first impressions (linux-2.6.33). The first test I have made was filling an empty 659M (/dev/sdb2) btrfs partition (mounted to /mnt) with 2K files: # for i in $(seq 100); \ do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done (terminated after getting No space left on device reports). # ls /mnt | wc -l 59480 So, I got the dirty utilization 59480*2048 / (659*1024*1024) = 0.17, and the first obvious question is hey, where are other 83% of my disk space??? I looked at the btrfs storage tree (fs_tree) and was shocked with the situation on the leaf level. The Appendix B shows 5 adjacent btrfs leafs, which have the same parent. For example, look at the leaf 29425664: items 1 free space 3892 (of 4096!!). Note, that this free space (3892) is _dead_: any attempts to write to the file system will result in No space left on device. Internal fragmentation (see Appendix A) of those 5 leafs is (1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then ext4 and xfs: The last ones in this example will show fragmentation near zero with blocksize= 2K. Even with 4K blocksize they will show better utilization 0.50 (against 0.38 in btrfs)! I have a small question for btrfs developers: Why do you folks put inline extents, xattr, etc items of variable size to the B-tree in spite of the fact that B-tree is a data structure NOT for variable sized records? This disadvantage of B-trees was widely discussed. For example, maestro D. Knuth warned about this issue long time ago (see Appendix C). It is a well known fact that internal fragmentation of classic Bayer's B-trees is restricted by the value 0.50 (see Appendix C). However it takes place only if your tree contains records of the _same_ length (for example, extent pointers). Once you put to your B-tree records of variable length (restricted only by leaf size, like btrfs inline extents), your tree LOSES this boundary. Moreover, even worse: it is clear, that in this case utilization of B-tree scales as zero(!). That said, for every small E and for every amount of data N we can construct a consistent B-tree, which contains data N and has utilization worse then E. I.e. from the standpoint of utilization such trees can be completely degenerated. That said, the very important property of B-trees, which guarantees non-zero utilization, has been lost, and I don't see in Btrfs code any substitution for this property. In other words, where is a formal guarantee that all disk space of our users won't be eaten by internal fragmentation? I consider such guarantee as a *necessary* condition for putting a file system to production. Wow...a small part of me says 'well said', on the basis that your assertions are true, but I do think there needs to be more constructivity in such critique; it is almost impossible to be a great engineer and a great academic at once in a time-pressured environment. I find this is somehow off-topic, but: For sure, it isn't impossible. History showed and present shows that there are exceptions. If you can produce some specific and suggestions with code references, I'm sure we'll get some good discussion with potential to improve from where we are. Thanks, Daniel Have fun Christian Stroetmann -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Jamie Lokier wrote: Edward Shishkin wrote: If you decide to base your file system on some algorithms then please use the original ones from proper academic papers. DO NOT modify the algorithms in solitude: this is very fragile thing! All such modifications must be reviewed by specialists in the theory of algorithms. Such review can be done in various scientific magazines of proper level. Personally I don't see any way to improve the situation with Btrfs except full redesigning the last one. If you want to base your file system on the paper of Ohad Rodeh, then please, use *exactly* the Bayer's B-trees that he refers to. That said, make sure that all records you put to the tree has equal length and all non-root nodes of your tree are at least half filled. First, thanks Edward for identifying a specific problem with the current btrfs implementation. I've studied modified B-trees quite a lot and know enough to be sure that they are quite robust when you modify them in all sorts of ways. This is the point: Which kind of modified B-tree data structure is best suited? Moreover, you are incorrect to say there's an intrinsic algorithmic problem with variable-length records. It is not true; if Knuth said so, Knuth was mistaken. This is easily shown by modelling variable-length records (key - string) as a range of fixed length records ([key,index] - byte) and apply the standard B-tree algorithms to that, which guarantees algorithm properties such as space utilisation and time; then you can substitute a compressed representation of contiguous index runs, which amounts to nothing more than just storing the strings (split where the algorithm says to do so) and endpoint indexes , and because this compression does not expand (in any way that matters), classic algorithmic properties are still guaranteed. Variable-length keys are a different business. Those are trickier, but as far as I know, btrfs doesn't use them. As to current Btrfs code: *NOT ACK*!!! I don't think we need such file systems. Btrfs provides many useful features that other filesystems don't. We definitely need it, or something like it. You have identified a bug. It's not a corruption bug, but it's definitely a bug, and probably affects performance as well as space utilisation. It is not deep design bug; it is just a result of the packing and balancing heuristics. I think this is the most important design question in relation with filesystems that use some kind of B-trees, which means, if the wrong modified B-tree as the fundamental data structure was chosen, then this is a deep design bug. If you are still interested, please apply your knowledge of B-tree algorithms to understanding why btrfs fails to balance the tree sufficiently well, and then propose a fix. This is a general problem of filesystem design, especially the packing and balancing heurisitcs, and a special problem of the Btrfs filesystem. You can't simply say do it in this or that way. That's why another filesystem uses something exotic like a B*-tree in conjunction with dancing trees as fundamental data structure, which leads back to the deep design question/problem/decision/bug/ And after I followed the explanations of this exotic B-tree version by the main developer I knew just right from the start of the development of the Btrfs filesystem that it wasn't chosen the right modified B-tree data structure, because it was too simple and too general. And since some days I have the impression that there wasn't made a design decision at all with the only exception that there has to be some kind of a B-tree algorithm/data structure in the Btrfs filesystem. And I also think that such a deep desgin decision can't simply be corrected in general (subjective opinion). Note that it's not necessarily a problem to have a few nodes with low utilisation. Deliberate violation of the classic balancing heuristic is often useful for performance.[*] The problem you've found is only a real problem when there are _too many_ nodes with low utilisation. The found problem is the first problem with the chosen modified B-tree data structure. I wouldn't call it only a problem in a special case. [*] For example when filling a tree by inserting contiguously ascending keys, the classic split into two when full heuristic gives the worst possible results (50% lost space), and deliberately underfilling the most actively updated nodes, which is not permitted at all by the classic algorithm, gives denser packing in the end (almost zero lost space). It's also faster. The trick is to make sure there's just the right number of underfilled nodes... Yes, but Firstly, maybe you are too focused on the classic B-tree algorithm here. Secondly, a trick here, a split there, turning off a feature and then? Then we have complexity at then end, which brings us back to the start, the design decision. But if you say there are no
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Jamie Lokier wrote: Edward Shishkin wrote: If you decide to base your file system on some algorithms then please use the original ones from proper academic papers. DO NOT modify the algorithms in solitude: this is very fragile thing! All such modifications must be reviewed by specialists in the theory of algorithms. Such review can be done in various scientific magazines of proper level. Personally I don't see any way to improve the situation with Btrfs except full redesigning the last one. If you want to base your file system on the paper of Ohad Rodeh, then please, use *exactly* the Bayer's B-trees that he refers to. That said, make sure that all records you put to the tree has equal length and all non-root nodes of your tree are at least half filled. First, thanks Edward for identifying a specific problem with the current btrfs implementation. Hello Jamie. I've studied modified B-trees quite a lot and know enough to be sure that they are quite robust when you modify them in all sorts of ways. Which property is robust? Moreover, you are incorrect to say there's an intrinsic algorithmic problem with variable-length records. It is not true; if Knuth said so, Knuth was mistaken. I didn't say about intrinsic algorithmic problems :) I just repeat (after Knuth et al) that B-trees with variable-length records don't have any sane boundary for internal fragmentation. The common idea is that if we don't want Btrfs to be in infinite development stage, then we should choose some *sane* strategy (for example the paper of Ohad Rodeh) and strictly adhere this in future. This is easily shown by modelling variable-length records (key - string) as a range of fixed length records ([key,index] - byte) and apply the standard B-tree algorithms to that, which guarantees algorithm properties such as space utilisation and time; then you can substitute a compressed representation of contiguous index runs, which amounts to nothing more than just storing the strings (split where the algorithm says to do so) and endpoint indexes , and because this compression does not expand (in any way that matters), classic algorithmic properties are still guaranteed. Variable-length keys are a different business. Those are trickier, but as far as I know, btrfs doesn't use them. As to current Btrfs code: *NOT ACK*!!! I don't think we need such file systems. Btrfs provides many useful features that other filesystems don't. We definitely need it, or something like it. You have identified a bug. It's not a corruption bug, but it's definitely a bug, and probably affects performance as well as space utilisation. It is not deep design bug; it is just a result of the packing and balancing heuristics. Frankly, I would like to believe to such end, taking into account amount of my contributions to the Btrfs project. At least to make sure I didn't do useless work.. If you are still interested, please apply your knowledge of B-tree algorithms to understanding why btrfs fails to balance the tree sufficiently well, Because of top-down balancing. It doesn't like clever things like splitting and merging. Currently top-down works properly only with stupid classic Bayer's B-trees. and then propose a fix. I'll try to help, but I am rather pessimistic here: working out algorithms is something, which doesn't like timelines.. Note that it's not necessarily a problem to have a few nodes with low utilisation. Deliberate violation of the classic balancing heuristic is often useful for performance.[*] Ok, let's violate this, but not in the detriment of utilisation: Internal fragmentation is a horror for file systems, the enemy #1 (which is completely defeated in the last century BTW). The problem you've found is only a real problem when there are _too many_ nodes with low utilisation. IMHO this is a problem, as we can not estimate number of such nodes. Do you have any sane upper boundary for this number? I don't know such one. Maybe I have missed something? [*] For example when filling a tree by inserting contiguously ascending keys, the classic split into two when full heuristic gives the worst possible results (50% lost space), and deliberately underfilling the most actively updated nodes, which is not permitted at all by the classic algorithm, gives denser packing in the end (almost zero lost space). At the end of what? I hope you don't speak about on-line defragmentation? Can you point me to the paper (if any)? Thanks! It's also faster. The trick is to make sure there's just the right number of underfilled nodes... -- Jamie -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Edward O. Shishkin Principal Software Engineer Red Hat Czech -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
On Fri, Jun 18, 2010 at 09:29:40PM +0200, Edward Shishkin wrote: Jamie Lokier wrote: Edward Shishkin wrote: If you decide to base your file system on some algorithms then please use the original ones from proper academic papers. DO NOT modify the algorithms in solitude: this is very fragile thing! All such modifications must be reviewed by specialists in the theory of algorithms. Such review can be done in various scientific magazines of proper level. Personally I don't see any way to improve the situation with Btrfs except full redesigning the last one. If you want to base your file system on the paper of Ohad Rodeh, then please, use *exactly* the Bayer's B-trees that he refers to. That said, make sure that all records you put to the tree has equal length and all non-root nodes of your tree are at least half filled. First, thanks Edward for identifying a specific problem with the current btrfs implementation. Hello Jamie. I've studied modified B-trees quite a lot and know enough to be sure that they are quite robust when you modify them in all sorts of ways. Which property is robust? Moreover, you are incorrect to say there's an intrinsic algorithmic problem with variable-length records. It is not true; if Knuth said so, Knuth was mistaken. I didn't say about intrinsic algorithmic problems :) I just repeat (after Knuth et al) that B-trees with variable-length records don't have any sane boundary for internal fragmentation. The common idea is that if we don't want Btrfs to be in infinite development stage, then we should choose some *sane* strategy (for example the paper of Ohad Rodeh) and strictly adhere this in future. Again, other than the inline file data, what exactly do you believe needs to change? Top down balancing vs balancing on insertion doesn't impact our ability to maintain full leaves. The current code is clearly choosing not to merge two leaves that it should have merged, which is just a plain old bug. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: default subvolume abilities/restrictions
On Sun, Jun 13, 2010 at 12:47 PM, C Anthony Risinger anth...@extof.me wrote: On Sat, Jun 12, 2010 at 8:06 PM, C Anthony Risinger anth...@extof.me wrote: On Sat, Jun 12, 2010 at 7:22 PM, David Brown bt...@davidb.org wrote: On Sat, Jun 12, 2010 at 06:06:23PM -0500, C Anthony Risinger wrote: # btrfs subvolume create new_root # mv . new_root/old_root can i at least get confirmation that the above is possible? I've had no problem with # btrfs subvolume snapshot . new_root # mkdir old_root # mv * old_root # rm -rf old_root Make sure the 'mv' fails fo move new_root, and I'd look at the new_root before removing everything. David heh, yeah i as i was writing the last email i realized that all i really wanted was to: # mv * new_root for some reason i was convinced that i must snapshot the old_root (.) to new_root... and then remove the erroneous stuff from old_root (.). thus a way to parent the default subvol (old_root/.) seemed a better solution... but alas, a snapshot isn't necessary. i can create an empty subvol new_root, and then mv * new_root. i don't know how that escaped me :-), sorry for all the noise. however, there probably is a legitimate use case for wanting to replace the default subvolume, but this isn't it. C Anthony ok i take it all back, i DO need this... i rewrote my initramfs hook to do the following operations: # btrfs subvolume create /new_root # mv /* /new_root instead of what i had: # btrfs subvolume snapshot / /new_root and it resulted in scarily COPYING my entire system... several gigs worth... to the newly created subvolume, which took forever, and grinded on my HD for awhile. i don't know how long because i went to bed. this is why i need a way to parent the default subvolume. a snapshot is nice and quick, but it leaves / full of erroneous folders (dev/etc/usr/lib), an entire system, that will no longer be used. this space will in time become dead wasted space unless my users manually rm -rf themselves. so... any input on this? how can i effectively, and efficiently, move a users installation into a dedicated subvolume, when they have already installed into the default subvolume? i think the best way is what i originally suggested; make an empty subvolume the new top-level subvol, and place the old top-level subvol INTO it with a new name. thoughts? can i get a little feedback on this problem? i choke slightly when telling users the only way to clean their old / is by rm -rf'ing everything i need the ability to move the default subvolume into a new, empty subvolume. the empty subvolume then becomes the new default. the end result is moving the users installation into a dedicated subvolume, cleanly and efficiently, so the system can do other things with the subroot structure. the way i am doing it now is snapshotting the users / to /__active... however, the side effect is an entire system worth of files that will in time become dead space. moving the users install via mv into an empty subvol does not work. everything is actually copied = slow,slow,slow. ideas??? how can i parent the default subvol with an empty subvol? this seems a legitimate operation. thanks, C Anthony -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] current btrfs-progs-unstable fails to compile
Hi, current btrfs-progs-unstable fails to compile for me (Ark Linux devel branch, using gcc 4.5 and eglibc 2.12): /usr/bin/ld: btrfsck.o: in function maybe_free_inode_rec:btrfsck.c:323: error: undefined reference to 'S_ISDIR' /usr/bin/ld: btrfsck.o: in function maybe_free_inode_rec:btrfsck.c:340: error: undefined reference to 'S_ISREG' /usr/bin/ld: btrfsck.o: in function maybe_free_inode_rec:btrfsck.c:328: error: undefined reference to 'S_ISREG' /usr/bin/ld: btrfsck.o: in function maybe_free_inode_rec:btrfsck.c:340: error: undefined reference to 'S_ISLNK' /usr/bin/ld: btrfsck.o: in function maybe_free_inode_rec:btrfsck.c:328: error: undefined reference to 'S_ISLNK' collect2: ld returned 1 exit status Patch attached. ttyl bero--- btrfs-progs-unstable/btrfsck.c.ark 2010-06-18 22:52:52.592535586 +0200 +++ btrfs-progs-unstable/btrfsck.c 2010-06-18 22:52:59.540417164 +0200 @@ -18,6 +18,7 @@ #define _XOPEN_SOURCE 500 #define _GNU_SOURCE 1 +#include sys/stat.h #include stdio.h #include stdlib.h #include fcntl.h
Balancing leaves when walking from top to down (was Btrfs:...)
Chris Mason wrote: On Fri, Jun 18, 2010 at 09:29:40PM +0200, Edward Shishkin wrote: Jamie Lokier wrote: Edward Shishkin wrote: If you decide to base your file system on some algorithms then please use the original ones from proper academic papers. DO NOT modify the algorithms in solitude: this is very fragile thing! All such modifications must be reviewed by specialists in the theory of algorithms. Such review can be done in various scientific magazines of proper level. Personally I don't see any way to improve the situation with Btrfs except full redesigning the last one. If you want to base your file system on the paper of Ohad Rodeh, then please, use *exactly* the Bayer's B-trees that he refers to. That said, make sure that all records you put to the tree has equal length and all non-root nodes of your tree are at least half filled. First, thanks Edward for identifying a specific problem with the current btrfs implementation. Hello Jamie. I've studied modified B-trees quite a lot and know enough to be sure that they are quite robust when you modify them in all sorts of ways. Which property is robust? Moreover, you are incorrect to say there's an intrinsic algorithmic problem with variable-length records. It is not true; if Knuth said so, Knuth was mistaken. I didn't say about intrinsic algorithmic problems :) I just repeat (after Knuth et al) that B-trees with variable-length records don't have any sane boundary for internal fragmentation. The common idea is that if we don't want Btrfs to be in infinite development stage, then we should choose some *sane* strategy (for example the paper of Ohad Rodeh) and strictly adhere this in future. Again, other than the inline file data, what exactly do you believe needs to change? 1. getting rid of inline extents; 2. new formats for directory and xattr items to not look like a train, which is able to occupy the whole leaf; 3. make sure we do pro-active balancing like it is described in the paper. Sorry, I don't see other ways for now.. Top down balancing vs balancing on insertion doesn't impact our ability to maintain full leaves. The current code is clearly choosing not to merge two leaves that it should have merged, which is just a plain old bug. How are you going to balance leaves when walking from top to down? Suppose 1) and 2) above are not satisfied and having arrived to the leaf level we see a number of items of variable length. What will we do to keep leaves full? Could you please provide a sketch of the algorithm? Thanks! -- Edward O. Shishkin Principal Software Engineer Red Hat Czech -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Balancing leaves when walking from top to down (was Btrfs:...)
On 06/18/2010 06:04 PM, Edward Shishkin wrote: Chris Mason wrote: On Fri, Jun 18, 2010 at 09:29:40PM +0200, Edward Shishkin wrote: Jamie Lokier wrote: Edward Shishkin wrote: If you decide to base your file system on some algorithms then please use the original ones from proper academic papers. DO NOT modify the algorithms in solitude: this is very fragile thing! All such modifications must be reviewed by specialists in the theory of algorithms. Such review can be done in various scientific magazines of proper level. Personally I don't see any way to improve the situation with Btrfs except full redesigning the last one. If you want to base your file system on the paper of Ohad Rodeh, then please, use *exactly* the Bayer's B-trees that he refers to. That said, make sure that all records you put to the tree has equal length and all non-root nodes of your tree are at least half filled. First, thanks Edward for identifying a specific problem with the current btrfs implementation. Hello Jamie. I've studied modified B-trees quite a lot and know enough to be sure that they are quite robust when you modify them in all sorts of ways. Which property is robust? Moreover, you are incorrect to say there's an intrinsic algorithmic problem with variable-length records. It is not true; if Knuth said so, Knuth was mistaken. I didn't say about intrinsic algorithmic problems :) I just repeat (after Knuth et al) that B-trees with variable-length records don't have any sane boundary for internal fragmentation. The common idea is that if we don't want Btrfs to be in infinite development stage, then we should choose some *sane* strategy (for example the paper of Ohad Rodeh) and strictly adhere this in future. Again, other than the inline file data, what exactly do you believe needs to change? 1. getting rid of inline extents; 2. new formats for directory and xattr items to not look like a train, which is able to occupy the whole leaf; 3. make sure we do pro-active balancing like it is described in the paper. Sorry, I don't see other ways for now.. Top down balancing vs balancing on insertion doesn't impact our ability to maintain full leaves. The current code is clearly choosing not to merge two leaves that it should have merged, which is just a plain old bug. How are you going to balance leaves when walking from top to down? Suppose 1) and 2) above are not satisfied and having arrived to the leaf level we see a number of items of variable length. What will we do to keep leaves full? Could you please provide a sketch of the algorithm? Thanks! Hi Edward, Is it really a requirement to have 100% full leaves? Most DB's (assuming I remember correctly) have deliberate strategies around this kind of thing. You might want to leave room in leaf nodes so that future insertions can be contiguous on the media with older data. Regards, Ric -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Balancing leaves when walking from top to down (was Btrfs:...)
Ric Wheeler wrote: On 06/18/2010 06:04 PM, Edward Shishkin wrote: Chris Mason wrote: On Fri, Jun 18, 2010 at 09:29:40PM +0200, Edward Shishkin wrote: Jamie Lokier wrote: Edward Shishkin wrote: If you decide to base your file system on some algorithms then please use the original ones from proper academic papers. DO NOT modify the algorithms in solitude: this is very fragile thing! All such modifications must be reviewed by specialists in the theory of algorithms. Such review can be done in various scientific magazines of proper level. Personally I don't see any way to improve the situation with Btrfs except full redesigning the last one. If you want to base your file system on the paper of Ohad Rodeh, then please, use *exactly* the Bayer's B-trees that he refers to. That said, make sure that all records you put to the tree has equal length and all non-root nodes of your tree are at least half filled. First, thanks Edward for identifying a specific problem with the current btrfs implementation. Hello Jamie. I've studied modified B-trees quite a lot and know enough to be sure that they are quite robust when you modify them in all sorts of ways. Which property is robust? Moreover, you are incorrect to say there's an intrinsic algorithmic problem with variable-length records. It is not true; if Knuth said so, Knuth was mistaken. I didn't say about intrinsic algorithmic problems :) I just repeat (after Knuth et al) that B-trees with variable-length records don't have any sane boundary for internal fragmentation. The common idea is that if we don't want Btrfs to be in infinite development stage, then we should choose some *sane* strategy (for example the paper of Ohad Rodeh) and strictly adhere this in future. Again, other than the inline file data, what exactly do you believe needs to change? 1. getting rid of inline extents; 2. new formats for directory and xattr items to not look like a train, which is able to occupy the whole leaf; 3. make sure we do pro-active balancing like it is described in the paper. Sorry, I don't see other ways for now.. Top down balancing vs balancing on insertion doesn't impact our ability to maintain full leaves. The current code is clearly choosing not to merge two leaves that it should have merged, which is just a plain old bug. How are you going to balance leaves when walking from top to down? Suppose 1) and 2) above are not satisfied and having arrived to the leaf level we see a number of items of variable length. What will we do to keep leaves full? Could you please provide a sketch of the algorithm? Thanks! Hi Edward, Is it really a requirement to have 100% full leaves? Most DB's (assuming I remember correctly) have deliberate strategies around this kind of thing. You might want to leave room in leaf nodes so that future insertions can be contiguous on the media with older data. Regards, Ric Hello Ric. No, every leaf shouldn't be necessarily 100% full. We may require every L-vicinity(*) of every leaf to be full in some sense. And this local condition sometimes brings the (global) boundary for utilization of the whole tree. In the classic Bayer's B-trees we have so-called S(0) balancing condition implicitly satisfied, which requires every leaf to be at least half full. It provides the low boundary 0.5 for utilization of the whole tree. Next example is ReiserFS file system, which uses so-called S(1) balancing condition on the leaf level, which requires every 1-vicinity of every leaf to be incompressible (i.e. two leaves can not be squeezed to a single one). This local incompressibility brings the sweet low utilization boundary 0.5-E for the whole tree (E is a small number, which is back proportional to the tree branching). (*) any set of L+1 neighboring nodes, which contains the leaf. -- Edward O. Shishkin Principal Software Engineer Red Hat Czech -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html