Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
On 06/26/2010 08:34 AM, Daniel Shiels wrote: 25.06.2010 22:58, Ric Wheeler wrote: On 06/24/2010 06:06 PM, Daniel Taylor wrote: [] On Wed, Jun 23, 2010 at 8:43 PM, Daniel Taylor wrote: Just an FYI reminder. The original test (2K files) is utterly pathological for disk drives with 4K physical sectors, such as those now shipping from WD, Seagate, and others. Some of the SSDs have larger (16K0 or smaller blocks (2K). There is also the issue of btrfs over RAID (which I know is not entirely sensible, but which will happen). Why it is not sensible to use btrfs on raid devices? Nowadays raid is just everywhere, from 'fakeraid' on AHCI to large external arrays on iSCSI-attached storage. Sometimes it is nearly imposisble to _not_ use RAID, -- many servers comes with a built-in RAID card which can't be turned off or disabled. And hardware raid is faster (at least in theory) at least because it puts less load on various system busses. To many "enterprise folks" a statement "we don't need hw raid, we have better solution" sounds like "we're just a toy, don't use". Hmm? ;) /mjt, who always used and preferred _software_ raid due to multiple reasons, and never used btrfs so far. Its not that you shouldn't use it on raid it's just it looses some value from the file system. Two nice features that btrfs provides are checksums and mirroring. If a disk corrupts a block then btrfs will realize due to the strong checksum and use the mirrored block. If you are using a raid system the raid won't know the data is corrupted and raid doesn't provide a way for the file system to get to the redundant block. I read a paper from Sun a while back about the undetected read failure rates for modern disks having not changed for many years. Disks are so large now that undetected failures are unacceptably likely for many systems. Hence zfs doing similar in file system raid schemes. In my lab I used dd to clobber data in some of my mirrors. Btrfs logs lots of checksum errors but never corrupted a file. Doing the same on a classic raid with classic filesystem (solaris with veritas volume manager) silently gave me bad data depending on what disk it felt like reading from. Daniel. I was (one of many) people who worked at EMC on designing storage arrays. If you are using any high end, external hardware array, it will detect data corruption pro-actively for you. Most arrays do continual scans for latent errors and have internal data integrity checks that are used for this. Note that DIF/DIX adds an extra 8 bytes of data integrity to newer standards disks. We don't do anything with that today in btrfs, but you could imagine ways to get even better data integrity protection. If you are using software RAID (MD), you should also use its internal checks to do this kind of proactive detection of latent errors on a regular basis (say once every week or two). Ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
On 06/26/2010 01:18 AM, Michael Tokarev wrote: 25.06.2010 22:58, Ric Wheeler wrote: On 06/24/2010 06:06 PM, Daniel Taylor wrote: [] On Wed, Jun 23, 2010 at 8:43 PM, Daniel Taylor wrote: Just an FYI reminder. The original test (2K files) is utterly pathological for disk drives with 4K physical sectors, such as those now shipping from WD, Seagate, and others. Some of the SSDs have larger (16K0 or smaller blocks (2K). There is also the issue of btrfs over RAID (which I know is not entirely sensible, but which will happen). Why it is not sensible to use btrfs on raid devices? Nowadays raid is just everywhere, from 'fakeraid' on AHCI to large external arrays on iSCSI-attached storage. Sometimes it is nearly imposisble to _not_ use RAID, -- many servers comes with a built-in RAID card which can't be turned off or disabled. And hardware raid is faster (at least in theory) at least because it puts less load on various system busses. To many "enterprise folks" a statement "we don't need hw raid, we have better solution" sounds like "we're just a toy, don't use". Hmm? ;) /mjt, who always used and preferred _software_ raid due to multiple reasons, and never used btrfs so far. Absolutely no reason that you would not use btrfs on hardware raid volumes (or software RAID for that matter). Ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
25.06.2010 22:58, Ric Wheeler wrote: > On 06/24/2010 06:06 PM, Daniel Taylor wrote: [] >>> On Wed, Jun 23, 2010 at 8:43 PM, Daniel Taylor >>> wrote: >>> Just an FYI reminder. The original test (2K files) is utterly pathological for disk drives with 4K physical sectors, such as those now shipping from WD, Seagate, and others. Some of the SSDs have larger (16K0 or smaller blocks (2K). There is also the issue of btrfs over RAID (which I know is not entirely sensible, but which will happen). Why it is not sensible to use btrfs on raid devices? Nowadays raid is just everywhere, from 'fakeraid' on AHCI to large external arrays on iSCSI-attached storage. Sometimes it is nearly imposisble to _not_ use RAID, -- many servers comes with a built-in RAID card which can't be turned off or disabled. And hardware raid is faster (at least in theory) at least because it puts less load on various system busses. To many "enterprise folks" a statement "we don't need hw raid, we have better solution" sounds like "we're just a toy, don't use". Hmm? ;) /mjt, who always used and preferred _software_ raid due to multiple reasons, and never used btrfs so far. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
On 06/24/2010 06:06 PM, Daniel Taylor wrote: -Original Message- From: mikefe...@gmail.com [mailto:mikefe...@gmail.com] On Behalf Of Mike Fedyk Sent: Wednesday, June 23, 2010 9:51 PM To: Daniel Taylor Cc: Daniel J Blueman; Mat; LKML; linux-fsde...@vger.kernel.org; Chris Mason; Ric Wheeler; Andrew Morton; Linus Torvalds; The development of BTRFS Subject: Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs) On Wed, Jun 23, 2010 at 8:43 PM, Daniel Taylor wrote: Just an FYI reminder. The original test (2K files) is utterly pathological for disk drives with 4K physical sectors, such as those now shipping from WD, Seagate, and others. Some of the SSDs have larger (16K0 or smaller blocks (2K). There is also the issue of btrfs over RAID (which I know is not entirely sensible, but which will happen). The absolute minimum allocation size for data should be the same as, and aligned with, the underlying disk block size. If that results in underutilization, I think that's a good thing for performance, compared to read-modify-write cycles to update partial disk blocks. Block size = 4k Btrfs packs smaller objects into the blocks in certain cases. As long as no object smaller than the disk block size is ever flushed to media, and all flushed objects are aligned to the disk blocks, there should be no real performance hit from that. Otherwise we end up with the damage for the ext[234] family, where the file blocks can be aligned, but the 1K inode updates cause the read-modify-write (RMW) cycles and and cost>10% performance hit for creation/update of large numbers of files. An RMW cycle costs at least a full rotation (11 msec on a 5400 RPM drive), which is painful. Also interesting is to note that you can get a significant overheard even with 0 byte length files. Path names, metadata overhead, etc can consume (depending on the pathname length) quite a bit of space per file. Ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
> -Original Message- > From: mikefe...@gmail.com [mailto:mikefe...@gmail.com] On > Behalf Of Mike Fedyk > Sent: Wednesday, June 23, 2010 9:51 PM > To: Daniel Taylor > Cc: Daniel J Blueman; Mat; LKML; > linux-fsde...@vger.kernel.org; Chris Mason; Ric Wheeler; > Andrew Morton; Linus Torvalds; The development of BTRFS > Subject: Re: Btrfs: broken file system design (was Unbound(?) > internal fragmentation in Btrfs) > > On Wed, Jun 23, 2010 at 8:43 PM, Daniel Taylor > wrote: > > Just an FYI reminder. The original test (2K files) is utterly > > pathological for disk drives with 4K physical sectors, such as > > those now shipping from WD, Seagate, and others. Some of the > > SSDs have larger (16K0 or smaller blocks (2K). There is also > > the issue of btrfs over RAID (which I know is not entirely > > sensible, but which will happen). > > > > The absolute minimum allocation size for data should be the same > > as, and aligned with, the underlying disk block size. If that > > results in underutilization, I think that's a good thing for > > performance, compared to read-modify-write cycles to update > > partial disk blocks. > > Block size = 4k > > Btrfs packs smaller objects into the blocks in certain cases. > As long as no object smaller than the disk block size is ever flushed to media, and all flushed objects are aligned to the disk blocks, there should be no real performance hit from that. Otherwise we end up with the damage for the ext[234] family, where the file blocks can be aligned, but the 1K inode updates cause the read-modify-write (RMW) cycles and and cost >10% performance hit for creation/update of large numbers of files. An RMW cycle costs at least a full rotation (11 msec on a 5400 RPM drive), which is painful. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
On Wed, 2010-06-23 at 20:43 -0700, Daniel Taylor wrote: > There is also the issue of btrfs over RAID (which I know is not > entirely sensible, but which will happen). Well, we could discourage that by merging the RAID support that's been pending for a while but I suspect Chris is a bit busy right now :) -- dwmw2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
On Wed, Jun 23, 2010 at 8:43 PM, Daniel Taylor wrote: > Just an FYI reminder. The original test (2K files) is utterly > pathological for disk drives with 4K physical sectors, such as > those now shipping from WD, Seagate, and others. Some of the > SSDs have larger (16K0 or smaller blocks (2K). There is also > the issue of btrfs over RAID (which I know is not entirely > sensible, but which will happen). > > The absolute minimum allocation size for data should be the same > as, and aligned with, the underlying disk block size. If that > results in underutilization, I think that's a good thing for > performance, compared to read-modify-write cycles to update > partial disk blocks. Block size = 4k Btrfs packs smaller objects into the blocks in certain cases. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Just an FYI reminder. The original test (2K files) is utterly pathological for disk drives with 4K physical sectors, such as those now shipping from WD, Seagate, and others. Some of the SSDs have larger (16K0 or smaller blocks (2K). There is also the issue of btrfs over RAID (which I know is not entirely sensible, but which will happen). The absolute minimum allocation size for data should be the same as, and aligned with, the underlying disk block size. If that results in underutilization, I think that's a good thing for performance, compared to read-modify-write cycles to update partial disk blocks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Edward Shishkin wrote: > I'll try to help, but I am rather pessimistic here: working out > algorithms is something, which doesn't like timelines.. Nonsense. Working out algorithms is just work to an algorithm designer, just like programming is work to a programmer. Sure, some things are harder than others, and there's an element of creativity. But lots of it is just cranking the handle, even in algorithm design. > >Note that it's not necessarily a problem to have a few nodes with low > >utilisation. Deliberate violation of the classic balancing heuristic > >is often useful for performance.[*] > > Ok, let's violate this, but not in the detriment of utilisation: > Internal fragmentation is a horror for file systems, the enemy #1 > (which is completely defeated in the last century BTW). > > > The problem you've found is only a > >real problem when there are _too many_ nodes with low utilisation. > > IMHO this is a problem, as we can not estimate number of such nodes. > Do you have any sane upper boundary for this number? I don't know such one. > Maybe I have missed something? Well, I don't know if btrfs maintains enough statistics or other implicit processes to guarantee a sane upper boundary for this. The impression I'm getting from the thread is that it relies on heuristic behaviour which is usually sufficient (and perhaps a bit faster than updating statistics on the disk would be), but that it does not provide a hard upper boundary. I'm really not sure, though. I haven't read the code. > >[*] For example when filling a tree by inserting contiguously > >ascending keys, the classic "split into two when full" heuristic gives > >the worst possible results (50% lost space), and deliberately > >underfilling the most actively updated nodes, which is not permitted > >at all by the classic algorithm, gives denser packing in the end > >(almost zero lost space). > > At the end of what? At the end of inserting keys in ascending order. Just think about the classic B-tree when that is done: Node[1] fills to 100% utilisation, then is split into two nodes at 50% (call them node[1] and node[2]). Node[2] fills to 100%, then splits into node[2] at 50% and node[3]. Etc etc: Result is a million nodes at 50% utilisation, except the last one. If instead you fill node[1], then *don't* split it but permit node[2] to have 0% to start with, let that fill to 100%, then don't split node[2] and let node[3] start with 0%, etc. etc: Result is half a million nodes at 100% utilisation, except the last one. Both fit the desired bounds, but the second is more desirable in practice, especially as it's common behaviour to fill with contiguous, ascending keys in a filesystem (or blob-filled database) where data extents are part of the tree :-) To handle the cases of multiple independent ascending runs of keys being written in parallel, you generalise to maintain more than one below-50% block, near the "active insertion heads". The above describes classic B-trees where updates are assumed to be written immediately. Things are different when there's a memory cache and delayed allocation/writing, and packing can be reorganised in memory before sending to storage. > I hope you don't speak about on-line defragmentation? No, not at all. > Can you point me to the paper (if any)? Sorry, no, I haven't seen this described in a paper. Imho it's obvious when you think about it. But maybe it's not obvious to everyone - perhaps this is even the heuristic modification missing from btrfs which it would need to avoid the scenario which started this thread? -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Edward Shishkin wrote: > I have noticed that events in Btrfs develop by scenario not predicted > by the paper of academic Ohad Rodeh (in spite of the announce that > Btrfs is based on this paper). This is why I have started to grumble.. In btrfs, "based on" means "started with the algorithm and ideas of", not "uses the same algorithm as". -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
On Fri, Jun 18, 2010 at 09:29:40PM +0200, Edward Shishkin wrote: > Jamie Lokier wrote: > >Edward Shishkin wrote: > >>If you decide to base your file system on some algorithms then please > >>use the original ones from proper academic papers. DO NOT modify the > >>algorithms in solitude: this is very fragile thing! All such > >>modifications must be reviewed by specialists in the theory of > >>algorithms. Such review can be done in various scientific magazines of > >>proper level. > >> > >>Personally I don't see any way to improve the situation with Btrfs > >>except full redesigning the last one. If you want to base your file > >>system on the paper of Ohad Rodeh, then please, use *exactly* the > >>Bayer's B-trees that he refers to. That said, make sure that all > >>records you put to the tree has equal length and all non-root nodes of > >>your tree are at least half filled. > > > >First, thanks Edward for identifying a specific problem with the > >current btrfs implementation. > > Hello Jamie. > > >I've studied modified B-trees quite a lot and know enough to be sure > >that they are quite robust when you modify them in all sorts of ways. > > Which property is robust? > > >Moreover, you are incorrect to say there's an intrinsic algorithmic > >problem with variable-length records. It is not true; if Knuth said > >so, Knuth was mistaken. > > I didn't say about intrinsic algorithmic problems :) > I just repeat (after Knuth et al) that B-trees with variable-length > records don't > have any sane boundary for internal fragmentation. The common idea > is that if we > don't want Btrfs to be in infinite development stage, then we should > choose some > *sane* strategy (for example the paper of Ohad Rodeh) and strictly > adhere this in > future. Again, other than the inline file data, what exactly do you believe needs to change? Top down balancing vs balancing on insertion doesn't impact our ability to maintain full leaves. The current code is clearly choosing not to merge two leaves that it should have merged, which is just a plain old bug. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Jamie Lokier wrote: Edward Shishkin wrote: If you decide to base your file system on some algorithms then please use the original ones from proper academic papers. DO NOT modify the algorithms in solitude: this is very fragile thing! All such modifications must be reviewed by specialists in the theory of algorithms. Such review can be done in various scientific magazines of proper level. Personally I don't see any way to improve the situation with Btrfs except full redesigning the last one. If you want to base your file system on the paper of Ohad Rodeh, then please, use *exactly* the Bayer's B-trees that he refers to. That said, make sure that all records you put to the tree has equal length and all non-root nodes of your tree are at least half filled. First, thanks Edward for identifying a specific problem with the current btrfs implementation. Hello Jamie. I've studied modified B-trees quite a lot and know enough to be sure that they are quite robust when you modify them in all sorts of ways. Which property is robust? Moreover, you are incorrect to say there's an intrinsic algorithmic problem with variable-length records. It is not true; if Knuth said so, Knuth was mistaken. I didn't say about intrinsic algorithmic problems :) I just repeat (after Knuth et al) that B-trees with variable-length records don't have any sane boundary for internal fragmentation. The common idea is that if we don't want Btrfs to be in infinite development stage, then we should choose some *sane* strategy (for example the paper of Ohad Rodeh) and strictly adhere this in future. This is easily shown by modelling variable-length records (key -> string) as a range of fixed length records ([key,index] -> byte) and apply the standard B-tree algorithms to that, which guarantees algorithm properties such as space utilisation and time; then you can substitute a "compressed" representation of contiguous index runs, which amounts to nothing more than just storing the strings (split where the algorithm says to do so) and endpoint indexes , and because this compression does not expand (in any way that matters), classic algorithmic properties are still guaranteed. Variable-length keys are a different business. Those are trickier, but as far as I know, btrfs doesn't use them. As to current Btrfs code: *NOT ACK*!!! I don't think we need such "file systems". Btrfs provides many useful features that other filesystems don't. We definitely need it, or something like it. You have identified a bug. It's not a corruption bug, but it's definitely a bug, and probably affects performance as well as space utilisation. It is not deep design bug; it is just a result of the packing and balancing heuristics. Frankly, I would like to believe to such end, taking into account amount of my contributions to the Btrfs project. At least to make sure I didn't do useless work.. If you are still interested, please apply your knowledge of B-tree algorithms to understanding why btrfs fails to balance the tree sufficiently well, Because of top-down balancing. It doesn't like "clever" things like "splitting" and "merging". Currently top-down works properly only with stupid classic Bayer's B-trees. and then propose a fix. I'll try to help, but I am rather pessimistic here: working out algorithms is something, which doesn't like timelines.. Note that it's not necessarily a problem to have a few nodes with low utilisation. Deliberate violation of the classic balancing heuristic is often useful for performance.[*] Ok, let's violate this, but not in the detriment of utilisation: Internal fragmentation is a horror for file systems, the enemy #1 (which is completely defeated in the last century BTW). The problem you've found is only a real problem when there are _too many_ nodes with low utilisation. IMHO this is a problem, as we can not estimate number of such nodes. Do you have any sane upper boundary for this number? I don't know such one. Maybe I have missed something? [*] For example when filling a tree by inserting contiguously ascending keys, the classic "split into two when full" heuristic gives the worst possible results (50% lost space), and deliberately underfilling the most actively updated nodes, which is not permitted at all by the classic algorithm, gives denser packing in the end (almost zero lost space). At the end of what? I hope you don't speak about on-line defragmentation? Can you point me to the paper (if any)? Thanks! It's also faster. The trick is to make sure there's just the right number of underfilled nodes... -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Edward O. Shishkin Principal Software Engineer Red Hat Czech -- To unsubscribe from this list: send the line "unsubscribe linux
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Jamie Lokier wrote: Edward Shishkin wrote: If you decide to base your file system on some algorithms then please use the original ones from proper academic papers. DO NOT modify the algorithms in solitude: this is very fragile thing! All such modifications must be reviewed by specialists in the theory of algorithms. Such review can be done in various scientific magazines of proper level. Personally I don't see any way to improve the situation with Btrfs except full redesigning the last one. If you want to base your file system on the paper of Ohad Rodeh, then please, use *exactly* the Bayer's B-trees that he refers to. That said, make sure that all records you put to the tree has equal length and all non-root nodes of your tree are at least half filled. First, thanks Edward for identifying a specific problem with the current btrfs implementation. I've studied modified B-trees quite a lot and know enough to be sure that they are quite robust when you modify them in all sorts of ways. This is the point: Which kind of modified B-tree data structure is best suited? Moreover, you are incorrect to say there's an intrinsic algorithmic problem with variable-length records. It is not true; if Knuth said so, Knuth was mistaken. This is easily shown by modelling variable-length records (key -> string) as a range of fixed length records ([key,index] -> byte) and apply the standard B-tree algorithms to that, which guarantees algorithm properties such as space utilisation and time; then you can substitute a "compressed" representation of contiguous index runs, which amounts to nothing more than just storing the strings (split where the algorithm says to do so) and endpoint indexes , and because this compression does not expand (in any way that matters), classic algorithmic properties are still guaranteed. Variable-length keys are a different business. Those are trickier, but as far as I know, btrfs doesn't use them. As to current Btrfs code: *NOT ACK*!!! I don't think we need such "file systems". Btrfs provides many useful features that other filesystems don't. We definitely need it, or something like it. You have identified a bug. It's not a corruption bug, but it's definitely a bug, and probably affects performance as well as space utilisation. It is not deep design bug; it is just a result of the packing and balancing heuristics. I think this is the most important design question in relation with filesystems that use some kind of B-trees, which means, if the wrong modified B-tree as the fundamental data structure was chosen, then this is a deep design bug. If you are still interested, please apply your knowledge of B-tree algorithms to understanding why btrfs fails to balance the tree sufficiently well, and then propose a fix. This is a general problem of filesystem design, especially the packing and balancing heurisitcs, and a special problem of the Btrfs filesystem. You can't simply say do it in this or that way. That's why another filesystem uses something exotic like a B*-tree in conjunction with dancing trees as fundamental data structure, which leads back to the deep design question/problem/decision/bug/ And after I followed the explanations of this exotic B-tree version by the main developer I knew just right from the start of the development of the Btrfs filesystem that it wasn't chosen the right modified B-tree data structure, because it was too simple and too general. And since some days I have the impression that there wasn't made a design decision at all with the only exception that there has to be some kind of a B-tree algorithm/data structure in the Btrfs filesystem. And I also think that such a deep desgin decision can't simply be corrected in general (subjective opinion). Note that it's not necessarily a problem to have a few nodes with low utilisation. Deliberate violation of the classic balancing heuristic is often useful for performance.[*] The problem you've found is only a real problem when there are _too many_ nodes with low utilisation. The found problem is the first problem with the chosen modified B-tree data structure. I wouldn't call it only a problem in a special case. [*] For example when filling a tree by inserting contiguously ascending keys, the classic "split into two when full" heuristic gives the worst possible results (50% lost space), and deliberately underfilling the most actively updated nodes, which is not permitted at all by the classic algorithm, gives denser packing in the end (almost zero lost space). It's also faster. The trick is to make sure there's just the right number of underfilled nodes... Yes, but Firstly, maybe you are too focused on the classic B-tree algorithm here. Secondly, a trick here, a split there, turning off a feature and then? Then we have complexity at then end, which brings us back to the start, the design decision. But if you say there
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Daniel J Blueman wrote: On Fri, Jun 18, 2010 at 1:32 PM, Edward Shishkin wrote: Mat wrote: On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin wrote: Hello everyone. I was asked to review/evaluate Btrfs for using in enterprise systems and the below are my first impressions (linux-2.6.33). The first test I have made was filling an empty 659M (/dev/sdb2) btrfs partition (mounted to /mnt) with 2K files: # for i in $(seq 100); \ do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done (terminated after getting "No space left on device" reports). # ls /mnt | wc -l 59480 So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17, and the first obvious question is "hey, where are other 83% of my disk space???" I looked at the btrfs storage tree (fs_tree) and was shocked with the situation on the leaf level. The Appendix B shows 5 adjacent btrfs leafs, which have the same parent. For example, look at the leaf 29425664: "items 1 free space 3892" (of 4096!!). Note, that this "free" space (3892) is _dead_: any attempts to write to the file system will result in "No space left on device". Internal fragmentation (see Appendix A) of those 5 leafs is (1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then ext4 and xfs: The last ones in this example will show fragmentation near zero with blocksize<= 2K. Even with 4K blocksize they will show better utilization 0.50 (against 0.38 in btrfs)! I have a small question for btrfs developers: Why do you folks put "inline extents", xattr, etc items of variable size to the B-tree in spite of the fact that B-tree is a data structure NOT for variable sized records? This disadvantage of B-trees was widely discussed. For example, maestro D. Knuth warned about this issue long time ago (see Appendix C). It is a well known fact that internal fragmentation of classic Bayer's B-trees is restricted by the value 0.50 (see Appendix C). However it takes place only if your tree contains records of the _same_ length (for example, extent pointers). Once you put to your B-tree records of variable length (restricted only by leaf size, like btrfs "inline extents"), your tree LOSES this boundary. Moreover, even worse: it is clear, that in this case utilization of B-tree scales as zero(!). That said, for every small E and for every amount of data N we can construct a consistent B-tree, which contains data N and has utilization worse then E. I.e. from the standpoint of utilization such trees can be completely degenerated. That said, the very important property of B-trees, which guarantees non-zero utilization, has been lost, and I don't see in Btrfs code any substitution for this property. In other words, where is a formal guarantee that all disk space of our users won't be eaten by internal fragmentation? I consider such guarantee as a *necessary* condition for putting a file system to production. Wow...a small part of me says 'well said', on the basis that your assertions are true, but I do think there needs to be more constructivity in such critique; it is almost impossible to be a great engineer and a great academic at once in a time-pressured environment. I find this is somehow off-topic, but: For sure, it isn't impossible. History showed and present shows that there are exceptions. If you can produce some specific and suggestions with code references, I'm sure we'll get some good discussion with potential to improve from where we are. Thanks, Daniel Have fun Christian Stroetmann -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
On Fri, Jun 18, 2010 at 06:22:39PM +0200, Edward Shishkin wrote: > Chris Mason wrote: > >On Fri, Jun 18, 2010 at 05:05:46PM +0200, Edward Shishkin wrote: > >>Chris Mason wrote: > >>>On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote: > Mat wrote: > >On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin > >wrote: > >>Hello everyone. > >> > >>I was asked to review/evaluate Btrfs for using in enterprise > >>systems and the below are my first impressions (linux-2.6.33). > >> > >>The first test I have made was filling an empty 659M (/dev/sdb2) > >>btrfs partition (mounted to /mnt) with 2K files: > >> > >># for i in $(seq 100); \ > >>do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done > >>(terminated after getting "No space left on device" reports). > >> > >># ls /mnt | wc -l > >>59480 > >> > >>So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17, > >>and the first obvious question is "hey, where are other 83% of my > >>disk space???" I looked at the btrfs storage tree (fs_tree) and was > >>shocked with the situation on the leaf level. The Appendix B shows > >>5 adjacent btrfs leafs, which have the same parent. > >> > >>For example, look at the leaf 29425664: "items 1 free space 3892" > >>(of 4096!!). Note, that this "free" space (3892) is _dead_: any > >>attempts to write to the file system will result in "No space left > >>on device". > >>>There are two easy ways to fix this problem. Turn off the inline > >>>extents (max_inline=0) or allow splitting of the inline extents. I > >>>didn't put in the splitting simply because the complexity was high while > >>>the benefits were low (in comparison with just turning off the inline > >>>extents). > >>Hello, Chris. Thanks for response! > >>I afraid that both ways won't fix the problem. Look at this leaf: > >> > >>[...] > >>leaf 29425664 items 1 free space 3892 generation 8 owner 5 > >>fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956 > >>chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6 > >> item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78 > >> location key (0 UNKNOWN 0) type 8 > >> namelen 16 datalen 32 name: security.selinux > >>[...] > >> > >>There is no inline extents, and what are you going to split here? > >>All leafs must be at least a half filled, otherwise we loose all > >>boundaries, which provides non-zero utilization.. > > > >Right, there is no inline extent because we require them to fit entirely > >in the leaf. So we end up with mostly empty leaves because the inline > >item is large enough to make it difficult to push around but not large > >enough to fill the leaf. > > How about left and right neighbors? They contain a lot of > free space (1572 and 1901 respectively). > I am not happy with the very fact of such shallow leafs which > contain only one small (xattr) item.. Sure, the balancing can also be made more aggressive. This should be very easy to fix. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Daniel J Blueman wrote: On Fri, Jun 18, 2010 at 1:32 PM, Edward Shishkin wrote: Mat wrote: On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin wrote: Hello everyone. I was asked to review/evaluate Btrfs for using in enterprise systems and the below are my first impressions (linux-2.6.33). The first test I have made was filling an empty 659M (/dev/sdb2) btrfs partition (mounted to /mnt) with 2K files: # for i in $(seq 100); \ do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done (terminated after getting "No space left on device" reports). # ls /mnt | wc -l 59480 So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17, and the first obvious question is "hey, where are other 83% of my disk space???" I looked at the btrfs storage tree (fs_tree) and was shocked with the situation on the leaf level. The Appendix B shows 5 adjacent btrfs leafs, which have the same parent. For example, look at the leaf 29425664: "items 1 free space 3892" (of 4096!!). Note, that this "free" space (3892) is _dead_: any attempts to write to the file system will result in "No space left on device". Internal fragmentation (see Appendix A) of those 5 leafs is (1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then ext4 and xfs: The last ones in this example will show fragmentation near zero with blocksize <= 2K. Even with 4K blocksize they will show better utilization 0.50 (against 0.38 in btrfs)! I have a small question for btrfs developers: Why do you folks put "inline extents", xattr, etc items of variable size to the B-tree in spite of the fact that B-tree is a data structure NOT for variable sized records? This disadvantage of B-trees was widely discussed. For example, maestro D. Knuth warned about this issue long time ago (see Appendix C). It is a well known fact that internal fragmentation of classic Bayer's B-trees is restricted by the value 0.50 (see Appendix C). However it takes place only if your tree contains records of the _same_ length (for example, extent pointers). Once you put to your B-tree records of variable length (restricted only by leaf size, like btrfs "inline extents"), your tree LOSES this boundary. Moreover, even worse: it is clear, that in this case utilization of B-tree scales as zero(!). That said, for every small E and for every amount of data N we can construct a consistent B-tree, which contains data N and has utilization worse then E. I.e. from the standpoint of utilization such trees can be completely degenerated. That said, the very important property of B-trees, which guarantees non-zero utilization, has been lost, and I don't see in Btrfs code any substitution for this property. In other words, where is a formal guarantee that all disk space of our users won't be eaten by internal fragmentation? I consider such guarantee as a *necessary* condition for putting a file system to production. Wow...a small part of me says 'well said', on the basis that your assertions are true, but I do think there needs to be more constructivity in such critique; it is almost impossible to be a great engineer and a great academic at once in a time-pressured environment. Sure it is impossible. I believe in division of labour: academics writes algorithms, and we (engineers) encode them. I have noticed that events in Btrfs develop by scenario not predicted by the paper of academic Ohad Rodeh (in spite of the announce that Btrfs is based on this paper). This is why I have started to grumble.. Thanks. If you can produce some specific and suggestions with code references, I'm sure we'll get some good discussion with potential to improve from where we are. Thanks, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Chris Mason wrote: On Fri, Jun 18, 2010 at 05:05:46PM +0200, Edward Shishkin wrote: Chris Mason wrote: On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote: Mat wrote: On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin wrote: Hello everyone. I was asked to review/evaluate Btrfs for using in enterprise systems and the below are my first impressions (linux-2.6.33). The first test I have made was filling an empty 659M (/dev/sdb2) btrfs partition (mounted to /mnt) with 2K files: # for i in $(seq 100); \ do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done (terminated after getting "No space left on device" reports). # ls /mnt | wc -l 59480 So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17, and the first obvious question is "hey, where are other 83% of my disk space???" I looked at the btrfs storage tree (fs_tree) and was shocked with the situation on the leaf level. The Appendix B shows 5 adjacent btrfs leafs, which have the same parent. For example, look at the leaf 29425664: "items 1 free space 3892" (of 4096!!). Note, that this "free" space (3892) is _dead_: any attempts to write to the file system will result in "No space left on device". There are two easy ways to fix this problem. Turn off the inline extents (max_inline=0) or allow splitting of the inline extents. I didn't put in the splitting simply because the complexity was high while the benefits were low (in comparison with just turning off the inline extents). Hello, Chris. Thanks for response! I afraid that both ways won't fix the problem. Look at this leaf: [...] leaf 29425664 items 1 free space 3892 generation 8 owner 5 fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956 chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6 item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78 location key (0 UNKNOWN 0) type 8 namelen 16 datalen 32 name: security.selinux [...] There is no inline extents, and what are you going to split here? All leafs must be at least a half filled, otherwise we loose all boundaries, which provides non-zero utilization.. Right, there is no inline extent because we require them to fit entirely in the leaf. So we end up with mostly empty leaves because the inline item is large enough to make it difficult to push around but not large enough to fill the leaf. How about left and right neighbors? They contain a lot of free space (1572 and 1901 respectively). I am not happy with the very fact of such shallow leafs which contain only one small (xattr) item.. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Edward Shishkin wrote: > If you decide to base your file system on some algorithms then please > use the original ones from proper academic papers. DO NOT modify the > algorithms in solitude: this is very fragile thing! All such > modifications must be reviewed by specialists in the theory of > algorithms. Such review can be done in various scientific magazines of > proper level. > > Personally I don't see any way to improve the situation with Btrfs > except full redesigning the last one. If you want to base your file > system on the paper of Ohad Rodeh, then please, use *exactly* the > Bayer's B-trees that he refers to. That said, make sure that all > records you put to the tree has equal length and all non-root nodes of > your tree are at least half filled. First, thanks Edward for identifying a specific problem with the current btrfs implementation. I've studied modified B-trees quite a lot and know enough to be sure that they are quite robust when you modify them in all sorts of ways. Moreover, you are incorrect to say there's an intrinsic algorithmic problem with variable-length records. It is not true; if Knuth said so, Knuth was mistaken. This is easily shown by modelling variable-length records (key -> string) as a range of fixed length records ([key,index] -> byte) and apply the standard B-tree algorithms to that, which guarantees algorithm properties such as space utilisation and time; then you can substitute a "compressed" representation of contiguous index runs, which amounts to nothing more than just storing the strings (split where the algorithm says to do so) and endpoint indexes , and because this compression does not expand (in any way that matters), classic algorithmic properties are still guaranteed. Variable-length keys are a different business. Those are trickier, but as far as I know, btrfs doesn't use them. > As to current Btrfs code: *NOT ACK*!!! I don't think we need such > "file systems". Btrfs provides many useful features that other filesystems don't. We definitely need it, or something like it. You have identified a bug. It's not a corruption bug, but it's definitely a bug, and probably affects performance as well as space utilisation. It is not deep design bug; it is just a result of the packing and balancing heuristics. If you are still interested, please apply your knowledge of B-tree algorithms to understanding why btrfs fails to balance the tree sufficiently well, and then propose a fix. Note that it's not necessarily a problem to have a few nodes with low utilisation. Deliberate violation of the classic balancing heuristic is often useful for performance.[*] The problem you've found is only a real problem when there are _too many_ nodes with low utilisation. [*] For example when filling a tree by inserting contiguously ascending keys, the classic "split into two when full" heuristic gives the worst possible results (50% lost space), and deliberately underfilling the most actively updated nodes, which is not permitted at all by the classic algorithm, gives denser packing in the end (almost zero lost space). It's also faster. The trick is to make sure there's just the right number of underfilled nodes... -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
On Fri, Jun 18, 2010 at 05:21:21PM +0200, Christian Stroetmann wrote: > Chris Mason wrote: > >On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote: > >>Mat wrote: > >>>On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin wrote: > Hello everyone. > > I was asked to review/evaluate Btrfs for using in enterprise > systems and the below are my first impressions (linux-2.6.33). > > The first test I have made was filling an empty 659M (/dev/sdb2) > btrfs partition (mounted to /mnt) with 2K files: > > # for i in $(seq 100); \ > do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done > (terminated after getting "No space left on device" reports). > > # ls /mnt | wc -l > 59480 > > So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17, > and the first obvious question is "hey, where are other 83% of my > disk space???" I looked at the btrfs storage tree (fs_tree) and was > shocked with the situation on the leaf level. The Appendix B shows > 5 adjacent btrfs leafs, which have the same parent. > > For example, look at the leaf 29425664: "items 1 free space 3892" > (of 4096!!). Note, that this "free" space (3892) is _dead_: any > attempts to write to the file system will result in "No space left > on device". > >There are two easy ways to fix this problem. Turn off the inline > >extents (max_inline=0) or allow splitting of the inline extents. I > >didn't put in the splitting simply because the complexity was high while > >the benefits were low (in comparison with just turning off the inline > >extents). > But then the benefits of splitting must be high, because it solves > this problem if inline extents are turned on. It depends, we might also argue that for larger inline extents like this we are better off with much larger leaves or with using max_inline=0 instead. > >>It must be a highly unexpected and difficult question for file system > >>developers: "how efficiently does your file system manage disk space"? > >> > >>In the meanwhile I confirm that Btrfs design is completely broken: > >>records stored in the B-tree differ greatly from each other (it is > >>unacceptable!), and the balancing algorithms have been modified in > >>insane manner. All these factors has led to loss of *all* boundaries > >>holding internal fragmentation and to exhaustive waste of disk space > >>(and memory!) in spite of the property "scaling in their ability to > >>address large storage". > >> > >>This is not a large storage, this is a "scalable sieve": you can not > >>rely on finding there some given amount of water even after infinite > >>increasing the size of the sieve (read escalating the pool of Btrfs > >>devices). > >> > >>It seems that nobody have reviewed Btrfs before its inclusion to the > >>mainline. I have only found a pair of recommendations with a common > >>idea that Btrfs maintainer is "not a crazy man". Plus a number of > >>papers which admire with the "Btrfs phenomena". Sigh. > >> > >>Well, let's decide what can we do in current situation.. > >>The first obvious point here is that we *can not* put such file system > >>to production. Just because it doesn't provide any guarantees for our > >>users regarding disk space utilization. > >Are you basing all of this on inline extents? The other extents of > >variable size are more flexible (taking up the room in the leaf), but > >they can also easy be split during balancing. > If we have to split everywhere, hasn't it then some (dramatic) > impact on the performance of the Btrfs filesystem? > As it was said above: splitting has a high complexity. Yes. Both Edward and I have worked on the reiserfs sources where inline extents were split during balancing. It made for a few surprises in the file read/write code. They aren't impossible by any means, but I wanted to avoid them. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Chris Mason wrote: On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote: Mat wrote: On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin wrote: Hello everyone. I was asked to review/evaluate Btrfs for using in enterprise systems and the below are my first impressions (linux-2.6.33). The first test I have made was filling an empty 659M (/dev/sdb2) btrfs partition (mounted to /mnt) with 2K files: # for i in $(seq 100); \ do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done (terminated after getting "No space left on device" reports). # ls /mnt | wc -l 59480 So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17, and the first obvious question is "hey, where are other 83% of my disk space???" I looked at the btrfs storage tree (fs_tree) and was shocked with the situation on the leaf level. The Appendix B shows 5 adjacent btrfs leafs, which have the same parent. For example, look at the leaf 29425664: "items 1 free space 3892" (of 4096!!). Note, that this "free" space (3892) is _dead_: any attempts to write to the file system will result in "No space left on device". There are two easy ways to fix this problem. Turn off the inline extents (max_inline=0) or allow splitting of the inline extents. I didn't put in the splitting simply because the complexity was high while the benefits were low (in comparison with just turning off the inline extents). But then the benefits of splitting must be high, because it solves this problem if inline extents are turned on. It must be a highly unexpected and difficult question for file system developers: "how efficiently does your file system manage disk space"? In the meanwhile I confirm that Btrfs design is completely broken: records stored in the B-tree differ greatly from each other (it is unacceptable!), and the balancing algorithms have been modified in insane manner. All these factors has led to loss of *all* boundaries holding internal fragmentation and to exhaustive waste of disk space (and memory!) in spite of the property "scaling in their ability to address large storage". This is not a large storage, this is a "scalable sieve": you can not rely on finding there some given amount of water even after infinite increasing the size of the sieve (read escalating the pool of Btrfs devices). It seems that nobody have reviewed Btrfs before its inclusion to the mainline. I have only found a pair of recommendations with a common idea that Btrfs maintainer is "not a crazy man". Plus a number of papers which admire with the "Btrfs phenomena". Sigh. Well, let's decide what can we do in current situation.. The first obvious point here is that we *can not* put such file system to production. Just because it doesn't provide any guarantees for our users regarding disk space utilization. Are you basing all of this on inline extents? The other extents of variable size are more flexible (taking up the room in the leaf), but they can also easy be split during balancing. If we have to split everywhere, hasn't it then some (dramatic) impact on the performance of the Btrfs filesystem? As it was said above: splitting has a high complexity. -chris -- Have fun Christian Stroetmann -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
On Fri, Jun 18, 2010 at 05:05:46PM +0200, Edward Shishkin wrote: > Chris Mason wrote: > >On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote: > >>Mat wrote: > >>>On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin wrote: > Hello everyone. > > I was asked to review/evaluate Btrfs for using in enterprise > systems and the below are my first impressions (linux-2.6.33). > > The first test I have made was filling an empty 659M (/dev/sdb2) > btrfs partition (mounted to /mnt) with 2K files: > > # for i in $(seq 100); \ > do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done > (terminated after getting "No space left on device" reports). > > # ls /mnt | wc -l > 59480 > > So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17, > and the first obvious question is "hey, where are other 83% of my > disk space???" I looked at the btrfs storage tree (fs_tree) and was > shocked with the situation on the leaf level. The Appendix B shows > 5 adjacent btrfs leafs, which have the same parent. > > For example, look at the leaf 29425664: "items 1 free space 3892" > (of 4096!!). Note, that this "free" space (3892) is _dead_: any > attempts to write to the file system will result in "No space left > on device". > > > >There are two easy ways to fix this problem. Turn off the inline > >extents (max_inline=0) or allow splitting of the inline extents. I > >didn't put in the splitting simply because the complexity was high while > >the benefits were low (in comparison with just turning off the inline > >extents). > > Hello, Chris. Thanks for response! > I afraid that both ways won't fix the problem. Look at this leaf: > > [...] > leaf 29425664 items 1 free space 3892 generation 8 owner 5 > fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956 > chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6 >item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78 >location key (0 UNKNOWN 0) type 8 >namelen 16 datalen 32 name: security.selinux > [...] > > There is no inline extents, and what are you going to split here? > All leafs must be at least a half filled, otherwise we loose all > boundaries, which provides non-zero utilization.. Right, there is no inline extent because we require them to fit entirely in the leaf. So we end up with mostly empty leaves because the inline item is large enough to make it difficult to push around but not large enough to fill the leaf. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Chris Mason wrote: On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote: Mat wrote: On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin wrote: Hello everyone. I was asked to review/evaluate Btrfs for using in enterprise systems and the below are my first impressions (linux-2.6.33). The first test I have made was filling an empty 659M (/dev/sdb2) btrfs partition (mounted to /mnt) with 2K files: # for i in $(seq 100); \ do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done (terminated after getting "No space left on device" reports). # ls /mnt | wc -l 59480 So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17, and the first obvious question is "hey, where are other 83% of my disk space???" I looked at the btrfs storage tree (fs_tree) and was shocked with the situation on the leaf level. The Appendix B shows 5 adjacent btrfs leafs, which have the same parent. For example, look at the leaf 29425664: "items 1 free space 3892" (of 4096!!). Note, that this "free" space (3892) is _dead_: any attempts to write to the file system will result in "No space left on device". There are two easy ways to fix this problem. Turn off the inline extents (max_inline=0) or allow splitting of the inline extents. I didn't put in the splitting simply because the complexity was high while the benefits were low (in comparison with just turning off the inline extents). Hello, Chris. Thanks for response! I afraid that both ways won't fix the problem. Look at this leaf: [...] leaf 29425664 items 1 free space 3892 generation 8 owner 5 fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956 chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6 item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78 location key (0 UNKNOWN 0) type 8 namelen 16 datalen 32 name: security.selinux [...] There is no inline extents, and what are you going to split here? All leafs must be at least a half filled, otherwise we loose all boundaries, which provides non-zero utilization.. Any ideas? Thanks, Edward. It must be a highly unexpected and difficult question for file system developers: "how efficiently does your file system manage disk space"? In the meanwhile I confirm that Btrfs design is completely broken: records stored in the B-tree differ greatly from each other (it is unacceptable!), and the balancing algorithms have been modified in insane manner. All these factors has led to loss of *all* boundaries holding internal fragmentation and to exhaustive waste of disk space (and memory!) in spite of the property "scaling in their ability to address large storage". This is not a large storage, this is a "scalable sieve": you can not rely on finding there some given amount of water even after infinite increasing the size of the sieve (read escalating the pool of Btrfs devices). It seems that nobody have reviewed Btrfs before its inclusion to the mainline. I have only found a pair of recommendations with a common idea that Btrfs maintainer is "not a crazy man". Plus a number of papers which admire with the "Btrfs phenomena". Sigh. Well, let's decide what can we do in current situation.. The first obvious point here is that we *can not* put such file system to production. Just because it doesn't provide any guarantees for our users regarding disk space utilization. Are you basing all of this on inline extents? The other extents of variable size are more flexible (taking up the room in the leaf), but they can also easy be split during balancing. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote: > Mat wrote: > >On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin wrote: > >>Hello everyone. > >> > >>I was asked to review/evaluate Btrfs for using in enterprise > >>systems and the below are my first impressions (linux-2.6.33). > >> > >>The first test I have made was filling an empty 659M (/dev/sdb2) > >>btrfs partition (mounted to /mnt) with 2K files: > >> > >># for i in $(seq 100); \ > >>do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done > >>(terminated after getting "No space left on device" reports). > >> > >># ls /mnt | wc -l > >>59480 > >> > >>So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17, > >>and the first obvious question is "hey, where are other 83% of my > >>disk space???" I looked at the btrfs storage tree (fs_tree) and was > >>shocked with the situation on the leaf level. The Appendix B shows > >>5 adjacent btrfs leafs, which have the same parent. > >> > >>For example, look at the leaf 29425664: "items 1 free space 3892" > >>(of 4096!!). Note, that this "free" space (3892) is _dead_: any > >>attempts to write to the file system will result in "No space left > >>on device". There are two easy ways to fix this problem. Turn off the inline extents (max_inline=0) or allow splitting of the inline extents. I didn't put in the splitting simply because the complexity was high while the benefits were low (in comparison with just turning off the inline extents). > > It must be a highly unexpected and difficult question for file system > developers: "how efficiently does your file system manage disk space"? > > In the meanwhile I confirm that Btrfs design is completely broken: > records stored in the B-tree differ greatly from each other (it is > unacceptable!), and the balancing algorithms have been modified in > insane manner. All these factors has led to loss of *all* boundaries > holding internal fragmentation and to exhaustive waste of disk space > (and memory!) in spite of the property "scaling in their ability to > address large storage". > > This is not a large storage, this is a "scalable sieve": you can not > rely on finding there some given amount of water even after infinite > increasing the size of the sieve (read escalating the pool of Btrfs > devices). > > It seems that nobody have reviewed Btrfs before its inclusion to the > mainline. I have only found a pair of recommendations with a common > idea that Btrfs maintainer is "not a crazy man". Plus a number of > papers which admire with the "Btrfs phenomena". Sigh. > > Well, let's decide what can we do in current situation.. > The first obvious point here is that we *can not* put such file system > to production. Just because it doesn't provide any guarantees for our > users regarding disk space utilization. Are you basing all of this on inline extents? The other extents of variable size are more flexible (taking up the room in the leaf), but they can also easy be split during balancing. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
On Fri, Jun 18, 2010 at 1:32 PM, Edward Shishkin wrote: > Mat wrote: >> >> On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin wrote: >>> >>> Hello everyone. >>> >>> I was asked to review/evaluate Btrfs for using in enterprise >>> systems and the below are my first impressions (linux-2.6.33). >>> >>> The first test I have made was filling an empty 659M (/dev/sdb2) >>> btrfs partition (mounted to /mnt) with 2K files: >>> >>> # for i in $(seq 100); \ >>> do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done >>> (terminated after getting "No space left on device" reports). >>> >>> # ls /mnt | wc -l >>> 59480 >>> >>> So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17, >>> and the first obvious question is "hey, where are other 83% of my >>> disk space???" I looked at the btrfs storage tree (fs_tree) and was >>> shocked with the situation on the leaf level. The Appendix B shows >>> 5 adjacent btrfs leafs, which have the same parent. >>> >>> For example, look at the leaf 29425664: "items 1 free space 3892" >>> (of 4096!!). Note, that this "free" space (3892) is _dead_: any >>> attempts to write to the file system will result in "No space left >>> on device". >>> >>> Internal fragmentation (see Appendix A) of those 5 leafs is >>> (1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then >>> ext4 and xfs: The last ones in this example will show fragmentation >>> near zero with blocksize <= 2K. Even with 4K blocksize they will >>> show better utilization 0.50 (against 0.38 in btrfs)! >>> >>> I have a small question for btrfs developers: Why do you folks put >>> "inline extents", xattr, etc items of variable size to the B-tree >>> in spite of the fact that B-tree is a data structure NOT for variable >>> sized records? This disadvantage of B-trees was widely discussed. >>> For example, maestro D. Knuth warned about this issue long time >>> ago (see Appendix C). >>> >>> It is a well known fact that internal fragmentation of classic Bayer's >>> B-trees is restricted by the value 0.50 (see Appendix C). However it >>> takes place only if your tree contains records of the _same_ length >>> (for example, extent pointers). Once you put to your B-tree records >>> of variable length (restricted only by leaf size, like btrfs "inline >>> extents"), your tree LOSES this boundary. Moreover, even worse: >>> it is clear, that in this case utilization of B-tree scales as zero(!). >>> That said, for every small E and for every amount of data N we >>> can construct a consistent B-tree, which contains data N and has >>> utilization worse then E. I.e. from the standpoint of utilization >>> such trees can be completely degenerated. >>> >>> That said, the very important property of B-trees, which guarantees >>> non-zero utilization, has been lost, and I don't see in Btrfs code any >>> substitution for this property. In other words, where is a formal >>> guarantee that all disk space of our users won't be eaten by internal >>> fragmentation? I consider such guarantee as a *necessary* condition >>> for putting a file system to production. Wow...a small part of me says 'well said', on the basis that your assertions are true, but I do think there needs to be more constructivity in such critique; it is almost impossible to be a great engineer and a great academic at once in a time-pressured environment. If you can produce some specific and suggestions with code references, I'm sure we'll get some good discussion with potential to improve from where we are. Thanks, Daniel -- Daniel J Blueman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Mat wrote: On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin wrote: Hello everyone. I was asked to review/evaluate Btrfs for using in enterprise systems and the below are my first impressions (linux-2.6.33). The first test I have made was filling an empty 659M (/dev/sdb2) btrfs partition (mounted to /mnt) with 2K files: # for i in $(seq 100); \ do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done (terminated after getting "No space left on device" reports). # ls /mnt | wc -l 59480 So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17, and the first obvious question is "hey, where are other 83% of my disk space???" I looked at the btrfs storage tree (fs_tree) and was shocked with the situation on the leaf level. The Appendix B shows 5 adjacent btrfs leafs, which have the same parent. For example, look at the leaf 29425664: "items 1 free space 3892" (of 4096!!). Note, that this "free" space (3892) is _dead_: any attempts to write to the file system will result in "No space left on device". Internal fragmentation (see Appendix A) of those 5 leafs is (1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then ext4 and xfs: The last ones in this example will show fragmentation near zero with blocksize <= 2K. Even with 4K blocksize they will show better utilization 0.50 (against 0.38 in btrfs)! I have a small question for btrfs developers: Why do you folks put "inline extents", xattr, etc items of variable size to the B-tree in spite of the fact that B-tree is a data structure NOT for variable sized records? This disadvantage of B-trees was widely discussed. For example, maestro D. Knuth warned about this issue long time ago (see Appendix C). It is a well known fact that internal fragmentation of classic Bayer's B-trees is restricted by the value 0.50 (see Appendix C). However it takes place only if your tree contains records of the _same_ length (for example, extent pointers). Once you put to your B-tree records of variable length (restricted only by leaf size, like btrfs "inline extents"), your tree LOSES this boundary. Moreover, even worse: it is clear, that in this case utilization of B-tree scales as zero(!). That said, for every small E and for every amount of data N we can construct a consistent B-tree, which contains data N and has utilization worse then E. I.e. from the standpoint of utilization such trees can be completely degenerated. That said, the very important property of B-trees, which guarantees non-zero utilization, has been lost, and I don't see in Btrfs code any substitution for this property. In other words, where is a formal guarantee that all disk space of our users won't be eaten by internal fragmentation? I consider such guarantee as a *necessary* condition for putting a file system to production. Any technical comments are welcome. Thanks, Edward. Appendix A. --- Glossary 1. Utilization of data and(or) metadata storage. The fraction A/B, where A is total size in bytes of stored data and(or) metadata. B = N * S, where N is number of blocks occupied by stored data and(or) metadata. S is block size in bytes. 2. Internal fragmentation of data and(or) metadata storage. difference (1 - U), where U is utilization. Appendix B. --- a "period" in the dump of the fs_tree (btrfs-debug-tree /dev/sdb2) ... leaf 29982720 items 4 free space 1572 generation 8 owner 5 fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956 chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6 item 0 key (319 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78 location key (0 UNKNOWN 0) type 8 namelen 16 datalen 32 name: security.selinux item 1 key (319 EXTENT_DATA 0) itemoff 1848 itemsize 2069 inline extent data size 2048 ram 2048 compress 0 item 2 key (320 INODE_ITEM 0) itemoff 1688 itemsize 160 inode generation 8 size 2048 block group 29360128 mode 100644 links 1 item 3 key (320 INODE_REF 256) itemoff 1672 itemsize 16 inode ref index 65 namelen 6 name: file64 leaf 29425664 items 1 free space 3892 generation 8 owner 5 fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956 chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6 item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78 location key (0 UNKNOWN 0) type 8 namelen 16 datalen 32 name: security.selinux leaf 29990912 items 1 free space 1901 generation 8 owner 5 fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956 chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6 item 0 key (320 EXTENT_DATA 0) itemoff 1926 itemsize 2069 inline extent data size 2048 ram 2048 compress 0 leaf 29986816 items 3 free space 3666 generation 8 owner 5 fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956 chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6 item 0 key (321 INODE_ITEM 0) itemoff 3835 itemsize 160 inode generation 8 size 2048 block group 29360128 mode 100644 links 1 item 1 key (321 INODE_