Re: Balancing leaves when walking from top to down (was Btrfs:...)
Chris Mason wrote: [...] 1. the balancing condition n <= number_of_keys <= 2n+1 is not satisfied (indeed, when number_of_keys is 1 we have 1 <= 2n+1 -- false); That doesn't matter. It is not necessary (or desirable) to follow the classical B-tree algorithms to the letter to get sufficiently good properties. sure, but we need something instead of classic balancing condition. We can not just refuse it: this is your lifejacket during performing tree operations. How will you proof utilization bounds without any balancing conditions? B-trees are quite robust to changing the details, as long as you follow the generalised principles which make them work. There are lots of different perspectives here, I wouldn't be so optimistic here (see below) but it is important to step back and look at what we're using the btree for. We're making an index to filesystem metadata. The index is there so that we can find that metadata with a reasonable number of IOs and in a reasonable amount of time. Because btrfs is COW, and because of the reference counted COW used to maintain snapshots, the cost of updating the btree is different from traditional btrees. We do try to avoid balancing more and intentionally allow the btree leaves to be less compact simply because it allows us to avoid the higher cost of those operations. I can understand reducing low bound, say, from 1/2 to 1/3 for performance reasons, but again, bound 0.00 is unacceptable.. So, please, make sure you have a sane proved bound for every such "compromise". The btree nodes are fairly strict. They contain fixed records and are balanced in a simple fashion. When we are inserting a new key into a node, if the node is completely full we split it. When we are deleting a key, if the node is less than half full we balance it. This works only if insertion/deletion on the leaf level won't result in insertion/deletion more then one key on upper levels. Calculating full on the btree nodes is a factor of the number of keys in them. The leaves have fixed length keys that describe items of variable size. On deletion merging is done when we're less than 1/3rd full and on insertion we split when the item we're trying to shove in doesn't fit. Calculating full on the btree leaves is a factor of the total size of the items stored. So, with all of that said, the nodes point to leaves and the leaves contain inodes, directory entries and sometimes file data. The most important part of the index is the higher levels of the tree. The index of most filesystems points to but does not contain inodes and file data, and so the higher levels of the btrfs btree are more like a traditional FS index. There's a knob here for the max inline item size and if someone wants to fill their whole FS with 2K files, the default settings are not at all suitable. Filling with 3K files actually works better because we end up with inode and file data in the same block much of the time. Chris, thanks for the description. I think that such balancing is totally incorrect. In accordance with your strategy insertions can spawn shallow leaves, and deletions will result in shallow leaves (because of merging with shallow leaves). This will lead to unbound utilization slump. In particular, your patch is just a workaround, which doesn't fix the core problem. Knobs for cases is a nonsense. Are you kidding? Ext4, xfs, reiserfs don't use any knobs, so why should we accept this regress on the background of common progress? Everything should be packed fine without any knobs... Your strategy slightly resembles balancing in Reiserfs file system, so I have two comments here. 1. Algorithms used in Reiserfs are "precise" (or "minimal"). That means there is no "superfluous" balancing there. It is impossible to save on balancing operations and slightly decrease utilization bound: you will lose everything (this is unacceptable). Only classic Bayer's B-trees allow to reduce bound 1/2 via replacing usual balancing condition q <= N <= 2q with more liberal one (say, q <= N <= 3q). 2. If you want to base on COW-friendly Reiserfs, then it's algorithms should be properly modified for the case of top-down balancing and reviewed (better by the authors of the original algorithms). I recommend to consider the following approach: Balancing the leaf level A. Inserting a key of length I(*) 01Suppose we traversed the tree and have arrived to the position 02pos in the leaf A. 03 04At first we try to make space by shifting all possible items at the 05left and right from the pos to the neighbors B and C respectively: 06 07B A C 08 09Let's denote via L and R total length of items on leaf A at left 10and right from the pos respectively. 11 12If after shifting there is enough space in A, then we perform 13insertion to A. After the insertion pairs (BA) and (AC) are 14incompressible for obvious reasons. 15 16Else
Re: Balancing leaves when walking from top to down (was Btrfs:...)
Chris Mason wrote: On Tue, Jun 22, 2010 at 04:12:57PM +0200, Edward Shishkin wrote: Chris Mason wrote: On Mon, Jun 21, 2010 at 09:15:28AM -0400, Chris Mason wrote: I'll reproduce from your test case and provide a fix. mount -o max_inline=1500 would give us 50% usage in the worst case This is a very strange statement: how did you calculate this lower bound? We want room for the extent and the inode item and the inode backref. It's a rough number that leaves some extra room. But even with a 2K item we're getting very close to 50% usage of the metadata area. (minus the balancing bug you hit). Ok, the balancing bug was interesting. What happens is we create all the inodes and directory items nicely packed into the leaves. Then delayed allocation kicks in and starts inserting the big fat inline extents. This often forces the balancing code to split a leaf twice in order to make room for the new inline extent. The double split code wasn't balancing the bits that were left over on either side of our desired slot. The patch below fixes the problem for me, reducing the number of leaves that have more than 2K of free space down from 6% of the total to about 74 leaves. It could be zero, but the balancing code doesn't push items around if our leaf is in the first or last slot of the parent node (complexity vs benefit tradeoff). Nevertheless, I see leaves, which are not in the first or last slot, but mergeable with their neighbors (test case is the same): Correct, but it was in the first or last slot when it was balanced (I confirmed this with printk). Then the higher layers were balanced and our leaves were no longer in the first/last slot. We don't rebalance leaves when we balance level 1. leaf 269557760 items 22 free space 2323 generation 25 owner 2 fs uuid 614fb921-cfa9-403d-9feb-940021e72382 chunk uuid b1674882-a445-45f9-b250-0985e483d231 leaf 280027136 items 18 free space 2627 generation 25 owner 2 fs uuid 614fb921-cfa9-403d-9feb-940021e72382 chunk uuid b1674882-a445-45f9-b250-0985e483d231 node 269549568 level 1 items 60 free 61 generation 25 owner 2 fs uuid 614fb921-cfa9-403d-9feb-940021e72382 chunk uuid b1674882-a445-45f9-b250-0985e483d231 key (175812608 EXTENT_ITEM 4096) block 175828992 (42927) gen 15 key (176025600 EXTENT_ITEM 4096) block 176111616 (42996) gen 15 key (176238592 EXTENT_ITEM 4096) block 176300032 (43042) gen 15 key (176451584 EXTENT_ITEM 4096) block 216248320 (52795) gen 17 key (176672768 EXTENT_ITEM 4096) block 216236032 (52792) gen 17 key (176783360 EXTENT_ITEM 4096) block 216252416 (52796) gen 17 key (176955392 EXTENT_ITEM 4096) block 138854400 (33900) gen 25 key (177131520 EXTENT_ITEM 4096) block 280289280 (68430) gen 25 key (177348608 EXTENT_ITEM 4096) block 280285184 (68429) gen 25 key (177561600 EXTENT_ITEM 4096) block 269557760 (65810) gen 25 key (177795072 EXTENT_ITEM 4096) block 280027136 (68366) gen 25 key (178008064 EXTENT_ITEM 4096) block 280064000 (68375) gen 25 key (178233344 EXTENT_ITEM 4096) block 285061120 (69595) gen 25 key (178442240 EXTENT_ITEM 4096) block 178442240 (43565) gen 16 key (178655232 EXTENT_ITEM 4096) block 178655232 (43617) gen 16 key (178868224 EXTENT_ITEM 4096) block 178868224 (43669) gen 16 [...] With the patch, I'm able to create 106894 files (2K each) on a 1GB FS. That doesn't sound like a huge improvement, but the default from mkfs.btrfs is to duplicate metadata. After duplication, that's about 417MB or about 40% of the overall space. When you factor in the space that we reserve to avoid exploding on enospc and the space that we have allocated for data extents, that's not a very surprising number. I'm still putting this patch through more testing, the double split code is used in some difficult corners and I need to make sure I've tried all of them. Chris, for the further code review we need documents, which reflect the original ideas of the balancing, etc. Could you please provide them? Obviously, I can not do it instead of you, as I don't know those ideas. Which part are you most interested in? Balancing. What conditions do we want to provide? Why won't we get utilization slump in future? How do we need to fix things when something goes wrong? Whereas in accordance with the single existing paper Btrfs design looks really broken, because: 1. the balancing condition n <= number_of_keys <= 2n+1 is not satisfied (indeed, when number_of_keys is 1 we have 1 <= 2n+1 -- false); 2. the trees mentioned in this paper don't allow keys of variable length. Thanks, Edward. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Balancing leaves when walking from top to down (was Btrfs:...)
Chris Mason wrote: On Mon, Jun 21, 2010 at 09:15:28AM -0400, Chris Mason wrote: I'll reproduce from your test case and provide a fix. mount -o max_inline=1500 would give us 50% usage in the worst case This is a very strange statement: how did you calculate this lower bound? (minus the balancing bug you hit). Ok, the balancing bug was interesting. What happens is we create all the inodes and directory items nicely packed into the leaves. Then delayed allocation kicks in and starts inserting the big fat inline extents. This often forces the balancing code to split a leaf twice in order to make room for the new inline extent. The double split code wasn't balancing the bits that were left over on either side of our desired slot. The patch below fixes the problem for me, reducing the number of leaves that have more than 2K of free space down from 6% of the total to about 74 leaves. It could be zero, but the balancing code doesn't push items around if our leaf is in the first or last slot of the parent node (complexity vs benefit tradeoff). Nevertheless, I see leaves, which are not in the first or last slot, but mergeable with their neighbors (test case is the same): leaf 269557760 items 22 free space 2323 generation 25 owner 2 fs uuid 614fb921-cfa9-403d-9feb-940021e72382 chunk uuid b1674882-a445-45f9-b250-0985e483d231 leaf 280027136 items 18 free space 2627 generation 25 owner 2 fs uuid 614fb921-cfa9-403d-9feb-940021e72382 chunk uuid b1674882-a445-45f9-b250-0985e483d231 node 269549568 level 1 items 60 free 61 generation 25 owner 2 fs uuid 614fb921-cfa9-403d-9feb-940021e72382 chunk uuid b1674882-a445-45f9-b250-0985e483d231 key (175812608 EXTENT_ITEM 4096) block 175828992 (42927) gen 15 key (176025600 EXTENT_ITEM 4096) block 176111616 (42996) gen 15 key (176238592 EXTENT_ITEM 4096) block 176300032 (43042) gen 15 key (176451584 EXTENT_ITEM 4096) block 216248320 (52795) gen 17 key (176672768 EXTENT_ITEM 4096) block 216236032 (52792) gen 17 key (176783360 EXTENT_ITEM 4096) block 216252416 (52796) gen 17 key (176955392 EXTENT_ITEM 4096) block 138854400 (33900) gen 25 key (177131520 EXTENT_ITEM 4096) block 280289280 (68430) gen 25 key (177348608 EXTENT_ITEM 4096) block 280285184 (68429) gen 25 key (177561600 EXTENT_ITEM 4096) block 269557760 (65810) gen 25 key (177795072 EXTENT_ITEM 4096) block 280027136 (68366) gen 25 key (178008064 EXTENT_ITEM 4096) block 280064000 (68375) gen 25 key (178233344 EXTENT_ITEM 4096) block 285061120 (69595) gen 25 key (178442240 EXTENT_ITEM 4096) block 178442240 (43565) gen 16 key (178655232 EXTENT_ITEM 4096) block 178655232 (43617) gen 16 key (178868224 EXTENT_ITEM 4096) block 178868224 (43669) gen 16 [...] With the patch, I'm able to create 106894 files (2K each) on a 1GB FS. That doesn't sound like a huge improvement, but the default from mkfs.btrfs is to duplicate metadata. After duplication, that's about 417MB or about 40% of the overall space. When you factor in the space that we reserve to avoid exploding on enospc and the space that we have allocated for data extents, that's not a very surprising number. I'm still putting this patch through more testing, the double split code is used in some difficult corners and I need to make sure I've tried all of them. Chris, for the further code review we need documents, which reflect the original ideas of the balancing, etc. Could you please provide them? Obviously, I can not do it instead of you, as I don't know those ideas. Thanks, Edward. -chris diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index 0d1d966..1f393b0 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -2304,12 +2304,17 @@ noinline int btrfs_leaf_free_space(struct btrfs_root *root, return ret; } +/* + * min slot controls the lowest index we're willing to push to the + * right. We'll push up to and including min_slot, but no lower + */ static noinline int __push_leaf_right(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct btrfs_path *path, int data_size, int empty, struct extent_buffer *right, - int free_space, u32 left_nritems) + int free_space, u32 left_nritems, + u32 min_slot) { struct extent_buffer *left = path->nodes[0]; struct extent_buffer *upper = path->nodes[1]; @@ -2327,7 +2332,7 @@ static noinline int __push_leaf_right(struct btrfs_trans_handle *trans, if (empty) nr = 0; else - nr = 1; + nr = max_t(u32, 1, min_slot); if (path->slots[0] >= left_nritems) push_space += data_size; @@ -2469,10 +2474,13 @@ out_unlock: * *
Re: Balancing leaves when walking from top to down (was Btrfs:...)
Ric Wheeler wrote: On 06/18/2010 06:04 PM, Edward Shishkin wrote: Chris Mason wrote: On Fri, Jun 18, 2010 at 09:29:40PM +0200, Edward Shishkin wrote: Jamie Lokier wrote: Edward Shishkin wrote: If you decide to base your file system on some algorithms then please use the original ones from proper academic papers. DO NOT modify the algorithms in solitude: this is very fragile thing! All such modifications must be reviewed by specialists in the theory of algorithms. Such review can be done in various scientific magazines of proper level. Personally I don't see any way to improve the situation with Btrfs except full redesigning the last one. If you want to base your file system on the paper of Ohad Rodeh, then please, use *exactly* the Bayer's B-trees that he refers to. That said, make sure that all records you put to the tree has equal length and all non-root nodes of your tree are at least half filled. First, thanks Edward for identifying a specific problem with the current btrfs implementation. Hello Jamie. I've studied modified B-trees quite a lot and know enough to be sure that they are quite robust when you modify them in all sorts of ways. Which property is robust? Moreover, you are incorrect to say there's an intrinsic algorithmic problem with variable-length records. It is not true; if Knuth said so, Knuth was mistaken. I didn't say about intrinsic algorithmic problems :) I just repeat (after Knuth et al) that B-trees with variable-length records don't have any sane boundary for internal fragmentation. The common idea is that if we don't want Btrfs to be in infinite development stage, then we should choose some *sane* strategy (for example the paper of Ohad Rodeh) and strictly adhere this in future. Again, other than the inline file data, what exactly do you believe needs to change? 1. getting rid of inline extents; 2. new formats for directory and xattr items to not look like a train, which is able to occupy the whole leaf; 3. make sure we do pro-active balancing like it is described in the paper. Sorry, I don't see other ways for now.. Top down balancing vs balancing on insertion doesn't impact our ability to maintain full leaves. The current code is clearly choosing not to merge two leaves that it should have merged, which is just a plain old bug. How are you going to balance leaves when walking from top to down? Suppose 1) and 2) above are not satisfied and having arrived to the leaf level we see a number of items of variable length. What will we do to keep leaves full? Could you please provide a sketch of the algorithm? Thanks! Hi Edward, Is it really a requirement to have 100% full leaves? Most DB's (assuming I remember correctly) have deliberate strategies around this kind of thing. You might want to leave room in leaf nodes so that future insertions can be contiguous on the media with older data. Regards, Ric Hello Ric. No, every leaf shouldn't be necessarily 100% full. We may require every L-vicinity(*) of every leaf to be full in some sense. And this local condition sometimes brings the (global) boundary for utilization of the whole tree. In the classic Bayer's B-trees we have so-called "S(0)" balancing condition implicitly satisfied, which requires every leaf to be at least half full. It provides the low boundary 0.5 for utilization of the whole tree. Next example is ReiserFS file system, which uses so-called "S(1)" balancing condition on the leaf level, which requires every 1-vicinity of every leaf to be "incompressible" (i.e. two leaves can not be squeezed to a single one). This local incompressibility brings the sweet low utilization boundary 0.5-E for the whole tree (E is a small number, which is back proportional to the tree branching). (*) any set of L+1 neighboring nodes, which contains the leaf. -- Edward O. Shishkin Principal Software Engineer Red Hat Czech -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Balancing leaves when walking from top to down (was Btrfs:...)
Chris Mason wrote: On Fri, Jun 18, 2010 at 09:29:40PM +0200, Edward Shishkin wrote: Jamie Lokier wrote: Edward Shishkin wrote: If you decide to base your file system on some algorithms then please use the original ones from proper academic papers. DO NOT modify the algorithms in solitude: this is very fragile thing! All such modifications must be reviewed by specialists in the theory of algorithms. Such review can be done in various scientific magazines of proper level. Personally I don't see any way to improve the situation with Btrfs except full redesigning the last one. If you want to base your file system on the paper of Ohad Rodeh, then please, use *exactly* the Bayer's B-trees that he refers to. That said, make sure that all records you put to the tree has equal length and all non-root nodes of your tree are at least half filled. First, thanks Edward for identifying a specific problem with the current btrfs implementation. Hello Jamie. I've studied modified B-trees quite a lot and know enough to be sure that they are quite robust when you modify them in all sorts of ways. Which property is robust? Moreover, you are incorrect to say there's an intrinsic algorithmic problem with variable-length records. It is not true; if Knuth said so, Knuth was mistaken. I didn't say about intrinsic algorithmic problems :) I just repeat (after Knuth et al) that B-trees with variable-length records don't have any sane boundary for internal fragmentation. The common idea is that if we don't want Btrfs to be in infinite development stage, then we should choose some *sane* strategy (for example the paper of Ohad Rodeh) and strictly adhere this in future. Again, other than the inline file data, what exactly do you believe needs to change? 1. getting rid of inline extents; 2. new formats for directory and xattr items to not look like a train, which is able to occupy the whole leaf; 3. make sure we do pro-active balancing like it is described in the paper. Sorry, I don't see other ways for now.. Top down balancing vs balancing on insertion doesn't impact our ability to maintain full leaves. The current code is clearly choosing not to merge two leaves that it should have merged, which is just a plain old bug. How are you going to balance leaves when walking from top to down? Suppose 1) and 2) above are not satisfied and having arrived to the leaf level we see a number of items of variable length. What will we do to keep leaves full? Could you please provide a sketch of the algorithm? Thanks! -- Edward O. Shishkin Principal Software Engineer Red Hat Czech -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Jamie Lokier wrote: Edward Shishkin wrote: If you decide to base your file system on some algorithms then please use the original ones from proper academic papers. DO NOT modify the algorithms in solitude: this is very fragile thing! All such modifications must be reviewed by specialists in the theory of algorithms. Such review can be done in various scientific magazines of proper level. Personally I don't see any way to improve the situation with Btrfs except full redesigning the last one. If you want to base your file system on the paper of Ohad Rodeh, then please, use *exactly* the Bayer's B-trees that he refers to. That said, make sure that all records you put to the tree has equal length and all non-root nodes of your tree are at least half filled. First, thanks Edward for identifying a specific problem with the current btrfs implementation. Hello Jamie. I've studied modified B-trees quite a lot and know enough to be sure that they are quite robust when you modify them in all sorts of ways. Which property is robust? Moreover, you are incorrect to say there's an intrinsic algorithmic problem with variable-length records. It is not true; if Knuth said so, Knuth was mistaken. I didn't say about intrinsic algorithmic problems :) I just repeat (after Knuth et al) that B-trees with variable-length records don't have any sane boundary for internal fragmentation. The common idea is that if we don't want Btrfs to be in infinite development stage, then we should choose some *sane* strategy (for example the paper of Ohad Rodeh) and strictly adhere this in future. This is easily shown by modelling variable-length records (key -> string) as a range of fixed length records ([key,index] -> byte) and apply the standard B-tree algorithms to that, which guarantees algorithm properties such as space utilisation and time; then you can substitute a "compressed" representation of contiguous index runs, which amounts to nothing more than just storing the strings (split where the algorithm says to do so) and endpoint indexes , and because this compression does not expand (in any way that matters), classic algorithmic properties are still guaranteed. Variable-length keys are a different business. Those are trickier, but as far as I know, btrfs doesn't use them. As to current Btrfs code: *NOT ACK*!!! I don't think we need such "file systems". Btrfs provides many useful features that other filesystems don't. We definitely need it, or something like it. You have identified a bug. It's not a corruption bug, but it's definitely a bug, and probably affects performance as well as space utilisation. It is not deep design bug; it is just a result of the packing and balancing heuristics. Frankly, I would like to believe to such end, taking into account amount of my contributions to the Btrfs project. At least to make sure I didn't do useless work.. If you are still interested, please apply your knowledge of B-tree algorithms to understanding why btrfs fails to balance the tree sufficiently well, Because of top-down balancing. It doesn't like "clever" things like "splitting" and "merging". Currently top-down works properly only with stupid classic Bayer's B-trees. and then propose a fix. I'll try to help, but I am rather pessimistic here: working out algorithms is something, which doesn't like timelines.. Note that it's not necessarily a problem to have a few nodes with low utilisation. Deliberate violation of the classic balancing heuristic is often useful for performance.[*] Ok, let's violate this, but not in the detriment of utilisation: Internal fragmentation is a horror for file systems, the enemy #1 (which is completely defeated in the last century BTW). The problem you've found is only a real problem when there are _too many_ nodes with low utilisation. IMHO this is a problem, as we can not estimate number of such nodes. Do you have any sane upper boundary for this number? I don't know such one. Maybe I have missed something? [*] For example when filling a tree by inserting contiguously ascending keys, the classic "split into two when full" heuristic gives the worst possible results (50% lost space), and deliberately underfilling the most actively updated nodes, which is not permitted at all by the classic algorithm, gives denser packing in the end (almost zero lost space). At the end of what? I hope you don't speak about on-line defragmentation? Can you point me to the paper (if any)? Thanks! It's also faster. The trick is to make sure there's just the right number of underfilled nodes... -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org Mo
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Daniel J Blueman wrote: On Fri, Jun 18, 2010 at 1:32 PM, Edward Shishkin wrote: Mat wrote: On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin wrote: Hello everyone. I was asked to review/evaluate Btrfs for using in enterprise systems and the below are my first impressions (linux-2.6.33). The first test I have made was filling an empty 659M (/dev/sdb2) btrfs partition (mounted to /mnt) with 2K files: # for i in $(seq 100); \ do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done (terminated after getting "No space left on device" reports). # ls /mnt | wc -l 59480 So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17, and the first obvious question is "hey, where are other 83% of my disk space???" I looked at the btrfs storage tree (fs_tree) and was shocked with the situation on the leaf level. The Appendix B shows 5 adjacent btrfs leafs, which have the same parent. For example, look at the leaf 29425664: "items 1 free space 3892" (of 4096!!). Note, that this "free" space (3892) is _dead_: any attempts to write to the file system will result in "No space left on device". Internal fragmentation (see Appendix A) of those 5 leafs is (1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then ext4 and xfs: The last ones in this example will show fragmentation near zero with blocksize <= 2K. Even with 4K blocksize they will show better utilization 0.50 (against 0.38 in btrfs)! I have a small question for btrfs developers: Why do you folks put "inline extents", xattr, etc items of variable size to the B-tree in spite of the fact that B-tree is a data structure NOT for variable sized records? This disadvantage of B-trees was widely discussed. For example, maestro D. Knuth warned about this issue long time ago (see Appendix C). It is a well known fact that internal fragmentation of classic Bayer's B-trees is restricted by the value 0.50 (see Appendix C). However it takes place only if your tree contains records of the _same_ length (for example, extent pointers). Once you put to your B-tree records of variable length (restricted only by leaf size, like btrfs "inline extents"), your tree LOSES this boundary. Moreover, even worse: it is clear, that in this case utilization of B-tree scales as zero(!). That said, for every small E and for every amount of data N we can construct a consistent B-tree, which contains data N and has utilization worse then E. I.e. from the standpoint of utilization such trees can be completely degenerated. That said, the very important property of B-trees, which guarantees non-zero utilization, has been lost, and I don't see in Btrfs code any substitution for this property. In other words, where is a formal guarantee that all disk space of our users won't be eaten by internal fragmentation? I consider such guarantee as a *necessary* condition for putting a file system to production. Wow...a small part of me says 'well said', on the basis that your assertions are true, but I do think there needs to be more constructivity in such critique; it is almost impossible to be a great engineer and a great academic at once in a time-pressured environment. Sure it is impossible. I believe in division of labour: academics writes algorithms, and we (engineers) encode them. I have noticed that events in Btrfs develop by scenario not predicted by the paper of academic Ohad Rodeh (in spite of the announce that Btrfs is based on this paper). This is why I have started to grumble.. Thanks. If you can produce some specific and suggestions with code references, I'm sure we'll get some good discussion with potential to improve from where we are. Thanks, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Chris Mason wrote: On Fri, Jun 18, 2010 at 05:05:46PM +0200, Edward Shishkin wrote: Chris Mason wrote: On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote: Mat wrote: On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin wrote: Hello everyone. I was asked to review/evaluate Btrfs for using in enterprise systems and the below are my first impressions (linux-2.6.33). The first test I have made was filling an empty 659M (/dev/sdb2) btrfs partition (mounted to /mnt) with 2K files: # for i in $(seq 100); \ do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done (terminated after getting "No space left on device" reports). # ls /mnt | wc -l 59480 So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17, and the first obvious question is "hey, where are other 83% of my disk space???" I looked at the btrfs storage tree (fs_tree) and was shocked with the situation on the leaf level. The Appendix B shows 5 adjacent btrfs leafs, which have the same parent. For example, look at the leaf 29425664: "items 1 free space 3892" (of 4096!!). Note, that this "free" space (3892) is _dead_: any attempts to write to the file system will result in "No space left on device". There are two easy ways to fix this problem. Turn off the inline extents (max_inline=0) or allow splitting of the inline extents. I didn't put in the splitting simply because the complexity was high while the benefits were low (in comparison with just turning off the inline extents). Hello, Chris. Thanks for response! I afraid that both ways won't fix the problem. Look at this leaf: [...] leaf 29425664 items 1 free space 3892 generation 8 owner 5 fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956 chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6 item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78 location key (0 UNKNOWN 0) type 8 namelen 16 datalen 32 name: security.selinux [...] There is no inline extents, and what are you going to split here? All leafs must be at least a half filled, otherwise we loose all boundaries, which provides non-zero utilization.. Right, there is no inline extent because we require them to fit entirely in the leaf. So we end up with mostly empty leaves because the inline item is large enough to make it difficult to push around but not large enough to fill the leaf. How about left and right neighbors? They contain a lot of free space (1572 and 1901 respectively). I am not happy with the very fact of such shallow leafs which contain only one small (xattr) item.. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Chris Mason wrote: On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote: Mat wrote: On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin wrote: Hello everyone. I was asked to review/evaluate Btrfs for using in enterprise systems and the below are my first impressions (linux-2.6.33). The first test I have made was filling an empty 659M (/dev/sdb2) btrfs partition (mounted to /mnt) with 2K files: # for i in $(seq 100); \ do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done (terminated after getting "No space left on device" reports). # ls /mnt | wc -l 59480 So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17, and the first obvious question is "hey, where are other 83% of my disk space???" I looked at the btrfs storage tree (fs_tree) and was shocked with the situation on the leaf level. The Appendix B shows 5 adjacent btrfs leafs, which have the same parent. For example, look at the leaf 29425664: "items 1 free space 3892" (of 4096!!). Note, that this "free" space (3892) is _dead_: any attempts to write to the file system will result in "No space left on device". There are two easy ways to fix this problem. Turn off the inline extents (max_inline=0) or allow splitting of the inline extents. I didn't put in the splitting simply because the complexity was high while the benefits were low (in comparison with just turning off the inline extents). Hello, Chris. Thanks for response! I afraid that both ways won't fix the problem. Look at this leaf: [...] leaf 29425664 items 1 free space 3892 generation 8 owner 5 fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956 chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6 item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78 location key (0 UNKNOWN 0) type 8 namelen 16 datalen 32 name: security.selinux [...] There is no inline extents, and what are you going to split here? All leafs must be at least a half filled, otherwise we loose all boundaries, which provides non-zero utilization.. Any ideas? Thanks, Edward. It must be a highly unexpected and difficult question for file system developers: "how efficiently does your file system manage disk space"? In the meanwhile I confirm that Btrfs design is completely broken: records stored in the B-tree differ greatly from each other (it is unacceptable!), and the balancing algorithms have been modified in insane manner. All these factors has led to loss of *all* boundaries holding internal fragmentation and to exhaustive waste of disk space (and memory!) in spite of the property "scaling in their ability to address large storage". This is not a large storage, this is a "scalable sieve": you can not rely on finding there some given amount of water even after infinite increasing the size of the sieve (read escalating the pool of Btrfs devices). It seems that nobody have reviewed Btrfs before its inclusion to the mainline. I have only found a pair of recommendations with a common idea that Btrfs maintainer is "not a crazy man". Plus a number of papers which admire with the "Btrfs phenomena". Sigh. Well, let's decide what can we do in current situation.. The first obvious point here is that we *can not* put such file system to production. Just because it doesn't provide any guarantees for our users regarding disk space utilization. Are you basing all of this on inline extents? The other extents of variable size are more flexible (taking up the room in the leaf), but they can also easy be split during balancing. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Mat wrote: On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin wrote: Hello everyone. I was asked to review/evaluate Btrfs for using in enterprise systems and the below are my first impressions (linux-2.6.33). The first test I have made was filling an empty 659M (/dev/sdb2) btrfs partition (mounted to /mnt) with 2K files: # for i in $(seq 100); \ do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done (terminated after getting "No space left on device" reports). # ls /mnt | wc -l 59480 So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17, and the first obvious question is "hey, where are other 83% of my disk space???" I looked at the btrfs storage tree (fs_tree) and was shocked with the situation on the leaf level. The Appendix B shows 5 adjacent btrfs leafs, which have the same parent. For example, look at the leaf 29425664: "items 1 free space 3892" (of 4096!!). Note, that this "free" space (3892) is _dead_: any attempts to write to the file system will result in "No space left on device". Internal fragmentation (see Appendix A) of those 5 leafs is (1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then ext4 and xfs: The last ones in this example will show fragmentation near zero with blocksize <= 2K. Even with 4K blocksize they will show better utilization 0.50 (against 0.38 in btrfs)! I have a small question for btrfs developers: Why do you folks put "inline extents", xattr, etc items of variable size to the B-tree in spite of the fact that B-tree is a data structure NOT for variable sized records? This disadvantage of B-trees was widely discussed. For example, maestro D. Knuth warned about this issue long time ago (see Appendix C). It is a well known fact that internal fragmentation of classic Bayer's B-trees is restricted by the value 0.50 (see Appendix C). However it takes place only if your tree contains records of the _same_ length (for example, extent pointers). Once you put to your B-tree records of variable length (restricted only by leaf size, like btrfs "inline extents"), your tree LOSES this boundary. Moreover, even worse: it is clear, that in this case utilization of B-tree scales as zero(!). That said, for every small E and for every amount of data N we can construct a consistent B-tree, which contains data N and has utilization worse then E. I.e. from the standpoint of utilization such trees can be completely degenerated. That said, the very important property of B-trees, which guarantees non-zero utilization, has been lost, and I don't see in Btrfs code any substitution for this property. In other words, where is a formal guarantee that all disk space of our users won't be eaten by internal fragmentation? I consider such guarantee as a *necessary* condition for putting a file system to production. Any technical comments are welcome. Thanks, Edward. Appendix A. --- Glossary 1. Utilization of data and(or) metadata storage. The fraction A/B, where A is total size in bytes of stored data and(or) metadata. B = N * S, where N is number of blocks occupied by stored data and(or) metadata. S is block size in bytes. 2. Internal fragmentation of data and(or) metadata storage. difference (1 - U), where U is utilization. Appendix B. --- a "period" in the dump of the fs_tree (btrfs-debug-tree /dev/sdb2) ... leaf 29982720 items 4 free space 1572 generation 8 owner 5 fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956 chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6 item 0 key (319 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78 location key (0 UNKNOWN 0) type 8 namelen 16 datalen 32 name: security.selinux item 1 key (319 EXTENT_DATA 0) itemoff 1848 itemsize 2069 inline extent data size 2048 ram 2048 compress 0 item 2 key (320 INODE_ITEM 0) itemoff 1688 itemsize 160 inode generation 8 size 2048 block group 29360128 mode 100644 links 1 item 3 key (320 INODE_REF 256) itemoff 1672 itemsize 16 inode ref index 65 namelen 6 name: file64 leaf 29425664 items 1 free space 3892 generation 8 owner 5 fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956 chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6 item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78 location key (0 UNKNOWN 0) type 8 namelen 16 datalen 32 name: security.selinux leaf 29990912 items 1 free space 1901 generation 8 owner 5 fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956 chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6 item 0 key (320 EXTENT_DATA 0) itemoff 1926 itemsize 2069 inline extent data size 2048 ram 2048 compress 0 leaf 29986816 items 3 free space 3666 generation 8 owner 5 fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956 chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6 item 0 key (321 INODE_ITEM 0) itemoff 3835 itemsize 1
Unbound(?) internal fragmentation in Btrfs
Hello everyone. I was asked to review/evaluate Btrfs for using in enterprise systems and the below are my first impressions (linux-2.6.33). The first test I have made was filling an empty 659M (/dev/sdb2) btrfs partition (mounted to /mnt) with 2K files: # for i in $(seq 100); \ do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done (terminated after getting "No space left on device" reports). # ls /mnt | wc -l 59480 So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17, and the first obvious question is "hey, where are other 83% of my disk space???" I looked at the btrfs storage tree (fs_tree) and was shocked with the situation on the leaf level. The Appendix B shows 5 adjacent btrfs leafs, which have the same parent. For example, look at the leaf 29425664: "items 1 free space 3892" (of 4096!!). Note, that this "free" space (3892) is _dead_: any attempts to write to the file system will result in "No space left on device". Internal fragmentation (see Appendix A) of those 5 leafs is (1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then ext4 and xfs: The last ones in this example will show fragmentation near zero with blocksize <= 2K. Even with 4K blocksize they will show better utilization 0.50 (against 0.38 in btrfs)! I have a small question for btrfs developers: Why do you folks put "inline extents", xattr, etc items of variable size to the B-tree in spite of the fact that B-tree is a data structure NOT for variable sized records? This disadvantage of B-trees was widely discussed. For example, maestro D. Knuth warned about this issue long time ago (see Appendix C). It is a well known fact that internal fragmentation of classic Bayer's B-trees is restricted by the value 0.50 (see Appendix C). However it takes place only if your tree contains records of the _same_ length (for example, extent pointers). Once you put to your B-tree records of variable length (restricted only by leaf size, like btrfs "inline extents"), your tree LOSES this boundary. Moreover, even worse: it is clear, that in this case utilization of B-tree scales as zero(!). That said, for every small E and for every amount of data N we can construct a consistent B-tree, which contains data N and has utilization worse then E. I.e. from the standpoint of utilization such trees can be completely degenerated. That said, the very important property of B-trees, which guarantees non-zero utilization, has been lost, and I don't see in Btrfs code any substitution for this property. In other words, where is a formal guarantee that all disk space of our users won't be eaten by internal fragmentation? I consider such guarantee as a *necessary* condition for putting a file system to production. Any technical comments are welcome. Thanks, Edward. Appendix A. --- Glossary 1. Utilization of data and(or) metadata storage. The fraction A/B, where A is total size in bytes of stored data and(or) metadata. B = N * S, where N is number of blocks occupied by stored data and(or) metadata. S is block size in bytes. 2. Internal fragmentation of data and(or) metadata storage. difference (1 - U), where U is utilization. Appendix B. --- a "period" in the dump of the fs_tree (btrfs-debug-tree /dev/sdb2) ... leaf 29982720 items 4 free space 1572 generation 8 owner 5 fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956 chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6 item 0 key (319 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78 location key (0 UNKNOWN 0) type 8 namelen 16 datalen 32 name: security.selinux item 1 key (319 EXTENT_DATA 0) itemoff 1848 itemsize 2069 inline extent data size 2048 ram 2048 compress 0 item 2 key (320 INODE_ITEM 0) itemoff 1688 itemsize 160 inode generation 8 size 2048 block group 29360128 mode 100644 links 1 item 3 key (320 INODE_REF 256) itemoff 1672 itemsize 16 inode ref index 65 namelen 6 name: file64 leaf 29425664 items 1 free space 3892 generation 8 owner 5 fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956 chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6 item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78 location key (0 UNKNOWN 0) type 8 namelen 16 datalen 32 name: security.selinux leaf 29990912 items 1 free space 1901 generation 8 owner 5 fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956 chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6 item 0 key (320 EXTENT_DATA 0) itemoff 1926 itemsize 2069 inline extent data size 2048 ram 2048 compress 0 leaf 29986816 items 3 free space 3666 generation 8 owner 5 fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956 chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6 item 0 key (321 INODE_ITEM 0) itemoff 3835 itemsize 160 inode generation 8 size 2048 block group 29360128 mode 100644 links 1 item 1 key (321 INODE_REF 256) itemoff 3819 itemsize 16 in
Re: [Jfs-discussion] benchmark results
When things didn't match up that was a clue that either - the benchmark was broken - the code was broken [...] I would carry out an object-oriented dualism here. [1] methods (kernel module) [2] objects (formatted partition) || || [3] benchmarks - [4] user-space utilities (fsck) User-space utilities investigate "object corruptions", whereas benchmarks investigate "software corruptions" (including bugs in source code, broken design, etc, etc..) It is clear that "software" can be "corrupted" by a larger number of ways than "objects". Indeed, it is known that dual space V* (of all linear functions over V) is a much more complex object than V. So benchmark is a process which takes a set of methods (we consider only "software" benchmarks) and puts numerical values populated with a special (the worst) value CRASH. Three main categories of benchmarks using: 1) Internal testing An engineer makes optimizations in a file system (e.g. for a customer) via choosing functions or plugins as winners in a set of internal (local) "nominations". 2) Business plans A system administrator chooses a "winner" in some (global) "nomination" of file systems in accordance with internal business-plans. 3) Flame and politics Someone presents a "nomination" (usually with the "winner" among restricted number of nominated members) to the public while nobody asked him to do it. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/2] grub-0.97: btrfs support
Johannes Hirte wrote: Am Freitag 11 Dezember 2009 16:27:54 schrieb Edward Shishkin: Johannes Hirte wrote: Am Freitag 11 Dezember 2009 12:17:29 schrieb Edward Shishkin: Johannes Hirte wrote: Am Freitag 11 Dezember 2009 00:15:46 schrieb Johannes Hirte: Am Freitag 25 September 2009 00:06:23 schrieb Edward Shishkin: Hello everyone. ... The following patches are for Fedora 10(**). The distro-independent package will be put to kernel.org a bit later. All comments, bugreports, etc. are welcome as usual. Ok, I have another comment/bugreport *g*. I'm testing this patch with gentoo, so the grub sources are not identicaly the same. With this patches applied, grub is unable to detect JFS or XFS filesystems. XFS is reported as unknown, JFS is reported as btrfs. Reiserfs and ext2/3 are detected as expected. Yes, this patch is for Fedora. For other distros some issues are possible, so please be careful.. I've also tested now with the fedora sources. There is the same bug. The btrfs patch breaks the filesystem detection. All filesystems after btrfs in fsys_table aren't detected. Moving btrfs to the end of fsys_table is a workaround but will interfere with FFS. So this should better be fixed in the btrfs-part of grub, so that it: a) doesn't missdetect a JFS filesystem as btrfs b) doesn't break the detection for remaining filesystems in the array. Hello. Yes, I confirm that xfs, etc. file systems are not detected, but missdetection jfs as btrfs looks rather fantastic :) Please, try the attached patch. Report if any problems. The patch works, but the problem with misdetected JFS filesystem still persists. It happens if the device contained a btrfs filesystem before. I assume that the JFS super block starts later on the device as the btrfs one do and jfs_mkfs doesn't clean the space ahead of the JFS super block. So if a JFS filesystem is created on a device that contained a btrfs before, btrfs_mount still detects the beginning of the old btrfs super block and reads crap later on. Yup, sticky thing.. To avoid this, btrfs detection could be placed after JFS. Are there any objections against this? If so, we might want to put if after XFS, which is also mistreated in such manner.. Thanks, Edward. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/2] grub-0.97: btrfs support
Johannes Hirte wrote: Am Freitag 11 Dezember 2009 12:17:29 schrieb Edward Shishkin: Johannes Hirte wrote: Am Freitag 11 Dezember 2009 00:15:46 schrieb Johannes Hirte: Am Freitag 25 September 2009 00:06:23 schrieb Edward Shishkin: Hello everyone. ... The following patches are for Fedora 10(**). The distro-independent package will be put to kernel.org a bit later. All comments, bugreports, etc. are welcome as usual. Ok, I have another comment/bugreport *g*. I'm testing this patch with gentoo, so the grub sources are not identicaly the same. With this patches applied, grub is unable to detect JFS or XFS filesystems. XFS is reported as unknown, JFS is reported as btrfs. Reiserfs and ext2/3 are detected as expected. Yes, this patch is for Fedora. For other distros some issues are possible, so please be careful.. I've also tested now with the fedora sources. There is the same bug. The btrfs patch breaks the filesystem detection. All filesystems after btrfs in fsys_table aren't detected. Moving btrfs to the end of fsys_table is a workaround but will interfere with FFS. So this should better be fixed in the btrfs-part of grub, so that it: a) doesn't missdetect a JFS filesystem as btrfs b) doesn't break the detection for remaining filesystems in the array. Hello. Yes, I confirm that xfs, etc. file systems are not detected, but missdetection jfs as btrfs looks rather fantastic :) Please, try the attached patch. Report if any problems. Thanks, Edward. Problem: XFS, JFS, etc. file systems of the fsys_table are not detected by grub with grub-0.97-btrfs.patch applied. BUG: btrfs_mount() sets ERR_FSYS_MOUNT to the global variable errnum if no btrfs metadata were found on a partition. As the result all next calls of devread() (and, hence, attempts to find metadata of other file systems) failed. Solution: Don't set ERR_FSYS_MOUNT, if btrfs metadata were found, just return 0. Signed-off-by: Edward Shishkin --- stage2/fsys_btrfs.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- grub-0.97.orig/stage2/fsys_btrfs.c +++ grub-0.97/stage2/fsys_btrfs.c @@ -638,7 +638,7 @@ int btrfs_mount(void) if (ret) { btrfs_msg("Drive %lu, partition %lu: no Btrfs metadata\n", current_drive, part_start); - goto error; + return 0; } ret = btrfs_uptodate_super_copy(BTRFS_FS_INFO); if (ret)
Re: [patch 2/2] grub-0.97: btrfs multidevice configuration support
Johannes Hirte wrote: Am Dienstag 03 November 2009 01:59:39 schrieb Edward Shishkin: Johannes Hirte wrote: Why is the btrfs code dealing with network devices at all? Why not? :) I don't see the possiblity to get a btrfs filesystem this way. So as far as I understand this, it's complete useless. The CD support doesn't look very usefull too to me. It's possible to put a btrfs filesystem on a CD or DVD. But that seems rather theoretical. Ok, let's keep this theoretical possibility, I don't see wrong things here.. Thanks, Edward. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/2] grub-0.97: btrfs support
Johannes Hirte wrote: Am Freitag 11 Dezember 2009 00:15:46 schrieb Johannes Hirte: Am Freitag 25 September 2009 00:06:23 schrieb Edward Shishkin: Hello everyone. ... The following patches are for Fedora 10(**). The distro-independent package will be put to kernel.org a bit later. All comments, bugreports, etc. are welcome as usual. Ok, I have another comment/bugreport *g*. I'm testing this patch with gentoo, so the grub sources are not identicaly the same. With this patches applied, grub is unable to detect JFS or XFS filesystems. XFS is reported as unknown, JFS is reported as btrfs. Reiserfs and ext2/3 are detected as expected. Yes, this patch is for Fedora. For other distros some issues are possible, so please be careful.. Thanks, Edward. A possible solution is to put FSYS_BTRFS on the end of struct fsys_entry fsys_table. I've tested with FSYS_BTFS as the second last entry, the last is still FFS. diff -Nru grub-0.97-r9/stage2/disk_io.c grub-0.97-r10/stage2/disk_io.c --- grub-0.97-r9/stage2/disk_io.c 2009-12-10 23:41:37.0 +0100 +++ grub-0.97-r10/stage2/disk_io.c 2009-12-11 00:50:51.555007247 +0100 @@ -79,6 +79,9 @@ # ifdef FSYS_ISO9660 {"iso9660", iso9660_mount, iso9660_read, iso9660_dir, 0, 0}, # endif +# ifdef FSYS_BTRFS + {"btrfs", btrfs_mount, btrfs_read, btrfs_dir, 0, btrfs_embed}, +# endif /* XX FFS should come last as it's superblock is commonly crossing tracks on floppies from track 1 to 2, while others only use 1. */ # ifdef FSYS_FFS With this order, XFS and JFS filesystems are identified correct. But I think, this is just a workaround. regards, Johannes -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 2/2] grub-0.97: btrfs multidevice configuration support
Johannes Hirte wrote: Am Freitag 25 September 2009 00:06:40 schrieb Edward Shishkin: Hi Edward, I was pointed to a problem with this patch. +static u64 scan_grub_devices(struct btrfs_device *dev, +int (*discerner)(struct btrfs_device **, int), +int lookup) +{ ... +#ifdef SUPPORT_NETBOOT + errnum = ERR_NONE; + if (network_ready && + !get_diskinfo(NETWORK_DRIVE, &geom)) { + dev->drive = NETWORK_DRIVE; + dev->part = 0; + dev->length = geom.total_sectors; + if (discerner(&dev, lookup)) { + count++; + if (lookup) + goto exit; + } + } +#endif /* SUPPORT_NETBOOT */ + exit: + return count; +} This won't compile since network_ready is undeclared. Yup, indeed.. Why is the btrfs code dealing with network devices at all? Why not? :) Well, would you please disable it for now with the attached patch? Thanks, Edward. Disable booting from network devices. Signed-off-by: Edward Shishkin --- stage2/fsys_btrfs.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) --- grub-0.97.orig/stage2/fsys_btrfs.c +++ grub-0.97/stage2/fsys_btrfs.c @@ -452,7 +452,7 @@ static u64 scan_grub_devices(struct btrf goto exit; } } -#ifdef SUPPORT_NETBOOT +#if 0 errnum = ERR_NONE; if (network_ready && !get_diskinfo(NETWORK_DRIVE, &geom)) { @@ -465,7 +465,7 @@ static u64 scan_grub_devices(struct btrf goto exit; } } -#endif /* SUPPORT_NETBOOT */ +#endif /* 0 */ exit: return count; }
Re: [patch 1/2] grub-0.97: btrfs support for a singe device configuration
Johannes Hirte wrote: [...] This compiles and works, but only with stage2. When using stage1_5, grub hangs on boot with GRUB loading stage1.5 and blinking cursor. Is the multidevice patch also necessary for single devices? Yes, please use both patches! I have split the stuff because the mailing list refuse attachments >100K. Thanks, Edward. I've omitted this patch, since my root-fs (with /boot) lies on a single partition. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/2] grub-0.97: btrfs support for a singe device configuration
Johannes Hirte wrote: Am Freitag 25 September 2009 00:06:32 schrieb Edward Shishkin: I was trying the patch and got a little confused. How did you get this work without linking against libgcc? Hmm.. Actually my Fedora stuff does link it: [...] +btrfs_stage1_5_exec_LDADD = @LIBGCC@ [...] I guess you have missed it when porting to Gentoo.. Thanks, Edward. I've tested it with the gentoo patches for grub and get /usr/src/portage/portage/sys-boot/grub-0.97-r10/work/grub-0.97/stage2/fsys_btrfs.c:551: undefined reference to `__udivdi3' /usr/src/portage/portage/sys-boot/grub-0.97-r10/work/grub-0.97/stage2/fsys_btrfs.c:571: undefined reference to `__umoddi3' /usr/src/portage/portage/sys-boot/grub-0.97-r10/work/grub-0.97/stage2/fsys_btrfs.c:575: undefined reference to `__umoddi3' /usr/src/portage/portage/sys-boot/grub-0.97-r10/work/grub-0.97/stage2/fsys_btrfs.c:581: undefined reference to `__umoddi3' /usr/src/portage/portage/sys-boot/grub-0.97-r10/work/grub-0.97/stage2/fsys_btrfs.c:583: undefined reference to `__udivdi3' /usr/src/portage/portage/sys-boot/grub-0.97-r10/work/grub-0.97/stage2/fsys_btrfs.c:588: undefined reference to `__umoddi3' /usr/src/portage/portage/sys-boot/grub-0.97-r10/work/grub-0.97/stage2/fsys_btrfs.c:589: undefined reference to `__udivdi3' collect2: ld returned 1 exit status make[2]: *** [pre_stage2.exec] Error 1 make[2]: Leaving directory `/usr/src/portage/portage/sys-boot/grub-0.97-r10/work/grub-0.97/stage2' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/usr/src/portage/portage/sys-boot/grub-0.97-r10/work/grub-0.97' make: *** [all] Error 2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how does grub exactly work?
Andreas Jellinghaus wrote: Hi Edward, Hello. I saw your mail on btrfs ml with the grub patches and the notes how to deal with btrfs. can you explain how grub and btrfs work exactly? I read the grub manual at http://www.gnu.org/software/grub/manual/html_node/Bootstrap- tricks.html#Bootstrap-tricks so I wonder: does btrfs provide a "boot loader area" similar to ffs and reiserfs, where grub places the stage 1.5 code and stage 1 can read it? Yes. or does grub find out the sectors of the stage 1.5 file and put the list of those into the stage 1 file (first sector address of stage 1.5 file) and stage 1.5 first sector (list of all other sectors)? Grub doesn't make the blocklist for stage1_5 Optionally grub makes the blocklist for stage2 (in particular when stage1.5 can not be embedded for some reasons). Note, that it is important to embed and use btrfs_stage1_5. First, because btrfs has defragmentator, which can make the blocklist out of date (so that you'll need to reinstall grub). Second, I am not sure, if the blocklist will be composed correctly in the case when stage2 locates in btrfs volume (I didn't look at the blocklist specification: it can happen that grub installer makes an assumption that all sectors of stage2 locate on the same device). and what would be the proper way to make boot from a raid1 device? so that if one disk fails the other can boot? device (hd0) /dev/sda root (hd0,X) setup (hd0) device (hd0) /dev/sdb root (hd0,X) setup (hd0) Perhaps, it will work. But officially grub doesn't understand software raid, and I am not familiar with raid1 specifications, so... ;) if data is written to mbr only and the sectors between mbr and the first partition, this could work. but if data is written to mbr and to btrfs (either a "boot loader area" or changes to the stage 1.5 file), then that data can only contain the valid block lists for one of the two hard disks - which shouldn't be a problem if drives are the same and have 100% the same geometry and partitioning. so I wonder, and it would be great if thise fine details would be documented for btrfs somewhere, as there is very little information about them (the situation isn't better with other filesystems either). thanks for your help! Regards, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 2/2] grub-0.97: btrfs multidevice configuration support
Signed-off-by: Edward Shishkin --- grub/Makefile.am|2 stage2/btrfs.h | 55 ++-- stage2/builtins.c | 10 stage2/disk_io.c|2 stage2/filesys.h|4 stage2/fsys_btrfs.c | 693 +--- 6 files changed, 595 insertions(+), 171 deletions(-) --- grub-0.97.orig/stage2/btrfs.h +++ grub-0.97/stage2/btrfs.h @@ -124,6 +124,7 @@ static int btrfs_csum_sizes[] = { 4, 0 } #define BTRFS_DEFAULT_NUM_DEVICES 1 #define BTRFS_DEFAULT_NODE_SIZE 4096 #define BTRFS_DEFAULT_LEAF_SIZE 4096 +#define BTRFS_NUM_CACHED_DEVICES 128 #define WARN_ON(c) #define cassert(cond) ({ switch (-1) { case (cond): case 0: break; } }) @@ -315,13 +316,22 @@ struct btrfs_node { struct btrfs_key_ptr ptrs[]; } __attribute__ ((__packed__)); +struct btrfs_device { + /* the internal btrfs device id */ + u64 devid; + /* the internal grub device representation */ + unsigned long drive; + unsigned long part; + unsigned long length; +}; + struct extent_buffer { /* metadata */ + struct btrfs_device dev; u64 start; u64 dev_bytenr; u32 len; - int refs; - int flags; + /* data */ char *data; }; @@ -555,12 +565,8 @@ struct btrfs_block_group_item { struct btrfs_root { struct extent_buffer node; char data[4096]; - struct extent_buffer *commit_root; struct btrfs_root_item root_item; - struct btrfs_key root_key; - struct btrfs_fs_info *fs_info; u64 objectid; - u64 last_trans; /* data allocations are done in sectorsize units */ u32 sectorsize; @@ -573,42 +579,31 @@ struct btrfs_root { /* leaf allocations are done in leafsize units */ u32 stripesize; +}; - int ref_cows; - int track_dirty; - - - u32 type; - u64 highest_inode; - u64 last_inode_alloc; +struct btrfs_file_info { + struct btrfs_key key; }; struct btrfs_root; struct btrfs_fs_devices; struct btrfs_fs_info { u8 fsid[BTRFS_FSID_SIZE]; - u8 chunk_tree_uuid[BTRFS_UUID_SIZE]; struct btrfs_root fs_root; struct btrfs_root tree_root; struct btrfs_root chunk_root; - struct btrfs_key file_info; /* currently opened file */ + struct btrfs_file_info file_info; /* currently opened file */ struct btrfs_path paths [LAST_LOOKUP_POOL]; - u64 generation; - u64 last_trans_committed; + char mbr[SECTOR_SIZE]; - u64 system_alloc_profile; - u64 alloc_start; + int sb_mirror; + u64 sb_transid; + struct btrfs_device sb_dev; + struct btrfs_super_block sb_copy; - struct btrfs_super_block super_temp; - struct btrfs_super_block super_copy; - - u64 super_bytenr; - u64 total_pinned; - - int system_allocs; - int readonly; + struct btrfs_device devices[BTRFS_NUM_CACHED_DEVICES + 1]; }; /* @@ -1129,6 +1124,11 @@ static inline void btrfs_set_key_type(st key->type = val; } +static inline u64 btrfs_super_devid(struct btrfs_super_block *disk_super) +{ + return le64_to_cpu(disk_super->dev_item.devid); +} + /* struct btrfs_header */ BTRFS_SETGET_HEADER_FUNCS(header_bytenr, struct btrfs_header, bytenr, 64); BTRFS_SETGET_HEADER_FUNCS(header_generation, struct btrfs_header, @@ -1317,6 +1317,7 @@ struct btrfs_fs_devices { }; struct btrfs_bio_stripe { + struct btrfs_device dev; u64 physical; }; --- grub-0.97.orig/stage2/fsys_btrfs.c +++ grub-0.97/stage2/fsys_btrfs.c @@ -31,15 +31,21 @@ #define BTRFS_FS_INFO \ ((struct btrfs_fs_info *)((unsigned long)FSYS_BUF + \ LOOKUP_CACHE_SIZE)) -#define BTRFS_CACHE_SIZE(sizeof(struct btrfs_fs_info) +\ -LOOKUP_CACHE_SIZE) -#define BTRFS_FILE_INFO (&BTRFS_FS_INFO->file_info) -#define BTRFS_TREE_ROOT (&BTRFS_FS_INFO->tree_root) -#define BTRFS_CHUNK_ROOT(&BTRFS_FS_INFO->chunk_root) -#define BTRFS_FS_ROOT (&BTRFS_FS_INFO->fs_root) -#define BTRFS_SUPER (&BTRFS_FS_INFO->super_copy) -#define LOOKUP_CACHE_BUF(id) ((char *)((unsigned long)FSYS_BUF + \ - id * LOOKUP_CACHE_BUF_SIZE)) +#define BTRFS_CACHE_SIZE (sizeof(struct btrfs_fs_info) + \ + LOOKUP_CACHE_SIZE) +#define BTRFS_TREE_ROOT (&BTRFS_FS_INFO->tree_root) +#define BTRFS_CHUNK_ROOT (&BTRFS_FS_INFO->chunk_root) +#define BTRFS_FS_ROOT(&BTRFS_FS_INFO->fs_root) +#define BTRFS_SUPER (&BTRFS_FS_INFO->sb_copy) +#define BTRFS_DEVICES(&BTRFS_FS_INFO->devices[0]) +#define BTRFS
[patch 0/2] grub-0.97: btrfs support
Hello everyone. The following patches are for Fedora 10(**). The distro-independent package will be put to kernel.org a bit later. I. Loading kernels from btrfs volumes Now you can load kernels and initrds from btrfs volumes composed of many devices. WARNING!!! Make sure that all components of your loading btrfs volume(*) are visible to grub. Otherwise, you'll end with unbootable system. The list of available grub devices can be obtained, for example, using tab completion in grub shell. Number of components of a loading volume is not restricted, however if it is larger then 128, then the boot process will be slowed down because of expensive translations (btrfs-device-id -> grub-device-id) which issue a large number of IOs. We cache only 128 such translations in grub-0.97 because of high memory pressure. II. Installing grub from btrfs volumes You can install grub from a btrfs image volume(*) composed of many devices (see above about restrictions). Also you can setup any component of a btrfs boot(*) volume as grub root device. NOTE!!! Make sure that all components of image and boot volumes(*) are visible to grub, otherwise grub installer will return error. TECHNICAL NOTE (for grub developers): The unpleasant surprise was that grub installer overwrites (by default!) the file (stage2), bypassing the file system driver. I can not understand this: it looks like stepping to the clean water with dirty shoe. Hope that grub2 won't afford such things. In order to install grub from a btrfs image volume use special option (--stage2). This option makes grub installer to rewrite the file with a help of the OS's file system (i.e, via write (2)). Any attempts to install without this option will fail with an error (wrong argument). The example of possible installation scenario. Suppose image volume = root volume = loading volume is composed of devices (hd0,4), (hd0,5), (hd1,5), (hd1,7) and is not an OS's root. We want to setup (hd0,4) as grub root device and install grub to the mbr of (hd0). . build and install grub with btrfs support; . mount your the "3 in 1" btrfs volume to /mnt; . create a directory /mnt/grub; . put the built files stage1, stage2, btrfs_stage1_5, grub.conf, etc. to /mnt/grub; . run grub shell; . grub> root (hd0,4) . grub> setup --stage2=/mnt/grub/stage2 (hd0) . have a fun. Use info(1) grub for more details. (*) Glossary: . loading volume: a btrfs volume that contains kernel image and initrd; . image volume: a btrfs volume that contains stage1, stage2, btrfs_stage_1_5, and grub.conf files needed for grub installer; . boot volume: a btrfs volume where grub will look for stage2 and grub.conf files in boot time. (**) Link to the Fedora's grub package: http://ucho.ignum.cz/fedora/linux/releases/10/Fedora/source/SRPMS/grub-0.97-38.fc10.src.rpm All comments, bugreports, etc. are welcome as usual. Thanks, Edward. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: grub-0.97/btrfs: the files fsys_btrfs.c, btrfs.h
Vladimir 'phcoder' Serbinenko wrote: On Wed, Jul 22, 2009 at 7:04 PM, Felix Zielcke wrote: Am Mittwoch, den 22.07.2009, 18:41 +0200 schrieb Edward Shishkin: Hi Edward, Hello. (CC linux-btrfs mailing list) Uhm there is no CC? I'm unsure now if I should CC it or not. Edward had a problem sending to our list because grub-devel is subscriber-only list. Let's CC linux-btrfs Anyway, nice to see that you work on btrfs for grub. But grub-legacy is totally dead now, it would be better if you would put your efforts into grub2. Mm.. I am not an individual contributor. I work for RHEL and Fedora, which is on grub-0.97 for now.. I would be happy to assist with this. I already have such experience porting zfs from grub-solaris to grub2 It would be fine.. Thanks, Edward. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
grub-0.97/VFS
Hello everyone. Grub-0.97 filesystem interface (read_func, dir_func) seems to be poor. Instead of this I would prefer to have something like the following: /* * * .init_root() // set index of root dir * .lookup_begin()// get index by name * .lookup_end() // get inode by index * .read_file() * .read_dir() * .read_symlink() * * so that dir_func in grub_open will look like the * following: */ int path_walk() { ops->init_root(); while (1) { int mode; ops->lookup_end(&mode, ...); switch (type_of_file(mode)) { case SYMLINK_FILE: follow_symlink(read_symlink, ...); ...; continue; case REGULAR_FILE: /* * the end of path walk: * normally we want to exit here */ ...; return 1; case DIRECTORY_FILE: /* * optionally this will * print possibilities */ ops->lookup_begin(mode, read_dir, ...); ...; continue; default: errnum = ERR_BAD_FILETYPE; return 0; } } } Just my 2 cents in the (grub-2?) development process.. Thanks, Edward. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch] btrfs-progs: fix btrfs_read_dev_super
Don't break in the middle of search, taste also other super mirrors. Signed-off-by: Edward Shishkin --- disk-io.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- btrfs-progs-0.18.orig/disk-io.c +++ btrfs-progs-0.18/disk-io.c @@ -716,7 +716,7 @@ int btrfs_read_dev_super(int fd, struct bytenr = btrfs_sb_offset(i); ret = pread64(fd, &buf, sizeof(buf), bytenr); if (ret < sizeof(buf)) - break; + continue; if (btrfs_super_bytenr(&buf) != bytenr || strncmp((char *)(&buf.magic), BTRFS_MAGIC,
Re: Some basic benchmarking
Jaime sanchez wrote: Hi list, I am tired of "syntetic" benchmarks and i wrote a script to totally automate benchmarking for various file system (command line option) through the same partition, collect the data and show results in a nice way. There are 4 basic benchmarks performed, copy, compress, uncompress and delete. The data folder i used is the linus-source-2.6.30-rc5 (uncompressed). The script can be downloaded here: http://rapidshare.com/files/232159335/abench.tar.gz (use abench -h for help) Here are some results for core2duo, Western digital green power hard disk (WD5000AACS): ┌───┬┬───┬──┬──┐ │ │ Copy │ Compress │ Uncompre │ Delete │ │ │ time │ time │ time │ time │ ├───┼┼───┼──┼──┤ │ ext4 │ 0m33.283s │ 0m26.834s │ 0m7.229s │ 0m0.866s │ │ noatime,async ││ │ │ │ ├───┼┼───┼──┼──┤ │ ext3 │ 0m46.947s │ 0m27.003s │ 0m7.953s │ 0m0.786s │ │ noatime,async ││ │ │ │ ├───┼┼───┼──┼──┤ │ ext2 │ 0m19.624s │ 0m26.716s │ 0m7.259s │ 0m0.477s │ │ noatime,async ││ │ │ │ ├───┼┼───┼──┼──┤ │ reiserfs │ 0m38.056s │ 0m27.996s │ 0m12.264 │ 0m1.474s │ │ noatime,async ││ │ │ │ ├───┼┼───┼──┼──┤ │ btrfs │ 0m20.411s │ 0m30.793s │ 0m11.618 │ 0m5.599s │ │ defaults ││ │ │ │ ├───┼┼───┼──┼──┤ │ btrfs │ 0m21.630s │ 0m27.891s │ 0m11.926 │ 0m8.473s │ │ async ││ │ │ │ ├───┼┼───┼──┼──┤ │ btrfs │ 0m19.277s │ 0m27.603s │ 0m11.238 │ 0m10.468s│ │ async,noatime ││ │ │ │ ├───┼┼───┼──┼──┤ │ btrfs │ 0m17.086s │ 0m27.520s │ 0m10.062 │ 0m11.688s│ │ nodatacsum,nodatacow,n│atime,async │ │ │ │ ├───┼┼───┼──┼──┤ │ btrfs │ 0m23.708s │ 0m28.663s │ 0m15.326 │ 0m7.318s │ │ compress,async,atime ││ │ │ │ ├───┼┼───┼──┼──┤ Btw, it might be roaches in the code: transparent compression should win 20-25%... │ btrfs │ 0m20.418s │ 0m27.484s │ 0m9.871s │ 0m8.989s │ │ compress,async,atime,n│datacsum,nod│tacow │ │ │ ├───┼┼───┼──┼──┤ │ reiser4 │ 0m19.750s │ 0m27.058s │ 0m10.196 │ 0m2.709s │ │ noatime,async ││ │ │ │ ├───┼┼───┼──┼──┤ │ reiser4 │ 0m16.121s │ 0m27.924s │ 0m9.798s │ 0m8.582s │ │ noatime,async,compress││ │ │ │ └───┴┴───┴──┴──┘ It seems that btrfs has good performance, except for delete times. Regards Antonio -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] fallocate behavior when crossing end-of-file
Michael Raskin wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Edward Shishkin wrote: Desired: something (probably, one block) is allocated/reserved for the file. File length is set to 1 byte. Where is it documented? IMHO we should only guarantee that writes to [0, 1] won't fail because of lack of disk space. Other things (including file size) are up to implementation. POSIX, SUS. http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html <<< If the offset+ len is beyond the current file size, then posix_fallocate() shall adjust the file size to offset+ len. Otherwise, the file size shall not be changed. fallocate (2) is something different from posix_fallocate: "...This default behavior closely resembles the behavior of the posix_fallocate(3) library function, and is intended as a method of optimally implementing that function." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] fallocate behavior when crossing end-of-file
Raskin Michael wrote: Hello. Hello. I found the following bug in BtrFS: 1. Create and open an empty file 2. fallocate (fd, 0, 1) Desired: something (probably, one block) is allocated/reserved for the file. File length is set to 1 byte. Where is it documented? IMHO we should only guarantee that writes to [0, 1] won't fail because of lack of disk space. Other things (including file size) are up to implementation. Actual: A block is allocated. File length is set to 1 block (4096 bytes). The rest of the file is filled with zeros. last_byte = min(extent_map_end(em), alloc_end); last_byte = (last_byte + mask) & ~mask; if (em->block_start == EXTENT_MAP_HOLE) { ret = prealloc_file_range(trans, inode, cur_offset, last_byte, locked_end + 1, alloc_hint, mode); That part seems strange to me. You make an effort for block size to divide last_byte. But last_byte should be max(i_size_read(inode), offset+len) - without any rounding. Michael Raskin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
Tomasz Chmielewski wrote: Thomas Glanzmann schrieb: 300 Gbyte of used storage of several productive VMs with the following Operatings systems running: \begin{itemize} \item Red Hat Linux 32 and 64 Bit (Release 3, 4 and 5) \item SuSE Linux 32 and 64 Bit (SLES 9 and 10) \item Windows 2003 Std. Edition 32 Bit \item Windows 2003 Enterprise Edition 64 Bit \end{itemize} \begin{tabular}{r|r|r|l} blocksize & Deduplicated Data \\ \hline 128k & 29.9 G \\ 64k & 41.3 G \\ 32k & 59.2 G \\ 16k & 82 G \\ 8k & 112 G \\ \ Bottom line with 8 K blocksize you can get more than 33% of deduped data running a productive set of VMs. Did you just compare checksums, I wouldn't rely on crc32: it is not a strong hash, Such deduplication can lead to various problems, including security ones. or did you also compare the data "bit after bit" if the checksums matched? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: implement FS_IOC_GETFLAGS/SETFLAGS/GETVERSION
Christoph Hellwig wrote: On Fri, Apr 17, 2009 at 12:11:55PM +0200, Edward Shishkin wrote: Christoph Hellwig wrote: Add support for the standard attributes set via chattr and read vis lsattr. Currently we store the attributes in the flags value in the btrfs inode, but I wonder whether we should split it into two so that we don't have to keep converting between the two formats. Imho, since inode items are of fixed size, is won't be possible to avoid such workarounds like conversion between formats. No? While the inode format is fixed it has 256 spare bits for expansion. Ah, I meant the case when the spare bits are exhausted.. But what I mean with the above is to split the current 64bit flags value into a a 32 bit internal flags and a 32bit user visible flags value and store the ioctl flags in the latter. OTOH every filesystem but extN seem to need some conversion so btrfs wouldn't be unusual at that. Not sure about extN, but one of the techniques is to represent inode item as a set of (optional) "extensions", so that every such extension includes it's version number (think of it as release date). If file system driver is older, then some encountered extension, init_inode() returns error. OTOH, if some extension is missed, then some featured operations can be undefined (i.e. read(), or write(), etc.. will return error). No conversion is needed, however such approach requires more sophisticated fsck. And the GETFLAGS/SETFLAGS flags value are pretty ugly anyway as they mix up flags for user visible behaviour with extN implementation details that shouldn't really need to be exposed to userspace. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: implement FS_IOC_GETFLAGS/SETFLAGS/GETVERSION
Christoph Hellwig wrote: Add support for the standard attributes set via chattr and read vis lsattr. Currently we store the attributes in the flags value in the btrfs inode, but I wonder whether we should split it into two so that we don't have to keep converting between the two formats. Imho, since inode items are of fixed size, is won't be possible to avoid such workarounds like conversion between formats. No? Remove the btrfs_clear_flag/btrfs_set_flag/btrfs_test_flag macros as they were obsfuction the existing code and got in the way of the new additions. Also add the FS_IOC_GETVERSION ioctl for getting i_generation as it's trivial. Btw, any idea what the BTRFS_INODE_REDONLY flag is for? It's a subset of the immutable flag, but can't actually be set anywhere from the filesystem code. Signed-off-by: Christoph Hellwig Index: linux-2.6/fs/btrfs/ioctl.c === --- linux-2.6.orig/fs/btrfs/ioctl.c 2009-04-17 10:08:11.758948607 +0200 +++ linux-2.6/fs/btrfs/ioctl.c 2009-04-17 10:33:21.555076930 +0200 @@ -50,7 +50,172 @@ #include "volumes.h" #include "locking.h" +/* Mask out flags that are inappropriate for the given type of inode. */ +static inline __u32 btrfs_mask_flags(umode_t mode, __u32 flags) +{ + if (S_ISDIR(mode)) + return flags; + else if (S_ISREG(mode)) + return flags & ~FS_DIRSYNC_FL; + else + return flags & (FS_NODUMP_FL | FS_NOATIME_FL); +} + +/* + * Export inode flags to the format expected by the FS_IOC_GETFLAGS ioctl. + */ +static unsigned int btrfs_flags_to_ioctl(unsigned int flags) +{ + unsigned int iflags = 0; + + if (flags & BTRFS_INODE_SYNC) + iflags |= FS_SYNC_FL; + if (flags & BTRFS_INODE_IMMUTABLE) + iflags |= FS_IMMUTABLE_FL; + if (flags & BTRFS_INODE_APPEND) + iflags |= FS_APPEND_FL; + if (flags & BTRFS_INODE_NODUMP) + iflags |= FS_NODUMP_FL; + if (flags & BTRFS_INODE_NOATIME) + iflags |= FS_NOATIME_FL; + if (flags & BTRFS_INODE_DIRSYNC) + iflags |= FS_DIRSYNC_FL; + + return iflags; +} + +/* + * Update inode->i_flags based on the btrfs internal flags. + */ +void btrfs_update_iflags(struct inode *inode) +{ + struct btrfs_inode *ip = BTRFS_I(inode); + + inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC); + + if (ip->flags & BTRFS_INODE_SYNC) + inode->i_flags |= S_SYNC; + if (ip->flags & BTRFS_INODE_IMMUTABLE) + inode->i_flags |= S_IMMUTABLE; + if (ip->flags & BTRFS_INODE_APPEND) + inode->i_flags |= S_APPEND; + if (ip->flags & BTRFS_INODE_NOATIME) + inode->i_flags |= S_NOATIME; + if (ip->flags & BTRFS_INODE_DIRSYNC) + inode->i_flags |= S_DIRSYNC; +} + +/* + * Inherit flags from the parent inode. + * + * Unlike extN we don't have any flags we don't want to inherit currently. + */ +void btrfs_inherit_iflags(struct inode *inode, struct inode *dir) +{ + unsigned int flags = BTRFS_I(dir)->flags; + + if (S_ISREG(inode->i_mode)) + flags &= ~BTRFS_INODE_DIRSYNC; + else if (!S_ISDIR(inode->i_mode)) + flags &= (BTRFS_INODE_NODUMP | BTRFS_INODE_NOATIME); + + BTRFS_I(inode)->flags = flags; + btrfs_update_iflags(inode); +} + +static int btrfs_ioctl_getflags(struct file *file, void __user *arg) +{ + struct btrfs_inode *ip = BTRFS_I(file->f_path.dentry->d_inode); + unsigned int flags = btrfs_flags_to_ioctl(ip->flags); + + if (copy_to_user(arg, &flags, sizeof(flags))) + return -EFAULT; + return 0; +} + +static int btrfs_ioctl_setflags(struct file *file, void __user *arg) +{ + struct inode *inode = file->f_path.dentry->d_inode; + struct btrfs_inode *ip = BTRFS_I(inode); + struct btrfs_root *root = ip->root; + struct btrfs_trans_handle *trans; + unsigned int flags, oldflags; + int ret; + + if (copy_from_user(&flags, arg, sizeof(flags))) + return -EFAULT; + if (flags & ~(FS_IMMUTABLE_FL | FS_APPEND_FL | \ + FS_NOATIME_FL | FS_NODUMP_FL | \ + FS_SYNC_FL | FS_DIRSYNC_FL)) + return -EOPNOTSUPP; + + if (!is_owner_or_cap(inode)) + return -EACCES; + + mutex_lock(&inode->i_mutex); + + flags = btrfs_mask_flags(inode->i_mode, flags); + oldflags = btrfs_flags_to_ioctl(ip->flags); + if ((flags ^ oldflags) & (FS_APPEND_FL | FS_IMMUTABLE_FL)) { + if (!capable(CAP_LINUX_IMMUTABLE)) { + ret = -EPERM; + goto out_unlock; + } + } + + ret = mnt_want_write(file->f_path.mnt); + if (ret) + goto out_unlock; + + if (flags & FS_SYNC_FL) + ip->flags |= BTRFS_INO