Re: Balancing leaves when walking from top to down (was Btrfs:...)

2010-06-30 Thread Edward Shishkin

Chris Mason wrote:
[...]

1. the balancing condition n <= number_of_keys <= 2n+1
  is not satisfied (indeed, when number_of_keys is 1
  we have 1 <= 2n+1 -- false);
  

That doesn't matter.  It is not necessary (or desirable) to follow the
classical B-tree algorithms to the letter to get sufficiently good
properties. 


sure, but we need something instead of classic balancing condition.
We can not just refuse it: this is your lifejacket during performing
tree operations. How will you proof utilization bounds without any
balancing conditions?



 B-trees are quite robust to changing the details, as long
as you follow the generalised principles which make them work.



There are lots of different perspectives here,


I wouldn't be so optimistic here (see below)


 but it is important to
step back and look at what we're using the btree for.  We're making an
index to filesystem metadata.  The index is there so that we can find
that metadata with a reasonable number of IOs and in a reasonable amount
of time.

Because btrfs is COW, and because of the reference counted COW used to
maintain snapshots, the cost of updating the btree is different from
traditional btrees.  We do try to avoid balancing more and
intentionally allow the btree leaves to be less compact simply because
it allows us to avoid the higher cost of those operations.
  


I can understand reducing low bound, say, from 1/2 to 1/3 for performance
reasons, but again, bound 0.00 is unacceptable.. So, please, make sure you
have a sane proved bound for every such "compromise".


The btree nodes are fairly strict.  They contain fixed records and are
balanced in a simple fashion.  When we are inserting a new key into a
node, if the node is completely full we split it.  When we are deleting
a key, if the node is less than half full we balance it.


This works only if insertion/deletion on the leaf level won't result in
insertion/deletion more then one key on upper levels.


  Calculating
full on the btree nodes is a factor of the number of keys in them.

The leaves have fixed length keys that describe items of variable size.
On deletion merging is done when we're less than 1/3rd full and on
insertion we split when the item we're trying to shove in doesn't fit.
Calculating full on the btree leaves is a factor of the total size of
the items stored.

So, with all of that said, the nodes point to leaves and the leaves
contain inodes, directory entries and sometimes file data.

The most important part of the index is the higher levels of the tree.
The index of most filesystems points to but does not contain
inodes and file data, and so the higher levels of the btrfs btree are
more like a traditional FS index.

There's a knob here for the max inline item size and if someone wants to
fill their whole FS with 2K files, the default settings are not at all
suitable.  Filling with 3K files actually works better because we end
up with inode and file data in the same block much of the time.
  


Chris, thanks for the description.

I think that such balancing is totally incorrect. In accordance with
your strategy insertions can spawn shallow leaves, and deletions will
result in shallow leaves (because of merging with shallow leaves).
This will lead to unbound utilization slump.

In particular, your patch is just a workaround, which doesn't fix the
core problem. Knobs for cases is a nonsense. Are you kidding? Ext4,
xfs, reiserfs don't use any knobs, so why should we accept this
regress on the background of common progress? Everything should be
packed fine without any knobs...

Your strategy slightly resembles balancing in Reiserfs file system,
so I have two comments here.

1. Algorithms used in Reiserfs are "precise" (or "minimal"). That
means there is no "superfluous" balancing there. It is impossible to
save on balancing operations and slightly decrease utilization bound:
you will lose everything (this is unacceptable). Only classic Bayer's
B-trees allow to reduce bound 1/2 via replacing usual balancing
condition q <= N <= 2q with more liberal one (say, q <= N <= 3q).

2. If you want to base on COW-friendly Reiserfs, then it's algorithms
should be properly modified for the case of top-down balancing and
reviewed (better by the authors of the original algorithms).
I recommend to consider the following approach:


Balancing the leaf level


A. Inserting a key of length I(*)


01Suppose we traversed the tree and have arrived to the position
02pos in the leaf A.
03
04At first we try to make space by shifting all possible items at the
05left and right from the pos to the neighbors B and C respectively:
06
07B   A   C
08   
09Let's denote via L and R total length of items on leaf A at left

10and right from the pos respectively.
11
12If after shifting there is enough space in A, then we perform
13insertion to A. After the insertion pairs (BA) and (AC) are
14incompressible for obvious reasons.
15
16Else

Re: Balancing leaves when walking from top to down (was Btrfs:...)

2010-06-23 Thread Edward Shishkin

Chris Mason wrote:

On Tue, Jun 22, 2010 at 04:12:57PM +0200, Edward Shishkin wrote:
  

Chris Mason wrote:


On Mon, Jun 21, 2010 at 09:15:28AM -0400, Chris Mason wrote:
  

I'll reproduce from your test case and provide a fix.  mount -o
max_inline=1500 would give us 50% usage in the worst case


This is a very strange statement: how did you calculate this lower bound?



We want room for the extent and the inode item and the inode backref.
It's a rough number that leaves some extra room.  But even with a 2K
item we're getting very close to 50% usage of the metadata area.

  

(minus the
balancing bug you hit).


Ok, the balancing bug was interesting.  What happens is we create all
the inodes and directory items nicely packed into the leaves.

Then delayed allocation kicks in and starts inserting the big fat inline
extents.  This often forces the balancing code to split a leaf twice in
order to make room for the new inline extent.  The double split code
wasn't balancing the bits that were left over on either side of our
desired slot.

The patch below fixes the problem for me, reducing the number of leaves
that have more than 2K of free space down from 6% of the total to about
74 leaves.  It could be zero, but the balancing code doesn't push
items around if our leaf is in the first or last slot of the parent
node (complexity vs benefit tradeoff).
  

Nevertheless, I see leaves, which are not in the first or last slot,
but mergeable with their neighbors (test case is the same):



Correct, but it was in the first or last slot when it was balanced (I
confirmed this with printk).

Then the higher layers were balanced and our leaves were no longer in
the first/last slot.  We don't rebalance leaves when we balance level 1.

  

leaf 269557760 items 22 free space 2323 generation 25 owner 2
fs uuid 614fb921-cfa9-403d-9feb-940021e72382
chunk uuid b1674882-a445-45f9-b250-0985e483d231

leaf 280027136 items 18 free space 2627 generation 25 owner 2
fs uuid 614fb921-cfa9-403d-9feb-940021e72382
chunk uuid b1674882-a445-45f9-b250-0985e483d231

node 269549568 level 1 items 60 free 61 generation 25 owner 2
fs uuid 614fb921-cfa9-403d-9feb-940021e72382
chunk uuid b1674882-a445-45f9-b250-0985e483d231
   key (175812608 EXTENT_ITEM 4096) block 175828992 (42927) gen 15
   key (176025600 EXTENT_ITEM 4096) block 176111616 (42996) gen 15
   key (176238592 EXTENT_ITEM 4096) block 176300032 (43042) gen 15
   key (176451584 EXTENT_ITEM 4096) block 216248320 (52795) gen 17
   key (176672768 EXTENT_ITEM 4096) block 216236032 (52792) gen 17
   key (176783360 EXTENT_ITEM 4096) block 216252416 (52796) gen 17
   key (176955392 EXTENT_ITEM 4096) block 138854400 (33900) gen 25
   key (177131520 EXTENT_ITEM 4096) block 280289280 (68430) gen 25
   key (177348608 EXTENT_ITEM 4096) block 280285184 (68429) gen 25
   key (177561600 EXTENT_ITEM 4096) block 269557760 (65810) gen 25
   key (177795072 EXTENT_ITEM 4096) block 280027136 (68366) gen 25
   key (178008064 EXTENT_ITEM 4096) block 280064000 (68375) gen 25
   key (178233344 EXTENT_ITEM 4096) block 285061120 (69595) gen 25
   key (178442240 EXTENT_ITEM 4096) block 178442240 (43565) gen 16
   key (178655232 EXTENT_ITEM 4096) block 178655232 (43617) gen 16
   key (178868224 EXTENT_ITEM 4096) block 178868224 (43669) gen 16
[...]



With the patch, I'm able to create 106894 files (2K each) on a 1GB FS.
That doesn't sound like a huge improvement, but the default from
mkfs.btrfs is to duplicate metadata.  After duplication, that's about
417MB or about 40% of the overall space.

When you factor in the space that we reserve to avoid exploding on
enospc and the space that we have allocated for data extents, that's not
a very surprising number.

I'm still putting this patch through more testing, the double split code
is used in some difficult corners and I need to make sure I've tried
all of them.
  

Chris, for the further code review we need documents, which reflect
the original ideas of the balancing, etc. Could you please provide them?
Obviously, I can not do it instead of you, as I don't know those ideas.




Which part are you most interested in?
  


Balancing.

What conditions do we want to provide?
Why won't we get utilization slump in future?
How do we need to fix things when something goes wrong?
Whereas in accordance with the single existing paper Btrfs
design looks really broken, because:
1. the balancing condition n <= number_of_keys <= 2n+1
  is not satisfied (indeed, when number_of_keys is 1
  we have 1 <= 2n+1 -- false);
2. the trees mentioned in this paper don't allow keys of
  variable length.

Thanks,
Edward.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Balancing leaves when walking from top to down (was Btrfs:...)

2010-06-22 Thread Edward Shishkin

Chris Mason wrote:

On Mon, Jun 21, 2010 at 09:15:28AM -0400, Chris Mason wrote:
  

I'll reproduce from your test case and provide a fix.  mount -o
max_inline=1500 would give us 50% usage in the worst case


This is a very strange statement: how did you calculate this lower bound?


 (minus the
balancing bug you hit).



Ok, the balancing bug was interesting.  What happens is we create all
the inodes and directory items nicely packed into the leaves.

Then delayed allocation kicks in and starts inserting the big fat inline
extents.  This often forces the balancing code to split a leaf twice in
order to make room for the new inline extent.  The double split code
wasn't balancing the bits that were left over on either side of our
desired slot.

The patch below fixes the problem for me, reducing the number of leaves
that have more than 2K of free space down from 6% of the total to about
74 leaves.  It could be zero, but the balancing code doesn't push
items around if our leaf is in the first or last slot of the parent
node (complexity vs benefit tradeoff).
  


Nevertheless, I see leaves, which are not in the first or last slot,
but mergeable with their neighbors (test case is the same):

leaf 269557760 items 22 free space 2323 generation 25 owner 2
fs uuid 614fb921-cfa9-403d-9feb-940021e72382
chunk uuid b1674882-a445-45f9-b250-0985e483d231

leaf 280027136 items 18 free space 2627 generation 25 owner 2
fs uuid 614fb921-cfa9-403d-9feb-940021e72382
chunk uuid b1674882-a445-45f9-b250-0985e483d231

node 269549568 level 1 items 60 free 61 generation 25 owner 2
fs uuid 614fb921-cfa9-403d-9feb-940021e72382
chunk uuid b1674882-a445-45f9-b250-0985e483d231
   key (175812608 EXTENT_ITEM 4096) block 175828992 (42927) gen 15
   key (176025600 EXTENT_ITEM 4096) block 176111616 (42996) gen 15
   key (176238592 EXTENT_ITEM 4096) block 176300032 (43042) gen 15
   key (176451584 EXTENT_ITEM 4096) block 216248320 (52795) gen 17
   key (176672768 EXTENT_ITEM 4096) block 216236032 (52792) gen 17
   key (176783360 EXTENT_ITEM 4096) block 216252416 (52796) gen 17
   key (176955392 EXTENT_ITEM 4096) block 138854400 (33900) gen 25
   key (177131520 EXTENT_ITEM 4096) block 280289280 (68430) gen 25
   key (177348608 EXTENT_ITEM 4096) block 280285184 (68429) gen 25
   key (177561600 EXTENT_ITEM 4096) block 269557760 (65810) gen 25
   key (177795072 EXTENT_ITEM 4096) block 280027136 (68366) gen 25
   key (178008064 EXTENT_ITEM 4096) block 280064000 (68375) gen 25
   key (178233344 EXTENT_ITEM 4096) block 285061120 (69595) gen 25
   key (178442240 EXTENT_ITEM 4096) block 178442240 (43565) gen 16
   key (178655232 EXTENT_ITEM 4096) block 178655232 (43617) gen 16
   key (178868224 EXTENT_ITEM 4096) block 178868224 (43669) gen 16
[...]


With the patch, I'm able to create 106894 files (2K each) on a 1GB FS.
That doesn't sound like a huge improvement, but the default from
mkfs.btrfs is to duplicate metadata.  After duplication, that's about
417MB or about 40% of the overall space.

When you factor in the space that we reserve to avoid exploding on
enospc and the space that we have allocated for data extents, that's not
a very surprising number.

I'm still putting this patch through more testing, the double split code
is used in some difficult corners and I need to make sure I've tried
all of them.
  


Chris, for the further code review we need documents, which reflect
the original ideas of the balancing, etc. Could you please provide them?
Obviously, I can not do it instead of you, as I don't know those ideas.

Thanks,
Edward.


-chris

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 0d1d966..1f393b0 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2304,12 +2304,17 @@ noinline int btrfs_leaf_free_space(struct btrfs_root 
*root,
return ret;
 }
 
+/*

+ * min slot controls the lowest index we're willing to push to the
+ * right.  We'll push up to and including min_slot, but no lower
+ */
 static noinline int __push_leaf_right(struct btrfs_trans_handle *trans,
  struct btrfs_root *root,
  struct btrfs_path *path,
  int data_size, int empty,
  struct extent_buffer *right,
- int free_space, u32 left_nritems)
+ int free_space, u32 left_nritems,
+ u32 min_slot)
 {
struct extent_buffer *left = path->nodes[0];
struct extent_buffer *upper = path->nodes[1];
@@ -2327,7 +2332,7 @@ static noinline int __push_leaf_right(struct 
btrfs_trans_handle *trans,
if (empty)
nr = 0;
else
-   nr = 1;
+   nr = max_t(u32, 1, min_slot);
 
 	if (path->slots[0] >= left_nritems)

push_space += data_size;
@@ -2469,10 +2474,13 @@ out_unlock:
  *
  * 

Re: Balancing leaves when walking from top to down (was Btrfs:...)

2010-06-18 Thread Edward Shishkin

Ric Wheeler wrote:

On 06/18/2010 06:04 PM, Edward Shishkin wrote:

Chris Mason wrote:

On Fri, Jun 18, 2010 at 09:29:40PM +0200, Edward Shishkin wrote:

Jamie Lokier wrote:

Edward Shishkin wrote:
If you decide to base your file system on some algorithms then 
please

use the original ones from proper academic papers. DO NOT modify the
algorithms in solitude: this is very fragile thing! All such
modifications must be reviewed by specialists in the theory of
algorithms. Such review can be done in various scientific 
magazines of

proper level.

Personally I don't see any way to improve the situation with Btrfs
except full redesigning the last one. If you want to base your file
system on the paper of Ohad Rodeh, then please, use *exactly* the
Bayer's B-trees that he refers to. That said, make sure that all
records you put to the tree has equal length and all non-root 
nodes of

your tree are at least half filled.

First, thanks Edward for identifying a specific problem with the
current btrfs implementation.

Hello Jamie.


I've studied modified B-trees quite a lot and know enough to be sure
that they are quite robust when you modify them in all sorts of ways.

Which property is robust?


Moreover, you are incorrect to say there's an intrinsic algorithmic
problem with variable-length records. It is not true; if Knuth said
so, Knuth was mistaken.

I didn't say about intrinsic algorithmic problems :)
I just repeat (after Knuth et al) that B-trees with variable-length
records don't
have any sane boundary for internal fragmentation. The common idea
is that if we
don't want Btrfs to be in infinite development stage, then we should
choose some
*sane* strategy (for example the paper of Ohad Rodeh) and strictly
adhere this in
future.


Again, other than the inline file data, what exactly do you believe
needs to change?


1. getting rid of inline extents;
2. new formats for directory and xattr items to not look like a train,
which is able to occupy the whole leaf;
3. make sure we do pro-active balancing like it is described in the 
paper.


Sorry, I don't see other ways for now..


Top down balancing vs balancing on insertion doesn't
impact our ability to maintain full leaves. The current code is clearly
choosing not to merge two leaves that it should have merged, which is
just a plain old bug.


How are you going to balance leaves when walking from top to down?
Suppose 1) and 2) above are not satisfied and having arrived to the leaf
level we see a number of items of variable length. What will we do to
keep leaves full?

Could you please provide a sketch of the algorithm?

Thanks!



Hi Edward,

Is it really a requirement to have 100% full leaves? Most DB's 
(assuming I remember correctly) have deliberate strategies around this 
kind of thing. You might want to leave room in leaf nodes so that 
future insertions can be contiguous on the media with older data.


Regards,

Ric


Hello Ric.

No, every leaf shouldn't be necessarily 100% full.
We may require every L-vicinity(*) of every leaf to be full in
some sense. And this local condition sometimes brings the (global)
boundary for utilization of the whole tree.

In the classic Bayer's B-trees we have so-called "S(0)" balancing
condition implicitly satisfied, which requires every leaf to be at
least half full. It provides the low boundary 0.5 for utilization
of the whole tree.

Next example is ReiserFS file system, which uses so-called "S(1)"
balancing condition on the leaf level, which requires every 1-vicinity
of every leaf to be "incompressible" (i.e. two leaves can not be
squeezed to a single one). This local incompressibility brings the
sweet low utilization boundary 0.5-E for the whole tree (E is a small
number, which is back proportional to the tree branching).

(*) any set of L+1 neighboring nodes, which contains the leaf.

--
Edward O. Shishkin
Principal Software Engineer
Red Hat Czech

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Balancing leaves when walking from top to down (was Btrfs:...)

2010-06-18 Thread Edward Shishkin

Chris Mason wrote:

On Fri, Jun 18, 2010 at 09:29:40PM +0200, Edward Shishkin wrote:
  

Jamie Lokier wrote:


Edward Shishkin wrote:
  

If you decide to base your file system on some algorithms then please
use the original ones from proper academic papers. DO NOT modify the
algorithms in solitude: this is very fragile thing! All such
modifications must be reviewed by specialists in the theory of
algorithms. Such review can be done in various scientific magazines of
proper level.

Personally I don't see any way to improve the situation with Btrfs
except full redesigning the last one. If you want to base your file
system on the paper of Ohad Rodeh, then please, use *exactly* the
Bayer's B-trees that he refers to. That said, make sure that all
records you put to the tree has equal length and all non-root nodes of
your tree are at least half filled.


First, thanks Edward for identifying a specific problem with the
current btrfs implementation.
  

Hello Jamie.



I've studied modified B-trees quite a lot and know enough to be sure
that they are quite robust when you modify them in all sorts of ways.
  

Which property is robust?



Moreover, you are incorrect to say there's an intrinsic algorithmic
problem with variable-length records.  It is not true; if Knuth said
so, Knuth was mistaken.
  

I didn't say about intrinsic algorithmic problems :)
I just repeat (after Knuth et al) that B-trees with variable-length
records don't
have any sane boundary for internal fragmentation. The common idea
is that if we
don't want Btrfs to be in infinite development stage, then we should
choose some
*sane* strategy (for example the paper of Ohad Rodeh) and strictly
adhere this in
future.



Again, other than the inline file data, what exactly do you believe
needs to change?


1. getting rid of inline extents;
2. new formats for directory and xattr items to not look like a train,
  which is able to occupy the whole leaf;
3. make sure we do pro-active balancing like it is described in the paper.

Sorry, I don't see other ways for now..


  Top down balancing vs balancing on insertion doesn't
impact our ability to maintain full leaves.  The current code is clearly
choosing not to merge two leaves that it should have merged, which is
just a plain old bug.
  


How are you going to balance leaves when walking from top to down?
Suppose 1) and 2) above are not satisfied and having arrived to the leaf
level we see a number of items of variable length. What will we do to
keep leaves full?

Could you please provide a sketch of the algorithm?

Thanks!

--
Edward O. Shishkin
Principal Software Engineer
Red Hat Czech

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Edward Shishkin

Jamie Lokier wrote:

Edward Shishkin wrote:
  

If you decide to base your file system on some algorithms then please
use the original ones from proper academic papers. DO NOT modify the
algorithms in solitude: this is very fragile thing! All such
modifications must be reviewed by specialists in the theory of
algorithms. Such review can be done in various scientific magazines of
proper level.

Personally I don't see any way to improve the situation with Btrfs
except full redesigning the last one. If you want to base your file
system on the paper of Ohad Rodeh, then please, use *exactly* the
Bayer's B-trees that he refers to. That said, make sure that all
records you put to the tree has equal length and all non-root nodes of
your tree are at least half filled.



First, thanks Edward for identifying a specific problem with the
current btrfs implementation.
  


Hello Jamie.


I've studied modified B-trees quite a lot and know enough to be sure
that they are quite robust when you modify them in all sorts of ways.
  


Which property is robust?


Moreover, you are incorrect to say there's an intrinsic algorithmic
problem with variable-length records.  It is not true; if Knuth said
so, Knuth was mistaken.
  


I didn't say about intrinsic algorithmic problems :)
I just repeat (after Knuth et al) that B-trees with variable-length 
records don't
have any sane boundary for internal fragmentation. The common idea is 
that if we
don't want Btrfs to be in infinite development stage, then we should 
choose some
*sane* strategy (for example the paper of Ohad Rodeh) and strictly 
adhere this in

future.


This is easily shown by modelling variable-length records (key ->
string) as a range of fixed length records ([key,index] -> byte) and
apply the standard B-tree algorithms to that, which guarantees
algorithm properties such as space utilisation and time; then you can
substitute a "compressed" representation of contiguous index runs,
which amounts to nothing more than just storing the strings (split
where the algorithm says to do so) and endpoint indexes , and because
this compression does not expand (in any way that matters), classic
algorithmic properties are still guaranteed.

Variable-length keys are a different business.  Those are trickier,
but as far as I know, btrfs doesn't use them.

  

As to current Btrfs code: *NOT ACK*!!! I don't think we need such
"file systems".



Btrfs provides many useful features that other filesystems don't.  We
definitely need it, or something like it.  You have identified a bug.
It's not a corruption bug, but it's definitely a bug, and probably
affects performance as well as space utilisation.

It is not deep design bug; it is just a result of the packing and
balancing heuristics.
  


Frankly, I would like to believe to such end, taking into account amount 
of my
contributions to the Btrfs project. At least to make sure I didn't do 
useless

work..


If you are still interested, please apply your knowledge of B-tree
algorithms to understanding why btrfs fails to balance the tree
sufficiently well,


Because of top-down balancing. It doesn't like "clever" things like 
"splitting"

and "merging". Currently top-down works properly only with stupid classic
Bayer's B-trees.


 and then propose a fix.
  


I'll try to help, but I am rather pessimistic here: working out 
algorithms is

something, which doesn't like timelines..


Note that it's not necessarily a problem to have a few nodes with low
utilisation.  Deliberate violation of the classic balancing heuristic
is often useful for performance.[*]


Ok, let's violate this, but not in the detriment of utilisation:
Internal fragmentation is a horror for file systems, the enemy #1
(which is completely defeated in the last century BTW).


  The problem you've found is only a
real problem when there are _too many_ nodes with low utilisation.
  


IMHO this is a problem, as we can not estimate number of such nodes.
Do you have any sane upper boundary for this number? I don't know such one.
Maybe I have missed something?


[*] For example when filling a tree by inserting contiguously
ascending keys, the classic "split into two when full" heuristic gives
the worst possible results (50% lost space), and deliberately
underfilling the most actively updated nodes, which is not permitted
at all by the classic algorithm, gives denser packing in the end
(almost zero lost space).


At the end of what? I hope you don't speak about on-line defragmentation?
Can you point me to the paper (if any)?

Thanks!


  It's also faster.  The trick is to make
sure there's just the right number of underfilled nodes...

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
Mo

Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Edward Shishkin

Daniel J Blueman wrote:

On Fri, Jun 18, 2010 at 1:32 PM, Edward Shishkin
 wrote:
  

Mat wrote:


On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin  wrote:
  

Hello everyone.

I was asked to review/evaluate Btrfs for using in enterprise
systems and the below are my first impressions (linux-2.6.33).

The first test I have made was filling an empty 659M (/dev/sdb2)
btrfs partition (mounted to /mnt) with 2K files:

# for i in $(seq 100); \
do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
(terminated after getting "No space left on device" reports).

# ls /mnt | wc -l
59480

So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
and the first obvious question is "hey, where are other 83% of my
disk space???" I looked at the btrfs storage tree (fs_tree) and was
shocked with the situation on the leaf level. The Appendix B shows
5 adjacent btrfs leafs, which have the same parent.

For example, look at the leaf 29425664: "items 1 free space 3892"
(of 4096!!). Note, that this "free" space (3892) is _dead_: any
attempts to write to the file system will result in "No space left
on device".

Internal fragmentation (see Appendix A) of those 5 leafs is
(1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then
ext4 and xfs: The last ones in this example will show fragmentation
near zero with blocksize <= 2K. Even with 4K blocksize they will
show better utilization 0.50 (against 0.38 in btrfs)!

I have a small question for btrfs developers: Why do you folks put
"inline extents", xattr, etc items of variable size to the B-tree
in spite of the fact that B-tree is a data structure NOT for variable
sized records? This disadvantage of B-trees was widely discussed.
For example, maestro D. Knuth warned about this issue long time
ago (see Appendix C).

It is a well known fact that internal fragmentation of classic Bayer's
B-trees is restricted by the value 0.50 (see Appendix C). However it
takes place only if your tree contains records of the _same_ length
(for example, extent pointers). Once you put to your B-tree records
of variable length (restricted only by leaf size, like btrfs "inline
extents"), your tree LOSES this boundary. Moreover, even worse:
it is clear, that in this case utilization of B-tree scales as zero(!).
That said, for every small E and for every amount of data N we
can construct a consistent B-tree, which contains data N and has
utilization worse then E. I.e. from the standpoint of utilization
such trees can be completely degenerated.

That said, the very important property of B-trees, which guarantees
non-zero utilization, has been lost, and I don't see in Btrfs code any
substitution for this property. In other words, where is a formal
guarantee that all disk space of our users won't be eaten by internal
fragmentation? I consider such guarantee as a *necessary* condition
for putting a file system to production.



Wow...a small part of me says 'well said', on the basis that your
assertions are true, but I do think there needs to be more
constructivity in such critique; it is almost impossible to be a great
engineer and a great academic at once in a time-pressured environment.
  


Sure it is impossible. I believe in division of labour:
academics writes algorithms, and we (engineers) encode them.

I have noticed that events in Btrfs develop by scenario not predicted
by the paper of academic Ohad Rodeh (in spite of the announce that
Btrfs is based on this paper). This is why I have started to grumble..

Thanks.


If you can produce some specific and suggestions with code references,
I'm sure we'll get some good discussion with potential to improve from
where we are.

Thanks,
  Daniel
  


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Edward Shishkin

Chris Mason wrote:

On Fri, Jun 18, 2010 at 05:05:46PM +0200, Edward Shishkin wrote:
  

Chris Mason wrote:


On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
  

Mat wrote:


On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin  wrote:
  

Hello everyone.

I was asked to review/evaluate Btrfs for using in enterprise
systems and the below are my first impressions (linux-2.6.33).

The first test I have made was filling an empty 659M (/dev/sdb2)
btrfs partition (mounted to /mnt) with 2K files:

# for i in $(seq 100); \
do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
(terminated after getting "No space left on device" reports).

# ls /mnt | wc -l
59480

So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
and the first obvious question is "hey, where are other 83% of my
disk space???" I looked at the btrfs storage tree (fs_tree) and was
shocked with the situation on the leaf level. The Appendix B shows
5 adjacent btrfs leafs, which have the same parent.

For example, look at the leaf 29425664: "items 1 free space 3892"
(of 4096!!). Note, that this "free" space (3892) is _dead_: any
attempts to write to the file system will result in "No space left
on device".


There are two easy ways to fix this problem.  Turn off the inline
extents (max_inline=0) or allow splitting of the inline extents.  I
didn't put in the splitting simply because the complexity was high while
the benefits were low (in comparison with just turning off the inline
extents).
  

Hello, Chris. Thanks for response!
I afraid that both ways won't fix the problem. Look at this leaf:

[...]
leaf 29425664 items 1 free space 3892 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
   item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
   location key (0 UNKNOWN 0) type 8
   namelen 16 datalen 32 name: security.selinux
[...]

There is no inline extents, and what are you going to split here?
All leafs must be at least a half filled, otherwise we loose all
boundaries, which provides non-zero utilization..



Right, there is no inline extent because we require them to fit entirely
in the leaf.  So we end up with mostly empty leaves because the inline
item is large enough to make it difficult to push around but not large
enough to fill the leaf.
  


How about left and right neighbors? They contain a lot of
free space (1572 and 1901 respectively).
I am not happy with the very fact of such shallow leafs which
contain only one small (xattr) item..
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Edward Shishkin

Chris Mason wrote:

On Fri, Jun 18, 2010 at 03:32:16PM +0200, Edward Shishkin wrote:
  

Mat wrote:


On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin  wrote:
  

Hello everyone.

I was asked to review/evaluate Btrfs for using in enterprise
systems and the below are my first impressions (linux-2.6.33).

The first test I have made was filling an empty 659M (/dev/sdb2)
btrfs partition (mounted to /mnt) with 2K files:

# for i in $(seq 100); \
do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
(terminated after getting "No space left on device" reports).

# ls /mnt | wc -l
59480

So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
and the first obvious question is "hey, where are other 83% of my
disk space???" I looked at the btrfs storage tree (fs_tree) and was
shocked with the situation on the leaf level. The Appendix B shows
5 adjacent btrfs leafs, which have the same parent.

For example, look at the leaf 29425664: "items 1 free space 3892"
(of 4096!!). Note, that this "free" space (3892) is _dead_: any
attempts to write to the file system will result in "No space left
on device".



There are two easy ways to fix this problem.  Turn off the inline
extents (max_inline=0) or allow splitting of the inline extents.  I
didn't put in the splitting simply because the complexity was high while
the benefits were low (in comparison with just turning off the inline
extents).
  


Hello, Chris. Thanks for response!
I afraid that both ways won't fix the problem. Look at this leaf:

[...]
leaf 29425664 items 1 free space 3892 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
   item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
   location key (0 UNKNOWN 0) type 8
   namelen 16 datalen 32 name: security.selinux
[...]

There is no inline extents, and what are you going to split here?
All leafs must be at least a half filled, otherwise we loose all
boundaries, which provides non-zero utilization..

Any ideas?

Thanks,
Edward.

  

It must be a highly unexpected and difficult question for file system
developers: "how efficiently does your file system manage disk space"?

In the meanwhile I confirm that Btrfs design is completely broken:
records stored in the B-tree differ greatly from each other (it is
unacceptable!), and the balancing algorithms have been modified in
insane manner. All these factors has led to loss of *all* boundaries
holding internal fragmentation and to exhaustive waste of disk space
(and memory!) in spite of the property "scaling in their ability to
address large storage".

This is not a large storage, this is a "scalable sieve": you can not
rely on finding there some given amount of water even after infinite
increasing the size of the sieve (read escalating the pool of Btrfs
devices).

It seems that nobody have reviewed Btrfs before its inclusion to the
mainline. I have only found a pair of recommendations with a common
idea that Btrfs maintainer is "not a crazy man". Plus a number of
papers which admire with the "Btrfs phenomena". Sigh.

Well, let's decide what can we do in current situation..
The first obvious point here is that we *can not* put such file system
to production. Just because it doesn't provide any guarantees for our
users regarding disk space utilization.



Are you basing all of this on inline extents?  The other extents of
variable size are more flexible (taking up the room in the leaf), but
they can also easy be split during balancing.

-chris

  


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Edward Shishkin

Mat wrote:

On Thu, Jun 3, 2010 at 4:58 PM, Edward Shishkin  wrote:
  

Hello everyone.

I was asked to review/evaluate Btrfs for using in enterprise
systems and the below are my first impressions (linux-2.6.33).

The first test I have made was filling an empty 659M (/dev/sdb2)
btrfs partition (mounted to /mnt) with 2K files:

# for i in $(seq 100); \
do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
(terminated after getting "No space left on device" reports).

# ls /mnt | wc -l
59480

So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
and the first obvious question is "hey, where are other 83% of my
disk space???" I looked at the btrfs storage tree (fs_tree) and was
shocked with the situation on the leaf level. The Appendix B shows
5 adjacent btrfs leafs, which have the same parent.

For example, look at the leaf 29425664: "items 1 free space 3892"
(of 4096!!). Note, that this "free" space (3892) is _dead_: any
attempts to write to the file system will result in "No space left
on device".

Internal fragmentation (see Appendix A) of those 5 leafs is
(1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then
ext4 and xfs: The last ones in this example will show fragmentation
near zero with blocksize <= 2K. Even with 4K blocksize they will
show better utilization 0.50 (against 0.38 in btrfs)!

I have a small question for btrfs developers: Why do you folks put
"inline extents", xattr, etc items of variable size to the B-tree
in spite of the fact that B-tree is a data structure NOT for variable
sized records? This disadvantage of B-trees was widely discussed.
For example, maestro D. Knuth warned about this issue long time
ago (see Appendix C).

It is a well known fact that internal fragmentation of classic Bayer's
B-trees is restricted by the value 0.50 (see Appendix C). However it
takes place only if your tree contains records of the _same_ length
(for example, extent pointers). Once you put to your B-tree records
of variable length (restricted only by leaf size, like btrfs "inline
extents"), your tree LOSES this boundary. Moreover, even worse:
it is clear, that in this case utilization of B-tree scales as zero(!).
That said, for every small E and for every amount of data N we
can construct a consistent B-tree, which contains data N and has
utilization worse then E. I.e. from the standpoint of utilization
such trees can be completely degenerated.

That said, the very important property of B-trees, which guarantees
non-zero utilization, has been lost, and I don't see in Btrfs code any
substitution for this property. In other words, where is a formal
guarantee that all disk space of our users won't be eaten by internal
fragmentation? I consider such guarantee as a *necessary* condition
for putting a file system to production.

Any technical comments are welcome.

Thanks,
Edward.


Appendix A.
---
Glossary

1. Utilization of data and(or) metadata storage.

The fraction A/B, where
A is total size in bytes of stored data and(or) metadata.
B = N * S, where
N is number of blocks occupied by stored data and(or) metadata.
S is block size in bytes.

2. Internal fragmentation of data and(or) metadata storage.

difference (1 - U), where U is utilization.


Appendix B.
---
a "period" in the dump of the fs_tree (btrfs-debug-tree /dev/sdb2)

...

leaf 29982720 items 4 free space 1572 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
  item 0 key (319 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
  location key (0 UNKNOWN 0) type 8
  namelen 16 datalen 32 name: security.selinux
  item 1 key (319 EXTENT_DATA 0) itemoff 1848 itemsize 2069
  inline extent data size 2048 ram 2048 compress 0
  item 2 key (320 INODE_ITEM 0) itemoff 1688 itemsize 160
  inode generation 8 size 2048 block group 29360128 mode 100644
links 1
  item 3 key (320 INODE_REF 256) itemoff 1672 itemsize 16
  inode ref index 65 namelen 6 name: file64
leaf 29425664 items 1 free space 3892 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
  item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
  location key (0 UNKNOWN 0) type 8
  namelen 16 datalen 32 name: security.selinux
leaf 29990912 items 1 free space 1901 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
  item 0 key (320 EXTENT_DATA 0) itemoff 1926 itemsize 2069
  inline extent data size 2048 ram 2048 compress 0
leaf 29986816 items 3 free space 3666 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
  item 0 key (321 INODE_ITEM 0) itemoff 3835 itemsize 1

Unbound(?) internal fragmentation in Btrfs

2010-06-03 Thread Edward Shishkin

Hello everyone.

I was asked to review/evaluate Btrfs for using in enterprise
systems and the below are my first impressions (linux-2.6.33).

The first test I have made was filling an empty 659M (/dev/sdb2)
btrfs partition (mounted to /mnt) with 2K files:

# for i in $(seq 100); \
do dd if=/dev/zero of=/mnt/file_$i bs=2048 count=1; done
(terminated after getting "No space left on device" reports).

# ls /mnt | wc -l
59480

So, I got the "dirty" utilization 59480*2048 / (659*1024*1024) = 0.17,
and the first obvious question is "hey, where are other 83% of my
disk space???" I looked at the btrfs storage tree (fs_tree) and was
shocked with the situation on the leaf level. The Appendix B shows
5 adjacent btrfs leafs, which have the same parent.

For example, look at the leaf 29425664: "items 1 free space 3892"
(of 4096!!). Note, that this "free" space (3892) is _dead_: any
attempts to write to the file system will result in "No space left
on device".

Internal fragmentation (see Appendix A) of those 5 leafs is
(1572+3892+1901+3666+1675)/4096*5 = 0.62. This is even worse then
ext4 and xfs: The last ones in this example will show fragmentation
near zero with blocksize <= 2K. Even with 4K blocksize they will
show better utilization 0.50 (against 0.38 in btrfs)!

I have a small question for btrfs developers: Why do you folks put
"inline extents", xattr, etc items of variable size to the B-tree
in spite of the fact that B-tree is a data structure NOT for variable
sized records? This disadvantage of B-trees was widely discussed.
For example, maestro D. Knuth warned about this issue long time
ago (see Appendix C).

It is a well known fact that internal fragmentation of classic Bayer's
B-trees is restricted by the value 0.50 (see Appendix C). However it
takes place only if your tree contains records of the _same_ length
(for example, extent pointers). Once you put to your B-tree records
of variable length (restricted only by leaf size, like btrfs "inline
extents"), your tree LOSES this boundary. Moreover, even worse:
it is clear, that in this case utilization of B-tree scales as zero(!).
That said, for every small E and for every amount of data N we
can construct a consistent B-tree, which contains data N and has
utilization worse then E. I.e. from the standpoint of utilization
such trees can be completely degenerated.

That said, the very important property of B-trees, which guarantees
non-zero utilization, has been lost, and I don't see in Btrfs code any
substitution for this property. In other words, where is a formal
guarantee that all disk space of our users won't be eaten by internal
fragmentation? I consider such guarantee as a *necessary* condition
for putting a file system to production.

Any technical comments are welcome.

Thanks,
Edward.


Appendix A.
---
Glossary

1. Utilization of data and(or) metadata storage.

The fraction A/B, where
A is total size in bytes of stored data and(or) metadata.
B = N * S, where
N is number of blocks occupied by stored data and(or) metadata.
S is block size in bytes.

2. Internal fragmentation of data and(or) metadata storage.

difference (1 - U), where U is utilization.


Appendix B.
---
a "period" in the dump of the fs_tree (btrfs-debug-tree /dev/sdb2)

...

leaf 29982720 items 4 free space 1572 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
   item 0 key (319 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
   location key (0 UNKNOWN 0) type 8
   namelen 16 datalen 32 name: security.selinux
   item 1 key (319 EXTENT_DATA 0) itemoff 1848 itemsize 2069
   inline extent data size 2048 ram 2048 compress 0
   item 2 key (320 INODE_ITEM 0) itemoff 1688 itemsize 160
   inode generation 8 size 2048 block group 29360128 mode 
100644 links 1

   item 3 key (320 INODE_REF 256) itemoff 1672 itemsize 16
   inode ref index 65 namelen 6 name: file64
leaf 29425664 items 1 free space 3892 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
   item 0 key (320 XATTR_ITEM 3817753667) itemoff 3917 itemsize 78
   location key (0 UNKNOWN 0) type 8
   namelen 16 datalen 32 name: security.selinux
leaf 29990912 items 1 free space 1901 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
   item 0 key (320 EXTENT_DATA 0) itemoff 1926 itemsize 2069
   inline extent data size 2048 ram 2048 compress 0
leaf 29986816 items 3 free space 3666 generation 8 owner 5
fs uuid 50268d9d-2a53-4f4d-b3a3-4fbff74dd956
chunk uuid 963ba49a-bb2b-48a3-9b35-520d857aade6
   item 0 key (321 INODE_ITEM 0) itemoff 3835 itemsize 160
   inode generation 8 size 2048 block group 29360128 mode 
100644 links 1

   item 1 key (321 INODE_REF 256) itemoff 3819 itemsize 16
   in

Re: [Jfs-discussion] benchmark results

2010-01-15 Thread Edward Shishkin



When things didn't match up that was a clue that either

- the benchmark was broken
- the code was broken
  

[...]

I would carry out an object-oriented dualism here.


[1] methods (kernel module)  [2] objects (formatted partition)

||
||

[3] benchmarks - [4] user-space utilities (fsck)


User-space utilities investigate "object corruptions",
whereas benchmarks investigate "software corruptions"
(including bugs in source code, broken design, etc, etc..)

It is clear that "software" can be "corrupted" by a larger
number of ways than "objects". Indeed, it is known that
dual space V* (of all linear functions over V) is a much
more complex object than V.

So benchmark is a process which takes a set of methods
(we consider only "software" benchmarks) and puts numerical
values populated with a special (the worst) value CRASH.

Three main categories of benchmarks using:

1) Internal testing

An engineer makes optimizations in a file system (e.g. for a
customer) via choosing functions or plugins as winners in
a set of internal (local) "nominations".

2) Business plans

A system administrator chooses a "winner" in some (global)
"nomination" of file systems in accordance with internal
business-plans.

3) Flame and politics

Someone presents a "nomination" (usually with the "winner"
among restricted number of nominated members) to the public
while nobody asked him to do it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/2] grub-0.97: btrfs support

2009-12-11 Thread Edward Shishkin

Johannes Hirte wrote:

Am Freitag 11 Dezember 2009 16:27:54 schrieb Edward Shishkin:
  

Johannes Hirte wrote:


Am Freitag 11 Dezember 2009 12:17:29 schrieb Edward Shishkin:
  

Johannes Hirte wrote:


Am Freitag 11 Dezember 2009 00:15:46 schrieb Johannes Hirte:
  

Am Freitag 25 September 2009 00:06:23 schrieb Edward Shishkin:


Hello everyone.
  

...



The following patches are for Fedora 10(**).
The distro-independent package will be put to kernel.org a bit later.


All comments, bugreports, etc. are welcome as usual.
  

Ok, I have another comment/bugreport *g*.

I'm testing this patch with gentoo, so the grub sources are not
identicaly the same. With this patches applied, grub is unable to
detect JFS or XFS filesystems. XFS is reported as unknown, JFS is
reported as btrfs. Reiserfs and ext2/3 are detected as expected.


Yes, this patch is for Fedora. For other distros
some issues are possible, so please be careful..


I've also tested now with the fedora sources. There is the same bug. The
btrfs patch breaks the filesystem detection. All filesystems after btrfs
in fsys_table aren't detected. Moving btrfs to the end of fsys_table is a
workaround but will interfere with FFS. So this should better be fixed in
the btrfs-part of grub, so that it:

a) doesn't missdetect a JFS filesystem as btrfs
b) doesn't break the detection for remaining filesystems in the array.
  

Hello.

Yes, I confirm that xfs, etc. file systems are not detected,
but missdetection jfs as btrfs looks rather fantastic :)

Please, try the attached patch. Report if any problems.



The patch works, but the problem with misdetected JFS filesystem still 
persists. It happens if the device contained a btrfs filesystem before. I 
assume that the JFS super block starts later on the device as the btrfs one do 
and jfs_mkfs doesn't clean the space ahead of the JFS super block. So if a JFS 
filesystem is created on a device that contained a btrfs before, btrfs_mount 
still detects the beginning of the old btrfs super block and reads crap later 
on.
  


Yup, sticky thing..

To avoid this, btrfs detection could be placed after JFS. Are there any 
objections against this?
  


If so, we might want to put if after XFS,
which is also mistreated in such manner..

Thanks,
Edward.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/2] grub-0.97: btrfs support

2009-12-11 Thread Edward Shishkin

Johannes Hirte wrote:

Am Freitag 11 Dezember 2009 12:17:29 schrieb Edward Shishkin:
  

Johannes Hirte wrote:


Am Freitag 11 Dezember 2009 00:15:46 schrieb Johannes Hirte:
  
  

Am Freitag 25 September 2009 00:06:23 schrieb Edward Shishkin:



Hello everyone.
  
  

...




The following patches are for Fedora 10(**).
The distro-independent package will be put to kernel.org a bit later.


All comments, bugreports, etc. are welcome as usual.
  
  

Ok, I have another comment/bugreport *g*.

I'm testing this patch with gentoo, so the grub sources are not identicaly
 the same. With this patches applied, grub is unable to detect JFS or XFS
 filesystems. XFS is reported as unknown, JFS is reported as btrfs.
 Reiserfs and ext2/3 are detected as expected.



Yes, this patch is for Fedora. For other distros
some issues are possible, so please be careful..



I've also tested now with the fedora sources. There is the same bug. The btrfs
patch breaks the filesystem detection. All filesystems after btrfs in fsys_table 
aren't detected. Moving btrfs to the end of fsys_table is a workaround but will

interfere with FFS. So this should better be fixed in the btrfs-part of grub,
so that it:

a) doesn't missdetect a JFS filesystem as btrfs
b) doesn't break the detection for remaining filesystems in the array.
  


Hello.

Yes, I confirm that xfs, etc. file systems are not detected,
but missdetection jfs as btrfs looks rather fantastic :)

Please, try the attached patch. Report if any problems.

Thanks,
Edward.
Problem:
XFS, JFS, etc. file systems of the fsys_table are not
detected by grub with grub-0.97-btrfs.patch applied.

BUG:
btrfs_mount() sets ERR_FSYS_MOUNT to the global variable
errnum if no btrfs metadata were found on a partition.
As the result all next calls of devread() (and, hence,
attempts to find metadata of other file systems) failed.

Solution:
Don't set ERR_FSYS_MOUNT, if btrfs metadata were found,
just return 0.

Signed-off-by: Edward Shishkin 
---
 stage2/fsys_btrfs.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- grub-0.97.orig/stage2/fsys_btrfs.c
+++ grub-0.97/stage2/fsys_btrfs.c
@@ -638,7 +638,7 @@ int btrfs_mount(void)
if (ret) {
btrfs_msg("Drive %lu, partition %lu: no Btrfs metadata\n",
  current_drive, part_start);
-   goto error;
+   return 0;
}
ret = btrfs_uptodate_super_copy(BTRFS_FS_INFO);
if (ret)


Re: [patch 2/2] grub-0.97: btrfs multidevice configuration support

2009-12-11 Thread Edward Shishkin

Johannes Hirte wrote:

Am Dienstag 03 November 2009 01:59:39 schrieb Edward Shishkin:
  

Johannes Hirte wrote:


 Why is the btrfs code
dealing with network devices at all?
  

Why not? :)



I don't see the possiblity to get a btrfs filesystem this way. So as far as I 
understand this, it's complete useless. The CD support doesn't look very 
usefull too to me. It's possible to put a btrfs filesystem on a CD or DVD. But 
that seems rather theoretical.
  


Ok, let's keep this theoretical possibility,
I don't see wrong things here..

Thanks,
Edward.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/2] grub-0.97: btrfs support

2009-12-11 Thread Edward Shishkin

Johannes Hirte wrote:

Am Freitag 11 Dezember 2009 00:15:46 schrieb Johannes Hirte:
  

Am Freitag 25 September 2009 00:06:23 schrieb Edward Shishkin:


Hello everyone.
  

...



The following patches are for Fedora 10(**).
The distro-independent package will be put to kernel.org a bit later.


All comments, bugreports, etc. are welcome as usual.
  

Ok, I have another comment/bugreport *g*.

I'm testing this patch with gentoo, so the grub sources are not identicaly
 the same. With this patches applied, grub is unable to detect JFS or XFS
 filesystems. XFS is reported as unknown, JFS is reported as btrfs.
 Reiserfs and ext2/3 are detected as expected.



Yes, this patch is for Fedora. For other distros
some issues are possible, so please be careful..

Thanks,
Edward.



A possible solution is to put FSYS_BTRFS on the end of struct fsys_entry 
fsys_table. I've tested with FSYS_BTFS as the second last entry, the last is 
still FFS.


diff -Nru grub-0.97-r9/stage2/disk_io.c grub-0.97-r10/stage2/disk_io.c
--- grub-0.97-r9/stage2/disk_io.c   2009-12-10 23:41:37.0 +0100
+++ grub-0.97-r10/stage2/disk_io.c  2009-12-11 00:50:51.555007247 +0100
@@ -79,6 +79,9 @@
 # ifdef FSYS_ISO9660
   {"iso9660", iso9660_mount, iso9660_read, iso9660_dir, 0, 0},
 # endif
+# ifdef FSYS_BTRFS
+  {"btrfs", btrfs_mount, btrfs_read, btrfs_dir, 0, btrfs_embed},
+# endif
   /* XX FFS should come last as it's superblock is commonly crossing tracks
  on floppies from track 1 to 2, while others only use 1.  */
 # ifdef FSYS_FFS

With this order, XFS and JFS filesystems are identified correct. But I think, 
this is just a workaround.



regards,
  Johannes

  


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/2] grub-0.97: btrfs multidevice configuration support

2009-11-02 Thread Edward Shishkin

Johannes Hirte wrote:

Am Freitag 25 September 2009 00:06:40 schrieb Edward Shishkin:
  


Hi Edward,

I was pointed to a problem with this patch.

+static u64 scan_grub_devices(struct btrfs_device *dev,
+int (*discerner)(struct btrfs_device **, int),
+int lookup)
+{
...
+#ifdef SUPPORT_NETBOOT
+   errnum = ERR_NONE;
+   if (network_ready &&
+   !get_diskinfo(NETWORK_DRIVE, &geom)) {
+   dev->drive = NETWORK_DRIVE;
+   dev->part = 0;
+   dev->length = geom.total_sectors;
+   if (discerner(&dev, lookup)) {
+   count++;
+   if (lookup)
+   goto exit;
+   }
+   }
+#endif /* SUPPORT_NETBOOT */
+ exit:
+   return count;
+}

This won't compile since network_ready is undeclared.


Yup, indeed..

 Why is the btrfs code 
dealing with network devices at all?
  


Why not? :)
Well, would you please disable it for now with the attached patch?

Thanks,
Edward.
Disable booting from network devices.

Signed-off-by: Edward Shishkin 
---
 stage2/fsys_btrfs.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- grub-0.97.orig/stage2/fsys_btrfs.c
+++ grub-0.97/stage2/fsys_btrfs.c
@@ -452,7 +452,7 @@ static u64 scan_grub_devices(struct btrf
goto exit;
}
}
-#ifdef SUPPORT_NETBOOT
+#if 0
errnum = ERR_NONE;
if (network_ready &&
!get_diskinfo(NETWORK_DRIVE, &geom)) {
@@ -465,7 +465,7 @@ static u64 scan_grub_devices(struct btrf
goto exit;
}
}
-#endif /* SUPPORT_NETBOOT */
+#endif /* 0 */
  exit:
return count;
 }


Re: [patch 1/2] grub-0.97: btrfs support for a singe device configuration

2009-10-28 Thread Edward Shishkin

Johannes Hirte wrote:
[...]

This compiles and works, but only with stage2. When using stage1_5, grub hangs
on boot with

GRUB loading stage1.5 

and blinking cursor. Is the multidevice patch also necessary for single 
devices?


Yes, please use both patches!
I have split the stuff because the mailing list refuse attachments >100K.

Thanks,
Edward.

 I've omitted this patch, since my root-fs (with /boot) lies on a 
single partition.


  


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/2] grub-0.97: btrfs support for a singe device configuration

2009-10-27 Thread Edward Shishkin

Johannes Hirte wrote:

Am Freitag 25 September 2009 00:06:32 schrieb Edward Shishkin:
  


I was trying the patch and got a little confused. How did you get this work
without linking against libgcc?


Hmm.. Actually my Fedora stuff does link it:
[...]

+btrfs_stage1_5_exec_LDADD = @LIBGCC@

[...]
I guess you have missed it when porting to Gentoo..

Thanks,
Edward.


 I've tested it with the gentoo patches for
grub and get

/usr/src/portage/portage/sys-boot/grub-0.97-r10/work/grub-0.97/stage2/fsys_btrfs.c:551: undefined reference to `__udivdi3'
/usr/src/portage/portage/sys-boot/grub-0.97-r10/work/grub-0.97/stage2/fsys_btrfs.c:571: undefined reference to `__umoddi3'
/usr/src/portage/portage/sys-boot/grub-0.97-r10/work/grub-0.97/stage2/fsys_btrfs.c:575: undefined reference to `__umoddi3'
/usr/src/portage/portage/sys-boot/grub-0.97-r10/work/grub-0.97/stage2/fsys_btrfs.c:581: undefined reference to `__umoddi3'
/usr/src/portage/portage/sys-boot/grub-0.97-r10/work/grub-0.97/stage2/fsys_btrfs.c:583: undefined reference to `__udivdi3'
/usr/src/portage/portage/sys-boot/grub-0.97-r10/work/grub-0.97/stage2/fsys_btrfs.c:588: undefined reference to `__umoddi3'
/usr/src/portage/portage/sys-boot/grub-0.97-r10/work/grub-0.97/stage2/fsys_btrfs.c:589: undefined reference to `__udivdi3'
collect2: ld returned 1 exit status
make[2]: *** [pre_stage2.exec] Error 1
make[2]: Leaving directory `/usr/src/portage/portage/sys-boot/grub-0.97-r10/work/grub-0.97/stage2'
make[1]: *** [all-recursive] Error 1  
make[1]: Leaving directory `/usr/src/portage/portage/sys-boot/grub-0.97-r10/work/grub-0.97'   
make: *** [all] Error 2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how does grub exactly work?

2009-09-25 Thread Edward Shishkin

Andreas Jellinghaus wrote:

Hi Edward,
  


Hello.


I saw your mail on btrfs ml with the grub patches and the notes
how to deal with btrfs.

can you explain how grub and btrfs work exactly?
I read the grub manual at
http://www.gnu.org/software/grub/manual/html_node/Bootstrap-
tricks.html#Bootstrap-tricks

so I wonder: does btrfs provide a "boot loader area" similar
to ffs and reiserfs, where grub places the stage 1.5 code
and stage 1 can read it?

  


Yes.


or does grub find out the sectors of the stage 1.5 file and
put the list of those into the stage 1 file (first sector address
of stage 1.5 file) and stage 1.5 first sector (list of all
other sectors)?
  


Grub doesn't make the blocklist for stage1_5

Optionally grub makes the blocklist for stage2 (in particular
when stage1.5 can not be embedded for some reasons).

Note, that it is important to embed and use btrfs_stage1_5.
First, because btrfs has defragmentator, which can make the
blocklist out of date (so that you'll need to reinstall grub).
Second, I am not sure, if the blocklist will be composed
correctly in the case when stage2 locates in btrfs volume
(I didn't look at the blocklist specification: it can happen that
grub installer makes an assumption that all sectors of stage2
locate on the same device).


and what would be the proper way to make boot from a raid1 device?
so that if one disk fails the other can boot?
device (hd0) /dev/sda
root (hd0,X) 
setup (hd0)


device (hd0) /dev/sdb
root (hd0,X)
setup (hd0)
  


Perhaps, it will work. But officially grub doesn't
understand software raid, and I am not familiar
with raid1 specifications, so...  ;)


if data is written to mbr only and the sectors between mbr and
the first partition, this could work. but if data is written to
mbr and to btrfs (either a "boot loader area" or changes to the
stage 1.5 file), then that data can only contain the valid block
lists for one of the two hard disks - which shouldn't be a problem
if drives are the same and have 100% the same geometry and partitioning.

so I wonder, and it would be great if thise fine details would be
documented for btrfs somewhere, as there is very little information
about them (the situation isn't better with other filesystems either).

thanks for your help!

Regards, Andreas

  


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 2/2] grub-0.97: btrfs multidevice configuration support

2009-09-24 Thread Edward Shishkin


Signed-off-by: Edward Shishkin 

---
 grub/Makefile.am|2 
 stage2/btrfs.h  |   55 ++--
 stage2/builtins.c   |   10 
 stage2/disk_io.c|2 
 stage2/filesys.h|4 
 stage2/fsys_btrfs.c |  693 +---
 6 files changed, 595 insertions(+), 171 deletions(-)

--- grub-0.97.orig/stage2/btrfs.h
+++ grub-0.97/stage2/btrfs.h
@@ -124,6 +124,7 @@ static int btrfs_csum_sizes[] = { 4, 0 }
 #define BTRFS_DEFAULT_NUM_DEVICES 1
 #define BTRFS_DEFAULT_NODE_SIZE   4096
 #define BTRFS_DEFAULT_LEAF_SIZE   4096
+#define BTRFS_NUM_CACHED_DEVICES  128
 
 #define WARN_ON(c)
 #define cassert(cond) ({ switch (-1) { case (cond): case 0: break; } })
@@ -315,13 +316,22 @@ struct btrfs_node {
struct btrfs_key_ptr ptrs[];
 } __attribute__ ((__packed__));
 
+struct btrfs_device {
+   /* the internal btrfs device id */
+   u64 devid;
+   /* the internal grub device representation */
+   unsigned long drive;
+   unsigned long part;
+   unsigned long length;
+};
+
 struct extent_buffer {
/* metadata */
+   struct btrfs_device dev;
u64 start;
u64 dev_bytenr;
u32 len;
-   int refs;
-   int flags;
+   /* data */
char *data;
 };
 
@@ -555,12 +565,8 @@ struct btrfs_block_group_item {
 struct btrfs_root {
struct extent_buffer   node;
char   data[4096];
-   struct extent_buffer   *commit_root;
struct btrfs_root_item root_item;
-   struct btrfs_key   root_key;
-   struct btrfs_fs_info   *fs_info;
u64 objectid;
-   u64 last_trans;
 
/* data allocations are done in sectorsize units */
u32 sectorsize;
@@ -573,42 +579,31 @@ struct btrfs_root {
 
/* leaf allocations are done in leafsize units */
u32 stripesize;
+};
 
-   int ref_cows;
-   int track_dirty;
-
-
-   u32 type;
-   u64 highest_inode;
-   u64 last_inode_alloc;
+struct btrfs_file_info {
+   struct btrfs_key key;
 };
 
 struct btrfs_root;
 struct btrfs_fs_devices;
 struct btrfs_fs_info {
u8 fsid[BTRFS_FSID_SIZE];
-   u8 chunk_tree_uuid[BTRFS_UUID_SIZE];
struct btrfs_root fs_root;
struct btrfs_root tree_root;
struct btrfs_root chunk_root;
 
-   struct btrfs_key  file_info; /* currently opened file */
+   struct btrfs_file_info file_info; /* currently opened file */
struct btrfs_path paths [LAST_LOOKUP_POOL];
 
-   u64 generation;
-   u64 last_trans_committed;
+   char mbr[SECTOR_SIZE];
 
-   u64 system_alloc_profile;
-   u64 alloc_start;
+   int sb_mirror;
+   u64 sb_transid;
+   struct btrfs_device sb_dev;
+   struct btrfs_super_block sb_copy;
 
-   struct btrfs_super_block super_temp;
-   struct btrfs_super_block super_copy;
-
-   u64 super_bytenr;
-   u64 total_pinned;
-
-   int system_allocs;
-   int readonly;
+   struct btrfs_device devices[BTRFS_NUM_CACHED_DEVICES + 1];
 };
 
 /*
@@ -1129,6 +1124,11 @@ static inline void btrfs_set_key_type(st
key->type = val;
 }
 
+static inline u64 btrfs_super_devid(struct btrfs_super_block *disk_super)
+{
+   return le64_to_cpu(disk_super->dev_item.devid);
+}
+
 /* struct btrfs_header */
 BTRFS_SETGET_HEADER_FUNCS(header_bytenr, struct btrfs_header, bytenr, 64);
 BTRFS_SETGET_HEADER_FUNCS(header_generation, struct btrfs_header,
@@ -1317,6 +1317,7 @@ struct btrfs_fs_devices {
 };
 
 struct btrfs_bio_stripe {
+   struct btrfs_device dev;
u64 physical;
 };
 
--- grub-0.97.orig/stage2/fsys_btrfs.c
+++ grub-0.97/stage2/fsys_btrfs.c
@@ -31,15 +31,21 @@
 #define BTRFS_FS_INFO  \
((struct btrfs_fs_info *)((unsigned long)FSYS_BUF + \
  LOOKUP_CACHE_SIZE))
-#define BTRFS_CACHE_SIZE(sizeof(struct btrfs_fs_info) +\
-LOOKUP_CACHE_SIZE)
-#define BTRFS_FILE_INFO (&BTRFS_FS_INFO->file_info)
-#define BTRFS_TREE_ROOT (&BTRFS_FS_INFO->tree_root)
-#define BTRFS_CHUNK_ROOT(&BTRFS_FS_INFO->chunk_root)
-#define BTRFS_FS_ROOT   (&BTRFS_FS_INFO->fs_root)
-#define BTRFS_SUPER (&BTRFS_FS_INFO->super_copy)
-#define LOOKUP_CACHE_BUF(id)   ((char *)((unsigned long)FSYS_BUF + \
- id * LOOKUP_CACHE_BUF_SIZE))
+#define BTRFS_CACHE_SIZE (sizeof(struct btrfs_fs_info) +   \
+ LOOKUP_CACHE_SIZE)
+#define BTRFS_TREE_ROOT  (&BTRFS_FS_INFO->tree_root)
+#define BTRFS_CHUNK_ROOT (&BTRFS_FS_INFO->chunk_root)
+#define BTRFS_FS_ROOT(&BTRFS_FS_INFO->fs_root)
+#define BTRFS_SUPER  (&BTRFS_FS_INFO->sb_copy)
+#define BTRFS_DEVICES(&BTRFS_FS_INFO->devices[0])
+#define BTRFS

[patch 0/2] grub-0.97: btrfs support

2009-09-24 Thread Edward Shishkin

Hello everyone.

The following patches are for Fedora 10(**).
The distro-independent package will be put to kernel.org a bit later.


  I. Loading kernels from btrfs volumes


Now you can load kernels and initrds from btrfs volumes composed of
many devices.

WARNING!!!
Make sure that all components of your loading btrfs volume(*) are
visible to grub. Otherwise, you'll end with unbootable system.
The list of available grub devices can be obtained, for example,
using tab completion in grub shell.

Number of components of a loading volume is not restricted, however
if it is larger then 128, then the boot process will be slowed down
because of expensive translations (btrfs-device-id -> grub-device-id)
which issue a large number of IOs. We cache only 128 such translations
in grub-0.97 because of high memory pressure.


  II. Installing grub from btrfs volumes


You can install grub from a btrfs image volume(*) composed of many
devices (see above about restrictions). Also you can setup any
component of a btrfs boot(*) volume as grub root device.

NOTE!!! Make sure that all components of image and boot volumes(*)
are visible to grub, otherwise grub installer will return error.

TECHNICAL NOTE (for grub developers):
The unpleasant surprise was that grub installer overwrites
(by default!) the file (stage2), bypassing the file system driver.
I can not understand this: it looks like stepping to the clean water
with dirty shoe. Hope that grub2 won't afford such things.

In order to install grub from a btrfs image volume use special
option (--stage2). This option makes grub installer to rewrite
the file with a help of the OS's file system (i.e, via write (2)).
Any attempts to install without this option will fail with an error
(wrong argument).

The example of possible installation scenario.

Suppose image volume = root volume = loading volume is composed
of devices (hd0,4), (hd0,5), (hd1,5), (hd1,7) and is not an OS's
root. We want to setup (hd0,4) as grub root device and install
grub to the mbr of (hd0).

. build and install grub with btrfs support;
. mount your the "3 in 1" btrfs volume to /mnt;
. create a directory /mnt/grub;
. put the built files stage1, stage2, btrfs_stage1_5, grub.conf,
etc. to /mnt/grub;
. run grub shell;
. grub> root (hd0,4)
. grub> setup --stage2=/mnt/grub/stage2 (hd0)
. have a fun.

Use info(1) grub for more details.

(*) Glossary:
. loading volume:
a btrfs volume that contains kernel image and initrd;

. image volume:
a btrfs volume that contains stage1, stage2, btrfs_stage_1_5,
and grub.conf files needed for grub installer;

. boot volume:
a btrfs volume where grub will look for stage2 and grub.conf
files in boot time.

(**) Link to the Fedora's grub package:
http://ucho.ignum.cz/fedora/linux/releases/10/Fedora/source/SRPMS/grub-0.97-38.fc10.src.rpm 



All comments, bugreports, etc. are welcome as usual.

Thanks,
Edward.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: grub-0.97/btrfs: the files fsys_btrfs.c, btrfs.h

2009-07-22 Thread Edward Shishkin

Vladimir 'phcoder' Serbinenko wrote:

On Wed, Jul 22, 2009 at 7:04 PM, Felix Zielcke wrote:
  

Am Mittwoch, den 22.07.2009, 18:41 +0200 schrieb Edward Shishkin:

Hi Edward,




Hello.


(CC linux-btrfs mailing list)
  

Uhm there is no CC?
I'm unsure now if I should CC it or not.


Edward had a problem sending to our list because grub-devel is
subscriber-only list. Let's CC linux-btrfs
  

Anyway, nice to see that you work on btrfs for grub.
But grub-legacy is totally dead now, it would be better if you would put
your efforts into grub2.



Mm.. I am not an individual contributor.
I work for RHEL and Fedora, which is on grub-0.97 for now..


I would be happy to assist with this. I already have such experience
porting zfs from grub-solaris to grub2
  


It would be fine..

Thanks,
Edward.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


grub-0.97/VFS

2009-07-22 Thread Edward Shishkin

Hello everyone.

Grub-0.97 filesystem interface (read_func, dir_func)
seems to be poor. Instead of this I would prefer to
have something like the following:

/*
*
* .init_root()   // set index of root dir
* .lookup_begin()// get index by name
* .lookup_end()  // get inode by index
* .read_file()
* .read_dir()
* .read_symlink()
*
* so that dir_func in grub_open will look like the
* following:
*/
int path_walk()
{
   ops->init_root();
   while (1) {
   int mode;
   ops->lookup_end(&mode, ...);

   switch (type_of_file(mode)) {
   case SYMLINK_FILE:
   follow_symlink(read_symlink, ...);
   ...;
   continue;
   case REGULAR_FILE:
   /*
* the end of path walk:
* normally we want to exit here
*/
   ...;
   return 1;
   case DIRECTORY_FILE:
   /*
* optionally this will
* print possibilities
*/
   ops->lookup_begin(mode, read_dir, ...);
   ...;
   continue;
   default:
   errnum = ERR_BAD_FILETYPE;
   return 0;
   }
   }
}

Just my 2 cents in the (grub-2?) development process..

Thanks,
Edward.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch] btrfs-progs: fix btrfs_read_dev_super

2009-05-27 Thread Edward Shishkin


Don't break in the middle of search,
taste also other super mirrors.

Signed-off-by: Edward Shishkin 
---
 disk-io.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- btrfs-progs-0.18.orig/disk-io.c
+++ btrfs-progs-0.18/disk-io.c
@@ -716,7 +716,7 @@ int btrfs_read_dev_super(int fd, struct 
bytenr = btrfs_sb_offset(i);
ret = pread64(fd, &buf, sizeof(buf), bytenr);
if (ret < sizeof(buf))
-   break;
+   continue;
 
if (btrfs_super_bytenr(&buf) != bytenr ||
strncmp((char *)(&buf.magic), BTRFS_MAGIC,


Re: Some basic benchmarking

2009-05-13 Thread Edward Shishkin

Jaime sanchez wrote:

Hi list,

I am tired of "syntetic" benchmarks and i wrote a script to totally
automate benchmarking for various file system (command line option)
through the same partition, collect the data and show results in a
nice way.

There are 4 basic benchmarks performed, copy, compress, uncompress and
delete. The data folder i used is the linus-source-2.6.30-rc5
(uncompressed).

The script can be downloaded here:
http://rapidshare.com/files/232159335/abench.tar.gz (use abench -h for
help)

Here are some results for core2duo, Western digital green power hard
disk (WD5000AACS):

┌───┬┬───┬──┬──┐
│   │  Copy  │ Compress  │ Uncompre │ Delete   │
│   │  time  │ time  │ time │ time │
├───┼┼───┼──┼──┤
│ ext4  │ 0m33.283s  │ 0m26.834s │ 0m7.229s │ 0m0.866s │
│ noatime,async ││   │  │  │
├───┼┼───┼──┼──┤
│ ext3  │ 0m46.947s  │ 0m27.003s │ 0m7.953s │ 0m0.786s │
│ noatime,async ││   │  │  │
├───┼┼───┼──┼──┤
│ ext2  │ 0m19.624s  │ 0m26.716s │ 0m7.259s │ 0m0.477s │
│ noatime,async ││   │  │  │
├───┼┼───┼──┼──┤
│ reiserfs  │ 0m38.056s  │ 0m27.996s │ 0m12.264 │ 0m1.474s │
│ noatime,async ││   │  │  │
├───┼┼───┼──┼──┤
│ btrfs │ 0m20.411s  │ 0m30.793s │ 0m11.618 │ 0m5.599s │
│ defaults  ││   │  │  │
├───┼┼───┼──┼──┤
│ btrfs │ 0m21.630s  │ 0m27.891s │ 0m11.926 │ 0m8.473s │
│ async ││   │  │  │
├───┼┼───┼──┼──┤
│ btrfs │ 0m19.277s  │ 0m27.603s │ 0m11.238 │ 0m10.468s│
│ async,noatime ││   │  │  │
├───┼┼───┼──┼──┤
│ btrfs │ 0m17.086s  │ 0m27.520s │ 0m10.062 │ 0m11.688s│
│ nodatacsum,nodatacow,n│atime,async │   │  │  │
├───┼┼───┼──┼──┤
│ btrfs │ 0m23.708s  │ 0m28.663s │ 0m15.326 │ 0m7.318s │
│ compress,async,atime  ││   │  │  │
├───┼┼───┼──┼──┤
  


Btw, it might be roaches in the code:
transparent compression should win 20-25%...


│ btrfs │ 0m20.418s  │ 0m27.484s │ 0m9.871s │ 0m8.989s │
│ compress,async,atime,n│datacsum,nod│tacow  │  │  │
├───┼┼───┼──┼──┤
│ reiser4   │ 0m19.750s  │ 0m27.058s │ 0m10.196 │ 0m2.709s │
│ noatime,async ││   │  │  │
├───┼┼───┼──┼──┤
│ reiser4   │ 0m16.121s  │ 0m27.924s │ 0m9.798s │ 0m8.582s │
│ noatime,async,compress││   │  │  │
└───┴┴───┴──┴──┘

It seems that btrfs has good performance, except for delete times.

Regards
Antonio

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] fallocate behavior when crossing end-of-file

2009-05-06 Thread Edward Shishkin

Michael Raskin wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Edward Shishkin wrote:
  

Desired: something (probably, one block) is allocated/reserved for the
file. File length is set to 1 byte.
  
 
Where is it documented?


IMHO we should only guarantee that writes
to [0, 1] won't fail because of lack of disk space.
Other things (including file size) are up to implementation.



POSIX, SUS.

http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html

<<<
If the offset+ len is beyond the current file size, then
posix_fallocate() shall adjust the file size to offset+ len. Otherwise,
the file size shall not be changed.
  


fallocate (2) is something different from posix_fallocate:

"...This default behavior closely resembles the behavior
of the posix_fallocate(3) library function, and is intended
as a method of optimally implementing that function."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] fallocate behavior when crossing end-of-file

2009-05-06 Thread Edward Shishkin

Raskin Michael wrote:

Hello.
  


Hello.


I found the following bug in BtrFS:

1. Create and open an empty file
2. fallocate (fd, 0, 1)

Desired: something (probably, one block) is allocated/reserved for the
file. File length is set to 1 byte.
  

Where is it documented?


IMHO we should only guarantee that writes
to [0, 1] won't fail because of lack of disk space.
Other things (including file size) are up to implementation.


Actual: A block is allocated. File length is set to 1 block (4096
bytes). The rest of the file is filled with zeros.

last_byte = min(extent_map_end(em), alloc_end);
last_byte = (last_byte + mask) & ~mask;
if (em->block_start == EXTENT_MAP_HOLE) {
ret = prealloc_file_range(trans, inode, cur_offset,
last_byte, locked_end + 1,
alloc_hint, mode);

That part seems strange to me. You make an effort for block size to
divide last_byte. But last_byte should be
max(i_size_read(inode), offset+len)
- without any rounding.

Michael Raskin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Edward Shishkin

Tomasz Chmielewski wrote:

Thomas Glanzmann schrieb:


300 Gbyte of used storage of several productive VMs with the following
Operatings systems running:
\begin{itemize}
\item Red Hat Linux 32 and 64 Bit (Release 3, 4 and 5)
\item SuSE Linux 32 and 64 Bit (SLES 9 and 10)
\item Windows 2003 Std. Edition 32 Bit
\item Windows 2003 Enterprise Edition 64 Bit
\end{itemize}
\begin{tabular}{r|r|r|l}
blocksize & Deduplicated Data \\
\hline
128k  &  29.9 G \\
 64k  &  41.3 G \\
 32k  &  59.2 G \\
 16k  &  82   G \\
  8k  & 112   G \\
\

Bottom line with 8 K blocksize you can get more than 33% of deduped data
running a productive set of VMs.


Did you just compare checksums, 


I wouldn't rely on crc32: it is not a strong hash,
Such deduplication can lead to various problems,
including security ones.

or did you also compare the data "bit after bit" if the checksums 
matched?


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: implement FS_IOC_GETFLAGS/SETFLAGS/GETVERSION

2009-04-21 Thread Edward Shishkin

Christoph Hellwig wrote:

On Fri, Apr 17, 2009 at 12:11:55PM +0200, Edward Shishkin wrote:
  

Christoph Hellwig wrote:


Add support for the standard attributes set via chattr and read vis
lsattr.  Currently we store the attributes in the flags value in
the btrfs inode, but I wonder whether we should split it into two so
that we don't have to keep converting between the two formats.
 
  

Imho, since inode items are of fixed size, is won't be possible
to avoid such workarounds like conversion between formats.
No?



While the inode format is fixed it has 256 spare bits for expansion.
  


Ah, I meant the case when the spare bits are exhausted..


But what I mean with the above is to split the current 64bit flags value
into a a 32 bit internal flags and a 32bit user visible flags value
and store the ioctl flags in the latter.

OTOH every filesystem but extN seem to need some conversion so btrfs
wouldn't be unusual at that.


Not sure about extN, but one of the techniques is to represent
inode item as a set of (optional) "extensions", so that every
such extension includes it's version number (think of it as
release date). If file system driver is older, then some encountered
extension, init_inode() returns error. OTOH, if some extension
is missed, then some featured operations can be undefined (i.e.
read(), or write(), etc.. will return error). No conversion is
needed, however such approach requires more sophisticated fsck.


  And the GETFLAGS/SETFLAGS flags value
are pretty ugly anyway as they mix up flags for user visible behaviour
with extN implementation details that shouldn't really need to be
exposed to userspace.


  


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: implement FS_IOC_GETFLAGS/SETFLAGS/GETVERSION

2009-04-17 Thread Edward Shishkin

Christoph Hellwig wrote:

Add support for the standard attributes set via chattr and read vis
lsattr.  Currently we store the attributes in the flags value in
the btrfs inode, but I wonder whether we should split it into two so
that we don't have to keep converting between the two formats.
  


Imho, since inode items are of fixed size, is won't be possible
to avoid such workarounds like conversion between formats.
No?


Remove the btrfs_clear_flag/btrfs_set_flag/btrfs_test_flag macros
as they were obsfuction the existing code and got in the way of the
new additions.

Also add the FS_IOC_GETVERSION ioctl for getting i_generation as it's
trivial.

Btw, any idea what the BTRFS_INODE_REDONLY flag is for?  It's a subset
of the immutable flag, but can't actually be set anywhere from the
filesystem code.


Signed-off-by: Christoph Hellwig 

Index: linux-2.6/fs/btrfs/ioctl.c
===
--- linux-2.6.orig/fs/btrfs/ioctl.c 2009-04-17 10:08:11.758948607 +0200
+++ linux-2.6/fs/btrfs/ioctl.c  2009-04-17 10:33:21.555076930 +0200
@@ -50,7 +50,172 @@
 #include "volumes.h"
 #include "locking.h"
 
+/* Mask out flags that are inappropriate for the given type of inode. */

+static inline __u32 btrfs_mask_flags(umode_t mode, __u32 flags)
+{
+   if (S_ISDIR(mode))
+   return flags;
+   else if (S_ISREG(mode))
+   return flags & ~FS_DIRSYNC_FL;
+   else
+   return flags & (FS_NODUMP_FL | FS_NOATIME_FL);
+}
+
+/*
+ * Export inode flags to the format expected by the FS_IOC_GETFLAGS ioctl.
+ */
+static unsigned int btrfs_flags_to_ioctl(unsigned int flags)
+{
+   unsigned int iflags = 0;
+
+   if (flags & BTRFS_INODE_SYNC)
+   iflags |= FS_SYNC_FL;
+   if (flags & BTRFS_INODE_IMMUTABLE)
+   iflags |= FS_IMMUTABLE_FL;
+   if (flags & BTRFS_INODE_APPEND)
+   iflags |= FS_APPEND_FL;
+   if (flags & BTRFS_INODE_NODUMP)
+   iflags |= FS_NODUMP_FL;
+   if (flags & BTRFS_INODE_NOATIME)
+   iflags |= FS_NOATIME_FL;
+   if (flags & BTRFS_INODE_DIRSYNC)
+   iflags |= FS_DIRSYNC_FL;
+
+   return iflags;
+}
+
+/*
+ * Update inode->i_flags based on the btrfs internal flags.
+ */
+void btrfs_update_iflags(struct inode *inode)
+{
+   struct btrfs_inode *ip = BTRFS_I(inode);
+
+   inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
+
+   if (ip->flags & BTRFS_INODE_SYNC)
+   inode->i_flags |= S_SYNC;
+   if (ip->flags & BTRFS_INODE_IMMUTABLE)
+   inode->i_flags |= S_IMMUTABLE;
+   if (ip->flags & BTRFS_INODE_APPEND)
+   inode->i_flags |= S_APPEND;
+   if (ip->flags & BTRFS_INODE_NOATIME)
+   inode->i_flags |= S_NOATIME;
+   if (ip->flags & BTRFS_INODE_DIRSYNC)
+   inode->i_flags |= S_DIRSYNC;
+}
+
+/*
+ * Inherit flags from the parent inode.
+ *
+ * Unlike extN we don't have any flags we don't want to inherit currently.
+ */
+void btrfs_inherit_iflags(struct inode *inode, struct inode *dir)
+{
+   unsigned int flags = BTRFS_I(dir)->flags;
+
+   if (S_ISREG(inode->i_mode))
+   flags &= ~BTRFS_INODE_DIRSYNC;
+   else if (!S_ISDIR(inode->i_mode))
+   flags &= (BTRFS_INODE_NODUMP | BTRFS_INODE_NOATIME);
+
+   BTRFS_I(inode)->flags = flags;
+   btrfs_update_iflags(inode);
+}
+
+static int btrfs_ioctl_getflags(struct file *file, void __user *arg)
+{
+   struct btrfs_inode *ip = BTRFS_I(file->f_path.dentry->d_inode);
+   unsigned int flags = btrfs_flags_to_ioctl(ip->flags);
+
+   if (copy_to_user(arg, &flags, sizeof(flags)))
+   return -EFAULT;
+   return 0;
+}
+
+static int btrfs_ioctl_setflags(struct file *file, void __user *arg)
+{
+   struct inode *inode = file->f_path.dentry->d_inode;
+   struct btrfs_inode *ip = BTRFS_I(inode);
+   struct btrfs_root *root = ip->root;
+   struct btrfs_trans_handle *trans;
+   unsigned int flags, oldflags;
+   int ret;
+
+   if (copy_from_user(&flags, arg, sizeof(flags)))
+   return -EFAULT;
 
+	if (flags & ~(FS_IMMUTABLE_FL | FS_APPEND_FL | \

+ FS_NOATIME_FL | FS_NODUMP_FL | \
+ FS_SYNC_FL | FS_DIRSYNC_FL))
+   return -EOPNOTSUPP;
+
+   if (!is_owner_or_cap(inode))
+   return -EACCES;
+
+   mutex_lock(&inode->i_mutex);
+
+   flags = btrfs_mask_flags(inode->i_mode, flags);
+   oldflags = btrfs_flags_to_ioctl(ip->flags);
+   if ((flags ^ oldflags) & (FS_APPEND_FL | FS_IMMUTABLE_FL)) {
+   if (!capable(CAP_LINUX_IMMUTABLE)) {
+   ret = -EPERM;
+   goto out_unlock;
+   }
+   }
+
+   ret = mnt_want_write(file->f_path.mnt);
+   if (ret)
+   goto out_unlock;
+
+   if (flags & FS_SYNC_FL)
+   ip->flags |= BTRFS_INO