Re: [Ext2-devel] [UPDATE] Directory index for ext2

2001-06-02 Thread Daniel Phillips

On Thursday 31 May 2001 21:44, Andreas Dilger wrote:
> I noticed something interesting when running "mongo" with debugging
> on. It is adding filenames which are only sequential numbers, and the
> hash code is basically adding to only two blocks until those blocks
> are full, at which point (I guess) the blocks split and we add to two
> other blocks.
>
> I haven't looked at it closely, but for _example_ it something like:
>
> 65531 to block 113
> 65532 to block 51
> 65533 to block 51
> 65534 to block 113
> 65535 to block 113
> (repeats)
> 65600 to block 21
> 65601 to block 96
> 65602 to block 96
> 65603 to block 21
> 65604 to block 21
> (repeats)
>
> I will have to recompile and run with debugging on again to get
> actual output.
>
> To me this would _seem_ bad, as it indicates the hash is not
> uniformly distributing the files across the hash space.  However,
> skewed hashing may not necessarily be the bad for performance.  It
> may even be that because we never have to rebalance the hash index
> structure that as long as we don't get large numbers of identical
> hashes it is just fine if similar filenames produce similar hash
> values.  We just keep splitting the leaf blocks until the hash moves
> over to a different "range".  For a balanced tree-based structure a
> skewed hash would be bad, because you would have to do full-tree
> rebalancing very often then.

A skewed hash doesn't hurt the fixed-depth tree in the obvious way - it 
just tends to leave a lot of half-full buckets around, which wastes 
about 1/3 of the leaf space.  The reason for this behaviour is, when 
you create a lot of sequentially-related names a skewed hash will tend 
to dump a lot them into a tiny sliver of the hash space, and after 
splitting the little sliver it's quite unlikely that later entries will 
hit the same small range.  The only good protection against this is to 
make the hash function vary wildly if even one bit in the string 
changes.  This is what crc does, and that's why I'm interested in it.

I've rehabilitated the hack_show_dir code, which is enabled by 
#defining DX_HACK.  To kprint a dump of hash buckets and statistics do:

cat /test_partition/indexed_dir

This dump is extremely helpful in judging the effectiveness of hash 
functions, just take a look at how the range of hash values that gets 
mapped into each leaf block.  Ideally, there should not be too much 
variance.

The format of the dump is:

   bucketnumber:blocknumber hashstart/range (entrycount)

Yusuf Goolamabbas sent me a pointer to some new work on hash functions:

   http://www.isthe.com/chongo/tech/comp/fnv/

I coded up the fnv_hash and included it in today's patch - there is a 
define to select which to use; dx_hack_hash is still the default.

fnv_hash is only a little wose than dx_hack_hash, which still produces 
the most uniform distribution of all the hash functions I've tested.   
But I can see from the dumps that it's still not optimal, it's just 
that all the others are worse.

I still have quite a few leads to follow up on the hashing question.
Next week I hope I'll get time to try crc32 as a hashing function.  I 
hope it doesn't win because I'll need a 1K table to evaluate it, and 
that would be 20% of the whole code size.

The patch:

http://nl.linux.org/~phillips/htree/dx.pcache-2.4.5-2

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Ext2-devel] [UPDATE] Directory index for ext2

2001-06-02 Thread Daniel Phillips

On Thursday 31 May 2001 21:44, Andreas Dilger wrote:
 I noticed something interesting when running mongo with debugging
 on. It is adding filenames which are only sequential numbers, and the
 hash code is basically adding to only two blocks until those blocks
 are full, at which point (I guess) the blocks split and we add to two
 other blocks.

 I haven't looked at it closely, but for _example_ it something like:

 65531 to block 113
 65532 to block 51
 65533 to block 51
 65534 to block 113
 65535 to block 113
 (repeats)
 65600 to block 21
 65601 to block 96
 65602 to block 96
 65603 to block 21
 65604 to block 21
 (repeats)

 I will have to recompile and run with debugging on again to get
 actual output.

 To me this would _seem_ bad, as it indicates the hash is not
 uniformly distributing the files across the hash space.  However,
 skewed hashing may not necessarily be the bad for performance.  It
 may even be that because we never have to rebalance the hash index
 structure that as long as we don't get large numbers of identical
 hashes it is just fine if similar filenames produce similar hash
 values.  We just keep splitting the leaf blocks until the hash moves
 over to a different range.  For a balanced tree-based structure a
 skewed hash would be bad, because you would have to do full-tree
 rebalancing very often then.

A skewed hash doesn't hurt the fixed-depth tree in the obvious way - it 
just tends to leave a lot of half-full buckets around, which wastes 
about 1/3 of the leaf space.  The reason for this behaviour is, when 
you create a lot of sequentially-related names a skewed hash will tend 
to dump a lot them into a tiny sliver of the hash space, and after 
splitting the little sliver it's quite unlikely that later entries will 
hit the same small range.  The only good protection against this is to 
make the hash function vary wildly if even one bit in the string 
changes.  This is what crc does, and that's why I'm interested in it.

I've rehabilitated the hack_show_dir code, which is enabled by 
#defining DX_HACK.  To kprint a dump of hash buckets and statistics do:

cat /test_partition/indexed_dir

This dump is extremely helpful in judging the effectiveness of hash 
functions, just take a look at how the range of hash values that gets 
mapped into each leaf block.  Ideally, there should not be too much 
variance.

The format of the dump is:

   bucketnumber:blocknumber hashstart/range (entrycount)

Yusuf Goolamabbas sent me a pointer to some new work on hash functions:

   http://www.isthe.com/chongo/tech/comp/fnv/

I coded up the fnv_hash and included it in today's patch - there is a 
define to select which to use; dx_hack_hash is still the default.

fnv_hash is only a little wose than dx_hack_hash, which still produces 
the most uniform distribution of all the hash functions I've tested.   
But I can see from the dumps that it's still not optimal, it's just 
that all the others are worse.

I still have quite a few leads to follow up on the hashing question.
Next week I hope I'll get time to try crc32 as a hashing function.  I 
hope it doesn't win because I'll need a 1K table to evaluate it, and 
that would be 20% of the whole code size.

The patch:

http://nl.linux.org/~phillips/htree/dx.pcache-2.4.5-2

--
Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Ext2-devel] [UPDATE] Directory index for ext2

2001-05-31 Thread Andreas Dilger

Daniel writes:
> On Thursday 31 May 2001 21:44, Andreas Dilger wrote:
> > I think Al's idea of doing the validation once on the initial
> > read is a good one.
> 
> I'm doing that in the current patch, for leaf blocks, look at 
> ext2_bread.  For index blocks, ext2_bread needs help to know that a 
> block is supposed to be an index block.  Add a parameter?

I think we should just get rid of the misconception that ext2_bread()
is a block read function.  It is only used by directory functions.
Instead we should have separate ext2_bread_leaf(), ext2_bread_root(),
ext2_bread_node() which do the appropriate validation for each type
of block.

In ext2_bread_dir() if we really think directory block prealloc is a win
(in addition to the existing in-memory contiguous block prealloc), we
may as well do it each time we split a leaf block, and make them valid
in-use leaf blocks instead of just wasting space on disk (i.e. each split
block has the hash space split into 1/N of the existing space, and we
distribute existing entries across all N blocks).

This way we don't have to split the each directory block so many times.
For indexed directories this is (probably) a net win because we avoid
N extra block splits (i.e. extra copying of leaf blocks), and make the
leaf search space smaller.  On non-indexed ext2 it would be a net loss
because we would still have to read and search each directory block,
even if they are empty.

> It's normal for it to start by putting all the entries into the first 
> two blocks, but after those are split it should be pretty uniform 
> across the resulting 4, and so on.  Can you confirm it's unbalanced?

I don't think that is what I was seeing, because the hash block numbers
were not "->1" and "->2" (which would be the case right after a split),
but rather 30's, 40's, etc.

> > Running mongo has shown up another bug, I see, but haven't had a
> > chance to look into yet.  It involves not being able to delete files
> > from an indexed directory:
> >
> > rm: cannot remove `/mnt/tmp/testdir1-0-0/d0/d1/d2/d3/509.r':
> > Input/output error
> >
> > This is after the files had been renamed (.r suffix).  Do we re-hash
> > directory entries if the file is renamed?  If not, then that would
> > explain this problem.  It _looks_ like we do the right thing, but the
> > mongo testing wipes out the filesystem after each test, and the above
> > message is from a logfile only.
> 
> The rename creates the new entry via ext2_add_entry so the hash must be 
> correct.  Time to get out the bug swatter.  I'll get mongo and try it.

One other point of information.  In the test I was running, it was
always the file "509.r" which had the I/O error (new filesystem each
test run, btw, and no IDE errors in the log).

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Ext2-devel] [UPDATE] Directory index for ext2

2001-05-31 Thread Daniel Phillips

On Thursday 31 May 2001 21:44, Andreas Dilger wrote:
> Daniel, you write:
> >   - Fall back to linear search in case of corrupted index
>
> OK, I have _some_ of the code needed to do this, but in one case I'm
> not sure of what your intention was - in dx_probe() you check for
> "unused_flags & 1" to signal a bad/unsupported index.  Why only check
> the low bit instead of the whole thing?

I guess it really means I've allocated the low bit to mean 
'incompatible change here, old code should give up now', so 
"unused_flags" is a misnomer.

> I currently have:

[code that kmail fsck'd completely in the quote]

I'll incorporate it.

> On lookup it is OK to fall back to linear search, but if we add an
> entry via linear we would then overwrite the root index.  We probably
> want extra logic so that if we have a bad interior node we overwrite
> that (or another leaf instead of killing the root index).  We
> probably also want to make a distinction between I/O errors and bad
> data (currently I just return NULL for both).  I think Al's idea of
> doing the validation once on the initial read is a good one.

I'm doing that in the current patch, for leaf blocks, look at 
ext2_bread.  For index blocks, ext2_bread needs help to know that a 
block is supposed to be an index block.  Add a parameter?

The checks we're doing now aren't that expensive so I decided to take 
the lazy approach and just do them every time.  It's not the same as 
the leaf block checks, which otherwise would need to be in the inner 
loop.

[I'm still thinking about your comments on magic numbers and hash 
versions]

> >   - Finalize hash function
>
> I noticed something interesting when running "mongo" with debugging
> on. It is adding filenames which are only sequential numbers, and the
> hash code is basically adding to only two blocks until those blocks
> are full, at which point (I guess) the blocks split and we add to two
> other blocks.

It's normal for it to start by putting all the entries into the first 
two blocks, but after those are split it should be pretty uniform 
across the resulting 4, and so on.  Can you confirm it's unbalanced?

I used to have a handy hash bucket-dumping function (dx_show_buckets) 
hooked into dir->read, which normally just returns an error.  To get a 
dump you'd cat the directory.  A hokey interface, for debugging only, 
but convenient for coding and using.This is gone from the page 
cache version and I should hook it back in somehow.

> I will have to recompile and run with debugging on again to get
> actual output.
>
> To me this would _seem_ bad, as it indicates the hash is not
> uniformly distributing the files across the hash space.  However,
> skewed hashing may not necessarily be the bad for performance.  It
> may even be that because we never have to rebalance the hash index
> structure that as long as we don't get large numbers of identical
> hashes it is just fine if similar filenames produce similar hash
> values.  We just keep splitting the leaf blocks until the hash moves
> over to a different "range".  For a balanced tree-based structure a
> skewed hash would be bad, because you would have to do full-tree
> rebalancing very often then.
>
> > No known bugs, please test, thanks in advance.

So much for that ;-)

> Running mongo has shown up another bug, I see, but haven't had a
> chance to look into yet.  It involves not being able to delete files
> from an indexed directory:
>
> rm: cannot remove `/mnt/tmp/testdir1-0-0/d0/d1/d2/d3/509.r':
> Input/output error
>
> This is after the files had been renamed (.r suffix).  Do we re-hash
> directory entries if the file is renamed?  If not, then that would
> explain this problem.  It _looks_ like we do the right thing, but the
> mongo testing wipes out the filesystem after each test, and the above
> message is from a logfile only.

The rename creates the new entry via ext2_add_entry so the hash must be 
correct.  Time to get out the bug swatter.  I'll get mongo and try it.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Ext2-devel] [UPDATE] Directory index for ext2

2001-05-31 Thread Andreas Dilger

Daniel, you write:
>   - Fall back to linear search in case of corrupted index

OK, I have _some_ of the code needed to do this, but in one case I'm
not sure of what your intention was - in dx_probe() you check for
"unused_flags & 1" to signal a bad/unsupported index.  Why only check
the low bit instead of the whole thing?  I currently have:

if (dir->i_size < dir->i_sb->s_blocksize * 2)
goto fail;  // bad EXT2_INDEX_FL set, clear?
if (IS_ERR(bh = ext2_bread (dir, 0, 0)))
goto fail;  // FIXME error message?
root = (struct dx_root *) bh->b_data;

// use the following for production once hash_version is 1
// if (root->info.hash_version & 3 == 0 || root->info.unused_flags)
if (root->info.hash_version > 0 || root->info.unused_flags & 1)
goto fail2; // unsupported hash format
if ((indirect = root->info.indirect_levels) > 1)
goto fail2; // unsupported hash format
if (root->info.reserved_zero.v ||
root->info.info_length < sizeof(root->info))
goto fail2; // bad root node data
...
if (dx_get_limit(entries) != dx_root_limit(dir, root->info.info_length))
goto fail2; // bad root node data
...
if (dx_get_limit(entries) != dx_node_limit (dir))
goto fail3; // bad index node data

On lookup it is OK to fall back to linear search, but if we add an entry
via linear we would then overwrite the root index.  We probably want extra
logic so that if we have a bad interior node we overwrite that (or another
leaf instead of killing the root index).  We probably also want to make a
distinction between I/O errors and bad data (currently I just return NULL
for both).  I think Al's idea of doing the validation once on the initial
read is a good one.

Instead of keeping reserved_zero as zero and using it to detect if an
inode number was written there, we could write a magic number there to
signal a valid hash.  Alternately, instead of using hash_version to mark
the hash type, we could leave that unused and make the above magic
number the hash value, which is the hash of some fixed string, e.g.

HASH_V0 := dx_hack_hash("Daniel Phillips", 15) // constant value
HASH_V1 := dx_new_hash("Daniel Phillips", 15)  // constant value

struct dx_root
{
struct fake_dirent dot;
char dot_name[4];
struct fake_dirent dotdot;
char dotdot_name[4];
struct dx_root_info {
le_u32 hash_magic;
u8 unused;
u8 info_length; /* 8 */
u8 indirect_levels;
u8 unused_flags;
} info;
struct dx_entry entries[0];
};

/*
 * Hash value depends on existing hash type, so it is calculated here.
 * For new directories we never use this function, and simply pick the
 * default hash function when creating the new dx_root.
 */
static inline dx_frame *dx_probe(inode *dir, dentry *dentry, u32 *hash)
 dx_frame *frame)
{
struct dx_root *root;

if (IS_ERR(bh = ext2_bread (dir, 0, 0)))
goto fail;  // return error code
root = (struct dx_root *) bh->b_data;

switch (le32_to_cpu(root->info.hash_magic.v)) {
case HASH_V0:
// hash-specific dx_root validation here
*hash = dx_hack_hash(dentry->d_name.name, dentry->d_name.len);
return dx_probe_hash(dir, hash, frame, bh);
case HASH_V1:
// hash-specific dx_root validation here
*hash = dx_new_hash(dentry->d_name.name, dentry->d_name.len);
return dx_probe_hash(dir, hash, frame, bh);
default: printk("unsupported hash %u in directory %lu\n",
root->info.hash_magic, dir->i_ino);
brelse(bh);
*hash = 0;
}
fail:
return NULL;
}



>   - Finalize hash function

I noticed something interesting when running "mongo" with debugging on.
It is adding filenames which are only sequential numbers, and the hash
code is basically adding to only two blocks until those blocks are full,
at which point (I guess) the blocks split and we add to two other blocks.

I haven't looked at it closely, but for _example_ it something like:

65531 to block 113
65532 to block 51
65533 to block 51
65534 to block 113
65535 to block 113
(repeats)
65600 to block 21
65601 to block 96
65602 to block 96
65603 to block 21
65604 to block 21
(repeats)

I will have to recompile and run with debugging on again to get actual
output.

To me this would _seem_ bad, as it indicates the hash is not uniformly
distributing the files across the hash space.  However, skewed hashing
may not necessarily be the bad for performance.  It may even be that
because we never have to rebalance the hash index structure that as long
as we don't get large 

Re: [Ext2-devel] [UPDATE] Directory index for ext2

2001-05-31 Thread Andreas Dilger

Daniel, you write:
   - Fall back to linear search in case of corrupted index

OK, I have _some_ of the code needed to do this, but in one case I'm
not sure of what your intention was - in dx_probe() you check for
unused_flags  1 to signal a bad/unsupported index.  Why only check
the low bit instead of the whole thing?  I currently have:

if (dir-i_size  dir-i_sb-s_blocksize * 2)
goto fail;  // bad EXT2_INDEX_FL set, clear?
if (IS_ERR(bh = ext2_bread (dir, 0, 0)))
goto fail;  // FIXME error message?
root = (struct dx_root *) bh-b_data;

// use the following for production once hash_version is 1
// if (root-info.hash_version  3 == 0 || root-info.unused_flags)
if (root-info.hash_version  0 || root-info.unused_flags  1)
goto fail2; // unsupported hash format
if ((indirect = root-info.indirect_levels)  1)
goto fail2; // unsupported hash format
if (root-info.reserved_zero.v ||
root-info.info_length  sizeof(root-info))
goto fail2; // bad root node data
...
if (dx_get_limit(entries) != dx_root_limit(dir, root-info.info_length))
goto fail2; // bad root node data
...
if (dx_get_limit(entries) != dx_node_limit (dir))
goto fail3; // bad index node data

On lookup it is OK to fall back to linear search, but if we add an entry
via linear we would then overwrite the root index.  We probably want extra
logic so that if we have a bad interior node we overwrite that (or another
leaf instead of killing the root index).  We probably also want to make a
distinction between I/O errors and bad data (currently I just return NULL
for both).  I think Al's idea of doing the validation once on the initial
read is a good one.

Instead of keeping reserved_zero as zero and using it to detect if an
inode number was written there, we could write a magic number there to
signal a valid hash.  Alternately, instead of using hash_version to mark
the hash type, we could leave that unused and make the above magic
number the hash value, which is the hash of some fixed string, e.g.

HASH_V0 := dx_hack_hash(Daniel Phillips, 15) // constant value
HASH_V1 := dx_new_hash(Daniel Phillips, 15)  // constant value

struct dx_root
{
struct fake_dirent dot;
char dot_name[4];
struct fake_dirent dotdot;
char dotdot_name[4];
struct dx_root_info {
le_u32 hash_magic;
u8 unused;
u8 info_length; /* 8 */
u8 indirect_levels;
u8 unused_flags;
} info;
struct dx_entry entries[0];
};

/*
 * Hash value depends on existing hash type, so it is calculated here.
 * For new directories we never use this function, and simply pick the
 * default hash function when creating the new dx_root.
 */
static inline dx_frame *dx_probe(inode *dir, dentry *dentry, u32 *hash)
 dx_frame *frame)
{
struct dx_root *root;

if (IS_ERR(bh = ext2_bread (dir, 0, 0)))
goto fail;  // return error code
root = (struct dx_root *) bh-b_data;

switch (le32_to_cpu(root-info.hash_magic.v)) {
case HASH_V0:
// hash-specific dx_root validation here
*hash = dx_hack_hash(dentry-d_name.name, dentry-d_name.len);
return dx_probe_hash(dir, hash, frame, bh);
case HASH_V1:
// hash-specific dx_root validation here
*hash = dx_new_hash(dentry-d_name.name, dentry-d_name.len);
return dx_probe_hash(dir, hash, frame, bh);
default: printk(unsupported hash %u in directory %lu\n,
root-info.hash_magic, dir-i_ino);
brelse(bh);
*hash = 0;
}
fail:
return NULL;
}



   - Finalize hash function

I noticed something interesting when running mongo with debugging on.
It is adding filenames which are only sequential numbers, and the hash
code is basically adding to only two blocks until those blocks are full,
at which point (I guess) the blocks split and we add to two other blocks.

I haven't looked at it closely, but for _example_ it something like:

65531 to block 113
65532 to block 51
65533 to block 51
65534 to block 113
65535 to block 113
(repeats)
65600 to block 21
65601 to block 96
65602 to block 96
65603 to block 21
65604 to block 21
(repeats)

I will have to recompile and run with debugging on again to get actual
output.

To me this would _seem_ bad, as it indicates the hash is not uniformly
distributing the files across the hash space.  However, skewed hashing
may not necessarily be the bad for performance.  It may even be that
because we never have to rebalance the hash index structure that as long
as we don't get large numbers of identical hashes it is just 

Re: [Ext2-devel] [UPDATE] Directory index for ext2

2001-05-31 Thread Daniel Phillips

On Thursday 31 May 2001 21:44, Andreas Dilger wrote:
 Daniel, you write:
- Fall back to linear search in case of corrupted index

 OK, I have _some_ of the code needed to do this, but in one case I'm
 not sure of what your intention was - in dx_probe() you check for
 unused_flags  1 to signal a bad/unsupported index.  Why only check
 the low bit instead of the whole thing?

I guess it really means I've allocated the low bit to mean 
'incompatible change here, old code should give up now', so 
unused_flags is a misnomer.

 I currently have:

[code that kmail fsck'd completely in the quote]

I'll incorporate it.

 On lookup it is OK to fall back to linear search, but if we add an
 entry via linear we would then overwrite the root index.  We probably
 want extra logic so that if we have a bad interior node we overwrite
 that (or another leaf instead of killing the root index).  We
 probably also want to make a distinction between I/O errors and bad
 data (currently I just return NULL for both).  I think Al's idea of
 doing the validation once on the initial read is a good one.

I'm doing that in the current patch, for leaf blocks, look at 
ext2_bread.  For index blocks, ext2_bread needs help to know that a 
block is supposed to be an index block.  Add a parameter?

The checks we're doing now aren't that expensive so I decided to take 
the lazy approach and just do them every time.  It's not the same as 
the leaf block checks, which otherwise would need to be in the inner 
loop.

[I'm still thinking about your comments on magic numbers and hash 
versions]

- Finalize hash function

 I noticed something interesting when running mongo with debugging
 on. It is adding filenames which are only sequential numbers, and the
 hash code is basically adding to only two blocks until those blocks
 are full, at which point (I guess) the blocks split and we add to two
 other blocks.

It's normal for it to start by putting all the entries into the first 
two blocks, but after those are split it should be pretty uniform 
across the resulting 4, and so on.  Can you confirm it's unbalanced?

I used to have a handy hash bucket-dumping function (dx_show_buckets) 
hooked into dir-read, which normally just returns an error.  To get a 
dump you'd cat the directory.  A hokey interface, for debugging only, 
but convenient for coding and using.This is gone from the page 
cache version and I should hook it back in somehow.

 I will have to recompile and run with debugging on again to get
 actual output.

 To me this would _seem_ bad, as it indicates the hash is not
 uniformly distributing the files across the hash space.  However,
 skewed hashing may not necessarily be the bad for performance.  It
 may even be that because we never have to rebalance the hash index
 structure that as long as we don't get large numbers of identical
 hashes it is just fine if similar filenames produce similar hash
 values.  We just keep splitting the leaf blocks until the hash moves
 over to a different range.  For a balanced tree-based structure a
 skewed hash would be bad, because you would have to do full-tree
 rebalancing very often then.

  No known bugs, please test, thanks in advance.

So much for that ;-)

 Running mongo has shown up another bug, I see, but haven't had a
 chance to look into yet.  It involves not being able to delete files
 from an indexed directory:

 rm: cannot remove `/mnt/tmp/testdir1-0-0/d0/d1/d2/d3/509.r':
 Input/output error

 This is after the files had been renamed (.r suffix).  Do we re-hash
 directory entries if the file is renamed?  If not, then that would
 explain this problem.  It _looks_ like we do the right thing, but the
 mongo testing wipes out the filesystem after each test, and the above
 message is from a logfile only.

The rename creates the new entry via ext2_add_entry so the hash must be 
correct.  Time to get out the bug swatter.  I'll get mongo and try it.

--
Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Ext2-devel] [UPDATE] Directory index for ext2

2001-05-31 Thread Andreas Dilger

Daniel writes:
 On Thursday 31 May 2001 21:44, Andreas Dilger wrote:
  I think Al's idea of doing the validation once on the initial
  read is a good one.
 
 I'm doing that in the current patch, for leaf blocks, look at 
 ext2_bread.  For index blocks, ext2_bread needs help to know that a 
 block is supposed to be an index block.  Add a parameter?

I think we should just get rid of the misconception that ext2_bread()
is a block read function.  It is only used by directory functions.
Instead we should have separate ext2_bread_leaf(), ext2_bread_root(),
ext2_bread_node() which do the appropriate validation for each type
of block.

In ext2_bread_dir() if we really think directory block prealloc is a win
(in addition to the existing in-memory contiguous block prealloc), we
may as well do it each time we split a leaf block, and make them valid
in-use leaf blocks instead of just wasting space on disk (i.e. each split
block has the hash space split into 1/N of the existing space, and we
distribute existing entries across all N blocks).

This way we don't have to split the each directory block so many times.
For indexed directories this is (probably) a net win because we avoid
N extra block splits (i.e. extra copying of leaf blocks), and make the
leaf search space smaller.  On non-indexed ext2 it would be a net loss
because we would still have to read and search each directory block,
even if they are empty.

 It's normal for it to start by putting all the entries into the first 
 two blocks, but after those are split it should be pretty uniform 
 across the resulting 4, and so on.  Can you confirm it's unbalanced?

I don't think that is what I was seeing, because the hash block numbers
were not -1 and -2 (which would be the case right after a split),
but rather 30's, 40's, etc.

  Running mongo has shown up another bug, I see, but haven't had a
  chance to look into yet.  It involves not being able to delete files
  from an indexed directory:
 
  rm: cannot remove `/mnt/tmp/testdir1-0-0/d0/d1/d2/d3/509.r':
  Input/output error
 
  This is after the files had been renamed (.r suffix).  Do we re-hash
  directory entries if the file is renamed?  If not, then that would
  explain this problem.  It _looks_ like we do the right thing, but the
  mongo testing wipes out the filesystem after each test, and the above
  message is from a logfile only.
 
 The rename creates the new entry via ext2_add_entry so the hash must be 
 correct.  Time to get out the bug swatter.  I'll get mongo and try it.

One other point of information.  In the test I was running, it was
always the file 509.r which had the I/O error (new filesystem each
test run, btw, and no IDE errors in the log).

Cheers, Andreas
-- 
Andreas Dilger  \ If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/