Re: [Ext2-devel] [UPDATE] Directory index for ext2
On Thursday 31 May 2001 21:44, Andreas Dilger wrote: > I noticed something interesting when running "mongo" with debugging > on. It is adding filenames which are only sequential numbers, and the > hash code is basically adding to only two blocks until those blocks > are full, at which point (I guess) the blocks split and we add to two > other blocks. > > I haven't looked at it closely, but for _example_ it something like: > > 65531 to block 113 > 65532 to block 51 > 65533 to block 51 > 65534 to block 113 > 65535 to block 113 > (repeats) > 65600 to block 21 > 65601 to block 96 > 65602 to block 96 > 65603 to block 21 > 65604 to block 21 > (repeats) > > I will have to recompile and run with debugging on again to get > actual output. > > To me this would _seem_ bad, as it indicates the hash is not > uniformly distributing the files across the hash space. However, > skewed hashing may not necessarily be the bad for performance. It > may even be that because we never have to rebalance the hash index > structure that as long as we don't get large numbers of identical > hashes it is just fine if similar filenames produce similar hash > values. We just keep splitting the leaf blocks until the hash moves > over to a different "range". For a balanced tree-based structure a > skewed hash would be bad, because you would have to do full-tree > rebalancing very often then. A skewed hash doesn't hurt the fixed-depth tree in the obvious way - it just tends to leave a lot of half-full buckets around, which wastes about 1/3 of the leaf space. The reason for this behaviour is, when you create a lot of sequentially-related names a skewed hash will tend to dump a lot them into a tiny sliver of the hash space, and after splitting the little sliver it's quite unlikely that later entries will hit the same small range. The only good protection against this is to make the hash function vary wildly if even one bit in the string changes. This is what crc does, and that's why I'm interested in it. I've rehabilitated the hack_show_dir code, which is enabled by #defining DX_HACK. To kprint a dump of hash buckets and statistics do: cat /test_partition/indexed_dir This dump is extremely helpful in judging the effectiveness of hash functions, just take a look at how the range of hash values that gets mapped into each leaf block. Ideally, there should not be too much variance. The format of the dump is: bucketnumber:blocknumber hashstart/range (entrycount) Yusuf Goolamabbas sent me a pointer to some new work on hash functions: http://www.isthe.com/chongo/tech/comp/fnv/ I coded up the fnv_hash and included it in today's patch - there is a define to select which to use; dx_hack_hash is still the default. fnv_hash is only a little wose than dx_hack_hash, which still produces the most uniform distribution of all the hash functions I've tested. But I can see from the dumps that it's still not optimal, it's just that all the others are worse. I still have quite a few leads to follow up on the hashing question. Next week I hope I'll get time to try crc32 as a hashing function. I hope it doesn't win because I'll need a 1K table to evaluate it, and that would be 20% of the whole code size. The patch: http://nl.linux.org/~phillips/htree/dx.pcache-2.4.5-2 -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] [UPDATE] Directory index for ext2
On Thursday 31 May 2001 21:44, Andreas Dilger wrote: I noticed something interesting when running mongo with debugging on. It is adding filenames which are only sequential numbers, and the hash code is basically adding to only two blocks until those blocks are full, at which point (I guess) the blocks split and we add to two other blocks. I haven't looked at it closely, but for _example_ it something like: 65531 to block 113 65532 to block 51 65533 to block 51 65534 to block 113 65535 to block 113 (repeats) 65600 to block 21 65601 to block 96 65602 to block 96 65603 to block 21 65604 to block 21 (repeats) I will have to recompile and run with debugging on again to get actual output. To me this would _seem_ bad, as it indicates the hash is not uniformly distributing the files across the hash space. However, skewed hashing may not necessarily be the bad for performance. It may even be that because we never have to rebalance the hash index structure that as long as we don't get large numbers of identical hashes it is just fine if similar filenames produce similar hash values. We just keep splitting the leaf blocks until the hash moves over to a different range. For a balanced tree-based structure a skewed hash would be bad, because you would have to do full-tree rebalancing very often then. A skewed hash doesn't hurt the fixed-depth tree in the obvious way - it just tends to leave a lot of half-full buckets around, which wastes about 1/3 of the leaf space. The reason for this behaviour is, when you create a lot of sequentially-related names a skewed hash will tend to dump a lot them into a tiny sliver of the hash space, and after splitting the little sliver it's quite unlikely that later entries will hit the same small range. The only good protection against this is to make the hash function vary wildly if even one bit in the string changes. This is what crc does, and that's why I'm interested in it. I've rehabilitated the hack_show_dir code, which is enabled by #defining DX_HACK. To kprint a dump of hash buckets and statistics do: cat /test_partition/indexed_dir This dump is extremely helpful in judging the effectiveness of hash functions, just take a look at how the range of hash values that gets mapped into each leaf block. Ideally, there should not be too much variance. The format of the dump is: bucketnumber:blocknumber hashstart/range (entrycount) Yusuf Goolamabbas sent me a pointer to some new work on hash functions: http://www.isthe.com/chongo/tech/comp/fnv/ I coded up the fnv_hash and included it in today's patch - there is a define to select which to use; dx_hack_hash is still the default. fnv_hash is only a little wose than dx_hack_hash, which still produces the most uniform distribution of all the hash functions I've tested. But I can see from the dumps that it's still not optimal, it's just that all the others are worse. I still have quite a few leads to follow up on the hashing question. Next week I hope I'll get time to try crc32 as a hashing function. I hope it doesn't win because I'll need a 1K table to evaluate it, and that would be 20% of the whole code size. The patch: http://nl.linux.org/~phillips/htree/dx.pcache-2.4.5-2 -- Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] [UPDATE] Directory index for ext2
Daniel writes: > On Thursday 31 May 2001 21:44, Andreas Dilger wrote: > > I think Al's idea of doing the validation once on the initial > > read is a good one. > > I'm doing that in the current patch, for leaf blocks, look at > ext2_bread. For index blocks, ext2_bread needs help to know that a > block is supposed to be an index block. Add a parameter? I think we should just get rid of the misconception that ext2_bread() is a block read function. It is only used by directory functions. Instead we should have separate ext2_bread_leaf(), ext2_bread_root(), ext2_bread_node() which do the appropriate validation for each type of block. In ext2_bread_dir() if we really think directory block prealloc is a win (in addition to the existing in-memory contiguous block prealloc), we may as well do it each time we split a leaf block, and make them valid in-use leaf blocks instead of just wasting space on disk (i.e. each split block has the hash space split into 1/N of the existing space, and we distribute existing entries across all N blocks). This way we don't have to split the each directory block so many times. For indexed directories this is (probably) a net win because we avoid N extra block splits (i.e. extra copying of leaf blocks), and make the leaf search space smaller. On non-indexed ext2 it would be a net loss because we would still have to read and search each directory block, even if they are empty. > It's normal for it to start by putting all the entries into the first > two blocks, but after those are split it should be pretty uniform > across the resulting 4, and so on. Can you confirm it's unbalanced? I don't think that is what I was seeing, because the hash block numbers were not "->1" and "->2" (which would be the case right after a split), but rather 30's, 40's, etc. > > Running mongo has shown up another bug, I see, but haven't had a > > chance to look into yet. It involves not being able to delete files > > from an indexed directory: > > > > rm: cannot remove `/mnt/tmp/testdir1-0-0/d0/d1/d2/d3/509.r': > > Input/output error > > > > This is after the files had been renamed (.r suffix). Do we re-hash > > directory entries if the file is renamed? If not, then that would > > explain this problem. It _looks_ like we do the right thing, but the > > mongo testing wipes out the filesystem after each test, and the above > > message is from a logfile only. > > The rename creates the new entry via ext2_add_entry so the hash must be > correct. Time to get out the bug swatter. I'll get mongo and try it. One other point of information. In the test I was running, it was always the file "509.r" which had the I/O error (new filesystem each test run, btw, and no IDE errors in the log). Cheers, Andreas -- Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto, \ would they cancel out, leaving him still hungry?" http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] [UPDATE] Directory index for ext2
On Thursday 31 May 2001 21:44, Andreas Dilger wrote: > Daniel, you write: > > - Fall back to linear search in case of corrupted index > > OK, I have _some_ of the code needed to do this, but in one case I'm > not sure of what your intention was - in dx_probe() you check for > "unused_flags & 1" to signal a bad/unsupported index. Why only check > the low bit instead of the whole thing? I guess it really means I've allocated the low bit to mean 'incompatible change here, old code should give up now', so "unused_flags" is a misnomer. > I currently have: [code that kmail fsck'd completely in the quote] I'll incorporate it. > On lookup it is OK to fall back to linear search, but if we add an > entry via linear we would then overwrite the root index. We probably > want extra logic so that if we have a bad interior node we overwrite > that (or another leaf instead of killing the root index). We > probably also want to make a distinction between I/O errors and bad > data (currently I just return NULL for both). I think Al's idea of > doing the validation once on the initial read is a good one. I'm doing that in the current patch, for leaf blocks, look at ext2_bread. For index blocks, ext2_bread needs help to know that a block is supposed to be an index block. Add a parameter? The checks we're doing now aren't that expensive so I decided to take the lazy approach and just do them every time. It's not the same as the leaf block checks, which otherwise would need to be in the inner loop. [I'm still thinking about your comments on magic numbers and hash versions] > > - Finalize hash function > > I noticed something interesting when running "mongo" with debugging > on. It is adding filenames which are only sequential numbers, and the > hash code is basically adding to only two blocks until those blocks > are full, at which point (I guess) the blocks split and we add to two > other blocks. It's normal for it to start by putting all the entries into the first two blocks, but after those are split it should be pretty uniform across the resulting 4, and so on. Can you confirm it's unbalanced? I used to have a handy hash bucket-dumping function (dx_show_buckets) hooked into dir->read, which normally just returns an error. To get a dump you'd cat the directory. A hokey interface, for debugging only, but convenient for coding and using.This is gone from the page cache version and I should hook it back in somehow. > I will have to recompile and run with debugging on again to get > actual output. > > To me this would _seem_ bad, as it indicates the hash is not > uniformly distributing the files across the hash space. However, > skewed hashing may not necessarily be the bad for performance. It > may even be that because we never have to rebalance the hash index > structure that as long as we don't get large numbers of identical > hashes it is just fine if similar filenames produce similar hash > values. We just keep splitting the leaf blocks until the hash moves > over to a different "range". For a balanced tree-based structure a > skewed hash would be bad, because you would have to do full-tree > rebalancing very often then. > > > No known bugs, please test, thanks in advance. So much for that ;-) > Running mongo has shown up another bug, I see, but haven't had a > chance to look into yet. It involves not being able to delete files > from an indexed directory: > > rm: cannot remove `/mnt/tmp/testdir1-0-0/d0/d1/d2/d3/509.r': > Input/output error > > This is after the files had been renamed (.r suffix). Do we re-hash > directory entries if the file is renamed? If not, then that would > explain this problem. It _looks_ like we do the right thing, but the > mongo testing wipes out the filesystem after each test, and the above > message is from a logfile only. The rename creates the new entry via ext2_add_entry so the hash must be correct. Time to get out the bug swatter. I'll get mongo and try it. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] [UPDATE] Directory index for ext2
Daniel, you write: > - Fall back to linear search in case of corrupted index OK, I have _some_ of the code needed to do this, but in one case I'm not sure of what your intention was - in dx_probe() you check for "unused_flags & 1" to signal a bad/unsupported index. Why only check the low bit instead of the whole thing? I currently have: if (dir->i_size < dir->i_sb->s_blocksize * 2) goto fail; // bad EXT2_INDEX_FL set, clear? if (IS_ERR(bh = ext2_bread (dir, 0, 0))) goto fail; // FIXME error message? root = (struct dx_root *) bh->b_data; // use the following for production once hash_version is 1 // if (root->info.hash_version & 3 == 0 || root->info.unused_flags) if (root->info.hash_version > 0 || root->info.unused_flags & 1) goto fail2; // unsupported hash format if ((indirect = root->info.indirect_levels) > 1) goto fail2; // unsupported hash format if (root->info.reserved_zero.v || root->info.info_length < sizeof(root->info)) goto fail2; // bad root node data ... if (dx_get_limit(entries) != dx_root_limit(dir, root->info.info_length)) goto fail2; // bad root node data ... if (dx_get_limit(entries) != dx_node_limit (dir)) goto fail3; // bad index node data On lookup it is OK to fall back to linear search, but if we add an entry via linear we would then overwrite the root index. We probably want extra logic so that if we have a bad interior node we overwrite that (or another leaf instead of killing the root index). We probably also want to make a distinction between I/O errors and bad data (currently I just return NULL for both). I think Al's idea of doing the validation once on the initial read is a good one. Instead of keeping reserved_zero as zero and using it to detect if an inode number was written there, we could write a magic number there to signal a valid hash. Alternately, instead of using hash_version to mark the hash type, we could leave that unused and make the above magic number the hash value, which is the hash of some fixed string, e.g. HASH_V0 := dx_hack_hash("Daniel Phillips", 15) // constant value HASH_V1 := dx_new_hash("Daniel Phillips", 15) // constant value struct dx_root { struct fake_dirent dot; char dot_name[4]; struct fake_dirent dotdot; char dotdot_name[4]; struct dx_root_info { le_u32 hash_magic; u8 unused; u8 info_length; /* 8 */ u8 indirect_levels; u8 unused_flags; } info; struct dx_entry entries[0]; }; /* * Hash value depends on existing hash type, so it is calculated here. * For new directories we never use this function, and simply pick the * default hash function when creating the new dx_root. */ static inline dx_frame *dx_probe(inode *dir, dentry *dentry, u32 *hash) dx_frame *frame) { struct dx_root *root; if (IS_ERR(bh = ext2_bread (dir, 0, 0))) goto fail; // return error code root = (struct dx_root *) bh->b_data; switch (le32_to_cpu(root->info.hash_magic.v)) { case HASH_V0: // hash-specific dx_root validation here *hash = dx_hack_hash(dentry->d_name.name, dentry->d_name.len); return dx_probe_hash(dir, hash, frame, bh); case HASH_V1: // hash-specific dx_root validation here *hash = dx_new_hash(dentry->d_name.name, dentry->d_name.len); return dx_probe_hash(dir, hash, frame, bh); default: printk("unsupported hash %u in directory %lu\n", root->info.hash_magic, dir->i_ino); brelse(bh); *hash = 0; } fail: return NULL; } > - Finalize hash function I noticed something interesting when running "mongo" with debugging on. It is adding filenames which are only sequential numbers, and the hash code is basically adding to only two blocks until those blocks are full, at which point (I guess) the blocks split and we add to two other blocks. I haven't looked at it closely, but for _example_ it something like: 65531 to block 113 65532 to block 51 65533 to block 51 65534 to block 113 65535 to block 113 (repeats) 65600 to block 21 65601 to block 96 65602 to block 96 65603 to block 21 65604 to block 21 (repeats) I will have to recompile and run with debugging on again to get actual output. To me this would _seem_ bad, as it indicates the hash is not uniformly distributing the files across the hash space. However, skewed hashing may not necessarily be the bad for performance. It may even be that because we never have to rebalance the hash index structure that as long as we don't get large
Re: [Ext2-devel] [UPDATE] Directory index for ext2
Daniel, you write: - Fall back to linear search in case of corrupted index OK, I have _some_ of the code needed to do this, but in one case I'm not sure of what your intention was - in dx_probe() you check for unused_flags 1 to signal a bad/unsupported index. Why only check the low bit instead of the whole thing? I currently have: if (dir-i_size dir-i_sb-s_blocksize * 2) goto fail; // bad EXT2_INDEX_FL set, clear? if (IS_ERR(bh = ext2_bread (dir, 0, 0))) goto fail; // FIXME error message? root = (struct dx_root *) bh-b_data; // use the following for production once hash_version is 1 // if (root-info.hash_version 3 == 0 || root-info.unused_flags) if (root-info.hash_version 0 || root-info.unused_flags 1) goto fail2; // unsupported hash format if ((indirect = root-info.indirect_levels) 1) goto fail2; // unsupported hash format if (root-info.reserved_zero.v || root-info.info_length sizeof(root-info)) goto fail2; // bad root node data ... if (dx_get_limit(entries) != dx_root_limit(dir, root-info.info_length)) goto fail2; // bad root node data ... if (dx_get_limit(entries) != dx_node_limit (dir)) goto fail3; // bad index node data On lookup it is OK to fall back to linear search, but if we add an entry via linear we would then overwrite the root index. We probably want extra logic so that if we have a bad interior node we overwrite that (or another leaf instead of killing the root index). We probably also want to make a distinction between I/O errors and bad data (currently I just return NULL for both). I think Al's idea of doing the validation once on the initial read is a good one. Instead of keeping reserved_zero as zero and using it to detect if an inode number was written there, we could write a magic number there to signal a valid hash. Alternately, instead of using hash_version to mark the hash type, we could leave that unused and make the above magic number the hash value, which is the hash of some fixed string, e.g. HASH_V0 := dx_hack_hash(Daniel Phillips, 15) // constant value HASH_V1 := dx_new_hash(Daniel Phillips, 15) // constant value struct dx_root { struct fake_dirent dot; char dot_name[4]; struct fake_dirent dotdot; char dotdot_name[4]; struct dx_root_info { le_u32 hash_magic; u8 unused; u8 info_length; /* 8 */ u8 indirect_levels; u8 unused_flags; } info; struct dx_entry entries[0]; }; /* * Hash value depends on existing hash type, so it is calculated here. * For new directories we never use this function, and simply pick the * default hash function when creating the new dx_root. */ static inline dx_frame *dx_probe(inode *dir, dentry *dentry, u32 *hash) dx_frame *frame) { struct dx_root *root; if (IS_ERR(bh = ext2_bread (dir, 0, 0))) goto fail; // return error code root = (struct dx_root *) bh-b_data; switch (le32_to_cpu(root-info.hash_magic.v)) { case HASH_V0: // hash-specific dx_root validation here *hash = dx_hack_hash(dentry-d_name.name, dentry-d_name.len); return dx_probe_hash(dir, hash, frame, bh); case HASH_V1: // hash-specific dx_root validation here *hash = dx_new_hash(dentry-d_name.name, dentry-d_name.len); return dx_probe_hash(dir, hash, frame, bh); default: printk(unsupported hash %u in directory %lu\n, root-info.hash_magic, dir-i_ino); brelse(bh); *hash = 0; } fail: return NULL; } - Finalize hash function I noticed something interesting when running mongo with debugging on. It is adding filenames which are only sequential numbers, and the hash code is basically adding to only two blocks until those blocks are full, at which point (I guess) the blocks split and we add to two other blocks. I haven't looked at it closely, but for _example_ it something like: 65531 to block 113 65532 to block 51 65533 to block 51 65534 to block 113 65535 to block 113 (repeats) 65600 to block 21 65601 to block 96 65602 to block 96 65603 to block 21 65604 to block 21 (repeats) I will have to recompile and run with debugging on again to get actual output. To me this would _seem_ bad, as it indicates the hash is not uniformly distributing the files across the hash space. However, skewed hashing may not necessarily be the bad for performance. It may even be that because we never have to rebalance the hash index structure that as long as we don't get large numbers of identical hashes it is just
Re: [Ext2-devel] [UPDATE] Directory index for ext2
On Thursday 31 May 2001 21:44, Andreas Dilger wrote: Daniel, you write: - Fall back to linear search in case of corrupted index OK, I have _some_ of the code needed to do this, but in one case I'm not sure of what your intention was - in dx_probe() you check for unused_flags 1 to signal a bad/unsupported index. Why only check the low bit instead of the whole thing? I guess it really means I've allocated the low bit to mean 'incompatible change here, old code should give up now', so unused_flags is a misnomer. I currently have: [code that kmail fsck'd completely in the quote] I'll incorporate it. On lookup it is OK to fall back to linear search, but if we add an entry via linear we would then overwrite the root index. We probably want extra logic so that if we have a bad interior node we overwrite that (or another leaf instead of killing the root index). We probably also want to make a distinction between I/O errors and bad data (currently I just return NULL for both). I think Al's idea of doing the validation once on the initial read is a good one. I'm doing that in the current patch, for leaf blocks, look at ext2_bread. For index blocks, ext2_bread needs help to know that a block is supposed to be an index block. Add a parameter? The checks we're doing now aren't that expensive so I decided to take the lazy approach and just do them every time. It's not the same as the leaf block checks, which otherwise would need to be in the inner loop. [I'm still thinking about your comments on magic numbers and hash versions] - Finalize hash function I noticed something interesting when running mongo with debugging on. It is adding filenames which are only sequential numbers, and the hash code is basically adding to only two blocks until those blocks are full, at which point (I guess) the blocks split and we add to two other blocks. It's normal for it to start by putting all the entries into the first two blocks, but after those are split it should be pretty uniform across the resulting 4, and so on. Can you confirm it's unbalanced? I used to have a handy hash bucket-dumping function (dx_show_buckets) hooked into dir-read, which normally just returns an error. To get a dump you'd cat the directory. A hokey interface, for debugging only, but convenient for coding and using.This is gone from the page cache version and I should hook it back in somehow. I will have to recompile and run with debugging on again to get actual output. To me this would _seem_ bad, as it indicates the hash is not uniformly distributing the files across the hash space. However, skewed hashing may not necessarily be the bad for performance. It may even be that because we never have to rebalance the hash index structure that as long as we don't get large numbers of identical hashes it is just fine if similar filenames produce similar hash values. We just keep splitting the leaf blocks until the hash moves over to a different range. For a balanced tree-based structure a skewed hash would be bad, because you would have to do full-tree rebalancing very often then. No known bugs, please test, thanks in advance. So much for that ;-) Running mongo has shown up another bug, I see, but haven't had a chance to look into yet. It involves not being able to delete files from an indexed directory: rm: cannot remove `/mnt/tmp/testdir1-0-0/d0/d1/d2/d3/509.r': Input/output error This is after the files had been renamed (.r suffix). Do we re-hash directory entries if the file is renamed? If not, then that would explain this problem. It _looks_ like we do the right thing, but the mongo testing wipes out the filesystem after each test, and the above message is from a logfile only. The rename creates the new entry via ext2_add_entry so the hash must be correct. Time to get out the bug swatter. I'll get mongo and try it. -- Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] [UPDATE] Directory index for ext2
Daniel writes: On Thursday 31 May 2001 21:44, Andreas Dilger wrote: I think Al's idea of doing the validation once on the initial read is a good one. I'm doing that in the current patch, for leaf blocks, look at ext2_bread. For index blocks, ext2_bread needs help to know that a block is supposed to be an index block. Add a parameter? I think we should just get rid of the misconception that ext2_bread() is a block read function. It is only used by directory functions. Instead we should have separate ext2_bread_leaf(), ext2_bread_root(), ext2_bread_node() which do the appropriate validation for each type of block. In ext2_bread_dir() if we really think directory block prealloc is a win (in addition to the existing in-memory contiguous block prealloc), we may as well do it each time we split a leaf block, and make them valid in-use leaf blocks instead of just wasting space on disk (i.e. each split block has the hash space split into 1/N of the existing space, and we distribute existing entries across all N blocks). This way we don't have to split the each directory block so many times. For indexed directories this is (probably) a net win because we avoid N extra block splits (i.e. extra copying of leaf blocks), and make the leaf search space smaller. On non-indexed ext2 it would be a net loss because we would still have to read and search each directory block, even if they are empty. It's normal for it to start by putting all the entries into the first two blocks, but after those are split it should be pretty uniform across the resulting 4, and so on. Can you confirm it's unbalanced? I don't think that is what I was seeing, because the hash block numbers were not -1 and -2 (which would be the case right after a split), but rather 30's, 40's, etc. Running mongo has shown up another bug, I see, but haven't had a chance to look into yet. It involves not being able to delete files from an indexed directory: rm: cannot remove `/mnt/tmp/testdir1-0-0/d0/d1/d2/d3/509.r': Input/output error This is after the files had been renamed (.r suffix). Do we re-hash directory entries if the file is renamed? If not, then that would explain this problem. It _looks_ like we do the right thing, but the mongo testing wipes out the filesystem after each test, and the above message is from a logfile only. The rename creates the new entry via ext2_add_entry so the hash must be correct. Time to get out the bug swatter. I'll get mongo and try it. One other point of information. In the test I was running, it was always the file 509.r which had the I/O error (new filesystem each test run, btw, and no IDE errors in the log). Cheers, Andreas -- Andreas Dilger \ If a man ate a pound of pasta and a pound of antipasto, \ would they cancel out, leaving him still hungry? http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/