Re: Enabling h-trees too early?
> On Thu, Sep 20, 2007 at 06:19:04PM +0200, Jan Kara wrote: > > if (EXT4_HAS_COMPAT_FEATURE(inode->i_sb, EXT4_FEATURE_COMPAT_DIR_INDEX) && > > ((EXT4_I(inode)->i_flags & EXT4_INDEX_FL) || > > ((inode->i_size >> sb->s_blocksize_bits) == 1))) { > > error = ext4_dx_readdir(filp, dirent, filldir); > > if (error != ERR_BAD_DX_DIR) { > > ret = error; > > goto out; > > } > > /* > > * We don't set the inode dirty flag since it's not > > * critical that it get flushed back to the disk. > > */ > > EXT4_I(filp->f_path.dentry->d_inode)->i_flags &= ~EXT4_INDEX_FL; > > } > > It calls ext4_dx_readdir() for *every* directory with 1 block (we have > > 1326 of them in the kernel tree). Now ext4_dx_readdir() calls > > ext4_htree_fill_tree() which finds out the directory is not h-tree and > > and calls htree_dirblock_to_tree(). So even for 4KB directories we end up > > deleting inodes in hash order! And as a bonus we burn some cycles building > > trees etc. What is the point of this? > > That was added so we wouldn't get screwed when a directory that was > previously non htree became an htree directory while the directory fd > is open. So the failure case is one where you do opendir(), readdir() > on 25% of the directory, sleep for 2 hours, and in the meantime, 200 > files are added to the directory and it gets converted into a htree > index, causing all of the previously returned readdir() results in > directory order to be completely screwed up now that the directory has > been converted into an htree. (All of the readdir/telldir/seekdir > POSIX requirements cause filesystem designers to tear their hair out.) Oh, yes. Thanks for explanation. > What we would need to do to avoid needing this is to read in the > entire directory leaf page into the rbtree, sorted by inode number, > and then to keep that rbtree for the entire life of the open directory > file descriptor. We would also have to change telldir/seekdir to use > something else as a telldir cookie, and readdir would have to be set > up to *only* use the rbtree, and never look at the on-disk directory. > This would also mean that all of the files created or deleted after > the initial opendir() would never be reflected in results returned by > readdir(), but that's allowed by POSIX. And if we do this for a > single block 4k directory, we might as well do it for a 32k or 64k > HTREE directory as well. Yes, this makes sence... Honza -- Jan Kara <[EMAIL PROTECTED]> SuSE CR Labs - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Enabling h-trees too early?
On Fri, Sep 21, 2007 at 11:02:58AM +0200, Andi Kleen wrote: > I assume you mean sort by inode, because sort by htree key would > be as bad as htrees. > > But wouldn't that break parallel readdir for a directory that just grows > from <32/64K to over it? e.g. if the sort moves already read > entries to after the cursor readdir would return some entries twice. No, it wouldn't break it because after snapshotting the tree, we would only use the snapshot and not refer to the on-disk format until the directory file handle is closed or there is an explicit lseek to 0. To quote from the spec: If a file is removed from or added to the directory after the most recent call to the opendir() or rewinddir() function, whether a subsequent call to the readdir() function returns an entry for that file is unspecified. So, if some program does the stupid thing and opens a directory, doesn't use it for five hours, and in the meantime 2,000 files are created, it's OK if we only return the set of files that were in existence when the opendir() was originally called. This makes us OK in terms of spec conformance, and given that real life programs don't do stuff this stupid, we should be fine. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Enabling h-trees too early?
Theodore Tso <[EMAIL PROTECTED]> writes: > > Certainly one of the things that we could consider is for small > directories to do an in-memory sort of all of the directory entries at > opendir() time, and keeping that list until it is closed. We can't do > this for really big directories, but we could easily do it for > directories under 32k or 64k. I assume you mean sort by inode, because sort by htree key would be as bad as htrees. But wouldn't that break parallel readdir for a directory that just grows from <32/64K to over it? e.g. if the sort moves already read entries to after the cursor readdir would return some entries twice. I suspect you would need to keep it always sorted after that no matter how big it gets. So the 32/64k boundary seems useless and you would need a sorted potentially partial in memory rbtree anyways. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Enabling h-trees too early?
On Thu, Sep 20, 2007 at 06:19:04PM +0200, Jan Kara wrote: > if (EXT4_HAS_COMPAT_FEATURE(inode->i_sb, EXT4_FEATURE_COMPAT_DIR_INDEX) && > ((EXT4_I(inode)->i_flags & EXT4_INDEX_FL) || > ((inode->i_size >> sb->s_blocksize_bits) == 1))) { > error = ext4_dx_readdir(filp, dirent, filldir); > if (error != ERR_BAD_DX_DIR) { > ret = error; > goto out; > } > /* > * We don't set the inode dirty flag since it's not > * critical that it get flushed back to the disk. > */ > EXT4_I(filp->f_path.dentry->d_inode)->i_flags &= ~EXT4_INDEX_FL; > } > It calls ext4_dx_readdir() for *every* directory with 1 block (we have > 1326 of them in the kernel tree). Now ext4_dx_readdir() calls > ext4_htree_fill_tree() which finds out the directory is not h-tree and > and calls htree_dirblock_to_tree(). So even for 4KB directories we end up > deleting inodes in hash order! And as a bonus we burn some cycles building > trees etc. What is the point of this? That was added so we wouldn't get screwed when a directory that was previously non htree became an htree directory while the directory fd is open. So the failure case is one where you do opendir(), readdir() on 25% of the directory, sleep for 2 hours, and in the meantime, 200 files are added to the directory and it gets converted into a htree index, causing all of the previously returned readdir() results in directory order to be completely screwed up now that the directory has been converted into an htree. (All of the readdir/telldir/seekdir POSIX requirements cause filesystem designers to tear their hair out.) What we would need to do to avoid needing this is to read in the entire directory leaf page into the rbtree, sorted by inode number, and then to keep that rbtree for the entire life of the open directory file descriptor. We would also have to change telldir/seekdir to use something else as a telldir cookie, and readdir would have to be set up to *only* use the rbtree, and never look at the on-disk directory. This would also mean that all of the files created or deleted after the initial opendir() would never be reflected in results returned by readdir(), but that's allowed by POSIX. And if we do this for a single block 4k directory, we might as well do it for a 32k or 64k HTREE directory as well. Does that make sense? - Ted - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Enabling h-trees too early?
On Thu 20-09-07 11:14:40, Theodore Tso wrote: > On Thu, Sep 20, 2007 at 04:58:39PM +0200, Jan Kara wrote: > > Hmm, strange - I've just looked at my computer and dir_index is set > > just for 5 directories in my tree. > > I looked at a tree that had object files, which is probably why I had > 8 directories; I'm guessing you probably just had kernel sources and > no build files. > > > If I try deleting just them, I also > > see some performance decrease but it's less than if I try deleting the > > whole tree (and that result seems to be quite consistent)... There's > > something > > fishy there. Maybe I could try seekwatcher or something similar to see > > what's really happening. > > That is very strange. Just a guess: Can't the culprit be the following test in ext3/4_readdir()? if (EXT4_HAS_COMPAT_FEATURE(inode->i_sb, EXT4_FEATURE_COMPAT_DIR_INDEX) && ((EXT4_I(inode)->i_flags & EXT4_INDEX_FL) || ((inode->i_size >> sb->s_blocksize_bits) == 1))) { error = ext4_dx_readdir(filp, dirent, filldir); if (error != ERR_BAD_DX_DIR) { ret = error; goto out; } /* * We don't set the inode dirty flag since it's not * critical that it get flushed back to the disk. */ EXT4_I(filp->f_path.dentry->d_inode)->i_flags &= ~EXT4_INDEX_FL; } It calls ext4_dx_readdir() for *every* directory with 1 block (we have 1326 of them in the kernel tree). Now ext4_dx_readdir() calls ext4_htree_fill_tree() which finds out the directory is not h-tree and and calls htree_dirblock_to_tree(). So even for 4KB directories we end up deleting inodes in hash order! And as a bonus we burn some cycles building trees etc. What is the point of this? Honza -- Jan Kara <[EMAIL PROTECTED]> SUSE Labs, CR - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Enabling h-trees too early?
On Thu, Sep 20, 2007 at 04:58:39PM +0200, Jan Kara wrote: > Hmm, strange - I've just looked at my computer and dir_index is set > just for 5 directories in my tree. I looked at a tree that had object files, which is probably why I had 8 directories; I'm guessing you probably just had kernel sources and no build files. > If I try deleting just them, I also > see some performance decrease but it's less than if I try deleting the > whole tree (and that result seems to be quite consistent)... There's something > fishy there. Maybe I could try seekwatcher or something similar to see > what's really happening. That is very strange. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Enabling h-trees too early?
> On Thu, Sep 20, 2007 at 03:33:50PM +0200, Jan Kara wrote: > > So for example deleting kernel tree on my computer takes ~14 seconds with > > h-trees and less than 9 without them. Also doing 'cp -lr' of the kernel > > tree takes 8 seconds with h-trees and 6.3s without them... So I think the > > performance difference is quite measurable. > > This is in a completely cold cache state? (i.e. mounting and > unmounting the filesystem before doing the rm -rf?) Yes. > On my kernel tree, using the command: "lsattr -R | grep -- -I-" shows > that only 8 directories are htree indexed, and they're not that big: > > 12 drwxr-xr-x 12 tytso tytso 12288 2007-09-14 16:25 ./drivers/char > 24 drwxr-xr-x 30 tytso tytso 24576 2007-09-14 16:25 ./drivers/net > 20 drwxr-xr-x 2 tytso tytso 20480 2007-09-14 16:25 ./drivers/usb/serial > 32 drwxr-xr-x 24 tytso tytso 32768 2007-09-14 16:10 ./include/linux > 12 drwxr-xr-x 2 tytso tytso 12288 2007-09-14 16:25 ./net/bridge/netfilter > 24 drwxr-xr-x 2 tytso tytso 24576 2007-09-14 16:25 ./net/ipv4/netfilter > 12 drwxr-xr-x 2 tytso tytso 12288 2007-09-14 16:25 ./net/ipv6/netfilter > 32 drwxr-xr-x 2 tytso tytso 32768 2007-09-14 16:25 ./net/netfilter > > ... which means if the benchmark only focused on deleting these files, > then presumably the percentage increase would be even worse. Hmm, strange - I've just looked at my computer and dir_index is set just for 5 directories in my tree. If I try deleting just them, I also see some performance decrease but it's less than if I try deleting the whole tree (and that result seems to be quite consistent)... There's something fishy there. Maybe I could try seekwatcher or something similar to see what's really happening. > > > Certainly one of the things that we could consider is for small > > > directories to do an in-memory sort of all of the directory entries at > > > opendir() time, and keeping that list until it is closed. We can't do > > > this for really big directories, but we could easily do it for > > > directories under 32k or 64k. > > > > Umm, yes. That would be probably feasible. But converting to htrees only > > when directories grow larger would avoid the problem also. It also does not > > seem *that* hard but maybe I miss some nasty details... > > The reason why I mentioned the caching idea is we already have code to > manage and return directories stored in an rbtree in the kernel, > albeit for a slightly different purpose. So hacking it up to cache > all of the directory entries for directories < 64k and to index them > by inode number instead of hash key would be pretty easy. > > What's nasty about converting to htrees after the directories become > larger is that we need to reserve extra space in the journal for each > block that we need to modify, and then just the fact that we have to > keep track of the multiple buffers. Basically, not impossible but > just a pain in the *ss. I see :). Honza -- Jan Kara <[EMAIL PROTECTED]> SuSE CR Labs - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Enabling h-trees too early?
On Thu, Sep 20, 2007 at 03:33:50PM +0200, Jan Kara wrote: > So for example deleting kernel tree on my computer takes ~14 seconds with > h-trees and less than 9 without them. Also doing 'cp -lr' of the kernel > tree takes 8 seconds with h-trees and 6.3s without them... So I think the > performance difference is quite measurable. This is in a completely cold cache state? (i.e. mounting and unmounting the filesystem before doing the rm -rf?) On my kernel tree, using the command: "lsattr -R | grep -- -I-" shows that only 8 directories are htree indexed, and they're not that big: 12 drwxr-xr-x 12 tytso tytso 12288 2007-09-14 16:25 ./drivers/char 24 drwxr-xr-x 30 tytso tytso 24576 2007-09-14 16:25 ./drivers/net 20 drwxr-xr-x 2 tytso tytso 20480 2007-09-14 16:25 ./drivers/usb/serial 32 drwxr-xr-x 24 tytso tytso 32768 2007-09-14 16:10 ./include/linux 12 drwxr-xr-x 2 tytso tytso 12288 2007-09-14 16:25 ./net/bridge/netfilter 24 drwxr-xr-x 2 tytso tytso 24576 2007-09-14 16:25 ./net/ipv4/netfilter 12 drwxr-xr-x 2 tytso tytso 12288 2007-09-14 16:25 ./net/ipv6/netfilter 32 drwxr-xr-x 2 tytso tytso 32768 2007-09-14 16:25 ./net/netfilter ... which means if the benchmark only focused on deleting these files, then presumably the percentage increase would be even worse. > > Certainly one of the things that we could consider is for small > > directories to do an in-memory sort of all of the directory entries at > > opendir() time, and keeping that list until it is closed. We can't do > > this for really big directories, but we could easily do it for > > directories under 32k or 64k. > > Umm, yes. That would be probably feasible. But converting to htrees only > when directories grow larger would avoid the problem also. It also does not > seem *that* hard but maybe I miss some nasty details... The reason why I mentioned the caching idea is we already have code to manage and return directories stored in an rbtree in the kernel, albeit for a slightly different purpose. So hacking it up to cache all of the directory entries for directories < 64k and to index them by inode number instead of hash key would be pretty easy. What's nasty about converting to htrees after the directories become larger is that we need to reserve extra space in the journal for each block that we need to modify, and then just the fact that we have to keep track of the multiple buffers. Basically, not impossible but just a pain in the *ss. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Enabling h-trees too early?
On Wed 19-09-07 14:24:50, Theodore Tso wrote: > On Wed, Sep 19, 2007 at 05:07:15PM +0200, Jan Kara wrote: > > > > I was just wondering: Currently we start to build h-tree in a directory > > already when the size of directory exceeds one block. But honestly, it does > > not seem to make much sence to use this feature until the directory is much > > larger (I'd say at least 16 or 32 KB). It actually slows down some > > operations like deleting the whole directory, etc. So what is the reason > > for starting building the tree so early? Just the simplicity of building it > > when the directory is just one block large? > > How much is it slowing down operations such as rm -rf? For a small > directory (< 32k), I would assume that the difference would be > relatively small. What sort of differences have you measured, and is > this a common case problem? So for example deleting kernel tree on my computer takes ~14 seconds with h-trees and less than 9 without them. Also doing 'cp -lr' of the kernel tree takes 8 seconds with h-trees and 6.3s without them... So I think the performance difference is quite measurable. > Certainly one of the things that we could consider is for small > directories to do an in-memory sort of all of the directory entries at > opendir() time, and keeping that list until it is closed. We can't do > this for really big directories, but we could easily do it for > directories under 32k or 64k. Umm, yes. That would be probably feasible. But converting to htrees only when directories grow larger would avoid the problem also. It also does not seem *that* hard but maybe I miss some nasty details... Honza -- Jan Kara <[EMAIL PROTECTED]> SUSE Labs, CR - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Enabling h-trees too early?
On Wed, Sep 19, 2007 at 05:07:15PM +0200, Jan Kara wrote: > > I was just wondering: Currently we start to build h-tree in a directory > already when the size of directory exceeds one block. But honestly, it does > not seem to make much sence to use this feature until the directory is much > larger (I'd say at least 16 or 32 KB). It actually slows down some > operations like deleting the whole directory, etc. So what is the reason > for starting building the tree so early? Just the simplicity of building it > when the directory is just one block large? How much is it slowing down operations such as rm -rf? For a small directory (< 32k), I would assume that the difference would be relatively small. What sort of differences have you measured, and is this a common case problem? Certainly one of the things that we could consider is for small directories to do an in-memory sort of all of the directory entries at opendir() time, and keeping that list until it is closed. We can't do this for really big directories, but we could easily do it for directories under 32k or 64k. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Enabling h-trees too early?
On Sep 19, 2007 17:07 +0200, Jan Kara wrote: > I was just wondering: Currently we start to build h-tree in a directory > already when the size of directory exceeds one block. But honestly, it does > not seem to make much sence to use this feature until the directory is much > larger (I'd say at least 16 or 32 KB). It actually slows down some > operations like deleting the whole directory, etc. So what is the reason > for starting building the tree so early? Just the simplicity of building it > when the directory is just one block large? Yes, doing it at one block is easy, doing it with more blocks is complex. At the time we added this feature, tests showed no performance difference between 2 unhashed blocks and 2 hashed blocks so the easier code was used. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Enabling h-trees too early?
Hi, I was just wondering: Currently we start to build h-tree in a directory already when the size of directory exceeds one block. But honestly, it does not seem to make much sence to use this feature until the directory is much larger (I'd say at least 16 or 32 KB). It actually slows down some operations like deleting the whole directory, etc. So what is the reason for starting building the tree so early? Just the simplicity of building it when the directory is just one block large? Honza -- Jan Kara <[EMAIL PROTECTED]> SUSE Labs, CR - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html