Re: [RFC] Ext2 Directory Index - File Structure
>Daniel Phillips wrote: > > Jamie Lokier wrote: > > > Daniel Phillips wrote: > > > > ls already can't handle the directories I'm working with on a regular > > > > basis. It's broken and needs to be fixed. A merge sort using log n > > > > temporary files is not hard to write. > > > > > > ls -U | sort > > > > > > should do the trick. > > > > Um, yep. Now ls should do that itself instead of giving up with an error. > > Sorting 1 million 30-character strings does not require temporary files > on a machine with > 35 MB anyway, and that can be virtual, so if anyone > cares about ls sorting huge directories I suggest improving the > in-memory sort. I got this today: ls -U ls: Memory exhausted Since this occured while nothing else was active, it's probably a MM bug and I will chase it further. However, it also shows that ls is well and truly borked. If anybody is going to work on it I'd suggest not only improving the in-memory sort but adding a custom file-based sort or exec sort as a fallback. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Ext2 Directory Index - File Structure
Daniel Phillips wrote: > Jamie Lokier wrote: > > Daniel Phillips wrote: > > > ls already can't handle the directories I'm working with on a regular > > > basis. It's broken and needs to be fixed. A merge sort using log n > > > temporary files is not hard to write. > > > > ls -U | sort > > > > should do the trick. > > Um, yep. Now ls should do that itself instead of giving up with an error. Sorting 1 million 30-character strings does not require temporary files on a machine with > 35 MB anyway, and that can be virtual, so if anyone cares about ls sorting huge directories I suggest improving the in-memory sort. cheers, -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Ext2 Directory Index - File Structure
Jamie Lokier wrote: > Daniel Phillips wrote: > > ls already can't handle the directories I'm working with on a regular > > basis. It's broken and needs to be fixed. A merge sort using log n > > temporary files is not hard to write. > > ls -U | sort > > should do the trick. Um, yep. Now ls should do that itself instead of giving up with an error. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Ext2 Directory Index - File Structure
Daniel Phillips wrote: > ls already can't handle the directories I'm working with on a regular > basis. It's broken and needs to be fixed. A merge sort using log n > temporary files is not hard to write. ls -U | sort should do the trick. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Ext2 Directory Index - File Structure
On Tuesday 10 April 2001 18:37, you wrote: > Daniel Phillips writes: > > The zeroth block of an indexed directory is the index root. Initially > > the index has only one block. The following blocks are normal ext2 > > directory entry blocks. When the directory grows large enough to fill > > all the available entries in the root index block (around 80-90,000 > > entries on a 4K blocksize filesystem) a second level is added to the > > index tree in the form of an internal index block appended to the > > directory. As the directory expands, new index blocks are appended as > > needed so that the directory consists of normal directory blocks with > > index nodes interspersed every 200 blocks or so. > > It looks like you end up jumping back and forth to read the > index blocks. Only after the index has grown past 80-90,000 entries, and only when you hit the directory once in isolation so that the root index node isn't in cache - a concurrence of events likely only under light load, and then you don't care about the slight extra head movement. I think you'd have a hard time measuring the performance impact. > (but maybe I need more sleep) It might be better > to allocate 1, 2, 4, 8, ... index blocks at once, instead of > always allocating just one. Other than the requirement that the root be in the 0th block the allocation of index blocks is completely unrestricted. If there turns out to be a better way to lay it out, allocation policy can be changed at any time without losing compatibility with preexisting directories. > > The high four bits of the block pointer field are reserved for use by > > a coalesce-on-delete feature which may be implemented in the future. > > The remaining 28 bits are capable of indexing up to 1TB per directory > > file. (Perhaps I should coopt 8 bits instead of 4.) > > Doing a 1 TB directory means you must give up on i_size, which is > too small. You may instead calculate what you need from the block count. > If you don't give up on i_size, you might as well coopt 11 bits. Directories of this size are pretty theoretical at this point, but who knows? Suppose Yahoo wants to have one file for each freemail account and have them all in one directory. I guess they'd already need tens of millions of entries. One thing you can be sure of is that you'll always be surprised, looking back, at just how much you underestimated the amount of storage that people will use. Platitudes aside, I'll be happy with 8 bits reserved, probably more than needed to support an efficient coalescing policy. This limits directory size to 64 gig with 4K blocks, and I'm not going to be the one to say that nobody would ever want one that big. i_size is a separate problem. It's been expanded in the VFS and Ext2 for regular files but not for directories. See Andreas Dilger's comment re reuse of the dir acl field in another thread. > Oh, just grab 12 or 16 bits. It isn't at all OK to make directories > that are pretty much impossible to read on a 32-bit system. Think > about what /bin/ls must do to sort a 1 TB directory. ls already can't handle the directories I'm working with on a regular basis. It's broken and needs to be fixed. A merge sort using log n temporary files is not hard to write. > > The first kind of forward compatibility is addressed by hiding all the > > new index structures inside what appears to earlier versions of Ext2 to > > be free space. This is accomplished by placing an empty Ext2 dirent > > structure at the beginning of each index node which marks the entire > > block as empty, from the point of view of non-index-aware versions of > > Ext2. > > Well, it looks better than vfat. Next you will be wanting to > increase the inode size and switch to 64-bit block numbers. > You could write such a wonderful NEW filesystem. Thanks. We haven't quite filled up our 128 byte inodes yet, at least if we treat the unused fragment fields as available. As far as the total Ext2 filesystem capacity goes, we can increase it by increasing the block size, up to 64K (and then the only limit we hit is the 16 bit fields in the directory blocks). The only thing that degenerates seriously is the internal fragmentation, a problem I've done some work on as you know. With 64K blocks we get a 64 TB filesystem size and by the time we hit that limit (we *will* hit it) I guess we'll have had time to figure out what to do about it. Getting back to mundane reality. Lots of users, particularly ISPs, are suffering now from poor performance of Ext2's directory operations and have been for some time. It's a safe bet that most of them would prefer that we quietly solve such problems instead of making them reformat their disks. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Ext2 Directory Index - File Structure
cat calahan.reply.txt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Ext2 Directory Index - File Structure
Daniel Phillips writes: > The zeroth block of an indexed directory is the index root. Initially > the index has only one block. The following blocks are normal ext2 > directory entry blocks. When the directory grows large enough to fill > all the available entries in the root index block (around 80-90,000 > entries on a 4K blocksize filesystem) a second level is added to the > index tree in the form of an internal index block appended to the > directory. As the directory expands, new index blocks are appended as > needed so that the directory consists of normal directory blocks with > index nodes interspersed every 200 blocks or so. It looks like you end up jumping back and forth to read the index blocks. (but maybe I need more sleep) It might be better to allocate 1, 2, 4, 8, ... index blocks at once, instead of always allocating just one. > The high four bits of the block pointer field are reserved for use by > a coalesce-on-delete feature which may be implemented in the future. > The remaining 28 bits are capable of indexing up to 1TB per directory > file. (Perhaps I should coopt 8 bits instead of 4.) Doing a 1 TB directory means you must give up on i_size, which is too small. You may instead calculate what you need from the block count. If you don't give up on i_size, you might as well coopt 11 bits. Oh, just grab 12 or 16 bits. It isn't at all OK to make directories that are pretty much impossible to read on a 32-bit system. Think about what /bin/ls must do to sort a 1 TB directory. > The first kind of forward compatibility is addressed by hiding all the > new index structures inside what appears to earlier versions of Ext2 to > be free space. This is accomplished by placing an empty Ext2 dirent > structure at the beginning of each index node which marks the entire > block as empty, from the point of view of non-index-aware versions of > Ext2. Well, it looks better than vfat. Next you will be wanting to increase the inode size and switch to 64-bit block numbers. You could write such a wonderful NEW filesystem. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Ext2 Directory Index - File Structure
Daniel, you write: > For the past several weeks I have been developing a directory index > facility for Ext2, with good results so far. This note describes the > on-disk format of the new index. Finally starting to test your last release, and you make a new one... ;-) > Needless to say, the new code no longer clears this bit but uses it as > it was intended, to flag the presence of an index on a directory. If a > partition has been mounted with the -o index[2] option enabled then each > new directory is created with the EXT2_INDEX_FL set. Are you going to go to a COMPAT flag before final release? This is pretty much needed for e2fsck to be able to detect/correct indexes. > struct dx_root > { > struct dx_root_info > { > struct fake_dirent fake1; > char dot1[4]; > struct fake_dirent fake2; > char dot2[4]; > u8 info_length; /* 8 */ > u8 hash_version; /* 0 */ > u8 indirect_levels > u8 unused_flags; > le_u32 reserved_zero; > } > info; > struct dx_entry entries[0]; > }; One thing I was thinking about the root entry - for compatibility with 2.0 and earlier kernels (which don't clear INDEX_FL) there is a simple solution: - deleted dirents do not pose any problems, because a hash index is no guarantee that a specific dirent exists, only a range of possible entries - adding a dirent will _always_ go into the first empty block (which is the index root block), and overwrite the dx_root_info data after "." and ".." In the latter case (and also in the case of data corruption), it is useful to have some sort of integrity check within dx_root_info. If a dirent overwrote the root index block, the first 4 bytes of the dirent after "." and ".." would be an inode number, this could be anything and is not very useful for detecting if it is a valid number or not. However, if you put the "reserved_zero" field first and swap the order of "hash_version" and "info_length", then "info_length" and "hash_version" will overlap with the "rec_len" field of a new dirent. If (hash_version & 3) is always true, a valid "rec_len" is always a multiple of 4, so it is possible to detect if the dx_root_info has been overwritten by an old version of ext2 code. So: struct dx_root_info { struct fake_dirent fake1; char dot1[4]; struct fake_dirent fake2; char dot2[4]; /* __u32 inode; */ le_u32 reserved_zero; /* __u16 rec_len lo */ u8 hash_version; /* 1, or 2, or 3, but not 0 or 4 */ /* __u16 rec_len hi */ u8 info_length; /* 8 */ u8 indirect_levels u8 unused_flags; } info; It will also be easy to detect if reserved_zero is changed, because zero is not a valid inode number. We can't rely on fake2's rec_len in case a dir entry was added and then deleted (corrupting dx_root_info, but leaving fake1 and fake2 as they should be). > The high four bits of the block pointer field are reserved for use by > a coalesce-on-delete feature which may be implemented in the future. > The remaining 28 bits are capable of indexing up to 1TB per directory > file. (Perhaps I should coopt 8 bits instead of 4.) Might be worthwhile, even if you don't use all of the bits for coalesce flags. Nothing better than having free space for future expansion. Even for a 1kB block filesystem, this would give us directories up to 16GB, at worst case about 32M 256-character entries, average case 595M 9-to-12-character names. Having 24 bits for 4kB block (64GB) directories gives us worst-case 117M 256-character entries, and average case 2.2B 9-to-12 character names. Assumes 50% full leaf blocks for worst case, 70% full for average case. For that matter, I'm not even sure if ext2 supports directory files larger than 2GB? Anyone? I'm not 100% sure, but it may be that e2fsck considers directories >2GB as invalid. > Thus, directory leafs will average no more than 75% full and typically > a few points less than that since the hash function never produces a > perfectly uniform distribution. (The better the hash function, the > closer we get to 75%; we are currently at 71%.) Have you been testing this with different kinds of input for filenames, or only synthetic input data? One interesting dataset would be to get the filename generation code from sendmail (for /var/spool/mqueue) and/or inn and/or squid to see how those names are hashed. These are legitimate applications which tend to store lots of files in a single directory. > It is quite likely that the hash function will be improved in the > future, therefore a hash function version field is included in the root > header. If the current code se