Re: [RFC] Ext2 Directory Index - File Structure

2001-04-15 Thread Daniel Phillips

>Daniel Phillips wrote:
> > Jamie Lokier wrote:
> > > Daniel Phillips wrote:
> > > > ls already can't handle the directories I'm working with on a regular
> > > > basis.  It's broken and needs to be fixed.  A merge sort using log n
> > > > temporary files is not hard to write.
> > >
> > > ls -U | sort
> > >
> > > should do the trick.
> >
> > Um, yep.  Now ls should do that itself instead of giving up with an error.
>
> Sorting 1 million 30-character strings does not require temporary files
> on a machine with > 35 MB anyway, and that can be virtual, so if anyone
> cares about ls sorting huge directories I suggest improving the
> in-memory sort.

I got this today:

ls -U  
ls: Memory exhausted

Since this occured while nothing else was active, it's probably a MM bug
and I will chase it further.  However, it also shows that ls is well and
truly borked.  If anybody is going to work on it I'd suggest not only
improving the in-memory sort but adding a custom file-based sort or exec
sort as a fallback.
--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFC] Ext2 Directory Index - File Structure

2001-04-13 Thread Jamie Lokier

Daniel Phillips wrote:
> Jamie Lokier wrote:
> > Daniel Phillips wrote:
> > > ls already can't handle the directories I'm working with on a regular
> > > basis.  It's broken and needs to be fixed.  A merge sort using log n
> > > temporary files is not hard to write.
> > 
> > ls -U | sort
> > 
> > should do the trick.
>  
> Um, yep.  Now ls should do that itself instead of giving up with an error.

Sorting 1 million 30-character strings does not require temporary files
on a machine with > 35 MB anyway, and that can be virtual, so if anyone
cares about ls sorting huge directories I suggest improving the
in-memory sort.

cheers,
-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFC] Ext2 Directory Index - File Structure

2001-04-13 Thread Daniel Phillips

Jamie Lokier wrote:
> Daniel Phillips wrote:
> > ls already can't handle the directories I'm working with on a regular
> > basis.  It's broken and needs to be fixed.  A merge sort using log n
> > temporary files is not hard to write.
> 
> ls -U | sort
> 
> should do the trick.
 
Um, yep.  Now ls should do that itself instead of giving up with an error.
--
Daniel

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFC] Ext2 Directory Index - File Structure

2001-04-11 Thread Jamie Lokier

Daniel Phillips wrote:
> ls already can't handle the directories I'm working with on a regular
> basis.  It's broken and needs to be fixed.  A merge sort using log n
> temporary files is not hard to write.

ls -U | sort

should do the trick.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFC] Ext2 Directory Index - File Structure

2001-04-11 Thread Daniel Phillips

On Tuesday 10 April 2001 18:37, you wrote:
> Daniel Phillips writes:
> > The zeroth block of an indexed directory is the index root.  Initially
> > the index has only one block.  The following blocks are normal ext2
> > directory entry blocks.  When the directory grows large enough to fill
> > all the available entries in the root index block (around 80-90,000
> > entries on a 4K blocksize filesystem) a second level is added to the
> > index tree in the form of an internal index block appended to the
> > directory.  As the directory expands, new index blocks are appended as
> > needed so that the directory consists of normal directory blocks with
> > index nodes interspersed every 200 blocks or so.
>
> It looks like you end up jumping back and forth to read the
> index blocks.

Only after the index has grown past 80-90,000 entries, and only when you
hit the directory once in isolation so that the root index node isn't
in cache - a concurrence of events likely only under light load, and
then you don't care about the slight extra head movement.  I think you'd
have a hard time measuring the performance impact. 

> (but maybe I need more sleep) It might be better
> to allocate 1, 2, 4, 8, ... index blocks at once, instead of
> always allocating just one.

Other than the requirement that the root be in the 0th block the
allocation of index blocks is completely unrestricted.  If there turns
out to be a better way to lay it out, allocation policy can be changed
at any time without losing compatibility with preexisting directories.

> > The high four bits of the block pointer field are reserved for use by
> > a coalesce-on-delete feature which may be implemented in the future.
> > The remaining 28 bits are capable of indexing up to 1TB per directory
> > file.  (Perhaps I should coopt 8 bits instead of 4.)
>
> Doing a 1 TB directory means you must give up on i_size, which is
> too small. You may instead calculate what you need from the block count.
> If you don't give up on i_size, you might as well coopt 11 bits.

Directories of this size are pretty theoretical at this point, but who
knows?  Suppose Yahoo wants to have one file for each freemail account
and have them all in one directory.  I guess they'd already need tens of
millions of entries.  One thing you can be sure of is that you'll always
be surprised, looking back, at just how much you underestimated the
amount of storage that people will use.

Platitudes aside, I'll be happy with 8 bits reserved, probably more than
needed to support an efficient coalescing policy.  This limits directory
size to 64 gig with 4K blocks, and I'm not going to be the one to say
that nobody would ever want one that big.

i_size is a separate problem.  It's been expanded in the VFS and Ext2
for regular files but not for directories.  See Andreas Dilger's comment
re reuse of the dir acl field in another thread.

> Oh, just grab 12 or 16 bits. It isn't at all OK to make directories
> that are pretty much impossible to read on a 32-bit system. Think
> about what /bin/ls must do to sort a 1 TB directory.

ls already can't handle the directories I'm working with on a regular
basis.  It's broken and needs to be fixed.  A merge sort using log n
temporary files is not hard to write.

> > The first kind of forward compatibility is addressed by hiding all the
> > new index structures inside what appears to earlier versions of Ext2 to
> > be free space.  This is accomplished by placing an empty Ext2 dirent
> > structure at the beginning of each index node which marks the entire
> > block as empty, from the point of view of non-index-aware versions of
> > Ext2.
>
> Well, it looks better than vfat. Next you will be wanting to
> increase the inode size and switch to 64-bit block numbers.
> You could write such a wonderful NEW filesystem.

Thanks.  We haven't quite filled up our 128 byte inodes yet, at least if
we treat the unused fragment fields as available.  As far as the total
Ext2 filesystem capacity goes, we can increase it by increasing the
block size, up to 64K (and then the only limit we hit is the 16 bit
fields in the directory blocks).  The only thing that degenerates
seriously is the internal fragmentation, a problem I've done some work
on as you know.  With 64K blocks we get a 64 TB filesystem size and by
the time we hit that limit (we *will* hit it) I guess we'll have had
time to figure out what to do about it.

Getting back to mundane reality.  Lots of users, particularly ISPs, are
suffering now from poor performance of Ext2's directory operations and
have been for some time.  It's a safe bet that most of them would prefer
that we quietly solve such problems instead of making them reformat
their disks.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFC] Ext2 Directory Index - File Structure

2001-04-11 Thread Daniel Phillips

cat calahan.reply.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFC] Ext2 Directory Index - File Structure

2001-04-10 Thread Albert D. Cahalan

Daniel Phillips writes:

> The zeroth block of an indexed directory is the index root.  Initially
> the index has only one block.  The following blocks are normal ext2
> directory entry blocks.  When the directory grows large enough to fill
> all the available entries in the root index block (around 80-90,000
> entries on a 4K blocksize filesystem) a second level is added to the
> index tree in the form of an internal index block appended to the
> directory.  As the directory expands, new index blocks are appended as
> needed so that the directory consists of normal directory blocks with
> index nodes interspersed every 200 blocks or so.

It looks like you end up jumping back and forth to read the
index blocks. (but maybe I need more sleep) It might be better
to allocate 1, 2, 4, 8, ... index blocks at once, instead of
always allocating just one.

> The high four bits of the block pointer field are reserved for use by
> a coalesce-on-delete feature which may be implemented in the future. 
> The remaining 28 bits are capable of indexing up to 1TB per directory
> file.  (Perhaps I should coopt 8 bits instead of 4.)

Doing a 1 TB directory means you must give up on i_size, which is
too small. You may instead calculate what you need from the block count.
If you don't give up on i_size, you might as well coopt 11 bits.

Oh, just grab 12 or 16 bits. It isn't at all OK to make directories
that are pretty much impossible to read on a 32-bit system. Think
about what /bin/ls must do to sort a 1 TB directory.

> The first kind of forward compatibility is addressed by hiding all the
> new index structures inside what appears to earlier versions of Ext2 to
> be free space.  This is accomplished by placing an empty Ext2 dirent
> structure at the beginning of each index node which marks the entire
> block as empty, from the point of view of non-index-aware versions of
> Ext2.

Well, it looks better than vfat. Next you will be wanting to
increase the inode size and switch to 64-bit block numbers.
You could write such a wonderful NEW filesystem.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFC] Ext2 Directory Index - File Structure

2001-04-09 Thread Andreas Dilger

Daniel, you write:
> For the past several weeks I have been developing a directory index
> facility for Ext2, with good results so far.  This note describes the
> on-disk format of the new index.

Finally starting to test your last release, and you make a new one... ;-)

> Needless to say, the new code no longer clears this bit but uses it as
> it was intended, to flag the presence of an index on a directory.  If a
> partition has been mounted with the -o index[2] option enabled then each
> new directory is created with the EXT2_INDEX_FL set.

Are you going to go to a COMPAT flag before final release?  This is
pretty much needed for e2fsck to be able to detect/correct indexes.

>   struct dx_root
>   {
>   struct dx_root_info
>   {
>   struct fake_dirent fake1;
>   char dot1[4];
>   struct fake_dirent fake2;
>   char dot2[4];
>   u8 info_length; /* 8 */
>   u8 hash_version; /* 0 */
>   u8 indirect_levels
>   u8 unused_flags;
>   le_u32 reserved_zero;
>   }
>   info;
>   struct dx_entry entries[0];
>   };

One thing I was thinking about the root entry - for compatibility with 2.0
and earlier kernels (which don't clear INDEX_FL) there is a simple solution:

- deleted dirents do not pose any problems, because a hash index is no
  guarantee that a specific dirent exists, only a range of possible entries
- adding a dirent will _always_ go into the first empty block (which is the
  index root block), and overwrite the dx_root_info data after "." and ".."

In the latter case (and also in the case of data corruption), it is useful
to have some sort of integrity check within dx_root_info.  If a dirent
overwrote the root index block, the first 4 bytes of the dirent after
"." and ".." would be an inode number, this could be anything and is not
very useful for detecting if it is a valid number or not.  However, if you
put the "reserved_zero" field first and swap the order of "hash_version"
and "info_length", then "info_length" and "hash_version" will overlap with
the "rec_len" field of a new dirent.  If (hash_version & 3) is always true,
a valid "rec_len" is always a multiple of 4, so it is possible to detect if
the dx_root_info has been overwritten by an old version of ext2 code.  So:

struct dx_root_info {
struct fake_dirent fake1;
char dot1[4];
struct fake_dirent fake2;
char dot2[4];
/* __u32 inode; */  le_u32 reserved_zero;
/* __u16 rec_len lo */  u8 hash_version; /* 1, or 2, or 3, but not 0 or 4 */
/* __u16 rec_len hi */  u8 info_length;  /* 8 */
u8 indirect_levels
u8 unused_flags;
} info;

It will also be easy to detect if reserved_zero is changed, because zero is
not a valid inode number.  We can't rely on fake2's rec_len in case a dir
entry was added and then deleted (corrupting dx_root_info, but leaving
fake1 and fake2 as they should be).

> The high four bits of the block pointer field are reserved for use by
> a coalesce-on-delete feature which may be implemented in the future. 
> The remaining 28 bits are capable of indexing up to 1TB per directory
> file.  (Perhaps I should coopt 8 bits instead of 4.)

Might be worthwhile, even if you don't use all of the bits for coalesce
flags.  Nothing better than having free space for future expansion.  Even
for a 1kB block filesystem, this would give us directories up to 16GB, at
worst case about 32M 256-character entries, average case 595M 9-to-12-character
names.  Having 24 bits for 4kB block (64GB) directories gives us worst-case
117M 256-character entries, and average case 2.2B 9-to-12 character names.
Assumes 50% full leaf blocks for worst case, 70% full for average case.

For that matter, I'm not even sure if ext2 supports directory files larger
than 2GB?  Anyone?  I'm not 100% sure, but it may be that e2fsck considers
directories >2GB as invalid.

> Thus, directory leafs will average no more than 75% full and typically
> a few points less than that since the hash function never produces a
> perfectly uniform distribution.  (The better the hash function, the
> closer we get to 75%; we are currently at 71%.) 

Have you been testing this with different kinds of input for filenames,
or only synthetic input data?  One interesting dataset would be to get
the filename generation code from sendmail (for /var/spool/mqueue) and/or
inn and/or squid to see how those names are hashed.  These are legitimate
applications which tend to store lots of files in a single directory.

> It is quite likely that the hash function will be improved in the
> future, therefore a hash function version field is included in the root
> header.  If the current code se