On Mon, May 2, 2011 at 8:03 AM, AJ ONeal <coola...@gmail.com> wrote:
> Does anyone have a good guesstimate (and perhaps some reference material) as
> to what the sweet spot is for access efficiency of folder depth vs number of
> files per folder?
>
>
> I've got a file database where all files are stored by their md5 such as
> ./a1/d5/83/d2/7e38/f96d/1641dead/f4d89ee2/a1d583d27e38f96d1641deadf4d89ee2.m4a
> However, iterating over the files with this much folder depth is
> significantly slower than a depth of 3.
>
>
> I was thinking that perhaps I could use a scheme more like this:
> ./a1d/583/a1d583d27e38f96d1641deadf4d89ee2.m4a
> in which case each folder will have up to 4096 sub-nodes
> (presently I can't foresee having more than 1 million files in the db)
>
> Or I could convert the md5sum to base36 and use a similar scheme with 1296
> sub-nodes.

Just do the math:

1 million files, we'll say 1mib files (2^20).
Also, we'll assume even distribution of the most significant digits in
your md5. If that is not the case, then you may need to recalculate
and modify accordingly.

Two hex digits for the first dir name yields 256 dirs, and cuts the
remainder by 8 bits, thus 2^12 files in each (4096).
Two hex digits for the second dir name yields 256 more each, cutting
the remainder by 8 bits again, thus 2^4 in each-- just 16 files on
average.
Any more dir depth is simply wasteful, as a hash they will each wind
up with on average only one file.

Depending on your file system, the number of files per dir is more or
less critical. With Reiser v3 it uses your choice of one of I think
three hashes, any one of which can handle 10s of thousands of files
per dir. Ext2 gets slow once you exceed a few thousand because
internally it uses a list. I believe Ext3 and above can use a
hash--but you have to tell it to do so and mount it accordingly. In
either case, 4096 files/dir is probably not a huge stretch (unless you
intend to use ls on it).

If you did 3 hex digits with a single dir depth (4096 dirs), you'd
have just 256 files/dir on average.

/*
PLUG: http://plug.org, #utah on irc.freenode.net
Unsubscribe: http://plug.org/mailman/options/plug
Don't fear the penguin.
*/

Reply via email to