Hi, I've been vaguely following DFly for a while, and in response to the comment about finding out about use cases for huge directories in the HAMMER update thread, I thought I'd share my experience.
I do research in image processing. Although there are many formats for video as a stream in one big file, in the kind of research I do it's often easier to work with one file/individual frame, partly because many video formats don't have good support for random access, particularly fast random access, and partly because it generally simplifies the code to have the input in the same format as the output (eg, do frame comparisons, run shell scripts, etc, that you can't do on a full video stream). The "results" are often written out as one image/file as well. Obviously this wastes a significant amount of disc space due to no temporal compression but it's better for the particular work we do. (We're working on Linux currently because it's a mainstream Unix-y option.) We go up to about a million files/directory, deliberately splitting stuff up at larger sizes partly from OS directory limitations but partly because lots of other things like tar-ing up stuff, etc, get really tricky. (There's also the problem that we are image researchers rather than unix gurus so we tend to do simple things like the memory-limited argument passing (generally from shell globs) rather than sophisticated xargs stuff, and many programs can't deal with excessively large numbers of arguments anyhow, which again argues against ultra-huge directories.) As regards names, the original data is often in the form of base_xxxxxx.ext where ext is a frame number that increases sequentially, with generally just one sequence in a directory. In the output directories there can be several sequences which often have names like some choice from base1_xxxxxx.ext, base2_xxxxxx_yyyyyy.ext, base3_xxxxxx_yyyyyy_zzzzzz.ext. Some choices for the xxxxxx, yyyyyyy, zzzzzzz are the pid (when running the same stochastic method on the same dataset at the same time on SMP to check they converge to the same thing), frame number and "sub-frame iteration number" (so you get frame_000001_000001.jpg, ...., frame_000001_000100.jpg for frame 1 before moving onto frame 2). I don't think we've ever had a directory with many files that weren't numerically ordered by one or more indexes but all completely unique. (Incidentally, lots of unix tools become really annoying with large numbers of files. Eg, I've got an inefficient but user-centred script that replaces ls so that it gives output like base1_[000000-0097832].jpg base1_[0097835-010000].jpg base2_[000000-0000999]_[000000-0000010].jpg where human beings can spot problems rather than screenfulls of stuff that's essentially useless for reading. And tab completion really should learn that if theres more than 50 possibilities then don't ask whether to show me all the possibilities, just tell me there's too many.) I don't know how all this affects directory the hashing scheme, and it's not a finished product, but hopefully it's a data point about how large number of files get named. -- cheers, dave tweed__________________________ [EMAIL PROTECTED] Rm 124, School of Systems Engineering, University of Reading. "while having code so boring anyone can maintain it, use Python." -- attempted insult seen on slashdot
