Matthew Dillon wrote: > 64-bit directory hash encoding (for smaller filenames out of bounds > indices just store a 0). > > aaaaa name[0] & 0x1F > bbbbb name[1] & 0x1F > ccccc name[2] & 0x1F > mmmmmm crc32(name + 3, len - 5) and some xor magic -> 6 bits > yyyyy name[len-2] & 0x1F > zzzzz name[len-1] & 0x1F > h[31] crc32(name, len) (entire filename) > > 0aaaaabbbbbccccc mmmmmmyyyyyzzzzz hhhhhhhhhhhhhhhh hhhhhhhhhhhhhhh0 > [...] > There are several issues but most of the trade-offs seem to be ok. The > one issue I am looking at closely is perhaps to try to do a better job > sorting files that have both a prefix and a postfix. In the above scheme > any large prefix or postfix covers the a[5], b[5], c[5], y[5], and z[5] > bits and devolves us back to an effectively random hash code.
You already mentioned it. That's exactly the problem that I'm seeing ... I'm not sure whether a[], b[], c[], y[] and z[] buy you anything in practice. If a single directory contains a huge number of files, it is likely they are all of the same type, e.g. it could be a collection of images or whatever. That means they all have the same extension (e.g. .jpg), so y[] and z[] are useless. Furthermore, it isn't completely unlikely that they even begin with the same prefix. For example, all of my digital camera pics are named "img%05d.jpg". Admittedly those aren't millions (but more than 10k anyway), and I'm not stupid enough to collect them in a single directory. ;-) Another example: The cache directory of my Opera browser. It contains several thousands of files all beginning with "opr*". It might be a good idea to make a small survey, i.e. find people who actually _do_ have directories with a huge number of files in them (and I mean more than just a few thousands), and ask them what the filenames typically look like. An obvious improvement would be to store name[d-2] and name[d-1] in y[] and z[], respectively, where d is the location of the last dot in the filename, if any, or the location of the terminating zero if there is no dot. In other words: Ignore the extension when identifying y[] and z[]. Finding the last dot shouldn't be more computationally expensive than strlen(name), so this shouldn't be a problem. Best regards Oliver -- Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M. Handelsregister: Registergericht Muenchen, HRA 74606, Geschäftsfuehrung: secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün- chen, HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart FreeBSD-Dienstleistungen, -Produkte und mehr: http://www.secnetix.de/bsd
