Hi! 2008/11/14 Oliver Fromme <[EMAIL PROTECTED]>: > Matthew Dillon wrote: > > 64-bit directory hash encoding (for smaller filenames out of bounds > > indices just store a 0). > > > > aaaaa name[0] & 0x1F > > bbbbb name[1] & 0x1F > > ccccc name[2] & 0x1F > > mmmmmm crc32(name + 3, len - 5) and some xor magic -> 6 bits > > yyyyy name[len-2] & 0x1F > > zzzzz name[len-1] & 0x1F > > h[31] crc32(name, len) (entire filename) > > > > 0aaaaabbbbbccccc mmmmmmyyyyyzzzzz hhhhhhhhhhhhhhhh hhhhhhhhhhhhhhh0 > > [...] .....
> > You already mentioned it. That's exactly the problem > that I'm seeing ... I'm not sure whether a[], b[], c[], > y[] and z[] buy you anything in practice. > > If a single directory contains a huge number of files, > it is likely they are all of the same type, e.g. it could > be a collection of images or whatever. That means they > all have the same extension (e.g. .jpg), so y[] and z[] > are useless. > > Furthermore, it isn't completely unlikely that they even > begin with the same prefix. For example, all of my > digital camera pics are named "img%05d.jpg". Admittedly > those aren't millions (but more than 10k anyway), and > I'm not stupid enough to collect them in a single > directory. ;-) > > Another example: The cache directory of my Opera browser. > It contains several thousands of files all beginning > with "opr*". > > It might be a good idea to make a small survey, i.e. find > people who actually _do_ have directories with a huge > number of files in them (and I mean more than just a few > thousands), and ask them what the filenames typically look > like. > > An obvious improvement would be to store name[d-2] and > name[d-1] in y[] and z[], respectively, where d is the > location of the last dot in the filename, if any, or the > location of the terminating zero if there is no dot. > In other words: Ignore the extension when identifying > y[] and z[]. Finding the last dot shouldn't be more > computationally expensive than strlen(name), so this > shouldn't be a problem. I do agree with Oliver. But have another proposal: Also, I doubt that there are usually more than 1-2 affected directories per host. And usually, file names has very similar pattern. Sysctl/some-other-tunable with some kind of mask would be great for fine-tuning (and just useless for the 90% of users). like: sysctl.hammer.dirhash.hashmask.prefix=1 (Starting at first filename byte, 3 bytes fixed length) sysctl.hammer.dirhash.hashmask.suffix=-1 (Starting last byte, 2 bytes length) This way, admins would be able to re-tune it to their particular needs. Just my 0.02UAH -- Dennis Melentyev
