On Tue, 2009-01-13 at 12:52 +1100, Paul Wankadia wrote: > On Mon, Jan 12, 2009 at 3:02 PM, Ian Kent <ra...@themaw.net> wrote: > > Here are a couple of patches we can use to check what is going > on (my > math may be wrong so it would be good if folks could check). > They should > apply to the git repo source. > > The first patch hash-report-stats.patch modifies > cache_dump_cache() to > report some statistics. > > The second patch hash-oaat-algo.patch makes the cache use the > Vals > proposed oaat algorithm without changing the hash table size > for > comparison. > > I used an indirect map with 35000 entries as a quick check. > > There are a couple of problems with this, first the map keys > are a bit > random and so are not representative of actual maps and since > we use > multiple caches in v5 the number of entries in a given cache > may be > quite different from the total number of entries. > > But, if my math is correct, we can get some marginally useful > information from the experiment. > > The results: > > Current algorithm: > cache: entries 35000 table size 77 ideal chain length 454.55 > chain length: empty 0 min 381 max 501 avg 454.55 std 22.78 > > oaat algorithm: > cache: entries 35000 table size 77 ideal chain length 454.55 > chain length: empty 0 min 402 max 532 avg 454.55 std 22.27 > > This looks a bit odd, my math is probably broken somehow. > > Perhaps you could calculate the median instead of the mean?
I was thinking that the min and max looked a bit odd. To calculate the median we would have to assume a normal distribution (or find some way to work out what type of distribution it is, too hard) and then it's (min + max)/2. > > > I have to conclude that while the oaat algorithm looks worse > some how, > if the math is OK, then it is likely that there are more > chains closer > to the average length with less outliers (smaller variance in > chain > length) than there are with the current algorithm. > > It would be good if we could get some real world numbers for > both > indirect and direct maps (mainly because direct maps have > somewhat > different keys, full path compared to directory component). > > I will do some more checks with the proposed large hash table > size to > see if the small table size is hiding any bias in the > distribution. > > On Mon, Jan 12, 2009 at 3:58 PM, Ian Kent <ra...@themaw.net> wrote: > > Here are some more stats. > > cache: entries 35000 table size 977 ideal chain length 35.82 > chain length: empty 627 min 0 max 284 avg 35.82 std 72.14 > > cache: entries 35000 table size 977 ideal chain length 35.82 > chain length: empty 0 min 17 max 59 avg 35.82 std 6.05 > > > cache: entries 35000 table size 2039 ideal chain length 17.17 > chain length: empty 1689 min 0 max 284 avg 17.17 std 53.05 > > cache: entries 35000 table size 2039 ideal chain length 17.17 > chain length: empty 0 min 6 max 32 avg 17.17 std 4.18 > > > cache: entries 35000 table size 4093 ideal chain length 8.55 > chain length: empty 3743 min 0 max 284 avg 8.55 std 38.41 > > cache: entries 35000 table size 4093 ideal chain length 8.55 > chain length: empty 2 min 0 max 21 avg 8.55 std 2.99 > > > There's clearly something wrong with my min and probably max > calculation > but the stats still show some interesting information. > > Why do you think that you've miscalculated the minimum? Well, maybe I haven't but I thought min should be > 0. But, it's not worth worrying about as I think we've managed to show a clear improvement. > > > It looks like the additive algorithm is biased toward a small > range of > hash indexes which becomes more significant as the table size > increases. > So the distribution is significantly better as table size > increases. > > I'm not clear on what the graphs (Pauls) are displaying but my > impression is that they also show this same distribution bias > toward > clustering for the additive hash. Assuming they were done > using actual > map keys that would indicate that the benefit is likely fairly > general. > > I plotted the number of elements in each bucket. > > As you now see, the additive hash function doesn't scale. Sure and now we can say why it doesn't scale in the patch description. > > > The only question remaining is how large should we make the > table. > A size of around 4k will lead to a lot of wasted space for > many common > configurations (including configurations that have many map > entries > spread over several individual maps) but a size around 1k > still gives > fairly long chains for large maps. > > Suggestions? > > Perhaps you could make it configurable? Perhaps, but I don't think this is the sort of thing that should be configurable so I'd prefer to make it a compile time option rather than a run time configuration setting. > > > PS. I'll need to make a couple of other changes as well, eg. > the null > map entry cache is typically very small. > _______________________________________________ autofs mailing list autofs@linux.kernel.org http://linux.kernel.org/mailman/listinfo/autofs