According to Geoff Hutchison:
> Well, it's pretty obvious what's happening once you get to the profile
> results. I compiled with profiling on and indexed David's spp2.txt file,
> which has 103,717 unique "words" (about 36 minutes on my machine):
[snip]
> I first thought the problem was simply filling the hashtable--in 3.1.x and
> before, we keep track of each unique word while we're indexing a given
> document and send it to db.wordlist when we're done with that doc. So I
> thought with 100,000+ unique "words," we were just hitting a lot of
> rehashes. Nope, we have 2 in each case.
>
> No, it turns out Gilles' hunch was correct--we're spending lots of time
> each call in the Dictionary::Find and Dictionary::Add methods. So the only
> conclusion is that we have long lists on each hash element--we're getting
> bad hashes.
>
> So here's the question: Right now we use the numbers themselves for a hash
> value. What can we do that's better? Perhaps to use them as the seed for a
> random number function?
I guess the question I have is what are these lists of numbers, and
is there some sort of pattern to them? If they're all a multiple of
the hash table size, I guess that would make the hashing degenerate.
Theoretically, you could always wind up with something that would do
this, but in practise there are some hashing functions that are less
likely to fail than others. I like the seed for a random number idea.
It would be worth a try. I wonder, though, if there wouldn't be some
other list of numbers that would have the same effect on this function.
--
Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/htdig-dev