Well, it's pretty obvious what's happening once you get to the profile
results. I compiled with profiling on and indexed David's spp2.txt file,
which has 103,717 unique "words" (about 36 minutes on my machine):
% cumulative self self total
time seconds seconds calls ms/call ms/call name
50.03 443.55 443.55 103788 4.27 4.27 Dictionary::Find(char *)
49.67 883.92 440.37 103988 4.23 4.24 Dictionary::Add(char *, Object *)
For comparison, a list of 113,000 unique words from /usr/dict/words (about
20 seconds):
8.04 0.09 0.09 113447 0.79 4.48 Retriever::got_word(char *, int,
int)
8.04 0.18 0.09 ostream::operator<<(char)
6.25 0.25 0.07 120320 0.58 2.47 WordList::Word(char *, int, int,
double)
5.36 0.31 0.06 127197 0.47 0.47 String::remove(char *)
5.36 0.37 0.06 120391 0.50 0.67 Dictionary::Find(char *)
4.46 0.42 0.05 114847 0.44 0.69 Dictionary::Add(char *, Ob
I first thought the problem was simply filling the hashtable--in 3.1.x and
before, we keep track of each unique word while we're indexing a given
document and send it to db.wordlist when we're done with that doc. So I
thought with 100,000+ unique "words," we were just hitting a lot of
rehashes. Nope, we have 2 in each case.
No, it turns out Gilles' hunch was correct--we're spending lots of time
each call in the Dictionary::Find and Dictionary::Add methods. So the only
conclusion is that we have long lists on each hash element--we're getting
bad hashes.
So here's the question: Right now we use the numbers themselves for a hash
value. What can we do that's better? Perhaps to use them as the seed for a
random number function?
-Geoff
_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/htdig-dev