Well, it's pretty obvious what's happening once you get to the profile
results. I compiled with profiling on and indexed David's spp2.txt file,
which has 103,717 unique "words" (about 36 minutes on my machine):


  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 50.03    443.55   443.55   103788     4.27     4.27  Dictionary::Find(char *)
 49.67    883.92   440.37   103988     4.23     4.24  Dictionary::Add(char *, Object *)

For comparison, a list of 113,000 unique words from /usr/dict/words (about
20 seconds):

  8.04      0.09     0.09   113447     0.79     4.48  Retriever::got_word(char *, int, 
int)
  8.04      0.18     0.09                             ostream::operator<<(char)
  6.25      0.25     0.07   120320     0.58     2.47  WordList::Word(char *, int, int, 
double)
  5.36      0.31     0.06   127197     0.47     0.47  String::remove(char *)
  5.36      0.37     0.06   120391     0.50     0.67  Dictionary::Find(char *)
  4.46      0.42     0.05   114847     0.44     0.69  Dictionary::Add(char *, Ob

I first thought the problem was simply filling the hashtable--in 3.1.x and
before, we keep track of each unique word while we're indexing a given
document and send it to db.wordlist when we're done with that doc. So I
thought with 100,000+ unique "words," we were just hitting a lot of
rehashes. Nope, we have 2 in each case.

No, it turns out Gilles' hunch was correct--we're spending lots of time
each call in the Dictionary::Find and Dictionary::Add methods. So the only
conclusion is that we have long lists on each hash element--we're getting
bad hashes.

So here's the question: Right now we use the numbers themselves for a hash
value. What can we do that's better? Perhaps to use them as the seed for a
random number function?

-Geoff


_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to