Re: Sort Performance Problems across large dataset

Matt Quail Mon, 24 Jan 2005 17:42:15 -0800

Peter,

Currently we can issue a simple search query and expect a response back in about 0.2 seconds (~3,000 results)

You may want to try something like the following (I do this in FishEye, seems to be performant for moderately large field-spaces).

Use a custom HitCollector, and store all the matching doc-ids in a java.util.BitSet. This will still give you your 0.2second performance.

Then, use a TermDocs iterator to visit each term in your "species name" field, "printing out" (or whatever) each species name if it contains a docid in your bitset. Something like this pseudocode:


BitSet docs = doSearch(query); // 0.2seconds

TermEnum te = reader.terms(new Term("species-name", ""));
TermDocs td = reader.termDocs();
Term t = te.term();
while (t!=null && t.field().equals("species-name")) {
  td.seek(te);
  while (td.next()) {
    int docid = td.doc();
    if (docs.get(docid)) {
      print "match:" + docid;
      break; // try next term
    }
  }

  if (!te.next()) {
    break;
  }
  t = te.term();
}

te.close();
td.close();

Now, with 2.3 million (or 4 million!) species names, I'm not sure how fast it will be to iterate through all the "species-name" termdocs. But I would be interested to find out; if you give this a code a try, could you report back your results?

=Matt

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Sort Performance Problems across large dataset

Reply via email to