Re: [PMX:FAKE_SENDER] Re: large OR-boolean query

just_adam Mon, 25 Jan 2010 15:00:31 -0800

Hi - Thought I'd follow up with this thread and share what I did with a bit
more specifics.

Background

For our use case, scoring the results is not as important.  We mainly want
to just answer whether the document is in the user's input list of
identifiers (large: 1M tokens), and then rely on faceting to help the user
navigate their results.  

We're using a 64bit Linux host with 26GB of RAM.  Our index is ~2.5GB, with
roughly 20 million docs.  Each doc has a unique identifier that we will
allow for this large scale querying by our users.

In addition to this Nabble thread, I also cite a cross reference to
lucidimanination's forum about using Filters on large queries:

http://www.lucidimagination.com/search/document/ea4de429e62aa05f/efficient_filtering_advise#5d103a8412eaa34
http://www.lucidimagination.com/search/document/ea4de429e62aa05f/efficient_filtering_advise#5d103a8412eaa34

Approach

So, for our implementation, I use a QParserPlugin which basically parses the
input into a list of Terms.  These terms are stored in a java.util.TreeSet
for reasons forthcoming.  The QParser then  just returns a
ConstantScoreQuery wrapped around a Filter subclass that I wrote which uses
these Terms.  The Filter subclass does most of the "work".

When processing the query, this Filter overrides getDocIdList(IndexReader)
method, where it obtains a TermDocs instance on the IndexReader, and employs
the seek() method while iterating over the TreeSet of input Terms.  It uses
an OpenBitSet to hold results.  

Correct me if I'm wrong, but it seemed important to have my input terms in
natural order of a TreeSet in order to take advantage of the seek() approach
to TermDocs (presuming it is sort of like a database cursor?).

I've tried breaking up the processing of getDocIdList() into multiple Java
threads, where each thread is handed a subset of the input Terms and a
separate OpenBitSet that the main thread manages at the end of processing
with the union() call.  Each thread obtains its own TermDocs instance on the
Reader and iterates over its subset of the input terms and stores in the
component bitset.  

The difference between using a single thread and multiple (10) threads does
not seem to matter much in my tests.  Perhaps there's some blocking at the
IndexReader level that I do not understand which forces things to go
serially no matter what.

In any event, we're getting rougly 2-3 second query times, with an
additional 1-2 seconds parsing input from the request.  so our local client
app sees about a 6-8 second roundtrip on it's queries, with faceting turned
on. For such a large query: not bad!

Anyway, I'm posting this in hopes that someone will have any input pro/cons
to share, and some code snippets below should you be interested.  Thanks,

/**
 *  MyQueryFilter.getDocIdSet()
*/

public DocIdSet getDocIdSet(IndexReader reader) throws IOException {
  long start = System.currentTimeMillis();
  OpenBitSet[] results = new OpenBitSet[SUBQUERY];
  Thread[] threads = new Thread[SUBQUERY];
  for (int i = 0; i < SUBQUERY; ++i) {
    OpenBitSet result = new OpenBitSet(reader.maxDoc());
    results[i] = result;
    SubQ s = new SubQ(subqueries[i], reader, result);
    Thread t =  new Thread(s);
    threads[i] = t;
    t.start();
  }
  try {
    for (int i = 0; i < SUBQUERY; ++i) {
      threads[i].join();
    }    
  } catch (InterruptedException ie) {
    log.error(this.getClass().toString() + " Thread Error ", ie);
  }
  OpenBitSet result = results[0];
  log.info(this.getClass().toString() 
      + ".getDocIDSet() terms: " + this.all_terms.size()
      + " miliseconds: " + (System.currentTimeMillis() - start)
      + " threads: " + SUBQUERY);
  for (int i = 1; i < SUBQUERY; ++ i) {
    result.union(results[i]);
  }
  return result; 

}

/** 
 * Here's my SubQ class
 */

private static class SubQ implements Runnable {
  TreeSet subterms;
  IndexReader my_reader;
  OpenBitSet my_results;
  private SubQ() {}
  protected SubQ(TreeSet terms, IndexReader reader, OpenBitSet bitset) {
    this();
    this.my_results = bitset;
    this.subterms = terms;
    this.my_reader = reader;
  }
  public void run() {
   TermDocs td = null;
   try {
    td = my_reader.termDocs();
      for (Iterator iter = this.subterms.iterator(); iter.hasNext();) {
        Term term = (Term) iter.next();
        td.seek(term);
        while (td.next()) {
          my_results.set(td.doc());
        }
      }
    } catch (IOException ioe) {
      log.error(this.getClass().toString() + " TermDoc seek Error ", ioe);
    } finally {
      if (td != null) try {
        td.close();
      } catch (IOException ioe) { 
        log.error(this.getClass().toString() + " TermDoc close Error ",
ioe);
      }
    }
  }
}

PS to Jeff - Thanks for the hints

just_adam wrote:
> 
> I'm attempting a very similar large OR boolean query here and wish to
> clarify what behavior that the custom query needs to be overriding.  
> 
> Do you mean to have the custom query actually run a bunch of subqueries as
> Java Threads?  Where in the message traffic between the Searcher
> IndexReader and Query would this type of bifurcation logic take place?
> 
> Thank you,
> Adam
> 
> 
> Yonik Seeley-2 wrote:
>> 
>> 
>> A custom query type that does this big parallel OR, and a
>> QParserPlugin that creates that query.  No changes to any search
>> components should be needed.
>> 
>> -Yonik
>> http://www.lucidimagination.com
>> 
>> 
> 
> 

-- 
View this message in context: 
http://old.nabble.com/large-OR-boolean-query-tp25584607p27315341.html
Sent from the Solr - Dev mailing list archive at Nabble.com.

Re: [PMX:FAKE_SENDER] Re: large OR-boolean query

Reply via email to