Hi - Thought I'd follow up with this thread and share what I did with a bit more specifics.
Background For our use case, scoring the results is not as important. We mainly want to just answer whether the document is in the user's input list of identifiers (large: 1M tokens), and then rely on faceting to help the user navigate their results. We're using a 64bit Linux host with 26GB of RAM. Our index is ~2.5GB, with roughly 20 million docs. Each doc has a unique identifier that we will allow for this large scale querying by our users. In addition to this Nabble thread, I also cite a cross reference to lucidimanination's forum about using Filters on large queries: http://www.lucidimagination.com/search/document/ea4de429e62aa05f/efficient_filtering_advise#5d103a8412eaa34 http://www.lucidimagination.com/search/document/ea4de429e62aa05f/efficient_filtering_advise#5d103a8412eaa34 Approach So, for our implementation, I use a QParserPlugin which basically parses the input into a list of Terms. These terms are stored in a java.util.TreeSet for reasons forthcoming. The QParser then just returns a ConstantScoreQuery wrapped around a Filter subclass that I wrote which uses these Terms. The Filter subclass does most of the "work". When processing the query, this Filter overrides getDocIdList(IndexReader) method, where it obtains a TermDocs instance on the IndexReader, and employs the seek() method while iterating over the TreeSet of input Terms. It uses an OpenBitSet to hold results. Correct me if I'm wrong, but it seemed important to have my input terms in natural order of a TreeSet in order to take advantage of the seek() approach to TermDocs (presuming it is sort of like a database cursor?). I've tried breaking up the processing of getDocIdList() into multiple Java threads, where each thread is handed a subset of the input Terms and a separate OpenBitSet that the main thread manages at the end of processing with the union() call. Each thread obtains its own TermDocs instance on the Reader and iterates over its subset of the input terms and stores in the component bitset. The difference between using a single thread and multiple (10) threads does not seem to matter much in my tests. Perhaps there's some blocking at the IndexReader level that I do not understand which forces things to go serially no matter what. In any event, we're getting rougly 2-3 second query times, with an additional 1-2 seconds parsing input from the request. so our local client app sees about a 6-8 second roundtrip on it's queries, with faceting turned on. For such a large query: not bad! Anyway, I'm posting this in hopes that someone will have any input pro/cons to share, and some code snippets below should you be interested. Thanks, /** * MyQueryFilter.getDocIdSet() */ public DocIdSet getDocIdSet(IndexReader reader) throws IOException { long start = System.currentTimeMillis(); OpenBitSet[] results = new OpenBitSet[SUBQUERY]; Thread[] threads = new Thread[SUBQUERY]; for (int i = 0; i < SUBQUERY; ++i) { OpenBitSet result = new OpenBitSet(reader.maxDoc()); results[i] = result; SubQ s = new SubQ(subqueries[i], reader, result); Thread t = new Thread(s); threads[i] = t; t.start(); } try { for (int i = 0; i < SUBQUERY; ++i) { threads[i].join(); } } catch (InterruptedException ie) { log.error(this.getClass().toString() + " Thread Error ", ie); } OpenBitSet result = results[0]; log.info(this.getClass().toString() + ".getDocIDSet() terms: " + this.all_terms.size() + " miliseconds: " + (System.currentTimeMillis() - start) + " threads: " + SUBQUERY); for (int i = 1; i < SUBQUERY; ++ i) { result.union(results[i]); } return result; } /** * Here's my SubQ class */ private static class SubQ implements Runnable { TreeSet subterms; IndexReader my_reader; OpenBitSet my_results; private SubQ() {} protected SubQ(TreeSet terms, IndexReader reader, OpenBitSet bitset) { this(); this.my_results = bitset; this.subterms = terms; this.my_reader = reader; } public void run() { TermDocs td = null; try { td = my_reader.termDocs(); for (Iterator iter = this.subterms.iterator(); iter.hasNext();) { Term term = (Term) iter.next(); td.seek(term); while (td.next()) { my_results.set(td.doc()); } } } catch (IOException ioe) { log.error(this.getClass().toString() + " TermDoc seek Error ", ioe); } finally { if (td != null) try { td.close(); } catch (IOException ioe) { log.error(this.getClass().toString() + " TermDoc close Error ", ioe); } } } } PS to Jeff - Thanks for the hints just_adam wrote: > > I'm attempting a very similar large OR boolean query here and wish to > clarify what behavior that the custom query needs to be overriding. > > Do you mean to have the custom query actually run a bunch of subqueries as > Java Threads? Where in the message traffic between the Searcher > IndexReader and Query would this type of bifurcation logic take place? > > Thank you, > Adam > > > Yonik Seeley-2 wrote: >> >> >> A custom query type that does this big parallel OR, and a >> QParserPlugin that creates that query. No changes to any search >> components should be needed. >> >> -Yonik >> http://www.lucidimagination.com >> >> > > -- View this message in context: http://old.nabble.com/large-OR-boolean-query-tp25584607p27315341.html Sent from the Solr - Dev mailing list archive at Nabble.com.