[ https://issues.apache.org/jira/browse/LUCENE-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-1594: --------------------------------------- Attachment: LUCENE-1594.patch Initial patch. > Use source code specialization to maximize search performance > ------------------------------------------------------------- > > Key: LUCENE-1594 > URL: https://issues.apache.org/jira/browse/LUCENE-1594 > Project: Lucene - Java > Issue Type: New Feature > Components: Search > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Attachments: FastSearchTask.java, LUCENE-1594.patch > > > Towards eeking absolute best search performance, and after seeing the > Java ghosts in LUCENE-1575, I decided to build a simple prototype > source code specializer for Lucene's searches. > The idea is to write dynamic Java code, specialized to run a very > specific query context (eg TermQuery, collecting top N by field, no > filter, no deletions), compile that Java code, and run it. > Here're the performance gains when compared to trunk: > ||Query||Sort||Filt|Deletes||Scoring||Hits||QPS (base)||QPS (new)||%|| > |1|Date (long)|no|no|Track,Max|2561886|6.8|10.6|{color:green}55.9%{color}| > |1|Date (long)|no|5%|Track,Max|2433472|6.3|10.5|{color:green}66.7%{color}| > |1|Date (long)|25%|no|Track,Max|640022|5.2|9.9|{color:green}90.4%{color}| > |1|Date (long)|25%|5%|Track,Max|607949|5.3|10.3|{color:green}94.3%{color}| > |1|Date (long)|10%|no|Track,Max|256300|6.7|12.3|{color:green}83.6%{color}| > |1|Date (long)|10%|5%|Track,Max|243317|6.6|12.6|{color:green}90.9%{color}| > |1|Relevance|no|no|Track,Max|2561886|11.2|17.3|{color:green}54.5%{color}| > |1|Relevance|no|5%|Track,Max|2433472|10.1|15.7|{color:green}55.4%{color}| > |1|Relevance|25%|no|Track,Max|640022|6.1|14.1|{color:green}131.1%{color}| > |1|Relevance|25%|5%|Track,Max|607949|6.2|14.4|{color:green}132.3%{color}| > |1|Relevance|10%|no|Track,Max|256300|7.7|15.6|{color:green}102.6%{color}| > |1|Relevance|10%|5%|Track,Max|243317|7.6|15.9|{color:green}109.2%{color}| > |1|Title (string)|no|no|Track,Max|2561886|7.8|12.5|{color:green}60.3%{color}| > |1|Title (string)|no|5%|Track,Max|2433472|7.5|11.1|{color:green}48.0%{color}| > |1|Title (string)|25%|no|Track,Max|640022|5.7|11.2|{color:green}96.5%{color}| > |1|Title (string)|25%|5%|Track,Max|607949|5.5|11.3|{color:green}105.5%{color}| > |1|Title (string)|10%|no|Track,Max|256300|7.0|12.7|{color:green}81.4%{color}| > |1|Title (string)|10%|5%|Track,Max|243317|6.7|13.2|{color:green}97.0%{color}| > Those tests were run on a 19M doc wikipedia index (splitting each > Wikipedia doc @ ~1024 chars), on Linux, Java 1.6.0_10 > But: it only works with TermQuery for now; it's just a start. > It should be easy for others to run this test: > * apply patch > * cd contrib/benchmark > * run python -u bench.py -delindex </path/to/index/with/deletes> > -nodelindex </path/to/index/without/deletes> > (You can leave off one of -delindex or -nodelindex and it'll skip > those tests). > For each test, bench.py generates a single Java source file that runs > that one query; you can open > contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/FastSearchTask.java > to see it. I'll attach an example. It writes "results.txt", in Jira > table format, which you should be able to copy/paste back here. > The specializer uses pretty much every search speedup I can think of > -- the ones from LUCENE-1575 (to score or not, to maxScore or not), > the ones suggested in the spinoff LUCENE-1593 (pre-fill w/ sentinels, > don't use docID for tie breaking), LUCENE-1536 (random access > filters). It bypasses TermDocs and interacts directly with the > IndexInput, and with BitVector for deletions. It directly folds in > the collector, if possible. A filter if used must be random access, > and is assumed to pre-multiply-in the deleted docs. > Current status: > * I only handle TermQuery. I'd like to add others over time... > * It can collect by score, or single field (with the 3 scoring > options in LUCENE-1575). It can't do reverse field sort nor > multi-field sort now. > * The auto-gen code (gen.py) is rather hideous. It could use some > serious refactoring, etc.; I think we could get it to the point > where each Query can gen its own specialized code, maybe. It also > needs to be eventually ported to Java. > * The script runs old, then new, then checks that the topN results > are identical, and aborts if not. So I'm pretty sure the > specialized code is working correctly, for the cases I'm testing. > * The patch includes a few small changes to core, mostly to open up > package protected APIs so I can access stuff > I think this is an interesting effort for several reasons: > * It gives us a best-case upper bound performance we can expect from > Lucene's normal search classes (minus algorithmic improvements eg > PFOR) because it makes life as easy as possible on the > compiler/JRE to convert to assembly. > * We can spin out optimization ideas from this back into the core > (eg LUCENE-1593 already has one example), and prioritize. EG I > think given these results, optimizing for filters that support > random-access API is important. As we fold speedups back into > core, the gains from specialization will naturally decrease. > * Eventually (maybe, eg as a future "experimental" module) this can > be used in production as a simple "search wrapper". Ie, for a > given query, the specializer is checked. If the query "matches" > what the specializer can handle, then the specialized code is run; > else we fallback to Lucene core. Likely one would pre-compile the > space of all specializations, or we could compile java-on-the-fly > (eg what a JSP source does when it's changed) but I'm not sure how > costly/portable that is. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org