[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Han Jiang updated LUCENE-3069: ------------------------------ Attachment: LUCENE-5152.patch Previous design put much stress on decoding of Outputs. This becomes disaster for wildcard queries: like for f*nd, we usually have to walk to the last character of FST, then find that it is not 'd' and automaton doesn't accept this. In this case, TempFST is actually iterating all the result of f*, which decodes all the metadata for them... So I'm trying another approach, the main idea is to load metadata & stats as lazily as possible. Here I use FST<Long> as term index, and leave all other stuff in a single term block. The term index FST holds the relationship between <Term, Ord>, and in the term block we can maintain a skip list for find related metadata & stats. It is a little similar to BTTR now, and we can someday control how much data to keep memory resident (e.g. keep stats in memory but metadata on disk, however this should be another issue). Another good part is, it naturally supports seek by ord.(ah, actually I don't understand where it is used). Tests pass, and intersect is not implemented yet. perf based on 1M wiki data, between non-intersect TempFST and TempFSTOrd: {noformat} Task QPS base StdDev QPS comp StdDev Pct diff PKLookup 373.80 (0.0%) 320.30 (0.0%) -14.3% ( -14% - -14%) Fuzzy1 43.82 (0.0%) 47.10 (0.0%) 7.5% ( 7% - 7%) Prefix3 399.62 (0.0%) 433.95 (0.0%) 8.6% ( 8% - 8%) Fuzzy2 14.26 (0.0%) 15.95 (0.0%) 11.9% ( 11% - 11%) Respell 40.69 (0.0%) 46.29 (0.0%) 13.8% ( 13% - 13%) Wildcard 83.44 (0.0%) 96.54 (0.0%) 15.7% ( 15% - 15%) {noformat} perf hit on pklookup should be sane, since I haven't optimize the skip list. I'll update intersect() later, and later we'll cutover to PagedBytes & PackedLongBuffer. > Lucene should have an entirely memory resident term dictionary > -------------------------------------------------------------- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search > Affects Versions: 4.0-ALPHA > Reporter: Simon Willnauer > Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 5.0, 4.5 > > Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, > LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, > LUCENE-5152.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org