[ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
------------------------------

    Attachment: LUCENE-3069.patch

Patch from last commit, and summary:

Previously our term dictionary were both block-based: 

* BlockTerms dict breaks terms list into several blocks, as a linear 
  structure with skip points. 

* BlockTreeTerms dict uses a trie-like structure to decide how terms are 
  assigned to different blocks, and uses an FST index to optimize seeking 
  performance.

However, those two kinds of term dictionary don't hold all the term 
data in memory. For the worst case there would be at least two seeks:
one from index in memory, another from file on disk. And we already have 
many complicated optimizations for this...

If by design a term dictionary can be memory resident, the data structure 
will be simpler (after all we don't need maintain extra file pointers for 
a second-time seek, and we don't have to decide heuristic for how terms 
are clustered). And this is why those two FST-based implementation are 
introduced.

Another big change in the code is: since our term dictionaries were both 
block-based, previous API was also limited. It was the postings writer who 
collected term metadata, and the term dictionary who told postings writer 
the range of terms it should flush to block. However, encoding of terms 
data should be decided by term dictionary part, since postings writer 
doesn't always know how terms are structured in term dictionary...
Previous API had some tricky codes for this, e.g. PulsingPostingsWriter had
to use terms' ordinal in block to decide how to write metadata, which is 
unnecessary.

To make the API between term dict and postings list more 'pluggable' and 
'general', I refactored the PostingsReader/WriterBase. For example, the 
postings writer should provide some information to term dictionary, like 
how many metadata values are strictly monotonic, so that term dictionary 
can optimize delta-encoding itself. And since the term dictionary now fully
decides how metadata are written, it gets the ability to utilize 
intblock-based metadata encoding.

Now the two implementations of term dictionary can easily be plugged with 
current postings formats, like:
* FST41 = 
    FSTTermdict + Lucene41PostingsBaseFormat,
* FSTOrd41 = 
    FSTOrdTermdict + Lucene41PostingsBaseFormat. 
* FSTOrdPulsing41 = 
    FSTOrdTermsdict + PulsingPostingsWrapper + Lucene41PostingsFormat

About performance, as shown before, those two term dict improve on primary 
key lookup, but still have overhead on wildcard query (both two term dict 
have only prefix information, and term dictionary cannot work well with 
this...). I'll try to hack this later.
                
> Lucene should have an entirely memory resident term dictionary
> --------------------------------------------------------------
>
>                 Key: LUCENE-3069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3069
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0-ALPHA
>            Reporter: Simon Willnauer
>            Assignee: Han Jiang
>              Labels: gsoc2013
>             Fix For: 5.0, 4.5
>
>         Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
> LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to