[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

Han Jiang (JIRA) Tue, 30 Jul 2013 09:32:41 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Han Jiang updated LUCENE-3069:
------------------------------

    Attachment: LUCENE-5152.patch

Previous design put much stress on decoding of Outputs. 
This becomes disaster for wildcard queries: like for f*nd, 
we usually have to walk to the last character of FST, then
find that it is not 'd' and automaton doesn't accept this.
In this case, TempFST is actually iterating all the result
of f*, which decodes all the metadata for them...

So I'm trying another approach, the main idea is to load 
metadata & stats as lazily as possible. 
Here I use FST<Long> as term index, and leave all other stuff 
in a single term block. The term index FST holds the relationship 
between <Term, Ord>, and in the term block we can maintain a skip list
for find related metadata & stats.

It is a little similar to BTTR now, and we can someday control how much
data to keep memory resident (e.g. keep stats in memory but metadata on 
disk, however this should be another issue).
Another good part is, it naturally supports seek by ord.(ah, 
actually I don't understand where it is used).

Tests pass, and intersect is not implemented yet.
perf based on 1M wiki data, between non-intersect TempFST and TempFSTOrd:

{noformat}
                    Task    QPS base      StdDev    QPS comp      StdDev        
        Pct diff
                PKLookup      373.80      (0.0%)      320.30      (0.0%)  
-14.3% ( -14% -  -14%)
                  Fuzzy1       43.82      (0.0%)       47.10      (0.0%)    
7.5% (   7% -    7%)
                 Prefix3      399.62      (0.0%)      433.95      (0.0%)    
8.6% (   8% -    8%)
                  Fuzzy2       14.26      (0.0%)       15.95      (0.0%)   
11.9% (  11% -   11%)
                 Respell       40.69      (0.0%)       46.29      (0.0%)   
13.8% (  13% -   13%)
                Wildcard       83.44      (0.0%)       96.54      (0.0%)   
15.7% (  15% -   15%)
{noformat}

perf hit on pklookup should be sane, since I haven't optimize the skip list.

I'll update intersect() later, and later we'll cutover to 
PagedBytes & PackedLongBuffer.

                
> Lucene should have an entirely memory resident term dictionary
> --------------------------------------------------------------
>
>                 Key: LUCENE-3069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3069
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0-ALPHA
>            Reporter: Simon Willnauer
>            Assignee: Han Jiang
>              Labels: gsoc2013
>             Fix For: 5.0, 4.5
>
>         Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
> LUCENE-5152.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

Reply via email to