[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

Han Jiang (JIRA) Thu, 11 Jul 2013 23:46:01 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Han Jiang updated LUCENE-3069:
------------------------------

    Attachment: example.png
                LUCENE-3069.patch

Uploaded patch, it is the main part of changes I commited to branch3069.

The picture shows current impl of outputs (it is fetched from one field in 
wikimedium5k).

* long[] (sortable metadata)
* byte[] (unsortable, generic metadata)
* df, ttf (term stats)

A single byte flag is used to indicate whether/which fields current outputs 
maintains, 
for PBF with short byte[], this should be enough. Also, for long-tail terms, 
the totalTermFreq
an safely be inlined into docFreq (for body field in wikimedium1m, 85.8% terms 
have df == ttf).


Since TermsEnum is totally based on FSTEnum, the performance of term dict 
should be similar with 
MemoryPF. However, for PK tasks, we have to pull docsEnum from MMap, so this 
hurts.


Following is the performance comparison:

{noformat}
pure TempFST vs. Lucene41 + Memory(on idField), on wikimediumall

                    Task    QPS base      StdDev    QPS comp      StdDev        
        Pct diff
                 Respell       48.13      (4.4%)       15.38      (1.0%)  
-68.0% ( -70% -  -65%)
                  Fuzzy2       51.30      (5.3%)       17.47      (1.3%)  
-65.9% ( -68% -  -62%)
                  Fuzzy1       52.24      (4.0%)       18.50      (1.2%)  
-64.6% ( -67% -  -61%)
                Wildcard        9.31      (1.7%)        6.16      (2.2%)  
-33.8% ( -37% -  -30%)
                 Prefix3       23.25      (1.8%)       19.00      (2.2%)  
-18.3% ( -21% -  -14%)
                PKLookup      244.92      (3.6%)      225.42      (2.3%)   
-8.0% ( -13% -   -2%)
                 LowTerm      295.88      (5.5%)      293.27      (4.8%)   
-0.9% ( -10% -    9%)
              HighPhrase       13.62      (6.5%)       13.54      (7.4%)   
-0.6% ( -13% -   14%)
                 MedTerm       99.51      (7.8%)       99.19      (7.7%)   
-0.3% ( -14% -   16%)
               MedPhrase      154.63      (9.4%)      154.38     (10.1%)   
-0.2% ( -17% -   21%)
                HighTerm       28.25     (10.7%)       28.25     (10.0%)   
-0.0% ( -18% -   23%)
              OrHighHigh       16.83     (13.3%)       16.86     (13.1%)    
0.2% ( -23% -   30%)
        HighSloppyPhrase        9.02      (4.4%)        9.03      (4.5%)    
0.2% (  -8% -    9%)
               LowPhrase        6.26      (3.4%)        6.27      (4.1%)    
0.2% (  -7% -    8%)
               OrHighMed       13.73     (13.2%)       13.77     (12.8%)    
0.3% ( -22% -   30%)
               OrHighLow       25.65     (13.2%)       25.73     (13.0%)    
0.3% ( -22% -   30%)
         MedSloppyPhrase        6.63      (2.7%)        6.66      (2.7%)    
0.5% (  -4% -    6%)
              AndHighMed       42.77      (1.8%)       43.13      (1.5%)    
0.8% (  -2% -    4%)
         LowSloppyPhrase       32.68      (3.0%)       32.96      (2.8%)    
0.8% (  -4% -    6%)
             AndHighHigh       22.90      (1.2%)       23.18      (0.7%)    
1.2% (   0% -    3%)
             LowSpanNear       29.30      (2.0%)       29.83      (2.2%)    
1.8% (  -2% -    6%)
             MedSpanNear        8.39      (2.7%)        8.56      (2.9%)    
2.0% (  -3% -    7%)
                  IntNRQ        3.12      (1.9%)        3.18      (6.7%)    
2.1% (  -6% -   10%)
              AndHighLow      507.01      (2.4%)      522.10      (2.8%)    
3.0% (  -2% -    8%)
            HighSpanNear        5.43      (1.8%)        5.60      (2.6%)    
3.1% (  -1% -    7%)
{noformat}


{noformat}
pure TempFST vs. pure Lucene41, on wikimediumall

                    Task    QPS base      StdDev    QPS comp      StdDev        
        Pct diff
                 Respell       49.24      (2.7%)       15.51      (1.0%)  
-68.5% ( -70% -  -66%)
                  Fuzzy2       52.01      (4.8%)       17.61      (1.4%)  
-66.1% ( -68% -  -63%)
                  Fuzzy1       53.00      (4.0%)       18.62      (1.3%)  
-64.9% ( -67% -  -62%)
                Wildcard        9.37      (1.3%)        6.15      (2.1%)  
-34.4% ( -37% -  -31%)
                 Prefix3       23.36      (0.8%)       18.96      (2.1%)  
-18.8% ( -21% -  -16%)
               MedPhrase      155.86      (9.8%)      152.34      (9.7%)   
-2.3% ( -19% -   19%)
               LowPhrase        6.33      (3.7%)        6.23      (4.0%)   
-1.6% (  -8% -    6%)
              HighPhrase       13.68      (7.2%)       13.49      (6.8%)   
-1.4% ( -14% -   13%)
               OrHighMed       13.78     (13.0%)       13.68     (12.7%)   
-0.8% ( -23% -   28%)
        HighSloppyPhrase        9.14      (5.2%)        9.07      (3.7%)   
-0.7% (  -9% -    8%)
              OrHighHigh       16.87     (13.3%)       16.76     (12.9%)   
-0.6% ( -23% -   29%)
               OrHighLow       25.71     (13.1%)       25.58     (12.8%)   
-0.5% ( -23% -   29%)
         MedSloppyPhrase        6.69      (2.7%)        6.67      (2.4%)   
-0.3% (  -5% -    4%)
         LowSloppyPhrase       33.01      (3.2%)       32.99      (2.6%)   
-0.1% (  -5% -    5%)
                 MedTerm       99.64      (8.0%)       99.67     (10.9%)    
0.0% ( -17% -   20%)
                 LowTerm      294.52      (5.5%)      295.72      (7.2%)    
0.4% ( -11% -   13%)
             LowSpanNear       29.61      (2.6%)       29.76      (2.7%)    
0.5% (  -4% -    5%)
                  IntNRQ        3.13      (1.8%)        3.16      (7.8%)    
0.8% (  -8% -   10%)
             MedSpanNear        8.49      (3.0%)        8.57      (3.4%)    
0.9% (  -5% -    7%)
              AndHighMed       42.86      (1.4%)       43.35      (1.4%)    
1.1% (  -1% -    3%)
             AndHighHigh       22.98      (0.6%)       23.26      (0.5%)    
1.2% (   0% -    2%)
            HighSpanNear        5.51      (3.4%)        5.58      (3.4%)    
1.3% (  -5% -    8%)
                HighTerm       28.32     (10.5%)       28.76     (15.0%)    
1.6% ( -21% -   30%)
              AndHighLow      509.60      (2.2%)      526.17      (1.9%)    
3.3% (   0% -    7%)
                PKLookup      156.59      (2.2%)      225.47      (2.8%)   
44.0% (  38% -   50%)
{noformat}

To revive the performance on automaton queries, intersect methods should be 
implemented.

And index size comparison:
(actually, after LUCENE-5029, TempBlock has a little larger (5%) index size 
than Lucene41)


{noformat}
          wikimedium1m         wikimediumall
Memory       2,212,352            /
Lucene41       448,164            12,104,520        
TempFST        525,888            12,770,700
{noformat}

as for term dict size:


{noformat}
                     wikimedium1m         wikimediumall
Lucene41(.tim+.tip)  157776               2059744
TempFST(.tmp)        233636               2779784
                     48%                  35%
{noformat}

Some unresolved problems: 

* Currently, TempFST uses the default option to build FST (i.e. doPacked = 
false), when this option is switched on, the index size on wikimedium1m becomes 
smaller, but on wikimediumall it becomes larger?
                
> Lucene should have an entirely memory resident term dictionary
> --------------------------------------------------------------
>
>                 Key: LUCENE-3069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3069
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0-ALPHA
>            Reporter: Simon Willnauer
>            Assignee: Han Jiang
>              Labels: gsoc2013
>             Fix For: 4.4
>
>         Attachments: example.png, LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

Reply via email to