[jira] [Commented] (LUCENE-3298) FST has hard limit max size of 2.1 GB

Michael McCandless (JIRA) Sun, 13 Jan 2013 16:40:16 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552364#comment-13552364
 ]


Michael McCandless commented on LUCENE-3298:
--------------------------------------------

Search perf looks fine ... maybe a bit slower for the terms dict/FST
heavy queries (PKLookup, Fuzzy1/2, Respell):

{noformat}
                    Task    QPS base      StdDev    QPS comp      StdDev        
        Pct diff
              AndHighMed       66.76      (1.8%)       64.53      (0.8%)   
-3.3% (  -5% -    0%)
                PKLookup      300.07      (1.1%)      295.77      (2.3%)   
-1.4% (  -4% -    2%)
                 Respell       71.30      (3.0%)       70.35      (3.2%)   
-1.3% (  -7% -    4%)
                  Fuzzy2       78.05      (2.6%)       77.14      (2.3%)   
-1.2% (  -5% -    3%)
        HighSloppyPhrase       35.17      (4.6%)       34.82      (4.4%)   
-1.0% (  -9% -    8%)
                  Fuzzy1       87.15      (3.2%)       86.36      (2.2%)   
-0.9% (  -6% -    4%)
         LowSloppyPhrase      198.02      (4.5%)      196.62      (4.4%)   
-0.7% (  -9% -    8%)
              AndHighLow     2344.92      (4.0%)     2328.77      (5.0%)   
-0.7% (  -9% -    8%)
                 Prefix3      146.38      (1.6%)      145.83      (1.7%)   
-0.4% (  -3% -    2%)
             MedSpanNear      125.96      (4.3%)      125.65      (4.4%)   
-0.2% (  -8% -    8%)
             LowSpanNear       88.16      (2.2%)       87.97      (2.0%)   
-0.2% (  -4% -    4%)
                  IntNRQ       15.10      (2.5%)       15.07      (2.3%)   
-0.2% (  -4% -    4%)
              HighPhrase       17.05      (4.5%)       17.03      (5.4%)   
-0.1% (  -9% -   10%)
            HighSpanNear       11.97      (4.3%)       11.96      (4.0%)   
-0.1% (  -8% -    8%)
             AndHighHigh       71.79      (2.0%)       71.80      (0.9%)    
0.0% (  -2% -    2%)
                Wildcard       41.93      (1.5%)       41.98      (1.3%)    
0.1% (  -2% -    2%)
               MedPhrase       41.43      (1.7%)       41.48      (1.8%)    
0.1% (  -3% -    3%)
                 MedTerm      199.42      (6.6%)      200.15      (6.5%)    
0.4% ( -11% -   14%)
                HighTerm      142.32      (6.9%)      142.89      (6.6%)    
0.4% ( -12% -   14%)
         MedSloppyPhrase       25.56      (5.9%)       25.67      (6.4%)    
0.4% ( -11% -   13%)
                 LowTerm     1016.02      (3.3%)     1021.04      (3.2%)    
0.5% (  -5% -    7%)
               LowPhrase       67.43      (2.1%)       67.80      (2.6%)    
0.5% (  -4% -    5%)
              OrHighHigh       22.58      (5.0%)       22.89      (5.3%)    
1.4% (  -8% -   12%)
               OrHighMed       52.47      (5.2%)       53.31      (5.2%)    
1.6% (  -8% -   12%)
               OrHighLow       24.74      (5.4%)       25.18      (5.3%)    
1.8% (  -8% -   13%)
{noformat}

I also tested building FST from all Wikipedia terms:

  * trunk took 7.9 sec to build, patch takes 9.0 seconds; I suspect
    this is from the cutover in NodeHash from int[] ->
    GrowableWriter.  I think this slowdown is acceptable.

  * trunk has 545 nsec per lookup, patch has 578 nsec per lookup; a
    bit slower but I think it's OK.

I also tested tokenizing first 100K Japanese Wikipedia docs w/
Kuromoji:

  * trunk took 156.4 sec

  * patch took 157.1 sec

Only a wee bit slower (could easily be noise).

                
> FST has hard limit max size of 2.1 GB
> -------------------------------------
>
>                 Key: LUCENE-3298
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3298
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-3298.patch, LUCENE-3298.patch, LUCENE-3298.patch, 
> LUCENE-3298.patch
>
>
> The FST uses a single contiguous byte[] under the hood, which in java is 
> indexed by int so we cannot grow this over Integer.MAX_VALUE.  It also 
> internally encodes references to this array as vInt.
> We could switch this to a paged byte[] and make the far larger.
> But I think this is low priority... I'm not going to work on it any time soon.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3298) FST has hard limit max size of 2.1 GB

Reply via email to