[
https://issues.apache.org/jira/browse/LUCENE-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552364#comment-13552364
]
Michael McCandless commented on LUCENE-3298:
--------------------------------------------
Search perf looks fine ... maybe a bit slower for the terms dict/FST
heavy queries (PKLookup, Fuzzy1/2, Respell):
{noformat}
Task QPS base StdDev QPS comp StdDev
Pct diff
AndHighMed 66.76 (1.8%) 64.53 (0.8%)
-3.3% ( -5% - 0%)
PKLookup 300.07 (1.1%) 295.77 (2.3%)
-1.4% ( -4% - 2%)
Respell 71.30 (3.0%) 70.35 (3.2%)
-1.3% ( -7% - 4%)
Fuzzy2 78.05 (2.6%) 77.14 (2.3%)
-1.2% ( -5% - 3%)
HighSloppyPhrase 35.17 (4.6%) 34.82 (4.4%)
-1.0% ( -9% - 8%)
Fuzzy1 87.15 (3.2%) 86.36 (2.2%)
-0.9% ( -6% - 4%)
LowSloppyPhrase 198.02 (4.5%) 196.62 (4.4%)
-0.7% ( -9% - 8%)
AndHighLow 2344.92 (4.0%) 2328.77 (5.0%)
-0.7% ( -9% - 8%)
Prefix3 146.38 (1.6%) 145.83 (1.7%)
-0.4% ( -3% - 2%)
MedSpanNear 125.96 (4.3%) 125.65 (4.4%)
-0.2% ( -8% - 8%)
LowSpanNear 88.16 (2.2%) 87.97 (2.0%)
-0.2% ( -4% - 4%)
IntNRQ 15.10 (2.5%) 15.07 (2.3%)
-0.2% ( -4% - 4%)
HighPhrase 17.05 (4.5%) 17.03 (5.4%)
-0.1% ( -9% - 10%)
HighSpanNear 11.97 (4.3%) 11.96 (4.0%)
-0.1% ( -8% - 8%)
AndHighHigh 71.79 (2.0%) 71.80 (0.9%)
0.0% ( -2% - 2%)
Wildcard 41.93 (1.5%) 41.98 (1.3%)
0.1% ( -2% - 2%)
MedPhrase 41.43 (1.7%) 41.48 (1.8%)
0.1% ( -3% - 3%)
MedTerm 199.42 (6.6%) 200.15 (6.5%)
0.4% ( -11% - 14%)
HighTerm 142.32 (6.9%) 142.89 (6.6%)
0.4% ( -12% - 14%)
MedSloppyPhrase 25.56 (5.9%) 25.67 (6.4%)
0.4% ( -11% - 13%)
LowTerm 1016.02 (3.3%) 1021.04 (3.2%)
0.5% ( -5% - 7%)
LowPhrase 67.43 (2.1%) 67.80 (2.6%)
0.5% ( -4% - 5%)
OrHighHigh 22.58 (5.0%) 22.89 (5.3%)
1.4% ( -8% - 12%)
OrHighMed 52.47 (5.2%) 53.31 (5.2%)
1.6% ( -8% - 12%)
OrHighLow 24.74 (5.4%) 25.18 (5.3%)
1.8% ( -8% - 13%)
{noformat}
I also tested building FST from all Wikipedia terms:
* trunk took 7.9 sec to build, patch takes 9.0 seconds; I suspect
this is from the cutover in NodeHash from int[] ->
GrowableWriter. I think this slowdown is acceptable.
* trunk has 545 nsec per lookup, patch has 578 nsec per lookup; a
bit slower but I think it's OK.
I also tested tokenizing first 100K Japanese Wikipedia docs w/
Kuromoji:
* trunk took 156.4 sec
* patch took 157.1 sec
Only a wee bit slower (could easily be noise).
> FST has hard limit max size of 2.1 GB
> -------------------------------------
>
> Key: LUCENE-3298
> URL: https://issues.apache.org/jira/browse/LUCENE-3298
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/FSTs
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Attachments: LUCENE-3298.patch, LUCENE-3298.patch, LUCENE-3298.patch,
> LUCENE-3298.patch
>
>
> The FST uses a single contiguous byte[] under the hood, which in java is
> indexed by int so we cannot grow this over Integer.MAX_VALUE. It also
> internally encodes references to this array as vInt.
> We could switch this to a paged byte[] and make the far larger.
> But I think this is low priority... I'm not going to work on it any time soon.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]