[
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552184#comment-13552184
]
Michael McCandless commented on LUCENE-4682:
--------------------------------------------
OK I ran several performance tests, comparing trunk (lots of waste),
the "up to 25% waste", and 0 waste (no array arcs).
First, I tested luceneutil on all of English Wikipedia, on an optimized index:
Base = trunk (lots of waste), comp = up to 25% waste:
{noformat}
Task QPS base StdDev QPS comp StdDev
Pct diff
Respell 167.37 (3.3%) 128.36 (3.2%)
-23.3% ( -28% - -17%)
Fuzzy2 83.54 (8.0%) 74.17 (7.2%)
-11.2% ( -24% - 4%)
PKLookup 359.62 (2.9%) 346.88 (4.5%)
-3.5% ( -10% - 4%)
Wildcard 28.48 (4.3%) 28.25 (3.9%)
-0.8% ( -8% - 7%)
Fuzzy1 82.77 (7.1%) 82.36 (7.7%)
-0.5% ( -14% - 15%)
HighSloppyPhrase 0.78 (4.6%) 0.78 (4.5%)
-0.4% ( -9% - 9%)
IntNRQ 3.38 (8.1%) 3.37 (8.1%)
-0.2% ( -15% - 17%)
MedSloppyPhrase 28.13 (2.4%) 28.13 (2.4%)
-0.0% ( -4% - 4%)
LowSloppyPhrase 25.10 (1.8%) 25.11 (1.8%)
0.0% ( -3% - 3%)
Prefix3 18.40 (4.8%) 18.41 (5.3%)
0.1% ( -9% - 10%)
MedTerm 65.42 (18.2%) 65.46 (17.7%)
0.1% ( -30% - 43%)
LowTerm 316.34 (6.9%) 317.45 (7.1%)
0.4% ( -12% - 15%)
LowPhrase 23.30 (2.1%) 23.39 (1.8%)
0.4% ( -3% - 4%)
HighTerm 52.11 (18.6%) 52.33 (18.5%)
0.4% ( -30% - 46%)
OrHighMed 25.65 (6.7%) 25.76 (7.2%)
0.4% ( -12% - 15%)
AndHighHigh 19.42 (1.8%) 19.52 (2.0%)
0.5% ( -3% - 4%)
OrHighLow 24.19 (7.1%) 24.36 (7.4%)
0.7% ( -12% - 16%)
OrHighHigh 14.32 (6.6%) 14.45 (7.3%)
0.9% ( -12% - 15%)
HighSpanNear 3.30 (3.9%) 3.33 (3.1%)
0.9% ( -5% - 8%)
LowSpanNear 8.94 (4.3%) 9.04 (3.4%)
1.1% ( -6% - 9%)
AndHighMed 78.66 (1.6%) 79.82 (1.4%)
1.5% ( -1% - 4%)
MedSpanNear 30.45 (3.1%) 30.91 (2.8%)
1.5% ( -4% - 7%)
MedPhrase 130.74 (5.8%) 133.03 (5.9%)
1.8% ( -9% - 14%)
AndHighLow 571.07 (3.5%) 582.83 (2.8%)
2.1% ( -4% - 8%)
HighPhrase 11.62 (6.1%) 11.88 (6.4%)
2.2% ( -9% - 15%)
{noformat}
trunk .tip file was 23928963 and comp was 17125654 bytes (~28% smaller!).
Then, base=trunk, comp=0 waste (no array arcs):
{noformat}
Task QPS base StdDev QPS comp StdDev
Pct diff
PKLookup 359.26 (2.3%) 152.91 (5.3%)
-57.4% ( -63% - -51%)
Respell 168.82 (3.4%) 128.85 (2.2%)
-23.7% ( -28% - -18%)
Fuzzy2 85.59 (8.2%) 75.03 (6.6%)
-12.3% ( -25% - 2%)
LowTerm 331.01 (7.9%) 324.30 (7.2%)
-2.0% ( -15% - 14%)
HighTerm 54.75 (19.2%) 54.00 (19.2%)
-1.4% ( -33% - 45%)
MedTerm 68.68 (18.5%) 68.04 (19.0%)
-0.9% ( -32% - 44%)
AndHighMed 80.50 (1.4%) 79.78 (1.4%)
-0.9% ( -3% - 1%)
Wildcard 29.13 (4.5%) 28.89 (4.0%)
-0.8% ( -8% - 8%)
LowPhrase 23.88 (2.3%) 23.69 (1.5%)
-0.8% ( -4% - 3%)
AndHighLow 584.30 (2.9%) 582.17 (2.9%)
-0.4% ( -6% - 5%)
Fuzzy1 83.84 (6.9%) 83.62 (6.7%)
-0.3% ( -12% - 14%)
AndHighHigh 19.88 (1.8%) 19.84 (1.5%)
-0.2% ( -3% - 3%)
LowSloppyPhrase 25.61 (1.9%) 25.56 (2.0%)
-0.2% ( -4% - 3%)
LowSpanNear 9.20 (3.6%) 9.20 (3.9%)
0.0% ( -7% - 7%)
MedSpanNear 31.32 (2.7%) 31.36 (2.9%)
0.1% ( -5% - 5%)
MedSloppyPhrase 28.67 (2.3%) 28.73 (2.2%)
0.2% ( -4% - 4%)
Prefix3 18.83 (4.9%) 18.88 (5.4%)
0.3% ( -9% - 11%)
OrHighHigh 14.71 (7.5%) 14.76 (8.2%)
0.3% ( -14% - 17%)
HighSpanNear 3.39 (3.0%) 3.40 (2.9%)
0.5% ( -5% - 6%)
OrHighMed 26.28 (7.3%) 26.47 (8.1%)
0.7% ( -13% - 17%)
IntNRQ 3.44 (8.6%) 3.47 (9.3%)
0.9% ( -15% - 20%)
OrHighLow 24.69 (7.3%) 24.97 (8.3%)
1.1% ( -13% - 18%)
HighSloppyPhrase 0.80 (4.8%) 0.81 (4.7%)
1.3% ( -7% - 11%)
HighPhrase 11.89 (6.1%) 12.20 (7.4%)
2.6% ( -10% - 17%)
MedPhrase 134.04 (5.5%) 138.07 (6.4%)
3.0% ( -8% - 15%)
{noformat}
comp .tip file was 16806594 (just a bit smaller than "up to 25% waste").
Curious how PK was barely affected by the "up to 25%", but heavily
affected by the "0 waste", and how Respell/Fuzzy2 lost perf going to
"up to 25% waste" and then didn't lose much going to "0 waste". I
suspect PK will be sensitive to the key structure (luceneutil uses
base 36 keys)...
Next I tested Kuromoji, just tokenizing first 100K docs from jawiki.
* trunk (lots of waste): TokenInfoFST was 1535643 bytes, tokenizing 100K docs
took 163.3 sec
* up to 25% waste: TokenInfoFST was 1309108 bytes, tokenizing 100K docs took
215.3 sec
* 0 waste: TokenInfoFST was 1228847 bytes, tokenizing 100K docs took 218.0 sec
Looks like Kuromoji sees good gains from the > 25% waste arcs... but
to keep this in perspective, in a "real" app, you index after
tokenizing and that indexing time will dwarf the tokenization time.
But then on the flip side we are "only" talking about ~221 KB ...
Finally, I tested building a no-outputs FST for all Wikipedia English
terms (9.8 M terms):
* trunk (lots of waste): 59267794 bytes, 566 nsec per lookup
* up to 25% waste: 58509011 bytes, 567 nsec per lookup
* 0 waste: 56413148 bytes, 1808 nsec per lookup
For this case it looks like you get all the benefit allowing only 25%
waste (though, most of the byte increase also?).
> Reduce wasted bytes in FST due to array arcs
> --------------------------------------------
>
> Key: LUCENE-4682
> URL: https://issues.apache.org/jira/browse/LUCENE-4682
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/FSTs
> Reporter: Michael McCandless
> Priority: Minor
> Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch
>
>
> When a node is close to the root, or it has many outgoing arcs, the FST
> writes the arcs as an array (each arc gets N bytes), so we can e.g. bin
> search on lookup.
> The problem is N is set to the max(numBytesPerArc), so if you have an outlier
> arc e.g. with a big output, you can waste many bytes for all the other arcs
> that didn't need so many bytes.
> I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size
> 1535612 = ~18% wasted.
> It would be nice to reduce this.
> One thing we could do without packing is: in addNode, if we detect that
> number of wasted bytes is above some threshold, then don't do the expansion.
> Another thing, if we are packing: we could record stats in the first pass
> about which nodes wasted the most, and then in the second pass (paack) we
> could set the threshold based on the top X% nodes that waste ...
> Another idea is maybe to deref large outputs, so that the numBytesPerArc is
> more uniform ...
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]