[ https://issues.apache.org/jira/browse/LUCENE-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552364#comment-13552364 ]
Michael McCandless commented on LUCENE-3298: -------------------------------------------- Search perf looks fine ... maybe a bit slower for the terms dict/FST heavy queries (PKLookup, Fuzzy1/2, Respell): {noformat} Task QPS base StdDev QPS comp StdDev Pct diff AndHighMed 66.76 (1.8%) 64.53 (0.8%) -3.3% ( -5% - 0%) PKLookup 300.07 (1.1%) 295.77 (2.3%) -1.4% ( -4% - 2%) Respell 71.30 (3.0%) 70.35 (3.2%) -1.3% ( -7% - 4%) Fuzzy2 78.05 (2.6%) 77.14 (2.3%) -1.2% ( -5% - 3%) HighSloppyPhrase 35.17 (4.6%) 34.82 (4.4%) -1.0% ( -9% - 8%) Fuzzy1 87.15 (3.2%) 86.36 (2.2%) -0.9% ( -6% - 4%) LowSloppyPhrase 198.02 (4.5%) 196.62 (4.4%) -0.7% ( -9% - 8%) AndHighLow 2344.92 (4.0%) 2328.77 (5.0%) -0.7% ( -9% - 8%) Prefix3 146.38 (1.6%) 145.83 (1.7%) -0.4% ( -3% - 2%) MedSpanNear 125.96 (4.3%) 125.65 (4.4%) -0.2% ( -8% - 8%) LowSpanNear 88.16 (2.2%) 87.97 (2.0%) -0.2% ( -4% - 4%) IntNRQ 15.10 (2.5%) 15.07 (2.3%) -0.2% ( -4% - 4%) HighPhrase 17.05 (4.5%) 17.03 (5.4%) -0.1% ( -9% - 10%) HighSpanNear 11.97 (4.3%) 11.96 (4.0%) -0.1% ( -8% - 8%) AndHighHigh 71.79 (2.0%) 71.80 (0.9%) 0.0% ( -2% - 2%) Wildcard 41.93 (1.5%) 41.98 (1.3%) 0.1% ( -2% - 2%) MedPhrase 41.43 (1.7%) 41.48 (1.8%) 0.1% ( -3% - 3%) MedTerm 199.42 (6.6%) 200.15 (6.5%) 0.4% ( -11% - 14%) HighTerm 142.32 (6.9%) 142.89 (6.6%) 0.4% ( -12% - 14%) MedSloppyPhrase 25.56 (5.9%) 25.67 (6.4%) 0.4% ( -11% - 13%) LowTerm 1016.02 (3.3%) 1021.04 (3.2%) 0.5% ( -5% - 7%) LowPhrase 67.43 (2.1%) 67.80 (2.6%) 0.5% ( -4% - 5%) OrHighHigh 22.58 (5.0%) 22.89 (5.3%) 1.4% ( -8% - 12%) OrHighMed 52.47 (5.2%) 53.31 (5.2%) 1.6% ( -8% - 12%) OrHighLow 24.74 (5.4%) 25.18 (5.3%) 1.8% ( -8% - 13%) {noformat} I also tested building FST from all Wikipedia terms: * trunk took 7.9 sec to build, patch takes 9.0 seconds; I suspect this is from the cutover in NodeHash from int[] -> GrowableWriter. I think this slowdown is acceptable. * trunk has 545 nsec per lookup, patch has 578 nsec per lookup; a bit slower but I think it's OK. I also tested tokenizing first 100K Japanese Wikipedia docs w/ Kuromoji: * trunk took 156.4 sec * patch took 157.1 sec Only a wee bit slower (could easily be noise). > FST has hard limit max size of 2.1 GB > ------------------------------------- > > Key: LUCENE-3298 > URL: https://issues.apache.org/jira/browse/LUCENE-3298 > Project: Lucene - Core > Issue Type: Improvement > Components: core/FSTs > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Attachments: LUCENE-3298.patch, LUCENE-3298.patch, LUCENE-3298.patch, > LUCENE-3298.patch > > > The FST uses a single contiguous byte[] under the hood, which in java is > indexed by int so we cannot grow this over Integer.MAX_VALUE. It also > internally encodes references to this array as vInt. > We could switch this to a paged byte[] and make the far larger. > But I think this is low priority... I'm not going to work on it any time soon. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org