[
https://issues.apache.org/jira/browse/LUCENE-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434935#comment-15434935
]
Ferenczi Jim edited comment on LUCENE-7423 at 8/25/16 9:05 AM:
---------------------------------------------------------------
(edited since the results of the autoprefix were wrong due to a bug in the code
to generate the prefixes)
I've added a small benchmark AutoPrefixPerf.java (modified from [~mikemccand]
utils).
For the benchmark I used the english wikipedia title and a standard analyzer:
{panel:title=Standard
analyzer|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#FFFFCE}
A single field in this test:
* "field": standard analyzer
{noformat}
Indexed 12600000: 33.756 sec
Final Indexed 12696047: 33.9 sec
Optimize...
After force merge: 37.794 sec
Close...
After close: 37.798 sec
Done CheckIndex:
Segments file=segments_1 numSegments=1 version=7.0.0
id=ex11gzoft89z21le5c93bpett
1 of 1: name=_j maxDoc=12696047
version=7.0.0
id=ex11gzoft89z21le5c93bpets
codec=Lucene62
compound=false
numFiles=7
size (MB)=78.562
diagnostics = {os=Mac OS X, java.vendor=Oracle Corporation,
java.version=1.8.0_77, java.vm.version=25.77-b03, lucene.version=7.0.0,
mergeMaxNumSegments=1, os.arch=x86_64, java.runtime.version=1.8.0_77-b03,
source=merge, mergeFactor=9, os.version=10.11.4, timestamp=1472043738648}
no deletions
test: open reader.........OK [took 0.002 sec]
test: check integrity.....OK [took 0.046 sec]
test: check live docs.....OK [took 0.000 sec]
test: field infos.........OK [1 fields] [took 0.000 sec]
test: field norms.........OK [0 fields] [took 0.000 sec]
test: terms, freq, prox...OK [2513966 terms; 34713220 terms/docs pairs; 0
tokens] [took 2.321 sec]
field "field":
index FST:
699982 bytes
terms:
2513966 terms
20843092 bytes (8.3 bytes/term)
blocks:
80953 blocks
59384 terms-only blocks
10 sub-block-only blocks
21559 mixed blocks
18273 floor blocks
25611 non-floor blocks
55342 floor sub-blocks
13294379 term suffix bytes (164.2 suffix-bytes/block)
2538232 term stats bytes (31.4 stats-bytes/block)
8829391 other bytes (109.1 other-bytes/block)
by prefix length:
0: 5
1: 421
2: 5620
3: 18794
4: 31598
5: 16630
6: 5322
7: 1709
8: 443
9: 138
10: 249
11: 14
12: 2
13: 6
14: 2
test: stored fields.......OK [0 total field count; avg 0.0 fields per doc]
[took 0.257 sec]
test: term vectors........OK [0 total term vector count; avg 0.0 term/freq
vector fields per doc] [took 0.000 sec]
test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0
SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec]
test: points..............OK [0 fields, 0 points] [took 0.000 sec]
detailed segment RAM usage:
_j(7.0.0):C12696047: 741.9 KB
|-- postings [PerFieldPostings(segment=_j formats=1)]: 683.8 KB
|-- format 'Lucene50_0'
[BlockTreeTermsReader(fields=1,delegate=Lucene50PostingsReader(positions=false,payloads=false))]:
683.8 KB
|-- field 'field'
[BlockTreeTerms(terms=2513966,postings=34713220,positions=-1,docs=12682564)]:
683.7 KB
|-- term index
[FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 683.6 KB
|-- delegate [Lucene50PostingsReader(positions=false,payloads=false)]:
32 bytes
|-- stored fields [CompressingStoredFieldsReader(mode=FAST,chunksize=16384)]:
58.1 KB
|-- stored field index [CompressingStoredFieldsIndexReader(blocks=97)]:
58.1 KB
|-- doc base deltas: 29.1 KB
|-- start pointer deltas: 26.6 KB
No problems were detected with this index.
{noformat}
{panel}
-{panel:title=EdgeNgram analyzer min=2 max=5
|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#FFFFCE}
Two fields for this test:
* "field": standard analyzer
* field-edge: edge ngram analyzer (min=2, max=5) on top of a standard analyzer.
{noformat}
Indexed 12600000: 70.831 sec
Final Indexed 12696047: 71.484 sec
Optimize...
After force merge: 80.344 sec
Close...
After close: 80.347 sec
Done CheckIndex:
Segments file=segments_1 numSegments=1 version=7.0.0
id=8bm8xy2peb5wo3td0ptgwv036
1 of 1: name=_19 maxDoc=12696047
version=7.0.0
id=8bm8xy2peb5wo3td0ptgwv035
codec=Lucene62
compound=false
numFiles=7
size (MB)=224.803
diagnostics = {os=Mac OS X, java.vendor=Oracle Corporation,
java.version=1.8.0_77, java.vm.version=25.77-b03, lucene.version=7.0.0,
mergeMaxNumSegments=1, os.arch=x86_64, java.runtime.version=1.8.0_77-b03,
source=merge, mergeFactor=15, os.version=10.11.4, timestamp=1472044255056}
no deletions
test: open reader.........OK [took 0.002 sec]
test: check integrity.....OK [took 0.130 sec]
test: check live docs.....OK [took 0.000 sec]
test: field infos.........OK [2 fields] [took 0.000 sec]
test: field norms.........OK [0 fields] [took 0.000 sec]
test: terms, freq, prox...OK [3459987 terms; 155467747 terms/docs pairs; 0
tokens] [took 3.736 sec]
field "field":
index FST:
699967 bytes
terms:
2513966 terms
20843092 bytes (8.3 bytes/term)
blocks:
80953 blocks
59384 terms-only blocks
10 sub-block-only blocks
21559 mixed blocks
18273 floor blocks
25611 non-floor blocks
55342 floor sub-blocks
13294377 term suffix bytes (164.2 suffix-bytes/block)
2538232 term stats bytes (31.4 stats-bytes/block)
8836971 other bytes (109.2 other-bytes/block)
by prefix length:
0: 5
1: 421
2: 5620
3: 18794
4: 31598
5: 16630
6: 5322
7: 1709
8: 443
9: 138
10: 249
11: 14
12: 2
13: 6
14: 2
field "field-edge":
index FST:
265903 bytes
terms:
946021 terms
4693480 bytes (5.0 bytes/term)
blocks:
30830 blocks
26448 terms-only blocks
16 sub-block-only blocks
4366 mixed blocks
6054 floor blocks
5852 non-floor blocks
24978 floor sub-blocks
2954296 term suffix bytes (95.8 suffix-bytes/block)
990273 term stats bytes (32.1 stats-bytes/block)
2750060 other bytes (89.2 other-bytes/block)
by prefix length:
0: 5
1: 313
2: 6051
3: 21746
4: 2272
5: 396
6: 28
7: 16
8: 3
test: stored fields.......OK [0 total field count; avg 0.0 fields per doc]
[took 0.319 sec]
test: term vectors........OK [0 total term vector count; avg 0.0 term/freq
vector fields per doc] [took 0.000 sec]
test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0
SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec]
test: points..............OK [0 fields, 0 points] [took 0.000 sec]
detailed segment RAM usage:
_19(7.0.0):C12696047: 1 MB
|-- postings [PerFieldPostings(segment=_19 formats=1)]: 943.6 KB
|-- format 'Lucene50_0'
[BlockTreeTermsReader(fields=2,delegate=Lucene50PostingsReader(positions=false,payloads=false))]:
943.6 KB
|-- field 'field'
[BlockTreeTerms(terms=2513966,postings=34713220,positions=-1,docs=12682564)]:
683.7 KB
|-- term index
[FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 683.6 KB
|-- field 'field-edge'
[BlockTreeTerms(terms=946021,postings=120754527,positions=-1,docs=12645321)]:
259.8 KB
|-- term index
[FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 259.7 KB
|-- delegate [Lucene50PostingsReader(positions=false,payloads=false)]:
32 bytes
|-- stored fields [CompressingStoredFieldsReader(mode=FAST,chunksize=16384)]:
95.2 KB
|-- stored field index [CompressingStoredFieldsIndexReader(blocks=97)]:
95.2 KB
|-- doc base deltas: 47.5 KB
|-- start pointer deltas: 45.3 KB
No problems were detected with this index.
Took 4.209 sec total.
Total index size: 235722542 bytes
{noformat}
{panel}
For the results of the AutoPrefix PostingsFormat please check the next comment.
was (Author: jim.ferenczi):
Another iteration. I fixed the prefix selection (the term "aa" should not
increment the number of terms accounted for the term "a"). This reduces the
index size greatly.
I've added a small benchmark AutoPrefixPerf.java (modified from [~mikemccand]
utils).
For the benchmark I used the english wikipedia title and a standard analyzer:
{panel:title=Standard
analyzer|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#FFFFCE}
A single field in this test:
* "field": standard analyzer
{noformat}
Indexed 12600000: 33.756 sec
Final Indexed 12696047: 33.9 sec
Optimize...
After force merge: 37.794 sec
Close...
After close: 37.798 sec
Done CheckIndex:
Segments file=segments_1 numSegments=1 version=7.0.0
id=ex11gzoft89z21le5c93bpett
1 of 1: name=_j maxDoc=12696047
version=7.0.0
id=ex11gzoft89z21le5c93bpets
codec=Lucene62
compound=false
numFiles=7
size (MB)=78.562
diagnostics = {os=Mac OS X, java.vendor=Oracle Corporation,
java.version=1.8.0_77, java.vm.version=25.77-b03, lucene.version=7.0.0,
mergeMaxNumSegments=1, os.arch=x86_64, java.runtime.version=1.8.0_77-b03,
source=merge, mergeFactor=9, os.version=10.11.4, timestamp=1472043738648}
no deletions
test: open reader.........OK [took 0.002 sec]
test: check integrity.....OK [took 0.046 sec]
test: check live docs.....OK [took 0.000 sec]
test: field infos.........OK [1 fields] [took 0.000 sec]
test: field norms.........OK [0 fields] [took 0.000 sec]
test: terms, freq, prox...OK [2513966 terms; 34713220 terms/docs pairs; 0
tokens] [took 2.321 sec]
field "field":
index FST:
699982 bytes
terms:
2513966 terms
20843092 bytes (8.3 bytes/term)
blocks:
80953 blocks
59384 terms-only blocks
10 sub-block-only blocks
21559 mixed blocks
18273 floor blocks
25611 non-floor blocks
55342 floor sub-blocks
13294379 term suffix bytes (164.2 suffix-bytes/block)
2538232 term stats bytes (31.4 stats-bytes/block)
8829391 other bytes (109.1 other-bytes/block)
by prefix length:
0: 5
1: 421
2: 5620
3: 18794
4: 31598
5: 16630
6: 5322
7: 1709
8: 443
9: 138
10: 249
11: 14
12: 2
13: 6
14: 2
test: stored fields.......OK [0 total field count; avg 0.0 fields per doc]
[took 0.257 sec]
test: term vectors........OK [0 total term vector count; avg 0.0 term/freq
vector fields per doc] [took 0.000 sec]
test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0
SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec]
test: points..............OK [0 fields, 0 points] [took 0.000 sec]
detailed segment RAM usage:
_j(7.0.0):C12696047: 741.9 KB
|-- postings [PerFieldPostings(segment=_j formats=1)]: 683.8 KB
|-- format 'Lucene50_0'
[BlockTreeTermsReader(fields=1,delegate=Lucene50PostingsReader(positions=false,payloads=false))]:
683.8 KB
|-- field 'field'
[BlockTreeTerms(terms=2513966,postings=34713220,positions=-1,docs=12682564)]:
683.7 KB
|-- term index
[FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 683.6 KB
|-- delegate [Lucene50PostingsReader(positions=false,payloads=false)]:
32 bytes
|-- stored fields [CompressingStoredFieldsReader(mode=FAST,chunksize=16384)]:
58.1 KB
|-- stored field index [CompressingStoredFieldsIndexReader(blocks=97)]:
58.1 KB
|-- doc base deltas: 29.1 KB
|-- start pointer deltas: 26.6 KB
No problems were detected with this index.
{noformat}
{panel}
{panel:title=EdgeNgram analyzer min=2 max=5
|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#FFFFCE}
Two fields for this test:
* "field": standard analyzer
* field-edge: edge ngram analyzer (min=2, max=5) on top of a standard analyzer.
{noformat}
Indexed 12600000: 70.831 sec
Final Indexed 12696047: 71.484 sec
Optimize...
After force merge: 80.344 sec
Close...
After close: 80.347 sec
Done CheckIndex:
Segments file=segments_1 numSegments=1 version=7.0.0
id=8bm8xy2peb5wo3td0ptgwv036
1 of 1: name=_19 maxDoc=12696047
version=7.0.0
id=8bm8xy2peb5wo3td0ptgwv035
codec=Lucene62
compound=false
numFiles=7
size (MB)=224.803
diagnostics = {os=Mac OS X, java.vendor=Oracle Corporation,
java.version=1.8.0_77, java.vm.version=25.77-b03, lucene.version=7.0.0,
mergeMaxNumSegments=1, os.arch=x86_64, java.runtime.version=1.8.0_77-b03,
source=merge, mergeFactor=15, os.version=10.11.4, timestamp=1472044255056}
no deletions
test: open reader.........OK [took 0.002 sec]
test: check integrity.....OK [took 0.130 sec]
test: check live docs.....OK [took 0.000 sec]
test: field infos.........OK [2 fields] [took 0.000 sec]
test: field norms.........OK [0 fields] [took 0.000 sec]
test: terms, freq, prox...OK [3459987 terms; 155467747 terms/docs pairs; 0
tokens] [took 3.736 sec]
field "field":
index FST:
699967 bytes
terms:
2513966 terms
20843092 bytes (8.3 bytes/term)
blocks:
80953 blocks
59384 terms-only blocks
10 sub-block-only blocks
21559 mixed blocks
18273 floor blocks
25611 non-floor blocks
55342 floor sub-blocks
13294377 term suffix bytes (164.2 suffix-bytes/block)
2538232 term stats bytes (31.4 stats-bytes/block)
8836971 other bytes (109.2 other-bytes/block)
by prefix length:
0: 5
1: 421
2: 5620
3: 18794
4: 31598
5: 16630
6: 5322
7: 1709
8: 443
9: 138
10: 249
11: 14
12: 2
13: 6
14: 2
field "field-edge":
index FST:
265903 bytes
terms:
946021 terms
4693480 bytes (5.0 bytes/term)
blocks:
30830 blocks
26448 terms-only blocks
16 sub-block-only blocks
4366 mixed blocks
6054 floor blocks
5852 non-floor blocks
24978 floor sub-blocks
2954296 term suffix bytes (95.8 suffix-bytes/block)
990273 term stats bytes (32.1 stats-bytes/block)
2750060 other bytes (89.2 other-bytes/block)
by prefix length:
0: 5
1: 313
2: 6051
3: 21746
4: 2272
5: 396
6: 28
7: 16
8: 3
test: stored fields.......OK [0 total field count; avg 0.0 fields per doc]
[took 0.319 sec]
test: term vectors........OK [0 total term vector count; avg 0.0 term/freq
vector fields per doc] [took 0.000 sec]
test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0
SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec]
test: points..............OK [0 fields, 0 points] [took 0.000 sec]
detailed segment RAM usage:
_19(7.0.0):C12696047: 1 MB
|-- postings [PerFieldPostings(segment=_19 formats=1)]: 943.6 KB
|-- format 'Lucene50_0'
[BlockTreeTermsReader(fields=2,delegate=Lucene50PostingsReader(positions=false,payloads=false))]:
943.6 KB
|-- field 'field'
[BlockTreeTerms(terms=2513966,postings=34713220,positions=-1,docs=12682564)]:
683.7 KB
|-- term index
[FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 683.6 KB
|-- field 'field-edge'
[BlockTreeTerms(terms=946021,postings=120754527,positions=-1,docs=12645321)]:
259.8 KB
|-- term index
[FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 259.7 KB
|-- delegate [Lucene50PostingsReader(positions=false,payloads=false)]:
32 bytes
|-- stored fields [CompressingStoredFieldsReader(mode=FAST,chunksize=16384)]:
95.2 KB
|-- stored field index [CompressingStoredFieldsIndexReader(blocks=97)]:
95.2 KB
|-- doc base deltas: 47.5 KB
|-- start pointer deltas: 45.3 KB
No problems were detected with this index.
Took 4.209 sec total.
Total index size: 235722542 bytes
{noformat}
{panel}
{panel:title=AutoPrefix
minPrefixTerms=2|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#FFFFCE}
Two indexed fields:
* "field": standard analyzer
* "field-autoprefix": the autoprefix of the field "field" with a minPrefixTerms
set to 2.
{noformat}
Indexed 12600000: 52.49 sec
Final Indexed 12696047: 52.717 sec
Optimize...
After force merge: 68.699 sec
Close...
After close: 68.704 sec
Done CheckIndex:
Segments file=segments_1 numSegments=1 version=7.0.0
id=1gb0m3msddxzckhpfj9lzsneq
1 of 1: name=_j maxDoc=12696047
version=7.0.0
id=1gb0m3msddxzckhpfj9lzsnep
codec=Lucene62
compound=false
numFiles=7
size (MB)=120.032
diagnostics = {os=Mac OS X, java.vendor=Oracle Corporation,
java.version=1.8.0_77, java.vm.version=25.77-b03, lucene.version=7.0.0,
mergeMaxNumSegments=1, os.arch=x86_64, java.runtime.version=1.8.0_77-b03,
source=merge, mergeFactor=9, os.version=10.11.4, timestamp=1472044414055}
no deletions
test: open reader.........OK [took 0.002 sec]
test: check integrity.....OK [took 0.067 sec]
test: check live docs.....OK [took 0.000 sec]
test: field infos.........OK [2 fields] [took 0.000 sec]
test: field norms.........OK [0 fields] [took 0.000 sec]
test: terms, freq, prox...OK [3034551 terms; 60351742 terms/docs pairs; 0
tokens] [took 2.566 sec]
field "field-autoprefix":
index FST:
152510 bytes
terms:
520585 terms
3436438 bytes (6.6 bytes/term)
blocks:
16779 blocks
12264 terms-only blocks
1 sub-block-only blocks
4514 mixed blocks
3880 floor blocks
5187 non-floor blocks
11592 floor sub-blocks
2140329 term suffix bytes (127.6 suffix-bytes/block)
539804 term stats bytes (32.2 stats-bytes/block)
729244 other bytes (43.5 other-bytes/block)
by prefix length:
0: 9
1: 286
2: 1746
3: 6942
4: 5237
5: 1722
6: 577
7: 191
8: 31
9: 18
10: 19
11: 1
field "field":
index FST:
699987 bytes
terms:
2513966 terms
20843092 bytes (8.3 bytes/term)
blocks:
80953 blocks
59384 terms-only blocks
10 sub-block-only blocks
21559 mixed blocks
18273 floor blocks
25611 non-floor blocks
55342 floor sub-blocks
13294384 term suffix bytes (164.2 suffix-bytes/block)
2538232 term stats bytes (31.4 stats-bytes/block)
8847612 other bytes (109.3 other-bytes/block)
by prefix length:
0: 5
1: 421
2: 5620
3: 18794
4: 31598
5: 16630
6: 5322
7: 1709
8: 443
9: 138
10: 249
11: 14
12: 2
13: 6
14: 2
test: stored fields.......OK [0 total field count; avg 0.0 fields per doc]
[took 0.281 sec]
test: term vectors........OK [0 total term vector count; avg 0.0 term/freq
vector fields per doc] [took 0.000 sec]
test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0
SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec]
test: points..............OK [0 fields, 0 points] [took 0.000 sec]
detailed segment RAM usage:
_j(7.0.0):C12696047: 894.8 KB
|-- postings [PerFieldPostings(segment=_j formats=1)]: 832.9 KB
|-- format 'AutoPrefix_0'
[BlockTreeTermsReader(fields=2,delegate=Lucene50PostingsReader(positions=false,payloads=false))]:
832.9 KB
|-- field 'field'
[BlockTreeTerms(terms=2513966,postings=34713220,positions=-1,docs=12682564)]:
683.7 KB
|-- term index
[FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 683.6 KB
|-- field 'field-autoprefix'
[BlockTreeTerms(terms=520585,postings=25638522,positions=-1,docs=9493306)]:
149.1 KB
|-- term index
[FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 148.9 KB
|-- delegate [Lucene50PostingsReader(positions=false,payloads=false)]:
32 bytes
|-- stored fields [CompressingStoredFieldsReader(mode=FAST,chunksize=16384)]:
61.9 KB
|-- stored field index [CompressingStoredFieldsIndexReader(blocks=97)]:
61.9 KB
|-- doc base deltas: 30.5 KB
|-- start pointer deltas: 29.1 KB
No problems were detected with this index.
Took 2.933 sec total.
Total index size: 125862986 bytes
{noformat}
{panel}
The autoprefix format has better performance than the 2-5 edge ngram solution.
It produces 520,585 terms, two times less than the 2-5 edge ngram (1M terms),
is faster to build 52.717 sec vs 71.484 sec and the index is smaller (120M vs
225M).
> AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on
> text fields.
> ---------------------------------------------------------------------------------------
>
> Key: LUCENE-7423
> URL: https://issues.apache.org/jira/browse/LUCENE-7423
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/sandbox
> Reporter: Ferenczi Jim
> Priority: Minor
> Attachments: LUCENE-7423.patch
>
>
> The autoprefix terms dict added in
> https://issues.apache.org/jira/browse/LUCENE-5879 has been removed with
> https://issues.apache.org/jira/browse/LUCENE-7317.
> The new points API is now used to do efficient range queries but the
> replacement for prefix string queries is unclear. The edge ngrams could be
> used instead but they have a lot of drawbacks and are hard to configure
> correctly. The completion postings format is also a good replacement but it
> requires to have a big FST in RAM and it cannot be intersected with other
> fields.
> This patch is a proposal for a new PostingsFormat optimized for prefix query
> on string fields. It detects prefixes that match "enough" terms and writes
> auto-prefix terms into their own virtual field.
> At search time the virtual field is used to speed up prefix queries that
> match "enough" terms.
> The auto-prefix terms are built in two pass:
> * The first pass builds a compact prefix tree. Since the terms enum is sorted
> the prefixes are flushed on the fly depending on the input. For each prefix
> we build its corresponding inverted lists using a DocIdSetBuilder. The first
> pass visits each term of the field TermsEnum only once. When a prefix is
> flushed from the prefix tree its inverted lists is dumped into a temporary
> file for further use. This is necessary since the prefixes are not sorted
> when they are removed from the tree. The selected auto prefixes are sorted at
> the end of the first pass.
> * The second pass is a sorted scan of the prefixes and the temporary file is
> used to read the corresponding inverted lists.
> The patch is just a POC and there are rooms for optimizations but the first
> results are promising:
> I tested the patch with the geonames dataset. I indexed all the titles with
> the KeywordAnalyzer and compared the index/merge time and the size of the
> indices.
> The edge ngram index (with a min edge ngram size of 2 and a max of 20) takes
> 572M on disk and it took 130s to index and optimize the 11M titles.
> The auto prefix index takes 287M on disk and took 70s to index and optimize
> the same 11M titles. Among the 287M, only 170M are used for the auto prefix
> fields and the rest is for the regular keyword field. All the auto prefixes
> were generated for this test (at least 2 terms per auto-prefix).
> The queries have similar performance since we are sure on both sides that one
> inverted list can answer any prefix query.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]