[jira] [Comment Edited] (LUCENE-7423) AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.

Ferenczi Jim (JIRA) Thu, 25 Aug 2016 02:05:50 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434935#comment-15434935
 ]


Ferenczi Jim edited comment on LUCENE-7423 at 8/25/16 9:05 AM:
---------------------------------------------------------------

(edited since the results of the autoprefix were wrong due to a bug in the code 
to generate the prefixes)

I've added a small benchmark AutoPrefixPerf.java (modified from [~mikemccand] 
utils).

For the benchmark I used the english wikipedia title and a standard analyzer:

{panel:title=Standard 
analyzer|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#FFFFCE}
A single field in this test:
* "field": standard analyzer 

{noformat}
Indexed 12600000: 33.756 sec
Final Indexed 12696047: 33.9 sec
Optimize...
After force merge: 37.794 sec
Close...
After close: 37.798 sec
Done CheckIndex:
Segments file=segments_1 numSegments=1 version=7.0.0 
id=ex11gzoft89z21le5c93bpett
  1 of 1: name=_j maxDoc=12696047
    version=7.0.0
    id=ex11gzoft89z21le5c93bpets
    codec=Lucene62
    compound=false
    numFiles=7
    size (MB)=78.562
    diagnostics = {os=Mac OS X, java.vendor=Oracle Corporation, 
java.version=1.8.0_77, java.vm.version=25.77-b03, lucene.version=7.0.0, 
mergeMaxNumSegments=1, os.arch=x86_64, java.runtime.version=1.8.0_77-b03, 
source=merge, mergeFactor=9, os.version=10.11.4, timestamp=1472043738648}
    no deletions
    test: open reader.........OK [took 0.002 sec]
    test: check integrity.....OK [took 0.046 sec]
    test: check live docs.....OK [took 0.000 sec]
    test: field infos.........OK [1 fields] [took 0.000 sec]
    test: field norms.........OK [0 fields] [took 0.000 sec]
    test: terms, freq, prox...OK [2513966 terms; 34713220 terms/docs pairs; 0 
tokens] [took 2.321 sec]
      field "field":
        index FST:
          699982 bytes
        terms:
          2513966 terms
          20843092 bytes (8.3 bytes/term)
        blocks:
          80953 blocks
          59384 terms-only blocks
          10 sub-block-only blocks
          21559 mixed blocks
          18273 floor blocks
          25611 non-floor blocks
          55342 floor sub-blocks
          13294379 term suffix bytes (164.2 suffix-bytes/block)
          2538232 term stats bytes (31.4 stats-bytes/block)
          8829391 other bytes (109.1 other-bytes/block)
          by prefix length:
             0: 5
             1: 421
             2: 5620
             3: 18794
             4: 31598
             5: 16630
             6: 5322
             7: 1709
             8: 443
             9: 138
            10: 249
            11: 14
            12: 2
            13: 6
            14: 2
      
    test: stored fields.......OK [0 total field count; avg 0.0 fields per doc] 
[took 0.257 sec]
    test: term vectors........OK [0 total term vector count; avg 0.0 term/freq 
vector fields per doc] [took 0.000 sec]
    test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 
SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec]
    test: points..............OK [0 fields, 0 points] [took 0.000 sec]

detailed segment RAM usage: 
_j(7.0.0):C12696047: 741.9 KB
|-- postings [PerFieldPostings(segment=_j formats=1)]: 683.8 KB
    |-- format 'Lucene50_0' 
[BlockTreeTermsReader(fields=1,delegate=Lucene50PostingsReader(positions=false,payloads=false))]:
 683.8 KB
        |-- field 'field' 
[BlockTreeTerms(terms=2513966,postings=34713220,positions=-1,docs=12682564)]: 
683.7 KB
            |-- term index 
[FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 683.6 KB
        |-- delegate [Lucene50PostingsReader(positions=false,payloads=false)]: 
32 bytes
|-- stored fields [CompressingStoredFieldsReader(mode=FAST,chunksize=16384)]: 
58.1 KB
    |-- stored field index [CompressingStoredFieldsIndexReader(blocks=97)]: 
58.1 KB
        |-- doc base deltas: 29.1 KB
        |-- start pointer deltas: 26.6 KB

No problems were detected with this index.
{noformat}
{panel}

-{panel:title=EdgeNgram analyzer  min=2 max=5 
|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#FFFFCE}

Two fields for this test:
* "field": standard analyzer
* field-edge: edge ngram analyzer (min=2, max=5) on top of a standard analyzer.

{noformat}
Indexed 12600000: 70.831 sec
Final Indexed 12696047: 71.484 sec
Optimize...
After force merge: 80.344 sec
Close...
After close: 80.347 sec
Done CheckIndex:
Segments file=segments_1 numSegments=1 version=7.0.0 
id=8bm8xy2peb5wo3td0ptgwv036
  1 of 1: name=_19 maxDoc=12696047
    version=7.0.0
    id=8bm8xy2peb5wo3td0ptgwv035
    codec=Lucene62
    compound=false
    numFiles=7
    size (MB)=224.803
    diagnostics = {os=Mac OS X, java.vendor=Oracle Corporation, 
java.version=1.8.0_77, java.vm.version=25.77-b03, lucene.version=7.0.0, 
mergeMaxNumSegments=1, os.arch=x86_64, java.runtime.version=1.8.0_77-b03, 
source=merge, mergeFactor=15, os.version=10.11.4, timestamp=1472044255056}
    no deletions
    test: open reader.........OK [took 0.002 sec]
    test: check integrity.....OK [took 0.130 sec]
    test: check live docs.....OK [took 0.000 sec]
    test: field infos.........OK [2 fields] [took 0.000 sec]
    test: field norms.........OK [0 fields] [took 0.000 sec]
    test: terms, freq, prox...OK [3459987 terms; 155467747 terms/docs pairs; 0 
tokens] [took 3.736 sec]
      field "field":
        index FST:
          699967 bytes
        terms:
          2513966 terms
          20843092 bytes (8.3 bytes/term)
        blocks:
          80953 blocks
          59384 terms-only blocks
          10 sub-block-only blocks
          21559 mixed blocks
          18273 floor blocks
          25611 non-floor blocks
          55342 floor sub-blocks
          13294377 term suffix bytes (164.2 suffix-bytes/block)
          2538232 term stats bytes (31.4 stats-bytes/block)
          8836971 other bytes (109.2 other-bytes/block)
          by prefix length:
             0: 5
             1: 421
             2: 5620
             3: 18794
             4: 31598
             5: 16630
             6: 5322
             7: 1709
             8: 443
             9: 138
            10: 249
            11: 14
            12: 2
            13: 6
            14: 2
      
      field "field-edge":
        index FST:
          265903 bytes
        terms:
          946021 terms
          4693480 bytes (5.0 bytes/term)
        blocks:
          30830 blocks
          26448 terms-only blocks
          16 sub-block-only blocks
          4366 mixed blocks
          6054 floor blocks
          5852 non-floor blocks
          24978 floor sub-blocks
          2954296 term suffix bytes (95.8 suffix-bytes/block)
          990273 term stats bytes (32.1 stats-bytes/block)
          2750060 other bytes (89.2 other-bytes/block)
          by prefix length:
             0: 5
             1: 313
             2: 6051
             3: 21746
             4: 2272
             5: 396
             6: 28
             7: 16
             8: 3
      
    test: stored fields.......OK [0 total field count; avg 0.0 fields per doc] 
[took 0.319 sec]
    test: term vectors........OK [0 total term vector count; avg 0.0 term/freq 
vector fields per doc] [took 0.000 sec]
    test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 
SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec]
    test: points..............OK [0 fields, 0 points] [took 0.000 sec]

detailed segment RAM usage: 
_19(7.0.0):C12696047: 1 MB
|-- postings [PerFieldPostings(segment=_19 formats=1)]: 943.6 KB
    |-- format 'Lucene50_0' 
[BlockTreeTermsReader(fields=2,delegate=Lucene50PostingsReader(positions=false,payloads=false))]:
 943.6 KB
        |-- field 'field' 
[BlockTreeTerms(terms=2513966,postings=34713220,positions=-1,docs=12682564)]: 
683.7 KB
            |-- term index 
[FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 683.6 KB
        |-- field 'field-edge' 
[BlockTreeTerms(terms=946021,postings=120754527,positions=-1,docs=12645321)]: 
259.8 KB
            |-- term index 
[FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 259.7 KB
        |-- delegate [Lucene50PostingsReader(positions=false,payloads=false)]: 
32 bytes
|-- stored fields [CompressingStoredFieldsReader(mode=FAST,chunksize=16384)]: 
95.2 KB
    |-- stored field index [CompressingStoredFieldsIndexReader(blocks=97)]: 
95.2 KB
        |-- doc base deltas: 47.5 KB
        |-- start pointer deltas: 45.3 KB

No problems were detected with this index.

Took 4.209 sec total.


Total index size: 235722542 bytes
{noformat}
{panel}

For the results of the AutoPrefix PostingsFormat please check the next comment.


was (Author: jim.ferenczi):
Another iteration. I fixed the prefix selection (the term "aa" should not 
increment the number of terms accounted for the term "a"). This reduces the 
index size greatly.
I've added a small benchmark AutoPrefixPerf.java (modified from [~mikemccand] 
utils).

For the benchmark I used the english wikipedia title and a standard analyzer:

{panel:title=Standard 
analyzer|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#FFFFCE}
A single field in this test:
* "field": standard analyzer 

{noformat}
Indexed 12600000: 33.756 sec
Final Indexed 12696047: 33.9 sec
Optimize...
After force merge: 37.794 sec
Close...
After close: 37.798 sec
Done CheckIndex:
Segments file=segments_1 numSegments=1 version=7.0.0 
id=ex11gzoft89z21le5c93bpett
  1 of 1: name=_j maxDoc=12696047
    version=7.0.0
    id=ex11gzoft89z21le5c93bpets
    codec=Lucene62
    compound=false
    numFiles=7
    size (MB)=78.562
    diagnostics = {os=Mac OS X, java.vendor=Oracle Corporation, 
java.version=1.8.0_77, java.vm.version=25.77-b03, lucene.version=7.0.0, 
mergeMaxNumSegments=1, os.arch=x86_64, java.runtime.version=1.8.0_77-b03, 
source=merge, mergeFactor=9, os.version=10.11.4, timestamp=1472043738648}
    no deletions
    test: open reader.........OK [took 0.002 sec]
    test: check integrity.....OK [took 0.046 sec]
    test: check live docs.....OK [took 0.000 sec]
    test: field infos.........OK [1 fields] [took 0.000 sec]
    test: field norms.........OK [0 fields] [took 0.000 sec]
    test: terms, freq, prox...OK [2513966 terms; 34713220 terms/docs pairs; 0 
tokens] [took 2.321 sec]
      field "field":
        index FST:
          699982 bytes
        terms:
          2513966 terms
          20843092 bytes (8.3 bytes/term)
        blocks:
          80953 blocks
          59384 terms-only blocks
          10 sub-block-only blocks
          21559 mixed blocks
          18273 floor blocks
          25611 non-floor blocks
          55342 floor sub-blocks
          13294379 term suffix bytes (164.2 suffix-bytes/block)
          2538232 term stats bytes (31.4 stats-bytes/block)
          8829391 other bytes (109.1 other-bytes/block)
          by prefix length:
             0: 5
             1: 421
             2: 5620
             3: 18794
             4: 31598
             5: 16630
             6: 5322
             7: 1709
             8: 443
             9: 138
            10: 249
            11: 14
            12: 2
            13: 6
            14: 2
      
    test: stored fields.......OK [0 total field count; avg 0.0 fields per doc] 
[took 0.257 sec]
    test: term vectors........OK [0 total term vector count; avg 0.0 term/freq 
vector fields per doc] [took 0.000 sec]
    test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 
SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec]
    test: points..............OK [0 fields, 0 points] [took 0.000 sec]

detailed segment RAM usage: 
_j(7.0.0):C12696047: 741.9 KB
|-- postings [PerFieldPostings(segment=_j formats=1)]: 683.8 KB
    |-- format 'Lucene50_0' 
[BlockTreeTermsReader(fields=1,delegate=Lucene50PostingsReader(positions=false,payloads=false))]:
 683.8 KB
        |-- field 'field' 
[BlockTreeTerms(terms=2513966,postings=34713220,positions=-1,docs=12682564)]: 
683.7 KB
            |-- term index 
[FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 683.6 KB
        |-- delegate [Lucene50PostingsReader(positions=false,payloads=false)]: 
32 bytes
|-- stored fields [CompressingStoredFieldsReader(mode=FAST,chunksize=16384)]: 
58.1 KB
    |-- stored field index [CompressingStoredFieldsIndexReader(blocks=97)]: 
58.1 KB
        |-- doc base deltas: 29.1 KB
        |-- start pointer deltas: 26.6 KB

No problems were detected with this index.
{noformat}
{panel}

{panel:title=EdgeNgram analyzer  min=2 max=5 
|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#FFFFCE}

Two fields for this test:
* "field": standard analyzer
* field-edge: edge ngram analyzer (min=2, max=5) on top of a standard analyzer.

{noformat}
Indexed 12600000: 70.831 sec
Final Indexed 12696047: 71.484 sec
Optimize...
After force merge: 80.344 sec
Close...
After close: 80.347 sec
Done CheckIndex:
Segments file=segments_1 numSegments=1 version=7.0.0 
id=8bm8xy2peb5wo3td0ptgwv036
  1 of 1: name=_19 maxDoc=12696047
    version=7.0.0
    id=8bm8xy2peb5wo3td0ptgwv035
    codec=Lucene62
    compound=false
    numFiles=7
    size (MB)=224.803
    diagnostics = {os=Mac OS X, java.vendor=Oracle Corporation, 
java.version=1.8.0_77, java.vm.version=25.77-b03, lucene.version=7.0.0, 
mergeMaxNumSegments=1, os.arch=x86_64, java.runtime.version=1.8.0_77-b03, 
source=merge, mergeFactor=15, os.version=10.11.4, timestamp=1472044255056}
    no deletions
    test: open reader.........OK [took 0.002 sec]
    test: check integrity.....OK [took 0.130 sec]
    test: check live docs.....OK [took 0.000 sec]
    test: field infos.........OK [2 fields] [took 0.000 sec]
    test: field norms.........OK [0 fields] [took 0.000 sec]
    test: terms, freq, prox...OK [3459987 terms; 155467747 terms/docs pairs; 0 
tokens] [took 3.736 sec]
      field "field":
        index FST:
          699967 bytes
        terms:
          2513966 terms
          20843092 bytes (8.3 bytes/term)
        blocks:
          80953 blocks
          59384 terms-only blocks
          10 sub-block-only blocks
          21559 mixed blocks
          18273 floor blocks
          25611 non-floor blocks
          55342 floor sub-blocks
          13294377 term suffix bytes (164.2 suffix-bytes/block)
          2538232 term stats bytes (31.4 stats-bytes/block)
          8836971 other bytes (109.2 other-bytes/block)
          by prefix length:
             0: 5
             1: 421
             2: 5620
             3: 18794
             4: 31598
             5: 16630
             6: 5322
             7: 1709
             8: 443
             9: 138
            10: 249
            11: 14
            12: 2
            13: 6
            14: 2
      
      field "field-edge":
        index FST:
          265903 bytes
        terms:
          946021 terms
          4693480 bytes (5.0 bytes/term)
        blocks:
          30830 blocks
          26448 terms-only blocks
          16 sub-block-only blocks
          4366 mixed blocks
          6054 floor blocks
          5852 non-floor blocks
          24978 floor sub-blocks
          2954296 term suffix bytes (95.8 suffix-bytes/block)
          990273 term stats bytes (32.1 stats-bytes/block)
          2750060 other bytes (89.2 other-bytes/block)
          by prefix length:
             0: 5
             1: 313
             2: 6051
             3: 21746
             4: 2272
             5: 396
             6: 28
             7: 16
             8: 3
      
    test: stored fields.......OK [0 total field count; avg 0.0 fields per doc] 
[took 0.319 sec]
    test: term vectors........OK [0 total term vector count; avg 0.0 term/freq 
vector fields per doc] [took 0.000 sec]
    test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 
SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec]
    test: points..............OK [0 fields, 0 points] [took 0.000 sec]

detailed segment RAM usage: 
_19(7.0.0):C12696047: 1 MB
|-- postings [PerFieldPostings(segment=_19 formats=1)]: 943.6 KB
    |-- format 'Lucene50_0' 
[BlockTreeTermsReader(fields=2,delegate=Lucene50PostingsReader(positions=false,payloads=false))]:
 943.6 KB
        |-- field 'field' 
[BlockTreeTerms(terms=2513966,postings=34713220,positions=-1,docs=12682564)]: 
683.7 KB
            |-- term index 
[FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 683.6 KB
        |-- field 'field-edge' 
[BlockTreeTerms(terms=946021,postings=120754527,positions=-1,docs=12645321)]: 
259.8 KB
            |-- term index 
[FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 259.7 KB
        |-- delegate [Lucene50PostingsReader(positions=false,payloads=false)]: 
32 bytes
|-- stored fields [CompressingStoredFieldsReader(mode=FAST,chunksize=16384)]: 
95.2 KB
    |-- stored field index [CompressingStoredFieldsIndexReader(blocks=97)]: 
95.2 KB
        |-- doc base deltas: 47.5 KB
        |-- start pointer deltas: 45.3 KB

No problems were detected with this index.

Took 4.209 sec total.


Total index size: 235722542 bytes
{noformat}
{panel}

{panel:title=AutoPrefix 
minPrefixTerms=2|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#FFFFCE}
Two indexed fields:
* "field": standard analyzer
* "field-autoprefix": the autoprefix of the field "field" with a minPrefixTerms 
set to 2.
{noformat}
Indexed 12600000: 52.49 sec
Final Indexed 12696047: 52.717 sec
Optimize...
After force merge: 68.699 sec
Close...
After close: 68.704 sec
Done CheckIndex:
Segments file=segments_1 numSegments=1 version=7.0.0 
id=1gb0m3msddxzckhpfj9lzsneq
  1 of 1: name=_j maxDoc=12696047
    version=7.0.0
    id=1gb0m3msddxzckhpfj9lzsnep
    codec=Lucene62
    compound=false
    numFiles=7
    size (MB)=120.032
    diagnostics = {os=Mac OS X, java.vendor=Oracle Corporation, 
java.version=1.8.0_77, java.vm.version=25.77-b03, lucene.version=7.0.0, 
mergeMaxNumSegments=1, os.arch=x86_64, java.runtime.version=1.8.0_77-b03, 
source=merge, mergeFactor=9, os.version=10.11.4, timestamp=1472044414055}
    no deletions
    test: open reader.........OK [took 0.002 sec]
    test: check integrity.....OK [took 0.067 sec]
    test: check live docs.....OK [took 0.000 sec]
    test: field infos.........OK [2 fields] [took 0.000 sec]
    test: field norms.........OK [0 fields] [took 0.000 sec]
    test: terms, freq, prox...OK [3034551 terms; 60351742 terms/docs pairs; 0 
tokens] [took 2.566 sec]
      field "field-autoprefix":
        index FST:
          152510 bytes
        terms:
          520585 terms
          3436438 bytes (6.6 bytes/term)
        blocks:
          16779 blocks
          12264 terms-only blocks
          1 sub-block-only blocks
          4514 mixed blocks
          3880 floor blocks
          5187 non-floor blocks
          11592 floor sub-blocks
          2140329 term suffix bytes (127.6 suffix-bytes/block)
          539804 term stats bytes (32.2 stats-bytes/block)
          729244 other bytes (43.5 other-bytes/block)
          by prefix length:
             0: 9
             1: 286
             2: 1746
             3: 6942
             4: 5237
             5: 1722
             6: 577
             7: 191
             8: 31
             9: 18
            10: 19
            11: 1
      
      field "field":
        index FST:
          699987 bytes
        terms:
          2513966 terms
          20843092 bytes (8.3 bytes/term)
        blocks:
          80953 blocks
          59384 terms-only blocks
          10 sub-block-only blocks
          21559 mixed blocks
          18273 floor blocks
          25611 non-floor blocks
          55342 floor sub-blocks
          13294384 term suffix bytes (164.2 suffix-bytes/block)
          2538232 term stats bytes (31.4 stats-bytes/block)
          8847612 other bytes (109.3 other-bytes/block)
          by prefix length:
             0: 5
             1: 421
             2: 5620
             3: 18794
             4: 31598
             5: 16630
             6: 5322
             7: 1709
             8: 443
             9: 138
            10: 249
            11: 14
            12: 2
            13: 6
            14: 2
      
    test: stored fields.......OK [0 total field count; avg 0.0 fields per doc] 
[took 0.281 sec]
    test: term vectors........OK [0 total term vector count; avg 0.0 term/freq 
vector fields per doc] [took 0.000 sec]
    test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 
SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec]
    test: points..............OK [0 fields, 0 points] [took 0.000 sec]

detailed segment RAM usage: 
_j(7.0.0):C12696047: 894.8 KB
|-- postings [PerFieldPostings(segment=_j formats=1)]: 832.9 KB
    |-- format 'AutoPrefix_0' 
[BlockTreeTermsReader(fields=2,delegate=Lucene50PostingsReader(positions=false,payloads=false))]:
 832.9 KB
        |-- field 'field' 
[BlockTreeTerms(terms=2513966,postings=34713220,positions=-1,docs=12682564)]: 
683.7 KB
            |-- term index 
[FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 683.6 KB
        |-- field 'field-autoprefix' 
[BlockTreeTerms(terms=520585,postings=25638522,positions=-1,docs=9493306)]: 
149.1 KB
            |-- term index 
[FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 148.9 KB
        |-- delegate [Lucene50PostingsReader(positions=false,payloads=false)]: 
32 bytes
|-- stored fields [CompressingStoredFieldsReader(mode=FAST,chunksize=16384)]: 
61.9 KB
    |-- stored field index [CompressingStoredFieldsIndexReader(blocks=97)]: 
61.9 KB
        |-- doc base deltas: 30.5 KB
        |-- start pointer deltas: 29.1 KB

No problems were detected with this index.

Took 2.933 sec total.


Total index size: 125862986 bytes

{noformat}
{panel}

The autoprefix format has better performance than the 2-5 edge ngram solution. 
It produces 520,585 terms, two times less than the 2-5 edge ngram (1M terms), 
is faster to build  52.717 sec vs 71.484 sec and the index is smaller (120M vs 
225M).


> AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on 
> text fields.
> ---------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7423
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7423
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/sandbox
>            Reporter: Ferenczi Jim
>            Priority: Minor
>         Attachments: LUCENE-7423.patch
>
>
> The autoprefix terms dict added in 
> https://issues.apache.org/jira/browse/LUCENE-5879 has been removed with 
> https://issues.apache.org/jira/browse/LUCENE-7317.
> The new points API is now used to do efficient range queries but the 
> replacement for prefix string queries is unclear. The edge ngrams could be 
> used instead but they have a lot of drawbacks and are hard to configure 
> correctly. The completion postings format is also a good replacement but it 
> requires to have a big FST in RAM and it cannot be intersected with other 
> fields. 
> This patch is a proposal for a new PostingsFormat optimized for prefix query 
> on string fields. It detects prefixes that match "enough" terms and writes 
> auto-prefix terms into their own virtual field.
>  At search time the virtual field is used to speed up prefix queries that 
> match "enough" terms.
> The auto-prefix terms are built in two pass:
> * The first pass builds a compact prefix tree. Since the terms enum is sorted 
> the prefixes are flushed on the fly depending on the input. For each prefix 
> we build its corresponding inverted lists using a DocIdSetBuilder. The first 
> pass visits each term of the field TermsEnum only once. When a prefix is 
> flushed from the prefix tree its inverted lists is dumped into a temporary 
> file for further use. This is necessary since the prefixes are not sorted 
> when they are removed from the tree. The selected auto prefixes are sorted at 
> the end of the first pass.
> * The second pass is a sorted scan of the prefixes and the temporary file is 
> used to read the corresponding inverted lists.
> The patch is just a POC and there are rooms for optimizations but the first 
> results are promising:
> I tested the patch with the geonames dataset. I indexed all the titles with 
> the KeywordAnalyzer and compared the index/merge time and the size of the 
> indices. 
> The edge ngram index (with a min edge ngram size of 2 and a max of 20) takes 
> 572M on disk and it took 130s to index and optimize the 11M titles. 
> The auto prefix index takes 287M on disk and took 70s to index and optimize 
> the same 11M titles. Among the 287M, only 170M are used for the auto prefix 
> fields and the rest is for the regular keyword field. All the auto prefixes 
> were generated for this test (at least 2 terms per auto-prefix).  
> The queries have similar performance since we are sure on both sides that one 
> inverted list can answer any prefix query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-7423) AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.

Reply via email to