[ 
https://issues.apache.org/jira/browse/LUCENE-7317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426601#comment-15426601
 ] 

Ferenczi Jim commented on LUCENE-7317:
--------------------------------------

I wanted to see what we're loosing with the removal of the AutoPrefix so I ran 
a small test with English Wikipedia title.

I indexed the 12M titles in three indices:
* *default*: keyword analyzer and the default postings format
* *auto_prefix*: keyword analyzer and the AutoPrefixPostings format with 
minAutoPrefix=24, maxAutoPrefix=Integer.MAX
* *edge*: edge ngram analyzer  with minGram=1,maxGram=Integer.MAX and the 
default postings format. 

||index||default||auto_prefix||edge||
||size in MB||231MB||274 MB||1600MB||

This table shows the size that each index takes on disk in bytes. As you can 
see the auto_prefix is very close to the size of the default one even though we 
compute all the prefix with more than 24 terms. Compared to the edge_ngram 
which multiplies the index size by a factor 7, the auto prefix seems to be a 
good trade off for fields where prefix queries are the norm. I didn't compare 
the query time but any prefix with more than 24 terms could be resolved by one 
inverted list in the auto_prefix index so it is equivalent to the edge_ngram 
index. 
The downside of the auto_prefix seems to be the merge, it takes more than 1 
minute to optimize, this is 10 times slower than the default index. Though this 
is expected since the default index uses a keyword analyzer. 

I understand that the new points APIs is better for numeric prefix/range 
queries but the auto prefix seems to be a good fit for prefix string queries. 
It saves a lot of spaces compared to edge ngram and the indexation is faster. I 
am not saying we should restore the functionality inside the default 
BlockTreeTerms but maybe we could create a separate postings format that 
exposes this feature ?


> Remove auto prefix terms
> ------------------------
>
>                 Key: LUCENE-7317
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7317
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Adrien Grand
>            Priority: Minor
>             Fix For: master (7.0), 6.2
>
>         Attachments: LUCENE-7317.patch
>
>
> This was mostly superseded by the new points API so should we remove 
> auto-prefix terms?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to