[ https://issues.apache.org/jira/browse/LUCENE-7317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426601#comment-15426601 ]
Ferenczi Jim commented on LUCENE-7317: -------------------------------------- I wanted to see what we're loosing with the removal of the AutoPrefix so I ran a small test with English Wikipedia title. I indexed the 12M titles in three indices: * *default*: keyword analyzer and the default postings format * *auto_prefix*: keyword analyzer and the AutoPrefixPostings format with minAutoPrefix=24, maxAutoPrefix=Integer.MAX * *edge*: edge ngram analyzer with minGram=1,maxGram=Integer.MAX and the default postings format. ||index||default||auto_prefix||edge|| ||size in MB||231MB||274 MB||1600MB|| This table shows the size that each index takes on disk in bytes. As you can see the auto_prefix is very close to the size of the default one even though we compute all the prefix with more than 24 terms. Compared to the edge_ngram which multiplies the index size by a factor 7, the auto prefix seems to be a good trade off for fields where prefix queries are the norm. I didn't compare the query time but any prefix with more than 24 terms could be resolved by one inverted list in the auto_prefix index so it is equivalent to the edge_ngram index. The downside of the auto_prefix seems to be the merge, it takes more than 1 minute to optimize, this is 10 times slower than the default index. Though this is expected since the default index uses a keyword analyzer. I understand that the new points APIs is better for numeric prefix/range queries but the auto prefix seems to be a good fit for prefix string queries. It saves a lot of spaces compared to edge ngram and the indexation is faster. I am not saying we should restore the functionality inside the default BlockTreeTerms but maybe we could create a separate postings format that exposes this feature ? > Remove auto prefix terms > ------------------------ > > Key: LUCENE-7317 > URL: https://issues.apache.org/jira/browse/LUCENE-7317 > Project: Lucene - Core > Issue Type: Task > Reporter: Adrien Grand > Priority: Minor > Fix For: master (7.0), 6.2 > > Attachments: LUCENE-7317.patch > > > This was mostly superseded by the new points API so should we remove > auto-prefix terms? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org