[ https://issues.apache.org/jira/browse/LUCENE-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ferenczi Jim closed LUCENE-7423. -------------------------------- Resolution: Not A Problem > AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on > text fields. > --------------------------------------------------------------------------------------- > > Key: LUCENE-7423 > URL: https://issues.apache.org/jira/browse/LUCENE-7423 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/sandbox > Reporter: Ferenczi Jim > Priority: Minor > Attachments: LUCENE-7423.patch > > > The autoprefix terms dict added in > https://issues.apache.org/jira/browse/LUCENE-5879 has been removed with > https://issues.apache.org/jira/browse/LUCENE-7317. > The new points API is now used to do efficient range queries but the > replacement for prefix string queries is unclear. The edge ngrams could be > used instead but they have a lot of drawbacks and are hard to configure > correctly. The completion postings format is also a good replacement but it > requires to have a big FST in RAM and it cannot be intersected with other > fields. > This patch is a proposal for a new PostingsFormat optimized for prefix query > on string fields. It detects prefixes that match "enough" terms and writes > auto-prefix terms into their own virtual field. > At search time the virtual field is used to speed up prefix queries that > match "enough" terms. > The auto-prefix terms are built in two pass: > * The first pass builds a compact prefix tree. Since the terms enum is sorted > the prefixes are flushed on the fly depending on the input. For each prefix > we build its corresponding inverted lists using a DocIdSetBuilder. The first > pass visits each term of the field TermsEnum only once. When a prefix is > flushed from the prefix tree its inverted lists is dumped into a temporary > file for further use. This is necessary since the prefixes are not sorted > when they are removed from the tree. The selected auto prefixes are sorted at > the end of the first pass. > * The second pass is a sorted scan of the prefixes and the temporary file is > used to read the corresponding inverted lists. > The patch is just a POC and there are rooms for optimizations but the first > results are promising: > I tested the patch with the geonames dataset. I indexed all the titles with > the KeywordAnalyzer and compared the index/merge time and the size of the > indices. > The edge ngram index (with a min edge ngram size of 2 and a max of 20) takes > 572M on disk and it took 130s to index and optimize the 11M titles. > The auto prefix index takes 287M on disk and took 70s to index and optimize > the same 11M titles. Among the 287M, only 170M are used for the auto prefix > fields and the rest is for the regular keyword field. All the auto prefixes > were generated for this test (at least 2 terms per auto-prefix). > The queries have similar performance since we are sure on both sides that one > inverted list can answer any prefix query. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org