[jira] [Closed] (LUCENE-7423) AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.

Ferenczi Jim (JIRA) Thu, 25 Aug 2016 02:06:44 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ferenczi Jim closed LUCENE-7423.
--------------------------------
    Resolution: Not A Problem

> AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on 
> text fields.
> ---------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7423
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7423
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/sandbox
>            Reporter: Ferenczi Jim
>            Priority: Minor
>         Attachments: LUCENE-7423.patch
>
>
> The autoprefix terms dict added in 
> https://issues.apache.org/jira/browse/LUCENE-5879 has been removed with 
> https://issues.apache.org/jira/browse/LUCENE-7317.
> The new points API is now used to do efficient range queries but the 
> replacement for prefix string queries is unclear. The edge ngrams could be 
> used instead but they have a lot of drawbacks and are hard to configure 
> correctly. The completion postings format is also a good replacement but it 
> requires to have a big FST in RAM and it cannot be intersected with other 
> fields. 
> This patch is a proposal for a new PostingsFormat optimized for prefix query 
> on string fields. It detects prefixes that match "enough" terms and writes 
> auto-prefix terms into their own virtual field.
>  At search time the virtual field is used to speed up prefix queries that 
> match "enough" terms.
> The auto-prefix terms are built in two pass:
> * The first pass builds a compact prefix tree. Since the terms enum is sorted 
> the prefixes are flushed on the fly depending on the input. For each prefix 
> we build its corresponding inverted lists using a DocIdSetBuilder. The first 
> pass visits each term of the field TermsEnum only once. When a prefix is 
> flushed from the prefix tree its inverted lists is dumped into a temporary 
> file for further use. This is necessary since the prefixes are not sorted 
> when they are removed from the tree. The selected auto prefixes are sorted at 
> the end of the first pass.
> * The second pass is a sorted scan of the prefixes and the temporary file is 
> used to read the corresponding inverted lists.
> The patch is just a POC and there are rooms for optimizations but the first 
> results are promising:
> I tested the patch with the geonames dataset. I indexed all the titles with 
> the KeywordAnalyzer and compared the index/merge time and the size of the 
> indices. 
> The edge ngram index (with a min edge ngram size of 2 and a max of 20) takes 
> 572M on disk and it took 130s to index and optimize the 11M titles. 
> The auto prefix index takes 287M on disk and took 70s to index and optimize 
> the same 11M titles. Among the 287M, only 170M are used for the auto prefix 
> fields and the rest is for the regular keyword field. All the auto prefixes 
> were generated for this test (at least 2 terms per auto-prefix).  
> The queries have similar performance since we are sure on both sides that one 
> inverted list can answer any prefix query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Closed] (LUCENE-7423) AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.

Reply via email to