[
https://issues.apache.org/jira/browse/LUCENE-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434056#comment-15434056
]
Robert Muir commented on LUCENE-7423:
-------------------------------------
i still am not sure about the benchmark. indexing prefixes of 2-20, come on, of
course that's gonna waste disk! What strings will you really be speeding up?
Something like 2-5, now we're talking. Of course, autoprefix has the advantage
that it can magically determine something better than "2-5", but still we know
a static case can be competitive, our numerics were very decent for years based
on those facts. I'd like to see more fair numbers, as the analysis chain can
handle this case, too, and its also more user-friendly.
My concern is the cost of maintenance here vs the actual tradeoff of
performance we get for a relatively obscure case (prefixes). we have to
maintain a large amount of code for a corner-case. Instead, If we want to spend
extra disk on accelerating wildcards, i would honestly rather it be helping the
true worst cases (e.g. suffix trie or similar datastructures).
> AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on
> text fields.
> ---------------------------------------------------------------------------------------
>
> Key: LUCENE-7423
> URL: https://issues.apache.org/jira/browse/LUCENE-7423
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/sandbox
> Reporter: Ferenczi Jim
> Priority: Minor
> Attachments: LUCENE-7423.patch
>
>
> The autoprefix terms dict added in
> https://issues.apache.org/jira/browse/LUCENE-5879 has been removed with
> https://issues.apache.org/jira/browse/LUCENE-7317.
> The new points API is now used to do efficient range queries but the
> replacement for prefix string queries is unclear. The edge ngrams could be
> used instead but they have a lot of drawbacks and are hard to configure
> correctly. The completion postings format is also a good replacement but it
> requires to have a big FST in RAM and it cannot be intersected with other
> fields.
> This patch is a proposal for a new PostingsFormat optimized for prefix query
> on string fields. It detects prefixes that match "enough" terms and writes
> auto-prefix terms into their own virtual field.
> At search time the virtual field is used to speed up prefix queries that
> match "enough" terms.
> The auto-prefix terms are built in two pass:
> * The first pass builds a compact prefix tree. Since the terms enum is sorted
> the prefixes are flushed on the fly depending on the input. For each prefix
> we build its corresponding inverted lists using a DocIdSetBuilder. The first
> pass visits each term of the field TermsEnum only once. When a prefix is
> flushed from the prefix tree its inverted lists is dumped into a temporary
> file for further use. This is necessary since the prefixes are not sorted
> when they are removed from the tree. The selected auto prefixes are sorted at
> the end of the first pass.
> * The second pass is a sorted scan of the prefixes and the temporary file is
> used to read the corresponding inverted lists.
> The patch is just a POC and there are rooms for optimizations but the first
> results are promising:
> I tested the patch with the geonames dataset. I indexed all the titles with
> the KeywordAnalyzer and compared the index/merge time and the size of the
> indices.
> The edge ngram index (with a min edge ngram size of 2 and a max of 20) takes
> 572M on disk and it took 130s to index and optimize the 11M titles.
> The auto prefix index takes 287M on disk and took 70s to index and optimize
> the same 11M titles. Among the 287M, only 170M are used for the auto prefix
> fields and the rest is for the regular keyword field. All the auto prefixes
> were generated for this test (at least 2 terms per auto-prefix).
> The queries have similar performance since we are sure on both sides that one
> inverted list can answer any prefix query.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]