[jira] [Commented] (LUCENE-5879) Add auto-prefix terms to block tree terms dict

Michael McCandless (JIRA) Fri, 20 Mar 2015 13:53:09 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14372077#comment-14372077
 ]


Michael McCandless commented on LUCENE-5879:
--------------------------------------------

{quote}
We cannot continue writing code in this way.

Please let intersect take care of how to intersect and get this shit out of the 
Query. The default Terms.intersect() method can specialize the PREFIX case with 
a PrefixTermsEnum if it is faster.
{quote}

Can you maybe be more specific?    I'm having trouble following exactly what 
you're objecting to.

Terms.intersect default impl is already specializing to PrefixTermsEnum in the 
patch.

You don't want the added ctor that takes a prefix term in CompiledAutomaton but 
you are OK with PREFIX/RANGE in CA.AUTOMATON_TYPE?

If I 1) remove the added ctor that takes the prefix term in CA, and 2) fix 
PrefixQuery to subclass AutomatonQuery (meaning CA must "autodetect" when it 
receives a prefix automaton), would that address your concerns?  Or something 
else...?

I still wonder if just using AutomatonTermsEnum for prefix/range will be fine.  
Then we don't need PREFIX nor RANGE in CA.AUTOMATON_TYPE.

I'll open a separate issue for this...

> Add auto-prefix terms to block tree terms dict
> ----------------------------------------------
>
>                 Key: LUCENE-5879
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5879
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/codecs
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 5.0, Trunk
>
>         Attachments: LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, 
> LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, 
> LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch
>
>
> This cool idea to generalize numeric/trie fields came from Adrien:
> Today, when we index a numeric field (LongField, etc.) we pre-compute
> (via NumericTokenStream) outside of indexer/codec which prefix terms
> should be indexed.
> But this can be inefficient: you set a static precisionStep, and
> always add those prefix terms regardless of how the terms in the field
> are actually distributed.  Yet typically in real world applications
> the terms have a non-random distribution.
> So, it should be better if instead the terms dict decides where it
> makes sense to insert prefix terms, based on how dense the terms are
> in each region of term space.
> This way we can speed up query time for both term (e.g. infix
> suggester) and numeric ranges, and it should let us use less index
> space and get faster range queries.
>  
> This would also mean that min/maxTerm for a numeric field would now be
> correct, vs today where the externally computed prefix terms are
> placed after the full precision terms, causing hairy code like
> NumericUtils.getMaxInt/Long.  So optos like LUCENE-5860 become
> feasible.
> The terms dict can also do tricks not possible if you must live on top
> of its APIs, e.g. to handle the adversary/over-constrained case when a
> given prefix has too many terms following it but finer prefixes
> have too few (what block tree calls "floor term blocks").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5879) Add auto-prefix terms to block tree terms dict

Reply via email to