[jira] [Commented] (LUCENE-5879) Add auto-prefix terms to block tree terms dict

Michael McCandless (JIRA) Fri, 19 Sep 2014 12:04:04 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141096#comment-14141096
 ]


Michael McCandless commented on LUCENE-5879:
--------------------------------------------

bq. Wow, awesome work Mike! And fantastic idea Adrien!

Thanks [~dsmiley]

bq.  I mean, are the intervals that are computed from the data determined and 
fixed within a given segment, or is it variable throughout the segment?

It's per-segment, so each segment will look at how its terms fall and find 
"good" places to insert the auto-prefix terms.

bq. Is this applicable to variable-length String fields that you might want to 
do range queries on for whatever reason? Such as... A*, B*, C* or A-G, H-P, ... 
etc. ? It appears this is applicable.

I don't quite understand the question ... the indexed terms can be any variable 
length.

bq. Would any CompiledAutomaton (e.g. a wildcard query) that has a leading 
prefix benefit from this or is it strictly Prefix & Range queries? Mike's 
comments suggest it will sometime but not yet. Can you create an issue for it, 
Mike? This would be especially useful in Lucene-spatial; I'm excited at the 
prospects!

Currently auto-prefix terms are only used for PrefixQuery and TermRangeQuery, 
or for any automaton query that "becomes" a PrefixQuery on rewrite (e.g. 
WildcardQuery("foo*")).

Enabling them for WildcardQuery and RegexpQuery should be fairly easy, however 
they will only kick in in somewhat exotic situations, where there is a portion 
of the term space accepted by the automaton which "suddenly" accepts any 
suffix.  E.g. foo*bar will never use auto-prefix terms, but foo?b* will.

I'll open an issue!

bq. When you iterate a TermsEnum, will the prefix terms be exposed or is it 
internal to the Codec?

No, these auto-prefix terms are invisible in all APIs, except when you call 
Terms.intersect.

> Add auto-prefix terms to block tree terms dict
> ----------------------------------------------
>
>                 Key: LUCENE-5879
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5879
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/codecs
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 5.0, Trunk
>
>         Attachments: LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, 
> LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch
>
>
> This cool idea to generalize numeric/trie fields came from Adrien:
> Today, when we index a numeric field (LongField, etc.) we pre-compute
> (via NumericTokenStream) outside of indexer/codec which prefix terms
> should be indexed.
> But this can be inefficient: you set a static precisionStep, and
> always add those prefix terms regardless of how the terms in the field
> are actually distributed.  Yet typically in real world applications
> the terms have a non-random distribution.
> So, it should be better if instead the terms dict decides where it
> makes sense to insert prefix terms, based on how dense the terms are
> in each region of term space.
> This way we can speed up query time for both term (e.g. infix
> suggester) and numeric ranges, and it should let us use less index
> space and get faster range queries.
>  
> This would also mean that min/maxTerm for a numeric field would now be
> correct, vs today where the externally computed prefix terms are
> placed after the full precision terms, causing hairy code like
> NumericUtils.getMaxInt/Long.  So optos like LUCENE-5860 become
> feasible.
> The terms dict can also do tricks not possible if you must live on top
> of its APIs, e.g. to handle the adversary/over-constrained case when a
> given prefix has too many terms following it but finer prefixes
> have too few (what block tree calls "floor term blocks").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5879) Add auto-prefix terms to block tree terms dict

Reply via email to