[jira] [Commented] (LUCENE-5879) Add auto-prefix terms to block tree terms dict

Robert Muir (JIRA) Fri, 19 Sep 2014 10:26:04 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140916#comment-14140916
 ]


Robert Muir commented on LUCENE-5879:
-------------------------------------

We should think about a migration plan for numerics? 

This should be a followup issue.

Here are some thoughts.
1. keep current trie "Encoding" for terms, it just uses precision step=Inf and 
lets the term dictionary do it automatically.
2. create a filteratomicreader, that for a previous trie encoded field, removes 
"fake" terms on merge.

Users could continue to use NumericRangeQuery just with the infinite precision 
step, and it will always work, just execute slower for old segments as it 
doesnt take advantage of the trie terms that are not yet merged away.

One issue to making it really nice, is that lucene doesnt know for sure that a 
field is numeric, so it cannot be "full-auto". Apps would have to use their 
schema or whatever to wrap with this reader in their merge policy.

Maybe we could provide some sugar for this, such as a wrapping merge policy 
that takes a list of field names that are numeric, or sugar to pass this to IWC 
in IndexUpgrader to force it, and so on.

I think its complicated enough for a followup issue though.

> Add auto-prefix terms to block tree terms dict
> ----------------------------------------------
>
>                 Key: LUCENE-5879
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5879
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/codecs
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 5.0, Trunk
>
>         Attachments: LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, 
> LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch
>
>
> This cool idea to generalize numeric/trie fields came from Adrien:
> Today, when we index a numeric field (LongField, etc.) we pre-compute
> (via NumericTokenStream) outside of indexer/codec which prefix terms
> should be indexed.
> But this can be inefficient: you set a static precisionStep, and
> always add those prefix terms regardless of how the terms in the field
> are actually distributed.  Yet typically in real world applications
> the terms have a non-random distribution.
> So, it should be better if instead the terms dict decides where it
> makes sense to insert prefix terms, based on how dense the terms are
> in each region of term space.
> This way we can speed up query time for both term (e.g. infix
> suggester) and numeric ranges, and it should let us use less index
> space and get faster range queries.
>  
> This would also mean that min/maxTerm for a numeric field would now be
> correct, vs today where the externally computed prefix terms are
> placed after the full precision terms, causing hairy code like
> NumericUtils.getMaxInt/Long.  So optos like LUCENE-5860 become
> feasible.
> The terms dict can also do tricks not possible if you must live on top
> of its APIs, e.g. to handle the adversary/over-constrained case when a
> given prefix has too many terms following it but finer prefixes
> have too few (what block tree calls "floor term blocks").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5879) Add auto-prefix terms to block tree terms dict

Reply via email to