[jira] [Commented] (LUCENE-5879) Add auto-prefix terms to block tree terms dict

David Smiley (JIRA) Sun, 21 Sep 2014 12:33:53 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142590#comment-14142590
 ]


David Smiley commented on LUCENE-5879:
--------------------------------------

Some more questions:

bq. It's per-segment, so each segment will look at how its terms fall and find 
"good" places to insert the auto-prefix terms.

So for the whole segment, does it decide to insert auto-prefix'es at specific 
byte lengths (e.g. 3, 5, and 7)?   Or does it vary based on specific terms?  
I'm hoping it's smart enough to vary based on specific terms.  For example if, 
hypothetically there were lots of terms that had this common prefix: "BGA" then 
it might decide "BGA" makes a good auto-prefix but not necessarily all terms at 
length 3 since many others might not make good prefixes.  Make sense?

At a low level, do I take advantage of this in the same way that I might do so 
at a high level using PrefixQuery and then getting the weight then getting the 
scorer to iterate docIds?  Or is there a lower-level path?  Although there is 
some elegance to not introducing new APIs, I think it's worth exploring having 
prefix & range capabilities be on the TermsEnum in some way.

Do you envision other posting formats being able to re-use the logic here?  
That would be nice.

In your future tuning, I suggest you give the ability to vary the convervative 
vs aggressive prefixing based on the very beginning and very end (assuming 
known common lengths).  In the FlexPrefixTree Varun (GSOC) worked on, the 
leaves per level is configurable at each level (i.e. prefix length)... and it's 
better to have little prefixing at the very top and little at the bottom too.  
At the top, prefixes only help for queries span massive portions of the 
possible term space (which in spatial is rare; likely other apps too).  And at 
the bottom (long prefixes) just shy of the maximum length (say 7 bytes out of 8 
for a double), there is marginal value because in the spatial search algorithm, 
the bottom detail is scan'ed over (e.g. TermsEnum.next()) instead of seek'ed, 
because the data is less dense and it's adjacent.  This principle may apply to 
numeric-range queries depending on how they are coded; I'm not sure.

> Add auto-prefix terms to block tree terms dict
> ----------------------------------------------
>
>                 Key: LUCENE-5879
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5879
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/codecs
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 5.0, Trunk
>
>         Attachments: LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, 
> LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch
>
>
> This cool idea to generalize numeric/trie fields came from Adrien:
> Today, when we index a numeric field (LongField, etc.) we pre-compute
> (via NumericTokenStream) outside of indexer/codec which prefix terms
> should be indexed.
> But this can be inefficient: you set a static precisionStep, and
> always add those prefix terms regardless of how the terms in the field
> are actually distributed.  Yet typically in real world applications
> the terms have a non-random distribution.
> So, it should be better if instead the terms dict decides where it
> makes sense to insert prefix terms, based on how dense the terms are
> in each region of term space.
> This way we can speed up query time for both term (e.g. infix
> suggester) and numeric ranges, and it should let us use less index
> space and get faster range queries.
>  
> This would also mean that min/maxTerm for a numeric field would now be
> correct, vs today where the externally computed prefix terms are
> placed after the full precision terms, causing hairy code like
> NumericUtils.getMaxInt/Long.  So optos like LUCENE-5860 become
> feasible.
> The terms dict can also do tricks not possible if you must live on top
> of its APIs, e.g. to handle the adversary/over-constrained case when a
> given prefix has too many terms following it but finer prefixes
> have too few (what block tree calls "floor term blocks").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5879) Add auto-prefix terms to block tree terms dict

Reply via email to