[jira] [Commented] (LUCENE-5879) Add auto-prefix terms to block tree terms dict

Michael McCandless (JIRA) Mon, 22 Sep 2014 06:54:50 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143219#comment-14143219
 ]


Michael McCandless commented on LUCENE-5879:
--------------------------------------------


bq. So for the whole segment, does it decide to insert auto-prefix'es at 
specific byte lengths (e.g. 3, 5, and 7)? Or does it vary based on specific 
terms? I'm hoping it's smart enough to vary based on specific terms. For 
example if, hypothetically there were lots of terms that had this common 
prefix: "BGA" then it might decide "BGA" makes a good auto-prefix but not 
necessarily all terms at length 3 since many others might not make good 
prefixes. Make sense?

It's dynamic, based on how terms occupy the space.

Today (and we can change this: it's an impl. detail) it assigns
prefixes just like it assigns terms to blocks.  Ie, when it sees a
given prefix matches 25 - 48 terms, it inserts a new auto-prefix
term.  That auto-prefix term replaces those 25 - 48 other terms with 1
term, and then the process "recurses", i.e. you can then have a
shorter auto-prefix term matching 25 - 48 other normal and/or
auto-prefix terms.

bq. At a low level, do I take advantage of this in the same way that I might do 
so at a high level using PrefixQuery and then getting the weight then getting 
the scorer to iterate docIds? Or is there a lower-level path? Although there is 
some elegance to not introducing new APIs, I think it's worth exploring having 
prefix & range capabilities be on the TermsEnum in some way.

What this patch does is generalize/relax Terms.intersect: that method
no longer guarantees that you see "true" terms.  Rather, it now
guarantees only that the docIDs you visit will be the same.

So to take advantage of it, you need to pass an Automaton to
Terms.intersect and then not care about which terms you see, only the
docIDs after visiting the DocsEnum across all terms it returned to
you.

bq. Do you envision other posting formats being able to re-use the logic here?  
That would be nice.

I agree it would be nice ... and the index-time logic that identifies
the auto-prefix terms is quite standalone, so e.g. we could pull it
out and have it "wrap" the incoming Fields to insert the auto-prefix
terms.  This way it's nearly transparent to any postings format ...

But the problem is, at search time, there's tricky logic in
intersect() to use these prefix terms ... factoring that out so other
formats can use it is trickier I think... though maybe we could fold
it into the default Terms.intersect() impl...

bq. In your future tuning, I suggest you give the ability to vary the 
conservative vs aggressive prefixing based on the very beginning and very end 
(assuming known common lengths).  In the FlexPrefixTree Varun (GSOC) worked on, 
the leaves per level is configurable at each level (i.e. prefix length)... and 
it's better to have little prefixing at the very top and little at the bottom 
too.  At the top, prefixes only help for queries span massive portions of the 
possible term space (which in spatial is rare; likely other apps too).  And at 
the bottom (long prefixes) just shy of the maximum length (say 7 bytes out of 8 
for a double), there is marginal value because in the spatial search algorithm, 
the bottom detail is scan'ed over (e.g. TermsEnum.next()) instead of seek'ed, 
because the data is less dense and it's adjacent.  This principle may apply to 
numeric-range queries depending on how they are coded; I'm not sure.

I agree this (how auto-prefix terms are assigned) needs more control /
experimenting.  Really the best prefixes are a function not only of
how the terms were distributed, but also of how queries will "likely"
ask for ranges.

I think it's similar to postings skip lists, where we have different
frequency of a skip pointer on the "leaf" level vs the "upper" skip
levels.


> Add auto-prefix terms to block tree terms dict
> ----------------------------------------------
>
>                 Key: LUCENE-5879
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5879
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/codecs
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 5.0, Trunk
>
>         Attachments: LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, 
> LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch
>
>
> This cool idea to generalize numeric/trie fields came from Adrien:
> Today, when we index a numeric field (LongField, etc.) we pre-compute
> (via NumericTokenStream) outside of indexer/codec which prefix terms
> should be indexed.
> But this can be inefficient: you set a static precisionStep, and
> always add those prefix terms regardless of how the terms in the field
> are actually distributed.  Yet typically in real world applications
> the terms have a non-random distribution.
> So, it should be better if instead the terms dict decides where it
> makes sense to insert prefix terms, based on how dense the terms are
> in each region of term space.
> This way we can speed up query time for both term (e.g. infix
> suggester) and numeric ranges, and it should let us use less index
> space and get faster range queries.
>  
> This would also mean that min/maxTerm for a numeric field would now be
> correct, vs today where the externally computed prefix terms are
> placed after the full precision terms, causing hairy code like
> NumericUtils.getMaxInt/Long.  So optos like LUCENE-5860 become
> feasible.
> The terms dict can also do tricks not possible if you must live on top
> of its APIs, e.g. to handle the adversary/over-constrained case when a
> given prefix has too many terms following it but finer prefixes
> have too few (what block tree calls "floor term blocks").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5879) Add auto-prefix terms to block tree terms dict

Reply via email to