[jira] [Commented] (LUCENE-5879) Add auto-prefix terms to block tree terms dict

Michael McCandless (JIRA) Fri, 20 Mar 2015 07:53:03 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14371401#comment-14371401
 ]


Michael McCandless commented on LUCENE-5879:
--------------------------------------------

bq. I don't understand why these need to be tied to FixedBitSet

I'll cutover to BitSet.

bq. Alternatively, code could stay where it is

I'll leave it as is (dark addition to BlockTree).

bq. in codecs/ we could have AutoPrefixPF that exposes it and make it 
experimental or something?

Good idea!  I'll do this... this way Lucene50PF is unchanged.

bq. I can't parse this. if you use a scoring rewrite, it still "works" right? 

It does "work" (you get the right hits) ... TestPrefixQuery/TestTermRangeQuery 
randomly use SCORING_BOOLEAN_REWRITE and
CONSTANT_SCORE_BOOLEAN_REWRITE.

bq. Its just that the generated termqueries will contain pseudo-terms, but 
their statistics etc are all correct?

Right: they will use the auto-prefix terms, which have "correct" stats
(i.e. docFreq is number of docs containing any term with this
prefix).  Is this too weird?  It means you get different scores than
you get today...

We could maybe turn off auto-prefix if you use these rewrite methods?
But this would need an API change to Terms, e.g. a new boolean
allowAutoPrefix to Terms.intersect.

bq. I definitely understand the RANGE case, its difficult to make the equiv 
automaton.

It's not so bad; I added Operations.makeBinaryInterval in the patch
for this.  It's like the decimal ranges that Automata.makeInterval already does.

bq. Why not just make PrefixQuery subclass AutomatonQuery?

I explored this, but it turns out to be tricky, for those PFs that
don't have auto prefix terms (use block tree)...

I.e., with the patch as it is now, PFs like SimpleText will use a
PrefixTermsEnum for PrefixQuery, but if I fix PrefixQuery to subclass
AutomatonQuery (and remove AUTOMATON_TYPE.PREFIX) then SimpleText
would use AutomatonTermsEnum (on a prefix automaton) which I think
will be somewhat less efficient?  Maybe it's not so bad in practice?  ATE
would realize it's in a "linear" part of the automaton...

Maybe we can somehow simplify things here ... I agree both PrefixQuery
and TermRangeQuery should ideally just subclass AutomatonQuery.


> Add auto-prefix terms to block tree terms dict
> ----------------------------------------------
>
>                 Key: LUCENE-5879
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5879
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/codecs
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 5.0, Trunk
>
>         Attachments: LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, 
> LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, 
> LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch
>
>
> This cool idea to generalize numeric/trie fields came from Adrien:
> Today, when we index a numeric field (LongField, etc.) we pre-compute
> (via NumericTokenStream) outside of indexer/codec which prefix terms
> should be indexed.
> But this can be inefficient: you set a static precisionStep, and
> always add those prefix terms regardless of how the terms in the field
> are actually distributed.  Yet typically in real world applications
> the terms have a non-random distribution.
> So, it should be better if instead the terms dict decides where it
> makes sense to insert prefix terms, based on how dense the terms are
> in each region of term space.
> This way we can speed up query time for both term (e.g. infix
> suggester) and numeric ranges, and it should let us use less index
> space and get faster range queries.
>  
> This would also mean that min/maxTerm for a numeric field would now be
> correct, vs today where the externally computed prefix terms are
> placed after the full precision terms, causing hairy code like
> NumericUtils.getMaxInt/Long.  So optos like LUCENE-5860 become
> feasible.
> The terms dict can also do tricks not possible if you must live on top
> of its APIs, e.g. to handle the adversary/over-constrained case when a
> given prefix has too many terms following it but finer prefixes
> have too few (what block tree calls "floor term blocks").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5879) Add auto-prefix terms to block tree terms dict

Reply via email to