[ 
https://issues.apache.org/jira/browse/LUCENE-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-5879:
---------------------------------------
    Attachment: LUCENE-5879.patch

New patch, folding in feedback from Rob.

First off, nothing (!!) was testing that you could use one of the
BOOLEAN_QUERY_REWRITE methods in MTQ when auto-prefix terms were
enumeratd.

When I added that, it trips this assert in ScoringRewrite:

{noformat}
        assert reader.docFreq(term) == termStates[pos].docFreq();
{noformat}

This trips when the term returned by IntersectTermsEnum is an
auto-prefix term, in which case reader.docFreq cannot find the term
since auto-prefix terms are invisible to all APIs except intersect.

I fixed this by adding a hackish boolean isRealTerm to BlockTermState,
and changed the assert to only check if it knows it's looking at real
terms.

However, I also had to fix AssertingAtomicReader.AssertingTermsEnum,
to override seekExact(BytesRef, TermState) and termState() to delegate
to the delegate (in) instead of to super ... super will fail because
it falls back to seeking by term, which cannot work here because you
can't seek by an auto-prefix term.

This is a little scary ... because you are allowed to
seekExact(BytesRef, TermState) for an auto-prefix term, meaning
auto-prefix terms are in fact NOT invisible in this API.  This is a
possibly nasty trap, for any FilterAtomicReaders out there, that rely
on super.... not sure what to do.

Also, Rob noticed that CheckIndex no longer checks all bits ... this
is pretty bad ... I put a nocommit to think about what to do ...


> Add auto-prefix terms to block tree terms dict
> ----------------------------------------------
>
>                 Key: LUCENE-5879
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5879
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/codecs
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 5.0, Trunk
>
>         Attachments: LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, 
> LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch
>
>
> This cool idea to generalize numeric/trie fields came from Adrien:
> Today, when we index a numeric field (LongField, etc.) we pre-compute
> (via NumericTokenStream) outside of indexer/codec which prefix terms
> should be indexed.
> But this can be inefficient: you set a static precisionStep, and
> always add those prefix terms regardless of how the terms in the field
> are actually distributed.  Yet typically in real world applications
> the terms have a non-random distribution.
> So, it should be better if instead the terms dict decides where it
> makes sense to insert prefix terms, based on how dense the terms are
> in each region of term space.
> This way we can speed up query time for both term (e.g. infix
> suggester) and numeric ranges, and it should let us use less index
> space and get faster range queries.
>  
> This would also mean that min/maxTerm for a numeric field would now be
> correct, vs today where the externally computed prefix terms are
> placed after the full precision terms, causing hairy code like
> NumericUtils.getMaxInt/Long.  So optos like LUCENE-5860 become
> feasible.
> The terms dict can also do tricks not possible if you must live on top
> of its APIs, e.g. to handle the adversary/over-constrained case when a
> given prefix has too many terms following it but finer prefixes
> have too few (what block tree calls "floor term blocks").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to