[jira] [Updated] (LUCENE-5879) Add auto-prefix terms to block tree terms dict

Michael McCandless (JIRA) Thu, 11 Sep 2014 16:17:03 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless updated LUCENE-5879:
---------------------------------------
    Attachment: LUCENE-5879.patch

Another patch, with all real nocommits removed (the only ones that
remain are for DEBUG prints... I'll remove those before committing).
I think this is close.

I picked a "gentle" (does not add too many auto-prefix terms) default
of minItemsInPrefix=100 and maxItemsInPrefix=Integer.MAX_VALUE.

With these defaults, on the simple "random longs" test, the increase
in index size for DOCS_ONLY fields is small (123.4 MB vs 137.69 MB:
~12%), much smaller than current default precStep=16 for long
NumericField (363.19 MB vs 137.69 MB: ~62% smaller) and search time is
faster (10.5 sec vs 6.0 sec: ~43% faster).  Indexing time is also ~2X
faster (29.96 sec vs 61.13 sec) than using default NumericField, but
~48% slower (20.31 sec vs 29.96 sec) than indexing simple binary
terms.

I think it's OK to just turn this on by default for all DOCS_ONLY
fields... apps can always disable by passing minItemsInPrefix=0 to
Lucene41PF or BlockTreeTermsWriter if necessary.

Note that this optimization currently only kicks in for TermRangeQuery
and PrefixQuery.  It's also possible to enable it for certain wildcard
(e.g. foo?x\*) and regexp (e.g. foo\[m-p\]\*) queries but this can wait.

There is plenty more to explore here in how auto-prefix terms are
assigned, e.g. like the multi-level postings skip lists, we could have
different thresholds for the "level 1" (the prefix matches real terms)
skip lists vs "level 2+" (the prefix also matches other prefix terms),
but we can iterate later / get feedback from users.

With this change an application can simply index numbers as binary
terms (carefully flipping the sign bit so negative numbers sort
before) or as fixed width terms (zero or space left-filled, e.g,
000042) and then run "ordinary" TermRangeQuery on them, and should see
good improvements in index size, indexing time, search performance vs
NumericRangeFilter.


> Add auto-prefix terms to block tree terms dict
> ----------------------------------------------
>
>                 Key: LUCENE-5879
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5879
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/codecs
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.10, 5.0
>
>         Attachments: LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, 
> LUCENE-5879.patch, LUCENE-5879.patch
>
>
> This cool idea to generalize numeric/trie fields came from Adrien:
> Today, when we index a numeric field (LongField, etc.) we pre-compute
> (via NumericTokenStream) outside of indexer/codec which prefix terms
> should be indexed.
> But this can be inefficient: you set a static precisionStep, and
> always add those prefix terms regardless of how the terms in the field
> are actually distributed.  Yet typically in real world applications
> the terms have a non-random distribution.
> So, it should be better if instead the terms dict decides where it
> makes sense to insert prefix terms, based on how dense the terms are
> in each region of term space.
> This way we can speed up query time for both term (e.g. infix
> suggester) and numeric ranges, and it should let us use less index
> space and get faster range queries.
>  
> This would also mean that min/maxTerm for a numeric field would now be
> correct, vs today where the externally computed prefix terms are
> placed after the full precision terms, causing hairy code like
> NumericUtils.getMaxInt/Long.  So optos like LUCENE-5860 become
> feasible.
> The terms dict can also do tricks not possible if you must live on top
> of its APIs, e.g. to handle the adversary/over-constrained case when a
> given prefix has too many terms following it but finer prefixes
> have too few (what block tree calls "floor term blocks").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5879) Add auto-prefix terms to block tree terms dict

Reply via email to