[jira] [Updated] (LUCENE-5879) Add auto-prefix terms to block tree terms dict

Michael McCandless (JIRA) Tue, 16 Sep 2014 02:30:08 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless updated LUCENE-5879:
---------------------------------------
    Attachment: LUCENE-5879.patch

New patch: I added an option (default off) to FieldType
for the application to "hint" that the field should be
indexed for range querying/filtering if possible.
It's experimental, only allowed if the field is DOCS_ONLY, and the
javadocs explain that this is just a hint (postings format can
ignore it entirely).

This value is propagated to "auto write-once schema" FieldInfo.getIndexRanges(),
which has same semantics as omitNorms ("once off always off").  This
means if you want to try out this new feature on a field you must
fully re-index (or use FilterAtomicReader to reprocess your entire
index).

Since this is now an "opt-in" feature, I also made the defaults more
aggressive (25/48: same defaults for pinching off a block of terms,
i.e. one prefix term per leaf term block).

At these defaults index is 52% larger for random longs (187.08 MB vs
baseline 123.4) but 48 % smaller than default (precStep=16) numeric
field, and search time is 45% faster (5790 msec vs 10466 with NF
precStep=16).  Indexing speed is about the same as NF...

Separately I also tested the "sequential IDs" case, indexing 100M ids in
left-zero-prefixed sequential form (0001, 0002, ...); normally one
wouldn't enable indexRanges for such a field (the "normal" use case is
just PK lookup) but I still wanted to see how auto-prefix terms behave
on such a "dense" term space:

  * Index was 14.1% larger (874 MB to 997 MB)

  * Indexing time was 2.5X slower (227 sec to 562 sec)

Net/net I think sequential IDs / densely packed terms is an "easy"
case since block tree can easily hit the max (default now 48) terms in
each auto-prefix term.  Also the postings compress very well since the
docIDs are all adjacent (I indexed with a single thread).

Tests pass, "ant precommit" passes, I think it's ready.


> Add auto-prefix terms to block tree terms dict
> ----------------------------------------------
>
>                 Key: LUCENE-5879
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5879
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/codecs
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.10, 5.0
>
>         Attachments: LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, 
> LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch
>
>
> This cool idea to generalize numeric/trie fields came from Adrien:
> Today, when we index a numeric field (LongField, etc.) we pre-compute
> (via NumericTokenStream) outside of indexer/codec which prefix terms
> should be indexed.
> But this can be inefficient: you set a static precisionStep, and
> always add those prefix terms regardless of how the terms in the field
> are actually distributed.  Yet typically in real world applications
> the terms have a non-random distribution.
> So, it should be better if instead the terms dict decides where it
> makes sense to insert prefix terms, based on how dense the terms are
> in each region of term space.
> This way we can speed up query time for both term (e.g. infix
> suggester) and numeric ranges, and it should let us use less index
> space and get faster range queries.
>  
> This would also mean that min/maxTerm for a numeric field would now be
> correct, vs today where the externally computed prefix terms are
> placed after the full precision terms, causing hairy code like
> NumericUtils.getMaxInt/Long.  So optos like LUCENE-5860 become
> feasible.
> The terms dict can also do tricks not possible if you must live on top
> of its APIs, e.g. to handle the adversary/over-constrained case when a
> given prefix has too many terms following it but finer prefixes
> have too few (what block tree calls "floor term blocks").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5879) Add auto-prefix terms to block tree terms dict

Reply via email to