[jira] [Updated] (LUCENE-5879) Add auto-prefix terms to block tree terms dict

Michael McCandless (JIRA) Fri, 03 Oct 2014 02:58:30 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless updated LUCENE-5879:
---------------------------------------
    Attachment: LUCENE-5879.patch

bq. I think we shouldn't add the FI option at this time? 

New patch with FieldType.setIndexRanges removed, but I don't think we
should commit this approach: the feature is basically so ridiculously
expert to use that only like 3 people in the world will figure out
how.

Sure, the servers built on top of Lucene can expose a simple API,
since they "know" the schema and can open up a "index for range
searching" boolean on a field and validate you are using a PF that
supports that... but I don't think it's right/fair to make new, strong
features of Lucene ridiculously hard to use by direct Lucene users.

It's wonderful Lucene has such pluggable codecs now, letting users
explore all sorts of custom formats, etc., but the nasty downside of
all this freedom is that new, complex features like this one, which
offer powerful improvements to the default codec that 99% of Lucene
users would have used, must either be implemented across the board for
all codecs (a very tall order) in order to have an intuitive API, or
must be exposed only via ridiculously expert codec-specific APIs.

I don't think either choice is acceptable.

So ... I tried exploring an uber helper/utility class, that lets you
add "optimized for range/prefix" fields to docs, and "spies" on you to
determine which fields should then use a customized PF, and then gives
you sugar APIs to build range/prefix/equals queries... but even as
best/simple as I can make this class it still feels way too
weird/heavy/external/uncomittable.

Maybe we should just go back to the "always index auto-prefix terms on
DOCS_ONLY fields" even though 1) I had to then choose "weaker"
defaults (less added index size; less performance gains), and 2) it's
a total waste to add such terms to NumericFields and probably spatial
fields which already build their own prefixes outside of Lucene.  This
is not a great solution either...


> Add auto-prefix terms to block tree terms dict
> ----------------------------------------------
>
>                 Key: LUCENE-5879
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5879
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/codecs
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 5.0, Trunk
>
>         Attachments: LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, 
> LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, 
> LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch
>
>
> This cool idea to generalize numeric/trie fields came from Adrien:
> Today, when we index a numeric field (LongField, etc.) we pre-compute
> (via NumericTokenStream) outside of indexer/codec which prefix terms
> should be indexed.
> But this can be inefficient: you set a static precisionStep, and
> always add those prefix terms regardless of how the terms in the field
> are actually distributed.  Yet typically in real world applications
> the terms have a non-random distribution.
> So, it should be better if instead the terms dict decides where it
> makes sense to insert prefix terms, based on how dense the terms are
> in each region of term space.
> This way we can speed up query time for both term (e.g. infix
> suggester) and numeric ranges, and it should let us use less index
> space and get faster range queries.
>  
> This would also mean that min/maxTerm for a numeric field would now be
> correct, vs today where the externally computed prefix terms are
> placed after the full precision terms, causing hairy code like
> NumericUtils.getMaxInt/Long.  So optos like LUCENE-5860 become
> feasible.
> The terms dict can also do tricks not possible if you must live on top
> of its APIs, e.g. to handle the adversary/over-constrained case when a
> given prefix has too many terms following it but finer prefixes
> have too few (what block tree calls "floor term blocks").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-5879) Add auto-prefix terms to block tree terms dict

Reply via email to