[jira] [Commented] (LUCENE-5879) Add auto-prefix terms to block tree terms dict

Robert Muir (JIRA) Fri, 03 Oct 2014 03:25:54 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157857#comment-14157857
 ]


Robert Muir commented on LUCENE-5879:
-------------------------------------

{quote}
but the nasty downside of
all this freedom is that new, complex features like this one, which
offer powerful improvements to the default codec that 99% of Lucene
users would have used, must either be implemented across the board for
all codecs (a very tall order) in order to have an intuitive API, or
must be exposed only via ridiculously expert codec-specific APIs.
{quote}

I don't think its a downside of the freedom, its just other problems.

However, there are way way too many experimental codecs. These are even more 
costly to maintain than backwards ones in some ways: they are rotated in all 
tests! For many recent changes I have spent just as much time fixing those as i 
have on backwards codecs. If we ever want to provide backwards compatibility 
for experimental codecs (which seems to confuse users constantly that we can't 
do this), then we have to tone them down anyway.

The existing trie-encoding is difficult to use, too. I dont think it should 
serve as your example for this feature. Remember that simple numeric range 
queries dont work with QP without the user doing subclassing, and numerics dont 
really work well with the parser at all because the analyzer is completely 
unaware of them (because, for some crazy reason, it is implemented as a 
tokenstream/special fields rather than being a more "ordinary" analysis chain 
integration).

The .document API is overengineered. I dont understand why it needs to be so 
complicated. Because of it taking on more than it can chew already, its 
impossible to even think about how it could work with the codec api: and I 
think this is causing a lot of your frustration.

The whole way that "lucene is schemaless" is fucking bogus, and only means that 
its on you, the user, to record and manage and track all this stuff yourself. 
Its no freedom to anyone, just pain. For example, we don't even know which 
fields have trie-encoded terms here, to do any kind of nice migration strategy 
from "old numerics" to this at all. Thats really sad and will cause users just 
more pain and confusion.

FieldInfo is a hardcore place to add an experimental option when we arent even 
sure how it should behave yet (e.g. should it really be limited to DOCS_ONLY? 
who knows?)

I can keep complaining too, we can rant about this stuff on this issue, but 
maybe you should commit what you have (yes, with the crappy hard-to-use codec 
option) so we can try to do something on another issue instead.

> Add auto-prefix terms to block tree terms dict
> ----------------------------------------------
>
>                 Key: LUCENE-5879
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5879
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/codecs
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 5.0, Trunk
>
>         Attachments: LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, 
> LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, 
> LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch
>
>
> This cool idea to generalize numeric/trie fields came from Adrien:
> Today, when we index a numeric field (LongField, etc.) we pre-compute
> (via NumericTokenStream) outside of indexer/codec which prefix terms
> should be indexed.
> But this can be inefficient: you set a static precisionStep, and
> always add those prefix terms regardless of how the terms in the field
> are actually distributed.  Yet typically in real world applications
> the terms have a non-random distribution.
> So, it should be better if instead the terms dict decides where it
> makes sense to insert prefix terms, based on how dense the terms are
> in each region of term space.
> This way we can speed up query time for both term (e.g. infix
> suggester) and numeric ranges, and it should let us use less index
> space and get faster range queries.
>  
> This would also mean that min/maxTerm for a numeric field would now be
> correct, vs today where the externally computed prefix terms are
> placed after the full precision terms, causing hairy code like
> NumericUtils.getMaxInt/Long.  So optos like LUCENE-5860 become
> feasible.
> The terms dict can also do tricks not possible if you must live on top
> of its APIs, e.g. to handle the adversary/over-constrained case when a
> given prefix has too many terms following it but finer prefixes
> have too few (what block tree calls "floor term blocks").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5879) Add auto-prefix terms to block tree terms dict

Reply via email to