RE: [jira] Commented: (LUCENE-1673) Move TrieRange to core

Uwe Schindler Tue, 09 Jun 2009 14:53:46 -0700

No we do not have such an issue, as far as I know. Storing some
version/field type info would be great. In this case we could maybe extend
TrieRange in future to use a different encoding or e.g. CSF for the highest
precisision (as Michael Busch suggested in Amsterdam).

Because TrieRange was and is in contrib until now, I did not wanted to
modify the index internals and file formats for a contrib extension. But if
it moves to core, I could create an subclass of AbstractField for numeric
values, the type is stored in FieldInfos and so it is possible to autodetect
SortFields/FieldCache type, recreate the AbstractField subtype for stored
fields (we may even encode the stored field contents using the prefix
encoding, which is good for floats/doubles because the human-readable
transformation from/to string may loose information).

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]

  _____  

From: Jason Rutherglen [mailto:[email protected]] 
Sent: Tuesday, June 09, 2009 8:48 PM
To: [email protected]
Subject: Re: [jira] Commented: (LUCENE-1673) Move TrieRange to core

> I wonder if we could handle this by adding a setting in FieldInfo?

Do we have an issue open that allows any metadata on a per field basis?
This seems like something flexible indexing will require?

On Tue, Jun 9, 2009 at 10:15 AM, Michael McCandless (JIRA) <[email protected]>
wrote:

   [
https://issues.apache.org/jira/browse/LUCENE-1673?page=com.atlassian.jira.pl
ugin.system.issuetabpanels:comment-tabpanel
<https://issues.apache.org/jira/browse/LUCENE-1673?page=com.atlassian.jira.p
lugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717754#actio
n_12717754> &focusedCommentId=12717754#action_12717754 ]

Michael McCandless commented on LUCENE-1673:
--------------------------------------------

{quote}

In Solr there are three different impls:

Trie (of course)
Text-only numbers (do not work with range queries, but can be used for
sorting etc.)
A binary encoding (also used by LocalLucene at the moment), that is
sortable. This can be used for RangeQueries, but sorting is slow (because
they have no parser, and at the time it was implemented, SortField had no
parser support)

{quote}

Ahh OK, this is just Solr's pre-existing numeric field support.  (I
had thought you meant Solr had a different impl for Trie).

bq. The problem, because of backwards compatibility they need to be
preserved (possibility to read old indexes).

This is indeed quite a challenge.  Actually is there anything in Trie
that encodes which version of the format is indexed in a given
segment?  (So that if we do every change the indexed format, we can
bump a version somewhere to keep back compat).

bq. Maybe we use a static factory instead of same Ctor. By this the name is
different, but it just creates the correct instance of always the same
class: NumericRangeQuery.newFloatRange(Float a, Float b, precisionStep) and
so on. Same for the TokenStreams (and the Field?)

That sounds like a good approach?

{quote}
> When you want to sort, pass the TrieUtils.FIELD_CACHE_LONG_PARSER
> to your SortField

Or add new SortField types.

The problem with all this: For old indexes, we need some backwards
compatibility. Ideally we would just create numeric fields in the new way
and reuse e.g. SortField.INT for this. But this cannot be done. Or even,
replace the FieldCache parsers by the trie ones. But this cannot be done at
the moment.
{quote}

I wonder if we could handle this by adding a setting in FieldInfo?
Ie, to record that "this numeric field was indexed as a trie".  Then,
when we need to get the parser for SortField.INT, we'd check the
FieldInfo to see which parser to use.  This could also handle
back-compat, ie if we change the trie format being written we'd change
the setting and segment merging would gradually uprade previously
indexed fields.

{quote}
> I'd also like to rename RangeQuery to something else, with this
> change. EG TermRangeQuery... to emphasize that you use it for
> non-numbers. The javadocs of TermRangeQuery should point to
> Int/LongRangeQuery as strongly preferred for numeric ranges.

Cool. For the others, too (FieldCacheRangeQuery).

{quote}

Yes.

> Move TrieRange to core
> ----------------------
>
>                 Key: LUCENE-1673
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1673
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 2.9
>
>
> TrieRange was iterated many times and seems stable now (LUCENE-1470,
LUCENE-1582, LUCENE-1602). There is lots of user interest, Solr added it to
its default FieldTypes (SOLR-940) and if possible I want to move it to core
before release of 2.9.
> Before this can be done, there are some things to think about:
> # There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how
should they be called in core? I would suggest to leave it as it is. On the
other hand, if this keeps our only numeric query implementation, we could
call it LongRangeQuery, IntRangeQuery or NumericRangeQuery (see below, here
are problems). Same for the TokenStreams and Filters.
> # Maybe the pairs of classes for indexing and searching should be moved
into one class: NumericTokenStream, NumericRangeQuery, NumericRangeFilter.
The problem here: ctors must be able to pass int, long, double, float as
range parameters. For the end user, mixing these 4 types in one class is
hard to handle. If somebody forgets to add a L to a long, it suddenly
instantiates a int version of range query, hitting no results and so on.
Same with other types. Maybe accept java.lang.Number as parameter (because
nullable for half-open bounds) and one enum for the type.
> # TrieUtils move into o.a.l.util? or document or?
> # Move TokenStreams into o.a.l.analysis, ShiftAttribute into
o.a.l.analysis.tokenattributes? Somewhere else?
> # If we rename the classes, should Solr stay with Trie (because there are
different impls)?
> # Maybe add a subclass of AbstractField, that automatically creates these
TokenStreams and omits norms/tf per default for easier addition to Document
instances?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: [jira] Commented: (LUCENE-1673) Move TrieRange to core

Reply via email to