[jira] Issue Comment Edited: (LUCENE-1470) Add TrieRangeQuery to contrib

Uwe Schindler (JIRA) Thu, 27 Nov 2008 07:11:15 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651367#action_12651367
 ]


thetaphi edited comment on LUCENE-1470 at 11/27/08 7:09 AM:
-----------------------------------------------------------------

Here is a new version of the patch, supporting 8bit, 4bit and 2bit trie 
variant. The API is similar, but instead of using TrieUtils as static class, 
you can choose 3 singleton converters: e.g. 
TrieUtils.VARIANT_8BIT.convertSomething(). TrieRangeQuery and TrieRangeFilter 
both accept a parameter for choosing the variant. A default can be set with a 
static TrieRangeFilter.setDefaultTrieVariant(...) and the corresponding getter.

To Paul's suggestion: You could put the trie variant into the filed name, but 
the main fieldname, where the full precision value is indexed/stored/whatever, 
has no suffix. This would make it inconsistent.

In my opinion, one should choose the same trie variant when indexing values and 
doing range queries. It is the same like choosing another analyzer for indexing 
and searching. Maybe setting the default trie variant with this static method 
maybe better hosted in TrieUtils, but this is room for discussion.

Now it is time to do some performance tests with really big indexes  comparing 
the trie variants :-) Does anybody has a synthetic benchmarker that can be 
easily used for this? The test case is not so interesting (uses RAMDirectory). 
I could reindex our PANGAEA index for performance testing in real world 
(doubles, dates).

The testcase (if you uncomment the printout in TrieRangeFilter of number of 
terms) clearly shows the lower number of terms visited for 4bit or 2bit. 
Formulas are in the package description.

In my opinion 4bit is a good alternative (about 3 times space requirement for 
about 8.5x less terms), but the impact of 2bit is to low for the about 6 times 
larger space.

      was (Author: thetaphi):
    Here is a new version of the patch, supporting 8bit, 4bit and 2bit trie 
variant. The API is similar, but instead of using TrieUtils as static class, 
you can choose 3 singleton converters: 
TrieUtils.VARIANT_8BIT.convertSomething(). TrieRangeQuery and TrieRangeFilter 
both accept a parameter for choosing the variant. A default can be set with a 
static TrieRangeFilter.setDefaultTrieVariant(...) and the corresponding getter.

To Paul's suggestion: You could put the trie variant into the filed name, but 
the main fieldname, where the full precision value is indexed/stored/whatever, 
has no prefix. This would make it inconsistent.

In my opinion, one should choose the same trie variant when indexing values and 
doing range queries. It is the same like choosing another analyzer for indexing 
and searching. Maybe setting the default trie variant with this static method 
maybe better hosted in TrieUtils, but this is room for discussion.

Now it is time to do some performance tests with really big indexes  comparing 
the trie variants :-) Does anybody has a synthetic benchmarker that can be 
easily used for this? The test case is not so interesting (uses RAMDirectory). 
I could reindex our PANGAEA index for performance testing in real world 
(doubles, dates).

The testcase (if you uncomment the printout in TrieRangeFilter of number of 
terms) clearly shows the lower number of terms visited for 4bit or 2bit. 
Formulas are in the package description.

In my opinion 4bit is a good alternative (about 3 times space requirement for 
about 8.5x less terms), but the impact of 2bit is to low for the about 6 times 
larger space.
  
> Add TrieRangeQuery to contrib
> -----------------------------
>
>                 Key: LUCENE-1470
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1470
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>    Affects Versions: 2.4
>            Reporter: Uwe Schindler
>            Assignee: Michael McCandless
>         Attachments: LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, 
> LUCENE-1470.patch
>
>
> According to the thread in java-dev 
> (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and 
> http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to 
> include my fast numerical range query implementation into lucene 
> contrib-queries.
> I implemented (based on RangeFilter) another approach for faster
> RangeQueries, based on longs stored in index in a special format.
> The idea behind this is to store the longs in different precision in index
> and partition the query range in such a way, that the outer boundaries are
> search using terms from the highest precision, but the center of the search
> Range with lower precision. The implementation stores the longs in 8
> different precisions (using a class called TrieUtils). It also has support
> for Doubles, using the IEEE 754 floating-point "double format" bit layout
> with some bit mappings to make them binary sortable. The approach is used in
> rather big indexes, query times are even on low performance desktop
> computers <<100 ms (!) for very big ranges on indexes with 500000 docs.
> I called this RangeQuery variant and format "TrieRangeRange" query because
> the idea looks like the well-known Trie structures (but it is not identical
> to real tries, but algorithms are related to it).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Issue Comment Edited: (LUCENE-1470) Add TrieRangeQuery to contrib

Reply via email to