[jira] Commented: (LUCENE-1470) Add TrieRangeQuery to contrib

Uwe Schindler (JIRA) Wed, 03 Dec 2008 05:28:49 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652780#action_12652780
 ]


Uwe Schindler commented on LUCENE-1470:
---------------------------------------

I have some first "benchmarks" on the three trie implementations. They were 
done using three indexes with same content, but encoded in 8bit, 4bit and 2bit, 
containing real-world data in 575,000 documents of the PANGAEA repository (see 
www.pangaea.de as mentioned before). The same range queries were executed after 
some warmup and time was measured until TopDocs returned the first 10 documents.

The indexes each contain 13 numeric, tree encoded fields (doubles and Dates). 
Index size (including the "normal" fields) was:
- 8bit: 4.8 GiB
- 4bit: 5.1 GiB
- 2bit: 5.7 GiB

So the addition 13*8*575,000 extra trie terms (with longer) size took 300 MB 
going from 8 to 4bit. Going from 4 to 2bit, the additional 13*16*575,000 extra 
terms took 600 MB (additional to the step from 8 to 4 bit, but its linear to 
the number of additional terms).

The performance impact was neglectible (or better the standard deviation was 
bigger than the performance plus), so I think still, 8bit is a good idea and 
index size is the smallest.

My idea, why this is so: Using fewer bits increases number of terms in index (I 
do not know how this decreases performance), needs more TermEnum seeks (for 
each trie precision, 2 seeking operations are needed). These term enum seeks 
are faster for 8bit trie varaint, because the 2 chars prefix can be used 
extensively (first prefix=precision, second char = first byte of numeric 
value). With 4bit and 2bit you get only 4bit or 2bits in addition to precision 
marker. So seeks in term enum are slower.

In comparison a ConstantScoreRangeQuery on the full precision trie-coded values 
took about 10-20 times longer. Here the numbers:
For PANGAEA, all queries are half open (we have the problem, that data sets 
have ittself bounding boxes - minlatitude, maxlatitude, minlongitude, 
maxlongitude) and the user searches for bounding boxes, hits are datasets that 
*intersect*  the search bounding box. So for a complete lat/lon constraint you 
have 4 half open ranges (with the current Trie implementation, this is not a 
problem, because two ANDed filters just withit half of the number of terms 
needed for a full range. The only performance impact is ANDing the two 
OpenBitSets). So 4 such half-open ranges ANDed together take the following 
times on the given index after some warmup, inkl fetching the first 10 
documents:

ConstantScoreRange: 1500-4000 ms
TrieRangeQuery 8bit: 40ms-100ms
TrieRangeQuery 4bit: 30ms-80ms
TrieRangeQuery 2bit: 20ms-80ms

The old RangeQuery was not tested, but that hits the MaxClauseCount constraint, 
and if this is raised to Integer.MAX_VALUE, it tooks approx 1 minute to 
finish). The numbers are pure rnage queries, if you additionally add some 
search terms, time goes up a little bit. Used was only TrieRangeQuery (so 
rewrite to constant score), no docs were filtered in the classical way: 
termqueries + filtering of results.

More benchmark results follow, when I finish the benchmarker. But these are 
real world examples. The search engine machine was a Sun X4600 machine with 16 
opteron cores, 64bit Java 1.5, Sun Java System Web Server, -Xmx 8192M (our 
current configuration). On lower cost machnies like desktop pcs, the 
ConstantScoreRangeQuery takes much more time, but TrieRangeQuery is approx the 
same time, as disk io seeks is more the limiting factor. Processor speed and 
number of processors is not the limiting factor (if only one query is run in 
parallel).

> Add TrieRangeQuery to contrib
> -----------------------------
>
>                 Key: LUCENE-1470
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1470
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>    Affects Versions: 2.4
>            Reporter: Uwe Schindler
>            Assignee: Michael McCandless
>         Attachments: LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, 
> LUCENE-1470.patch, LUCENE-1470.patch
>
>
> According to the thread in java-dev 
> (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and 
> http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to 
> include my fast numerical range query implementation into lucene 
> contrib-queries.
> I implemented (based on RangeFilter) another approach for faster
> RangeQueries, based on longs stored in index in a special format.
> The idea behind this is to store the longs in different precision in index
> and partition the query range in such a way, that the outer boundaries are
> search using terms from the highest precision, but the center of the search
> Range with lower precision. The implementation stores the longs in 8
> different precisions (using a class called TrieUtils). It also has support
> for Doubles, using the IEEE 754 floating-point "double format" bit layout
> with some bit mappings to make them binary sortable. The approach is used in
> rather big indexes, query times are even on low performance desktop
> computers <<100 ms (!) for very big ranges on indexes with 500000 docs.
> I called this RangeQuery variant and format "TrieRangeRange" query because
> the idea looks like the well-known Trie structures (but it is not identical
> to real tries, but algorithms are related to it).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1470) Add TrieRangeQuery to contrib

Reply via email to