RE: TrieRange

Uwe Schindler Fri, 06 Feb 2009 15:19:26 -0800

Hi Yonik,

> I've taken a quick peek at TrieUtils/TrieRangeQuery - nice work Uwe!


Thank you :-)

Some words before: I made the original "TrieRangeQuery" about middle of
2005, when Lucene 1.4.3 was on the market, closed source for PANGAEA
(www.pangaea.de). At this time there was no ConstantScoreQuery around. The
special Rangequery was was just a work-around, that splitted a Range into
smaller RangeQueries connected by a BooleanQuery in rewrite() with different
precision. I was now able, to set the static clause limitation in
BooleanQuery to about 4000 and all numerical ranges worked. At this time, I
used fixed-point decimals, that I indexed using these trie methods.

During the work for the project panFMP (PANGAEA Framework for Metadata
Portals, www.panfmp.org), I released it open source. And the I read a lot of
discussions about range queries, geographical searches and so on, colleagues
also using Lucene asked me, to give it directly to Lucene, so it can be used
outside of panFMP. Older versions of this query can be still be retrieved
from the panFMP SVN. So there was a lot of optimization work involved,
because we do very often such type of queries at PANGAEA.

> One general comment is that it seems like TrieUtils tries to do a
> little too much for users (and in the process covers up some
> functionallity).

The question came from Otis after new year, too. The encoding of the values
into two different field names does the trick for the whole range query.
Removing the code that generates the field in exactly that way would remove
the idea behind TrieRangeFilter. The only thing that could be changed would
be to make the suffix on the helper field variable. There is lot of
optimization behind this (see TrieRangeFilter comments in the central
splitRange method). This depends on the order and so the names of the field
and its helper. Very early versions of the trie algorithm was using a extra
field name for each slice, as you call it, I removed it completely later (no
helper field at all), but then trie fields could not be sorted any more. The
extra helper field is the way in the middle.

Maybe there should be more documentation about that by a more native
speaker.

> For example, whether to encode values in two fields and exactly what
> those fields are named, seems like it should be under developer
> control.  Likewise, developers should be in control of creating and
> adding fields to documents and setting other properties like
> omitTerms, omitNorms, etc.

In my opinion there is no sense in having norms or such things on trie
fields. They should only be queried using TrieRangeFilter/Query, for that
norms and TF are not needed (as they are numerical values, for what do you
need the norms?).

> Encoding a slice per character makes the code simpler, but increases
> the size of the index... but perhaps not enough to worry about in
> practice?

This is correct. For 2bit and 4bit there is a lot of overhead by this, but
there is no way round (any ideas how to fix this?). But 8bit is the most
compact one. There needs to be more testing and benchmarking.

> What's the synchronization for... a mistaken remnant from a previous
> draft?

You mean the synchronization around StringBuffer? The original code was Java
1.5 and used StringBuilder. As StringBuffer synchronizes on each method
call, it is more performant to synchronize the whole block that uses the
StringBuffer (or use Java 1.5). By this, the mutex is locked and the method
calls do not need to be synchronized on each call (because it is already
synchronized). See e.g.
http://allu.wordpress.com/2006/08/20/string-buffer-is-thread-safe/, but
there is more of this around. When we switch to Java 1.5, we can remove all
usage of StringBuffer and replace by StringBuilder. There may be other
places in Lucene that could be optimized like this (a lot of toString() and
+ operator, but toString() is mostly used for debugging so no use).

To conclude:
If some time in future, the idea behind TrieRangeFilter would make its way
into the core, the whole term "Term" must be changed. The paper mentioned in
the original JIRA issue (LUCENE-1470) handles this
[http://webcourse.cs.technion.ac.il/236620/Winter2006-2007/hw/WCFiles/params
earch.pdf] - I did not know it. But Lucene is Lucene, if you want to change
it and make this issue more compact, many parts of Lucene's core must be
rewritten.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: TrieRange

Reply via email to