Re: TrieRangeQuery for contrib?

Michael McCandless Tue, 25 Nov 2008 11:48:49 -0800


Uwe Schindler wrote:

Hi Mike, hi Paul,
Mike: You are right, the algorithm has one advantage and onedisadvantage:
- There is not only a logarithmic bound, there is a hard upperbound. In my
algorithm with 8 precisions (so from 1 to 8 bytes length of keys) the
maximum numbers of terms to be visited is limited to 3825 terms (seethejava docs or the cited paper). The upper limit only applies, if youmake avery big range and have a lot of different homogenous distributedtermsinside the range (without the optimization). Testing with a 500,000documentindex (6 GB) with numeric values (doubles) has shown, that even withlargeranges the maximum number of terms, you mostly only get about 300terms to
be visited. This is not related to index size!


Awesome!

- The index size is bigger: You store for each numeric field notonly oneterm in index, but eight, so index size increases. But this isneglectible
in my opinion and for large indexes the speed increase is great.

I agree this should be a very small cost, especially because it's onlythe range-tokens, often one per doc per ranged-field, that haveincreased storage. I would imagine in practice it's quite low.

- Inside Luke, the values of such "Trie" fields are not human readable
(because of the encoding). Even when stored, the currentimplementation usesthe special encoding to store the field. For displaying the fieldyou have
to use the decoder from the TrieUtils class. But this is the same with
current DateUtils from Lucene (but they are more readable :-) )

These seems OK, for starters. Eventually maybe such a "range field"could provide an interface that knows how to "subdivide" intervals onits space of all terms, assigning more human readable labels to thesesubdivisions, instead of always casting to unsigned long.

Comparisions with the above 500,000 doc index showed that the oldRangeQuery(with raised BooleanQuery clause count) took about 30-40 secs tocomplete,
ConstantScoreRangeQuery took 5 secs and TrieRangeQuery took <100ms to
complete (on an Opteron64 machine, Java 1.5). You can test a littlebit onhttp://www.pangaea.de/advanced/advsearch.php by entering somethinginto thegeographic bounding box or temporal coverage). As you can see, theusage ofthis range query type is optimal for geographic searches usingdoubles (not
fixed decimals!), longs or dates as keys.

Wow it's very fast! I first searched for "water", which returned~428K docs, then bounded it roughly around Africa and it returned ~78Kdocs, very quickly. Now I'd really love to get this into Lucene!

I have no problem with including it into Lucene contrib. I just havesome
questions/comments:
- Code is Apache 2.0 licensed, so it is simple to include. I wouldchangethe package prefix, update the JavaDocs and create a contrib patchout of
it. References to commons logging can be removed (they are just for
debugging). Code is Java 1.5 (using StringBuilder), but this couldeasily be
changed.


Contrib code can be Java 1.5.

- I want to be able to develop the code further once in contrib, isthispossible? How would be the best to handle this? Let the code stay inmy SVN
and you update it or let me commit to the contrib folder in Lucene?
Currently the code is in SVN of panFMP (www.panfmp.org) that usesit. When
donated to Apache, I would put a dependency into panFMP to the contrib
Package, once released and remove it from my tree. I do not want toget thecode into a dead end or start a fork of it inside contrib, because Iwant to
actively maintain it.

I think for starters open an issue, attach a patch, and then weiterate from there? Probably having the code in Apache's SVN, withthe eventual goal of giving you commit rights to contrib, is what weshould aim for?

My intentions for giving the code to Lucene were some questions fromotherprojects (from geographic information systems), to also use theoptimizedrange queries for such type of geo queries, e.g. GeoNetworkOpensource, also
using Lucene, is interested. Maybe Solr wants to make use of it (using
another field data type). Instead of distributing the code todifferentprojects, I wanted to make it available as plugin for everybody fromLucene
itself.


I agree, this should be in Lucene.

I would start an issue in JIRA and attach the patch.


Excellent!

Mike


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: TrieRangeQuery for contrib?

Reply via email to