Hi Yonik, > I've taken a quick peek at TrieUtils/TrieRangeQuery - nice work Uwe!
Thank you :-) Some words before: I made the original "TrieRangeQuery" about middle of 2005, when Lucene 1.4.3 was on the market, closed source for PANGAEA (www.pangaea.de). At this time there was no ConstantScoreQuery around. The special Rangequery was was just a work-around, that splitted a Range into smaller RangeQueries connected by a BooleanQuery in rewrite() with different precision. I was now able, to set the static clause limitation in BooleanQuery to about 4000 and all numerical ranges worked. At this time, I used fixed-point decimals, that I indexed using these trie methods. During the work for the project panFMP (PANGAEA Framework for Metadata Portals, www.panfmp.org), I released it open source. And the I read a lot of discussions about range queries, geographical searches and so on, colleagues also using Lucene asked me, to give it directly to Lucene, so it can be used outside of panFMP. Older versions of this query can be still be retrieved from the panFMP SVN. So there was a lot of optimization work involved, because we do very often such type of queries at PANGAEA. > One general comment is that it seems like TrieUtils tries to do a > little too much for users (and in the process covers up some > functionallity). The question came from Otis after new year, too. The encoding of the values into two different field names does the trick for the whole range query. Removing the code that generates the field in exactly that way would remove the idea behind TrieRangeFilter. The only thing that could be changed would be to make the suffix on the helper field variable. There is lot of optimization behind this (see TrieRangeFilter comments in the central splitRange method). This depends on the order and so the names of the field and its helper. Very early versions of the trie algorithm was using a extra field name for each slice, as you call it, I removed it completely later (no helper field at all), but then trie fields could not be sorted any more. The extra helper field is the way in the middle. Maybe there should be more documentation about that by a more native speaker. > For example, whether to encode values in two fields and exactly what > those fields are named, seems like it should be under developer > control. Likewise, developers should be in control of creating and > adding fields to documents and setting other properties like > omitTerms, omitNorms, etc. In my opinion there is no sense in having norms or such things on trie fields. They should only be queried using TrieRangeFilter/Query, for that norms and TF are not needed (as they are numerical values, for what do you need the norms?). > Encoding a slice per character makes the code simpler, but increases > the size of the index... but perhaps not enough to worry about in > practice? This is correct. For 2bit and 4bit there is a lot of overhead by this, but there is no way round (any ideas how to fix this?). But 8bit is the most compact one. There needs to be more testing and benchmarking. > What's the synchronization for... a mistaken remnant from a previous > draft? You mean the synchronization around StringBuffer? The original code was Java 1.5 and used StringBuilder. As StringBuffer synchronizes on each method call, it is more performant to synchronize the whole block that uses the StringBuffer (or use Java 1.5). By this, the mutex is locked and the method calls do not need to be synchronized on each call (because it is already synchronized). See e.g. http://allu.wordpress.com/2006/08/20/string-buffer-is-thread-safe/, but there is more of this around. When we switch to Java 1.5, we can remove all usage of StringBuffer and replace by StringBuilder. There may be other places in Lucene that could be optimized like this (a lot of toString() and + operator, but toString() is mostly used for debugging so no use). To conclude: If some time in future, the idea behind TrieRangeFilter would make its way into the core, the whole term "Term" must be changed. The paper mentioned in the original JIRA issue (LUCENE-1470) handles this [http://webcourse.cs.technion.ac.il/236620/Winter2006-2007/hw/WCFiles/params earch.pdf] - I did not know it. But Lucene is Lucene, if you want to change it and make this issue more compact, many parts of Lucene's core must be rewritten. Uwe --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org