Michael - Great thoughts, and thanks for the feedback.
Following on the Range Query approach, how is performance? I found the range approach (albeit with the exact values) to be slower than the parsed-string approach I posited. On the custom scoring, is the distance element for sorting or as a component of relevance? We have a need for distance sorting, but I'm trying to slay that beast at a later stage. -- jeff On 2/28/06, Bryzek.Michael <[EMAIL PROTECTED]> wrote: > > Jeff - > > This is an interesting approach. On our end, we have experimented with > two variants: > > Variant 1: Use Range Query > > Rather than precomputing the boolean clauses yourself, index encoded > latitude and longitude values and use a Range Query. We encode by > adding 1000 to each of the values. Note: We only deal with zip codes > in the US for which this encoding works fine, but is worth another > look if you have a broader range of coordinates. > > Example: Compute the box as you describe, then encode each value and > use a RangeQuery. Using the box you provided and lucene's query > parser: > > North latitude = 47.77704 > West longitude = -122.41909 > South latitude = 47.34827 > East longitude = -122.22031 > > gives us this query: > > latitude:[1047.34827 TO 1047.77704] AND longitude:[877.58091 TO 877.77969] > > Lucene then handles expanding this query into the appropriate set of > Boolean Queries. > > > Variant 2: Combine the above w/ a custom scorer > > For us, boxing works well to retrieve relevant documents, but our > users want those documents sorted by distance. We modified the custom > scoring class that Hatcher provides in Lucene in Action to compute the > distance between two points for only those documents which actually > match. Thus, we do our search using a Range Query, then for all > matching documents we compute an actual score that incorporates the > distance from the actual location from which the user is searching. > > On our data set, we can still end up with 1000s of matching documents > after boxing. Thus, we still see a bottleneck computing the score for > even this smaller set of documents which we are still working through. > > -Mike > > > -----Original Message----- > From: Jeff Rodenburg [mailto:[EMAIL PROTECTED] > Sent: Tue 2/28/06 3:10 PM > To: java-user@lucene.apache.org > Cc: > Subject: Hacking proximity search: looking for feedback > > I've been wrestling with a way to index and search data with a > geo-positional aspect. By a geo-positional search, I want to constrain > search results within a given location range. Furthermore, I want to > allow > the user to set/change the geo-positional boundaries as needed for their > search. This isn't a foreign concept to anyone who's needed to do the > same. > > There's been some work done in this area publicly (or at least what I > could > find via the list archives and google), but not a lot. The guys at > eventax.de have done some very impressive work. Their implementation is > within the constructs of nutch; there's more here at > http://wiki.apache.org/nutch/GeoPosition. Their approach is very > interesting and is predicated by having data indexed in a certain format. > I've considered building something based on the FunctionQuery class as > well. Within this class, I could actually do the mathematical > calculations > (Haversine formula, anyone?) for geo-positional filtering. Range queries > on > this data set were an option as well. > > I've hit a performance wall with these approaches. The geo-positional > calculations are proving to be intensive. With the combination of my data > set, hardware, and OS, this just wasn't getting it done. > > So, I reversed my thinking. Instead of getting Lucene to do geo-math, > what > if I made geo-math do Lucene? Lucene is exceptional at string lookups; > how > could I do geo-positional search in that framework? I seized upon an > approach that I've validated in testing, but wanted to get more feedback > from the community. > > > ******************************************************************************************************** > > Data structure: > Latitude & longitude values in decimal format, i.e. latitude=47.480227, > longitude=-122.333670 (btw, a tire shop I recently visited). > > Geo definition: > Boxing around a center point. It's not critical to do a radius search > with > a given circle. A boxed approach allows for taller or wider frames of > reference, which are applicable for our use. > > > How I'm solving: > Treat the data as strings and formulate boolean query lookups based on > positional comparison. Here are the steps: > > Indexing: > 1. Break up lat/long values to individual characters, and store each field > in progression. Field storage type = Keyword. The tire shop example: > Lat1 = [4] > Lat2 = [7] > Lat3 = [.] > Lat4 = [4] > Lat5 = [8] > ... > Long1 = [-] > Long2 = [1] > Long3 = [2] > ... > > Searching: > 1. Start with box coordinates - North/West corner, South/East corner. For > example, a sample box around Seattle, WA: > North latitude = 47.77704 > West longitude = -122.41909 > South latitude = 47.34827 > East longitude = -122.22031 > 2. Break up lat/long values in a manner similar to indexing. > 3. Begin to build boolean query. Query will contain two required clauses: > the North/South query, and the West/East query. Both queries are built > using the same progressional evaluation of characters by position. > 4. Compare North/South (top/bottom) values. Here's a pseudo-graphical > chart: > > North = [4][7][.][7][7][7][0][4] > South = [4][7][.][3][4][8][2][7] > > Qualifying records will have latitude values between 47.77704 and 47.34827 > . > With lexicographical ordering in mind, I can safely include this phrase in > my North/South query: > > (+Lat1:4 +Lat2:7 +Lat3:. +(Lat4:4 Lat4:5 Lat4:6) > > The logic is that any latitude with the first three values matching, and > the > fourth position containing either 4 or 5 or 6 will yield a qualifying > match. > > What other queries are needed? Ones that match 7 or 3 in the fourth > position should be considered, but they need further qualification. The > further qualification is based on values from the fifth position. I can > safely included the following phrases in my North/South query: > > (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:7) +(Lat5:6 Lat5:5 Lat5:4 Lat5:3 Lat5:2 > Lat5:1 Lat5:0) > (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:3) +(Lat5:5 Lat5:6 Lat5:7 Lat5:8 > Lat5:9) > > The logic here is simply an extension of the first query, extended to next > position in the latitude range. In the Northern latitude, with the first > four positions matching those values exactly, anything less than 7 in the > fifth position will yield a matching latitude. Same goes for the South > (bottom) query. > > > ******************************************************************************************* > > This approach yields a higher number of boolean query clauses, the more > granular you get. In this scenario, I've estimated approximately 145 > clauses within the final constructed query. In validation testing, this > approach has proven to be: > > 1) Accurate. > 2) Performant (thus far). > > > At last, my question to everyone who cares to respond (and read this far): > feedback? > > Thanks, > -- jeff > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >