Re: Hacking proximity search: looking for feedback

Jeff Rodenburg Tue, 28 Feb 2006 13:32:56 -0800

Michael -

Great thoughts, and thanks for the feedback.


Following on the Range Query approach, how is performance?  I found the
range approach (albeit with the exact values) to be slower than the
parsed-string approach I posited.

On the custom scoring, is the distance element for sorting or as a component
of relevance?  We have a need for distance sorting, but I'm trying to slay
that beast at a later stage.

-- jeff

On 2/28/06, Bryzek.Michael <[EMAIL PROTECTED]> wrote:
>
> Jeff -
>
> This is an interesting approach. On our end, we have experimented with
> two variants:
>
> Variant 1: Use Range Query
>
> Rather than precomputing the boolean clauses yourself, index encoded
> latitude and longitude values and use a Range Query. We encode by
> adding 1000 to each of the values. Note: We only deal with zip codes
> in the US for which this encoding works fine, but is worth another
> look if you have a broader range of coordinates.
>
> Example: Compute the box as you describe, then encode each value and
> use a RangeQuery. Using the box you provided and lucene's query
> parser:
>
>        North latitude = 47.77704
>        West longitude = -122.41909
>        South latitude = 47.34827
>        East longitude = -122.22031
>
> gives us this query:
>
> latitude:[1047.34827 TO 1047.77704] AND longitude:[877.58091 TO 877.77969]
>
> Lucene then handles expanding this query into the appropriate set of
> Boolean Queries.
>
>
> Variant 2: Combine the above w/ a custom scorer
>
> For us, boxing works well to retrieve relevant documents, but our
> users want those documents sorted by distance. We modified the custom
> scoring class that Hatcher provides in Lucene in Action to compute the
> distance between two points for only those documents which actually
> match. Thus, we do our search using a Range Query, then for all
> matching documents we compute an actual score that incorporates the
> distance from the actual location from which the user is searching.
>
> On our data set, we can still end up with 1000s of matching documents
> after boxing. Thus, we still see a bottleneck computing the score for
> even this smaller set of documents which we are still working through.
>
> -Mike
>
>
> -----Original Message-----
> From:   Jeff Rodenburg [mailto:[EMAIL PROTECTED]
> Sent:   Tue 2/28/06 3:10 PM
> To:     java-user@lucene.apache.org
> Cc:
> Subject:        Hacking proximity search: looking for feedback
>
> I've been wrestling with a way to index and search data with a
> geo-positional aspect.  By a geo-positional search, I want to constrain
> search results within a given location range.  Furthermore, I want to
> allow
> the user to set/change the geo-positional boundaries as needed for their
> search.  This isn't a foreign concept to anyone who's needed to do the
> same.
>
> There's been some work done in this area publicly (or at least what I
> could
> find via the list archives and google), but not a lot.  The guys at
> eventax.de have done some very impressive work.  Their implementation is
> within the constructs of nutch; there's more here at
> http://wiki.apache.org/nutch/GeoPosition.  Their approach is very
> interesting and is predicated by having data indexed in a certain format.
> I've considered building something based on the FunctionQuery class as
> well.  Within this class, I could actually do the mathematical
> calculations
> (Haversine formula, anyone?) for geo-positional filtering.  Range queries
> on
> this data set were an option as well.
>
> I've hit a performance wall with these approaches.  The geo-positional
> calculations are proving to be intensive.  With the combination of my data
> set, hardware, and OS, this just wasn't getting it done.
>
> So, I reversed my thinking.  Instead of getting Lucene to do geo-math,
> what
> if I made geo-math do Lucene?  Lucene is exceptional at string lookups;
> how
> could I do geo-positional search in that framework?  I seized upon an
> approach that I've validated in testing, but wanted to get more feedback
> from the community.
>
>
> ********************************************************************************************************
>
> Data structure:
> Latitude & longitude values in decimal format, i.e. latitude=47.480227,
> longitude=-122.333670 (btw, a tire shop I recently visited).
>
> Geo definition:
> Boxing around a center point.  It's not critical to do a radius search
> with
> a given circle.  A boxed approach allows for taller or wider frames of
> reference, which are applicable for our use.
>
>
> How I'm solving:
> Treat the data as strings and formulate boolean query lookups based on
> positional comparison.  Here are the steps:
>
> Indexing:
> 1. Break up lat/long values to individual characters, and store each field
> in progression.  Field storage type = Keyword.  The tire shop example:
> Lat1 = [4]
> Lat2 = [7]
> Lat3 = [.]
> Lat4 = [4]
> Lat5 = [8]
> ...
> Long1 = [-]
> Long2 = [1]
> Long3 = [2]
> ...
>
> Searching:
> 1. Start with box coordinates - North/West corner, South/East corner.  For
> example, a sample box around Seattle, WA:
> North latitude = 47.77704
> West longitude = -122.41909
> South latitude = 47.34827
> East longitude = -122.22031
> 2. Break up lat/long values in a manner similar to indexing.
> 3. Begin to build boolean query.  Query will contain two required clauses:
> the North/South query, and the West/East query.  Both queries are built
> using the same progressional evaluation of characters by position.
> 4. Compare North/South (top/bottom) values.  Here's a pseudo-graphical
> chart:
>
> North = [4][7][.][7][7][7][0][4]
> South = [4][7][.][3][4][8][2][7]
>
> Qualifying records will have latitude values between 47.77704 and 47.34827
> .
> With lexicographical ordering in mind, I can safely include this phrase in
> my North/South query:
>
>    (+Lat1:4 +Lat2:7 +Lat3:. +(Lat4:4 Lat4:5 Lat4:6)
>
> The logic is that any latitude with the first three values matching, and
> the
> fourth position containing either 4 or 5 or 6 will yield a qualifying
> match.
>
> What other queries are needed?  Ones that match 7 or 3 in the fourth
> position should be considered, but they need further qualification.  The
> further qualification is based on values from the fifth position.  I can
> safely included the following phrases in my North/South query:
>
>     (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:7) +(Lat5:6 Lat5:5 Lat5:4 Lat5:3 Lat5:2
> Lat5:1 Lat5:0)
>     (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:3) +(Lat5:5 Lat5:6 Lat5:7 Lat5:8
> Lat5:9)
>
> The logic here is simply an extension of the first query, extended to next
> position in the latitude range.  In the Northern latitude, with the first
> four positions matching those values exactly, anything less than 7 in the
> fifth position will yield a matching latitude.  Same goes for the South
> (bottom) query.
>
>
> *******************************************************************************************
>
> This approach yields a higher number of boolean query clauses, the more
> granular you get.  In this scenario, I've estimated approximately 145
> clauses within the final constructed query.  In validation testing, this
> approach has proven to be:
>
> 1) Accurate.
> 2) Performant (thus far).
>
>
> At last, my question to everyone who cares to respond (and read this far):
> feedback?
>
> Thanks,
> -- jeff
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Hacking proximity search: looking for feedback

Reply via email to