[ 
https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley updated SOLR-2155:
-------------------------------

    Attachment: GeoHashPrefixFilter.patch

Here is another patch.  By the way, I'm using revision 1055285 of trunk.
* Removed @author tags.
* Introduced a constant threshold at which a term scan is done instead of 
divide & conquer. GRIDLEN_SCAN_THRESHOLD.  It used to be 2, meaning if maxlen 
is 9 then once we get to grid level 7 then the remaining leaves are scanned 
manually instead of making more boxes. I should make this configurable but it 
is not at this time.
* By setting GRIDLEN_SCAN_THRESHOLD to 4, I found the performance to be 
superior for the geonames data when the query shape was more complex than a 
bbox.  I haven't truly tuned this though.
* Added polygon search based on JTS that will handle any "WKT" (well known 
text) query string!  The JTS library (LGPL licensed) is downloaded similarly to 
how the "bdb" contrib module downloads sleepycat.  The only limitation with 
this is that I don't do any special world boundary processing, which mainly 
matters at the dateline.  That's a TODO.
* Added SpatialGeohashFilterQParser.  I don't like SpatialFilterQParser. This 
one handles, point-radius, bounding box, polygon, and WKT geometry inputs.  The 
argument and inputs were developed to be made easily compatible with the geo 
extension to the open-search spec. If JTS is not on the classpath then this 
query parser should still work provided you don't do polygon or WKT (not 
verified but should work in theory).
* Added a test for doing a polygon search.  And I made the existing lat-lon 
test get executed for both geohash and latlon type.

Here is an updated benchmark.  I'm doing geohash of length 9 and this time with 
the threshold mentioned above at 4.  The query is a *circle* (no bbox).  This 
triggers the LatLonType field to do a completely different algorithm in which 
it _loads every value into memory_ via the field cache and does a brute force 
match.  This GeoHash prefix filter has never used the field cache!  It uses 
Lucene's index.  The "places/query" (which is an average) actually varied by 
one between both implementations.  Could indicate a bug or some math rounding 
issue at the edge. And another point is that these benchmarks almost certainly 
resulted in my OS disk cache putting the relevant index files into memory.

||km||places/query||ms/query (LatLon)||ms/query (geohash)||
|11|        587|     10.0| 4.8|
|44|     3,404|     11.5| 4.3|
|230|    45,536|     21.8|24.0|
|1800| 1,319,692|       288.5|142.3|

I'm pretty happy with it at this point and I'll sit on it for a while, 
gathering feedback.

> Geospatial search using geohash prefixes
> ----------------------------------------
>
>                 Key: SOLR-2155
>                 URL: https://issues.apache.org/jira/browse/SOLR-2155
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: David Smiley
>         Attachments: GeoHashPrefixFilter.patch, GeoHashPrefixFilter.patch, 
> GeoHashPrefixFilter.patch
>
>
> There currently isn't a solution in Solr for doing geospatial filtering on 
> documents that have a variable number of points.  This scenario occurs when 
> there is location extraction (i.e. via a "gazateer") occurring on free text.  
> None, one, or many geospatial locations might be extracted from any given 
> document and users want to limit their search results to those occurring in a 
> user-specified area.
> I've implemented this by furthering the GeoHash based work in Lucene/Solr 
> with a geohash prefix based filter.  A geohash refers to a lat-lon box on the 
> earth.  Each successive character added further subdivides the box into a 4x8 
> (or 8x4 depending on the even/odd length of the geohash) grid.  The first 
> step in this scheme is figuring out which geohash grid squares cover the 
> user's search query.  I've added various extra methods to GeoHashUtils (and 
> added tests) to assist in this purpose.  The next step is an actual Lucene 
> Filter, GeoHashPrefixFilter, that uses these geohash prefixes in 
> TermsEnum.seek() to skip to relevant grid squares in the index.  Once a 
> matching geohash grid is found, the points therein are compared against the 
> user's query to see if it matches.  I created an abstraction GeoShape 
> extended by subclasses named PointDistance... and CartesianBox.... to support 
> different queried shapes so that the filter need not care about these details.
> This work was presented at LuceneRevolution in Boston on October 8th.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to