Re: Geospatial search in Lucene/Solr

Smiley, David W. Tue, 28 Dec 2010 07:48:23 -0800

(I was emailing Grant RE geospatial search and it occurred to me I should be 
doing this on the Lucene dev list so I've taken it there)


On Dec 28, 2010, at 8:31 AM, Grant Ingersoll wrote:
On Dec 27, 2010, at 11:52 PM, Smiley, David W. wrote:

Hi Grant.

I saw your latest blog post at Lucid Imagination in which you mentioned you 
were going to work on adding polygon search.  FWIW, I finished polygon search 
last week to my code base based 
onhttps://issues.apache.org/jira/browse/SOLR-2155 (geohash prefix 
implementation).

I've scanned this, but haven't had time to look in depth.

I used the JTS library to do the heavy lifting.  I’d be happy to release the 
code.  I’ve iterated more on SOLR-2155 on my local project code and plan to 
re-integrate it with the patch at some point.  There was almost 2 months of 
time in-between me releasing SOLR-2155 and me having other priorities but I’m 
back at it.

Cool.  The problem w/ JTS is it is LGPL.

At least it's not GPL.  Can you simply use JTS and make it an optional library, 
like Solr does for some other libs?  There's a lot of expertise in that library 
that's been refined over the last 10 years.  Even if you refuse to touch 
anything non-Apache licensed, I highly recommend you look through its code to 
see how geospatial point-in-polygon is efficiently done.  It has a concept of a 
"prepared geometry" object that is optimized to make large numbers of 
point-in-polygon queries more efficient.  It is implemented by putting the line 
segments of the polygon in an in-memory R-tree index.  If you'd like me to 
point you at specific classes, I'd be happy to.  Better yet, I could release an 
update to SOLR-2155 and you could debug step through.  FWIW  I used a separate 
class for the polygon search that implements my GeoShape interface.  If a user 
doesn't need to do a polygon search (which is not a common requirement of 
geospatial search), then the JTS library need not ship with Lucene/Solr.

Presently, I’m working on Lucene’s benchmark contrib module to evaluate the 
performance of SOLR-2155 compared to the LatLon type (i.e. a pair of lat-lon 
range queries), and then I’ll work on a more efficient probably non-geohash 
implementation but based on the same underlying concept of a hierarchical grid. 
 I’m using the geonames.org<http://geonames.org/> data set.  Unfortunately, the 
benchmark code seems very oriented to a generic title-body document whereas I’m 
looking to create lat-lon pairs… and furthermore to create documents containing 
multiple lat-lon pairs, and even furthermore a query generator that generates 
random box queries centered on a random location from the data set.  I seem to 
be stretching the benchmark framework beyond the use-case it was designed for 
and so perhaps it won’t be committable but at least I’ll have a patch for other 
geospatial birds-of-a-feather like you to use.

Stretch away.  The Title/Body orientation is just a relic of what we have done 
in the past, it doesn't have to stay that way.

I am interested in your thoughts on evaluating the performance of geospatial 
queries.  Since reading LIA2, the Lucene's benchmark contrib module seems like 
the ideal way to test Lucene. I thought about programmatically generating 
points but I've warmed to the idea of using geonames.org<http://geonames.org> 
data as the space of possible points.  Geonames has 7.5M unique lat-lon 
pairs[1], including population.  In SOLR-2155, there are multiple points per 
document as a key feature.  For testing, I want to be able to configure 1 point 
per document for comparison to algorithms that only support that but I must 
support a random variable number of them too.  Consequently, each place does 
NOT correspond 1-1 to a document.  The number of documents indexed should be 
configurable which will be randomly generated based on randomly picking one or 
more points from the input data set.  The number of points available from the 
input data set should be configurable too (i.e. < 7.5M).   Assuming that a more 
populace place is more likely to be referenced than a less populace one, I want 
to skew choosing a place weighted by the place's population.  This creates a 
much more realistic skew of documents mapped to points than an evenly 
distributed one, which is important.  On the query side, I'd like to generate 
random queries centered on a random one of these points, with a radius that is 
either 1km, 10km, 100km, 1000km, or 10000km, in order to match a wide variety 
of documents.  For reporting, I'd like to see a chart of response time vs 
number of documents matched.

I'm perhaps half-done with implementing all this. Because I need to randomly 
choose points in the data set, I can't stream it in, I need to read it all into 
memory as a singleton object used by both the indexing side, and the query side 
(since the queries need to pick a random point).

[1] http://www.geonames.org/about.html

~ David Smiley

Re: Geospatial search in Lucene/Solr

Reply via email to