(I was emailing Grant RE geospatial search and it occurred to me I should be doing this on the Lucene dev list so I've taken it there)
On Dec 28, 2010, at 8:31 AM, Grant Ingersoll wrote: On Dec 27, 2010, at 11:52 PM, Smiley, David W. wrote: Hi Grant. I saw your latest blog post at Lucid Imagination in which you mentioned you were going to work on adding polygon search. FWIW, I finished polygon search last week to my code base based onhttps://issues.apache.org/jira/browse/SOLR-2155 (geohash prefix implementation). I've scanned this, but haven't had time to look in depth. I used the JTS library to do the heavy lifting. I’d be happy to release the code. I’ve iterated more on SOLR-2155 on my local project code and plan to re-integrate it with the patch at some point. There was almost 2 months of time in-between me releasing SOLR-2155 and me having other priorities but I’m back at it. Cool. The problem w/ JTS is it is LGPL. At least it's not GPL. Can you simply use JTS and make it an optional library, like Solr does for some other libs? There's a lot of expertise in that library that's been refined over the last 10 years. Even if you refuse to touch anything non-Apache licensed, I highly recommend you look through its code to see how geospatial point-in-polygon is efficiently done. It has a concept of a "prepared geometry" object that is optimized to make large numbers of point-in-polygon queries more efficient. It is implemented by putting the line segments of the polygon in an in-memory R-tree index. If you'd like me to point you at specific classes, I'd be happy to. Better yet, I could release an update to SOLR-2155 and you could debug step through. FWIW I used a separate class for the polygon search that implements my GeoShape interface. If a user doesn't need to do a polygon search (which is not a common requirement of geospatial search), then the JTS library need not ship with Lucene/Solr. Presently, I’m working on Lucene’s benchmark contrib module to evaluate the performance of SOLR-2155 compared to the LatLon type (i.e. a pair of lat-lon range queries), and then I’ll work on a more efficient probably non-geohash implementation but based on the same underlying concept of a hierarchical grid. I’m using the geonames.org<http://geonames.org/> data set. Unfortunately, the benchmark code seems very oriented to a generic title-body document whereas I’m looking to create lat-lon pairs… and furthermore to create documents containing multiple lat-lon pairs, and even furthermore a query generator that generates random box queries centered on a random location from the data set. I seem to be stretching the benchmark framework beyond the use-case it was designed for and so perhaps it won’t be committable but at least I’ll have a patch for other geospatial birds-of-a-feather like you to use. Stretch away. The Title/Body orientation is just a relic of what we have done in the past, it doesn't have to stay that way. I am interested in your thoughts on evaluating the performance of geospatial queries. Since reading LIA2, the Lucene's benchmark contrib module seems like the ideal way to test Lucene. I thought about programmatically generating points but I've warmed to the idea of using geonames.org<http://geonames.org> data as the space of possible points. Geonames has 7.5M unique lat-lon pairs[1], including population. In SOLR-2155, there are multiple points per document as a key feature. For testing, I want to be able to configure 1 point per document for comparison to algorithms that only support that but I must support a random variable number of them too. Consequently, each place does NOT correspond 1-1 to a document. The number of documents indexed should be configurable which will be randomly generated based on randomly picking one or more points from the input data set. The number of points available from the input data set should be configurable too (i.e. < 7.5M). Assuming that a more populace place is more likely to be referenced than a less populace one, I want to skew choosing a place weighted by the place's population. This creates a much more realistic skew of documents mapped to points than an evenly distributed one, which is important. On the query side, I'd like to generate random queries centered on a random one of these points, with a radius that is either 1km, 10km, 100km, 1000km, or 10000km, in order to match a wide variety of documents. For reporting, I'd like to see a chart of response time vs number of documents matched. I'm perhaps half-done with implementing all this. Because I need to randomly choose points in the data set, I can't stream it in, I need to read it all into memory as a singleton object used by both the indexing side, and the query side (since the queries need to pick a random point). [1] http://www.geonames.org/about.html ~ David Smiley