Re: question(s) re lucene spatial toolkit aka LSP aka spatial4j
Hey solr-user, are you by chance indexing LineStrings? That is something I never tried with this spatial index. Depending on which iteration of LSP you are using, I figure you'd either end up indexing a vast number of points along the line which would be slow to index and make the index quite big, or you might end up with a geohash granularity that will look more like a very blocky (i.e. pixelated) approximation of the line that is much courser and will thus trigger searches "near" the line to match the line. I don't have this use-case in my work so I haven't put that much thought into handling lines -- I just do points & polygons & circles & rects. ~ David - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/question-s-re-lucene-spatial-toolkit-aka-LSP-aka-spatial4j-tp3997757p4001486.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: question(s) re lucene spatial toolkit aka LSP aka spatial4j
On Aug 9, 2012, at 4:16 PM, solr-user [via Lucene] wrote: I didn't know how the cache got triggered and the "needScore=false" now allows some of my problem queries to finally work, and well within 2gb of mem. needScore is an unfortunate hack in the Solr adapter to the Lucene spatial module to work-around the fact that Solr only knows how to get queries from a field type, not filters. Unlike filters, queries have scores. For spatial, they are expensive (lots of ram) and you may not even want them! Consider voting for this issue: https://issues.apache.org/jira/browse/SOLR-2883 ~ David - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/question-s-re-lucene-spatial-toolkit-aka-LSP-aka-spatial4j-tp3997757p4000294.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: question(s) re lucene spatial toolkit aka LSP aka spatial4j
Thanks David. You are a life saver. I didn't know how the cache got triggered and the "needScore=false" now allows some of my problem queries to finally work, and well within 2gb of mem. will look at your other suggestion when I can. MANY thanks again. -- View this message in context: http://lucene.472066.n3.nabble.com/question-s-re-lucene-spatial-toolkit-aka-LSP-aka-spatial4j-tp3997757p4000286.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: question(s) re lucene spatial toolkit aka LSP aka spatial4j
solr-user wrote > > Thanks David. No worries about the delay; am always happy and > appreciative when someone responds. > > I don't understand what you mean by "All center points get cached into > memory upon first use in a score" in question 2 about the Java OOM errors > I am seeing. > The underlying field type receives one internal Shape instance per WKT string that is handed to it, no matter wether that WKT is MultiGeometry or not. The center point of that shape is indexed in such a way that it can be read into a cache later. It doesn't matter how many vertexes/coordinates your geometries have or quantity of shapes that exist in a single WKT string; it results in one point given one WKT string value. Just wanted to be clear on that. STNumPoints is the wrong statistic since that counts internal coordinates, from my reading of its documentation just now. STNumGeometries isn't right either if your WKT uses any of the Multi* type geometries. solr-user wrote > > The Solr instance I have setup for testing has around 200k docs, with one > WKT field per doc (indexed and stored and set to multivalue). > > I did a count of the number of points that get indexed in Solr (computed > in MS SQL by counting the number of points (using STNumPoints) for each > geometry (using STNumGeometries) in the WKT data I am indexing), and I > have around 35M points total. > > If only the center points for 190K docs get cached, wouldn't that easily > fit in 7GB of heap? > > Even if Solr was caching 35M points, that still doesn't sound like 7GB > worth of data. > Yeah... the memory cache may be pig-ish but not that bad. There's something about the implementation that tells me there could be a bug if any of your polygon shapes are small and/or you index at a high resolution. Given that you have multi-valued spatial data per document, you can't simply use solr.LatLonType. Try this -- create a new field called centerPoints or something like that, and also use the same field type as for the geohash one you are already using. But for this one, hand Solr the center-points of your shape data. Hopefully it's straight-forward for you to calculate this. Then when you do sorting by distance or need to retrieve the distance via a dist:query(...) etc., be sure to use this field and NOT the main shape one that has the full shape indexed. To be sure the spatial module doesn't load the center points for the main shape field, pass needScore=false as a Solr local-param in your filter query for it. Hopefully that fixes it. If it does, there is a bug and I know what it is. ~ David - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/question-s-re-lucene-spatial-toolkit-aka-LSP-aka-spatial4j-tp3997757p4000276.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: question(s) re lucene spatial toolkit aka LSP aka spatial4j
Thanks David. No worries about the delay; am always happy and appreciative when someone responds. I don't understand what you mean by "All center points get cached into memory upon first use in a score" in question 2 about the Java OOM errors I am seeing. The Solr instance I have setup for testing has around 200k docs, with one WKT field per doc (indexed and stored and set to multivalue). I did a count of the number of points that get indexed in Solr (computed in MS SQL by counting the number of points (using STNumPoints) for each geometry (using STNumGeometries) in the WKT data I am indexing), and I have around 35M points total. If only the center points for 190K docs get cached, wouldn't that easily fit in 7GB of heap? Even if Solr was caching 35M points, that still doesn't sound like 7GB worth of data. -- View this message in context: http://lucene.472066.n3.nabble.com/question-s-re-lucene-spatial-toolkit-aka-LSP-aka-spatial4j-tp3997757p4000268.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: question(s) re lucene spatial toolkit aka LSP aka spatial4j
Hi! Sorry for the belated response; my google alerts didn't kick in for some weird reason until you posted to the lucene dev list. solr-user wrote > > hopefully someone is using the lucene spatial toolkit aka LSP aka > spatial4j, and can answer this question > > we are using this spatial tool for doing searches. overall, it seems to > work very well. however, finding documentation is difficult. > > I'm using it ;-) The current in-progress documentation is here: http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4 solr-user wrote > > > I have a couple of questions: > > 1. I have a geohash field in my solr schema that contains indexed > geographic polygon data. I want to find all docs where that polygon > intersects a given lat/long. I was experimenting with returning distance > in the resultset and with sorting by distance and found that the following > query works. However, I dont know what distance means in the query. i.e. > is it distance from point to the polygon centroid, to the closest outer > edge of the polygon, its a useless random value, etc. Does anyone know?? > > http://solrserver:solrport/solr/core0/select?q=*:*&fq={!v=$geoq%20cache=false}&geoq=wkt_search:%22Intersects(Circle(-97.057%2047.924%20d=0.01))%22&sort=query($geoq)+asc&fl=catchment_wkt1_trimmed,school_name,latitude,longitude,dist:query($geoq,-1),loc_city,loc_state > It's from the center of the indexed shape to the center of the query shape. At some point soonish, the score of a geo query is going to be similar to the inverted distance so that it's a better relevancy metric which is what scores should be. I expect some alternative means to show up to actually get the distance in search results -- perhaps a special Solr function query. solr-user wrote > > 2. some of the polygons, being geographic representations, are very big > (ie state/province polygons). when solr starts processing a spatial query > (like the one above), I can see ("INFO: Building Cache [xx]") it fills > in some sort of memory cache > (org.apache.lucene.spatial.strategy.util.ShapeFieldCache) of the indexed > polygon data. We are encountering Java OOM issues when this occurs (even > when we booested the mem to 7GB). I know that some of the polygons can > have more than 2300 points, but heavy trimming isn't really an option due > to level of detail issues. Can we control this caching, or the indexing of > the polygons, in any way to reduce the memory requirements?? > All center points get cached into memory upon first use in a score. I'm unsatisfied with the current details of this which is not real-time-search friendly and is a memory pig since it's a ArrayList of ArrayList of PointImpl objects. If you have a single shape value per field, then I suggest indexing the center point into a solr.LatLonType field for sorting, which uses the lucene FieldCache and it'll use much less memory. Consider making it float based to halve your memory requirements further. p.s. I suggest "watching" this JIRA issue: https://issues.apache.org/jira/browse/SOLR-3304 ~ David Smiley ----- Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/question-s-re-lucene-spatial-toolkit-aka-LSP-aka-spatial4j-tp3997757p424.html Sent from the Solr - User mailing list archive at Nabble.com.
question(s) re lucene spatial toolkit aka LSP aka spatial4j
hopefully someone is using the lucene spatial toolkit aka LSP aka spatial4j, and can answer this question we are using this spatial tool for doing searches. overall, it seems to work very well. however, finding documentation is difficult. I have a couple of questions: 1. I have a geohash field in my solr schema that contains indexed geographic polygon data. I want to find all docs where that polygon intersects a given lat/long. I was experimenting with returning distance in the resultset and with sorting by distance and found that the following query works. However, I dont know what distance means in the query. i.e. is it distance from point to the polygon centroid, to the closest outer edge of the polygon, its a useless random value, etc. Does anyone know?? http://solrserver:solrport/solr/core0/select?q=*:*&fq={!v=$geoq%20cache=false}&geoq=wkt_search:%22Intersects(Circle(-97.057%2047.924%20d=0.01))%22&sort=query($geoq)+asc&fl=catchment_wkt1_trimmed,school_name,latitude,longitude,dist:query($geoq,-1),loc_city,loc_state 2. some of the polygons, being geographic representations, are very big (ie state/province polygons). when solr starts processing a spatial query (like the one above), I can see ("INFO: Building Cache [xx]") it fills in some sort of memory cache (org.apache.lucene.spatial.strategy.util.ShapeFieldCache) of the indexed polygon data. We are encountering Java OOM issues when this occurs (even when we booested the mem to 7GB). I know that some of the polygons can have more than 2300 points, but heavy trimming isn't really an option due to level of detail issues. Can we control this caching, or the indexing of the polygons, in any way to reduce the memory requirements?? -- View this message in context: http://lucene.472066.n3.nabble.com/question-s-re-lucene-spatial-toolkit-aka-LSP-aka-spatial4j-tp3997757.html Sent from the Solr - User mailing list archive at Nabble.com.