I compared locallucene to spatial search and saw a performance degradation, even using geohash queries, though perhaps I indexed things wrong? Locallucene across 6 machines handles 150 queries per second fine, but using geofilt and geohash I got lots of timeouts even when I was doing only 50 queries per second. Has anybody done a formal comparison of locallucene with spatial search and latlontype, pointtype and geohash?
On 2/8/12 2:20 PM, "Ryan McKinley" <ryan...@gmail.com> wrote: >Hi Matthias- > >I'm trying to understand how you have your data indexed so we can give >reasonable direction. > >What field type are you using for your locations? Is it using the >solr spatial field types? What do you see when you look at the debug >information from &debugQuery=true? > >From my experience, there is no single best practice for spatial >queries -- it will depend on your data density and distribution if. > >You may also want to look at: >http://code.google.com/p/lucene-spatial-playground/ >but note this is off lucene trunk -- the geohash queries are super fast >though > >ryan > > > > >2012/2/8 Matthias Käppler <matth...@qype.com>: >> Hi Erick, >> >> if we're not doing geo searches, we filter by "location tags" that we >> attach to places. This is simply a hierachical regional id, which is >> simple to filter for, but much less flexible. We use that on Web a >> lot, but not on mobile, where we want to performance searches in >> arbitrary radii around arbitrary positions. For those location tag >> kind of queries, the average time spent in SOLR is 43msec (I'm looking >> at the New Relic snapshot of the last 12 hours). I have disabled our >> "optimization" again just yesterday, so for the bbox queries we're now >> at an avg of 220ms (same time window). That's a 5 fold increase in >> response time, and in peak hours it's worse than that. >> >> I've also found a blog post from 3 years ago which outlines the inner >> workings of the SOLR spatial indexing and searching: >> http://www.searchworkings.org/blog/-/blogs/23842 >> From that it seems as if SOLR already performs a similar optimization >> we had in mind during the index step, so if I understand correctly, it >> doesn't even search over all records, only those that were mapped to >> the grid box identified during indexing. >> >> What I would love to see is what the suggested way is to perform a geo >> query on SOLR, considering that they're so difficult to cache and >> expensive to run. Is the best approach to restrict the candidate set >> as much as possible using cheap filter queries, so that SOLR merely >> has to do the geo search against these subsets? How does the query >> planner work here? I see there's a cost attached to a filter query, >> but one can only set it when cache is set to false? Are cached geo >> queries executed last when there are cheaper filter queries to cut >> down on documents? If you have a real world practical setup to share, >> one that performs well in a production environment that serves >> requests in the Millions per day, that would be great. >> >> I'd love to contribute documentation by the way, if you knew me you'd >> know I'm an avid open source contributor and actually run several open >> source projects myself. But tell me, how can I possibly contribute >> answer to questions I don't have an answer to? That's why I'm here, >> remember :) So please, these kinds of snippy replies are not helping >> anyone. >> >> Thanks >> -Matthias >> >> On Tue, Feb 7, 2012 at 3:06 PM, Erick Erickson >><erickerick...@gmail.com> wrote: >>> So the obvious question is "what is your >>> performance like without the distance filters?" >>> >>> Without that knowledge, we have no clue whether >>> the modifications you've made had any hope of >>> speeding up your response times.... >>> >>> As for the docs, any improvements you'd like to >>> contribute would be happily received >>> >>> Best >>> Erick >>> >>> 2012/2/6 Matthias Käppler <matth...@qype.com>: >>>> Hi, >>>> >>>> we need to perform fast geo lookups on an index of ~13M places, and >>>> were running into performance problems here with SOLR. We haven't done >>>> a lot of query optimization / SOLR tuning up until now so there's >>>> probably a lot of things we're missing. I was wondering if you could >>>> give me some feedback on the way we do things, whether they make >>>> sense, and especially why a supposed optimization we implemented >>>> recently seems to have no effect, when we actually thought it would >>>> help a lot. >>>> >>>> What we do is this: our API is built on a Rails stack and talks to >>>> SOLR via a Ruby wrapper. We have a few filters that almost always >>>> apply, which we put in filter queries. Filter cache hit rate is >>>> excellent, about 97%, and cache size caps at 10k filters (max size is >>>> 32k, but it never seems to reach that many, probably because we >>>> replicate / delta update every few minutes). Still, geo queries are >>>> slow, about 250-500msec on average. We send them with cache=false, so >>>> as to not flood the fq cache and cause undesirable evictions. >>>> >>>> Now our idea was this: while the actual geo queries are poorly >>>> cacheable, we could clearly identify geographical regions which are >>>> more often queried than others (naturally, since we're a user driven >>>> service). Therefore, we dynamically partition Earth into a static grid >>>> of overlapping boxes, where the grid size (the distance of the nodes) >>>> depends on the maximum allowed search radius. That way, for every user >>>> query, we would always be able to identify a single bounding box that >>>> covers it. This larger bounding box (200km edge length) we would send >>>> to SOLR as a cached filter query, along with the actual user query >>>> which would still be sent uncached. Ex: >>>> >>>> User asks for places in 10km around 49.14839,8.5691, then what we will >>>> send to SOLR is something like this: >>>> >>>> fq={!bbox cache=false d=10 sfield=location_ll pt=49.14839,8.5691} >>>> fq={!bbox cache=true d=100.0 sfield=location_ll >>>> pt=49.4684836290799,8.31165802979391} <-- this one we derive >>>> automatically >>>> >>>> That way SOLR would intersect the two filters and return the same >>>> results as when only looking at the smaller bounding box, but keep the >>>> larger box in cache and speed up subsequent geo queries in the same >>>> regions. Or so we thought; unfortunately this approach did not help >>>> query execution times get better, at all. >>>> >>>> Question is: why does it not help? Shouldn't it be faster to search on >>>> a cached bbox with only a few hundred thousand places? Is it a good >>>> idea to make these kinds of optimizations in the app layer (we do this >>>> as part of resolving the SOLR query in Ruby), and does it make sense >>>> at all? We're not sure what kind of optimizations SOLR already does in >>>> its query planner. The documentation is (sorry) miserable, and >>>> debugQuery yields no insight into which optimizations are performed. >>>> So this has been a hit and miss game for us, which is very ineffective >>>> considering that it takes considerable time to build these kinds of >>>> optimizations in the app layer. >>>> >>>> Would be glad to hear your opinions / experience around this. >>>> >>>> Thanks! >>>> >>>> -- >>>> Matthias Käppler >>>> Lead Developer API & Mobile >>>> >>>> Qype GmbH >>>> Großer Burstah 50-52 >>>> 20457 Hamburg >>>> Telephone: +49 (0)40 - 219 019 2 - 160 >>>> Skype: m_kaeppler >>>> Email: matth...@qype.com >>>> >>>> Managing Director: Ian Brotherston >>>> Amtsgericht Hamburg >>>> HRB 95913 >>>> >>>> This e-mail and its attachments may contain confidential and/or >>>> privileged information. If you are not the intended recipient (or have >>>> received this e-mail in error) please notify the sender immediately >>>> and destroy this e-mail and its attachments. Any unauthorized copying, >>>> disclosure or distribution of this e-mail and its attachments is >>>> strictly forbidden. This notice also applies to future messages. >> >> >> >> -- >> Matthias Käppler >> Lead Developer API & Mobile >> >> Qype GmbH >> Großer Burstah 50-52 >> 20457 Hamburg >> Telephone: +49 (0)40 - 219 019 2 - 160 >> Skype: m_kaeppler >> Email: matth...@qype.com >> >> Managing Director: Ian Brotherston >> Amtsgericht Hamburg >> HRB 95913 >> >> This e-mail and its attachments may contain confidential and/or >> privileged information. If you are not the intended recipient (or have >> received this e-mail in error) please notify the sender immediately >> and destroy this e-mail and its attachments. Any unauthorized copying, >> disclosure or distribution of this e-mail and its attachments is >> strictly forbidden. This notice also applies to future messages. >