Re: Improving performance for SOLR geo queries?

Nicolas Flacco Wed, 08 Feb 2012 15:22:40 -0800

I compared locallucene to spatial search and saw a performance
degradation, even using geohash queries, though perhaps I indexed things
wrong? Locallucene across 6 machines handles 150 queries per second fine,
but using geofilt and geohash I got lots of timeouts even when I was doing
only 50 queries per second. Has anybody done a formal comparison of
locallucene with spatial search and latlontype, pointtype and geohash?


On 2/8/12 2:20 PM, "Ryan McKinley" <ryan...@gmail.com> wrote:

>Hi Matthias-
>
>I'm trying to understand how you have your data indexed so we can give
>reasonable direction.
>
>What field type are you using for your locations?  Is it using the
>solr spatial field types?  What do you see when you look at the debug
>information from &debugQuery=true?
>
>From my experience, there is no single best practice for spatial
>queries -- it will depend on your data density and distribution if.
>
>You may also want to look at:
>http://code.google.com/p/lucene-spatial-playground/
>but note this is off lucene trunk -- the geohash queries are super fast
>though
>
>ryan
>
>
>
>
>2012/2/8 Matthias Käppler <matth...@qype.com>:
>> Hi Erick,
>>
>> if we're not doing geo searches, we filter by "location tags" that we
>> attach to places. This is simply a hierachical regional id, which is
>> simple to filter for, but much less flexible. We use that on Web a
>> lot, but not on mobile, where we want to performance searches in
>> arbitrary radii around arbitrary positions. For those location tag
>> kind of queries, the average time spent in SOLR is 43msec (I'm looking
>> at the New Relic snapshot of the last 12 hours). I have disabled our
>> "optimization" again just yesterday, so for the bbox queries we're now
>> at an avg of 220ms (same time window). That's a 5 fold increase in
>> response time, and in peak hours it's worse than that.
>>
>> I've also found a blog post from 3 years ago which outlines the inner
>> workings of the SOLR spatial indexing and searching:
>> http://www.searchworkings.org/blog/-/blogs/23842
>> From that it seems as if SOLR already performs a similar optimization
>> we had in mind during the index step, so if I understand correctly, it
>> doesn't even search over all records, only those that were mapped to
>> the grid box identified during indexing.
>>
>> What I would love to see is what the suggested way is to perform a geo
>> query on SOLR, considering that they're so difficult to cache and
>> expensive to run. Is the best approach to restrict the candidate set
>> as much as possible using cheap filter queries, so that SOLR merely
>> has to do the geo search against these subsets? How does the query
>> planner work here? I see there's a cost attached to a filter query,
>> but one can only set it when cache is set to false? Are cached geo
>> queries executed last when there are cheaper filter queries to cut
>> down on documents? If you have a real world practical setup to share,
>> one that performs well in a production environment that serves
>> requests in the Millions per day, that would be great.
>>
>> I'd love to contribute documentation by the way, if you knew me you'd
>> know I'm an avid open source contributor and actually run several open
>> source projects myself. But tell me, how can I possibly contribute
>> answer to questions I don't have an answer to? That's why I'm here,
>> remember :) So please, these kinds of snippy replies are not helping
>> anyone.
>>
>> Thanks
>> -Matthias
>>
>> On Tue, Feb 7, 2012 at 3:06 PM, Erick Erickson
>><erickerick...@gmail.com> wrote:
>>> So the obvious question is "what is your
>>> performance like without the distance filters?"
>>>
>>> Without that knowledge, we have no clue whether
>>> the modifications you've made had any hope of
>>> speeding up your response times....
>>>
>>> As for the docs, any improvements you'd like to
>>> contribute would be happily received
>>>
>>> Best
>>> Erick
>>>
>>> 2012/2/6 Matthias Käppler <matth...@qype.com>:
>>>> Hi,
>>>>
>>>> we need to perform fast geo lookups on an index of ~13M places, and
>>>> were running into performance problems here with SOLR. We haven't done
>>>> a lot of query optimization / SOLR tuning up until now so there's
>>>> probably a lot of things we're missing. I was wondering if you could
>>>> give me some feedback on the way we do things, whether they make
>>>> sense, and especially why a supposed optimization we implemented
>>>> recently seems to have no effect, when we actually thought it would
>>>> help a lot.
>>>>
>>>> What we do is this: our API is built on a Rails stack and talks to
>>>> SOLR via a Ruby wrapper. We have a few filters that almost always
>>>> apply, which we put in filter queries. Filter cache hit rate is
>>>> excellent, about 97%, and cache size caps at 10k filters (max size is
>>>> 32k, but it never seems to reach that many, probably because we
>>>> replicate / delta update every few minutes). Still, geo queries are
>>>> slow, about 250-500msec on average. We send them with cache=false, so
>>>> as to not flood the fq cache and cause undesirable evictions.
>>>>
>>>> Now our idea was this: while the actual geo queries are poorly
>>>> cacheable, we could clearly identify geographical regions which are
>>>> more often queried than others (naturally, since we're a user driven
>>>> service). Therefore, we dynamically partition Earth into a static grid
>>>> of overlapping boxes, where the grid size (the distance of the nodes)
>>>> depends on the maximum allowed search radius. That way, for every user
>>>> query, we would always be able to identify a single bounding box that
>>>> covers it. This larger bounding box (200km edge length) we would send
>>>> to SOLR as a cached filter query, along with the actual user query
>>>> which would still be sent uncached. Ex:
>>>>
>>>> User asks for places in 10km around 49.14839,8.5691, then what we will
>>>> send to SOLR is something like this:
>>>>
>>>> fq={!bbox cache=false d=10 sfield=location_ll pt=49.14839,8.5691}
>>>> fq={!bbox cache=true d=100.0 sfield=location_ll
>>>> pt=49.4684836290799,8.31165802979391} <-- this one we derive
>>>> automatically
>>>>
>>>> That way SOLR would intersect the two filters and return the same
>>>> results as when only looking at the smaller bounding box, but keep the
>>>> larger box in cache and speed up subsequent geo queries in the same
>>>> regions. Or so we thought; unfortunately this approach did not help
>>>> query execution times get better, at all.
>>>>
>>>> Question is: why does it not help? Shouldn't it be faster to search on
>>>> a cached bbox with only a few hundred thousand places? Is it a good
>>>> idea to make these kinds of optimizations in the app layer (we do this
>>>> as part of resolving the SOLR query in Ruby), and does it make sense
>>>> at all? We're not sure what kind of optimizations SOLR already does in
>>>> its query planner. The documentation is (sorry) miserable, and
>>>> debugQuery yields no insight into which optimizations are performed.
>>>> So this has been a hit and miss game for us, which is very ineffective
>>>> considering that it takes considerable time to build these kinds of
>>>> optimizations in the app layer.
>>>>
>>>> Would be glad to hear your opinions / experience around this.
>>>>
>>>> Thanks!
>>>>
>>>> --
>>>> Matthias Käppler
>>>> Lead Developer API & Mobile
>>>>
>>>> Qype GmbH
>>>> Großer Burstah 50-52
>>>> 20457 Hamburg
>>>> Telephone: +49 (0)40 - 219 019 2 - 160
>>>> Skype: m_kaeppler
>>>> Email: matth...@qype.com
>>>>
>>>> Managing Director: Ian Brotherston
>>>> Amtsgericht Hamburg
>>>> HRB 95913
>>>>
>>>> This e-mail and its attachments may contain confidential and/or
>>>> privileged information. If you are not the intended recipient (or have
>>>> received this e-mail in error) please notify the sender immediately
>>>> and destroy this e-mail and its attachments. Any unauthorized copying,
>>>> disclosure or distribution of this e-mail and  its attachments is
>>>> strictly forbidden. This notice also applies to future messages.
>>
>>
>>
>> --
>> Matthias Käppler
>> Lead Developer API & Mobile
>>
>> Qype GmbH
>> Großer Burstah 50-52
>> 20457 Hamburg
>> Telephone: +49 (0)40 - 219 019 2 - 160
>> Skype: m_kaeppler
>> Email: matth...@qype.com
>>
>> Managing Director: Ian Brotherston
>> Amtsgericht Hamburg
>> HRB 95913
>>
>> This e-mail and its attachments may contain confidential and/or
>> privileged information. If you are not the intended recipient (or have
>> received this e-mail in error) please notify the sender immediately
>> and destroy this e-mail and its attachments. Any unauthorized copying,
>> disclosure or distribution of this e-mail and  its attachments is
>> strictly forbidden. This notice also applies to future messages.
>

Re: Improving performance for SOLR geo queries?

Reply via email to