Re: Improving performance for SOLR geo queries?

Matthias Käppler Wed, 08 Feb 2012 02:12:25 -0800

Hi Erick,

if we're not doing geo searches, we filter by "location tags" that we
attach to places. This is simply a hierachical regional id, which is
simple to filter for, but much less flexible. We use that on Web a
lot, but not on mobile, where we want to performance searches in
arbitrary radii around arbitrary positions. For those location tag
kind of queries, the average time spent in SOLR is 43msec (I'm looking
at the New Relic snapshot of the last 12 hours). I have disabled our
"optimization" again just yesterday, so for the bbox queries we're now
at an avg of 220ms (same time window). That's a 5 fold increase in
response time, and in peak hours it's worse than that.


I've also found a blog post from 3 years ago which outlines the inner
workings of the SOLR spatial indexing and searching:
http://www.searchworkings.org/blog/-/blogs/23842
>From that it seems as if SOLR already performs a similar optimization
we had in mind during the index step, so if I understand correctly, it
doesn't even search over all records, only those that were mapped to
the grid box identified during indexing.

What I would love to see is what the suggested way is to perform a geo
query on SOLR, considering that they're so difficult to cache and
expensive to run. Is the best approach to restrict the candidate set
as much as possible using cheap filter queries, so that SOLR merely
has to do the geo search against these subsets? How does the query
planner work here? I see there's a cost attached to a filter query,
but one can only set it when cache is set to false? Are cached geo
queries executed last when there are cheaper filter queries to cut
down on documents? If you have a real world practical setup to share,
one that performs well in a production environment that serves
requests in the Millions per day, that would be great.

I'd love to contribute documentation by the way, if you knew me you'd
know I'm an avid open source contributor and actually run several open
source projects myself. But tell me, how can I possibly contribute
answer to questions I don't have an answer to? That's why I'm here,
remember :) So please, these kinds of snippy replies are not helping
anyone.

Thanks
-Matthias

On Tue, Feb 7, 2012 at 3:06 PM, Erick Erickson <erickerick...@gmail.com> wrote:
> So the obvious question is "what is your
> performance like without the distance filters?"
>
> Without that knowledge, we have no clue whether
> the modifications you've made had any hope of
> speeding up your response times....
>
> As for the docs, any improvements you'd like to
> contribute would be happily received
>
> Best
> Erick
>
> 2012/2/6 Matthias Käppler <matth...@qype.com>:
>> Hi,
>>
>> we need to perform fast geo lookups on an index of ~13M places, and
>> were running into performance problems here with SOLR. We haven't done
>> a lot of query optimization / SOLR tuning up until now so there's
>> probably a lot of things we're missing. I was wondering if you could
>> give me some feedback on the way we do things, whether they make
>> sense, and especially why a supposed optimization we implemented
>> recently seems to have no effect, when we actually thought it would
>> help a lot.
>>
>> What we do is this: our API is built on a Rails stack and talks to
>> SOLR via a Ruby wrapper. We have a few filters that almost always
>> apply, which we put in filter queries. Filter cache hit rate is
>> excellent, about 97%, and cache size caps at 10k filters (max size is
>> 32k, but it never seems to reach that many, probably because we
>> replicate / delta update every few minutes). Still, geo queries are
>> slow, about 250-500msec on average. We send them with cache=false, so
>> as to not flood the fq cache and cause undesirable evictions.
>>
>> Now our idea was this: while the actual geo queries are poorly
>> cacheable, we could clearly identify geographical regions which are
>> more often queried than others (naturally, since we're a user driven
>> service). Therefore, we dynamically partition Earth into a static grid
>> of overlapping boxes, where the grid size (the distance of the nodes)
>> depends on the maximum allowed search radius. That way, for every user
>> query, we would always be able to identify a single bounding box that
>> covers it. This larger bounding box (200km edge length) we would send
>> to SOLR as a cached filter query, along with the actual user query
>> which would still be sent uncached. Ex:
>>
>> User asks for places in 10km around 49.14839,8.5691, then what we will
>> send to SOLR is something like this:
>>
>> fq={!bbox cache=false d=10 sfield=location_ll pt=49.14839,8.5691}
>> fq={!bbox cache=true d=100.0 sfield=location_ll
>> pt=49.4684836290799,8.31165802979391} <-- this one we derive
>> automatically
>>
>> That way SOLR would intersect the two filters and return the same
>> results as when only looking at the smaller bounding box, but keep the
>> larger box in cache and speed up subsequent geo queries in the same
>> regions. Or so we thought; unfortunately this approach did not help
>> query execution times get better, at all.
>>
>> Question is: why does it not help? Shouldn't it be faster to search on
>> a cached bbox with only a few hundred thousand places? Is it a good
>> idea to make these kinds of optimizations in the app layer (we do this
>> as part of resolving the SOLR query in Ruby), and does it make sense
>> at all? We're not sure what kind of optimizations SOLR already does in
>> its query planner. The documentation is (sorry) miserable, and
>> debugQuery yields no insight into which optimizations are performed.
>> So this has been a hit and miss game for us, which is very ineffective
>> considering that it takes considerable time to build these kinds of
>> optimizations in the app layer.
>>
>> Would be glad to hear your opinions / experience around this.
>>
>> Thanks!
>>
>> --
>> Matthias Käppler
>> Lead Developer API & Mobile
>>
>> Qype GmbH
>> Großer Burstah 50-52
>> 20457 Hamburg
>> Telephone: +49 (0)40 - 219 019 2 - 160
>> Skype: m_kaeppler
>> Email: matth...@qype.com
>>
>> Managing Director: Ian Brotherston
>> Amtsgericht Hamburg
>> HRB 95913
>>
>> This e-mail and its attachments may contain confidential and/or
>> privileged information. If you are not the intended recipient (or have
>> received this e-mail in error) please notify the sender immediately
>> and destroy this e-mail and its attachments. Any unauthorized copying,
>> disclosure or distribution of this e-mail and  its attachments is
>> strictly forbidden. This notice also applies to future messages.



-- 
Matthias Käppler
Lead Developer API & Mobile

Qype GmbH
Großer Burstah 50-52
20457 Hamburg
Telephone: +49 (0)40 - 219 019 2 - 160
Skype: m_kaeppler
Email: matth...@qype.com

Managing Director: Ian Brotherston
Amtsgericht Hamburg
HRB 95913

This e-mail and its attachments may contain confidential and/or
privileged information. If you are not the intended recipient (or have
received this e-mail in error) please notify the sender immediately
and destroy this e-mail and its attachments. Any unauthorized copying,
disclosure or distribution of this e-mail and  its attachments is
strictly forbidden. This notice also applies to future messages.

Re: Improving performance for SOLR geo queries?

Reply via email to