Re: Using Solr for a rather busy "Yellow Pages"-type index - good idea or not really?

Mikhail Khludnev Thu, 16 Feb 2012 02:38:19 -0800

Pls find inlined.

On Thu, Feb 16, 2012 at 10:30 AM, Alexey Verkhovsky <
alexey.verkhov...@gmail.com> wrote:


> Hi, all,
>
> I'm new here. Used Solr on a couple of projects before, but didn't need to
> dive deep into anything until now. These days, I'm doing a spike for a
> "yellow pages" type search server with the following technical
> requirements:
>
> ~10 mln listings in the database. A listing has a name, address,
> description, coordinates and a number of tags / filtering fields; no more
> than a kilobyte all told; i.e. theoretically the whole thing should fit in
> RAM without sharding. A typical query is either "all text matches on name
> and/or description within a bounded box", or "some combination of tag
> matches within a bounded box". Bounded boxes are 1 to 50 km wide, and
> contain up to 10^5 unfiltered listings (the average is more like 10^3).
> More than 50% of all the listings are in the frequently requested bounding
> boxes, however a vast majority of listings are almost never displayed
> (because they don't match the other filters).
>
> Data "never changes" (i.e., a daily batch update; rebuild of the entire
> index and restart of all search servers is feasible, as long as it takes
> minutes, not hours).

Everybody start from daily bounce, but end up with UPDATED_AT column and
delta updates , just consider urgent content fix usecase. Don't think it's
worth to rely on daily bounce as a cornerstone of architecture.


> This thing ideally should serve up to 10^3 requests
> per second on a small (as in, "less than 10 commodity boxes") cluster. In
> other words, a typical request should be CPU bound and take ~100-200 msec
> to process. Because of coordinates (that are almost never the same),
> caching of queries makes no sense;

you can use grid of coordinates to reduce their entropy, if you filter by
bounding box argument is bounding box not a coordinates. Anyway
postfiltering and cache=false for such filters
http://yonik.wordpress.com/2012/02/10/advanced-filter-caching-in-solr/


> from what little I understand about
> Lucene internals, caching of filters probably doesn't make sense either.
>
But solr does it http://wiki.apache.org/solr/SolrCaching#filterCache

>
> After perusing documentation and some googling (but almost no source code
> exploring yet), I understand how the schema and the queries will look like,
> and now have to figure out a specific configuration that fits the
> performance/scalability requirements. Here is what I'm thinking:
>
> 1. Search server is an internal service that uses embedded Solr for the
> indexing part. RAMDirectoryFactory as index storage.
>
Bad idea. It's purposed mostly for tests, the closest purposed for
production analogue is
org.apache.lucene.store.instantiated.InstantiatedIndex


> 2. All data is in some sort of persistent storage on a file system, and is
> loaded into the memory when a search server starts up.
>
AFAIK the state of art is use file directory (MMAP or whatever), rely on
Linux file system RAM cache. Also Solr and partially Lucene cache some
stuff in HEAP themselves
http://wiki.apache.org/solr/SolrCaching#Types_of_Caches_and_Example_Configuration.
So, this is mostly done already.


> 3. Data updates are handled as "update the persistent storage, start
> another cluster, load the world into RAM, flip the load balancer, kill the
> old cluster"
>
no again. Lucene has pretty cool model of segments and generations purposed
to incremental update. And Solr does a lot to do search in old generation
and warnup the new one simultaneously (it just takes some memory, you know,
two times). I don;t think that manual A/B scheme is applicable. Anyway, you
can (but don't relly need to) play around replication facilities e.g.
disable traffic for half of nodes, push new index on it, let them warmup,
enable traffic (such machinery never works smoothly due number of moving
parts)


> 4. Solr returns IDs with relevance scores; actual presentations of listings
> (as JSON documents) are constructed outside of Solr and cached in
> Memcached, as a mostly static content with a few templated bits, like
> <distance><%=DISTANCE_TO(-123.0123, 45.6789) %>.
>
Use separate nodes to do a search and another nodes to stream the content
sounds good (mentioned in every book). Looks like beside of the score you
can also return distance to user i.e. no need to <%=DISTANCE_TO(-123.0123,
45.6789) %> , just <%=doc.DISTANCE%> see
http://wiki.apache.org/solr/SpatialSearch?#Returning_the_distance



> 5. All Solr caching is switched off.
>
But why?



>
> Obviously, we are not the first people to do something like this with Solr,
> so I'm hoping for some collective wisdom on the following:
>
> Does this sounds like a feasible set of requirements in terms of
> performance and scalability for Solr? Are we on the right path to solving
> this problem well? If not, what should we be doing instead? What nasty
> technical/architectural gotchas are we probably missing at this stage?
>
> One particular advice I'd be really happy to hear is "you may not need
> RAMDataFactory if you use <some combination of fast distributed file system
> and caching> instead".
>
> Aso, is there a blog, wiki page or a maillist thread where a similar
> problem is discussed? Yes, we have seen
> http://www.ibm.com/developerworks/opensource/library/j-spatial, it's a
> good
> introduction that is outdated and doesn't go into the nasty bits, anyway.
>
Btw, if you need multivalue geofield pls vote for SOLR-2155


> Many thanks in advance,
> -- Alex Verkhovsky
>



-- 
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics

<http://www.griddynamics.com>
 <mkhlud...@griddynamics.com>

Re: Using Solr for a rather busy "Yellow Pages"-type index - good idea or not really?

Reply via email to