Re: performance crossover between single index and sharding

Dmitry Kan Wed, 03 Aug 2011 02:04:30 -0700

OK, here is a brief on our sharded setup.

We have 10 shards, 3 per high-end Amazon machine. Majority of the searches
are done on 2 shards at most, that have the latest data in their indices. We
use logical sharding, not hash based. These two lead to a situation, where
given a user query that *will for sure* hit the 2 last (or adjacent in time)
shards, other solr shards would have to search in vain. Therefore, we have
implemented the query router, which is essentially solr itself with
modifications in the QueryComponent. Before implementing the router it was
nearly impossible to run the system.


Why did we do the sharding?

Simply because we started to see a lot OOM exceptions, and various other
instability issues. Also we had to rebuild the index very often due to
changes in the preceeding pipeline. Therefore distributing over shards was
another asset for us in the sense, that reindexing could be carried out in
parallel. On top of that, which is certainly not least, our search became
faster, the slimmer we kept the shards.

We don't yet have master / slave architecture, as this is done when the user
base grows. We started with growing amounts of data, therefore came
horizontal scaling.

Regards,
Dmitry Kan
twitter.com/DmitryKan

On Wed, Aug 3, 2011 at 12:24 PM, Bernd Fehling <
bernd.fehl...@uni-bielefeld.de> wrote:

>
> On 02.08.2011 21:00, Shawn Heisey wrote:
>
>> ...
>>
>> I did try some early tests with a single large index. Performance was
>> pretty decent once it got warmed up, but I was worried about how it would
>> perform under a heavy load, and how it would cope with frequent updates. I
>> never really got very far with testing those fears, because the full
>> rebuild time was unacceptable - at least 8 hours. The source database can
>> keep up with six DIH instances reindexing at once, which completes
>> much quicker than a single machine grabbing the entire database. I may
>> increase the number of shards after I remove virtualization, but I'll
>> need to fix a few limitations in my build system.
>> ...
>>
>
> At first, thanks a lot to all answers and here is my setup.
>
> I know that it is very difficult to give specific recommendations about
> this.
> Because of changing from FAST Search to Solr I can state that Solr performs
> very well, if not excellent.
>
> To show that I compare apples and oranges here are my previous FAST Search
> setup:
> - one master server (controlling, logging, search dispatcher)
> - six index server (4.25 mio docs per server, 5 slices per index)
>  (searching and indexing at the same time, indexing once per week during
> the weekend)
> - each server has 4GB RAM, all servers are physical on seperate machines
> - RAM usage controlled by the processes
> - total of 25.5 mio. docs (mainly metadata) from 1500 databases worldwide
> - index size is about 67GB per indexer --> about 402GB total
> - about 3 qps at peek times
> - with average search time of 0.05 seconds at peek times
>
> And here is now my current Solr setup:
> - one master server (indexing only)
> - two slave server (search only) but only one is online, the second is
> fallback
> - each server has 32GB RAM, all server are virtuell
>  (master on a seperate physical machine, both slaves together on a physical
> machine)
> - RAM usage is currently 20GB to java heap
> - total of 31 mio. docs (all metadata) from 2000 databases worldwide
> - index size is 156GB total
> - search handler statistic report 0.6 average requests per second
> - average time per request 39.5 (is that seconds?)
> - building the index from scratch takes about 20 hours
>
> The good thing is I have the ability to compare a commercial product and
> enterprise system to open source.
>
> I started with my simple Solr setup because of "kiss" (keep it simple and
> stupid).
> Actually it is doing excellent as single index on a single virtuell server.
> But the average time per request should be reduced now, thats why I started
> this discussion.
> While searches with smaller Solr index size (3 mio. docs) showed that it
> can
> stand with FAST Search it now shows that its time to go with sharding.
> I think we are already far behind the point of search performance
> crossover.
>
> What I hope to get with sharding:
> - reduce time for building the index
> - reduce average time per request
>
> What I fear with sharding:
> - i currently have master/slave, do I then have e.g. 3 master and 3 slaves?
> - the query changes because of sharding (is there a search distributor?)
> - how to distribute the content the indexer with DIH on 3 server?
> - anything else to think about while changing to sharding?
>
> Conclusion:
> - Solr can handle much more than 30 mio. docs of metadata in a single index
>  if java heap size is large enough. Have an eye on Lucenes fieldCache and
>  sorted fields, especially title (string) fields.
> - The crossover in my case is somewhere between 3 mio. and 10 mio. docs
>  per index for Solr (compared to FAST Search). FAST recommends about 3 to 6
> mio.
>  docs per 4GB RAM server for their system.
>
> Anyone able to reduce my fears about sharding?
> Thanks again for all your answers.
>
> Regards
> Bernd
>
> --
> *****************************************************************
> BASE - Bielefeld Academic Search Engine - www.base-search.net
> *****************************************************************
>

Re: performance crossover between single index and sharding

Reply via email to