Re: [scottchu] What kind of configuration to use for this size of news data?

Charlie Hull Wed, 11 May 2016 01:21:19 -0700

On 11/05/2016 04:27, scott.chu wrote:

Fix some typos, add some words and resend same question =>


I want to build a Solr engine for over 60-year news articles. My
requests are (I use Solr 5.4.1):


Hi Scott,

We've actually done something very similar for the our client NLA Media
Access in the UK, who handle licensing of most UK newspaper content.
They have over 45m docs going back to 2006.


1> Currently over 10M no. of docs. 2> Currently over 60GB total data
size. 3> The no. of docs and data size will keep growing at the rate
of 1000 no. of docs(or 8MB size) per day. 4> There are totally 5-6
different newspaper types.

My questions are: 1> Is it wokable enough just to use master-slave
model? Or should I turn to SolrCloud? (I ask this due to our system
management group never manage a distributed system before and they
also have no knowedge of Zookeeper, shards, etc. Also they don't know
how to backup/restore distributed data.)

Workable yes, advisable no. You should get much better reliability &performance with SolrCloud once it's set up. Also, if you havereplication set up correctly the need for backup/restore will besignificantly reduced and may be unnecessary.

We used master-slave for News UK's Solr setup (articles from The Timesand other papers) but this was before SolrCloud had properly arrived.We'd only use master-slave rarely now.

2> Say if I choose Solrcloud anyway. I wish to keep one shard owning
one specific year of data. Can it be done?

Yes it can, but it may not be a good idea. If a large proportion of yourqueries hit recent news you may find one shard dealing with more queriesthan the others and becoming overloaded. Here's a blog post we wrote along time ago about this - ignore the name Xapian, this applies to Solras well:http://www.flax.co.uk/blog/2009/04/25/distributed-search-and-partition-functions/


What configuration should

I do? (AFAIK, SolrCloud distributes data based on some intrinsic
routing algorithm.)


You can choose how to route data at indexing time:
https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud

3> If I wish to create another Solr engine with
one or two particular paper types. Is it possible to copy their index
data directly from the big central Solr engine? Or I have to rebuild
index from raw articles data? (Our business has this possibility of
needs.)

Yes, I guess so, but why copy it when you could just search it with afilter for the paper types?


I'd like to hear and use some well suggestion and experiences.

Thanks in advance and best regards.

Scott Chu @ 2016/5/11  11:26 GMT+8


Hope this helps!

Cheers

Charlie

--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Re: [scottchu] What kind of configuration to use for this size of news data?

Reply via email to