On 11/05/2016 04:27, scott.chu wrote:
Fix some typos, add some words and resend same question =>

I want to build a Solr engine for over 60-year news articles. My
requests are (I use Solr 5.4.1):

Hi Scott,

We've actually done something very similar for the our client NLA Media
Access in the UK, who handle licensing of most UK newspaper content.
They have over 45m docs going back to 2006.

1> Currently over 10M no. of docs. 2> Currently over 60GB total data
size. 3> The no. of docs and data size will keep growing at the rate
of 1000 no. of docs(or 8MB size) per day. 4> There are totally 5-6
different newspaper types.

My questions are: 1> Is it wokable enough just to use master-slave
model? Or should I turn to SolrCloud? (I ask this due to our system
management group never manage a distributed system before and they
also have no knowedge of Zookeeper, shards, etc. Also they don't know
how to backup/restore distributed data.)

Workable yes, advisable no. You should get much better reliability & performance with SolrCloud once it's set up. Also, if you have replication set up correctly the need for backup/restore will be significantly reduced and may be unnecessary.

We used master-slave for News UK's Solr setup (articles from The Times and other papers) but this was before SolrCloud had properly arrived. We'd only use master-slave rarely now.

2> Say if I choose Solrcloud anyway. I wish to keep one shard owning
one specific year of data. Can it be done?

Yes it can, but it may not be a good idea. If a large proportion of your queries hit recent news you may find one shard dealing with more queries than the others and becoming overloaded. Here's a blog post we wrote a long time ago about this - ignore the name Xapian, this applies to Solr as well: http://www.flax.co.uk/blog/2009/04/25/distributed-search-and-partition-functions/

What configuration should
I do? (AFAIK, SolrCloud distributes data based on some intrinsic
routing algorithm.)

You can choose how to route data at indexing time:
https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud

3> If I wish to create another Solr engine with
one or two particular paper types. Is it possible to copy their index
data directly from the big central Solr engine? Or I have to rebuild
index from raw articles data? (Our business has this possibility of
needs.)

Yes, I guess so, but why copy it when you could just search it with a filter for the paper types?

I'd like to hear and use some well suggestion and experiences.

Thanks in advance and best regards.

Scott Chu @ 2016/5/11  11:26 GMT+8


Hope this helps!

Cheers

Charlie

--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Reply via email to