For Indexing 3.5 billion documents, you will not only run into bottleneck
with Solr but also at different places (data acquisition, solr document
object creation, submitting in bulk/batches to Solr).

This will require parallelizing the above operations at each of the above
steps which can get you maximum throughput.  Multi-threaded java solrj
based Indexer & CloudSolrClient is required as described by Shawn.   I have
used ConcurrentSolrUpdate in the past but with CloudSolrClient,
setParallelUpdates should be tried out.

Thanks,
Susheel

On Wed, Aug 19, 2015 at 2:41 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Ir you're sitting on HDFS anyway, you could use MapReduceIndexerTool. I'm
> not
> sure that'll hit your rate, it spends some time copying things around.
> If you're not on
> HDFS, though, it's not an option.
>
> Best,
> Erick
>
> On Wed, Aug 19, 2015 at 11:36 AM, Upayavira <u...@odoko.co.uk> wrote:
> >
> >
> > On Wed, Aug 19, 2015, at 07:13 PM, Toke Eskildsen wrote:
> >> Troy Edwards <tedwards415...@gmail.com> wrote:
> >> > My average document size is 400 bytes
> >> > Number of documents that need to be inserted 250000/second
> >> > (for a total of about 3.6 Billion documents)
> >>
> >> > Any ideas/suggestions on how that can be done? (use a client
> >> > or uploadcsv or stream or data import handler)
> >>
> >> Use more than one cloud. Make them fully independent. As I suggested
> when
> >> you asked 4 days ago. That would also make it easy to scale: Just
> measure
> >> how much a single setup can take and do the math.
> >
> > Yes - work out how much each node can handle, then you can work out how
> > many nodes you need.
> >
> > You could consider using implicit routing rather than compositeId, which
> > means that you take on responsibility for hashing your ID to push
> > content to the right node. (Or, if you use compositeId, you could use
> > the same algorithm, and be sure that you send docs directly to the
> > correct shard.
> >
> > At the moment, if you push five documents to a five shard collection,
> > the node you send them to could end up doing four HTTP requests to the
> > other nodes in the collection. This means you don't need to worry about
> > where to post your content - it is just handled for you. However, there
> > is a performance hit there. Push content direct to the correct node
> > (either using implicit routing, or by replicating the compositeId hash
> > calculation in your client) and you'd increase your indexing throughput
> > significantly, I would theorise.
> >
> > Upayavira
>

Reply via email to