On 8/19/2015 11:09 AM, Troy Edwards wrote:
> I have a requirement where I have to bulk insert a lot of documents in
> SolrCloud.
>
> My average document size is 400 bytes
> Number of documents that need to be inserted 250000/second (for a total of
> about 3.6 Billion documents)
>
> Any ideas/suggestions on how that can be done? (use a client or uploadcsv
> or stream or data import handler)
>
> How can SolrCloud be configured to allow this fast bulk insert?
>
> Any thoughts on what the SolrCloud configuration would probably look like?

I think this is an unrealistic goal, unless you're planning on a couple
hundred shards with a very small number of shards (1 or 2) per server. 
This would also require a very large number of very fast servers with a
fair amount of RAM.  The more shards you have on each server, the more
likely it is that you'll need SSD storage.  This will get very expensive.

It is likely going to take a lot longer than 4 hours to rebuild your
entire 3.6 billion document index.  Your small document size will help
keep the rebuild time lower than I would otherwise expect, but 3.6
billion is a VERY large number.  I can achieve about 6000 docs per
second on my largest index, which means that each of my cold shards
indexes at about 1000 docs per second.  I'm not sure how large my
documents are, but a few kilobytes is probably about right.  The entire
rebuild takes over 9 hours for a little more than 200 million documents.

The best performance is likely to come from a heavily multi-threaded
SolrJ 5.2.1 or later application using CloudSolrClient, with at least
version 5.2.1 on your servers.  Even if you build the hardware
infrastructure I described above, it won't perform to your expectations
unless you've got someone with considerable Java programming skills.

Thanks,
Shawn

Reply via email to