On 8/19/2015 11:09 AM, Troy Edwards wrote: > I have a requirement where I have to bulk insert a lot of documents in > SolrCloud. > > My average document size is 400 bytes > Number of documents that need to be inserted 250000/second (for a total of > about 3.6 Billion documents) > > Any ideas/suggestions on how that can be done? (use a client or uploadcsv > or stream or data import handler) > > How can SolrCloud be configured to allow this fast bulk insert? > > Any thoughts on what the SolrCloud configuration would probably look like?
I think this is an unrealistic goal, unless you're planning on a couple hundred shards with a very small number of shards (1 or 2) per server. This would also require a very large number of very fast servers with a fair amount of RAM. The more shards you have on each server, the more likely it is that you'll need SSD storage. This will get very expensive. It is likely going to take a lot longer than 4 hours to rebuild your entire 3.6 billion document index. Your small document size will help keep the rebuild time lower than I would otherwise expect, but 3.6 billion is a VERY large number. I can achieve about 6000 docs per second on my largest index, which means that each of my cold shards indexes at about 1000 docs per second. I'm not sure how large my documents are, but a few kilobytes is probably about right. The entire rebuild takes over 9 hours for a little more than 200 million documents. The best performance is likely to come from a heavily multi-threaded SolrJ 5.2.1 or later application using CloudSolrClient, with at least version 5.2.1 on your servers. Even if you build the hardware infrastructure I described above, it won't perform to your expectations unless you've got someone with considerable Java programming skills. Thanks, Shawn