Best approach to handle large volume of documents with constantly high incoming rate?

shushuai zhu Thu, 20 Mar 2014 16:53:24 -0700

Hi, 

I am looking for some advice to handle large volume of documents with a very 
high incoming rate. The size of each document is about 0.5 KB and the incoming 
rate could be more than 20K per second and we want to store about one year's 
documents in Solr for near real=time searching. The goal is to achieve 
acceptable indexing and querying performance.


We will use techniques like soft commit, dedicated indexing servers, etc. My 
main question is about how to structure the collection/shard/core to achieve 
the goals. Since the incoming rate is very high, we do not want the incoming 
documents to affect the existing older indexes. Some thoughts are to create a 
latest index to hold the incoming documents (say latest half hour's data, about 
36M docs) so queries on older data could be faster since the old indexes are 
not affected. There seem three ways to grow the time dimension by 
adding/splitting/creating a new object listed below every half hour:

collection
shard
core

Which is the best way to grow the time dimension? Any limitation in that 
direction? Or there is some better approach?

As an example, I am thinking about having 4 nodes with the following 
configuration to setup a Solr Cloud:

Memory: 128 GB
Storage: 4 TB

How to set the collection/shard/core to deal with the use case?

Thanks in advance.

Shushuai

Best approach to handle large volume of documents with constantly high incoming rate?

Reply via email to