(apologies in advance for yet another sizing post)

We are indexing approximately 2KB documents and ingesting about 50 million 
documents daily. The index size ends up being about 75GB per day for the 
primary shards (doing replication = 1 so 150GB/day). In our use case, after 
1 month, we throw away 95% of the data but need to keep the rest 
indefinitely. We are planning to use the "time data flow" mentioned in 
Shay's presentations and are currently thinking about what time period to 
use for each index. With a shorter period, the current month index may 
behave better, but we'll end up accumulating lots of smaller indices after 
the 1 month period. 

We currently have a 4 node setup, each with 12 cores, 96GB of ram and 2TB 
of disk space over 4 disks. By my calculations, to hold one year of data 
with r=1, we would need 150GB/day * 31 for the initial month, then 
150GB/day*31*.05 for historical months = 4.65TB + 2.5TB = 7+TB for 1 year 
of data. This seems pretty tight to me considering additional space may be 
needed for merges, etc. 

   1. Is accumulating a lot of indexes per node a concern here? If we did a 
   daily index with 4 shards and r=1, that would be over 700 shards per node 
   for 1 year. I know that there is a memory limitation on the number of 
   shards that can be managed by a node. 
   2. If we did a monthly index, that would be better for the historical 
   indices, but the current month index would be huge, over 2TB.
   3. Is there any difference here between doing a daily index with less 
   shards vs. a monthly index with more primary shards?
   4. How would having this many shards affect query performance? I assume 
   there is some sweet spot of shards per node that must be found empirically? 
   I would guess it's somewhat related to the number of disks/cores per node?
   5. I am also wondering about the RAM to data ratio and whether we'll get 
   decent query performance. Due to our use case, we can't use routing. Is 
   there any rule of thumb here?
   6. Another option we are considering is to do a daily index for the 
   first month, and then have periodic jobs to combine the historical daily 
   indexes into larger indices. So for example the first month = 31 daily 
   indices and following months will get rolled up into 1 index per month. But 
   we only want to do this extra work if it's needed.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/b3b6e634-7184-4f7e-ac46-da453917721b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to