(apologies in advance for yet another sizing post) We are indexing approximately 2KB documents and ingesting about 50 million documents daily. The index size ends up being about 75GB per day for the primary shards (doing replication = 1 so 150GB/day). In our use case, after 1 month, we throw away 95% of the data but need to keep the rest indefinitely. We are planning to use the "time data flow" mentioned in Shay's presentations and are currently thinking about what time period to use for each index. With a shorter period, the current month index may behave better, but we'll end up accumulating lots of smaller indices after the 1 month period.
We currently have a 4 node setup, each with 12 cores, 96GB of ram and 2TB of disk space over 4 disks. By my calculations, to hold one year of data with r=1, we would need 150GB/day * 31 for the initial month, then 150GB/day*31*.05 for historical months = 4.65TB + 2.5TB = 7+TB for 1 year of data. This seems pretty tight to me considering additional space may be needed for merges, etc. 1. Is accumulating a lot of indexes per node a concern here? If we did a daily index with 4 shards and r=1, that would be over 700 shards per node for 1 year. I know that there is a memory limitation on the number of shards that can be managed by a node. 2. If we did a monthly index, that would be better for the historical indices, but the current month index would be huge, over 2TB. 3. Is there any difference here between doing a daily index with less shards vs. a monthly index with more primary shards? 4. How would having this many shards affect query performance? I assume there is some sweet spot of shards per node that must be found empirically? I would guess it's somewhat related to the number of disks/cores per node? 5. I am also wondering about the RAM to data ratio and whether we'll get decent query performance. Due to our use case, we can't use routing. Is there any rule of thumb here? 6. Another option we are considering is to do a daily index for the first month, and then have periodic jobs to combine the historical daily indexes into larger indices. So for example the first month = 31 daily indices and following months will get rolled up into 1 index per month. But we only want to do this extra work if it's needed. -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b3b6e634-7184-4f7e-ac46-da453917721b%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.