sizing for time data flow

slushi Mon, 12 May 2014 10:25:44 -0700

(apologies in advance for yet another sizing post)

We are indexing approximately 2KB documents and ingesting about 50 million 
documents daily. The index size ends up being about 75GB per day for the 
primary shards (doing replication = 1 so 150GB/day). In our use case, after 
1 month, we throw away 95% of the data but need to keep the rest 
indefinitely. We are planning to use the "time data flow" mentioned in 
Shay's presentations and are currently thinking about what time period to 
use for each index. With a shorter period, the current month index may 
behave better, but we'll end up accumulating lots of smaller indices after 
the 1 month period.

We currently have a 4 node setup, each with 12 cores, 96GB of ram and 2TB
of disk space over 4 disks. By my calculations, to hold one year of data
with r=1, we would need 150GB/day * 31 for the initial month, then
150GB/day*31*.05 for historical months = 4.65TB + 2.5TB = 7+TB for 1 year
of data. This seems pretty tight to me considering additional space may be
needed for merges, etc.

1. Is accumulating a lot of indexes per node a concern here? If we did a
daily index with 4 shards and r=1, that would be over 700 shards per node
for 1 year. I know that there is a memory limitation on the number of
shards that can be managed by a node.
2. If we did a monthly index, that would be better for the historical
indices, but the current month index would be huge, over 2TB.
3. Is there any difference here between doing a daily index with less
shards vs. a monthly index with more primary shards?
4. How would having this many shards affect query performance? I assume
there is some sweet spot of shards per node that must be found empirically?
I would guess it's somewhat related to the number of disks/cores per node?
5. I am also wondering about the RAM to data ratio and whether we'll get
decent query performance. Due to our use case, we can't use routing. Is
there any rule of thumb here?
6. Another option we are considering is to do a daily index for the
first month, and then have periodic jobs to combine the historical daily
indexes into larger indices. So for example the first month = 31 daily
indices and following months will get rolled up into 1 index per month. But
we only want to do this extra work if it's needed.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b3b6e634-7184-4f7e-ac46-da453917721b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

sizing for time data flow

Reply via email to