Hey Guys,
I'm working on an analytics dashboard project where we collect events into
Elasticsearch for clients. Each client could have millions of events per month.
We are thinking of using one index with one shard and one replica per client.
Looking at Logstash, it seems like Logstash creates
Drew,
The Elasticsearch default is to create 5 shards for each index. I would start
with this. Typically it is best to actually over-shard, which is to say have
more than 1 shard per node per index. There is not really any measurable cost
to this and it gives you flexibility in your design as
Hi Andrew,
Not sure if you read my original question. The question is about having a
separate index per customer since we are going to have 1000 customers but
each would have a lot of data. Each shard comes with it's own overhead since
it's an instance of Lucene. I was going with the 1 shard
Pretty sure he read it as I'd have offered the same advice :)
You cannot change the sharding of an index after creation, you need to
completely reindex the data to do so. This may not be a major issue for you
but it's something to take into account when you have hundreds or thousands
of customers,
Hi Mark,
The problem that we have is that each customer could generate 60-80 million
docs/month on average. In addition, when a customer leaves, we would need to
delete all their data. So hence it makes sense to have an index per customer
(or even multiple indexes per customer). Another issue
Ahh ok, knowing this extra info is good as it helps us help you :)
Logstash doesn't define how many shards to use, at least not that I can see
here -
https://github.com/elasticsearch/logstash/blob/master/lib/logstash/outputs/elasticsearch/elasticsearch-template.json
-
or through some quick tests.