Joe Obernberger <joseph.obernber...@gmail.com> wrote:

[3 billion docs / 16TB / 27 shards on HDFS times 3 for replication]

> Each shard is then hosting about 610GBytes of index.  The HDFS cache
> size is very low at about 8GBytes.  Suffice it to say, performance isn't
> very good, but again, this is for experimentation.

We are running a setup with local SSDs where our shards are 900GB with ~6GB 
free for disk cache for each. But the shards are static and fully optimized. If 
your data are non-changing time-series, you might want to consider a model with 
dedicated search-only and shard-build-only nodes to lower hardware requirements.

> If we were to redo this, would it be better to create many shards -
> maybe 200 with 3 replicas each (600 in all) with the goal being to
> withstand a server going out, and future expansion as more hardware is
> added?  I know this is very general question.  Thanks very much in advance!

As Erick says then you are in the fortunate position that it is reasonable easy 
for you to prototype and extrapolate, as you are scaling out. I will add that 
you should keep in mind that under the constraint of constant hardware, you 
might win latency by sharding, but you will loose maximum throughput (due to 
increased duplicate information and increased logistics). Can you just add 
hardware, then do so. If you have under-utilized CPUs and a need for lower 
latency, then try more shards on existing hardware (the Scott@FullStory slides 
that Susheel mentions seems fitting here).

- Toke Eskildsen

Reply via email to