Re: Cluster sizing for huge dataset

Jeff Jirsa Sun, 29 Sep 2019 10:34:46 -0700

> On Sep 29, 2019, at 12:30 AM, DuyHai Doan <doanduy...@gmail.com> wrote:
> 
> Thank you Jeff for the hints
> 
> We are targeting to reach 20Tb/machine using TWCS and 8 vnodes (using
> the new token allocation algo). Also we will try the new zstd
> compression.

I’d provably still be inclined to run two instances per machine for 20TB 
machines unless you’re planning on using 4.0

> 
> About transient replication, the underlying trade-offs and semantics
> are hard to understand for common people (for example, reading at CL
> ONE in the face of 2 full replicas loss leads to unavailable
> exception, unlike normal replication) so we will let it out for the
> moment

Yea in transient you’d be restoring from backup in this case, but to be fair, 
you’d have violated consistency / lost data written at quorum if two replicas 
fail even without transient replication using RF=3

> 
> Regards
> 
>> On Sun, Sep 29, 2019 at 3:50 AM Jeff Jirsa <jji...@gmail.com> wrote:
>> 
>> A few random thoughts here
>> 
>> 1) 90 nodes / 900T in a cluster isn’t that big. petabyte per cluster is a 
>> manageable size.
>> 
>> 2) The 2TB guidance is old and irrelevant for most people, what you really 
>> care about is how fast you can replace the failed machine
>> 
>> You’d likely be ok going significantly larger than that if you use a few 
>> vnodes, since that’ll help rebuild faster (you’ll stream from more sources 
>> on rebuild)
>> 
>> If you don’t want to use vnodes, buy big machines and run multiple Cassandra 
>> instances in it - it’s not hard to run 3-4TB per instance and 12-16T of SSD 
>> per machine
>> 
>> 3) Transient replication in 4.0 could potentially be worth trying out, 
>> depending on your risk tolerance. Doing 2 full and one transient replica may 
>> save you 30% storage
>> 
>> 4) Note that you’re not factoring in compression, and some of the recent 
>> zstd work may go a long way if your sensor data is similar / compressible.
>> 
>>>> On Sep 28, 2019, at 1:23 PM, DuyHai Doan <doanduy...@gmail.com> wrote:
>>> 
>>> Hello users
>>> 
>>> I'm facing with a very challenging exercise: size a cluster with a huge 
>>> dataset.
>>> 
>>> Use-case = IoT
>>> 
>>> Number of sensors: 30 millions
>>> Frequency of data: every 10 minutes
>>> Estimate size of a data: 100 bytes (including clustering columns)
>>> Data retention: 2 years
>>> Replication factor: 3 (pretty standard)
>>> 
>>> A very quick math gives me:
>>> 
>>> 6 data points / hour * 24 * 365 ~50 000 data points/ year/ sensor
>>> 
>>> In term of size, it is 50 000 x 100 bytes = 5Mb worth of data /year /sensor
>>> 
>>> Now the big problem is that we have 30 millions of sensor so the disk
>>> requirements adds up pretty fast: 5 Mb * 30 000 000 = 5Tb * 30 = 150Tb
>>> worth of data/year
>>> 
>>> We want to store data for 2 years => 300Tb
>>> 
>>> We have RF=3 ==> 900Tb !!!!
>>> 
>>> Now, according to commonly recommended density (with SSD), one shall
>>> not exceed 2Tb of data per node, which give us a rough sizing of 450
>>> nodes cluster !!!
>>> 
>>> Even if we push the limit up to 10Tb using TWCS (has anyone tried this
>>> ?) We would still need 90 beefy nodes to support this.
>>> 
>>> Any thoughts/ideas to reduce the nodes count or increase density and
>>> keep the cluster manageable ?
>>> 
>>> Regards
>>> 
>>> Duy Hai DOAN
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org
Re: Cluster sizing for huge dataset

Reply via email to