> On Sep 29, 2019, at 12:30 AM, DuyHai Doan <doanduy...@gmail.com> wrote:
>
> Thank you Jeff for the hints
>
> We are targeting to reach 20Tb/machine using TWCS and 8 vnodes (using
> the new token allocation algo). Also we will try the new zstd
> compression.
I’d provably still be inclined to run two instances per machine for 20TB
machines unless you’re planning on using 4.0
>
> About transient replication, the underlying trade-offs and semantics
> are hard to understand for common people (for example, reading at CL
> ONE in the face of 2 full replicas loss leads to unavailable
> exception, unlike normal replication) so we will let it out for the
> moment
Yea in transient you’d be restoring from backup in this case, but to be fair,
you’d have violated consistency / lost data written at quorum if two replicas
fail even without transient replication using RF=3
>
> Regards
>
>> On Sun, Sep 29, 2019 at 3:50 AM Jeff Jirsa <jji...@gmail.com> wrote:
>>
>> A few random thoughts here
>>
>> 1) 90 nodes / 900T in a cluster isn’t that big. petabyte per cluster is a
>> manageable size.
>>
>> 2) The 2TB guidance is old and irrelevant for most people, what you really
>> care about is how fast you can replace the failed machine
>>
>> You’d likely be ok going significantly larger than that if you use a few
>> vnodes, since that’ll help rebuild faster (you’ll stream from more sources
>> on rebuild)
>>
>> If you don’t want to use vnodes, buy big machines and run multiple Cassandra
>> instances in it - it’s not hard to run 3-4TB per instance and 12-16T of SSD
>> per machine
>>
>> 3) Transient replication in 4.0 could potentially be worth trying out,
>> depending on your risk tolerance. Doing 2 full and one transient replica may
>> save you 30% storage
>>
>> 4) Note that you’re not factoring in compression, and some of the recent
>> zstd work may go a long way if your sensor data is similar / compressible.
>>
>>>> On Sep 28, 2019, at 1:23 PM, DuyHai Doan <doanduy...@gmail.com> wrote:
>>>
>>> Hello users
>>>
>>> I'm facing with a very challenging exercise: size a cluster with a huge
>>> dataset.
>>>
>>> Use-case = IoT
>>>
>>> Number of sensors: 30 millions
>>> Frequency of data: every 10 minutes
>>> Estimate size of a data: 100 bytes (including clustering columns)
>>> Data retention: 2 years
>>> Replication factor: 3 (pretty standard)
>>>
>>> A very quick math gives me:
>>>
>>> 6 data points / hour * 24 * 365 ~50 000 data points/ year/ sensor
>>>
>>> In term of size, it is 50 000 x 100 bytes = 5Mb worth of data /year /sensor
>>>
>>> Now the big problem is that we have 30 millions of sensor so the disk
>>> requirements adds up pretty fast: 5 Mb * 30 000 000 = 5Tb * 30 = 150Tb
>>> worth of data/year
>>>
>>> We want to store data for 2 years => 300Tb
>>>
>>> We have RF=3 ==> 900Tb !!!!
>>>
>>> Now, according to commonly recommended density (with SSD), one shall
>>> not exceed 2Tb of data per node, which give us a rough sizing of 450
>>> nodes cluster !!!
>>>
>>> Even if we push the limit up to 10Tb using TWCS (has anyone tried this
>>> ?) We would still need 90 beefy nodes to support this.
>>>
>>> Any thoughts/ideas to reduce the nodes count or increase density and
>>> keep the cluster manageable ?
>>>
>>> Regards
>>>
>>> Duy Hai DOAN
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org