Re: Big Data Question

Joe Obernberger Thu, 17 Aug 2023 07:46:37 -0700

Thanks for this - yeah - duh - forgot about replication in my example!

So - is 2TBytes per Cassandra instance advisable? Better to usemore/less? Modern 2u servers can be had with 24 3.8TBtyte SSDs; soassume 80Tbytes per server, you could do:(1024*3)/80 = 39 servers, but you'd have to run 40 instances ofCassandra on each server; maybe 24G of heap per instance, so a serverwith 1TByte of RAM would work.

Is this what folks would do?


-Joe

On 8/17/2023 9:13 AM, Bowen Song via user wrote:

Just pointing out the obvious, for 1PB of data on nodes with 2TB diskeach, you will need far more than 500 nodes.
1, it is unwise to run Cassandra with replication factor 1. It usuallymakes sense to use RF=3, so 1PB data will cost 3PB of storage space,minimal of 1500 such nodes.
2, depending on the compaction strategy you use and the write accesspattern, there's a disk space amplification to consider. For example,with STCS, the disk usage can be many times of the actual live data size.
3, you will need some extra free disk space as temporary space forrunning compactions.
4, the data is rarely going to be perfectly evenly distributed amongall nodes, and you need to take that into consideration and size thenodes based on the node with the most data.
5, enough of bad news, here's a good one. Compression will save you (alot) of disk space!
With all the above considered, you probably will end up with a lotmore than the 500 nodes you initially thought. Your choice ofcompaction strategy and compression ratio can dramatically affect thiscalculation.
On 16/08/2023 16:33, Joe Obernberger wrote:
General question on how to configure Cassandra. Say I have 1PByte ofdata to store. The general rule of thumb is that each node (or atleast instance of Cassandra) shouldn't handle more than 2TBytes ofdisk. That means 500 instances of Cassandra.
Assuming you have very fast persistent storage (such as a NetApp,PorterWorx etc.), would using Kubernetes or some orchestration layerto handle those nodes be a viable approach? Perhaps the worker nodeswould have enough RAM to run 4 instances (pods) of Cassandra, youwould need 125 servers.Another approach is to build your servers with 5 (or more) SSDdevices - one for OS, four for each instance of Cassandra running onthat server. Then build some scripts/ansible/puppet that wouldmanage Cassandra start/stops, and other maintenance items.
Where I think this runs into problems is with repairs, orsstablescrubs that can take days to run on a single instance. How isthat handled 'in the real world'? With seed nodes, how many wouldyou have in such a configuration?
Thanks for any thoughts!

-Joe


--
This email has been checked for viruses by AVG antivirus software.
www.avg.com

Re: Big Data Question

Reply via email to