I started to respond, then realized I and the other OP posters are not thinking the same: What is the business case for availability, data los/reload/recoverability? You all argue for higher availability and damn the cost. But noone asked "can you lose access, for 20 minutes, to a portion of the data, 10 times a year, on a 250 node cluster in AWS, if it is not lost"? Can you lose access 1-2 times a year for the cost of a 500 node cluster holding the same data?
Then we can discuss 32/64g JVM and SSD's. *.* *Arthur C. Clarke famously said that "technology sufficiently advanced is indistinguishable from magic." Magic is coming, and it's coming for all of us....* *Daemeon Reiydelle* *email: daeme...@gmail.com <daeme...@gmail.com>* *LI: https://www.linkedin.com/in/daemeonreiydelle/ <https://www.linkedin.com/in/daemeonreiydelle/>* *San Francisco 1.415.501.0198/Skype daemeon.c.m.reiydelle* On Thu, Aug 17, 2023 at 1:53 PM Joe Obernberger < joseph.obernber...@gmail.com> wrote: > Was assuming reaper did incremental? That was probably a bad assumption. > > nodetool repair -pr > I know it well now! > > :) > > -Joe > > On 8/17/2023 4:47 PM, Bowen Song via user wrote: > > I don't have experience with Cassandra on Kubernetes, so I can't > > comment on that. > > > > For repairs, may I interest you with incremental repairs? It will make > > repairs hell of a lot faster. Of course, occasional full repair is > > still needed, but that's another story. > > > > > > On 17/08/2023 21:36, Joe Obernberger wrote: > >> Thank you. Enjoying this conversation. > >> Agree on blade servers, where each blade has a small number of SSDs. > >> Yeh/Nah to a kubernetes approach assuming fast persistent storage? I > >> think that might be easier to manage. > >> > >> In my current benchmarks, the performance is excellent, but the > >> repairs are painful. I come from the Hadoop world where it was all > >> about large servers with lots of disk. > >> Relatively small number of tables, but some have a high number of > >> rows, 10bil + - we use spark to run across all the data. > >> > >> -Joe > >> > >> On 8/17/2023 12:13 PM, Bowen Song via user wrote: > >>> The optimal node size largely depends on the table schema and > >>> read/write pattern. In some cases 500 GB per node is too large, but > >>> in some other cases 10TB per node works totally fine. It's hard to > >>> estimate that without benchmarking. > >>> > >>> Again, just pointing out the obvious, you did not count the off-heap > >>> memory and page cache. 1TB of RAM for 24GB heap * 40 instances is > >>> definitely not enough. You'll most likely need between 1.5 and 2 TB > >>> memory for 40x 24GB heap nodes. You may be better off with blade > >>> servers than single server with gigantic memory and disk sizes. > >>> > >>> > >>> On 17/08/2023 15:46, Joe Obernberger wrote: > >>>> Thanks for this - yeah - duh - forgot about replication in my example! > >>>> So - is 2TBytes per Cassandra instance advisable? Better to use > >>>> more/less? Modern 2u servers can be had with 24 3.8TBtyte SSDs; so > >>>> assume 80Tbytes per server, you could do: > >>>> (1024*3)/80 = 39 servers, but you'd have to run 40 instances of > >>>> Cassandra on each server; maybe 24G of heap per instance, so a > >>>> server with 1TByte of RAM would work. > >>>> Is this what folks would do? > >>>> > >>>> -Joe > >>>> > >>>> On 8/17/2023 9:13 AM, Bowen Song via user wrote: > >>>>> Just pointing out the obvious, for 1PB of data on nodes with 2TB > >>>>> disk each, you will need far more than 500 nodes. > >>>>> > >>>>> 1, it is unwise to run Cassandra with replication factor 1. It > >>>>> usually makes sense to use RF=3, so 1PB data will cost 3PB of > >>>>> storage space, minimal of 1500 such nodes. > >>>>> > >>>>> 2, depending on the compaction strategy you use and the write > >>>>> access pattern, there's a disk space amplification to consider. > >>>>> For example, with STCS, the disk usage can be many times of the > >>>>> actual live data size. > >>>>> > >>>>> 3, you will need some extra free disk space as temporary space for > >>>>> running compactions. > >>>>> > >>>>> 4, the data is rarely going to be perfectly evenly distributed > >>>>> among all nodes, and you need to take that into consideration and > >>>>> size the nodes based on the node with the most data. > >>>>> > >>>>> 5, enough of bad news, here's a good one. Compression will save > >>>>> you (a lot) of disk space! > >>>>> > >>>>> With all the above considered, you probably will end up with a lot > >>>>> more than the 500 nodes you initially thought. Your choice of > >>>>> compaction strategy and compression ratio can dramatically affect > >>>>> this calculation. > >>>>> > >>>>> > >>>>> On 16/08/2023 16:33, Joe Obernberger wrote: > >>>>>> General question on how to configure Cassandra. Say I have > >>>>>> 1PByte of data to store. The general rule of thumb is that each > >>>>>> node (or at least instance of Cassandra) shouldn't handle more > >>>>>> than 2TBytes of disk. That means 500 instances of Cassandra. > >>>>>> > >>>>>> Assuming you have very fast persistent storage (such as a NetApp, > >>>>>> PorterWorx etc.), would using Kubernetes or some orchestration > >>>>>> layer to handle those nodes be a viable approach? Perhaps the > >>>>>> worker nodes would have enough RAM to run 4 instances (pods) of > >>>>>> Cassandra, you would need 125 servers. > >>>>>> Another approach is to build your servers with 5 (or more) SSD > >>>>>> devices - one for OS, four for each instance of Cassandra running > >>>>>> on that server. Then build some scripts/ansible/puppet that > >>>>>> would manage Cassandra start/stops, and other maintenance items. > >>>>>> > >>>>>> Where I think this runs into problems is with repairs, or > >>>>>> sstablescrubs that can take days to run on a single instance. How > >>>>>> is that handled 'in the real world'? With seed nodes, how many > >>>>>> would you have in such a configuration? > >>>>>> Thanks for any thoughts! > >>>>>> > >>>>>> -Joe > >>>>>> > >>>>>> > >>>> > >> > > -- > This email has been checked for viruses by AVG antivirus software. > www.avg.com >