For our scenario, the goal is to minimize down-time for a single (at
least initially) data center system. Data-loss is basically
unacceptable. I wouldn't say we have a "rusty slow data center" - we
can certainly use SSDs and have servers connected via 10G copper to a
fast back-plane. For our specific use case with Cassandra (lots of
writes, small number of reads), the network load is usually pretty low.
I suspect that would change if we used Kubernetes + central persistent
storage.
Good discussion.
-Joe
On 8/17/2023 7:37 PM, daemeon reiydelle wrote:
I started to respond, then realized I and the other OP posters are not
thinking the same: What is the business case for availability, data
los/reload/recoverability? You all argue for higher availability and
damn the cost. But noone asked "can you lose access, for 20 minutes,
to a portion of the data, 10 times a year, on a 250 node cluster in
AWS, if it is not lost"? Can you lose access 1-2 times a year for the
cost of a 500 node cluster holding the same data?
Then we can discuss 32/64g JVM and SSD's.
/./
/Arthur C. Clarke famously said that "technology sufficiently advanced
is indistinguishable from magic." Magic is coming, and it's coming for
all of us..../
/
/
*Daemeon Reiydelle*
*email: daeme...@gmail.com*
*LI: https://www.linkedin.com/in/daemeonreiydelle/*
*San Francisco 1.415.501.0198/Skype daemeon.c.m.reiydelle*
On Thu, Aug 17, 2023 at 1:53 PM Joe Obernberger
<joseph.obernber...@gmail.com> wrote:
Was assuming reaper did incremental? That was probably a bad
assumption.
nodetool repair -pr
I know it well now!
:)
-Joe
On 8/17/2023 4:47 PM, Bowen Song via user wrote:
> I don't have experience with Cassandra on Kubernetes, so I can't
> comment on that.
>
> For repairs, may I interest you with incremental repairs? It
will make
> repairs hell of a lot faster. Of course, occasional full repair is
> still needed, but that's another story.
>
>
> On 17/08/2023 21:36, Joe Obernberger wrote:
>> Thank you. Enjoying this conversation.
>> Agree on blade servers, where each blade has a small number of
SSDs.
>> Yeh/Nah to a kubernetes approach assuming fast persistent
storage? I
>> think that might be easier to manage.
>>
>> In my current benchmarks, the performance is excellent, but the
>> repairs are painful. I come from the Hadoop world where it was
all
>> about large servers with lots of disk.
>> Relatively small number of tables, but some have a high number of
>> rows, 10bil + - we use spark to run across all the data.
>>
>> -Joe
>>
>> On 8/17/2023 12:13 PM, Bowen Song via user wrote:
>>> The optimal node size largely depends on the table schema and
>>> read/write pattern. In some cases 500 GB per node is too
large, but
>>> in some other cases 10TB per node works totally fine. It's
hard to
>>> estimate that without benchmarking.
>>>
>>> Again, just pointing out the obvious, you did not count the
off-heap
>>> memory and page cache. 1TB of RAM for 24GB heap * 40 instances is
>>> definitely not enough. You'll most likely need between 1.5 and
2 TB
>>> memory for 40x 24GB heap nodes. You may be better off with blade
>>> servers than single server with gigantic memory and disk sizes.
>>>
>>>
>>> On 17/08/2023 15:46, Joe Obernberger wrote:
>>>> Thanks for this - yeah - duh - forgot about replication in my
example!
>>>> So - is 2TBytes per Cassandra instance advisable? Better to use
>>>> more/less? Modern 2u servers can be had with 24 3.8TBtyte
SSDs; so
>>>> assume 80Tbytes per server, you could do:
>>>> (1024*3)/80 = 39 servers, but you'd have to run 40 instances of
>>>> Cassandra on each server; maybe 24G of heap per instance, so a
>>>> server with 1TByte of RAM would work.
>>>> Is this what folks would do?
>>>>
>>>> -Joe
>>>>
>>>> On 8/17/2023 9:13 AM, Bowen Song via user wrote:
>>>>> Just pointing out the obvious, for 1PB of data on nodes with
2TB
>>>>> disk each, you will need far more than 500 nodes.
>>>>>
>>>>> 1, it is unwise to run Cassandra with replication factor 1. It
>>>>> usually makes sense to use RF=3, so 1PB data will cost 3PB of
>>>>> storage space, minimal of 1500 such nodes.
>>>>>
>>>>> 2, depending on the compaction strategy you use and the write
>>>>> access pattern, there's a disk space amplification to consider.
>>>>> For example, with STCS, the disk usage can be many times of the
>>>>> actual live data size.
>>>>>
>>>>> 3, you will need some extra free disk space as temporary
space for
>>>>> running compactions.
>>>>>
>>>>> 4, the data is rarely going to be perfectly evenly distributed
>>>>> among all nodes, and you need to take that into
consideration and
>>>>> size the nodes based on the node with the most data.
>>>>>
>>>>> 5, enough of bad news, here's a good one. Compression will save
>>>>> you (a lot) of disk space!
>>>>>
>>>>> With all the above considered, you probably will end up with
a lot
>>>>> more than the 500 nodes you initially thought. Your choice of
>>>>> compaction strategy and compression ratio can dramatically
affect
>>>>> this calculation.
>>>>>
>>>>>
>>>>> On 16/08/2023 16:33, Joe Obernberger wrote:
>>>>>> General question on how to configure Cassandra. Say I have
>>>>>> 1PByte of data to store. The general rule of thumb is that
each
>>>>>> node (or at least instance of Cassandra) shouldn't handle more
>>>>>> than 2TBytes of disk. That means 500 instances of Cassandra.
>>>>>>
>>>>>> Assuming you have very fast persistent storage (such as a
NetApp,
>>>>>> PorterWorx etc.), would using Kubernetes or some orchestration
>>>>>> layer to handle those nodes be a viable approach? Perhaps the
>>>>>> worker nodes would have enough RAM to run 4 instances
(pods) of
>>>>>> Cassandra, you would need 125 servers.
>>>>>> Another approach is to build your servers with 5 (or more) SSD
>>>>>> devices - one for OS, four for each instance of Cassandra
running
>>>>>> on that server. Then build some scripts/ansible/puppet that
>>>>>> would manage Cassandra start/stops, and other maintenance
items.
>>>>>>
>>>>>> Where I think this runs into problems is with repairs, or
>>>>>> sstablescrubs that can take days to run on a single
instance. How
>>>>>> is that handled 'in the real world'? With seed nodes, how many
>>>>>> would you have in such a configuration?
>>>>>> Thanks for any thoughts!
>>>>>>
>>>>>> -Joe
>>>>>>
>>>>>>
>>>>
>>
--
This email has been checked for viruses by AVG antivirus software.
www.avg.com <http://www.avg.com>