Re: ssl certificate hot reloading test - cassandra 4.1

2024-04-18 Thread Tolbert, Andy
I think in the context of what I think initially motivated this hot
reloading capability, a big win it provides is avoiding having to
bounce your cluster as your certificates near expiry.  If not watched
closely you can get yourself into a state where every node in the
cluster's cert expired, which is effectively an outage.

I see the appeal of draining connections on a change of trust,
although the necessity of being able to "do it live" (as opposed to
doing a bounce) seems less important then avoiding the outage
condition of your certificates expiring, especially since you can sort
of already do this without bouncing by toggling nodetool
disablebinary/enablebinary.  I agree with Dinesh that most operators
would prefer that it does not do that as interrupting connections can
be disruptive to applications if they don't have retries configured,
but I also agree it'd be a nice improvement to support draining
existing connections in some way.

+1 on the idea of having a "timed connection" capability brought up
here, and implementing it in a way such that connection lifetimes can
be dynamically adjusted.  This way it can be made such that on a trust
store change Cassandra could simply adjust the connection lifetimes
and they will be disconnected immediately or drained over a time
period like Josh proposed.

Thanks,
Andy


Re: ssl certificate hot reloading test - cassandra 4.1

2024-04-15 Thread Tolbert, Andy
I should mention, when toggling disablebinary/enablebinary between
instances, you will probably want to give some time between doing this
so connections can reestablish, and you will want to verify that the
connections can actually reestablish.  You also need to be mindful of
this being disruptive to inflight queries (if your client is
configured for retries it will probably be fine).  Semantically to
your applications it should look a lot like a rolling cluster bounce.

Thanks,
Andy

On Mon, Apr 15, 2024 at 11:39 AM pabbireddy avinash
 wrote:
>
> Thanks Andy for your reply . We will test the scenario you mentioned.
>
> Regards
> Avinash
>
> On Mon, Apr 15, 2024 at 11:28 AM, Tolbert, Andy  
> wrote:
>>
>> Hi Avinash,
>>
>> As far as I understand it, if the underlying keystore/trustore(s)
>> Cassandra is configured for is updated, this *will not* provoke
>> Cassandra to interrupt existing connections, it's just that the new
>> stores will be used for future TLS initialization.
>>
>> Via: 
>> https://cassandra.apache.org/doc/4.1/cassandra/operating/security.html#ssl-certificate-hot-reloading
>>
>> > When the files are updated, Cassandra will reload them and use them for 
>> > subsequent connections
>>
>> I suppose one could do a rolling disablebinary/enablebinary (if it's
>> only client connections) after you roll out a keystore/truststore
>> change as a way of enforcing the existing connections to reestablish.
>>
>> Thanks,
>> Andy
>>
>>
>> On Mon, Apr 15, 2024 at 11:11 AM pabbireddy avinash
>>  wrote:
>> >
>> > Dear Community,
>> >
>> > I hope this email finds you well. I am currently testing SSL certificate 
>> > hot reloading on a Cassandra cluster running version 4.1 and encountered a 
>> > situation that requires your expertise.
>> >
>> > Here's a summary of the process and issue:
>> >
>> > Reloading Process: We reloaded certificates signed by our in-house 
>> > certificate authority into the cluster, which was initially running with 
>> > self-signed certificates. The reload was done node by node.
>> >
>> > Truststore and Keystore: The truststore and keystore passwords are the 
>> > same across the cluster.
>> >
>> > Unexpected Behavior: Despite the different truststore configurations for 
>> > the self-signed and new CA certificates, we observed no breakdown in 
>> > server-to-server communication during the reload. We did not upload the 
>> > new CA cert into the old truststore.We anticipated interruptions due to 
>> > the differing truststore configurations but did not encounter any.
>> >
>> > Post-Reload Changes: After reloading, we updated the cqlshrc file with the 
>> > new CA certificate and key to connect to cqlsh.
>> >
>> > server_encryption_options:
>> >
>> > internode_encryption: all
>> >
>> > keystore: ~/conf/server-keystore.jks
>> >
>> > keystore_password: 
>> >
>> > truststore: ~/conf/server-truststore.jks
>> >
>> > truststore_password: 
>> >
>> > protocol: TLS
>> >
>> > algorithm: SunX509
>> >
>> > store_type: JKS
>> >
>> > cipher_suites: [TLS_RSA_WITH_AES_256_CBC_SHA]
>> >
>> > require_client_auth: true
>> >
>> > client_encryption_options:
>> >
>> > enabled: true
>> >
>> > keystore: ~/conf/server-keystore.jks
>> >
>> > keystore_password: 
>> >
>> > require_client_auth: true
>> >
>> > truststore: ~/conf/server-truststore.jks
>> >
>> > truststore_password: 
>> >
>> > protocol: TLS
>> >
>> > algorithm: SunX509
>> >
>> > store_type: JKS
>> >
>> > cipher_suites: [TLS_RSA_WITH_AES_256_CBC_SHA]
>> >
>> > Given this situation, I have the following questions:
>> >
>> > Could there be a reason for the continuity of server-to-server 
>> > communication despite the different truststores?
>> > Is there a possibility that the old truststore remains cached even after 
>> > reloading the certificates on a node?
>> > Have others encountered similar issues, and if so, what were your 
>> > solutions?
>> >
>> > Any insights or suggestions would be greatly appreciated. Please let me 
>> > know if further information is needed.
>> >
>> > Thank you
>> >
>> > Best regards,
>> >
>> > Avinash


Re: ssl certificate hot reloading test - cassandra 4.1

2024-04-15 Thread Tolbert, Andy
Hi Avinash,

As far as I understand it, if the underlying keystore/trustore(s)
Cassandra is configured for is updated, this *will not* provoke
Cassandra to interrupt existing connections, it's just that the new
stores will be used for future TLS initialization.

Via: 
https://cassandra.apache.org/doc/4.1/cassandra/operating/security.html#ssl-certificate-hot-reloading

> When the files are updated, Cassandra will reload them and use them for 
> subsequent connections

I suppose one could do a rolling disablebinary/enablebinary (if it's
only client connections) after you roll out a keystore/truststore
change as a way of enforcing the existing connections to reestablish.

Thanks,
Andy


On Mon, Apr 15, 2024 at 11:11 AM pabbireddy avinash
 wrote:
>
> Dear Community,
>
> I hope this email finds you well. I am currently testing SSL certificate hot 
> reloading on a Cassandra cluster running version 4.1 and encountered a 
> situation that requires your expertise.
>
> Here's a summary of the process and issue:
>
> Reloading Process: We reloaded certificates signed by our in-house 
> certificate authority into the cluster, which was initially running with 
> self-signed certificates. The reload was done node by node.
>
> Truststore and Keystore: The truststore and keystore passwords are the same 
> across the cluster.
>
> Unexpected Behavior: Despite the different truststore configurations for the 
> self-signed and new CA certificates, we observed no breakdown in 
> server-to-server communication during the reload. We did not upload the new 
> CA cert into the old truststore.We anticipated interruptions due to the 
> differing truststore configurations but did not encounter any.
>
> Post-Reload Changes: After reloading, we updated the cqlshrc file with the 
> new CA certificate and key to connect to cqlsh.
>
> server_encryption_options:
>
> internode_encryption: all
>
> keystore: ~/conf/server-keystore.jks
>
> keystore_password: 
>
> truststore: ~/conf/server-truststore.jks
>
> truststore_password: 
>
> protocol: TLS
>
> algorithm: SunX509
>
> store_type: JKS
>
> cipher_suites: [TLS_RSA_WITH_AES_256_CBC_SHA]
>
> require_client_auth: true
>
> client_encryption_options:
>
> enabled: true
>
> keystore: ~/conf/server-keystore.jks
>
> keystore_password: 
>
> require_client_auth: true
>
> truststore: ~/conf/server-truststore.jks
>
> truststore_password: 
>
> protocol: TLS
>
> algorithm: SunX509
>
> store_type: JKS
>
> cipher_suites: [TLS_RSA_WITH_AES_256_CBC_SHA]
>
> Given this situation, I have the following questions:
>
> Could there be a reason for the continuity of server-to-server communication 
> despite the different truststores?
> Is there a possibility that the old truststore remains cached even after 
> reloading the certificates on a node?
> Have others encountered similar issues, and if so, what were your solutions?
>
> Any insights or suggestions would be greatly appreciated. Please let me know 
> if further information is needed.
>
> Thank you
>
> Best regards,
>
> Avinash


Re: Failed disks - correct procedure

2023-01-16 Thread Tolbert, Andy
Hi Joe,

Reading it back I realized I misunderstood that part of your email, so
you must be using data_file_directories with 16 drives?  That's a lot
of drives!  I imagine this may happen from time to time given that
disks like to fail.

That's a bit of an interesting scenario that I would have to think
about.  If you brought the node up without the bad drive, repairs are
probably going to do a ton of repair overstreaming if you aren't using
4.0 (https://issues.apache.org/jira/browse/CASSANDRA-3200) which may
put things into a really bad state (lots of streaming = lots of
compactions = slower reads) and you may be seeing some inconsistency
if repairs weren't regularly running beforehand.

How much data was on the drive that failed?  How much data do you
usually have per node?

Thanks,
Andy

On Mon, Jan 16, 2023 at 10:59 AM Joe Obernberger
 wrote:
>
> Thank you Andy.
> Is there a way to just remove the drive from the cluster and replace it
> later?  Ordering replacement drives isn't a fast process...
> What I've done so far is:
> Stop node
> Remove drive reference from /etc/cassandra/conf/cassandra.yaml
> Restart node
> Run repair
>
> Will that work?  Right now, it's showing all nodes as up.
>
> -Joe
>
> On 1/16/2023 11:55 AM, Tolbert, Andy wrote:
> > Hi Joe,
> >
> > I'd recommend just doing a replacement, bringing up a new node with
> > -Dcassandra.replace_address_first_boot=ip.you.are.replacing as
> > described here:
> > https://cassandra.apache.org/doc/4.1/cassandra/operating/topo_changes.html#replacing-a-dead-node
> >
> > Before you do that, you will want to make sure a cycle of repairs has
> > run on the replicas of the down node to ensure they are consistent
> > with each other.
> >
> > Make sure you also have 'auto_bootstrap: true' in the yaml of the node
> > you are replacing and that the initial_token matches the node you are
> > replacing (If you are not using vnodes) so the node doesn't skip
> > bootstrapping.  This is the default, but felt worth mentioning.
> >
> > You can also remove the dead node, which should stream data to
> > replicas that will pick up new ranges, but you also will want to do
> > repairs ahead of time too.  To be honest it's not something I've done
> > recently, so I'm not as confident on executing that procedure.
> >
> > Thanks,
> > Andy
> >
> >
> > On Mon, Jan 16, 2023 at 9:28 AM Joe Obernberger
> >  wrote:
> >> Hi all - what is the correct procedure when handling a failed disk?
> >> Have a node in a 15 node cluster.  This node has 16 drives and cassandra
> >> data is split across them.  One drive is failing.  Can I just remove it
> >> from the list and cassandra will then replicate? If not - what?
> >> Thank you!
> >>
> >> -Joe
> >>
> >>
> >> --
> >> This email has been checked for viruses by AVG antivirus software.
> >> www.avg.com
>
> --
> This email has been checked for viruses by AVG antivirus software.
> www.avg.com


Re: Failed disks - correct procedure

2023-01-16 Thread Tolbert, Andy
Hi Joe,

I'd recommend just doing a replacement, bringing up a new node with
-Dcassandra.replace_address_first_boot=ip.you.are.replacing as
described here:
https://cassandra.apache.org/doc/4.1/cassandra/operating/topo_changes.html#replacing-a-dead-node

Before you do that, you will want to make sure a cycle of repairs has
run on the replicas of the down node to ensure they are consistent
with each other.

Make sure you also have 'auto_bootstrap: true' in the yaml of the node
you are replacing and that the initial_token matches the node you are
replacing (If you are not using vnodes) so the node doesn't skip
bootstrapping.  This is the default, but felt worth mentioning.

You can also remove the dead node, which should stream data to
replicas that will pick up new ranges, but you also will want to do
repairs ahead of time too.  To be honest it's not something I've done
recently, so I'm not as confident on executing that procedure.

Thanks,
Andy


On Mon, Jan 16, 2023 at 9:28 AM Joe Obernberger
 wrote:
>
> Hi all - what is the correct procedure when handling a failed disk?
> Have a node in a 15 node cluster.  This node has 16 drives and cassandra
> data is split across them.  One drive is failing.  Can I just remove it
> from the list and cassandra will then replicate? If not - what?
> Thank you!
>
> -Joe
>
>
> --
> This email has been checked for viruses by AVG antivirus software.
> www.avg.com


Re: Wrong Consistency level seems to be used

2022-07-21 Thread Tolbert, Andy
I'd bet the JIRA that Paul is pointing to is likely what's happening
here.  I'd look for read repair errors in your system logs or in your
metrics (if you have easy access to them).

There are  operations that can happen during the course of a  query
being executed that may happen at different CLs, atomic batch log
timeouts (CL TWO I think?) and read repair came to my mind (especially
for CL ALL) that can make the timeout/unavailable exceptions include a
different CL.   I also remember some DSE features causing this as well
(rbac, auditing, graph and solr stuff).   In newer versions of C* the
errors may be more specific or a warning may come along with it
depending on what is failing.

Thanks,
Andy


Re: Enabling SSL on a live cluster

2021-11-09 Thread Tolbert, Andy
Hi Shaurya,

On Tue, Nov 9, 2021 at 11:57 PM Shaurya Gupta 
wrote:

> Hi,
>
> We want to enable node-to-node SSL on a live cluster. Could it be done
> without any down time ?
>

Yup, this is definitely doable for both internode and client connections.
You will have to bounce your cassandra nodes, but you should be able to
achieve this operation without any downtime.  See server_encryption_options
in cassandra.yaml (
https://cassandra.apache.org/doc/4.0/cassandra/configuration/cass_yaml_file.html#server_encryption_options
)


> Would the nodes which have been restarted be able to communicate with the
> nodes which have not yet come up and vice versa ?
>

The idea would be to:

1. Set optional to true in server_encryption_options and bounce the cluster
safely into it.  As nodes come up, they will first attempt to connect to
other nodes via ssl, and fallback on the unencrypted storage_port.
2. Once you have bounced the entire cluster once, switch optional to false
and then bounce the cluster again.

Before 4.0, a separate port (ssl_storage_port) was used for connecting with
internode via ssl.  In 4.0, storage_port can be used for both unencrypted
and encrypted connections, and enable_legacy_ssl_storage port can be used
to maintain ssl_storage_port. Once the entire cluster is on 4.0 you can set
this option to false so storage_port is used over ssl_storage_port.

One important thing to point out is that prior to C* 4.0, Cassandra does
not hot reload keystore changes, so whenever you update the certificates in
your keystores (e.g. to avoid your certificates expiring) you would need to
bounce your cassandra instances. See:
https://cassandra.apache.org/doc/4.0/cassandra/operating/security.html#ssl-certificate-hot-reloading
for explanation on how that works.

Thanks,
Andy


>
> Regards
> --
> Shaurya Gupta
>
>
>