Ioannis Stoltidis created CASSJAVA-106:
------------------------------------------
Summary: Gauge counters for open-connections not updated after
Cassandra pod recreation in geographical redundant setup
Key: CASSJAVA-106
URL: https://issues.apache.org/jira/browse/CASSJAVA-106
Project: Apache Cassandra Java driver
Issue Type: Bug
Reporter: Ioannis Stoltidis
We are running a containerized version of Cassandra in a geographical redundant
setup with 2 datacenters. Each datacenter contains three Cassandra pods, which
are managed as part of a Cassandra StatefulSet. Every pod has an associated
Kubernetes service with a load balancer IP address. This IP remains constant
and serves as the hostname for internode communication among all Cassandra
pods. Additionally, each datacenter includes a pod running our application,
which uses the Cassandra driver to communicate with the pool of Cassandra pods.
We utilize the DataStax Java driver configured as follows:
* Two contact points are specified, connecting to two hosts (the first 2 pods,
named cassandra-datacenter1_rack1-0 and cassandra-datacenter1_rack1-1).
* After all the endpoints are discovered, one connection per server in the
local DC is established, along with one control connection.
The mapping between host domains and IP addresses is as follows:
||domain||IP||
|cassandra-datacenter1_rack1-0|214.22.161.195|
|cassandra-datacenter1_rack1-1|214.22.161.196|
|cassandra-datacenter1_rack1-2|214.22.161.197|
While monitoring Cassandra connections using gauge counters exposed via the
Dropwizard exporter, we observed that some counters show domain names while
others display IP addresses, and at least one counter appears duplicated.
The following 4 gauge counters are being observed:
{noformat}
s0.nodes.214_22_161_196:9042.pool.open-connections → initial value: 1
s0.nodes.214_22_161_197:9042.pool.open-connections → initial value: 2
s0.nodes.cassandra-datacenter1-rack1-0_cassandra-datacenter1-rack1:9042.pool.open-connections
→ initial value: 1
s0.nodes.cassandra-datacenter1-rack1-1_cassandra-datacenter1-rack1:9042.pool.open-connections
→ initial value: 0{noformat}
After testing the following recovery procedure on 2 of the 3 pods in the local
datacenter:
* Halt Cassandra container using: echo STOPPED >
/var/lib/cassandra/.cassandra.init && pkill java
* Remove Persistent Volume Claim (PVC) associated with the two pods
* Run nodetool removenode on the cluster to clean up the old instances
* Restart the two pods and re-enable Cassandra using: echo RUNNING >
/var/lib/cassandra/.cassandra.init
We observed that the gauge counters are no longer accurately updated.
Specifically, they change to:
{noformat}
s0.nodes.214_22_161_196:9042.pool.open-connections → 0
s0.nodes.214_22_161_197:9042.pool.open-connections → 2
s0.nodes.cassandra-datacenter1-rack1-0_cassandra-datacenter1-rack1:9042.pool.open-connections
→ 0
s0.nodes.cassandra-datacenter1-rack1-1_cassandra-datacenter1-rack1:9042.pool.open-connections
→ 0{noformat}
No other counters are created. These values remain stuck and do not reflect the
actual state of the connection pool, because from server side we can verify
that all expected connections are up again (i.e. one connection per server + 1
control). These values are only correctly reset when we manually restart the
application pod that utilizes the DataStax Java driver, which in turn recreates
the session.
*Expected behavior:*
Gauge counters should reflect the actual number of open connections even after
the Cassandra pods are deleted and recreated.
*Observed behavior:*
After pod recreation and node replacement, the counters stay at incorrect
values until the client session is forcibly reset by restarting the application.
*Environment:*
Cassandra: containerized
Java driver: DataStax Java driver (version 4.19.0)
Monitoring via: simpleclient_dropwizard of io.prometheus
Setup: Geo-redundant, 2 datacenters, 3 pods per datacenter
*Impact:*
This behavior results in stale monitoring data and obscures actual cluster
health and connectivity, particularly in automated or production setups.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]