date:20240429



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Orlowski updated CASSANDRA-19598:

Attachment: (was: image-2024-04-29-20-40-53-382.png)

> advanced.resolve-contact-points: unresolved hostname being clobbered during 
> reconnection
> 
>
> Key: CASSANDRA-19598
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19598
> Project: Cassandra
>  Issue Type: Bug
>  Components: Client/java-driver
>Reporter: Andrew Orlowski
>Priority: Normal
> Attachments: image-2024-04-29-20-13-56-161.png, 
> image-2024-04-29-22-57-26-910.png
>
>
> Hello, this is a bug ticket for 4.18.0 of the Java driver.
>  
> I am running in an environment where I have 3 Cassandra nodes. We have a use 
> case to redeploy the cluster from the ground up at midnight every day. This 
> means that all 3 nodes become unavailable for a short period of time and 3 
> new nodes with 3 new ip addresses get spun up and placed behind the contact 
> point hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the 
> java driver should re-resolve the hostname for every new connection to that 
> node. This occurs prior to and for the first redeployment, but the unresolved 
> hostname is clobbered during the reconnection process and replaced with a 
> resolved IP address, making additional redeployments fruitless. We provide a 
> singular hostname as a contact point.
>  
> In our case, what is happening is that all 3 nodes become unavailable while 
> our CICD process is destroying the existing cluster and replacing it with a 
> new one. During the window of unavailability, the Java driver attempts to 
> reconnect to each node, two of which internally (internal to the driver) have 
> resolved IP addresses and one of which retains the unresolved hostname. Here 
> is a screenshot that captures the internal state of the 3 nodes within 
> `PoolManager` prior to the finished redeployment of the cluster. Note that 
> there are 2 resolved IP addresses and 1 unresolved hostname.
> !image-2024-04-29-20-13-56-161.png|width=985,height=181!
> This 2:1 ratio of resolved IP:unresolved hostname is the correct internal 
> state for a 3 node cluster when `advanced.resolve-contact-points` is set to 
> `FALSE`.
> Eventually, the hostname points to one of the 3 new valid nodes, and the java 
> driver reconnects and discovers the new peers. However, as part of this 
> reconnection process, the internal Node that held the unresolved hostname is 
> now overwritten with a Node that has the resolved IP address:
> !image-2024-04-29-22-57-26-910.png|width=753,height=107!
> Note that we no longer have 2 resolved IP addresses and 1 unresolved 
> hostname; rather, we have 3 resolved IP addresses, which is an incorrect 
> internal state when `advanced.resolve-contact-points` is set to `FALSE`. One 
> of the nodes should have retained the unresolved hostname.
> At this stage, the Java driver no longer queries the hostname for new 
> connections, and further redeployments of ours result in failure because the 
> hostname is no longer amongst the list of nodes that are queried for 
> reconnection. This causes us to need to restart the application. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19598) advanced.resolve-contact-points: unresolved hostname being clobbered during reconnection

[
https://issues.apache.org/jira/browse/CASSANDRA-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Orlowski updated CASSANDRA-19598:

Description:
Hello, this is a bug ticket for 4.18.0 of the Java driver.

I am running in an environment where I have 3 Cassandra nodes. We have a use
case to redeploy the cluster from the ground up at midnight every day. This
means that all 3 nodes become unavailable for a short period of time and 3 new
nodes with 3 new ip addresses get spun up and placed behind the contact point
hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java
driver should re-resolve the hostname for every new connection to that node.
This occurs prior to and for the first redeployment, but the unresolved
hostname is clobbered during the reconnection process and replaced with a
resolved IP address, making additional redeployments fruitless. We provide a
singular hostname as a contact point.

In our case, what is happening is that all 3 nodes become unavailable while our
CICD process is destroying the existing cluster and replacing it with a new
one. During the window of unavailability, the Java driver attempts to reconnect
to each node, two of which internally (internal to the driver) have resolved IP
addresses and one of which retains the unresolved hostname. Here is a
screenshot that captures the internal state of the 3 nodes within `PoolManager`
prior to the finished redeployment of the cluster. Note that there are 2
resolved IP addresses and 1 unresolved hostname.

!image-2024-04-29-20-13-56-161.png|width=985,height=181!

This 2:1 ratio of resolved IP:unresolved hostname is the correct internal state
for a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`.

Eventually, the hostname points to one of the 3 new valid nodes, and the java
driver reconnects and discovers the new peers. However, as part of this
reconnection process, the internal Node that held the unresolved hostname is
now overwritten with a Node that has the resolved IP address:
!image-2024-04-29-22-57-26-910.png|width=753,height=107!
Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname;
rather, we have 3 resolved IP addresses, which is an incorrect internal state
when `advanced.resolve-contact-points` is set to `FALSE`. One of the nodes
should have retained the unresolved hostname.

At this stage, the Java driver no longer queries the hostname for new
connections, and further redeployments of ours result in failure because the
hostname is no longer amongst the list of nodes that are queried for
reconnection. This causes us to need to restart the application.

was:
Hello, this is a bug ticket for 4.18.0 of the Java driver.

!image-2024-04-29-20-13-56-161.png|width=985,height=181!

This ratio of resolved IP:unresolved hostname is the correct internal state for
a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`.

[jira] [Updated] (CASSANDRA-19598) advanced.resolve-contact-points: unresolved hostname being clobbered during reconnection

[
https://issues.apache.org/jira/browse/CASSANDRA-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Orlowski updated CASSANDRA-19598:

Description:
Hello, this is a bug ticket for 4.18.0 of the Java driver.

!image-2024-04-29-20-13-56-161.png|width=985,height=181!

This ratio of resolved IP:unresolved hostname is the correct internal state for
a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`.

was:
Hello, this is a bug ticket for 4.18.0 of the Java driver.

!image-2024-04-29-20-13-56-161.png|width=985,height=181!

This ratio of resolved IP:unresolved hostname is the correct internal state for
a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`.

Eventually, the hostname points to one of the 3 new valid nodes, and the java
driver reconnects and discovers the new peers. However, as part of this
reconnection process, the internal Node that held the unresolved hostname is
now overwritten with a Node that has the resolved IP address:
!image-2024-04-29-20-40-53-382.png|width=1080,height=102!
Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname;
rather, we have 3 resolved IP addresses, which is an incorrect internal state
when `advanced.resolve-contact-points` is set to `FALSE`. One of the nodes
should have retained the unresolve

[jira] [Updated] (CASSANDRA-19598) advanced.resolve-contact-points: unresolved hostname being clobbered during reconnection



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Orlowski updated CASSANDRA-19598:

Attachment: image-2024-04-29-22-57-26-910.png

> advanced.resolve-contact-points: unresolved hostname being clobbered during 
> reconnection
> 
>
> Key: CASSANDRA-19598
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19598
> Project: Cassandra
>  Issue Type: Bug
>  Components: Client/java-driver
>Reporter: Andrew Orlowski
>Priority: Normal
> Attachments: image-2024-04-29-20-13-56-161.png, 
> image-2024-04-29-20-40-53-382.png, image-2024-04-29-22-57-26-910.png
>
>
> Hello, this is a bug ticket for 4.18.0 of the Java driver.
>  
> I am running in an environment where I have 3 Cassandra nodes. We have a use 
> case to redeploy the cluster from the ground up at midnight every day. This 
> means that all 3 nodes become unavailable for a short period of time and 3 
> new nodes with 3 new ip addresses get spun up and placed behind the contact 
> point hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the 
> java driver should re-resolve the hostname for every new connection to that 
> node. This occurs prior to and for the first redeployment, but the unresolved 
> hostname is clobbered during the reconnection process and replaced with a 
> resolved IP address, making additional redeployments fruitless. We provide a 
> singular hostname as a contact point.
>  
> In our case, what is happening is that all 3 nodes become unavailable while 
> our CICD process is destroying the existing cluster and replacing it with a 
> new one. During the window of unavailability, the Java driver attempts to 
> reconnect to each node, two of which internally (internal to the driver) have 
> resolved IP addresses and one of which retains the unresolved hostname. Here 
> is a screenshot that captures the internal state of the 3 nodes within 
> `PoolManager` prior to the finished redeployment of the cluster. Note that 
> there are 2 resolved IP addresses and 1 unresolved hostname.
> !image-2024-04-29-20-13-56-161.png|width=985,height=181!
> This ratio of resolved IP:unresolved hostname is the correct internal state 
> for a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`.
> Eventually, the hostname points to one of the 3 new valid nodes, and the java 
> driver reconnects and discovers the new peers. However, as part of this 
> reconnection process, the internal Node that held the unresolved hostname is 
> now overwritten with a Node that has the resolved IP address:
> !image-2024-04-29-20-40-53-382.png|width=1080,height=102!
> Note that we no longer have 2 resolved IP addresses and 1 unresolved 
> hostname; rather, we have 3 resolved IP addresses, which is an incorrect 
> internal state when `advanced.resolve-contact-points` is set to `FALSE`. One 
> of the nodes should have retained the unresolved hostname.
> At this stage, the Java driver no longer queries the hostname for new 
> connections, and further redeployments of ours result in failure because the 
> hostname is no longer amongst the list of nodes that are queried for 
> reconnection. This causes us to need to restart the application. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19598) advanced.resolve-contact-points: unresolved hostname being clobbered during reconnection

2024-04-29 Thread Bret McGuire (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842228#comment-17842228
 ] 

Bret McGuire commented on CASSANDRA-19598:
--

No worries [~shot_up] , you're good... and thanks for bringing it to my 
attention!

 

> advanced.resolve-contact-points: unresolved hostname being clobbered during 
> reconnection
> 
>
> Key: CASSANDRA-19598
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19598
> Project: Cassandra
>  Issue Type: Bug
>  Components: Client/java-driver
>Reporter: Andrew Orlowski
>Priority: Normal
> Attachments: image-2024-04-29-20-13-56-161.png, 
> image-2024-04-29-20-40-53-382.png
>
>
> Hello, this is a bug ticket for 4.18.0 of the Java driver.
>  
> I am running in an environment where I have 3 Cassandra nodes. We have a use 
> case to redeploy the cluster from the ground up at midnight every day. This 
> means that all 3 nodes become unavailable for a short period of time and 3 
> new nodes with 3 new ip addresses get spun up and placed behind the contact 
> point hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the 
> java driver should re-resolve the hostname for every new connection to that 
> node. This occurs prior to and for the first redeployment, but the unresolved 
> hostname is clobbered during the reconnection process and replaced with a 
> resolved IP address, making additional redeployments fruitless. We provide a 
> singular hostname as a contact point.
>  
> In our case, what is happening is that all 3 nodes become unavailable while 
> our CICD process is destroying the existing cluster and replacing it with a 
> new one. During the window of unavailability, the Java driver attempts to 
> reconnect to each node, two of which internally (internal to the driver) have 
> resolved IP addresses and one of which retains the unresolved hostname. Here 
> is a screenshot that captures the internal state of the 3 nodes within 
> `PoolManager` prior to the finished redeployment of the cluster. Note that 
> there are 2 resolved IP addresses and 1 unresolved hostname.
> !image-2024-04-29-20-13-56-161.png|width=985,height=181!
> This ratio of resolved IP:unresolved hostname is the correct internal state 
> for a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`.
> Eventually, the hostname points to one of the 3 new valid nodes, and the java 
> driver reconnects and discovers the new peers. However, as part of this 
> reconnection process, the internal Node that held the unresolved hostname is 
> now overwritten with a Node that has the resolved IP address:
> !image-2024-04-29-20-40-53-382.png|width=1080,height=102!
> Note that we no longer have 2 resolved IP addresses and 1 unresolved 
> hostname; rather, we have 3 resolved IP addresses, which is an incorrect 
> internal state when `advanced.resolve-contact-points` is set to `FALSE`. One 
> of the nodes should have retained the unresolved hostname.
> At this stage, the Java driver no longer queries the hostname for new 
> connections, and further redeployments of ours result in failure because the 
> hostname is no longer amongst the list of nodes that are queried for 
> reconnection. This causes us to need to restart the application. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19598) advanced.resolve-contact-points: unresolved hostname being clobbered during reconnection

[
https://issues.apache.org/jira/browse/CASSANDRA-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Orlowski updated CASSANDRA-19598:

Description:
Hello, this is a bug ticket for 4.18.0 of the Java driver.

!image-2024-04-29-20-13-56-161.png|width=985,height=181!

This ratio of resolved IP:unresolved hostname is the correct internal state for
a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`.

Eventually, the hostname points to one of the 3 new valid nodes, and the java
driver reconnects and discovers the new peers. However, as part of this
reconnection process, the internal Node that held the unresolved hostname is
now overwritten with a Node that has the resolved IP address:
!image-2024-04-29-20-40-53-382.png|width=1080,height=102!
Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname;
rather, we have 3 resolved IP addresses, which is an incorrect internal state
when `advanced.resolve-contact-points` is set to `FALSE`. One of the nodes
should have retained the unresolved hostname.

was:
Hello, this is a bug ticket for 4.18.0 of the Java driver.

!image-2024-04-29-20-13-56-161.png|width=985,height=181!

This ratio of resolved IP:unresolved hostname is the correct internal state for
a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`.

Eventually, the hostname points to one of the 3 new valid nodes, and the java
driver reconnects and discovers the new peers. However, as part of this
reconnection process, the internal Node that held the unresolved hostname is
now overwritten with a Node that has the resolved IP address:
!image-2024-04-29-20-40-53-382.png|width=1080,height=102!
Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname;
rather, we have 3 resolved IP addresses, which is an incorrect internal state
when `advanced.resolve-contact-points` is set to `FALSE`. One of the nodes
should have retained the unresolved hostname.

At this stage, the Java driver no long

[jira] [Commented] (CASSANDRA-19598) advanced.resolve-contact-points: unresolved hostname being clobbered during reconnection



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842214#comment-17842214
 ] 

Andrew Orlowski commented on CASSANDRA-19598:
-

cc [~absurdfarce]  (sorry if I shouldn't have - just saw it was done on another 
thread)

> advanced.resolve-contact-points: unresolved hostname being clobbered during 
> reconnection
> 
>
> Key: CASSANDRA-19598
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19598
> Project: Cassandra
>  Issue Type: Bug
>  Components: Client/java-driver
>Reporter: Andrew Orlowski
>Priority: Normal
> Attachments: image-2024-04-29-20-13-56-161.png, 
> image-2024-04-29-20-40-53-382.png
>
>
> Hello, this is a bug ticket for 4.18.0 of the Java driver.
>  
> I am running in an environment where I have 3 Cassandra nodes. We have a use 
> case to redeploy the cluster from the ground up at midnight every day. This 
> means that all 3 nodes become unavailable for a short period of time and 3 
> new nodes with 3 new ip addresses get spun up and placed behind the contact 
> point hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the 
> java driver should re-resolve the hostname for every new connection to that 
> node. This occurs prior to and for the first redeployment, but the unresolved 
> hostname is clobbered during the reconnection process and replaced with a 
> resolved IP address, making additional redeployments fruitless.
>  
> In our case, what is happening is that all 3 nodes become unavailable while 
> our CICD process is destroying the existing cluster and replacing it with a 
> new one. During the window of unavailability, the Java driver attempts to 
> reconnect to each node, two of which internally (internal to the driver) have 
> resolved IP addresses and one of which retains the unresolved hostname. Here 
> is a screenshot that captures the internal state of the 3 nodes within 
> `PoolManager` prior to the finished redeployment of the cluster. Note that 
> there are 2 resolved IP addresses and 1 unresolved hostname.
> !image-2024-04-29-20-13-56-161.png|width=985,height=181!
> This ratio of resolved IP:unresolved hostname is the correct internal state 
> for a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`.
> Eventually, the hostname points to one of the 3 new valid nodes, and the java 
> driver reconnects and discovers the new peers. However, as part of this 
> reconnection process, the internal Node that held the unresolved hostname is 
> now overwritten with a Node that has the resolved IP address:
> !image-2024-04-29-20-40-53-382.png|width=1080,height=102!
> Note that we no longer have 2 resolved IP addresses and 1 unresolved 
> hostname; rather, we have 3 resolved IP addresses, which is an incorrect 
> internal state when `advanced.resolve-contact-points` is set to `FALSE`. One 
> of the nodes should have retained the unresolved hostname.
> At this stage, the Java driver no longer queries the hostname for new 
> connections, and further redeployments of ours result in failure because the 
> hostname is no longer amongst the list of nodes that are queried for 
> reconnection. This causes us to need to restart the application. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19598) advanced.resolve-contact-points: unresolved hostname being clobbered during reconnection

[
https://issues.apache.org/jira/browse/CASSANDRA-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Orlowski updated CASSANDRA-19598:

Description:
Hello, this is a bug ticket for 4.18.0 of the Java driver.

!image-2024-04-29-20-13-56-161.png|width=985,height=181!

This ratio of resolved IP:unresolved hostname is the correct internal state for
a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`.

Eventually, the hostname points to one of the 3 new valid nodes, and the java
driver reconnects and discovers the new peers. However, as part of this
reconnection process, the internal Node that held the unresolved hostname is
now overwritten with a Node that has the resolved IP address:
!image-2024-04-29-20-40-53-382.png|width=1080,height=102!
Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname;
rather, we have 3 resolved IP addresses, which is an incorrect internal state
when `advanced.resolve-contact-points` is set to `FALSE`. One of the nodes
should have retained the unresolved hostname.

was:
Hello, this is a bug ticket for 4.18.0 of the Java driver.

!image-2024-04-29-20-13-56-161.png|width=985,height=181!

This ratio of resolved IP:unresolved hostname is the correct internal state for
a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`.

Eventually, the hostname points to one of the 3 new valid nodes, and the java
driver reconnects and discovers the new peers. However, as part of this
reconnection process, the internal Node that held the unresolved hostname is
now overwritten with a Node that has the resolved IP address:
!image-2024-04-29-20-40-53-382.png|width=1080,height=102!
Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname;
rather, we have 3 resolved IP addresses, which is an incorrect internal state
when `advanced.resolve-contact-points` is set to `FALSE`.

At this stage, the Java driver no longer queries the hostname for new
connections, and further redeployments of ours result in failure because the
hostn

[jira] [Updated] (CASSANDRA-19598) advanced.resolve-contact-points: unresolved hostname being clobbered during reconnection

[
https://issues.apache.org/jira/browse/CASSANDRA-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Orlowski updated CASSANDRA-19598:

Description:
Hello, this is a bug ticket for 4.18.0 of the Java driver.

!image-2024-04-29-20-13-56-161.png|width=985,height=181!

This ratio of resolved IP:unresolved hostname is the correct internal state for
a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`.

Eventually, the hostname points to one of the 3 new valid nodes, and the java
driver reconnects and discovers the new peers. However, as part of this
reconnection process, the internal Node that held the unresolved hostname is
now overwritten with a Node that has the resolved IP address:
!image-2024-04-29-20-40-53-382.png|width=1080,height=102!
Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname;
rather, we have 3 resolved IP addresses, which is an incorrect internal state
when `advanced.resolve-contact-points` is set to `FALSE`.

was:
Hello, this is a bug ticket for 4.18.0 of the Java driver.

I am running in an environment where I have 3 Cassandra nodes. We have a use
case to redeploy the cluster from the ground up at midnight every day. This
means that all 3 nodes become unavailable for a short period of time and 3 new
nodes with 3 new ip addresses get spun up and placed behind the contact point
hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java
driver should re-resolve the hostname for every new connection to that node.
This occurs for the prior to and for the first redeployment, but the unresolved
hostname is clobbered during the reconnection process and replaced with a
resolved IP address.

!image-2024-04-29-20-13-56-161.png|width=985,height=181!

This ratio of resolved IP:unresolved hostname is the correct internal state for
a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`.

Eventually, the hostname points to one of the 3 new valid nodes, and the java
driver reconnects and discovers the new peers. However, as part of this
reconnection process, the internal Node that held the unresolved hostname is
now overwritten with a Node that has the resolved IP address:
!image-2024-04-29-20-40-53-382.png|width=1080,height=102!
Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname;
rather, we have 3 resolved IP addresses, which is an incorrect internal state
when `advanced.resolve-contact-points` is set to `FALSE`.

[jira] [Updated] (CASSANDRA-19598) advanced.resolve-contact-points: unresolved hostname being clobbered during reconnection

[
https://issues.apache.org/jira/browse/CASSANDRA-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Orlowski updated CASSANDRA-19598:

Description:
Hello, this is a bug ticket for 4.18.0 of the Java driver.

I am running in an environment where I have 3 Cassandra nodes. We have a use
case to redeploy the cluster from the ground up at midnight every day. This
means that all 3 nodes become unavailable for a short period of time and 3 new
nodes with 3 new ip addresses get spun up and placed behind the contact point
hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java
driver should re-resolve the hostname for every new connection to that node.
This occurs for the prior to and for the first redeployment, but the unresolved
hostname is clobbered during the reconnection process and replaced with a
resolved IP address.

!image-2024-04-29-20-13-56-161.png|width=985,height=181!

This ratio of resolved IP:unresolved hostname is the correct internal state for
a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`.

Eventually, the hostname points to one of the 3 new valid nodes, and the java
driver reconnects and discovers the new peers. However, as part of this
reconnection process, the internal Node that held the unresolved hostname is
now overwritten with a Node that has the resolved IP address:
!image-2024-04-29-20-40-53-382.png|width=1080,height=102!
Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname;
rather, we have 3 resolved IP addresses, which is an incorrect internal state
when `advanced.resolve-contact-points` is set to `FALSE`.

was:
Hello, this is a bug ticket for 4.18.0 of the Java driver.

I am running in an environment where I have 3 Cassandra nodes. We have a use
case to redeploy the cluster from the ground up at midnight every day. This
means that all 3 nodes become unavailable for a short period of time and 3 new
nodes with 3 new ip addresses get spun up and placed behind the contact point
hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java
driver should re-resolve the hostname for every new connection to that node.
This occurs for the prior to and for the first redeployment, but the unresolved
hostname is clobbered during the reconnection process and replaced with a
resolved IP address.

!image-2024-04-29-20-13-56-161.png|width=985,height=181!

This ratio of resolved IP:unresolved hostname is the correct internal state for
a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`.

Eventually, the hostname points to one of the 3 new valid nodes, and the java
driver reconnects and resets the pool. However, as part of this reconnection
process, the internal Node that held the unresolved hostname is now overwritten
with a Node that has the resolved IP address:
!image-2024-04-29-20-40-53-382.png|width=1080,height=102!
Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname;
rather, we have 3 resolved IP addresses, which is an incorrect internal state
when `advanced.resolve-contact-points` is set to `FALSE`.

> advance

[jira] [Updated] (CASSANDRA-19598) advanced.resolve-contact-points: unresolved hostname being clobbered during reconnection

[
https://issues.apache.org/jira/browse/CASSANDRA-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Orlowski updated CASSANDRA-19598:

Description:
Hello, this is a bug ticket for 4.18.0 of the Java driver.

I am running in an environment where I have 3 Cassandra nodes. We have a use
case to redeploy the cluster from the ground up at midnight every day. This
means that all 3 nodes become unavailable for a short period of time and 3 new
nodes with 3 new ip addresses get spun up and placed behind the contact point
hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java
driver should re-resolve the hostname for every new connection to that node.
This occurs for the prior to and for the first redeployment, but the unresolved
hostname is clobbered during the reconnection process and replaced with a
resolved IP address.

!image-2024-04-29-20-13-56-161.png|width=985,height=181!

This ratio of resolved IP:unresolved hostname is the correct internal state for
a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`.

was:
Hello, this is a bug ticket for 4.18.0 of the Java driver.

I am running in an environment where I have 3 Cassandra nodes. We have a use
case to redeploy the cluster from the ground up at midnight every day. This
means that all 3 nodes become unavailable for a short period of time and 3 new
nodes with 3 new ip addresses get spun up and placed behind the contact point
hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java
driver should re-resolve the hostname for every new connection to that node.
This occurs for the prior to and for the first redeployment, but the unresolved
hostname is clobbered during the reconnection process and replaced with a
resolved IP address.

!image-2024-04-29-20-13-56-161.png!

This ratio of resolved IP:unresolved hostname is the correct internal state for
a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`.

Eventually, the hostname points to one of the 3 new valid nodes, and the java
driver reconnects and resets the pool. However, as part of this reconnection
process, the internal Node that held the unresolved hostname is now overwritten
with a Node that has the resolved IP address:
!image-2024-04-29-20-40-53-382.png!
Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname;
rather, we have 3 resolved IP addresses, which is an incorrect internal state
when `advanced.resolve-contact-points` is set to `FALSE`.

> advanced.resolve-contact-points: unresolved hostname being

[jira] [Updated] (CASSANDRA-19598) advanced.resolve-contact-points: unresolved hostname being clobbered during reconnection

[
https://issues.apache.org/jira/browse/CASSANDRA-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Orlowski updated CASSANDRA-19598:

Description:
Hello, this is a bug ticket for 4.18.0 of the Java driver.

I am running in an environment where I have 3 Cassandra nodes. We have a use
case to redeploy the cluster from the ground up at midnight every day. This
means that all 3 nodes become unavailable for a short period of time and 3 new
nodes with 3 new ip addresses get spun up and placed behind the contact point
hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java
driver should re-resolve the hostname for every new connection to that node.
This occurs for the prior to and for the first redeployment, but the unresolved
hostname is clobbered during the reconnection process and replaced with a
resolved IP address.

!image-2024-04-29-20-13-56-161.png!

This ratio of resolved IP:unresolved hostname is the correct internal state for
a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`.

Eventually, the hostname points to one of the 3 new valid nodes, and the java
driver reconnects and resets the pool. However, as part of this reconnection
process, the internal Node that held the unresolved hostname is now overwritten
with a Node that has the resolved IP address:
!image-2024-04-29-20-40-53-382.png!
Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname;
rather, we have 3 resolved IP addresses, which is an incorrect internal state
when `advanced.resolve-contact-points` is set to `FALSE`.

was:
Hello, this is a bug ticket for 4.18.0 of the Java driver.

I am running in an environment where I have 3 Cassandra nodes. We have a use
case to redeploy the cluster from the ground up at midnight every day. This
means that all 3 nodes become unavailable for a short period of time and 3 new
nodes with 3 new ip addresses get spun up and placed behind the contact point
hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java
driver should re-resolve the hostname for every new connection to that node.
This occurs for the prior to and for the first redeployment, but the unresolved
hostname is clobbered during the reconnection process and replaced with a
resolved IP address.

In our case, what is happening is that all 3 nodes become unavailable while our
CICD process is destroying the existing cluster and replacing it with a new
one. The Java driver attempts to reconnect to each node, two of which
internally (internal to the driver) have resolved IP addresses and one of which
retains the unresolved hostname. Here is a screenshot that captures the
internal state of the 3 nodes within `PoolManager` prior to the redeployment of
the cluster. Note that there are 2 resolved IP addresses and 1 unresolved
hostname.

!image-2024-04-29-20-13-56-161.png!

This ratio of resolved IP:unresolved hostname is the correct internal state for
a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`.

Eventually, the hostname points to one of the 3 new valid nodes, and the java
driver reconnects and resets the pool. However, as part of this reconnection
process, the internal Node that held the unresolved hostname is now overwritten
with a Node that has the resolved IP address:
!image-2024-04-29-20-40-53-382.png!
Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname;
rather, we have 3 resolved IP addresses, which is an incorrect internal state
when `advanced.resolve-contact-points` is set to `FALSE`.

> advanced.resolve-contact-points: unresolved hostname being clobbered during
> reconnection
> -

[jira] [Updated] (CASSANDRA-19598) advanced.resolve-contact-points: unresolved hostname being clobbered during reconnection

[
https://issues.apache.org/jira/browse/CASSANDRA-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Orlowski updated CASSANDRA-19598:

Description:
Hello, this is a bug ticket for 4.18.0 of the Java driver.

I am running in an environment where I have 3 Cassandra nodes. We have a use
case to redeploy the cluster from the ground up at midnight every day. This
means that all 3 nodes become unavailable for a short period of time and 3 new
nodes with 3 new ip addresses get spun up and placed behind the contact point
hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java
driver should re-resolve the hostname for every new connection to that node.
This occurs for the prior to and for the first redeployment, but the unresolved
hostname is clobbered during the reconnection process and replaced with a
resolved IP address.

!image-2024-04-29-20-13-56-161.png!

This ratio of resolved IP:unresolved hostname is the correct internal state for
a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`.

Eventually, the hostname points to one of the 3 new valid nodes, and the java
driver reconnects and resets the pool. However, as part of this reconnection
process, the internal Node that held the unresolved hostname is now overwritten
with a Node that has the resolved IP address:
!image-2024-04-29-20-40-53-382.png!
Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname;
rather, we have 3 resolved IP addresses, which is an incorrect internal state
when `advanced.resolve-contact-points` is set to `FALSE`.

was:
Hello, this is a bug ticket for 4.18.0 of the Java driver.

I am running in an environment where I have 3 Cassandra nodes. We have a use
case to redeploy the cluster from the ground up at midnight every day. This
means that all 3 nodes become unavailable for a short period of time and 3 new
nodes with 3 new ip addresses get spun up and placed behind the contact point
hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java
driver should re-resolve the hostname for every new connection to that node.
This occurs for the first redeployment, but the unresolved hostname is
clobbered during the reconnection process and replaced with a resolved IP
address.

!image-2024-04-29-20-13-56-161.png!

This ratio of resolved IP:unresolved hostname is the correct internal state for
a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`.

Eventually, the hostname points to one of the 3 new valid nodes, and the java
driver reconnects and resets the pool. However, as part of this reconnection
process, the internal Node that held the unresolved hostname is now overwritten
with a Node that has the resolved IP address:
!image-2024-04-29-20-40-53-382.png!
Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname;
rather, we have 3 resolved IP addresses, which is an incorrect internal state
when `advanced.resolve-contact-points` is set to `FALSE`.

> advanced.resolve-contact-points: unresolved hostname being clobbered during
> reconnection
>
>
> Key: CASSAN

[jira] [Updated] (CASSANDRA-19598) advanced.resolve-contact-points: unresolved hostname being clobbered during reconnection

[
https://issues.apache.org/jira/browse/CASSANDRA-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Orlowski updated CASSANDRA-19598:

Description:
Hello, this is a bug ticket for 4.18.0 of the Java driver.

I am running in an environment where I have 3 Cassandra nodes. We have a use
case to redeploy the cluster from the ground up at midnight every day. This
means that all 3 nodes become unavailable for a short period of time and 3 new
nodes with 3 new ip addresses get spun up and placed behind the contact point
hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java
driver should re-resolve the hostname for every new connection to that node.
This occurs for the first redeployment, but the unresolved hostname is
clobbered during the reconnection process and replaced with a resolved IP
address.

!image-2024-04-29-20-13-56-161.png!

This ratio of resolved IP:unresolved hostname is the correct internal state for
a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`.

Eventually, the hostname points to one of the 3 new valid nodes, and the java
driver reconnects and resets the pool. However, as part of this reconnection
process, the internal Node that held the unresolved hostname is now overwritten
with a Node that has the resolved IP address:
!image-2024-04-29-20-40-53-382.png!
Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname;
rather, we have 3 resolved IP addresses, which is an incorrect internal state
when `advanced.resolve-contact-points` is set to `FALSE`.

was:
Hello, this is a bug ticket for 4.18.0 of the Java driver.

I am running in an environment where I have 3 Cassandra nodes. We have a use
case to redeploy the cluster from the ground up at midnight every day. This
means that all 3 nodes become unavailable for a short period of time and 3 new
nodes with 3 new ip addresses get spun up and placed behind the contact point
hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java
driver should re-resolve the hostname for every new connection to that node.
This occurs for the first redeployment, but the unresolved hostname is
clobbered during the reconnection process and replaced with a resolved IP
address.

!image-2024-04-29-20-13-56-161.png!

This ratio of resolved IP:unresolved hostname is the correct internal state for
a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`.

Eventually, the hostname points to one of the 3 new valid nodes, and the java
driver reconnects and resets the pool. However, as part of this reconnection
process, the internal Node that held the unresolved hostname is now overwritten
with a Node that has the resolved IP address:
!image-2024-04-29-20-40-53-382.png!
Note that we no longer have 2 resolved IP addresses and 1 resolved hostname;
rather, we have 3 resolved IP addresses, which is an incorrect internal state
when `advanced.resolve-contact-points` is set to `FALSE`.

> advanced.resolve-contact-points: unresolved hostname being clobbered during
> reconnection
>
>
> Key: CASSANDRA-19598
>

[jira] [Updated] (CASSANDRA-19598) advanced.resolve-contact-points: unresolved hostname being clobbered during reconnection

[
https://issues.apache.org/jira/browse/CASSANDRA-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Orlowski updated CASSANDRA-19598:

Description:
Hello, this is a bug ticket for 4.18.0 of the Java driver.

I am running in an environment where I have 3 Cassandra nodes. We have a use
case to redeploy the cluster from the ground up at midnight every day. This
means that all 3 nodes become unavailable for a short period of time and 3 new
nodes with 3 new ip addresses get spun up and placed behind the contact point
hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java
driver should re-resolve the hostname for every new connection to that node.
This occurs for the first redeployment, but the unresolved hostname is
clobbered during the reconnection process and replaced with a resolved IP
address.

!image-2024-04-29-20-13-56-161.png!

This ratio of resolved IP:unresolved hostname is the correct internal state for
a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`.

Eventually, the hostname points to one of the 3 new valid nodes, and the java
driver reconnects and resets the pool. However, as part of this reconnection
process, the internal Node that held the unresolved hostname is now overwritten
with a Node that has the resolved IP address:
!image-2024-04-29-20-40-53-382.png!
Note that we no longer have 2 resolved IP addresses and 1 resolved hostname;
rather, we have 3 resolved IP addresses, which is an incorrect internal state
when `advanced.resolve-contact-points` is set to `FALSE`.

was:
Hello, this is a bug ticket for 4.18.0 of the Java driver.

I am running in an environment where I have 3 Cassandra nodes. We have a use
case to redeploy the cluster from the ground up at midnight every day. This
means that all 3 nodes become unavailable for a short period of time and 3 new
nodes with 3 new ip addresses get spun up and placed behind the contact point
hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java
driver should re-resolve the hostname for every new connection to that node.
This occurs for the first redeployment, but the unresolved hostname is
clobbered during the reconnection process and replaced with a resolved IP
address.

!image-2024-04-29-20-13-56-161.png!

Note that two of the nodes are dynamically discovered peers with resolved IP
addresses, while one node is the unresolved contact point. This is the correct
internal state when `advanced.resolve-contact-points` is set to `FALSE`.

Eventually, the hostname points to one of the 3 new valid nodes, and the java
driver reconnects and resets the pool. However, as part of this reconnection
process, the internal Node that held the unresolved hostname is now overwritten
with a Node that has the resolved IP address:
!image-2024-04-29-20-40-53-382.png!
Note that we no longer have 2 resolved IP addresses and 1 resolved hostname;
rather, we have 3 resolved IP addresses.

> advanced.resolve-contact-points: unresolved hostname being clobbered during
> reconnection
>
>
> Key: CASSANDRA-19598
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19598
> Project: Cassand

[jira] [Created] (CASSANDRA-19598) advanced.resolve-contact-points: unresolved hostname being clobbered during reconnection

Andrew Orlowski created CASSANDRA-19598:
---

 Summary: advanced.resolve-contact-points: unresolved hostname 
being clobbered during reconnection
 Key: CASSANDRA-19598
 URL: https://issues.apache.org/jira/browse/CASSANDRA-19598
 Project: Cassandra
  Issue Type: Bug
  Components: Client/java-driver
Reporter: Andrew Orlowski
 Attachments: image-2024-04-29-20-13-56-161.png, 
image-2024-04-29-20-40-53-382.png

Hello, this is a bug ticket for 4.18.0 of the Java driver.

 

I am running in an environment where I have 3 Cassandra nodes. We have a use 
case to redeploy the cluster from the ground up at midnight every day. This 
means that all 3 nodes become unavailable for a short period of time and 3 new 
nodes with 3 new ip addresses get spun up and placed behind the contact point 
hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java 
driver should re-resolve the hostname for every new connection to that node. 
This occurs for the first redeployment, but the unresolved hostname is 
clobbered during the reconnection process and replaced with a resolved IP 
address.

 

In our case, what is happening is that all 3 nodes become unavailable while our 
CICD process is destroying the existing cluster and replacing it with a new 
one. The Java driver attempts to reconnect to each node, two of which 
internally (internal to the driver) have resolved IP addresses and one of which 
retains the unresolved hostname. Here is a screenshot that captures the 
internal state of the 3 nodes within `PoolManager` prior to the redeployment of 
the cluster.

!image-2024-04-29-20-13-56-161.png!

Note that two of the nodes are dynamically discovered peers with resolved IP 
addresses, while one node is the unresolved contact point. This is the correct 
internal state when `advanced.resolve-contact-points` is set to `FALSE`.

Eventually, the hostname points to one of the 3 new valid nodes, and the java 
driver reconnects and resets the pool. However, as part of this reconnection 
process, the internal Node that held the unresolved hostname is now overwritten 
with a Node that has the resolved IP address:
!image-2024-04-29-20-40-53-382.png!
Note that we no longer have 2 resolved IP addresses and 1 resolved hostname; 
rather, we have 3 resolved IP addresses.


At this stage, the Java driver no longer queries the hostname for new 
connections, and further redeployments of ours result in failure because the 
hostname is no longer amongst the list of nodes that are queried for 
reconnection. This causes us to need to restart the application. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19556) Guardrail to block DDL/DCL queries

2024-04-29 Thread Yuqi Yan (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Yan updated CASSANDRA-19556:
-
Fix Version/s: 5.x
  Description: 
Sometimes we want to block DDL/DCL queries to stop new schemas being created or 
roles created. (e.g. when doing live-upgrade)

For DDL guardrail current implementation won't block the query if it's no-op 
(e.g. CREATE TABLE...IF NOT EXISTS, but table already exists, etc. The 
guardrail check is added in apply() right after all the existence check)

I don't have preference on either block every DDL query or check whether if 
it's no-op here. Just we have some users always run CREATE..IF NOT EXISTS.. at 
startup, which is no-op but will be blocked by this guardrail and failed to 
start.

 

4.1 PR: [https://github.com/apache/cassandra/pull/3248]

trunk PR: [https://github.com/apache/cassandra/pull/3275]

 

  was:
Sometimes we want to block DDL/DCL queries to stop new schemas being created or 
roles created. (e.g. when doing live-upgrade)

For DDL guardrail current implementation won't block the query if it's no-op 
(e.g. CREATE TABLE...IF NOT EXISTS, but table already exists, etc. The 
guardrail check is added in apply() right after all the existence check)

I don't have preference on either block every DDL query or check whether if 
it's no-op here. Just we have some users always run CREATE..IF NOT EXISTS.. at 
startup, which is no-op but will be blocked by this guardrail and failed to 
start.

 

4.1 PR: [https://github.com/apache/cassandra/pull/3248]

 


> Guardrail to block DDL/DCL queries
> --
>
> Key: CASSANDRA-19556
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19556
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Feature/Guardrails
>Reporter: Yuqi Yan
>Assignee: Yuqi Yan
>Priority: Normal
> Fix For: 5.x
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Sometimes we want to block DDL/DCL queries to stop new schemas being created 
> or roles created. (e.g. when doing live-upgrade)
> For DDL guardrail current implementation won't block the query if it's no-op 
> (e.g. CREATE TABLE...IF NOT EXISTS, but table already exists, etc. The 
> guardrail check is added in apply() right after all the existence check)
> I don't have preference on either block every DDL query or check whether if 
> it's no-op here. Just we have some users always run CREATE..IF NOT EXISTS.. 
> at startup, which is no-op but will be blocked by this guardrail and failed 
> to start.
>  
> 4.1 PR: [https://github.com/apache/cassandra/pull/3248]
> trunk PR: [https://github.com/apache/cassandra/pull/3275]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status

2024-04-29 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842182#comment-17842182
 ] 

Cameron Zemek commented on CASSANDRA-19580:
---

I don't understand why Gossiper::examineGossiper is implemented to only iterate 
on the digests in the SYN message. Why doesn't it handle sending back in the 
delta missing entries in the digest list?

> Unable to contact any seeds with node in hibernate status
> -
>
> Key: CASSANDRA-19580
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19580
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Cameron Zemek
>Priority: Normal
>
> We have customer running into the error 'Unable to contact any seeds!' . I 
> have been able to reproduce this issue if I kill Cassandra as its joining 
> which will put the node into hibernate status. Once a node is in hibernate it 
> will no longer receive any SYN messages from other nodes during startup and 
> as it sends only itself as digest in outbound SYN messages it never receives 
> any states in any of the ACK replies. So once it gets to the check 
> `seenAnySeed` in it fails as the endpointStateMap is empty.
>  
> A workaround is copying the system.peers table from other node but this is 
> less than ideal. I tested modifying maybeGossipToSeed as follows:
> {code:java}
>     /* Possibly gossip to a seed for facilitating partition healing */
>     private void maybeGossipToSeed(MessageOut prod)
>     {
>         int size = seeds.size();
>         if (size > 0)
>         {
>             if (size == 1 && 
> seeds.contains(FBUtilities.getBroadcastAddress()))
>             {
>                 return;
>             }
>             if (liveEndpoints.size() == 0)
>             {
>                 List gDigests = prod.payload.gDigests;
>                 if (gDigests.size() == 1 && 
> gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress()))
>                 {
>                     gDigests = new ArrayList();
>                     GossipDigestSyn digestSynMessage = new 
> GossipDigestSyn(DatabaseDescriptor.getClusterName(),
>                                                                            
> DatabaseDescriptor.getPartitionerName(),
>                                                                            
> gDigests);
>                     MessageOut message = new 
> MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN,
>                                                                               
>             digestSynMessage,
>                                                                               
>             GossipDigestSyn.serializer);
>                     sendGossip(message, seeds);
>                 }
>                 else
>                 {
>                     sendGossip(prod, seeds);
>                 }
>             }
>             else
>             {
>                 /* Gossip with the seed with some probability. */
>                 double probability = seeds.size() / (double) 
> (liveEndpoints.size() + unreachableEndpoints.size());
>                 double randDbl = random.nextDouble();
>                 if (randDbl <= probability)
>                     sendGossip(prod, seeds);
>             }
>         }
>     }
>  {code}
> Only problem is this is the same as SYN from shadow round. It does resolve 
> the issue however as then receive an ACK with all the states.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19578) Concurrent equivalent schema updates lead to unresolved disagreement

2024-04-29 Thread Jordan West (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842181#comment-17842181
 ] 

Jordan West commented on CASSANDRA-19578:
-

4.1 is when it was introduced but afaik the issue wasn't fixed so its likely 
broken in 5.0.x as well -- or any code where TCM isn't merged post 4.1

> Concurrent equivalent schema updates lead to unresolved disagreement
> 
>
> Key: CASSANDRA-19578
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19578
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Schema
>Reporter: Chris Lohfink
>Priority: Normal
> Fix For: 4.1.x, 5.0.x
>
>
> As part of CASSANDRA-17819 a check for empty schema changes was added to the 
> updateSchema. This only looks at the _logical_ schema difference of the 
> schemas, but the changes made to the system_schema keyspace are the ones that 
> actually are involved in the digest.
> If two nodes issue the same CREATE statement the difference from the 
> keyspace.diff would be empty but the timestamps on the mutations would be 
> different, leading to a pseudo schema disagreement which will never resolve 
> until resetlocalschema or nodes being bounced.
> Only impacts 4.1
> test and fix : 
> https://github.com/clohfink/cassandra/commit/ba915f839089006ac6d08494ef19dc010bcd6411



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19341) Relation and Restriction hierachies are too complex and error prone



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ekaterina Dimitrova updated CASSANDRA-19341:

Fix Version/s: 5.x

> Relation and Restriction hierachies are too complex and error prone
> ---
>
> Key: CASSANDRA-19341
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19341
> Project: Cassandra
>  Issue Type: Improvement
>  Components: CQL/Interpreter
>Reporter: Benjamin Lerer
>Assignee: Benjamin Lerer
>Priority: Normal
> Fix For: 5.x
>
>  Time Spent: 21h 50m
>  Remaining Estimate: 0h
>
> The {{Relation}} and {{Restriction}} hierarchy have been designed when C* was 
> only supporting a limited amount of operators and columns expressions (single 
> column, multi-column and token expressions). Over time they have grown in 
> complexity making the code harder to understand and modify and error prone. 
> Their design is also resulting in unnecessary limitations that could be 
> easily lifted, like the ability to accept different predicates on the same 
> column.
> Today adding a new operator requires the addition of a lot of glue code and 
> surgical changes accross the CQL layer. Making patch for features such as 
> CASSANDRA-18584 much complex than it should be.
> The goal of this ticket is to simplify the {{Relation}} and {{Restriction}} 
> hierarchies and modify operator  class so that adding new operators requires 
> only changes to the {{Operator}} class and ANTLR file.   



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19341) Relation and Restriction hierachies are too complex and error prone



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ekaterina Dimitrova updated CASSANDRA-19341:

Reviewers: Berenguer Blasi, Ekaterina Dimitrova  (was: Ekaterina Dimitrova)

> Relation and Restriction hierachies are too complex and error prone
> ---
>
> Key: CASSANDRA-19341
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19341
> Project: Cassandra
>  Issue Type: Improvement
>  Components: CQL/Interpreter
>Reporter: Benjamin Lerer
>Assignee: Benjamin Lerer
>Priority: Normal
>  Time Spent: 21h 50m
>  Remaining Estimate: 0h
>
> The {{Relation}} and {{Restriction}} hierarchy have been designed when C* was 
> only supporting a limited amount of operators and columns expressions (single 
> column, multi-column and token expressions). Over time they have grown in 
> complexity making the code harder to understand and modify and error prone. 
> Their design is also resulting in unnecessary limitations that could be 
> easily lifted, like the ability to accept different predicates on the same 
> column.
> Today adding a new operator requires the addition of a lot of glue code and 
> surgical changes accross the CQL layer. Making patch for features such as 
> CASSANDRA-18584 much complex than it should be.
> The goal of this ticket is to simplify the {{Relation}} and {{Restriction}} 
> hierarchies and modify operator  class so that adding new operators requires 
> only changes to the {{Operator}} class and ANTLR file.   



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19182) IR may leak SSTables with pending repair when coming from streaming

2024-04-29 Thread David Capwell (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842169#comment-17842169
 ] 

David Capwell commented on CASSANDRA-19182:
---

I just got back from vacation, I don't see this merged yet so going to restart 
the merge process.

> IR may leak SSTables with pending repair when coming from streaming
> ---
>
> Key: CASSANDRA-19182
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19182
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: David Capwell
>Assignee: David Capwell
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
> Attachments: 
> ci_summary-trunk-a1010f4101bf259de3f31077540e4f987d5df9c5.html
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> There is a race condition where SSTables from streaming may race with pending 
> repair cleanup in compaction causing us to cleanup the pending repair state 
> in compaction while the SSTables are being added to it; this leads to IR 
> failing in the future when those files get selected for repair.
> This problem was hard to track down as the in-memory state was wiped, so we 
> don’t have any details.  To better aid these types of investigation we should 
> make sure the repair vtables get updated when IR session failures are 
> submitted



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842171#comment-17842171
 ] 

Alex Petrov commented on CASSANDRA-19534:
-

This is great, thank you for testing!

My 100s timeout was erring (probably too far) on the side of sticking to the 
old behaviour. I was slightly concerned that people will see timeouts and 
conclude this is not something they want. But unfortunately there’s no way for 
us to produce reasonable workload balance without shedding some load and timing 
out lagging requests. I will update a default to 12s.

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, 
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842164#comment-17842164
 ] 

Brandon Williams commented on CASSANDRA-19534:
--

bq. I'd suggest setting cql_start_time to REQUEST

This appears to be the default in the patch, so first I ran with no config 
changes. Here are the KeyValue ECS numbers while the random workload is also 
running with an increased rate of 300:
{noformat}
 Writes  Reads  
Deletes   Errors
  Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) |   
Count  Latency (p99)  1min (req/s) |  Count  1min (errors/s)
1035374  100254.13 0 |  828223   100252.6 0 |   
0  0 0 |   91260303  20005.4
1035374  100254.13 0 |  828223   100252.6 0 |   
0  0 0 |   91320989 20007.02
1035374  100254.13 0 |  828223   100252.6 0 |   
0  0 0 |   91380356 20007.02
1035374  100254.13 0 |  828223   100252.6 0 |   
0  0 0 |   91441015 19976.79
{noformat}

We can see the 100ms native transport timeout default which is stable, and with 
the ECS rate set to 20k/s it is doing nothing but throwing errors at this 
point.  There was also a good amount of GC pressure.

With the native transport timeout adjusted to 12s:
{noformat}
 Writes  Reads  
Deletes   Errors
  Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) |   
Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
6362953   12019.36   7602.56 | 6346212   12016.37   7581.98 |   
0  0 0 | 1639458  4976.36
6384650   12016.847566.8 | 6367878   12023.32   7553.07 |   
0  0 0 | 1655989  5033.01
6405461   12016.847566.8 | 6388707   12023.32   7553.07 |   
0  0 0 | 1674127  5033.01
6426641   12016.84   7510.02 | 6409624   12021.767493.9 |   
0  0 0 | 1693822  5158.58
{noformat}

We can see the timeout reflected again, but this time without heap pressure it 
continues to serve many requests.

Finally, here is cql_start_time set to QUEUE and the native transport timeout 
at 12s:
{noformat}
 Writes  Reads  
Deletes   Errors
  Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) |   
Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
 505121   11983.81 53.36 |  7949266334.45113.39 |   
0  0 0 | 5350041  19782.8
 505123   11983.81 49.13 |  7949266334.45104.33 |   
0  0 0 | 5410428 19815.76
 505137   11983.81 49.13 |  7949266334.45104.33 |   
0  0 0 | 5468740 19815.76
 505145   11983.81 45.53 |  7949266334.45 95.99 |   
0  0 0 | 5528104 19848.02
{noformat}

This also ended up throwing errors but still respected the timeout.

This patch appears to solve the runaway latency growth as requests never last 
beyond the native transport timeout.  I still think the 100s default is too 
high; it's the closest to the unbounded behavior from before but still 
detrimental and probably not what most people actually want especially since it 
may exert additional GC pressure.

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, 
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configu

(cassandra-website) branch asf-staging updated (6b58d505 -> 836a863a)

2024-04-29 Thread git-site-role

This is an automated email from the ASF dual-hosted git repository.

git-site-role pushed a change to branch asf-staging
in repository https://gitbox.apache.org/repos/asf/cassandra-website.git


 discard 6b58d505 generate docs for cc1c7113
 new 836a863a generate docs for cc1c7113

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (6b58d505)
\
 N -- N -- N   refs/heads/asf-staging (836a863a)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 content/search-index.js |   2 +-
 site-ui/build/ui-bundle.zip | Bin 4883646 -> 4883646 bytes
 2 files changed, 1 insertion(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-19597) SystemKeyspace CFS flushing blocked by unrelated keyspace flushing/compaction

2024-04-29 Thread Benedict Elliott Smith (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842098#comment-17842098
 ] 

Benedict Elliott Smith edited comment on CASSANDRA-19597 at 4/29/24 8:02 PM:
-

Yes, exactly. If I remember correctly, this "queue" was originally intended to 
achieve two things:
1) ensure commit log records are invalidated correctly, as it used to only 
support essentially invalidations of a complete prefix;
2) serve as a kind of fsync so that when awaiting the completion of a flush on 
a particular table you can be certain all data written prior has made it to 
sstables

I'm not actually sure if any of this is necessary today though. Pretty sure we 
invalidate explicit ranges now, so the commit log semantics do not require 
this. I'm not sure off the top of my head why (except for non-durable 
tables/writes, or things that might want to read sstables prior to commit log 
replay) you would ever need to know all prior flushes had completed though, 
since the commit log will ensure they are re-written on restart.

But a low risk approach would be to just make this a per table queue.


was (Author: benedict):
Yes, exactly. If I remember correctly, this "queue" was originally intended to 
achieve two things:
1) ensure commit log records are invalidated correctly, as it used to only 
support essentially invalidations of a complete prefix;
2) serve as a kind of fsync so that when awaiting the completion of a flush on 
a particular table you can be certain all data written prior has made it to disk

I'm not actually sure if any of this is necessary today though. Pretty sure we 
invalidate explicit ranges now, so the commit log semantics do not require 
this. I'm not off the top of my head sure why (except for non-durable 
tables/writes) you would ever need to know all prior flushes had completed 
though, since the commit log will ensure they are re-written on restart.

But a low risk approach would be to just make this a per table queue.

> SystemKeyspace CFS flushing blocked by unrelated keyspace flushing/compaction
> -
>
> Key: CASSANDRA-19597
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19597
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>Priority: Normal
>
> There is a single post flush thread and that thread processes tasks in order 
> and one of those tasks can be a memtable flush for an unrelated keyspace/cfs, 
> and that memtable flush can be blocked by slow IntervalTree building and 
> racing with compactors to try and build an interval tree.
> Unless there is a requirement for ordering we probably want to loosen this to 
> the actual ordering requirement so that problems in one keyspace can’t effect 
> another.
> SystemKeyspace and Gossip in particular cause lots of weird problems like 
> nodes marking each other down because Gossip can’t process nodes being 
> removed (blocking flush each time in SystemKeyspace.removeNode)
> A very simple fix here might be to queue the post flush task at the same time 
> as the flush in a per CFS queue, and then submit the task only once the flush 
> is completed.
> If flushes complete out of order the queue will still ensure their 
> completions are processed in order.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842114#comment-17842114
 ] 

Jon Haddad commented on CASSANDRA-19534:


I can fire it up this week.  

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, 
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19567) Minimize the heap consumption when registering metrics



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-19567:

Fix Version/s: 5.1
   (was: 5.x)

> Minimize the heap consumption when registering metrics
> --
>
> Key: CASSANDRA-19567
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19567
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Maxim Muzafarov
>Assignee: Maxim Muzafarov
>Priority: Normal
> Fix For: 5.1
>
> Attachments: summary.png
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> The problem is only reproducible on the x86 machine, the problem is not 
> reproducible on the arm64. A quick analysis showed a lot of MetricName 
> objects stored in the heap, although the real cause could be related to 
> something else, the MetricName object requires extra attention.
> To reproduce run the command run locally:
> {code}
> ant test-jvm-dtest-some 
> -Dtest.name=org.apache.cassandra.distributed.test.ReadRepairTest
> {code}
> The error:
> {code:java}
> [junit-timeout] Exception in thread "main" java.lang.OutOfMemoryError: Java 
> heap space
> [junit-timeout]     at 
> java.base/java.lang.StringLatin1.newString(StringLatin1.java:769)
> [junit-timeout]     at 
> java.base/java.lang.StringBuffer.toString(StringBuffer.java:716)
> [junit-timeout]     at 
> org.apache.cassandra.CassandraBriefJUnitResultFormatter.endTestSuite(CassandraBriefJUnitResultFormatter.java:191)
> [junit-timeout]     at 
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.fireEndTestSuite(JUnitTestRunner.java:854)
> [junit-timeout]     at 
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:578)
> [junit-timeout]     at 
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:1197)
> [junit-timeout]     at 
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:1042)
> [junit-timeout] Testsuite: 
> org.apache.cassandra.distributed.test.ReadRepairTest-cassandra.testtag_IS_UNDEFINED
> [junit-timeout] Testsuite: 
> org.apache.cassandra.distributed.test.ReadRepairTest-cassandra.testtag_IS_UNDEFINED
>  Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0 sec
> [junit-timeout] 
> [junit-timeout] Testcase: 
> org.apache.cassandra.distributed.test.ReadRepairTest:readRepairRTRangeMovementTest-cassandra.testtag_IS_UNDEFINED:
>     Caused an ERROR
> [junit-timeout] Forked Java VM exited abnormally. Please note the time in the 
> report does not reflect the time until the VM exit.
> [junit-timeout] junit.framework.AssertionFailedError: Forked Java VM exited 
> abnormally. Please note the time in the report does not reflect the time 
> until the VM exit.
> [junit-timeout]     at 
> jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> [junit-timeout]     at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [junit-timeout]     at java.base/java.util.Vector.forEach(Vector.java:1365)
> [junit-timeout]     at 
> jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> [junit-timeout]     at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [junit-timeout]     at 
> jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> [junit-timeout]     at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [junit-timeout]     at java.base/java.util.Vector.forEach(Vector.java:1365)
> [junit-timeout]     at 
> jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> [junit-timeout]     at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [junit-timeout]     at 
> jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> [junit-timeout]     at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [junit-timeout] 
> [junit-timeout] 
> [junit-timeout] Test org.apache.cassandra.distributed.test.ReadRepairTest 
> FAILED (crashed)BUILD FAILED
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842112#comment-17842112
 ] 

Alex Petrov commented on CASSANDRA-19534:
-

[~brandon.williams] [~rustyrazorblade] would you be so kind to try running your 
tests? I suggest setting  {{native_transport_timeout_in_ms}} to about 10 (or 12 
max) seconds, and {{internode_timeout}} to {{true}} for starters. If you really 
want to push the limits, I'd suggest setting {{cql_start_time}} to {{REQUEST}}, 
but this is optional, as we will not roll it out with this setting enabled.

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, 
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842112#comment-17842112
 ] 

Alex Petrov edited comment on CASSANDRA-19534 at 4/29/24 5:24 PM:
--

[~brandon.williams] [~rustyrazorblade] would you be so kind to try running your 
tests against the branch posted above? I suggest setting  
{{native_transport_timeout_in_ms}} to about 10 (or 12 max) seconds, and 
{{internode_timeout}} to {{true}} for starters. If you really want to push the 
limits, I'd suggest setting {{cql_start_time}} to {{REQUEST}}, but this is 
optional, as we will not roll it out with this setting enabled.


was (Author: ifesdjeen):
[~brandon.williams] [~rustyrazorblade] would you be so kind to try running your 
tests? I suggest setting  {{native_transport_timeout_in_ms}} to about 10 (or 12 
max) seconds, and {{internode_timeout}} to {{true}} for starters. If you really 
want to push the limits, I'd suggest setting {{cql_start_time}} to {{REQUEST}}, 
but this is optional, as we will not roll it out with this setting enabled.

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, 
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19578) Concurrent equivalent schema updates lead to unresolved disagreement



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842111#comment-17842111
 ] 

Brandon Williams commented on CASSANDRA-19578:
--

Hmm, I see in the description this says "Only impacts 4.1" but then [~jwest] 
added 5.0.  If I apply the added test to 5.0 (and make 
DefaultSchemaUpdateHandler.applyMutations public so it can run) then it fails 
there also.  Does this affect 5.0?

> Concurrent equivalent schema updates lead to unresolved disagreement
> 
>
> Key: CASSANDRA-19578
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19578
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Schema
>Reporter: Chris Lohfink
>Priority: Normal
> Fix For: 4.1.5, 5.0-beta2
>
>
> As part of CASSANDRA-17819 a check for empty schema changes was added to the 
> updateSchema. This only looks at the _logical_ schema difference of the 
> schemas, but the changes made to the system_schema keyspace are the ones that 
> actually are involved in the digest.
> If two nodes issue the same CREATE statement the difference from the 
> keyspace.diff would be empty but the timestamps on the mutations would be 
> different, leading to a pseudo schema disagreement which will never resolve 
> until resetlocalschema or nodes being bounced.
> Only impacts 4.1
> test and fix : 
> https://github.com/clohfink/cassandra/commit/ba915f839089006ac6d08494ef19dc010bcd6411



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19578) Concurrent equivalent schema updates lead to unresolved disagreement



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-19578:
-
Fix Version/s: 4.1.x
   5.0.x
   (was: 5.0-beta2)
   (was: 4.1.5)

> Concurrent equivalent schema updates lead to unresolved disagreement
> 
>
> Key: CASSANDRA-19578
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19578
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Schema
>Reporter: Chris Lohfink
>Priority: Normal
> Fix For: 4.1.x, 5.0.x
>
>
> As part of CASSANDRA-17819 a check for empty schema changes was added to the 
> updateSchema. This only looks at the _logical_ schema difference of the 
> schemas, but the changes made to the system_schema keyspace are the ones that 
> actually are involved in the digest.
> If two nodes issue the same CREATE statement the difference from the 
> keyspace.diff would be empty but the timestamps on the mutations would be 
> different, leading to a pseudo schema disagreement which will never resolve 
> until resetlocalschema or nodes being bounced.
> Only impacts 4.1
> test and fix : 
> https://github.com/clohfink/cassandra/commit/ba915f839089006ac6d08494ef19dc010bcd6411



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19534:

Test and Documentation Plan: Includes tests, also was tested separately; 
screenshots and description attached
 Status: Patch Available  (was: Open)

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, 
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19534:

Attachment: ci_summary.html

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, 
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html
>
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19597) SystemKeyspace CFS flushing blocked by unrelated keyspace flushing/compaction

2024-04-29 Thread Benedict Elliott Smith (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842098#comment-17842098
 ] 

Benedict Elliott Smith commented on CASSANDRA-19597:


Yes, exactly. If I remember correctly, this "queue" was originally intended to 
achieve two things:
1) ensure commit log records are invalidated correctly, as it used to only 
support essentially invalidations of a complete prefix;
2) serve as a kind of fsync so that when awaiting the completion of a flush on 
a particular table you can be certain all data written prior has made it to disk

I'm not actually sure if any of this is necessary today though. Pretty sure we 
invalidate explicit ranges now, so the commit log semantics do not require 
this. I'm not off the top of my head sure why (except for non-durable 
tables/writes) you would ever need to know all prior flushes had completed 
though, since the commit log will ensure they are re-written on restart.

But a low risk approach would be to just make this a per table queue.

> SystemKeyspace CFS flushing blocked by unrelated keyspace flushing/compaction
> -
>
> Key: CASSANDRA-19597
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19597
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>Priority: Normal
>
> There is a single post flush thread and that thread processes tasks in order 
> and one of those tasks can be a memtable flush for an unrelated keyspace/cfs, 
> and that memtable flush can be blocked by slow IntervalTree building and 
> racing with compactors to try and build an interval tree.
> Unless there is a requirement for ordering we probably want to loosen this to 
> the actual ordering requirement so that problems in one keyspace can’t effect 
> another.
> SystemKeyspace and Gossip in particular cause lots of weird problems like 
> nodes marking each other down because Gossip can’t process nodes being 
> removed (blocking flush each time in SystemKeyspace.removeNode)
> A very simple fix here might be to queue the post flush task at the same time 
> as the flush in a per CFS queue, and then submit the task only once the flush 
> is completed.
> If flushes complete out of order the queue will still ensure their 
> completions are processed in order.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19597) SystemKeyspace CFS flushing blocked by unrelated keyspace flushing/compaction



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842091#comment-17842091
 ] 

Ariel Weisberg commented on CASSANDRA-19597:


[~benedict] is the requirement for post flush processing that it be done in 
order per CFS so a per CFS queue would actually address the problem of keeping 
the post flush processing in order?

> SystemKeyspace CFS flushing blocked by unrelated keyspace flushing/compaction
> -
>
> Key: CASSANDRA-19597
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19597
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Ariel Weisberg
>Priority: Normal
>
> There is a single post flush thread and that thread processes tasks in order 
> and one of those tasks can be a memtable flush for an unrelated keyspace/cfs, 
> and that memtable flush can be blocked by slow IntervalTree building and 
> racing with compactors to try and build an interval tree.
> Unless there is a requirement for ordering we probably want to loosen this to 
> the actual ordering requirement so that problems in one keyspace can’t effect 
> another.
> SystemKeyspace and Gossip in particular cause lots of weird problems like 
> nodes marking each other down because Gossip can’t process nodes being 
> removed (blocking flush each time in SystemKeyspace.removeNode)
> A very simple fix here might be to queue the post flush task at the same time 
> as the flush in a per CFS queue, and then submit the task only once the flush 
> is completed.
> If flushes complete out of order the queue will still ensure their 
> completions are processed in order.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Assigned] (CASSANDRA-19597) SystemKeyspace CFS flushing blocked by unrelated keyspace flushing/compaction



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ariel Weisberg reassigned CASSANDRA-19597:
--

Assignee: Ariel Weisberg

> SystemKeyspace CFS flushing blocked by unrelated keyspace flushing/compaction
> -
>
> Key: CASSANDRA-19597
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19597
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>Priority: Normal
>
> There is a single post flush thread and that thread processes tasks in order 
> and one of those tasks can be a memtable flush for an unrelated keyspace/cfs, 
> and that memtable flush can be blocked by slow IntervalTree building and 
> racing with compactors to try and build an interval tree.
> Unless there is a requirement for ordering we probably want to loosen this to 
> the actual ordering requirement so that problems in one keyspace can’t effect 
> another.
> SystemKeyspace and Gossip in particular cause lots of weird problems like 
> nodes marking each other down because Gossip can’t process nodes being 
> removed (blocking flush each time in SystemKeyspace.removeNode)
> A very simple fix here might be to queue the post flush task at the same time 
> as the flush in a per CFS queue, and then submit the task only once the flush 
> is completed.
> If flushes complete out of order the queue will still ensure their 
> completions are processed in order.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Created] (CASSANDRA-19597) SystemKeyspace CFS flushing blocked by unrelated keyspace flushing/compaction

Ariel Weisberg created CASSANDRA-19597:
--

 Summary: SystemKeyspace CFS flushing blocked by unrelated keyspace 
flushing/compaction
 Key: CASSANDRA-19597
 URL: https://issues.apache.org/jira/browse/CASSANDRA-19597
 Project: Cassandra
  Issue Type: Bug
Reporter: Ariel Weisberg


There is a single post flush thread and that thread processes tasks in order 
and one of those tasks can be a memtable flush for an unrelated keyspace/cfs, 
and that memtable flush can be blocked by slow IntervalTree building and racing 
with compactors to try and build an interval tree.

Unless there is a requirement for ordering we probably want to loosen this to 
the actual ordering requirement so that problems in one keyspace can’t effect 
another.

SystemKeyspace and Gossip in particular cause lots of weird problems like nodes 
marking each other down because Gossip can’t process nodes being removed 
(blocking flush each time in SystemKeyspace.removeNode)

A very simple fix here might be to queue the post flush task at the same time 
as the flush in a per CFS queue, and then submit the task only once the flush 
is completed.

If flushes complete out of order the queue will still ensure their completions 
are processed in order.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Created] (CASSANDRA-19596) IntervalTree build throughput is low enough to be a bottleneck

Ariel Weisberg created CASSANDRA-19596:
--

 Summary: IntervalTree build throughput is low enough to be a 
bottleneck
 Key: CASSANDRA-19596
 URL: https://issues.apache.org/jira/browse/CASSANDRA-19596
 Project: Cassandra
  Issue Type: Improvement
  Components: Local/Compaction, Local/SSTable
Reporter: Ariel Weisberg


With several terabytes of data and 8 compactors it’s possible for the 
compactors to spend a lot of time blocked waiting on IntervalTrees to be built.

There is also a lot of wasted CPU because it’s updated optimistically so most 
of them end up being thrown away.

This can end up being quite painful because it can block memtable flushing as 
well and then a single slow CFS can block unrelated CFS because the memtable 
post flush executor is single threaded and shared across all CFS. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19490) Add foundation for independent parsing of junit based output for CI reporting to cassandra-builds

2024-04-29 Thread Josh McKenzie (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842087#comment-17842087
 ] 

Josh McKenzie commented on CASSANDRA-19490:
---

Nothing here. Was swamped just getting all the things together and working; 
right now it's honor system.

Once we have CASSANDRA-18731 this _should_ be moot since the config would 
indicate resource limits. The resource allocation in the system I cobbled 
together on top of David's work is actually significantly more constrained in 
both CPU and RAM compared to what we discussed and what's available in ASF CI, 
but it's all in bespoke .yml files for the parallelizer David wrote.

> Add foundation for independent parsing of junit based output for CI reporting 
> to cassandra-builds
> -
>
> Key: CASSANDRA-19490
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19490
> Project: Cassandra
>  Issue Type: New Feature
>  Components: CI
>Reporter: Josh McKenzie
>Assignee: Josh McKenzie
>Priority: Normal
>
> PR attached.
> For doing CI ourselves, it's useful to have a single pane of glass report 
> where you have a summary of results for all your suites as well as inlined 
> failures. This should be agnostic to any xunit based output; so long as we 
> co-locate the xunit data in directories adjacent to one another, the script 
> in the PR will generate an in-memory representation of the xunit results as 
> well as inline failures to an existing .html file.
> The contents will need to be tweaked a bit to generate the top level branch + 
> sha + checkstyle + summaries information, but the vast majority of that is 
> already parsed and easily available within the script and can be extended 
> pretty trivially.
> Opening up a pr to pull this into 
> [cassandra-builds](https://github.com/apache/cassandra-builds) since [~mck] 
> is actively working on that and needs these primitives. I'd expect the 
> contents in ci_parser to be massaged to become a more finalized, full 
> solution before we start to use it but no harm in the incremental merge.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19490) Add foundation for independent parsing of junit based output for CI reporting to cassandra-builds

2024-04-29 Thread Michael Semb Wever (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842084#comment-17842084
 ] 

Michael Semb Wever commented on CASSANDRA-19490:


yes.

do we have a ticket for a script that validates a results_details.tar.xz meets 
pre-commit requirements …? 

> Add foundation for independent parsing of junit based output for CI reporting 
> to cassandra-builds
> -
>
> Key: CASSANDRA-19490
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19490
> Project: Cassandra
>  Issue Type: New Feature
>  Components: CI
>Reporter: Josh McKenzie
>Assignee: Josh McKenzie
>Priority: Normal
>
> PR attached.
> For doing CI ourselves, it's useful to have a single pane of glass report 
> where you have a summary of results for all your suites as well as inlined 
> failures. This should be agnostic to any xunit based output; so long as we 
> co-locate the xunit data in directories adjacent to one another, the script 
> in the PR will generate an in-memory representation of the xunit results as 
> well as inline failures to an existing .html file.
> The contents will need to be tweaked a bit to generate the top level branch + 
> sha + checkstyle + summaries information, but the vast majority of that is 
> already parsed and easily available within the script and can be extended 
> pretty trivially.
> Opening up a pr to pull this into 
> [cassandra-builds](https://github.com/apache/cassandra-builds) since [~mck] 
> is actively working on that and needs these primitives. I'd expect the 
> contents in ci_parser to be massaged to become a more finalized, full 
> solution before we start to use it but no harm in the incremental merge.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19490) Add foundation for independent parsing of junit based output for CI reporting to cassandra-builds

2024-04-29 Thread Josh McKenzie (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842083#comment-17842083
 ] 

Josh McKenzie commented on CASSANDRA-19490:
---

You've integrated this now right [~mck] ? i.e. can we close this out?

> Add foundation for independent parsing of junit based output for CI reporting 
> to cassandra-builds
> -
>
> Key: CASSANDRA-19490
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19490
> Project: Cassandra
>  Issue Type: New Feature
>  Components: CI
>Reporter: Josh McKenzie
>Assignee: Josh McKenzie
>Priority: Normal
>
> PR attached.
> For doing CI ourselves, it's useful to have a single pane of glass report 
> where you have a summary of results for all your suites as well as inlined 
> failures. This should be agnostic to any xunit based output; so long as we 
> co-locate the xunit data in directories adjacent to one another, the script 
> in the PR will generate an in-memory representation of the xunit results as 
> well as inline failures to an existing .html file.
> The contents will need to be tweaked a bit to generate the top level branch + 
> sha + checkstyle + summaries information, but the vast majority of that is 
> already parsed and easily available within the script and can be extended 
> pretty trivially.
> Opening up a pr to pull this into 
> [cassandra-builds](https://github.com/apache/cassandra-builds) since [~mck] 
> is actively working on that and needs these primitives. I'd expect the 
> contents in ci_parser to be massaged to become a more finalized, full 
> solution before we start to use it but no harm in the incremental merge.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19583) Make 0 work as 0+unit for all three config classes (DataStorageSpec, DurationSpec, DataRateSpec)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ekaterina Dimitrova updated CASSANDRA-19583:

Summary: Make 0 work as 0+unit for all three config classes 
(DataStorageSpec, DurationSpec, DataRateSpec)  (was: enable Make 0 work as 
0+unit for all three config classes (DataStorageSpec, DurationSpec, 
DataRateSpec))

> Make 0 work as 0+unit for all three config classes (DataStorageSpec, 
> DurationSpec, DataRateSpec)
> 
>
> Key: CASSANDRA-19583
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19583
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Config
>Reporter: Jon Haddad
>Priority: Normal
> Fix For: 4.1.x, 5.0.x, 5.x
>
>
> The inline docs say:
> {noformat}
> Setting this to 0 disables throttling.
> {noformat}
> However, on startup, we throw this error:
> {noformat}
> Caused by: java.lang.IllegalArgumentException: Invalid data rate: 0 Accepted 
> units: MiB/s, KiB/s, B/s where case matters and only non-negative values a>
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:52)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:61)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec$LongBytesPerSecondBound.(DataRateSpec.java:232)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: ... 27 common frames 
> omitted
> {noformat}
> We should allow 0 without a unit attached for data, duration, and data spec 
> config parameters, as 0 is always 0 no matter the unit.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19583) enable Make 0 work as 0+unit for all three config classes (DataStorageSpec, DurationSpec, DataRateSpec)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ekaterina Dimitrova updated CASSANDRA-19583:

Description: 
The inline docs say:
{noformat}
Setting this to 0 disables throttling.
{noformat}
However, on startup, we throw this error:
{noformat}
Caused by: java.lang.IllegalArgumentException: Invalid data rate: 0 Accepted 
units: MiB/s, KiB/s, B/s where case matters and only non-negative values a>
Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:52)
Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:61)
Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
org.apache.cassandra.config.DataRateSpec$LongBytesPerSecondBound.(DataRateSpec.java:232)
Apr 23 23:12:01 cassandra0 cassandra[3424]: ... 27 common frames omitted
{noformat}
We should allow 0 without a unit attached for data, duration, and data spec 
config parameters, as 0 is always 0 no matter the unit.

  was:
The inline docs say:


{noformat}
Setting this to 0 disables throttling.
{noformat}

However, on startup, we throw this error:


{noformat}
Caused by: java.lang.IllegalArgumentException: Invalid data rate: 0 Accepted 
units: MiB/s, KiB/s, B/s where case matters and only non-negative values a>
Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:52)
Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:61)
Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
org.apache.cassandra.config.DataRateSpec$LongBytesPerSecondBound.(DataRateSpec.java:232)
Apr 23 23:12:01 cassandra0 cassandra[3424]: ... 27 common frames omitted
{noformat}

We should allow 0 as per the inline doc.


> enable Make 0 work as 0+unit for all three config classes (DataStorageSpec, 
> DurationSpec, DataRateSpec)
> ---
>
> Key: CASSANDRA-19583
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19583
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Config
>Reporter: Jon Haddad
>Priority: Normal
> Fix For: 4.1.x, 5.0.x, 5.x
>
>
> The inline docs say:
> {noformat}
> Setting this to 0 disables throttling.
> {noformat}
> However, on startup, we throw this error:
> {noformat}
> Caused by: java.lang.IllegalArgumentException: Invalid data rate: 0 Accepted 
> units: MiB/s, KiB/s, B/s where case matters and only non-negative values a>
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:52)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:61)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec$LongBytesPerSecondBound.(DataRateSpec.java:232)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: ... 27 common frames 
> omitted
> {noformat}
> We should allow 0 without a unit attached for data, duration, and data spec 
> config parameters, as 0 is always 0 no matter the unit.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19583) enable Make 0 work as 0+unit for all three config classes (DataStorageSpec, DurationSpec, DataRateSpec)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-19583:
-
  Workflow: Copy of Cassandra Default Workflow  (was: Copy of Cassandra Bug 
Workflow)
Issue Type: Improvement  (was: Bug)

> enable Make 0 work as 0+unit for all three config classes (DataStorageSpec, 
> DurationSpec, DataRateSpec)
> ---
>
> Key: CASSANDRA-19583
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19583
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Config
>Reporter: Jon Haddad
>Priority: Normal
> Fix For: 4.1.x, 5.0.x, 5.x
>
>
> The inline docs say:
> {noformat}
> Setting this to 0 disables throttling.
> {noformat}
> However, on startup, we throw this error:
> {noformat}
> Caused by: java.lang.IllegalArgumentException: Invalid data rate: 0 Accepted 
> units: MiB/s, KiB/s, B/s where case matters and only non-negative values a>
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:52)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:61)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec$LongBytesPerSecondBound.(DataRateSpec.java:232)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: ... 27 common frames 
> omitted
> {noformat}
> We should allow 0 as per the inline doc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19583) enable Make 0 work as 0+unit for all three config classes (DataStorageSpec, DurationSpec, DataRateSpec)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842053#comment-17842053
 ] 

Jon Haddad commented on CASSANDRA-19583:


Sounds good, I've changed the title.

> enable Make 0 work as 0+unit for all three config classes (DataStorageSpec, 
> DurationSpec, DataRateSpec)
> ---
>
> Key: CASSANDRA-19583
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19583
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Config
>Reporter: Jon Haddad
>Priority: Normal
> Fix For: 4.1.x, 5.0.x, 5.x
>
>
> The inline docs say:
> {noformat}
> Setting this to 0 disables throttling.
> {noformat}
> However, on startup, we throw this error:
> {noformat}
> Caused by: java.lang.IllegalArgumentException: Invalid data rate: 0 Accepted 
> units: MiB/s, KiB/s, B/s where case matters and only non-negative values a>
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:52)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:61)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec$LongBytesPerSecondBound.(DataRateSpec.java:232)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: ... 27 common frames 
> omitted
> {noformat}
> We should allow 0 as per the inline doc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19583) enable Make 0 work as 0+unit for all three config classes (DataStorageSpec, DurationSpec, DataRateSpec)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jon Haddad updated CASSANDRA-19583:
---
Summary: enable Make 0 work as 0+unit for all three config classes 
(DataStorageSpec, DurationSpec, DataRateSpec)  (was: setting compaction 
throughput to 0 throws a startup error)

> enable Make 0 work as 0+unit for all three config classes (DataStorageSpec, 
> DurationSpec, DataRateSpec)
> ---
>
> Key: CASSANDRA-19583
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19583
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Config
>Reporter: Jon Haddad
>Priority: Normal
> Fix For: 4.1.x, 5.0.x, 5.x
>
>
> The inline docs say:
> {noformat}
> Setting this to 0 disables throttling.
> {noformat}
> However, on startup, we throw this error:
> {noformat}
> Caused by: java.lang.IllegalArgumentException: Invalid data rate: 0 Accepted 
> units: MiB/s, KiB/s, B/s where case matters and only non-negative values a>
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:52)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:61)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec$LongBytesPerSecondBound.(DataRateSpec.java:232)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: ... 27 common frames 
> omitted
> {noformat}
> We should allow 0 as per the inline doc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-19583) setting compaction throughput to 0 throws a startup error



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842048#comment-17842048
 ] 

Ekaterina Dimitrova edited comment on CASSANDRA-19583 at 4/29/24 3:53 PM:
--

{quote}0 can mean "unlimited", but "0MiB/s" indicates actually zero.
{quote}
This confused me.
{quote}To be clear, I'm suggesting we make "0" work, without a unit. I'm not 
suggesting we change how 0MiB/s works. They can be equivalent. 
{quote}
Thanks for clarifying.

Then let's change this ticket to improvement and change its description to  
"enable 0 to work as 0+unit for all three config classes (DataStorageSpec, 
DurationSpec, DataRateSpec)" and not particularly for the compaction throughput 
config?


was (Author: e.dimitrova):
{quote}0 can mean "unlimited", but "0MiB/s" indicates actually zero.
{quote}
This confused me.
{quote}
To be clear, I'm suggesting we make "0" work, without a unit. I'm not 
suggesting we change how 0MiB/s works. They can be equivalent. 
{quote}
Thanks for clarifying.

Then let's change this ticket to improvement and change its description to  
"enable 0 to work as 0+unit for all three cofig classes (DataStorageSpec, 
DurationSpec, DataRateSpec)" and not particularly for the compaction throughput 
config?

> setting compaction throughput to 0 throws a startup error
> -
>
> Key: CASSANDRA-19583
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19583
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Config
>Reporter: Jon Haddad
>Priority: Normal
> Fix For: 4.1.x, 5.0.x, 5.x
>
>
> The inline docs say:
> {noformat}
> Setting this to 0 disables throttling.
> {noformat}
> However, on startup, we throw this error:
> {noformat}
> Caused by: java.lang.IllegalArgumentException: Invalid data rate: 0 Accepted 
> units: MiB/s, KiB/s, B/s where case matters and only non-negative values a>
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:52)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:61)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec$LongBytesPerSecondBound.(DataRateSpec.java:232)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: ... 27 common frames 
> omitted
> {noformat}
> We should allow 0 as per the inline doc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18987) Using counter column type in Accord transactions leads to Atomicity / Consistency violations



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-18987:

Epic Link: CASSANDRA-17092  (was: CASSANDRA-17089)

> Using counter column type in Accord transactions leads to Atomicity / 
> Consistency violations
> 
>
> Key: CASSANDRA-18987
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18987
> Project: Cassandra
>  Issue Type: Bug
>  Components: Accord
>Reporter: Luis E Fernandez
>Assignee: Caleb Rackliffe
>Priority: Normal
> Fix For: NA
>
> Attachments: ci_summary.html
>
>
> *System configuration and information:*
> Single node Cassandra with Accord transactions enabled running on docker
> Built from commit: 
> [a7cd114435704b988c81f47ef53d0bfd6441f38b|https://github.com/apache/cassandra/commit/a7cd114435704b988c81f47ef53d0bfd6441f38b]
> CQLSH: [cqlsh 6.2.0 | Cassandra 5.0-alpha2-SNAPSHOT | CQL spec 3.4.7 | Native 
> protocol v5]
>  
> *Steps to reproduce in CQLSH:*
> {code:java}
> CREATE KEYSPACE accord WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': '1'} AND durable_writes = true;{code}
> {code:java}
> CREATE TABLE accord.accounts (
>     partition text,
>     account_id int,
>     balance counter,
>     PRIMARY KEY (partition, account_id)
> ) WITH CLUSTERING ORDER BY (account_id ASC);
> {code}
> {code:java}
> BEGIN TRANSACTION
>   UPDATE accord.accounts
> SET balance += 100
>   WHERE
> partition = 'default'
> AND account_id = 0;
>   UPDATE accord.accounts
> SET balance += 100
>   WHERE
> partition = 'default'
> AND account_id =1;
> COMMIT TRANSACTION;{code}
> bug happens after executing the following statement:
> Based on [Cassandra 
> documentation|https://cassandra.apache.org/doc/trunk/cassandra/developing/cql/types.html#counters]
>  regarding the use of counters, I expect the following results:
> Transaction A: subtract 10 from the balance of account 1 (total ending 
> balance of 90) and add 10 to the balance of account 0 (total ending balance 
> of 110)
> {*}Bug A{*}: Neither account's balance is updated and the state of the rows 
> is left unchanged
> {code:java}
> /* Transaction A */
> BEGIN TRANSACTION
> UPDATE accord.accounts
> SET balance -= 10
> WHERE
>   partition = 'default'
>   AND account_id = 1;
> UPDATE accord.accounts
> SET balance += 10
> WHERE
>   partition = 'default'
>   AND account_id = 0;
> COMMIT TRANSACTION;{code}
> Transaction B: subtract 10 from the balance of account 1 (total ending 
> balance of 90) and add 10 to the balance of a new account 2 (total ending 
> balance of 10)
> {*}Bug B{*}: Only the new account 2 is created. The balance of account 1 is 
> left unchanged
> {code:java}
> /* Transaction B */
> BEGIN TRANSACTION
> UPDATE accord.accounts
> SET balance -= 10
> WHERE
>   partition = 'default'
>   AND account_id = 1;
> UPDATE accord.accounts
> SET balance += 10
> WHERE
>   partition = 'default'
>   AND account_id = 2;
> COMMIT TRANSACTION;{code}
> Bug / Error:
> ==
> The result of performing a table read after executing each buggy transaction 
> is:
> {code:java}
> /* Transaction / Bug A */
>  partition | account_id | balance
> ---++-
>    default |          0 |     100
>    default |          1 |     100{code}
> {code:java}
> /* Transaction / Bug B */
>  partition | account_id | balance
> ---++-
>    default |          0 |     100
>    default |          1 |     100
>    default |          2 |      10 {code}
> Note that performing the above statements without transaction blocks works as 
> expected.
> {color:#172b4d}This was found while testing Accord transactions with 
> [~henrik.ingo] and team.{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19595) RepairDigestTrackingTest#testLocalDataAndRemoteRequestConcurrency timing out on QUORUM read



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-19595:

Fix Version/s: NA

> RepairDigestTrackingTest#testLocalDataAndRemoteRequestConcurrency timing out 
> on QUORUM read
> ---
>
> Key: CASSANDRA-19595
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19595
> Project: Cassandra
>  Issue Type: Bug
>  Components: Accord
>Reporter: Caleb Rackliffe
>Priority: Normal
> Fix For: NA
>
>
> This test doesn't seem to have any trouble passing locally in trunk, but on 
> cep-15-accord, it reliably times out. From a couple minutes of debugging in 
> {{ReadCallback#onResponse()}}, I think the proximate cause looks like a 
> {{QUORUM}} read w/ 3 nodes getting back 2 responses, but both are digest 
> responses. It seems like one should actually have data. In any case, this 
> manifests as a timeout, because even though we have the right number of 
> responses, we don't signal before the timeout. Also, this still fails even w/ 
> speculative retries disabled (i.e. set to NEVER) in the test table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19583) setting compaction throughput to 0 throws a startup error



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842048#comment-17842048
 ] 

Ekaterina Dimitrova commented on CASSANDRA-19583:
-

{quote}0 can mean "unlimited", but "0MiB/s" indicates actually zero.
{quote}
This confused me.
{quote}
To be clear, I'm suggesting we make "0" work, without a unit. I'm not 
suggesting we change how 0MiB/s works. They can be equivalent. 
{quote}
Thanks for clarifying.

Then let's change this ticket to improvement and change its description to  
"enable 0 to work as 0+unit for all three cofig classes (DataStorageSpec, 
DurationSpec, DataRateSpec)" and not particularly for the compaction throughput 
config?

> setting compaction throughput to 0 throws a startup error
> -
>
> Key: CASSANDRA-19583
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19583
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Config
>Reporter: Jon Haddad
>Priority: Normal
> Fix For: 4.1.x, 5.0.x, 5.x
>
>
> The inline docs say:
> {noformat}
> Setting this to 0 disables throttling.
> {noformat}
> However, on startup, we throw this error:
> {noformat}
> Caused by: java.lang.IllegalArgumentException: Invalid data rate: 0 Accepted 
> units: MiB/s, KiB/s, B/s where case matters and only non-negative values a>
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:52)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:61)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec$LongBytesPerSecondBound.(DataRateSpec.java:232)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: ... 27 common frames 
> omitted
> {noformat}
> We should allow 0 as per the inline doc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Created] (CASSANDRA-19595) RepairDigestTrackingTest#testLocalDataAndRemoteRequestConcurrency timing out on QUORUM read

Caleb Rackliffe created CASSANDRA-19595:
---

 Summary: 
RepairDigestTrackingTest#testLocalDataAndRemoteRequestConcurrency timing out on 
QUORUM read
 Key: CASSANDRA-19595
 URL: https://issues.apache.org/jira/browse/CASSANDRA-19595
 Project: Cassandra
  Issue Type: Bug
  Components: Accord
Reporter: Caleb Rackliffe


This test doesn't seem to have any trouble passing locally in trunk, but on 
cep-15-accord, it reliably times out. From a couple minutes of debugging in 
{{ReadCallback#onResponse()}}, I think the proximate cause looks like a 
{{QUORUM}} read w/ 3 nodes getting back 2 responses, but both are digest 
responses. It seems like one should actually have data. In any case, this 
manifests as a timeout, because even though we have the right number of 
responses, we don't signal before the timeout. Also, this still fails even w/ 
speculative retries disabled (i.e. set to NEVER) in the test table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19583) setting compaction throughput to 0 throws a startup error



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842046#comment-17842046
 ] 

Jon Haddad commented on CASSANDRA-19583:


To be clear, I'm suggesting we make "0" work, without a unit.  I'm not 
suggesting we change how 0MiB/s works.  They can be equivalent.

Making it mandatory to supply a unit with 0 is a weird user experience.  Zero 
is zero, there's no meaning to the label, it's superfluous.  

> setting compaction throughput to 0 throws a startup error
> -
>
> Key: CASSANDRA-19583
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19583
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Config
>Reporter: Jon Haddad
>Priority: Normal
> Fix For: 4.1.x, 5.0.x, 5.x
>
>
> The inline docs say:
> {noformat}
> Setting this to 0 disables throttling.
> {noformat}
> However, on startup, we throw this error:
> {noformat}
> Caused by: java.lang.IllegalArgumentException: Invalid data rate: 0 Accepted 
> units: MiB/s, KiB/s, B/s where case matters and only non-negative values a>
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:52)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:61)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec$LongBytesPerSecondBound.(DataRateSpec.java:232)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: ... 27 common frames 
> omitted
> {noformat}
> We should allow 0 as per the inline doc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19579) threads lingering after driver shutdown: session close starts thread and doesn't await its stop

2024-04-29 Thread Bret McGuire (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842045#comment-17842045
 ] 

Bret McGuire commented on CASSANDRA-19579:
--

ACKed [~brandon.williams] ; thanks for letting me know.  The 
"Client/java-driver" component will usually do the trick but an explicit "cc" 
tag never hurts :)

> threads lingering after driver shutdown: session close starts thread and 
> doesn't await its stop
> ---
>
> Key: CASSANDRA-19579
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19579
> Project: Cassandra
>  Issue Type: Bug
>  Components: Client/java-driver
>Reporter: Thomas Klambauer
>Priority: Normal
>
> We are checking remaining/lingering threads during shutdown.
> we noticed some with naming pattern/thread factory: 
> ""globalEventExecutor-1-2" Id=146 TIMED_WAITING"
> this one seems to be created during shutdown / session close and not 
> awaited/shut down:
> {noformat}
> addTask:156, GlobalEventExecutor (io.netty.util.concurrent)
> execute0:225, GlobalEventExecutor (io.netty.util.concurrent)
> execute:221, GlobalEventExecutor (io.netty.util.concurrent)
> onClose:188, DefaultNettyOptions 
> (com.datastax.oss.driver.internal.core.context)
> onChildrenClosed:589, DefaultSession$SingleThreaded 
> (com.datastax.oss.driver.internal.core.session)
> lambda$close$9:552, DefaultSession$SingleThreaded 
> (com.datastax.oss.driver.internal.core.session)
> run:-1, 860270832 
> (com.datastax.oss.driver.internal.core.session.DefaultSession$SingleThreaded$$Lambda$9508)
> tryFire$$$capture:783, CompletableFuture$UniRun (java.util.concurrent)
> tryFire:-1, CompletableFuture$UniRun (java.util.concurrent)
>  - Async stack trace
> addTask:-1, SingleThreadEventExecutor (io.netty.util.concurrent)
> execute:836, SingleThreadEventExecutor (io.netty.util.concurrent)
> execute0:827, SingleThreadEventExecutor (io.netty.util.concurrent)
> execute:817, SingleThreadEventExecutor (io.netty.util.concurrent)
> claim:568, CompletableFuture$UniCompletion (java.util.concurrent)
> tryFire$$$capture:780, CompletableFuture$UniRun (java.util.concurrent)
> tryFire:-1, CompletableFuture$UniRun (java.util.concurrent)
>  - Async stack trace
> :767, CompletableFuture$UniRun (java.util.concurrent)
> uniRunStage:801, CompletableFuture (java.util.concurrent)
> thenRunAsync:2136, CompletableFuture (java.util.concurrent)
> thenRunAsync:143, CompletableFuture (java.util.concurrent)
> whenAllDone:75, CompletableFutures 
> (com.datastax.oss.driver.internal.core.util.concurrent)
> close:551, DefaultSession$SingleThreaded 
> (com.datastax.oss.driver.internal.core.session)
> access$1000:300, DefaultSession$SingleThreaded 
> (com.datastax.oss.driver.internal.core.session)
> lambda$closeAsync$1:272, DefaultSession 
> (com.datastax.oss.driver.internal.core.session)
> runTask:98, PromiseTask (io.netty.util.concurrent)
> run:106, PromiseTask (io.netty.util.concurrent)
> runTask$$$capture:174, AbstractEventExecutor (io.netty.util.concurrent)
> runTask:-1, AbstractEventExecutor (io.netty.util.concurrent)
>  - Async stack trace
> addTask:-1, SingleThreadEventExecutor (io.netty.util.concurrent)
> execute:836, SingleThreadEventExecutor (io.netty.util.concurrent)
> execute0:827, SingleThreadEventExecutor (io.netty.util.concurrent)
> execute:817, SingleThreadEventExecutor (io.netty.util.concurrent)
> submit:118, AbstractExecutorService (java.util.concurrent)
> submit:118, AbstractEventExecutor (io.netty.util.concurrent)
> on:57, RunOrSchedule (com.datastax.oss.driver.internal.core.util.concurrent)
> closeSafely:286, DefaultSession 
> (com.datastax.oss.driver.internal.core.session)
> closeAsync:272, DefaultSession (com.datastax.oss.driver.internal.core.session)
> close:76, AsyncAutoCloseable (com.datastax.oss.driver.api.core)
> -- custom shutdown code
> run:829, Thread (java.lang)
> {noformat}
> the initial close here is called on 
> com.datastax.oss.driver.api.core.CqlSession.
> netty framework suggests to call
> io.netty.util.concurrent.GlobalEventExecutor#awaitInactivity
> during shutdown to await event thread stopping
> (slightly related issue in netty: 
> [https://github.com/netty/netty/issues/2084] )
> suggestion to add maybe GlobalEventExecutor.INSTANCE.awaitInactivity with 
> some timeout during close around here:
> [https://github.com/apache/cassandra-java-driver/blob/4.x/core/src/main/java/com/datastax/oss/driver/internal/core/context/DefaultNettyOptions.java#L199]
> noting that this might slow down closing for up to 2 seconds if the netty 
> issue comment is correct.
> this is on latest datastax java driver version: 4.17,



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To

[jira] [Updated] (CASSANDRA-18987) Using counter column type in Accord transactions leads to Atomicity / Consistency violations



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-18987:

Fix Version/s: NA
   (was: 5.1)

> Using counter column type in Accord transactions leads to Atomicity / 
> Consistency violations
> 
>
> Key: CASSANDRA-18987
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18987
> Project: Cassandra
>  Issue Type: Bug
>  Components: Accord
>Reporter: Luis E Fernandez
>Assignee: Caleb Rackliffe
>Priority: Normal
> Fix For: NA
>
> Attachments: ci_summary.html
>
>
> *System configuration and information:*
> Single node Cassandra with Accord transactions enabled running on docker
> Built from commit: 
> [a7cd114435704b988c81f47ef53d0bfd6441f38b|https://github.com/apache/cassandra/commit/a7cd114435704b988c81f47ef53d0bfd6441f38b]
> CQLSH: [cqlsh 6.2.0 | Cassandra 5.0-alpha2-SNAPSHOT | CQL spec 3.4.7 | Native 
> protocol v5]
>  
> *Steps to reproduce in CQLSH:*
> {code:java}
> CREATE KEYSPACE accord WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': '1'} AND durable_writes = true;{code}
> {code:java}
> CREATE TABLE accord.accounts (
>     partition text,
>     account_id int,
>     balance counter,
>     PRIMARY KEY (partition, account_id)
> ) WITH CLUSTERING ORDER BY (account_id ASC);
> {code}
> {code:java}
> BEGIN TRANSACTION
>   UPDATE accord.accounts
> SET balance += 100
>   WHERE
> partition = 'default'
> AND account_id = 0;
>   UPDATE accord.accounts
> SET balance += 100
>   WHERE
> partition = 'default'
> AND account_id =1;
> COMMIT TRANSACTION;{code}
> bug happens after executing the following statement:
> Based on [Cassandra 
> documentation|https://cassandra.apache.org/doc/trunk/cassandra/developing/cql/types.html#counters]
>  regarding the use of counters, I expect the following results:
> Transaction A: subtract 10 from the balance of account 1 (total ending 
> balance of 90) and add 10 to the balance of account 0 (total ending balance 
> of 110)
> {*}Bug A{*}: Neither account's balance is updated and the state of the rows 
> is left unchanged
> {code:java}
> /* Transaction A */
> BEGIN TRANSACTION
> UPDATE accord.accounts
> SET balance -= 10
> WHERE
>   partition = 'default'
>   AND account_id = 1;
> UPDATE accord.accounts
> SET balance += 10
> WHERE
>   partition = 'default'
>   AND account_id = 0;
> COMMIT TRANSACTION;{code}
> Transaction B: subtract 10 from the balance of account 1 (total ending 
> balance of 90) and add 10 to the balance of a new account 2 (total ending 
> balance of 10)
> {*}Bug B{*}: Only the new account 2 is created. The balance of account 1 is 
> left unchanged
> {code:java}
> /* Transaction B */
> BEGIN TRANSACTION
> UPDATE accord.accounts
> SET balance -= 10
> WHERE
>   partition = 'default'
>   AND account_id = 1;
> UPDATE accord.accounts
> SET balance += 10
> WHERE
>   partition = 'default'
>   AND account_id = 2;
> COMMIT TRANSACTION;{code}
> Bug / Error:
> ==
> The result of performing a table read after executing each buggy transaction 
> is:
> {code:java}
> /* Transaction / Bug A */
>  partition | account_id | balance
> ---++-
>    default |          0 |     100
>    default |          1 |     100{code}
> {code:java}
> /* Transaction / Bug B */
>  partition | account_id | balance
> ---++-
>    default |          0 |     100
>    default |          1 |     100
>    default |          2 |      10 {code}
> Note that performing the above statements without transaction blocks works as 
> expected.
> {color:#172b4d}This was found while testing Accord transactions with 
> [~henrik.ingo] and team.{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19583) setting compaction throughput to 0 throws a startup error



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842041#comment-17842041
 ] 

Ekaterina Dimitrova commented on CASSANDRA-19583:
-

{quote} "0MiB/s" indicates actually zero.
{quote}
Indeed, this is something we saw with the old config pre-4.1 - there were cases 
where 0 did not mean 0, or we used negatives or -1 to special case things, 
which was very confusing IMHO.

Post-4.1 we encourage people to use guardrails or null for any special case 
with the new config (as documented here - 
[https://cassandra.apache.org/doc/4.1/cassandra/configuration/configuration.html,]
 section {*}Notes for Cassandra Developers{*}). Unfortunately, we have to live 
with the realities of some old configurations so we do not change 
behavior/introduce regressions.

Compaction throughput is one of them - 0MiB/s always meant unlimited; the unit 
was part of the config name, so we had to preserve the behavior. Also, it could 
be unclear to special case 0 while having 0 with unit means something else. 
Using null or guardrail sounds more deterministic. Last but not least, I think 
it is too late to change the meaning of 0MiB/s for compaction throughput as it 
was already released in 4.1 to mean "unlimited," and technically, it did not 
change anything from before, where the unit was just in the name of the 
parameter. We never had 0MiB/s meaning 0 for that property.

> setting compaction throughput to 0 throws a startup error
> -
>
> Key: CASSANDRA-19583
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19583
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Config
>Reporter: Jon Haddad
>Priority: Normal
> Fix For: 4.1.x, 5.0.x, 5.x
>
>
> The inline docs say:
> {noformat}
> Setting this to 0 disables throttling.
> {noformat}
> However, on startup, we throw this error:
> {noformat}
> Caused by: java.lang.IllegalArgumentException: Invalid data rate: 0 Accepted 
> units: MiB/s, KiB/s, B/s where case matters and only non-negative values a>
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:52)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:61)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec$LongBytesPerSecondBound.(DataRateSpec.java:232)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: ... 27 common frames 
> omitted
> {noformat}
> We should allow 0 as per the inline doc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19567) Minimize the heap consumption when registering metrics

2024-04-29 Thread Maxim Muzafarov (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842035#comment-17842035
 ] 

Maxim Muzafarov commented on CASSANDRA-19567:
-

I've fixed all the comments and the failed tests. The changes are here:
https://github.com/apache/cassandra/pull/3267

Additionally, I've prepared a branch that contains a new assertion on a metric 
remove operation to ensure the contract. As I previously mentioned, this 
assertion can be tricky as I assume some of the metrics with the same name can 
be removed in parallel, it shouldn't happen in a normal way, but it could 
because of the lack of tests and/or parallel removal the same metrics e.g. 
Memtables share the same instances of metrics. Anyway, the same as above 
changes, but with an assertion are here:
https://github.com/Mmuzaf/cassandra/tree/cassandra-19567-assert

I'll check that. 

> Minimize the heap consumption when registering metrics
> --
>
> Key: CASSANDRA-19567
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19567
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Maxim Muzafarov
>Assignee: Maxim Muzafarov
>Priority: Normal
> Fix For: 5.x
>
> Attachments: summary.png
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> The problem is only reproducible on the x86 machine, the problem is not 
> reproducible on the arm64. A quick analysis showed a lot of MetricName 
> objects stored in the heap, although the real cause could be related to 
> something else, the MetricName object requires extra attention.
> To reproduce run the command run locally:
> {code}
> ant test-jvm-dtest-some 
> -Dtest.name=org.apache.cassandra.distributed.test.ReadRepairTest
> {code}
> The error:
> {code:java}
> [junit-timeout] Exception in thread "main" java.lang.OutOfMemoryError: Java 
> heap space
> [junit-timeout]     at 
> java.base/java.lang.StringLatin1.newString(StringLatin1.java:769)
> [junit-timeout]     at 
> java.base/java.lang.StringBuffer.toString(StringBuffer.java:716)
> [junit-timeout]     at 
> org.apache.cassandra.CassandraBriefJUnitResultFormatter.endTestSuite(CassandraBriefJUnitResultFormatter.java:191)
> [junit-timeout]     at 
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.fireEndTestSuite(JUnitTestRunner.java:854)
> [junit-timeout]     at 
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:578)
> [junit-timeout]     at 
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:1197)
> [junit-timeout]     at 
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:1042)
> [junit-timeout] Testsuite: 
> org.apache.cassandra.distributed.test.ReadRepairTest-cassandra.testtag_IS_UNDEFINED
> [junit-timeout] Testsuite: 
> org.apache.cassandra.distributed.test.ReadRepairTest-cassandra.testtag_IS_UNDEFINED
>  Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0 sec
> [junit-timeout] 
> [junit-timeout] Testcase: 
> org.apache.cassandra.distributed.test.ReadRepairTest:readRepairRTRangeMovementTest-cassandra.testtag_IS_UNDEFINED:
>     Caused an ERROR
> [junit-timeout] Forked Java VM exited abnormally. Please note the time in the 
> report does not reflect the time until the VM exit.
> [junit-timeout] junit.framework.AssertionFailedError: Forked Java VM exited 
> abnormally. Please note the time in the report does not reflect the time 
> until the VM exit.
> [junit-timeout]     at 
> jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> [junit-timeout]     at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [junit-timeout]     at java.base/java.util.Vector.forEach(Vector.java:1365)
> [junit-timeout]     at 
> jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> [junit-timeout]     at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [junit-timeout]     at 
> jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> [junit-timeout]     at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [junit-timeout]     at java.base/java.util.Vector.forEach(Vector.java:1365)
> [junit-timeout]     at 
> jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> [junit-timeout]     at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [junit-timeout]     at 
> jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> [junit-timeout]     at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.

[jira] [Commented] (CASSANDRA-19579) threads lingering after driver shutdown: session close starts thread and doesn't await its stop



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841964#comment-17841964
 ] 

Brandon Williams commented on CASSANDRA-19579:
--

/cc [~absurdfarce] (sorry not sure how better to get these on the driver radar)

> threads lingering after driver shutdown: session close starts thread and 
> doesn't await its stop
> ---
>
> Key: CASSANDRA-19579
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19579
> Project: Cassandra
>  Issue Type: Bug
>  Components: Client/java-driver
>Reporter: Thomas Klambauer
>Priority: Normal
>
> We are checking remaining/lingering threads during shutdown.
> we noticed some with naming pattern/thread factory: 
> ""globalEventExecutor-1-2" Id=146 TIMED_WAITING"
> this one seems to be created during shutdown / session close and not 
> awaited/shut down:
> {noformat}
> addTask:156, GlobalEventExecutor (io.netty.util.concurrent)
> execute0:225, GlobalEventExecutor (io.netty.util.concurrent)
> execute:221, GlobalEventExecutor (io.netty.util.concurrent)
> onClose:188, DefaultNettyOptions 
> (com.datastax.oss.driver.internal.core.context)
> onChildrenClosed:589, DefaultSession$SingleThreaded 
> (com.datastax.oss.driver.internal.core.session)
> lambda$close$9:552, DefaultSession$SingleThreaded 
> (com.datastax.oss.driver.internal.core.session)
> run:-1, 860270832 
> (com.datastax.oss.driver.internal.core.session.DefaultSession$SingleThreaded$$Lambda$9508)
> tryFire$$$capture:783, CompletableFuture$UniRun (java.util.concurrent)
> tryFire:-1, CompletableFuture$UniRun (java.util.concurrent)
>  - Async stack trace
> addTask:-1, SingleThreadEventExecutor (io.netty.util.concurrent)
> execute:836, SingleThreadEventExecutor (io.netty.util.concurrent)
> execute0:827, SingleThreadEventExecutor (io.netty.util.concurrent)
> execute:817, SingleThreadEventExecutor (io.netty.util.concurrent)
> claim:568, CompletableFuture$UniCompletion (java.util.concurrent)
> tryFire$$$capture:780, CompletableFuture$UniRun (java.util.concurrent)
> tryFire:-1, CompletableFuture$UniRun (java.util.concurrent)
>  - Async stack trace
> :767, CompletableFuture$UniRun (java.util.concurrent)
> uniRunStage:801, CompletableFuture (java.util.concurrent)
> thenRunAsync:2136, CompletableFuture (java.util.concurrent)
> thenRunAsync:143, CompletableFuture (java.util.concurrent)
> whenAllDone:75, CompletableFutures 
> (com.datastax.oss.driver.internal.core.util.concurrent)
> close:551, DefaultSession$SingleThreaded 
> (com.datastax.oss.driver.internal.core.session)
> access$1000:300, DefaultSession$SingleThreaded 
> (com.datastax.oss.driver.internal.core.session)
> lambda$closeAsync$1:272, DefaultSession 
> (com.datastax.oss.driver.internal.core.session)
> runTask:98, PromiseTask (io.netty.util.concurrent)
> run:106, PromiseTask (io.netty.util.concurrent)
> runTask$$$capture:174, AbstractEventExecutor (io.netty.util.concurrent)
> runTask:-1, AbstractEventExecutor (io.netty.util.concurrent)
>  - Async stack trace
> addTask:-1, SingleThreadEventExecutor (io.netty.util.concurrent)
> execute:836, SingleThreadEventExecutor (io.netty.util.concurrent)
> execute0:827, SingleThreadEventExecutor (io.netty.util.concurrent)
> execute:817, SingleThreadEventExecutor (io.netty.util.concurrent)
> submit:118, AbstractExecutorService (java.util.concurrent)
> submit:118, AbstractEventExecutor (io.netty.util.concurrent)
> on:57, RunOrSchedule (com.datastax.oss.driver.internal.core.util.concurrent)
> closeSafely:286, DefaultSession 
> (com.datastax.oss.driver.internal.core.session)
> closeAsync:272, DefaultSession (com.datastax.oss.driver.internal.core.session)
> close:76, AsyncAutoCloseable (com.datastax.oss.driver.api.core)
> -- custom shutdown code
> run:829, Thread (java.lang)
> {noformat}
> the initial close here is called on 
> com.datastax.oss.driver.api.core.CqlSession.
> netty framework suggests to call
> io.netty.util.concurrent.GlobalEventExecutor#awaitInactivity
> during shutdown to await event thread stopping
> (slightly related issue in netty: 
> [https://github.com/netty/netty/issues/2084] )
> suggestion to add maybe GlobalEventExecutor.INSTANCE.awaitInactivity with 
> some timeout during close around here:
> [https://github.com/apache/cassandra-java-driver/blob/4.x/core/src/main/java/com/datastax/oss/driver/internal/core/context/DefaultNettyOptions.java#L199]
> noting that this might slow down closing for up to 2 seconds if the netty 
> issue comment is correct.
> this is on latest datastax java driver version: 4.17,



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additi

[jira] [Updated] (CASSANDRA-19579) threads lingering after driver shutdown: session close starts thread and doesn't await its stop



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-19579:
-
 Bug Category: Parent values: Degradation(12984)
   Complexity: Normal
Discovered By: User Report
 Severity: Normal
   Status: Open  (was: Triage Needed)

> threads lingering after driver shutdown: session close starts thread and 
> doesn't await its stop
> ---
>
> Key: CASSANDRA-19579
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19579
> Project: Cassandra
>  Issue Type: Bug
>  Components: Client/java-driver
>Reporter: Thomas Klambauer
>Priority: Normal
>
> We are checking remaining/lingering threads during shutdown.
> we noticed some with naming pattern/thread factory: 
> ""globalEventExecutor-1-2" Id=146 TIMED_WAITING"
> this one seems to be created during shutdown / session close and not 
> awaited/shut down:
> {noformat}
> addTask:156, GlobalEventExecutor (io.netty.util.concurrent)
> execute0:225, GlobalEventExecutor (io.netty.util.concurrent)
> execute:221, GlobalEventExecutor (io.netty.util.concurrent)
> onClose:188, DefaultNettyOptions 
> (com.datastax.oss.driver.internal.core.context)
> onChildrenClosed:589, DefaultSession$SingleThreaded 
> (com.datastax.oss.driver.internal.core.session)
> lambda$close$9:552, DefaultSession$SingleThreaded 
> (com.datastax.oss.driver.internal.core.session)
> run:-1, 860270832 
> (com.datastax.oss.driver.internal.core.session.DefaultSession$SingleThreaded$$Lambda$9508)
> tryFire$$$capture:783, CompletableFuture$UniRun (java.util.concurrent)
> tryFire:-1, CompletableFuture$UniRun (java.util.concurrent)
>  - Async stack trace
> addTask:-1, SingleThreadEventExecutor (io.netty.util.concurrent)
> execute:836, SingleThreadEventExecutor (io.netty.util.concurrent)
> execute0:827, SingleThreadEventExecutor (io.netty.util.concurrent)
> execute:817, SingleThreadEventExecutor (io.netty.util.concurrent)
> claim:568, CompletableFuture$UniCompletion (java.util.concurrent)
> tryFire$$$capture:780, CompletableFuture$UniRun (java.util.concurrent)
> tryFire:-1, CompletableFuture$UniRun (java.util.concurrent)
>  - Async stack trace
> :767, CompletableFuture$UniRun (java.util.concurrent)
> uniRunStage:801, CompletableFuture (java.util.concurrent)
> thenRunAsync:2136, CompletableFuture (java.util.concurrent)
> thenRunAsync:143, CompletableFuture (java.util.concurrent)
> whenAllDone:75, CompletableFutures 
> (com.datastax.oss.driver.internal.core.util.concurrent)
> close:551, DefaultSession$SingleThreaded 
> (com.datastax.oss.driver.internal.core.session)
> access$1000:300, DefaultSession$SingleThreaded 
> (com.datastax.oss.driver.internal.core.session)
> lambda$closeAsync$1:272, DefaultSession 
> (com.datastax.oss.driver.internal.core.session)
> runTask:98, PromiseTask (io.netty.util.concurrent)
> run:106, PromiseTask (io.netty.util.concurrent)
> runTask$$$capture:174, AbstractEventExecutor (io.netty.util.concurrent)
> runTask:-1, AbstractEventExecutor (io.netty.util.concurrent)
>  - Async stack trace
> addTask:-1, SingleThreadEventExecutor (io.netty.util.concurrent)
> execute:836, SingleThreadEventExecutor (io.netty.util.concurrent)
> execute0:827, SingleThreadEventExecutor (io.netty.util.concurrent)
> execute:817, SingleThreadEventExecutor (io.netty.util.concurrent)
> submit:118, AbstractExecutorService (java.util.concurrent)
> submit:118, AbstractEventExecutor (io.netty.util.concurrent)
> on:57, RunOrSchedule (com.datastax.oss.driver.internal.core.util.concurrent)
> closeSafely:286, DefaultSession 
> (com.datastax.oss.driver.internal.core.session)
> closeAsync:272, DefaultSession (com.datastax.oss.driver.internal.core.session)
> close:76, AsyncAutoCloseable (com.datastax.oss.driver.api.core)
> -- custom shutdown code
> run:829, Thread (java.lang)
> {noformat}
> the initial close here is called on 
> com.datastax.oss.driver.api.core.CqlSession.
> netty framework suggests to call
> io.netty.util.concurrent.GlobalEventExecutor#awaitInactivity
> during shutdown to await event thread stopping
> (slightly related issue in netty: 
> [https://github.com/netty/netty/issues/2084] )
> suggestion to add maybe GlobalEventExecutor.INSTANCE.awaitInactivity with 
> some timeout during close around here:
> [https://github.com/apache/cassandra-java-driver/blob/4.x/core/src/main/java/com/datastax/oss/driver/internal/core/context/DefaultNettyOptions.java#L199]
> noting that this might slow down closing for up to 2 seconds if the netty 
> issue comment is correct.
> this is on latest datastax java driver version: 4.17,



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits

[jira] [Assigned] (CASSANDRA-19579) threads lingering after driver shutdown: session close starts thread and doesn't await its stop



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams reassigned CASSANDRA-19579:


Assignee: (was: Henry Hughes)

> threads lingering after driver shutdown: session close starts thread and 
> doesn't await its stop
> ---
>
> Key: CASSANDRA-19579
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19579
> Project: Cassandra
>  Issue Type: Bug
>  Components: Client/java-driver
>Reporter: Thomas Klambauer
>Priority: Normal
>
> We are checking remaining/lingering threads during shutdown.
> we noticed some with naming pattern/thread factory: 
> ""globalEventExecutor-1-2" Id=146 TIMED_WAITING"
> this one seems to be created during shutdown / session close and not 
> awaited/shut down:
> {noformat}
> addTask:156, GlobalEventExecutor (io.netty.util.concurrent)
> execute0:225, GlobalEventExecutor (io.netty.util.concurrent)
> execute:221, GlobalEventExecutor (io.netty.util.concurrent)
> onClose:188, DefaultNettyOptions 
> (com.datastax.oss.driver.internal.core.context)
> onChildrenClosed:589, DefaultSession$SingleThreaded 
> (com.datastax.oss.driver.internal.core.session)
> lambda$close$9:552, DefaultSession$SingleThreaded 
> (com.datastax.oss.driver.internal.core.session)
> run:-1, 860270832 
> (com.datastax.oss.driver.internal.core.session.DefaultSession$SingleThreaded$$Lambda$9508)
> tryFire$$$capture:783, CompletableFuture$UniRun (java.util.concurrent)
> tryFire:-1, CompletableFuture$UniRun (java.util.concurrent)
>  - Async stack trace
> addTask:-1, SingleThreadEventExecutor (io.netty.util.concurrent)
> execute:836, SingleThreadEventExecutor (io.netty.util.concurrent)
> execute0:827, SingleThreadEventExecutor (io.netty.util.concurrent)
> execute:817, SingleThreadEventExecutor (io.netty.util.concurrent)
> claim:568, CompletableFuture$UniCompletion (java.util.concurrent)
> tryFire$$$capture:780, CompletableFuture$UniRun (java.util.concurrent)
> tryFire:-1, CompletableFuture$UniRun (java.util.concurrent)
>  - Async stack trace
> :767, CompletableFuture$UniRun (java.util.concurrent)
> uniRunStage:801, CompletableFuture (java.util.concurrent)
> thenRunAsync:2136, CompletableFuture (java.util.concurrent)
> thenRunAsync:143, CompletableFuture (java.util.concurrent)
> whenAllDone:75, CompletableFutures 
> (com.datastax.oss.driver.internal.core.util.concurrent)
> close:551, DefaultSession$SingleThreaded 
> (com.datastax.oss.driver.internal.core.session)
> access$1000:300, DefaultSession$SingleThreaded 
> (com.datastax.oss.driver.internal.core.session)
> lambda$closeAsync$1:272, DefaultSession 
> (com.datastax.oss.driver.internal.core.session)
> runTask:98, PromiseTask (io.netty.util.concurrent)
> run:106, PromiseTask (io.netty.util.concurrent)
> runTask$$$capture:174, AbstractEventExecutor (io.netty.util.concurrent)
> runTask:-1, AbstractEventExecutor (io.netty.util.concurrent)
>  - Async stack trace
> addTask:-1, SingleThreadEventExecutor (io.netty.util.concurrent)
> execute:836, SingleThreadEventExecutor (io.netty.util.concurrent)
> execute0:827, SingleThreadEventExecutor (io.netty.util.concurrent)
> execute:817, SingleThreadEventExecutor (io.netty.util.concurrent)
> submit:118, AbstractExecutorService (java.util.concurrent)
> submit:118, AbstractEventExecutor (io.netty.util.concurrent)
> on:57, RunOrSchedule (com.datastax.oss.driver.internal.core.util.concurrent)
> closeSafely:286, DefaultSession 
> (com.datastax.oss.driver.internal.core.session)
> closeAsync:272, DefaultSession (com.datastax.oss.driver.internal.core.session)
> close:76, AsyncAutoCloseable (com.datastax.oss.driver.api.core)
> -- custom shutdown code
> run:829, Thread (java.lang)
> {noformat}
> the initial close here is called on 
> com.datastax.oss.driver.api.core.CqlSession.
> netty framework suggests to call
> io.netty.util.concurrent.GlobalEventExecutor#awaitInactivity
> during shutdown to await event thread stopping
> (slightly related issue in netty: 
> [https://github.com/netty/netty/issues/2084] )
> suggestion to add maybe GlobalEventExecutor.INSTANCE.awaitInactivity with 
> some timeout during close around here:
> [https://github.com/apache/cassandra-java-driver/blob/4.x/core/src/main/java/com/datastax/oss/driver/internal/core/context/DefaultNettyOptions.java#L199]
> noting that this might slow down closing for up to 2 seconds if the netty 
> issue comment is correct.
> this is on latest datastax java driver version: 4.17,



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19158) Reuse native transport-driven futures in Debounce



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19158:

Attachment: ci_summary.html

> Reuse native transport-driven futures in Debounce
> -
>
> Key: CASSANDRA-19158
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19158
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
> Attachments: ci_summary.html
>
>
> Currently, we create a future in Debounce, then create one more future in 
> RemoteProcessor#sendWithCallback. This is further exacerbated by chaining 
> calls, when we first attempt to catch up from peer, and then from CMS.
> First of all, we should always only use a native transport timeout driven 
> futures returned from sendWithCallback, since they implement reasonable 
> retries under the hood, and are easy to bulk-configure (ie you can simply 
> change timeout in yaml and have all futures change their behaviour).
> Second, we should _chain_ futures and use map or andThen for fallback 
> operations such as trying to catch up from CMS after an unsuccesful attemp to 
> catch up from peer.
> This should significantly simplify the code and number of blocked/waiting 
> threads.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-19450) Hygiene updates for warnings and pytests



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841901#comment-17841901
 ] 

Stefan Miklosovic edited comment on CASSANDRA-19450 at 4/29/24 11:07 AM:
-

I dont understand your last sentence. I am running j17 pre-commit in circle as 
I write this though.

edit: aha, you mean like one commit before to see if it fails there or not? I 
dont think it does, we would detect that already. I think that this is just 
flaky but let's see 


was (Author: smiklosovic):
I dont understand your last sentence. I am running j17 pre-commit in circle as 
I write this though.

> Hygiene updates for warnings and pytests
> 
>
> Key: CASSANDRA-19450
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19450
> Project: Cassandra
>  Issue Type: Improvement
>  Components: CQL/Interpreter
>Reporter: Brad Schoening
>Assignee: Brad Schoening
>Priority: Low
> Fix For: 5.x
>
>
>  
>  * Update 'Warning' message to write to stderr
>  * -Replace TimeoutError Exception with builtin (since Python 3.3)-
>  * -Remove re.pattern_type (removed since Python 3.7)-
>  * Fix mutable arg [] in read_until()
>  * Remove redirect of stderr to stdout in pytest fixture with tty=false; 
> Deprecation warnings can otherwise fail unit tests when stdout & stderr 
> output is combined.
>  * Fix several pycodestyle issues



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19546) Add to_human_size and to_human_duration function



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841942#comment-17841942
 ] 

Stefan Miklosovic commented on CASSANDRA-19546:
---

I added both to_human_size and to_human_duration (1), (2).

I try my luck with asking for a reviewer. It is also tested / documented etc.

(1) https://github.com/apache/cassandra/pull/3239/files
(2) 
https://github.com/apache/cassandra/blob/f35ed228145fae3edb4325d29464f0d950d13511/doc/modules/cassandra/pages/developing/cql/functions.adoc#human-helper-functions

> Add to_human_size and to_human_duration function
> 
>
> Key: CASSANDRA-19546
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19546
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Legacy/CQL
>Reporter: Stefan Miklosovic
>Assignee: Stefan Miklosovic
>Priority: Normal
> Fix For: 5.x
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There are cases (e.g in our system_views tables but might be applicable for 
> user tables as well) when a column is of a type which represents number of 
> bytes. However, it is quite hard to parse a value for a human to have some 
> estimation what that value is.
> I propose this:
> {code:java}
> cqlsh> select * from myks.mytb ;
>  id | col1 | col2 | col3 | col4 
> +--+--+--+--
>   1 |  100 |  200 |  300 | 32432423 
> (1 rows)
> cqlsh> select to_human_size(col4) from myks.mytb where id = 1;
>  system.to_human_size(col4)
> --
> 30.93 MiB
> (1 rows)
> cqlsh> select to_human_size(col4,0) from myks.mytb where id = 1;
>  system.to_human_size(col4, 0)
> -
>   31 MiB
> (1 rows)
> cqlsh> select to_human_size(col4,1) from myks.mytb where id = 1;
>  system.to_human_size(col4, 1)
> -
> 30.9 MiB
> (1 rows)
> {code}
> The second argument is optional and represents the number of decimal places 
> (at most) to use. Without the second argument, it will default to 
> FileUtils.df which is "#.##" format.
> {code}
> cqlsh> DESCRIBE myks.mytb ;
> CREATE TABLE myks.mytb (
> id int PRIMARY KEY,
> col1 int,
> col2 smallint,
> col3 bigint,
> col4 varint,
> )
> {code}
> I also propose that this to_human_size function (name of it might be indeed 
> discussed and it is just a suggestion) should be only applicable for int, 
> smallint, bigint and varint types. I am not sure how to apply this to e.g. 
> "float" or similar. As I mentioned, it is meant to convert just number of 
> bytes, which is just some number, to a string representation of that and I do 
> not think that applying that function to anything else but these types makes 
> sense.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19546) Add to_human_size and to_human_duration function



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefan Miklosovic updated CASSANDRA-19546:
--
Status: Needs Committer  (was: Patch Available)

> Add to_human_size and to_human_duration function
> 
>
> Key: CASSANDRA-19546
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19546
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Legacy/CQL
>Reporter: Stefan Miklosovic
>Assignee: Stefan Miklosovic
>Priority: Normal
> Fix For: 5.x
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There are cases (e.g in our system_views tables but might be applicable for 
> user tables as well) when a column is of a type which represents number of 
> bytes. However, it is quite hard to parse a value for a human to have some 
> estimation what that value is.
> I propose this:
> {code:java}
> cqlsh> select * from myks.mytb ;
>  id | col1 | col2 | col3 | col4 
> +--+--+--+--
>   1 |  100 |  200 |  300 | 32432423 
> (1 rows)
> cqlsh> select to_human_size(col4) from myks.mytb where id = 1;
>  system.to_human_size(col4)
> --
> 30.93 MiB
> (1 rows)
> cqlsh> select to_human_size(col4,0) from myks.mytb where id = 1;
>  system.to_human_size(col4, 0)
> -
>   31 MiB
> (1 rows)
> cqlsh> select to_human_size(col4,1) from myks.mytb where id = 1;
>  system.to_human_size(col4, 1)
> -
> 30.9 MiB
> (1 rows)
> {code}
> The second argument is optional and represents the number of decimal places 
> (at most) to use. Without the second argument, it will default to 
> FileUtils.df which is "#.##" format.
> {code}
> cqlsh> DESCRIBE myks.mytb ;
> CREATE TABLE myks.mytb (
> id int PRIMARY KEY,
> col1 int,
> col2 smallint,
> col3 bigint,
> col4 varint,
> )
> {code}
> I also propose that this to_human_size function (name of it might be indeed 
> discussed and it is just a suggestion) should be only applicable for int, 
> smallint, bigint and varint types. I am not sure how to apply this to e.g. 
> "float" or similar. As I mentioned, it is meant to convert just number of 
> bytes, which is just some number, to a string representation of that and I do 
> not think that applying that function to anything else but these types makes 
> sense.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19546) Add to_human_size and to_human_duration function



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefan Miklosovic updated CASSANDRA-19546:
--
Test and Documentation Plan: ci
 Status: Patch Available  (was: In Progress)

> Add to_human_size and to_human_duration function
> 
>
> Key: CASSANDRA-19546
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19546
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Legacy/CQL
>Reporter: Stefan Miklosovic
>Assignee: Stefan Miklosovic
>Priority: Normal
> Fix For: 5.x
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There are cases (e.g in our system_views tables but might be applicable for 
> user tables as well) when a column is of a type which represents number of 
> bytes. However, it is quite hard to parse a value for a human to have some 
> estimation what that value is.
> I propose this:
> {code:java}
> cqlsh> select * from myks.mytb ;
>  id | col1 | col2 | col3 | col4 
> +--+--+--+--
>   1 |  100 |  200 |  300 | 32432423 
> (1 rows)
> cqlsh> select to_human_size(col4) from myks.mytb where id = 1;
>  system.to_human_size(col4)
> --
> 30.93 MiB
> (1 rows)
> cqlsh> select to_human_size(col4,0) from myks.mytb where id = 1;
>  system.to_human_size(col4, 0)
> -
>   31 MiB
> (1 rows)
> cqlsh> select to_human_size(col4,1) from myks.mytb where id = 1;
>  system.to_human_size(col4, 1)
> -
> 30.9 MiB
> (1 rows)
> {code}
> The second argument is optional and represents the number of decimal places 
> (at most) to use. Without the second argument, it will default to 
> FileUtils.df which is "#.##" format.
> {code}
> cqlsh> DESCRIBE myks.mytb ;
> CREATE TABLE myks.mytb (
> id int PRIMARY KEY,
> col1 int,
> col2 smallint,
> col3 bigint,
> col4 varint,
> )
> {code}
> I also propose that this to_human_size function (name of it might be indeed 
> discussed and it is just a suggestion) should be only applicable for int, 
> smallint, bigint and varint types. I am not sure how to apply this to e.g. 
> "float" or similar. As I mentioned, it is meant to convert just number of 
> bytes, which is just some number, to a string representation of that and I do 
> not think that applying that function to anything else but these types makes 
> sense.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19579) threads lingering after driver shutdown: session close starts thread and doesn't await its stop

2024-04-29 Thread Thomas Klambauer (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Klambauer updated CASSANDRA-19579:
-
Description: 
We are checking remaining/lingering threads during shutdown.

we noticed some with naming pattern/thread factory: ""globalEventExecutor-1-2" 
Id=146 TIMED_WAITING"

this one seems to be created during shutdown / session close and not 
awaited/shut down:
{noformat}
addTask:156, GlobalEventExecutor (io.netty.util.concurrent)
execute0:225, GlobalEventExecutor (io.netty.util.concurrent)
execute:221, GlobalEventExecutor (io.netty.util.concurrent)
onClose:188, DefaultNettyOptions (com.datastax.oss.driver.internal.core.context)
onChildrenClosed:589, DefaultSession$SingleThreaded 
(com.datastax.oss.driver.internal.core.session)
lambda$close$9:552, DefaultSession$SingleThreaded 
(com.datastax.oss.driver.internal.core.session)
run:-1, 860270832 
(com.datastax.oss.driver.internal.core.session.DefaultSession$SingleThreaded$$Lambda$9508)
tryFire$$$capture:783, CompletableFuture$UniRun (java.util.concurrent)
tryFire:-1, CompletableFuture$UniRun (java.util.concurrent)
 - Async stack trace
addTask:-1, SingleThreadEventExecutor (io.netty.util.concurrent)
execute:836, SingleThreadEventExecutor (io.netty.util.concurrent)
execute0:827, SingleThreadEventExecutor (io.netty.util.concurrent)
execute:817, SingleThreadEventExecutor (io.netty.util.concurrent)
claim:568, CompletableFuture$UniCompletion (java.util.concurrent)
tryFire$$$capture:780, CompletableFuture$UniRun (java.util.concurrent)
tryFire:-1, CompletableFuture$UniRun (java.util.concurrent)
 - Async stack trace
:767, CompletableFuture$UniRun (java.util.concurrent)
uniRunStage:801, CompletableFuture (java.util.concurrent)
thenRunAsync:2136, CompletableFuture (java.util.concurrent)
thenRunAsync:143, CompletableFuture (java.util.concurrent)
whenAllDone:75, CompletableFutures 
(com.datastax.oss.driver.internal.core.util.concurrent)
close:551, DefaultSession$SingleThreaded 
(com.datastax.oss.driver.internal.core.session)
access$1000:300, DefaultSession$SingleThreaded 
(com.datastax.oss.driver.internal.core.session)
lambda$closeAsync$1:272, DefaultSession 
(com.datastax.oss.driver.internal.core.session)
runTask:98, PromiseTask (io.netty.util.concurrent)
run:106, PromiseTask (io.netty.util.concurrent)
runTask$$$capture:174, AbstractEventExecutor (io.netty.util.concurrent)
runTask:-1, AbstractEventExecutor (io.netty.util.concurrent)
 - Async stack trace
addTask:-1, SingleThreadEventExecutor (io.netty.util.concurrent)
execute:836, SingleThreadEventExecutor (io.netty.util.concurrent)
execute0:827, SingleThreadEventExecutor (io.netty.util.concurrent)
execute:817, SingleThreadEventExecutor (io.netty.util.concurrent)
submit:118, AbstractExecutorService (java.util.concurrent)
submit:118, AbstractEventExecutor (io.netty.util.concurrent)
on:57, RunOrSchedule (com.datastax.oss.driver.internal.core.util.concurrent)
closeSafely:286, DefaultSession (com.datastax.oss.driver.internal.core.session)
closeAsync:272, DefaultSession (com.datastax.oss.driver.internal.core.session)
close:76, AsyncAutoCloseable (com.datastax.oss.driver.api.core)
-- custom shutdown code
run:829, Thread (java.lang)
{noformat}
the initial close here is called on com.datastax.oss.driver.api.core.CqlSession.

netty framework suggests to call
io.netty.util.concurrent.GlobalEventExecutor#awaitInactivity
during shutdown to await event thread stopping

(slightly related issue in netty: [https://github.com/netty/netty/issues/2084] )

suggestion to add maybe GlobalEventExecutor.INSTANCE.awaitInactivity with some 
timeout during close around here:
[https://github.com/apache/cassandra-java-driver/blob/4.x/core/src/main/java/com/datastax/oss/driver/internal/core/context/DefaultNettyOptions.java#L199]

noting that this might slow down closing for up to 2 seconds if the netty issue 
comment is correct.

this is on latest datastax java driver version: 4.17,

  was:
We are checking remaining/lingering threads during shutdown.

we noticed some with naming pattern/thread factory: ""globalEventExecutor-1-2" 
Id=146 TIMED_WAITING"

this one seems to be created during shutdown / session close and not 
awaited/shut down:
{noformat}
addTask:156, GlobalEventExecutor (io.netty.util.concurrent)
execute0:225, GlobalEventExecutor (io.netty.util.concurrent)
execute:221, GlobalEventExecutor (io.netty.util.concurrent)
onClose:188, DefaultNettyOptions (com.datastax.oss.driver.internal.core.context)
onChildrenClosed:589, DefaultSession$SingleThreaded 
(com.datastax.oss.driver.internal.core.session)
lambda$close$9:552, DefaultSession$SingleThreaded 
(com.datastax.oss.driver.internal.core.session)
run:-1, 860270832 
(com.datastax.oss.driver.internal.core.session.DefaultSession$SingleThreaded$$Lambda$9508)
tryFire$$$capture:783, CompletableFuture$UniRun (java.util.concurrent)
tryFire:-1, CompletableFuture$UniRun (java.util

[jira] [Comment Edited] (CASSANDRA-19590) Unexpected error deserializing mutation when upgrade from 2.2.19 to 3.0.30/3.11.17



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841916#comment-17841916
 ] 

Stefan Miklosovic edited comment on CASSANDRA-19590 at 4/29/24 9:19 AM:


what about 2.2.19 -> 3.0.0?

what is the schema for ks.tb? anything special there? 


was (Author: smiklosovic):
what about 2.2.19 -> 3.0.0?

> Unexpected error deserializing mutation when upgrade from 2.2.19 to 
> 3.0.30/3.11.17
> --
>
> Key: CASSANDRA-19590
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19590
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Klay
>Priority: Normal
> Attachments: data.tar.gz, system.log
>
>
> I am trying to upgrade from 2.2.19 to 3.0.30/3.11.17. I encountered the 
> following exception during the upgrade process and the 3.0.30/3.11.17 node 
> cannot start up.
> {code:java}
> ERROR [main] 2024-04-25 18:46:10,496 JVMStabilityInspector.java:124 - Exiting 
> due to error while processing commit log during initialization.
> org.apache.cassandra.db.commitlog.CommitLogReadHandler$CommitLogReadException:
>  Unexpected error deserializing mutation; saved to 
> /tmp/mutation8318204837345269856dat.  This may be caused by replaying a 
> mutation against a table with the same name but incompatible schema.  
> Exception follows: java.lang.AssertionError
>         at 
> org.apache.cassandra.db.commitlog.CommitLogReader.readMutation(CommitLogReader.java:471)
>         at 
> org.apache.cassandra.db.commitlog.CommitLogReader.readSection(CommitLogReader.java:404)
>         at 
> org.apache.cassandra.db.commitlog.CommitLogReader.readCommitLogSegment(CommitLogReader.java:251)
>         at 
> org.apache.cassandra.db.commitlog.CommitLogReader.readAllFiles(CommitLogReader.java:132)
>         at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer.replayFiles(CommitLogReplayer.java:137)
>         at 
> org.apache.cassandra.db.commitlog.CommitLog.recoverFiles(CommitLog.java:189)
>         at 
> org.apache.cassandra.db.commitlog.CommitLog.recoverSegmentsOnDisk(CommitLog.java:170)
>         at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:331)
>         at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630)
>         at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:791) 
> {code}
> h1. Reproduce1 (Flush before upgrade)
> Upgrade fails when replaying the commit log.
> This can be reproduced deterministically by 
> 1. Start up cassandra-2.2.19, singe node is enough (Using default 
> configuration)
> 2. Execute the following commands in cqlsh
> {code:java}
> CREATE KEYSPACE ks WITH REPLICATION = { 'class' : 'SimpleStrategy', 
> 'replication_factor' : 1 };
> CREATE TABLE ks.tb (c1 INT, c0 INT, PRIMARY KEY (c1));
> INSERT INTO ks.tb (c1, c0) VALUES (0, 0);
> ALTER TABLE ks.tb DROP c0 ;
> ALTER TABLE ks.tb ADD c0 set ; 
> {code}
> 3. Stop the old version.
> {code:java}
> bin/nodetool -h :::127.0.0.1 flush
> bin/nodetool -h :::127.0.0.1 stopdaemon{code}
> 4. Copy the data and start up the new version node (3.0.30 or 3.11.17)
> Upgrade crashes with the following error
> {code:java}
> ERROR [main] 2024-04-25 18:46:10,496 JVMStabilityInspector.java:124 - Exiting 
> due to error while processing commit log during initialization.
> org.apache.cassandra.db.commitlog.CommitLogReadHandler$CommitLogReadException:
>  Unexpected error deserializing mutation; saved to 
> /tmp/mutation8318204837345269856dat.  This may be caused by replaying a 
> mutation against a table with the same name but incompatible schema.  
> Exception follows: java.lang.AssertionError
>         at 
> org.apache.cassandra.db.commitlog.CommitLogReader.readMutation(CommitLogReader.java:471)
>         at 
> org.apache.cassandra.db.commitlog.CommitLogReader.readSection(CommitLogReader.java:404)
>         at 
> org.apache.cassandra.db.commitlog.CommitLogReader.readCommitLogSegment(CommitLogReader.java:251)
>         at 
> org.apache.cassandra.db.commitlog.CommitLogReader.readAllFiles(CommitLogReader.java:132)
>         at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer.replayFiles(CommitLogReplayer.java:137)
>         at 
> org.apache.cassandra.db.commitlog.CommitLog.recoverFiles(CommitLog.java:189)
>         at 
> org.apache.cassandra.db.commitlog.CommitLog.recoverSegmentsOnDisk(CommitLog.java:170)
>         at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:331)
>         at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630)
>         at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:791){code}
> I have attached the system.log when starting up the 3.11.17 node.
> I also

[jira] [Commented] (CASSANDRA-19590) Unexpected error deserializing mutation when upgrade from 2.2.19 to 3.0.30/3.11.17



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841916#comment-17841916
 ] 

Stefan Miklosovic commented on CASSANDRA-19590:
---

what about 2.2.11 -> 3.0.0?

> Unexpected error deserializing mutation when upgrade from 2.2.19 to 
> 3.0.30/3.11.17
> --
>
> Key: CASSANDRA-19590
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19590
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Klay
>Priority: Normal
> Attachments: data.tar.gz, system.log
>
>
> I am trying to upgrade from 2.2.19 to 3.0.30/3.11.17. I encountered the 
> following exception during the upgrade process and the 3.0.30/3.11.17 node 
> cannot start up.
> {code:java}
> ERROR [main] 2024-04-25 18:46:10,496 JVMStabilityInspector.java:124 - Exiting 
> due to error while processing commit log during initialization.
> org.apache.cassandra.db.commitlog.CommitLogReadHandler$CommitLogReadException:
>  Unexpected error deserializing mutation; saved to 
> /tmp/mutation8318204837345269856dat.  This may be caused by replaying a 
> mutation against a table with the same name but incompatible schema.  
> Exception follows: java.lang.AssertionError
>         at 
> org.apache.cassandra.db.commitlog.CommitLogReader.readMutation(CommitLogReader.java:471)
>         at 
> org.apache.cassandra.db.commitlog.CommitLogReader.readSection(CommitLogReader.java:404)
>         at 
> org.apache.cassandra.db.commitlog.CommitLogReader.readCommitLogSegment(CommitLogReader.java:251)
>         at 
> org.apache.cassandra.db.commitlog.CommitLogReader.readAllFiles(CommitLogReader.java:132)
>         at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer.replayFiles(CommitLogReplayer.java:137)
>         at 
> org.apache.cassandra.db.commitlog.CommitLog.recoverFiles(CommitLog.java:189)
>         at 
> org.apache.cassandra.db.commitlog.CommitLog.recoverSegmentsOnDisk(CommitLog.java:170)
>         at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:331)
>         at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630)
>         at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:791) 
> {code}
> h1. Reproduce1 (Flush before upgrade)
> Upgrade fails when replaying the commit log.
> This can be reproduced deterministically by 
> 1. Start up cassandra-2.2.19, singe node is enough (Using default 
> configuration)
> 2. Execute the following commands in cqlsh
> {code:java}
> CREATE KEYSPACE ks WITH REPLICATION = { 'class' : 'SimpleStrategy', 
> 'replication_factor' : 1 };
> CREATE TABLE ks.tb (c1 INT, c0 INT, PRIMARY KEY (c1));
> INSERT INTO ks.tb (c1, c0) VALUES (0, 0);
> ALTER TABLE ks.tb DROP c0 ;
> ALTER TABLE ks.tb ADD c0 set ; 
> {code}
> 3. Stop the old version.
> {code:java}
> bin/nodetool -h :::127.0.0.1 flush
> bin/nodetool -h :::127.0.0.1 stopdaemon{code}
> 4. Copy the data and start up the new version node (3.0.30 or 3.11.17)
> Upgrade crashes with the following error
> {code:java}
> ERROR [main] 2024-04-25 18:46:10,496 JVMStabilityInspector.java:124 - Exiting 
> due to error while processing commit log during initialization.
> org.apache.cassandra.db.commitlog.CommitLogReadHandler$CommitLogReadException:
>  Unexpected error deserializing mutation; saved to 
> /tmp/mutation8318204837345269856dat.  This may be caused by replaying a 
> mutation against a table with the same name but incompatible schema.  
> Exception follows: java.lang.AssertionError
>         at 
> org.apache.cassandra.db.commitlog.CommitLogReader.readMutation(CommitLogReader.java:471)
>         at 
> org.apache.cassandra.db.commitlog.CommitLogReader.readSection(CommitLogReader.java:404)
>         at 
> org.apache.cassandra.db.commitlog.CommitLogReader.readCommitLogSegment(CommitLogReader.java:251)
>         at 
> org.apache.cassandra.db.commitlog.CommitLogReader.readAllFiles(CommitLogReader.java:132)
>         at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer.replayFiles(CommitLogReplayer.java:137)
>         at 
> org.apache.cassandra.db.commitlog.CommitLog.recoverFiles(CommitLog.java:189)
>         at 
> org.apache.cassandra.db.commitlog.CommitLog.recoverSegmentsOnDisk(CommitLog.java:170)
>         at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:331)
>         at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630)
>         at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:791){code}
> I have attached the system.log when starting up the 3.11.17 node.
> I also attached the data folder generated from the 2.2.19, start up 3.0.30 or 
> 3.11.17 with this data folder can directly expose the error.
> h1. Reproduce2 (Drain be

[jira] [Comment Edited] (CASSANDRA-19590) Unexpected error deserializing mutation when upgrade from 2.2.19 to 3.0.30/3.11.17