[jira] [Updated] (CASSANDRA-19598) advanced.resolve-contact-points: unresolved hostname being clobbered during reconnection
[ https://issues.apache.org/jira/browse/CASSANDRA-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Orlowski updated CASSANDRA-19598: Attachment: (was: image-2024-04-29-20-40-53-382.png) > advanced.resolve-contact-points: unresolved hostname being clobbered during > reconnection > > > Key: CASSANDRA-19598 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19598 > Project: Cassandra > Issue Type: Bug > Components: Client/java-driver >Reporter: Andrew Orlowski >Priority: Normal > Attachments: image-2024-04-29-20-13-56-161.png, > image-2024-04-29-22-57-26-910.png > > > Hello, this is a bug ticket for 4.18.0 of the Java driver. > > I am running in an environment where I have 3 Cassandra nodes. We have a use > case to redeploy the cluster from the ground up at midnight every day. This > means that all 3 nodes become unavailable for a short period of time and 3 > new nodes with 3 new ip addresses get spun up and placed behind the contact > point hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the > java driver should re-resolve the hostname for every new connection to that > node. This occurs prior to and for the first redeployment, but the unresolved > hostname is clobbered during the reconnection process and replaced with a > resolved IP address, making additional redeployments fruitless. We provide a > singular hostname as a contact point. > > In our case, what is happening is that all 3 nodes become unavailable while > our CICD process is destroying the existing cluster and replacing it with a > new one. During the window of unavailability, the Java driver attempts to > reconnect to each node, two of which internally (internal to the driver) have > resolved IP addresses and one of which retains the unresolved hostname. Here > is a screenshot that captures the internal state of the 3 nodes within > `PoolManager` prior to the finished redeployment of the cluster. Note that > there are 2 resolved IP addresses and 1 unresolved hostname. > !image-2024-04-29-20-13-56-161.png|width=985,height=181! > This 2:1 ratio of resolved IP:unresolved hostname is the correct internal > state for a 3 node cluster when `advanced.resolve-contact-points` is set to > `FALSE`. > Eventually, the hostname points to one of the 3 new valid nodes, and the java > driver reconnects and discovers the new peers. However, as part of this > reconnection process, the internal Node that held the unresolved hostname is > now overwritten with a Node that has the resolved IP address: > !image-2024-04-29-22-57-26-910.png|width=753,height=107! > Note that we no longer have 2 resolved IP addresses and 1 unresolved > hostname; rather, we have 3 resolved IP addresses, which is an incorrect > internal state when `advanced.resolve-contact-points` is set to `FALSE`. One > of the nodes should have retained the unresolved hostname. > At this stage, the Java driver no longer queries the hostname for new > connections, and further redeployments of ours result in failure because the > hostname is no longer amongst the list of nodes that are queried for > reconnection. This causes us to need to restart the application. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19598) advanced.resolve-contact-points: unresolved hostname being clobbered during reconnection
[ https://issues.apache.org/jira/browse/CASSANDRA-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Orlowski updated CASSANDRA-19598: Description: Hello, this is a bug ticket for 4.18.0 of the Java driver. I am running in an environment where I have 3 Cassandra nodes. We have a use case to redeploy the cluster from the ground up at midnight every day. This means that all 3 nodes become unavailable for a short period of time and 3 new nodes with 3 new ip addresses get spun up and placed behind the contact point hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java driver should re-resolve the hostname for every new connection to that node. This occurs prior to and for the first redeployment, but the unresolved hostname is clobbered during the reconnection process and replaced with a resolved IP address, making additional redeployments fruitless. We provide a singular hostname as a contact point. In our case, what is happening is that all 3 nodes become unavailable while our CICD process is destroying the existing cluster and replacing it with a new one. During the window of unavailability, the Java driver attempts to reconnect to each node, two of which internally (internal to the driver) have resolved IP addresses and one of which retains the unresolved hostname. Here is a screenshot that captures the internal state of the 3 nodes within `PoolManager` prior to the finished redeployment of the cluster. Note that there are 2 resolved IP addresses and 1 unresolved hostname. !image-2024-04-29-20-13-56-161.png|width=985,height=181! This 2:1 ratio of resolved IP:unresolved hostname is the correct internal state for a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`. Eventually, the hostname points to one of the 3 new valid nodes, and the java driver reconnects and discovers the new peers. However, as part of this reconnection process, the internal Node that held the unresolved hostname is now overwritten with a Node that has the resolved IP address: !image-2024-04-29-22-57-26-910.png|width=753,height=107! Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname; rather, we have 3 resolved IP addresses, which is an incorrect internal state when `advanced.resolve-contact-points` is set to `FALSE`. One of the nodes should have retained the unresolved hostname. At this stage, the Java driver no longer queries the hostname for new connections, and further redeployments of ours result in failure because the hostname is no longer amongst the list of nodes that are queried for reconnection. This causes us to need to restart the application. was: Hello, this is a bug ticket for 4.18.0 of the Java driver. I am running in an environment where I have 3 Cassandra nodes. We have a use case to redeploy the cluster from the ground up at midnight every day. This means that all 3 nodes become unavailable for a short period of time and 3 new nodes with 3 new ip addresses get spun up and placed behind the contact point hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java driver should re-resolve the hostname for every new connection to that node. This occurs prior to and for the first redeployment, but the unresolved hostname is clobbered during the reconnection process and replaced with a resolved IP address, making additional redeployments fruitless. We provide a singular hostname as a contact point. In our case, what is happening is that all 3 nodes become unavailable while our CICD process is destroying the existing cluster and replacing it with a new one. During the window of unavailability, the Java driver attempts to reconnect to each node, two of which internally (internal to the driver) have resolved IP addresses and one of which retains the unresolved hostname. Here is a screenshot that captures the internal state of the 3 nodes within `PoolManager` prior to the finished redeployment of the cluster. Note that there are 2 resolved IP addresses and 1 unresolved hostname. !image-2024-04-29-20-13-56-161.png|width=985,height=181! This ratio of resolved IP:unresolved hostname is the correct internal state for a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`. Eventually, the hostname points to one of the 3 new valid nodes, and the java driver reconnects and discovers the new peers. However, as part of this reconnection process, the internal Node that held the unresolved hostname is now overwritten with a Node that has the resolved IP address: !image-2024-04-29-22-57-26-910.png|width=753,height=107! Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname; rather, we have 3 resolved IP addresses, which is an incorrect internal state when `advanced.resolve-contact-points` is set to `FALSE`. One of the nodes should have retained the unreso
[jira] [Updated] (CASSANDRA-19598) advanced.resolve-contact-points: unresolved hostname being clobbered during reconnection
[ https://issues.apache.org/jira/browse/CASSANDRA-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Orlowski updated CASSANDRA-19598: Description: Hello, this is a bug ticket for 4.18.0 of the Java driver. I am running in an environment where I have 3 Cassandra nodes. We have a use case to redeploy the cluster from the ground up at midnight every day. This means that all 3 nodes become unavailable for a short period of time and 3 new nodes with 3 new ip addresses get spun up and placed behind the contact point hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java driver should re-resolve the hostname for every new connection to that node. This occurs prior to and for the first redeployment, but the unresolved hostname is clobbered during the reconnection process and replaced with a resolved IP address, making additional redeployments fruitless. We provide a singular hostname as a contact point. In our case, what is happening is that all 3 nodes become unavailable while our CICD process is destroying the existing cluster and replacing it with a new one. During the window of unavailability, the Java driver attempts to reconnect to each node, two of which internally (internal to the driver) have resolved IP addresses and one of which retains the unresolved hostname. Here is a screenshot that captures the internal state of the 3 nodes within `PoolManager` prior to the finished redeployment of the cluster. Note that there are 2 resolved IP addresses and 1 unresolved hostname. !image-2024-04-29-20-13-56-161.png|width=985,height=181! This ratio of resolved IP:unresolved hostname is the correct internal state for a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`. Eventually, the hostname points to one of the 3 new valid nodes, and the java driver reconnects and discovers the new peers. However, as part of this reconnection process, the internal Node that held the unresolved hostname is now overwritten with a Node that has the resolved IP address: !image-2024-04-29-22-57-26-910.png|width=753,height=107! Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname; rather, we have 3 resolved IP addresses, which is an incorrect internal state when `advanced.resolve-contact-points` is set to `FALSE`. One of the nodes should have retained the unresolved hostname. At this stage, the Java driver no longer queries the hostname for new connections, and further redeployments of ours result in failure because the hostname is no longer amongst the list of nodes that are queried for reconnection. This causes us to need to restart the application. was: Hello, this is a bug ticket for 4.18.0 of the Java driver. I am running in an environment where I have 3 Cassandra nodes. We have a use case to redeploy the cluster from the ground up at midnight every day. This means that all 3 nodes become unavailable for a short period of time and 3 new nodes with 3 new ip addresses get spun up and placed behind the contact point hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java driver should re-resolve the hostname for every new connection to that node. This occurs prior to and for the first redeployment, but the unresolved hostname is clobbered during the reconnection process and replaced with a resolved IP address, making additional redeployments fruitless. We provide a singular hostname as a contact point. In our case, what is happening is that all 3 nodes become unavailable while our CICD process is destroying the existing cluster and replacing it with a new one. During the window of unavailability, the Java driver attempts to reconnect to each node, two of which internally (internal to the driver) have resolved IP addresses and one of which retains the unresolved hostname. Here is a screenshot that captures the internal state of the 3 nodes within `PoolManager` prior to the finished redeployment of the cluster. Note that there are 2 resolved IP addresses and 1 unresolved hostname. !image-2024-04-29-20-13-56-161.png|width=985,height=181! This ratio of resolved IP:unresolved hostname is the correct internal state for a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`. Eventually, the hostname points to one of the 3 new valid nodes, and the java driver reconnects and discovers the new peers. However, as part of this reconnection process, the internal Node that held the unresolved hostname is now overwritten with a Node that has the resolved IP address: !image-2024-04-29-20-40-53-382.png|width=1080,height=102! Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname; rather, we have 3 resolved IP addresses, which is an incorrect internal state when `advanced.resolve-contact-points` is set to `FALSE`. One of the nodes should have retained the unresolve
[jira] [Updated] (CASSANDRA-19598) advanced.resolve-contact-points: unresolved hostname being clobbered during reconnection
[ https://issues.apache.org/jira/browse/CASSANDRA-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Orlowski updated CASSANDRA-19598: Attachment: image-2024-04-29-22-57-26-910.png > advanced.resolve-contact-points: unresolved hostname being clobbered during > reconnection > > > Key: CASSANDRA-19598 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19598 > Project: Cassandra > Issue Type: Bug > Components: Client/java-driver >Reporter: Andrew Orlowski >Priority: Normal > Attachments: image-2024-04-29-20-13-56-161.png, > image-2024-04-29-20-40-53-382.png, image-2024-04-29-22-57-26-910.png > > > Hello, this is a bug ticket for 4.18.0 of the Java driver. > > I am running in an environment where I have 3 Cassandra nodes. We have a use > case to redeploy the cluster from the ground up at midnight every day. This > means that all 3 nodes become unavailable for a short period of time and 3 > new nodes with 3 new ip addresses get spun up and placed behind the contact > point hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the > java driver should re-resolve the hostname for every new connection to that > node. This occurs prior to and for the first redeployment, but the unresolved > hostname is clobbered during the reconnection process and replaced with a > resolved IP address, making additional redeployments fruitless. We provide a > singular hostname as a contact point. > > In our case, what is happening is that all 3 nodes become unavailable while > our CICD process is destroying the existing cluster and replacing it with a > new one. During the window of unavailability, the Java driver attempts to > reconnect to each node, two of which internally (internal to the driver) have > resolved IP addresses and one of which retains the unresolved hostname. Here > is a screenshot that captures the internal state of the 3 nodes within > `PoolManager` prior to the finished redeployment of the cluster. Note that > there are 2 resolved IP addresses and 1 unresolved hostname. > !image-2024-04-29-20-13-56-161.png|width=985,height=181! > This ratio of resolved IP:unresolved hostname is the correct internal state > for a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`. > Eventually, the hostname points to one of the 3 new valid nodes, and the java > driver reconnects and discovers the new peers. However, as part of this > reconnection process, the internal Node that held the unresolved hostname is > now overwritten with a Node that has the resolved IP address: > !image-2024-04-29-20-40-53-382.png|width=1080,height=102! > Note that we no longer have 2 resolved IP addresses and 1 unresolved > hostname; rather, we have 3 resolved IP addresses, which is an incorrect > internal state when `advanced.resolve-contact-points` is set to `FALSE`. One > of the nodes should have retained the unresolved hostname. > At this stage, the Java driver no longer queries the hostname for new > connections, and further redeployments of ours result in failure because the > hostname is no longer amongst the list of nodes that are queried for > reconnection. This causes us to need to restart the application. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19598) advanced.resolve-contact-points: unresolved hostname being clobbered during reconnection
[ https://issues.apache.org/jira/browse/CASSANDRA-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842228#comment-17842228 ] Bret McGuire commented on CASSANDRA-19598: -- No worries [~shot_up] , you're good... and thanks for bringing it to my attention! > advanced.resolve-contact-points: unresolved hostname being clobbered during > reconnection > > > Key: CASSANDRA-19598 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19598 > Project: Cassandra > Issue Type: Bug > Components: Client/java-driver >Reporter: Andrew Orlowski >Priority: Normal > Attachments: image-2024-04-29-20-13-56-161.png, > image-2024-04-29-20-40-53-382.png > > > Hello, this is a bug ticket for 4.18.0 of the Java driver. > > I am running in an environment where I have 3 Cassandra nodes. We have a use > case to redeploy the cluster from the ground up at midnight every day. This > means that all 3 nodes become unavailable for a short period of time and 3 > new nodes with 3 new ip addresses get spun up and placed behind the contact > point hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the > java driver should re-resolve the hostname for every new connection to that > node. This occurs prior to and for the first redeployment, but the unresolved > hostname is clobbered during the reconnection process and replaced with a > resolved IP address, making additional redeployments fruitless. We provide a > singular hostname as a contact point. > > In our case, what is happening is that all 3 nodes become unavailable while > our CICD process is destroying the existing cluster and replacing it with a > new one. During the window of unavailability, the Java driver attempts to > reconnect to each node, two of which internally (internal to the driver) have > resolved IP addresses and one of which retains the unresolved hostname. Here > is a screenshot that captures the internal state of the 3 nodes within > `PoolManager` prior to the finished redeployment of the cluster. Note that > there are 2 resolved IP addresses and 1 unresolved hostname. > !image-2024-04-29-20-13-56-161.png|width=985,height=181! > This ratio of resolved IP:unresolved hostname is the correct internal state > for a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`. > Eventually, the hostname points to one of the 3 new valid nodes, and the java > driver reconnects and discovers the new peers. However, as part of this > reconnection process, the internal Node that held the unresolved hostname is > now overwritten with a Node that has the resolved IP address: > !image-2024-04-29-20-40-53-382.png|width=1080,height=102! > Note that we no longer have 2 resolved IP addresses and 1 unresolved > hostname; rather, we have 3 resolved IP addresses, which is an incorrect > internal state when `advanced.resolve-contact-points` is set to `FALSE`. One > of the nodes should have retained the unresolved hostname. > At this stage, the Java driver no longer queries the hostname for new > connections, and further redeployments of ours result in failure because the > hostname is no longer amongst the list of nodes that are queried for > reconnection. This causes us to need to restart the application. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19598) advanced.resolve-contact-points: unresolved hostname being clobbered during reconnection
[ https://issues.apache.org/jira/browse/CASSANDRA-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Orlowski updated CASSANDRA-19598: Description: Hello, this is a bug ticket for 4.18.0 of the Java driver. I am running in an environment where I have 3 Cassandra nodes. We have a use case to redeploy the cluster from the ground up at midnight every day. This means that all 3 nodes become unavailable for a short period of time and 3 new nodes with 3 new ip addresses get spun up and placed behind the contact point hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java driver should re-resolve the hostname for every new connection to that node. This occurs prior to and for the first redeployment, but the unresolved hostname is clobbered during the reconnection process and replaced with a resolved IP address, making additional redeployments fruitless. We provide a singular hostname as a contact point. In our case, what is happening is that all 3 nodes become unavailable while our CICD process is destroying the existing cluster and replacing it with a new one. During the window of unavailability, the Java driver attempts to reconnect to each node, two of which internally (internal to the driver) have resolved IP addresses and one of which retains the unresolved hostname. Here is a screenshot that captures the internal state of the 3 nodes within `PoolManager` prior to the finished redeployment of the cluster. Note that there are 2 resolved IP addresses and 1 unresolved hostname. !image-2024-04-29-20-13-56-161.png|width=985,height=181! This ratio of resolved IP:unresolved hostname is the correct internal state for a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`. Eventually, the hostname points to one of the 3 new valid nodes, and the java driver reconnects and discovers the new peers. However, as part of this reconnection process, the internal Node that held the unresolved hostname is now overwritten with a Node that has the resolved IP address: !image-2024-04-29-20-40-53-382.png|width=1080,height=102! Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname; rather, we have 3 resolved IP addresses, which is an incorrect internal state when `advanced.resolve-contact-points` is set to `FALSE`. One of the nodes should have retained the unresolved hostname. At this stage, the Java driver no longer queries the hostname for new connections, and further redeployments of ours result in failure because the hostname is no longer amongst the list of nodes that are queried for reconnection. This causes us to need to restart the application. was: Hello, this is a bug ticket for 4.18.0 of the Java driver. I am running in an environment where I have 3 Cassandra nodes. We have a use case to redeploy the cluster from the ground up at midnight every day. This means that all 3 nodes become unavailable for a short period of time and 3 new nodes with 3 new ip addresses get spun up and placed behind the contact point hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java driver should re-resolve the hostname for every new connection to that node. This occurs prior to and for the first redeployment, but the unresolved hostname is clobbered during the reconnection process and replaced with a resolved IP address, making additional redeployments fruitless. In our case, what is happening is that all 3 nodes become unavailable while our CICD process is destroying the existing cluster and replacing it with a new one. During the window of unavailability, the Java driver attempts to reconnect to each node, two of which internally (internal to the driver) have resolved IP addresses and one of which retains the unresolved hostname. Here is a screenshot that captures the internal state of the 3 nodes within `PoolManager` prior to the finished redeployment of the cluster. Note that there are 2 resolved IP addresses and 1 unresolved hostname. !image-2024-04-29-20-13-56-161.png|width=985,height=181! This ratio of resolved IP:unresolved hostname is the correct internal state for a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`. Eventually, the hostname points to one of the 3 new valid nodes, and the java driver reconnects and discovers the new peers. However, as part of this reconnection process, the internal Node that held the unresolved hostname is now overwritten with a Node that has the resolved IP address: !image-2024-04-29-20-40-53-382.png|width=1080,height=102! Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname; rather, we have 3 resolved IP addresses, which is an incorrect internal state when `advanced.resolve-contact-points` is set to `FALSE`. One of the nodes should have retained the unresolved hostname. At this stage, the Java driver no long
[jira] [Commented] (CASSANDRA-19598) advanced.resolve-contact-points: unresolved hostname being clobbered during reconnection
[ https://issues.apache.org/jira/browse/CASSANDRA-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842214#comment-17842214 ] Andrew Orlowski commented on CASSANDRA-19598: - cc [~absurdfarce] (sorry if I shouldn't have - just saw it was done on another thread) > advanced.resolve-contact-points: unresolved hostname being clobbered during > reconnection > > > Key: CASSANDRA-19598 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19598 > Project: Cassandra > Issue Type: Bug > Components: Client/java-driver >Reporter: Andrew Orlowski >Priority: Normal > Attachments: image-2024-04-29-20-13-56-161.png, > image-2024-04-29-20-40-53-382.png > > > Hello, this is a bug ticket for 4.18.0 of the Java driver. > > I am running in an environment where I have 3 Cassandra nodes. We have a use > case to redeploy the cluster from the ground up at midnight every day. This > means that all 3 nodes become unavailable for a short period of time and 3 > new nodes with 3 new ip addresses get spun up and placed behind the contact > point hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the > java driver should re-resolve the hostname for every new connection to that > node. This occurs prior to and for the first redeployment, but the unresolved > hostname is clobbered during the reconnection process and replaced with a > resolved IP address, making additional redeployments fruitless. > > In our case, what is happening is that all 3 nodes become unavailable while > our CICD process is destroying the existing cluster and replacing it with a > new one. During the window of unavailability, the Java driver attempts to > reconnect to each node, two of which internally (internal to the driver) have > resolved IP addresses and one of which retains the unresolved hostname. Here > is a screenshot that captures the internal state of the 3 nodes within > `PoolManager` prior to the finished redeployment of the cluster. Note that > there are 2 resolved IP addresses and 1 unresolved hostname. > !image-2024-04-29-20-13-56-161.png|width=985,height=181! > This ratio of resolved IP:unresolved hostname is the correct internal state > for a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`. > Eventually, the hostname points to one of the 3 new valid nodes, and the java > driver reconnects and discovers the new peers. However, as part of this > reconnection process, the internal Node that held the unresolved hostname is > now overwritten with a Node that has the resolved IP address: > !image-2024-04-29-20-40-53-382.png|width=1080,height=102! > Note that we no longer have 2 resolved IP addresses and 1 unresolved > hostname; rather, we have 3 resolved IP addresses, which is an incorrect > internal state when `advanced.resolve-contact-points` is set to `FALSE`. One > of the nodes should have retained the unresolved hostname. > At this stage, the Java driver no longer queries the hostname for new > connections, and further redeployments of ours result in failure because the > hostname is no longer amongst the list of nodes that are queried for > reconnection. This causes us to need to restart the application. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19598) advanced.resolve-contact-points: unresolved hostname being clobbered during reconnection
[ https://issues.apache.org/jira/browse/CASSANDRA-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Orlowski updated CASSANDRA-19598: Description: Hello, this is a bug ticket for 4.18.0 of the Java driver. I am running in an environment where I have 3 Cassandra nodes. We have a use case to redeploy the cluster from the ground up at midnight every day. This means that all 3 nodes become unavailable for a short period of time and 3 new nodes with 3 new ip addresses get spun up and placed behind the contact point hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java driver should re-resolve the hostname for every new connection to that node. This occurs prior to and for the first redeployment, but the unresolved hostname is clobbered during the reconnection process and replaced with a resolved IP address, making additional redeployments fruitless. In our case, what is happening is that all 3 nodes become unavailable while our CICD process is destroying the existing cluster and replacing it with a new one. During the window of unavailability, the Java driver attempts to reconnect to each node, two of which internally (internal to the driver) have resolved IP addresses and one of which retains the unresolved hostname. Here is a screenshot that captures the internal state of the 3 nodes within `PoolManager` prior to the finished redeployment of the cluster. Note that there are 2 resolved IP addresses and 1 unresolved hostname. !image-2024-04-29-20-13-56-161.png|width=985,height=181! This ratio of resolved IP:unresolved hostname is the correct internal state for a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`. Eventually, the hostname points to one of the 3 new valid nodes, and the java driver reconnects and discovers the new peers. However, as part of this reconnection process, the internal Node that held the unresolved hostname is now overwritten with a Node that has the resolved IP address: !image-2024-04-29-20-40-53-382.png|width=1080,height=102! Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname; rather, we have 3 resolved IP addresses, which is an incorrect internal state when `advanced.resolve-contact-points` is set to `FALSE`. One of the nodes should have retained the unresolved hostname. At this stage, the Java driver no longer queries the hostname for new connections, and further redeployments of ours result in failure because the hostname is no longer amongst the list of nodes that are queried for reconnection. This causes us to need to restart the application. was: Hello, this is a bug ticket for 4.18.0 of the Java driver. I am running in an environment where I have 3 Cassandra nodes. We have a use case to redeploy the cluster from the ground up at midnight every day. This means that all 3 nodes become unavailable for a short period of time and 3 new nodes with 3 new ip addresses get spun up and placed behind the contact point hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java driver should re-resolve the hostname for every new connection to that node. This occurs prior to and for the first redeployment, but the unresolved hostname is clobbered during the reconnection process and replaced with a resolved IP address, making additional redeployments fruitless. In our case, what is happening is that all 3 nodes become unavailable while our CICD process is destroying the existing cluster and replacing it with a new one. During the window of unavailability, the Java driver attempts to reconnect to each node, two of which internally (internal to the driver) have resolved IP addresses and one of which retains the unresolved hostname. Here is a screenshot that captures the internal state of the 3 nodes within `PoolManager` prior to the finished redeployment of the cluster. Note that there are 2 resolved IP addresses and 1 unresolved hostname. !image-2024-04-29-20-13-56-161.png|width=985,height=181! This ratio of resolved IP:unresolved hostname is the correct internal state for a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`. Eventually, the hostname points to one of the 3 new valid nodes, and the java driver reconnects and discovers the new peers. However, as part of this reconnection process, the internal Node that held the unresolved hostname is now overwritten with a Node that has the resolved IP address: !image-2024-04-29-20-40-53-382.png|width=1080,height=102! Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname; rather, we have 3 resolved IP addresses, which is an incorrect internal state when `advanced.resolve-contact-points` is set to `FALSE`. At this stage, the Java driver no longer queries the hostname for new connections, and further redeployments of ours result in failure because the hostn
[jira] [Updated] (CASSANDRA-19598) advanced.resolve-contact-points: unresolved hostname being clobbered during reconnection
[ https://issues.apache.org/jira/browse/CASSANDRA-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Orlowski updated CASSANDRA-19598: Description: Hello, this is a bug ticket for 4.18.0 of the Java driver. I am running in an environment where I have 3 Cassandra nodes. We have a use case to redeploy the cluster from the ground up at midnight every day. This means that all 3 nodes become unavailable for a short period of time and 3 new nodes with 3 new ip addresses get spun up and placed behind the contact point hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java driver should re-resolve the hostname for every new connection to that node. This occurs prior to and for the first redeployment, but the unresolved hostname is clobbered during the reconnection process and replaced with a resolved IP address, making additional redeployments fruitless. In our case, what is happening is that all 3 nodes become unavailable while our CICD process is destroying the existing cluster and replacing it with a new one. During the window of unavailability, the Java driver attempts to reconnect to each node, two of which internally (internal to the driver) have resolved IP addresses and one of which retains the unresolved hostname. Here is a screenshot that captures the internal state of the 3 nodes within `PoolManager` prior to the finished redeployment of the cluster. Note that there are 2 resolved IP addresses and 1 unresolved hostname. !image-2024-04-29-20-13-56-161.png|width=985,height=181! This ratio of resolved IP:unresolved hostname is the correct internal state for a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`. Eventually, the hostname points to one of the 3 new valid nodes, and the java driver reconnects and discovers the new peers. However, as part of this reconnection process, the internal Node that held the unresolved hostname is now overwritten with a Node that has the resolved IP address: !image-2024-04-29-20-40-53-382.png|width=1080,height=102! Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname; rather, we have 3 resolved IP addresses, which is an incorrect internal state when `advanced.resolve-contact-points` is set to `FALSE`. At this stage, the Java driver no longer queries the hostname for new connections, and further redeployments of ours result in failure because the hostname is no longer amongst the list of nodes that are queried for reconnection. This causes us to need to restart the application. was: Hello, this is a bug ticket for 4.18.0 of the Java driver. I am running in an environment where I have 3 Cassandra nodes. We have a use case to redeploy the cluster from the ground up at midnight every day. This means that all 3 nodes become unavailable for a short period of time and 3 new nodes with 3 new ip addresses get spun up and placed behind the contact point hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java driver should re-resolve the hostname for every new connection to that node. This occurs for the prior to and for the first redeployment, but the unresolved hostname is clobbered during the reconnection process and replaced with a resolved IP address. In our case, what is happening is that all 3 nodes become unavailable while our CICD process is destroying the existing cluster and replacing it with a new one. During the window of unavailability, the Java driver attempts to reconnect to each node, two of which internally (internal to the driver) have resolved IP addresses and one of which retains the unresolved hostname. Here is a screenshot that captures the internal state of the 3 nodes within `PoolManager` prior to the finished redeployment of the cluster. Note that there are 2 resolved IP addresses and 1 unresolved hostname. !image-2024-04-29-20-13-56-161.png|width=985,height=181! This ratio of resolved IP:unresolved hostname is the correct internal state for a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`. Eventually, the hostname points to one of the 3 new valid nodes, and the java driver reconnects and discovers the new peers. However, as part of this reconnection process, the internal Node that held the unresolved hostname is now overwritten with a Node that has the resolved IP address: !image-2024-04-29-20-40-53-382.png|width=1080,height=102! Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname; rather, we have 3 resolved IP addresses, which is an incorrect internal state when `advanced.resolve-contact-points` is set to `FALSE`. At this stage, the Java driver no longer queries the hostname for new connections, and further redeployments of ours result in failure because the hostname is no longer amongst the list of nodes that are queried for reconnection. This causes us to ne
[jira] [Updated] (CASSANDRA-19598) advanced.resolve-contact-points: unresolved hostname being clobbered during reconnection
[ https://issues.apache.org/jira/browse/CASSANDRA-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Orlowski updated CASSANDRA-19598: Description: Hello, this is a bug ticket for 4.18.0 of the Java driver. I am running in an environment where I have 3 Cassandra nodes. We have a use case to redeploy the cluster from the ground up at midnight every day. This means that all 3 nodes become unavailable for a short period of time and 3 new nodes with 3 new ip addresses get spun up and placed behind the contact point hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java driver should re-resolve the hostname for every new connection to that node. This occurs for the prior to and for the first redeployment, but the unresolved hostname is clobbered during the reconnection process and replaced with a resolved IP address. In our case, what is happening is that all 3 nodes become unavailable while our CICD process is destroying the existing cluster and replacing it with a new one. During the window of unavailability, the Java driver attempts to reconnect to each node, two of which internally (internal to the driver) have resolved IP addresses and one of which retains the unresolved hostname. Here is a screenshot that captures the internal state of the 3 nodes within `PoolManager` prior to the finished redeployment of the cluster. Note that there are 2 resolved IP addresses and 1 unresolved hostname. !image-2024-04-29-20-13-56-161.png|width=985,height=181! This ratio of resolved IP:unresolved hostname is the correct internal state for a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`. Eventually, the hostname points to one of the 3 new valid nodes, and the java driver reconnects and discovers the new peers. However, as part of this reconnection process, the internal Node that held the unresolved hostname is now overwritten with a Node that has the resolved IP address: !image-2024-04-29-20-40-53-382.png|width=1080,height=102! Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname; rather, we have 3 resolved IP addresses, which is an incorrect internal state when `advanced.resolve-contact-points` is set to `FALSE`. At this stage, the Java driver no longer queries the hostname for new connections, and further redeployments of ours result in failure because the hostname is no longer amongst the list of nodes that are queried for reconnection. This causes us to need to restart the application. was: Hello, this is a bug ticket for 4.18.0 of the Java driver. I am running in an environment where I have 3 Cassandra nodes. We have a use case to redeploy the cluster from the ground up at midnight every day. This means that all 3 nodes become unavailable for a short period of time and 3 new nodes with 3 new ip addresses get spun up and placed behind the contact point hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java driver should re-resolve the hostname for every new connection to that node. This occurs for the prior to and for the first redeployment, but the unresolved hostname is clobbered during the reconnection process and replaced with a resolved IP address. In our case, what is happening is that all 3 nodes become unavailable while our CICD process is destroying the existing cluster and replacing it with a new one. During the window of unavailability, the Java driver attempts to reconnect to each node, two of which internally (internal to the driver) have resolved IP addresses and one of which retains the unresolved hostname. Here is a screenshot that captures the internal state of the 3 nodes within `PoolManager` prior to the finished redeployment of the cluster. Note that there are 2 resolved IP addresses and 1 unresolved hostname. !image-2024-04-29-20-13-56-161.png|width=985,height=181! This ratio of resolved IP:unresolved hostname is the correct internal state for a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`. Eventually, the hostname points to one of the 3 new valid nodes, and the java driver reconnects and resets the pool. However, as part of this reconnection process, the internal Node that held the unresolved hostname is now overwritten with a Node that has the resolved IP address: !image-2024-04-29-20-40-53-382.png|width=1080,height=102! Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname; rather, we have 3 resolved IP addresses, which is an incorrect internal state when `advanced.resolve-contact-points` is set to `FALSE`. At this stage, the Java driver no longer queries the hostname for new connections, and further redeployments of ours result in failure because the hostname is no longer amongst the list of nodes that are queried for reconnection. This causes us to need to restart the application. > advance
[jira] [Updated] (CASSANDRA-19598) advanced.resolve-contact-points: unresolved hostname being clobbered during reconnection
[ https://issues.apache.org/jira/browse/CASSANDRA-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Orlowski updated CASSANDRA-19598: Description: Hello, this is a bug ticket for 4.18.0 of the Java driver. I am running in an environment where I have 3 Cassandra nodes. We have a use case to redeploy the cluster from the ground up at midnight every day. This means that all 3 nodes become unavailable for a short period of time and 3 new nodes with 3 new ip addresses get spun up and placed behind the contact point hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java driver should re-resolve the hostname for every new connection to that node. This occurs for the prior to and for the first redeployment, but the unresolved hostname is clobbered during the reconnection process and replaced with a resolved IP address. In our case, what is happening is that all 3 nodes become unavailable while our CICD process is destroying the existing cluster and replacing it with a new one. During the window of unavailability, the Java driver attempts to reconnect to each node, two of which internally (internal to the driver) have resolved IP addresses and one of which retains the unresolved hostname. Here is a screenshot that captures the internal state of the 3 nodes within `PoolManager` prior to the finished redeployment of the cluster. Note that there are 2 resolved IP addresses and 1 unresolved hostname. !image-2024-04-29-20-13-56-161.png|width=985,height=181! This ratio of resolved IP:unresolved hostname is the correct internal state for a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`. Eventually, the hostname points to one of the 3 new valid nodes, and the java driver reconnects and resets the pool. However, as part of this reconnection process, the internal Node that held the unresolved hostname is now overwritten with a Node that has the resolved IP address: !image-2024-04-29-20-40-53-382.png|width=1080,height=102! Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname; rather, we have 3 resolved IP addresses, which is an incorrect internal state when `advanced.resolve-contact-points` is set to `FALSE`. At this stage, the Java driver no longer queries the hostname for new connections, and further redeployments of ours result in failure because the hostname is no longer amongst the list of nodes that are queried for reconnection. This causes us to need to restart the application. was: Hello, this is a bug ticket for 4.18.0 of the Java driver. I am running in an environment where I have 3 Cassandra nodes. We have a use case to redeploy the cluster from the ground up at midnight every day. This means that all 3 nodes become unavailable for a short period of time and 3 new nodes with 3 new ip addresses get spun up and placed behind the contact point hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java driver should re-resolve the hostname for every new connection to that node. This occurs for the prior to and for the first redeployment, but the unresolved hostname is clobbered during the reconnection process and replaced with a resolved IP address. In our case, what is happening is that all 3 nodes become unavailable while our CICD process is destroying the existing cluster and replacing it with a new one. During the window of unavailability, the Java driver attempts to reconnect to each node, two of which internally (internal to the driver) have resolved IP addresses and one of which retains the unresolved hostname. Here is a screenshot that captures the internal state of the 3 nodes within `PoolManager` prior to the finished redeployment of the cluster. Note that there are 2 resolved IP addresses and 1 unresolved hostname. !image-2024-04-29-20-13-56-161.png! This ratio of resolved IP:unresolved hostname is the correct internal state for a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`. Eventually, the hostname points to one of the 3 new valid nodes, and the java driver reconnects and resets the pool. However, as part of this reconnection process, the internal Node that held the unresolved hostname is now overwritten with a Node that has the resolved IP address: !image-2024-04-29-20-40-53-382.png! Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname; rather, we have 3 resolved IP addresses, which is an incorrect internal state when `advanced.resolve-contact-points` is set to `FALSE`. At this stage, the Java driver no longer queries the hostname for new connections, and further redeployments of ours result in failure because the hostname is no longer amongst the list of nodes that are queried for reconnection. This causes us to need to restart the application. > advanced.resolve-contact-points: unresolved hostname being
[jira] [Updated] (CASSANDRA-19598) advanced.resolve-contact-points: unresolved hostname being clobbered during reconnection
[ https://issues.apache.org/jira/browse/CASSANDRA-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Orlowski updated CASSANDRA-19598: Description: Hello, this is a bug ticket for 4.18.0 of the Java driver. I am running in an environment where I have 3 Cassandra nodes. We have a use case to redeploy the cluster from the ground up at midnight every day. This means that all 3 nodes become unavailable for a short period of time and 3 new nodes with 3 new ip addresses get spun up and placed behind the contact point hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java driver should re-resolve the hostname for every new connection to that node. This occurs for the prior to and for the first redeployment, but the unresolved hostname is clobbered during the reconnection process and replaced with a resolved IP address. In our case, what is happening is that all 3 nodes become unavailable while our CICD process is destroying the existing cluster and replacing it with a new one. During the window of unavailability, the Java driver attempts to reconnect to each node, two of which internally (internal to the driver) have resolved IP addresses and one of which retains the unresolved hostname. Here is a screenshot that captures the internal state of the 3 nodes within `PoolManager` prior to the finished redeployment of the cluster. Note that there are 2 resolved IP addresses and 1 unresolved hostname. !image-2024-04-29-20-13-56-161.png! This ratio of resolved IP:unresolved hostname is the correct internal state for a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`. Eventually, the hostname points to one of the 3 new valid nodes, and the java driver reconnects and resets the pool. However, as part of this reconnection process, the internal Node that held the unresolved hostname is now overwritten with a Node that has the resolved IP address: !image-2024-04-29-20-40-53-382.png! Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname; rather, we have 3 resolved IP addresses, which is an incorrect internal state when `advanced.resolve-contact-points` is set to `FALSE`. At this stage, the Java driver no longer queries the hostname for new connections, and further redeployments of ours result in failure because the hostname is no longer amongst the list of nodes that are queried for reconnection. This causes us to need to restart the application. was: Hello, this is a bug ticket for 4.18.0 of the Java driver. I am running in an environment where I have 3 Cassandra nodes. We have a use case to redeploy the cluster from the ground up at midnight every day. This means that all 3 nodes become unavailable for a short period of time and 3 new nodes with 3 new ip addresses get spun up and placed behind the contact point hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java driver should re-resolve the hostname for every new connection to that node. This occurs for the prior to and for the first redeployment, but the unresolved hostname is clobbered during the reconnection process and replaced with a resolved IP address. In our case, what is happening is that all 3 nodes become unavailable while our CICD process is destroying the existing cluster and replacing it with a new one. The Java driver attempts to reconnect to each node, two of which internally (internal to the driver) have resolved IP addresses and one of which retains the unresolved hostname. Here is a screenshot that captures the internal state of the 3 nodes within `PoolManager` prior to the redeployment of the cluster. Note that there are 2 resolved IP addresses and 1 unresolved hostname. !image-2024-04-29-20-13-56-161.png! This ratio of resolved IP:unresolved hostname is the correct internal state for a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`. Eventually, the hostname points to one of the 3 new valid nodes, and the java driver reconnects and resets the pool. However, as part of this reconnection process, the internal Node that held the unresolved hostname is now overwritten with a Node that has the resolved IP address: !image-2024-04-29-20-40-53-382.png! Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname; rather, we have 3 resolved IP addresses, which is an incorrect internal state when `advanced.resolve-contact-points` is set to `FALSE`. At this stage, the Java driver no longer queries the hostname for new connections, and further redeployments of ours result in failure because the hostname is no longer amongst the list of nodes that are queried for reconnection. This causes us to need to restart the application. > advanced.resolve-contact-points: unresolved hostname being clobbered during > reconnection > -
[jira] [Updated] (CASSANDRA-19598) advanced.resolve-contact-points: unresolved hostname being clobbered during reconnection
[ https://issues.apache.org/jira/browse/CASSANDRA-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Orlowski updated CASSANDRA-19598: Description: Hello, this is a bug ticket for 4.18.0 of the Java driver. I am running in an environment where I have 3 Cassandra nodes. We have a use case to redeploy the cluster from the ground up at midnight every day. This means that all 3 nodes become unavailable for a short period of time and 3 new nodes with 3 new ip addresses get spun up and placed behind the contact point hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java driver should re-resolve the hostname for every new connection to that node. This occurs for the prior to and for the first redeployment, but the unresolved hostname is clobbered during the reconnection process and replaced with a resolved IP address. In our case, what is happening is that all 3 nodes become unavailable while our CICD process is destroying the existing cluster and replacing it with a new one. The Java driver attempts to reconnect to each node, two of which internally (internal to the driver) have resolved IP addresses and one of which retains the unresolved hostname. Here is a screenshot that captures the internal state of the 3 nodes within `PoolManager` prior to the redeployment of the cluster. Note that there are 2 resolved IP addresses and 1 unresolved hostname. !image-2024-04-29-20-13-56-161.png! This ratio of resolved IP:unresolved hostname is the correct internal state for a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`. Eventually, the hostname points to one of the 3 new valid nodes, and the java driver reconnects and resets the pool. However, as part of this reconnection process, the internal Node that held the unresolved hostname is now overwritten with a Node that has the resolved IP address: !image-2024-04-29-20-40-53-382.png! Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname; rather, we have 3 resolved IP addresses, which is an incorrect internal state when `advanced.resolve-contact-points` is set to `FALSE`. At this stage, the Java driver no longer queries the hostname for new connections, and further redeployments of ours result in failure because the hostname is no longer amongst the list of nodes that are queried for reconnection. This causes us to need to restart the application. was: Hello, this is a bug ticket for 4.18.0 of the Java driver. I am running in an environment where I have 3 Cassandra nodes. We have a use case to redeploy the cluster from the ground up at midnight every day. This means that all 3 nodes become unavailable for a short period of time and 3 new nodes with 3 new ip addresses get spun up and placed behind the contact point hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java driver should re-resolve the hostname for every new connection to that node. This occurs for the first redeployment, but the unresolved hostname is clobbered during the reconnection process and replaced with a resolved IP address. In our case, what is happening is that all 3 nodes become unavailable while our CICD process is destroying the existing cluster and replacing it with a new one. The Java driver attempts to reconnect to each node, two of which internally (internal to the driver) have resolved IP addresses and one of which retains the unresolved hostname. Here is a screenshot that captures the internal state of the 3 nodes within `PoolManager` prior to the redeployment of the cluster. Note that there are 2 resolved IP addresses and 1 unresolved hostname. !image-2024-04-29-20-13-56-161.png! This ratio of resolved IP:unresolved hostname is the correct internal state for a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`. Eventually, the hostname points to one of the 3 new valid nodes, and the java driver reconnects and resets the pool. However, as part of this reconnection process, the internal Node that held the unresolved hostname is now overwritten with a Node that has the resolved IP address: !image-2024-04-29-20-40-53-382.png! Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname; rather, we have 3 resolved IP addresses, which is an incorrect internal state when `advanced.resolve-contact-points` is set to `FALSE`. At this stage, the Java driver no longer queries the hostname for new connections, and further redeployments of ours result in failure because the hostname is no longer amongst the list of nodes that are queried for reconnection. This causes us to need to restart the application. > advanced.resolve-contact-points: unresolved hostname being clobbered during > reconnection > > > Key: CASSAN
[jira] [Updated] (CASSANDRA-19598) advanced.resolve-contact-points: unresolved hostname being clobbered during reconnection
[ https://issues.apache.org/jira/browse/CASSANDRA-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Orlowski updated CASSANDRA-19598: Description: Hello, this is a bug ticket for 4.18.0 of the Java driver. I am running in an environment where I have 3 Cassandra nodes. We have a use case to redeploy the cluster from the ground up at midnight every day. This means that all 3 nodes become unavailable for a short period of time and 3 new nodes with 3 new ip addresses get spun up and placed behind the contact point hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java driver should re-resolve the hostname for every new connection to that node. This occurs for the first redeployment, but the unresolved hostname is clobbered during the reconnection process and replaced with a resolved IP address. In our case, what is happening is that all 3 nodes become unavailable while our CICD process is destroying the existing cluster and replacing it with a new one. The Java driver attempts to reconnect to each node, two of which internally (internal to the driver) have resolved IP addresses and one of which retains the unresolved hostname. Here is a screenshot that captures the internal state of the 3 nodes within `PoolManager` prior to the redeployment of the cluster. Note that there are 2 resolved IP addresses and 1 unresolved hostname. !image-2024-04-29-20-13-56-161.png! This ratio of resolved IP:unresolved hostname is the correct internal state for a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`. Eventually, the hostname points to one of the 3 new valid nodes, and the java driver reconnects and resets the pool. However, as part of this reconnection process, the internal Node that held the unresolved hostname is now overwritten with a Node that has the resolved IP address: !image-2024-04-29-20-40-53-382.png! Note that we no longer have 2 resolved IP addresses and 1 unresolved hostname; rather, we have 3 resolved IP addresses, which is an incorrect internal state when `advanced.resolve-contact-points` is set to `FALSE`. At this stage, the Java driver no longer queries the hostname for new connections, and further redeployments of ours result in failure because the hostname is no longer amongst the list of nodes that are queried for reconnection. This causes us to need to restart the application. was: Hello, this is a bug ticket for 4.18.0 of the Java driver. I am running in an environment where I have 3 Cassandra nodes. We have a use case to redeploy the cluster from the ground up at midnight every day. This means that all 3 nodes become unavailable for a short period of time and 3 new nodes with 3 new ip addresses get spun up and placed behind the contact point hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java driver should re-resolve the hostname for every new connection to that node. This occurs for the first redeployment, but the unresolved hostname is clobbered during the reconnection process and replaced with a resolved IP address. In our case, what is happening is that all 3 nodes become unavailable while our CICD process is destroying the existing cluster and replacing it with a new one. The Java driver attempts to reconnect to each node, two of which internally (internal to the driver) have resolved IP addresses and one of which retains the unresolved hostname. Here is a screenshot that captures the internal state of the 3 nodes within `PoolManager` prior to the redeployment of the cluster. Note that there are 2 resolved IP addresses and 1 unresolved hostname. !image-2024-04-29-20-13-56-161.png! This ratio of resolved IP:unresolved hostname is the correct internal state for a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`. Eventually, the hostname points to one of the 3 new valid nodes, and the java driver reconnects and resets the pool. However, as part of this reconnection process, the internal Node that held the unresolved hostname is now overwritten with a Node that has the resolved IP address: !image-2024-04-29-20-40-53-382.png! Note that we no longer have 2 resolved IP addresses and 1 resolved hostname; rather, we have 3 resolved IP addresses, which is an incorrect internal state when `advanced.resolve-contact-points` is set to `FALSE`. At this stage, the Java driver no longer queries the hostname for new connections, and further redeployments of ours result in failure because the hostname is no longer amongst the list of nodes that are queried for reconnection. This causes us to need to restart the application. > advanced.resolve-contact-points: unresolved hostname being clobbered during > reconnection > > > Key: CASSANDRA-19598 >
[jira] [Updated] (CASSANDRA-19598) advanced.resolve-contact-points: unresolved hostname being clobbered during reconnection
[ https://issues.apache.org/jira/browse/CASSANDRA-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Orlowski updated CASSANDRA-19598: Description: Hello, this is a bug ticket for 4.18.0 of the Java driver. I am running in an environment where I have 3 Cassandra nodes. We have a use case to redeploy the cluster from the ground up at midnight every day. This means that all 3 nodes become unavailable for a short period of time and 3 new nodes with 3 new ip addresses get spun up and placed behind the contact point hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java driver should re-resolve the hostname for every new connection to that node. This occurs for the first redeployment, but the unresolved hostname is clobbered during the reconnection process and replaced with a resolved IP address. In our case, what is happening is that all 3 nodes become unavailable while our CICD process is destroying the existing cluster and replacing it with a new one. The Java driver attempts to reconnect to each node, two of which internally (internal to the driver) have resolved IP addresses and one of which retains the unresolved hostname. Here is a screenshot that captures the internal state of the 3 nodes within `PoolManager` prior to the redeployment of the cluster. Note that there are 2 resolved IP addresses and 1 unresolved hostname. !image-2024-04-29-20-13-56-161.png! This ratio of resolved IP:unresolved hostname is the correct internal state for a 3 node cluster when `advanced.resolve-contact-points` is set to `FALSE`. Eventually, the hostname points to one of the 3 new valid nodes, and the java driver reconnects and resets the pool. However, as part of this reconnection process, the internal Node that held the unresolved hostname is now overwritten with a Node that has the resolved IP address: !image-2024-04-29-20-40-53-382.png! Note that we no longer have 2 resolved IP addresses and 1 resolved hostname; rather, we have 3 resolved IP addresses, which is an incorrect internal state when `advanced.resolve-contact-points` is set to `FALSE`. At this stage, the Java driver no longer queries the hostname for new connections, and further redeployments of ours result in failure because the hostname is no longer amongst the list of nodes that are queried for reconnection. This causes us to need to restart the application. was: Hello, this is a bug ticket for 4.18.0 of the Java driver. I am running in an environment where I have 3 Cassandra nodes. We have a use case to redeploy the cluster from the ground up at midnight every day. This means that all 3 nodes become unavailable for a short period of time and 3 new nodes with 3 new ip addresses get spun up and placed behind the contact point hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java driver should re-resolve the hostname for every new connection to that node. This occurs for the first redeployment, but the unresolved hostname is clobbered during the reconnection process and replaced with a resolved IP address. In our case, what is happening is that all 3 nodes become unavailable while our CICD process is destroying the existing cluster and replacing it with a new one. The Java driver attempts to reconnect to each node, two of which internally (internal to the driver) have resolved IP addresses and one of which retains the unresolved hostname. Here is a screenshot that captures the internal state of the 3 nodes within `PoolManager` prior to the redeployment of the cluster. !image-2024-04-29-20-13-56-161.png! Note that two of the nodes are dynamically discovered peers with resolved IP addresses, while one node is the unresolved contact point. This is the correct internal state when `advanced.resolve-contact-points` is set to `FALSE`. Eventually, the hostname points to one of the 3 new valid nodes, and the java driver reconnects and resets the pool. However, as part of this reconnection process, the internal Node that held the unresolved hostname is now overwritten with a Node that has the resolved IP address: !image-2024-04-29-20-40-53-382.png! Note that we no longer have 2 resolved IP addresses and 1 resolved hostname; rather, we have 3 resolved IP addresses. At this stage, the Java driver no longer queries the hostname for new connections, and further redeployments of ours result in failure because the hostname is no longer amongst the list of nodes that are queried for reconnection. This causes us to need to restart the application. > advanced.resolve-contact-points: unresolved hostname being clobbered during > reconnection > > > Key: CASSANDRA-19598 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19598 > Project: Cassand
[jira] [Created] (CASSANDRA-19598) advanced.resolve-contact-points: unresolved hostname being clobbered during reconnection
Andrew Orlowski created CASSANDRA-19598: --- Summary: advanced.resolve-contact-points: unresolved hostname being clobbered during reconnection Key: CASSANDRA-19598 URL: https://issues.apache.org/jira/browse/CASSANDRA-19598 Project: Cassandra Issue Type: Bug Components: Client/java-driver Reporter: Andrew Orlowski Attachments: image-2024-04-29-20-13-56-161.png, image-2024-04-29-20-40-53-382.png Hello, this is a bug ticket for 4.18.0 of the Java driver. I am running in an environment where I have 3 Cassandra nodes. We have a use case to redeploy the cluster from the ground up at midnight every day. This means that all 3 nodes become unavailable for a short period of time and 3 new nodes with 3 new ip addresses get spun up and placed behind the contact point hostname. If you set {{advanced.resolve-contact-points}} to FALSE, the java driver should re-resolve the hostname for every new connection to that node. This occurs for the first redeployment, but the unresolved hostname is clobbered during the reconnection process and replaced with a resolved IP address. In our case, what is happening is that all 3 nodes become unavailable while our CICD process is destroying the existing cluster and replacing it with a new one. The Java driver attempts to reconnect to each node, two of which internally (internal to the driver) have resolved IP addresses and one of which retains the unresolved hostname. Here is a screenshot that captures the internal state of the 3 nodes within `PoolManager` prior to the redeployment of the cluster. !image-2024-04-29-20-13-56-161.png! Note that two of the nodes are dynamically discovered peers with resolved IP addresses, while one node is the unresolved contact point. This is the correct internal state when `advanced.resolve-contact-points` is set to `FALSE`. Eventually, the hostname points to one of the 3 new valid nodes, and the java driver reconnects and resets the pool. However, as part of this reconnection process, the internal Node that held the unresolved hostname is now overwritten with a Node that has the resolved IP address: !image-2024-04-29-20-40-53-382.png! Note that we no longer have 2 resolved IP addresses and 1 resolved hostname; rather, we have 3 resolved IP addresses. At this stage, the Java driver no longer queries the hostname for new connections, and further redeployments of ours result in failure because the hostname is no longer amongst the list of nodes that are queried for reconnection. This causes us to need to restart the application. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19556) Guardrail to block DDL/DCL queries
[ https://issues.apache.org/jira/browse/CASSANDRA-19556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuqi Yan updated CASSANDRA-19556: - Fix Version/s: 5.x Description: Sometimes we want to block DDL/DCL queries to stop new schemas being created or roles created. (e.g. when doing live-upgrade) For DDL guardrail current implementation won't block the query if it's no-op (e.g. CREATE TABLE...IF NOT EXISTS, but table already exists, etc. The guardrail check is added in apply() right after all the existence check) I don't have preference on either block every DDL query or check whether if it's no-op here. Just we have some users always run CREATE..IF NOT EXISTS.. at startup, which is no-op but will be blocked by this guardrail and failed to start. 4.1 PR: [https://github.com/apache/cassandra/pull/3248] trunk PR: [https://github.com/apache/cassandra/pull/3275] was: Sometimes we want to block DDL/DCL queries to stop new schemas being created or roles created. (e.g. when doing live-upgrade) For DDL guardrail current implementation won't block the query if it's no-op (e.g. CREATE TABLE...IF NOT EXISTS, but table already exists, etc. The guardrail check is added in apply() right after all the existence check) I don't have preference on either block every DDL query or check whether if it's no-op here. Just we have some users always run CREATE..IF NOT EXISTS.. at startup, which is no-op but will be blocked by this guardrail and failed to start. 4.1 PR: [https://github.com/apache/cassandra/pull/3248] > Guardrail to block DDL/DCL queries > -- > > Key: CASSANDRA-19556 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19556 > Project: Cassandra > Issue Type: New Feature > Components: Feature/Guardrails >Reporter: Yuqi Yan >Assignee: Yuqi Yan >Priority: Normal > Fix For: 5.x > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Sometimes we want to block DDL/DCL queries to stop new schemas being created > or roles created. (e.g. when doing live-upgrade) > For DDL guardrail current implementation won't block the query if it's no-op > (e.g. CREATE TABLE...IF NOT EXISTS, but table already exists, etc. The > guardrail check is added in apply() right after all the existence check) > I don't have preference on either block every DDL query or check whether if > it's no-op here. Just we have some users always run CREATE..IF NOT EXISTS.. > at startup, which is no-op but will be blocked by this guardrail and failed > to start. > > 4.1 PR: [https://github.com/apache/cassandra/pull/3248] > trunk PR: [https://github.com/apache/cassandra/pull/3275] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status
[ https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842182#comment-17842182 ] Cameron Zemek commented on CASSANDRA-19580: --- I don't understand why Gossiper::examineGossiper is implemented to only iterate on the digests in the SYN message. Why doesn't it handle sending back in the delta missing entries in the digest list? > Unable to contact any seeds with node in hibernate status > - > > Key: CASSANDRA-19580 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19580 > Project: Cassandra > Issue Type: Bug >Reporter: Cameron Zemek >Priority: Normal > > We have customer running into the error 'Unable to contact any seeds!' . I > have been able to reproduce this issue if I kill Cassandra as its joining > which will put the node into hibernate status. Once a node is in hibernate it > will no longer receive any SYN messages from other nodes during startup and > as it sends only itself as digest in outbound SYN messages it never receives > any states in any of the ACK replies. So once it gets to the check > `seenAnySeed` in it fails as the endpointStateMap is empty. > > A workaround is copying the system.peers table from other node but this is > less than ideal. I tested modifying maybeGossipToSeed as follows: > {code:java} > /* Possibly gossip to a seed for facilitating partition healing */ > private void maybeGossipToSeed(MessageOut prod) > { > int size = seeds.size(); > if (size > 0) > { > if (size == 1 && > seeds.contains(FBUtilities.getBroadcastAddress())) > { > return; > } > if (liveEndpoints.size() == 0) > { > List gDigests = prod.payload.gDigests; > if (gDigests.size() == 1 && > gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress())) > { > gDigests = new ArrayList(); > GossipDigestSyn digestSynMessage = new > GossipDigestSyn(DatabaseDescriptor.getClusterName(), > > DatabaseDescriptor.getPartitionerName(), > > gDigests); > MessageOut message = new > MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN, > > digestSynMessage, > > GossipDigestSyn.serializer); > sendGossip(message, seeds); > } > else > { > sendGossip(prod, seeds); > } > } > else > { > /* Gossip with the seed with some probability. */ > double probability = seeds.size() / (double) > (liveEndpoints.size() + unreachableEndpoints.size()); > double randDbl = random.nextDouble(); > if (randDbl <= probability) > sendGossip(prod, seeds); > } > } > } > {code} > Only problem is this is the same as SYN from shadow round. It does resolve > the issue however as then receive an ACK with all the states. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19578) Concurrent equivalent schema updates lead to unresolved disagreement
[ https://issues.apache.org/jira/browse/CASSANDRA-19578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842181#comment-17842181 ] Jordan West commented on CASSANDRA-19578: - 4.1 is when it was introduced but afaik the issue wasn't fixed so its likely broken in 5.0.x as well -- or any code where TCM isn't merged post 4.1 > Concurrent equivalent schema updates lead to unresolved disagreement > > > Key: CASSANDRA-19578 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19578 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Schema >Reporter: Chris Lohfink >Priority: Normal > Fix For: 4.1.x, 5.0.x > > > As part of CASSANDRA-17819 a check for empty schema changes was added to the > updateSchema. This only looks at the _logical_ schema difference of the > schemas, but the changes made to the system_schema keyspace are the ones that > actually are involved in the digest. > If two nodes issue the same CREATE statement the difference from the > keyspace.diff would be empty but the timestamps on the mutations would be > different, leading to a pseudo schema disagreement which will never resolve > until resetlocalschema or nodes being bounced. > Only impacts 4.1 > test and fix : > https://github.com/clohfink/cassandra/commit/ba915f839089006ac6d08494ef19dc010bcd6411 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19341) Relation and Restriction hierachies are too complex and error prone
[ https://issues.apache.org/jira/browse/CASSANDRA-19341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ekaterina Dimitrova updated CASSANDRA-19341: Fix Version/s: 5.x > Relation and Restriction hierachies are too complex and error prone > --- > > Key: CASSANDRA-19341 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19341 > Project: Cassandra > Issue Type: Improvement > Components: CQL/Interpreter >Reporter: Benjamin Lerer >Assignee: Benjamin Lerer >Priority: Normal > Fix For: 5.x > > Time Spent: 21h 50m > Remaining Estimate: 0h > > The {{Relation}} and {{Restriction}} hierarchy have been designed when C* was > only supporting a limited amount of operators and columns expressions (single > column, multi-column and token expressions). Over time they have grown in > complexity making the code harder to understand and modify and error prone. > Their design is also resulting in unnecessary limitations that could be > easily lifted, like the ability to accept different predicates on the same > column. > Today adding a new operator requires the addition of a lot of glue code and > surgical changes accross the CQL layer. Making patch for features such as > CASSANDRA-18584 much complex than it should be. > The goal of this ticket is to simplify the {{Relation}} and {{Restriction}} > hierarchies and modify operator class so that adding new operators requires > only changes to the {{Operator}} class and ANTLR file. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19341) Relation and Restriction hierachies are too complex and error prone
[ https://issues.apache.org/jira/browse/CASSANDRA-19341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ekaterina Dimitrova updated CASSANDRA-19341: Reviewers: Berenguer Blasi, Ekaterina Dimitrova (was: Ekaterina Dimitrova) > Relation and Restriction hierachies are too complex and error prone > --- > > Key: CASSANDRA-19341 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19341 > Project: Cassandra > Issue Type: Improvement > Components: CQL/Interpreter >Reporter: Benjamin Lerer >Assignee: Benjamin Lerer >Priority: Normal > Time Spent: 21h 50m > Remaining Estimate: 0h > > The {{Relation}} and {{Restriction}} hierarchy have been designed when C* was > only supporting a limited amount of operators and columns expressions (single > column, multi-column and token expressions). Over time they have grown in > complexity making the code harder to understand and modify and error prone. > Their design is also resulting in unnecessary limitations that could be > easily lifted, like the ability to accept different predicates on the same > column. > Today adding a new operator requires the addition of a lot of glue code and > surgical changes accross the CQL layer. Making patch for features such as > CASSANDRA-18584 much complex than it should be. > The goal of this ticket is to simplify the {{Relation}} and {{Restriction}} > hierarchies and modify operator class so that adding new operators requires > only changes to the {{Operator}} class and ANTLR file. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19182) IR may leak SSTables with pending repair when coming from streaming
[ https://issues.apache.org/jira/browse/CASSANDRA-19182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842169#comment-17842169 ] David Capwell commented on CASSANDRA-19182: --- I just got back from vacation, I don't see this merged yet so going to restart the merge process. > IR may leak SSTables with pending repair when coming from streaming > --- > > Key: CASSANDRA-19182 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19182 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair >Reporter: David Capwell >Assignee: David Capwell >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > Attachments: > ci_summary-trunk-a1010f4101bf259de3f31077540e4f987d5df9c5.html > > Time Spent: 1h 40m > Remaining Estimate: 0h > > There is a race condition where SSTables from streaming may race with pending > repair cleanup in compaction causing us to cleanup the pending repair state > in compaction while the SSTables are being added to it; this leads to IR > failing in the future when those files get selected for repair. > This problem was hard to track down as the in-memory state was wiped, so we > don’t have any details. To better aid these types of investigation we should > make sure the repair vtables get updated when IR session failures are > submitted -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842171#comment-17842171 ] Alex Petrov commented on CASSANDRA-19534: - This is great, thank you for testing! My 100s timeout was erring (probably too far) on the side of sticking to the old behaviour. I was slightly concerned that people will see timeouts and conclude this is not something they want. But unfortunately there’s no way for us to produce reasonable workload balance without shedding some load and timing out lagging requests. I will update a default to 12s. > unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 5.0-rc, 5.x > > Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - > QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, > Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html > > Time Spent: 10m > Remaining Estimate: 0h > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842164#comment-17842164 ] Brandon Williams commented on CASSANDRA-19534: -- bq. I'd suggest setting cql_start_time to REQUEST This appears to be the default in the patch, so first I ran with no config changes. Here are the KeyValue ECS numbers while the random workload is also running with an increased rate of 300: {noformat} Writes Reads Deletes Errors Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | Count 1min (errors/s) 1035374 100254.13 0 | 828223 100252.6 0 | 0 0 0 | 91260303 20005.4 1035374 100254.13 0 | 828223 100252.6 0 | 0 0 0 | 91320989 20007.02 1035374 100254.13 0 | 828223 100252.6 0 | 0 0 0 | 91380356 20007.02 1035374 100254.13 0 | 828223 100252.6 0 | 0 0 0 | 91441015 19976.79 {noformat} We can see the 100ms native transport timeout default which is stable, and with the ECS rate set to 20k/s it is doing nothing but throwing errors at this point. There was also a good amount of GC pressure. With the native transport timeout adjusted to 12s: {noformat} Writes Reads Deletes Errors Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | Count 1min (errors/s) 6362953 12019.36 7602.56 | 6346212 12016.37 7581.98 | 0 0 0 | 1639458 4976.36 6384650 12016.847566.8 | 6367878 12023.32 7553.07 | 0 0 0 | 1655989 5033.01 6405461 12016.847566.8 | 6388707 12023.32 7553.07 | 0 0 0 | 1674127 5033.01 6426641 12016.84 7510.02 | 6409624 12021.767493.9 | 0 0 0 | 1693822 5158.58 {noformat} We can see the timeout reflected again, but this time without heap pressure it continues to serve many requests. Finally, here is cql_start_time set to QUEUE and the native transport timeout at 12s: {noformat} Writes Reads Deletes Errors Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | Count 1min (errors/s) 505121 11983.81 53.36 | 7949266334.45113.39 | 0 0 0 | 5350041 19782.8 505123 11983.81 49.13 | 7949266334.45104.33 | 0 0 0 | 5410428 19815.76 505137 11983.81 49.13 | 7949266334.45104.33 | 0 0 0 | 5468740 19815.76 505145 11983.81 45.53 | 7949266334.45 95.99 | 0 0 0 | 5528104 19848.02 {noformat} This also ended up throwing errors but still respected the timeout. This patch appears to solve the runaway latency growth as requests never last beyond the native transport timeout. I still think the 100s default is too high; it's the closest to the unbounded behavior from before but still detrimental and probably not what most people actually want especially since it may exert additional GC pressure. > unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 5.0-rc, 5.x > > Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - > QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, > Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html > > Time Spent: 10m > Remaining Estimate: 0h > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configu
(cassandra-website) branch asf-staging updated (6b58d505 -> 836a863a)
This is an automated email from the ASF dual-hosted git repository. git-site-role pushed a change to branch asf-staging in repository https://gitbox.apache.org/repos/asf/cassandra-website.git discard 6b58d505 generate docs for cc1c7113 new 836a863a generate docs for cc1c7113 This update added new revisions after undoing existing revisions. That is to say, some revisions that were in the old version of the branch are not in the new version. This situation occurs when a user --force pushes a change and generates a repository containing something like this: * -- * -- B -- O -- O -- O (6b58d505) \ N -- N -- N refs/heads/asf-staging (836a863a) You should already have received notification emails for all of the O revisions, and so the following emails describe only the N revisions from the common base, B. Any revisions marked "omit" are not gone; other references still refer to them. Any revisions marked "discard" are gone forever. The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: content/search-index.js | 2 +- site-ui/build/ui-bundle.zip | Bin 4883646 -> 4883646 bytes 2 files changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-19597) SystemKeyspace CFS flushing blocked by unrelated keyspace flushing/compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-19597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842098#comment-17842098 ] Benedict Elliott Smith edited comment on CASSANDRA-19597 at 4/29/24 8:02 PM: - Yes, exactly. If I remember correctly, this "queue" was originally intended to achieve two things: 1) ensure commit log records are invalidated correctly, as it used to only support essentially invalidations of a complete prefix; 2) serve as a kind of fsync so that when awaiting the completion of a flush on a particular table you can be certain all data written prior has made it to sstables I'm not actually sure if any of this is necessary today though. Pretty sure we invalidate explicit ranges now, so the commit log semantics do not require this. I'm not sure off the top of my head why (except for non-durable tables/writes, or things that might want to read sstables prior to commit log replay) you would ever need to know all prior flushes had completed though, since the commit log will ensure they are re-written on restart. But a low risk approach would be to just make this a per table queue. was (Author: benedict): Yes, exactly. If I remember correctly, this "queue" was originally intended to achieve two things: 1) ensure commit log records are invalidated correctly, as it used to only support essentially invalidations of a complete prefix; 2) serve as a kind of fsync so that when awaiting the completion of a flush on a particular table you can be certain all data written prior has made it to disk I'm not actually sure if any of this is necessary today though. Pretty sure we invalidate explicit ranges now, so the commit log semantics do not require this. I'm not off the top of my head sure why (except for non-durable tables/writes) you would ever need to know all prior flushes had completed though, since the commit log will ensure they are re-written on restart. But a low risk approach would be to just make this a per table queue. > SystemKeyspace CFS flushing blocked by unrelated keyspace flushing/compaction > - > > Key: CASSANDRA-19597 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19597 > Project: Cassandra > Issue Type: Bug >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg >Priority: Normal > > There is a single post flush thread and that thread processes tasks in order > and one of those tasks can be a memtable flush for an unrelated keyspace/cfs, > and that memtable flush can be blocked by slow IntervalTree building and > racing with compactors to try and build an interval tree. > Unless there is a requirement for ordering we probably want to loosen this to > the actual ordering requirement so that problems in one keyspace can’t effect > another. > SystemKeyspace and Gossip in particular cause lots of weird problems like > nodes marking each other down because Gossip can’t process nodes being > removed (blocking flush each time in SystemKeyspace.removeNode) > A very simple fix here might be to queue the post flush task at the same time > as the flush in a per CFS queue, and then submit the task only once the flush > is completed. > If flushes complete out of order the queue will still ensure their > completions are processed in order. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842114#comment-17842114 ] Jon Haddad commented on CASSANDRA-19534: I can fire it up this week. > unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 5.0-rc, 5.x > > Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - > QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, > Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html > > Time Spent: 10m > Remaining Estimate: 0h > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19567) Minimize the heap consumption when registering metrics
[ https://issues.apache.org/jira/browse/CASSANDRA-19567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Caleb Rackliffe updated CASSANDRA-19567: Fix Version/s: 5.1 (was: 5.x) > Minimize the heap consumption when registering metrics > -- > > Key: CASSANDRA-19567 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19567 > Project: Cassandra > Issue Type: Bug > Components: Observability/Metrics >Reporter: Maxim Muzafarov >Assignee: Maxim Muzafarov >Priority: Normal > Fix For: 5.1 > > Attachments: summary.png > > Time Spent: 2h 10m > Remaining Estimate: 0h > > The problem is only reproducible on the x86 machine, the problem is not > reproducible on the arm64. A quick analysis showed a lot of MetricName > objects stored in the heap, although the real cause could be related to > something else, the MetricName object requires extra attention. > To reproduce run the command run locally: > {code} > ant test-jvm-dtest-some > -Dtest.name=org.apache.cassandra.distributed.test.ReadRepairTest > {code} > The error: > {code:java} > [junit-timeout] Exception in thread "main" java.lang.OutOfMemoryError: Java > heap space > [junit-timeout] at > java.base/java.lang.StringLatin1.newString(StringLatin1.java:769) > [junit-timeout] at > java.base/java.lang.StringBuffer.toString(StringBuffer.java:716) > [junit-timeout] at > org.apache.cassandra.CassandraBriefJUnitResultFormatter.endTestSuite(CassandraBriefJUnitResultFormatter.java:191) > [junit-timeout] at > org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.fireEndTestSuite(JUnitTestRunner.java:854) > [junit-timeout] at > org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:578) > [junit-timeout] at > org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:1197) > [junit-timeout] at > org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:1042) > [junit-timeout] Testsuite: > org.apache.cassandra.distributed.test.ReadRepairTest-cassandra.testtag_IS_UNDEFINED > [junit-timeout] Testsuite: > org.apache.cassandra.distributed.test.ReadRepairTest-cassandra.testtag_IS_UNDEFINED > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0 sec > [junit-timeout] > [junit-timeout] Testcase: > org.apache.cassandra.distributed.test.ReadRepairTest:readRepairRTRangeMovementTest-cassandra.testtag_IS_UNDEFINED: > Caused an ERROR > [junit-timeout] Forked Java VM exited abnormally. Please note the time in the > report does not reflect the time until the VM exit. > [junit-timeout] junit.framework.AssertionFailedError: Forked Java VM exited > abnormally. Please note the time in the report does not reflect the time > until the VM exit. > [junit-timeout] at > jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) > [junit-timeout] at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > [junit-timeout] at java.base/java.util.Vector.forEach(Vector.java:1365) > [junit-timeout] at > jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) > [junit-timeout] at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > [junit-timeout] at > jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) > [junit-timeout] at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > [junit-timeout] at java.base/java.util.Vector.forEach(Vector.java:1365) > [junit-timeout] at > jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) > [junit-timeout] at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > [junit-timeout] at > jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) > [junit-timeout] at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > [junit-timeout] > [junit-timeout] > [junit-timeout] Test org.apache.cassandra.distributed.test.ReadRepairTest > FAILED (crashed)BUILD FAILED > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842112#comment-17842112 ] Alex Petrov commented on CASSANDRA-19534: - [~brandon.williams] [~rustyrazorblade] would you be so kind to try running your tests? I suggest setting {{native_transport_timeout_in_ms}} to about 10 (or 12 max) seconds, and {{internode_timeout}} to {{true}} for starters. If you really want to push the limits, I'd suggest setting {{cql_start_time}} to {{REQUEST}}, but this is optional, as we will not roll it out with this setting enabled. > unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 5.0-rc, 5.x > > Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - > QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, > Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html > > Time Spent: 10m > Remaining Estimate: 0h > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842112#comment-17842112 ] Alex Petrov edited comment on CASSANDRA-19534 at 4/29/24 5:24 PM: -- [~brandon.williams] [~rustyrazorblade] would you be so kind to try running your tests against the branch posted above? I suggest setting {{native_transport_timeout_in_ms}} to about 10 (or 12 max) seconds, and {{internode_timeout}} to {{true}} for starters. If you really want to push the limits, I'd suggest setting {{cql_start_time}} to {{REQUEST}}, but this is optional, as we will not roll it out with this setting enabled. was (Author: ifesdjeen): [~brandon.williams] [~rustyrazorblade] would you be so kind to try running your tests? I suggest setting {{native_transport_timeout_in_ms}} to about 10 (or 12 max) seconds, and {{internode_timeout}} to {{true}} for starters. If you really want to push the limits, I'd suggest setting {{cql_start_time}} to {{REQUEST}}, but this is optional, as we will not roll it out with this setting enabled. > unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 5.0-rc, 5.x > > Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - > QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, > Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html > > Time Spent: 10m > Remaining Estimate: 0h > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19578) Concurrent equivalent schema updates lead to unresolved disagreement
[ https://issues.apache.org/jira/browse/CASSANDRA-19578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842111#comment-17842111 ] Brandon Williams commented on CASSANDRA-19578: -- Hmm, I see in the description this says "Only impacts 4.1" but then [~jwest] added 5.0. If I apply the added test to 5.0 (and make DefaultSchemaUpdateHandler.applyMutations public so it can run) then it fails there also. Does this affect 5.0? > Concurrent equivalent schema updates lead to unresolved disagreement > > > Key: CASSANDRA-19578 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19578 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Schema >Reporter: Chris Lohfink >Priority: Normal > Fix For: 4.1.5, 5.0-beta2 > > > As part of CASSANDRA-17819 a check for empty schema changes was added to the > updateSchema. This only looks at the _logical_ schema difference of the > schemas, but the changes made to the system_schema keyspace are the ones that > actually are involved in the digest. > If two nodes issue the same CREATE statement the difference from the > keyspace.diff would be empty but the timestamps on the mutations would be > different, leading to a pseudo schema disagreement which will never resolve > until resetlocalschema or nodes being bounced. > Only impacts 4.1 > test and fix : > https://github.com/clohfink/cassandra/commit/ba915f839089006ac6d08494ef19dc010bcd6411 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19578) Concurrent equivalent schema updates lead to unresolved disagreement
[ https://issues.apache.org/jira/browse/CASSANDRA-19578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Williams updated CASSANDRA-19578: - Fix Version/s: 4.1.x 5.0.x (was: 5.0-beta2) (was: 4.1.5) > Concurrent equivalent schema updates lead to unresolved disagreement > > > Key: CASSANDRA-19578 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19578 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Schema >Reporter: Chris Lohfink >Priority: Normal > Fix For: 4.1.x, 5.0.x > > > As part of CASSANDRA-17819 a check for empty schema changes was added to the > updateSchema. This only looks at the _logical_ schema difference of the > schemas, but the changes made to the system_schema keyspace are the ones that > actually are involved in the digest. > If two nodes issue the same CREATE statement the difference from the > keyspace.diff would be empty but the timestamps on the mutations would be > different, leading to a pseudo schema disagreement which will never resolve > until resetlocalschema or nodes being bounced. > Only impacts 4.1 > test and fix : > https://github.com/clohfink/cassandra/commit/ba915f839089006ac6d08494ef19dc010bcd6411 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19534: Test and Documentation Plan: Includes tests, also was tested separately; screenshots and description attached Status: Patch Available (was: Open) > unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 5.0-rc, 5.x > > Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - > QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, > Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html > > Time Spent: 10m > Remaining Estimate: 0h > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19534: Attachment: ci_summary.html > unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 5.0-rc, 5.x > > Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - > QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, > Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html > > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19597) SystemKeyspace CFS flushing blocked by unrelated keyspace flushing/compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-19597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842098#comment-17842098 ] Benedict Elliott Smith commented on CASSANDRA-19597: Yes, exactly. If I remember correctly, this "queue" was originally intended to achieve two things: 1) ensure commit log records are invalidated correctly, as it used to only support essentially invalidations of a complete prefix; 2) serve as a kind of fsync so that when awaiting the completion of a flush on a particular table you can be certain all data written prior has made it to disk I'm not actually sure if any of this is necessary today though. Pretty sure we invalidate explicit ranges now, so the commit log semantics do not require this. I'm not off the top of my head sure why (except for non-durable tables/writes) you would ever need to know all prior flushes had completed though, since the commit log will ensure they are re-written on restart. But a low risk approach would be to just make this a per table queue. > SystemKeyspace CFS flushing blocked by unrelated keyspace flushing/compaction > - > > Key: CASSANDRA-19597 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19597 > Project: Cassandra > Issue Type: Bug >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg >Priority: Normal > > There is a single post flush thread and that thread processes tasks in order > and one of those tasks can be a memtable flush for an unrelated keyspace/cfs, > and that memtable flush can be blocked by slow IntervalTree building and > racing with compactors to try and build an interval tree. > Unless there is a requirement for ordering we probably want to loosen this to > the actual ordering requirement so that problems in one keyspace can’t effect > another. > SystemKeyspace and Gossip in particular cause lots of weird problems like > nodes marking each other down because Gossip can’t process nodes being > removed (blocking flush each time in SystemKeyspace.removeNode) > A very simple fix here might be to queue the post flush task at the same time > as the flush in a per CFS queue, and then submit the task only once the flush > is completed. > If flushes complete out of order the queue will still ensure their > completions are processed in order. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19597) SystemKeyspace CFS flushing blocked by unrelated keyspace flushing/compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-19597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842091#comment-17842091 ] Ariel Weisberg commented on CASSANDRA-19597: [~benedict] is the requirement for post flush processing that it be done in order per CFS so a per CFS queue would actually address the problem of keeping the post flush processing in order? > SystemKeyspace CFS flushing blocked by unrelated keyspace flushing/compaction > - > > Key: CASSANDRA-19597 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19597 > Project: Cassandra > Issue Type: Bug >Reporter: Ariel Weisberg >Priority: Normal > > There is a single post flush thread and that thread processes tasks in order > and one of those tasks can be a memtable flush for an unrelated keyspace/cfs, > and that memtable flush can be blocked by slow IntervalTree building and > racing with compactors to try and build an interval tree. > Unless there is a requirement for ordering we probably want to loosen this to > the actual ordering requirement so that problems in one keyspace can’t effect > another. > SystemKeyspace and Gossip in particular cause lots of weird problems like > nodes marking each other down because Gossip can’t process nodes being > removed (blocking flush each time in SystemKeyspace.removeNode) > A very simple fix here might be to queue the post flush task at the same time > as the flush in a per CFS queue, and then submit the task only once the flush > is completed. > If flushes complete out of order the queue will still ensure their > completions are processed in order. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-19597) SystemKeyspace CFS flushing blocked by unrelated keyspace flushing/compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-19597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ariel Weisberg reassigned CASSANDRA-19597: -- Assignee: Ariel Weisberg > SystemKeyspace CFS flushing blocked by unrelated keyspace flushing/compaction > - > > Key: CASSANDRA-19597 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19597 > Project: Cassandra > Issue Type: Bug >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg >Priority: Normal > > There is a single post flush thread and that thread processes tasks in order > and one of those tasks can be a memtable flush for an unrelated keyspace/cfs, > and that memtable flush can be blocked by slow IntervalTree building and > racing with compactors to try and build an interval tree. > Unless there is a requirement for ordering we probably want to loosen this to > the actual ordering requirement so that problems in one keyspace can’t effect > another. > SystemKeyspace and Gossip in particular cause lots of weird problems like > nodes marking each other down because Gossip can’t process nodes being > removed (blocking flush each time in SystemKeyspace.removeNode) > A very simple fix here might be to queue the post flush task at the same time > as the flush in a per CFS queue, and then submit the task only once the flush > is completed. > If flushes complete out of order the queue will still ensure their > completions are processed in order. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19597) SystemKeyspace CFS flushing blocked by unrelated keyspace flushing/compaction
Ariel Weisberg created CASSANDRA-19597: -- Summary: SystemKeyspace CFS flushing blocked by unrelated keyspace flushing/compaction Key: CASSANDRA-19597 URL: https://issues.apache.org/jira/browse/CASSANDRA-19597 Project: Cassandra Issue Type: Bug Reporter: Ariel Weisberg There is a single post flush thread and that thread processes tasks in order and one of those tasks can be a memtable flush for an unrelated keyspace/cfs, and that memtable flush can be blocked by slow IntervalTree building and racing with compactors to try and build an interval tree. Unless there is a requirement for ordering we probably want to loosen this to the actual ordering requirement so that problems in one keyspace can’t effect another. SystemKeyspace and Gossip in particular cause lots of weird problems like nodes marking each other down because Gossip can’t process nodes being removed (blocking flush each time in SystemKeyspace.removeNode) A very simple fix here might be to queue the post flush task at the same time as the flush in a per CFS queue, and then submit the task only once the flush is completed. If flushes complete out of order the queue will still ensure their completions are processed in order. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19596) IntervalTree build throughput is low enough to be a bottleneck
Ariel Weisberg created CASSANDRA-19596: -- Summary: IntervalTree build throughput is low enough to be a bottleneck Key: CASSANDRA-19596 URL: https://issues.apache.org/jira/browse/CASSANDRA-19596 Project: Cassandra Issue Type: Improvement Components: Local/Compaction, Local/SSTable Reporter: Ariel Weisberg With several terabytes of data and 8 compactors it’s possible for the compactors to spend a lot of time blocked waiting on IntervalTrees to be built. There is also a lot of wasted CPU because it’s updated optimistically so most of them end up being thrown away. This can end up being quite painful because it can block memtable flushing as well and then a single slow CFS can block unrelated CFS because the memtable post flush executor is single threaded and shared across all CFS. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19490) Add foundation for independent parsing of junit based output for CI reporting to cassandra-builds
[ https://issues.apache.org/jira/browse/CASSANDRA-19490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842087#comment-17842087 ] Josh McKenzie commented on CASSANDRA-19490: --- Nothing here. Was swamped just getting all the things together and working; right now it's honor system. Once we have CASSANDRA-18731 this _should_ be moot since the config would indicate resource limits. The resource allocation in the system I cobbled together on top of David's work is actually significantly more constrained in both CPU and RAM compared to what we discussed and what's available in ASF CI, but it's all in bespoke .yml files for the parallelizer David wrote. > Add foundation for independent parsing of junit based output for CI reporting > to cassandra-builds > - > > Key: CASSANDRA-19490 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19490 > Project: Cassandra > Issue Type: New Feature > Components: CI >Reporter: Josh McKenzie >Assignee: Josh McKenzie >Priority: Normal > > PR attached. > For doing CI ourselves, it's useful to have a single pane of glass report > where you have a summary of results for all your suites as well as inlined > failures. This should be agnostic to any xunit based output; so long as we > co-locate the xunit data in directories adjacent to one another, the script > in the PR will generate an in-memory representation of the xunit results as > well as inline failures to an existing .html file. > The contents will need to be tweaked a bit to generate the top level branch + > sha + checkstyle + summaries information, but the vast majority of that is > already parsed and easily available within the script and can be extended > pretty trivially. > Opening up a pr to pull this into > [cassandra-builds](https://github.com/apache/cassandra-builds) since [~mck] > is actively working on that and needs these primitives. I'd expect the > contents in ci_parser to be massaged to become a more finalized, full > solution before we start to use it but no harm in the incremental merge. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19490) Add foundation for independent parsing of junit based output for CI reporting to cassandra-builds
[ https://issues.apache.org/jira/browse/CASSANDRA-19490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842084#comment-17842084 ] Michael Semb Wever commented on CASSANDRA-19490: yes. do we have a ticket for a script that validates a results_details.tar.xz meets pre-commit requirements …? > Add foundation for independent parsing of junit based output for CI reporting > to cassandra-builds > - > > Key: CASSANDRA-19490 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19490 > Project: Cassandra > Issue Type: New Feature > Components: CI >Reporter: Josh McKenzie >Assignee: Josh McKenzie >Priority: Normal > > PR attached. > For doing CI ourselves, it's useful to have a single pane of glass report > where you have a summary of results for all your suites as well as inlined > failures. This should be agnostic to any xunit based output; so long as we > co-locate the xunit data in directories adjacent to one another, the script > in the PR will generate an in-memory representation of the xunit results as > well as inline failures to an existing .html file. > The contents will need to be tweaked a bit to generate the top level branch + > sha + checkstyle + summaries information, but the vast majority of that is > already parsed and easily available within the script and can be extended > pretty trivially. > Opening up a pr to pull this into > [cassandra-builds](https://github.com/apache/cassandra-builds) since [~mck] > is actively working on that and needs these primitives. I'd expect the > contents in ci_parser to be massaged to become a more finalized, full > solution before we start to use it but no harm in the incremental merge. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19490) Add foundation for independent parsing of junit based output for CI reporting to cassandra-builds
[ https://issues.apache.org/jira/browse/CASSANDRA-19490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842083#comment-17842083 ] Josh McKenzie commented on CASSANDRA-19490: --- You've integrated this now right [~mck] ? i.e. can we close this out? > Add foundation for independent parsing of junit based output for CI reporting > to cassandra-builds > - > > Key: CASSANDRA-19490 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19490 > Project: Cassandra > Issue Type: New Feature > Components: CI >Reporter: Josh McKenzie >Assignee: Josh McKenzie >Priority: Normal > > PR attached. > For doing CI ourselves, it's useful to have a single pane of glass report > where you have a summary of results for all your suites as well as inlined > failures. This should be agnostic to any xunit based output; so long as we > co-locate the xunit data in directories adjacent to one another, the script > in the PR will generate an in-memory representation of the xunit results as > well as inline failures to an existing .html file. > The contents will need to be tweaked a bit to generate the top level branch + > sha + checkstyle + summaries information, but the vast majority of that is > already parsed and easily available within the script and can be extended > pretty trivially. > Opening up a pr to pull this into > [cassandra-builds](https://github.com/apache/cassandra-builds) since [~mck] > is actively working on that and needs these primitives. I'd expect the > contents in ci_parser to be massaged to become a more finalized, full > solution before we start to use it but no harm in the incremental merge. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19583) Make 0 work as 0+unit for all three config classes (DataStorageSpec, DurationSpec, DataRateSpec)
[ https://issues.apache.org/jira/browse/CASSANDRA-19583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ekaterina Dimitrova updated CASSANDRA-19583: Summary: Make 0 work as 0+unit for all three config classes (DataStorageSpec, DurationSpec, DataRateSpec) (was: enable Make 0 work as 0+unit for all three config classes (DataStorageSpec, DurationSpec, DataRateSpec)) > Make 0 work as 0+unit for all three config classes (DataStorageSpec, > DurationSpec, DataRateSpec) > > > Key: CASSANDRA-19583 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19583 > Project: Cassandra > Issue Type: Improvement > Components: Local/Config >Reporter: Jon Haddad >Priority: Normal > Fix For: 4.1.x, 5.0.x, 5.x > > > The inline docs say: > {noformat} > Setting this to 0 disables throttling. > {noformat} > However, on startup, we throw this error: > {noformat} > Caused by: java.lang.IllegalArgumentException: Invalid data rate: 0 Accepted > units: MiB/s, KiB/s, B/s where case matters and only non-negative values a> > Apr 23 23:12:01 cassandra0 cassandra[3424]: at > org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:52) > Apr 23 23:12:01 cassandra0 cassandra[3424]: at > org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:61) > Apr 23 23:12:01 cassandra0 cassandra[3424]: at > org.apache.cassandra.config.DataRateSpec$LongBytesPerSecondBound.(DataRateSpec.java:232) > Apr 23 23:12:01 cassandra0 cassandra[3424]: ... 27 common frames > omitted > {noformat} > We should allow 0 without a unit attached for data, duration, and data spec > config parameters, as 0 is always 0 no matter the unit. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19583) enable Make 0 work as 0+unit for all three config classes (DataStorageSpec, DurationSpec, DataRateSpec)
[ https://issues.apache.org/jira/browse/CASSANDRA-19583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ekaterina Dimitrova updated CASSANDRA-19583: Description: The inline docs say: {noformat} Setting this to 0 disables throttling. {noformat} However, on startup, we throw this error: {noformat} Caused by: java.lang.IllegalArgumentException: Invalid data rate: 0 Accepted units: MiB/s, KiB/s, B/s where case matters and only non-negative values a> Apr 23 23:12:01 cassandra0 cassandra[3424]: at org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:52) Apr 23 23:12:01 cassandra0 cassandra[3424]: at org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:61) Apr 23 23:12:01 cassandra0 cassandra[3424]: at org.apache.cassandra.config.DataRateSpec$LongBytesPerSecondBound.(DataRateSpec.java:232) Apr 23 23:12:01 cassandra0 cassandra[3424]: ... 27 common frames omitted {noformat} We should allow 0 without a unit attached for data, duration, and data spec config parameters, as 0 is always 0 no matter the unit. was: The inline docs say: {noformat} Setting this to 0 disables throttling. {noformat} However, on startup, we throw this error: {noformat} Caused by: java.lang.IllegalArgumentException: Invalid data rate: 0 Accepted units: MiB/s, KiB/s, B/s where case matters and only non-negative values a> Apr 23 23:12:01 cassandra0 cassandra[3424]: at org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:52) Apr 23 23:12:01 cassandra0 cassandra[3424]: at org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:61) Apr 23 23:12:01 cassandra0 cassandra[3424]: at org.apache.cassandra.config.DataRateSpec$LongBytesPerSecondBound.(DataRateSpec.java:232) Apr 23 23:12:01 cassandra0 cassandra[3424]: ... 27 common frames omitted {noformat} We should allow 0 as per the inline doc. > enable Make 0 work as 0+unit for all three config classes (DataStorageSpec, > DurationSpec, DataRateSpec) > --- > > Key: CASSANDRA-19583 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19583 > Project: Cassandra > Issue Type: Improvement > Components: Local/Config >Reporter: Jon Haddad >Priority: Normal > Fix For: 4.1.x, 5.0.x, 5.x > > > The inline docs say: > {noformat} > Setting this to 0 disables throttling. > {noformat} > However, on startup, we throw this error: > {noformat} > Caused by: java.lang.IllegalArgumentException: Invalid data rate: 0 Accepted > units: MiB/s, KiB/s, B/s where case matters and only non-negative values a> > Apr 23 23:12:01 cassandra0 cassandra[3424]: at > org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:52) > Apr 23 23:12:01 cassandra0 cassandra[3424]: at > org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:61) > Apr 23 23:12:01 cassandra0 cassandra[3424]: at > org.apache.cassandra.config.DataRateSpec$LongBytesPerSecondBound.(DataRateSpec.java:232) > Apr 23 23:12:01 cassandra0 cassandra[3424]: ... 27 common frames > omitted > {noformat} > We should allow 0 without a unit attached for data, duration, and data spec > config parameters, as 0 is always 0 no matter the unit. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19583) enable Make 0 work as 0+unit for all three config classes (DataStorageSpec, DurationSpec, DataRateSpec)
[ https://issues.apache.org/jira/browse/CASSANDRA-19583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Williams updated CASSANDRA-19583: - Workflow: Copy of Cassandra Default Workflow (was: Copy of Cassandra Bug Workflow) Issue Type: Improvement (was: Bug) > enable Make 0 work as 0+unit for all three config classes (DataStorageSpec, > DurationSpec, DataRateSpec) > --- > > Key: CASSANDRA-19583 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19583 > Project: Cassandra > Issue Type: Improvement > Components: Local/Config >Reporter: Jon Haddad >Priority: Normal > Fix For: 4.1.x, 5.0.x, 5.x > > > The inline docs say: > {noformat} > Setting this to 0 disables throttling. > {noformat} > However, on startup, we throw this error: > {noformat} > Caused by: java.lang.IllegalArgumentException: Invalid data rate: 0 Accepted > units: MiB/s, KiB/s, B/s where case matters and only non-negative values a> > Apr 23 23:12:01 cassandra0 cassandra[3424]: at > org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:52) > Apr 23 23:12:01 cassandra0 cassandra[3424]: at > org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:61) > Apr 23 23:12:01 cassandra0 cassandra[3424]: at > org.apache.cassandra.config.DataRateSpec$LongBytesPerSecondBound.(DataRateSpec.java:232) > Apr 23 23:12:01 cassandra0 cassandra[3424]: ... 27 common frames > omitted > {noformat} > We should allow 0 as per the inline doc. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19583) enable Make 0 work as 0+unit for all three config classes (DataStorageSpec, DurationSpec, DataRateSpec)
[ https://issues.apache.org/jira/browse/CASSANDRA-19583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842053#comment-17842053 ] Jon Haddad commented on CASSANDRA-19583: Sounds good, I've changed the title. > enable Make 0 work as 0+unit for all three config classes (DataStorageSpec, > DurationSpec, DataRateSpec) > --- > > Key: CASSANDRA-19583 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19583 > Project: Cassandra > Issue Type: Bug > Components: Local/Config >Reporter: Jon Haddad >Priority: Normal > Fix For: 4.1.x, 5.0.x, 5.x > > > The inline docs say: > {noformat} > Setting this to 0 disables throttling. > {noformat} > However, on startup, we throw this error: > {noformat} > Caused by: java.lang.IllegalArgumentException: Invalid data rate: 0 Accepted > units: MiB/s, KiB/s, B/s where case matters and only non-negative values a> > Apr 23 23:12:01 cassandra0 cassandra[3424]: at > org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:52) > Apr 23 23:12:01 cassandra0 cassandra[3424]: at > org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:61) > Apr 23 23:12:01 cassandra0 cassandra[3424]: at > org.apache.cassandra.config.DataRateSpec$LongBytesPerSecondBound.(DataRateSpec.java:232) > Apr 23 23:12:01 cassandra0 cassandra[3424]: ... 27 common frames > omitted > {noformat} > We should allow 0 as per the inline doc. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19583) enable Make 0 work as 0+unit for all three config classes (DataStorageSpec, DurationSpec, DataRateSpec)
[ https://issues.apache.org/jira/browse/CASSANDRA-19583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jon Haddad updated CASSANDRA-19583: --- Summary: enable Make 0 work as 0+unit for all three config classes (DataStorageSpec, DurationSpec, DataRateSpec) (was: setting compaction throughput to 0 throws a startup error) > enable Make 0 work as 0+unit for all three config classes (DataStorageSpec, > DurationSpec, DataRateSpec) > --- > > Key: CASSANDRA-19583 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19583 > Project: Cassandra > Issue Type: Bug > Components: Local/Config >Reporter: Jon Haddad >Priority: Normal > Fix For: 4.1.x, 5.0.x, 5.x > > > The inline docs say: > {noformat} > Setting this to 0 disables throttling. > {noformat} > However, on startup, we throw this error: > {noformat} > Caused by: java.lang.IllegalArgumentException: Invalid data rate: 0 Accepted > units: MiB/s, KiB/s, B/s where case matters and only non-negative values a> > Apr 23 23:12:01 cassandra0 cassandra[3424]: at > org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:52) > Apr 23 23:12:01 cassandra0 cassandra[3424]: at > org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:61) > Apr 23 23:12:01 cassandra0 cassandra[3424]: at > org.apache.cassandra.config.DataRateSpec$LongBytesPerSecondBound.(DataRateSpec.java:232) > Apr 23 23:12:01 cassandra0 cassandra[3424]: ... 27 common frames > omitted > {noformat} > We should allow 0 as per the inline doc. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-19583) setting compaction throughput to 0 throws a startup error
[ https://issues.apache.org/jira/browse/CASSANDRA-19583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842048#comment-17842048 ] Ekaterina Dimitrova edited comment on CASSANDRA-19583 at 4/29/24 3:53 PM: -- {quote}0 can mean "unlimited", but "0MiB/s" indicates actually zero. {quote} This confused me. {quote}To be clear, I'm suggesting we make "0" work, without a unit. I'm not suggesting we change how 0MiB/s works. They can be equivalent. {quote} Thanks for clarifying. Then let's change this ticket to improvement and change its description to "enable 0 to work as 0+unit for all three config classes (DataStorageSpec, DurationSpec, DataRateSpec)" and not particularly for the compaction throughput config? was (Author: e.dimitrova): {quote}0 can mean "unlimited", but "0MiB/s" indicates actually zero. {quote} This confused me. {quote} To be clear, I'm suggesting we make "0" work, without a unit. I'm not suggesting we change how 0MiB/s works. They can be equivalent. {quote} Thanks for clarifying. Then let's change this ticket to improvement and change its description to "enable 0 to work as 0+unit for all three cofig classes (DataStorageSpec, DurationSpec, DataRateSpec)" and not particularly for the compaction throughput config? > setting compaction throughput to 0 throws a startup error > - > > Key: CASSANDRA-19583 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19583 > Project: Cassandra > Issue Type: Bug > Components: Local/Config >Reporter: Jon Haddad >Priority: Normal > Fix For: 4.1.x, 5.0.x, 5.x > > > The inline docs say: > {noformat} > Setting this to 0 disables throttling. > {noformat} > However, on startup, we throw this error: > {noformat} > Caused by: java.lang.IllegalArgumentException: Invalid data rate: 0 Accepted > units: MiB/s, KiB/s, B/s where case matters and only non-negative values a> > Apr 23 23:12:01 cassandra0 cassandra[3424]: at > org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:52) > Apr 23 23:12:01 cassandra0 cassandra[3424]: at > org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:61) > Apr 23 23:12:01 cassandra0 cassandra[3424]: at > org.apache.cassandra.config.DataRateSpec$LongBytesPerSecondBound.(DataRateSpec.java:232) > Apr 23 23:12:01 cassandra0 cassandra[3424]: ... 27 common frames > omitted > {noformat} > We should allow 0 as per the inline doc. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18987) Using counter column type in Accord transactions leads to Atomicity / Consistency violations
[ https://issues.apache.org/jira/browse/CASSANDRA-18987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Caleb Rackliffe updated CASSANDRA-18987: Epic Link: CASSANDRA-17092 (was: CASSANDRA-17089) > Using counter column type in Accord transactions leads to Atomicity / > Consistency violations > > > Key: CASSANDRA-18987 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18987 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Luis E Fernandez >Assignee: Caleb Rackliffe >Priority: Normal > Fix For: NA > > Attachments: ci_summary.html > > > *System configuration and information:* > Single node Cassandra with Accord transactions enabled running on docker > Built from commit: > [a7cd114435704b988c81f47ef53d0bfd6441f38b|https://github.com/apache/cassandra/commit/a7cd114435704b988c81f47ef53d0bfd6441f38b] > CQLSH: [cqlsh 6.2.0 | Cassandra 5.0-alpha2-SNAPSHOT | CQL spec 3.4.7 | Native > protocol v5] > > *Steps to reproduce in CQLSH:* > {code:java} > CREATE KEYSPACE accord WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': '1'} AND durable_writes = true;{code} > {code:java} > CREATE TABLE accord.accounts ( > partition text, > account_id int, > balance counter, > PRIMARY KEY (partition, account_id) > ) WITH CLUSTERING ORDER BY (account_id ASC); > {code} > {code:java} > BEGIN TRANSACTION > UPDATE accord.accounts > SET balance += 100 > WHERE > partition = 'default' > AND account_id = 0; > UPDATE accord.accounts > SET balance += 100 > WHERE > partition = 'default' > AND account_id =1; > COMMIT TRANSACTION;{code} > bug happens after executing the following statement: > Based on [Cassandra > documentation|https://cassandra.apache.org/doc/trunk/cassandra/developing/cql/types.html#counters] > regarding the use of counters, I expect the following results: > Transaction A: subtract 10 from the balance of account 1 (total ending > balance of 90) and add 10 to the balance of account 0 (total ending balance > of 110) > {*}Bug A{*}: Neither account's balance is updated and the state of the rows > is left unchanged > {code:java} > /* Transaction A */ > BEGIN TRANSACTION > UPDATE accord.accounts > SET balance -= 10 > WHERE > partition = 'default' > AND account_id = 1; > UPDATE accord.accounts > SET balance += 10 > WHERE > partition = 'default' > AND account_id = 0; > COMMIT TRANSACTION;{code} > Transaction B: subtract 10 from the balance of account 1 (total ending > balance of 90) and add 10 to the balance of a new account 2 (total ending > balance of 10) > {*}Bug B{*}: Only the new account 2 is created. The balance of account 1 is > left unchanged > {code:java} > /* Transaction B */ > BEGIN TRANSACTION > UPDATE accord.accounts > SET balance -= 10 > WHERE > partition = 'default' > AND account_id = 1; > UPDATE accord.accounts > SET balance += 10 > WHERE > partition = 'default' > AND account_id = 2; > COMMIT TRANSACTION;{code} > Bug / Error: > == > The result of performing a table read after executing each buggy transaction > is: > {code:java} > /* Transaction / Bug A */ > partition | account_id | balance > ---++- > default | 0 | 100 > default | 1 | 100{code} > {code:java} > /* Transaction / Bug B */ > partition | account_id | balance > ---++- > default | 0 | 100 > default | 1 | 100 > default | 2 | 10 {code} > Note that performing the above statements without transaction blocks works as > expected. > {color:#172b4d}This was found while testing Accord transactions with > [~henrik.ingo] and team.{color} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19595) RepairDigestTrackingTest#testLocalDataAndRemoteRequestConcurrency timing out on QUORUM read
[ https://issues.apache.org/jira/browse/CASSANDRA-19595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Caleb Rackliffe updated CASSANDRA-19595: Fix Version/s: NA > RepairDigestTrackingTest#testLocalDataAndRemoteRequestConcurrency timing out > on QUORUM read > --- > > Key: CASSANDRA-19595 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19595 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Caleb Rackliffe >Priority: Normal > Fix For: NA > > > This test doesn't seem to have any trouble passing locally in trunk, but on > cep-15-accord, it reliably times out. From a couple minutes of debugging in > {{ReadCallback#onResponse()}}, I think the proximate cause looks like a > {{QUORUM}} read w/ 3 nodes getting back 2 responses, but both are digest > responses. It seems like one should actually have data. In any case, this > manifests as a timeout, because even though we have the right number of > responses, we don't signal before the timeout. Also, this still fails even w/ > speculative retries disabled (i.e. set to NEVER) in the test table. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19583) setting compaction throughput to 0 throws a startup error
[ https://issues.apache.org/jira/browse/CASSANDRA-19583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842048#comment-17842048 ] Ekaterina Dimitrova commented on CASSANDRA-19583: - {quote}0 can mean "unlimited", but "0MiB/s" indicates actually zero. {quote} This confused me. {quote} To be clear, I'm suggesting we make "0" work, without a unit. I'm not suggesting we change how 0MiB/s works. They can be equivalent. {quote} Thanks for clarifying. Then let's change this ticket to improvement and change its description to "enable 0 to work as 0+unit for all three cofig classes (DataStorageSpec, DurationSpec, DataRateSpec)" and not particularly for the compaction throughput config? > setting compaction throughput to 0 throws a startup error > - > > Key: CASSANDRA-19583 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19583 > Project: Cassandra > Issue Type: Bug > Components: Local/Config >Reporter: Jon Haddad >Priority: Normal > Fix For: 4.1.x, 5.0.x, 5.x > > > The inline docs say: > {noformat} > Setting this to 0 disables throttling. > {noformat} > However, on startup, we throw this error: > {noformat} > Caused by: java.lang.IllegalArgumentException: Invalid data rate: 0 Accepted > units: MiB/s, KiB/s, B/s where case matters and only non-negative values a> > Apr 23 23:12:01 cassandra0 cassandra[3424]: at > org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:52) > Apr 23 23:12:01 cassandra0 cassandra[3424]: at > org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:61) > Apr 23 23:12:01 cassandra0 cassandra[3424]: at > org.apache.cassandra.config.DataRateSpec$LongBytesPerSecondBound.(DataRateSpec.java:232) > Apr 23 23:12:01 cassandra0 cassandra[3424]: ... 27 common frames > omitted > {noformat} > We should allow 0 as per the inline doc. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19595) RepairDigestTrackingTest#testLocalDataAndRemoteRequestConcurrency timing out on QUORUM read
Caleb Rackliffe created CASSANDRA-19595: --- Summary: RepairDigestTrackingTest#testLocalDataAndRemoteRequestConcurrency timing out on QUORUM read Key: CASSANDRA-19595 URL: https://issues.apache.org/jira/browse/CASSANDRA-19595 Project: Cassandra Issue Type: Bug Components: Accord Reporter: Caleb Rackliffe This test doesn't seem to have any trouble passing locally in trunk, but on cep-15-accord, it reliably times out. From a couple minutes of debugging in {{ReadCallback#onResponse()}}, I think the proximate cause looks like a {{QUORUM}} read w/ 3 nodes getting back 2 responses, but both are digest responses. It seems like one should actually have data. In any case, this manifests as a timeout, because even though we have the right number of responses, we don't signal before the timeout. Also, this still fails even w/ speculative retries disabled (i.e. set to NEVER) in the test table. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19583) setting compaction throughput to 0 throws a startup error
[ https://issues.apache.org/jira/browse/CASSANDRA-19583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842046#comment-17842046 ] Jon Haddad commented on CASSANDRA-19583: To be clear, I'm suggesting we make "0" work, without a unit. I'm not suggesting we change how 0MiB/s works. They can be equivalent. Making it mandatory to supply a unit with 0 is a weird user experience. Zero is zero, there's no meaning to the label, it's superfluous. > setting compaction throughput to 0 throws a startup error > - > > Key: CASSANDRA-19583 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19583 > Project: Cassandra > Issue Type: Bug > Components: Local/Config >Reporter: Jon Haddad >Priority: Normal > Fix For: 4.1.x, 5.0.x, 5.x > > > The inline docs say: > {noformat} > Setting this to 0 disables throttling. > {noformat} > However, on startup, we throw this error: > {noformat} > Caused by: java.lang.IllegalArgumentException: Invalid data rate: 0 Accepted > units: MiB/s, KiB/s, B/s where case matters and only non-negative values a> > Apr 23 23:12:01 cassandra0 cassandra[3424]: at > org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:52) > Apr 23 23:12:01 cassandra0 cassandra[3424]: at > org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:61) > Apr 23 23:12:01 cassandra0 cassandra[3424]: at > org.apache.cassandra.config.DataRateSpec$LongBytesPerSecondBound.(DataRateSpec.java:232) > Apr 23 23:12:01 cassandra0 cassandra[3424]: ... 27 common frames > omitted > {noformat} > We should allow 0 as per the inline doc. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19579) threads lingering after driver shutdown: session close starts thread and doesn't await its stop
[ https://issues.apache.org/jira/browse/CASSANDRA-19579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842045#comment-17842045 ] Bret McGuire commented on CASSANDRA-19579: -- ACKed [~brandon.williams] ; thanks for letting me know. The "Client/java-driver" component will usually do the trick but an explicit "cc" tag never hurts :) > threads lingering after driver shutdown: session close starts thread and > doesn't await its stop > --- > > Key: CASSANDRA-19579 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19579 > Project: Cassandra > Issue Type: Bug > Components: Client/java-driver >Reporter: Thomas Klambauer >Priority: Normal > > We are checking remaining/lingering threads during shutdown. > we noticed some with naming pattern/thread factory: > ""globalEventExecutor-1-2" Id=146 TIMED_WAITING" > this one seems to be created during shutdown / session close and not > awaited/shut down: > {noformat} > addTask:156, GlobalEventExecutor (io.netty.util.concurrent) > execute0:225, GlobalEventExecutor (io.netty.util.concurrent) > execute:221, GlobalEventExecutor (io.netty.util.concurrent) > onClose:188, DefaultNettyOptions > (com.datastax.oss.driver.internal.core.context) > onChildrenClosed:589, DefaultSession$SingleThreaded > (com.datastax.oss.driver.internal.core.session) > lambda$close$9:552, DefaultSession$SingleThreaded > (com.datastax.oss.driver.internal.core.session) > run:-1, 860270832 > (com.datastax.oss.driver.internal.core.session.DefaultSession$SingleThreaded$$Lambda$9508) > tryFire$$$capture:783, CompletableFuture$UniRun (java.util.concurrent) > tryFire:-1, CompletableFuture$UniRun (java.util.concurrent) > - Async stack trace > addTask:-1, SingleThreadEventExecutor (io.netty.util.concurrent) > execute:836, SingleThreadEventExecutor (io.netty.util.concurrent) > execute0:827, SingleThreadEventExecutor (io.netty.util.concurrent) > execute:817, SingleThreadEventExecutor (io.netty.util.concurrent) > claim:568, CompletableFuture$UniCompletion (java.util.concurrent) > tryFire$$$capture:780, CompletableFuture$UniRun (java.util.concurrent) > tryFire:-1, CompletableFuture$UniRun (java.util.concurrent) > - Async stack trace > :767, CompletableFuture$UniRun (java.util.concurrent) > uniRunStage:801, CompletableFuture (java.util.concurrent) > thenRunAsync:2136, CompletableFuture (java.util.concurrent) > thenRunAsync:143, CompletableFuture (java.util.concurrent) > whenAllDone:75, CompletableFutures > (com.datastax.oss.driver.internal.core.util.concurrent) > close:551, DefaultSession$SingleThreaded > (com.datastax.oss.driver.internal.core.session) > access$1000:300, DefaultSession$SingleThreaded > (com.datastax.oss.driver.internal.core.session) > lambda$closeAsync$1:272, DefaultSession > (com.datastax.oss.driver.internal.core.session) > runTask:98, PromiseTask (io.netty.util.concurrent) > run:106, PromiseTask (io.netty.util.concurrent) > runTask$$$capture:174, AbstractEventExecutor (io.netty.util.concurrent) > runTask:-1, AbstractEventExecutor (io.netty.util.concurrent) > - Async stack trace > addTask:-1, SingleThreadEventExecutor (io.netty.util.concurrent) > execute:836, SingleThreadEventExecutor (io.netty.util.concurrent) > execute0:827, SingleThreadEventExecutor (io.netty.util.concurrent) > execute:817, SingleThreadEventExecutor (io.netty.util.concurrent) > submit:118, AbstractExecutorService (java.util.concurrent) > submit:118, AbstractEventExecutor (io.netty.util.concurrent) > on:57, RunOrSchedule (com.datastax.oss.driver.internal.core.util.concurrent) > closeSafely:286, DefaultSession > (com.datastax.oss.driver.internal.core.session) > closeAsync:272, DefaultSession (com.datastax.oss.driver.internal.core.session) > close:76, AsyncAutoCloseable (com.datastax.oss.driver.api.core) > -- custom shutdown code > run:829, Thread (java.lang) > {noformat} > the initial close here is called on > com.datastax.oss.driver.api.core.CqlSession. > netty framework suggests to call > io.netty.util.concurrent.GlobalEventExecutor#awaitInactivity > during shutdown to await event thread stopping > (slightly related issue in netty: > [https://github.com/netty/netty/issues/2084] ) > suggestion to add maybe GlobalEventExecutor.INSTANCE.awaitInactivity with > some timeout during close around here: > [https://github.com/apache/cassandra-java-driver/blob/4.x/core/src/main/java/com/datastax/oss/driver/internal/core/context/DefaultNettyOptions.java#L199] > noting that this might slow down closing for up to 2 seconds if the netty > issue comment is correct. > this is on latest datastax java driver version: 4.17, -- This message was sent by Atlassian Jira (v8.20.10#820010) - To
[jira] [Updated] (CASSANDRA-18987) Using counter column type in Accord transactions leads to Atomicity / Consistency violations
[ https://issues.apache.org/jira/browse/CASSANDRA-18987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Caleb Rackliffe updated CASSANDRA-18987: Fix Version/s: NA (was: 5.1) > Using counter column type in Accord transactions leads to Atomicity / > Consistency violations > > > Key: CASSANDRA-18987 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18987 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Luis E Fernandez >Assignee: Caleb Rackliffe >Priority: Normal > Fix For: NA > > Attachments: ci_summary.html > > > *System configuration and information:* > Single node Cassandra with Accord transactions enabled running on docker > Built from commit: > [a7cd114435704b988c81f47ef53d0bfd6441f38b|https://github.com/apache/cassandra/commit/a7cd114435704b988c81f47ef53d0bfd6441f38b] > CQLSH: [cqlsh 6.2.0 | Cassandra 5.0-alpha2-SNAPSHOT | CQL spec 3.4.7 | Native > protocol v5] > > *Steps to reproduce in CQLSH:* > {code:java} > CREATE KEYSPACE accord WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': '1'} AND durable_writes = true;{code} > {code:java} > CREATE TABLE accord.accounts ( > partition text, > account_id int, > balance counter, > PRIMARY KEY (partition, account_id) > ) WITH CLUSTERING ORDER BY (account_id ASC); > {code} > {code:java} > BEGIN TRANSACTION > UPDATE accord.accounts > SET balance += 100 > WHERE > partition = 'default' > AND account_id = 0; > UPDATE accord.accounts > SET balance += 100 > WHERE > partition = 'default' > AND account_id =1; > COMMIT TRANSACTION;{code} > bug happens after executing the following statement: > Based on [Cassandra > documentation|https://cassandra.apache.org/doc/trunk/cassandra/developing/cql/types.html#counters] > regarding the use of counters, I expect the following results: > Transaction A: subtract 10 from the balance of account 1 (total ending > balance of 90) and add 10 to the balance of account 0 (total ending balance > of 110) > {*}Bug A{*}: Neither account's balance is updated and the state of the rows > is left unchanged > {code:java} > /* Transaction A */ > BEGIN TRANSACTION > UPDATE accord.accounts > SET balance -= 10 > WHERE > partition = 'default' > AND account_id = 1; > UPDATE accord.accounts > SET balance += 10 > WHERE > partition = 'default' > AND account_id = 0; > COMMIT TRANSACTION;{code} > Transaction B: subtract 10 from the balance of account 1 (total ending > balance of 90) and add 10 to the balance of a new account 2 (total ending > balance of 10) > {*}Bug B{*}: Only the new account 2 is created. The balance of account 1 is > left unchanged > {code:java} > /* Transaction B */ > BEGIN TRANSACTION > UPDATE accord.accounts > SET balance -= 10 > WHERE > partition = 'default' > AND account_id = 1; > UPDATE accord.accounts > SET balance += 10 > WHERE > partition = 'default' > AND account_id = 2; > COMMIT TRANSACTION;{code} > Bug / Error: > == > The result of performing a table read after executing each buggy transaction > is: > {code:java} > /* Transaction / Bug A */ > partition | account_id | balance > ---++- > default | 0 | 100 > default | 1 | 100{code} > {code:java} > /* Transaction / Bug B */ > partition | account_id | balance > ---++- > default | 0 | 100 > default | 1 | 100 > default | 2 | 10 {code} > Note that performing the above statements without transaction blocks works as > expected. > {color:#172b4d}This was found while testing Accord transactions with > [~henrik.ingo] and team.{color} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19583) setting compaction throughput to 0 throws a startup error
[ https://issues.apache.org/jira/browse/CASSANDRA-19583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842041#comment-17842041 ] Ekaterina Dimitrova commented on CASSANDRA-19583: - {quote} "0MiB/s" indicates actually zero. {quote} Indeed, this is something we saw with the old config pre-4.1 - there were cases where 0 did not mean 0, or we used negatives or -1 to special case things, which was very confusing IMHO. Post-4.1 we encourage people to use guardrails or null for any special case with the new config (as documented here - [https://cassandra.apache.org/doc/4.1/cassandra/configuration/configuration.html,] section {*}Notes for Cassandra Developers{*}). Unfortunately, we have to live with the realities of some old configurations so we do not change behavior/introduce regressions. Compaction throughput is one of them - 0MiB/s always meant unlimited; the unit was part of the config name, so we had to preserve the behavior. Also, it could be unclear to special case 0 while having 0 with unit means something else. Using null or guardrail sounds more deterministic. Last but not least, I think it is too late to change the meaning of 0MiB/s for compaction throughput as it was already released in 4.1 to mean "unlimited," and technically, it did not change anything from before, where the unit was just in the name of the parameter. We never had 0MiB/s meaning 0 for that property. > setting compaction throughput to 0 throws a startup error > - > > Key: CASSANDRA-19583 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19583 > Project: Cassandra > Issue Type: Bug > Components: Local/Config >Reporter: Jon Haddad >Priority: Normal > Fix For: 4.1.x, 5.0.x, 5.x > > > The inline docs say: > {noformat} > Setting this to 0 disables throttling. > {noformat} > However, on startup, we throw this error: > {noformat} > Caused by: java.lang.IllegalArgumentException: Invalid data rate: 0 Accepted > units: MiB/s, KiB/s, B/s where case matters and only non-negative values a> > Apr 23 23:12:01 cassandra0 cassandra[3424]: at > org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:52) > Apr 23 23:12:01 cassandra0 cassandra[3424]: at > org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:61) > Apr 23 23:12:01 cassandra0 cassandra[3424]: at > org.apache.cassandra.config.DataRateSpec$LongBytesPerSecondBound.(DataRateSpec.java:232) > Apr 23 23:12:01 cassandra0 cassandra[3424]: ... 27 common frames > omitted > {noformat} > We should allow 0 as per the inline doc. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19567) Minimize the heap consumption when registering metrics
[ https://issues.apache.org/jira/browse/CASSANDRA-19567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842035#comment-17842035 ] Maxim Muzafarov commented on CASSANDRA-19567: - I've fixed all the comments and the failed tests. The changes are here: https://github.com/apache/cassandra/pull/3267 Additionally, I've prepared a branch that contains a new assertion on a metric remove operation to ensure the contract. As I previously mentioned, this assertion can be tricky as I assume some of the metrics with the same name can be removed in parallel, it shouldn't happen in a normal way, but it could because of the lack of tests and/or parallel removal the same metrics e.g. Memtables share the same instances of metrics. Anyway, the same as above changes, but with an assertion are here: https://github.com/Mmuzaf/cassandra/tree/cassandra-19567-assert I'll check that. > Minimize the heap consumption when registering metrics > -- > > Key: CASSANDRA-19567 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19567 > Project: Cassandra > Issue Type: Bug > Components: Observability/Metrics >Reporter: Maxim Muzafarov >Assignee: Maxim Muzafarov >Priority: Normal > Fix For: 5.x > > Attachments: summary.png > > Time Spent: 2h 10m > Remaining Estimate: 0h > > The problem is only reproducible on the x86 machine, the problem is not > reproducible on the arm64. A quick analysis showed a lot of MetricName > objects stored in the heap, although the real cause could be related to > something else, the MetricName object requires extra attention. > To reproduce run the command run locally: > {code} > ant test-jvm-dtest-some > -Dtest.name=org.apache.cassandra.distributed.test.ReadRepairTest > {code} > The error: > {code:java} > [junit-timeout] Exception in thread "main" java.lang.OutOfMemoryError: Java > heap space > [junit-timeout] at > java.base/java.lang.StringLatin1.newString(StringLatin1.java:769) > [junit-timeout] at > java.base/java.lang.StringBuffer.toString(StringBuffer.java:716) > [junit-timeout] at > org.apache.cassandra.CassandraBriefJUnitResultFormatter.endTestSuite(CassandraBriefJUnitResultFormatter.java:191) > [junit-timeout] at > org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.fireEndTestSuite(JUnitTestRunner.java:854) > [junit-timeout] at > org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:578) > [junit-timeout] at > org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:1197) > [junit-timeout] at > org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:1042) > [junit-timeout] Testsuite: > org.apache.cassandra.distributed.test.ReadRepairTest-cassandra.testtag_IS_UNDEFINED > [junit-timeout] Testsuite: > org.apache.cassandra.distributed.test.ReadRepairTest-cassandra.testtag_IS_UNDEFINED > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0 sec > [junit-timeout] > [junit-timeout] Testcase: > org.apache.cassandra.distributed.test.ReadRepairTest:readRepairRTRangeMovementTest-cassandra.testtag_IS_UNDEFINED: > Caused an ERROR > [junit-timeout] Forked Java VM exited abnormally. Please note the time in the > report does not reflect the time until the VM exit. > [junit-timeout] junit.framework.AssertionFailedError: Forked Java VM exited > abnormally. Please note the time in the report does not reflect the time > until the VM exit. > [junit-timeout] at > jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) > [junit-timeout] at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > [junit-timeout] at java.base/java.util.Vector.forEach(Vector.java:1365) > [junit-timeout] at > jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) > [junit-timeout] at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > [junit-timeout] at > jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) > [junit-timeout] at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > [junit-timeout] at java.base/java.util.Vector.forEach(Vector.java:1365) > [junit-timeout] at > jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) > [junit-timeout] at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > [junit-timeout] at > jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) > [junit-timeout] at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.
[jira] [Commented] (CASSANDRA-19579) threads lingering after driver shutdown: session close starts thread and doesn't await its stop
[ https://issues.apache.org/jira/browse/CASSANDRA-19579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841964#comment-17841964 ] Brandon Williams commented on CASSANDRA-19579: -- /cc [~absurdfarce] (sorry not sure how better to get these on the driver radar) > threads lingering after driver shutdown: session close starts thread and > doesn't await its stop > --- > > Key: CASSANDRA-19579 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19579 > Project: Cassandra > Issue Type: Bug > Components: Client/java-driver >Reporter: Thomas Klambauer >Priority: Normal > > We are checking remaining/lingering threads during shutdown. > we noticed some with naming pattern/thread factory: > ""globalEventExecutor-1-2" Id=146 TIMED_WAITING" > this one seems to be created during shutdown / session close and not > awaited/shut down: > {noformat} > addTask:156, GlobalEventExecutor (io.netty.util.concurrent) > execute0:225, GlobalEventExecutor (io.netty.util.concurrent) > execute:221, GlobalEventExecutor (io.netty.util.concurrent) > onClose:188, DefaultNettyOptions > (com.datastax.oss.driver.internal.core.context) > onChildrenClosed:589, DefaultSession$SingleThreaded > (com.datastax.oss.driver.internal.core.session) > lambda$close$9:552, DefaultSession$SingleThreaded > (com.datastax.oss.driver.internal.core.session) > run:-1, 860270832 > (com.datastax.oss.driver.internal.core.session.DefaultSession$SingleThreaded$$Lambda$9508) > tryFire$$$capture:783, CompletableFuture$UniRun (java.util.concurrent) > tryFire:-1, CompletableFuture$UniRun (java.util.concurrent) > - Async stack trace > addTask:-1, SingleThreadEventExecutor (io.netty.util.concurrent) > execute:836, SingleThreadEventExecutor (io.netty.util.concurrent) > execute0:827, SingleThreadEventExecutor (io.netty.util.concurrent) > execute:817, SingleThreadEventExecutor (io.netty.util.concurrent) > claim:568, CompletableFuture$UniCompletion (java.util.concurrent) > tryFire$$$capture:780, CompletableFuture$UniRun (java.util.concurrent) > tryFire:-1, CompletableFuture$UniRun (java.util.concurrent) > - Async stack trace > :767, CompletableFuture$UniRun (java.util.concurrent) > uniRunStage:801, CompletableFuture (java.util.concurrent) > thenRunAsync:2136, CompletableFuture (java.util.concurrent) > thenRunAsync:143, CompletableFuture (java.util.concurrent) > whenAllDone:75, CompletableFutures > (com.datastax.oss.driver.internal.core.util.concurrent) > close:551, DefaultSession$SingleThreaded > (com.datastax.oss.driver.internal.core.session) > access$1000:300, DefaultSession$SingleThreaded > (com.datastax.oss.driver.internal.core.session) > lambda$closeAsync$1:272, DefaultSession > (com.datastax.oss.driver.internal.core.session) > runTask:98, PromiseTask (io.netty.util.concurrent) > run:106, PromiseTask (io.netty.util.concurrent) > runTask$$$capture:174, AbstractEventExecutor (io.netty.util.concurrent) > runTask:-1, AbstractEventExecutor (io.netty.util.concurrent) > - Async stack trace > addTask:-1, SingleThreadEventExecutor (io.netty.util.concurrent) > execute:836, SingleThreadEventExecutor (io.netty.util.concurrent) > execute0:827, SingleThreadEventExecutor (io.netty.util.concurrent) > execute:817, SingleThreadEventExecutor (io.netty.util.concurrent) > submit:118, AbstractExecutorService (java.util.concurrent) > submit:118, AbstractEventExecutor (io.netty.util.concurrent) > on:57, RunOrSchedule (com.datastax.oss.driver.internal.core.util.concurrent) > closeSafely:286, DefaultSession > (com.datastax.oss.driver.internal.core.session) > closeAsync:272, DefaultSession (com.datastax.oss.driver.internal.core.session) > close:76, AsyncAutoCloseable (com.datastax.oss.driver.api.core) > -- custom shutdown code > run:829, Thread (java.lang) > {noformat} > the initial close here is called on > com.datastax.oss.driver.api.core.CqlSession. > netty framework suggests to call > io.netty.util.concurrent.GlobalEventExecutor#awaitInactivity > during shutdown to await event thread stopping > (slightly related issue in netty: > [https://github.com/netty/netty/issues/2084] ) > suggestion to add maybe GlobalEventExecutor.INSTANCE.awaitInactivity with > some timeout during close around here: > [https://github.com/apache/cassandra-java-driver/blob/4.x/core/src/main/java/com/datastax/oss/driver/internal/core/context/DefaultNettyOptions.java#L199] > noting that this might slow down closing for up to 2 seconds if the netty > issue comment is correct. > this is on latest datastax java driver version: 4.17, -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additi
[jira] [Updated] (CASSANDRA-19579) threads lingering after driver shutdown: session close starts thread and doesn't await its stop
[ https://issues.apache.org/jira/browse/CASSANDRA-19579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Williams updated CASSANDRA-19579: - Bug Category: Parent values: Degradation(12984) Complexity: Normal Discovered By: User Report Severity: Normal Status: Open (was: Triage Needed) > threads lingering after driver shutdown: session close starts thread and > doesn't await its stop > --- > > Key: CASSANDRA-19579 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19579 > Project: Cassandra > Issue Type: Bug > Components: Client/java-driver >Reporter: Thomas Klambauer >Priority: Normal > > We are checking remaining/lingering threads during shutdown. > we noticed some with naming pattern/thread factory: > ""globalEventExecutor-1-2" Id=146 TIMED_WAITING" > this one seems to be created during shutdown / session close and not > awaited/shut down: > {noformat} > addTask:156, GlobalEventExecutor (io.netty.util.concurrent) > execute0:225, GlobalEventExecutor (io.netty.util.concurrent) > execute:221, GlobalEventExecutor (io.netty.util.concurrent) > onClose:188, DefaultNettyOptions > (com.datastax.oss.driver.internal.core.context) > onChildrenClosed:589, DefaultSession$SingleThreaded > (com.datastax.oss.driver.internal.core.session) > lambda$close$9:552, DefaultSession$SingleThreaded > (com.datastax.oss.driver.internal.core.session) > run:-1, 860270832 > (com.datastax.oss.driver.internal.core.session.DefaultSession$SingleThreaded$$Lambda$9508) > tryFire$$$capture:783, CompletableFuture$UniRun (java.util.concurrent) > tryFire:-1, CompletableFuture$UniRun (java.util.concurrent) > - Async stack trace > addTask:-1, SingleThreadEventExecutor (io.netty.util.concurrent) > execute:836, SingleThreadEventExecutor (io.netty.util.concurrent) > execute0:827, SingleThreadEventExecutor (io.netty.util.concurrent) > execute:817, SingleThreadEventExecutor (io.netty.util.concurrent) > claim:568, CompletableFuture$UniCompletion (java.util.concurrent) > tryFire$$$capture:780, CompletableFuture$UniRun (java.util.concurrent) > tryFire:-1, CompletableFuture$UniRun (java.util.concurrent) > - Async stack trace > :767, CompletableFuture$UniRun (java.util.concurrent) > uniRunStage:801, CompletableFuture (java.util.concurrent) > thenRunAsync:2136, CompletableFuture (java.util.concurrent) > thenRunAsync:143, CompletableFuture (java.util.concurrent) > whenAllDone:75, CompletableFutures > (com.datastax.oss.driver.internal.core.util.concurrent) > close:551, DefaultSession$SingleThreaded > (com.datastax.oss.driver.internal.core.session) > access$1000:300, DefaultSession$SingleThreaded > (com.datastax.oss.driver.internal.core.session) > lambda$closeAsync$1:272, DefaultSession > (com.datastax.oss.driver.internal.core.session) > runTask:98, PromiseTask (io.netty.util.concurrent) > run:106, PromiseTask (io.netty.util.concurrent) > runTask$$$capture:174, AbstractEventExecutor (io.netty.util.concurrent) > runTask:-1, AbstractEventExecutor (io.netty.util.concurrent) > - Async stack trace > addTask:-1, SingleThreadEventExecutor (io.netty.util.concurrent) > execute:836, SingleThreadEventExecutor (io.netty.util.concurrent) > execute0:827, SingleThreadEventExecutor (io.netty.util.concurrent) > execute:817, SingleThreadEventExecutor (io.netty.util.concurrent) > submit:118, AbstractExecutorService (java.util.concurrent) > submit:118, AbstractEventExecutor (io.netty.util.concurrent) > on:57, RunOrSchedule (com.datastax.oss.driver.internal.core.util.concurrent) > closeSafely:286, DefaultSession > (com.datastax.oss.driver.internal.core.session) > closeAsync:272, DefaultSession (com.datastax.oss.driver.internal.core.session) > close:76, AsyncAutoCloseable (com.datastax.oss.driver.api.core) > -- custom shutdown code > run:829, Thread (java.lang) > {noformat} > the initial close here is called on > com.datastax.oss.driver.api.core.CqlSession. > netty framework suggests to call > io.netty.util.concurrent.GlobalEventExecutor#awaitInactivity > during shutdown to await event thread stopping > (slightly related issue in netty: > [https://github.com/netty/netty/issues/2084] ) > suggestion to add maybe GlobalEventExecutor.INSTANCE.awaitInactivity with > some timeout during close around here: > [https://github.com/apache/cassandra-java-driver/blob/4.x/core/src/main/java/com/datastax/oss/driver/internal/core/context/DefaultNettyOptions.java#L199] > noting that this might slow down closing for up to 2 seconds if the netty > issue comment is correct. > this is on latest datastax java driver version: 4.17, -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits
[jira] [Assigned] (CASSANDRA-19579) threads lingering after driver shutdown: session close starts thread and doesn't await its stop
[ https://issues.apache.org/jira/browse/CASSANDRA-19579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Williams reassigned CASSANDRA-19579: Assignee: (was: Henry Hughes) > threads lingering after driver shutdown: session close starts thread and > doesn't await its stop > --- > > Key: CASSANDRA-19579 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19579 > Project: Cassandra > Issue Type: Bug > Components: Client/java-driver >Reporter: Thomas Klambauer >Priority: Normal > > We are checking remaining/lingering threads during shutdown. > we noticed some with naming pattern/thread factory: > ""globalEventExecutor-1-2" Id=146 TIMED_WAITING" > this one seems to be created during shutdown / session close and not > awaited/shut down: > {noformat} > addTask:156, GlobalEventExecutor (io.netty.util.concurrent) > execute0:225, GlobalEventExecutor (io.netty.util.concurrent) > execute:221, GlobalEventExecutor (io.netty.util.concurrent) > onClose:188, DefaultNettyOptions > (com.datastax.oss.driver.internal.core.context) > onChildrenClosed:589, DefaultSession$SingleThreaded > (com.datastax.oss.driver.internal.core.session) > lambda$close$9:552, DefaultSession$SingleThreaded > (com.datastax.oss.driver.internal.core.session) > run:-1, 860270832 > (com.datastax.oss.driver.internal.core.session.DefaultSession$SingleThreaded$$Lambda$9508) > tryFire$$$capture:783, CompletableFuture$UniRun (java.util.concurrent) > tryFire:-1, CompletableFuture$UniRun (java.util.concurrent) > - Async stack trace > addTask:-1, SingleThreadEventExecutor (io.netty.util.concurrent) > execute:836, SingleThreadEventExecutor (io.netty.util.concurrent) > execute0:827, SingleThreadEventExecutor (io.netty.util.concurrent) > execute:817, SingleThreadEventExecutor (io.netty.util.concurrent) > claim:568, CompletableFuture$UniCompletion (java.util.concurrent) > tryFire$$$capture:780, CompletableFuture$UniRun (java.util.concurrent) > tryFire:-1, CompletableFuture$UniRun (java.util.concurrent) > - Async stack trace > :767, CompletableFuture$UniRun (java.util.concurrent) > uniRunStage:801, CompletableFuture (java.util.concurrent) > thenRunAsync:2136, CompletableFuture (java.util.concurrent) > thenRunAsync:143, CompletableFuture (java.util.concurrent) > whenAllDone:75, CompletableFutures > (com.datastax.oss.driver.internal.core.util.concurrent) > close:551, DefaultSession$SingleThreaded > (com.datastax.oss.driver.internal.core.session) > access$1000:300, DefaultSession$SingleThreaded > (com.datastax.oss.driver.internal.core.session) > lambda$closeAsync$1:272, DefaultSession > (com.datastax.oss.driver.internal.core.session) > runTask:98, PromiseTask (io.netty.util.concurrent) > run:106, PromiseTask (io.netty.util.concurrent) > runTask$$$capture:174, AbstractEventExecutor (io.netty.util.concurrent) > runTask:-1, AbstractEventExecutor (io.netty.util.concurrent) > - Async stack trace > addTask:-1, SingleThreadEventExecutor (io.netty.util.concurrent) > execute:836, SingleThreadEventExecutor (io.netty.util.concurrent) > execute0:827, SingleThreadEventExecutor (io.netty.util.concurrent) > execute:817, SingleThreadEventExecutor (io.netty.util.concurrent) > submit:118, AbstractExecutorService (java.util.concurrent) > submit:118, AbstractEventExecutor (io.netty.util.concurrent) > on:57, RunOrSchedule (com.datastax.oss.driver.internal.core.util.concurrent) > closeSafely:286, DefaultSession > (com.datastax.oss.driver.internal.core.session) > closeAsync:272, DefaultSession (com.datastax.oss.driver.internal.core.session) > close:76, AsyncAutoCloseable (com.datastax.oss.driver.api.core) > -- custom shutdown code > run:829, Thread (java.lang) > {noformat} > the initial close here is called on > com.datastax.oss.driver.api.core.CqlSession. > netty framework suggests to call > io.netty.util.concurrent.GlobalEventExecutor#awaitInactivity > during shutdown to await event thread stopping > (slightly related issue in netty: > [https://github.com/netty/netty/issues/2084] ) > suggestion to add maybe GlobalEventExecutor.INSTANCE.awaitInactivity with > some timeout during close around here: > [https://github.com/apache/cassandra-java-driver/blob/4.x/core/src/main/java/com/datastax/oss/driver/internal/core/context/DefaultNettyOptions.java#L199] > noting that this might slow down closing for up to 2 seconds if the netty > issue comment is correct. > this is on latest datastax java driver version: 4.17, -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19158) Reuse native transport-driven futures in Debounce
[ https://issues.apache.org/jira/browse/CASSANDRA-19158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19158: Attachment: ci_summary.html > Reuse native transport-driven futures in Debounce > - > > Key: CASSANDRA-19158 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19158 > Project: Cassandra > Issue Type: Improvement >Reporter: Alex Petrov >Assignee: Alex Petrov >Priority: Normal > Attachments: ci_summary.html > > > Currently, we create a future in Debounce, then create one more future in > RemoteProcessor#sendWithCallback. This is further exacerbated by chaining > calls, when we first attempt to catch up from peer, and then from CMS. > First of all, we should always only use a native transport timeout driven > futures returned from sendWithCallback, since they implement reasonable > retries under the hood, and are easy to bulk-configure (ie you can simply > change timeout in yaml and have all futures change their behaviour). > Second, we should _chain_ futures and use map or andThen for fallback > operations such as trying to catch up from CMS after an unsuccesful attemp to > catch up from peer. > This should significantly simplify the code and number of blocked/waiting > threads. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-19450) Hygiene updates for warnings and pytests
[ https://issues.apache.org/jira/browse/CASSANDRA-19450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841901#comment-17841901 ] Stefan Miklosovic edited comment on CASSANDRA-19450 at 4/29/24 11:07 AM: - I dont understand your last sentence. I am running j17 pre-commit in circle as I write this though. edit: aha, you mean like one commit before to see if it fails there or not? I dont think it does, we would detect that already. I think that this is just flaky but let's see was (Author: smiklosovic): I dont understand your last sentence. I am running j17 pre-commit in circle as I write this though. > Hygiene updates for warnings and pytests > > > Key: CASSANDRA-19450 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19450 > Project: Cassandra > Issue Type: Improvement > Components: CQL/Interpreter >Reporter: Brad Schoening >Assignee: Brad Schoening >Priority: Low > Fix For: 5.x > > > > * Update 'Warning' message to write to stderr > * -Replace TimeoutError Exception with builtin (since Python 3.3)- > * -Remove re.pattern_type (removed since Python 3.7)- > * Fix mutable arg [] in read_until() > * Remove redirect of stderr to stdout in pytest fixture with tty=false; > Deprecation warnings can otherwise fail unit tests when stdout & stderr > output is combined. > * Fix several pycodestyle issues -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19546) Add to_human_size and to_human_duration function
[ https://issues.apache.org/jira/browse/CASSANDRA-19546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841942#comment-17841942 ] Stefan Miklosovic commented on CASSANDRA-19546: --- I added both to_human_size and to_human_duration (1), (2). I try my luck with asking for a reviewer. It is also tested / documented etc. (1) https://github.com/apache/cassandra/pull/3239/files (2) https://github.com/apache/cassandra/blob/f35ed228145fae3edb4325d29464f0d950d13511/doc/modules/cassandra/pages/developing/cql/functions.adoc#human-helper-functions > Add to_human_size and to_human_duration function > > > Key: CASSANDRA-19546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19546 > Project: Cassandra > Issue Type: New Feature > Components: Legacy/CQL >Reporter: Stefan Miklosovic >Assignee: Stefan Miklosovic >Priority: Normal > Fix For: 5.x > > Time Spent: 10m > Remaining Estimate: 0h > > There are cases (e.g in our system_views tables but might be applicable for > user tables as well) when a column is of a type which represents number of > bytes. However, it is quite hard to parse a value for a human to have some > estimation what that value is. > I propose this: > {code:java} > cqlsh> select * from myks.mytb ; > id | col1 | col2 | col3 | col4 > +--+--+--+-- > 1 | 100 | 200 | 300 | 32432423 > (1 rows) > cqlsh> select to_human_size(col4) from myks.mytb where id = 1; > system.to_human_size(col4) > -- > 30.93 MiB > (1 rows) > cqlsh> select to_human_size(col4,0) from myks.mytb where id = 1; > system.to_human_size(col4, 0) > - > 31 MiB > (1 rows) > cqlsh> select to_human_size(col4,1) from myks.mytb where id = 1; > system.to_human_size(col4, 1) > - > 30.9 MiB > (1 rows) > {code} > The second argument is optional and represents the number of decimal places > (at most) to use. Without the second argument, it will default to > FileUtils.df which is "#.##" format. > {code} > cqlsh> DESCRIBE myks.mytb ; > CREATE TABLE myks.mytb ( > id int PRIMARY KEY, > col1 int, > col2 smallint, > col3 bigint, > col4 varint, > ) > {code} > I also propose that this to_human_size function (name of it might be indeed > discussed and it is just a suggestion) should be only applicable for int, > smallint, bigint and varint types. I am not sure how to apply this to e.g. > "float" or similar. As I mentioned, it is meant to convert just number of > bytes, which is just some number, to a string representation of that and I do > not think that applying that function to anything else but these types makes > sense. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19546) Add to_human_size and to_human_duration function
[ https://issues.apache.org/jira/browse/CASSANDRA-19546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan Miklosovic updated CASSANDRA-19546: -- Status: Needs Committer (was: Patch Available) > Add to_human_size and to_human_duration function > > > Key: CASSANDRA-19546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19546 > Project: Cassandra > Issue Type: New Feature > Components: Legacy/CQL >Reporter: Stefan Miklosovic >Assignee: Stefan Miklosovic >Priority: Normal > Fix For: 5.x > > Time Spent: 10m > Remaining Estimate: 0h > > There are cases (e.g in our system_views tables but might be applicable for > user tables as well) when a column is of a type which represents number of > bytes. However, it is quite hard to parse a value for a human to have some > estimation what that value is. > I propose this: > {code:java} > cqlsh> select * from myks.mytb ; > id | col1 | col2 | col3 | col4 > +--+--+--+-- > 1 | 100 | 200 | 300 | 32432423 > (1 rows) > cqlsh> select to_human_size(col4) from myks.mytb where id = 1; > system.to_human_size(col4) > -- > 30.93 MiB > (1 rows) > cqlsh> select to_human_size(col4,0) from myks.mytb where id = 1; > system.to_human_size(col4, 0) > - > 31 MiB > (1 rows) > cqlsh> select to_human_size(col4,1) from myks.mytb where id = 1; > system.to_human_size(col4, 1) > - > 30.9 MiB > (1 rows) > {code} > The second argument is optional and represents the number of decimal places > (at most) to use. Without the second argument, it will default to > FileUtils.df which is "#.##" format. > {code} > cqlsh> DESCRIBE myks.mytb ; > CREATE TABLE myks.mytb ( > id int PRIMARY KEY, > col1 int, > col2 smallint, > col3 bigint, > col4 varint, > ) > {code} > I also propose that this to_human_size function (name of it might be indeed > discussed and it is just a suggestion) should be only applicable for int, > smallint, bigint and varint types. I am not sure how to apply this to e.g. > "float" or similar. As I mentioned, it is meant to convert just number of > bytes, which is just some number, to a string representation of that and I do > not think that applying that function to anything else but these types makes > sense. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19546) Add to_human_size and to_human_duration function
[ https://issues.apache.org/jira/browse/CASSANDRA-19546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan Miklosovic updated CASSANDRA-19546: -- Test and Documentation Plan: ci Status: Patch Available (was: In Progress) > Add to_human_size and to_human_duration function > > > Key: CASSANDRA-19546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19546 > Project: Cassandra > Issue Type: New Feature > Components: Legacy/CQL >Reporter: Stefan Miklosovic >Assignee: Stefan Miklosovic >Priority: Normal > Fix For: 5.x > > Time Spent: 10m > Remaining Estimate: 0h > > There are cases (e.g in our system_views tables but might be applicable for > user tables as well) when a column is of a type which represents number of > bytes. However, it is quite hard to parse a value for a human to have some > estimation what that value is. > I propose this: > {code:java} > cqlsh> select * from myks.mytb ; > id | col1 | col2 | col3 | col4 > +--+--+--+-- > 1 | 100 | 200 | 300 | 32432423 > (1 rows) > cqlsh> select to_human_size(col4) from myks.mytb where id = 1; > system.to_human_size(col4) > -- > 30.93 MiB > (1 rows) > cqlsh> select to_human_size(col4,0) from myks.mytb where id = 1; > system.to_human_size(col4, 0) > - > 31 MiB > (1 rows) > cqlsh> select to_human_size(col4,1) from myks.mytb where id = 1; > system.to_human_size(col4, 1) > - > 30.9 MiB > (1 rows) > {code} > The second argument is optional and represents the number of decimal places > (at most) to use. Without the second argument, it will default to > FileUtils.df which is "#.##" format. > {code} > cqlsh> DESCRIBE myks.mytb ; > CREATE TABLE myks.mytb ( > id int PRIMARY KEY, > col1 int, > col2 smallint, > col3 bigint, > col4 varint, > ) > {code} > I also propose that this to_human_size function (name of it might be indeed > discussed and it is just a suggestion) should be only applicable for int, > smallint, bigint and varint types. I am not sure how to apply this to e.g. > "float" or similar. As I mentioned, it is meant to convert just number of > bytes, which is just some number, to a string representation of that and I do > not think that applying that function to anything else but these types makes > sense. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19579) threads lingering after driver shutdown: session close starts thread and doesn't await its stop
[ https://issues.apache.org/jira/browse/CASSANDRA-19579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Klambauer updated CASSANDRA-19579: - Description: We are checking remaining/lingering threads during shutdown. we noticed some with naming pattern/thread factory: ""globalEventExecutor-1-2" Id=146 TIMED_WAITING" this one seems to be created during shutdown / session close and not awaited/shut down: {noformat} addTask:156, GlobalEventExecutor (io.netty.util.concurrent) execute0:225, GlobalEventExecutor (io.netty.util.concurrent) execute:221, GlobalEventExecutor (io.netty.util.concurrent) onClose:188, DefaultNettyOptions (com.datastax.oss.driver.internal.core.context) onChildrenClosed:589, DefaultSession$SingleThreaded (com.datastax.oss.driver.internal.core.session) lambda$close$9:552, DefaultSession$SingleThreaded (com.datastax.oss.driver.internal.core.session) run:-1, 860270832 (com.datastax.oss.driver.internal.core.session.DefaultSession$SingleThreaded$$Lambda$9508) tryFire$$$capture:783, CompletableFuture$UniRun (java.util.concurrent) tryFire:-1, CompletableFuture$UniRun (java.util.concurrent) - Async stack trace addTask:-1, SingleThreadEventExecutor (io.netty.util.concurrent) execute:836, SingleThreadEventExecutor (io.netty.util.concurrent) execute0:827, SingleThreadEventExecutor (io.netty.util.concurrent) execute:817, SingleThreadEventExecutor (io.netty.util.concurrent) claim:568, CompletableFuture$UniCompletion (java.util.concurrent) tryFire$$$capture:780, CompletableFuture$UniRun (java.util.concurrent) tryFire:-1, CompletableFuture$UniRun (java.util.concurrent) - Async stack trace :767, CompletableFuture$UniRun (java.util.concurrent) uniRunStage:801, CompletableFuture (java.util.concurrent) thenRunAsync:2136, CompletableFuture (java.util.concurrent) thenRunAsync:143, CompletableFuture (java.util.concurrent) whenAllDone:75, CompletableFutures (com.datastax.oss.driver.internal.core.util.concurrent) close:551, DefaultSession$SingleThreaded (com.datastax.oss.driver.internal.core.session) access$1000:300, DefaultSession$SingleThreaded (com.datastax.oss.driver.internal.core.session) lambda$closeAsync$1:272, DefaultSession (com.datastax.oss.driver.internal.core.session) runTask:98, PromiseTask (io.netty.util.concurrent) run:106, PromiseTask (io.netty.util.concurrent) runTask$$$capture:174, AbstractEventExecutor (io.netty.util.concurrent) runTask:-1, AbstractEventExecutor (io.netty.util.concurrent) - Async stack trace addTask:-1, SingleThreadEventExecutor (io.netty.util.concurrent) execute:836, SingleThreadEventExecutor (io.netty.util.concurrent) execute0:827, SingleThreadEventExecutor (io.netty.util.concurrent) execute:817, SingleThreadEventExecutor (io.netty.util.concurrent) submit:118, AbstractExecutorService (java.util.concurrent) submit:118, AbstractEventExecutor (io.netty.util.concurrent) on:57, RunOrSchedule (com.datastax.oss.driver.internal.core.util.concurrent) closeSafely:286, DefaultSession (com.datastax.oss.driver.internal.core.session) closeAsync:272, DefaultSession (com.datastax.oss.driver.internal.core.session) close:76, AsyncAutoCloseable (com.datastax.oss.driver.api.core) -- custom shutdown code run:829, Thread (java.lang) {noformat} the initial close here is called on com.datastax.oss.driver.api.core.CqlSession. netty framework suggests to call io.netty.util.concurrent.GlobalEventExecutor#awaitInactivity during shutdown to await event thread stopping (slightly related issue in netty: [https://github.com/netty/netty/issues/2084] ) suggestion to add maybe GlobalEventExecutor.INSTANCE.awaitInactivity with some timeout during close around here: [https://github.com/apache/cassandra-java-driver/blob/4.x/core/src/main/java/com/datastax/oss/driver/internal/core/context/DefaultNettyOptions.java#L199] noting that this might slow down closing for up to 2 seconds if the netty issue comment is correct. this is on latest datastax java driver version: 4.17, was: We are checking remaining/lingering threads during shutdown. we noticed some with naming pattern/thread factory: ""globalEventExecutor-1-2" Id=146 TIMED_WAITING" this one seems to be created during shutdown / session close and not awaited/shut down: {noformat} addTask:156, GlobalEventExecutor (io.netty.util.concurrent) execute0:225, GlobalEventExecutor (io.netty.util.concurrent) execute:221, GlobalEventExecutor (io.netty.util.concurrent) onClose:188, DefaultNettyOptions (com.datastax.oss.driver.internal.core.context) onChildrenClosed:589, DefaultSession$SingleThreaded (com.datastax.oss.driver.internal.core.session) lambda$close$9:552, DefaultSession$SingleThreaded (com.datastax.oss.driver.internal.core.session) run:-1, 860270832 (com.datastax.oss.driver.internal.core.session.DefaultSession$SingleThreaded$$Lambda$9508) tryFire$$$capture:783, CompletableFuture$UniRun (java.util.concurrent) tryFire:-1, CompletableFuture$UniRun (java.util
[jira] [Comment Edited] (CASSANDRA-19590) Unexpected error deserializing mutation when upgrade from 2.2.19 to 3.0.30/3.11.17
[ https://issues.apache.org/jira/browse/CASSANDRA-19590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841916#comment-17841916 ] Stefan Miklosovic edited comment on CASSANDRA-19590 at 4/29/24 9:19 AM: what about 2.2.19 -> 3.0.0? what is the schema for ks.tb? anything special there? was (Author: smiklosovic): what about 2.2.19 -> 3.0.0? > Unexpected error deserializing mutation when upgrade from 2.2.19 to > 3.0.30/3.11.17 > -- > > Key: CASSANDRA-19590 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19590 > Project: Cassandra > Issue Type: Bug >Reporter: Klay >Priority: Normal > Attachments: data.tar.gz, system.log > > > I am trying to upgrade from 2.2.19 to 3.0.30/3.11.17. I encountered the > following exception during the upgrade process and the 3.0.30/3.11.17 node > cannot start up. > {code:java} > ERROR [main] 2024-04-25 18:46:10,496 JVMStabilityInspector.java:124 - Exiting > due to error while processing commit log during initialization. > org.apache.cassandra.db.commitlog.CommitLogReadHandler$CommitLogReadException: > Unexpected error deserializing mutation; saved to > /tmp/mutation8318204837345269856dat. This may be caused by replaying a > mutation against a table with the same name but incompatible schema. > Exception follows: java.lang.AssertionError > at > org.apache.cassandra.db.commitlog.CommitLogReader.readMutation(CommitLogReader.java:471) > at > org.apache.cassandra.db.commitlog.CommitLogReader.readSection(CommitLogReader.java:404) > at > org.apache.cassandra.db.commitlog.CommitLogReader.readCommitLogSegment(CommitLogReader.java:251) > at > org.apache.cassandra.db.commitlog.CommitLogReader.readAllFiles(CommitLogReader.java:132) > at > org.apache.cassandra.db.commitlog.CommitLogReplayer.replayFiles(CommitLogReplayer.java:137) > at > org.apache.cassandra.db.commitlog.CommitLog.recoverFiles(CommitLog.java:189) > at > org.apache.cassandra.db.commitlog.CommitLog.recoverSegmentsOnDisk(CommitLog.java:170) > at > org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:331) > at > org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630) > at > org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:791) > {code} > h1. Reproduce1 (Flush before upgrade) > Upgrade fails when replaying the commit log. > This can be reproduced deterministically by > 1. Start up cassandra-2.2.19, singe node is enough (Using default > configuration) > 2. Execute the following commands in cqlsh > {code:java} > CREATE KEYSPACE ks WITH REPLICATION = { 'class' : 'SimpleStrategy', > 'replication_factor' : 1 }; > CREATE TABLE ks.tb (c1 INT, c0 INT, PRIMARY KEY (c1)); > INSERT INTO ks.tb (c1, c0) VALUES (0, 0); > ALTER TABLE ks.tb DROP c0 ; > ALTER TABLE ks.tb ADD c0 set ; > {code} > 3. Stop the old version. > {code:java} > bin/nodetool -h :::127.0.0.1 flush > bin/nodetool -h :::127.0.0.1 stopdaemon{code} > 4. Copy the data and start up the new version node (3.0.30 or 3.11.17) > Upgrade crashes with the following error > {code:java} > ERROR [main] 2024-04-25 18:46:10,496 JVMStabilityInspector.java:124 - Exiting > due to error while processing commit log during initialization. > org.apache.cassandra.db.commitlog.CommitLogReadHandler$CommitLogReadException: > Unexpected error deserializing mutation; saved to > /tmp/mutation8318204837345269856dat. This may be caused by replaying a > mutation against a table with the same name but incompatible schema. > Exception follows: java.lang.AssertionError > at > org.apache.cassandra.db.commitlog.CommitLogReader.readMutation(CommitLogReader.java:471) > at > org.apache.cassandra.db.commitlog.CommitLogReader.readSection(CommitLogReader.java:404) > at > org.apache.cassandra.db.commitlog.CommitLogReader.readCommitLogSegment(CommitLogReader.java:251) > at > org.apache.cassandra.db.commitlog.CommitLogReader.readAllFiles(CommitLogReader.java:132) > at > org.apache.cassandra.db.commitlog.CommitLogReplayer.replayFiles(CommitLogReplayer.java:137) > at > org.apache.cassandra.db.commitlog.CommitLog.recoverFiles(CommitLog.java:189) > at > org.apache.cassandra.db.commitlog.CommitLog.recoverSegmentsOnDisk(CommitLog.java:170) > at > org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:331) > at > org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630) > at > org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:791){code} > I have attached the system.log when starting up the 3.11.17 node. > I also
[jira] [Commented] (CASSANDRA-19590) Unexpected error deserializing mutation when upgrade from 2.2.19 to 3.0.30/3.11.17
[ https://issues.apache.org/jira/browse/CASSANDRA-19590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841916#comment-17841916 ] Stefan Miklosovic commented on CASSANDRA-19590: --- what about 2.2.11 -> 3.0.0? > Unexpected error deserializing mutation when upgrade from 2.2.19 to > 3.0.30/3.11.17 > -- > > Key: CASSANDRA-19590 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19590 > Project: Cassandra > Issue Type: Bug >Reporter: Klay >Priority: Normal > Attachments: data.tar.gz, system.log > > > I am trying to upgrade from 2.2.19 to 3.0.30/3.11.17. I encountered the > following exception during the upgrade process and the 3.0.30/3.11.17 node > cannot start up. > {code:java} > ERROR [main] 2024-04-25 18:46:10,496 JVMStabilityInspector.java:124 - Exiting > due to error while processing commit log during initialization. > org.apache.cassandra.db.commitlog.CommitLogReadHandler$CommitLogReadException: > Unexpected error deserializing mutation; saved to > /tmp/mutation8318204837345269856dat. This may be caused by replaying a > mutation against a table with the same name but incompatible schema. > Exception follows: java.lang.AssertionError > at > org.apache.cassandra.db.commitlog.CommitLogReader.readMutation(CommitLogReader.java:471) > at > org.apache.cassandra.db.commitlog.CommitLogReader.readSection(CommitLogReader.java:404) > at > org.apache.cassandra.db.commitlog.CommitLogReader.readCommitLogSegment(CommitLogReader.java:251) > at > org.apache.cassandra.db.commitlog.CommitLogReader.readAllFiles(CommitLogReader.java:132) > at > org.apache.cassandra.db.commitlog.CommitLogReplayer.replayFiles(CommitLogReplayer.java:137) > at > org.apache.cassandra.db.commitlog.CommitLog.recoverFiles(CommitLog.java:189) > at > org.apache.cassandra.db.commitlog.CommitLog.recoverSegmentsOnDisk(CommitLog.java:170) > at > org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:331) > at > org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630) > at > org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:791) > {code} > h1. Reproduce1 (Flush before upgrade) > Upgrade fails when replaying the commit log. > This can be reproduced deterministically by > 1. Start up cassandra-2.2.19, singe node is enough (Using default > configuration) > 2. Execute the following commands in cqlsh > {code:java} > CREATE KEYSPACE ks WITH REPLICATION = { 'class' : 'SimpleStrategy', > 'replication_factor' : 1 }; > CREATE TABLE ks.tb (c1 INT, c0 INT, PRIMARY KEY (c1)); > INSERT INTO ks.tb (c1, c0) VALUES (0, 0); > ALTER TABLE ks.tb DROP c0 ; > ALTER TABLE ks.tb ADD c0 set ; > {code} > 3. Stop the old version. > {code:java} > bin/nodetool -h :::127.0.0.1 flush > bin/nodetool -h :::127.0.0.1 stopdaemon{code} > 4. Copy the data and start up the new version node (3.0.30 or 3.11.17) > Upgrade crashes with the following error > {code:java} > ERROR [main] 2024-04-25 18:46:10,496 JVMStabilityInspector.java:124 - Exiting > due to error while processing commit log during initialization. > org.apache.cassandra.db.commitlog.CommitLogReadHandler$CommitLogReadException: > Unexpected error deserializing mutation; saved to > /tmp/mutation8318204837345269856dat. This may be caused by replaying a > mutation against a table with the same name but incompatible schema. > Exception follows: java.lang.AssertionError > at > org.apache.cassandra.db.commitlog.CommitLogReader.readMutation(CommitLogReader.java:471) > at > org.apache.cassandra.db.commitlog.CommitLogReader.readSection(CommitLogReader.java:404) > at > org.apache.cassandra.db.commitlog.CommitLogReader.readCommitLogSegment(CommitLogReader.java:251) > at > org.apache.cassandra.db.commitlog.CommitLogReader.readAllFiles(CommitLogReader.java:132) > at > org.apache.cassandra.db.commitlog.CommitLogReplayer.replayFiles(CommitLogReplayer.java:137) > at > org.apache.cassandra.db.commitlog.CommitLog.recoverFiles(CommitLog.java:189) > at > org.apache.cassandra.db.commitlog.CommitLog.recoverSegmentsOnDisk(CommitLog.java:170) > at > org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:331) > at > org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630) > at > org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:791){code} > I have attached the system.log when starting up the 3.11.17 node. > I also attached the data folder generated from the 2.2.19, start up 3.0.30 or > 3.11.17 with this data folder can directly expose the error. > h1. Reproduce2 (Drain be
[jira] [Comment Edited] (CASSANDRA-19590) Unexpected error deserializing mutation when upgrade from 2.2.19 to 3.0.30/3.11.17
[ https://issues.apache.org/jira/browse/CASSANDRA-19590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841916#comment-17841916 ] Stefan Miklosovic edited comment on CASSANDRA-19590 at 4/29/24 9:18 AM: what about 2.2.19 -> 3.0.0? was (Author: smiklosovic): what about 2.2.11 -> 3.0.0? > Unexpected error deserializing mutation when upgrade from 2.2.19 to > 3.0.30/3.11.17 > -- > > Key: CASSANDRA-19590 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19590 > Project: Cassandra > Issue Type: Bug >Reporter: Klay >Priority: Normal > Attachments: data.tar.gz, system.log > > > I am trying to upgrade from 2.2.19 to 3.0.30/3.11.17. I encountered the > following exception during the upgrade process and the 3.0.30/3.11.17 node > cannot start up. > {code:java} > ERROR [main] 2024-04-25 18:46:10,496 JVMStabilityInspector.java:124 - Exiting > due to error while processing commit log during initialization. > org.apache.cassandra.db.commitlog.CommitLogReadHandler$CommitLogReadException: > Unexpected error deserializing mutation; saved to > /tmp/mutation8318204837345269856dat. This may be caused by replaying a > mutation against a table with the same name but incompatible schema. > Exception follows: java.lang.AssertionError > at > org.apache.cassandra.db.commitlog.CommitLogReader.readMutation(CommitLogReader.java:471) > at > org.apache.cassandra.db.commitlog.CommitLogReader.readSection(CommitLogReader.java:404) > at > org.apache.cassandra.db.commitlog.CommitLogReader.readCommitLogSegment(CommitLogReader.java:251) > at > org.apache.cassandra.db.commitlog.CommitLogReader.readAllFiles(CommitLogReader.java:132) > at > org.apache.cassandra.db.commitlog.CommitLogReplayer.replayFiles(CommitLogReplayer.java:137) > at > org.apache.cassandra.db.commitlog.CommitLog.recoverFiles(CommitLog.java:189) > at > org.apache.cassandra.db.commitlog.CommitLog.recoverSegmentsOnDisk(CommitLog.java:170) > at > org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:331) > at > org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630) > at > org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:791) > {code} > h1. Reproduce1 (Flush before upgrade) > Upgrade fails when replaying the commit log. > This can be reproduced deterministically by > 1. Start up cassandra-2.2.19, singe node is enough (Using default > configuration) > 2. Execute the following commands in cqlsh > {code:java} > CREATE KEYSPACE ks WITH REPLICATION = { 'class' : 'SimpleStrategy', > 'replication_factor' : 1 }; > CREATE TABLE ks.tb (c1 INT, c0 INT, PRIMARY KEY (c1)); > INSERT INTO ks.tb (c1, c0) VALUES (0, 0); > ALTER TABLE ks.tb DROP c0 ; > ALTER TABLE ks.tb ADD c0 set ; > {code} > 3. Stop the old version. > {code:java} > bin/nodetool -h :::127.0.0.1 flush > bin/nodetool -h :::127.0.0.1 stopdaemon{code} > 4. Copy the data and start up the new version node (3.0.30 or 3.11.17) > Upgrade crashes with the following error > {code:java} > ERROR [main] 2024-04-25 18:46:10,496 JVMStabilityInspector.java:124 - Exiting > due to error while processing commit log during initialization. > org.apache.cassandra.db.commitlog.CommitLogReadHandler$CommitLogReadException: > Unexpected error deserializing mutation; saved to > /tmp/mutation8318204837345269856dat. This may be caused by replaying a > mutation against a table with the same name but incompatible schema. > Exception follows: java.lang.AssertionError > at > org.apache.cassandra.db.commitlog.CommitLogReader.readMutation(CommitLogReader.java:471) > at > org.apache.cassandra.db.commitlog.CommitLogReader.readSection(CommitLogReader.java:404) > at > org.apache.cassandra.db.commitlog.CommitLogReader.readCommitLogSegment(CommitLogReader.java:251) > at > org.apache.cassandra.db.commitlog.CommitLogReader.readAllFiles(CommitLogReader.java:132) > at > org.apache.cassandra.db.commitlog.CommitLogReplayer.replayFiles(CommitLogReplayer.java:137) > at > org.apache.cassandra.db.commitlog.CommitLog.recoverFiles(CommitLog.java:189) > at > org.apache.cassandra.db.commitlog.CommitLog.recoverSegmentsOnDisk(CommitLog.java:170) > at > org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:331) > at > org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630) > at > org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:791){code} > I have attached the system.log when starting up the 3.11.17 node. > I also attached the data folder generated from the 2.2.19, sta
[jira] [Commented] (CASSANDRA-19565) SIGSEGV on Cassandra v4.1.4
[ https://issues.apache.org/jira/browse/CASSANDRA-19565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841872#comment-17841872 ] Berenguer Blasi commented on CASSANDRA-19565: - Good that dev-branch is back online. So iiuc packaging is all good. CI failures are known offenders. We're only missing CI and a branch for trunk. But it should be the same as 5.0 so +1 LGTM > SIGSEGV on Cassandra v4.1.4 > --- > > Key: CASSANDRA-19565 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19565 > Project: Cassandra > Issue Type: Bug > Components: Packaging >Reporter: Thomas De Keulenaer >Assignee: Brandon Williams >Priority: Normal > Fix For: 4.1.x, 5.0.x, 5.x > > Attachments: cassandra_57_debian_jdk11_amd64_attempt1.log.xz, > cassandra_57_redhat_jdk11_amd64_attempt1.log.xz, hs_err_pid1116450.log > > > Hello, > Since upgrading to v4.1. we cannat run CAssandra any more. Each start > immediately crashes: > {{Apr 17 08:58:34 SVALD108 cassandra[1116450]: # A fatal error has been > detected by the Java Runtime Environment: > Apr 17 08:58:34 SVALD108 cassandra[1116450]: # SIGSEGV (0xb) at > pc=0x7fccaab4d152, pid=1116450, tid=1116451}} > I have added the log from the coe dump. > This issue is perhaps related to > https://davecturner.github.io/2021/08/30/seven-year-old-segfault.html ? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org