[ https://issues.apache.org/jira/browse/CASSANDRA-17401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17809755#comment-17809755 ]
Long Pan commented on CASSANDRA-17401: -------------------------------------- [~chovatia.jayd...@gmail.com] and I managed to reproduce the issue. Basically a large number of QPS and client connections are necessary to reproduce it. Here are the steps to reproduce: *Server Setup:* A 3-node Cassandra cluster. Each node with 64GB mem (16GB heap), 7 CPU cores. ({{{}native_transport_max_threads = 1024){}}} *Keyspace/Table:* CREATE KEYSPACE test_ks WITH REPLICATION = \{ ‘class’ : ‘NetworkTopologyStrategy’, ‘datacenter1’ : 3 } ; CREATE TABLE test_ks.table1 ( p_id text, c_id text, v text); *Client Setup:* 30 hosts. Each host run the following pseudo-code, using *GoCql* client: cluster.CQLVersion = "3.4.0" cluster.ProtoVersion = 4 cluster.Timeout = 5s cluster.ConnectTimeout = 10s cluster.NumConns = 3 cluster.Consistency = LocalQuorum cluster.RetryPolicy = SimpleRetryPolicy\{NumRetries: 1} cluster.SocketKeepalive = 20s cluster.HostSelectionPolicy = RoundRobinHostPolicy sessionCount = 30 qpsPerSession = 30 cqlQuery = "SELECT p_id,c_id,v FROM test_ks.table1 WHERE p_id = ? AND c_id = ?" for (i = 0; i < sessionCount; i++) \{ session = cluster.createSession rateLimiter = NewRateLimiter(qpsPerSession) newGoRoutine.run( sendReads(session, rateLimiter) ) } / * sendReads(session, rateLimiter) \{ for { newGoRoutine.run ( if (rateLimiter.allow) { session.execute(cqlQuery, randomString, randomString) } ) } } */ Traffic generated this way will result in ~10K coordiator QPS and ~3k client connections per Cassandra node. *Trigger Point:* Manually issue a CQL query to add a column in the table: “ALTER TABLE test_ks.table1 ADD new_col text;” {*}Symmpton{*}: Seconds after the trigger point, one or more Cassandra nodes will show number of native_transport threads reaching {{{}native_transport_max_threads{}}}, and pending native transport tasks grow endlessly. > Race condition in QueryProcessor causes just prepared statement not to be in > the prepared statements cache > ---------------------------------------------------------------------------------------------------------- > > Key: CASSANDRA-17401 > URL: https://issues.apache.org/jira/browse/CASSANDRA-17401 > Project: Cassandra > Issue Type: Bug > Reporter: Ivan Senic > Priority: Normal > Time Spent: 10m > Remaining Estimate: 0h > > The changes in the > [QueryProcessor#prepare|https://github.com/apache/cassandra/blame/cassandra-4.0.2/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L575-L638] > method that were introduced in versions *4.0.2* and *3.11.12* can cause a > race condition between two threads trying to concurrently prepare the same > statement. This race condition can cause removing of a prepared statement > from the cache, after one of the threads has received the result of the > prepare and eventually uses MD5Digest to call > [QueryProcessor#getPrepared|https://github.com/apache/cassandra/blame/cassandra-4.0.2/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L212-L215]. > The race condition looks like this: > * Thread1 enters _prepare_ method and resolves _safeToReturnCached_ as false > * Thread1 executes eviction of hashes > * Thread2 enters _prepare_ method and resolves _safeToReturnCached_ as false > * Thread1 prepares the statement and caches it > * Thread1 returns the result of the prepare > * Thread2 executes eviction of hashes > * Thread1 tries to execute the prepared statement with the received > MD5Digest, but statement is not in the cache as it was evicted by Thread2 > I tried to reproduce this by using a Java driver, but hitting this case from > a client side is highly unlikely and I can not simulate the needed race > condition. However, we can easily reproduce this in Stargate (details > [here|https://github.com/stargate/stargate/pull/1647]), as it's closer to > QueryProcessor. > Reproducing this in a unit test is fairly easy. I am happy to showcase this > if needed. > Note that the issue can occur only when safeToReturnCached is resolved as > false. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org