[ 
https://issues.apache.org/jira/browse/CASSANDRA-17401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17809755#comment-17809755
 ] 

Long Pan commented on CASSANDRA-17401:
--------------------------------------

[~chovatia.jayd...@gmail.com] and I managed to reproduce the issue. Basically a 
large number of QPS and client connections are necessary to reproduce it. Here 
are the steps to reproduce:


*Server Setup:*
A 3-node Cassandra cluster. Each node with 64GB mem (16GB heap), 7 CPU cores. 
({{{}native_transport_max_threads = 1024){}}}

*Keyspace/Table:*
CREATE KEYSPACE test_ks WITH REPLICATION = \{ ‘class’ : 
‘NetworkTopologyStrategy’, ‘datacenter1’ : 3 } ;
CREATE TABLE test_ks.table1 ( p_id text, c_id text, v text);

*Client Setup:*
30 hosts. Each host run the following pseudo-code, using *GoCql* client:
    cluster.CQLVersion = "3.4.0"
    cluster.ProtoVersion = 4
    cluster.Timeout = 5s
    cluster.ConnectTimeout = 10s
    cluster.NumConns = 3
    cluster.Consistency = LocalQuorum
    cluster.RetryPolicy = SimpleRetryPolicy\{NumRetries: 1}
    cluster.SocketKeepalive = 20s
    cluster.HostSelectionPolicy = RoundRobinHostPolicy

    sessionCount = 30
    qpsPerSession = 30
    cqlQuery = "SELECT p_id,c_id,v FROM test_ks.table1 WHERE p_id = ? AND c_id 
= ?"
    for (i = 0; i < sessionCount; i++) \{
       session = cluster.createSession
       rateLimiter = NewRateLimiter(qpsPerSession)
       newGoRoutine.run( sendReads(session, rateLimiter) )
    }

    / *
      sendReads(session, rateLimiter) \{
         for {
            newGoRoutine.run (
               if (rateLimiter.allow) {
                  session.execute(cqlQuery, randomString, randomString)
               }
            )
         }
      }
    */
Traffic generated this way will result in ~10K coordiator QPS and ~3k client 
connections per Cassandra node.

*Trigger Point:*
Manually issue a CQL query to add a column in the table: “ALTER TABLE 
test_ks.table1 ADD new_col text;”

{*}Symmpton{*}:
Seconds after the trigger point, one or more Cassandra nodes will show number 
of native_transport threads reaching {{{}native_transport_max_threads{}}}, and 
pending native transport tasks grow endlessly.

> Race condition in QueryProcessor causes just prepared statement not to be in 
> the prepared statements cache
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-17401
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17401
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Ivan Senic
>            Priority: Normal
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The changes in the 
> [QueryProcessor#prepare|https://github.com/apache/cassandra/blame/cassandra-4.0.2/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L575-L638]
>  method that were introduced in versions *4.0.2* and *3.11.12* can cause a 
> race condition between two threads trying to concurrently prepare the same 
> statement. This race condition can cause removing of a prepared statement 
> from the cache, after one of the threads has received the result of the 
> prepare and eventually uses MD5Digest to call 
> [QueryProcessor#getPrepared|https://github.com/apache/cassandra/blame/cassandra-4.0.2/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L212-L215].
> The race condition looks like this:
>  * Thread1 enters _prepare_ method and resolves _safeToReturnCached_ as false
>  * Thread1 executes eviction of hashes
>  * Thread2 enters _prepare_ method and resolves _safeToReturnCached_ as false
>  * Thread1 prepares the statement and caches it
>  * Thread1 returns the result of the prepare
>  * Thread2 executes eviction of hashes
>  * Thread1 tries to execute the prepared statement with the received 
> MD5Digest, but statement is not in the cache as it was evicted by Thread2
> I tried to reproduce this by using a Java driver, but hitting this case from 
> a client side is highly unlikely and I can not simulate the needed race 
> condition. However, we can easily reproduce this in Stargate (details 
> [here|https://github.com/stargate/stargate/pull/1647]), as it's closer to 
> QueryProcessor.
> Reproducing this in a unit test is fairly easy. I am happy to showcase this 
> if needed.
> Note that the issue can occur only when  safeToReturnCached is resolved as 
> false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to