[ 
https://issues.apache.org/jira/browse/CASSANDRA-19958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jaydeepkumar Chovatia updated CASSANDRA-19958:
----------------------------------------------
    Description: 
Cassandra uses the same queue (Stage.MUTATION) to process local mutations as 
well as local hints writing. CASSANDRA-19534 has enhanced and added timeouts 
for local mutations, but local hint writing does not honor that timeout by 
design as it honors a different timeout, i.e. _max_hint_window_in_ms_

 

*The Problem*

Let's understand the problem by having five nodes Cassandra cluster N1, N2, N3, 
N4, N5 with the following configuration:
 * concurrent_writes{_}:{_}10
 * native_transport_timeout: 5s 
 * write_request_timeout_in_ms: 2000 //2 seconds

.

+StorageProxy.java snippet...+

 

!image-2024-09-26-15-28-20-435.png|height=200,width=600!

 

Let's assume N4 and N5 are slow flapping or down. Assume N1 receives a flurry 
of mutations, so this is what happens on N1:
 # Line no 1542: Append 100 hints to the Stage.Mutation queue 
 # Line no 1547: Append 100 local mutations to the Stage.Mutation queue 

 Stage.Mutation queue on N1 would look as follows:
{code:java}
hint1,hint2,hint3,....hint100,mutation1,mutaiton2,....mutation100 {code}
 * Assume hints runnable takes 1 second, then it will take 10 seconds to 
process 100 hints, and only after that will local mutation be processed. 

 

So, in production, it would look like N1 is inactive for almost 10 seconds as 
it is just writing hints locally and not participating in any Quorum, etc.

 

The problem becomes really huge if, let's say, the load is high, and if hints 
pile up to 1M, then N1 will choke. The only solution at this time is to involve 
an operator that will restart N1 to drain all the piled-up hints from the 
Stage.Mutation queue.

 

The reason above problem happens is because local hint writing and local 
mutation are both using the same Queue, i.e., Stage.Mutation.

Local mutation writing is in the hot path. However, a slight local hint writing 
delay does not create a big trouble.

 

*Reproducible steps*
 # Pull the latest 4.1.x release
 # Create a 5-node cluster
 # Set the following configuration
{code:java}
native_transport_timeout: 10s
write_request_timeout_in_ms: 2000
enforce_native_deadline_for_hints: true{code}

 # Inject 1s of latency inside the following API in _StorageProxy.java_ on all 
five nodes
 # 
{code:java}
private static void performLocally(Stage stage, Replica localReplica, final 
Runnable runnable, final RequestCallback<?> handler, Object description, 
Dispatcher.RequestTime requestTime)
{
    stage.maybeExecuteImmediately(new LocalMutationRunnable(localReplica, 
requestTime)
    {
        public void runMayThrow()
        {
            try
            {
                Thread.sleep(1000); // Inject latency here
                runnable.run();
                handler.onResponse(null);
            }
            catch (Exception ex)
            {
                if (!(ex instanceof WriteTimeoutException))
                    logger.error("Failed to apply mutation locally : ", ex);
                handler.onFailure(FBUtilities.getBroadcastAddressAndPort(), 
RequestFailureReason.forException(ex));
            }
        }

        @Override
        public String description()
        {
            // description is an Object and toString() called so we do not have 
to evaluate the Mutation.toString()
            // unless expliclitly checked
            return description.toString();
        }

        @Override
        protected Verb verb()
        {
            return Verb.MUTATION_REQ;
        }
    });
} {code}

 # Run write-only stress for 1 hour or so
 # You will see Stage.Mutation queue will pile up to >1 million in size
 # Stop the load
 # Stage.Mutation will not be cleared immediately, and you cannot perform new 
writes. Basically, at this time Cassandra cluster has become inoperable from 
new mutations point-of-view. Only read will be served

 

*Solution*

The solution is to segregate the local mutation queue and local hint writing 
queue to address the problem above.

 

  was:
Cassandra uses the same queue (Stage.MUTATION) to process local mutations as 
well as local hints writing. CASSANDRA-19534 has enhanced and added timeouts 
for local mutations, but local hint writing does not honor that timeout by 
design as it honors a different timeout, i.e. _max_hint_window_in_ms_

 

*The Problem*

Let's understand the problem by having five nodes Cassandra cluster N1, N2, N3, 
N4, N5 with the following configuration:
 * concurrent_writes{_}:{_}10
 * native_transport_timeout: 5s 
 * write_request_timeout_in_ms: 2000 //2 seconds

.

+StorageProxy.java snippet...+

 

!image-2024-09-26-15-28-20-435.png|height=600,width=600!

 

Let's assume N4 and N5 are slow flapping or down. Assume N1 receives a flurry 
of mutations, so this is what happens on N1:
 # Line no 1542: Append 100 hints to the Stage.Mutation queue 
 # Line no 1547: Append 100 local mutations to the Stage.Mutation queue 

 Stage.Mutation queue on N1 would look as follows:
{code:java}
hint1,hint2,hint3,....hint100,mutation1,mutaiton2,....mutation100 {code}
 * Assume hints runnable takes 1 second, then it will take 10 seconds to 
process 100 hints, and only after that will local mutation be processed. 

 

So, in production, it would look like N1 is inactive for almost 10 seconds as 
it is just writing hints locally and not participating in any Quorum, etc.

 

The problem becomes really huge if, let's say, the load is high, and if hints 
pile up to 1M, then N1 will choke. The only solution at this time is to involve 
an operator that will restart N1 to drain all the piled-up hints from the 
Stage.Mutation queue.

 

The reason above problem happens is because local hint writing and local 
mutation are both using the same Queue, i.e., Stage.Mutation.

Local mutation writing is in the hot path. However, a slight local hint writing 
delay does not create a big trouble.

 

*Reproducible steps*
 # Pull the latest 4.1.x release
 # Create a 5-node cluster
 # Set the following configuration
{code:java}
native_transport_timeout: 10s
write_request_timeout_in_ms: 2000
enforce_native_deadline_for_hints: true{code}

 # Inject 1s of latency inside the following API in _StorageProxy.java_ on all 
five nodes
 # 
{code:java}
private static void performLocally(Stage stage, Replica localReplica, final 
Runnable runnable, final RequestCallback<?> handler, Object description, 
Dispatcher.RequestTime requestTime)
{
    stage.maybeExecuteImmediately(new LocalMutationRunnable(localReplica, 
requestTime)
    {
        public void runMayThrow()
        {
            try
            {
                Thread.sleep(1000); // Inject latency here
                runnable.run();
                handler.onResponse(null);
            }
            catch (Exception ex)
            {
                if (!(ex instanceof WriteTimeoutException))
                    logger.error("Failed to apply mutation locally : ", ex);
                handler.onFailure(FBUtilities.getBroadcastAddressAndPort(), 
RequestFailureReason.forException(ex));
            }
        }

        @Override
        public String description()
        {
            // description is an Object and toString() called so we do not have 
to evaluate the Mutation.toString()
            // unless expliclitly checked
            return description.toString();
        }

        @Override
        protected Verb verb()
        {
            return Verb.MUTATION_REQ;
        }
    });
} {code}

 # Run write-only stress for 1 hour or so
 # You will see Stage.Mutation queue will pile up to >1 million in size
 # Stop the load
 # Stage.Mutation will not be cleared immediately, and you cannot perform new 
writes. Basically, at this time Cassandra cluster has become inoperable from 
new mutations point-of-view. Only read will be served

 

*Solution*

The solution is to segregate the local mutation queue and local hint writing 
queue to address the problem above.

 


> Hints are stepping on online mutations
> --------------------------------------
>
>                 Key: CASSANDRA-19958
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19958
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Legacy/Local Write-Read Paths
>            Reporter: Jaydeepkumar Chovatia
>            Priority: Normal
>         Attachments: image-2024-09-26-15-28-20-435.png
>
>
> Cassandra uses the same queue (Stage.MUTATION) to process local mutations as 
> well as local hints writing. CASSANDRA-19534 has enhanced and added timeouts 
> for local mutations, but local hint writing does not honor that timeout by 
> design as it honors a different timeout, i.e. _max_hint_window_in_ms_
>  
> *The Problem*
> Let's understand the problem by having five nodes Cassandra cluster N1, N2, 
> N3, N4, N5 with the following configuration:
>  * concurrent_writes{_}:{_}10
>  * native_transport_timeout: 5s 
>  * write_request_timeout_in_ms: 2000 //2 seconds
> .
> +StorageProxy.java snippet...+
>  
> !image-2024-09-26-15-28-20-435.png|height=200,width=600!
>  
> Let's assume N4 and N5 are slow flapping or down. Assume N1 receives a flurry 
> of mutations, so this is what happens on N1:
>  # Line no 1542: Append 100 hints to the Stage.Mutation queue 
>  # Line no 1547: Append 100 local mutations to the Stage.Mutation queue 
>  Stage.Mutation queue on N1 would look as follows:
> {code:java}
> hint1,hint2,hint3,....hint100,mutation1,mutaiton2,....mutation100 {code}
>  * Assume hints runnable takes 1 second, then it will take 10 seconds to 
> process 100 hints, and only after that will local mutation be processed. 
>  
> So, in production, it would look like N1 is inactive for almost 10 seconds as 
> it is just writing hints locally and not participating in any Quorum, etc.
>  
> The problem becomes really huge if, let's say, the load is high, and if hints 
> pile up to 1M, then N1 will choke. The only solution at this time is to 
> involve an operator that will restart N1 to drain all the piled-up hints from 
> the Stage.Mutation queue.
>  
> The reason above problem happens is because local hint writing and local 
> mutation are both using the same Queue, i.e., Stage.Mutation.
> Local mutation writing is in the hot path. However, a slight local hint 
> writing delay does not create a big trouble.
>  
> *Reproducible steps*
>  # Pull the latest 4.1.x release
>  # Create a 5-node cluster
>  # Set the following configuration
> {code:java}
> native_transport_timeout: 10s
> write_request_timeout_in_ms: 2000
> enforce_native_deadline_for_hints: true{code}
>  # Inject 1s of latency inside the following API in _StorageProxy.java_ on 
> all five nodes
>  # 
> {code:java}
> private static void performLocally(Stage stage, Replica localReplica, final 
> Runnable runnable, final RequestCallback<?> handler, Object description, 
> Dispatcher.RequestTime requestTime)
> {
>     stage.maybeExecuteImmediately(new LocalMutationRunnable(localReplica, 
> requestTime)
>     {
>         public void runMayThrow()
>         {
>             try
>             {
>                 Thread.sleep(1000); // Inject latency here
>                 runnable.run();
>                 handler.onResponse(null);
>             }
>             catch (Exception ex)
>             {
>                 if (!(ex instanceof WriteTimeoutException))
>                     logger.error("Failed to apply mutation locally : ", ex);
>                 handler.onFailure(FBUtilities.getBroadcastAddressAndPort(), 
> RequestFailureReason.forException(ex));
>             }
>         }
>         @Override
>         public String description()
>         {
>             // description is an Object and toString() called so we do not 
> have to evaluate the Mutation.toString()
>             // unless expliclitly checked
>             return description.toString();
>         }
>         @Override
>         protected Verb verb()
>         {
>             return Verb.MUTATION_REQ;
>         }
>     });
> } {code}
>  # Run write-only stress for 1 hour or so
>  # You will see Stage.Mutation queue will pile up to >1 million in size
>  # Stop the load
>  # Stage.Mutation will not be cleared immediately, and you cannot perform new 
> writes. Basically, at this time Cassandra cluster has become inoperable from 
> new mutations point-of-view. Only read will be served
>  
> *Solution*
> The solution is to segregate the local mutation queue and local hint writing 
> queue to address the problem above.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to