[ 
https://issues.apache.org/jira/browse/CASSANDRA-20135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17904189#comment-17904189
 ] 

Josh McKenzie commented on CASSANDRA-20135:
-------------------------------------------

I've worked quite a bit recently with the {{StoppingTransformation in the 
paging across tombstones code, but haven't really delved into 
{{QueryCancellationChecker (CASSANDRA-17810 was an upstreaming of another 
author's work and it's been 2 years so... brain cache purged =/).}}}}

{{{{If neither [~marcuse] or [~aleksey] have immediate context on that to 
determine whether removal of that transformation application on the tail end is 
correct, I can take a look at this maybe next week before I step away for 
holiday.}}}}

> Assertion errors on CheckForAbort / QueryCancellationChecker on multiple 
> calls of applyToPartition
> --------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-20135
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20135
>             Project: Apache Cassandra
>          Issue Type: Bug
>          Components: Legacy/Core
>            Reporter: Stefan Miklosovic
>            Assignee: Stefan Miklosovic
>            Priority: Normal
>             Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> We see there are assertion errors thrown in 4.1 at least in 
> StoppingTransformation like these:
> {code}
> java.lang.RuntimeException: java.lang.AssertionError
>         at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:108)
>         at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:45)
>         at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:430)
>         at 
> org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:133)
>         at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:142)
>         at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>         at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: java.lang.AssertionError: null
>         at 
> org.apache.cassandra.db.transform.StoppingTransformation.attachTo(StoppingTransformation.java:72)
>         at org.apache.cassandra.db.transform.BaseRows.add(BaseRows.java:104)
>         at 
> org.apache.cassandra.db.transform.UnfilteredRows.add(UnfilteredRows.java:49)
>         at 
> org.apache.cassandra.db.transform.Transformation.add(Transformation.java:198)
>         at 
> org.apache.cassandra.db.transform.Transformation.apply(Transformation.java:140)
>         at 
> org.apache.cassandra.db.ReadCommand$CheckForAbort.applyToPartition(ReadCommand.java:616)
>         at 
> org.apache.cassandra.db.ReadCommand$CheckForAbort.applyToPartition(ReadCommand.java:604)
>         at 
> org.apache.cassandra.db.transform.BasePartitions.hasNext(BasePartitions.java:97)
>         at 
> org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$Serializer.serialize(UnfilteredPartitionIterators.java:303)
>         at 
> org.apache.cassandra.db.ReadResponse$LocalDataResponse.build(ReadResponse.java:201)
>         at 
> org.apache.cassandra.db.ReadResponse$LocalDataResponse.<init>(ReadResponse.java:186)
>         at 
> org.apache.cassandra.db.ReadResponse.createDataResponse(ReadResponse.java:48)
>         at 
> org.apache.cassandra.db.ReadCommand.createResponse(ReadCommand.java:337)
>         at 
> org.apache.cassandra.db.ReadCommandVerbHandler.doVerb(ReadCommandVerbHandler.java:63)
>         at 
> org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:78)
>         at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:97)
>         ... 6 common frames omitted
> {code}
> This does not make sense at first sight and it is quite a rabbit hole to go 
> through. If you follow the stacktrace, you see that 
> {code}
> Caused by: java.lang.AssertionError: null
>         at 
> org.apache.cassandra.db.transform.StoppingTransformation.attachTo(StoppingTransformation.java:72)
> {code}
> but ... why? It means that this (1) was called twice because that is the only 
> place where "this.rows" are ever updated in that class (and this.rows is 
> private) which means that _something_ has to call this twice in a row. Once 
> it sets it just fine and another time it goes to set it again but it fails as 
> it is not null anymore. Hence, the question is why is that set twice?
> The reason is quite elaborative. "attachTo" which throws is ever called in 
> BaseRows#add(Transformation) (2) and just on the next line it calls 
> "super.add(transformation);" which adds that transformation at the end of a 
> stack in Stack class which BaseRows extends.
> {code}
>     void add(Transformation add)
>     {
>         if (length == stack.length)
>             stack = resize(stack);
>         stack[length++] = add;
>     }
> {code}
> Next thing we see from the stacktrace is that CheckForAbort.applyToPartition 
> is calling Transformation.apply (3) and what that ultimately does is that it 
> will add itself, again, at the end of the stack (4).
> When we look at that stacktrace as a whole, what it does is that it is 
> iterating over Unfiliteredpartition while building a local data response on a 
> read and as it does so, it calls "BasePartitions.hasNext". Now we are getting 
> to that ... (5). What "hasNext" is doing is that while this.next is null, it 
> will take the stack and it loops in while by taking "next" from "input" and 
> it applies all the transformations by calling "fs[i].applyToParition(next)".
> So, there is a stack of transformations and they are called just one after 
> another until some result of "applyToPartition" returns null or we iterated 
> over all transformations. The chain of transformations also include 
> "CheckForAbort" transformation which we added here (6) so what happens is 
> that when we call "applyToPartitions" for the first time on CheckForAbort, it 
> will run just fine, but when that while loop / for loop in BasePartitions is 
> called _again_ (e.g. we are calling "hasNext" upon iterating in 
> UnfilteredPartitionIterators), then "applyToPartition" for "CheckForAbort" 
> will be called again as well. But CheckForAbort is doing this (7).
> {code}
>         protected UnfilteredRowIterator 
> applyToPartition(UnfilteredRowIterator partition)
>         {
>             if (maybeAbort())
>             {
>                 partition.close();
>                 return null;
>             }
>             return Transformation.apply(partition, this);
>         }
> {code}
> Check the last line where it applies itself when it is not aborted:
> {code}
> Transformation.apply(partition, this)
> {code}
> The application of this stopping transformation to given partition means that 
> it will add that transformation at the end of the stack as we already showed. 
> Then, we will iterate over that stack again upon iterating in BasePartitions, 
> which eventually calls "attachTo" for the second time, hence the assertion 
> error.
> The stack might look like
> {code}
> stack[0] = transformation1
> stack[1] = transformation2
> stack[2] = CheckForAbort
> {code}
> then we call "fs[i].applyToParition(next)" which will modify the stack like 
> this:
> {code}
> stack[0] = transformation1
> stack[1] = transformation2
> stack[2] = CheckForAbort
> stack[3] = CheckForAbort 
> {code}
> Then we will loop over that again and if I am not mistaken, when we hit 
> stack[2], it will call applyToPartition on that and it will do 
> {code}
> stack[0] = transformation1
> stack[1] = transformation2
> stack[2] = CheckForAbort // this basically adds itself at the end again
> stack[3] = CheckForAbort 
> stack[4] = CheckForAbort 
> {code}
> but we actually never get this far because on adding itself to the stack, we 
> will hit that assertion error. 
> I also see that CheckForAbort was replaced by CASSANDRA-17810 (8) by 
> QueryCancellationChecker but except some "cosmetic" changes, the logic 
> remains the same. That stopping transformation is applying itself in 
> applyToPartition so I think that this problem is present in 5.0+ too. No 
> transformation is applying itself like that but this one. 
> I am not completely sure what we should do about this but two ideas are 
> obvious (with non-zero probability that both of them are wrong)
> 1) We apply _new instance_ of QueryCancellationChecker / CheckForAbort so we 
> will never call attachTo on the _same_ instance (but then we would end up 
> with a bunch of QueryCancellationChecker / CheckForAbort instances in the 
> stack, brrr)
> 2) We will remove "assert" in attachTo in StoppingTransformation so we will 
> enable this to be called twice so we will not throw at least ... 
> (1) 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/transform/StoppingTransformation.java#L72
> (2) 
> https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/db/transform/BaseRows.java#L104
> (3) 
> https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/db/ReadCommand.java#L627
> (4) 
> https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/db/transform/Transformation.java#L198
> (5) 
> https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/db/transform/BasePartitions.java#L87-L109
> (6) 
> https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/db/ReadCommand.java#L436
> (7) 
> https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/db/ReadCommand.java#L619-L628
> (8) 
> https://github.com/apache/cassandra/commit/f4b69ba0e82bb051e56a92d792142034d9f617f0#diff-554e7dff38b500f5eaed0b9b651c7098c3f8a1bd4f6aca12063eab352e685b9fR690
> cc [~jmckenzie] [~marcuse] [~aleksey]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to