[jira] [Commented] (CASSANDRA-12236) RTE from new CDC column breaks in flight queries.

Joshua McKenzie (JIRA) Fri, 22 Jul 2016 05:58:49 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-12236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15389445#comment-15389445
 ]


Joshua McKenzie commented on CASSANDRA-12236:
---------------------------------------------

Current status: Not entirely sure what to make of that [upgrade test 
run|http://cassci.datastax.com/view/Dev/view/knifewine/job/knifewine-joshupgrade12236-upgrade/3/testReport/].

65 failures out of ~1350 tests runs, so if the driver was bailing on the null 
in cdc I'd expect we'd see far more than that. The errors I'm seeing are all 
over the map though, and I don't know how many upgrade test errors we "expect" 
at this point (3.7 upgrade job not on cassci that I'm seeing):
* 
[Timeouts|http://cassci.datastax.com/view/Dev/view/knifewine/job/knifewine-joshupgrade12236-upgrade/3/testReport/junit/upgrade_tests.paging_test/TestPagingDataNodes3RF3_Upgrade_current_3_0_x_To_indev_3_x/basic_paging_test/]
* [LegacyPagedRangeCommandSerializer.deserialize 
assertions|http://cassci.datastax.com/view/Dev/view/knifewine/job/knifewine-joshupgrade12236-upgrade/3/testReport/junit/upgrade_tests.paging_test/TestPagingDataNodes3RF3_Upgrade_current_3_0_x_To_indev_3_x/basic_paging_test_2/]
 - see CASSANDRA-12249. I haven't dug deeply into that code but from initially 
looking into it, I'm not sure how an added column in schema would lead to us 
sending a deprecated PAGED_RANGE from a 3.8 to a 3.0.x node. That being said, I 
don't see any "guards" in general around a 
PartitionRangedReadCommand.createMessage with a paging data range, and that 
predated the changes in CASSANDRA-11393 so that would require more inspection 
to figure out what's going on.
* [Secondary index paging 
timeouts|http://cassci.datastax.com/view/Dev/view/knifewine/job/knifewine-joshupgrade12236-upgrade/3/testReport/junit/upgrade_tests.paging_test/TestPagingDataNodes3RF3_Upgrade_current_3_x_To_indev_3_x/test_paging_using_secondary_indexes/]
* [Failure to find unrelated 
columns|http://cassci.datastax.com/view/Dev/view/knifewine/job/knifewine-joshupgrade12236-upgrade/3/testReport/junit/upgrade_tests.cql_tests/TestCQLNodes2RF1_Upgrade_current_3_0_x_To_indev_3_x/select_with_alias_test/]

As for how we want to proceed from here: I'd say we a) re-run the upgrade jobs 
to see if timeouts were flaky environment (had a lot of problems with that 
yesterday across a lot of jobs, b) commit this change to 3.8/3.9/trunk, and c) 
Start working the CASSANDRA-12249 angle since that error showed up considerably 
more frequently than any other single error in the upgrade test runs I saw.

As for what this means for the 3.8 release, my .02 is that I'd want to delta it 
against what upgrade tests looked like for 3.6, 3.4, 3.2. This is an even 
release, meaning we don't recommend rolling it out in production, and as long 
as our load of upgrade test failures for 3.8 isn't a regression from the load 
we had for 3.6, I'd say we move forward, potentially even before hammering out 
CASSANDRA-12249. Currently even releases are "feature" releases and odd are 
"stable", so there's no real need to hold up an even release for upgrade-only, 
mixed-version specific cluster problems in my opinion.

> RTE from new CDC column breaks in flight queries.
> -------------------------------------------------
>
>                 Key: CASSANDRA-12236
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12236
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jeremiah Jordan
>            Assignee: Joshua McKenzie
>             Fix For: 3.x
>
>         Attachments: 12236.txt
>
>
> This RTE is not harmless. It will cause the internode connection to break 
> which will cause all in flight requests between these nodes to die/timeout.
> {noformat}
>     - Due to changes in schema migration handling and the storage format 
> after 3.0, you will
>       see error messages such as:
>          "java.lang.RuntimeException: Unknown column cdc during 
> deserialization"
>       in your system logs on a mixed-version cluster during upgrades. This 
> error message
>       is harmless and due to the 3.8 nodes having cdc added to their schema 
> tables while
>       the <3.8 nodes do not. This message should cease once all nodes are 
> upgraded to 3.8.
>       As always, refrain from schema changes during cluster upgrades.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-12236) RTE from new CDC column breaks in flight queries.

Reply via email to