[ https://issues.apache.org/jira/browse/CASSANDRA-12236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15389445#comment-15389445 ]
Joshua McKenzie commented on CASSANDRA-12236: --------------------------------------------- Current status: Not entirely sure what to make of that [upgrade test run|http://cassci.datastax.com/view/Dev/view/knifewine/job/knifewine-joshupgrade12236-upgrade/3/testReport/]. 65 failures out of ~1350 tests runs, so if the driver was bailing on the null in cdc I'd expect we'd see far more than that. The errors I'm seeing are all over the map though, and I don't know how many upgrade test errors we "expect" at this point (3.7 upgrade job not on cassci that I'm seeing): * [Timeouts|http://cassci.datastax.com/view/Dev/view/knifewine/job/knifewine-joshupgrade12236-upgrade/3/testReport/junit/upgrade_tests.paging_test/TestPagingDataNodes3RF3_Upgrade_current_3_0_x_To_indev_3_x/basic_paging_test/] * [LegacyPagedRangeCommandSerializer.deserialize assertions|http://cassci.datastax.com/view/Dev/view/knifewine/job/knifewine-joshupgrade12236-upgrade/3/testReport/junit/upgrade_tests.paging_test/TestPagingDataNodes3RF3_Upgrade_current_3_0_x_To_indev_3_x/basic_paging_test_2/] - see CASSANDRA-12249. I haven't dug deeply into that code but from initially looking into it, I'm not sure how an added column in schema would lead to us sending a deprecated PAGED_RANGE from a 3.8 to a 3.0.x node. That being said, I don't see any "guards" in general around a PartitionRangedReadCommand.createMessage with a paging data range, and that predated the changes in CASSANDRA-11393 so that would require more inspection to figure out what's going on. * [Secondary index paging timeouts|http://cassci.datastax.com/view/Dev/view/knifewine/job/knifewine-joshupgrade12236-upgrade/3/testReport/junit/upgrade_tests.paging_test/TestPagingDataNodes3RF3_Upgrade_current_3_x_To_indev_3_x/test_paging_using_secondary_indexes/] * [Failure to find unrelated columns|http://cassci.datastax.com/view/Dev/view/knifewine/job/knifewine-joshupgrade12236-upgrade/3/testReport/junit/upgrade_tests.cql_tests/TestCQLNodes2RF1_Upgrade_current_3_0_x_To_indev_3_x/select_with_alias_test/] As for how we want to proceed from here: I'd say we a) re-run the upgrade jobs to see if timeouts were flaky environment (had a lot of problems with that yesterday across a lot of jobs, b) commit this change to 3.8/3.9/trunk, and c) Start working the CASSANDRA-12249 angle since that error showed up considerably more frequently than any other single error in the upgrade test runs I saw. As for what this means for the 3.8 release, my .02 is that I'd want to delta it against what upgrade tests looked like for 3.6, 3.4, 3.2. This is an even release, meaning we don't recommend rolling it out in production, and as long as our load of upgrade test failures for 3.8 isn't a regression from the load we had for 3.6, I'd say we move forward, potentially even before hammering out CASSANDRA-12249. Currently even releases are "feature" releases and odd are "stable", so there's no real need to hold up an even release for upgrade-only, mixed-version specific cluster problems in my opinion. > RTE from new CDC column breaks in flight queries. > ------------------------------------------------- > > Key: CASSANDRA-12236 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12236 > Project: Cassandra > Issue Type: Bug > Reporter: Jeremiah Jordan > Assignee: Joshua McKenzie > Fix For: 3.x > > Attachments: 12236.txt > > > This RTE is not harmless. It will cause the internode connection to break > which will cause all in flight requests between these nodes to die/timeout. > {noformat} > - Due to changes in schema migration handling and the storage format > after 3.0, you will > see error messages such as: > "java.lang.RuntimeException: Unknown column cdc during > deserialization" > in your system logs on a mixed-version cluster during upgrades. This > error message > is harmless and due to the 3.8 nodes having cdc added to their schema > tables while > the <3.8 nodes do not. This message should cease once all nodes are > upgraded to 3.8. > As always, refrain from schema changes during cluster upgrades. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)