[jira] [Commented] (CASSANDRA-15861) Mutating sstable STATS metadata may race with entire-sstable-streaming(ZCS) causing checksum validation failure

2020-06-16 Thread ZhaoYang (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138135#comment-17138135
 ] 

ZhaoYang commented on CASSANDRA-15861:
--

[~dcapwell] you are right. {{IndexSummary}} can definitely cause trouble for 
entire-sstable-streaming.. Then the only option we have is to apply first 
approach to {{IndexSummary}} because we can't make {{IndexSummary}} 
fixed-length encoding..

> Mutating sstable STATS metadata may race with entire-sstable-streaming(ZCS) 
> causing checksum validation failure
> ---
>
> Key: CASSANDRA-15861
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15861
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair, Consistency/Streaming, 
> Local/Compaction
>Reporter: ZhaoYang
>Assignee: ZhaoYang
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Flaky dtest: [test_dead_sync_initiator - 
> repair_tests.repair_test.TestRepair|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/143/testReport/junit/dtest.repair_tests.repair_test/TestRepair/test_dead_sync_initiator/]
> {code:java|title=stacktrace}
> Unexpected error found in node logs (see stdout for full details). Errors: 
> [ERROR [Stream-Deserializer-127.0.0.1:7000-570871f3] 2020-06-03 04:05:19,081 
> CassandraEntireSSTableStreamReader.java:145 - [Stream 
> 6f1c3360-a54f-11ea-a808-2f23710fdc90] Error while reading sstable from stream 
> for table = keyspace1.standard1
> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.maybeValidateChecksum(MetadataSerializer.java:219)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:198)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:129)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.mutate(MetadataSerializer.java:226)
>   at 
> org.apache.cassandra.db.streaming.CassandraEntireSSTableStreamReader.read(CassandraEntireSSTableStreamReader.java:140)
>   at 
> org.apache.cassandra.db.streaming.CassandraIncomingFile.read(CassandraIncomingFile.java:78)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:36)
>   at 
> org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.async.StreamingInboundHandler$StreamDeserializingTask.run(StreamingInboundHandler.java:181)
>   at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Checksums do not match for 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
> {code}
>  
> In the above test, it executes "nodetool repair" on node1 and kills node2 
> during repair. At the end, node3 reports checksum validation failure on 
> sstable transferred from node1.
> {code:java|title=what happened}
> 1. When repair started on node1, it performs anti-compaction which modifies 
> sstable's repairAt to 0 and pending repair id to session-id.
> 2. Then node1 creates {{ComponentManifest}} which contains file lengths to be 
> transferred to node3.
> 3. Before node1 actually sends the files to node3, node2 is killed and node1 
> starts to broadcast repair-failure-message to all participants in 
> {{CoordinatorSession#fail}}
> 4. Node1 receives its own repair-failure-message and fails its local repair 
> sessions at {{LocalSessions#failSession}} which triggers async background 
> compaction.
> 5. Node1's background compaction will mutate sstable's repairAt to 0 and 
> pending repair id to null via  
> {{PendingRepairManager#getNextRepairFinishedTask}}, as there is no more 
> in-progress repair.
> 6. Node1 actually sends the sstable to node3 where the sstable's STATS 
> component size is different from the original size recorded in the manifest.
> 7. At the end, node3 reports checksum validation failure when it tries to 
> mutate sstable level and "isTransient" attribute in 
> {{CassandraEntireSSTableStreamReader#read}}.
> {code}
> This isn't a problem in legacy streaming as STATS file length didn't matter.
> Ideally it will be great 

[jira] [Comment Edited] (CASSANDRA-15782) Compression test failure

2020-06-16 Thread Caleb Rackliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138117#comment-17138117
 ] 

Caleb Rackliffe edited comment on CASSANDRA-15782 at 6/17/20, 5:48 AM:
---

[~Bereng] [~jolynch] I think [~djoshi] and I are seeing this pop up again here: 
https://app.circleci.com/pipelines/github/dineshjoshi/cassandra/48/workflows/23de1e8d-108e-4138-8ea6-a650965920a5/jobs/2550/parallel-runs/2.
 The failure output seems to indicate the changes from this patch are present. 
(The C* branch in question should be [very close to 
trunk|https://github.com/dineshjoshi/cassandra/tree/CASSANDRA-14888].)


was (Author: maedhroz):
[~Bereng] [~jolynch] I think [~djoshi] and I are seeing this pop up again here: 
https://app.circleci.com/pipelines/github/dineshjoshi/cassandra/48/workflows/23de1e8d-108e-4138-8ea6-a650965920a5/jobs/2550/parallel-runs/2.
 The failure output seems to indicate the changes from this patch are present.

> Compression test failure
> 
>
> Key: CASSANDRA-15782
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15782
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Berenguer Blasi
>Assignee: Joey Lynch
>Priority: Normal
> Fix For: 4.0, 4.0-alpha5
>
>
> On CASSANDRA-15560 compression test failed. This was bisected to 
> [9c1bbf3|https://github.com/apache/cassandra/commit/9c1bbf3ac913f9bdf7a0e0922106804af42d2c1e]
>  from CASSANDRA-15379.
> Full details here
> CC/ [~jolynch] in case he can spot it quick.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-14888) Several mbeans are not unregistered when dropping a keyspace and table

2020-06-16 Thread Caleb Rackliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138119#comment-17138119
 ] 

Caleb Rackliffe edited comment on CASSANDRA-14888 at 6/17/20, 5:46 AM:
---

I've made a note in CASSANDRA-15782, but I think it's pretty safe to say this 
patch isn't the cause of any of the regressions we're seeing. I'd say we're 
ready to commit.


was (Author: maedhroz):
I've made a note in CASSANDRA-15782, but I think it's pretty safe to say this 
patch isn't the cause of any of the regressions we're seeing.

> Several mbeans are not unregistered when dropping a keyspace and table
> --
>
> Key: CASSANDRA-14888
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14888
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Ariel Weisberg
>Assignee: Alex Deparvu
>Priority: Urgent
>  Labels: patch-available
> Fix For: 3.0.x, 3.11.x, 4.0-beta
>
> Attachments: CASSANDRA-14888.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> CasCommit, CasPrepare, CasPropose, ReadRepairRequests, 
> ShortReadProtectionRequests, AntiCompactionTime, BytesValidated, 
> PartitionsValidated, RepairPrepareTime, RepairSyncTime, 
> RepairedDataInconsistencies, ViewLockAcquireTime, ViewReadTime, 
> WriteFailedIdealCL
> Basically for 3 years people haven't known what they are doing because the 
> entire thing is kind of obscure. Fix it and also add a dtest that detects if 
> any mbeans are left behind after dropping a table and keyspace.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14888) Several mbeans are not unregistered when dropping a keyspace and table

2020-06-16 Thread Caleb Rackliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138119#comment-17138119
 ] 

Caleb Rackliffe commented on CASSANDRA-14888:
-

I've made a note in CASSANDRA-15782, but I think it's pretty safe to say this 
patch isn't the cause of any of the regressions we're seeing.

> Several mbeans are not unregistered when dropping a keyspace and table
> --
>
> Key: CASSANDRA-14888
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14888
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Ariel Weisberg
>Assignee: Alex Deparvu
>Priority: Urgent
>  Labels: patch-available
> Fix For: 3.0.x, 3.11.x, 4.0-beta
>
> Attachments: CASSANDRA-14888.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> CasCommit, CasPrepare, CasPropose, ReadRepairRequests, 
> ShortReadProtectionRequests, AntiCompactionTime, BytesValidated, 
> PartitionsValidated, RepairPrepareTime, RepairSyncTime, 
> RepairedDataInconsistencies, ViewLockAcquireTime, ViewReadTime, 
> WriteFailedIdealCL
> Basically for 3 years people haven't known what they are doing because the 
> entire thing is kind of obscure. Fix it and also add a dtest that detects if 
> any mbeans are left behind after dropping a table and keyspace.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15782) Compression test failure

2020-06-16 Thread Caleb Rackliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138117#comment-17138117
 ] 

Caleb Rackliffe commented on CASSANDRA-15782:
-

[~Bereng] [~jolynch] I think [~djoshi] and I are seeing this pop up again here: 
https://app.circleci.com/pipelines/github/dineshjoshi/cassandra/48/workflows/23de1e8d-108e-4138-8ea6-a650965920a5/jobs/2550/parallel-runs/2.
 The failure output seems to indicate the changes from this patch are present.

> Compression test failure
> 
>
> Key: CASSANDRA-15782
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15782
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Berenguer Blasi
>Assignee: Joey Lynch
>Priority: Normal
> Fix For: 4.0, 4.0-alpha5
>
>
> On CASSANDRA-15560 compression test failed. This was bisected to 
> [9c1bbf3|https://github.com/apache/cassandra/commit/9c1bbf3ac913f9bdf7a0e0922106804af42d2c1e]
>  from CASSANDRA-15379.
> Full details here
> CC/ [~jolynch] in case he can spot it quick.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-14888) Several mbeans are not unregistered when dropping a keyspace and table

2020-06-16 Thread Caleb Rackliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138058#comment-17138058
 ] 

Caleb Rackliffe edited comment on CASSANDRA-14888 at 6/17/20, 5:26 AM:
---

[~djoshi] After the known issues above, the only other failure appears to be 
[compression_test.TestCompression|https://app.circleci.com/pipelines/github/dineshjoshi/cassandra/48/workflows/23de1e8d-108e-4138-8ea6-a650965920a5/jobs/2550],
 but CASSANDRA-15782 should have addressed that. Not exactly sure what's going 
on there, given that fix was committed to {{cassandra-dtests}} at the 
[beginning of 
May|https://github.com/apache/cassandra-dtest/commit/da7fcefb16d16af8924cda35c0a6a63ad553694f].


was (Author: maedhroz):
[~djoshi] After the known issues above, the only other failure is that appears 
to be 
[compression_test.TestCompression|https://app.circleci.com/pipelines/github/dineshjoshi/cassandra/48/workflows/23de1e8d-108e-4138-8ea6-a650965920a5/jobs/2550],
 but CASSANDRA-15782 should have addressed that. Not exactly sure what's going 
on there yet...

> Several mbeans are not unregistered when dropping a keyspace and table
> --
>
> Key: CASSANDRA-14888
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14888
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Ariel Weisberg
>Assignee: Alex Deparvu
>Priority: Urgent
>  Labels: patch-available
> Fix For: 3.0.x, 3.11.x, 4.0-beta
>
> Attachments: CASSANDRA-14888.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> CasCommit, CasPrepare, CasPropose, ReadRepairRequests, 
> ShortReadProtectionRequests, AntiCompactionTime, BytesValidated, 
> PartitionsValidated, RepairPrepareTime, RepairSyncTime, 
> RepairedDataInconsistencies, ViewLockAcquireTime, ViewReadTime, 
> WriteFailedIdealCL
> Basically for 3 years people haven't known what they are doing because the 
> entire thing is kind of obscure. Fix it and also add a dtest that detects if 
> any mbeans are left behind after dropping a table and keyspace.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15871) Cassandra driver first connection get the user's own schema information

2020-06-16 Thread Brandon Williams (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138066#comment-17138066
 ] 

Brandon Williams commented on CASSANDRA-15871:
--

If they're both superusers, I'm not sure I understand the validity of the test.

> Cassandra driver first  connection get the user's own schema information
> 
>
> Key: CASSANDRA-15871
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15871
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Cluster/Schema, Messaging/Client
>Reporter: maxwellguo
>Priority: Normal
> Attachments: 1.jpg
>
>
> We know that cassandra driver making a conenction with the coordinator node 
> first time , the driver may select all the keyspaces/tables/columns/types 
> from the server and cache the data locally. 
> For different users they may have different tables and types ,so It is not 
> suitable to get all the meta data cached , It is fine to just cache the 
> user's own schema information not all.
>  And doing this is safe and save first time connection resourse.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-14888) Several mbeans are not unregistered when dropping a keyspace and table

2020-06-16 Thread Caleb Rackliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138058#comment-17138058
 ] 

Caleb Rackliffe edited comment on CASSANDRA-14888 at 6/17/20, 3:18 AM:
---

[~djoshi] After the known issues above, the only other failure is that appears 
to be 
[compression_test.TestCompression|https://app.circleci.com/pipelines/github/dineshjoshi/cassandra/48/workflows/23de1e8d-108e-4138-8ea6-a650965920a5/jobs/2550],
 but CASSANDRA-15782 should have addressed that. Not exactly sure what's going 
on there yet...


was (Author: maedhroz):
[~djoshi] After the known issues above, the only other failure is that appears 
to be [compression_test.TestCompression|
https://app.circleci.com/pipelines/github/dineshjoshi/cassandra/48/workflows/23de1e8d-108e-4138-8ea6-a650965920a5/jobs/2550],
 but CASSANDRA-15782 should have addressed that. Not exactly sure what's going 
on there yet...

> Several mbeans are not unregistered when dropping a keyspace and table
> --
>
> Key: CASSANDRA-14888
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14888
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Ariel Weisberg
>Assignee: Alex Deparvu
>Priority: Urgent
>  Labels: patch-available
> Fix For: 3.0.x, 3.11.x, 4.0-beta
>
> Attachments: CASSANDRA-14888.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> CasCommit, CasPrepare, CasPropose, ReadRepairRequests, 
> ShortReadProtectionRequests, AntiCompactionTime, BytesValidated, 
> PartitionsValidated, RepairPrepareTime, RepairSyncTime, 
> RepairedDataInconsistencies, ViewLockAcquireTime, ViewReadTime, 
> WriteFailedIdealCL
> Basically for 3 years people haven't known what they are doing because the 
> entire thing is kind of obscure. Fix it and also add a dtest that detects if 
> any mbeans are left behind after dropping a table and keyspace.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14888) Several mbeans are not unregistered when dropping a keyspace and table

2020-06-16 Thread Caleb Rackliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138058#comment-17138058
 ] 

Caleb Rackliffe commented on CASSANDRA-14888:
-

[~djoshi] After the known issues above, the only other failure is that appears 
to be [compression_test.TestCompression|
https://app.circleci.com/pipelines/github/dineshjoshi/cassandra/48/workflows/23de1e8d-108e-4138-8ea6-a650965920a5/jobs/2550],
 but CASSANDRA-15782 should have addressed that. Not exactly sure what's going 
on there yet...

> Several mbeans are not unregistered when dropping a keyspace and table
> --
>
> Key: CASSANDRA-14888
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14888
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Ariel Weisberg
>Assignee: Alex Deparvu
>Priority: Urgent
>  Labels: patch-available
> Fix For: 3.0.x, 3.11.x, 4.0-beta
>
> Attachments: CASSANDRA-14888.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> CasCommit, CasPrepare, CasPropose, ReadRepairRequests, 
> ShortReadProtectionRequests, AntiCompactionTime, BytesValidated, 
> PartitionsValidated, RepairPrepareTime, RepairSyncTime, 
> RepairedDataInconsistencies, ViewLockAcquireTime, ViewReadTime, 
> WriteFailedIdealCL
> Basically for 3 years people haven't known what they are doing because the 
> entire thing is kind of obscure. Fix it and also add a dtest that detects if 
> any mbeans are left behind after dropping a table and keyspace.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15871) Cassandra driver first connection get the user's own schema information

2020-06-16 Thread maxwellguo (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138040#comment-17138040
 ] 

maxwellguo commented on CASSANDRA-15871:


For my test, I use userA to create kscas and userB to create ksgc , all are 
superuser.

But when doing connection first time use userA ,all keyspace include userB's 
ksgc are all return to driver.

It seems use permissions does not  work. 

> Cassandra driver first  connection get the user's own schema information
> 
>
> Key: CASSANDRA-15871
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15871
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Cluster/Schema, Messaging/Client
>Reporter: maxwellguo
>Priority: Normal
> Attachments: 1.jpg
>
>
> We know that cassandra driver making a conenction with the coordinator node 
> first time , the driver may select all the keyspaces/tables/columns/types 
> from the server and cache the data locally. 
> For different users they may have different tables and types ,so It is not 
> suitable to get all the meta data cached , It is fine to just cache the 
> user's own schema information not all.
>  And doing this is safe and save first time connection resourse.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-15879) Flaky unit test: BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy

2020-06-16 Thread Caleb Rackliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17137986#comment-17137986
 ] 

Caleb Rackliffe edited comment on CASSANDRA-15879 at 6/17/20, 2:53 AM:
---

I pushed up [a PR|https://github.com/apache/cassandra/pull/638] that 
approximates the 2.2 version. (There have been a number of other changes to 
{{CorruptedSSTablesCompactionsTest}} since then to fix other kinds of 
flakiness.) The only downside I can see to following through w/ this is that if 
we leave things as they are and there's another failure, we'd know exactly 
which seed broke things.


was (Author: maedhroz):
I pushed up [a PR|https://github.com/apache/cassandra/pull/638] that 
approximates the 2.2 version. (There have been a number of other changes to 
{{CorruptedSSTablesCompactionsTest}} since then to fix other kinds of 
flakiness.)

> Flaky unit test: 
> BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy
> -
>
> Key: CASSANDRA-15879
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15879
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/unit
>Reporter: Caleb Rackliffe
>Assignee: Caleb Rackliffe
>Priority: Normal
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> CASSANDRA-14238 addressed the failure in 
> {{BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy}},
>  but only on 2.2. While working on CASSANDRA-14888, we’ve reproduced [the 
> failure|https://app.circleci.com/pipelines/github/dineshjoshi/cassandra/47/workflows/de5f7cdb-06b6-4869-9d19-81a145e79f3f/jobs/2516/tests]
>  on trunk.
> It looks like this should be a clean merge forward.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15685) flaky testWithMismatchingPending - org.apache.cassandra.distributed.test.PreviewRepairTest

2020-06-16 Thread Ekaterina Dimitrova (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138038#comment-17138038
 ] 

Ekaterina Dimitrova commented on CASSANDRA-15685:
-

We might fix the IR but it is not a blocker now as there is no defect and it is 
rare case. It will be taken care in beta, it is not a blocker. That is what I 
meant

> flaky testWithMismatchingPending - 
> org.apache.cassandra.distributed.test.PreviewRepairTest
> --
>
> Key: CASSANDRA-15685
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15685
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Kevin Gallardo
>Assignee: Ekaterina Dimitrova
>Priority: Normal
>  Labels: pull-request-available
> Fix For: 4.0-beta
>
> Attachments: log-CASSANDRA-15685.txt, output
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Observed in: 
> https://app.circleci.com/pipelines/github/newkek/cassandra/34/workflows/1c6b157d-13c3-48a9-85fb-9fe8c153256b/jobs/191/tests
> Failure:
> {noformat}
> testWithMismatchingPending - 
> org.apache.cassandra.distributed.test.PreviewRepairTest
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.PreviewRepairTest.testWithMismatchingPending(PreviewRepairTest.java:97)
> {noformat}
> [Circle 
> CI|https://circleci.com/gh/dcapwell/cassandra/tree/bug%2FCASSANDRA-15685]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15685) flaky testWithMismatchingPending - org.apache.cassandra.distributed.test.PreviewRepairTest

2020-06-16 Thread David Capwell (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138020#comment-17138020
 ] 

David Capwell commented on CASSANDRA-15685:
---

sorry for the delay.

I am fine with the test getting fixed without changing IR, though this adds an 
unexpected edge case for users; though it is expected to be rare in production.

> flaky testWithMismatchingPending - 
> org.apache.cassandra.distributed.test.PreviewRepairTest
> --
>
> Key: CASSANDRA-15685
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15685
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Kevin Gallardo
>Assignee: Ekaterina Dimitrova
>Priority: Normal
>  Labels: pull-request-available
> Fix For: 4.0-beta
>
> Attachments: log-CASSANDRA-15685.txt, output
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Observed in: 
> https://app.circleci.com/pipelines/github/newkek/cassandra/34/workflows/1c6b157d-13c3-48a9-85fb-9fe8c153256b/jobs/191/tests
> Failure:
> {noformat}
> testWithMismatchingPending - 
> org.apache.cassandra.distributed.test.PreviewRepairTest
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.PreviewRepairTest.testWithMismatchingPending(PreviewRepairTest.java:97)
> {noformat}
> [Circle 
> CI|https://circleci.com/gh/dcapwell/cassandra/tree/bug%2FCASSANDRA-15685]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15852) Handle errors in StreamSession#prepare

2020-06-16 Thread David Capwell (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Capwell updated CASSANDRA-15852:
--
Status: Changes Suggested  (was: Review In Progress)

> Handle errors in StreamSession#prepare
> --
>
> Key: CASSANDRA-15852
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15852
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Streaming
>Reporter: Marcus Eriksson
>Assignee: Marcus Eriksson
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Since CASSANDRA-12229 we don't handle errors in {{StreamSession#prepare}} - 
> this makes a stream initiator hang forever if an error is thrown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15851) Add bytebuddy support for in-jvm dtests

2020-06-16 Thread David Capwell (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Capwell updated CASSANDRA-15851:
--
Reviewers: Alex Petrov, David Capwell, David Capwell  (was: Alex Petrov, 
David Capwell)
   Alex Petrov, David Capwell, David Capwell  (was: Alex Petrov)
   Status: Review In Progress  (was: Patch Available)

* 
https://github.com/apache/cassandra-in-jvm-dtest-api/pull/11/files#diff-d58040416ac6fdf2482ed7100441b555R265
 this allows null, so we should add a null check here 
https://github.com/krummas/cassandra/commit/6899cfb970d5674e2f012371bd6ba23294cf882d#diff-e398a00672550f1911eb13e4d4aa86cbR159

Overall LGTM, only a small thing; +1

[~ifesdjeen] I have not been paying attention, are we planning to release .3 or 
are we going to start supporting snapshots?  I know I have been the one 
opposing snapshots so not sure if it got brought up again.

> Add bytebuddy support for in-jvm dtests
> ---
>
> Key: CASSANDRA-15851
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15851
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Test/dtest
>Reporter: Marcus Eriksson
>Assignee: Marcus Eriksson
>Priority: Normal
>  Labels: pull-request-available
>
> Old python dtests support byteman, but that is quite horrible to work with, 
> [bytebuddy|https://bytebuddy.net/#/] is much better, so we should add support 
> for that in the in-jvm dtests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-14888) Several mbeans are not unregistered when dropping a keyspace and table

2020-06-16 Thread Caleb Rackliffe (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-14888:

Status: Review In Progress  (was: Changes Suggested)

> Several mbeans are not unregistered when dropping a keyspace and table
> --
>
> Key: CASSANDRA-14888
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14888
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Ariel Weisberg
>Assignee: Alex Deparvu
>Priority: Urgent
>  Labels: patch-available
> Fix For: 3.0.x, 3.11.x, 4.0-beta
>
> Attachments: CASSANDRA-14888.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> CasCommit, CasPrepare, CasPropose, ReadRepairRequests, 
> ShortReadProtectionRequests, AntiCompactionTime, BytesValidated, 
> PartitionsValidated, RepairPrepareTime, RepairSyncTime, 
> RepairedDataInconsistencies, ViewLockAcquireTime, ViewReadTime, 
> WriteFailedIdealCL
> Basically for 3 years people haven't known what they are doing because the 
> entire thing is kind of obscure. Fix it and also add a dtest that detects if 
> any mbeans are left behind after dropping a table and keyspace.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15852) Handle errors in StreamSession#prepare

2020-06-16 Thread David Capwell (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17137987#comment-17137987
 ] 

David Capwell commented on CASSANDRA-15852:
---

I am +1 assuming the exception type is changed, or a exception message is 
checked.

> Handle errors in StreamSession#prepare
> --
>
> Key: CASSANDRA-15852
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15852
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Streaming
>Reporter: Marcus Eriksson
>Assignee: Marcus Eriksson
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Since CASSANDRA-12229 we don't handle errors in {{StreamSession#prepare}} - 
> this makes a stream initiator hang forever if an error is thrown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15879) Flaky unit test: BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy

2020-06-16 Thread Caleb Rackliffe (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-15879:

Test and Documentation Plan: CircleCI: TODO
 Status: Patch Available  (was: In Progress)

> Flaky unit test: 
> BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy
> -
>
> Key: CASSANDRA-15879
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15879
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/unit
>Reporter: Caleb Rackliffe
>Assignee: Caleb Rackliffe
>Priority: Normal
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> CASSANDRA-14238 addressed the failure in 
> {{BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy}},
>  but only on 2.2. While working on CASSANDRA-14888, we’ve reproduced [the 
> failure|https://app.circleci.com/pipelines/github/dineshjoshi/cassandra/47/workflows/de5f7cdb-06b6-4869-9d19-81a145e79f3f/jobs/2516/tests]
>  on trunk.
> It looks like this should be a clean merge forward.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15879) Flaky unit test: BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy

2020-06-16 Thread Caleb Rackliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17137986#comment-17137986
 ] 

Caleb Rackliffe commented on CASSANDRA-15879:
-

I pushed up [a PR|https://github.com/apache/cassandra/pull/638] that 
approximates the 2.2 version. (There have been a number of other changes to 
{{CorruptedSSTablesCompactionsTest}} since then to fix other kinds of 
flakiness.)

> Flaky unit test: 
> BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy
> -
>
> Key: CASSANDRA-15879
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15879
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/unit
>Reporter: Caleb Rackliffe
>Assignee: Caleb Rackliffe
>Priority: Normal
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> CASSANDRA-14238 addressed the failure in 
> {{BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy}},
>  but only on 2.2. While working on CASSANDRA-14888, we’ve reproduced [the 
> failure|https://app.circleci.com/pipelines/github/dineshjoshi/cassandra/47/workflows/de5f7cdb-06b6-4869-9d19-81a145e79f3f/jobs/2516/tests]
>  on trunk.
> It looks like this should be a clean merge forward.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15852) Handle errors in StreamSession#prepare

2020-06-16 Thread David Capwell (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Capwell updated CASSANDRA-15852:
--
Reviewers: David Capwell, David Capwell  (was: David Capwell)
   David Capwell, David Capwell
   Status: Review In Progress  (was: Patch Available)

Mostly LGTM

* 
https://github.com/krummas/cassandra/commit/ee0a5f2a849b8a11d760ca2975a61fd4bbdc1735#diff-040c2f1cb2ba51b14c7e249412d6574eR62
 Can we use a custom exception here, or add a message and verify the message?  
RuntimeException can be thrown in many locations, so this test could pass 
without triggering this condition.

> Handle errors in StreamSession#prepare
> --
>
> Key: CASSANDRA-15852
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15852
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Streaming
>Reporter: Marcus Eriksson
>Assignee: Marcus Eriksson
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Since CASSANDRA-12229 we don't handle errors in {{StreamSession#prepare}} - 
> this makes a stream initiator hang forever if an error is thrown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Assigned] (CASSANDRA-15879) Flaky unit test: BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy

2020-06-16 Thread Caleb Rackliffe (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe reassigned CASSANDRA-15879:
---

Assignee: Caleb Rackliffe

> Flaky unit test: 
> BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy
> -
>
> Key: CASSANDRA-15879
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15879
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/unit
>Reporter: Caleb Rackliffe
>Assignee: Caleb Rackliffe
>Priority: Normal
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> CASSANDRA-14238 addressed the failure in 
> {{BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy}},
>  but only on 2.2. While working on CASSANDRA-14888, we’ve reproduced [the 
> failure|https://app.circleci.com/pipelines/github/dineshjoshi/cassandra/47/workflows/de5f7cdb-06b6-4869-9d19-81a145e79f3f/jobs/2516/tests]
>  on trunk.
> It looks like this should be a clean merge forward.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15879) Flaky unit test: BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy

2020-06-16 Thread Caleb Rackliffe (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-15879:

Source Control Link: https://github.com/apache/cassandra/pull/638 (3.0)

> Flaky unit test: 
> BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy
> -
>
> Key: CASSANDRA-15879
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15879
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/unit
>Reporter: Caleb Rackliffe
>Assignee: Caleb Rackliffe
>Priority: Normal
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> CASSANDRA-14238 addressed the failure in 
> {{BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy}},
>  but only on 2.2. While working on CASSANDRA-14888, we’ve reproduced [the 
> failure|https://app.circleci.com/pipelines/github/dineshjoshi/cassandra/47/workflows/de5f7cdb-06b6-4869-9d19-81a145e79f3f/jobs/2516/tests]
>  on trunk.
> It looks like this should be a clean merge forward.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Assigned] (CASSANDRA-15879) Flaky unit test: BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy

2020-06-16 Thread Caleb Rackliffe (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe reassigned CASSANDRA-15879:
---

Assignee: (was: Caleb Rackliffe)

> Flaky unit test: 
> BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy
> -
>
> Key: CASSANDRA-15879
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15879
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/unit
>Reporter: Caleb Rackliffe
>Priority: Normal
>
> CASSANDRA-14238 addressed the failure in 
> {{BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy}},
>  but only on 2.2. While working on CASSANDRA-14888, we’ve reproduced [the 
> failure|https://app.circleci.com/pipelines/github/dineshjoshi/cassandra/47/workflows/de5f7cdb-06b6-4869-9d19-81a145e79f3f/jobs/2516/tests]
>  on trunk.
> It looks like this should be a clean merge forward.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13994) Remove COMPACT STORAGE internals before 4.0 release

2020-06-16 Thread Josh McKenzie (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-13994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17137982#comment-17137982
 ] 

Josh McKenzie commented on CASSANDRA-13994:
---

[~djoshi] - are you reviewing this? Or Jordan, or Sylvain, or?  :)

 

Seems like we have a lot of hands on this one. Just want to clarify so I know 
who to -badger- follow up with about it.

> Remove COMPACT STORAGE internals before 4.0 release
> ---
>
> Key: CASSANDRA-13994
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13994
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Legacy/Local Write-Read Paths
>Reporter: Alex Petrov
>Assignee: Ekaterina Dimitrova
>Priority: Low
> Fix For: 4.0, 4.0-alpha
>
>
> 4.0 comes without thrift (after [CASSANDRA-5]) and COMPACT STORAGE (after 
> [CASSANDRA-10857]), and since Compact Storage flags are now disabled, all of 
> the related functionality is useless.
> There are still some things to consider:
> 1. One of the system tables (built indexes) was compact. For now, we just 
> added {{value}} column to it to make sure it's backwards-compatible, but we 
> might want to make sure it's just a "normal" table and doesn't have redundant 
> columns.
> 2. Compact Tables were building indexes in {{KEYS}} mode. Removing it is 
> trivial, but this would mean that all built indexes will be defunct. We could 
> log a warning for now and ask users to migrate off those for now and 
> completely remove it from future releases. It's just a couple of classes 
> though.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15861) Mutating sstable STATS metadata may race with entire-sstable-streaming(ZCS) causing checksum validation failure

2020-06-16 Thread David Capwell (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17137979#comment-17137979
 ] 

David Capwell commented on CASSANDRA-15861:
---

If reading this correctly, I wonder if this should also be a issue with 
org.apache.cassandra.io.sstable.format.SSTableReader#cloneWithNewSummarySamplingLevel
 which is called by org.apache.cassandra.io.sstable.IndexSummaryRedistribution; 
this modifies the summary file in place.

> Mutating sstable STATS metadata may race with entire-sstable-streaming(ZCS) 
> causing checksum validation failure
> ---
>
> Key: CASSANDRA-15861
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15861
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair, Consistency/Streaming, 
> Local/Compaction
>Reporter: ZhaoYang
>Assignee: ZhaoYang
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Flaky dtest: [test_dead_sync_initiator - 
> repair_tests.repair_test.TestRepair|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/143/testReport/junit/dtest.repair_tests.repair_test/TestRepair/test_dead_sync_initiator/]
> {code:java|title=stacktrace}
> Unexpected error found in node logs (see stdout for full details). Errors: 
> [ERROR [Stream-Deserializer-127.0.0.1:7000-570871f3] 2020-06-03 04:05:19,081 
> CassandraEntireSSTableStreamReader.java:145 - [Stream 
> 6f1c3360-a54f-11ea-a808-2f23710fdc90] Error while reading sstable from stream 
> for table = keyspace1.standard1
> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.maybeValidateChecksum(MetadataSerializer.java:219)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:198)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:129)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.mutate(MetadataSerializer.java:226)
>   at 
> org.apache.cassandra.db.streaming.CassandraEntireSSTableStreamReader.read(CassandraEntireSSTableStreamReader.java:140)
>   at 
> org.apache.cassandra.db.streaming.CassandraIncomingFile.read(CassandraIncomingFile.java:78)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:36)
>   at 
> org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.async.StreamingInboundHandler$StreamDeserializingTask.run(StreamingInboundHandler.java:181)
>   at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Checksums do not match for 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
> {code}
>  
> In the above test, it executes "nodetool repair" on node1 and kills node2 
> during repair. At the end, node3 reports checksum validation failure on 
> sstable transferred from node1.
> {code:java|title=what happened}
> 1. When repair started on node1, it performs anti-compaction which modifies 
> sstable's repairAt to 0 and pending repair id to session-id.
> 2. Then node1 creates {{ComponentManifest}} which contains file lengths to be 
> transferred to node3.
> 3. Before node1 actually sends the files to node3, node2 is killed and node1 
> starts to broadcast repair-failure-message to all participants in 
> {{CoordinatorSession#fail}}
> 4. Node1 receives its own repair-failure-message and fails its local repair 
> sessions at {{LocalSessions#failSession}} which triggers async background 
> compaction.
> 5. Node1's background compaction will mutate sstable's repairAt to 0 and 
> pending repair id to null via  
> {{PendingRepairManager#getNextRepairFinishedTask}}, as there is no more 
> in-progress repair.
> 6. Node1 actually sends the sstable to node3 where the sstable's STATS 
> component size is different from the original size recorded in the manifest.
> 7. At the end, node3 reports checksum validation failure when it tries to 
> mutate sstable level and "isTransient" attribute in 
> {{CassandraEntireSSTableStreamReader#read}}.
> {code}
> This isn't a problem in legacy streaming as STATS file leng

[jira] [Updated] (CASSANDRA-15879) Flaky unit test: BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy

2020-06-16 Thread Caleb Rackliffe (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-15879:

Status: Open  (was: Resolved)

It seems the test has been renamed to {{CorruptedSSTablesCompactionsTest}}.

> Flaky unit test: 
> BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy
> -
>
> Key: CASSANDRA-15879
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15879
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/unit
>Reporter: Caleb Rackliffe
>Assignee: Caleb Rackliffe
>Priority: Normal
>
> CASSANDRA-14238 addressed the failure in 
> {{BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy}},
>  but only on 2.2. While working on CASSANDRA-14888, we’ve reproduced [the 
> failure|https://app.circleci.com/pipelines/github/dineshjoshi/cassandra/47/workflows/de5f7cdb-06b6-4869-9d19-81a145e79f3f/jobs/2516/tests]
>  on trunk.
> It looks like this should be a clean merge forward.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Issue Comment Deleted] (CASSANDRA-15879) Flaky unit test: BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy

2020-06-16 Thread Caleb Rackliffe (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-15879:

Comment: was deleted

(was: Sorry for the noise. The fork where the failure occurred has not been 
sync'd with the upstream repo in quite a while. {{BlacklistingCompactionsTest}} 
no longer exists.)

> Flaky unit test: 
> BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy
> -
>
> Key: CASSANDRA-15879
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15879
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/unit
>Reporter: Caleb Rackliffe
>Assignee: Caleb Rackliffe
>Priority: Normal
>
> CASSANDRA-14238 addressed the failure in 
> {{BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy}},
>  but only on 2.2. While working on CASSANDRA-14888, we’ve reproduced [the 
> failure|https://app.circleci.com/pipelines/github/dineshjoshi/cassandra/47/workflows/de5f7cdb-06b6-4869-9d19-81a145e79f3f/jobs/2516/tests]
>  on trunk.
> It looks like this should be a clean merge forward.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15879) Flaky unit test: BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy

2020-06-16 Thread Caleb Rackliffe (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-15879:

Resolution: Invalid
Status: Resolved  (was: Open)

Sorry for the noise. The fork where the failure occurred has not been sync'd 
with the upstream repo in quite a while. {{BlacklistingCompactionsTest}} no 
longer exists.

> Flaky unit test: 
> BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy
> -
>
> Key: CASSANDRA-15879
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15879
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/unit
>Reporter: Caleb Rackliffe
>Assignee: Caleb Rackliffe
>Priority: Normal
> Fix For: 3.0.x, 3.11.x, 4.0-beta
>
>
> CASSANDRA-14238 addressed the failure in 
> {{BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy}},
>  but only on 2.2. While working on CASSANDRA-14888, we’ve reproduced [the 
> failure|https://app.circleci.com/pipelines/github/dineshjoshi/cassandra/47/workflows/de5f7cdb-06b6-4869-9d19-81a145e79f3f/jobs/2516/tests]
>  on trunk.
> It looks like this should be a clean merge forward.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15879) Flaky unit test: BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy

2020-06-16 Thread Caleb Rackliffe (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-15879:

Fix Version/s: (was: 4.0-beta)
   (was: 3.11.x)
   (was: 3.0.x)

> Flaky unit test: 
> BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy
> -
>
> Key: CASSANDRA-15879
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15879
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/unit
>Reporter: Caleb Rackliffe
>Assignee: Caleb Rackliffe
>Priority: Normal
>
> CASSANDRA-14238 addressed the failure in 
> {{BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy}},
>  but only on 2.2. While working on CASSANDRA-14888, we’ve reproduced [the 
> failure|https://app.circleci.com/pipelines/github/dineshjoshi/cassandra/47/workflows/de5f7cdb-06b6-4869-9d19-81a145e79f3f/jobs/2516/tests]
>  on trunk.
> It looks like this should be a clean merge forward.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-14888) Several mbeans are not unregistered when dropping a keyspace and table

2020-06-16 Thread Caleb Rackliffe (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-14888:

Test and Documentation Plan: 
Review and tests
CircleCI: https://circleci.com/gh/dineshjoshi/cassandra/2534

  was:
Review and tests
CircleCI: 
https://app.circleci.com/pipelines/github/dineshjoshi/cassandra/47/workflows/de5f7cdb-06b6-4869-9d19-81a145e79f3f


> Several mbeans are not unregistered when dropping a keyspace and table
> --
>
> Key: CASSANDRA-14888
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14888
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Ariel Weisberg
>Assignee: Alex Deparvu
>Priority: Urgent
>  Labels: patch-available
> Fix For: 3.0.x, 3.11.x, 4.0-beta
>
> Attachments: CASSANDRA-14888.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> CasCommit, CasPrepare, CasPropose, ReadRepairRequests, 
> ShortReadProtectionRequests, AntiCompactionTime, BytesValidated, 
> PartitionsValidated, RepairPrepareTime, RepairSyncTime, 
> RepairedDataInconsistencies, ViewLockAcquireTime, ViewReadTime, 
> WriteFailedIdealCL
> Basically for 3 years people haven't known what they are doing because the 
> entire thing is kind of obscure. Fix it and also add a dtest that detects if 
> any mbeans are left behind after dropping a table and keyspace.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-15879) Flaky unit test: BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy

2020-06-16 Thread Caleb Rackliffe (Jira)
Caleb Rackliffe created CASSANDRA-15879:
---

 Summary: Flaky unit test: 
BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy
 Key: CASSANDRA-15879
 URL: https://issues.apache.org/jira/browse/CASSANDRA-15879
 Project: Cassandra
  Issue Type: Bug
  Components: Test/unit
Reporter: Caleb Rackliffe


CASSANDRA-14238 addressed the failure in 
{{BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy}},
 but only on 2.2. While working on CASSANDRA-14888, we’ve reproduced [the 
failure|https://app.circleci.com/pipelines/github/dineshjoshi/cassandra/47/workflows/de5f7cdb-06b6-4869-9d19-81a145e79f3f/jobs/2516/tests]
 on trunk.

It looks like this should be a clean merge forward.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Assigned] (CASSANDRA-15879) Flaky unit test: BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy

2020-06-16 Thread Caleb Rackliffe (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe reassigned CASSANDRA-15879:
---

Assignee: Caleb Rackliffe

> Flaky unit test: 
> BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy
> -
>
> Key: CASSANDRA-15879
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15879
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/unit
>Reporter: Caleb Rackliffe
>Assignee: Caleb Rackliffe
>Priority: Normal
>
> CASSANDRA-14238 addressed the failure in 
> {{BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy}},
>  but only on 2.2. While working on CASSANDRA-14888, we’ve reproduced [the 
> failure|https://app.circleci.com/pipelines/github/dineshjoshi/cassandra/47/workflows/de5f7cdb-06b6-4869-9d19-81a145e79f3f/jobs/2516/tests]
>  on trunk.
> It looks like this should be a clean merge forward.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14238) Flaky Unittest: org.apache.cassandra.db.compaction.BlacklistingCompactionsTest

2020-06-16 Thread Caleb Rackliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17137951#comment-17137951
 ] 

Caleb Rackliffe commented on CASSANDRA-14238:
-

Just created CASSANDRA-15879. Will put up the patch shortly...

> Flaky Unittest: org.apache.cassandra.db.compaction.BlacklistingCompactionsTest
> --
>
> Key: CASSANDRA-14238
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14238
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Testing
>Reporter: Jay Zhuang
>Assignee: Marcus Eriksson
>Priority: Low
>  Labels: testing
> Fix For: 2.2.13
>
>
> The unittest is flaky
> {noformat}
> [junit] Testcase: 
> testBlacklistingWithSizeTieredCompactionStrategy(org.apache.cassandra.db.compaction.BlacklistingCompactionsTest):
>  FAILED
> [junit] expected:<8> but was:<25>
> [junit] junit.framework.AssertionFailedError: expected:<8> but was:<25>
> [junit] at 
> org.apache.cassandra.db.compaction.BlacklistingCompactionsTest.testBlacklisting(BlacklistingCompactionsTest.java:170)
> [junit] at 
> org.apache.cassandra.db.compaction.BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy(BlacklistingCompactionsTest.java:71)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15879) Flaky unit test: BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy

2020-06-16 Thread Caleb Rackliffe (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-15879:

 Bug Category: Parent values: Correctness(12982)Level 1 values: Test 
Failure(12990)
   Complexity: Low Hanging Fruit
Discovered By: Unit Test
Fix Version/s: 4.0-beta
   3.11.x
   3.0.x
Reviewers: Dinesh Joshi, Marcus Eriksson
 Severity: Normal
   Status: Open  (was: Triage Needed)

> Flaky unit test: 
> BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy
> -
>
> Key: CASSANDRA-15879
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15879
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/unit
>Reporter: Caleb Rackliffe
>Assignee: Caleb Rackliffe
>Priority: Normal
> Fix For: 3.0.x, 3.11.x, 4.0-beta
>
>
> CASSANDRA-14238 addressed the failure in 
> {{BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy}},
>  but only on 2.2. While working on CASSANDRA-14888, we’ve reproduced [the 
> failure|https://app.circleci.com/pipelines/github/dineshjoshi/cassandra/47/workflows/de5f7cdb-06b6-4869-9d19-81a145e79f3f/jobs/2516/tests]
>  on trunk.
> It looks like this should be a clean merge forward.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14888) Several mbeans are not unregistered when dropping a keyspace and table

2020-06-16 Thread Caleb Rackliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17137936#comment-17137936
 ] 

Caleb Rackliffe commented on CASSANDRA-14888:
-

...and {{hintedhandoff_test.TestHintedHandoffConfig}} appears to be covered by 
CASSANDRA-15865.

> Several mbeans are not unregistered when dropping a keyspace and table
> --
>
> Key: CASSANDRA-14888
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14888
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Ariel Weisberg
>Assignee: Alex Deparvu
>Priority: Urgent
>  Labels: patch-available
> Fix For: 3.0.x, 3.11.x, 4.0-beta
>
> Attachments: CASSANDRA-14888.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> CasCommit, CasPrepare, CasPropose, ReadRepairRequests, 
> ShortReadProtectionRequests, AntiCompactionTime, BytesValidated, 
> PartitionsValidated, RepairPrepareTime, RepairSyncTime, 
> RepairedDataInconsistencies, ViewLockAcquireTime, ViewReadTime, 
> WriteFailedIdealCL
> Basically for 3 years people haven't known what they are doing because the 
> entire thing is kind of obscure. Fix it and also add a dtest that detects if 
> any mbeans are left behind after dropping a table and keyspace.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-14888) Several mbeans are not unregistered when dropping a keyspace and table

2020-06-16 Thread Caleb Rackliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17137933#comment-17137933
 ] 

Caleb Rackliffe edited comment on CASSANDRA-14888 at 6/16/20, 9:23 PM:
---

The one failure in the unit tests appears to be CASSANDRA-14238 resurrected, 
which we've already noted.

In the dtests, the {{TestPushedNotifications}} failures have already been noted 
in CASSANDRA-15877.

There also appears to be some good evidence that 
{{test_simple_repair_order_preserving - repair_tests.repair_test.TestRepair}} 
is 
[flaky|https://issues.apache.org/jira/browse/CASSANDRA-15170?focusedCommentId=16909454&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16909454].


was (Author: maedhroz):
The one failure in the unit tests appears to be CASSANDRA-14238 resurrected, 
which we've already noted.

In the dtests, the {{pushed_notifications_test.TestPushedNotifications}} have 
already been noted in CASSANDRA-15877.

There also appears to be some good evidence that 
{{test_simple_repair_order_preserving - repair_tests.repair_test.TestRepair}} 
is 
[flaky|https://issues.apache.org/jira/browse/CASSANDRA-15170?focusedCommentId=16909454&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16909454].

> Several mbeans are not unregistered when dropping a keyspace and table
> --
>
> Key: CASSANDRA-14888
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14888
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Ariel Weisberg
>Assignee: Alex Deparvu
>Priority: Urgent
>  Labels: patch-available
> Fix For: 3.0.x, 3.11.x, 4.0-beta
>
> Attachments: CASSANDRA-14888.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> CasCommit, CasPrepare, CasPropose, ReadRepairRequests, 
> ShortReadProtectionRequests, AntiCompactionTime, BytesValidated, 
> PartitionsValidated, RepairPrepareTime, RepairSyncTime, 
> RepairedDataInconsistencies, ViewLockAcquireTime, ViewReadTime, 
> WriteFailedIdealCL
> Basically for 3 years people haven't known what they are doing because the 
> entire thing is kind of obscure. Fix it and also add a dtest that detects if 
> any mbeans are left behind after dropping a table and keyspace.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14888) Several mbeans are not unregistered when dropping a keyspace and table

2020-06-16 Thread Caleb Rackliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17137933#comment-17137933
 ] 

Caleb Rackliffe commented on CASSANDRA-14888:
-

The one failure in the unit tests appears to be CASSANDRA-14238 resurrected, 
which we've already noted.

In the dtests, the {{pushed_notifications_test.TestPushedNotifications}} have 
already been noted in CASSANDRA-15877.

There also appears to be some good evidence that 
{{test_simple_repair_order_preserving - repair_tests.repair_test.TestRepair}} 
is 
[flaky|https://issues.apache.org/jira/browse/CASSANDRA-15170?focusedCommentId=16909454&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16909454].

> Several mbeans are not unregistered when dropping a keyspace and table
> --
>
> Key: CASSANDRA-14888
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14888
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Ariel Weisberg
>Assignee: Alex Deparvu
>Priority: Urgent
>  Labels: patch-available
> Fix For: 3.0.x, 3.11.x, 4.0-beta
>
> Attachments: CASSANDRA-14888.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> CasCommit, CasPrepare, CasPropose, ReadRepairRequests, 
> ShortReadProtectionRequests, AntiCompactionTime, BytesValidated, 
> PartitionsValidated, RepairPrepareTime, RepairSyncTime, 
> RepairedDataInconsistencies, ViewLockAcquireTime, ViewReadTime, 
> WriteFailedIdealCL
> Basically for 3 years people haven't known what they are doing because the 
> entire thing is kind of obscure. Fix it and also add a dtest that detects if 
> any mbeans are left behind after dropping a table and keyspace.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-15874) Bootstrap completes Successfully without streaming all the data

2020-06-16 Thread Jai Bheemsen Rao Dhanwada (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136793#comment-17136793
 ] 

Jai Bheemsen Rao Dhanwada edited comment on CASSANDRA-15874 at 6/16/20, 9:03 PM:
-

thanks [~brandon.williams] can you please provide the symptoms of this race 
conditions? in my case I see only some portion of the data is not bootstrapped 
but rest of the data bootstrapped without any issues. 


was (Author: jaid):
thanks [~brandon.williams] can you please provide the symptoms of this race 
conditions? in my case I see only some portion of the data is bootstrapped but 
rest of the data bootstrapped without any issues. 

> Bootstrap completes Successfully without streaming all the data
> ---
>
> Key: CASSANDRA-15874
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15874
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: Jai Bheemsen Rao Dhanwada
>Priority: Normal
>
> I am seeing a strange issue where, adding a new node with auto_bootstrap: 
> true is not streaming all the data before it joins the cluster. Don't see any 
> information in the logs about bootstrap failures.
> Here is the sequence of logs
>  
> {code:java}
> INFO [main] 2020-06-12 01:41:49,642 StorageService.java:1446 - JOINING: 
> schema complete, ready to bootstrap
> INFO [main] 2020-06-12 01:41:49,642 StorageService.java:1446 - JOINING: 
> waiting for pending range calculation
> INFO [main] 2020-06-12 01:41:49,643 StorageService.java:1446 - JOINING: 
> calculation complete, ready to bootstrap
> INFO [main] 2020-06-12 01:41:49,643 StorageService.java:1446 - JOINING: 
> getting bootstrap token
> INFO [main] 2020-06-12 01:42:19,656 StorageService.java:1446 - JOINING: 
> Starting to bootstrap...
> org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find table for 
> cfId . If a table was just created, this is likely due to the schema 
> not being fully propagated. Please wait for schema agreement on table 
> creation.
> INFO [StreamReceiveTask:1] 2020-06-12 02:29:51,892 
> StreamResultFuture.java:219 - [Stream #f4224f444-a55d-154a-23e3-867899486f5f] 
> All sessions completed INFO [StreamReceiveTask:1] 2020-06-12 02:29:51,892 
> StorageService.java:1505 - Bootstrap completed! for the tokens
> {code}
> Cassandra Version: 3.11.3
> I am not able to reproduce this issue all the time, but it happened couple of 
> times. Is there any  race condition/corner case, which could cause this issue?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-14888) Several mbeans are not unregistered when dropping a keyspace and table

2020-06-16 Thread Dinesh Joshi (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dinesh Joshi updated CASSANDRA-14888:
-
Status: Changes Suggested  (was: Review In Progress)

> Several mbeans are not unregistered when dropping a keyspace and table
> --
>
> Key: CASSANDRA-14888
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14888
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Ariel Weisberg
>Assignee: Alex Deparvu
>Priority: Urgent
>  Labels: patch-available
> Fix For: 3.0.x, 3.11.x, 4.0-beta
>
> Attachments: CASSANDRA-14888.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> CasCommit, CasPrepare, CasPropose, ReadRepairRequests, 
> ShortReadProtectionRequests, AntiCompactionTime, BytesValidated, 
> PartitionsValidated, RepairPrepareTime, RepairSyncTime, 
> RepairedDataInconsistencies, ViewLockAcquireTime, ViewReadTime, 
> WriteFailedIdealCL
> Basically for 3 years people haven't known what they are doing because the 
> entire thing is kind of obscure. Fix it and also add a dtest that detects if 
> any mbeans are left behind after dropping a table and keyspace.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14888) Several mbeans are not unregistered when dropping a keyspace and table

2020-06-16 Thread Dinesh Joshi (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17137924#comment-17137924
 ] 

Dinesh Joshi commented on CASSANDRA-14888:
--

I kicked off a [test 
run|https://circleci.com/workflow-run/de5f7cdb-06b6-4869-9d19-81a145e79f3f]. 
However, I see a couple failures. Could you both please ensure that they are 
indeed unrelated / flaky? 

> Several mbeans are not unregistered when dropping a keyspace and table
> --
>
> Key: CASSANDRA-14888
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14888
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Ariel Weisberg
>Assignee: Alex Deparvu
>Priority: Urgent
>  Labels: patch-available
> Fix For: 3.0.x, 3.11.x, 4.0-beta
>
> Attachments: CASSANDRA-14888.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> CasCommit, CasPrepare, CasPropose, ReadRepairRequests, 
> ShortReadProtectionRequests, AntiCompactionTime, BytesValidated, 
> PartitionsValidated, RepairPrepareTime, RepairSyncTime, 
> RepairedDataInconsistencies, ViewLockAcquireTime, ViewReadTime, 
> WriteFailedIdealCL
> Basically for 3 years people haven't known what they are doing because the 
> entire thing is kind of obscure. Fix it and also add a dtest that detects if 
> any mbeans are left behind after dropping a table and keyspace.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15677) Topology events are not sent to clients if the nodes use the same network interface

2020-06-16 Thread Brandon Williams (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-15677:
-
Status: Open  (was: Resolved)

This broke two python dtests: test_restart_node_localhost - 
pushed_notifications_test.TestPushedNotifications,
test_move_single_node_localhost - 
pushed_notifications_test.TestPushedNotifications.

This is because those tests expect no notifications, where now some exist: 
https://app.circleci.com/pipelines/github/driftx/cassandra/29/workflows/e2a641fa-8100-49e6-8c5a-da46d3fcee5f/jobs/241

These both assert this way due to CASSANDRA-10052.  Are we sure this new 
behavior is correct?

> Topology events are not sent to clients if the nodes use the same network 
> interface
> ---
>
> Key: CASSANDRA-15677
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15677
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client
>Reporter: Alan Boudreault
>Assignee: Bryn Cooke
>Priority: Normal
>  Labels: pull-request-available
> Fix For: 4.0-alpha5
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> *This bug only happens when the cassandra nodes are configured to use a 
> single network interface (ip) but different ports.  See CASSANDRA-7544.*
> Issue: The topology events aren't sent to clients. The problem is that the 
> port is not taken into account when determining if we send it or not:
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/transport/Server.java#L624
> To reproduce:
> {code}
> # I think the cassandra-test branch is required to get the -S option 
> (USE_SINGLE_INTERFACE)
> ccm create -n4 local40 -v 4.0-alpha2 -S
> {code}
>  
> Then run this small python driver script:
> {code}
> import time
> from cassandra.cluster import Cluster
> cluster = Cluster()
> session = cluster.connect()
> while True:
> print(cluster.metadata.all_hosts())
> print([h.is_up for h in cluster.metadata.all_hosts()])
> time.sleep(5)
> {code}
> Then decommission a node:
> {code}
> ccm node2 nodetool disablebinary
> ccm node2 nodetool decommission
> {code}
>  
> You should see that the node is never removed from the client cluster 
> metadata and the reconnector started.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-15794) Upgraded C* (4.x) fail to start because of Compact Tables & dropping compact tables in downgraded C* (3.11.4) introduces non-existent columns

2020-06-16 Thread Zhuqi Jin (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134575#comment-17134575
 ] 

Zhuqi Jin edited comment on CASSANDRA-15794 at 6/16/20, 7:57 PM:
-

Hi, [~ifesdjeen].

I created patches for 3.0 and 3.11, according to the third method we discussed 
earlier.

[^CASSANDRA-15794-branch-3.0.patch]

I've attached the patches. Would you mind reviewing them? 

And I'd like to move on to the second method. Could you please do me a favor? 
We don‘t want to generate new commit logs before we hit the error in 4.x, so I 
need to know when and where the commit logs were written.


was (Author: zhuqi1108):
Hi, [~ifesdjeen].

I created patches for 3.0 and 3.11, according to the third method we discussed 
earlier.

[^CASSANDRA-15794-branch-3.0.patch]

I've attached the patches. Would you mind reviewing them? 

> Upgraded C* (4.x) fail to start because of Compact Tables & dropping compact 
> tables in downgraded C* (3.11.4) introduces non-existent columns
> -
>
> Key: CASSANDRA-15794
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15794
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Zhuqi Jin
>Priority: Normal
> Attachments: CASSANDRA-15794-branch-3.0.patch, 
> CASSANDRA-15794-branch-3.11.patch
>
>
> We tried to test upgrading a 3.11.4 C* cluster to 4.x and run into the 
> following problems. 
>  * We started a single 3.11.4 C* node. 
>  * We ran cassandra-stress like this
> {code:java}
> ./cassandra-stress write n = 30 -rate threads = 10 -node  172.17.0.2 {code}
>  * We stopped this node, and started a C* node running C* compiled from trunk 
> (git commit: e394dc0bb32f612a476269010930c617dd1ed3cb)
>  * New C* failed to start with the following error message
> {code:java}
> ERROR [main] 2020-05-07 00:58:18,503 CassandraDaemon.java:245 - Error while 
> loading schema: ERROR [main] 2020-05-07 00:58:18,503 CassandraDaemon.java:245 
> - Error while loading schema: java.lang.IllegalArgumentException: Compact 
> Tables are not allowed in Cassandra starting with 4.0 version. Use `ALTER ... 
> DROP COMPACT STORAGE` command supplied in 3.x/3.11 Cassandra in order to 
> migrate off Compact Storage. at 
> org.apache.cassandra.schema.SchemaKeyspace.fetchTable(SchemaKeyspace.java:965)
>  at 
> org.apache.cassandra.schema.SchemaKeyspace.fetchTables(SchemaKeyspace.java:924)
>  at 
> org.apache.cassandra.schema.SchemaKeyspace.fetchKeyspace(SchemaKeyspace.java:883)
>  at 
> org.apache.cassandra.schema.SchemaKeyspace.fetchKeyspacesWithout(SchemaKeyspace.java:874)
>  at 
> org.apache.cassandra.schema.SchemaKeyspace.fetchNonSystemKeyspaces(SchemaKeyspace.java:862)
>  at org.apache.cassandra.schema.Schema.loadFromDisk(Schema.java:102) at 
> org.apache.cassandra.schema.Schema.loadFromDisk(Schema.java:91) at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:241) 
> at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:653)
>  at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:770)Exception
>  (java.lang.IllegalArgumentException) encountered during startup: Compact 
> Tables are not allowed in Cassandra starting with 4.0 version. Use `ALTER ... 
> DROP COMPACT STORAGE` command supplied in 3.x/3.11 Cassandra in order to 
> migrate off Compact Storage.ERROR [main] 2020-05-07 00:58:18,520 
> CassandraDaemon.java:792 - Exception encountered during 
> startupjava.lang.IllegalArgumentException: Compact Tables are not allowed in 
> Cassandra starting with 4.0 version. Use `ALTER ... DROP COMPACT STORAGE` 
> command supplied in 3.x/3.11 Cassandra in order to migrate off Compact 
> Storage. at 
> org.apache.cassandra.schema.SchemaKeyspace.fetchTable(SchemaKeyspace.java:965)
>  at 
> org.apache.cassandra.schema.SchemaKeyspace.fetchTables(SchemaKeyspace.java:924)
>  at 
> org.apache.cassandra.schema.SchemaKeyspace.fetchKeyspace(SchemaKeyspace.java:883)
>  at 
> org.apache.cassandra.schema.SchemaKeyspace.fetchKeyspacesWithout(SchemaKeyspace.java:874)
>  at 
> org.apache.cassandra.schema.SchemaKeyspace.fetchNonSystemKeyspaces(SchemaKeyspace.java:862)
>  at org.apache.cassandra.schema.Schema.loadFromDisk(Schema.java:102) at 
> org.apache.cassandra.schema.Schema.loadFromDisk(Schema.java:91) at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:241) 
> at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:653)
>  at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:770){code}
>  * We stopped the trunk version C* and started the 3.11.4 version C*. 
>  * 3.11.4 C* failed to start with the following error messages: 
> {code:

[jira] [Comment Edited] (CASSANDRA-15877) Followup on CASSANDRA-15600

2020-06-16 Thread Ekaterina Dimitrova (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17137860#comment-17137860
 ] 

Ekaterina Dimitrova edited comment on CASSANDRA-15877 at 6/16/20, 7:11 PM:
---

Thank you for the review [~kornelpal]!

 After the change from random tokens to splits, 
TokenAllocatorDiagnostics.randomTokensGenerated does not seem to be used 
anymore. Could you please consider removing it, if not needed.

I think actually I should create new method probably for an event to be 
published. I will check what is needed a bit later or Thursday (off tomorrow). 
Good catch!

 I've noticed that you added a new NoReplicationTokenAllocatorTest.failed 
field with assertions, but it does not seem to be set to true anywhere. Could 
you please check whether it is needed.

I think this assertion is actually not needed anymore

 


was (Author: e.dimitrova):
Thank you for the review [~kornelpal]!

?? After the change from random tokens to splits, 
TokenAllocatorDiagnostics.randomTokensGenerated does not seem to be used 
anymore. Could you please consider removing it, if not needed.??

I think actually I should create new method probably for an event to be 
published. I will check what is needed a bit later or Thursday (off tomorrow). 
Good catch!

?? I've noticed that you added a new NoReplicationTokenAllocatorTest.failed 
field with assertions, but it does not seem to be set to true anywhere. Could 
you please check whether it is needed.??

I think this assertion is actually not needed anymore

 

> Followup on CASSANDRA-15600
> ---
>
> Key: CASSANDRA-15877
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15877
> Project: Cassandra
>  Issue Type: Bug
>  Components: Feature/Virtual Nodes
>Reporter: Ekaterina Dimitrova
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 4.0, 4.0-alpha
>
> Attachments: Screen Shot 2020-06-12 at 3.21.18 PM.png
>
>
> As part of CASSANDRA-15600  generateSplits method replaced the 
> generateRandomTokens for NoReplicationAwareTokenAllocator.  generateSplits 
> should be used also in ReplicationAwareTokenAllocator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-15877) Followup on CASSANDRA-15600

2020-06-16 Thread Ekaterina Dimitrova (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17137860#comment-17137860
 ] 

Ekaterina Dimitrova edited comment on CASSANDRA-15877 at 6/16/20, 7:11 PM:
---

Thank you for the review [~kornelpal]!

_After the change from random tokens to splits, 
TokenAllocatorDiagnostics.randomTokensGenerated does not seem to be used 
anymore. Could you please consider removing it, if not needed._

I think actually I should create new method probably for an event to be 
published. I will check what is needed a bit later or Thursday (off tomorrow). 
Good catch!

_I've noticed that you added a new NoReplicationTokenAllocatorTest.failed field 
with assertions, but it does not seem to be set to true anywhere. Could you 
please check whether it is needed._

I think this assertion is actually not needed anymore

 


was (Author: e.dimitrova):
Thank you for the review [~kornelpal]!

 After the change from random tokens to splits, 
TokenAllocatorDiagnostics.randomTokensGenerated does not seem to be used 
anymore. Could you please consider removing it, if not needed.

I think actually I should create new method probably for an event to be 
published. I will check what is needed a bit later or Thursday (off tomorrow). 
Good catch!

 I've noticed that you added a new NoReplicationTokenAllocatorTest.failed 
field with assertions, but it does not seem to be set to true anywhere. Could 
you please check whether it is needed.

I think this assertion is actually not needed anymore

 

> Followup on CASSANDRA-15600
> ---
>
> Key: CASSANDRA-15877
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15877
> Project: Cassandra
>  Issue Type: Bug
>  Components: Feature/Virtual Nodes
>Reporter: Ekaterina Dimitrova
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 4.0, 4.0-alpha
>
> Attachments: Screen Shot 2020-06-12 at 3.21.18 PM.png
>
>
> As part of CASSANDRA-15600  generateSplits method replaced the 
> generateRandomTokens for NoReplicationAwareTokenAllocator.  generateSplits 
> should be used also in ReplicationAwareTokenAllocator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15877) Followup on CASSANDRA-15600

2020-06-16 Thread Ekaterina Dimitrova (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17137860#comment-17137860
 ] 

Ekaterina Dimitrova commented on CASSANDRA-15877:
-

Thank you for the review [~kornelpal]!

?? After the change from random tokens to splits, 
TokenAllocatorDiagnostics.randomTokensGenerated does not seem to be used 
anymore. Could you please consider removing it, if not needed.??

I think actually I should create new method probably for an event to be 
published. I will check what is needed a bit later or Thursday (off tomorrow). 
Good catch!

?? I've noticed that you added a new NoReplicationTokenAllocatorTest.failed 
field with assertions, but it does not seem to be set to true anywhere. Could 
you please check whether it is needed.??

I think this assertion is actually not needed anymore

 

> Followup on CASSANDRA-15600
> ---
>
> Key: CASSANDRA-15877
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15877
> Project: Cassandra
>  Issue Type: Bug
>  Components: Feature/Virtual Nodes
>Reporter: Ekaterina Dimitrova
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 4.0, 4.0-alpha
>
> Attachments: Screen Shot 2020-06-12 at 3.21.18 PM.png
>
>
> As part of CASSANDRA-15600  generateSplits method replaced the 
> generateRandomTokens for NoReplicationAwareTokenAllocator.  generateSplits 
> should be used also in ReplicationAwareTokenAllocator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14238) Flaky Unittest: org.apache.cassandra.db.compaction.BlacklistingCompactionsTest

2020-06-16 Thread Marcus Eriksson (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17137841#comment-17137841
 ] 

Marcus Eriksson commented on CASSANDRA-14238:
-

[~maedhroz]/[~djoshi] could you open a new jira?

> Flaky Unittest: org.apache.cassandra.db.compaction.BlacklistingCompactionsTest
> --
>
> Key: CASSANDRA-14238
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14238
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Testing
>Reporter: Jay Zhuang
>Assignee: Marcus Eriksson
>Priority: Low
>  Labels: testing
> Fix For: 2.2.13
>
>
> The unittest is flaky
> {noformat}
> [junit] Testcase: 
> testBlacklistingWithSizeTieredCompactionStrategy(org.apache.cassandra.db.compaction.BlacklistingCompactionsTest):
>  FAILED
> [junit] expected:<8> but was:<25>
> [junit] junit.framework.AssertionFailedError: expected:<8> but was:<25>
> [junit] at 
> org.apache.cassandra.db.compaction.BlacklistingCompactionsTest.testBlacklisting(BlacklistingCompactionsTest.java:170)
> [junit] at 
> org.apache.cassandra.db.compaction.BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy(BlacklistingCompactionsTest.java:71)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14238) Flaky Unittest: org.apache.cassandra.db.compaction.BlacklistingCompactionsTest

2020-06-16 Thread Caleb Rackliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17137825#comment-17137825
 ] 

Caleb Rackliffe commented on CASSANDRA-14238:
-

[~marcuse] We've been able [to reproduce 
this|https://app.circleci.com/pipelines/github/dineshjoshi/cassandra/47/workflows/de5f7cdb-06b6-4869-9d19-81a145e79f3f/jobs/2516/tests]
 in a branch based on the current trunk while working on CASSANDRA-14888.

CC [~djoshi]

> Flaky Unittest: org.apache.cassandra.db.compaction.BlacklistingCompactionsTest
> --
>
> Key: CASSANDRA-14238
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14238
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Testing
>Reporter: Jay Zhuang
>Assignee: Marcus Eriksson
>Priority: Low
>  Labels: testing
> Fix For: 2.2.13
>
>
> The unittest is flaky
> {noformat}
> [junit] Testcase: 
> testBlacklistingWithSizeTieredCompactionStrategy(org.apache.cassandra.db.compaction.BlacklistingCompactionsTest):
>  FAILED
> [junit] expected:<8> but was:<25>
> [junit] junit.framework.AssertionFailedError: expected:<8> but was:<25>
> [junit] at 
> org.apache.cassandra.db.compaction.BlacklistingCompactionsTest.testBlacklisting(BlacklistingCompactionsTest.java:170)
> [junit] at 
> org.apache.cassandra.db.compaction.BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy(BlacklistingCompactionsTest.java:71)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14238) Flaky Unittest: org.apache.cassandra.db.compaction.BlacklistingCompactionsTest

2020-06-16 Thread Dinesh Joshi (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17137822#comment-17137822
 ] 

Dinesh Joshi commented on CASSANDRA-14238:
--

trunk has a similar failure: 
https://circleci.com/gh/dineshjoshi/cassandra/2516#tests/containers/36 
[~marcuse] can you please confirm?

> Flaky Unittest: org.apache.cassandra.db.compaction.BlacklistingCompactionsTest
> --
>
> Key: CASSANDRA-14238
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14238
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Testing
>Reporter: Jay Zhuang
>Assignee: Marcus Eriksson
>Priority: Low
>  Labels: testing
> Fix For: 2.2.13
>
>
> The unittest is flaky
> {noformat}
> [junit] Testcase: 
> testBlacklistingWithSizeTieredCompactionStrategy(org.apache.cassandra.db.compaction.BlacklistingCompactionsTest):
>  FAILED
> [junit] expected:<8> but was:<25>
> [junit] junit.framework.AssertionFailedError: expected:<8> but was:<25>
> [junit] at 
> org.apache.cassandra.db.compaction.BlacklistingCompactionsTest.testBlacklisting(BlacklistingCompactionsTest.java:170)
> [junit] at 
> org.apache.cassandra.db.compaction.BlacklistingCompactionsTest.testBlacklistingWithSizeTieredCompactionStrategy(BlacklistingCompactionsTest.java:71)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15871) Cassandra driver first connection get the user's own schema information

2020-06-16 Thread Brandon Williams (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17137803#comment-17137803
 ] 

Brandon Williams commented on CASSANDRA-15871:
--

bq. different users can get all the schemas information when first made a 
connection

Use permissions to prevent them from being able to do that.

> Cassandra driver first  connection get the user's own schema information
> 
>
> Key: CASSANDRA-15871
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15871
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Cluster/Schema, Messaging/Client
>Reporter: maxwellguo
>Priority: Normal
> Attachments: 1.jpg
>
>
> We know that cassandra driver making a conenction with the coordinator node 
> first time , the driver may select all the keyspaces/tables/columns/types 
> from the server and cache the data locally. 
> For different users they may have different tables and types ,so It is not 
> suitable to get all the meta data cached , It is fine to just cache the 
> user's own schema information not all.
>  And doing this is safe and save first time connection resourse.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15874) Bootstrap completes Successfully without streaming all the data

2020-06-16 Thread Brandon Williams (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17137801#comment-17137801
 ] 

Brandon Williams commented on CASSANDRA-15874:
--

That is the symptom.

> Bootstrap completes Successfully without streaming all the data
> ---
>
> Key: CASSANDRA-15874
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15874
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: Jai Bheemsen Rao Dhanwada
>Priority: Normal
>
> I am seeing a strange issue where, adding a new node with auto_bootstrap: 
> true is not streaming all the data before it joins the cluster. Don't see any 
> information in the logs about bootstrap failures.
> Here is the sequence of logs
>  
> {code:java}
> INFO [main] 2020-06-12 01:41:49,642 StorageService.java:1446 - JOINING: 
> schema complete, ready to bootstrap
> INFO [main] 2020-06-12 01:41:49,642 StorageService.java:1446 - JOINING: 
> waiting for pending range calculation
> INFO [main] 2020-06-12 01:41:49,643 StorageService.java:1446 - JOINING: 
> calculation complete, ready to bootstrap
> INFO [main] 2020-06-12 01:41:49,643 StorageService.java:1446 - JOINING: 
> getting bootstrap token
> INFO [main] 2020-06-12 01:42:19,656 StorageService.java:1446 - JOINING: 
> Starting to bootstrap...
> org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find table for 
> cfId . If a table was just created, this is likely due to the schema 
> not being fully propagated. Please wait for schema agreement on table 
> creation.
> INFO [StreamReceiveTask:1] 2020-06-12 02:29:51,892 
> StreamResultFuture.java:219 - [Stream #f4224f444-a55d-154a-23e3-867899486f5f] 
> All sessions completed INFO [StreamReceiveTask:1] 2020-06-12 02:29:51,892 
> StorageService.java:1505 - Bootstrap completed! for the tokens
> {code}
> Cassandra Version: 3.11.3
> I am not able to reproduce this issue all the time, but it happened couple of 
> times. Is there any  race condition/corner case, which could cause this issue?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-14888) Several mbeans are not unregistered when dropping a keyspace and table

2020-06-16 Thread Caleb Rackliffe (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-14888:

Test and Documentation Plan: 
Review and tests
CircleCI: 
https://app.circleci.com/pipelines/github/dineshjoshi/cassandra/47/workflows/de5f7cdb-06b6-4869-9d19-81a145e79f3f

  was:
Review and tests
CircleCI: 
https://app.circleci.com/pipelines/github/maedhroz/cassandra?branch=14888-maedhroz


> Several mbeans are not unregistered when dropping a keyspace and table
> --
>
> Key: CASSANDRA-14888
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14888
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Ariel Weisberg
>Assignee: Alex Deparvu
>Priority: Urgent
>  Labels: patch-available
> Fix For: 3.0.x, 3.11.x, 4.0-beta
>
> Attachments: CASSANDRA-14888.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> CasCommit, CasPrepare, CasPropose, ReadRepairRequests, 
> ShortReadProtectionRequests, AntiCompactionTime, BytesValidated, 
> PartitionsValidated, RepairPrepareTime, RepairSyncTime, 
> RepairedDataInconsistencies, ViewLockAcquireTime, ViewReadTime, 
> WriteFailedIdealCL
> Basically for 3 years people haven't known what they are doing because the 
> entire thing is kind of obscure. Fix it and also add a dtest that detects if 
> any mbeans are left behind after dropping a table and keyspace.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15863) Bootstrap resume and TestReplaceAddress fixes

2020-06-16 Thread Brandon Williams (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-15863:
-
  Since Version: 3.0.0
Source Control Link: 
https://github.com/apache/cassandra/commit/eacdfc4978547b8e7be06c9ba9611c29963e6cc2
 Resolution: Fixed
 Status: Resolved  (was: Ready to Commit)

Thanks, committed.  I will note though that your 3.11 branch and beyond would 
not compile because your call to finishJoiningRing did not include the boolean 
(which 3.0 does not have.)  I added that on commit.

> Bootstrap resume and TestReplaceAddress fixes
> -
>
> Key: CASSANDRA-15863
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15863
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission, Test/dtest
>Reporter: Berenguer Blasi
>Assignee: Berenguer Blasi
>Priority: Normal
> Fix For: 3.0.x, 3.11.x, 4.0-alpha
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This has been 
> [broken|https://ci-cassandra.apache.org/job/Cassandra-trunk/159/testReport/dtest-large.replace_address_test/TestReplaceAddress/test_restart_failed_replace/history/]
>  for ages



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15863) Bootstrap resume and TestReplaceAddress fixes

2020-06-16 Thread Brandon Williams (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-15863:
-
Status: Ready to Commit  (was: Review In Progress)

> Bootstrap resume and TestReplaceAddress fixes
> -
>
> Key: CASSANDRA-15863
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15863
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission, Test/dtest
>Reporter: Berenguer Blasi
>Assignee: Berenguer Blasi
>Priority: Normal
> Fix For: 3.0.x, 3.11.x, 4.0-alpha
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This has been 
> [broken|https://ci-cassandra.apache.org/job/Cassandra-trunk/159/testReport/dtest-large.replace_address_test/TestReplaceAddress/test_restart_failed_replace/history/]
>  for ages



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[cassandra-dtest] branch master updated: Fix flaky replace address tests and bootstrap resume fixes

2020-06-16 Thread brandonwilliams
This is an automated email from the ASF dual-hosted git repository.

brandonwilliams pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/cassandra-dtest.git


The following commit(s) were added to refs/heads/master by this push:
 new f6b79ab  Fix flaky replace address tests and bootstrap resume fixes
f6b79ab is described below

commit f6b79abec6f059add754d5cceaab1009089c962b
Author: Bereng 
AuthorDate: Wed Jun 10 14:59:34 2020 +0200

Fix flaky replace address tests and bootstrap resume fixes

Patch by Berenguer Blasi, reviewed by brandonwilliams for CASSANDRA-15863
---
 replace_address_test.py | 37 +
 1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/replace_address_test.py b/replace_address_test.py
index bc122c7..0a34ac8 100644
--- a/replace_address_test.py
+++ b/replace_address_test.py
@@ -182,10 +182,13 @@ class BaseReplaceAddressTest(Tester):
 # a little hacky but grep_log returns the whole line...
 num_tokens = 
int(self.replacement_node.get_conf_option('num_tokens'))
 
-logger.debug("Verifying {} tokens migrated 
sucessfully".format(num_tokens))
-logs = self.replacement_node.grep_log(r"Token (.*?) changing ownership 
from /{} to /{}"
-  
.format(self.replaced_node.address(),
-  
self.replacement_node.address()))
+logger.debug("Verifying {} tokens migrated 
successfully".format(num_tokens))
+replmnt_address = ("/" + self.replacement_node.address()) if 
self.cluster.version() < '4.0' else self.replacement_node.address_and_port()
+repled_address = ("/" + self.replaced_node.address()) if 
self.cluster.version() < '4.0' else self.replaced_node.address_and_port()
+token_ownership_log = r"Token (.*?) changing ownership from {} to 
{}".format(repled_address,
+   
  replmnt_address)
+logs = self.replacement_node.grep_log(token_ownership_log)
+
 if (previous_log_size is not None):
 assert len(logs) == previous_log_size
 
@@ -321,7 +324,9 @@ class TestReplaceAddress(BaseReplaceAddressTest):
 self._do_replace(replace_address='127.0.0.5', 
wait_for_binary_proto=False)
 
 logger.debug("Waiting for replace to fail")
-self.replacement_node.watch_log_for("java.lang.RuntimeException: 
Cannot replace_address /127.0.0.5 because it doesn't exist in gossip")
+node_log_str = "/127.0.0.5" if self.cluster.version() < '4.0' else 
"127.0.0.5:7000"
+self.replacement_node.watch_log_for("java.lang.RuntimeException: 
Cannot replace_address "
++ node_log_str + " because it 
doesn't exist in gossip")
 assert_not_running(self.replacement_node)
 
 @since('3.6')
@@ -464,17 +469,23 @@ class TestReplaceAddress(BaseReplaceAddressTest):
 self._stop_node_to_replace()
 
 logger.debug("Submitting byteman script to make stream fail")
+btmmark = self.query_node.mark_log()
 
 if self.cluster.version() < '4.0':
 
self.query_node.byteman_submit(['./byteman/pre4.0/stream_failure.btm'])
 self._do_replace(jvm_option='replace_address_first_boot',
- opts={'streaming_socket_timeout_in_ms': 1000})
+ opts={'streaming_socket_timeout_in_ms': 1000},
+ wait_for_binary_proto=False,
+ wait_other_notice=True)
 else:
 
self.query_node.byteman_submit(['./byteman/4.0/stream_failure.btm'])
-self._do_replace(jvm_option='replace_address_first_boot')
+self._do_replace(jvm_option='replace_address_first_boot', 
wait_for_binary_proto=False, wait_other_notice=True)
 
 # Make sure bootstrap did not complete successfully
-assert_bootstrap_state(self, self.replacement_node, 'IN_PROGRESS')
+self.query_node.watch_log_for("Triggering network failure", 
from_mark=btmmark)
+self.query_node.watch_log_for("Stream failed", from_mark=btmmark)
+self.replacement_node.watch_log_for("Stream failed")
+self.replacement_node.watch_log_for("Some data streaming 
failed.*IN_PROGRESS$")
 
 if mode == 'reset_resume_state':
 mark = self.replacement_node.mark_log()
@@ -498,12 +509,14 @@ class TestReplaceAddress(BaseReplaceAddressTest):
 self.replacement_node.stop()
 
 logger.debug("Waiting other nodes to detect node stopped")
-self.query_node.watch_log_for("FatClient /{} has been silent for 
3ms, removing from gossip".format(self.replacement_node.address()), 
timeout=120)
-self.query_node.watch_log_for("Node /{} failed during 
replace.".format(self.replacement_node.address()), timeout=120, 
filename='debug.log')

[cassandra] branch trunk updated (7cdad3c -> eacdfc4)

2020-06-16 Thread brandonwilliams
This is an automated email from the ASF dual-hosted git repository.

brandonwilliams pushed a change to branch trunk
in repository https://gitbox.apache.org/repos/asf/cassandra.git.


from 7cdad3c  Avoid overflow when bloom filter size exceeds 2GB
 new 4f50a67  Catch exception on bootstrap resume and init native transport
 new eacdfc4  Merge branch 'cassandra-3.11' into trunk

The 2 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 CHANGES.txt|  1 +
 .../apache/cassandra/service/CassandraDaemon.java  |  3 ++-
 .../apache/cassandra/service/StorageService.java   | 30 ++
 3 files changed, 23 insertions(+), 11 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[cassandra] branch cassandra-3.0 updated: Catch exception on bootstrap resume and init native transport

2020-06-16 Thread brandonwilliams
This is an automated email from the ASF dual-hosted git repository.

brandonwilliams pushed a commit to branch cassandra-3.0
in repository https://gitbox.apache.org/repos/asf/cassandra.git


The following commit(s) were added to refs/heads/cassandra-3.0 by this push:
 new 1843032  Catch exception on bootstrap resume and init native transport
1843032 is described below

commit 184303220b2995f411f51f675131007404372b3d
Author: Bereng 
AuthorDate: Wed Jun 10 12:34:34 2020 +0200

Catch exception on bootstrap resume and init native transport

Patch by Berenguer Blasi, reviewed by brandonwilliams for CASSANDRA-15863
---
 CHANGES.txt|  1 +
 .../apache/cassandra/service/CassandraDaemon.java  |  3 ++-
 .../apache/cassandra/service/StorageService.java   | 30 ++
 3 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/CHANGES.txt b/CHANGES.txt
index d506dc8..d1b1416 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -1,4 +1,5 @@
 3.0.21
+ * Catch exception on bootstrap resume and init native transport 
(CASSANDRA-15863)
  * Fix replica-side filtering returning stale data with CL > ONE 
(CASSANDRA-8272, CASSANDRA-8273)
  * Fix duplicated row on 2.x upgrades when multi-rows range tombstones 
interact with collection ones (CASSANDRA-15805)
  * Rely on snapshotted session infos on StreamResultFuture.maybeComplete to 
avoid race conditions (CASSANDRA-15667)
diff --git a/src/java/org/apache/cassandra/service/CassandraDaemon.java 
b/src/java/org/apache/cassandra/service/CassandraDaemon.java
index 4a6e947..85a002f 100644
--- a/src/java/org/apache/cassandra/service/CassandraDaemon.java
+++ b/src/java/org/apache/cassandra/service/CassandraDaemon.java
@@ -421,7 +421,8 @@ public class CassandraDaemon
 public void initializeNativeTransport()
 {
 // Native transport
-nativeTransportService = new NativeTransportService();
+if (nativeTransportService == null)
+nativeTransportService = new NativeTransportService();
 }
 
 /*
diff --git a/src/java/org/apache/cassandra/service/StorageService.java 
b/src/java/org/apache/cassandra/service/StorageService.java
index f9efdb8..d287788 100644
--- a/src/java/org/apache/cassandra/service/StorageService.java
+++ b/src/java/org/apache/cassandra/service/StorageService.java
@@ -1317,20 +1317,30 @@ public class StorageService extends 
NotificationBroadcasterSupport implements IE
 @Override
 public void onSuccess(StreamState streamState)
 {
-bootstrapFinished();
-if (isSurveyMode)
+try
 {
-logger.info("Startup complete, but write survey mode 
is active, not becoming an active ring member. Use JMX 
(StorageService->joinRing()) to finalize ring joining.");
+bootstrapFinished();
+if (isSurveyMode)
+{
+logger.info("Startup complete, but write survey 
mode is active, not becoming an active ring member. Use JMX 
(StorageService->joinRing()) to finalize ring joining.");
+}
+else
+{
+isSurveyMode = false;
+progressSupport.progress("bootstrap", 
ProgressEvent.createNotification("Joining ring..."));
+finishJoiningRing(bootstrapTokens);
+}
+progressSupport.progress("bootstrap", new 
ProgressEvent(ProgressEventType.COMPLETE, 1, 1, "Resume bootstrap complete"));
+if (!isNativeTransportRunning())
+daemon.initializeNativeTransport();
+daemon.start();
+logger.info("Resume complete");
 }
-else
+catch(Exception e)
 {
-isSurveyMode = false;
-progressSupport.progress("bootstrap", 
ProgressEvent.createNotification("Joining ring..."));
-finishJoiningRing(bootstrapTokens);
+onFailure(e);
+throw e;
 }
-progressSupport.progress("bootstrap", new 
ProgressEvent(ProgressEventType.COMPLETE, 1, 1, "Resume bootstrap complete"));
-daemon.start();
-logger.info("Resume complete");
 }
 
 @Override


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[cassandra] 01/01: Merge branch 'cassandra-3.11' into trunk

2020-06-16 Thread brandonwilliams
This is an automated email from the ASF dual-hosted git repository.

brandonwilliams pushed a commit to branch trunk
in repository https://gitbox.apache.org/repos/asf/cassandra.git

commit eacdfc4978547b8e7be06c9ba9611c29963e6cc2
Merge: 7cdad3c 4f50a67
Author: Brandon Williams 
AuthorDate: Tue Jun 16 12:37:15 2020 -0500

Merge branch 'cassandra-3.11' into trunk

 CHANGES.txt|  1 +
 .../apache/cassandra/service/CassandraDaemon.java  |  3 ++-
 .../apache/cassandra/service/StorageService.java   | 30 ++
 3 files changed, 23 insertions(+), 11 deletions(-)

diff --cc CHANGES.txt
index 5f09cdc,0f730d4..e6ecb42
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@@ -1,49 -1,10 +1,50 @@@
 -3.11.7
 +4.0-alpha5
 + * Fix missing topology events when running multiple nodes on the same 
network interface (CASSANDRA-15677)
 + * Create config.yml.MIDRES (CASSANDRA-15712)
 + * Fix handling of fully purged static rows in repaired data tracking 
(CASSANDRA-15848)
 + * Prevent validation request submission from blocking ANTI_ENTROPY stage 
(CASSANDRA-15812)
 + * Add fqltool and auditlogviewer to rpm and deb packages (CASSANDRA-14712)
 + * Include DROPPED_COLUMNS in schema digest computation (CASSANDRA-15843)
 + * Fix Cassandra restart from rpm install (CASSANDRA-15830)
 + * Improve handling of 2i initialization failures (CASSANDRA-13606)
 + * Add completion_ratio column to sstable_tasks virtual table (CASANDRA-15759)
 + * Add support for adding custom Verbs (CASSANDRA-15725)
 + * Speed up entire-file-streaming file containment check and allow 
entire-file-streaming for all compaction strategies 
(CASSANDRA-15657,CASSANDRA-15783)
 + * Provide ability to configure IAuditLogger (CASSANDRA-15748)
 + * Fix nodetool enablefullquerylog blocking param parsing (CASSANDRA-15819)
 + * Add isTransient to SSTableMetadataView (CASSANDRA-15806)
 + * Fix tools/bin/fqltool for all shells (CASSANDRA-15820)
 + * Fix clearing of legacy size_estimates (CASSANDRA-15776)
 + * Update port when reconnecting to pre-4.0 SSL storage (CASSANDRA-15727)
 + * Only calculate dynamicBadnessThreshold once per loop in 
DynamicEndpointSnitch (CASSANDRA-15798)
 + * Cleanup redundant nodetool commands added in 4.0 (CASSANDRA-15256)
 + * Update to Python driver 3.23 for cqlsh (CASSANDRA-15793)
 + * Add tunable initial size and growth factor to RangeTombstoneList 
(CASSANDRA-15763)
 + * Improve debug logging in SSTableReader for index summary (CASSANDRA-15755)
 + * bin/sstableverify should support user provided token ranges 
(CASSANDRA-15753)
 + * Improve logging when mutation passed to commit log is too large 
(CASSANDRA-14781)
 + * replace LZ4FastDecompressor with LZ4SafeDecompressor (CASSANDRA-15560)
 + * Fix buffer pool NPE with concurrent release due to in-progress tiny pool 
eviction (CASSANDRA-15726)
 + * Avoid race condition when completing stream sessions (CASSANDRA-15666)
 + * Flush with fast compressors by default (CASSANDRA-15379)
 + * Fix CqlInputFormat regression from the switch to system.size_estimates 
(CASSANDRA-15637)
 + * Allow sending Entire SSTables over SSL (CASSANDRA-15740)
 + * Fix CQLSH UTF-8 encoding issue for Python 2/3 compatibility 
(CASSANDRA-15739)
 + * Fix batch statement preparation when multiple tables and parameters are 
used (CASSANDRA-15730)
 + * Fix regression with traceOutgoingMessage printing message size 
(CASSANDRA-15687)
 + * Ensure repaired data tracking reads a consistent amount of data across 
replicas (CASSANDRA-15601)
 + * Fix CQLSH to avoid arguments being evaluated (CASSANDRA-15660)
 + * Correct Visibility and Improve Safety of Methods in LatencyMetrics 
(CASSANDRA-15597)
 + * Allow cqlsh to run with Python2.7/Python3.6+ 
(CASSANDRA-15659,CASSANDRA-15573)
 + * Improve logging around incremental repair (CASSANDRA-15599)
 + * Do not check cdc_raw_directory filesystem space if CDC disabled 
(CASSANDRA-15688)
 + * Replace array iterators with get by index (CASSANDRA-15394)
 + * Minimize BTree iterator allocations (CASSANDRA-15389)
 +Merged from 3.11:
   * Fix CQL formatting of read command restrictions for slow query log 
(CASSANDRA-15503)
 - * Allow sstableloader to use SSL on the native port (CASSANDRA-14904)
  Merged from 3.0:
+  * Catch exception on bootstrap resume and init native transport 
(CASSANDRA-15863)
   * Fix replica-side filtering returning stale data with CL > ONE 
(CASSANDRA-8272, CASSANDRA-8273)
 - * Fix duplicated row on 2.x upgrades when multi-rows range tombstones 
interact with collection ones (CASSANDRA-15805)
   * Rely on snapshotted session infos on StreamResultFuture.maybeComplete to 
avoid race conditions (CASSANDRA-15667)
   * EmptyType doesn't override writeValue so could attempt to write bytes when 
expected not to (CASSANDRA-15790)
   * Fix index queries on partition key columns when some partitions contains 
only static data (CASSANDRA-13666)
diff --cc src/java/org/apache/cassandra/service/CassandraDaemon.java
index 2dbe217,1f93262..c7591d5

[cassandra] branch cassandra-3.11 updated: Catch exception on bootstrap resume and init native transport

2020-06-16 Thread brandonwilliams
This is an automated email from the ASF dual-hosted git repository.

brandonwilliams pushed a commit to branch cassandra-3.11
in repository https://gitbox.apache.org/repos/asf/cassandra.git


The following commit(s) were added to refs/heads/cassandra-3.11 by this push:
 new 4f50a67  Catch exception on bootstrap resume and init native transport
4f50a67 is described below

commit 4f50a6712ada5c4298ec860836015ea15049cbda
Author: Bereng 
AuthorDate: Wed Jun 10 12:34:34 2020 +0200

Catch exception on bootstrap resume and init native transport

Patch by Berenguer Blasi, reviewed by brandonwilliams for CASSANDRA-15863
---
 CHANGES.txt|  1 +
 .../apache/cassandra/service/CassandraDaemon.java  |  3 +-
 .../apache/cassandra/service/StorageService.java   | 32 ++
 3 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/CHANGES.txt b/CHANGES.txt
index 7f54146..0f730d4 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -2,6 +2,7 @@
  * Fix CQL formatting of read command restrictions for slow query log 
(CASSANDRA-15503)
  * Allow sstableloader to use SSL on the native port (CASSANDRA-14904)
 Merged from 3.0:
+ * Catch exception on bootstrap resume and init native transport 
(CASSANDRA-15863)
  * Fix replica-side filtering returning stale data with CL > ONE 
(CASSANDRA-8272, CASSANDRA-8273)
  * Fix duplicated row on 2.x upgrades when multi-rows range tombstones 
interact with collection ones (CASSANDRA-15805)
  * Rely on snapshotted session infos on StreamResultFuture.maybeComplete to 
avoid race conditions (CASSANDRA-15667)
diff --git a/src/java/org/apache/cassandra/service/CassandraDaemon.java 
b/src/java/org/apache/cassandra/service/CassandraDaemon.java
index b80580a..1f93262 100644
--- a/src/java/org/apache/cassandra/service/CassandraDaemon.java
+++ b/src/java/org/apache/cassandra/service/CassandraDaemon.java
@@ -450,7 +450,8 @@ public class CassandraDaemon
 public void initializeNativeTransport()
 {
 // Native transport
-nativeTransportService = new NativeTransportService();
+if (nativeTransportService == null)
+nativeTransportService = new NativeTransportService();
 }
 
 /*
diff --git a/src/java/org/apache/cassandra/service/StorageService.java 
b/src/java/org/apache/cassandra/service/StorageService.java
index 3d31596..d3c30a0 100644
--- a/src/java/org/apache/cassandra/service/StorageService.java
+++ b/src/java/org/apache/cassandra/service/StorageService.java
@@ -1592,22 +1592,30 @@ public class StorageService extends 
NotificationBroadcasterSupport implements IE
 @Override
 public void onSuccess(StreamState streamState)
 {
-bootstrapFinished();
-// start participating in the ring.
-// pretend we are in survey mode so we can use joinRing() 
here
-if (isSurveyMode)
+try
 {
-logger.info("Startup complete, but write survey mode 
is active, not becoming an active ring member. Use JMX 
(StorageService->joinRing()) to finalize ring joining.");
+bootstrapFinished();
+if (isSurveyMode)
+{
+logger.info("Startup complete, but write survey 
mode is active, not becoming an active ring member. Use JMX 
(StorageService->joinRing()) to finalize ring joining.");
+}
+else
+{
+isSurveyMode = false;
+progressSupport.progress("bootstrap", 
ProgressEvent.createNotification("Joining ring..."));
+finishJoiningRing(true, bootstrapTokens);
+}
+progressSupport.progress("bootstrap", new 
ProgressEvent(ProgressEventType.COMPLETE, 1, 1, "Resume bootstrap complete"));
+if (!isNativeTransportRunning())
+daemon.initializeNativeTransport();
+daemon.start();
+logger.info("Resume complete");
 }
-else
+catch(Exception e)
 {
-isSurveyMode = false;
-progressSupport.progress("bootstrap", 
ProgressEvent.createNotification("Joining ring..."));
-finishJoiningRing(true, bootstrapTokens);
+onFailure(e);
+throw e;
 }
-progressSupport.progress("bootstrap", new 
ProgressEvent(ProgressEventType.COMPLETE, 1, 1, "Resume bootstrap complete"));
-daemon.start();
-logger.info("Resume complete");
 }
 
 @Override



[jira] [Commented] (CASSANDRA-15833) Unresolvable false digest mismatch during upgrade due to CASSANDRA-10657

2020-06-16 Thread Jacek Lewandowski (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17137775#comment-17137775
 ] 

Jacek Lewandowski commented on CASSANDRA-15833:
---

Regarding your concern about using Gossiper; I was wondering whether we could 
just create a ColumnFilter factory and have different implementations, wdyt?

> Unresolvable false digest mismatch during upgrade due to CASSANDRA-10657
> 
>
> Key: CASSANDRA-15833
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15833
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Jacek Lewandowski
>Assignee: Jacek Lewandowski
>Priority: Normal
> Fix For: 3.11.x, 4.0-alpha
>
> Attachments: CASSANDRA-15833-3.11.patch, CASSANDRA-15833-4.0.patch
>
>
> CASSANDRA-10657 introduced changes in how the ColumnFilter is interpreted. 
> This results in digest mismatch when querying incomplete set of columns from 
> a table with consistency that requires reaching instances running pre 
> CASSANDRA-10657 from nodes that include CASSANDRA-10657 (it was introduced in 
> Cassandra 3.4). 
> The fix is to bring back the previous behaviour until there are no instances 
> running pre CASSANDRA-10657 version. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15459) Short read protection doesn't work on group-by queries

2020-06-16 Thread Jira


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andres de la Peña updated CASSANDRA-15459:
--
 Bug Category: Parent values: Correctness(12982)Level 1 values: 
Consistency(12989)
   Complexity: Normal
Discovered By: Code Inspection
Fix Version/s: 3.11.7
 Severity: Normal
   Status: Open  (was: Triage Needed)

> Short read protection doesn't work on group-by queries
> --
>
> Key: CASSANDRA-15459
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15459
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Coordination
>Reporter: ZhaoYang
>Assignee: Andres de la Peña
>Priority: Normal
>  Labels: correctness
> Fix For: 3.11.7, 4.0-beta
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> [DTest to 
> reproduce|https://github.com/apache/cassandra-dtest/compare/master...jasonstack:srp_group_by_trunk?expand=1]:
>  it affects all versions..
> {code}
> In a two-node cluster with RF = 2
> Execute only on Node1:
> * Insert pk=1 and ck=1 with timestamp 9
> * Delete pk=0 and ck=0 with timestamp 10
> * Insert pk=2 and ck=2 with timestamp 9
> Execute only on Node2:
> * Delete pk=1 and ck=1 with timestamp 10
> * Insert pk=0 and ck=0 with timestamp 9
> * Delete pk=2 and ck=2 with timestamp 10
> Query: "SELECT pk, c FROM %s GROUP BY pk LIMIT 1"
> * Expect no live data, but got [0, 0]
> {code}
> Note: for group-by queries, SRP should use "group counted" to calculate 
> limits used for SRP query, rather than "row counted".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15868) Update Netty version to 4.1.50 because there are security issues in 4.1.37

2020-06-16 Thread Brandon Williams (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-15868:
-
Status: In Progress  (was: Changes Suggested)

> Update Netty version to 4.1.50 because there are security issues in 4.1.37
> --
>
> Key: CASSANDRA-15868
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15868
> Project: Cassandra
>  Issue Type: Task
>  Components: Dependencies
>Reporter: Stefan Miklosovic
>Assignee: Stefan Miklosovic
>Priority: Normal
> Fix For: 3.11.7, 4.0-beta
>
> Attachments: dependency-check-report.html
>
>
> Please see attached HTML report from OWASP dependency check.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15868) Update Netty version to 4.1.50 because there are security issues in 4.1.37

2020-06-16 Thread Brandon Williams (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-15868:
-
Status: Patch Available  (was: In Progress)

> Update Netty version to 4.1.50 because there are security issues in 4.1.37
> --
>
> Key: CASSANDRA-15868
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15868
> Project: Cassandra
>  Issue Type: Task
>  Components: Dependencies
>Reporter: Stefan Miklosovic
>Assignee: Stefan Miklosovic
>Priority: Normal
> Fix For: 3.11.7, 4.0-beta
>
> Attachments: dependency-check-report.html
>
>
> Please see attached HTML report from OWASP dependency check.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15861) Mutating sstable STATS metadata may race with entire-sstable-streaming(ZCS) causing checksum validation failure

2020-06-16 Thread ZhaoYang (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhaoYang updated CASSANDRA-15861:
-
Description: 
Flaky dtest: [test_dead_sync_initiator - 
repair_tests.repair_test.TestRepair|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/143/testReport/junit/dtest.repair_tests.repair_test/TestRepair/test_dead_sync_initiator/]
{code:java|title=stacktrace}
Unexpected error found in node logs (see stdout for full details). Errors: 
[ERROR [Stream-Deserializer-127.0.0.1:7000-570871f3] 2020-06-03 04:05:19,081 
CassandraEntireSSTableStreamReader.java:145 - [Stream 
6f1c3360-a54f-11ea-a808-2f23710fdc90] Error while reading sstable from stream 
for table = keyspace1.standard1
org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: 
/home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
at 
org.apache.cassandra.io.sstable.metadata.MetadataSerializer.maybeValidateChecksum(MetadataSerializer.java:219)
at 
org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:198)
at 
org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:129)
at 
org.apache.cassandra.io.sstable.metadata.MetadataSerializer.mutate(MetadataSerializer.java:226)
at 
org.apache.cassandra.db.streaming.CassandraEntireSSTableStreamReader.read(CassandraEntireSSTableStreamReader.java:140)
at 
org.apache.cassandra.db.streaming.CassandraIncomingFile.read(CassandraIncomingFile.java:78)
at 
org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:49)
at 
org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:36)
at 
org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:49)
at 
org.apache.cassandra.streaming.async.StreamingInboundHandler$StreamDeserializingTask.run(StreamingInboundHandler.java:181)
at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Checksums do not match for 
/home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
{code}
 

In the above test, it executes "nodetool repair" on node1 and kills node2 
during repair. At the end, node3 reports checksum validation failure on sstable 
transferred from node1.
{code:java|title=what happened}
1. When repair started on node1, it performs anti-compaction which modifies 
sstable's repairAt to 0 and pending repair id to session-id.
2. Then node1 creates {{ComponentManifest}} which contains file lengths to be 
transferred to node3.
3. Before node1 actually sends the files to node3, node2 is killed and node1 
starts to broadcast repair-failure-message to all participants in 
{{CoordinatorSession#fail}}
4. Node1 receives its own repair-failure-message and fails its local repair 
sessions at {{LocalSessions#failSession}} which triggers async background 
compaction.
5. Node1's background compaction will mutate sstable's repairAt to 0 and 
pending repair id to null via  
{{PendingRepairManager#getNextRepairFinishedTask}}, as there is no more 
in-progress repair.
6. Node1 actually sends the sstable to node3 where the sstable's STATS 
component size is different from the original size recorded in the manifest.
7. At the end, node3 reports checksum validation failure when it tries to 
mutate sstable level and "isTransient" attribute in 
{{CassandraEntireSSTableStreamReader#read}}.
{code}
This isn't a problem in legacy streaming as STATS file length didn't matter.

Ideally it will be great to make sstable STATS metadata immutable, just like 
other sstable components, so we don't have to worry this special case.

I can think of 2 ways:
 # Make STATS mutation as a proper compaction to create hard link on the 
compacting sstable components with a new descriptor, except STATS files which 
will be copied entirely. Then mutation will be applied on the new STATS file. 
At the end, old sstable will be released. This ensures all sstable components 
are immutable and shouldn't make these special compaction tasks slower.
 # Change STATS metadata format to use fixed length encoding for repair info

  was:
Flaky dtest: [test_dead_sync_initiator - 
repair_tests.repair_test.TestRepair|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/143/testReport/junit/dtest.repair_tests.repair_test/TestRepair/test_dead_sync_initiator/]

{code:title=stacktrace}
Unexpected error found in node logs (see stdout for full details). Errors: 
[ERROR [Stream-Deserializer-127.0.0.1:7000-570871f3] 202

[jira] [Updated] (CASSANDRA-15861) Mutating sstable STATS metadata may race with entire-sstable-streaming(ZCS) causing checksum validation failure

2020-06-16 Thread ZhaoYang (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhaoYang updated CASSANDRA-15861:
-
Description: 
Flaky dtest: [test_dead_sync_initiator - 
repair_tests.repair_test.TestRepair|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/143/testReport/junit/dtest.repair_tests.repair_test/TestRepair/test_dead_sync_initiator/]

{code:title=stacktrace}
Unexpected error found in node logs (see stdout for full details). Errors: 
[ERROR [Stream-Deserializer-127.0.0.1:7000-570871f3] 2020-06-03 04:05:19,081 
CassandraEntireSSTableStreamReader.java:145 - [Stream 
6f1c3360-a54f-11ea-a808-2f23710fdc90] Error while reading sstable from stream 
for table = keyspace1.standard1
org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: 
/home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
at 
org.apache.cassandra.io.sstable.metadata.MetadataSerializer.maybeValidateChecksum(MetadataSerializer.java:219)
at 
org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:198)
at 
org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:129)
at 
org.apache.cassandra.io.sstable.metadata.MetadataSerializer.mutate(MetadataSerializer.java:226)
at 
org.apache.cassandra.db.streaming.CassandraEntireSSTableStreamReader.read(CassandraEntireSSTableStreamReader.java:140)
at 
org.apache.cassandra.db.streaming.CassandraIncomingFile.read(CassandraIncomingFile.java:78)
at 
org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:49)
at 
org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:36)
at 
org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:49)
at 
org.apache.cassandra.streaming.async.StreamingInboundHandler$StreamDeserializingTask.run(StreamingInboundHandler.java:181)
at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Checksums do not match for 
/home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
{code} 

In the above test, it executes "nodetool repair" on node1 and kills node2 
during repair. At the end, node3 reports checksum validation failure on sstable 
transferred from node1.
{code:java|title=what happened}
1. When repair started on node1, it performs anti-compaction which modifies 
sstable's repairAt to 0 and pending repair id to session-id.
2. Then node1 creates {{ComponentManifest}} which contains file lengths to be 
transferred to node3.
3. Before node1 actually sends the files to node3, node2 is killed and node1 
starts to broadcast repair-failure-message to all participants in 
{{CoordinatorSession#fail}}
4. Node1 receives its own repair-failure-message and fails its local repair 
sessions at {{LocalSessions#failSession}} which triggers async background 
compaction.
5. Node1's background compaction will mutate sstable's repairAt to 0 and 
pending repair id to null via  
{{PendingRepairManager#getNextRepairFinishedTask}}, as there is no more 
in-progress repair.
6. Node1 actually sends the sstable to node3 where the sstable's STATS 
component size is different from the original size recorded in the manifest.
7. At the end, node3 reports checksum validation failure when it tries to 
mutate sstable level and "isTransient" attribute in 
{{CassandraEntireSSTableStreamReader#read}}.
{code}
This isn't a problem in legacy streaming as STATS file length didn't matter.

Ideally it will be great to make sstable STATS metadata immutable, just like 
other sstable components, so we don't have to worry this special case.

I can think of 2 ways:
 # Change {{RepairFinishedCompactionTask}}, {{AntiCompaction}} and 
{{SingleSSTableLCSTask}} to create hard link on the compacting sstable 
components with a new descriptor, except STATS files which will be copied 
entirely. Then mutation will be applied on the new STATS file. At the end, old 
sstable will be released. This ensures all sstable components are immutable and 
shouldn't make these special compaction tasks slower.
 # Change STATS metadata format to use fixed length encoding for repair info

  was:
Flaky dtest: [test_dead_sync_initiator - 
repair_tests.repair_test.TestRepair|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/143/testReport/junit/dtest.repair_tests.repair_test/TestRepair/test_dead_sync_initiator/]

In the above test, it executes "nodetool repair" on node1 and kills node2 
during repair. At the end, node3 report

[jira] [Updated] (CASSANDRA-15878) Ec2Snitch fails on upgrade in legacy mode

2020-06-16 Thread Michael Semb Wever (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-15878:
---
Fix Version/s: 4.0-beta

> Ec2Snitch fails on upgrade in legacy mode
> -
>
> Key: CASSANDRA-15878
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15878
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Alexander Dejanovski
>Priority: Normal
> Fix For: 4.0-beta
>
>
> CASSANDRA-7839 changed the way the EC2 DC/Rack naming was handled in the 
> Ec2Snitch to match AWS conventions.
> The "legacy" mode was introduced to allow upgrades from Cassandra 3.0/3.x and 
> keep the same naming as before (while the "standard" mode uses the new naming 
> convention).
> When performing an upgrade in the us-west-2 region, the second node failed to 
> start with the following exception:
>  
> {code:java}
> ERROR [main] 2020-06-16 09:14:42,218 Ec2Snitch.java:210 - This ec2-enabled 
> snitch appears to be using the legacy naming scheme for regions, but existing 
> nodes in cluster are using the opposite: region(s) = [us-west-2], 
> availability zone(s) = [2a]. Please check the ec2_naming_scheme property in 
> the cassandra-rackdc.properties configuration file for more details.
> ERROR [main] 2020-06-16 09:14:42,219 CassandraDaemon.java:789 - Exception 
> encountered during startup
> java.lang.IllegalStateException: null
>   at 
> org.apache.cassandra.service.StorageService.validateEndpointSnitch(StorageService.java:573)
>   at 
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:530)
>   at 
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:800)
>   at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:659)
>   at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:610)
>   at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:373)
>   at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:650)
>   at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:767)
> {code}
>  
> The exception leads back to [this piece of 
> code|https://github.com/apache/cassandra/blob/cassandra-4.0-alpha4/src/java/org/apache/cassandra/locator/Ec2Snitch.java#L183-L185].
> After adding some logging, it turned out the DC name of the first upgraded 
> node was considered invalid as a legacy one:
> {code:java}
> INFO  [main] 2020-06-16 09:14:42,216 Ec2Snitch.java:183 - Detected DC 
> us-west-2
> INFO  [main] 2020-06-16 09:14:42,217 Ec2Snitch.java:185 - 
> dcUsesLegacyFormat=false / usingLegacyNaming=true
> ERROR [main] 2020-06-16 09:14:42,217 Ec2Snitch.java:188 - Invalid DC name 
> us-west-2
> {code}
>  
> The problem is that the regex that's used to identify legacy dc names will 
> match both old and new names : 
> {code:java}
> boolean dcUsesLegacyFormat = !dc.matches("[a-z]+-[a-z].+-[\\d].*");
> {code}
> Knowing that some dc names didn't change between the two modes (us-west-2 for 
> example), I don't see how we can use the dc names to detect if the legacy 
> mode is being used by other nodes in the cluster.
>   
>  The rack names on the other hand are totally different in the legacy and 
> standard modes and can be used to detect mismatching settings.
>   
>  My go to fix would be to drop the check on datacenters by removing the 
> following lines: 
> [https://github.com/apache/cassandra/blob/cassandra-4.0-alpha4/src/java/org/apache/cassandra/locator/Ec2Snitch.java#L172-L186]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15459) Short read protection doesn't work on group-by queries

2020-06-16 Thread Caleb Rackliffe (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-15459:

Fix Version/s: 4.0-beta

> Short read protection doesn't work on group-by queries
> --
>
> Key: CASSANDRA-15459
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15459
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Coordination
>Reporter: ZhaoYang
>Assignee: Andres de la Peña
>Priority: Normal
>  Labels: correctness
> Fix For: 4.0-beta
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> [DTest to 
> reproduce|https://github.com/apache/cassandra-dtest/compare/master...jasonstack:srp_group_by_trunk?expand=1]:
>  it affects all versions..
> {code}
> In a two-node cluster with RF = 2
> Execute only on Node1:
> * Insert pk=1 and ck=1 with timestamp 9
> * Delete pk=0 and ck=0 with timestamp 10
> * Insert pk=2 and ck=2 with timestamp 9
> Execute only on Node2:
> * Delete pk=1 and ck=1 with timestamp 10
> * Insert pk=0 and ck=0 with timestamp 9
> * Delete pk=2 and ck=2 with timestamp 10
> Query: "SELECT pk, c FROM %s GROUP BY pk LIMIT 1"
> * Expect no live data, but got [0, 0]
> {code}
> Note: for group-by queries, SRP should use "group counted" to calculate 
> limits used for SRP query, rather than "row counted".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15861) Mutating sstable STATS metadata may race with entire-sstable-streaming(ZCS) causing checksum validation failure

2020-06-16 Thread ZhaoYang (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhaoYang updated CASSANDRA-15861:
-
Description: 
Flaky dtest: [test_dead_sync_initiator - 
repair_tests.repair_test.TestRepair|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/143/testReport/junit/dtest.repair_tests.repair_test/TestRepair/test_dead_sync_initiator/]

In the above test, it executes "nodetool repair" on node1 and kills node2 
during repair. At the end, node3 reports checksum validation failure on sstable 
transferred from node1.
{code:java|title=what happened}
1. When repair started on node1, it performs anti-compaction which modifies 
sstable's repairAt to 0 and pending repair id to session-id.
2. Then node1 creates {{ComponentManifest}} which contains file lengths to be 
transferred to node3.
3. Before node1 actually sends the files to node3, node2 is killed and node1 
starts to broadcast repair-failure-message to all participants in 
{{CoordinatorSession#fail}}
4. Node1 receives its own repair-failure-message and fails its local repair 
sessions at {{LocalSessions#failSession}} which triggers async background 
compaction.
5. Node1's background compaction will mutate sstable's repairAt to 0 and 
pending repair id to null via  
{{PendingRepairManager#getNextRepairFinishedTask}}, as there is no more 
in-progress repair.
6. Node1 actually sends the sstable to node3 where the sstable's STATS 
component size is different from the original size recorded in the manifest.
7. At the end, node3 reports checksum validation failure when it tries to 
mutate sstable level and "isTransient" attribute in 
{{CassandraEntireSSTableStreamReader#read}}.
{code}

This isn't a problem in legacy streaming as STATS file length didn't matter.

Ideally it will be great to make sstable STATS metadata immutable, just like 
other sstable components, so we don't have to worry this special case. 

I can think of 3 ways:
# Change {{RepairFinishedCompactionTask}}, {{AntiCompaction}} and 
{{SingleSSTableLCSTask}} to create hard link on the compacting sstable 
components with a new descriptor, except STATS files which will be copied 
entirely. Then mutation will be applied on the new STATS file. At the end, old 
sstable will be released. This ensures all sstable components are immutable and 
shouldn't make these special compaction tasks slower.
# Change STATS metadata format to use fixed length encoding for repair info
# Hacky approach: load the small STATS file into memory when initializing 
{{CassandraOutgoingFile}} instead of relying on mutable on-disk STATS file.

  was:
Flaky dtest: [test_dead_sync_initiator - 
repair_tests.repair_test.TestRepair|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/143/testReport/junit/dtest.repair_tests.repair_test/TestRepair/test_dead_sync_initiator/]

In the above test, it executes "nodetool repair" on node1 and kills node2 
during repair. At the end, node3 reports checksum validation failure on sstable 
transferred from node1.
{code:java|title=what happened}
1. When repair started on node1, it performs anti-compaction which modifies 
sstable's repairAt to 0 and pending repair id to session-id.
2. Then node1 creates {{ComponentManifest}} which contains file lengths to be 
transferred to node3.
3. Before node1 actually sends the files to node3, node2 is killed and node1 
starts to broadcast repair-failure-message to all participants in 
{{CoordinatorSession#fail}}
4. Node1 receives its own repair-failure-message and fails its local repair 
sessions at {{LocalSessions#failSession}} which triggers async background 
compaction.
5. Node1's background compaction will mutate sstable's repairAt to 0 and 
pending repair id to null via  
{{PendingRepairManager#getNextRepairFinishedTask}}, as there is no more 
in-progress repair.
6. Node1 actually sends the sstable to node3 where the sstable's STATS 
component size is different from the original size recorded in the manifest.
7. At the end, node3 reports checksum validation failure when it tries to 
mutate sstable level and "isTransient" attribute in 
{{CassandraEntireSSTableStreamReader#read}}.
{code}

This isn't a problem in legacy streaming as STATS file length didn't matter.

Ideally it will be great to make sstable STATS metadata immutable, just like 
other sstable components, so we don't have to worry this special case. 

I can think of two ways:
# Change {{RepairFinishedCompactionTask}}, {{AntiCompaction}} and 
{{SingleSSTableLCSTask}} to create hard link on the compacting sstable 
components with a new descriptor, except STATS files which will be copied 
entirely. Then mutation will be applied on the new STATS file. At the end, old 
sstable will be released. This ensures all sstable components are immutable and 
shouldn't make these special compaction tasks slower.
# Hacky approach: load the small STATS file into memory when initi

[jira] [Commented] (CASSANDRA-15874) Bootstrap completes Successfully without streaming all the data

2020-06-16 Thread Jai Bheemsen Rao Dhanwada (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136793#comment-17136793
 ] 

Jai Bheemsen Rao Dhanwada commented on CASSANDRA-15874:
---

thanks [~brandon.williams] can you please provide the symptoms of this race 
conditions? in my case I see only some portion of the data is bootstrapped but 
rest of the data bootstrapped without any issues. 

> Bootstrap completes Successfully without streaming all the data
> ---
>
> Key: CASSANDRA-15874
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15874
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: Jai Bheemsen Rao Dhanwada
>Priority: Normal
>
> I am seeing a strange issue where, adding a new node with auto_bootstrap: 
> true is not streaming all the data before it joins the cluster. Don't see any 
> information in the logs about bootstrap failures.
> Here is the sequence of logs
>  
> {code:java}
> INFO [main] 2020-06-12 01:41:49,642 StorageService.java:1446 - JOINING: 
> schema complete, ready to bootstrap
> INFO [main] 2020-06-12 01:41:49,642 StorageService.java:1446 - JOINING: 
> waiting for pending range calculation
> INFO [main] 2020-06-12 01:41:49,643 StorageService.java:1446 - JOINING: 
> calculation complete, ready to bootstrap
> INFO [main] 2020-06-12 01:41:49,643 StorageService.java:1446 - JOINING: 
> getting bootstrap token
> INFO [main] 2020-06-12 01:42:19,656 StorageService.java:1446 - JOINING: 
> Starting to bootstrap...
> org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find table for 
> cfId . If a table was just created, this is likely due to the schema 
> not being fully propagated. Please wait for schema agreement on table 
> creation.
> INFO [StreamReceiveTask:1] 2020-06-12 02:29:51,892 
> StreamResultFuture.java:219 - [Stream #f4224f444-a55d-154a-23e3-867899486f5f] 
> All sessions completed INFO [StreamReceiveTask:1] 2020-06-12 02:29:51,892 
> StorageService.java:1505 - Bootstrap completed! for the tokens
> {code}
> Cassandra Version: 3.11.3
> I am not able to reproduce this issue all the time, but it happened couple of 
> times. Is there any  race condition/corner case, which could cause this issue?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15861) Mutating sstable STATS metadata may race with entire-sstable-streaming(ZCS) causing checksum validation failure

2020-06-16 Thread ZhaoYang (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhaoYang updated CASSANDRA-15861:
-
Description: 
Flaky dtest: [test_dead_sync_initiator - 
repair_tests.repair_test.TestRepair|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/143/testReport/junit/dtest.repair_tests.repair_test/TestRepair/test_dead_sync_initiator/]

In the above test, it executes "nodetool repair" on node1 and kills node2 
during repair. At the end, node3 reports checksum validation failure on sstable 
transferred from node1.
{code:java|title=what happened}
1. When repair started on node1, it performs anti-compaction which modifies 
sstable's repairAt to 0 and pending repair id to session-id.
2. Then node1 creates {{ComponentManifest}} which contains file lengths to be 
transferred to node3.
3. Before node1 actually sends the files to node3, node2 is killed and node1 
starts to broadcast repair-failure-message to all participants in 
{{CoordinatorSession#fail}}
4. Node1 receives its own repair-failure-message and fails its local repair 
sessions at {{LocalSessions#failSession}} which triggers async background 
compaction.
5. Node1's background compaction will mutate sstable's repairAt to 0 and 
pending repair id to null via  
{{PendingRepairManager#getNextRepairFinishedTask}}, as there is no more 
in-progress repair.
6. Node1 actually sends the sstable to node3 where the sstable's STATS 
component size is different from the original size recorded in the manifest.
7. At the end, node3 reports checksum validation failure when it tries to 
mutate sstable level and "isTransient" attribute in 
{{CassandraEntireSSTableStreamReader#read}}.
{code}

This isn't a problem in legacy streaming as STATS file length didn't matter.

Ideally it will be great to make sstable STATS metadata immutable, just like 
other sstable components, so we don't have to worry this special case. 

I can think of two ways:
# Change {{RepairFinishedCompactionTask}}, {{AntiCompaction}} and 
{{SingleSSTableLCSTask}} to create hard link on the compacting sstable 
components with a new descriptor, except STATS files which will be copied 
entirely. Then mutation will be applied on the new STATS file. At the end, old 
sstable will be released. This ensures all sstable components are immutable and 
shouldn't make these special compaction tasks slower.
# Hacky approach: load the small STATS file into memory when initializing 
{{CassandraOutgoingFile}} instead of relying on mutable on-disk STATS file.

  was:
Flaky dtest: [test_dead_sync_initiator - 
repair_tests.repair_test.TestRepair|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/143/testReport/junit/dtest.repair_tests.repair_test/TestRepair/test_dead_sync_initiator/]

In the above test, it executes "nodetool repair" on node1 and kills node2 
during repair. At the end, node3 reports checksum validation failure on sstable 
transferred from node1.
{code:java|title=what happened}
1. When repair started on node1, it performs anti-compaction which modifies 
sstable's repairAt to 0 and pending repair id to session-id.
2. Then node1 creates {{ComponentManifest}} which contains file lengths to be 
transferred to node3.
3. Before node1 actually sends the files to node3, node2 is killed and node1 
starts to broadcast repair-failure-message to all participants in 
{{CoordinatorSession#fail}}
4. Node1 receives its own repair-failure-message and fails its local repair 
sessions at {{LocalSessions#failSession}} which triggers async background 
compaction.
5. Node1's background compaction will mutate sstable's repairAt to 0 and 
pending repair id to null via  
{{PendingRepairManager#getNextRepairFinishedTask}}, as there is no more 
in-progress repair.
6. Node1 actually sends the sstable to node3 where the sstable's STATS 
component size is different from the original size recorded in the manifest.
7. At the end, node3 reports checksum validation failure when it tries to 
mutate sstable level and "isTransient" attribute in 
{{CassandraEntireSSTableStreamReader#read}}.
{code}

I believe similar race may happen with level compaction where it may directly 
mutate a sstable's level if it doesn't overlap with sstables at next level. 
(Note: this isn't a problem in legacy streaming as STATS file length didn't 
matter.) Also it impacts snapshot as well because snapshotted STATS file is 
hard linked. 

Ideally it will be great to make sstable STATS metadata immutable, just like 
other sstable components, so we don't have to worry this special case. 

I can think of two ways:
# Change {{RepairFinishedCompactionTask}}, {{AntiCompaction}} and 
{{SingleSSTableLCSTask}} to create hard link on the compacting sstable 
components with a new descriptor, except STATS files which will be copied 
entirely. Then mutation will be applied on the new STATS file. At the end, old 
sstable will be released. Th

[jira] [Updated] (CASSANDRA-15877) Followup on CASSANDRA-15600

2020-06-16 Thread Ekaterina Dimitrova (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ekaterina Dimitrova updated CASSANDRA-15877:

Attachment: Screen Shot 2020-06-12 at 3.21.18 PM.png

> Followup on CASSANDRA-15600
> ---
>
> Key: CASSANDRA-15877
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15877
> Project: Cassandra
>  Issue Type: Bug
>  Components: Feature/Virtual Nodes
>Reporter: Ekaterina Dimitrova
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 4.0, 4.0-alpha
>
> Attachments: Screen Shot 2020-06-12 at 3.21.18 PM.png
>
>
> As part of CASSANDRA-15600  generateSplits method replaced the 
> generateRandomTokens for NoReplicationAwareTokenAllocator.  generateSplits 
> should be used also in ReplicationAwareTokenAllocator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15861) Mutating sstable STATS metadata may race with entire-sstable-streaming(ZCS) causing checksum validation failure

2020-06-16 Thread ZhaoYang (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhaoYang updated CASSANDRA-15861:
-
Description: 
Flaky dtest: [test_dead_sync_initiator - 
repair_tests.repair_test.TestRepair|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/143/testReport/junit/dtest.repair_tests.repair_test/TestRepair/test_dead_sync_initiator/]

In the above test, it executes "nodetool repair" on node1 and kills node2 
during repair. At the end, node3 reports checksum validation failure on sstable 
transferred from node1.
{code:java|title=what happened}
1. When repair started on node1, it performs anti-compaction which modifies 
sstable's repairAt to 0 and pending repair id to session-id.
2. Then node1 creates {{ComponentManifest}} which contains file lengths to be 
transferred to node3.
3. Before node1 actually sends the files to node3, node2 is killed and node1 
starts to broadcast repair-failure-message to all participants in 
{{CoordinatorSession#fail}}
4. Node1 receives its own repair-failure-message and fails its local repair 
sessions at {{LocalSessions#failSession}} which triggers async background 
compaction.
5. Node1's background compaction will mutate sstable's repairAt to 0 and 
pending repair id to null via  
{{PendingRepairManager#getNextRepairFinishedTask}}, as there is no more 
in-progress repair.
6. Node1 actually sends the sstable to node3 where the sstable's STATS 
component size is different from the original size recorded in the manifest.
7. At the end, node3 reports checksum validation failure when it tries to 
mutate sstable level and "isTransient" attribute in 
{{CassandraEntireSSTableStreamReader#read}}.
{code}

I believe similar race may happen with level compaction where it may directly 
mutate a sstable's level if it doesn't overlap with sstables at next level. 
(Note: this isn't a problem in legacy streaming as STATS file length didn't 
matter.) Also it impacts snapshot as well because snapshotted STATS file is 
hard linked. 

Ideally it will be great to make sstable STATS metadata immutable, just like 
other sstable components, so we don't have to worry this special case. 

I can think of two ways:
# Change {{RepairFinishedCompactionTask}}, {{AntiCompaction}} and 
{{SingleSSTableLCSTask}} to create hard link on the compacting sstable 
components with a new descriptor, except STATS files which will be copied 
entirely. Then mutation will be applied on the new STATS file. At the end, old 
sstable will be released. This ensures all sstable components are immutable and 
shouldn't make these special compaction tasks slower.
# Hacky approach: load the small STATS file into memory when initializing 
{{CassandraOutgoingFile}} instead of relying on mutable on-disk STATS file.

  was:
Flaky dtest: [test_dead_sync_initiator - 
repair_tests.repair_test.TestRepair|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/143/testReport/junit/dtest.repair_tests.repair_test/TestRepair/test_dead_sync_initiator/]

In the above test, it executes "nodetool repair" on node1 and kills node2 
during repair. At the end, node3 reports checksum validation failure on sstable 
transferred from node1.
{code:java|title=what happened}
1. When repair started on node1, it performs anti-compaction which modifies 
sstable's repairAt to 0 and pending repair id to session-id.
2. Then node1 creates {{ComponentManifest}} which contains file lengths to be 
transferred to node3.
3. Before node1 actually sends the files to node3, node2 is killed and node1 
starts to broadcast repair-failure-message to all participants in 
{{CoordinatorSession#fail}}
4. Node1 receives its own repair-failure-message and fails its local repair 
sessions at {{LocalSessions#failSession}} which triggers async background 
compaction.
5. Node1's background compaction will mutate sstable's repairAt to 0 and 
pending repair id to null via  
{{PendingRepairManager#getNextRepairFinishedTask}}, as there is no more 
in-progress repair.
6. Node1 actually sends the sstable to node3 where the sstable's STATS 
component size is different from the original size recorded in the manifest.
7. At the end, node3 reports checksum validation failure when it tries to 
mutate sstable level and "isTransient" attribute in 
{{CassandraEntireSSTableStreamReader#read}}.
{code}

I believe similar race may happen with level compaction where it may directly 
mutate a sstable's level if it doesn't overlap with sstables at next level. 
(Note: this isn't a problem in legacy streaming as STATS file length didn't 
matter.)

Ideally it will be great to make sstable STATS metadata immutable, just like 
other sstable components, so we don't have to worry this special case. 

I can think of two ways:
# Change {{RepairFinishedCompactionTask}}, {{AntiCompaction}} and 
{{SingleSSTableLCSTask}} to create hard link on the compacting sstable 
components with a 

[jira] [Created] (CASSANDRA-15878) Ec2Snitch fails on upgrade in legacy mode

2020-06-16 Thread Alexander Dejanovski (Jira)
Alexander Dejanovski created CASSANDRA-15878:


 Summary: Ec2Snitch fails on upgrade in legacy mode
 Key: CASSANDRA-15878
 URL: https://issues.apache.org/jira/browse/CASSANDRA-15878
 Project: Cassandra
  Issue Type: Bug
Reporter: Alexander Dejanovski


CASSANDRA-7839 changed the way the EC2 DC/Rack naming was handled in the 
Ec2Snitch to match AWS conventions.

The "legacy" mode was introduced to allow upgrades from Cassandra 3.0/3.x and 
keep the same naming as before (while the "standard" mode uses the new naming 
convention).

When performing an upgrade in the us-west-2 region, the second node failed to 
start with the following exception:

 
{code:java}
ERROR [main] 2020-06-16 09:14:42,218 Ec2Snitch.java:210 - This ec2-enabled 
snitch appears to be using the legacy naming scheme for regions, but existing 
nodes in cluster are using the opposite: region(s) = [us-west-2], availability 
zone(s) = [2a]. Please check the ec2_naming_scheme property in the 
cassandra-rackdc.properties configuration file for more details.
ERROR [main] 2020-06-16 09:14:42,219 CassandraDaemon.java:789 - Exception 
encountered during startup
java.lang.IllegalStateException: null
at 
org.apache.cassandra.service.StorageService.validateEndpointSnitch(StorageService.java:573)
at 
org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:530)
at 
org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:800)
at 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:659)
at 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:610)
at 
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:373)
at 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:650)
at 
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:767)
{code}
 

The exception leads back to [this piece of 
code|https://github.com/apache/cassandra/blob/cassandra-4.0-alpha4/src/java/org/apache/cassandra/locator/Ec2Snitch.java#L183-L185].

After adding some logging, it turned out the DC name of the first upgraded node 
was considered invalid as a legacy one:
{code:java}
INFO  [main] 2020-06-16 09:14:42,216 Ec2Snitch.java:183 - Detected DC us-west-2
INFO  [main] 2020-06-16 09:14:42,217 Ec2Snitch.java:185 - 
dcUsesLegacyFormat=false / usingLegacyNaming=true
ERROR [main] 2020-06-16 09:14:42,217 Ec2Snitch.java:188 - Invalid DC name 
us-west-2
{code}
 

The problem is that the regex that's used to identify legacy dc names will 
match both old and new names : 
{code:java}
boolean dcUsesLegacyFormat = !dc.matches("[a-z]+-[a-z].+-[\\d].*");
{code}
Knowing that some dc names didn't change between the two modes (us-west-2 for 
example), I don't see how we can use the dc names to detect if the legacy mode 
is being used by other nodes in the cluster.
  
 The rack names on the other hand are totally different in the legacy and 
standard modes and can be used to detect mismatching settings.
  
 My go to fix would be to drop the check on datacenters by removing the 
following lines: 
[https://github.com/apache/cassandra/blob/cassandra-4.0-alpha4/src/java/org/apache/cassandra/locator/Ec2Snitch.java#L172-L186]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14753) Document incremental repair session timeouts and repair_admin usage

2020-06-16 Thread Berenguer Blasi (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136632#comment-17136632
 ] 

Berenguer Blasi commented on CASSANDRA-14753:
-

Hi [~bdeggleston] was wondering if you mind I take this one. Also being a bit 
cheeky, even after reading through CASSANDRA-14685, I can't seem to pin down 
which error exactly you are referring to with "The sstable acquisition error"...

> Document incremental repair session timeouts and repair_admin usage
> ---
>
> Key: CASSANDRA-14753
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14753
> Project: Cassandra
>  Issue Type: Task
>  Components: Legacy/Documentation and Website
>Reporter: Blake Eggleston
>Assignee: Blake Eggleston
>Priority: Low
> Fix For: 4.0
>
>
> As seen in CASSANDRA-14685, the behavior of incremental repair sessions with 
> failed streams is not obvious and appears to be a bug (although it's working 
> as expected). The incremental repair documentation should be updated to 
> describe what happens if an incremental repair session fails mid-stream, the 
> session timeouts, and how and when to use nodetool repair_admin. The sstable 
> acquisition error should also be updated to mention this as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15821) Metrics Documentation Enhancements

2020-06-16 Thread Berenguer Blasi (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136482#comment-17136482
 ] 

Berenguer Blasi commented on CASSANDRA-15821:
-

Hi [~spmallette] I told you I would look into this but I am afraid I can't help 
much more than doing a sanity check. It looks good in that regard. But my 
metrics knowledge is no match for the requirements here. I will keep an eye on 
it but you'll need sbdy with deeper knowledge to chime in.

> Metrics Documentation Enhancements
> --
>
> Key: CASSANDRA-15821
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15821
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Documentation/Website
>Reporter: Stephen Mallette
>Assignee: Stephen Mallette
>Priority: Normal
> Fix For: 4.0-beta
>
>
> CASSANDRA-15582 involves quality around metrics and it was mentioned that 
> reviewing and [improving 
> documentation|https://github.com/apache/cassandra/blob/trunk/doc/source/operating/metrics.rst]
>  around metrics would fall into that scope. Please consider some of this 
> analysis in determining what improvements to make here:
> Please see [this 
> spreadsheet|https://docs.google.com/spreadsheets/d/1iPWfCMIG75CI6LbYuDtCTjEOvZw-5dyH-e08bc63QnI/edit?usp=sharing]
>  that itemizes almost all of cassandra's metrics and whether they are 
> documented or not (and other notes).  That spreadsheet is "almost all" 
> because there are some metrics that don't seem to initialize as part of 
> Cassandra startup (i was able to trigger some to initialize, but all were not 
> immediately obvious). The missing metrics seem to be related to the following:
> * ThreadPool metrics - only some initialize at startup the list of which 
> follow below
> * Streaming Metrics
> * HintedHandoff Metrics
> * HintsService Metrics
> Here are the ThreadPool scopes that get listed:
> {code}
> AntiEntropyStage
> CacheCleanupExecutor
> CompactionExecutor
> GossipStage
> HintsDispatcher
> MemtableFlushWriter
> MemtablePostFlush
> MemtableReclaimMemory
> MigrationStage
> MutationStage
> Native-Transport-Requests
> PendingRangeCalculator
> PerDiskMemtableFlushWriter_0
> ReadStage
> Repair-Task
> RequestResponseStage
> Sampler
> SecondaryIndexManagement
> ValidationExecutor
> ViewBuildExecutor
> {code}
> I noticed that Keyspace Metrics have this note: "Most of these metrics are 
> the same as the Table Metrics above, only they are aggregated at the Keyspace 
> level." I think I've isolated those metrics on table that are not on keyspace 
> to specifically be:
> {code}
> BloomFilterFalsePositives
> BloomFilterFalseRatio
> BytesAnticompacted
> BytesFlushed
> BytesMutatedAnticompaction
> BytesPendingRepair
> BytesRepaired
> BytesUnrepaired
> CompactionBytesWritten
> CompressionRatio
> CoordinatorReadLatency
> CoordinatorScanLatency
> CoordinatorWriteLatency
> EstimatedColumnCountHistogram
> EstimatedPartitionCount
> EstimatedPartitionSizeHistogram
> KeyCacheHitRate
> LiveSSTableCount
> MaxPartitionSize
> MeanPartitionSize
> MinPartitionSize
> MutatedAnticompactionGauge
> PercentRepaired
> RowCacheHitOutOfRange
> RowCacheHit
> RowCacheMiss
> SpeculativeSampleLatencyNanos
> SyncTime
> WaitingOnFreeMemtableSpace
> DroppedMutations
> {code}
> Someone with greater knowledge of this area might consider it worth the 
> effort to see if any of these metrics should be aggregated to the keyspace 
> level in case they were inadvertently missed. In any case, perhaps the 
> documentation could easily now reflect which metric names could be expected 
> on Keyspace.
> The DroppedMessage metrics have a much larger body of scopes than just what 
> were documented:
> {code}
> ASYMMETRIC_SYNC_REQ
> BATCH_REMOVE_REQ
> BATCH_REMOVE_RSP
> BATCH_STORE_REQ
> BATCH_STORE_RSP
> CLEANUP_MSG
> COUNTER_MUTATION_REQ
> COUNTER_MUTATION_RSP
> ECHO_REQ
> ECHO_RSP
> FAILED_SESSION_MSG
> FAILURE_RSP
> FINALIZE_COMMIT_MSG
> FINALIZE_PROMISE_MSG
> FINALIZE_PROPOSE_MSG
> GOSSIP_DIGEST_ACK
> GOSSIP_DIGEST_ACK2
> GOSSIP_DIGEST_SYN
> GOSSIP_SHUTDOWN
> HINT_REQ
> HINT_RSP
> INTERNAL_RSP
> MUTATION_REQ
> MUTATION_RSP
> PAXOS_COMMIT_REQ
> PAXOS_COMMIT_RSP
> PAXOS_PREPARE_REQ
> PAXOS_PREPARE_RSP
> PAXOS_PROPOSE_REQ
> PAXOS_PROPOSE_RSP
> PING_REQ
> PING_RSP
> PREPARE_CONSISTENT_REQ
> PREPARE_CONSISTENT_RSP
> PREPARE_MSG
> RANGE_REQ
> RANGE_RSP
> READ_REPAIR_REQ
> READ_REPAIR_RSP
> READ_REQ
> READ_RSP
> REPAIR_RSP
> REPLICATION_DONE_REQ
> REPLICATION_DONE_RSP
> REQUEST_RSP
> SCHEMA_PULL_REQ
> SCHEMA_PULL_RSP
> SCHEMA_PUSH_REQ
> SCHEMA_PUSH_RSP
> SCHEMA_VERSION_REQ
> SCHEMA_VERSION_RSP
> SNAPSHOT_MSG
> SNAPSHOT_REQ
> SNAPSHOT_RSP
> STATUS_REQ
> STATUS_RSP
> SYNC_REQ
> SYNC_RSP
> TRUNCATE_REQ
> TRUNCATE_RSP
> VALIDATION_REQ
> VALIDATION_RSP
> _SAMPLE
> _TEST_1
> _TEST_2
> _TRACE
> {code}
> 

[jira] [Updated] (CASSANDRA-15821) Metrics Documentation Enhancements

2020-06-16 Thread Berenguer Blasi (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Berenguer Blasi updated CASSANDRA-15821:

Reviewers: Berenguer Blasi

> Metrics Documentation Enhancements
> --
>
> Key: CASSANDRA-15821
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15821
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Documentation/Website
>Reporter: Stephen Mallette
>Assignee: Stephen Mallette
>Priority: Normal
> Fix For: 4.0-beta
>
>
> CASSANDRA-15582 involves quality around metrics and it was mentioned that 
> reviewing and [improving 
> documentation|https://github.com/apache/cassandra/blob/trunk/doc/source/operating/metrics.rst]
>  around metrics would fall into that scope. Please consider some of this 
> analysis in determining what improvements to make here:
> Please see [this 
> spreadsheet|https://docs.google.com/spreadsheets/d/1iPWfCMIG75CI6LbYuDtCTjEOvZw-5dyH-e08bc63QnI/edit?usp=sharing]
>  that itemizes almost all of cassandra's metrics and whether they are 
> documented or not (and other notes).  That spreadsheet is "almost all" 
> because there are some metrics that don't seem to initialize as part of 
> Cassandra startup (i was able to trigger some to initialize, but all were not 
> immediately obvious). The missing metrics seem to be related to the following:
> * ThreadPool metrics - only some initialize at startup the list of which 
> follow below
> * Streaming Metrics
> * HintedHandoff Metrics
> * HintsService Metrics
> Here are the ThreadPool scopes that get listed:
> {code}
> AntiEntropyStage
> CacheCleanupExecutor
> CompactionExecutor
> GossipStage
> HintsDispatcher
> MemtableFlushWriter
> MemtablePostFlush
> MemtableReclaimMemory
> MigrationStage
> MutationStage
> Native-Transport-Requests
> PendingRangeCalculator
> PerDiskMemtableFlushWriter_0
> ReadStage
> Repair-Task
> RequestResponseStage
> Sampler
> SecondaryIndexManagement
> ValidationExecutor
> ViewBuildExecutor
> {code}
> I noticed that Keyspace Metrics have this note: "Most of these metrics are 
> the same as the Table Metrics above, only they are aggregated at the Keyspace 
> level." I think I've isolated those metrics on table that are not on keyspace 
> to specifically be:
> {code}
> BloomFilterFalsePositives
> BloomFilterFalseRatio
> BytesAnticompacted
> BytesFlushed
> BytesMutatedAnticompaction
> BytesPendingRepair
> BytesRepaired
> BytesUnrepaired
> CompactionBytesWritten
> CompressionRatio
> CoordinatorReadLatency
> CoordinatorScanLatency
> CoordinatorWriteLatency
> EstimatedColumnCountHistogram
> EstimatedPartitionCount
> EstimatedPartitionSizeHistogram
> KeyCacheHitRate
> LiveSSTableCount
> MaxPartitionSize
> MeanPartitionSize
> MinPartitionSize
> MutatedAnticompactionGauge
> PercentRepaired
> RowCacheHitOutOfRange
> RowCacheHit
> RowCacheMiss
> SpeculativeSampleLatencyNanos
> SyncTime
> WaitingOnFreeMemtableSpace
> DroppedMutations
> {code}
> Someone with greater knowledge of this area might consider it worth the 
> effort to see if any of these metrics should be aggregated to the keyspace 
> level in case they were inadvertently missed. In any case, perhaps the 
> documentation could easily now reflect which metric names could be expected 
> on Keyspace.
> The DroppedMessage metrics have a much larger body of scopes than just what 
> were documented:
> {code}
> ASYMMETRIC_SYNC_REQ
> BATCH_REMOVE_REQ
> BATCH_REMOVE_RSP
> BATCH_STORE_REQ
> BATCH_STORE_RSP
> CLEANUP_MSG
> COUNTER_MUTATION_REQ
> COUNTER_MUTATION_RSP
> ECHO_REQ
> ECHO_RSP
> FAILED_SESSION_MSG
> FAILURE_RSP
> FINALIZE_COMMIT_MSG
> FINALIZE_PROMISE_MSG
> FINALIZE_PROPOSE_MSG
> GOSSIP_DIGEST_ACK
> GOSSIP_DIGEST_ACK2
> GOSSIP_DIGEST_SYN
> GOSSIP_SHUTDOWN
> HINT_REQ
> HINT_RSP
> INTERNAL_RSP
> MUTATION_REQ
> MUTATION_RSP
> PAXOS_COMMIT_REQ
> PAXOS_COMMIT_RSP
> PAXOS_PREPARE_REQ
> PAXOS_PREPARE_RSP
> PAXOS_PROPOSE_REQ
> PAXOS_PROPOSE_RSP
> PING_REQ
> PING_RSP
> PREPARE_CONSISTENT_REQ
> PREPARE_CONSISTENT_RSP
> PREPARE_MSG
> RANGE_REQ
> RANGE_RSP
> READ_REPAIR_REQ
> READ_REPAIR_RSP
> READ_REQ
> READ_RSP
> REPAIR_RSP
> REPLICATION_DONE_REQ
> REPLICATION_DONE_RSP
> REQUEST_RSP
> SCHEMA_PULL_REQ
> SCHEMA_PULL_RSP
> SCHEMA_PUSH_REQ
> SCHEMA_PUSH_RSP
> SCHEMA_VERSION_REQ
> SCHEMA_VERSION_RSP
> SNAPSHOT_MSG
> SNAPSHOT_REQ
> SNAPSHOT_RSP
> STATUS_REQ
> STATUS_RSP
> SYNC_REQ
> SYNC_RSP
> TRUNCATE_REQ
> TRUNCATE_RSP
> VALIDATION_REQ
> VALIDATION_RSP
> _SAMPLE
> _TEST_1
> _TEST_2
> _TRACE
> {code}
> I suppose I may yet be missing some metrics as my knowledge of what's 
> available is limited to what I can get from JMX after cassandra 
> initialization (and some initial starting commands) and what's int he 
> documentation. If something is present that is missing from both then I won't 
> know it's there.  Anyway, pe

[jira] [Updated] (CASSANDRA-15861) Mutating sstable STATS metadata may race with entire-sstable-streaming(ZCS) causing checksum validation failure

2020-06-16 Thread ZhaoYang (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhaoYang updated CASSANDRA-15861:
-
Component/s: Local/Compaction

> Mutating sstable STATS metadata may race with entire-sstable-streaming(ZCS) 
> causing checksum validation failure
> ---
>
> Key: CASSANDRA-15861
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15861
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair, Consistency/Streaming, 
> Local/Compaction
>Reporter: ZhaoYang
>Assignee: ZhaoYang
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Flaky dtest: [test_dead_sync_initiator - 
> repair_tests.repair_test.TestRepair|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/143/testReport/junit/dtest.repair_tests.repair_test/TestRepair/test_dead_sync_initiator/]
> In the above test, it executes "nodetool repair" on node1 and kills node2 
> during repair. At the end, node3 reports checksum validation failure on 
> sstable transferred from node1.
> {code:java|title=what happened}
> 1. When repair started on node1, it performs anti-compaction which modifies 
> sstable's repairAt to 0 and pending repair id to session-id.
> 2. Then node1 creates {{ComponentManifest}} which contains file lengths to be 
> transferred to node3.
> 3. Before node1 actually sends the files to node3, node2 is killed and node1 
> starts to broadcast repair-failure-message to all participants in 
> {{CoordinatorSession#fail}}
> 4. Node1 receives its own repair-failure-message and fails its local repair 
> sessions at {{LocalSessions#failSession}} which triggers async background 
> compaction.
> 5. Node1's background compaction will mutate sstable's repairAt to 0 and 
> pending repair id to null via  
> {{PendingRepairManager#getNextRepairFinishedTask}}, as there is no more 
> in-progress repair.
> 6. Node1 actually sends the sstable to node3 where the sstable's STATS 
> component size is different from the original size recorded in the manifest.
> 7. At the end, node3 reports checksum validation failure when it tries to 
> mutate sstable level and "isTransient" attribute in 
> {{CassandraEntireSSTableStreamReader#read}}.
> {code}
> I believe similar race may happen with level compaction where it may directly 
> mutate a sstable's level if it doesn't overlap with sstables at next level. 
> (Note: this isn't a problem in legacy streaming as STATS file length didn't 
> matter.)
> Ideally it will be great to make sstable STATS metadata immutable, just like 
> other sstable components, so we don't have to worry this special case. 
> I can think of two ways:
> # Change {{RepairFinishedCompactionTask}}, {{AntiCompaction}} and 
> {{SingleSSTableLCSTask}} to create hard link on the compacting sstable 
> components with a new descriptor, except STATS files which will be copied 
> entirely. Then mutation will be applied on the new STATS file. At the end, 
> old sstable will be released. This ensures all sstable components are 
> immutable and shouldn't make these special compaction tasks slower.
> # Hacky approach: load the small STATS file into memory when initializing 
> {{CassandraOutgoingFile}} instead of relying on mutable on-disk STATS file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org