[jira] [Commented] (CASSANDRA-18111) Centralize all snapshot operations to SnapshotManager and cache snapshots

2024-09-14 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881819#comment-17881819
 ] 

Paulo Motta commented on CASSANDRA-18111:
-

Thanks for the update, the patch is looking good so far but I think there is 
still some internal snapshot logic leaking to other classes (ie. 
{{{}Keyspace{}}}/ColumnFamilyStore). It would be ideal if we could centralize 
most if not all internal snapshot logic on the package 
*org.apache.cassandra.service.snapshot* as part of this effort.

Added some review comments directly to the 
[PR|https://github.com/apache/cassandra/pull/3374#pullrequestreview-2305171293] 
and some other comments below.

I see some internal code/tests using old snapshot methods from 
{{StorageService}} - (for example 
[StandaloneUpgraderOnSStablesTest#testUpgradeSnapshot|https://github.com/apache/cassandra/blob/trunk/test/unit/org/apache/cassandra/tools/StandaloneUpgraderOnSStablesTest.java#L96]).
 Should we deprecate {{StorageService}} snapshot methods to discourage its use 
(similar to {{{}StorageServiceMBean{}}}) and update all uses to use 
{{SnapshotManager}} methods ?

It looks like some internal code is still referring to {{ColumnFamilyStore}} 
legacy snapshot verbs (ie. 
[SnapshotVerbHandler.doVerb|https://github.com/apache/cassandra/blob/fe025c7f79e76d99e0db347518a7872fd4a114bc/src/java/org/apache/cassandra/service/SnapshotVerbHandler.java#L49])
 - should we update all uses to use {{SnapshotManager}} and remove 
{{ColumnFamilyStore}} snapshot methods in favor of {{SnapshotManager}} methods ?

It looks like there are some legacy snapshot tests without assertions on 
[StorageServiceServerTest|https://github.com/apache/cassandra/blob/trunk/test/unit/org/apache/cassandra/service/StorageServiceServerTest.java#L158].
 I think we should try to add the missing assertions and move them to 
{{SnapshotManagerTest}}  if they're not already being tested somewhere else.

> Centralize all snapshot operations to SnapshotManager and cache snapshots
> -
>
> Key: CASSANDRA-18111
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18111
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Snapshots
>Reporter: Paulo Motta
>Assignee: Stefan Miklosovic
>Priority: Normal
> Fix For: 5.x
>
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> Everytime {{nodetool listsnapshots}} is called, all data directories are 
> scanned to find snapshots, what is inefficient.
> For example, fetching the 
> {{org.apache.cassandra.metrics:type=ColumnFamily,name=SnapshotsSize}} metric 
> can take half a second (CASSANDRA-13338).
> This improvement will also allow snapshots to be efficiently queried via 
> virtual tables (CASSANDRA-18102).
> In order to do this, we should:
> a) load all snapshots from disk during initialization
> b) keep a collection of snapshots on {{SnapshotManager}}
> c) update the snapshots collection anytime a new snapshot is taken or cleared
> d) detect when a snapshot is manually removed from disk.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19902) Revert CASSANDRA-11537

2024-09-05 Thread Paulo Motta (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-19902:

Description: 
Looks like the seemingly harmless cosmetic patch from CASSANDRA-11537 causes 
the StorageServiceMBean to not be available during bootstrap. This causes 
commands like "nodetool nestats/status/etc" to not be available on the 
boostrapping node with the following error:

{code:none}
- StackTrace --
javax.management.InstanceNotFoundException: 
org.apache.cassandra.db:type=StorageService
at 
java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getMBean(DefaultMBeanServerInterceptor.java:1083)
at 
java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:637)
{code}

This ticket is just to revert CASSANDRA-11537, we can re-add the improvement of 
that ticket later.

  was:
Looks like the seemingly innocent cosmetic patch from CASSANDRA-11537 causes 
the StorageServiceMBean to not be available during bootstrap. This causes 
commands like "nodetool nestats/status/etc" to not be available on the 
boostrapping node with the following error:

{code:none}
- StackTrace --
javax.management.InstanceNotFoundException: 
org.apache.cassandra.db:type=StorageService
at 
java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getMBean(DefaultMBeanServerInterceptor.java:1083)
at 
java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:637)
{code}

This ticket is just to revert CASSANDRA-11537, we can re-add the improvement of 
that ticket later.


> Revert CASSANDRA-11537
> --
>
> Key: CASSANDRA-19902
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19902
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tool/nodetool
>    Reporter: Paulo Motta
>Assignee: Paulo Motta
>Priority: Normal
>
> Looks like the seemingly harmless cosmetic patch from CASSANDRA-11537 causes 
> the StorageServiceMBean to not be available during bootstrap. This causes 
> commands like "nodetool nestats/status/etc" to not be available on the 
> boostrapping node with the following error:
> {code:none}
> - StackTrace --
> javax.management.InstanceNotFoundException: 
> org.apache.cassandra.db:type=StorageService
> at 
> java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getMBean(DefaultMBeanServerInterceptor.java:1083)
> at 
> java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:637)
> {code}
> This ticket is just to revert CASSANDRA-11537, we can re-add the improvement 
> of that ticket later.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-18111) Centralize all snapshot operations to SnapshotManager and cache snapshots

2024-07-31 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870085#comment-17870085
 ] 

Paulo Motta edited comment on CASSANDRA-18111 at 8/1/24 5:24 AM:
-

(I structured this review into multiple sections to hopefully make it easier to 
discuss)

I'd like to restate and discuss the goals of this ticket to ensure we're on the 
same page:
 *  *✅ Goal 1: Improve performance of {color:#ff}+nodetool 
listsnapshots+{color} / {color:#ff}SELECT * FROM 
system_views.snapshots{color} by avoiding expensive disk traversal when listing 
snapshots*

To validate this goal is being achieved with the proposed patch, I created a 
rough benchmark comparing listsnapshot performance in the following 
implementations:
 * {*}listsnapshots_disk{*}: Old implementation fetching snapshots from disk in 
every call to listsnapshots
 * {*}listsnapshots_cached{*}: New cached implementation
 * {*}listsnapshots_cached_checkexists{*}: New cached implementation + check 
manifest file exists during fetch

The benchmark consists of a simple junit test ([code 
here|https://github.com/pauloricardomg/cassandra/blob/CASSANDRA-18111-review-2/test/unit/org/apache/cassandra/service/snapshot/SnapshotManagerTest.java])
 fetching 100 snapshots of 10 tables with 10 sstables each 1000 times for each 
implementation.

I got the following test execution times in my modest SSD laptop for each 
implementation:
 * {*}listsnapshots_disk{*}: 37 seconds
 * {*}listsnapshots_cached{*}: 36 milliseconds
 * {*}listsnapshots_cached_checkexists{*}: 4 seconds

The *listsnapshots_cached* results indicate that caching snapshots greatly 
improves *listsnapshots* speed compared to the current *listsnapshots_disk* 
implementation as expected, what accomplishes *Goal 1* and justifies this patch.

The additional snapshot manifest existence check from 
*listsnapshots_cached_checkexists* adds considerable overhead in comparison to 
{*}listsnapshots_cached{*}, but it's still significantly faster than the 
previous *listsnapshots_disk* implementation.
 * *⚠️ Goal 2: Consolidate / centralize "as much as possible" snapshot logic in 
SnapshotManager (CASSANDRA-18271)*

While this patch makes progress towards this goal, there is still considerable 
amount of snapshot logic in *StorageService* and {*}ColumnFamilyStore{*}.  See 
discussion for each subsystem below:

*A) StorageService:* there is some snapshot handling logic in at least the 
following methods:
 * 
[takeSnapshot|https://github.com/pauloricardomg/cassandra/blob/CASSANDRA-18111-review-2/src/java/org/apache/cassandra/service/StorageService.java#L2735]
 * 
[getSnapshotDetails|https://github.com/pauloricardomg/cassandra/blob/CASSANDRA-18111-review-2/src/java/org/apache/cassandra/service/StorageService.java#L3020]
 * 
[trueSnapshotsSize|https://github.com/pauloricardomg/cassandra/blob/CASSANDRA-18111-review-2/src/java/org/apache/cassandra/service/StorageService.java#L3042]

I think we could simplify a great deal of code by moving remaining snapshot 
logic from StorageService to SnapshotManager and create a dedicated 
[SnapshotManagerMbean|https://github.com/pauloricardomg/cassandra/blob/CASSANDRA-18111-review-2/src/java/org/apache/cassandra/service/snapshot/SnapshotManagerMBean.java]
 to expose snapshot methods via JMX moving forward. WDYT ?

This would allow refactoring and simplifying some snapshot logic, for example 
unifying implementations of 
[takeMultipleTableSnapshot|https://github.com/apache/cassandra/blob/f95c1b5bb3d6f8728a00e13ca81993e12a9b14cd/src/java/org/apache/cassandra/service/StorageService.java#L2914]
 and 
[takeSnapshot.|https://github.com/apache/cassandra/blob/f95c1b5bb3d6f8728a00e13ca81993e12a9b14cd/src/java/org/apache/cassandra/service/StorageService.java#L2868]

The proposal above would help retire snapshot logic from StorageService and 
eventually remove deprecated snapshot handling methods from 
StorageServiceMbean. I'm happy to take these suggestions to a follow-up ticket, 
but wanted to hear your thoughts on this refactoring proposal.

*B) ColumnFamilyStore:* there is a fundamental coupling between 
ColumnFamilyStore and snapshot creation, since snapshot creation requires 
flushing and locking sstables while creating the hardlinks. I don't think we 
can fully remove this dependency but maybe there's room for further 
cleanup/improvement in a follow-up ticket.



*⚠️* *SnapshotWatcher*

I am a bit concerned by the additional complexity added by SnapshotWatcher and 
reliance on WatchService's / inotify implementation to detect when a snapshot 
was manually removed from outside the process.

How about checking if the manifest file exists periodically or during fetch if 
the user wants to enable this detection ? This seems relatively cheap based in 
the *listsnapshots_cached_checkexists* r

[jira] [Comment Edited] (CASSANDRA-18111) Centralize all snapshot operations to SnapshotManager and cache snapshots

2024-07-31 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870085#comment-17870085
 ] 

Paulo Motta edited comment on CASSANDRA-18111 at 8/1/24 5:24 AM:
-

(I structured this review into multiple sections to hopefully make it easier to 
discuss)

I'd like to restate and discuss the goals of this ticket to ensure we're on the 
same page:
 *  *✅ Goal 1: Improve performance of {color:#ff}+nodetool 
listsnapshots+{color} / {color:#ff}SELECT * FROM 
system_views.snapshots{color} by avoiding expensive disk traversal when listing 
snapshots*

To validate this goal is being achieved with the proposed patch, I created a 
rough benchmark comparing listsnapshot performance in the following 
implementations:
 * {*}listsnapshots_disk{*}: Old implementation fetching snapshots from disk in 
every call to listsnapshots
 * {*}listsnapshots_cached{*}: New cached implementation
 * {*}listsnapshots_cached_checkexists{*}: New cached implementation + check 
manifest file exists during fetch

The benchmark consists of a simple junit test ([code 
here|https://github.com/pauloricardomg/cassandra/blob/CASSANDRA-18111-review-2/test/unit/org/apache/cassandra/service/snapshot/SnapshotManagerTest.java])
 fetching 100 snapshots of 10 tables with 10 sstables each 1000 times for each 
implementation.

I got the following test execution times in my modest SSD laptop for each 
implementation:
 * {*}listsnapshots_disk{*}: 37 seconds
 * {*}listsnapshots_cached{*}: 36 milliseconds
 * {*}listsnapshots_cached_checkexists{*}: 4 seconds

The *listsnapshots_cached* results indicate that caching snapshots greatly 
improves *listsnapshots* speed compared to the current *listsnapshots_disk* 
implementation as expected, what accomplishes *Goal 1* and justifies this patch.

The additional snapshot manifest existence check from 
*listsnapshots_cached_checkexists* adds considerable overhead in comparison to 
{*}listsnapshots_cached{*}, but it's still significantly faster than the 
previous *listsnapshots_disk* implementation.
 * *⚠️ Goal 2: Consolidate / centralize "as much as possible" snapshot logic in 
SnapshotManager (CASSANDRA-18271)*

While this patch makes progress towards this goal, there is still considerable 
amount of snapshot logic in *StorageService* and {*}ColumnFamilyStore{*}.  See 
discussion for each subsystem below:

*A) StorageService:* there is some snapshot handling logic in at least the 
following methods:
 * 
[takeSnapshot|https://github.com/pauloricardomg/cassandra/blob/CASSANDRA-18111-review-2/src/java/org/apache/cassandra/service/StorageService.java#L2735]
 * 
[getSnapshotDetails|https://github.com/pauloricardomg/cassandra/blob/CASSANDRA-18111-review-2/src/java/org/apache/cassandra/service/StorageService.java#L3020]
 * 
[trueSnapshotsSize|https://github.com/pauloricardomg/cassandra/blob/CASSANDRA-18111-review-2/src/java/org/apache/cassandra/service/StorageService.java#L3042]

I think we could simplify a great deal of code by moving remaining snapshot 
logic from StorageService to SnapshotManager and create a dedicated 
[SnapshotManagerMbean|https://github.com/pauloricardomg/cassandra/blob/CASSANDRA-18111-review-2/src/java/org/apache/cassandra/service/snapshot/SnapshotManagerMBean.java]
 to expose snapshot methods via JMX moving forward. WDYT ?

This would allow refactoring and simplifying some snapshot logic, for example 
unifying implementations of 
[takeMultipleTableSnapshot|https://github.com/apache/cassandra/blob/f95c1b5bb3d6f8728a00e13ca81993e12a9b14cd/src/java/org/apache/cassandra/service/StorageService.java#L2914]
 and 
[takeSnapshot.|https://github.com/apache/cassandra/blob/f95c1b5bb3d6f8728a00e13ca81993e12a9b14cd/src/java/org/apache/cassandra/service/StorageService.java#L2868]

The proposal above would help retire snapshot logic from StorageService and 
eventually remove deprecated snapshot handling methods from 
StorageServiceMbean. I'm happy to take these suggestions to a follow-up ticket, 
but wanted to hear your thoughts on this refactoring proposal.

*B) ColumnFamilyStore:* there is a fundamental coupling between 
ColumnFamilyStore and snapshot creation, since snapshot creation requires 
flushing and locking sstables while creating the hardlinks. I don't think we 
can fully remove this dependency but maybe there's room for further 
cleanup/improvement in a follow-up ticket.



*⚠️* *SnapshotWatcher*

I am a bit concerned by the additional complexity added by SnapshotWatcher and 
reliance on WatchService's / inotify implementation to detect when a snapshot 
was manually removed from outside the process.

How about checking if the manifest file exists periodically or during fetch if 
the user wants to enable this detection ? This seems relatively cheap based in 
the *listsnapshots_cached_checkexists* r

[jira] [Commented] (CASSANDRA-18111) Centralize all snapshot operations to SnapshotManager and cache snapshots

2024-07-31 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870085#comment-17870085
 ] 

Paulo Motta commented on CASSANDRA-18111:
-

(I structured this review into multiple sections to hopefully make it easier to 
discuss)

I'd like to restate and discuss the goals of this ticket to ensure we're on the 
same page:
 *  *✅ Goal 1: Improve performance of {color:#ff}+nodetool 
listsnapshots+{color} / {color:#ff}SELECT * FROM 
system_views.snapshots{color} by avoiding expensive disk traversal when listing 
snapshots*

To validate this goal is being achieved with the proposed patch, I created a 
rough benchmark comparing listsnapshot performance in the following 
implementations:
 * {*}listsnapshots_disk{*}: Old implementation fetching snapshots from disk in 
every call to listsnapshots
 * {*}listsnapshots_cached{*}: New cached implementation
 * {*}listsnapshots_cached_checkexists{*}: New cached implementation + check 
manifest file exists during fetch

The benchmark consists of a simple junit test ([code 
here|https://github.com/pauloricardomg/cassandra/blob/CASSANDRA-18111-review-2/test/unit/org/apache/cassandra/service/snapshot/SnapshotManagerTest.java])
 fetching 100 snapshots of 10 tables with 10 sstables each 1000 times for each 
implementation.

I got the following test execution times in my modest SSD laptop for each 
implementation:
 * {*}listsnapshots_disk{*}: 37 seconds
 * {*}listsnapshots_cached{*}: 36 milliseconds
 * {*}listsnapshots_cached_checkexists{*}: 4 seconds

The *listsnapshots_cached* results indicate that caching snapshots greatly 
improves *listsnapshots* speed compared to the current *listsnapshots_disk* 
implementation as expected, what accomplishes *Goal 1* and justifies this patch.

The additional snapshot manifest existence check from 
*listsnapshots_cached_checkexists* adds considerable overhead in comparison to 
{*}listsnapshots_cached{*}, but it's still significantly faster than the 
previous *listsnapshots_disk* implementation.
 * *⚠️ Goal 2: Consolidate / centralize "as much as possible" snapshot logic in 
SnapshotManager (CASSANDRA-18271)*

While this patch makes progress towards this goal, there is still considerable 
amount of snapshot logic in *StorageService* and {*}ColumnFamilyStore{*}.  See 
discussion for each subsystem below:

*A) StorageService:* there is some snapshot handling logic in at least the 
following methods:
 * 
[takeSnapshot|https://github.com/pauloricardomg/cassandra/blob/CASSANDRA-18111-review-2/src/java/org/apache/cassandra/service/StorageService.java#L2745]
 * 
[getSnapshotDetails|https://github.com/pauloricardomg/cassandra/blob/CASSANDRA-18111-review-2/src/java/org/apache/cassandra/service/StorageService.java#L3062]
 * 
[trueSnapshotsSize|https://github.com/pauloricardomg/cassandra/blob/CASSANDRA-18111-review-2/src/java/org/apache/cassandra/service/StorageService.java#L3062]

I think we could simplify a great deal of code by moving remaining snapshot 
logic from StorageService to SnapshotManager and create a dedicated 
[SnapshotManagerMbean|https://github.com/pauloricardomg/cassandra/blob/CASSANDRA-18111-review-2/src/java/org/apache/cassandra/service/snapshot/SnapshotManagerMBean.java]
 to expose snapshot methods via JMX moving forward. WDYT ?

This would allow refactoring and simplifying some snapshot logic, for example 
unifying implementations of 
[takeMultipleTableSnapshot|https://github.com/apache/cassandra/blob/f95c1b5bb3d6f8728a00e13ca81993e12a9b14cd/src/java/org/apache/cassandra/service/StorageService.java#L2914]
 and 
[takeSnapshot.|https://github.com/apache/cassandra/blob/f95c1b5bb3d6f8728a00e13ca81993e12a9b14cd/src/java/org/apache/cassandra/service/StorageService.java#L2868]

The proposal above would help retire snapshot logic from StorageService and 
eventually remove deprecated snapshot handling methods from 
StorageServiceMbean. I'm happy to take these suggestions to a follow-up ticket, 
but wanted to hear your thoughts on this refactoring proposal.

*B) ColumnFamilyStore:* there is a fundamental coupling between 
ColumnFamilyStore and snapshot creation, since snapshot creation requires 
flushing and locking sstables while creating the hardlinks. I don't think we 
can fully remove this dependency but maybe there's room for further 
cleanup/improvement in a follow-up ticket.



*⚠️* *SnapshotWatcher*

I am a bit concerned by the additional complexity added by SnapshotWatcher and 
reliance on WatchService's / inotify implementation to detect when a snapshot 
was manually removed from outside the process.

How about checking if the manifest file exists periodically or during fetch if 
the user wants to enable this detection ? This seems relatively cheap based in 
the *listsnapshots_cached_checkexists* results while being considerably simpler 
than 

[jira] [Commented] (CASSANDRA-18111) Centralize all snapshot operations to SnapshotManager and cache snapshots

2024-07-21 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867638#comment-17867638
 ] 

Paulo Motta commented on CASSANDRA-18111:
-

While testing this I noticed that when a snapshot from a table is split across 
multiple data directories, and one of the directories is manually removed, then 
this causes the cleanup mechanism to remove the snapshot files from the other 
directories.

When a snapshot is spread across multiple data directories I think the intent 
is to only stop tracking the snapshot on SnapshotManager when all snapshot 
subdirectories are removed? We don't want to clear additional snapshot 
directories if one of the subdirectories was manually removed.

Alternatively we can consider that a snapshot is valid as long as the 
"manifest.json" exists ? This would create a requirement that all snapshots 
should contain a "manifest.json" to be tracked by SnapshotManager. I think this 
is a fair requirement, because without the manifest it's not possible to ensure 
whether a snapshot was partially corrupted (ie. some files were removed from 
it).

See example:
{code:none}
$ cat 
data/data1/system/compaction_history-b4dbb7b4dc493fb5b3bfce6e434832ca/snapshots/test/manifest.json
 
{
  "files" : [ "oa-2-big-Data.db", "oa-1-big-Data.db" ],
  "created_at" : "2024-07-22T01:53:47.026Z",
  "expires_at" : null,
  "ephemeral" : false
}

$ls -ltra 
data/data1/system/compaction_history-b4dbb7b4dc493fb5b3bfce6e434832ca/snapshots/test/
 
total 48
-rw-rw-r-- 2 user user   16 Jul 21 21:53 oa-1-big-Filter.db
-rw-rw-r-- 2 user user   92 Jul 21 21:53 oa-1-big-Summary.db
-rw-rw-r-- 2 user user   20 Jul 21 21:53 oa-1-big-Index.db
-rw-rw-r-- 2 user user   10 Jul 21 21:53 oa-1-big-Digest.crc32
-rw-rw-r-- 2 user user  137 Jul 21 21:53 oa-1-big-Data.db
-rw-rw-r-- 2 user user   92 Jul 21 21:53 oa-1-big-TOC.txt
-rw-rw-r-- 2 user user 5429 Jul 21 21:53 oa-1-big-Statistics.db
-rw-rw-r-- 2 user user   47 Jul 21 21:53 oa-1-big-CompressionInfo.db
drwxrwxr-x 3 user user 4096 Jul 21 21:53 ..
drwxrwxr-x 2 user user 4096 Jul 21 21:53 .
-rw-rw-r-- 1 user user  149 Jul 21 21:53 manifest.json

$ ls -ltra 
data/data2/system/compaction_history-b4dbb7b4dc493fb5b3bfce6e434832ca/snapshots/test/
total 44
-rw-rw-r-- 2 user user   92 Jul 21 21:53 oa-2-big-Summary.db
-rw-rw-r-- 2 user user   20 Jul 21 21:53 oa-2-big-Index.db
-rw-rw-r-- 2 user user   16 Jul 21 21:53 oa-2-big-Filter.db
-rw-rw-r-- 2 user user  129 Jul 21 21:53 oa-2-big-Data.db
-rw-rw-r-- 2 user user   10 Jul 21 21:53 oa-2-big-Digest.crc32
-rw-rw-r-- 2 user user   47 Jul 21 21:53 oa-2-big-CompressionInfo.db
-rw-rw-r-- 2 user user   92 Jul 21 21:53 oa-2-big-TOC.txt
-rw-rw-r-- 2 user user 5430 Jul 21 21:53 oa-2-big-Statistics.db
drwxrwxr-x 3 user user 4096 Jul 21 21:53 ..
drwxrwxr-x 2 user user 4096 Jul 21 21:53 .

# Remove data2 manually, but keep data1
$ rm -rf 
data/data2/system/compaction_history-b4dbb7b4dc493fb5b3bfce6e434832ca/snapshots/test/

$ ls -ltra 
data/data1/system/compaction_history-b4dbb7b4dc493fb5b3bfce6e434832ca/snapshots/test/
total 48
-rw-rw-r-- 2 user user   16 Jul 21 21:53 oa-1-big-Filter.db
-rw-rw-r-- 2 user user   92 Jul 21 21:53 oa-1-big-Summary.db
-rw-rw-r-- 2 user user   20 Jul 21 21:53 oa-1-big-Index.db
-rw-rw-r-- 2 user user   10 Jul 21 21:53 oa-1-big-Digest.crc32
-rw-rw-r-- 2 user user  137 Jul 21 21:53 oa-1-big-Data.db
-rw-rw-r-- 2 user user   92 Jul 21 21:53 oa-1-big-TOC.txt
-rw-rw-r-- 2 user user 5429 Jul 21 21:53 oa-1-big-Statistics.db
-rw-rw-r-- 2 user user   47 Jul 21 21:53 oa-1-big-CompressionInfo.db
drwxrwxr-x 3 user user 4096 Jul 21 21:53 ..
drwxrwxr-x 2 user user 4096 Jul 21 21:53 .
-rw-rw-r-- 1 user user  149 Jul 21 21:53 manifest.json

[after some time]

INFO  [SnapshotCleanup:1] 2024-07-21 22:01:46,818 SnapshotManager.java:243 - 
Removing snapshot TableSnapshot{keyspaceName='system', 
tableName='compaction_history', tableId=b4dbb7b4-dc49-3fb5-b3bf-ce6e434832ca, 
tag='test', createdAt=2024-07-22T01:53:47.026Z, expiresAt=null, 
snapshotDirs=[/tmp/apache-cassandra-5.1-SNAPSHOT/data/data2/system/compaction_history-b4dbb7b4dc493fb5b3bfce6e434832ca/snapshots/test,
 
/tmp/apache-cassandra-5.1-SNAPSHOT/data/data1/system/compaction_history-b4dbb7b4dc493fb5b3bfce6e434832ca/snapshots/test],
 ephemeral=false}

$ ls -ltra 
data/data1/system/compaction_history-b4dbb7b4dc493fb5b3bfce6e434832ca/snapshots/test/
ls: cannot access 
'data/data1/system/compaction_history-b4dbb7b4dc493fb5b3bfce6e434832ca/snapshots/test/':
 No such file or directory <-- other snapshot subdirectory "data1" was removed
{code}

> Centralize all snapshot operations to SnapshotManager and cache snapshots
> -
>
>

[jira] [Comment Edited] (CASSANDRA-18111) Cache snapshots in memory

2024-06-18 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17856088#comment-17856088
 ] 

Paulo Motta edited comment on CASSANDRA-18111 at 6/19/24 12:49 AM:
---

I was thinking that since this is just a cache, perhaps we could have a 
{{snapshot_metadata_cache_size: 100MiB}} setting so the amount of memory used 
for snapshot metadata would be capped while providing the optimization by 
default ? Users wishing to disable  could just set 
{{snapshot_metadata_cache_size: 0MiB}}.

It would be nice to validate how much this improves select * from 
system_views.snapshots performance for large snapshot * keyspace  * table * 
sstable counts.


was (Author: paulo):
I was thinking that since this is just a cache, perhaps we could have a 
{{snapshot_metadata_cache_size: 100MiB }}setting so the amount of memory used 
for snapshot metadata would be capped while providing the optimization by 
default ? Users wishing to disable  could just set 
{{{}snapshot_metadata_cache_size: 0MiB{}}}.

It would be nice to validate how much this improves select * from 
system_views.snapshots performance for large snapshot * keyspace  * table * 
sstable counts.

> Cache snapshots in memory
> -
>
> Key: CASSANDRA-18111
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18111
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Snapshots
>Reporter: Paulo Motta
>Assignee: Stefan Miklosovic
>Priority: Normal
> Fix For: 5.x
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Everytime {{nodetool listsnapshots}} is called, all data directories are 
> scanned to find snapshots, what is inefficient.
> For example, fetching the 
> {{org.apache.cassandra.metrics:type=ColumnFamily,name=SnapshotsSize}} metric 
> can take half a second (CASSANDRA-13338).
> This improvement will also allow snapshots to be efficiently queried via 
> virtual tables (CASSANDRA-18102).
> In order to do this, we should:
> a) load all snapshots from disk during initialization
> b) keep a collection of snapshots on {{SnapshotManager}}
> c) update the snapshots collection anytime a new snapshot is taken or cleared
> d) detect when a snapshot is manually removed from disk.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-18111) Cache snapshots in memory

2024-06-18 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17856088#comment-17856088
 ] 

Paulo Motta edited comment on CASSANDRA-18111 at 6/19/24 12:48 AM:
---

I was thinking that since this is just a cache, perhaps we could have a 
{{snapshot_metadata_cache_size: 100MiB }}setting so the amount of memory used 
for snapshot metadata would be capped while providing the optimization by 
default ? Users wishing to disable  could just set 
{{{}snapshot_metadata_cache_size: 0MiB{}}}.

It would be nice to validate how much this improves select * from 
system_views.snapshots performance for large snapshot * keyspace  * table * 
sstable counts.


was (Author: paulo):
I was thinking that since this is just a cache, perhaps we could have a 
{{snapshot_metadata_cache_size: 100MiB}} setting so the amount of memory used 
for snapshot metadata would be capped while providing the optimization by 
default ? Users wishing to disable the  could just set 
{{snapshot_metadata_cache_size: 0MiB.
}}
It would be nice to validate how much this improves select * 
system_views.snapshots performance for large snapshot * keyspace  * table * 
sstable counts.

> Cache snapshots in memory
> -
>
> Key: CASSANDRA-18111
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18111
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Snapshots
>Reporter: Paulo Motta
>Assignee: Stefan Miklosovic
>Priority: Normal
> Fix For: 5.x
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Everytime {{nodetool listsnapshots}} is called, all data directories are 
> scanned to find snapshots, what is inefficient.
> For example, fetching the 
> {{org.apache.cassandra.metrics:type=ColumnFamily,name=SnapshotsSize}} metric 
> can take half a second (CASSANDRA-13338).
> This improvement will also allow snapshots to be efficiently queried via 
> virtual tables (CASSANDRA-18102).
> In order to do this, we should:
> a) load all snapshots from disk during initialization
> b) keep a collection of snapshots on {{SnapshotManager}}
> c) update the snapshots collection anytime a new snapshot is taken or cleared
> d) detect when a snapshot is manually removed from disk.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-18111) Cache snapshots in memory

2024-06-18 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17856088#comment-17856088
 ] 

Paulo Motta commented on CASSANDRA-18111:
-

I was thinking that since this is just a cache, perhaps we could have a 
{{snapshot_metadata_cache_size: 100MiB}} setting so the amount of memory used 
for snapshot metadata would be capped while providing the optimization by 
default ? Users wishing to disable the  could just set 
{{snapshot_metadata_cache_size: 0MiB.
}}
It would be nice to validate how much this improves select * 
system_views.snapshots performance for large snapshot * keyspace  * table * 
sstable counts.

> Cache snapshots in memory
> -
>
> Key: CASSANDRA-18111
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18111
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Snapshots
>Reporter: Paulo Motta
>Assignee: Stefan Miklosovic
>Priority: Normal
> Fix For: 5.x
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Everytime {{nodetool listsnapshots}} is called, all data directories are 
> scanned to find snapshots, what is inefficient.
> For example, fetching the 
> {{org.apache.cassandra.metrics:type=ColumnFamily,name=SnapshotsSize}} metric 
> can take half a second (CASSANDRA-13338).
> This improvement will also allow snapshots to be efficiently queried via 
> virtual tables (CASSANDRA-18102).
> In order to do this, we should:
> a) load all snapshots from disk during initialization
> b) keep a collection of snapshots on {{SnapshotManager}}
> c) update the snapshots collection anytime a new snapshot is taken or cleared
> d) detect when a snapshot is manually removed from disk.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-18111) Cache snapshots in memory

2024-06-18 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17856039#comment-17856039
 ] 

Paulo Motta commented on CASSANDRA-18111:
-

{quote}Is there a way to disable this functionality? I briefly took a look at 
the implementation but didn't see anything that would allow you to disable it. 
My concern is that we've seen issues with the amount of snapshots on large 
clusters, so this can be problematic for some clusters by putting additional 
memory pressure on individual hosts.
{quote}
I would like to understand what kind of problems did you encounter so we could 
try to address them here if possible. The goal of this ticket is exactly to 
optimize for large number of snapshots by avoiding an expensive directory 
traversal when snapshots are listed, so I think it would be counterproductive 
to disable this. See CASSANDRA-13338 which is the original motivation for this 
ticket.

We have not considered memory cost for keeping these snapshot metadata in 
memory, but perhaps this is something to consider for large amounts of 
snapshots. Do you have a ballpark number for a very large of amount of 
snapshots per node in your experience ? 10K, 100K, 1M?

> Cache snapshots in memory
> -
>
> Key: CASSANDRA-18111
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18111
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Snapshots
>Reporter: Paulo Motta
>Assignee: Stefan Miklosovic
>Priority: Normal
> Fix For: 5.x
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Everytime {{nodetool listsnapshots}} is called, all data directories are 
> scanned to find snapshots, what is inefficient.
> For example, fetching the 
> {{org.apache.cassandra.metrics:type=ColumnFamily,name=SnapshotsSize}} metric 
> can take half a second (CASSANDRA-13338).
> This improvement will also allow snapshots to be efficiently queried via 
> virtual tables (CASSANDRA-18102).
> In order to do this, we should:
> a) load all snapshots from disk during initialization
> b) keep a collection of snapshots on {{SnapshotManager}}
> c) update the snapshots collection anytime a new snapshot is taken or cleared
> d) detect when a snapshot is manually removed from disk.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-18111) Cache snapshots in memory

2024-06-13 Thread Paulo Motta (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-18111:

Reviewers: Paulo Motta

> Cache snapshots in memory
> -
>
> Key: CASSANDRA-18111
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18111
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Snapshots
>Reporter: Paulo Motta
>Assignee: Stefan Miklosovic
>Priority: Normal
> Fix For: 5.x
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Everytime {{nodetool listsnapshots}} is called, all data directories are 
> scanned to find snapshots, what is inefficient.
> For example, fetching the 
> {{org.apache.cassandra.metrics:type=ColumnFamily,name=SnapshotsSize}} metric 
> can take half a second (CASSANDRA-13338).
> This improvement will also allow snapshots to be efficiently queried via 
> virtual tables (CASSANDRA-18102).
> In order to do this, we should:
> a) load all snapshots from disk during initialization
> b) keep a collection of snapshots on {{SnapshotManager}}
> c) update the snapshots collection anytime a new snapshot is taken or cleared
> d) detect when a snapshot is manually removed from disk.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-19512) Add startup flag to load a local snapshot from disk

2024-04-02 Thread Paulo Motta (Jira)
Paulo Motta created CASSANDRA-19512:
---

 Summary: Add startup flag to load a local snapshot from disk
 Key: CASSANDRA-19512
 URL: https://issues.apache.org/jira/browse/CASSANDRA-19512
 Project: Cassandra
  Issue Type: Improvement
  Components: Local/Snapshots
Reporter: Paulo Motta
Assignee: Paulo Motta


Add startup flag "cassandra.load_snapshot_unsafe=snapshot_id" that loads a 
snapshot with the specified ID into the sstable tracker in the initial startup 
phase.

The flag has the {{_unsafe}} prefix because it may cause data consistency 
issues if this is used incorrectly. For example, if a given snapshot is loaded 
in a single replica of a replicated keyspace, it may cause replicas to go out 
of sync. For this reason, this flag should only be accepted if the 
"allow_load_snapshot_unsafe" guardrail is enabled (it is disabled by default).

When the flag is detected during startup, snapshots with the given tag will be 
located. If no snapshot with the given tag exists, the startup should fail.

The snapshot loading mechanism should create a hard link to existing sstables 
into a staging area to ensure the existing data is secured. After this, it 
should replace the existing sstables with the snapshot data into the sstable 
tracker before proceeding normally with the startup.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-17401) Race condition in QueryProcessor causes just prepared statement not to be in the prepared statements cache

2024-02-27 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-17401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821475#comment-17821475
 ] 

Paulo Motta commented on CASSANDRA-17401:
-

Thanks for the detailed reports and repro steps. I've taken a look and this 
looks to me to be a legitimate race condition that can cause a re-prepare storm 
under large concurrency and unlucky timing.

My understanding is that [these evict 
statements|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L735]
 are not required for the correctness of the upgrade compatibility logic and 
can be safely removed. Would you have some cycles to confirm this [~ifesdjeen] ?

In addition to this, I think there's a pending issue from CASSANDRA-17248 that 
can leak prepared statements between keyspaces during mixed upgrade mode. Since 
these issues are in a related area I think it makes sense to address them 
together (in separate commits) to ensure these changes are tested together.

I think the {{PreparedStatementCollisionTest}} suite from [this 
commit|https://github.com/apache/cassandra/pull/1872/commits/758bc4a89d7ca9d0bfe27e6f41000484724261bc]
 can help improve the validation coverage of this logic. That change looks 
correct to me but may need some cleanup. We should probably keep the metric 
changes out of this to keep the scope of this patch to a minimum.

After proper review and validation I think there's value in including these 
fixes in the final 3.X releases to address these outstanding issues as users 
will still do upgrade cycles as 5.x release approaches. This will make 
resolution more laborious as we will need to provide patches for 3.x all the 
way up to trunk + CI for all branches. What do you think [~brandon.williams] 
[~stefan.miklosovic]  ?

> Race condition in QueryProcessor causes just prepared statement not to be in 
> the prepared statements cache
> --
>
> Key: CASSANDRA-17401
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17401
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client
>Reporter: Ivan Senic
>Assignee: Jaydeepkumar Chovatia
>Priority: Normal
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The changes in the 
> [QueryProcessor#prepare|https://github.com/apache/cassandra/blame/cassandra-4.0.2/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L575-L638]
>  method that were introduced in versions *4.0.2* and *3.11.12* can cause a 
> race condition between two threads trying to concurrently prepare the same 
> statement. This race condition can cause removing of a prepared statement 
> from the cache, after one of the threads has received the result of the 
> prepare and eventually uses MD5Digest to call 
> [QueryProcessor#getPrepared|https://github.com/apache/cassandra/blame/cassandra-4.0.2/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L212-L215].
> The race condition looks like this:
>  * Thread1 enters _prepare_ method and resolves _safeToReturnCached_ as false
>  * Thread1 executes eviction of hashes
>  * Thread2 enters _prepare_ method and resolves _safeToReturnCached_ as false
>  * Thread1 prepares the statement and caches it
>  * Thread1 returns the result of the prepare
>  * Thread2 executes eviction of hashes
>  * Thread1 tries to execute the prepared statement with the received 
> MD5Digest, but statement is not in the cache as it was evicted by Thread2
> I tried to reproduce this by using a Java driver, but hitting this case from 
> a client side is highly unlikely and I can not simulate the needed race 
> condition. However, we can easily reproduce this in Stargate (details 
> [here|https://github.com/stargate/stargate/pull/1647]), as it's closer to 
> QueryProcessor.
> Reproducing this in a unit test is fairly easy. I am happy to showcase this 
> if needed.
> Note that the issue can occur only when  safeToReturnCached is resolved as 
> false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-17401) Race condition in QueryProcessor causes just prepared statement not to be in the prepared statements cache

2024-02-27 Thread Paulo Motta (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-17401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-17401:

 Bug Category: Parent values: Correctness(12982)Level 1 values: Transient 
Incorrect Response(12987)
   Complexity: Normal
  Component/s: Messaging/Client
Discovered By: User Report
Reviewers: Paulo Motta
 Severity: Normal
 Assignee: Jaydeepkumar Chovatia
   Status: Open  (was: Triage Needed)

> Race condition in QueryProcessor causes just prepared statement not to be in 
> the prepared statements cache
> --
>
> Key: CASSANDRA-17401
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17401
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client
>Reporter: Ivan Senic
>Assignee: Jaydeepkumar Chovatia
>Priority: Normal
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The changes in the 
> [QueryProcessor#prepare|https://github.com/apache/cassandra/blame/cassandra-4.0.2/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L575-L638]
>  method that were introduced in versions *4.0.2* and *3.11.12* can cause a 
> race condition between two threads trying to concurrently prepare the same 
> statement. This race condition can cause removing of a prepared statement 
> from the cache, after one of the threads has received the result of the 
> prepare and eventually uses MD5Digest to call 
> [QueryProcessor#getPrepared|https://github.com/apache/cassandra/blame/cassandra-4.0.2/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L212-L215].
> The race condition looks like this:
>  * Thread1 enters _prepare_ method and resolves _safeToReturnCached_ as false
>  * Thread1 executes eviction of hashes
>  * Thread2 enters _prepare_ method and resolves _safeToReturnCached_ as false
>  * Thread1 prepares the statement and caches it
>  * Thread1 returns the result of the prepare
>  * Thread2 executes eviction of hashes
>  * Thread1 tries to execute the prepared statement with the received 
> MD5Digest, but statement is not in the cache as it was evicted by Thread2
> I tried to reproduce this by using a Java driver, but hitting this case from 
> a client side is highly unlikely and I can not simulate the needed race 
> condition. However, we can easily reproduce this in Stargate (details 
> [here|https://github.com/stargate/stargate/pull/1647]), as it's closer to 
> QueryProcessor.
> Reproducing this in a unit test is fairly easy. I am happy to showcase this 
> if needed.
> Note that the issue can occur only when  safeToReturnCached is resolved as 
> false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19389) Start UCS docs with examples and use cases

2024-02-12 Thread Paulo Motta (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-19389:

Change Category: Operability
 Complexity: Low Hanging Fruit
 Status: Open  (was: Triage Needed)

> Start UCS docs with examples and use cases
> --
>
> Key: CASSANDRA-19389
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19389
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Jon Haddad
>Priority: Normal
>
> Users interested in UCS are primarily going to be interested in examples of 
> how UCS should be used for certain types of workloads.  We start the current 
> docs by saying it can replace every other compaction strategy, but leave it 
> up to the user to figure out exactly what that means for them.  
> Before the docs that explain how it works, let's describe how it should be 
> used.  Users interested in the nuts and bolts can scroll down to learn the 
> details, but that shouldn't be a requirement to switch from an existing 
> compaction strategy to UCS.
> A table showing examples of LCS, STCS, and TWCS converted to UCS would 
> suffice for 99% of people's needs.  
> More information in this Slack thread: 
> https://the-asf.slack.com/archives/CK23JSY2K/p1707700814330359



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19389) Start UCS docs with examples and use cases

2024-02-12 Thread Paulo Motta (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-19389:

Labels: lhf  (was: )

> Start UCS docs with examples and use cases
> --
>
> Key: CASSANDRA-19389
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19389
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Jon Haddad
>Priority: Normal
>  Labels: lhf
>
> Users interested in UCS are primarily going to be interested in examples of 
> how UCS should be used for certain types of workloads.  We start the current 
> docs by saying it can replace every other compaction strategy, but leave it 
> up to the user to figure out exactly what that means for them.  
> Before the docs that explain how it works, let's describe how it should be 
> used.  Users interested in the nuts and bolts can scroll down to learn the 
> details, but that shouldn't be a requirement to switch from an existing 
> compaction strategy to UCS.
> A table showing examples of LCS, STCS, and TWCS converted to UCS would 
> suffice for 99% of people's needs.  
> More information in this Slack thread: 
> https://the-asf.slack.com/archives/CK23JSY2K/p1707700814330359



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-17798) Flaky org.apache.cassandra.tools TopPartitionsTest testServiceTopPartitionsSingleTable

2024-02-08 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-17798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17815902#comment-17815902
 ] 

Paulo Motta commented on CASSANDRA-17798:
-

FYI failed on https://ci-cassandra.apache.org/job/Cassandra-4.1/465/testReport/

> Flaky org.apache.cassandra.tools TopPartitionsTest 
> testServiceTopPartitionsSingleTable
> --
>
> Key: CASSANDRA-17798
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17798
> Project: Cassandra
>  Issue Type: Bug
>  Components: CI
>Reporter: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 4.1.x, 5.x
>
>
> h3.  
> {code:java}
> Error Message
> If this failed you probably have to raise the beginLocalSampling duration 
> expected:<1> but was:<0>
> Stacktrace
> junit.framework.AssertionFailedError: If this failed you probably have to 
> raise the beginLocalSampling duration expected:<1> but was:<0> at 
> org.apache.cassandra.tools.TopPartitionsTest.testServiceTopPartitionsSingleTable(TopPartitionsTest.java:83)
>  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> Standard Output
> INFO [main] 2022-08-02 01:49:49,333 YamlConfigurationLoader.java:104 - 
> Configuration location: 
> file:home/cassandra/cassandra/build/test/cassandra.cdc.yaml DEBUG [main] 
> 2022-08-02 01:49:49,339 YamlConfigurationLoader.java:124 - Loading settings 
> from file:home/cassandra/cassandra/build/test/cassandra.cdc.yaml INFO 
> [main] 2022-08-02 01:49:49,642 Config.java:1167 - Node 
> configuration:[allocate_tokens_for_keyspace=null; 
> allocate_tokens_for_local_replication_factor=null; allow_extra_insecure 
> ...[truncated 50809 chars]... lizing counter cache with capacity of 2 MiBs 
> INFO [MemtableFlushWriter:1] 2022-08-02 01:49:53,519 CacheService.java:163 - 
> Scheduling counter cache save to every 7200 seconds (going to save all keys). 
> DEBUG [MemtableFlushWriter:1] 2022-08-02 01:49:53,575 
> ColumnFamilyStore.java:1330 - Flushed to 
> [BigTableReader(path='/home/cassandra/cassandra/build/test/cassandra/data/system/IndexInfo-9f5c6374d48532299a0a5094af9ad1e3/nb-1-big-Data.db')]
>  (1 sstables, 4.915KiB), biggest 4.915KiB, smallest 4.915KiB
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-17298) Test Failure: org.apache.cassandra.cql3.MemtableSizeTest

2024-02-08 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17815684#comment-17815684
 ] 

Paulo Motta commented on CASSANDRA-17298:
-

Looks like this is failing consistently in both 4.0/4.1:
* 
https://butler.cassandra.apache.org/#/ci/upstream/compare/Cassandra-4.0/cassandra-4.0
* 
[https://butler.cassandra.apache.org/#/ci/upstream/compare/Cassandra-4.0/cassandra-4.1]

[~e.dimitrova] I wonder if this should've been addressed by CASSANDRA-16684 or 
if it's a new issue. I am able to reproduce the failure locally on 
cassandra-4.1, even after increasing rerunsOnFailure from 2 to 4.


{noformat}
java.lang.AssertionError: Expected heap usage close to 75.335MiB, got 71.163MiB.
at org.junit.Assert.fail(Assert.java:88)
    at org.junit.Assert.assertTrue(Assert.java:41)
    at 
org.apache.cassandra.cql3.MemtableSizeTest.testSizeFlaky(MemtableSizeTest.java:149)
    at org.apache.cassandra.Util.runCatchingAssertionError(Util.java:696)
{noformat}

> Test Failure: org.apache.cassandra.cql3.MemtableSizeTest
> 
>
> Key: CASSANDRA-17298
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17298
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/unit
>Reporter: Josh McKenzie
>Priority: Normal
> Fix For: 4.0.x
>
>
> [https://ci-cassandra.apache.org/job/Cassandra-4.0/313/testReport/org.apache.cassandra.cql3/MemtableSizeTest/testTruncationReleasesLogSpace_2/]
>  Failed 4 times in the last 30 runs. Flakiness: 27%, Stability: 86%
> Error Message
> Expected heap usage close to 49.930MiB, got 41.542MiB.
> {code}
> Stacktrace
> junit.framework.AssertionFailedError: Expected heap usage close to 49.930MiB, 
> got 41.542MiB.
>   at 
> org.apache.cassandra.cql3.MemtableSizeTest.testSize(MemtableSizeTest.java:130)
>   at org.apache.cassandra.Util.runCatchingAssertionError(Util.java:644)
>   at org.apache.cassandra.Util.flakyTest(Util.java:669)
>   at 
> org.apache.cassandra.cql3.MemtableSizeTest.testTruncationReleasesLogSpace(MemtableSizeTest.java:61)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-17298) Test Failure: org.apache.cassandra.cql3.MemtableSizeTest

2024-02-08 Thread Paulo Motta (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-17298:

Summary: Test Failure: org.apache.cassandra.cql3.MemtableSizeTest  (was: 
Test Failure: 
org.apache.cassandra.cql3.MemtableSizeTest.testTruncationReleasesLogSpace)

> Test Failure: org.apache.cassandra.cql3.MemtableSizeTest
> 
>
> Key: CASSANDRA-17298
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17298
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/unit
>Reporter: Josh McKenzie
>Priority: Normal
> Fix For: 4.0.x
>
>
> [https://ci-cassandra.apache.org/job/Cassandra-4.0/313/testReport/org.apache.cassandra.cql3/MemtableSizeTest/testTruncationReleasesLogSpace_2/]
>  Failed 4 times in the last 30 runs. Flakiness: 27%, Stability: 86%
> Error Message
> Expected heap usage close to 49.930MiB, got 41.542MiB.
> {code}
> Stacktrace
> junit.framework.AssertionFailedError: Expected heap usage close to 49.930MiB, 
> got 41.542MiB.
>   at 
> org.apache.cassandra.cql3.MemtableSizeTest.testSize(MemtableSizeTest.java:130)
>   at org.apache.cassandra.Util.runCatchingAssertionError(Util.java:644)
>   at org.apache.cassandra.Util.flakyTest(Util.java:669)
>   at 
> org.apache.cassandra.cql3.MemtableSizeTest.testTruncationReleasesLogSpace(MemtableSizeTest.java:61)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-17401) Race condition in QueryProcessor causes just prepared statement not to be in the prepared statements cache

2024-01-29 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-17401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17812033#comment-17812033
 ] 

Paulo Motta commented on CASSANDRA-17401:
-

[~chovatia.jayd...@gmail.com] I was not able to review this yet, will send an 
update when I get a chance to review it. If anyone else subscribed wants to 
review this on the meantime feel free to take it.

> Race condition in QueryProcessor causes just prepared statement not to be in 
> the prepared statements cache
> --
>
> Key: CASSANDRA-17401
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17401
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Ivan Senic
>Priority: Normal
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The changes in the 
> [QueryProcessor#prepare|https://github.com/apache/cassandra/blame/cassandra-4.0.2/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L575-L638]
>  method that were introduced in versions *4.0.2* and *3.11.12* can cause a 
> race condition between two threads trying to concurrently prepare the same 
> statement. This race condition can cause removing of a prepared statement 
> from the cache, after one of the threads has received the result of the 
> prepare and eventually uses MD5Digest to call 
> [QueryProcessor#getPrepared|https://github.com/apache/cassandra/blame/cassandra-4.0.2/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L212-L215].
> The race condition looks like this:
>  * Thread1 enters _prepare_ method and resolves _safeToReturnCached_ as false
>  * Thread1 executes eviction of hashes
>  * Thread2 enters _prepare_ method and resolves _safeToReturnCached_ as false
>  * Thread1 prepares the statement and caches it
>  * Thread1 returns the result of the prepare
>  * Thread2 executes eviction of hashes
>  * Thread1 tries to execute the prepared statement with the received 
> MD5Digest, but statement is not in the cache as it was evicted by Thread2
> I tried to reproduce this by using a Java driver, but hitting this case from 
> a client side is highly unlikely and I can not simulate the needed race 
> condition. However, we can easily reproduce this in Stargate (details 
> [here|https://github.com/stargate/stargate/pull/1647]), as it's closer to 
> QueryProcessor.
> Reproducing this in a unit test is fairly easy. I am happy to showcase this 
> if needed.
> Note that the issue can occur only when  safeToReturnCached is resolved as 
> false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-19291) Fix NEWS.txt Compact Storage section

2024-01-26 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17811309#comment-17811309
 ] 

Paulo Motta edited comment on CASSANDRA-19291 at 1/26/24 3:00 PM:
--

Thanks Ekaterina and apologies for the delay. LGTM, feel free to merge it.


was (Author: paulo):
Thanks Ekaterina and apologies for the delay. Feel free to merge it.

> Fix NEWS.txt Compact Storage section
> 
>
> Key: CASSANDRA-19291
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19291
> Project: Cassandra
>  Issue Type: Task
>  Components: Documentation
>Reporter: Ekaterina Dimitrova
>Assignee: Ekaterina Dimitrova
>Priority: Low
> Fix For: 3.0.x, 3.11.x, 4.0.x, 4.1.x, 5.0.x, 5.x
>
>
> In CASSANDRA-16733 we added a note that Compact Storage will no longer be 
> supported in 5.0. The idea was that drop_compact_storage would be pulled out 
> of the experimental version. 
> This did not happen, and compact storage is still around. 
> I think this will not be handled at least until 6.0 (major breaking changes) 
> and it is good to be corrected. More and more people are upgrading to 4.0+ 
> and they are confused. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19291) Fix NEWS.txt Compact Storage section

2024-01-26 Thread Paulo Motta (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-19291:

Status: Ready to Commit  (was: Review In Progress)

Thanks Ekaterina and apologies for the delay. Feel free to merge it.

> Fix NEWS.txt Compact Storage section
> 
>
> Key: CASSANDRA-19291
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19291
> Project: Cassandra
>  Issue Type: Task
>  Components: Documentation
>Reporter: Ekaterina Dimitrova
>Assignee: Ekaterina Dimitrova
>Priority: Low
> Fix For: 3.0.x, 3.11.x, 4.0.x, 4.1.x, 5.0.x, 5.x
>
>
> In CASSANDRA-16733 we added a note that Compact Storage will no longer be 
> supported in 5.0. The idea was that drop_compact_storage would be pulled out 
> of the experimental version. 
> This did not happen, and compact storage is still around. 
> I think this will not be handled at least until 6.0 (major breaking changes) 
> and it is good to be corrected. More and more people are upgrading to 4.0+ 
> and they are confused. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19291) Fix NEWS.txt Compact Storage section

2024-01-23 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810101#comment-17810101
 ] 

Paulo Motta commented on CASSANDRA-19291:
-

Is there a ticket to take "DROP COMPACT STORAGE" out of experimental mode? If 
so it would probably be nice to link the Jira# in the message so people can 
track it.

Otherwise LGTM.

> Fix NEWS.txt Compact Storage section
> 
>
> Key: CASSANDRA-19291
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19291
> Project: Cassandra
>  Issue Type: Task
>  Components: Documentation
>Reporter: Ekaterina Dimitrova
>Assignee: Ekaterina Dimitrova
>Priority: Low
> Fix For: 3.0.x, 3.11.x, 4.0.x, 4.1.x, 5.0.x, 5.x
>
>
> In CASSANDRA-16733 we added a note that Compact Storage will no longer be 
> supported in 5.0. The idea was that drop_compact_storage would be pulled out 
> of the experimental version. 
> This did not happen, and compact storage is still around. 
> I think this will not be handled at least until 6.0 (major breaking changes) 
> and it is good to be corrected. More and more people are upgrading to 4.0+ 
> and they are confused. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRASC-94) Reduce filesystem calls while streaming SSTables

2024-01-22 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRASC-94?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17809735#comment-17809735
 ] 

Paulo Motta commented on CASSANDRASC-94:


Cool, thanks for clarifying! I can create a follow-up sidecar ticket if there's 
movement on CASSANDRA-18111.

> Reduce filesystem calls while streaming SSTables
> 
>
> Key: CASSANDRASC-94
> URL: https://issues.apache.org/jira/browse/CASSANDRASC-94
> Project: Sidecar for Apache Cassandra
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Francisco Guerrero
>Assignee: Francisco Guerrero
>Priority: Normal
>  Labels: pull-request-available
>
> When streaming snapshotted SSTables from Cassandra Sidecar, Sidecar will 
> perform multiple filesystem calls:
> - Traverse the data directories to determine the keyspace / table path
> - Once found determine if the SSTable file exists under the snapshots 
> directory
> - Read the filesystem to obtain the file type and file size
> - Read the requested range of the file and stream it
> The amount of filesystem calls is manageable for streaming a single SSTable, 
> but when a client(s) read multiple SSTables, for example in the case of 
> Cassandra Analytics bulk reads, hundred to thousand of requests are performed 
> requiring every request to perform the above system calls.
> In this improvement, it is proposed introducing several caches to reduce the 
> amount of system calls while streaming SSTables.
> - *snapshot list cache*: to maintain a cache of recently listed snapshot 
> files under a snapshot directory. This cache avoids having to access the 
> filesystem every time a bulk read client list the snapshot directory.
> - *table dir cache*: to maintain a cache of recently streamed table directory 
> paths. This cache helps avoiding having to traverse the filesystem searching 
> for the table directory while running bulk reads for example. Since bulk 
> reads can stream tens to hundreds of SSTable components from a snapshot 
> directory, this cache helps avoid having to resolve the table directory each 
> time.
> - *snapshot path cache*: to maintain a cache of recently streamed snapshot 
> SSTable components. This cache avoids having to resolve the snapshot SSTable 
> component path during bulk reads. Since bulk reads streams sub-ranges of an 
> SSTable component, the resolution can happen multiple times during bulk reads 
> for a single SSTable component.
> - *file props cache*: to maintain a cache of FileProps of recently streamed 
> files. This cache avoids having to validate file properties during bulk reads 
> for example where sub-ranges of an SSTable component are streamed, therefore 
> reading the file properties can occur multiple times during bulk reads of the 
> same file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRASC-94) Reduce filesystem calls while streaming SSTables

2024-01-22 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRASC-94?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17809601#comment-17809601
 ] 

Paulo Motta commented on CASSANDRASC-94:


I am planning to add support to caching snapshots in memory in the server as 
part of CASSANDRA-18111 (I have an draft patch but need to cleanup/rebase/test, 
should take a couple of weeks to wrap up). Do you think caching snapshots in 
the sidecar will be relevant with that in place?

One issue I see is that that functionality will probably land in 5.x, so it's 
still probably useful to have sidecar caching for 4.x.

> Reduce filesystem calls while streaming SSTables
> 
>
> Key: CASSANDRASC-94
> URL: https://issues.apache.org/jira/browse/CASSANDRASC-94
> Project: Sidecar for Apache Cassandra
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Francisco Guerrero
>Assignee: Francisco Guerrero
>Priority: Normal
>  Labels: pull-request-available
>
> When streaming snapshotted SSTables from Cassandra Sidecar, Sidecar will 
> perform multiple filesystem calls:
> - Traverse the data directories to determine the keyspace / table path
> - Once found determine if the SSTable file exists under the snapshots 
> directory
> - Read the filesystem to obtain the file type and file size
> - Read the requested range of the file and stream it
> The amount of filesystem calls is manageable for streaming a single SSTable, 
> but when a client(s) read multiple SSTables, for example in the case of 
> Cassandra Analytics bulk reads, hundred to thousand of requests are performed 
> requiring every request to perform the above system calls.
> In this improvement, it is proposed introducing several caches to reduce the 
> amount of system calls while streaming SSTables.
> - *snapshot list cache*: to maintain a cache of recently listed snapshot 
> files under a snapshot directory. This cache avoids having to access the 
> filesystem every time a bulk read client list the snapshot directory.
> - *table dir cache*: to maintain a cache of recently streamed table directory 
> paths. This cache helps avoiding having to traverse the filesystem searching 
> for the table directory while running bulk reads for example. Since bulk 
> reads can stream tens to hundreds of SSTable components from a snapshot 
> directory, this cache helps avoid having to resolve the table directory each 
> time.
> - *snapshot path cache*: to maintain a cache of recently streamed snapshot 
> SSTable components. This cache avoids having to resolve the snapshot SSTable 
> component path during bulk reads. Since bulk reads streams sub-ranges of an 
> SSTable component, the resolution can happen multiple times during bulk reads 
> for a single SSTable component.
> - *file props cache*: to maintain a cache of FileProps of recently streamed 
> files. This cache avoids having to validate file properties during bulk reads 
> for example where sub-ranges of an SSTable component are streamed, therefore 
> reading the file properties can occur multiple times during bulk reads of the 
> same file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-17401) Race condition in QueryProcessor causes just prepared statement not to be in the prepared statements cache

2024-01-21 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-17401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17809211#comment-17809211
 ] 

Paulo Motta edited comment on CASSANDRA-17401 at 1/22/24 1:54 AM:
--

Ok thanks [~chovatia.jayd...@gmail.com]! I'm not familiar with this area but 
will try to look at it if I find cycles in the next few days and nobody beats 
me to it. :)

Btw did you observe a single occurrence of this issue, or is it recurrent?


was (Author: paulo):
Ok thanks [~chovatia.jayd...@gmail.com]! I'm not familiar with this area but 
will try to look at it if I find cycles in the next few days and nobody beats 
me to it. :)

Btw did you just observe a single occurrence of this issue, or is it recurrent?

> Race condition in QueryProcessor causes just prepared statement not to be in 
> the prepared statements cache
> --
>
> Key: CASSANDRA-17401
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17401
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Ivan Senic
>Priority: Normal
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The changes in the 
> [QueryProcessor#prepare|https://github.com/apache/cassandra/blame/cassandra-4.0.2/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L575-L638]
>  method that were introduced in versions *4.0.2* and *3.11.12* can cause a 
> race condition between two threads trying to concurrently prepare the same 
> statement. This race condition can cause removing of a prepared statement 
> from the cache, after one of the threads has received the result of the 
> prepare and eventually uses MD5Digest to call 
> [QueryProcessor#getPrepared|https://github.com/apache/cassandra/blame/cassandra-4.0.2/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L212-L215].
> The race condition looks like this:
>  * Thread1 enters _prepare_ method and resolves _safeToReturnCached_ as false
>  * Thread1 executes eviction of hashes
>  * Thread2 enters _prepare_ method and resolves _safeToReturnCached_ as false
>  * Thread1 prepares the statement and caches it
>  * Thread1 returns the result of the prepare
>  * Thread2 executes eviction of hashes
>  * Thread1 tries to execute the prepared statement with the received 
> MD5Digest, but statement is not in the cache as it was evicted by Thread2
> I tried to reproduce this by using a Java driver, but hitting this case from 
> a client side is highly unlikely and I can not simulate the needed race 
> condition. However, we can easily reproduce this in Stargate (details 
> [here|https://github.com/stargate/stargate/pull/1647]), as it's closer to 
> QueryProcessor.
> Reproducing this in a unit test is fairly easy. I am happy to showcase this 
> if needed.
> Note that the issue can occur only when  safeToReturnCached is resolved as 
> false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-17401) Race condition in QueryProcessor causes just prepared statement not to be in the prepared statements cache

2024-01-21 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-17401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17809211#comment-17809211
 ] 

Paulo Motta commented on CASSANDRA-17401:
-

Ok thanks [~chovatia.jayd...@gmail.com]! I'm not familiar with this area but 
will try to look at it if I find cycles in the next few days and nobody beats 
me to it. :)

Btw did you just observe a single occurrence of this issue, or is it recurrent?

> Race condition in QueryProcessor causes just prepared statement not to be in 
> the prepared statements cache
> --
>
> Key: CASSANDRA-17401
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17401
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Ivan Senic
>Priority: Normal
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The changes in the 
> [QueryProcessor#prepare|https://github.com/apache/cassandra/blame/cassandra-4.0.2/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L575-L638]
>  method that were introduced in versions *4.0.2* and *3.11.12* can cause a 
> race condition between two threads trying to concurrently prepare the same 
> statement. This race condition can cause removing of a prepared statement 
> from the cache, after one of the threads has received the result of the 
> prepare and eventually uses MD5Digest to call 
> [QueryProcessor#getPrepared|https://github.com/apache/cassandra/blame/cassandra-4.0.2/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L212-L215].
> The race condition looks like this:
>  * Thread1 enters _prepare_ method and resolves _safeToReturnCached_ as false
>  * Thread1 executes eviction of hashes
>  * Thread2 enters _prepare_ method and resolves _safeToReturnCached_ as false
>  * Thread1 prepares the statement and caches it
>  * Thread1 returns the result of the prepare
>  * Thread2 executes eviction of hashes
>  * Thread1 tries to execute the prepared statement with the received 
> MD5Digest, but statement is not in the cache as it was evicted by Thread2
> I tried to reproduce this by using a Java driver, but hitting this case from 
> a client side is highly unlikely and I can not simulate the needed race 
> condition. However, we can easily reproduce this in Stargate (details 
> [here|https://github.com/stargate/stargate/pull/1647]), as it's closer to 
> QueryProcessor.
> Reproducing this in a unit test is fairly easy. I am happy to showcase this 
> if needed.
> Note that the issue can occur only when  safeToReturnCached is resolved as 
> false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-17401) Race condition in QueryProcessor causes just prepared statement not to be in the prepared statements cache

2024-01-21 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-17401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17809139#comment-17809139
 ] 

Paulo Motta commented on CASSANDRA-17401:
-

Hi [~chovatia.jayd...@gmail.com] can you provide a regression test case 
reproducing this issue and a patch with a proposed fix ?

> Race condition in QueryProcessor causes just prepared statement not to be in 
> the prepared statements cache
> --
>
> Key: CASSANDRA-17401
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17401
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Ivan Senic
>Priority: Normal
>
> The changes in the 
> [QueryProcessor#prepare|https://github.com/apache/cassandra/blame/cassandra-4.0.2/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L575-L638]
>  method that were introduced in versions *4.0.2* and *3.11.12* can cause a 
> race condition between two threads trying to concurrently prepare the same 
> statement. This race condition can cause removing of a prepared statement 
> from the cache, after one of the threads has received the result of the 
> prepare and eventually uses MD5Digest to call 
> [QueryProcessor#getPrepared|https://github.com/apache/cassandra/blame/cassandra-4.0.2/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L212-L215].
> The race condition looks like this:
>  * Thread1 enters _prepare_ method and resolves _safeToReturnCached_ as false
>  * Thread1 executes eviction of hashes
>  * Thread2 enters _prepare_ method and resolves _safeToReturnCached_ as false
>  * Thread1 prepares the statement and caches it
>  * Thread1 returns the result of the prepare
>  * Thread2 executes eviction of hashes
>  * Thread1 tries to execute the prepared statement with the received 
> MD5Digest, but statement is not in the cache as it was evicted by Thread2
> I tried to reproduce this by using a Java driver, but hitting this case from 
> a client side is highly unlikely and I can not simulate the needed race 
> condition. However, we can easily reproduce this in Stargate (details 
> [here|https://github.com/stargate/stargate/pull/1647]), as it's closer to 
> QueryProcessor.
> Reproducing this in a unit test is fairly easy. I am happy to showcase this 
> if needed.
> Note that the issue can occur only when  safeToReturnCached is resolved as 
> false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-10821) OOM Killer terminates Cassandra when Compactions use too much memory then won't restart

2024-01-19 Thread Paulo Motta (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-10821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-10821:

Description: 
 

We were writing to the DB from EC2 instances in us-east-1 at a rate of about 
3000 per second, replication us-east:2 us-west:2, LeveledCompaction and 
DeflateCompressor.

After about 48 hours some nodes had over 800 pending compactions and a few of 
them started getting killed for Linux OOM. Priam attempts to restart the nodes, 
but they fail because of corrupted saved_cahce files.

Loading has finished, and the cluster is mostly idle, but 6 of the nodes were 
killed again last night by OOM.

This is the log message where the node won't restart:

ERROR [main] 2015-12-05 13:59:13,754 CassandraDaemon.java:635 - Detected 
unreadable sstables /media/ephemeral0/cassandra/saved_caches/KeyCache-ca.db, 
please check NEWS.txt and ensure that you have upgraded through all required 
intermediate versions, running upgradesstables

This is the dmesg where the node is terminated:

[360803.234422] Out of memory: Kill process 10809 (java) score 949 or sacrifice 
child
[360803.237544] Killed process 10809 (java) total-vm:438484092kB, 
anon-rss:29228012kB, file-rss:107576kB

This is what Compaction Stats look like currently:

pending tasks: 1096
id compaction type keyspace table completed total unit progress
93eb3200-9b58-11e5-b9f1-ffef1041ec45 Compaction overlordpreprod document 
8670748796 839129219651 bytes 1.03%
Compaction system hints 30 1921326518 bytes 0.00%
Active compaction remaining time : 27h33m47s

Only 6 of the 32 nodes have compactions pending, and all on the order of 1000.

  was:
We were writing to the DB from EC2 instances in us-east-1 at a rate of about 
3000 per second, replication us-east:2 us-west:2, LeveledCompaction and 
DeflateCompressor.

After about 48 hours some nodes had over 800 pending compactions and a few of 
them started getting killed for Linux OOM. Priam attempts to restart the nodes, 
but they fail because of corrupted saved_cahce files.

Loading has finished, and the cluster is mostly idle, but 6 of the nodes were 
killed again last night by OOM.

This is the log message where the node won't restart:

ERROR [main] 2015-12-05 13:59:13,754 CassandraDaemon.java:635 - Detected 
unreadable sstables /media/ephemeral0/cassandra/saved_caches/KeyCache-ca.db, 
please check NEWS.txt and ensure that you have upgraded through all required 
intermediate versions, running upgradesstables

This is the dmesg where the node is terminated:

[360803.234422] Out of memory: Kill process 10809 (java) score 949 or sacrifice 
child
[360803.237544] Killed process 10809 (java) total-vm:438484092kB, 
anon-rss:29228012kB, file-rss:107576kB

This is what Compaction Stats look like currently:

pending tasks: 1096
 id   compaction type  keyspace 
 tablecompleted  totalunit   progress
   93eb3200-9b58-11e5-b9f1-ffef1041ec45Compaction   overlordpreprod   
document   8670748796   839129219651   bytes  1.03%
   Compactionsystem 
 hints   30 1921326518   bytes  0.00%
Active compaction remaining time :  27h33m47s

Only 6 of the 32 nodes have compactions pending, and all on the order of 1000.


> OOM Killer terminates Cassandra when Compactions use too much memory then 
> won't restart
> ---
>
> Key: CASSANDRA-10821
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10821
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Compaction
> Environment: EC2 32 x i2.xlarge split between us-east-1a,c and 
> us-west 2a,b
> Linux  4.1.10-17.31.amzn1.x86_64 #1 SMP Sat Oct 24 01:31:37 UTC 2015 x86_64 
> x86_64 x86_64 GNU/Linux
> Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode)
> Cassandra version: 2.2.3
>Reporter: tbartold
>Priority: Normal
>
>  
> We were writing to the DB from EC2 instances in us-east-1 at a rate of about 
> 3000 per second, replication us-east:2 us-west:2, LeveledCompaction and 
> DeflateCompressor.
> After about 48 hours some nodes had over 800 pending compactions and a few of 
> them started getting killed for Linux OOM. Priam attempts to restart the 
> nodes, but they fail because of corrupted saved_cahce files.
> Loading has finished, and the cluster is mostly idle, but 6 of the nodes were 
> killed again last night by OOM.
> This is the log message where the node won't restart:
> ERROR [main] 2015-12-05 13:59:13,754 CassandraDaemon.java:635 - Detected 
> unreadable ss

[jira] [Commented] (CASSANDRASC-92) Add restore SSTables from S3 into Cassandra feature to Cassandra Sidecar

2024-01-18 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRASC-92?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808429#comment-17808429
 ] 

Paulo Motta commented on CASSANDRASC-92:


Feel free to merge [~frankgh] - I’ll follow up later if needed when I have a 
chance to test this feature. Thanks!

> Add restore SSTables from S3 into Cassandra feature to Cassandra Sidecar
> 
>
> Key: CASSANDRASC-92
> URL: https://issues.apache.org/jira/browse/CASSANDRASC-92
> Project: Sidecar for Apache Cassandra
>  Issue Type: New Feature
>  Components: Rest API
>Reporter: Saranya Krishnakumar
>Assignee: Saranya Krishnakumar
>Priority: Normal
>
> Through this proposal we want to add restore capability to Sidecar, for 
> Sidecar to allow restoring data from S3. As part of this patch we want to add 
> APIs for creating, updating and getting information about the restore jobs. 
> We also want to add background tasks for managing these restore jobs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19001) Check whether the startup warnings for unknown modules represent a legit problem or cosmetic issue

2024-01-17 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17807855#comment-17807855
 ] 

Paulo Motta commented on CASSANDRA-19001:
-

{quote}If JAVA_HOME is not defined, Cassandra checks what is in PATH first. Do 
we expect users to do more modifications to PATH to adhere? It sounds a bit 
risky to me; I hope I do not overengineer it. WDYT?
{quote}
As far as I understand there is no reliable way to detect if there's a local 
JDK other than check if javac exists in JAVA_HOME or PATH. So the only way to 
figure out if the user is running on a JDK is to to check if javac exists in 
JAVA_HOME/bin first, if not check on PATH - this does not look like 
overengineering to me. What do you mean by "expect users to do more 
modifications to PATH to adhere"?

> Check whether the startup warnings for unknown modules represent a legit 
> problem or cosmetic issue
> --
>
> Key: CASSANDRA-19001
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19001
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Other
>Reporter: Ekaterina Dimitrova
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 5.0-rc, 5.0.x, 5.x
>
>
> During the 5.0 alpha 2 release 
> [vote|https://lists.apache.org/thread/lt3x0obr5cpbcydf5490pj6b2q0mz5zr], 
> [~paulo] raised the following concerns:
> {code:java}
> Launched a tarball-based 5.0-alpha2 container on top of
> "eclipse-temurin:17-jre-focal" and the server starts up fine, can run
> nodetool and cqlsh.
> I got these seemingly harmless JDK17 warnings during startup and when
> running nodetool (no warnings on JDK11):
> WARNING: Unknown module: jdk.attach specified to --add-exports
> WARNING: Unknown module: jdk.compiler specified to --add-exports
> WARNING: Unknown module: jdk.compiler specified to --add-opens
> WARNING: A terminally deprecated method in java.lang.System has been called
> WARNING: System::setSecurityManager has been called by
> org.apache.cassandra.security.ThreadAwareSecurityManager
> (file:/opt/cassandra/lib/apache-cassandra-5.0-alpha2-SNAPSHOT.jar)
> WARNING: Please consider reporting this to the maintainers of
> org.apache.cassandra.security.ThreadAwareSecurityManager
> WARNING: System::setSecurityManager will be removed in a future release
> Anybody knows if these warnings are legit/expected ? We can create
> follow-up tickets if needed.
> $ java --version
> openjdk 17.0.9 2023-10-17
> OpenJDK Runtime Environment Temurin-17.0.9+9 (build 17.0.9+9)
> OpenJDK 64-Bit Server VM Temurin-17.0.9+9 (build 17.0.9+9, mixed mode,
> sharing)
> {code}
> {code:java}
> Clarification: - When running nodetool only the "Unknown module" warnings 
> show up. All warnings show up during startup.{code}
> We need to verify whether this presents a real problem in the features where 
> those modules are expected to be used, or if it is a false alarm. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRASC-92) Add restore SSTables from S3 into Cassandra feature to Cassandra Sidecar

2024-01-17 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRASC-92?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17807853#comment-17807853
 ] 

Paulo Motta commented on CASSANDRASC-92:


Thanks for the context [~frankgh] 

I plan to test this functionality at some point but please don't block this 
review on me. I'll add any comments later if needed.

> Add restore SSTables from S3 into Cassandra feature to Cassandra Sidecar
> 
>
> Key: CASSANDRASC-92
> URL: https://issues.apache.org/jira/browse/CASSANDRASC-92
> Project: Sidecar for Apache Cassandra
>  Issue Type: New Feature
>  Components: Rest API
>Reporter: Saranya Krishnakumar
>Assignee: Saranya Krishnakumar
>Priority: Normal
>
> Through this proposal we want to add restore capability to Sidecar, for 
> Sidecar to allow restoring data from S3. As part of this patch we want to add 
> APIs for creating, updating and getting information about the restore jobs. 
> We also want to add background tasks for managing these restore jobs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19001) Check whether the startup warnings for unknown modules represent a legit problem or cosmetic issue

2024-01-16 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17807408#comment-17807408
 ] 

Paulo Motta commented on CASSANDRA-19001:
-

Thanks Ekaterina!

bq. 1) Decide whether we want to be checking for runtime or javac (considering 
the case I mentioned)

I think checking for 'javac' should be fine when JAVA_HOME is not defined. If 
JAVA_HOME is defined, then we check for the existence of 
"${JAVA_HOME}/bin/javac" to determine if it's running on a JDK. Would this fix 
your edge case?

bq. 2) IMHO, we should not prevent all sjk commands from running if JRE is 
detected

sounds good to me, the warning from sjk itself {{ERROR 14:04:02,644 Java home 
points to /Library/Java/JavaVirtualMachines/temurin-17.jre/Contents/Home make 
sure it is not a JRE path}} should be sufficient

> Check whether the startup warnings for unknown modules represent a legit 
> problem or cosmetic issue
> --
>
> Key: CASSANDRA-19001
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19001
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Other
>Reporter: Ekaterina Dimitrova
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 5.0-rc, 5.0.x, 5.x
>
>
> During the 5.0 alpha 2 release 
> [vote|https://lists.apache.org/thread/lt3x0obr5cpbcydf5490pj6b2q0mz5zr], 
> [~paulo] raised the following concerns:
> {code:java}
> Launched a tarball-based 5.0-alpha2 container on top of
> "eclipse-temurin:17-jre-focal" and the server starts up fine, can run
> nodetool and cqlsh.
> I got these seemingly harmless JDK17 warnings during startup and when
> running nodetool (no warnings on JDK11):
> WARNING: Unknown module: jdk.attach specified to --add-exports
> WARNING: Unknown module: jdk.compiler specified to --add-exports
> WARNING: Unknown module: jdk.compiler specified to --add-opens
> WARNING: A terminally deprecated method in java.lang.System has been called
> WARNING: System::setSecurityManager has been called by
> org.apache.cassandra.security.ThreadAwareSecurityManager
> (file:/opt/cassandra/lib/apache-cassandra-5.0-alpha2-SNAPSHOT.jar)
> WARNING: Please consider reporting this to the maintainers of
> org.apache.cassandra.security.ThreadAwareSecurityManager
> WARNING: System::setSecurityManager will be removed in a future release
> Anybody knows if these warnings are legit/expected ? We can create
> follow-up tickets if needed.
> $ java --version
> openjdk 17.0.9 2023-10-17
> OpenJDK Runtime Environment Temurin-17.0.9+9 (build 17.0.9+9)
> OpenJDK 64-Bit Server VM Temurin-17.0.9+9 (build 17.0.9+9, mixed mode,
> sharing)
> {code}
> {code:java}
> Clarification: - When running nodetool only the "Unknown module" warnings 
> show up. All warnings show up during startup.{code}
> We need to verify whether this presents a real problem in the features where 
> those modules are expected to be used, or if it is a false alarm. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRASC-92) Add restore SSTables from S3 into Cassandra feature to Cassandra Sidecar

2024-01-12 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRASC-92?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17806164#comment-17806164
 ] 

Paulo Motta commented on CASSANDRASC-92:


This looks interesting! I'll take a look at this patch.

Are there plans to support sstable export capability to S3, or just restore for 
the time being?

> Add restore SSTables from S3 into Cassandra feature to Cassandra Sidecar
> 
>
> Key: CASSANDRASC-92
> URL: https://issues.apache.org/jira/browse/CASSANDRASC-92
> Project: Sidecar for Apache Cassandra
>  Issue Type: New Feature
>  Components: Rest API
>Reporter: Saranya Krishnakumar
>Assignee: Saranya Krishnakumar
>Priority: Normal
>
> Through this proposal we want to add restore capability to Sidecar, for 
> Sidecar to allow restoring data from S3. As part of this patch we want to add 
> APIs for creating, updating and getting information about the restore jobs. 
> We also want to add background tasks for managing these restore jobs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19259) upgrade_tests.upgrade_through_versions_test consistently failing on circleci

2024-01-10 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805218#comment-17805218
 ] 

Paulo Motta commented on CASSANDRA-19259:
-

Failing tests are:
* 
upgrade_tests.upgrade_through_versions_test.TestProtoV3Upgrade_AllVersions_EndsAt_Trunk_HEAD
* 
upgrade_tests.upgrade_through_versions_test.TestProtoV3Upgrade_AllVersions_RandomPartitioner_EndsAt_Trunk_HEAD
* 
upgrade_tests.upgrade_through_versions_test.TestProtoV4Upgrade_AllVersions_EndsAt_Trunk_HEAD
* 
upgrade_tests.upgrade_through_versions_test.TestProtoV4Upgrade_AllVersions_RandomPartitioner_EndsAt_Trunk_HEAD
* 
upgrade_tests.upgrade_through_versions_test.TestProtoV5Upgrade_AllVersions_EndsAt_Trunk_HEAD
* 
upgrade_tests.upgrade_through_versions_test.TestProtoV5Upgrade_AllVersions_RandomPartitioner_EndsAt_Trunk_HEAD

> upgrade_tests.upgrade_through_versions_test consistently failing on circleci
> 
>
> Key: CASSANDRA-19259
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19259
> Project: Cassandra
>  Issue Type: Task
>  Components: Local/Other
>Reporter: Paulo Motta
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0-rc
>
>
> This suite is consistently failing in  
> [4.0|https://app.circleci.com/pipelines/github/driftx/cassandra/1454/workflows/0357136e-cee3-42e4-900b-3347fc8d42d3/jobs/71008/tests]
>  and 
> [4.1|https://app.circleci.com/pipelines/github/driftx/cassandra/1453/workflows/dd1732df-271c-43bc-bc5f-8577c605c746/jobs/71009/tests]
>  with the following stack trace:
> {noformat}
> self = 
> process = 
> def _update_pid(self, process):
> """
> Reads pid from cassandra.pid file and stores in the self.pid
> After setting up pid updates status (UP, DOWN, etc) and node.conf
> """
> pidfile = os.path.join(self.get_path(), 'cassandra.pid')
> 
> start = time.time()
> while not (os.path.isfile(pidfile) and os.stat(pidfile).st_size > 0):
> if (time.time() - start > 30.0):
> common.error("Timed out waiting for pidfile to be filled 
> (current time is {})".format(datetime.now()))
> break
> else:
> time.sleep(0.1)
> 
> try:
> >   with open(pidfile, 'rb') as f:
> E   FileNotFoundError: [Errno 2] No such file or directory: 
> '/tmp/dtest-_8rdmjs0/test/node1/cassandra.pid'
> ../env3.6/lib/python3.6/site-packages/ccmlib/node.py:2100: FileNotFoundError
> During handling of the above exception, another exception occurred:
> self = 
>   object at 0x7f4c01419438>
> def test_parallel_upgrade(self):
> """
> Test upgrading cluster all at once (requires cluster downtime).
> """
> >   self.upgrade_scenario()
> upgrade_tests/upgrade_through_versions_test.py:387: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> upgrade_tests/upgrade_through_versions_test.py:491: in upgrade_scenario
> self.upgrade_to_version(version_meta, internode_ssl=internode_ssl)
> upgrade_tests/upgrade_through_versions_test.py:580: in upgrade_to_version
> jvm_args=['-Dcassandra.disable_max_protocol_auto_override=true'])  # 
> prevent protocol capping in mixed version clusters
> ../env3.6/lib/python3.6/site-packages/ccmlib/node.py:906: in start
> if not self._wait_for_running(process, timeout_s=7):
> ../env3.6/lib/python3.6/site-packages/ccmlib/node.py:931: in _wait_for_running
> self._update_pid(process)
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> self = 
> process = 
> def _update_pid(self, process):
> """
> Reads pid from cassandra.pid file and stores in the self.pid
> After setting up pid updates status (UP, DOWN, etc) and node.conf
> """
> pidfile = os.path.join(self.get_path(), 'cassandra.pid')
> 
> start = time.time()
> while not (os.path.isfile(pidfile) and os.stat(pidfile).st_size > 0):
> if (time.time() - start > 30.0):
> common.error("Timed out waiting for pidfile to be filled 
> (current time is {})".format(datetime.now()))
> break
> else:
> time.sleep(0.1)
> 
> try:
> with open(pidfile, 'rb') as f:
>

[jira] [Updated] (CASSANDRA-18999) Gossiper::hasMajorVersion3Nodes returns true when a cluster is upgrading patch version without Cassandra 3 nodes.

2024-01-10 Thread Paulo Motta (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-18999:

Source Control Link: 
https://github.com/apache/cassandra/commit/475c0035e6e04526eaf50805d33156ac9b828ab6
 Resolution: Fixed
 Status: Resolved  (was: Ready to Commit)

Thanks Brandon and Stefan.

I'm confident these failures are unrelated so I optimistically committed this 
to 4.0+ on 
[475c0035e6e04526eaf50805d33156ac9b828ab6|https://github.com/apache/cassandra/commit/475c0035e6e04526eaf50805d33156ac9b828ab6]
 to avoid dragging this for any longer given current CI restrictions.

I created CASSANDRA-19259 to address these failures separately. We should 
attempt a green upgrade CI run before next 4.0/4.1 releases.

> Gossiper::hasMajorVersion3Nodes returns true when a cluster is upgrading 
> patch version without Cassandra 3 nodes.
> -
>
> Key: CASSANDRA-18999
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18999
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Distributed Metadata
>Reporter: Isaac Reath
>Assignee: Isaac Reath
>Priority: Low
>  Labels: lhf
> Fix For: 4.0.x, 4.1.x, 5.0.x
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> When working on https://issues.apache.org/jira/browse/CASSANDRA-18968 we 
> found that {{Gossiper::hasMajorVersion3Nodes}} will return true when the 
> cluster is undergoing an upgrade from a patch version even if the cluster has 
> no Cassandra 3 nodes in it.
> This can be reproduced by running this Gossiper test:
> {code:java}
> @Test
> public void 
> testHasVersion3NodesShouldReturnFalseWhenNoVersion3NodesDetectedAndCassandra4UpgradeInProgress()
>  throws Exception
> {
> Gossiper.instance.start(0);
> Gossiper.instance.expireUpgradeFromVersion();
> VersionedValue.VersionedValueFactory factory = new 
> VersionedValue.VersionedValueFactory(null);
> EndpointState es = new EndpointState((HeartBeatState) null);
> es.addApplicationState(ApplicationState.RELEASE_VERSION, 
> factory.releaseVersion(CURRENT_VERSION.toString()));
> 
> Gossiper.instance.endpointStateMap.put(InetAddressAndPort.getByName("127.0.0.1"),
>  es);
> 
> Gossiper.instance.liveEndpoints.add(InetAddressAndPort.getByName("127.0.0.1"));
> es = new EndpointState((HeartBeatState) null);
> String previousPatchVersion = String.valueOf(CURRENT_VERSION.major) + 
> '.' + (CURRENT_VERSION.minor) + '.' + (CURRENT_VERSION.patch - 1);
> es.addApplicationState(ApplicationState.RELEASE_VERSION, 
> factory.releaseVersion(previousPatchVersion));
> 
> Gossiper.instance.endpointStateMap.put(InetAddressAndPort.getByName("127.0.0.2"),
>  es);
> 
> Gossiper.instance.liveEndpoints.add(InetAddressAndPort.getByName("127.0.0.2"));
> assertFalse(Gossiper.instance.hasMajorVersion3Nodes());
> }
> {code}
> This seems to be because of 
> [https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L2360],
>  where an upgrade in progress is possible but we are not upgrading from a 
> lower family version (i.e from 4.1.1 to 4.1.2).
> From the comment in this function, it seems instead of the existing check, we 
> would want to iterate over all known endpoints in gossip and return true if 
> any of them do not have a version (similar to 
> [https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L227-L236)
>  
> |https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L227-L236).]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-18999) Gossiper::hasMajorVersion3Nodes returns true when a cluster is upgrading patch version without Cassandra 3 nodes.

2024-01-10 Thread Paulo Motta (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-18999:

Fix Version/s: 4.0.12
   4.1.4
   5.0-beta2
   (was: 4.0.x)
   (was: 4.1.x)
   (was: 5.0.x)

> Gossiper::hasMajorVersion3Nodes returns true when a cluster is upgrading 
> patch version without Cassandra 3 nodes.
> -
>
> Key: CASSANDRA-18999
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18999
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Distributed Metadata
>Reporter: Isaac Reath
>Assignee: Isaac Reath
>Priority: Low
>  Labels: lhf
> Fix For: 4.0.12, 4.1.4, 5.0-beta2
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> When working on https://issues.apache.org/jira/browse/CASSANDRA-18968 we 
> found that {{Gossiper::hasMajorVersion3Nodes}} will return true when the 
> cluster is undergoing an upgrade from a patch version even if the cluster has 
> no Cassandra 3 nodes in it.
> This can be reproduced by running this Gossiper test:
> {code:java}
> @Test
> public void 
> testHasVersion3NodesShouldReturnFalseWhenNoVersion3NodesDetectedAndCassandra4UpgradeInProgress()
>  throws Exception
> {
> Gossiper.instance.start(0);
> Gossiper.instance.expireUpgradeFromVersion();
> VersionedValue.VersionedValueFactory factory = new 
> VersionedValue.VersionedValueFactory(null);
> EndpointState es = new EndpointState((HeartBeatState) null);
> es.addApplicationState(ApplicationState.RELEASE_VERSION, 
> factory.releaseVersion(CURRENT_VERSION.toString()));
> 
> Gossiper.instance.endpointStateMap.put(InetAddressAndPort.getByName("127.0.0.1"),
>  es);
> 
> Gossiper.instance.liveEndpoints.add(InetAddressAndPort.getByName("127.0.0.1"));
> es = new EndpointState((HeartBeatState) null);
> String previousPatchVersion = String.valueOf(CURRENT_VERSION.major) + 
> '.' + (CURRENT_VERSION.minor) + '.' + (CURRENT_VERSION.patch - 1);
> es.addApplicationState(ApplicationState.RELEASE_VERSION, 
> factory.releaseVersion(previousPatchVersion));
> 
> Gossiper.instance.endpointStateMap.put(InetAddressAndPort.getByName("127.0.0.2"),
>  es);
> 
> Gossiper.instance.liveEndpoints.add(InetAddressAndPort.getByName("127.0.0.2"));
> assertFalse(Gossiper.instance.hasMajorVersion3Nodes());
> }
> {code}
> This seems to be because of 
> [https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L2360],
>  where an upgrade in progress is possible but we are not upgrading from a 
> lower family version (i.e from 4.1.1 to 4.1.2).
> From the comment in this function, it seems instead of the existing check, we 
> would want to iterate over all known endpoints in gossip and return true if 
> any of them do not have a version (similar to 
> [https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L227-L236)
>  
> |https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L227-L236).]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



(cassandra) 01/01: Merge branch 'cassandra-4.0' into cassandra-4.1

2024-01-10 Thread paulo
This is an automated email from the ASF dual-hosted git repository.

paulo pushed a commit to branch cassandra-4.1
in repository https://gitbox.apache.org/repos/asf/cassandra.git

commit 3d1b981d8968635660eb015292891e57d1212c2e
Merge: 4dd69dc62d 475c0035e6
Author: Paulo Motta 
AuthorDate: Wed Jan 10 11:17:01 2024 -0500

Merge branch 'cassandra-4.0' into cassandra-4.1

Closes #2968

 CHANGES.txt|  1 +
 src/java/org/apache/cassandra/gms/Gossiper.java| 15 ---
 .../schema/SystemDistributedKeyspace.java  |  2 +-
 .../apache/cassandra/tracing/TraceKeyspace.java|  4 +-
 test/unit/org/apache/cassandra/Util.java   |  2 +-
 .../org/apache/cassandra/gms/GossiperTest.java | 50 +-
 6 files changed, 61 insertions(+), 13 deletions(-)

diff --cc CHANGES.txt
index 66144ce1e6,d944415f76..ec0e7c60d7
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@@ -1,13 -1,5 +1,14 @@@
 -4.0.12
 +4.1.4
 + * Memoize Cassandra verion and add a backoff interval for failed schema 
pulls (CASSANDRA-18902)
 + * Fix StackOverflowError on ALTER after many previous schema changes 
(CASSANDRA-19166)
 + * Fixed the inconsistency between distributedKeyspaces and 
distributedAndLocalKeyspaces (CASSANDRA-18747)
 + * Internode legacy SSL storage port certificate is not hot reloaded on 
update (CASSANDRA-18681)
 + * Nodetool paxos-only repair is no longer incremental (CASSANDRA-18466)
 + * Waiting indefinitely on ReceivedMessage response in 
StreamSession#receive() can cause deadlock (CASSANDRA-18733)
 + * Allow empty keystore_password in encryption_options (CASSANDRA-18778)
 + * Skip ColumnFamilyStore#topPartitions initialization when client or tool 
mode (CASSANDRA-18697)
 +Merged from 4.0:
+  * Fix Gossiper::hasMajorVersion3Nodes to return false during minor upgrade 
(CASSANDRA-18999)
   * Revert unnecessary read lock acquisition when reading ring version in 
TokenMetadata introduced in CASSANDRA-16286 (CASSANDRA-19107)
   * Support max SSTable size in sorted CQLSSTableWriter (CASSANDRA-18941)
   * Fix nodetool repair_admin summarize-pending command to not throw exception 
(CASSANDRA-19014)
diff --cc src/java/org/apache/cassandra/gms/Gossiper.java
index 0d5db5f81c,22595b299a..018e20542d
--- a/src/java/org/apache/cassandra/gms/Gossiper.java
+++ b/src/java/org/apache/cassandra/gms/Gossiper.java
@@@ -1614,33 -1555,11 +1615,33 @@@ public class Gossiper implements IFailu
  localState.addApplicationStates(updatedStates);
  
  // get rid of legacy fields once the cluster is not in mixed mode
- if (!hasMajorVersion3Nodes())
+ if (!hasMajorVersion3OrUnknownNodes())
  localState.removeMajorVersion3LegacyApplicationStates();
  
 +// need to run STATUS or STATUS_WITH_PORT first to handle 
BOOT_REPLACE correctly (else won't be a member, so TOKENS won't be processed)
 +for (Entry updatedEntry : 
updatedStates)
 +{
 +switch (updatedEntry.getKey())
 +{
 +default:
 +continue;
 +case STATUS:
 +if 
(localState.containsApplicationState(ApplicationState.STATUS_WITH_PORT))
 +continue;
 +case STATUS_WITH_PORT:
 +}
 +doOnChangeNotifications(addr, updatedEntry.getKey(), 
updatedEntry.getValue());
 +}
 +
  for (Entry updatedEntry : 
updatedStates)
  {
 +switch (updatedEntry.getKey())
 +{
 +// We should have alredy handled these two states above:
 +case STATUS_WITH_PORT:
 +case STATUS:
 +continue;
 +}
  // filters out legacy change notifications
  // only if local state already indicates that the peer has the 
new fields
  if ((ApplicationState.INTERNAL_IP == updatedEntry.getKey() && 
localState.containsApplicationState(ApplicationState.INTERNAL_ADDRESS_AND_PORT))
diff --cc src/java/org/apache/cassandra/schema/SystemDistributedKeyspace.java
index dc40093d4d,00..d63bbace79
mode 100644,00..100644
--- a/src/java/org/apache/cassandra/schema/SystemDistributedKeyspace.java
+++ b/src/java/org/apache/cassandra/schema/SystemDistributedKeyspace.java
@@@ -1,409 -1,0 +1,409 @@@
 +/*
 + * Licensed to the Apache Software Foundation (ASF) under one
 + * or more contributor license agreements.  See the NOTICE file
 + * distributed with this work for additional information
 + * regarding copyright ownership.  The ASF licenses this file
 + * to you under the Apache License, Version 2.0 (the
 + * "License"); you may not use this file except in compliance
 + * with the License.  You may obtain a copy of the License at
 + *
 + * http://www.apache.org/licenses/LICENSE-2.0
 + *
 + * Unless required by applicable law or agreed to in writing, software
 + * distributed unde

(cassandra) branch cassandra-5.0 updated (14c773d8bc -> e04a3176ff)

2024-01-10 Thread paulo
This is an automated email from the ASF dual-hosted git repository.

paulo pushed a change to branch cassandra-5.0
in repository https://gitbox.apache.org/repos/asf/cassandra.git


from 14c773d8bc Merge branch 'cassandra-4.1' into cassandra-5.0
 new 475c0035e6 [CASSANDRA-18999] Gossiper::hasMajorVersion3Nodes returns 
true when a cluster is upgrading patch version without Cassandra 3 nodes.
 new 3d1b981d89 Merge branch 'cassandra-4.0' into cassandra-4.1
 new e04a3176ff Merge branch 'cassandra-4.1' into cassandra-5.0

The 3 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 CHANGES.txt|  1 +
 src/java/org/apache/cassandra/gms/Gossiper.java| 14 +++---
 .../schema/SystemDistributedKeyspace.java  |  2 +-
 .../apache/cassandra/tracing/TraceKeyspace.java|  4 +-
 test/unit/org/apache/cassandra/Util.java   |  2 +-
 .../org/apache/cassandra/gms/GossiperTest.java | 50 +-
 6 files changed, 61 insertions(+), 12 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



(cassandra) branch cassandra-4.0 updated: [CASSANDRA-18999] Gossiper::hasMajorVersion3Nodes returns true when a cluster is upgrading patch version without Cassandra 3 nodes.

2024-01-10 Thread paulo
This is an automated email from the ASF dual-hosted git repository.

paulo pushed a commit to branch cassandra-4.0
in repository https://gitbox.apache.org/repos/asf/cassandra.git


The following commit(s) were added to refs/heads/cassandra-4.0 by this push:
 new 475c0035e6 [CASSANDRA-18999] Gossiper::hasMajorVersion3Nodes returns 
true when a cluster is upgrading patch version without Cassandra 3 nodes.
475c0035e6 is described below

commit 475c0035e6e04526eaf50805d33156ac9b828ab6
Author: Isaac Reath 
AuthorDate: Fri Jan 5 12:57:21 2024 -0500

[CASSANDRA-18999] Gossiper::hasMajorVersion3Nodes returns true when a 
cluster is upgrading patch version without Cassandra 3 nodes.

This commit fixes Gossiper::hasMajorVersion3Nodes so that it does not 
return true when all hosts have a known version, no hosts are on a version 
earlier than 4.0, and there is a 4.x minor version or patch version upgrade in 
progress. Additionally, this commit improves the clarity of 
Gossiper::hasMajorVersion3Nodes's name to indicate that it will return true 
when the cluster has 3.x nodes or if the cluster state is unknown, matching the 
description in the in-line comment.

patch by Isaac Reath; reviewed by Paulo Motta and Stefan Miklosovic for 
CASSANDRA-18999

Closes #2967
---
 CHANGES.txt|  1 +
 src/java/org/apache/cassandra/gms/Gossiper.java| 15 ---
 .../repair/SystemDistributedKeyspace.java  |  2 +-
 .../apache/cassandra/tracing/TraceKeyspace.java|  4 +-
 test/unit/org/apache/cassandra/Util.java   |  2 +-
 .../org/apache/cassandra/gms/GossiperTest.java | 50 +-
 6 files changed, 61 insertions(+), 13 deletions(-)

diff --git a/CHANGES.txt b/CHANGES.txt
index 0edb216735..d944415f76 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -1,4 +1,5 @@
 4.0.12
+ * Fix Gossiper::hasMajorVersion3Nodes to return false during minor upgrade 
(CASSANDRA-18999)
  * Revert unnecessary read lock acquisition when reading ring version in 
TokenMetadata introduced in CASSANDRA-16286 (CASSANDRA-19107)
  * Support max SSTable size in sorted CQLSSTableWriter (CASSANDRA-18941)
  * Fix nodetool repair_admin summarize-pending command to not throw exception 
(CASSANDRA-19014)
diff --git a/src/java/org/apache/cassandra/gms/Gossiper.java 
b/src/java/org/apache/cassandra/gms/Gossiper.java
index f88ee44edf..22595b299a 100644
--- a/src/java/org/apache/cassandra/gms/Gossiper.java
+++ b/src/java/org/apache/cassandra/gms/Gossiper.java
@@ -170,6 +170,7 @@ public class Gossiper implements 
IFailureDetectionEventListener, GossiperMBean
  * This property and anything that checks it should be removed in 5.0
  */
 private volatile boolean upgradeInProgressPossible = true;
+private volatile boolean hasNodeWithUnknownVersion = false;
 
 public void clearUnsafe()
 {
@@ -206,14 +207,14 @@ public class Gossiper implements 
IFailureDetectionEventListener, GossiperMBean
 }
 
 // Check the release version of all the peers it heard of. Not 
necessary the peer that it has/had contacted with.
-boolean allHostsHaveKnownVersion = true;
+hasNodeWithUnknownVersion = false;
 for (InetAddressAndPort host : endpointStateMap.keySet())
 {
 CassandraVersion version = getReleaseVersion(host);
 
 //Raced with changes to gossip state, wait until next iteration
 if (version == null)
-allHostsHaveKnownVersion = false;
+hasNodeWithUnknownVersion = true;
 else if (version.compareTo(minVersion) < 0)
 minVersion = version;
 }
@@ -221,7 +222,7 @@ public class Gossiper implements 
IFailureDetectionEventListener, GossiperMBean
 if (minVersion.compareTo(SystemKeyspace.CURRENT_VERSION) < 0)
 return new ExpiringMemoizingSupplier.Memoized<>(minVersion);
 
-if (!allHostsHaveKnownVersion)
+if (hasNodeWithUnknownVersion)
 return new ExpiringMemoizingSupplier.NotMemoized<>(minVersion);
 
 upgradeInProgressPossible = false;
@@ -1466,7 +1467,7 @@ public class Gossiper implements 
IFailureDetectionEventListener, GossiperMBean
 
 EndpointState localEpStatePtr = endpointStateMap.get(ep);
 EndpointState remoteState = entry.getValue();
-if (!hasMajorVersion3Nodes())
+if (!hasMajorVersion3OrUnknownNodes())
 remoteState.removeMajorVersion3LegacyApplicationStates();
 
 /*
@@ -1554,7 +1555,7 @@ public class Gossiper implements 
IFailureDetectionEventListener, GossiperMBean
 localState.addApplicationStates(updatedStates);
 
 // get rid of legacy fields once the cluster is not in mixed mode
-if (!hasMajorVersion3Nodes())
+if (!hasMajorVersion3OrUnknownNodes())
 localState.removeMajorVersion3LegacyApplicationStates();
 
 

(cassandra) branch cassandra-4.1 updated (4dd69dc62d -> 3d1b981d89)

2024-01-10 Thread paulo
This is an automated email from the ASF dual-hosted git repository.

paulo pushed a change to branch cassandra-4.1
in repository https://gitbox.apache.org/repos/asf/cassandra.git


from 4dd69dc62d Merge branch 'cassandra-4.0' into cassandra-4.1
 new 475c0035e6 [CASSANDRA-18999] Gossiper::hasMajorVersion3Nodes returns 
true when a cluster is upgrading patch version without Cassandra 3 nodes.
 new 3d1b981d89 Merge branch 'cassandra-4.0' into cassandra-4.1

The 2 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 CHANGES.txt|  1 +
 src/java/org/apache/cassandra/gms/Gossiper.java| 15 ---
 .../schema/SystemDistributedKeyspace.java  |  2 +-
 .../apache/cassandra/tracing/TraceKeyspace.java|  4 +-
 test/unit/org/apache/cassandra/Util.java   |  2 +-
 .../org/apache/cassandra/gms/GossiperTest.java | 50 +-
 6 files changed, 61 insertions(+), 13 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19259) upgrade_tests.upgrade_through_versions_test consistently failing on circleci

2024-01-10 Thread Paulo Motta (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-19259:

Description: 
This suite is consistently failing in  
[4.0|https://app.circleci.com/pipelines/github/driftx/cassandra/1454/workflows/0357136e-cee3-42e4-900b-3347fc8d42d3/jobs/71008/tests]
 and 
[4.1|https://app.circleci.com/pipelines/github/driftx/cassandra/1453/workflows/dd1732df-271c-43bc-bc5f-8577c605c746/jobs/71009/tests]
 with the following stack trace:
{noformat}
self = 
process = 

def _update_pid(self, process):
"""
Reads pid from cassandra.pid file and stores in the self.pid
After setting up pid updates status (UP, DOWN, etc) and node.conf
"""
pidfile = os.path.join(self.get_path(), 'cassandra.pid')

start = time.time()
while not (os.path.isfile(pidfile) and os.stat(pidfile).st_size > 0):
if (time.time() - start > 30.0):
common.error("Timed out waiting for pidfile to be filled 
(current time is {})".format(datetime.now()))
break
else:
time.sleep(0.1)

try:
>   with open(pidfile, 'rb') as f:
E   FileNotFoundError: [Errno 2] No such file or directory: 
'/tmp/dtest-_8rdmjs0/test/node1/cassandra.pid'

../env3.6/lib/python3.6/site-packages/ccmlib/node.py:2100: FileNotFoundError

During handling of the above exception, another exception occurred:

self = 


def test_parallel_upgrade(self):
"""
Test upgrading cluster all at once (requires cluster downtime).
"""
>   self.upgrade_scenario()

upgrade_tests/upgrade_through_versions_test.py:387: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
upgrade_tests/upgrade_through_versions_test.py:491: in upgrade_scenario
self.upgrade_to_version(version_meta, internode_ssl=internode_ssl)
upgrade_tests/upgrade_through_versions_test.py:580: in upgrade_to_version
jvm_args=['-Dcassandra.disable_max_protocol_auto_override=true'])  # 
prevent protocol capping in mixed version clusters
../env3.6/lib/python3.6/site-packages/ccmlib/node.py:906: in start
if not self._wait_for_running(process, timeout_s=7):
../env3.6/lib/python3.6/site-packages/ccmlib/node.py:931: in _wait_for_running
self._update_pid(process)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = 
process = 

def _update_pid(self, process):
"""
Reads pid from cassandra.pid file and stores in the self.pid
After setting up pid updates status (UP, DOWN, etc) and node.conf
"""
pidfile = os.path.join(self.get_path(), 'cassandra.pid')

start = time.time()
while not (os.path.isfile(pidfile) and os.stat(pidfile).st_size > 0):
if (time.time() - start > 30.0):
common.error("Timed out waiting for pidfile to be filled 
(current time is {})".format(datetime.now()))
break
else:
time.sleep(0.1)

try:
with open(pidfile, 'rb') as f:
if 
common.is_modern_windows_install(self.get_base_cassandra_version()):
self.pid = 
int(f.readline().strip().decode('utf-16').strip())
else:
self.pid = int(f.readline().strip())
except IOError as e:
>   raise NodeError('Problem starting node %s due to %s' % (self.name, 
> e), process)
E   ccmlib.node.NodeError: Problem starting node node1 due to [Errno 2] 
No such file or directory: '/tmp/dtest-_8rdmjs0/test/node1/cassandra.pid'

../env3.6/lib/python3.6/site-packages/ccmlib/node.py:2106: NodeError
{noformat}
It's not clear whether this reproduces locally or just on circleci.

We should address these failures before next 4.0.12 and 4.1.4 releases.

  was:
This suite is consistently failing in  
[4.0|https://app.circleci.com/pipelines/github/driftx/cassandra/1454/workflows/0357136e-cee3-42e4-900b-3347fc8d42d3/jobs/71008/tests]
 and 
[4.1|https://app.circleci.com/pipelines/github/driftx/cassandra/1453/workflows/dd1732df-271c-43bc-bc5f-8577c605c746/jobs/71009/tests]
 with the following stack trace:

{noformat}
self = 
process = 

def _update_pid(self, process):
"""
Reads pid from cassandra.pid file and stores in the self.pid
After setting up pid updates status (UP, DOWN, etc) and node.conf
"""
pidfile = os.path.join(self.get_path(), 'cassandra.pid')

start = time.time()
while not (os.path.isfile(pidfile) and os.stat(pidfile).st_size > 0):

(cassandra) branch trunk updated (2e7c0ee5c6 -> 7d6cc31b21)

2024-01-10 Thread paulo
This is an automated email from the ASF dual-hosted git repository.

paulo pushed a change to branch trunk
in repository https://gitbox.apache.org/repos/asf/cassandra.git


from 2e7c0ee5c6 Merge branch 'cassandra-5.0' into trunk
 new 475c0035e6 [CASSANDRA-18999] Gossiper::hasMajorVersion3Nodes returns 
true when a cluster is upgrading patch version without Cassandra 3 nodes.
 new 3d1b981d89 Merge branch 'cassandra-4.0' into cassandra-4.1
 new e04a3176ff Merge branch 'cassandra-4.1' into cassandra-5.0
 new 7d6cc31b21 Merge branch 'cassandra-5.0' into trunk

The 4 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



(cassandra) 01/01: Merge branch 'cassandra-4.1' into cassandra-5.0

2024-01-10 Thread paulo
This is an automated email from the ASF dual-hosted git repository.

paulo pushed a commit to branch cassandra-5.0
in repository https://gitbox.apache.org/repos/asf/cassandra.git

commit e04a3176ffbcb1192ac9081cd980d1e4592ba3f5
Merge: 14c773d8bc 3d1b981d89
Author: Paulo Motta 
AuthorDate: Wed Jan 10 11:20:54 2024 -0500

Merge branch 'cassandra-4.1' into cassandra-5.0

Closes #3004

 CHANGES.txt|  1 +
 src/java/org/apache/cassandra/gms/Gossiper.java| 14 +++---
 .../schema/SystemDistributedKeyspace.java  |  2 +-
 .../apache/cassandra/tracing/TraceKeyspace.java|  4 +-
 test/unit/org/apache/cassandra/Util.java   |  2 +-
 .../org/apache/cassandra/gms/GossiperTest.java | 50 +-
 6 files changed, 61 insertions(+), 12 deletions(-)

diff --cc CHANGES.txt
index 0e2306dc68,ec0e7c60d7..95047150c0
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@@ -1,42 -1,15 +1,43 @@@
 -4.1.4
 +5.0-beta2
 + * Creating a SASI index after creating an SAI index does not break secondary 
index queries (CASSANDRA-18939)
 + * Optionally fail when a non-partition-restricted query is issued against an 
index (CASSANDRA-18796)
 + * Add a startup check to fail startup when using invalid configuration with 
certain Kernel and FS type (CASSANDRA-19196)
 + * UCS min_sstable_size should not be lower than target_sstable_size lower 
bound (CASSANDRA-19112)
 + * Fix the correspondingMessagingVersion of SSTable format and improve TTL 
overflow tests coverage (CASSANDRA-19197)
 + * Fix resource cleanup after SAI query timeouts (CASSANDRA-19177)
 + * Suppress CVE-2023-6481 (CASSANDRA-19184)
 +Merged from 4.1:
   * Memoize Cassandra verion and add a backoff interval for failed schema 
pulls (CASSANDRA-18902)
   * Fix StackOverflowError on ALTER after many previous schema changes 
(CASSANDRA-19166)
 - * Fixed the inconsistency between distributedKeyspaces and 
distributedAndLocalKeyspaces (CASSANDRA-18747)
 - * Internode legacy SSL storage port certificate is not hot reloaded on 
update (CASSANDRA-18681)
 - * Nodetool paxos-only repair is no longer incremental (CASSANDRA-18466)
 - * Waiting indefinitely on ReceivedMessage response in 
StreamSession#receive() can cause deadlock (CASSANDRA-18733)
 - * Allow empty keystore_password in encryption_options (CASSANDRA-18778)
 - * Skip ColumnFamilyStore#topPartitions initialization when client or tool 
mode (CASSANDRA-18697)
  Merged from 4.0:
+  * Fix Gossiper::hasMajorVersion3Nodes to return false during minor upgrade 
(CASSANDRA-18999)
   * Revert unnecessary read lock acquisition when reading ring version in 
TokenMetadata introduced in CASSANDRA-16286 (CASSANDRA-19107)
 +Merged from 3.11:
 +Merged from 3.0:
 +
 +
 +5.0-beta1
 + * Fix SAI intersection queries (CASSANDRA-19011)
 + * Clone EndpointState before sending GossipShutdown message (CASSANDRA-19115)
 + * SAI indexes are marked queryable during truncation (CASSANDRA-19032)
 + * Enable Direct-IO feature for CommitLog files using Java native API's. 
(CASSANDRA-18464)
 + * SAI fixes for composite partitions, and static and non-static rows 
intersections (CASSANDRA-19034)
 + * Improve SAI IndexContext handling of indexed and non-indexed columns in 
queries (CASSANDRA-18166)
 + * Fixed bug where UnifiedCompactionTask constructor was calling the wrong 
base constructor of CompactionTask (CASSANDRA-18757)
 + * Fix SAI unindexed contexts not considering CONTAINS KEY (CASSANDRA-19040)
 + * Ensure that empty SAI column indexes do not fail on validation after 
full-SSTable streaming (CASSANDRA-19017)
 + * SAI in-memory index should check max term size (CASSANDRA-18926)
 + * Set default disk_access_mode to mmap_index_only (CASSANDRA-19021)
 + * Exclude net.java.dev.jna:jna dependency from dependencies of 
org.caffinitas.ohc:ohc-core (CASSANDRA-18992)
 + * Add UCS sstable_growth and min_sstable_size options (CASSANDRA-18945)
 + * Make cqlsh's min required Python version 3.7+ instead of 3.6+ 
(CASSANDRA-18960)
 + * Fix incorrect seeking through the sstable iterator by IndexState 
(CASSANDRA-18932)
 + * Upgrade Python driver to 3.28.0 (CASSANDRA-18960)
 + * Add retries to IR messages (CASSANDRA-18962)
 + * Add metrics and logging to repair retries (CASSANDRA-18952)
 + * Remove deprecated code in Cassandra 1.x and 2.x (CASSANDRA-18959)
 + * ClientRequestSize metrics should not treat CONTAINS restrictions as being 
equality-based (CASSANDRA-18896)
 +Merged from 4.0:
   * Support max SSTable size in sorted CQLSSTableWriter (CASSANDRA-18941)
   * Fix nodetool repair_admin summarize-pending command to not throw exception 
(CASSANDRA-19014)
   * Fix cassandra-stress in simplenative mode with prepared statements 
(CASSANDRA-18744)
diff --cc src/java/org/apache/cassandra/gms/Gossiper.java
index b5b0caec77,018e20542d..5a616a4eae
--- a/src/java/org/apache/cassandra/gms/Gossiper.java
+++ b/src/java/org/apache/cassandra/gms/Gossiper.java
@@@ -229,15 -219,14 +230,15 @@@ p

(cassandra) 01/01: Merge branch 'cassandra-5.0' into trunk

2024-01-10 Thread paulo
This is an automated email from the ASF dual-hosted git repository.

paulo pushed a commit to branch trunk
in repository https://gitbox.apache.org/repos/asf/cassandra.git

commit 7d6cc31b216f464e404527eeb10c6f1ab97ab828
Merge: 2e7c0ee5c6 e04a3176ff
Author: Paulo Motta 
AuthorDate: Wed Jan 10 11:22:03 2024 -0500

Merge branch 'cassandra-5.0' into trunk



-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19259) upgrade_tests.upgrade_through_versions_test consistently failing on circleci

2024-01-10 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805209#comment-17805209
 ] 

Paulo Motta commented on CASSANDRA-19259:
-

[~stefan.miklosovic] can you try to reproduce this locally if you have a dtest 
setup? I can try but still need to setup my environment.

I want to check if this reproduces locally or if it's a CI issue.

> upgrade_tests.upgrade_through_versions_test consistently failing on circleci
> 
>
> Key: CASSANDRA-19259
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19259
> Project: Cassandra
>  Issue Type: Task
>  Components: Local/Other
>Reporter: Paulo Motta
>Priority: Normal
> Fix For: 4.0.12, 4.1.4, 5.0-beta2
>
>
> This suite is consistently failing in  
> [4.0|https://app.circleci.com/pipelines/github/driftx/cassandra/1454/workflows/0357136e-cee3-42e4-900b-3347fc8d42d3/jobs/71008/tests]
>  and 
> [4.1|https://app.circleci.com/pipelines/github/driftx/cassandra/1453/workflows/dd1732df-271c-43bc-bc5f-8577c605c746/jobs/71009/tests]
>  with the following stack trace:
> {noformat}
> self = 
> process = 
> def _update_pid(self, process):
> """
> Reads pid from cassandra.pid file and stores in the self.pid
> After setting up pid updates status (UP, DOWN, etc) and node.conf
> """
> pidfile = os.path.join(self.get_path(), 'cassandra.pid')
> 
> start = time.time()
> while not (os.path.isfile(pidfile) and os.stat(pidfile).st_size > 0):
> if (time.time() - start > 30.0):
> common.error("Timed out waiting for pidfile to be filled 
> (current time is {})".format(datetime.now()))
> break
> else:
> time.sleep(0.1)
> 
> try:
> >   with open(pidfile, 'rb') as f:
> E   FileNotFoundError: [Errno 2] No such file or directory: 
> '/tmp/dtest-_8rdmjs0/test/node1/cassandra.pid'
> ../env3.6/lib/python3.6/site-packages/ccmlib/node.py:2100: FileNotFoundError
> During handling of the above exception, another exception occurred:
> self = 
>   object at 0x7f4c01419438>
> def test_parallel_upgrade(self):
> """
> Test upgrading cluster all at once (requires cluster downtime).
> """
> >   self.upgrade_scenario()
> upgrade_tests/upgrade_through_versions_test.py:387: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> upgrade_tests/upgrade_through_versions_test.py:491: in upgrade_scenario
> self.upgrade_to_version(version_meta, internode_ssl=internode_ssl)
> upgrade_tests/upgrade_through_versions_test.py:580: in upgrade_to_version
> jvm_args=['-Dcassandra.disable_max_protocol_auto_override=true'])  # 
> prevent protocol capping in mixed version clusters
> ../env3.6/lib/python3.6/site-packages/ccmlib/node.py:906: in start
> if not self._wait_for_running(process, timeout_s=7):
> ../env3.6/lib/python3.6/site-packages/ccmlib/node.py:931: in _wait_for_running
> self._update_pid(process)
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> self = 
> process = 
> def _update_pid(self, process):
> """
> Reads pid from cassandra.pid file and stores in the self.pid
> After setting up pid updates status (UP, DOWN, etc) and node.conf
> """
> pidfile = os.path.join(self.get_path(), 'cassandra.pid')
> 
> start = time.time()
> while not (os.path.isfile(pidfile) and os.stat(pidfile).st_size > 0):
> if (time.time() - start > 30.0):
> common.error("Timed out waiting for pidfile to be filled 
> (current time is {})".format(datetime.now()))
> break
> else:
> time.sleep(0.1)
> 
> try:
> with open(pidfile, 'rb') as f:
> if 
> common.is_modern_windows_install(self.get_base_cassandra_version()):
> self.pid = 
> int(f.readline().strip().decode('utf-16').strip())
> else:
> self.pid = int(f.readline().strip())
> except IOError as e:
> >   raise NodeError('Problem starting node %

[jira] [Updated] (CASSANDRA-19259) upgrade_tests.upgrade_through_versions_test consistently failing on circleci

2024-01-10 Thread Paulo Motta (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-19259:

Change Category: Quality Assurance
 Complexity: Normal
Component/s: Local/Other
  Fix Version/s: 4.0.12
 4.1.4
 5.0-beta2
 Status: Open  (was: Triage Needed)

> upgrade_tests.upgrade_through_versions_test consistently failing on circleci
> 
>
> Key: CASSANDRA-19259
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19259
> Project: Cassandra
>  Issue Type: Task
>  Components: Local/Other
>Reporter: Paulo Motta
>Priority: Normal
> Fix For: 4.0.12, 4.1.4, 5.0-beta2
>
>
> This suite is consistently failing in  
> [4.0|https://app.circleci.com/pipelines/github/driftx/cassandra/1454/workflows/0357136e-cee3-42e4-900b-3347fc8d42d3/jobs/71008/tests]
>  and 
> [4.1|https://app.circleci.com/pipelines/github/driftx/cassandra/1453/workflows/dd1732df-271c-43bc-bc5f-8577c605c746/jobs/71009/tests]
>  with the following stack trace:
> {noformat}
> self = 
> process = 
> def _update_pid(self, process):
> """
> Reads pid from cassandra.pid file and stores in the self.pid
> After setting up pid updates status (UP, DOWN, etc) and node.conf
> """
> pidfile = os.path.join(self.get_path(), 'cassandra.pid')
> 
> start = time.time()
> while not (os.path.isfile(pidfile) and os.stat(pidfile).st_size > 0):
> if (time.time() - start > 30.0):
> common.error("Timed out waiting for pidfile to be filled 
> (current time is {})".format(datetime.now()))
> break
> else:
> time.sleep(0.1)
> 
> try:
> >   with open(pidfile, 'rb') as f:
> E   FileNotFoundError: [Errno 2] No such file or directory: 
> '/tmp/dtest-_8rdmjs0/test/node1/cassandra.pid'
> ../env3.6/lib/python3.6/site-packages/ccmlib/node.py:2100: FileNotFoundError
> During handling of the above exception, another exception occurred:
> self = 
>   object at 0x7f4c01419438>
> def test_parallel_upgrade(self):
> """
> Test upgrading cluster all at once (requires cluster downtime).
> """
> >   self.upgrade_scenario()
> upgrade_tests/upgrade_through_versions_test.py:387: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> upgrade_tests/upgrade_through_versions_test.py:491: in upgrade_scenario
> self.upgrade_to_version(version_meta, internode_ssl=internode_ssl)
> upgrade_tests/upgrade_through_versions_test.py:580: in upgrade_to_version
> jvm_args=['-Dcassandra.disable_max_protocol_auto_override=true'])  # 
> prevent protocol capping in mixed version clusters
> ../env3.6/lib/python3.6/site-packages/ccmlib/node.py:906: in start
> if not self._wait_for_running(process, timeout_s=7):
> ../env3.6/lib/python3.6/site-packages/ccmlib/node.py:931: in _wait_for_running
> self._update_pid(process)
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> self = 
> process = 
> def _update_pid(self, process):
> """
> Reads pid from cassandra.pid file and stores in the self.pid
> After setting up pid updates status (UP, DOWN, etc) and node.conf
> """
> pidfile = os.path.join(self.get_path(), 'cassandra.pid')
> 
> start = time.time()
> while not (os.path.isfile(pidfile) and os.stat(pidfile).st_size > 0):
> if (time.time() - start > 30.0):
> common.error("Timed out waiting for pidfile to be filled 
> (current time is {})".format(datetime.now()))
> break
> else:
> time.sleep(0.1)
> 
> try:
> with open(pidfile, 'rb') as f:
> if 
> common.is_modern_windows_install(self.get_base_cassandra_version()):
> self.pid = 
> int(f.readline().strip().decode('utf-16').strip())
> else:
> self.pid = int(f.readline().strip())
> except IOError as e:
> >   raise NodeError('Problem starting node %s due to %s' % 
> > (self.name, e), process)
> E   ccmlib.node

[jira] [Updated] (CASSANDRA-19259) upgrade_tests.upgrade_through_versions_test consistently failing on circleci

2024-01-10 Thread Paulo Motta (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-19259:

  Workflow: Copy of Cassandra Default Workflow  (was: Copy of Cassandra Bug 
Workflow)
Issue Type: Task  (was: Bug)

> upgrade_tests.upgrade_through_versions_test consistently failing on circleci
> 
>
> Key: CASSANDRA-19259
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19259
> Project: Cassandra
>  Issue Type: Task
>    Reporter: Paulo Motta
>Priority: Normal
>
> This suite is consistently failing in  
> [4.0|https://app.circleci.com/pipelines/github/driftx/cassandra/1454/workflows/0357136e-cee3-42e4-900b-3347fc8d42d3/jobs/71008/tests]
>  and 
> [4.1|https://app.circleci.com/pipelines/github/driftx/cassandra/1453/workflows/dd1732df-271c-43bc-bc5f-8577c605c746/jobs/71009/tests]
>  with the following stack trace:
> {noformat}
> self = 
> process = 
> def _update_pid(self, process):
> """
> Reads pid from cassandra.pid file and stores in the self.pid
> After setting up pid updates status (UP, DOWN, etc) and node.conf
> """
> pidfile = os.path.join(self.get_path(), 'cassandra.pid')
> 
> start = time.time()
> while not (os.path.isfile(pidfile) and os.stat(pidfile).st_size > 0):
> if (time.time() - start > 30.0):
> common.error("Timed out waiting for pidfile to be filled 
> (current time is {})".format(datetime.now()))
> break
> else:
> time.sleep(0.1)
> 
> try:
> >   with open(pidfile, 'rb') as f:
> E   FileNotFoundError: [Errno 2] No such file or directory: 
> '/tmp/dtest-_8rdmjs0/test/node1/cassandra.pid'
> ../env3.6/lib/python3.6/site-packages/ccmlib/node.py:2100: FileNotFoundError
> During handling of the above exception, another exception occurred:
> self = 
>   object at 0x7f4c01419438>
> def test_parallel_upgrade(self):
> """
> Test upgrading cluster all at once (requires cluster downtime).
> """
> >   self.upgrade_scenario()
> upgrade_tests/upgrade_through_versions_test.py:387: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> upgrade_tests/upgrade_through_versions_test.py:491: in upgrade_scenario
> self.upgrade_to_version(version_meta, internode_ssl=internode_ssl)
> upgrade_tests/upgrade_through_versions_test.py:580: in upgrade_to_version
> jvm_args=['-Dcassandra.disable_max_protocol_auto_override=true'])  # 
> prevent protocol capping in mixed version clusters
> ../env3.6/lib/python3.6/site-packages/ccmlib/node.py:906: in start
> if not self._wait_for_running(process, timeout_s=7):
> ../env3.6/lib/python3.6/site-packages/ccmlib/node.py:931: in _wait_for_running
> self._update_pid(process)
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> self = 
> process = 
> def _update_pid(self, process):
> """
> Reads pid from cassandra.pid file and stores in the self.pid
> After setting up pid updates status (UP, DOWN, etc) and node.conf
> """
> pidfile = os.path.join(self.get_path(), 'cassandra.pid')
> 
> start = time.time()
> while not (os.path.isfile(pidfile) and os.stat(pidfile).st_size > 0):
> if (time.time() - start > 30.0):
> common.error("Timed out waiting for pidfile to be filled 
> (current time is {})".format(datetime.now()))
> break
> else:
> time.sleep(0.1)
> 
> try:
> with open(pidfile, 'rb') as f:
> if 
> common.is_modern_windows_install(self.get_base_cassandra_version()):
> self.pid = 
> int(f.readline().strip().decode('utf-16').strip())
> else:
> self.pid = int(f.readline().strip())
> except IOError as e:
> >   raise NodeError('Problem starting node %s due to %s' % 
> > (self.name, e), process)
> E   ccmlib.node.NodeError: Problem starting node node1 due to [Errno 
> 2] No such file or directory: '/tmp/dtest-_8rdmjs0/test/node1/cassandra.pid'
> ../env3.6/lib/python3.6/site-packages/ccmlib/node.py:2106: NodeError
> {noformat}
> It's not clear whether this reproduces locally or just on circleci.
> We should address these failures before next 4.0.13 and 4.1.4 releases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-19259) upgrade_tests.upgrade_through_versions_test consistently failing on circleci

2024-01-10 Thread Paulo Motta (Jira)
Paulo Motta created CASSANDRA-19259:
---

 Summary: upgrade_tests.upgrade_through_versions_test consistently 
failing on circleci
 Key: CASSANDRA-19259
 URL: https://issues.apache.org/jira/browse/CASSANDRA-19259
 Project: Cassandra
  Issue Type: Bug
Reporter: Paulo Motta


This suite is consistently failing in  
[4.0|https://app.circleci.com/pipelines/github/driftx/cassandra/1454/workflows/0357136e-cee3-42e4-900b-3347fc8d42d3/jobs/71008/tests]
 and 
[4.1|https://app.circleci.com/pipelines/github/driftx/cassandra/1453/workflows/dd1732df-271c-43bc-bc5f-8577c605c746/jobs/71009/tests]
 with the following stack trace:

{noformat}
self = 
process = 

def _update_pid(self, process):
"""
Reads pid from cassandra.pid file and stores in the self.pid
After setting up pid updates status (UP, DOWN, etc) and node.conf
"""
pidfile = os.path.join(self.get_path(), 'cassandra.pid')

start = time.time()
while not (os.path.isfile(pidfile) and os.stat(pidfile).st_size > 0):
if (time.time() - start > 30.0):
common.error("Timed out waiting for pidfile to be filled 
(current time is {})".format(datetime.now()))
break
else:
time.sleep(0.1)

try:
>   with open(pidfile, 'rb') as f:
E   FileNotFoundError: [Errno 2] No such file or directory: 
'/tmp/dtest-_8rdmjs0/test/node1/cassandra.pid'

../env3.6/lib/python3.6/site-packages/ccmlib/node.py:2100: FileNotFoundError

During handling of the above exception, another exception occurred:

self = 


def test_parallel_upgrade(self):
"""
Test upgrading cluster all at once (requires cluster downtime).
"""
>   self.upgrade_scenario()

upgrade_tests/upgrade_through_versions_test.py:387: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
upgrade_tests/upgrade_through_versions_test.py:491: in upgrade_scenario
self.upgrade_to_version(version_meta, internode_ssl=internode_ssl)
upgrade_tests/upgrade_through_versions_test.py:580: in upgrade_to_version
jvm_args=['-Dcassandra.disable_max_protocol_auto_override=true'])  # 
prevent protocol capping in mixed version clusters
../env3.6/lib/python3.6/site-packages/ccmlib/node.py:906: in start
if not self._wait_for_running(process, timeout_s=7):
../env3.6/lib/python3.6/site-packages/ccmlib/node.py:931: in _wait_for_running
self._update_pid(process)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = 
process = 

def _update_pid(self, process):
"""
Reads pid from cassandra.pid file and stores in the self.pid
After setting up pid updates status (UP, DOWN, etc) and node.conf
"""
pidfile = os.path.join(self.get_path(), 'cassandra.pid')

start = time.time()
while not (os.path.isfile(pidfile) and os.stat(pidfile).st_size > 0):
if (time.time() - start > 30.0):
common.error("Timed out waiting for pidfile to be filled 
(current time is {})".format(datetime.now()))
break
else:
time.sleep(0.1)

try:
with open(pidfile, 'rb') as f:
if 
common.is_modern_windows_install(self.get_base_cassandra_version()):
self.pid = 
int(f.readline().strip().decode('utf-16').strip())
else:
self.pid = int(f.readline().strip())
except IOError as e:
>   raise NodeError('Problem starting node %s due to %s' % (self.name, 
> e), process)
E   ccmlib.node.NodeError: Problem starting node node1 due to [Errno 2] 
No such file or directory: '/tmp/dtest-_8rdmjs0/test/node1/cassandra.pid'

../env3.6/lib/python3.6/site-packages/ccmlib/node.py:2106: NodeError
{noformat}

It's not clear whether this reproduces locally or just on circleci.

We should address these failures before next 4.0.13 and 4.1.4 releases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-18999) Gossiper::hasMajorVersion3Nodes returns true when a cluster is upgrading patch version without Cassandra 3 nodes.

2024-01-07 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17804104#comment-17804104
 ] 

Paulo Motta commented on CASSANDRA-18999:
-

5.0 precommit tests are looking good.

I can't make a lot of sense from the [upgrade dtests 
failures|https://app.circleci.com/pipelines/github/driftx/cassandra/1444/workflows/ddfe8a3c-4b36-4b9e-8f01-c85249fd8488/jobs/70142/tests]
 but they don't seem related to this ticket.

It looks like in both runs tests from {{upgrade_through_versions_test}} failed 
with:
{noformat}

Problem starting node node1 due to [Errno 2] No such file or directory: 
'/tmp/dtest-jbrcckw7/test/node1/cassandra.pid'
{noformat}
This looks like an environmental issue to me as I didn't find any open ticket 
for this particular issue. While the 
[4.1|https://app.circleci.com/pipelines/github/driftx/cassandra/1444/workflows/ddfe8a3c-4b36-4b9e-8f01-c85249fd8488]
 job completed the 
[4.0|https://app.circleci.com/pipelines/github/driftx/cassandra/1445/workflows/d346af10-7b34-41a0-b2b7-c1c3290a6696]
 seems to have gotten stuck.

I'm inclined to commit this to avoid dragging this ticket longer and re-run the 
upgrade dtest before the next 4.X release to catch any outstanding upgrade 
issues. WDYT?

> Gossiper::hasMajorVersion3Nodes returns true when a cluster is upgrading 
> patch version without Cassandra 3 nodes.
> -
>
> Key: CASSANDRA-18999
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18999
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Distributed Metadata
>Reporter: Isaac Reath
>Assignee: Isaac Reath
>Priority: Low
>  Labels: lhf
> Fix For: 4.0.x, 4.1.x, 5.0.x
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When working on https://issues.apache.org/jira/browse/CASSANDRA-18968 we 
> found that {{Gossiper::hasMajorVersion3Nodes}} will return true when the 
> cluster is undergoing an upgrade from a patch version even if the cluster has 
> no Cassandra 3 nodes in it.
> This can be reproduced by running this Gossiper test:
> {code:java}
> @Test
> public void 
> testHasVersion3NodesShouldReturnFalseWhenNoVersion3NodesDetectedAndCassandra4UpgradeInProgress()
>  throws Exception
> {
> Gossiper.instance.start(0);
> Gossiper.instance.expireUpgradeFromVersion();
> VersionedValue.VersionedValueFactory factory = new 
> VersionedValue.VersionedValueFactory(null);
> EndpointState es = new EndpointState((HeartBeatState) null);
> es.addApplicationState(ApplicationState.RELEASE_VERSION, 
> factory.releaseVersion(CURRENT_VERSION.toString()));
> 
> Gossiper.instance.endpointStateMap.put(InetAddressAndPort.getByName("127.0.0.1"),
>  es);
> 
> Gossiper.instance.liveEndpoints.add(InetAddressAndPort.getByName("127.0.0.1"));
> es = new EndpointState((HeartBeatState) null);
> String previousPatchVersion = String.valueOf(CURRENT_VERSION.major) + 
> '.' + (CURRENT_VERSION.minor) + '.' + (CURRENT_VERSION.patch - 1);
> es.addApplicationState(ApplicationState.RELEASE_VERSION, 
> factory.releaseVersion(previousPatchVersion));
> 
> Gossiper.instance.endpointStateMap.put(InetAddressAndPort.getByName("127.0.0.2"),
>  es);
> 
> Gossiper.instance.liveEndpoints.add(InetAddressAndPort.getByName("127.0.0.2"));
> assertFalse(Gossiper.instance.hasMajorVersion3Nodes());
> }
> {code}
> This seems to be because of 
> [https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L2360],
>  where an upgrade in progress is possible but we are not upgrading from a 
> lower family version (i.e from 4.1.1 to 4.1.2).
> From the comment in this function, it seems instead of the existing check, we 
> would want to iterate over all known endpoints in gossip and return true if 
> any of them do not have a version (similar to 
> [https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L227-L236)
>  
> |https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L227-L236).]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19097) Test Failure: bootstrap_test.TestBootstrap.*

2024-01-05 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17803682#comment-17803682
 ] 

Paulo Motta commented on CASSANDRA-19097:
-

Seen {{test_read_from_bootstrapped_node}} failure in 
[5.0-18999-j17|https://app.circleci.com/pipelines/github/driftx/cassandra/1446/workflows/e0726e79-e517-4a82-828c-c7931fc9d99b/jobs/70130/tests]

> Test Failure: bootstrap_test.TestBootstrap.*
> 
>
> Key: CASSANDRA-19097
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19097
> Project: Cassandra
>  Issue Type: Bug
>  Components: CI
>Reporter: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0.x, 5.0-rc
>
>
> test_killed_wiped_node_cannot_join
> test_read_from_bootstrapped_node
> test_shutdown_wiped_node_cannot_join
> Seen in dtests_offheap in CASSANDRA-19034
> https://app.circleci.com/pipelines/github/michaelsembwever/cassandra/258/workflows/cea7d697-ca33-40bb-8914-fb9fc662820a/jobs/21162/parallel-runs/38
> {noformat}
> self = 
> def test_killed_wiped_node_cannot_join(self):
> >   self._wiped_node_cannot_join_test(gently=False)
> bootstrap_test.py:608: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> self = , gently = False
> def _wiped_node_cannot_join_test(self, gently):
> """
> @jira_ticket CASSANDRA-9765
> Test that if we stop a node and wipe its data then the node cannot 
> join
> when it is not a seed. Test both a nice shutdown or a forced 
> shutdown, via
> the gently parameter.
> """
> cluster = self.cluster
> 
> cluster.set_environment_variable('CASSANDRA_TOKEN_PREGENERATION_DISABLED', 
> 'True')
> cluster.populate(3)
> cluster.start()
> 
> stress_table = 'keyspace1.standard1'
> 
> # write some data
> node1 = cluster.nodelist()[0]
> node1.stress(['write', 'n=10K', 'no-warmup', '-rate', 'threads=8'])
> 
> session = self.patient_cql_connection(node1)
> original_rows = list(session.execute("SELECT * FROM 
> {}".format(stress_table,)))
> 
> # Add a new node, bootstrap=True ensures that it is not a seed
> node4 = new_node(cluster, bootstrap=True)
> node4.start(wait_for_binary_proto=True)
> 
> session = self.patient_cql_connection(node4)
> >   assert original_rows == list(session.execute("SELECT * FROM 
> > {}".format(stress_table,)))
> E   assert [Row(key=b'PP...e9\xbb'), ...] == [Row(key=b'PP...e9\xbb'), 
> ...]
> E At index 587 diff: Row(key=b'OP2656L630', 
> C0=b"E02\xd2\x8clBv\tr\n\xe3\x01\xdd\xf2\x8a\x91\x7f-\x9dm'\xa5\xe7PH\xef\xc1xlO\xab+d",
>  
> C1=b"\xb2\xc0j\xff\xcb'\xe3\xcc\x0b\x93?\x18@\xc4\xc7tV\xb7q\xeeF\x82\xa4\xd3\xdcFl\xd9\x87
>  \x9a\xde\xdc\xa3", 
> C2=b'\xed\xf8\x8d%\xa4\xa6LPs;\x98f\xdb\xca\x913\xba{M\x8d6XW\x01\xea-\xb5  
> C3=b'\x9ec\xcf\xc7\xec\xa5\x85Z]\xa6\x19\xeb\xc4W\x1d%lyZj\xb9\x94I\x90\xebZ\xdba\xdd\xdc\x9e\x82\x95\x1c',
>  
> C4=b'\xab\x9e\x13\x8b\xc6\x15D\x9b\xccl\xdcX\xb23\xd0\x8b\xa3\xba7\xc1c\xf7F\x1d\xf8e\xbd\x89\xcb\xd8\xd1)f\xdd')
>  != Row(key=b'4LN78NONP0', 
> C0=b"\xdf\x90\xb3/u\xc9/C\xcdOYG3\x070@#\xc3k\xaa$M'\x19\xfb\xab\xc0\x10]\xa6\xac\x1d\x81\xad",
>  
> C1=b'\x8a\xb7j\x95\xf9\xbd?&\x11\xaaH\xcd\x87\xaa\xd2\x85\x08X\xea9\x94\xae8U\x92\xad\xb0\x1b9\xff\x87Z\xe81',
>  
> C2=b'6\x1d\xa1-\xf77\xc7\xde+`\xb7\x89\xaa\xcd\xb5_\xe5\xb3\x04\xc7\xb1\x95e\x81s\t1\x8b\x16sc\x0eMm',
>  
> C3=b'\xfbi\x08;\xc9\x94\x15}r\xfe\x1b\xae5\xf6v\x83\xae\xff\x82\x9b`J\xc2D\xa6k+\xf3\xd3\xff{C\xd0;',
>  
> C4=b'\x8f\x87\x18\x0f\xfa\xadK"\x9e\x96\x87:tiu\xa5\x99\xe1_Ax\xa3\x12\xb4Z\xc9v\xa5\xad\xb8{\xc0\xa3\x93')
> E Left contains 2830 more items, first extra item: 
> Row(key=b'5N7N172K30', 
> C0=b'Y\x81\xa6\x02\x89\xa0hyp\x00O\xe9kFp$\x86u\xea\n\x7fK\x99\xe1\xf6G\xf77\xf7\xd7\xe1\xc7L\x...0\x87a\x03\xee',
>  
> C4=b'\xe8\xd8\x17\xf3\x14\x16Q\x9d\\jb\xde=\x81\xc1B\x9c;T\xb1\xa2O-\x87zF=\x04`\x04\xbd\xc9\x95\xad')
> E Full diff:
> E   [
> …
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-18999) Gossiper::hasMajorVersion3Nodes returns true when a cluster is upgrading patch version without Cassandra 3 nodes.

2024-01-05 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17803676#comment-17803676
 ] 

Paulo Motta commented on CASSANDRA-18999:
-

Thanks Brandon! Looks like {{test_read_from_bootstrapped_node}} already failed 
in 
[5.0-j17|https://app.circleci.com/pipelines/github/driftx/cassandra/1446/workflows/e0726e79-e517-4a82-828c-c7931fc9d99b]
 but this is being tracked in CASSANDRA-19097.

I will check back when CI finishes and commit if it looks good.

> Gossiper::hasMajorVersion3Nodes returns true when a cluster is upgrading 
> patch version without Cassandra 3 nodes.
> -
>
> Key: CASSANDRA-18999
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18999
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Distributed Metadata
>Reporter: Isaac Reath
>Assignee: Isaac Reath
>Priority: Low
>  Labels: lhf
> Fix For: 4.0.x, 4.1.x, 5.0.x
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When working on https://issues.apache.org/jira/browse/CASSANDRA-18968 we 
> found that {{Gossiper::hasMajorVersion3Nodes}} will return true when the 
> cluster is undergoing an upgrade from a patch version even if the cluster has 
> no Cassandra 3 nodes in it.
> This can be reproduced by running this Gossiper test:
> {code:java}
> @Test
> public void 
> testHasVersion3NodesShouldReturnFalseWhenNoVersion3NodesDetectedAndCassandra4UpgradeInProgress()
>  throws Exception
> {
> Gossiper.instance.start(0);
> Gossiper.instance.expireUpgradeFromVersion();
> VersionedValue.VersionedValueFactory factory = new 
> VersionedValue.VersionedValueFactory(null);
> EndpointState es = new EndpointState((HeartBeatState) null);
> es.addApplicationState(ApplicationState.RELEASE_VERSION, 
> factory.releaseVersion(CURRENT_VERSION.toString()));
> 
> Gossiper.instance.endpointStateMap.put(InetAddressAndPort.getByName("127.0.0.1"),
>  es);
> 
> Gossiper.instance.liveEndpoints.add(InetAddressAndPort.getByName("127.0.0.1"));
> es = new EndpointState((HeartBeatState) null);
> String previousPatchVersion = String.valueOf(CURRENT_VERSION.major) + 
> '.' + (CURRENT_VERSION.minor) + '.' + (CURRENT_VERSION.patch - 1);
> es.addApplicationState(ApplicationState.RELEASE_VERSION, 
> factory.releaseVersion(previousPatchVersion));
> 
> Gossiper.instance.endpointStateMap.put(InetAddressAndPort.getByName("127.0.0.2"),
>  es);
> 
> Gossiper.instance.liveEndpoints.add(InetAddressAndPort.getByName("127.0.0.2"));
> assertFalse(Gossiper.instance.hasMajorVersion3Nodes());
> }
> {code}
> This seems to be because of 
> [https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L2360],
>  where an upgrade in progress is possible but we are not upgrading from a 
> lower family version (i.e from 4.1.1 to 4.1.2).
> From the comment in this function, it seems instead of the existing check, we 
> would want to iterate over all known endpoints in gossip and return true if 
> any of them do not have a version (similar to 
> [https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L227-L236)
>  
> |https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L227-L236).]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-18999) Gossiper::hasMajorVersion3Nodes returns true when a cluster is upgrading patch version without Cassandra 3 nodes.

2024-01-05 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17803651#comment-17803651
 ] 

Paulo Motta commented on CASSANDRA-18999:
-

Please find updated patches prepared for commit:

* [4.0-18999|https://github.com/pauloricardomg/cassandra/tree/4.0-18999]
* [4.1-18999|https://github.com/pauloricardomg/cassandra/tree/4.1-18999]
* [5.0-18999|https://github.com/pauloricardomg/cassandra/tree/5.0-18999]

> Gossiper::hasMajorVersion3Nodes returns true when a cluster is upgrading 
> patch version without Cassandra 3 nodes.
> -
>
> Key: CASSANDRA-18999
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18999
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Distributed Metadata
>Reporter: Isaac Reath
>Assignee: Isaac Reath
>Priority: Low
>  Labels: lhf
> Fix For: 4.0.x, 4.1.x, 5.0.x
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When working on https://issues.apache.org/jira/browse/CASSANDRA-18968 we 
> found that {{Gossiper::hasMajorVersion3Nodes}} will return true when the 
> cluster is undergoing an upgrade from a patch version even if the cluster has 
> no Cassandra 3 nodes in it.
> This can be reproduced by running this Gossiper test:
> {code:java}
> @Test
> public void 
> testHasVersion3NodesShouldReturnFalseWhenNoVersion3NodesDetectedAndCassandra4UpgradeInProgress()
>  throws Exception
> {
> Gossiper.instance.start(0);
> Gossiper.instance.expireUpgradeFromVersion();
> VersionedValue.VersionedValueFactory factory = new 
> VersionedValue.VersionedValueFactory(null);
> EndpointState es = new EndpointState((HeartBeatState) null);
> es.addApplicationState(ApplicationState.RELEASE_VERSION, 
> factory.releaseVersion(CURRENT_VERSION.toString()));
> 
> Gossiper.instance.endpointStateMap.put(InetAddressAndPort.getByName("127.0.0.1"),
>  es);
> 
> Gossiper.instance.liveEndpoints.add(InetAddressAndPort.getByName("127.0.0.1"));
> es = new EndpointState((HeartBeatState) null);
> String previousPatchVersion = String.valueOf(CURRENT_VERSION.major) + 
> '.' + (CURRENT_VERSION.minor) + '.' + (CURRENT_VERSION.patch - 1);
> es.addApplicationState(ApplicationState.RELEASE_VERSION, 
> factory.releaseVersion(previousPatchVersion));
> 
> Gossiper.instance.endpointStateMap.put(InetAddressAndPort.getByName("127.0.0.2"),
>  es);
> 
> Gossiper.instance.liveEndpoints.add(InetAddressAndPort.getByName("127.0.0.2"));
> assertFalse(Gossiper.instance.hasMajorVersion3Nodes());
> }
> {code}
> This seems to be because of 
> [https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L2360],
>  where an upgrade in progress is possible but we are not upgrading from a 
> lower family version (i.e from 4.1.1 to 4.1.2).
> From the comment in this function, it seems instead of the existing check, we 
> would want to iterate over all known endpoints in gossip and return true if 
> any of them do not have a version (similar to 
> [https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L227-L236)
>  
> |https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L227-L236).]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19248) "nodetool bootstrap resume" starts unnecessary streaming session on joining node

2024-01-04 Thread Paulo Motta (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-19248:

Description: 
Calling {{nodetool boostrap resume}} triggers a new bootstrap streaming session 
on a joining node, even if there's a bootstrap streaming session currently 
running.

Each time this command is called, a new bootstrap streaming session is started, 
causing the same data to be needlessly streamed from peers.

It should only be possible to call {{nodetool bootstrap resume}} if a previous 
bootstrap attempt has failed.

An example of multiple invocations of {{nodetool bootstrap resume}} in a 
joining node is shown below:
{noformat}
$ nodetool netstats
Mode: JOINING
Bootstrap a1cf3bf0-ab3a-11ee-9fcf-5746a7aee9ca
    /A.B.C.D
        Receiving 13 files, 14302312660 bytes total. Already received 2 files, 
52389676 bytes total
            ks1/tbl1 80/80 bytes(100%) received from idx:0/A.B.C.D
            ks2/tbl2 471/471 bytes(100%) received from idx:0/A.B.C.D
    /E.F.G.H
    /I.J.K.L
Bootstrap 7f1e7000-ab3d-11ee-9fcf-5746a7aee9ca
    /A.B.C.D
        Receiving 13 files, 14302312660 bytes total. Already received 0 files, 
0 bytes total
    /E.F.G.H
    /I.J.K.L
Bootstrap 9ca42500-ab3a-11ee-9fcf-5746a7aee9ca
    /A.B.C.D
    /E.F.G.H
        Receiving 13 files, 14302312660 bytes total. Already received 2 files, 
104838752 bytes total
            ks1/tbl1 80/80 bytes(100%) received from idx:0/E.F.G.H
            ks2/tbl2 471/471 bytes(100%) received from idx:0/E.F.G.H
    /I.J.K.L {noformat}

  was:
Calling {{nodetool boostrap resume}} triggers a new bootstrap streaming session 
on a joining node, even if there's a bootstrap streaming session currently 
running.

Each time this command is called, a new bootstrap streaming session is started, 
causing the same data to be needlessly streamed from peers.

It should only be possible to call {{nodetool bootstrap resume}} if a previous 
bootstrap attempt has failed.

An example of multiple invocations of {{nodetool bootstrap resume}} in a 
joining node is shown below:
{noformat}
$ nodetool netstats
Mode: JOINING
Bootstrap a1cf3bf0-ab3a-11ee-9fcf-5746a7aee9ca
    /A.B.C.D
        Receiving 13 files, 14302312660 bytes total. Already received 2 files, 
52389676 bytes total
            ks1/tbl1 80/80 bytes(100%) received from idx:0/10.34.194.220
            ks2/tbl2 471/471 bytes(100%) received from idx:0/10.34.194.220
    /E.F.G.H
    /I.J.K.L
Bootstrap 7f1e7000-ab3d-11ee-9fcf-5746a7aee9ca
    /A.B.C.D
        Receiving 13 files, 14302312660 bytes total. Already received 0 files, 
0 bytes total
    /E.F.G.H
    /I.J.K.L
Bootstrap 9ca42500-ab3a-11ee-9fcf-5746a7aee9ca
    /A.B.C.D
    /E.F.G.H
        Receiving 13 files, 14302312660 bytes total. Already received 2 files, 
104838752 bytes total
            ks1/tbl1 80/80 bytes(100%) received from idx:0/10.34.194.220
            ks2/tbl2 471/471 bytes(100%) received from idx:0/10.34.194.220
    /I.J.K.L {noformat}


> "nodetool bootstrap resume" starts unnecessary streaming session on joining 
> node
> 
>
> Key: CASSANDRA-19248
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19248
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Membership
>Reporter: Paulo Motta
>Priority: Normal
>  Labels: lhf
>
> Calling {{nodetool boostrap resume}} triggers a new bootstrap streaming 
> session on a joining node, even if there's a bootstrap streaming session 
> currently running.
> Each time this command is called, a new bootstrap streaming session is 
> started, causing the same data to be needlessly streamed from peers.
> It should only be possible to call {{nodetool bootstrap resume}} if a 
> previous bootstrap attempt has failed.
> An example of multiple invocations of {{nodetool bootstrap resume}} in a 
> joining node is shown below:
> {noformat}
> $ nodetool netstats
> Mode: JOINING
> Bootstrap a1cf3bf0-ab3a-11ee-9fcf-5746a7aee9ca
>     /A.B.C.D
>         Receiving 13 files, 14302312660 bytes total. Already received 2 
> files, 52389676 bytes total
>             ks1/tbl1 80/80 bytes(100%) received from idx:0/A.B.C.D
>             ks2/tbl2 471/471 bytes(100%) received from idx:0/A.B.C.D
>     /E.F.G.H
>     /I.J.K.L
> Bootstrap 7f1e7000-ab3d-11ee-9fcf-5746a7aee9ca
>     /A.B.C.D
>         Receiving 13 files, 14302312660 bytes total. Already received 0 
> files, 0 bytes total
>     /E.F.G.H
>     /I.J.K.L
> Bootstrap 9ca42500-ab3a-11ee-9fcf-5746a7aee9ca
>     /A.B.C.D
>     /E.F.G.H
>         Receiving 13 files, 14302312660 bytes total. Already received 2 
> fil

[jira] [Updated] (CASSANDRA-19248) "nodetool bootstrap resume" starts unnecessary streaming session on joining node

2024-01-04 Thread Paulo Motta (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-19248:

Since Version: 2.2.0 beta 1
   Labels: lhf  (was: )

> "nodetool bootstrap resume" starts unnecessary streaming session on joining 
> node
> 
>
> Key: CASSANDRA-19248
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19248
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Membership
>Reporter: Paulo Motta
>Priority: Normal
>  Labels: lhf
>
> Calling {{nodetool boostrap resume}} triggers a new bootstrap streaming 
> session on a joining node, even if there's a bootstrap streaming session 
> currently running.
> Each time this command is called, a new bootstrap streaming session is 
> started, causing the same data to be needlessly streamed from peers.
> It should only be possible to call {{nodetool bootstrap resume}} if a 
> previous bootstrap attempt has failed.
> An example of multiple invocations of {{nodetool bootstrap resume}} in a 
> joining node is shown below:
> {noformat}
> $ nodetool netstats
> Mode: JOINING
> Bootstrap a1cf3bf0-ab3a-11ee-9fcf-5746a7aee9ca
>     /A.B.C.D
>         Receiving 13 files, 14302312660 bytes total. Already received 2 
> files, 52389676 bytes total
>             ks1/tbl1 80/80 bytes(100%) received from idx:0/10.34.194.220
>             ks2/tbl2 471/471 bytes(100%) received from idx:0/10.34.194.220
>     /E.F.G.H
>     /I.J.K.L
> Bootstrap 7f1e7000-ab3d-11ee-9fcf-5746a7aee9ca
>     /A.B.C.D
>         Receiving 13 files, 14302312660 bytes total. Already received 0 
> files, 0 bytes total
>     /E.F.G.H
>     /I.J.K.L
> Bootstrap 9ca42500-ab3a-11ee-9fcf-5746a7aee9ca
>     /A.B.C.D
>     /E.F.G.H
>         Receiving 13 files, 14302312660 bytes total. Already received 2 
> files, 104838752 bytes total
>             ks1/tbl1 80/80 bytes(100%) received from idx:0/10.34.194.220
>             ks2/tbl2 471/471 bytes(100%) received from idx:0/10.34.194.220
>     /I.J.K.L {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19248) "nodetool bootstrap resume" starts unnecessary streaming session on joining node

2024-01-04 Thread Paulo Motta (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-19248:

 Bug Category: Parent values: Correctness(12982)Level 1 values: API / 
Semantic Implementation(12988)
   Complexity: Low Hanging Fruit
  Component/s: Cluster/Membership
Discovered By: User Report
 Severity: Low
   Status: Open  (was: Triage Needed)

> "nodetool bootstrap resume" starts unnecessary streaming session on joining 
> node
> 
>
> Key: CASSANDRA-19248
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19248
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Membership
>Reporter: Paulo Motta
>Priority: Normal
>
> Calling {{nodetool boostrap resume}} triggers a new bootstrap streaming 
> session on a joining node, even if there's a bootstrap streaming session 
> currently running.
> Each time this command is called, a new bootstrap streaming session is 
> started, causing the same data to be needlessly streamed from peers.
> It should only be possible to call {{nodetool bootstrap resume}} if a 
> previous bootstrap attempt has failed.
> An example of multiple invocations of {{nodetool bootstrap resume}} in a 
> joining node is shown below:
> {noformat}
> $ nodetool netstats
> Mode: JOINING
> Bootstrap a1cf3bf0-ab3a-11ee-9fcf-5746a7aee9ca
>     /A.B.C.D
>         Receiving 13 files, 14302312660 bytes total. Already received 2 
> files, 52389676 bytes total
>             ks1/tbl1 80/80 bytes(100%) received from idx:0/10.34.194.220
>             ks2/tbl2 471/471 bytes(100%) received from idx:0/10.34.194.220
>     /E.F.G.H
>     /I.J.K.L
> Bootstrap 7f1e7000-ab3d-11ee-9fcf-5746a7aee9ca
>     /A.B.C.D
>         Receiving 13 files, 14302312660 bytes total. Already received 0 
> files, 0 bytes total
>     /E.F.G.H
>     /I.J.K.L
> Bootstrap 9ca42500-ab3a-11ee-9fcf-5746a7aee9ca
>     /A.B.C.D
>     /E.F.G.H
>         Receiving 13 files, 14302312660 bytes total. Already received 2 
> files, 104838752 bytes total
>             ks1/tbl1 80/80 bytes(100%) received from idx:0/10.34.194.220
>             ks2/tbl2 471/471 bytes(100%) received from idx:0/10.34.194.220
>     /I.J.K.L {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-19248) "nodetool bootstrap resume" starts unnecessary streaming session on joining node

2024-01-04 Thread Paulo Motta (Jira)
Paulo Motta created CASSANDRA-19248:
---

 Summary: "nodetool bootstrap resume" starts unnecessary streaming 
session on joining node
 Key: CASSANDRA-19248
 URL: https://issues.apache.org/jira/browse/CASSANDRA-19248
 Project: Cassandra
  Issue Type: Bug
Reporter: Paulo Motta


Calling {{nodetool boostrap resume}} triggers a new bootstrap streaming session 
on a joining node, even if there's a bootstrap streaming session currently 
running.

Each time this command is called, a new bootstrap streaming session is started, 
causing the same data to be needlessly streamed from peers.

It should only be possible to call {{nodetool bootstrap resume}} if a previous 
bootstrap attempt has failed.

An example of multiple invocations of {{nodetool bootstrap resume}} in a 
joining node is shown below:
{noformat}
$ nodetool netstats
Mode: JOINING
Bootstrap a1cf3bf0-ab3a-11ee-9fcf-5746a7aee9ca
    /A.B.C.D
        Receiving 13 files, 14302312660 bytes total. Already received 2 files, 
52389676 bytes total
            ks1/tbl1 80/80 bytes(100%) received from idx:0/10.34.194.220
            ks2/tbl2 471/471 bytes(100%) received from idx:0/10.34.194.220
    /E.F.G.H
    /I.J.K.L
Bootstrap 7f1e7000-ab3d-11ee-9fcf-5746a7aee9ca
    /A.B.C.D
        Receiving 13 files, 14302312660 bytes total. Already received 0 files, 
0 bytes total
    /E.F.G.H
    /I.J.K.L
Bootstrap 9ca42500-ab3a-11ee-9fcf-5746a7aee9ca
    /A.B.C.D
    /E.F.G.H
        Receiving 13 files, 14302312660 bytes total. Already received 2 files, 
104838752 bytes total
            ks1/tbl1 80/80 bytes(100%) received from idx:0/10.34.194.220
            ks2/tbl2 471/471 bytes(100%) received from idx:0/10.34.194.220
    /I.J.K.L {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-18999) Gossiper::hasMajorVersion3Nodes returns true when a cluster is upgrading patch version without Cassandra 3 nodes.

2024-01-02 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801825#comment-17801825
 ] 

Paulo Motta commented on CASSANDRA-18999:
-

{quote}I have tested 5.0 already, links are in my last comments above. Have you 
done some changes to it? I think 5.0 is fully covered, the only thing we need 
is upgrade dtests for 4.0 and 4.1.
{quote}
The 5.0 version you submitted was based on the 
[CASSANDRA-18999-5.0-hasMajVer3removal|https://github.com/apache/cassandra/compare/trunk...instaclustr:cassandra:CASSANDRA-18999-5.0-hasMajVer3removal]
 branch which removes {{hasMajorVersion3Nodes}} from 5.0.

We need to submit CI for [this 
branch|https://github.com/pauloricardomg/cassandra/tree/cassandra-5.0] where 
{{hasMajorVersion3Nodes}} is fixed but not removed (the removal will be done on 
CASSANDRA-19243).

We also need to submit upgrade tests for 4.0/4.1/5.0. Can you do this in 
circle? If not I guess we'll have to wait until asf ci is bac.

> Gossiper::hasMajorVersion3Nodes returns true when a cluster is upgrading 
> patch version without Cassandra 3 nodes.
> -
>
> Key: CASSANDRA-18999
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18999
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Distributed Metadata
>Reporter: Isaac Reath
>Assignee: Isaac Reath
>Priority: Low
>  Labels: lhf
> Fix For: 4.0.x, 4.1.x, 5.0.x
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When working on https://issues.apache.org/jira/browse/CASSANDRA-18968 we 
> found that {{Gossiper::hasMajorVersion3Nodes}} will return true when the 
> cluster is undergoing an upgrade from a patch version even if the cluster has 
> no Cassandra 3 nodes in it.
> This can be reproduced by running this Gossiper test:
> {code:java}
> @Test
> public void 
> testHasVersion3NodesShouldReturnFalseWhenNoVersion3NodesDetectedAndCassandra4UpgradeInProgress()
>  throws Exception
> {
> Gossiper.instance.start(0);
> Gossiper.instance.expireUpgradeFromVersion();
> VersionedValue.VersionedValueFactory factory = new 
> VersionedValue.VersionedValueFactory(null);
> EndpointState es = new EndpointState((HeartBeatState) null);
> es.addApplicationState(ApplicationState.RELEASE_VERSION, 
> factory.releaseVersion(CURRENT_VERSION.toString()));
> 
> Gossiper.instance.endpointStateMap.put(InetAddressAndPort.getByName("127.0.0.1"),
>  es);
> 
> Gossiper.instance.liveEndpoints.add(InetAddressAndPort.getByName("127.0.0.1"));
> es = new EndpointState((HeartBeatState) null);
> String previousPatchVersion = String.valueOf(CURRENT_VERSION.major) + 
> '.' + (CURRENT_VERSION.minor) + '.' + (CURRENT_VERSION.patch - 1);
> es.addApplicationState(ApplicationState.RELEASE_VERSION, 
> factory.releaseVersion(previousPatchVersion));
> 
> Gossiper.instance.endpointStateMap.put(InetAddressAndPort.getByName("127.0.0.2"),
>  es);
> 
> Gossiper.instance.liveEndpoints.add(InetAddressAndPort.getByName("127.0.0.2"));
> assertFalse(Gossiper.instance.hasMajorVersion3Nodes());
> }
> {code}
> This seems to be because of 
> [https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L2360],
>  where an upgrade in progress is possible but we are not upgrading from a 
> lower family version (i.e from 4.1.1 to 4.1.2).
> From the comment in this function, it seems instead of the existing check, we 
> would want to iterate over all known endpoints in gossip and return true if 
> any of them do not have a version (similar to 
> [https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L227-L236)
>  
> |https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L227-L236).]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-18999) Gossiper::hasMajorVersion3Nodes returns true when a cluster is upgrading patch version without Cassandra 3 nodes.

2023-12-28 Thread Paulo Motta (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-18999:

Status: Ready to Commit  (was: Changes Suggested)

> Gossiper::hasMajorVersion3Nodes returns true when a cluster is upgrading 
> patch version without Cassandra 3 nodes.
> -
>
> Key: CASSANDRA-18999
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18999
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Distributed Metadata
>Reporter: Isaac Reath
>Assignee: Isaac Reath
>Priority: Low
>  Labels: lhf
> Fix For: 4.0.x, 4.1.x, 5.0.x
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When working on https://issues.apache.org/jira/browse/CASSANDRA-18968 we 
> found that {{Gossiper::hasMajorVersion3Nodes}} will return true when the 
> cluster is undergoing an upgrade from a patch version even if the cluster has 
> no Cassandra 3 nodes in it.
> This can be reproduced by running this Gossiper test:
> {code:java}
> @Test
> public void 
> testHasVersion3NodesShouldReturnFalseWhenNoVersion3NodesDetectedAndCassandra4UpgradeInProgress()
>  throws Exception
> {
> Gossiper.instance.start(0);
> Gossiper.instance.expireUpgradeFromVersion();
> VersionedValue.VersionedValueFactory factory = new 
> VersionedValue.VersionedValueFactory(null);
> EndpointState es = new EndpointState((HeartBeatState) null);
> es.addApplicationState(ApplicationState.RELEASE_VERSION, 
> factory.releaseVersion(CURRENT_VERSION.toString()));
> 
> Gossiper.instance.endpointStateMap.put(InetAddressAndPort.getByName("127.0.0.1"),
>  es);
> 
> Gossiper.instance.liveEndpoints.add(InetAddressAndPort.getByName("127.0.0.1"));
> es = new EndpointState((HeartBeatState) null);
> String previousPatchVersion = String.valueOf(CURRENT_VERSION.major) + 
> '.' + (CURRENT_VERSION.minor) + '.' + (CURRENT_VERSION.patch - 1);
> es.addApplicationState(ApplicationState.RELEASE_VERSION, 
> factory.releaseVersion(previousPatchVersion));
> 
> Gossiper.instance.endpointStateMap.put(InetAddressAndPort.getByName("127.0.0.2"),
>  es);
> 
> Gossiper.instance.liveEndpoints.add(InetAddressAndPort.getByName("127.0.0.2"));
> assertFalse(Gossiper.instance.hasMajorVersion3Nodes());
> }
> {code}
> This seems to be because of 
> [https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L2360],
>  where an upgrade in progress is possible but we are not upgrading from a 
> lower family version (i.e from 4.1.1 to 4.1.2).
> From the comment in this function, it seems instead of the existing check, we 
> would want to iterate over all known endpoints in gossip and return true if 
> any of them do not have a version (similar to 
> [https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L227-L236)
>  
> |https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L227-L236).]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-18999) Gossiper::hasMajorVersion3Nodes returns true when a cluster is upgrading patch version without Cassandra 3 nodes.

2023-12-28 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801031#comment-17801031
 ] 

Paulo Motta commented on CASSANDRA-18999:
-

I created CASSANDRA-19243 to have a wider review on removal of pre-4.0 
compatibility code.

For this ticket, let's just merge the original 4.0/4.1/5.0 PRs fixing 
{{Gossiper::hasMajorVersion3Nodes}} without removing pre-4.0 compatibility code 
from 5.0:

I have prepared the patches for commit:
 * 
[cassandra-4.0|https://github.com/pauloricardomg/cassandra/tree/cassandra-4.0]
 * 
[cassandra-4.1|https://github.com/pauloricardomg/cassandra/tree/cassandra-4.1]
 * 
[cassandra-5.0|https://github.com/pauloricardomg/cassandra/tree/cassandra-5.0]

I've submitted a [devbranch 
job|https://ci-cassandra.apache.org/job/Cassandra-devbranch/2649/] for the 
cassandra-5.0 branch but it seems ci-cassandra.a.o is unavailable.

I don't have circle environment setup, so I will wait until jenkins is back or 
someone submits a circle job for cassandra-5.0 before committing this.

> Gossiper::hasMajorVersion3Nodes returns true when a cluster is upgrading 
> patch version without Cassandra 3 nodes.
> -
>
> Key: CASSANDRA-18999
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18999
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Distributed Metadata
>Reporter: Isaac Reath
>Assignee: Isaac Reath
>Priority: Low
>  Labels: lhf
> Fix For: 4.0.x, 4.1.x, 5.0.x
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When working on https://issues.apache.org/jira/browse/CASSANDRA-18968 we 
> found that {{Gossiper::hasMajorVersion3Nodes}} will return true when the 
> cluster is undergoing an upgrade from a patch version even if the cluster has 
> no Cassandra 3 nodes in it.
> This can be reproduced by running this Gossiper test:
> {code:java}
> @Test
> public void 
> testHasVersion3NodesShouldReturnFalseWhenNoVersion3NodesDetectedAndCassandra4UpgradeInProgress()
>  throws Exception
> {
> Gossiper.instance.start(0);
> Gossiper.instance.expireUpgradeFromVersion();
> VersionedValue.VersionedValueFactory factory = new 
> VersionedValue.VersionedValueFactory(null);
> EndpointState es = new EndpointState((HeartBeatState) null);
> es.addApplicationState(ApplicationState.RELEASE_VERSION, 
> factory.releaseVersion(CURRENT_VERSION.toString()));
> 
> Gossiper.instance.endpointStateMap.put(InetAddressAndPort.getByName("127.0.0.1"),
>  es);
> 
> Gossiper.instance.liveEndpoints.add(InetAddressAndPort.getByName("127.0.0.1"));
> es = new EndpointState((HeartBeatState) null);
> String previousPatchVersion = String.valueOf(CURRENT_VERSION.major) + 
> '.' + (CURRENT_VERSION.minor) + '.' + (CURRENT_VERSION.patch - 1);
> es.addApplicationState(ApplicationState.RELEASE_VERSION, 
> factory.releaseVersion(previousPatchVersion));
> 
> Gossiper.instance.endpointStateMap.put(InetAddressAndPort.getByName("127.0.0.2"),
>  es);
> 
> Gossiper.instance.liveEndpoints.add(InetAddressAndPort.getByName("127.0.0.2"));
> assertFalse(Gossiper.instance.hasMajorVersion3Nodes());
> }
> {code}
> This seems to be because of 
> [https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L2360],
>  where an upgrade in progress is possible but we are not upgrading from a 
> lower family version (i.e from 4.1.1 to 4.1.2).
> From the comment in this function, it seems instead of the existing check, we 
> would want to iterate over all known endpoints in gossip and return true if 
> any of them do not have a version (similar to 
> [https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L227-L236)
>  
> |https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L227-L236).]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-19243) Remove pre-4.0 compatibility code for 5.0

2023-12-28 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801027#comment-17801027
 ] 

Paulo Motta edited comment on CASSANDRA-19243 at 12/28/23 4:14 PM:
---

It was identified on CASSANDRA-18999 that {{Gossiper::hasMajorVersion3Nodes 
}}was removed from trunk, effectively removing pre-4.0 compatibility from trunk.

This [PR|https://github.com/apache/cassandra/pull/3004] removes the method 
{{Gossiper::hasMajorVersion3Nodes}} from cassandra-5.0 branch, which removes 
pre-4.0 compatibility from 5.0.

In addition to reviewing the changes above, we need to ensure that no more 
pre-4.0 compatibility code remains in 5.0+

Since the backward compatibility code will be removed, I propose adding a new 
StartupCheck to prevent upgrade from version < 4.0 and a flag to override (if 
this is not already there).


was (Author: paulo):
It was identified on CASSANDRA-18999 that {{Gossiper::hasMajorVersion3Nodes 
}}was removed from trunk, effectively removing pre-4.0 compatibility from trunk.

This [PR|https://github.com/apache/cassandra/pull/3004] removes the method 
{{Gossiper::hasMajorVersion3Nodes}} from cassandra-5.0 branch, which removes 
pre-4.0 compatibility from 5.0.

In addition to reviewing the changes above, we need to ensure that no more 
pre-4.0 compatibility code remains in 5.0+

Since the backward compatibility code will be removed, I propose adding a new 
StartupCheck to prevent upgrade from version < 4.0 and a flag to override (if 
this is not already there).

> Remove pre-4.0 compatibility code for 5.0
> -
>
> Key: CASSANDRA-19243
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19243
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Paulo Motta
>Priority: Normal
>
> This is an umbrella ticket to discuss removing pre-4.0 compatibility code 
> from 5.0, similar to CASSANDRA-12716 for 4.x.
> A few considerations:
> - Discuss/ratify removal of pre-compatibility code on dev mailing list
> - What compatibility features are being removed?
> - What upgrade tests are being removed ? Are they still relevant and can be 
> reused?
> - Should upgrade from 3.x to 5.X fail on startup with an override flag?
> - Can/should we make it easier to deprecate/remove compatibility code for 
> future major releases?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19243) Remove pre-4.0 compatibility code for 5.0

2023-12-28 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801027#comment-17801027
 ] 

Paulo Motta commented on CASSANDRA-19243:
-

It was identified on CASSANDRA-18999 that {{Gossiper::hasMajorVersion3Nodes 
}}was removed from trunk, effectively removing pre-4.0 compatibility from trunk.

This [PR|https://github.com/apache/cassandra/pull/3004] removes the method 
{{Gossiper::hasMajorVersion3Nodes}} from cassandra-5.0 branch, which removes 
pre-4.0 compatibility from 5.0.

In addition to reviewing the changes above, we need to ensure that no more 
pre-4.0 compatibility code remains in 5.0+

Since the backward compatibility code will be removed, I propose adding a new 
StartupCheck to prevent upgrade from version < 4.0 and a flag to override (if 
this is not already there).

> Remove pre-4.0 compatibility code for 5.0
> -
>
> Key: CASSANDRA-19243
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19243
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Paulo Motta
>Priority: Normal
>
> This is an umbrella ticket to discuss removing pre-4.0 compatibility code 
> from 5.0, similar to CASSANDRA-12716 for 4.x.
> A few considerations:
> - Discuss/ratify removal of pre-compatibility code on dev mailing list
> - What compatibility features are being removed?
> - What upgrade tests are being removed ? Are they still relevant and can be 
> reused?
> - Should upgrade from 3.x to 5.X fail on startup with an override flag?
> - Can/should we make it easier to deprecate/remove compatibility code for 
> future major releases?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19243) Remove pre-4.0 compatibility code for 5.0

2023-12-28 Thread Paulo Motta (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-19243:

  Workflow: Copy of Cassandra Default Workflow  (was: Copy of Cassandra Bug 
Workflow)
Issue Type: Improvement  (was: Bug)

> Remove pre-4.0 compatibility code for 5.0
> -
>
> Key: CASSANDRA-19243
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19243
> Project: Cassandra
>  Issue Type: Improvement
>    Reporter: Paulo Motta
>Priority: Normal
>
> This is an umbrella ticket to discuss removing pre-4.0 compatibility code 
> from 5.0, similar to CASSANDRA-12716 for 4.x.
> A few considerations:
> - Discuss/ratify removal of pre-compatibility code on dev mailing list
> - What compatibility features are being removed?
> - What upgrade tests are being removed ? Are they still relevant and can be 
> reused?
> - Should upgrade from 3.x to 5.X fail on startup with an override flag?
> - Can/should we make it easier to deprecate/remove compatibility code for 
> future major releases?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19243) Remove pre-4.0 compatibility code for 5.0

2023-12-28 Thread Paulo Motta (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-19243:

Description: 
This is an umbrella ticket to discuss removing pre-4.0 compatibility code from 
5.0, similar to CASSANDRA-12716 for 4.x.

A few considerations:
- Discuss/ratify removal of pre-compatibility code on dev mailing list
- What compatibility features are being removed?
- What upgrade tests are being removed ? Are they still relevant and can be 
reused?
- Should upgrade from 3.x to 5.X fail on startup with an override flag?
- Can/should we make it easier to deprecate/remove compatibility code for 
future major releases?

  was:TBD


> Remove pre-4.0 compatibility code for 5.0
> -
>
> Key: CASSANDRA-19243
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19243
> Project: Cassandra
>  Issue Type: Bug
>    Reporter: Paulo Motta
>Priority: Normal
>
> This is an umbrella ticket to discuss removing pre-4.0 compatibility code 
> from 5.0, similar to CASSANDRA-12716 for 4.x.
> A few considerations:
> - Discuss/ratify removal of pre-compatibility code on dev mailing list
> - What compatibility features are being removed?
> - What upgrade tests are being removed ? Are they still relevant and can be 
> reused?
> - Should upgrade from 3.x to 5.X fail on startup with an override flag?
> - Can/should we make it easier to deprecate/remove compatibility code for 
> future major releases?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-19243) Remove pre-4.0 compatibility code for 5.0

2023-12-28 Thread Paulo Motta (Jira)
Paulo Motta created CASSANDRA-19243:
---

 Summary: Remove pre-4.0 compatibility code for 5.0
 Key: CASSANDRA-19243
 URL: https://issues.apache.org/jira/browse/CASSANDRA-19243
 Project: Cassandra
  Issue Type: Bug
Reporter: Paulo Motta


TBD



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19001) Check whether the startup warnings for unknown modules represent a legit problem or cosmetic issue

2023-12-27 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800879#comment-17800879
 ] 

Paulo Motta commented on CASSANDRA-19001:
-

Added [this 
commit|https://github.com/pauloricardomg/cassandra/commit/cdc4124873f2b29c4d42e3265a9c7f408bcd98c4]
 to 
[pauloricardomg/19001-5.0-patch|https://github.com/apache/cassandra/compare/trunk...pauloricardomg:cassandra:19001-5.0-patch]
 to fail "nodetool sjk" with a nicer message when a JDK is not found:

"nodetool sjk jps" output with JDK17:
{noformat}
$ bin/nodetool sjk jps
28270   org.apache.cassandra.tools.NodeTool -p 7199 sjk jps
{noformat}
"nodetool sjk jps" output with JRE17:
{noformat}
$ docker run --rm -it cassandra-test:5.0-19001 nodetool sjk jps | cat | head 
-n10
ERROR: JDK not detected and nodetool sjk requires JDK to work.
{noformat}

> Check whether the startup warnings for unknown modules represent a legit 
> problem or cosmetic issue
> --
>
> Key: CASSANDRA-19001
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19001
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Other
>Reporter: Ekaterina Dimitrova
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 5.0-rc, 5.0.x, 5.x
>
>
> During the 5.0 alpha 2 release 
> [vote|https://lists.apache.org/thread/lt3x0obr5cpbcydf5490pj6b2q0mz5zr], 
> [~paulo] raised the following concerns:
> {code:java}
> Launched a tarball-based 5.0-alpha2 container on top of
> "eclipse-temurin:17-jre-focal" and the server starts up fine, can run
> nodetool and cqlsh.
> I got these seemingly harmless JDK17 warnings during startup and when
> running nodetool (no warnings on JDK11):
> WARNING: Unknown module: jdk.attach specified to --add-exports
> WARNING: Unknown module: jdk.compiler specified to --add-exports
> WARNING: Unknown module: jdk.compiler specified to --add-opens
> WARNING: A terminally deprecated method in java.lang.System has been called
> WARNING: System::setSecurityManager has been called by
> org.apache.cassandra.security.ThreadAwareSecurityManager
> (file:/opt/cassandra/lib/apache-cassandra-5.0-alpha2-SNAPSHOT.jar)
> WARNING: Please consider reporting this to the maintainers of
> org.apache.cassandra.security.ThreadAwareSecurityManager
> WARNING: System::setSecurityManager will be removed in a future release
> Anybody knows if these warnings are legit/expected ? We can create
> follow-up tickets if needed.
> $ java --version
> openjdk 17.0.9 2023-10-17
> OpenJDK Runtime Environment Temurin-17.0.9+9 (build 17.0.9+9)
> OpenJDK 64-Bit Server VM Temurin-17.0.9+9 (build 17.0.9+9, mixed mode,
> sharing)
> {code}
> {code:java}
> Clarification: - When running nodetool only the "Unknown module" warnings 
> show up. All warnings show up during startup.{code}
> We need to verify whether this presents a real problem in the features where 
> those modules are expected to be used, or if it is a false alarm. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19001) Check whether the startup warnings for unknown modules represent a legit problem or cosmetic issue

2023-12-27 Thread Paulo Motta (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-19001:

Reviewers: Paulo Motta, Paulo Motta
   Paulo Motta, Paulo Motta  (was: Paulo Motta)
   Status: Review In Progress  (was: Patch Available)

> Check whether the startup warnings for unknown modules represent a legit 
> problem or cosmetic issue
> --
>
> Key: CASSANDRA-19001
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19001
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Other
>Reporter: Ekaterina Dimitrova
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 5.0-rc, 5.0.x, 5.x
>
>
> During the 5.0 alpha 2 release 
> [vote|https://lists.apache.org/thread/lt3x0obr5cpbcydf5490pj6b2q0mz5zr], 
> [~paulo] raised the following concerns:
> {code:java}
> Launched a tarball-based 5.0-alpha2 container on top of
> "eclipse-temurin:17-jre-focal" and the server starts up fine, can run
> nodetool and cqlsh.
> I got these seemingly harmless JDK17 warnings during startup and when
> running nodetool (no warnings on JDK11):
> WARNING: Unknown module: jdk.attach specified to --add-exports
> WARNING: Unknown module: jdk.compiler specified to --add-exports
> WARNING: Unknown module: jdk.compiler specified to --add-opens
> WARNING: A terminally deprecated method in java.lang.System has been called
> WARNING: System::setSecurityManager has been called by
> org.apache.cassandra.security.ThreadAwareSecurityManager
> (file:/opt/cassandra/lib/apache-cassandra-5.0-alpha2-SNAPSHOT.jar)
> WARNING: Please consider reporting this to the maintainers of
> org.apache.cassandra.security.ThreadAwareSecurityManager
> WARNING: System::setSecurityManager will be removed in a future release
> Anybody knows if these warnings are legit/expected ? We can create
> follow-up tickets if needed.
> $ java --version
> openjdk 17.0.9 2023-10-17
> OpenJDK Runtime Environment Temurin-17.0.9+9 (build 17.0.9+9)
> OpenJDK 64-Bit Server VM Temurin-17.0.9+9 (build 17.0.9+9, mixed mode,
> sharing)
> {code}
> {code:java}
> Clarification: - When running nodetool only the "Unknown module" warnings 
> show up. All warnings show up during startup.{code}
> We need to verify whether this presents a real problem in the features where 
> those modules are expected to be used, or if it is a false alarm. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19001) Check whether the startup warnings for unknown modules represent a legit problem or cosmetic issue

2023-12-27 Thread Paulo Motta (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-19001:

Status: Changes Suggested  (was: Review In Progress)

> Check whether the startup warnings for unknown modules represent a legit 
> problem or cosmetic issue
> --
>
> Key: CASSANDRA-19001
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19001
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Other
>Reporter: Ekaterina Dimitrova
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 5.0-rc, 5.0.x, 5.x
>
>
> During the 5.0 alpha 2 release 
> [vote|https://lists.apache.org/thread/lt3x0obr5cpbcydf5490pj6b2q0mz5zr], 
> [~paulo] raised the following concerns:
> {code:java}
> Launched a tarball-based 5.0-alpha2 container on top of
> "eclipse-temurin:17-jre-focal" and the server starts up fine, can run
> nodetool and cqlsh.
> I got these seemingly harmless JDK17 warnings during startup and when
> running nodetool (no warnings on JDK11):
> WARNING: Unknown module: jdk.attach specified to --add-exports
> WARNING: Unknown module: jdk.compiler specified to --add-exports
> WARNING: Unknown module: jdk.compiler specified to --add-opens
> WARNING: A terminally deprecated method in java.lang.System has been called
> WARNING: System::setSecurityManager has been called by
> org.apache.cassandra.security.ThreadAwareSecurityManager
> (file:/opt/cassandra/lib/apache-cassandra-5.0-alpha2-SNAPSHOT.jar)
> WARNING: Please consider reporting this to the maintainers of
> org.apache.cassandra.security.ThreadAwareSecurityManager
> WARNING: System::setSecurityManager will be removed in a future release
> Anybody knows if these warnings are legit/expected ? We can create
> follow-up tickets if needed.
> $ java --version
> openjdk 17.0.9 2023-10-17
> OpenJDK Runtime Environment Temurin-17.0.9+9 (build 17.0.9+9)
> OpenJDK 64-Bit Server VM Temurin-17.0.9+9 (build 17.0.9+9, mixed mode,
> sharing)
> {code}
> {code:java}
> Clarification: - When running nodetool only the "Unknown module" warnings 
> show up. All warnings show up during startup.{code}
> We need to verify whether this presents a real problem in the features where 
> those modules are expected to be used, or if it is a false alarm. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19001) Check whether the startup warnings for unknown modules represent a legit problem or cosmetic issue

2023-12-27 Thread Paulo Motta (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-19001:

Status: Patch Available  (was: In Progress)

> Check whether the startup warnings for unknown modules represent a legit 
> problem or cosmetic issue
> --
>
> Key: CASSANDRA-19001
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19001
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Other
>Reporter: Ekaterina Dimitrova
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 5.0-rc, 5.0.x, 5.x
>
>
> During the 5.0 alpha 2 release 
> [vote|https://lists.apache.org/thread/lt3x0obr5cpbcydf5490pj6b2q0mz5zr], 
> [~paulo] raised the following concerns:
> {code:java}
> Launched a tarball-based 5.0-alpha2 container on top of
> "eclipse-temurin:17-jre-focal" and the server starts up fine, can run
> nodetool and cqlsh.
> I got these seemingly harmless JDK17 warnings during startup and when
> running nodetool (no warnings on JDK11):
> WARNING: Unknown module: jdk.attach specified to --add-exports
> WARNING: Unknown module: jdk.compiler specified to --add-exports
> WARNING: Unknown module: jdk.compiler specified to --add-opens
> WARNING: A terminally deprecated method in java.lang.System has been called
> WARNING: System::setSecurityManager has been called by
> org.apache.cassandra.security.ThreadAwareSecurityManager
> (file:/opt/cassandra/lib/apache-cassandra-5.0-alpha2-SNAPSHOT.jar)
> WARNING: Please consider reporting this to the maintainers of
> org.apache.cassandra.security.ThreadAwareSecurityManager
> WARNING: System::setSecurityManager will be removed in a future release
> Anybody knows if these warnings are legit/expected ? We can create
> follow-up tickets if needed.
> $ java --version
> openjdk 17.0.9 2023-10-17
> OpenJDK Runtime Environment Temurin-17.0.9+9 (build 17.0.9+9)
> OpenJDK 64-Bit Server VM Temurin-17.0.9+9 (build 17.0.9+9, mixed mode,
> sharing)
> {code}
> {code:java}
> Clarification: - When running nodetool only the "Unknown module" warnings 
> show up. All warnings show up during startup.{code}
> We need to verify whether this presents a real problem in the features where 
> those modules are expected to be used, or if it is a false alarm. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19001) Check whether the startup warnings for unknown modules represent a legit problem or cosmetic issue

2023-12-27 Thread Paulo Motta (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-19001:

Status: Open  (was: Patch Available)

> Check whether the startup warnings for unknown modules represent a legit 
> problem or cosmetic issue
> --
>
> Key: CASSANDRA-19001
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19001
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Other
>Reporter: Ekaterina Dimitrova
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 5.0-rc, 5.0.x, 5.x
>
>
> During the 5.0 alpha 2 release 
> [vote|https://lists.apache.org/thread/lt3x0obr5cpbcydf5490pj6b2q0mz5zr], 
> [~paulo] raised the following concerns:
> {code:java}
> Launched a tarball-based 5.0-alpha2 container on top of
> "eclipse-temurin:17-jre-focal" and the server starts up fine, can run
> nodetool and cqlsh.
> I got these seemingly harmless JDK17 warnings during startup and when
> running nodetool (no warnings on JDK11):
> WARNING: Unknown module: jdk.attach specified to --add-exports
> WARNING: Unknown module: jdk.compiler specified to --add-exports
> WARNING: Unknown module: jdk.compiler specified to --add-opens
> WARNING: A terminally deprecated method in java.lang.System has been called
> WARNING: System::setSecurityManager has been called by
> org.apache.cassandra.security.ThreadAwareSecurityManager
> (file:/opt/cassandra/lib/apache-cassandra-5.0-alpha2-SNAPSHOT.jar)
> WARNING: Please consider reporting this to the maintainers of
> org.apache.cassandra.security.ThreadAwareSecurityManager
> WARNING: System::setSecurityManager will be removed in a future release
> Anybody knows if these warnings are legit/expected ? We can create
> follow-up tickets if needed.
> $ java --version
> openjdk 17.0.9 2023-10-17
> OpenJDK Runtime Environment Temurin-17.0.9+9 (build 17.0.9+9)
> OpenJDK 64-Bit Server VM Temurin-17.0.9+9 (build 17.0.9+9, mixed mode,
> sharing)
> {code}
> {code:java}
> Clarification: - When running nodetool only the "Unknown module" warnings 
> show up. All warnings show up during startup.{code}
> We need to verify whether this presents a real problem in the features where 
> those modules are expected to be used, or if it is a false alarm. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19001) Check whether the startup warnings for unknown modules represent a legit problem or cosmetic issue

2023-12-27 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800870#comment-17800870
 ] 

Paulo Motta commented on CASSANDRA-19001:
-

I finally got a chance to take a look at this, apologies for the delay. 

It looks like the [JDK detection 
check|https://github.com/ekaterinadimitrova2/cassandra/blob/613bb6d2cbc40924479eac044f78e0c4e584521b/bin/cassandra.in.sh#L153]
 does not work when the JRE is on {{/opt/java/openjdk/bin/java}} which is the 
case for the official docker image. I updated the check [on this 
commit|https://github.com/pauloricardomg/cassandra/commit/97472afcc4f63291ebbbcc6aab476b0ccf12ce06]
 to check for the presence of the {{javac}} executable on the {{$PATH}} or 
{{$JAVA_HOME}} to detect whether a JDK is present. Let me know what do you 
think.

I checked that no more warnings "Unknown module: jdk.attach specified to 
--add-exports" are logged during server initialization, nor when calling 
nodetool commands when using JRE17:
*BEFORE:*
{noformat}
$ docker run --rm -it cassandra:5 nodetool help | cat | head -n10
WARNING: Unknown module: jdk.attach specified to --add-exports
WARNING: Unknown module: jdk.compiler specified to --add-exports
WARNING: Unknown module: jdk.compiler specified to --add-opens
usage: nodetool [(-p  | --port )]
[(-u  | --username )]
[(-pw  | --password )]
[(-pwf  | --password-file )]
[(-pp | --print-port)] [(-h  | --host )]  []
{noformat}
*AFTER:*
{noformat}
$ docker run --rm -it cassandra-test:5.0-19001 nodetool help | cat | head -n10
usage: nodetool [(-pw  | --password )]
[(-p  | --port )]
[(-pwf  | --password-file )]
[(-pp | --print-port)] [(-h  | --host )]
[(-u  | --username )]  []
{noformat}
I also checked that nodetool sjk fails with this message on JRE17:
{noformat}
$ docker run --rm -it cassandra-test:5.0-19001 nodetool sjk jps | head -n10
ERROR 17:22:29,631 Java home points to /opt/java/openjdk make sure it is not a 
JRE path
ERROR 17:22:29,632 Failed to add tools.jar to classpath
java.lang.ClassNotFoundException: com.sun.tools.attach.VirtualMachine
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(Unknown 
Source)
at 
java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(Unknown 
Source)
at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
at org.gridkit.lab.jvm.attach.AttachAPI.(AttachAPI.java:52)
{noformat}
But works when a JDK17 is present:
{noformat}
$ bin/nodetool sjk jps
22825   org.apache.cassandra.tools.NodeTool -p 7199 sjk jps
{noformat}
I checked that all commands above have the same output on JRE11.

I briefly tested the full query logger on a JRE17 with the patch above and it 
seems to be working:
{noformat}
root@6c9f22a89594:/# nodetool enablefullquerylog --path /tmp/bla

root@6c9f22a89594:/# cqlsh
Connected to Test Cluster at 127.0.0.1:9042
[cqlsh 6.2.0 | Cassandra 5.0-beta1-SNAPSHOT | CQL spec 3.4.7 | Native protocol 
v5]
Use HELP for help.
cqlsh> CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 
'replication_factor': 1};
cqlsh> exit

root@6c9f22a89594:/# /opt/cassandra/tools/bin/fqltool dump /tmp/bla
INFO  [main] 2023-12-27 16:56:34,673 DatabaseDescriptor.java:1557 - Supported 
sstable formats are: big -> 
org.apache.cassandra.io.sstable.format.big.BigFormat with singleton components: 
[Data.db, Index.db, Statistics.db, CompressionInfo.db, Filter.db, Summary.db, 
Digest.crc32, CRC.db, TOC.txt], bti -> 
org.apache.cassandra.io.sstable.format.bti.BtiFormat with singleton components: 
[Data.db, Partitions.db, Rows.db, Statistics.db, CompressionInfo.db, Filter.db, 
Digest.crc32, CRC.db, TOC.txt]
INFO  [main] 2023-12-27 16:56:34,723 Jvm.java:174 - Chronicle core loaded from 
file:/opt/cassandra/lib/chronicle-core-2.23.36.jar
INFO  [main] 2023-12-27 16:56:34,817 Slf4jExceptionHandler.java:44 - Took 6 ms 
to add mapping for /tmp/bla/metadata.cq4t
INFO  [main] 2023-12-27 16:56:34,859 Slf4jExceptionHandler.java:44 - Running 
under OpenJDK Runtime Environment 17.0.9+9 with 16 processors reported.
INFO  [main] 2023-12-27 16:56:34,860 Slf4jExceptionHandler.java:44 - Leave your 
e-mail to get information about the latest releases and patches at 
https://chronicle.software/release-notes/
INFO  [main] 2023-12-27 16:56:34,861 Slf4jExceptionHandler.java:44 - Process 
id: 1015 :: Chronicle Queue (5.23.37)
Type: single-query
Query start time: 1703696157539
Protocol version: 5
Generated timestamp:-9223372036854775808
Generated nowInSeconds:1703696157
Query: SELECT * FROM system.peers_v2
Values:

Type: single-query
Query start time: 1703696157544
Protocol version: 5
Generated timestamp:-9223372036854775808
Generated nowInSeconds:1703696157
Query: SELECT * FROM system.local WHERE key='local'
Values:
{noformat}
I inspecte

[jira] [Commented] (CASSANDRA-19001) Check whether the startup warnings for unknown modules represent a legit problem or cosmetic issue

2023-12-20 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17799057#comment-17799057
 ] 

Paulo Motta commented on CASSANDRA-19001:
-

I'll take a look at this today, will get back soon.

> Check whether the startup warnings for unknown modules represent a legit 
> problem or cosmetic issue
> --
>
> Key: CASSANDRA-19001
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19001
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Other
>Reporter: Ekaterina Dimitrova
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 5.0-rc, 5.0.x, 5.x
>
>
> During the 5.0 alpha 2 release 
> [vote|https://lists.apache.org/thread/lt3x0obr5cpbcydf5490pj6b2q0mz5zr], 
> [~paulo] raised the following concerns:
> {code:java}
> Launched a tarball-based 5.0-alpha2 container on top of
> "eclipse-temurin:17-jre-focal" and the server starts up fine, can run
> nodetool and cqlsh.
> I got these seemingly harmless JDK17 warnings during startup and when
> running nodetool (no warnings on JDK11):
> WARNING: Unknown module: jdk.attach specified to --add-exports
> WARNING: Unknown module: jdk.compiler specified to --add-exports
> WARNING: Unknown module: jdk.compiler specified to --add-opens
> WARNING: A terminally deprecated method in java.lang.System has been called
> WARNING: System::setSecurityManager has been called by
> org.apache.cassandra.security.ThreadAwareSecurityManager
> (file:/opt/cassandra/lib/apache-cassandra-5.0-alpha2-SNAPSHOT.jar)
> WARNING: Please consider reporting this to the maintainers of
> org.apache.cassandra.security.ThreadAwareSecurityManager
> WARNING: System::setSecurityManager will be removed in a future release
> Anybody knows if these warnings are legit/expected ? We can create
> follow-up tickets if needed.
> $ java --version
> openjdk 17.0.9 2023-10-17
> OpenJDK Runtime Environment Temurin-17.0.9+9 (build 17.0.9+9)
> OpenJDK 64-Bit Server VM Temurin-17.0.9+9 (build 17.0.9+9, mixed mode,
> sharing)
> {code}
> {code:java}
> Clarification: - When running nodetool only the "Unknown module" warnings 
> show up. All warnings show up during startup.{code}
> We need to verify whether this presents a real problem in the features where 
> those modules are expected to be used, or if it is a false alarm. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-18999) Gossiper::hasMajorVersion3Nodes returns true when a cluster is upgrading patch version without Cassandra 3 nodes.

2023-12-18 Thread Paulo Motta (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-18999:

Status: Changes Suggested  (was: Review In Progress)

> Gossiper::hasMajorVersion3Nodes returns true when a cluster is upgrading 
> patch version without Cassandra 3 nodes.
> -
>
> Key: CASSANDRA-18999
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18999
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Distributed Metadata
>Reporter: Isaac Reath
>Assignee: Isaac Reath
>Priority: Low
>  Labels: lhf
> Fix For: 4.0.x, 4.1.x, 5.0.x
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When working on https://issues.apache.org/jira/browse/CASSANDRA-18968 we 
> found that {{Gossiper::hasMajorVersion3Nodes}} will return true when the 
> cluster is undergoing an upgrade from a patch version even if the cluster has 
> no Cassandra 3 nodes in it.
> This can be reproduced by running this Gossiper test:
> {code:java}
> @Test
> public void 
> testHasVersion3NodesShouldReturnFalseWhenNoVersion3NodesDetectedAndCassandra4UpgradeInProgress()
>  throws Exception
> {
> Gossiper.instance.start(0);
> Gossiper.instance.expireUpgradeFromVersion();
> VersionedValue.VersionedValueFactory factory = new 
> VersionedValue.VersionedValueFactory(null);
> EndpointState es = new EndpointState((HeartBeatState) null);
> es.addApplicationState(ApplicationState.RELEASE_VERSION, 
> factory.releaseVersion(CURRENT_VERSION.toString()));
> 
> Gossiper.instance.endpointStateMap.put(InetAddressAndPort.getByName("127.0.0.1"),
>  es);
> 
> Gossiper.instance.liveEndpoints.add(InetAddressAndPort.getByName("127.0.0.1"));
> es = new EndpointState((HeartBeatState) null);
> String previousPatchVersion = String.valueOf(CURRENT_VERSION.major) + 
> '.' + (CURRENT_VERSION.minor) + '.' + (CURRENT_VERSION.patch - 1);
> es.addApplicationState(ApplicationState.RELEASE_VERSION, 
> factory.releaseVersion(previousPatchVersion));
> 
> Gossiper.instance.endpointStateMap.put(InetAddressAndPort.getByName("127.0.0.2"),
>  es);
> 
> Gossiper.instance.liveEndpoints.add(InetAddressAndPort.getByName("127.0.0.2"));
> assertFalse(Gossiper.instance.hasMajorVersion3Nodes());
> }
> {code}
> This seems to be because of 
> [https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L2360],
>  where an upgrade in progress is possible but we are not upgrading from a 
> lower family version (i.e from 4.1.1 to 4.1.2).
> From the comment in this function, it seems instead of the existing check, we 
> would want to iterate over all known endpoints in gossip and return true if 
> any of them do not have a version (similar to 
> [https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L227-L236)
>  
> |https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L227-L236).]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-18999) Gossiper::hasMajorVersion3Nodes returns true when a cluster is upgrading patch version without Cassandra 3 nodes.

2023-12-18 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17798298#comment-17798298
 ] 

Paulo Motta commented on CASSANDRA-18999:
-

{quote}I think we should keep some version of hasMajorVersion3Nodes still 
around, something like this:
{quote}
Where will this method be ever used if we're removing 
{{Gossiper::hasMajorVersion3Nodes}} and all references to it? There is no place 
in the code that requires checking if there are unknown nodes in gossip, except 
inside 
[upgradeFromVersionSupplier|https://github.com/apache/cassandra/blob/cassandra-5.0/src/java/org/apache/cassandra/gms/Gossiper.java#L219],
 where the check will be kept.


{quote}I just dont understand that when the original version was dealing with 
unknown versions and it could evaluate that method as returning true, then us 
removing the unknown check will change behavior in 5.0 as well.
{quote}
As far as I understand the objective of hasMajorVersion3Nodes methods is to 
*not* do things when a cluster node is identified to be in version 3.x. It was 
not possible to know if a node with unknown version was on 3.x or not, so 
hasMajorVersion3Nodes returned true if a node version was not known (since it 
could potentially be a 3.x nodes).

On 5.x we no longer need to identify if a node is on version 3.x since direct 
upgrade from 3.x is not supported, so there is no reason to keep 
hasMajorVersion3Nodes or hasUnknownNodes around.

> Gossiper::hasMajorVersion3Nodes returns true when a cluster is upgrading 
> patch version without Cassandra 3 nodes.
> -
>
> Key: CASSANDRA-18999
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18999
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Distributed Metadata
>Reporter: Isaac Reath
>Assignee: Isaac Reath
>Priority: Low
>  Labels: lhf
> Fix For: 4.0.x, 4.1.x, 5.0.x
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When working on https://issues.apache.org/jira/browse/CASSANDRA-18968 we 
> found that {{Gossiper::hasMajorVersion3Nodes}} will return true when the 
> cluster is undergoing an upgrade from a patch version even if the cluster has 
> no Cassandra 3 nodes in it.
> This can be reproduced by running this Gossiper test:
> {code:java}
> @Test
> public void 
> testHasVersion3NodesShouldReturnFalseWhenNoVersion3NodesDetectedAndCassandra4UpgradeInProgress()
>  throws Exception
> {
> Gossiper.instance.start(0);
> Gossiper.instance.expireUpgradeFromVersion();
> VersionedValue.VersionedValueFactory factory = new 
> VersionedValue.VersionedValueFactory(null);
> EndpointState es = new EndpointState((HeartBeatState) null);
> es.addApplicationState(ApplicationState.RELEASE_VERSION, 
> factory.releaseVersion(CURRENT_VERSION.toString()));
> 
> Gossiper.instance.endpointStateMap.put(InetAddressAndPort.getByName("127.0.0.1"),
>  es);
> 
> Gossiper.instance.liveEndpoints.add(InetAddressAndPort.getByName("127.0.0.1"));
> es = new EndpointState((HeartBeatState) null);
> String previousPatchVersion = String.valueOf(CURRENT_VERSION.major) + 
> '.' + (CURRENT_VERSION.minor) + '.' + (CURRENT_VERSION.patch - 1);
> es.addApplicationState(ApplicationState.RELEASE_VERSION, 
> factory.releaseVersion(previousPatchVersion));
> 
> Gossiper.instance.endpointStateMap.put(InetAddressAndPort.getByName("127.0.0.2"),
>  es);
> 
> Gossiper.instance.liveEndpoints.add(InetAddressAndPort.getByName("127.0.0.2"));
> assertFalse(Gossiper.instance.hasMajorVersion3Nodes());
> }
> {code}
> This seems to be because of 
> [https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L2360],
>  where an upgrade in progress is possible but we are not upgrading from a 
> lower family version (i.e from 4.1.1 to 4.1.2).
> From the comment in this function, it seems instead of the existing check, we 
> would want to iterate over all known endpoints in gossip and return true if 
> any of them do not have a version (similar to 
> [https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L227-L236)
>  
> |https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L227-L236).]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-18999) Gossiper::hasMajorVersion3Nodes returns true when a cluster is upgrading patch version without Cassandra 3 nodes.

2023-12-18 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17798298#comment-17798298
 ] 

Paulo Motta edited comment on CASSANDRA-18999 at 12/18/23 5:50 PM:
---

{quote}I think we should keep some version of hasMajorVersion3Nodes still 
around, something like this:
{quote}
Where will this method be ever used if we're removing 
{{Gossiper::hasMajorVersion3Nodes}} and all references to it? There is no place 
in the code that requires checking if there are unknown version nodes in 
gossip, except inside 
[upgradeFromVersionSupplier|https://github.com/apache/cassandra/blob/cassandra-5.0/src/java/org/apache/cassandra/gms/Gossiper.java#L219],
 where the check will be kept.
{quote}I just dont understand that when the original version was dealing with 
unknown versions and it could evaluate that method as returning true, then us 
removing the unknown check will change behavior in 5.0 as well.
{quote}
As far as I understand the objective of hasMajorVersion3Nodes methods is to 
*not* do things when a cluster node is identified to be in version 3.x. It was 
not possible to know if a node with unknown version was on 3.x or not, so 
hasMajorVersion3Nodes returned true if a node version was not known (since it 
could potentially be a 3.x nodes).

On 5.x we no longer need to identify if a node is on version 3.x since direct 
upgrade from 3.x is not supported, so there is no reason to keep 
hasMajorVersion3Nodes or hasUnknownNodes around.


was (Author: paulo):
{quote}I think we should keep some version of hasMajorVersion3Nodes still 
around, something like this:
{quote}
Where will this method be ever used if we're removing 
{{Gossiper::hasMajorVersion3Nodes}} and all references to it? There is no place 
in the code that requires checking if there are unknown nodes in gossip, except 
inside 
[upgradeFromVersionSupplier|https://github.com/apache/cassandra/blob/cassandra-5.0/src/java/org/apache/cassandra/gms/Gossiper.java#L219],
 where the check will be kept.


{quote}I just dont understand that when the original version was dealing with 
unknown versions and it could evaluate that method as returning true, then us 
removing the unknown check will change behavior in 5.0 as well.
{quote}
As far as I understand the objective of hasMajorVersion3Nodes methods is to 
*not* do things when a cluster node is identified to be in version 3.x. It was 
not possible to know if a node with unknown version was on 3.x or not, so 
hasMajorVersion3Nodes returned true if a node version was not known (since it 
could potentially be a 3.x nodes).

On 5.x we no longer need to identify if a node is on version 3.x since direct 
upgrade from 3.x is not supported, so there is no reason to keep 
hasMajorVersion3Nodes or hasUnknownNodes around.

> Gossiper::hasMajorVersion3Nodes returns true when a cluster is upgrading 
> patch version without Cassandra 3 nodes.
> -
>
> Key: CASSANDRA-18999
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18999
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Distributed Metadata
>Reporter: Isaac Reath
>Assignee: Isaac Reath
>Priority: Low
>  Labels: lhf
> Fix For: 4.0.x, 4.1.x, 5.0.x
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When working on https://issues.apache.org/jira/browse/CASSANDRA-18968 we 
> found that {{Gossiper::hasMajorVersion3Nodes}} will return true when the 
> cluster is undergoing an upgrade from a patch version even if the cluster has 
> no Cassandra 3 nodes in it.
> This can be reproduced by running this Gossiper test:
> {code:java}
> @Test
> public void 
> testHasVersion3NodesShouldReturnFalseWhenNoVersion3NodesDetectedAndCassandra4UpgradeInProgress()
>  throws Exception
> {
> Gossiper.instance.start(0);
> Gossiper.instance.expireUpgradeFromVersion();
> VersionedValue.VersionedValueFactory factory = new 
> VersionedValue.VersionedValueFactory(null);
> EndpointState es = new EndpointState((HeartBeatState) null);
> es.addApplicationState(ApplicationState.RELEASE_VERSION, 
> factory.releaseVersion(CURRENT_VERSION.toString()));
> 
> Gossiper.instance.endpointStateMap.put(InetAddressAndPort.getByName("127.0.0.1"),
>  es);
> 
> Gossiper.instance.liveEndpoints.add(InetAddressAndPort.getByName("127.0.0.1"));
> es = new EndpointState((HeartBeatState) null);
> String previousPatchVersion = String.valueOf(CURRENT_VERSION.major) + 
> '.' 

[jira] [Commented] (CASSANDRA-18999) Gossiper::hasMajorVersion3Nodes returns true when a cluster is upgrading patch version without Cassandra 3 nodes.

2023-12-18 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17798294#comment-17798294
 ] 

Paulo Motta commented on CASSANDRA-18999:
-

{quote}So, if this is removed in 5.0, that also means that the places where 
that method is called are not relevant anymore - as you showed its usage in 
your first comment to this ticket. That means that we would need a little bit 
more refactoring in 5.0 around that.
{quote}
Yes, [~isaacreath] I think we need to update the 5.0 patch to remove 
{{Gossiper::hasMajorVersion3Nodes}} and any references to it.
{quote}check this, that comment in particular (1). It seems to me that unknown 
version can happen in 4.0+ as well.
{quote}
We shouldn't remove this variable from {{upgradeFromVersionSupplier}} since it 
will still be needed there. We just don't need the method 
{{hasMajorVersion3OrUnknownNodes}} nor {{{}hasNodeWithUnknownVersion{}}}, since 
these will no longer be required anywhere in 5.x.

For the 5.x patch after removing {{Gossiper::hasMajorVersion3Nodes}} we can 
keep the 
[upgradeFromVersionSupplier|https://github.com/apache/cassandra/blob/cassandra-5.0/src/java/org/apache/cassandra/gms/Gossiper.java#L219]
 in the original configuration, where {{allHostsHaveKnownVersion}} is a local 
variable within that method used 
[here|https://github.com/apache/cassandra/blob/cassandra-5.0/src/java/org/apache/cassandra/gms/Gossiper.java#L253].
 

In summary:
 - 4.0/4.1 patches: LGTM
 - 5.0 patch: only remove Gossiper::hasMajorVersion3Nodes and any references to 
it.
 - Trunk (no change)

WDYT?

> Gossiper::hasMajorVersion3Nodes returns true when a cluster is upgrading 
> patch version without Cassandra 3 nodes.
> -
>
> Key: CASSANDRA-18999
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18999
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Distributed Metadata
>Reporter: Isaac Reath
>Assignee: Isaac Reath
>Priority: Low
>  Labels: lhf
> Fix For: 4.0.x, 4.1.x, 5.0.x
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When working on https://issues.apache.org/jira/browse/CASSANDRA-18968 we 
> found that {{Gossiper::hasMajorVersion3Nodes}} will return true when the 
> cluster is undergoing an upgrade from a patch version even if the cluster has 
> no Cassandra 3 nodes in it.
> This can be reproduced by running this Gossiper test:
> {code:java}
> @Test
> public void 
> testHasVersion3NodesShouldReturnFalseWhenNoVersion3NodesDetectedAndCassandra4UpgradeInProgress()
>  throws Exception
> {
> Gossiper.instance.start(0);
> Gossiper.instance.expireUpgradeFromVersion();
> VersionedValue.VersionedValueFactory factory = new 
> VersionedValue.VersionedValueFactory(null);
> EndpointState es = new EndpointState((HeartBeatState) null);
> es.addApplicationState(ApplicationState.RELEASE_VERSION, 
> factory.releaseVersion(CURRENT_VERSION.toString()));
> 
> Gossiper.instance.endpointStateMap.put(InetAddressAndPort.getByName("127.0.0.1"),
>  es);
> 
> Gossiper.instance.liveEndpoints.add(InetAddressAndPort.getByName("127.0.0.1"));
> es = new EndpointState((HeartBeatState) null);
> String previousPatchVersion = String.valueOf(CURRENT_VERSION.major) + 
> '.' + (CURRENT_VERSION.minor) + '.' + (CURRENT_VERSION.patch - 1);
> es.addApplicationState(ApplicationState.RELEASE_VERSION, 
> factory.releaseVersion(previousPatchVersion));
> 
> Gossiper.instance.endpointStateMap.put(InetAddressAndPort.getByName("127.0.0.2"),
>  es);
> 
> Gossiper.instance.liveEndpoints.add(InetAddressAndPort.getByName("127.0.0.2"));
> assertFalse(Gossiper.instance.hasMajorVersion3Nodes());
> }
> {code}
> This seems to be because of 
> [https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L2360],
>  where an upgrade in progress is possible but we are not upgrading from a 
> lower family version (i.e from 4.1.1 to 4.1.2).
> From the comment in this function, it seems instead of the existing check, we 
> would want to iterate over all known endpoints in gossip and return true if 
> any of them do not have a version (similar to 
> [https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L227-L236)
>  
> |https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apach

[jira] [Updated] (CASSANDRA-18999) Gossiper::hasMajorVersion3Nodes returns true when a cluster is upgrading patch version without Cassandra 3 nodes.

2023-12-18 Thread Paulo Motta (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-18999:

Reviewers: Paulo Motta, Stefan Miklosovic  (was: Stefan Miklosovic)

> Gossiper::hasMajorVersion3Nodes returns true when a cluster is upgrading 
> patch version without Cassandra 3 nodes.
> -
>
> Key: CASSANDRA-18999
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18999
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Distributed Metadata
>Reporter: Isaac Reath
>Assignee: Isaac Reath
>Priority: Low
>  Labels: lhf
> Fix For: 4.0.x, 4.1.x, 5.0.x
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When working on https://issues.apache.org/jira/browse/CASSANDRA-18968 we 
> found that {{Gossiper::hasMajorVersion3Nodes}} will return true when the 
> cluster is undergoing an upgrade from a patch version even if the cluster has 
> no Cassandra 3 nodes in it.
> This can be reproduced by running this Gossiper test:
> {code:java}
> @Test
> public void 
> testHasVersion3NodesShouldReturnFalseWhenNoVersion3NodesDetectedAndCassandra4UpgradeInProgress()
>  throws Exception
> {
> Gossiper.instance.start(0);
> Gossiper.instance.expireUpgradeFromVersion();
> VersionedValue.VersionedValueFactory factory = new 
> VersionedValue.VersionedValueFactory(null);
> EndpointState es = new EndpointState((HeartBeatState) null);
> es.addApplicationState(ApplicationState.RELEASE_VERSION, 
> factory.releaseVersion(CURRENT_VERSION.toString()));
> 
> Gossiper.instance.endpointStateMap.put(InetAddressAndPort.getByName("127.0.0.1"),
>  es);
> 
> Gossiper.instance.liveEndpoints.add(InetAddressAndPort.getByName("127.0.0.1"));
> es = new EndpointState((HeartBeatState) null);
> String previousPatchVersion = String.valueOf(CURRENT_VERSION.major) + 
> '.' + (CURRENT_VERSION.minor) + '.' + (CURRENT_VERSION.patch - 1);
> es.addApplicationState(ApplicationState.RELEASE_VERSION, 
> factory.releaseVersion(previousPatchVersion));
> 
> Gossiper.instance.endpointStateMap.put(InetAddressAndPort.getByName("127.0.0.2"),
>  es);
> 
> Gossiper.instance.liveEndpoints.add(InetAddressAndPort.getByName("127.0.0.2"));
> assertFalse(Gossiper.instance.hasMajorVersion3Nodes());
> }
> {code}
> This seems to be because of 
> [https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L2360],
>  where an upgrade in progress is possible but we are not upgrading from a 
> lower family version (i.e from 4.1.1 to 4.1.2).
> From the comment in this function, it seems instead of the existing check, we 
> would want to iterate over all known endpoints in gossip and return true if 
> any of them do not have a version (similar to 
> [https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L227-L236)
>  
> |https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L227-L236).]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-18999) Gossiper::hasMajorVersion3Nodes returns true when a cluster is upgrading patch version without Cassandra 3 nodes.

2023-12-18 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17798276#comment-17798276
 ] 

Paulo Motta commented on CASSANDRA-18999:
-

Fwiw I'm +1 on the patch, but let's wait a bit to see if Mick/Brandon have any 
input. If you're good can you trigger ci [~smiklosovic] ? I need to setup my 
circleci stuff to be able to submit, will setup this soon.

> Gossiper::hasMajorVersion3Nodes returns true when a cluster is upgrading 
> patch version without Cassandra 3 nodes.
> -
>
> Key: CASSANDRA-18999
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18999
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Distributed Metadata
>Reporter: Isaac Reath
>Assignee: Isaac Reath
>Priority: Low
>  Labels: lhf
> Fix For: 4.0.x, 4.1.x, 5.0.x
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When working on https://issues.apache.org/jira/browse/CASSANDRA-18968 we 
> found that {{Gossiper::hasMajorVersion3Nodes}} will return true when the 
> cluster is undergoing an upgrade from a patch version even if the cluster has 
> no Cassandra 3 nodes in it.
> This can be reproduced by running this Gossiper test:
> {code:java}
> @Test
> public void 
> testHasVersion3NodesShouldReturnFalseWhenNoVersion3NodesDetectedAndCassandra4UpgradeInProgress()
>  throws Exception
> {
> Gossiper.instance.start(0);
> Gossiper.instance.expireUpgradeFromVersion();
> VersionedValue.VersionedValueFactory factory = new 
> VersionedValue.VersionedValueFactory(null);
> EndpointState es = new EndpointState((HeartBeatState) null);
> es.addApplicationState(ApplicationState.RELEASE_VERSION, 
> factory.releaseVersion(CURRENT_VERSION.toString()));
> 
> Gossiper.instance.endpointStateMap.put(InetAddressAndPort.getByName("127.0.0.1"),
>  es);
> 
> Gossiper.instance.liveEndpoints.add(InetAddressAndPort.getByName("127.0.0.1"));
> es = new EndpointState((HeartBeatState) null);
> String previousPatchVersion = String.valueOf(CURRENT_VERSION.major) + 
> '.' + (CURRENT_VERSION.minor) + '.' + (CURRENT_VERSION.patch - 1);
> es.addApplicationState(ApplicationState.RELEASE_VERSION, 
> factory.releaseVersion(previousPatchVersion));
> 
> Gossiper.instance.endpointStateMap.put(InetAddressAndPort.getByName("127.0.0.2"),
>  es);
> 
> Gossiper.instance.liveEndpoints.add(InetAddressAndPort.getByName("127.0.0.2"));
> assertFalse(Gossiper.instance.hasMajorVersion3Nodes());
> }
> {code}
> This seems to be because of 
> [https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L2360],
>  where an upgrade in progress is possible but we are not upgrading from a 
> lower family version (i.e from 4.1.1 to 4.1.2).
> From the comment in this function, it seems instead of the existing check, we 
> would want to iterate over all known endpoints in gossip and return true if 
> any of them do not have a version (similar to 
> [https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L227-L236)
>  
> |https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L227-L236).]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-18999) Gossiper::hasMajorVersion3Nodes returns true when a cluster is upgrading patch version without Cassandra 3 nodes.

2023-12-18 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17798274#comment-17798274
 ] 

Paulo Motta edited comment on CASSANDRA-18999 at 12/18/23 4:27 PM:
---

{quote}So I can see an argument for completely removing this in 5.0, but on the 
other hand, there is also this "or unknown nodes" and that is still valid 
question to ask. Hence, would not it be more appropriate to remove 
"isUpgradingFromVersionLowerThan" and base this method just on 
"hasNodeWithUnknownVersion" ?
{quote}
Upgrade from 3.x to 5.x is not supported, so this method should be removed. The 
unknown version check is a pessimistic guard against a 3.x node possibly not 
having its version propagated via gossip. Since upgrade from 3.x is no longer 
supported on 5.x, the unknown version check should no longer exist.


was (Author: paulo):
{quote}So I can see an argument for completely removing this in 5.0, but on the 
other hand, there is also this "or unknown nodes" and that is still valid 
question to ask. Hence, would not it be more appropriate to remove 
"isUpgradingFromVersionLowerThan" and base this method just on 
"hasNodeWithUnknownVersion" ?
{quote}
Upgrade from 3.x to 5.x is not supported, so this method should be removed. The 
unknown version check is a pessimistic guard against a 3.x node possibly not 
having its version propagated via gossip. Since upgrade from 3.x is no longer 
supported on 3.x, this should no longer be guarded against.

> Gossiper::hasMajorVersion3Nodes returns true when a cluster is upgrading 
> patch version without Cassandra 3 nodes.
> -
>
> Key: CASSANDRA-18999
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18999
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Distributed Metadata
>Reporter: Isaac Reath
>Assignee: Isaac Reath
>Priority: Low
>  Labels: lhf
> Fix For: 4.0.x, 4.1.x, 5.0.x
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When working on https://issues.apache.org/jira/browse/CASSANDRA-18968 we 
> found that {{Gossiper::hasMajorVersion3Nodes}} will return true when the 
> cluster is undergoing an upgrade from a patch version even if the cluster has 
> no Cassandra 3 nodes in it.
> This can be reproduced by running this Gossiper test:
> {code:java}
> @Test
> public void 
> testHasVersion3NodesShouldReturnFalseWhenNoVersion3NodesDetectedAndCassandra4UpgradeInProgress()
>  throws Exception
> {
> Gossiper.instance.start(0);
> Gossiper.instance.expireUpgradeFromVersion();
> VersionedValue.VersionedValueFactory factory = new 
> VersionedValue.VersionedValueFactory(null);
> EndpointState es = new EndpointState((HeartBeatState) null);
> es.addApplicationState(ApplicationState.RELEASE_VERSION, 
> factory.releaseVersion(CURRENT_VERSION.toString()));
> 
> Gossiper.instance.endpointStateMap.put(InetAddressAndPort.getByName("127.0.0.1"),
>  es);
> 
> Gossiper.instance.liveEndpoints.add(InetAddressAndPort.getByName("127.0.0.1"));
> es = new EndpointState((HeartBeatState) null);
> String previousPatchVersion = String.valueOf(CURRENT_VERSION.major) + 
> '.' + (CURRENT_VERSION.minor) + '.' + (CURRENT_VERSION.patch - 1);
> es.addApplicationState(ApplicationState.RELEASE_VERSION, 
> factory.releaseVersion(previousPatchVersion));
> 
> Gossiper.instance.endpointStateMap.put(InetAddressAndPort.getByName("127.0.0.2"),
>  es);
> 
> Gossiper.instance.liveEndpoints.add(InetAddressAndPort.getByName("127.0.0.2"));
> assertFalse(Gossiper.instance.hasMajorVersion3Nodes());
> }
> {code}
> This seems to be because of 
> [https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L2360],
>  where an upgrade in progress is possible but we are not upgrading from a 
> lower family version (i.e from 4.1.1 to 4.1.2).
> From the comment in this function, it seems instead of the existing check, we 
> would want to iterate over all known endpoints in gossip and return true if 
> any of them do not have a version (similar to 
> [https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L227-L236)
>  
> |https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L227-L236).]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-18999) Gossiper::hasMajorVersion3Nodes returns true when a cluster is upgrading patch version without Cassandra 3 nodes.

2023-12-18 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17798274#comment-17798274
 ] 

Paulo Motta commented on CASSANDRA-18999:
-

{quote}So I can see an argument for completely removing this in 5.0, but on the 
other hand, there is also this "or unknown nodes" and that is still valid 
question to ask. Hence, would not it be more appropriate to remove 
"isUpgradingFromVersionLowerThan" and base this method just on 
"hasNodeWithUnknownVersion" ?
{quote}
Upgrade from 3.x to 5.x is not supported, so this method should be removed. The 
unknown version check is a pessimistic guard against a 3.x node possibly not 
having its version propagated via gossip. Since upgrade from 3.x is no longer 
supported on 3.x, this should no longer be guarded against.

> Gossiper::hasMajorVersion3Nodes returns true when a cluster is upgrading 
> patch version without Cassandra 3 nodes.
> -
>
> Key: CASSANDRA-18999
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18999
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Distributed Metadata
>Reporter: Isaac Reath
>Assignee: Isaac Reath
>Priority: Low
>  Labels: lhf
> Fix For: 4.0.x, 4.1.x, 5.0.x
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When working on https://issues.apache.org/jira/browse/CASSANDRA-18968 we 
> found that {{Gossiper::hasMajorVersion3Nodes}} will return true when the 
> cluster is undergoing an upgrade from a patch version even if the cluster has 
> no Cassandra 3 nodes in it.
> This can be reproduced by running this Gossiper test:
> {code:java}
> @Test
> public void 
> testHasVersion3NodesShouldReturnFalseWhenNoVersion3NodesDetectedAndCassandra4UpgradeInProgress()
>  throws Exception
> {
> Gossiper.instance.start(0);
> Gossiper.instance.expireUpgradeFromVersion();
> VersionedValue.VersionedValueFactory factory = new 
> VersionedValue.VersionedValueFactory(null);
> EndpointState es = new EndpointState((HeartBeatState) null);
> es.addApplicationState(ApplicationState.RELEASE_VERSION, 
> factory.releaseVersion(CURRENT_VERSION.toString()));
> 
> Gossiper.instance.endpointStateMap.put(InetAddressAndPort.getByName("127.0.0.1"),
>  es);
> 
> Gossiper.instance.liveEndpoints.add(InetAddressAndPort.getByName("127.0.0.1"));
> es = new EndpointState((HeartBeatState) null);
> String previousPatchVersion = String.valueOf(CURRENT_VERSION.major) + 
> '.' + (CURRENT_VERSION.minor) + '.' + (CURRENT_VERSION.patch - 1);
> es.addApplicationState(ApplicationState.RELEASE_VERSION, 
> factory.releaseVersion(previousPatchVersion));
> 
> Gossiper.instance.endpointStateMap.put(InetAddressAndPort.getByName("127.0.0.2"),
>  es);
> 
> Gossiper.instance.liveEndpoints.add(InetAddressAndPort.getByName("127.0.0.2"));
> assertFalse(Gossiper.instance.hasMajorVersion3Nodes());
> }
> {code}
> This seems to be because of 
> [https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L2360],
>  where an upgrade in progress is possible but we are not upgrading from a 
> lower family version (i.e from 4.1.1 to 4.1.2).
> From the comment in this function, it seems instead of the existing check, we 
> would want to iterate over all known endpoints in gossip and return true if 
> any of them do not have a version (similar to 
> [https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L227-L236)
>  
> |https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/gms/Gossiper.java#L227-L236).]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16418) Unsafe to run nodetool cleanup during bootstrap or decommission

2023-12-06 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17793739#comment-17793739
 ] 

Paulo Motta commented on CASSANDRA-16418:
-

bq. However, from the API pov CompactionManager.performCleanup can be now 
called anytime - I think it was important precondition for that method - 
wouldn't be good to keep it there, just changing the condition to check pending 
ranges rather than joining status?

Good point, this was overlooked during review - I suggested removing that just 
to cleanup but looking back I think there is value in keeping it for safety if 
this API is used elsewhere. Feel free to create a new ticket to add it back or 
piggyback in some other ticket, I'd be glad to review.

To me it'd be nice that CompactionManager API is a dumb local API unaware of 
token ranges/membership status since it's just a local operation, but 
practically these concerns are mixed across the codebase so developers expect 
that any local API is safe from a distributed standpoint.

> Unsafe to run nodetool cleanup during bootstrap or decommission
> ---
>
> Key: CASSANDRA-16418
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16418
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: James Baker
>Assignee: Lindsey Zurovchak
>Priority: Normal
> Fix For: 4.0.8, 4.1.1, 5.0-alpha1, 5.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> What we expected: Running a cleanup is a safe operation; the result of 
> running a query after a cleanup should be the same as the result of running a 
> query before a cleanup.
> What actually happened: We ran a cleanup during a decommission. All the 
> streamed data was silently deleted, the bootstrap did not fail, the cluster's 
> data after the decommission was very different to the state before.
> Why: Cleanups do not take into account pending ranges and so the cleanup 
> thought that all the data that had just been streamed was redundant and so 
> deleted it. We think that this is symmetric with bootstraps, though have not 
> verified.
> Not sure if this is technically a bug but it was very surprising (and 
> seemingly undocumented) behaviour.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16418) Unsafe to run nodetool cleanup during bootstrap or decommission

2023-12-04 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792847#comment-17792847
 ] 

Paulo Motta commented on CASSANDRA-16418:
-

{quote}Why that check in CompactionManager was removed? Was it needed for tests 
to make them run? I'm afraid that the check could have been legit for 
production use.
{quote}
I think that check was deemed unnecessary after a new check was added to 
[StorageService.forceKeyspaceCleanup|https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/service/StorageService.java#L3907]
 to prevent starting cleanup when there are pending ranges (ie. when a node is 
joining).

It's not clear to me why this latter check is not present in 
[trunk|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L2524]
 (while it's present in 4.0/4.1).

> Unsafe to run nodetool cleanup during bootstrap or decommission
> ---
>
> Key: CASSANDRA-16418
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16418
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: James Baker
>Assignee: Lindsey Zurovchak
>Priority: Normal
> Fix For: 4.0.8, 4.1.1, 5.0-alpha1, 5.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> What we expected: Running a cleanup is a safe operation; the result of 
> running a query after a cleanup should be the same as the result of running a 
> query before a cleanup.
> What actually happened: We ran a cleanup during a decommission. All the 
> streamed data was silently deleted, the bootstrap did not fail, the cluster's 
> data after the decommission was very different to the state before.
> Why: Cleanups do not take into account pending ranges and so the cleanup 
> thought that all the data that had just been streamed was redundant and so 
> deleted it. We think that this is symmetric with bootstraps, though have not 
> verified.
> Not sure if this is technically a bug but it was very surprising (and 
> seemingly undocumented) behaviour.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19001) Check whether the startup warnings for unknown modules represent a legit problem or cosmetic issue

2023-11-30 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791828#comment-17791828
 ] 

Paulo Motta commented on CASSANDRA-19001:
-

[~e.dimitrova] thanks for the patch! I'll take a look ASAP, hopefully tomorrow.

> Check whether the startup warnings for unknown modules represent a legit 
> problem or cosmetic issue
> --
>
> Key: CASSANDRA-19001
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19001
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Other
>Reporter: Ekaterina Dimitrova
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 5.0-rc, 5.0.x, 5.x
>
>
> During the 5.0 alpha 2 release 
> [vote|https://lists.apache.org/thread/lt3x0obr5cpbcydf5490pj6b2q0mz5zr], 
> [~paulo] raised the following concerns:
> {code:java}
> Launched a tarball-based 5.0-alpha2 container on top of
> "eclipse-temurin:17-jre-focal" and the server starts up fine, can run
> nodetool and cqlsh.
> I got these seemingly harmless JDK17 warnings during startup and when
> running nodetool (no warnings on JDK11):
> WARNING: Unknown module: jdk.attach specified to --add-exports
> WARNING: Unknown module: jdk.compiler specified to --add-exports
> WARNING: Unknown module: jdk.compiler specified to --add-opens
> WARNING: A terminally deprecated method in java.lang.System has been called
> WARNING: System::setSecurityManager has been called by
> org.apache.cassandra.security.ThreadAwareSecurityManager
> (file:/opt/cassandra/lib/apache-cassandra-5.0-alpha2-SNAPSHOT.jar)
> WARNING: Please consider reporting this to the maintainers of
> org.apache.cassandra.security.ThreadAwareSecurityManager
> WARNING: System::setSecurityManager will be removed in a future release
> Anybody knows if these warnings are legit/expected ? We can create
> follow-up tickets if needed.
> $ java --version
> openjdk 17.0.9 2023-10-17
> OpenJDK Runtime Environment Temurin-17.0.9+9 (build 17.0.9+9)
> OpenJDK 64-Bit Server VM Temurin-17.0.9+9 (build 17.0.9+9, mixed mode,
> sharing)
> {code}
> {code:java}
> Clarification: - When running nodetool only the "Unknown module" warnings 
> show up. All warnings show up during startup.{code}
> We need to verify whether this presents a real problem in the features where 
> those modules are expected to be used, or if it is a false alarm. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19033) Add virtual table with GC pause history

2023-11-16 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786927#comment-17786927
 ] 

Paulo Motta commented on CASSANDRA-19033:
-

Seems like it could be useful to expose formatted gc info via a vtable for 
troubleshooting/tuning. If GC logging is not enabled I think it's fine to error 
out or perhaps not even load the virtual table.

Would a specific GC logging format be required? Would this support just 
gc.log.current or compressed rolled over files?

Do you have an idea on what the table schema would look like and possible 
queries?

> Add virtual table with GC pause history
> ---
>
> Key: CASSANDRA-19033
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19033
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Feature/Virtual Tables
>Reporter: Jon Haddad
>Priority: Normal
>
> We should be able to view GC pause history in a virtual table. 
> I think the best approach here is to read from the GC logs.  The format was 
> unified in Java 9, and we've dropped older JVM support so I think this is 
> reasonable.  The benefits of using logs are that we can preserve it across 
> restarts and we enable GC logs by default.  
> The downside is people might not have GC logs configured and it seems weird 
> that a feature would just stop working because logs aren't enabled.   Maybe 
> that's OK if we call it out, or error if people try to read from it and the 
> logs aren't enabled.  I think if someone disables -Xlog:gc then an error 
> might be fine as I don't expect it to happen often.  I think I lean towards 
> this from a usability perspective, and Microsoft has a 
> [project|https://github.com/microsoft/gctoolkit] to parse them, but I haven't 
> used it so I'm not sure if it's suitable for us.  
> At a minimum, pause time should be it's own field so we can query for pauses 
> over a specific threshold, but there may be other data we want to explicitly 
> split out as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19001) Check whether the startup warnings for unknown modules represent a legit problem or cosmetic issue

2023-11-16 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786885#comment-17786885
 ] 

Paulo Motta commented on CASSANDRA-19001:
-

{quote}A warning would not "break" it. It would inform users of the docker 
image of known limitations. This buys us time to then deal with the issue 
properly as we wish. (And the docker image maintainers may notice and change to 
JDK anyway…)
{quote}
It would deprecate JRE support which was previously supported (ie. JRE_HOME is 
mentioned 
[here|https://cassandra.apache.org/doc/latest/cassandra/reference/java17.html] 
and other places). If there are no hard dependencies on the JDK for core 
features, I would prefer to just require it for optional features like SJK and 
audit log. WDYT?

One question that arises is whether we want to continue JRE support to core 
features. The benefits I can think of are smaller image size and fewer runtime 
dependencies.

> Check whether the startup warnings for unknown modules represent a legit 
> problem or cosmetic issue
> --
>
> Key: CASSANDRA-19001
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19001
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Other
>Reporter: Ekaterina Dimitrova
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0-rc, 5.0.x, 5.x
>
>
> During the 5.0 alpha 2 release 
> [vote|https://lists.apache.org/thread/lt3x0obr5cpbcydf5490pj6b2q0mz5zr], 
> [~paulo] raised the following concerns:
> {code:java}
> Launched a tarball-based 5.0-alpha2 container on top of
> "eclipse-temurin:17-jre-focal" and the server starts up fine, can run
> nodetool and cqlsh.
> I got these seemingly harmless JDK17 warnings during startup and when
> running nodetool (no warnings on JDK11):
> WARNING: Unknown module: jdk.attach specified to --add-exports
> WARNING: Unknown module: jdk.compiler specified to --add-exports
> WARNING: Unknown module: jdk.compiler specified to --add-opens
> WARNING: A terminally deprecated method in java.lang.System has been called
> WARNING: System::setSecurityManager has been called by
> org.apache.cassandra.security.ThreadAwareSecurityManager
> (file:/opt/cassandra/lib/apache-cassandra-5.0-alpha2-SNAPSHOT.jar)
> WARNING: Please consider reporting this to the maintainers of
> org.apache.cassandra.security.ThreadAwareSecurityManager
> WARNING: System::setSecurityManager will be removed in a future release
> Anybody knows if these warnings are legit/expected ? We can create
> follow-up tickets if needed.
> $ java --version
> openjdk 17.0.9 2023-10-17
> OpenJDK Runtime Environment Temurin-17.0.9+9 (build 17.0.9+9)
> OpenJDK 64-Bit Server VM Temurin-17.0.9+9 (build 17.0.9+9, mixed mode,
> sharing)
> {code}
> {code:java}
> Clarification: - When running nodetool only the "Unknown module" warnings 
> show up. All warnings show up during startup.{code}
> We need to verify whether this presents a real problem in the features where 
> those modules are expected to be used, or if it is a false alarm. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-18762) Repair triggers OOM with direct buffer memory

2023-11-15 Thread Paulo Motta (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-18762:

Resolution: (was: Fixed)
Status: Open  (was: Resolved)

> Repair triggers OOM with direct buffer memory
> -
>
> Key: CASSANDRA-18762
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18762
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Brad Schoening
>Priority: Normal
>  Labels: OutOfMemoryError
> Attachments: Cluster-dm-metrics-1.PNG
>
>
> We are seeing repeated failures of nodes with 16GB of heap and the same size 
> (16GB) for direct memory (derived from -Xms).  This seems to be related to 
> CASSANDRA-15202 which moved merkel trees off-heap in 4.0.   Using Cassandra 
> 4.0.6.
> {noformat}
> 2023-08-09 04:30:57,470 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e55a3b0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_a from 
> /169.102.200.241:7000
> 2023-08-09 04:30:57,567 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e0d2900-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.93.192.29:7000
> 2023-08-09 04:30:57,568 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e1dcad0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_c from 
> /169.104.171.134:7000
> 2023-08-09 04:30:57,591 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e69a0e0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.79.232.67:7000
> 2023-08-09 04:30:57,876 [INFO ] [Service Thread] cluster_id=101 
> ip_address=169.0.0.1 GCInspector.java:294 - G1 Old Generation GC in 282ms. 
> Compressed Class Space: 8444560 -> 8372152; G1 Eden Space: 7809794048 -> 0; 
> G1 Old Gen: 1453478400 -> 820942800; G1 Survivor Space: 419430400 -> 0; 
> Metaspace: 80411136 -> 80176528
> 2023-08-09 04:30:58,387 [ERROR] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 JVMStabilityInspector.java:102 - OutOfMemory error 
> letting the JVM handle the error:
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
> at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:118)
> at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:318)
> at org.apache.cassandra.utils.MerkleTree.allocate(MerkleTree.java:742)
> at 
> org.apache.cassandra.utils.MerkleTree.deserializeOffHeap(MerkleTree.java:780)
> at org.apache.cassandra.utils.MerkleTree.deserializeTree(MerkleTree.java:751)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:720)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:698)
> at 
> org.apache.cassandra.utils.MerkleTrees$MerkleTreesSerializer.deserialize(MerkleTrees.java:416)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:100)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:84)
> at 
> org.apache.cassandra.net.Message$Serializer.deserializePost40(Message.java:782)
> at org.apache.cassandra.net.Message$Serializer.deserialize(Message.java:642)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.deserialize(InboundMessageHandler.java:364)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.access$1100(InboundMessageHandler.java:317)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessLargeMessage.provideMessage(InboundMessageHandler.java:504)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:429)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:834)no* further _formatting_ is 
> done here{noformat}
>  
> -XX:+AlwaysPreTouch
> -XX:+CrashOnOutOfMemoryError
> -XX:+ExitOnOutOfMemoryError
> -XX:+HeapDumpOnOutOfMemoryError
> -XX:+Parallel

[jira] (CASSANDRA-18762) Repair triggers OOM with direct buffer memory

2023-11-15 Thread Paulo Motta (Jira)


[ https://issues.apache.org/jira/browse/CASSANDRA-18762 ]


Paulo Motta deleted comment on CASSANDRA-18762:
-

was (Author: paulo):
Thanks for the follow-up. I will close this for now, please re-open if you 
observe the issue after 4.0.10.

> Repair triggers OOM with direct buffer memory
> -
>
> Key: CASSANDRA-18762
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18762
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Brad Schoening
>Priority: Normal
>  Labels: OutOfMemoryError
> Attachments: Cluster-dm-metrics-1.PNG
>
>
> We are seeing repeated failures of nodes with 16GB of heap and the same size 
> (16GB) for direct memory (derived from -Xms).  This seems to be related to 
> CASSANDRA-15202 which moved merkel trees off-heap in 4.0.   Using Cassandra 
> 4.0.6.
> {noformat}
> 2023-08-09 04:30:57,470 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e55a3b0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_a from 
> /169.102.200.241:7000
> 2023-08-09 04:30:57,567 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e0d2900-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.93.192.29:7000
> 2023-08-09 04:30:57,568 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e1dcad0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_c from 
> /169.104.171.134:7000
> 2023-08-09 04:30:57,591 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e69a0e0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.79.232.67:7000
> 2023-08-09 04:30:57,876 [INFO ] [Service Thread] cluster_id=101 
> ip_address=169.0.0.1 GCInspector.java:294 - G1 Old Generation GC in 282ms. 
> Compressed Class Space: 8444560 -> 8372152; G1 Eden Space: 7809794048 -> 0; 
> G1 Old Gen: 1453478400 -> 820942800; G1 Survivor Space: 419430400 -> 0; 
> Metaspace: 80411136 -> 80176528
> 2023-08-09 04:30:58,387 [ERROR] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 JVMStabilityInspector.java:102 - OutOfMemory error 
> letting the JVM handle the error:
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
> at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:118)
> at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:318)
> at org.apache.cassandra.utils.MerkleTree.allocate(MerkleTree.java:742)
> at 
> org.apache.cassandra.utils.MerkleTree.deserializeOffHeap(MerkleTree.java:780)
> at org.apache.cassandra.utils.MerkleTree.deserializeTree(MerkleTree.java:751)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:720)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:698)
> at 
> org.apache.cassandra.utils.MerkleTrees$MerkleTreesSerializer.deserialize(MerkleTrees.java:416)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:100)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:84)
> at 
> org.apache.cassandra.net.Message$Serializer.deserializePost40(Message.java:782)
> at org.apache.cassandra.net.Message$Serializer.deserialize(Message.java:642)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.deserialize(InboundMessageHandler.java:364)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.access$1100(InboundMessageHandler.java:317)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessLargeMessage.provideMessage(InboundMessageHandler.java:504)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:429)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:834)no* further _formatting_ is 
> done here{noformat}
>  
> -XX:+AlwaysPreTouch
> -XX:+CrashOnOutOfMemoryError
> -XX:+ExitOnOutOfMemoryError
> -XX:+HeapDu

[jira] [Updated] (CASSANDRA-18762) Repair triggers OOM with direct buffer memory

2023-11-15 Thread Paulo Motta (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-18762:

Resolution: Cannot Reproduce
Status: Resolved  (was: Open)

> Repair triggers OOM with direct buffer memory
> -
>
> Key: CASSANDRA-18762
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18762
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Brad Schoening
>Priority: Normal
>  Labels: OutOfMemoryError
> Attachments: Cluster-dm-metrics-1.PNG
>
>
> We are seeing repeated failures of nodes with 16GB of heap and the same size 
> (16GB) for direct memory (derived from -Xms).  This seems to be related to 
> CASSANDRA-15202 which moved merkel trees off-heap in 4.0.   Using Cassandra 
> 4.0.6.
> {noformat}
> 2023-08-09 04:30:57,470 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e55a3b0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_a from 
> /169.102.200.241:7000
> 2023-08-09 04:30:57,567 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e0d2900-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.93.192.29:7000
> 2023-08-09 04:30:57,568 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e1dcad0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_c from 
> /169.104.171.134:7000
> 2023-08-09 04:30:57,591 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e69a0e0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.79.232.67:7000
> 2023-08-09 04:30:57,876 [INFO ] [Service Thread] cluster_id=101 
> ip_address=169.0.0.1 GCInspector.java:294 - G1 Old Generation GC in 282ms. 
> Compressed Class Space: 8444560 -> 8372152; G1 Eden Space: 7809794048 -> 0; 
> G1 Old Gen: 1453478400 -> 820942800; G1 Survivor Space: 419430400 -> 0; 
> Metaspace: 80411136 -> 80176528
> 2023-08-09 04:30:58,387 [ERROR] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 JVMStabilityInspector.java:102 - OutOfMemory error 
> letting the JVM handle the error:
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
> at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:118)
> at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:318)
> at org.apache.cassandra.utils.MerkleTree.allocate(MerkleTree.java:742)
> at 
> org.apache.cassandra.utils.MerkleTree.deserializeOffHeap(MerkleTree.java:780)
> at org.apache.cassandra.utils.MerkleTree.deserializeTree(MerkleTree.java:751)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:720)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:698)
> at 
> org.apache.cassandra.utils.MerkleTrees$MerkleTreesSerializer.deserialize(MerkleTrees.java:416)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:100)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:84)
> at 
> org.apache.cassandra.net.Message$Serializer.deserializePost40(Message.java:782)
> at org.apache.cassandra.net.Message$Serializer.deserialize(Message.java:642)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.deserialize(InboundMessageHandler.java:364)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.access$1100(InboundMessageHandler.java:317)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessLargeMessage.provideMessage(InboundMessageHandler.java:504)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:429)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:834)no* further _formatting_ is 
> done here{noformat}
>  
> -XX:+AlwaysPreTouch
> -XX:+CrashOnOutOfMemoryError
> -XX:+ExitOnOutOfMemoryError
> -XX:+HeapDumpOnOutOfMemoryError
> -XX:+Parallel

[jira] [Updated] (CASSANDRA-18762) Repair triggers OOM with direct buffer memory

2023-11-15 Thread Paulo Motta (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-18762:

Fix Version/s: (was: 5.x)
   (was: 4.0.x)
   (was: 4.1.x)
   (was: 5.0.x)

> Repair triggers OOM with direct buffer memory
> -
>
> Key: CASSANDRA-18762
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18762
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Brad Schoening
>Priority: Normal
>  Labels: OutOfMemoryError
> Attachments: Cluster-dm-metrics-1.PNG
>
>
> We are seeing repeated failures of nodes with 16GB of heap and the same size 
> (16GB) for direct memory (derived from -Xms).  This seems to be related to 
> CASSANDRA-15202 which moved merkel trees off-heap in 4.0.   Using Cassandra 
> 4.0.6.
> {noformat}
> 2023-08-09 04:30:57,470 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e55a3b0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_a from 
> /169.102.200.241:7000
> 2023-08-09 04:30:57,567 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e0d2900-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.93.192.29:7000
> 2023-08-09 04:30:57,568 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e1dcad0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_c from 
> /169.104.171.134:7000
> 2023-08-09 04:30:57,591 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e69a0e0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.79.232.67:7000
> 2023-08-09 04:30:57,876 [INFO ] [Service Thread] cluster_id=101 
> ip_address=169.0.0.1 GCInspector.java:294 - G1 Old Generation GC in 282ms. 
> Compressed Class Space: 8444560 -> 8372152; G1 Eden Space: 7809794048 -> 0; 
> G1 Old Gen: 1453478400 -> 820942800; G1 Survivor Space: 419430400 -> 0; 
> Metaspace: 80411136 -> 80176528
> 2023-08-09 04:30:58,387 [ERROR] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 JVMStabilityInspector.java:102 - OutOfMemory error 
> letting the JVM handle the error:
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
> at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:118)
> at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:318)
> at org.apache.cassandra.utils.MerkleTree.allocate(MerkleTree.java:742)
> at 
> org.apache.cassandra.utils.MerkleTree.deserializeOffHeap(MerkleTree.java:780)
> at org.apache.cassandra.utils.MerkleTree.deserializeTree(MerkleTree.java:751)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:720)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:698)
> at 
> org.apache.cassandra.utils.MerkleTrees$MerkleTreesSerializer.deserialize(MerkleTrees.java:416)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:100)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:84)
> at 
> org.apache.cassandra.net.Message$Serializer.deserializePost40(Message.java:782)
> at org.apache.cassandra.net.Message$Serializer.deserialize(Message.java:642)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.deserialize(InboundMessageHandler.java:364)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.access$1100(InboundMessageHandler.java:317)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessLargeMessage.provideMessage(InboundMessageHandler.java:504)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:429)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:834)no* further _formatting_ is 
> done here{noformat}
>  
> -XX:+AlwaysPreTouch
> -XX:+CrashOnOutOfMemoryError
> -XX:+Exi

[jira] [Commented] (CASSANDRA-18762) Repair triggers OOM with direct buffer memory

2023-11-15 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786596#comment-17786596
 ] 

Paulo Motta commented on CASSANDRA-18762:
-

Thanks for the follow-up. I will close this for now, please re-open if you 
observe the issue after 4.0.10.

> Repair triggers OOM with direct buffer memory
> -
>
> Key: CASSANDRA-18762
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18762
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Brad Schoening
>Priority: Normal
>  Labels: OutOfMemoryError
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
> Attachments: Cluster-dm-metrics-1.PNG
>
>
> We are seeing repeated failures of nodes with 16GB of heap and the same size 
> (16GB) for direct memory (derived from -Xms).  This seems to be related to 
> CASSANDRA-15202 which moved merkel trees off-heap in 4.0.   Using Cassandra 
> 4.0.6.
> {noformat}
> 2023-08-09 04:30:57,470 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e55a3b0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_a from 
> /169.102.200.241:7000
> 2023-08-09 04:30:57,567 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e0d2900-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.93.192.29:7000
> 2023-08-09 04:30:57,568 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e1dcad0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_c from 
> /169.104.171.134:7000
> 2023-08-09 04:30:57,591 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e69a0e0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.79.232.67:7000
> 2023-08-09 04:30:57,876 [INFO ] [Service Thread] cluster_id=101 
> ip_address=169.0.0.1 GCInspector.java:294 - G1 Old Generation GC in 282ms. 
> Compressed Class Space: 8444560 -> 8372152; G1 Eden Space: 7809794048 -> 0; 
> G1 Old Gen: 1453478400 -> 820942800; G1 Survivor Space: 419430400 -> 0; 
> Metaspace: 80411136 -> 80176528
> 2023-08-09 04:30:58,387 [ERROR] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 JVMStabilityInspector.java:102 - OutOfMemory error 
> letting the JVM handle the error:
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
> at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:118)
> at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:318)
> at org.apache.cassandra.utils.MerkleTree.allocate(MerkleTree.java:742)
> at 
> org.apache.cassandra.utils.MerkleTree.deserializeOffHeap(MerkleTree.java:780)
> at org.apache.cassandra.utils.MerkleTree.deserializeTree(MerkleTree.java:751)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:720)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:698)
> at 
> org.apache.cassandra.utils.MerkleTrees$MerkleTreesSerializer.deserialize(MerkleTrees.java:416)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:100)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:84)
> at 
> org.apache.cassandra.net.Message$Serializer.deserializePost40(Message.java:782)
> at org.apache.cassandra.net.Message$Serializer.deserialize(Message.java:642)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.deserialize(InboundMessageHandler.java:364)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.access$1100(InboundMessageHandler.java:317)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessLargeMessage.provideMessage(InboundMessageHandler.java:504)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:429)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:834)no* further _formatting_ is 
> done here{noformat}

[jira] [Updated] (CASSANDRA-18762) Repair triggers OOM with direct buffer memory

2023-11-15 Thread Paulo Motta (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-18762:

Resolution: Fixed
Status: Resolved  (was: Open)

Thanks for the follow-up. I will close this for now, please re-open if you 
observe the issue after 4.0.10.

> Repair triggers OOM with direct buffer memory
> -
>
> Key: CASSANDRA-18762
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18762
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Brad Schoening
>Priority: Normal
>  Labels: OutOfMemoryError
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
> Attachments: Cluster-dm-metrics-1.PNG
>
>
> We are seeing repeated failures of nodes with 16GB of heap and the same size 
> (16GB) for direct memory (derived from -Xms).  This seems to be related to 
> CASSANDRA-15202 which moved merkel trees off-heap in 4.0.   Using Cassandra 
> 4.0.6.
> {noformat}
> 2023-08-09 04:30:57,470 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e55a3b0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_a from 
> /169.102.200.241:7000
> 2023-08-09 04:30:57,567 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e0d2900-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.93.192.29:7000
> 2023-08-09 04:30:57,568 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e1dcad0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_c from 
> /169.104.171.134:7000
> 2023-08-09 04:30:57,591 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e69a0e0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.79.232.67:7000
> 2023-08-09 04:30:57,876 [INFO ] [Service Thread] cluster_id=101 
> ip_address=169.0.0.1 GCInspector.java:294 - G1 Old Generation GC in 282ms. 
> Compressed Class Space: 8444560 -> 8372152; G1 Eden Space: 7809794048 -> 0; 
> G1 Old Gen: 1453478400 -> 820942800; G1 Survivor Space: 419430400 -> 0; 
> Metaspace: 80411136 -> 80176528
> 2023-08-09 04:30:58,387 [ERROR] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 JVMStabilityInspector.java:102 - OutOfMemory error 
> letting the JVM handle the error:
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
> at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:118)
> at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:318)
> at org.apache.cassandra.utils.MerkleTree.allocate(MerkleTree.java:742)
> at 
> org.apache.cassandra.utils.MerkleTree.deserializeOffHeap(MerkleTree.java:780)
> at org.apache.cassandra.utils.MerkleTree.deserializeTree(MerkleTree.java:751)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:720)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:698)
> at 
> org.apache.cassandra.utils.MerkleTrees$MerkleTreesSerializer.deserialize(MerkleTrees.java:416)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:100)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:84)
> at 
> org.apache.cassandra.net.Message$Serializer.deserializePost40(Message.java:782)
> at org.apache.cassandra.net.Message$Serializer.deserialize(Message.java:642)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.deserialize(InboundMessageHandler.java:364)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.access$1100(InboundMessageHandler.java:317)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessLargeMessage.provideMessage(InboundMessageHandler.java:504)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:429)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:834)no* further _formatting_ is 
> don

[jira] [Commented] (CASSANDRA-18762) Repair triggers OOM with direct buffer memory

2023-11-15 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786588#comment-17786588
 ] 

Paulo Motta commented on CASSANDRA-18762:
-

[~bschoeni] Did you confirm CASSANDRA-16681 fixes this issue?

> Repair triggers OOM with direct buffer memory
> -
>
> Key: CASSANDRA-18762
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18762
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Brad Schoening
>Priority: Normal
>  Labels: OutOfMemoryError
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
> Attachments: Cluster-dm-metrics-1.PNG
>
>
> We are seeing repeated failures of nodes with 16GB of heap and the same size 
> (16GB) for direct memory (derived from -Xms).  This seems to be related to 
> CASSANDRA-15202 which moved merkel trees off-heap in 4.0.   Using Cassandra 
> 4.0.6.
> {noformat}
> 2023-08-09 04:30:57,470 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e55a3b0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_a from 
> /169.102.200.241:7000
> 2023-08-09 04:30:57,567 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e0d2900-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.93.192.29:7000
> 2023-08-09 04:30:57,568 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e1dcad0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_c from 
> /169.104.171.134:7000
> 2023-08-09 04:30:57,591 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e69a0e0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.79.232.67:7000
> 2023-08-09 04:30:57,876 [INFO ] [Service Thread] cluster_id=101 
> ip_address=169.0.0.1 GCInspector.java:294 - G1 Old Generation GC in 282ms. 
> Compressed Class Space: 8444560 -> 8372152; G1 Eden Space: 7809794048 -> 0; 
> G1 Old Gen: 1453478400 -> 820942800; G1 Survivor Space: 419430400 -> 0; 
> Metaspace: 80411136 -> 80176528
> 2023-08-09 04:30:58,387 [ERROR] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 JVMStabilityInspector.java:102 - OutOfMemory error 
> letting the JVM handle the error:
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
> at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:118)
> at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:318)
> at org.apache.cassandra.utils.MerkleTree.allocate(MerkleTree.java:742)
> at 
> org.apache.cassandra.utils.MerkleTree.deserializeOffHeap(MerkleTree.java:780)
> at org.apache.cassandra.utils.MerkleTree.deserializeTree(MerkleTree.java:751)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:720)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:698)
> at 
> org.apache.cassandra.utils.MerkleTrees$MerkleTreesSerializer.deserialize(MerkleTrees.java:416)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:100)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:84)
> at 
> org.apache.cassandra.net.Message$Serializer.deserializePost40(Message.java:782)
> at org.apache.cassandra.net.Message$Serializer.deserialize(Message.java:642)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.deserialize(InboundMessageHandler.java:364)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.access$1100(InboundMessageHandler.java:317)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessLargeMessage.provideMessage(InboundMessageHandler.java:504)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:429)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:834)no* further _formatting_ is 
> done here{noformat}
>  
> -XX:+AlwaysP

[jira] [Updated] (CASSANDRA-18661) Update to cassandra-stress to use Apache Commons CLI

2023-11-15 Thread Paulo Motta (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-18661:

Labels: lhf  (was: )

> Update to cassandra-stress to use Apache Commons CLI
> 
>
> Key: CASSANDRA-18661
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18661
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Tool/stress
>Reporter: Brad Schoening
>Priority: Normal
>  Labels: lhf
>
> The Apache Commons CLI library provides an API for parsing command line 
> options with the package org.apache.commons.cli and this is already used by a 
> dozen of existing Cassandra utilities including:
> {quote}SSTableMetadataViewer, StandaloneScrubber, StandaloneSplitter, 
> SSTableExport, BulkLoader, and others.
> {quote}
> However, cassandra-stress is an outlier which uses its own custom classes to 
> parse command line options with classes such as OptionsSimple.  In addition, 
> the options syntax for username, password, and others are not aligned with 
> the format used by CQLSH.
> Currently, there are > 5K lines of code in 'settings' which appears to just 
> process command line args.
> This suggestion is to:
>  
> a) Upgrade cassandra-stress to use Apache Commons CLI (no new dependencies 
> are required as this library is already used by the project)
>  
> b) Align the cassandra-stress CLI options with those in CQLSH, 
>  
> {quote}For example, using the new syntax like CQLSH:
> {quote}
>  
> cassandra-stress -username foo -password bar
> {quote}and replacing the old syntax:
> {quote}
> cassandra-stress -mode username=foo and password=bar
>  
> This will simplify and unify the code base, eliminate code and reduce the 
> confusion between similar named classes such as 
> org.apache.cassandra.stress.settings.\{Option, OptionsMulti, OptionsSimple} 
> and org.apache.commons.cli.{Option, OptionGroup, Options)
>  
> Note: documentation will need to be updated as well



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19021) Update default disk_access_mode to mmap_index_only

2023-11-15 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786483#comment-17786483
 ] 

Paulo Motta commented on CASSANDRA-19021:
-

LGTM, feel free to merge if tests looks good and nobody objects in the ML 
thread by tomorrow.

Please include this [NEWS.txt 
entry|https://github.com/pauloricardomg/cassandra/commit/f8d08719712c895ee0684fd5e9aa4a911dd33ed3]
 on commit.

> Update default disk_access_mode to mmap_index_only
> --
>
> Key: CASSANDRA-19021
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19021
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Config
>Reporter: Paulo Motta
>Assignee: Stefan Miklosovic
>Priority: Normal
> Fix For: 5.0-beta, 5.x
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://lists.apache.org/thread/nhp6vftc4kc3dxskngxy5rpo1lp19drw



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19001) Check whether the startup warnings for unknown modules represent a legit problem or cosmetic issue

2023-11-14 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17785848#comment-17785848
 ] 

Paulo Motta commented on CASSANDRA-19001:
-

bq. Our quickest fix is to just fail or warn that using JRE is not recommended 
(and that some features like audit logging and sjk may not work)

This would break the official Cassandra [docker image| 
https://github.com/docker-library/cassandra/blob/master/5.0/Dockerfile] that is 
built on top of JRE. Do we want to drop the unintended JRE support that has 
been proven to work over the years on this image ?

I see the following options:
a) Make the JDK dependency optional, failing-fast if features that strictly 
require it are enabled. Add testing with JRE, preferably with the official 
docker image.
b) Make the JDK dependency strictly required, properly document this and work 
with the official docker image maintainers to update the image to use JDK 
instead.

Wdyt?

> Check whether the startup warnings for unknown modules represent a legit 
> problem or cosmetic issue
> --
>
> Key: CASSANDRA-19001
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19001
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Other
>Reporter: Ekaterina Dimitrova
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0-beta, 5.0.x, 5.x
>
>
> During the 5.0 alpha 2 release 
> [vote|https://lists.apache.org/thread/lt3x0obr5cpbcydf5490pj6b2q0mz5zr], 
> [~paulo] raised the following concerns:
> {code:java}
> Launched a tarball-based 5.0-alpha2 container on top of
> "eclipse-temurin:17-jre-focal" and the server starts up fine, can run
> nodetool and cqlsh.
> I got these seemingly harmless JDK17 warnings during startup and when
> running nodetool (no warnings on JDK11):
> WARNING: Unknown module: jdk.attach specified to --add-exports
> WARNING: Unknown module: jdk.compiler specified to --add-exports
> WARNING: Unknown module: jdk.compiler specified to --add-opens
> WARNING: A terminally deprecated method in java.lang.System has been called
> WARNING: System::setSecurityManager has been called by
> org.apache.cassandra.security.ThreadAwareSecurityManager
> (file:/opt/cassandra/lib/apache-cassandra-5.0-alpha2-SNAPSHOT.jar)
> WARNING: Please consider reporting this to the maintainers of
> org.apache.cassandra.security.ThreadAwareSecurityManager
> WARNING: System::setSecurityManager will be removed in a future release
> Anybody knows if these warnings are legit/expected ? We can create
> follow-up tickets if needed.
> $ java --version
> openjdk 17.0.9 2023-10-17
> OpenJDK Runtime Environment Temurin-17.0.9+9 (build 17.0.9+9)
> OpenJDK 64-Bit Server VM Temurin-17.0.9+9 (build 17.0.9+9, mixed mode,
> sharing)
> {code}
> {code:java}
> Clarification: - When running nodetool only the "Unknown module" warnings 
> show up. All warnings show up during startup.{code}
> We need to verify whether this presents a real problem in the features where 
> those modules are expected to be used, or if it is a false alarm. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19001) Check whether the startup warnings for unknown modules represent a legit problem or cosmetic issue

2023-11-13 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17785691#comment-17785691
 ] 

Paulo Motta commented on CASSANDRA-19001:
-

bq. As I explained, the exports/opens for Chronicle have unclear impact on 
audit ligging. 

Audit logging is an optional functionality as far as I understand. We can 
prevent startup if audit logging is enabled and a JDK is not detected.

> Check whether the startup warnings for unknown modules represent a legit 
> problem or cosmetic issue
> --
>
> Key: CASSANDRA-19001
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19001
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Other
>Reporter: Ekaterina Dimitrova
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0-beta, 5.0.x, 5.x
>
>
> During the 5.0 alpha 2 release 
> [vote|https://lists.apache.org/thread/lt3x0obr5cpbcydf5490pj6b2q0mz5zr], 
> [~paulo] raised the following concerns:
> {code:java}
> Launched a tarball-based 5.0-alpha2 container on top of
> "eclipse-temurin:17-jre-focal" and the server starts up fine, can run
> nodetool and cqlsh.
> I got these seemingly harmless JDK17 warnings during startup and when
> running nodetool (no warnings on JDK11):
> WARNING: Unknown module: jdk.attach specified to --add-exports
> WARNING: Unknown module: jdk.compiler specified to --add-exports
> WARNING: Unknown module: jdk.compiler specified to --add-opens
> WARNING: A terminally deprecated method in java.lang.System has been called
> WARNING: System::setSecurityManager has been called by
> org.apache.cassandra.security.ThreadAwareSecurityManager
> (file:/opt/cassandra/lib/apache-cassandra-5.0-alpha2-SNAPSHOT.jar)
> WARNING: Please consider reporting this to the maintainers of
> org.apache.cassandra.security.ThreadAwareSecurityManager
> WARNING: System::setSecurityManager will be removed in a future release
> Anybody knows if these warnings are legit/expected ? We can create
> follow-up tickets if needed.
> $ java --version
> openjdk 17.0.9 2023-10-17
> OpenJDK Runtime Environment Temurin-17.0.9+9 (build 17.0.9+9)
> OpenJDK 64-Bit Server VM Temurin-17.0.9+9 (build 17.0.9+9, mixed mode,
> sharing)
> {code}
> {code:java}
> Clarification: - When running nodetool only the "Unknown module" warnings 
> show up. All warnings show up during startup.{code}
> We need to verify whether this presents a real problem in the features where 
> those modules are expected to be used, or if it is a false alarm. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19021) Update default disk_access_mode to mmap_index_only

2023-11-13 Thread Paulo Motta (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-19021:

  Workflow: Copy of Cassandra Default Workflow  (was: Copy of Cassandra Bug 
Workflow)
Issue Type: Improvement  (was: Bug)

> Update default disk_access_mode to mmap_index_only
> --
>
> Key: CASSANDRA-19021
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19021
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Config
>Reporter: Paulo Motta
>Priority: Normal
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-19021) Update default disk_access_mode to mmap_index_only

2023-11-13 Thread Paulo Motta (Jira)
Paulo Motta created CASSANDRA-19021:
---

 Summary: Update default disk_access_mode to mmap_index_only
 Key: CASSANDRA-19021
 URL: https://issues.apache.org/jira/browse/CASSANDRA-19021
 Project: Cassandra
  Issue Type: Bug
  Components: Local/Config
Reporter: Paulo Motta






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19020) cqlsh should allow failure to import cqlshlib.serverversion

2023-11-13 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17785530#comment-17785530
 ] 

Paulo Motta commented on CASSANDRA-19020:
-

+1 after CI looks good.

> cqlsh should allow failure to import cqlshlib.serverversion
> ---
>
> Key: CASSANDRA-19020
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19020
> Project: Cassandra
>  Issue Type: Bug
>  Components: CQL/Interpreter
>Reporter: Brandon Williams
>Assignee: Brandon Williams
>Priority: Normal
> Fix For: 3.0.x, 3.11.x, 4.0.x, 4.1.x, 5.0.x, 5.x
>
>
> cqlshlib.serverversion is created by ant, recording the server's version so 
> that python can see if it matches cqlsh later.  This can make work for other 
> things that need to be aware of it like CASSANDRA-18594, so we should relax 
> it a bit since this really has no value outside of warning humans they have a 
> mismatch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-18968) StartupClusterConnectivityChecker fails on upgrade from 3.X

2023-11-08 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784225#comment-17784225
 ] 

Paulo Motta edited comment on CASSANDRA-18968 at 11/8/23 10:26 PM:
---

bq. Whole "waiting for gossip to settle" machinery is ... not ideal. Yes, it 
works in most of the situations but there are edge cases when it does not, e.g. 
when there are large clusters, it may happen that it may evaluate that gossip 
is "settled" falsely because it took so much time to detect any changes that it 
was thinking it is settled.

I'm aware waitToSettle is not reliable. Nevertheless I think having a 
"best-effort" skipping of this check when 3.X nodes are detected in gossip is 
valuable. This will mostly work as long as gossip with a single node was 
successful, since it will get the latest known versions of the other nodes. 

In the case where the gossip information is absent and there are 3.X nodes 
present in the cluster, it's not a big deal - the check will just be executed 
and the timeout warning above will be unnecessarily emitted.

We just don't want to skip this check when *all nodes are upgraded to 4.x* but 
I don't think this would happen if waitToSettle fails.

bq. I think it would make a lot of sense to run the upgrade tests here.

Good call! Thanks


was (Author: paulo):
bq. Whole "waiting for gossip to settle" machinery is ... not ideal. Yes, it 
works in most of the situations but there are edge cases when it does not, e.g. 
when there are large clusters, it may happen that it may evaluate that gossip 
is "settled" falsely because it took so much time to detect any changes that it 
was thinking it is settled.

I'm aware waitToSettle is not reliable. Nevertheless I think having a 
"best-effort" skipping of this check when 3.X nodes are detected in gossip is 
valuable. This will mostly work as long as gossip with a single node was 
successful, since it will get the latest known versions of the other nodes. 

In the case where the gossip information is absent and there are 3.X nodes 
present in the cluster, it's not a big deal - the check will just be executed 
and the timeout warning above will be unnecessarily emitted.

We just don't want to skip this check when *all nodes are upgraded to 4.x* but 
I don't think this would happen if waitToSettle fails.

> StartupClusterConnectivityChecker fails on upgrade from 3.X
> ---
>
> Key: CASSANDRA-18968
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18968
>     Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Startup and Shutdown
>Reporter: Paulo Motta
>Assignee: Isaac Reath
>Priority: Normal
>  Labels: lhf
> Fix For: 4.0.x, 4.1.x
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Starting up a new 4.X node on a 3.x cluster throws the following warning:
> {noformat}
> WARN  [main] 2023-10-27 15:58:22,234 
> StartupClusterConnectivityChecker.java:183 - Timed out after 10002 
> milliseconds, was waiting for remaining peers to connect: {dc1=[X.Y.Z.W, 
> A.B.C.D]}
> {noformat}
> I think this is because the PING messages used by the startup check are not 
> available on 3.X.
> To provide a smoother upgrade experience we should probably disable this 
> check on a mixed version clusters, or skip peers on versions < 4.x when doing 
> the connectivity check.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



  1   2   3   4   5   6   7   8   9   10   >