from:"Cameron Zemek \(JIRA\)"

[jira] [Commented] (CASSANDRA-19776) Spinning trying to capture readers

2024-08-05 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871180#comment-17871180
 ] 

Cameron Zemek commented on CASSANDRA-19776:
---

Note there are other places in the code that call selectAndReference with 
CANONICAL set also that would result in the same issue if there a compaction 
ongoing. In fact, I have blacklisted the EstimatedPartitionCount metric as 
workaround but still see this spinning occur (yet to trace the origin for 
these). 

Also another interesting data point all the occurrences of this I have seen are 
with TimeWindowCompactionStrategy.

> Spinning trying to capture readers
> --
>
> Key: CASSANDRA-19776
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19776
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: extract.log
>
>
> On a handful of clusters we are noticing Spin locks occurring. I traced back 
> all the calls to the EstimatedPartitionCount metric (eg. 
> org.apache.cassandra.metrics:type=Table,keyspace=testks,scope=testcf,name=EstimatedPartitionCount)
> Using the following patched function:
> {code:java}
>     public RefViewFragment selectAndReference(Function Iterable> filter)
>     {
>         long failingSince = -1L;
>         boolean first = true;
>         while (true)
>         {
>             ViewFragment view = select(filter);
>             Refs refs = Refs.tryRef(view.sstables);
>             if (refs != null)
>                 return new RefViewFragment(view.sstables, view.memtables, 
> refs);
>             if (failingSince <= 0)
>             {
>                 failingSince = System.nanoTime();
>             }
>             else if (System.nanoTime() - failingSince > 
> TimeUnit.MILLISECONDS.toNanos(100))
>             {
>                 List released = new ArrayList<>();
>                 for (SSTableReader reader : view.sstables)
>                     if (reader.selfRef().globalCount() == 0)
>                         released.add(reader);
>                 NoSpamLogger.log(logger, NoSpamLogger.Level.WARN, 1, 
> TimeUnit.SECONDS,
>                                  "Spinning trying to capture readers {}, 
> released: {}, ", view.sstables, released);
>                 if (first)
>                 {
>                     first = false;
>                     try {
>                         throw new RuntimeException("Spinning trying to 
> capture readers");
>                     } catch (Exception e) {
>                         logger.warn("Spin lock stacktrace", e);
>                     }
>                 }
>                 failingSince = System.nanoTime();
>             }
>         }
>     }
>  {code}
> Digging into this code I found it will fail if any of the sstables are in 
> released state (ie. reader.selfRef().globalCount() == 0).
> See the extract.log for an example of one of these spin lock occurrences. 
> Sometimes these spin locks last over 5 minutes. Across the worst cluster with 
> this issue, I ran a log processing script that everytime the 'Spinning trying 
> to capture readers' was different to previous one it would output if the 
> released tables were in Compacting state. Every single occurrence has it spin 
> locking with released listing a sstable that is compacting.
> In the extract.log example its spin locking saying that nb-320533-big-Data.db 
> has been released. But you can see prior to it spinning that sstable is 
> involved in a compaction. The compaction completes at 01:03:36 and the 
> spinning stops. nb-320533-big-Data.db is deleted at 01:03:49 along with the 
> other 9 sstables involved in the compaction.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18543) Waiting for gossip to settle does not wait for live endpoints

2024-08-04 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870881#comment-17870881
 ] 

Cameron Zemek commented on CASSANDRA-18543:
---

[~Aburadeh] can you refer to 
https://issues.apache.org/jira/browse/CASSANDRA-19580 to see if that is what 
you are running into.

> Waiting for gossip to settle does not wait for live endpoints
> -
>
> Key: CASSANDRA-18543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Gossip
>Reporter: Cameron Zemek
>Assignee: Stefan Miklosovic
>Priority: Normal
> Fix For: 3.11.16, 4.0.11, 4.1.3, 5.0-alpha1, 5.0
>
> Attachments: gossip.patch, gossip4.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> When a node starts it will get endpoint states (via shadow round) but have 
> all nodes marked as down. The problem is the wait to settle only checks the 
> size of endpoint states is stable before starting Native transport. Once 
> native transport starts it will receive queries and fail consistency levels 
> such as LOCAL_QUORUM since it still thinks nodes are down.
> This is problem for a number of large clusters for our customers. The cluster 
> has quorum but due to this issue a node restart is causing a bunch of query 
> errors.
> My initial solution to this was to only check live endpoints size in addition 
> to size of endpoint states. This worked but I noticed in testing this fix 
> that there also a lot of duplication of checking the same node (via Echo 
> messages) for liveness. So the patch also removes this duplication of 
> checking node is UP in markAlive.
> The final problem I found while testing is sometimes could still not see a 
> change in live endpoints due to only 1 second polling, so the patch allows 
> for overridding the settle parameters. I could not reliability reproduce this 
> but think its worth providing a way to override these hardcoded values.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19776) Spinning trying to capture readers

2024-07-25 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17868832#comment-17868832
 ] 

Cameron Zemek commented on CASSANDRA-19776:
---

Okay I have found the cause of this.

The EstimatedPartitionCount metric asks for reference to the CANONICAL sstables.
{code:java}
ViewFragment view = select(filter);
Refs refs = Refs.tryRef(view.sstables); {code}
Meanwhile there is a compaction running that includes a fully expired sstable.
{code:java}
Set actuallyCompact = Sets.difference(transaction.originals(), 
fullyExpiredSSTables);
// ...
try (Refs refs = Refs.ref(actuallyCompact);{code}
But the compaction doesn't take a reference on the fully expired sstable.

 

So the selectAndReference call by the EstimatedPartitionCount is stuck looping 
trying to take a reference to the fully expired sstable as that sstable has no 
references and so fails due to counts check:
{code:java}
        boolean ref()
        {
            while (true)
            {
                int cur = counts.get();
                if (cur < 0)
                    return false;
                if (counts.compareAndSet(cur, cur + 1))
                    return true;
            }
        }
 {code}
It spins until the compaction is completed when the fully expired sstable is 
removed from the CANONICAL set of sstables.

> Spinning trying to capture readers
> --
>
> Key: CASSANDRA-19776
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19776
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: extract.log
>
>
> On a handful of clusters we are noticing Spin locks occurring. I traced back 
> all the calls to the EstimatedPartitionCount metric (eg. 
> org.apache.cassandra.metrics:type=Table,keyspace=testks,scope=testcf,name=EstimatedPartitionCount)
> Using the following patched function:
> {code:java}
>     public RefViewFragment selectAndReference(Function Iterable> filter)
>     {
>         long failingSince = -1L;
>         boolean first = true;
>         while (true)
>         {
>             ViewFragment view = select(filter);
>             Refs refs = Refs.tryRef(view.sstables);
>             if (refs != null)
>                 return new RefViewFragment(view.sstables, view.memtables, 
> refs);
>             if (failingSince <= 0)
>             {
>                 failingSince = System.nanoTime();
>             }
>             else if (System.nanoTime() - failingSince > 
> TimeUnit.MILLISECONDS.toNanos(100))
>             {
>                 List released = new ArrayList<>();
>                 for (SSTableReader reader : view.sstables)
>                     if (reader.selfRef().globalCount() == 0)
>                         released.add(reader);
>                 NoSpamLogger.log(logger, NoSpamLogger.Level.WARN, 1, 
> TimeUnit.SECONDS,
>                                  "Spinning trying to capture readers {}, 
> released: {}, ", view.sstables, released);
>                 if (first)
>                 {
>                     first = false;
>                     try {
>                         throw new RuntimeException("Spinning trying to 
> capture readers");
>                     } catch (Exception e) {
>                         logger.warn("Spin lock stacktrace", e);
>                     }
>                 }
>                 failingSince = System.nanoTime();
>             }
>         }
>     }
>  {code}
> Digging into this code I found it will fail if any of the sstables are in 
> released state (ie. reader.selfRef().globalCount() == 0).
> See the extract.log for an example of one of these spin lock occurrences. 
> Sometimes these spin locks last over 5 minutes. Across the worst cluster with 
> this issue, I ran a log processing script that everytime the 'Spinning trying 
> to capture readers' was different to previous one it would output if the 
> released tables were in Compacting state. Every single occurrence has it spin 
> locking with released listing a sstable that is compacting.
> In the extract.log example its spin locking saying that nb-320533-big-Data.db 
> has been released. But you can see prior to it spinning that sstable is 
> involved in a compaction. The compaction completes at 01:03:36 and the 
> spinning stops. nb-320533-big-Data.db is deleted at 01:03:49 along with the 
> other 9 sstables involved in the compaction.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Created] (CASSANDRA-19776) Spinning trying to capture readers

2024-07-16 Thread Cameron Zemek (Jira)

Cameron Zemek created CASSANDRA-19776:
-

 Summary: Spinning trying to capture readers
 Key: CASSANDRA-19776
 URL: https://issues.apache.org/jira/browse/CASSANDRA-19776
 Project: Cassandra
  Issue Type: Bug
Reporter: Cameron Zemek
 Attachments: extract.log

On a handful of clusters we are noticing Spin locks occurring. I traced back 
all the calls to the EstimatedPartitionCount metric (eg. 
org.apache.cassandra.metrics:type=Table,keyspace=testks,scope=testcf,name=EstimatedPartitionCount)

Using the following patched function:
{code:java}
    public RefViewFragment selectAndReference(Function> filter)
    {
        long failingSince = -1L;
        boolean first = true;
        while (true)
        {
            ViewFragment view = select(filter);
            Refs refs = Refs.tryRef(view.sstables);
            if (refs != null)
                return new RefViewFragment(view.sstables, view.memtables, refs);
            if (failingSince <= 0)
            {
                failingSince = System.nanoTime();
            }
            else if (System.nanoTime() - failingSince > 
TimeUnit.MILLISECONDS.toNanos(100))
            {
                List released = new ArrayList<>();
                for (SSTableReader reader : view.sstables)
                    if (reader.selfRef().globalCount() == 0)
                        released.add(reader);
                NoSpamLogger.log(logger, NoSpamLogger.Level.WARN, 1, 
TimeUnit.SECONDS,
                                 "Spinning trying to capture readers {}, 
released: {}, ", view.sstables, released);
                if (first)
                {
                    first = false;
                    try {
                        throw new RuntimeException("Spinning trying to capture 
readers");
                    } catch (Exception e) {
                        logger.warn("Spin lock stacktrace", e);
                    }
                }
                failingSince = System.nanoTime();
            }
        }
    }
 {code}
Digging into this code I found it will fail if any of the sstables are in 
released state (ie. reader.selfRef().globalCount() == 0).

See the extract.log for an example of one of these spin lock occurrences. 
Sometimes these spin locks last over 5 minutes. Across the worst cluster with 
this issue, I ran a log processing script that everytime the 'Spinning trying 
to capture readers' was different to previous one it would output if the 
released tables were in Compacting state. Every single occurrence has it spin 
locking with released listing a sstable that is compacting.

In the extract.log example its spin locking saying that nb-320533-big-Data.db 
has been released. But you can see prior to it spinning that sstable is 
involved in a compaction. The compaction completes at 01:03:36 and the spinning 
stops. nb-320533-big-Data.db is deleted at 01:03:49 along with the other 9 
sstables involved in the compaction.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Assigned] (CASSANDRA-19703) Newly inserted prepared statements got evicted too early from cache that leads to race condition

2024-06-13 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek reassigned CASSANDRA-19703:
-

Assignee: Cameron Zemek

> Newly inserted prepared statements got evicted too early from cache that 
> leads to race condition
> 
>
> Key: CASSANDRA-19703
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19703
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Yuqi Yan
>Assignee: Cameron Zemek
>Priority: Normal
> Fix For: 4.1.x
>
>
> We're upgrading from Cassandra 4.0 to Cassandra 4.1.3 and 
> system.prepared_statements table size start growing to GB size after upgrade. 
> This slows down node startup significantly when it's doing 
> preloadPreparedStatements
> I can't share the exact log but it's a race condition like this:
>  # [Thread 1] Receives a prepared request for S1. Attempts to get S1 in cache
>  # [Thread 1] Cache miss, put this S1 into cache
>  # [Thread 1] Attempts to write S1 into local table
>  # [Thread 2] Receives a prepared request for S2. Attempts to get S2 in cache
>  # [Thread 2] Cache miss, put this S2 into cache
>  # [Thread 2] Cache is full, evicting S1 from cache
>  # [Thread 2] Attempts to delete S1 from local table
>  # [Thread 2] Tombstone inserted for S1, delete finished
>  # [Thread 1] Record inserted for S1, write finished
> Thread 2 inserted a tombstone for S1 earlier than Thread 1 was able to insert 
> the record in the table. Hence the data will not be removed because the later 
> insert has newer write time than the tombstone.
> Whether this would happen or not depends on how the cache decides what’s the 
> next entry to evict when it’s full. We noticed that in 4.1.3 Caffeine was 
> upgraded to 2.9.2 CASSANDRA-15153
>  
> I did a small research in Caffeine commits. It seems this commit was causing 
> the entry got evicted to early: Eagerly evict an entry if it too large to fit 
> in the cache(Feb 2021), available after 2.9.0: 
> [https://github.com/ben-manes/caffeine/commit/464bc1914368c47a0203517fda2151fbedaf568b]
> And later fixed in: Improve eviction when overflow or the weight is 
> oversized(Aug 2022), available after 3.1.2: 
> [https://github.com/ben-manes/caffeine/commit/25b7d17b1a246a63e4991d4902a2ecf24e86d234]
> {quote}Previously an attempt to centralize evictions into one code path led 
> to a suboptimal approach 
> ([{{464bc19}}|https://github.com/ben-manes/caffeine/commit/464bc1914368c47a0203517fda2151fbedaf568b]
> ). This tried to move those entries into the LRU position for early eviction, 
> but was confusing and could too aggressively evict something that is 
> desirable to keep.
> {quote}
>  
> I upgrade the Caffeine to 3.1.8 (same as 5.0 trunk) and this issue is gone. 
> But I think this version is not compatible with Java 8.
> I'm not 100% sure if this is the root cause and what's the correct fix here. 
> Would appreciate if anyone can have a look, thanks
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18866) Node sends multiple inflight echos

2024-05-12 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17845757#comment-17845757
 ] 

Cameron Zemek commented on CASSANDRA-18866:
---

Gossip Stage has only 1 thread, so this doesn't have race condition. So full 
patch is [^CASSANDRA-18866-4.0.patch]

 

> Node sends multiple inflight echos
> --
>
> Key: CASSANDRA-18866
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18866
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Cluster/Gossip
>Reporter: Cameron Zemek
>Assignee: Cameron Zemek
>Priority: Normal
> Fix For: 5.x
>
> Attachments: 18866-regression.patch, CASSANDRA-18866-4.0.patch, 
> duplicates.log, echo.log
>
>
> CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, 
> 18845 had change to only allow 1 inflight ECHO request at a time. As per 
> 18854 some tests have an error rate due to this change. Creating this ticket 
> to discuss this further. As the current state also does not have retry logic, 
> it just allowing multiple ECHO requests inflight at the same time so less 
> likely that all ECHO will timeout or get lost.
> With the change from 18845 adding in some extra logging to track what is 
> going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO 
> requests from a node and also see it retrying ECHOs when it doesn't get a 
> reply.
> Therefore, I think the problem is more specific than the dropping of one ECHO 
> request. Yes there no retry logic for failed ECHO requests, but this is the 
> case even both before and after 18845. ECHO requests are only sent via gossip 
> verb handlers calling applyStateLocally. In these failed tests I therefore 
> assuming their cases where it won't call markAlive when other nodes consider 
> the node UP but its marked DOWN by a node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18866) Node sends multiple inflight echos

2024-05-12 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek updated CASSANDRA-18866:
--
Attachment: CASSANDRA-18866-4.0.patch

> Node sends multiple inflight echos
> --
>
> Key: CASSANDRA-18866
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18866
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Cluster/Gossip
>Reporter: Cameron Zemek
>Assignee: Cameron Zemek
>Priority: Normal
> Fix For: 5.x
>
> Attachments: 18866-regression.patch, CASSANDRA-18866-4.0.patch, 
> duplicates.log, echo.log
>
>
> CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, 
> 18845 had change to only allow 1 inflight ECHO request at a time. As per 
> 18854 some tests have an error rate due to this change. Creating this ticket 
> to discuss this further. As the current state also does not have retry logic, 
> it just allowing multiple ECHO requests inflight at the same time so less 
> likely that all ECHO will timeout or get lost.
> With the change from 18845 adding in some extra logging to track what is 
> going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO 
> requests from a node and also see it retrying ECHOs when it doesn't get a 
> reply.
> Therefore, I think the problem is more specific than the dropping of one ECHO 
> request. Yes there no retry logic for failed ECHO requests, but this is the 
> case even both before and after 18845. ECHO requests are only sent via gossip 
> verb handlers calling applyStateLocally. In these failed tests I therefore 
> assuming their cases where it won't call markAlive when other nodes consider 
> the node UP but its marked DOWN by a node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18866) Node sends multiple inflight echos

2024-05-10 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17845360#comment-17845360
 ] 

Cameron Zemek commented on CASSANDRA-18866:
---

found a bug with this patch,
{code:java}
    private void handleMajorStateChange(InetAddressAndPort ep, EndpointState 
epState)
    {
        // omitted for brevity
        endpointStateMap.put(ep, epState);
        if (localEpState != null)
        {   // the node restarted: it is up to the subscriber to take whatever 
action is necessary
            for (IEndpointStateChangeSubscriber subscriber : subscribers)
                subscriber.onRestart(ep, localEpState);
        }
        if (!isDeadState(epState))
            markAlive(ep, epState);
 {code}
markAlive is passed the remote epState that just got put into the 
endpointStateMap which has isAlive = true on it.
{code:java}
 private void markAlive(final InetAddressAndPort addr, final EndpointState 
localState)
 {
if (inflightEcho.contains(addr))
{
return;
}
inflightEcho.add(addr);

localState.markDead(); {code}
 But we don't enter markAlive when already inflight echo request. So 
endpointStateMap now has entry with isAlive = true, but unreachableEndpoints 
has the down node. So now `nodetool status` and down endpoint count do not 
match.

 

The fix is have the onResponse to ECHO update the entry currently in the map. 
And always update the passed in state to dead.
{code:java}
    private void markAlive(final InetAddressAndPort addr, final EndpointState 
localState)
    {
        localState.markDead();
        if (!inflightEcho.add(addr))
        {
            return;
        }
        Message echoMessage = Message.out(ECHO_REQ, noPayload);
        logger.trace("Sending ECHO_REQ to {}", addr);
        RequestCallback echoHandler = new RequestCallback()
        {
            @Override
            public void onResponse(Message msg)
            {
                // force processing of the echo response onto the gossip stage, 
as it comes in on the REQUEST_RESPONSE stage
                runInGossipStageBlocking(() -> {
                    try
                    {
                        EndpointState localEpStatePtr = 
endpointStateMap.get(addr);
                        realMarkAlive(addr, localEpStatePtr);
                    }
                    finally
                    {
                        inflightEcho.remove(addr);
                    }
                });
            }
 {code}
Not sure if this allows for race condition around endpointStateMap (eg. you 
have a call to handleMajorChange putting a new entry that gets marked dead 
after the call to get localEpStatePtr in the onResponse callback)

> Node sends multiple inflight echos
> --
>
> Key: CASSANDRA-18866
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18866
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Cluster/Gossip
>Reporter: Cameron Zemek
>Assignee: Cameron Zemek
>Priority: Normal
> Fix For: 5.x
>
> Attachments: 18866-regression.patch, duplicates.log, echo.log
>
>
> CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, 
> 18845 had change to only allow 1 inflight ECHO request at a time. As per 
> 18854 some tests have an error rate due to this change. Creating this ticket 
> to discuss this further. As the current state also does not have retry logic, 
> it just allowing multiple ECHO requests inflight at the same time so less 
> likely that all ECHO will timeout or get lost.
> With the change from 18845 adding in some extra logging to track what is 
> going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO 
> requests from a node and also see it retrying ECHOs when it doesn't get a 
> reply.
> Therefore, I think the problem is more specific than the dropping of one ECHO 
> request. Yes there no retry logic for failed ECHO requests, but this is the 
> case even both before and after 18845. ECHO requests are only sent via gossip 
> verb handlers calling applyStateLocally. In these failed tests I therefore 
> assuming their cases where it won't call markAlive when other nodes consider 
> the node UP but its marked DOWN by a node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status

2024-05-01 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842799#comment-17842799
 ] 

Cameron Zemek commented on CASSANDRA-19580:
---

> Most of what you've described here are implementation details of how replace 
> works, like how hibernate is handled, so I'm not sure if anything is wrong.

I do not follow what you mean by not sure if anything is wrong. The problem is 
you can't do the replacement if for any reason the node ends up in hibernate 
state. It is forever stuck in 'Unable to contact any seeds!' error, every 
attempt at replacement results in that error. This has been a long running 
issue that seen many times over the years but never managed to figure out the 
cause of.

I do not know what the correct solution is to this, there seems to be many 
possible approaches to fix. I am unaware of the reasons for how it's been 
implemented in order to decide what would be the preferred method. For example, 
I don't understand why responses to SYN do not include state for nodes that are 
not in the digest list. Gossip been like this for a long time and therefore 
seems rather major thing to change. Another approach would be to no longer use 
hibernate, ie. CASSANDRA-12344 

> Unable to contact any seeds with node in hibernate status
> -
>
> Key: CASSANDRA-19580
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19580
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Cameron Zemek
>Priority: Normal
>
> We have customer running into the error 'Unable to contact any seeds!' . I 
> have been able to reproduce this issue if I kill Cassandra as its joining 
> which will put the node into hibernate status. Once a node is in hibernate it 
> will no longer receive any SYN messages from other nodes during startup and 
> as it sends only itself as digest in outbound SYN messages it never receives 
> any states in any of the ACK replies. So once it gets to the check 
> `seenAnySeed` in it fails as the endpointStateMap is empty.
>  
> A workaround is copying the system.peers table from other node but this is 
> less than ideal. I tested modifying maybeGossipToSeed as follows:
> {code:java}
>     /* Possibly gossip to a seed for facilitating partition healing */
>     private void maybeGossipToSeed(MessageOut prod)
>     {
>         int size = seeds.size();
>         if (size > 0)
>         {
>             if (size == 1 && 
> seeds.contains(FBUtilities.getBroadcastAddress()))
>             {
>                 return;
>             }
>             if (liveEndpoints.size() == 0)
>             {
>                 List gDigests = prod.payload.gDigests;
>                 if (gDigests.size() == 1 && 
> gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress()))
>                 {
>                     gDigests = new ArrayList();
>                     GossipDigestSyn digestSynMessage = new 
> GossipDigestSyn(DatabaseDescriptor.getClusterName(),
>                                                                            
> DatabaseDescriptor.getPartitionerName(),
>                                                                            
> gDigests);
>                     MessageOut message = new 
> MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN,
>                                                                               
>             digestSynMessage,
>                                                                               
>             GossipDigestSyn.serializer);
>                     sendGossip(message, seeds);
>                 }
>                 else
>                 {
>                     sendGossip(prod, seeds);
>                 }
>             }
>             else
>             {
>                 /* Gossip with the seed with some probability. */
>                 double probability = seeds.size() / (double) 
> (liveEndpoints.size() + unreachableEndpoints.size());
>                 double randDbl = random.nextDouble();
>                 if (randDbl <= probability)
>                     sendGossip(prod, seeds);
>             }
>         }
>     }
>  {code}
> Only problem is this is the same as SYN from shadow round. It does resolve 
> the issue however as then receive an ACK with all the states.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status

2024-04-29 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842182#comment-17842182
 ] 

Cameron Zemek commented on CASSANDRA-19580:
---

I don't understand why Gossiper::examineGossiper is implemented to only iterate 
on the digests in the SYN message. Why doesn't it handle sending back in the 
delta missing entries in the digest list?

> Unable to contact any seeds with node in hibernate status
> -
>
> Key: CASSANDRA-19580
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19580
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Cameron Zemek
>Priority: Normal
>
> We have customer running into the error 'Unable to contact any seeds!' . I 
> have been able to reproduce this issue if I kill Cassandra as its joining 
> which will put the node into hibernate status. Once a node is in hibernate it 
> will no longer receive any SYN messages from other nodes during startup and 
> as it sends only itself as digest in outbound SYN messages it never receives 
> any states in any of the ACK replies. So once it gets to the check 
> `seenAnySeed` in it fails as the endpointStateMap is empty.
>  
> A workaround is copying the system.peers table from other node but this is 
> less than ideal. I tested modifying maybeGossipToSeed as follows:
> {code:java}
>     /* Possibly gossip to a seed for facilitating partition healing */
>     private void maybeGossipToSeed(MessageOut prod)
>     {
>         int size = seeds.size();
>         if (size > 0)
>         {
>             if (size == 1 && 
> seeds.contains(FBUtilities.getBroadcastAddress()))
>             {
>                 return;
>             }
>             if (liveEndpoints.size() == 0)
>             {
>                 List gDigests = prod.payload.gDigests;
>                 if (gDigests.size() == 1 && 
> gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress()))
>                 {
>                     gDigests = new ArrayList();
>                     GossipDigestSyn digestSynMessage = new 
> GossipDigestSyn(DatabaseDescriptor.getClusterName(),
>                                                                            
> DatabaseDescriptor.getPartitionerName(),
>                                                                            
> gDigests);
>                     MessageOut message = new 
> MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN,
>                                                                               
>             digestSynMessage,
>                                                                               
>             GossipDigestSyn.serializer);
>                     sendGossip(message, seeds);
>                 }
>                 else
>                 {
>                     sendGossip(prod, seeds);
>                 }
>             }
>             else
>             {
>                 /* Gossip with the seed with some probability. */
>                 double probability = seeds.size() / (double) 
> (liveEndpoints.size() + unreachableEndpoints.size());
>                 double randDbl = random.nextDouble();
>                 if (randDbl <= probability)
>                     sendGossip(prod, seeds);
>             }
>         }
>     }
>  {code}
> Only problem is this is the same as SYN from shadow round. It does resolve 
> the issue however as then receive an ACK with all the states.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status

2024-04-25 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841016#comment-17841016
 ] 

Cameron Zemek commented on CASSANDRA-19580:
---

> Set compression to all so there are no special cases and test again.

My test was with all.

> Unable to contact any seeds with node in hibernate status
> -
>
> Key: CASSANDRA-19580
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19580
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Cameron Zemek
>Priority: Normal
>
> We have customer running into the error 'Unable to contact any seeds!' . I 
> have been able to reproduce this issue if I kill Cassandra as its joining 
> which will put the node into hibernate status. Once a node is in hibernate it 
> will no longer receive any SYN messages from other nodes during startup and 
> as it sends only itself as digest in outbound SYN messages it never receives 
> any states in any of the ACK replies. So once it gets to the check 
> `seenAnySeed` in it fails as the endpointStateMap is empty.
>  
> A workaround is copying the system.peers table from other node but this is 
> less than ideal. I tested modifying maybeGossipToSeed as follows:
> {code:java}
>     /* Possibly gossip to a seed for facilitating partition healing */
>     private void maybeGossipToSeed(MessageOut prod)
>     {
>         int size = seeds.size();
>         if (size > 0)
>         {
>             if (size == 1 && 
> seeds.contains(FBUtilities.getBroadcastAddress()))
>             {
>                 return;
>             }
>             if (liveEndpoints.size() == 0)
>             {
>                 List gDigests = prod.payload.gDigests;
>                 if (gDigests.size() == 1 && 
> gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress()))
>                 {
>                     gDigests = new ArrayList();
>                     GossipDigestSyn digestSynMessage = new 
> GossipDigestSyn(DatabaseDescriptor.getClusterName(),
>                                                                            
> DatabaseDescriptor.getPartitionerName(),
>                                                                            
> gDigests);
>                     MessageOut message = new 
> MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN,
>                                                                               
>             digestSynMessage,
>                                                                               
>             GossipDigestSyn.serializer);
>                     sendGossip(message, seeds);
>                 }
>                 else
>                 {
>                     sendGossip(prod, seeds);
>                 }
>             }
>             else
>             {
>                 /* Gossip with the seed with some probability. */
>                 double probability = seeds.size() / (double) 
> (liveEndpoints.size() + unreachableEndpoints.size());
>                 double randDbl = random.nextDouble();
>                 if (randDbl <= probability)
>                     sendGossip(prod, seeds);
>             }
>         }
>     }
>  {code}
> Only problem is this is the same as SYN from shadow round. It does resolve 
> the issue however as then receive an ACK with all the states.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status

2024-04-23 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17840266#comment-17840266
 ] 

Cameron Zemek commented on CASSANDRA-19580:
---

> If you have internode_compression=dc then replacement with the same IP will 
> not work, you need to use a different IP because the compression has already 
> been negotiated on the other nodes.

Not to get too off topic to the issue at hand but I am able todo replacement 
with same IP with internode compression enabled. So what doesn't work about 
this?

> Unable to contact any seeds with node in hibernate status
> -
>
> Key: CASSANDRA-19580
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19580
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Cameron Zemek
>Priority: Normal
>
> We have customer running into the error 'Unable to contact any seeds!' . I 
> have been able to reproduce this issue if I kill Cassandra as its joining 
> which will put the node into hibernate status. Once a node is in hibernate it 
> will no longer receive any SYN messages from other nodes during startup and 
> as it sends only itself as digest in outbound SYN messages it never receives 
> any states in any of the ACK replies. So once it gets to the check 
> `seenAnySeed` in it fails as the endpointStateMap is empty.
>  
> A workaround is copying the system.peers table from other node but this is 
> less than ideal. I tested modifying maybeGossipToSeed as follows:
> {code:java}
>     /* Possibly gossip to a seed for facilitating partition healing */
>     private void maybeGossipToSeed(MessageOut prod)
>     {
>         int size = seeds.size();
>         if (size > 0)
>         {
>             if (size == 1 && 
> seeds.contains(FBUtilities.getBroadcastAddress()))
>             {
>                 return;
>             }
>             if (liveEndpoints.size() == 0)
>             {
>                 List gDigests = prod.payload.gDigests;
>                 if (gDigests.size() == 1 && 
> gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress()))
>                 {
>                     gDigests = new ArrayList();
>                     GossipDigestSyn digestSynMessage = new 
> GossipDigestSyn(DatabaseDescriptor.getClusterName(),
>                                                                            
> DatabaseDescriptor.getPartitionerName(),
>                                                                            
> gDigests);
>                     MessageOut message = new 
> MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN,
>                                                                               
>             digestSynMessage,
>                                                                               
>             GossipDigestSyn.serializer);
>                     sendGossip(message, seeds);
>                 }
>                 else
>                 {
>                     sendGossip(prod, seeds);
>                 }
>             }
>             else
>             {
>                 /* Gossip with the seed with some probability. */
>                 double probability = seeds.size() / (double) 
> (liveEndpoints.size() + unreachableEndpoints.size());
>                 double randDbl = random.nextDouble();
>                 if (randDbl <= probability)
>                     sendGossip(prod, seeds);
>             }
>         }
>     }
>  {code}
> Only problem is this is the same as SYN from shadow round. It does resolve 
> the issue however as then receive an ACK with all the states.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status

2024-04-23 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17840241#comment-17840241
 ] 

Cameron Zemek commented on CASSANDRA-19580:
---

Yeah so what breaks if use same state as when replacing with different address? 
I looked through CASSANDRA-8523 and didn't understand what different about 
replacing when reusing the same IP address. Why isn't the node in UJ state when 
doing replacements, that is receiving writes but not reads.

What do you think would be the correct fix here? Is sending an empty SYN like 
shadow round okay? Why does examineGossiper not send back states for missing 
digests (it only compares for the digests in the SYN)?

Considering that SYN messages are sent randomly, it seems like could also end 
up with this 'Unable to contact any seeds!' path if none of the nodes randomly 
pick the replacement node to send a SYN to.

> Unable to contact any seeds with node in hibernate status
> -
>
> Key: CASSANDRA-19580
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19580
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Cameron Zemek
>Priority: Normal
>
> We have customer running into the error 'Unable to contact any seeds!' . I 
> have been able to reproduce this issue if I kill Cassandra as its joining 
> which will put the node into hibernate status. Once a node is in hibernate it 
> will no longer receive any SYN messages from other nodes during startup and 
> as it sends only itself as digest in outbound SYN messages it never receives 
> any states in any of the ACK replies. So once it gets to the check 
> `seenAnySeed` in it fails as the endpointStateMap is empty.
>  
> A workaround is copying the system.peers table from other node but this is 
> less than ideal. I tested modifying maybeGossipToSeed as follows:
> {code:java}
>     /* Possibly gossip to a seed for facilitating partition healing */
>     private void maybeGossipToSeed(MessageOut prod)
>     {
>         int size = seeds.size();
>         if (size > 0)
>         {
>             if (size == 1 && 
> seeds.contains(FBUtilities.getBroadcastAddress()))
>             {
>                 return;
>             }
>             if (liveEndpoints.size() == 0)
>             {
>                 List gDigests = prod.payload.gDigests;
>                 if (gDigests.size() == 1 && 
> gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress()))
>                 {
>                     gDigests = new ArrayList();
>                     GossipDigestSyn digestSynMessage = new 
> GossipDigestSyn(DatabaseDescriptor.getClusterName(),
>                                                                            
> DatabaseDescriptor.getPartitionerName(),
>                                                                            
> gDigests);
>                     MessageOut message = new 
> MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN,
>                                                                               
>             digestSynMessage,
>                                                                               
>             GossipDigestSyn.serializer);
>                     sendGossip(message, seeds);
>                 }
>                 else
>                 {
>                     sendGossip(prod, seeds);
>                 }
>             }
>             else
>             {
>                 /* Gossip with the seed with some probability. */
>                 double probability = seeds.size() / (double) 
> (liveEndpoints.size() + unreachableEndpoints.size());
>                 double randDbl = random.nextDouble();
>                 if (randDbl <= probability)
>                     sendGossip(prod, seeds);
>             }
>         }
>     }
>  {code}
> Only problem is this is the same as SYN from shadow round. It does resolve 
> the issue however as then receive an ACK with all the states.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status

2024-04-22 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17839901#comment-17839901
 ] 

Cameron Zemek edited comment on CASSANDRA-19580 at 4/23/24 1:03 AM:


[~brandon.williams] do you know why it needs to use Hibernate for replacement 
for same address? CASSANDRA-8523

added BOOT_REPLACE status. I am not sure what I am breaking by doing this:
{code:java}
    public void prepareToJoin() throws ConfigurationException
    {
        // omitted for brevity
                else if (isReplacingSameAddress())
                {
                    //only go into hibernate state if replacing the same 
address (CASSANDRA-8523)
                    logger.warn("Writes will not be forwarded to this node 
during replacement because it has the same address as " +
                                "the node to be replaced ({}). If the previous 
node has been down for longer than max_hint_window_in_ms, " +
                                "repair must be run after the replacement 
process in order to make this node consistent.",
                                DatabaseDescriptor.getReplaceAddress());
                    appStates.put(ApplicationState.STATUS, 
valueFactory.bootReplacing(DatabaseDescriptor.getReplaceAddress()));
                }{code}
This stops the issue as no longer putting the node into hibernate during 
replacement. So if the replacement fails not in a dead state.


was (Author: cam1982):
[~brandon.williams] do you know why it needs to use Hibernate for replacement 
for same address. CASSANDRA-8523

added BOOT_REPLACE status. I am not sure what I am breaking by doing this:
{code:java}
    public void prepareToJoin() throws ConfigurationException
    {
        // omitted for brevity
                else if (isReplacingSameAddress())
                {
                    //only go into hibernate state if replacing the same 
address (CASSANDRA-8523)
                    logger.warn("Writes will not be forwarded to this node 
during replacement because it has the same address as " +
                                "the node to be replaced ({}). If the previous 
node has been down for longer than max_hint_window_in_ms, " +
                                "repair must be run after the replacement 
process in order to make this node consistent.",
                                DatabaseDescriptor.getReplaceAddress());
                    appStates.put(ApplicationState.STATUS, 
valueFactory.bootReplacing(DatabaseDescriptor.getReplaceAddress()));
                }{code}
This stops the issue as no longer putting the node into hibernate during 
replacement. So if the replacement fails not in a dead state.

> Unable to contact any seeds with node in hibernate status
> -
>
> Key: CASSANDRA-19580
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19580
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Cameron Zemek
>Priority: Normal
>
> We have customer running into the error 'Unable to contact any seeds!' . I 
> have been able to reproduce this issue if I kill Cassandra as its joining 
> which will put the node into hibernate status. Once a node is in hibernate it 
> will no longer receive any SYN messages from other nodes during startup and 
> as it sends only itself as digest in outbound SYN messages it never receives 
> any states in any of the ACK replies. So once it gets to the check 
> `seenAnySeed` in it fails as the endpointStateMap is empty.
>  
> A workaround is copying the system.peers table from other node but this is 
> less than ideal. I tested modifying maybeGossipToSeed as follows:
> {code:java}
>     /* Possibly gossip to a seed for facilitating partition healing */
>     private void maybeGossipToSeed(MessageOut prod)
>     {
>         int size = seeds.size();
>         if (size > 0)
>         {
>             if (size == 1 && 
> seeds.contains(FBUtilities.getBroadcastAddress()))
>             {
>                 return;
>             }
>             if (liveEndpoints.size() == 0)
>             {
>                 List gDigests = prod.payload.gDigests;
>                 if (gDigests.size() == 1 && 
> gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress()))
>                 {
>                     gDigests = new ArrayList();
>                     GossipDigestSyn digestSynMessage = new 
> GossipDigestSyn(DatabaseDescriptor.getClusterName(),
>                                                                            
> DatabaseDescriptor.getPartitionerName(),
>                                                                            
> gDigests);
>                     MessageOut message = new 
> MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN,
>

[jira] [Commented] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status

2024-04-22 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17839901#comment-17839901
 ] 

Cameron Zemek commented on CASSANDRA-19580:
---

[~brandon.williams] do you know why it needs to use Hibernate for replacement 
for same address. CASSANDRA-8523

added BOOT_REPLACE status. I am not sure what I am breaking by doing this:
{code:java}
    public void prepareToJoin() throws ConfigurationException
    {
        // omitted for brevity
                else if (isReplacingSameAddress())
                {
                    //only go into hibernate state if replacing the same 
address (CASSANDRA-8523)
                    logger.warn("Writes will not be forwarded to this node 
during replacement because it has the same address as " +
                                "the node to be replaced ({}). If the previous 
node has been down for longer than max_hint_window_in_ms, " +
                                "repair must be run after the replacement 
process in order to make this node consistent.",
                                DatabaseDescriptor.getReplaceAddress());
                    appStates.put(ApplicationState.STATUS, 
valueFactory.bootReplacing(DatabaseDescriptor.getReplaceAddress()));
                }{code}
This stops the issue as no longer putting the node into hibernate during 
replacement. So if the replacement fails not in a dead state.

> Unable to contact any seeds with node in hibernate status
> -
>
> Key: CASSANDRA-19580
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19580
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Cameron Zemek
>Priority: Normal
>
> We have customer running into the error 'Unable to contact any seeds!' . I 
> have been able to reproduce this issue if I kill Cassandra as its joining 
> which will put the node into hibernate status. Once a node is in hibernate it 
> will no longer receive any SYN messages from other nodes during startup and 
> as it sends only itself as digest in outbound SYN messages it never receives 
> any states in any of the ACK replies. So once it gets to the check 
> `seenAnySeed` in it fails as the endpointStateMap is empty.
>  
> A workaround is copying the system.peers table from other node but this is 
> less than ideal. I tested modifying maybeGossipToSeed as follows:
> {code:java}
>     /* Possibly gossip to a seed for facilitating partition healing */
>     private void maybeGossipToSeed(MessageOut prod)
>     {
>         int size = seeds.size();
>         if (size > 0)
>         {
>             if (size == 1 && 
> seeds.contains(FBUtilities.getBroadcastAddress()))
>             {
>                 return;
>             }
>             if (liveEndpoints.size() == 0)
>             {
>                 List gDigests = prod.payload.gDigests;
>                 if (gDigests.size() == 1 && 
> gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress()))
>                 {
>                     gDigests = new ArrayList();
>                     GossipDigestSyn digestSynMessage = new 
> GossipDigestSyn(DatabaseDescriptor.getClusterName(),
>                                                                            
> DatabaseDescriptor.getPartitionerName(),
>                                                                            
> gDigests);
>                     MessageOut message = new 
> MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN,
>                                                                               
>             digestSynMessage,
>                                                                               
>             GossipDigestSyn.serializer);
>                     sendGossip(message, seeds);
>                 }
>                 else
>                 {
>                     sendGossip(prod, seeds);
>                 }
>             }
>             else
>             {
>                 /* Gossip with the seed with some probability. */
>                 double probability = seeds.size() / (double) 
> (liveEndpoints.size() + unreachableEndpoints.size());
>                 double randDbl = random.nextDouble();
>                 if (randDbl <= probability)
>                     sendGossip(prod, seeds);
>             }
>         }
>     }
>  {code}
> Only problem is this is the same as SYN from shadow round. It does resolve 
> the issue however as then receive an ACK with all the states.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status

2024-04-22 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17839887#comment-17839887
 ] 

Cameron Zemek commented on CASSANDRA-19580:
---

PS: the customer not doing step 2. That just my reliable way to reproduce the 
issue. I have seen this 'Unable to contact seeds!' in the past but never had 
enough information to go on. It seems to happen on larger clusters.

> Unable to contact any seeds with node in hibernate status
> -
>
> Key: CASSANDRA-19580
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19580
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Cameron Zemek
>Priority: Normal
>
> We have customer running into the error 'Unable to contact any seeds!' . I 
> have been able to reproduce this issue if I kill Cassandra as its joining 
> which will put the node into hibernate status. Once a node is in hibernate it 
> will no longer receive any SYN messages from other nodes during startup and 
> as it sends only itself as digest in outbound SYN messages it never receives 
> any states in any of the ACK replies. So once it gets to the check 
> `seenAnySeed` in it fails as the endpointStateMap is empty.
>  
> A workaround is copying the system.peers table from other node but this is 
> less than ideal. I tested modifying maybeGossipToSeed as follows:
> {code:java}
>     /* Possibly gossip to a seed for facilitating partition healing */
>     private void maybeGossipToSeed(MessageOut prod)
>     {
>         int size = seeds.size();
>         if (size > 0)
>         {
>             if (size == 1 && 
> seeds.contains(FBUtilities.getBroadcastAddress()))
>             {
>                 return;
>             }
>             if (liveEndpoints.size() == 0)
>             {
>                 List gDigests = prod.payload.gDigests;
>                 if (gDigests.size() == 1 && 
> gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress()))
>                 {
>                     gDigests = new ArrayList();
>                     GossipDigestSyn digestSynMessage = new 
> GossipDigestSyn(DatabaseDescriptor.getClusterName(),
>                                                                            
> DatabaseDescriptor.getPartitionerName(),
>                                                                            
> gDigests);
>                     MessageOut message = new 
> MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN,
>                                                                               
>             digestSynMessage,
>                                                                               
>             GossipDigestSyn.serializer);
>                     sendGossip(message, seeds);
>                 }
>                 else
>                 {
>                     sendGossip(prod, seeds);
>                 }
>             }
>             else
>             {
>                 /* Gossip with the seed with some probability. */
>                 double probability = seeds.size() / (double) 
> (liveEndpoints.size() + unreachableEndpoints.size());
>                 double randDbl = random.nextDouble();
>                 if (randDbl <= probability)
>                     sendGossip(prod, seeds);
>             }
>         }
>     }
>  {code}
> Only problem is this is the same as SYN from shadow round. It does resolve 
> the issue however as then receive an ACK with all the states.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status

2024-04-22 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17839886#comment-17839886
 ] 

Cameron Zemek commented on CASSANDRA-19580:
---

The node trying to replace. So in my replication steps:
 # replace a node using '-Dcassandra.replace_address=44.239.237.152'
 # while its replacing kill off cassandra
 # wipe the cassandra folders
 # start cassandra again still using the replace address flag

After step 2 if I check 'nodetool gossipinfo' the node being replaced 
(44.239.237.152 in this example) has status of hibernate.

During step 4 the other nodes will say 'Not marking /44.239.237.152 alive due 
to dead state'

I did a whole bunch of testing of this yesterday and this is the key issue as 
far as I can tell. Due to the replacing node being in hibernate they won't send 
a SYN (see maybeGossipToUnreachableMember filters out ones in dead state). And 
without the SYN message the replacing node never gets gossip state of the 
cluster as its own SYN messages only has itself as digest so ACK replies to 
those don't include other nodes.

> Unable to contact any seeds with node in hibernate status
> -
>
> Key: CASSANDRA-19580
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19580
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Cameron Zemek
>Priority: Normal
>
> We have customer running into the error 'Unable to contact any seeds!' . I 
> have been able to reproduce this issue if I kill Cassandra as its joining 
> which will put the node into hibernate status. Once a node is in hibernate it 
> will no longer receive any SYN messages from other nodes during startup and 
> as it sends only itself as digest in outbound SYN messages it never receives 
> any states in any of the ACK replies. So once it gets to the check 
> `seenAnySeed` in it fails as the endpointStateMap is empty.
>  
> A workaround is copying the system.peers table from other node but this is 
> less than ideal. I tested modifying maybeGossipToSeed as follows:
> {code:java}
>     /* Possibly gossip to a seed for facilitating partition healing */
>     private void maybeGossipToSeed(MessageOut prod)
>     {
>         int size = seeds.size();
>         if (size > 0)
>         {
>             if (size == 1 && 
> seeds.contains(FBUtilities.getBroadcastAddress()))
>             {
>                 return;
>             }
>             if (liveEndpoints.size() == 0)
>             {
>                 List gDigests = prod.payload.gDigests;
>                 if (gDigests.size() == 1 && 
> gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress()))
>                 {
>                     gDigests = new ArrayList();
>                     GossipDigestSyn digestSynMessage = new 
> GossipDigestSyn(DatabaseDescriptor.getClusterName(),
>                                                                            
> DatabaseDescriptor.getPartitionerName(),
>                                                                            
> gDigests);
>                     MessageOut message = new 
> MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN,
>                                                                               
>             digestSynMessage,
>                                                                               
>             GossipDigestSyn.serializer);
>                     sendGossip(message, seeds);
>                 }
>                 else
>                 {
>                     sendGossip(prod, seeds);
>                 }
>             }
>             else
>             {
>                 /* Gossip with the seed with some probability. */
>                 double probability = seeds.size() / (double) 
> (liveEndpoints.size() + unreachableEndpoints.size());
>                 double randDbl = random.nextDouble();
>                 if (randDbl <= probability)
>                     sendGossip(prod, seeds);
>             }
>         }
>     }
>  {code}
> Only problem is this is the same as SYN from shadow round. It does resolve 
> the issue however as then receive an ACK with all the states.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status

2024-04-22 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17839882#comment-17839882
 ] 

Cameron Zemek commented on CASSANDRA-19580:
---

Customer cluster has: 

commitlog_compression=LZ4Compressor

hints_compression=null

internode_compression=dc

 

So it happens with and without comrpession

> Unable to contact any seeds with node in hibernate status
> -
>
> Key: CASSANDRA-19580
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19580
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Cameron Zemek
>Priority: Normal
>
> We have customer running into the error 'Unable to contact any seeds!' . I 
> have been able to reproduce this issue if I kill Cassandra as its joining 
> which will put the node into hibernate status. Once a node is in hibernate it 
> will no longer receive any SYN messages from other nodes during startup and 
> as it sends only itself as digest in outbound SYN messages it never receives 
> any states in any of the ACK replies. So once it gets to the check 
> `seenAnySeed` in it fails as the endpointStateMap is empty.
>  
> A workaround is copying the system.peers table from other node but this is 
> less than ideal. I tested modifying maybeGossipToSeed as follows:
> {code:java}
>     /* Possibly gossip to a seed for facilitating partition healing */
>     private void maybeGossipToSeed(MessageOut prod)
>     {
>         int size = seeds.size();
>         if (size > 0)
>         {
>             if (size == 1 && 
> seeds.contains(FBUtilities.getBroadcastAddress()))
>             {
>                 return;
>             }
>             if (liveEndpoints.size() == 0)
>             {
>                 List gDigests = prod.payload.gDigests;
>                 if (gDigests.size() == 1 && 
> gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress()))
>                 {
>                     gDigests = new ArrayList();
>                     GossipDigestSyn digestSynMessage = new 
> GossipDigestSyn(DatabaseDescriptor.getClusterName(),
>                                                                            
> DatabaseDescriptor.getPartitionerName(),
>                                                                            
> gDigests);
>                     MessageOut message = new 
> MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN,
>                                                                               
>             digestSynMessage,
>                                                                               
>             GossipDigestSyn.serializer);
>                     sendGossip(message, seeds);
>                 }
>                 else
>                 {
>                     sendGossip(prod, seeds);
>                 }
>             }
>             else
>             {
>                 /* Gossip with the seed with some probability. */
>                 double probability = seeds.size() / (double) 
> (liveEndpoints.size() + unreachableEndpoints.size());
>                 double randDbl = random.nextDouble();
>                 if (randDbl <= probability)
>                     sendGossip(prod, seeds);
>             }
>         }
>     }
>  {code}
> Only problem is this is the same as SYN from shadow round. It does resolve 
> the issue however as then receive an ACK with all the states.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status

2024-04-22 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17839881#comment-17839881
 ] 

Cameron Zemek commented on CASSANDRA-19580:
---

[~brandon.williams] 

> Is compression enabled on this cluster?

Not sure which setting you referring to. Just replicated the issue on test 
cluster where I have:

commitlog_compression=null

internode_compression=none 

hint_compression=null

> Unable to contact any seeds with node in hibernate status
> -
>
> Key: CASSANDRA-19580
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19580
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Cameron Zemek
>Priority: Normal
>
> We have customer running into the error 'Unable to contact any seeds!' . I 
> have been able to reproduce this issue if I kill Cassandra as its joining 
> which will put the node into hibernate status. Once a node is in hibernate it 
> will no longer receive any SYN messages from other nodes during startup and 
> as it sends only itself as digest in outbound SYN messages it never receives 
> any states in any of the ACK replies. So once it gets to the check 
> `seenAnySeed` in it fails as the endpointStateMap is empty.
>  
> A workaround is copying the system.peers table from other node but this is 
> less than ideal. I tested modifying maybeGossipToSeed as follows:
> {code:java}
>     /* Possibly gossip to a seed for facilitating partition healing */
>     private void maybeGossipToSeed(MessageOut prod)
>     {
>         int size = seeds.size();
>         if (size > 0)
>         {
>             if (size == 1 && 
> seeds.contains(FBUtilities.getBroadcastAddress()))
>             {
>                 return;
>             }
>             if (liveEndpoints.size() == 0)
>             {
>                 List gDigests = prod.payload.gDigests;
>                 if (gDigests.size() == 1 && 
> gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress()))
>                 {
>                     gDigests = new ArrayList();
>                     GossipDigestSyn digestSynMessage = new 
> GossipDigestSyn(DatabaseDescriptor.getClusterName(),
>                                                                            
> DatabaseDescriptor.getPartitionerName(),
>                                                                            
> gDigests);
>                     MessageOut message = new 
> MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN,
>                                                                               
>             digestSynMessage,
>                                                                               
>             GossipDigestSyn.serializer);
>                     sendGossip(message, seeds);
>                 }
>                 else
>                 {
>                     sendGossip(prod, seeds);
>                 }
>             }
>             else
>             {
>                 /* Gossip with the seed with some probability. */
>                 double probability = seeds.size() / (double) 
> (liveEndpoints.size() + unreachableEndpoints.size());
>                 double randDbl = random.nextDouble();
>                 if (randDbl <= probability)
>                     sendGossip(prod, seeds);
>             }
>         }
>     }
>  {code}
> Only problem is this is the same as SYN from shadow round. It does resolve 
> the issue however as then receive an ACK with all the states.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status

2024-04-22 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17839879#comment-17839879
 ] 

Cameron Zemek commented on CASSANDRA-19580:
---

Here is an extract of logs showing the issue:
{noformat}
INFO  [main] 2024-04-17 17:57:45,766 MessagingService.java:750 - Starting 
Messaging Service on /10.120.156.42:7000 (eth0)
INFO  [main] 2024-04-17 17:57:45,775 StorageService.java:681 - Gathering node 
replacement information for /10.120.156.42
TRACE [main] 2024-04-17 17:57:45,781 Gossiper.java:1613 - Sending shadow round 
GOSSIP DIGEST SYN to seeds [/10.120.156.17, /10.120.156.21, /10.120.156.9]
INFO  [main] 2024-04-17 17:57:45,788 OutboundTcpConnection.java:108 - 
OutboundTcpConnection using coalescing strategy DISABLED
INFO  [HANDSHAKE-/10.120.156.9] 2024-04-17 17:57:45,802 
OutboundTcpConnection.java:561 - Handshaking version with /10.120.156.9
INFO  [HANDSHAKE-/10.120.156.17] 2024-04-17 17:57:45,803 
OutboundTcpConnection.java:561 - Handshaking version with /10.120.156.17
INFO  [HANDSHAKE-/10.120.156.21] 2024-04-17 17:57:45,803 
OutboundTcpConnection.java:561 - Handshaking version with /10.120.156.21
TRACE [GossipStage:1] 2024-04-17 17:57:45,875 
GossipDigestAckVerbHandler.java:41 - Received a GossipDigestAckMessage from 
/10.120.156.9
TRACE [GossipStage:1] 2024-04-17 17:57:45,875 
GossipDigestAckVerbHandler.java:52 - Received ack with 0 digests and 48 states
DEBUG [GossipStage:1] 2024-04-17 17:57:45,876 
GossipDigestAckVerbHandler.java:57 - Received an ack from /10.120.156.9, which 
may trigger exit from shadow round
DEBUG [GossipStage:1] 2024-04-17 17:57:45,876 Gossiper.java:1802 - Received a 
regular ack from /10.120.156.9, can now exit shadow round
TRACE [GossipStage:1] 2024-04-17 17:57:45,876 
GossipDigestAckVerbHandler.java:41 - Received a GossipDigestAckMessage from 
/10.120.156.21
TRACE [GossipStage:1] 2024-04-17 17:57:45,876 
GossipDigestAckVerbHandler.java:45 - Ignoring GossipDigestAckMessage because 
gossip is disabled
TRACE [GossipStage:1] 2024-04-17 17:57:45,876 
GossipDigestAckVerbHandler.java:41 - Received a GossipDigestAckMessage from 
/10.120.156.17
TRACE [GossipStage:1] 2024-04-17 17:57:45,876 
GossipDigestAckVerbHandler.java:45 - Ignoring GossipDigestAckMessage because 
gossip is disabled
WARN  [main] 2024-04-17 17:57:46,825 StorageService.java:970 - Writes will not 
be forwarded to this node during replacement because it has the same address as 
the node to be replaced (/10.120.156.42). If the previous node has been down 
for longer than max_hint_window_in_ms, repair must be run after the replacement 
process in order to make this node consistent.
INFO  [main] 2024-04-17 17:57:46,827 StorageService.java:877 - Loading 
persisted ring state
INFO  [main] 2024-04-17 17:57:46,829 StorageService.java:1008 - Starting up 
server gossip
TRACE [main] 2024-04-17 17:57:46,854 Gossiper.java:1550 - gossip started with 
generation 171337
WARN  [main] 2024-04-17 17:57:46,883 StorageService.java:1099 - Detected 
previous bootstrap failure; retrying
INFO  [main] 2024-04-17 17:57:46,883 StorageService.java:1679 - JOINING: 
waiting for ring information
TRACE [GossipTasks:1] 2024-04-17 17:57:47,855 Gossiper.java:215 - My heartbeat 
is now 16
TRACE [GossipTasks:1] 2024-04-17 17:57:47,856 Gossiper.java:633 - Gossip 
Digests are : /10.120.156.42:171337:16 
TRACE [GossipTasks:1] 2024-04-17 17:57:47,857 Gossiper.java:782 - Sending a 
GossipDigestSyn to /10.120.156.17 ...
TRACE [GossipTasks:1] 2024-04-17 17:57:47,857 Gossiper.java:911 - Performing 
status check ...
TRACE [GossipStage:1] 2024-04-17 17:57:47,858 
GossipDigestAckVerbHandler.java:41 - Received a GossipDigestAckMessage from 
/10.120.156.17
TRACE [GossipStage:1] 2024-04-17 17:57:47,858 
GossipDigestAckVerbHandler.java:52 - Received ack with 1 digests and 0 states
TRACE [GossipStage:1] 2024-04-17 17:57:47,858 Gossiper.java:1048 - local 
heartbeat version 16 greater than 0 for /10.120.156.42
TRACE [GossipStage:1] 2024-04-17 17:57:47,858 Gossiper.java:1063 - Adding state 
STATUS: hibernate,true
TRACE [GossipStage:1] 2024-04-17 17:57:47,858 Gossiper.java:1063 - Adding state 
SCHEMA: 59adb24e-f3cd-3e02-97f0-5b395827453f
TRACE [GossipStage:1] 2024-04-17 17:57:47,858 Gossiper.java:1063 - Adding state 
DC: us-west2
TRACE [GossipStage:1] 2024-04-17 17:57:47,858 Gossiper.java:1063 - Adding state 
RACK: c
TRACE [GossipStage:1] 2024-04-17 17:57:47,859 Gossiper.java:1063 - Adding state 
RELEASE_VERSION: 3.11.16
TRACE [GossipStage:1] 2024-04-17 17:57:47,859 Gossiper.java:1063 - Adding state 
INTERNAL_IP: 10.120.156.42
TRACE [GossipStage:1] 2024-04-17 17:57:47,859 Gossiper.java:1063 - Adding state 
RPC_ADDRESS: 10.120.156.42
TRACE [GossipStage:1] 2024-04-17 17:57:47,859 Gossiper.java:1063 - Adding state 
NET_VERSION: 11
TRACE [GossipStage:1] 2024-04-17 17:57:47,859 Gossiper.java:1063 - Adding state 
HOST_ID: 4477-a899-4cc1-a9f9-2

[jira] [Commented] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status

2024-04-22 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17839870#comment-17839870
 ] 

Cameron Zemek commented on CASSANDRA-19580:
---

[~brandon.williams] sorry I did not clarify that exactly what doing, node 
replacements. In particular for same IP address. If I kill off the node during 
node replacement the other nodes in cluster will have that replacing node in 
hibernate status. At which point you will always get 'Unable to contact any 
seeds!' as SYN are not sent by other nodes to the replacing node when they have 
it in HIBERNATE status since that is a dead state.

 

In a working replacement the other nodes have it in SHUTDOWN state. Then as 
part of bootstrap the node gets marked as alive and then one of the nodes end 
up sending a SYN.

 

That is if there some failure during a node replacement end up in unrecoverable 
state.

> Unable to contact any seeds with node in hibernate status
> -
>
> Key: CASSANDRA-19580
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19580
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Cameron Zemek
>Priority: Normal
>
> We have customer running into the error 'Unable to contact any seeds!' . I 
> have been able to reproduce this issue if I kill Cassandra as its joining 
> which will put the node into hibernate status. Once a node is in hibernate it 
> will no longer receive any SYN messages from other nodes during startup and 
> as it sends only itself as digest in outbound SYN messages it never receives 
> any states in any of the ACK replies. So once it gets to the check 
> `seenAnySeed` in it fails as the endpointStateMap is empty.
>  
> A workaround is copying the system.peers table from other node but this is 
> less than ideal. I tested modifying maybeGossipToSeed as follows:
> {code:java}
>     /* Possibly gossip to a seed for facilitating partition healing */
>     private void maybeGossipToSeed(MessageOut prod)
>     {
>         int size = seeds.size();
>         if (size > 0)
>         {
>             if (size == 1 && 
> seeds.contains(FBUtilities.getBroadcastAddress()))
>             {
>                 return;
>             }
>             if (liveEndpoints.size() == 0)
>             {
>                 List gDigests = prod.payload.gDigests;
>                 if (gDigests.size() == 1 && 
> gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress()))
>                 {
>                     gDigests = new ArrayList();
>                     GossipDigestSyn digestSynMessage = new 
> GossipDigestSyn(DatabaseDescriptor.getClusterName(),
>                                                                            
> DatabaseDescriptor.getPartitionerName(),
>                                                                            
> gDigests);
>                     MessageOut message = new 
> MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN,
>                                                                               
>             digestSynMessage,
>                                                                               
>             GossipDigestSyn.serializer);
>                     sendGossip(message, seeds);
>                 }
>                 else
>                 {
>                     sendGossip(prod, seeds);
>                 }
>             }
>             else
>             {
>                 /* Gossip with the seed with some probability. */
>                 double probability = seeds.size() / (double) 
> (liveEndpoints.size() + unreachableEndpoints.size());
>                 double randDbl = random.nextDouble();
>                 if (randDbl <= probability)
>                     sendGossip(prod, seeds);
>             }
>         }
>     }
>  {code}
> Only problem is this is the same as SYN from shadow round. It does resolve 
> the issue however as then receive an ACK with all the states.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status

2024-04-22 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek updated CASSANDRA-19580:
--
Description: 
We have customer running into the error 'Unable to contact any seeds!' . I have 
been able to reproduce this issue if I kill Cassandra as its joining which will 
put the node into hibernate status. Once a node is in hibernate it will no 
longer receive any SYN messages from other nodes during startup and as it sends 
only itself as digest in outbound SYN messages it never receives any states in 
any of the ACK replies. So once it gets to the check `seenAnySeed` in it fails 
as the endpointStateMap is empty.

 

A workaround is copying the system.peers table from other node but this is less 
than ideal. I tested modifying maybeGossipToSeed as follows:
{code:java}
    /* Possibly gossip to a seed for facilitating partition healing */
    private void maybeGossipToSeed(MessageOut prod)
    {
        int size = seeds.size();
        if (size > 0)
        {
            if (size == 1 && seeds.contains(FBUtilities.getBroadcastAddress()))
            {
                return;
            }
            if (liveEndpoints.size() == 0)
            {
                List gDigests = prod.payload.gDigests;
                if (gDigests.size() == 1 && 
gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress()))
                {
                    gDigests = new ArrayList();
                    GossipDigestSyn digestSynMessage = new 
GossipDigestSyn(DatabaseDescriptor.getClusterName(),
                                                                           
DatabaseDescriptor.getPartitionerName(),
                                                                           
gDigests);
                    MessageOut message = new 
MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN,
                                                                                
          digestSynMessage,
                                                                                
          GossipDigestSyn.serializer);
                    sendGossip(message, seeds);
                }
                else
                {
                    sendGossip(prod, seeds);
                }
            }
            else
            {
                /* Gossip with the seed with some probability. */
                double probability = seeds.size() / (double) 
(liveEndpoints.size() + unreachableEndpoints.size());
                double randDbl = random.nextDouble();
                if (randDbl <= probability)
                    sendGossip(prod, seeds);
            }
        }
    }
 {code}
Only problem is this is the same as SYN from shadow round. It does resolve the 
issue however as then receive an ACK with all the states.

  was:
We have customer running into the error 'Unable to contact any seeds!' . I have 
been able to reproduce this issue if I kill Cassandra as its joining which will 
put the node into hibernate status. Once a node is in hibernate it will no 
longer receive any SYN messages from other nodes during startup and as it sends 
only itself as digest in outbound SYN messages it never receives any states in 
any of the ACK replies. So once it gets to the check `seenAnySeed` in it fails 
as the endpointStateMap is empty.

 

A workaround is copying the system.peers table from other node but this is less 
than ideal. I tested modifying maybeGossipToSeed as follows:
{code:java}
    /* Possibly gossip to a seed for facilitating partition healing */
    private void maybeGossipToSeed(MessageOut prod)
    {
        int size = seeds.size();
        if (size > 0)
        {
            if (size == 1 && seeds.contains(FBUtilities.getBroadcastAddress()))
            {
                return;
            }            if (liveEndpoints.size() == 0)
            {
                List gDigests = prod.payload.gDigests;
                if (gDigests.size() == 1 && 
gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress()))
                {
                    gDigests = new ArrayList();
                    GossipDigestSyn digestSynMessage = new 
GossipDigestSyn(DatabaseDescriptor.getClusterName(),
                                                                           
DatabaseDescriptor.getPartitionerName(),
                                                                           
gDigests);
                    MessageOut message = new 
MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN,
                                                                                
          digestSynMessage,
                                                                                
          GossipDigestSyn.serializer);
                    sendGossip(message, seeds);
                }
                else
                {
                    sendGossip(prod, seeds);
                }
            }

[jira] [Created] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status

2024-04-22 Thread Cameron Zemek (Jira)

Cameron Zemek created CASSANDRA-19580:
-

 Summary: Unable to contact any seeds with node in hibernate status
 Key: CASSANDRA-19580
 URL: https://issues.apache.org/jira/browse/CASSANDRA-19580
 Project: Cassandra
  Issue Type: Bug
Reporter: Cameron Zemek


We have customer running into the error 'Unable to contact any seeds!' . I have 
been able to reproduce this issue if I kill Cassandra as its joining which will 
put the node into hibernate status. Once a node is in hibernate it will no 
longer receive any SYN messages from other nodes during startup and as it sends 
only itself as digest in outbound SYN messages it never receives any states in 
any of the ACK replies. So once it gets to the check `seenAnySeed` in it fails 
as the endpointStateMap is empty.

 

A workaround is copying the system.peers table from other node but this is less 
than ideal. I tested modifying maybeGossipToSeed as follows:
{code:java}
    /* Possibly gossip to a seed for facilitating partition healing */
    private void maybeGossipToSeed(MessageOut prod)
    {
        int size = seeds.size();
        if (size > 0)
        {
            if (size == 1 && seeds.contains(FBUtilities.getBroadcastAddress()))
            {
                return;
            }            if (liveEndpoints.size() == 0)
            {
                List gDigests = prod.payload.gDigests;
                if (gDigests.size() == 1 && 
gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress()))
                {
                    gDigests = new ArrayList();
                    GossipDigestSyn digestSynMessage = new 
GossipDigestSyn(DatabaseDescriptor.getClusterName(),
                                                                           
DatabaseDescriptor.getPartitionerName(),
                                                                           
gDigests);
                    MessageOut message = new 
MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN,
                                                                                
          digestSynMessage,
                                                                                
          GossipDigestSyn.serializer);
                    sendGossip(message, seeds);
                }
                else
                {
                    sendGossip(prod, seeds);
                }
            }
            else
            {
                /* Gossip with the seed with some probability. */
                double probability = seeds.size() / (double) 
(liveEndpoints.size() + unreachableEndpoints.size());
                double randDbl = random.nextDouble();
                if (randDbl <= probability)
                    sendGossip(prod, seeds);
            }
        }
    }
 {code}
Only problem is this is the same as SYN from shadow round. It does resolve the 
issue however as then receive an ACK with all the states.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2024-04-11 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17836370#comment-17836370
 ] 

Cameron Zemek commented on CASSANDRA-18845:
---

I have reworked the patch more so it a new method instead of modifying the 
existing waitToSettle. So it has the least change to any existing behavior. It 
directly called in MigrationCoordinator::awaitSchemaRequests to handle if node 
bootstrapping (since need nodes in UP state in order to get schema and stream 
sstables from). And just before enabling native transport. 
https://issues.apache.org/jira/secure/attachment/13068153/CASSANDRA-18845-4_0_12.patch

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18845-seperate.patch, CASSANDRA-18845-4_0_12.patch, 
> delay.log, example.log, image-2023-09-14-11-16-23-020.png, stream.log, 
> test1.log, test2.log, test3.log
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2024-04-11 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17836370#comment-17836370
 ] 

Cameron Zemek edited comment on CASSANDRA-18845 at 4/11/24 10:18 PM:
-

I have reworked the [patch| [^CASSANDRA-18845-4_0_12.patch]] more so it a new 
method instead of modifying the existing waitToSettle, so it has the least 
change to any existing behavior. It directly called in 
MigrationCoordinator::awaitSchemaRequests to handle if node bootstrapping 
(since need nodes in UP state in order to get schema and stream sstables from). 
And just before enabling native transport.


was (Author: cam1982):
I have reworked the patch more so it a new method instead of modifying the 
existing waitToSettle. So it has the least change to any existing behavior. It 
directly called in MigrationCoordinator::awaitSchemaRequests to handle if node 
bootstrapping (since need nodes in UP state in order to get schema and stream 
sstables from). And just before enabling native transport. 
https://issues.apache.org/jira/secure/attachment/13068153/CASSANDRA-18845-4_0_12.patch

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18845-seperate.patch, CASSANDRA-18845-4_0_12.patch, 
> delay.log, example.log, image-2023-09-14-11-16-23-020.png, stream.log, 
> test1.log, test2.log, test3.log
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2024-04-11 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek updated CASSANDRA-18845:
--
Attachment: CASSANDRA-18845-4_0_12.patch

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18845-seperate.patch, CASSANDRA-18845-4_0_12.patch, 
> delay.log, example.log, image-2023-09-14-11-16-23-020.png, stream.log, 
> test1.log, test2.log, test3.log
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Created] (CASSANDRA-19473) Latency Spike on NTR startup

2024-03-14 Thread Cameron Zemek (Jira)

Cameron Zemek created CASSANDRA-19473:
-

 Summary: Latency Spike on NTR startup
 Key: CASSANDRA-19473
 URL: https://issues.apache.org/jira/browse/CASSANDRA-19473
 Project: Cassandra
  Issue Type: Improvement
Reporter: Cameron Zemek


Firstly you need the patch from 
https://issues.apache.org/jira/browse/CASSANDRA-18845 to solve consistency 
query errors on startup. With that patch there is still a further issue we see 
on some clusters where the latency spikes too high when initially starting. I 
see pending compactions and hints metrics increased during this time.

I tried lowering the hint delivery threshold across the cluster thinking it was 
overloading the node starting up, but this didn't resolve the issue. So at this 
time I am not sure what the root cause (I still think its combination of the 
compactions and hints).

As workaround I have this small code change:
{code:java}
            int START_NATIVE_DELAY = 
Integer.getInteger("cassandra.start_native_transport_delay_secs", 120);
            if (START_NATIVE_DELAY > 0)
            {
                logger.info("Waiting an extra {} seconds before enabling NTR", 
START_NATIVE_DELAY);
                Uninterruptibles.sleepUninterruptibly(START_NATIVE_DELAY, 
TimeUnit.SECONDS);
            }
            startNativeTransport();
 {code}
Where wait an configurable time before starting native transport. Delaying NTR 
startup resolved the issue.

A better solution would be to wait for hints/compactions or whatever is root 
cause to complete.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18935) Unable to write to counter table if native transport is disabled on startup

2023-10-17 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek updated CASSANDRA-18935:
--
Attachment: 18935-3.11.patch

> Unable to write to counter table if native transport is disabled on startup
> ---
>
> Key: CASSANDRA-18935
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18935
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18935-3.11.patch
>
>
>  
> {code:java}
>     if ((nativeFlag != null && Boolean.parseBoolean(nativeFlag)) || 
> (nativeFlag == null && DatabaseDescriptor.startNativeTransport()))
>     {
>     startNativeTransport();
>     StorageService.instance.setRpcReady(true);
>     } {code}
> The startup code here only sets RpcReady if native transport is enabled. If 
> you call 
> {code:java}
> nodetool enablebinary{code}
> then this flag doesn't get set.
> But with the change from CASSANDRA-13043 it requires RpcReady set to true in 
> order to get a leader for the counter update.
> Not sure what the correct fix is here, seems to only really use this flag for 
> counters. So thinking perhaps the fix is to just move this outside the if 
> condition.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18935) Unable to write to counter table if native transport is disabled on startup

2023-10-17 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek updated CASSANDRA-18935:
--
Description: 
 
{code:java}
    if ((nativeFlag != null && Boolean.parseBoolean(nativeFlag)) || 
(nativeFlag == null && DatabaseDescriptor.startNativeTransport()))
    {
    startNativeTransport();
    StorageService.instance.setRpcReady(true);
    } {code}
The startup code here only sets RpcReady if native transport is enabled. If you 
call 
{code:java}
nodetool enablebinary{code}
then this flag doesn't get set.

But with the change from CASSANDRA-13043 it requires RpcReady set to true in 
order to get a leader for the counter update.

Not sure what the correct fix is here, seems to only really use this flag for 
counters. So thinking perhaps the fix is to just move this outside the if 
condition.

 

  was:
 
{code:java}
    if ((nativeFlag != null && Boolean.parseBoolean(nativeFlag)) || 
(nativeFlag == null && DatabaseDescriptor.startNativeTransport()))
    {
    startNativeTransport();
    StorageService.instance.setRpcReady(true);
    } {code}
The startup code here only sets RpcReady if native transport is enabled. If you 
call 
{code:java}
nodetool enablebinary{code}
then this flag doesn't get set.

But with the change from CASSANDRA-13043 it requires RpcReady set to true in 
other to get a leader for the counter update.

Not sure what the correct fix is here, seems to only really use this flag for 
counters. So thinking perhaps the fix is to just move this outside the if 
condition.

 


> Unable to write to counter table if native transport is disabled on startup
> ---
>
> Key: CASSANDRA-18935
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18935
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Cameron Zemek
>Priority: Normal
>
>  
> {code:java}
>     if ((nativeFlag != null && Boolean.parseBoolean(nativeFlag)) || 
> (nativeFlag == null && DatabaseDescriptor.startNativeTransport()))
>     {
>     startNativeTransport();
>     StorageService.instance.setRpcReady(true);
>     } {code}
> The startup code here only sets RpcReady if native transport is enabled. If 
> you call 
> {code:java}
> nodetool enablebinary{code}
> then this flag doesn't get set.
> But with the change from CASSANDRA-13043 it requires RpcReady set to true in 
> order to get a leader for the counter update.
> Not sure what the correct fix is here, seems to only really use this flag for 
> counters. So thinking perhaps the fix is to just move this outside the if 
> condition.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Created] (CASSANDRA-18935) Unable to write to counter table if native transport is disabled on startup

2023-10-17 Thread Cameron Zemek (Jira)

Cameron Zemek created CASSANDRA-18935:
-

 Summary: Unable to write to counter table if native transport is 
disabled on startup
 Key: CASSANDRA-18935
 URL: https://issues.apache.org/jira/browse/CASSANDRA-18935
 Project: Cassandra
  Issue Type: Bug
Reporter: Cameron Zemek


 
{code:java}
    if ((nativeFlag != null && Boolean.parseBoolean(nativeFlag)) || 
(nativeFlag == null && DatabaseDescriptor.startNativeTransport()))
    {
    startNativeTransport();
    StorageService.instance.setRpcReady(true);
    } {code}
The startup code here only sets RpcReady if native transport is enabled. If you 
call 
{code:java}
nodetool enablebinary{code}
then this flag doesn't get set.

But with the change from CASSANDRA-13043 it requires RpcReady set to true in 
other to get a leader for the counter update.

Not sure what the correct fix is here, seems to only really use this flag for 
counters. So thinking perhaps the fix is to just move this outside the if 
condition.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-10-05 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17772415#comment-17772415
 ] 

Cameron Zemek commented on CASSANDRA-18845:
---

I have reworked the patch into pull request here: [Wait for live endpoints as 
part of waiting for gossip to settle by grom358 · Pull Request #2778 · 
apache/cassandra (github.com)|https://github.com/apache/cassandra/pull/2778]. 
Created the PR against 4.1 since 5.x is not as stable.

Still have not got around to making an automated test for this yet. It has the 
following behaviors:
 * Must opt-in by setting cassandra.gossip_settle_wait_live_max
 * Waits up to maximum number of polls defined by 
cassandra.gossip_settle_wait_live_max . Set to -1 to wait indefinitely.
 * cassandra.skip_wait_for_gossip_to_settle still applies to cap the maximum 
number of polls.
 * cassandra.gossip_settle_wait_live_required determines how many polls in a 
row without change to live endpoint state to consider gossip as settled once 
opt-in via cassandra.gossip_settle_wait_live_max
 * If live endpoint size equals number of endpoints, consider live endpoints as 
settled.
 * Requires at least 1 other live endpoint to begin considering live endpoints 
as settled.

Scenarios considered:
 * One node cluster. Will skip this check since epSize == liveSize
 * Entire cluster is down and starting up a node. Will wait 
cassandra.gossip_settle_wait_live_max polls
 * Restarting a node when another node is down. Will wait 
cassandra.gossip_settle_wait_live_required polls
 * On rare occasions it takes a while to see another node as UP. This is 
covered by requiring at least 1 other endpoint as up `liveSize > 1` to start 
the settlement process.

Being opt-in, this doesn't break any existing tests. This is also easier to use 
then the reverted patch as you just need to set 
cassandra.gossip_settle_wait_live_max . To restate the purpose of this patch is 
to resolve Native-Transport-Request starting before Cassandra has finished ECHO 
requests to other nodes. This results in requests failing LOCAL_QUORUM/QUORUM 
consistency as the endpoints are not considered live for purposes of executing 
requests.

This is coming up every time we are rolling restarting large clusters when 
doing security patches and other such operations. So typically, only allow a 
single node to be down at a time. With this Pull Request the waiting for live 
endpoints ends once all endpoints are UP and so this allows for minimizing time 
to perform rolling restarts while avoiding failed queries and affecting clients.

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18845-seperate.patch, delay.log, example.log, 
> image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18866) Node sends multiple inflight echos

2023-09-26 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769377#comment-17769377
 ] 

Cameron Zemek commented on CASSANDRA-18866:
---

[Cassandra 5.0 Pull Request #2733|https://github.com/apache/cassandra/pull/2733]

[Cassandra 4.1 Pull Request #2734|https://github.com/apache/cassandra/pull/2734]

[Cassandra 4.0 Pull Request #2735|https://github.com/apache/cassandra/pull/2735]

[Cassandra 3.11 Pull Request 
#2736|https://github.com/apache/cassandra/pull/2736]

 

> Node sends multiple inflight echos
> --
>
> Key: CASSANDRA-18866
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18866
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Cluster/Gossip
>Reporter: Cameron Zemek
>Assignee: Cameron Zemek
>Priority: Normal
> Attachments: 18866-regression.patch, duplicates.log, echo.log
>
>
> CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, 
> 18845 had change to only allow 1 inflight ECHO request at a time. As per 
> 18854 some tests have an error rate due to this change. Creating this ticket 
> to discuss this further. As the current state also does not have retry logic, 
> it just allowing multiple ECHO requests inflight at the same time so less 
> likely that all ECHO will timeout or get lost.
> With the change from 18845 adding in some extra logging to track what is 
> going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO 
> requests from a node and also see it retrying ECHOs when it doesn't get a 
> reply.
> Therefore, I think the problem is more specific than the dropping of one ECHO 
> request. Yes there no retry logic for failed ECHO requests, but this is the 
> case even both before and after 18845. ECHO requests are only sent via gossip 
> verb handlers calling applyStateLocally. In these failed tests I therefore 
> assuming their cases where it won't call markAlive when other nodes consider 
> the node UP but its marked DOWN by a node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-18866) Node sends multiple inflight echos

2023-09-24 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17768462#comment-17768462
 ] 

Cameron Zemek edited comment on CASSANDRA-18866 at 9/24/23 11:47 PM:
-

Had to make the following change for some more dtests:

Previous:
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                logger.trace("Resending ECHO_REQ to {}", addr);
                Message echoMessage = Message.out(ECHO_REQ, 
noPayload);
                MessagingService.instance().sendWithCallback(echoMessage, addr, 
this);
            } {code}
After:
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                if (isEnabled())
                {
                    logger.trace("Resending ECHO_REQ to {}", addr);
                    Message echoMessage = Message.out(ECHO_REQ, 
noPayload);
                    MessagingService.instance().sendWithCallback(echoMessage, 
addr, this);
                }
                else
                {
                    logger.trace("Failed ECHO_REQ to {}, aborting due to 
disabled gossip", addr);
                    inflightEcho.remove(addr);
                 }
            }
 {code}
[instaclustr/cassandra at CASSANDRA-18866-regressiontest 
(github.com)|https://github.com/instaclustr/cassandra/tree/CASSANDRA-18866-regressiontest]


was (Author: cam1982):
Had to make the following change for some more dtests:

Previous:
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                logger.trace("Resending ECHO_REQ to {}", addr);
                Message echoMessage = Message.out(ECHO_REQ, 
noPayload);
                MessagingService.instance().sendWithCallback(echoMessage, addr, 
this);
            } {code}
After:
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                if (isEnabled())
                {
                    logger.trace("Resending ECHO_REQ to {}", addr);
                    Message echoMessage = Message.out(ECHO_REQ, 
noPayload);
                    MessagingService.instance().sendWithCallback(echoMessage, 
addr, this);
                }
                else
                {
                    logger.trace("Failed ECHO_REQ to {}, aborting due to 
disabled gossip", addr);
                }
            }
 {code}
[instaclustr/cassandra at CASSANDRA-18866-regressiontest 
(github.com)|https://github.com/instaclustr/cassandra/tree/CASSANDRA-18866-regressiontest]

> Node sends multiple inflight echos
> --
>
> Key: CASSANDRA-18866
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18866
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18866-regression.patch, duplicates.log, echo.log
>
>
> CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, 
> 18845 had change to only allow 1 inflight ECHO request at a time. As per 
> 18854 some tests have an error rate due to this change. Creating this ticket 
> to discuss this further. As the current state also does not have retry logic, 
> it just allowing multiple ECHO requests inflight at the same time so less 
> likely that all ECHO will timeout or get lost.
> With the change from 18845 adding in some extra logging to track what is 
> going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO 
> requests from a node and also see it retrying ECHOs when it doesn't get a 
> reply.
> Therefore, I think the problem is more specific than the dropping of one ECHO 
> request. Yes there no retry logic for failed ECHO requests, but this is the 
> case even both before and after 18845. ECHO requests are only sent via gossip 
> verb handlers calling applyStateLocally. In these failed tests I therefore 
> assuming their cases where it won't call markAlive when other nodes consider 
> the node UP but its marked DOWN by a node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-18866) Node sends multiple inflight echos

2023-09-24 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17768462#comment-17768462
 ] 

Cameron Zemek edited comment on CASSANDRA-18866 at 9/24/23 11:42 PM:
-

Had to make the following change for some more dtests:

Previous:
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                logger.trace("Resending ECHO_REQ to {}", addr);
                Message echoMessage = Message.out(ECHO_REQ, 
noPayload);
                MessagingService.instance().sendWithCallback(echoMessage, addr, 
this);
            } {code}
After:
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                if (isEnabled())
                {
                    logger.trace("Resending ECHO_REQ to {}", addr);
                    Message echoMessage = Message.out(ECHO_REQ, 
noPayload);
                    MessagingService.instance().sendWithCallback(echoMessage, 
addr, this);
                }
                else
                {
                    logger.trace("Failed ECHO_REQ to {}, aborting due to 
disabled gossip", addr);
                }
            }
 {code}
[instaclustr/cassandra at CASSANDRA-18866-regressiontest 
(github.com)|https://github.com/instaclustr/cassandra/tree/CASSANDRA-18866-regressiontest]


was (Author: cam1982):
Had to make the following change for some more dtests:

Previous:
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                logger.trace("Resending ECHO_REQ to {}", addr);
                Message echoMessage = Message.out(ECHO_REQ, 
noPayload);
                MessagingService.instance().sendWithCallback(echoMessage, addr, 
this);
            } {code}
After:
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                if (isEnabled())
                {
                    logger.trace("Resending ECHO_REQ to {}", addr);
                    Message echoMessage = Message.out(ECHO_REQ, 
noPayload);
                    MessagingService.instance().sendWithCallback(echoMessage, 
addr, this);
                }
                else
                {
                    logger.trace("Failed ECHO_REQ to {}, aborting due to 
disabled gossip", addr);
                }
            }
 {code}

> Node sends multiple inflight echos
> --
>
> Key: CASSANDRA-18866
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18866
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18866-regression.patch, duplicates.log, echo.log
>
>
> CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, 
> 18845 had change to only allow 1 inflight ECHO request at a time. As per 
> 18854 some tests have an error rate due to this change. Creating this ticket 
> to discuss this further. As the current state also does not have retry logic, 
> it just allowing multiple ECHO requests inflight at the same time so less 
> likely that all ECHO will timeout or get lost.
> With the change from 18845 adding in some extra logging to track what is 
> going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO 
> requests from a node and also see it retrying ECHOs when it doesn't get a 
> reply.
> Therefore, I think the problem is more specific than the dropping of one ECHO 
> request. Yes there no retry logic for failed ECHO requests, but this is the 
> case even both before and after 18845. ECHO requests are only sent via gossip 
> verb handlers calling applyStateLocally. In these failed tests I therefore 
> assuming their cases where it won't call markAlive when other nodes consider 
> the node UP but its marked DOWN by a node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18866) Node sends multiple inflight echos

2023-09-24 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17768462#comment-17768462
 ] 

Cameron Zemek commented on CASSANDRA-18866:
---

Had to make the following change for some more dtests:

Previous:
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                logger.trace("Resending ECHO_REQ to {}", addr);
                Message echoMessage = Message.out(ECHO_REQ, 
noPayload);
                MessagingService.instance().sendWithCallback(echoMessage, addr, 
this);
            } {code}
After:
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                if (isEnabled())
                {
                    logger.trace("Resending ECHO_REQ to {}", addr);
                    Message echoMessage = Message.out(ECHO_REQ, 
noPayload);
                    MessagingService.instance().sendWithCallback(echoMessage, 
addr, this);
                }
                else
                {
                    logger.trace("Failed ECHO_REQ to {}, aborting due to 
disabled gossip", addr);
                }
            }
 {code}

> Node sends multiple inflight echos
> --
>
> Key: CASSANDRA-18866
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18866
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18866-regression.patch, duplicates.log, echo.log
>
>
> CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, 
> 18845 had change to only allow 1 inflight ECHO request at a time. As per 
> 18854 some tests have an error rate due to this change. Creating this ticket 
> to discuss this further. As the current state also does not have retry logic, 
> it just allowing multiple ECHO requests inflight at the same time so less 
> likely that all ECHO will timeout or get lost.
> With the change from 18845 adding in some extra logging to track what is 
> going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO 
> requests from a node and also see it retrying ECHOs when it doesn't get a 
> reply.
> Therefore, I think the problem is more specific than the dropping of one ECHO 
> request. Yes there no retry logic for failed ECHO requests, but this is the 
> case even both before and after 18845. ECHO requests are only sent via gossip 
> verb handlers calling applyStateLocally. In these failed tests I therefore 
> assuming their cases where it won't call markAlive when other nodes consider 
> the node UP but its marked DOWN by a node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18866) Node sends multiple inflight echos

2023-09-23 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17768255#comment-17768255
 ] 

Cameron Zemek commented on CASSANDRA-18866:
---

{noformat}
pytest --count=500 --cassandra-dir=/home/grom/dev/cassandra-instaclustr 
transient_replication_ring_test.py::TestTransientReplicationRing::test_move_forwards_between_and_cleanup{noformat}
500/500 passes.

 
{noformat}
$ rg 'Resending'
1695355675403_test_move_forwards_between_and_cleanup[27-500]/node4_debug.log
1263:DEBUG [InternalResponseStage:1] 2023-09-22 14:07:06,461 Gossiper.java:1390 
- Resending ECHO_REQ to 
/127.0.0.2:70001695362768506_test_move_forwards_between_and_cleanup[74-500]/node1_debug.log
1038:DEBUG [InternalResponseStage:1] 2023-09-22 16:05:20,772 Gossiper.java:1390 
- Resending ECHO_REQ to /127.0.0.2:7000
1695362768506_test_move_forwards_between_and_cleanup[74-500]/node1_debug.log: 
WARNING: stopped searching binary file after match (found "\0" byte around 
offset 
329646)1695403170261_test_move_forwards_between_and_cleanup[342-500]/node1_debug.log
1029:DEBUG [InternalResponseStage:1] 2023-09-23 03:18:41,126 Gossiper.java:1390 
- Resending ECHO_REQ to /127.0.0.2:7000
1695403170261_test_move_forwards_between_and_cleanup[342-500]/node1_debug.log: 
WARNING: stopped searching binary file after match (found "\0" byte around 
offset 
331373)1695366089957_test_move_forwards_between_and_cleanup[96-500]/node4_debug.log
1275:DEBUG [InternalResponseStage:1] 2023-09-22 17:00:41,140 Gossiper.java:1390 
- Resending ECHO_REQ to 
/127.0.0.2:70001695422554318_test_move_forwards_between_and_cleanup[471-500]/node4_debug.log
1293:DEBUG [InternalResponseStage:1] 2023-09-23 08:41:45,750 Gossiper.java:1390 
- Resending ECHO_REQ to /127.0.0.2:7000{noformat}
So the retry happens 1% of the time with this test.

> Node sends multiple inflight echos
> --
>
> Key: CASSANDRA-18866
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18866
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18866-regression.patch, duplicates.log, echo.log
>
>
> CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, 
> 18845 had change to only allow 1 inflight ECHO request at a time. As per 
> 18854 some tests have an error rate due to this change. Creating this ticket 
> to discuss this further. As the current state also does not have retry logic, 
> it just allowing multiple ECHO requests inflight at the same time so less 
> likely that all ECHO will timeout or get lost.
> With the change from 18845 adding in some extra logging to track what is 
> going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO 
> requests from a node and also see it retrying ECHOs when it doesn't get a 
> reply.
> Therefore, I think the problem is more specific than the dropping of one ECHO 
> request. Yes there no retry logic for failed ECHO requests, but this is the 
> case even both before and after 18845. ECHO requests are only sent via gossip 
> verb handlers calling applyStateLocally. In these failed tests I therefore 
> assuming their cases where it won't call markAlive when other nodes consider 
> the node UP but its marked DOWN by a node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-18866) Node sends multiple inflight echos

2023-09-21 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767742#comment-17767742
 ] 

Cameron Zemek edited comment on CASSANDRA-18866 at 9/22/23 3:00 AM:


Found some bugs:
{code:java}
if (inflightEcho.contains(addr))
{
return;
}
inflightEcho.add(addr); {code}
should be
{code:java}
        if (!inflightEcho.add(addr))
        {
            return;
        } {code}
Otherwise, data race allows multiple inflight echos.

 

and
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                MessagingService.instance().sendWithCallback(echoMessage, addr, 
this);
            } {code}
should be
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                logger.trace("Resending ECHO_REQ to {}", addr);
                Message echoMessage = Message.out(ECHO_REQ, 
noPayload);
                MessagingService.instance().sendWithCallback(echoMessage, addr, 
this);
            }
 {code}
That is need to construct a new message, not send the same message again.

 


was (Author: cam1982):
Found some bugs:
{code:java}
if (inflightEcho.contains(addr))
{
return;
}
inflightEcho.add(addr); {code}
should be
{noformat}
if (!inflightEcho.add(addr))
{
logger.info("Skip ECHO_REQ to {}", addr);
return;
}{noformat}
Otherwise, data race allows multiple inflight echos.

 

and
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                MessagingService.instance().sendWithCallback(echoMessage, addr, 
this);
            } {code}
should be
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                logger.trace("Resending ECHO_REQ to {}", addr);
                Message echoMessage = Message.out(ECHO_REQ, 
noPayload);
                MessagingService.instance().sendWithCallback(echoMessage, addr, 
this);
            }
 {code}
That is need to construct a new message, not send the same message again.

 

> Node sends multiple inflight echos
> --
>
> Key: CASSANDRA-18866
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18866
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18866-regression.patch, duplicates.log, echo.log
>
>
> CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, 
> 18845 had change to only allow 1 inflight ECHO request at a time. As per 
> 18854 some tests have an error rate due to this change. Creating this ticket 
> to discuss this further. As the current state also does not have retry logic, 
> it just allowing multiple ECHO requests inflight at the same time so less 
> likely that all ECHO will timeout or get lost.
> With the change from 18845 adding in some extra logging to track what is 
> going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO 
> requests from a node and also see it retrying ECHOs when it doesn't get a 
> reply.
> Therefore, I think the problem is more specific than the dropping of one ECHO 
> request. Yes there no retry logic for failed ECHO requests, but this is the 
> case even both before and after 18845. ECHO requests are only sent via gossip 
> verb handlers calling applyStateLocally. In these failed tests I therefore 
> assuming their cases where it won't call markAlive when other nodes consider 
> the node UP but its marked DOWN by a node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18866) Node sends multiple inflight echos

2023-09-21 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767787#comment-17767787
 ] 

Cameron Zemek commented on CASSANDRA-18866:
---

{noformat}
 pytest --count=100 --cassandra-dir=/home/grom/dev/cassandra-instaclustr 
transient_replication_ring_test.py::TestTransientReplicationRing::test_move_backwards_and_cleanup{noformat}
 100/100 passes

> Node sends multiple inflight echos
> --
>
> Key: CASSANDRA-18866
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18866
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18866-regression.patch, duplicates.log, echo.log
>
>
> CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, 
> 18845 had change to only allow 1 inflight ECHO request at a time. As per 
> 18854 some tests have an error rate due to this change. Creating this ticket 
> to discuss this further. As the current state also does not have retry logic, 
> it just allowing multiple ECHO requests inflight at the same time so less 
> likely that all ECHO will timeout or get lost.
> With the change from 18845 adding in some extra logging to track what is 
> going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO 
> requests from a node and also see it retrying ECHOs when it doesn't get a 
> reply.
> Therefore, I think the problem is more specific than the dropping of one ECHO 
> request. Yes there no retry logic for failed ECHO requests, but this is the 
> case even both before and after 18845. ECHO requests are only sent via gossip 
> verb handlers calling applyStateLocally. In these failed tests I therefore 
> assuming their cases where it won't call markAlive when other nodes consider 
> the node UP but its marked DOWN by a node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-18866) Node sends multiple inflight echos

2023-09-21 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767742#comment-17767742
 ] 

Cameron Zemek edited comment on CASSANDRA-18866 at 9/21/23 10:04 PM:
-

Found some bugs:
{code:java}
if (inflightEcho.contains(addr))
{
return;
}
inflightEcho.add(addr); {code}
should be
{noformat}
if (!inflightEcho.add(addr))
{
logger.info("Skip ECHO_REQ to {}", addr);
return;
}{noformat}
Otherwise, data race allows multiple inflight echos.

 

and
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                MessagingService.instance().sendWithCallback(echoMessage, addr, 
this);
            } {code}
should be
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                logger.trace("Resending ECHO_REQ to {}", addr);
                Message echoMessage = Message.out(ECHO_REQ, 
noPayload);
                MessagingService.instance().sendWithCallback(echoMessage, addr, 
this);
            }
 {code}
That is need to construct a new message, not send the same message again.

 


was (Author: cam1982):
Found some bugs:

 
{code:java}
if (inflightEcho.contains(addr))
{
return;
}
inflightEcho.add(addr); {code}
should be

 

 
{noformat}
if (!inflightEcho.add(addr))
{
logger.info("Skip ECHO_REQ to {}", addr);
return;
}{noformat}
 

Otherwise, data race allows multiple inflight echos.

 

and

 
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                MessagingService.instance().sendWithCallback(echoMessage, addr, 
this);
            } {code}
should be

 

 
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                logger.trace("Resending ECHO_REQ to {}", addr);
                Message echoMessage = Message.out(ECHO_REQ, 
noPayload);
                MessagingService.instance().sendWithCallback(echoMessage, addr, 
this);
            }
 {code}
That is need to construct a new message, not send the same message again.

 

> Node sends multiple inflight echos
> --
>
> Key: CASSANDRA-18866
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18866
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18866-regression.patch, duplicates.log, echo.log
>
>
> CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, 
> 18845 had change to only allow 1 inflight ECHO request at a time. As per 
> 18854 some tests have an error rate due to this change. Creating this ticket 
> to discuss this further. As the current state also does not have retry logic, 
> it just allowing multiple ECHO requests inflight at the same time so less 
> likely that all ECHO will timeout or get lost.
> With the change from 18845 adding in some extra logging to track what is 
> going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO 
> requests from a node and also see it retrying ECHOs when it doesn't get a 
> reply.
> Therefore, I think the problem is more specific than the dropping of one ECHO 
> request. Yes there no retry logic for failed ECHO requests, but this is the 
> case even both before and after 18845. ECHO requests are only sent via gossip 
> verb handlers calling applyStateLocally. In these failed tests I therefore 
> assuming their cases where it won't call markAlive when other nodes consider 
> the node UP but its marked DOWN by a node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18866) Node sends multiple inflight echos

2023-09-21 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767742#comment-17767742
 ] 

Cameron Zemek commented on CASSANDRA-18866:
---

Found some bugs:

 
{code:java}
if (inflightEcho.contains(addr))
{
return;
}
inflightEcho.add(addr); {code}
should be

 

 
{noformat}
if (!inflightEcho.add(addr))
{
logger.info("Skip ECHO_REQ to {}", addr);
return;
}{noformat}
 

Otherwise, data race allows multiple inflight echos.

 

and

 
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                MessagingService.instance().sendWithCallback(echoMessage, addr, 
this);
            } {code}
should be

 

 
{code:java}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                logger.trace("Resending ECHO_REQ to {}", addr);
                Message echoMessage = Message.out(ECHO_REQ, 
noPayload);
                MessagingService.instance().sendWithCallback(echoMessage, addr, 
this);
            }
 {code}
That is need to construct a new message, not send the same message again.

 

> Node sends multiple inflight echos
> --
>
> Key: CASSANDRA-18866
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18866
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18866-regression.patch, duplicates.log, echo.log
>
>
> CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, 
> 18845 had change to only allow 1 inflight ECHO request at a time. As per 
> 18854 some tests have an error rate due to this change. Creating this ticket 
> to discuss this further. As the current state also does not have retry logic, 
> it just allowing multiple ECHO requests inflight at the same time so less 
> likely that all ECHO will timeout or get lost.
> With the change from 18845 adding in some extra logging to track what is 
> going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO 
> requests from a node and also see it retrying ECHOs when it doesn't get a 
> reply.
> Therefore, I think the problem is more specific than the dropping of one ECHO 
> request. Yes there no retry logic for failed ECHO requests, but this is the 
> case even both before and after 18845. ECHO requests are only sent via gossip 
> verb handlers calling applyStateLocally. In these failed tests I therefore 
> assuming their cases where it won't call markAlive when other nodes consider 
> the node UP but its marked DOWN by a node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18866) Node sends multiple inflight echos

2023-09-21 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767555#comment-17767555
 ] 

Cameron Zemek commented on CASSANDRA-18866:
---

{noformat}
pytest --count=100 --cassandra-dir=/home/grom/dev/cassandra 
transient_replication_ring_test.py::TestTransientReplicationRing::test_move_forwards_between_and_cleanup{noformat}
100/100 passes

> Node sends multiple inflight echos
> --
>
> Key: CASSANDRA-18866
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18866
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18866-regression.patch, duplicates.log, echo.log
>
>
> CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, 
> 18845 had change to only allow 1 inflight ECHO request at a time. As per 
> 18854 some tests have an error rate due to this change. Creating this ticket 
> to discuss this further. As the current state also does not have retry logic, 
> it just allowing multiple ECHO requests inflight at the same time so less 
> likely that all ECHO will timeout or get lost.
> With the change from 18845 adding in some extra logging to track what is 
> going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO 
> requests from a node and also see it retrying ECHOs when it doesn't get a 
> reply.
> Therefore, I think the problem is more specific than the dropping of one ECHO 
> request. Yes there no retry logic for failed ECHO requests, but this is the 
> case even both before and after 18845. ECHO requests are only sent via gossip 
> verb handlers calling applyStateLocally. In these failed tests I therefore 
> assuming their cases where it won't call markAlive when other nodes consider 
> the node UP but its marked DOWN by a node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-20 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767361#comment-17767361
 ] 

Cameron Zemek commented on CASSANDRA-18845:
---

{noformat}
Sep 21 03:01:42 ip-10-1-32-228 cassandra[52927]: INFO  
org.apache.cassandra.gms.Gossiper Waiting for gossip to settle...
Sep 21 03:01:48 ip-10-1-32-228 cassandra[52927]: INFO  
org.apache.cassandra.gms.Gossiper Gossip looks settled. epSize=108
Sep 21 03:01:49 ip-10-1-32-228 cassandra[52927]: INFO  
org.apache.cassandra.gms.Gossiper Gossip looks settled. epSize=108
Sep 21 03:01:50 ip-10-1-32-228 cassandra[52927]: INFO  
org.apache.cassandra.gms.Gossiper Gossip looks settled. epSize=108
Sep 21 03:02:00 ip-10-1-32-228 cassandra[52927]: INFO  
o.a.c.gms.GossipDigestAckVerbHandler Received a GossipDigestAckMessage from 
/15.223.140.86
Sep 21 03:02:00 ip-10-1-32-228 cassandra[52927]: INFO  
org.apache.cassandra.gms.Gossiper Sending a EchoMessage to /44.229.153.229
...
Sep 21 03:03:40 ip-10-1-32-228 cassandra[52927]: INFO  
org.apache.cassandra.gms.Gossiper InetAddress /44.229.153.229 is now 
UP{noformat}
Got a test run with 18 second delay. 

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18845-seperate.patch, delay.log, example.log, 
> image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-20 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767358#comment-17767358
 ] 

Cameron Zemek edited comment on CASSANDRA-18845 at 9/21/23 2:59 AM:


 
{noformat}
Sep 19 08:09:45 ip-10-1-57-23 cassandra[131402]: INFO  
org.apache.cassandra.gms.Gossiper Waiting for gossip to settle...
Sep 19 08:10:56 ip-10-1-57-23 cassandra[131402]: DEBUG 
org.apache.cassandra.gms.Gossiper Sending a EchoMessage to 
/35.83.14.80{noformat}
I am struggling to reproduce this ^ I seen it twice, and after enabling more 
logging haven't been able to reproduce again.

 

What I do sometimes see though is it taking over 30 seconds to get the first 
ECHO response. Since there are dtests that rely on having CQL up while nodes 
are down, I have attached a patch [^18845-seperate.patch] (against 5.0 branch) 
that is opt-in. Having settle just check for currentLive == liveSize is still 
allowing NTR to start while nodes are marked down. Yes you can increase 
cassandra.gossip_settle_poll_success_required (and/or the other properties) to 
mitigate it but these increase the minimum startup time. Whereas 
[^18845-seperate.patch] doesn't add to this when the cluster is healthy.

 

A more elaborate solution would be to specify the required consistency level. 
And for all token ranges owned by the node you check if you have the needed 
live endpoints to satisfy the consistency level.


was (Author: cam1982):
 
{noformat}
Sep 19 08:09:45 ip-10-1-57-23 cassandra[131402]: INFO  
org.apache.cassandra.gms.Gossiper Waiting for gossip to settle...
Sep 19 08:10:56 ip-10-1-57-23 cassandra[131402]: DEBUG 
org.apache.cassandra.gms.Gossiper Sending a EchoMessage to 
/35.83.14.80{noformat}
I am struggling to reproduce this ^ I seen it twice, and after enabling more 
logging haven't been able to reproduce again.

 

What I do sometimes see though it taking over 30 seconds to get the first ECHO 
response. Since there are dtests that rely on having CQL up while nodes are 
down, I have attached a patch [^18845-seperate.patch] (against 5.0 branch) that 
is opt-in. Having settle just check for currentLive == liveSize is still 
allowing NTR to start while nodes are marked down. Yes you can increase 
cassandra.gossip_settle_poll_success_required (and/or the other properties) to 
mitigate it but these increase the minimum startup time. Whereas 
[^18845-seperate.patch] doesn't add to this when the cluster is healthy.

 

A more elaborate solution would be to specify the required consistency level. 
And for all token ranges owned by the node you check if you have the needed 
live endpoints to satisfy the consistency level.

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18845-seperate.patch, delay.log, example.log, 
> image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-20 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767358#comment-17767358
 ] 

Cameron Zemek commented on CASSANDRA-18845:
---

 
{noformat}
Sep 19 08:09:45 ip-10-1-57-23 cassandra[131402]: INFO  
org.apache.cassandra.gms.Gossiper Waiting for gossip to settle...
Sep 19 08:10:56 ip-10-1-57-23 cassandra[131402]: DEBUG 
org.apache.cassandra.gms.Gossiper Sending a EchoMessage to 
/35.83.14.80{noformat}
I am struggling to reproduce this ^ I seen it twice, and after enabling more 
logging haven't been able to reproduce again.

 

What I do sometimes see though it taking over 30 seconds to get the first ECHO 
response. Since there are dtests that rely on having CQL up while nodes are 
down, I have attached a patch [^18845-seperate.patch] (against 5.0 branch) that 
is opt-in. Having settle just check for currentLive == liveSize is still 
allowing NTR to start while nodes are marked down. Yes you can increase 
cassandra.gossip_settle_poll_success_required (and/or the other properties) to 
mitigate it but these increase the minimum startup time. Whereas 
[^18845-seperate.patch] doesn't add to this when the cluster is healthy.

 

A more elaborate solution would be to specify the required consistency level. 
And for all token ranges owned by the node you check if you have the needed 
live endpoints to satisfy the consistency level.

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18845-seperate.patch, delay.log, example.log, 
> image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-20 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek updated CASSANDRA-18845:
--
Attachment: (was: 18845-seperate.patch)

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18845-seperate.patch, delay.log, example.log, 
> image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-20 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek updated CASSANDRA-18845:
--
Attachment: 18845-seperate.patch

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18845-seperate.patch, delay.log, example.log, 
> image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-20 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek updated CASSANDRA-18845:
--
Attachment: 18845-seperate.patch

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18845-seperate.patch, delay.log, example.log, 
> image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-20 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767052#comment-17767052
 ] 

Cameron Zemek commented on CASSANDRA-18845:
---

with this removed
{code:java}
(epSize == liveSize || liveSize > 1){code}
the j11_dtests just passed. [j11_dtests (120384) - instaclustr/cassandra 
(circleci.com)|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3180/workflows/2f7e6199-d865-4eee-a3b1-9511a4c88a45/jobs/120384]

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: delay.log, example.log, 
> image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-20 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767007#comment-17767007
 ] 

Cameron Zemek commented on CASSANDRA-18845:
---

[^stream.log] Without this patch I get nodes stuck in being unable to join 
large test cluster:
{noformat}
Sep 20 01:18:51 ip-10-7-20-120 cassandra[5521]: INFO  
o.a.cassandra.service.StorageService JOINING: Starting to bootstrap...
Sep 20 01:18:51 ip-10-7-20-120 cassandra[5521]: Exception 
(java.lang.RuntimeException) encountered during startup: A node required to 
move the data consistently is down (/13.237.60.255). If you wish to move the 
data from a potentially inconsistent replica, restart the node with 
-Dcassandra.consistent.rangemovement=false
Sep 20 01:18:51 ip-10-7-20-120 cassandra[5521]: java.lang.RuntimeException: A 
node required to move the data consistently is down (/13.237.60.255). If you 
wish to move the data from a potentially inconsistent replica, restart the node 
with -Dcassandra.consistent.rangemovement=false
Sep 20 01:18:51 ip-10-7-20-120 cassandra[5521]:         at 
org.apache.cassandra.dht.RangeStreamer.getAllRangesWithStrictSourcesFor(RangeStreamer.java:294){noformat}
The node is in endless restart cycle (since our service keeps retrying) with it 
reporting a different IP each time. 

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: delay.log, example.log, 
> image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-18866) Node sends multiple inflight echos

2023-09-20 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767036#comment-17767036
 ] 

Cameron Zemek edited comment on CASSANDRA-18866 at 9/20/23 7:37 AM:


going to run overnight the broken dtest that was flagged by the ECHO changes. 
But with potential fix:
{noformat}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                MessagingService.instance().sendWithCallback(echoMessage, addr, 
this);
            }{noformat}
will report back in the morning.


was (Author: cam1982):
!18866-regression.patch|width=7,height=7,align=absmiddle!

going to run overnight the broken dtest that was flagged by the ECHO changes. 
But with potential fix:
{noformat}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                MessagingService.instance().sendWithCallback(echoMessage, addr, 
this);
            }{noformat}
will report back in the morning.

> Node sends multiple inflight echos
> --
>
> Key: CASSANDRA-18866
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18866
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18866-regression.patch, duplicates.log, echo.log
>
>
> CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, 
> 18845 had change to only allow 1 inflight ECHO request at a time. As per 
> 18854 some tests have an error rate due to this change. Creating this ticket 
> to discuss this further. As the current state also does not have retry logic, 
> it just allowing multiple ECHO requests inflight at the same time so less 
> likely that all ECHO will timeout or get lost.
> With the change from 18845 adding in some extra logging to track what is 
> going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO 
> requests from a node and also see it retrying ECHOs when it doesn't get a 
> reply.
> Therefore, I think the problem is more specific than the dropping of one ECHO 
> request. Yes there no retry logic for failed ECHO requests, but this is the 
> case even both before and after 18845. ECHO requests are only sent via gossip 
> verb handlers calling applyStateLocally. In these failed tests I therefore 
> assuming their cases where it won't call markAlive when other nodes consider 
> the node UP but its marked DOWN by a node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-20 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767037#comment-17767037
 ] 

Cameron Zemek commented on CASSANDRA-18845:
---

the 
{noformat}
(epSize == liveSize || liveSize > 1){noformat}
part breaks dtests. For example, 
{noformat}
pytest --force-resource-intensive-tests 
--cassandra-dir=/home/grom/dev/cassandra 
materialized_views_test.py::TestMaterializedViews::test_throttled_partition_update{noformat}
This test fails since it will shutdown a 5 node cluster and start/stop each 
node one at a time. And therefore liveSize > 1 is never true.

Possible paths forward:
 # The check for waiting for other nodes is off by default and requries setting 
a system property.
 # Figure out why there this large delay between waitToSettle call and getting 
ECHO responses.
 # Have the tests override cassandra.skip_wait_for_gossip_to_settle
 # ?? Some other option haven't thought of yet.

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: delay.log, example.log, 
> image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18866) Node sends multiple inflight echos

2023-09-20 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767036#comment-17767036
 ] 

Cameron Zemek commented on CASSANDRA-18866:
---

!18866-regression.patch|width=7,height=7,align=absmiddle!

going to run overnight the broken dtest that was flagged by the ECHO changes. 
But with potential fix:
{noformat}
            @Override
            public void onFailure(InetAddressAndPort from, RequestFailureReason 
failureReason)
            {
                MessagingService.instance().sendWithCallback(echoMessage, addr, 
this);
            }{noformat}
will report back in the morning.

> Node sends multiple inflight echos
> --
>
> Key: CASSANDRA-18866
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18866
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18866-regression.patch, duplicates.log, echo.log
>
>
> CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, 
> 18845 had change to only allow 1 inflight ECHO request at a time. As per 
> 18854 some tests have an error rate due to this change. Creating this ticket 
> to discuss this further. As the current state also does not have retry logic, 
> it just allowing multiple ECHO requests inflight at the same time so less 
> likely that all ECHO will timeout or get lost.
> With the change from 18845 adding in some extra logging to track what is 
> going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO 
> requests from a node and also see it retrying ECHOs when it doesn't get a 
> reply.
> Therefore, I think the problem is more specific than the dropping of one ECHO 
> request. Yes there no retry logic for failed ECHO requests, but this is the 
> case even both before and after 18845. ECHO requests are only sent via gossip 
> verb handlers calling applyStateLocally. In these failed tests I therefore 
> assuming their cases where it won't call markAlive when other nodes consider 
> the node UP but its marked DOWN by a node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18866) Node sends multiple inflight echos

2023-09-20 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek updated CASSANDRA-18866:
--
Attachment: 18866-regression.patch

> Node sends multiple inflight echos
> --
>
> Key: CASSANDRA-18866
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18866
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18866-regression.patch, duplicates.log, echo.log
>
>
> CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, 
> 18845 had change to only allow 1 inflight ECHO request at a time. As per 
> 18854 some tests have an error rate due to this change. Creating this ticket 
> to discuss this further. As the current state also does not have retry logic, 
> it just allowing multiple ECHO requests inflight at the same time so less 
> likely that all ECHO will timeout or get lost.
> With the change from 18845 adding in some extra logging to track what is 
> going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO 
> requests from a node and also see it retrying ECHOs when it doesn't get a 
> reply.
> Therefore, I think the problem is more specific than the dropping of one ECHO 
> request. Yes there no retry logic for failed ECHO requests, but this is the 
> case even both before and after 18845. ECHO requests are only sent via gossip 
> verb handlers calling applyStateLocally. In these failed tests I therefore 
> assuming their cases where it won't call markAlive when other nodes consider 
> the node UP but its marked DOWN by a node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-19 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek updated CASSANDRA-18845:
--
Attachment: stream.log

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: delay.log, example.log, 
> image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-19 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766949#comment-17766949
 ] 

Cameron Zemek commented on CASSANDRA-18845:
---

Still running, but sharing the results so far:
{noformat}
$ pytest --count=500 --cassandra-dir=/home/grom/dev/cassandra 
transient_replication_ring_test.py::TestTransientReplicationRing::test_move_forwards_between_and_cleanup
/home/grom/dtest/lib/python3.10/site-packages/ccmlib/common.py:773: 
DeprecationWarning: distutils Version classes are deprecated. Use 
packaging.version instead.
  return LooseVersion(match.group(1))
== test session starts 
===platform linux -- Python 3.10.12, 
pytest-7.3.1, pluggy-1.0.0
rootdir: /home/grom/tmp/cassandra-dtest
configfile: pytest.ini
plugins: repeat-0.9.1, flaky-3.7.0, timeout-1.4.2
timeout: 900.0s
timeout method: signal
timeout func_only: False
collected 500 itemstransient_replication_ring_test.py 
... [ 11%]
{noformat}

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: delay.log, example.log, 
> image-2023-09-14-11-16-23-020.png, test1.log, test2.log, test3.log
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-19 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766694#comment-17766694
 ] 

Cameron Zemek commented on CASSANDRA-18845:
---

[^delay.log]

Attached a log from 105 node test cluster that shows the delay between starting 
to wait for gossip and getting replies back for UP .

Snippet
{noformat}
Sep 19 08:09:45 ip-10-1-57-23 cassandra[131402]: INFO  
org.apache.cassandra.gms.Gossiper Waiting for gossip to settle...
Sep 19 08:10:56 ip-10-1-57-23 cassandra[131402]: DEBUG 
org.apache.cassandra.gms.Gossiper Sending a EchoMessage to /35.83.14.80
Sep 19 08:10:57 ip-10-1-57-23 cassandra[131402]: INFO  
org.apache.cassandra.gms.Gossiper InetAddress /54.149.62.104 is now UP{noformat}
So the delay is in sending out the Echo.

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: delay.log, example.log, 
> image-2023-09-14-11-16-23-020.png, test1.log, test2.log, test3.log
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-19 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek updated CASSANDRA-18845:
--
Attachment: delay.log

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: delay.log, example.log, 
> image-2023-09-14-11-16-23-020.png, test1.log, test2.log, test3.log
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18773) Compactions are slow

2023-09-19 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766678#comment-17766678
 ] 

Cameron Zemek commented on CASSANDRA-18773:
---

[~blambov] have updated the pull request with your feedback and is ready for 
review.

> Compactions are slow
> 
>
> Key: CASSANDRA-18773
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18773
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Compaction
>Reporter: Cameron Zemek
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
> Attachments: 18773.patch, compact-poc.patch, flamegraph.png, 
> stress.yaml
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> I have noticed that compactions involving a lot of sstables are very slow 
> (for example major compactions). I have attached a cassandra stress profile 
> that can generate such a dataset under ccm. In my local test I have 2567 
> sstables at 4Mb each.
> I added code to track wall clock time of various parts of the code. One 
> problematic part is ManyToOne constructor. Tracing through the code for every 
> partition creating a ManyToOne for all the sstable iterators for each 
> partition. In my local test get a measy 60Kb/sec read speed, and bottlenecked 
> on single core CPU (since this code is single threaded) with it spending 85% 
> of the wall clock time in ManyToOne constructor.
> As another datapoint to show its the merge iterator part of the code using 
> the cfstats from [https://github.com/instaclustr/cassandra-sstable-tools/] 
> which reads all the sstables but does no merging gets 26Mb/sec read speed.
> Tracking back from ManyToOne call I see this in 
> UnfilteredPartitionIterators::merge
> {code:java}
>                 for (int i = 0; i < toMerge.size(); i++)
>                 {
>                     if (toMerge.get(i) == null)
>                     {
>                         if (null == empty)
>                             empty = EmptyIterators.unfilteredRow(metadata, 
> partitionKey, isReverseOrder);
>                         toMerge.set(i, empty);
>                     }
>                 }
>  {code}
> Not sure what purpose of creating these empty rows are. But on a whim I 
> removed all these empty iterators before passing to ManyToOne and then all 
> the wall clock time shifted to CompactionIterator::hasNext() and read speed 
> increased to 1.5Mb/s.
> So there are further bottlenecks in this code path it seems, but the first is 
> this ManyToOne and having to build it for every partition read.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-19 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766677#comment-17766677
 ] 

Cameron Zemek edited comment on CASSANDRA-18845 at 9/19/23 7:32 AM:


Tested the patch 3 times to confirm it working. See test1.log test2.log and 
test3.log


was (Author: cam1982):
!test1.log|width=7,height=7,align=absmiddle!

!test2.log|width=7,height=7,align=absmiddle!

!test3.log|width=7,height=7,align=absmiddle!

Tested the patch 3 times to confirm it working.

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: example.log, image-2023-09-14-11-16-23-020.png, 
> test1.log, test2.log, test3.log
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-19 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766677#comment-17766677
 ] 

Cameron Zemek commented on CASSANDRA-18845:
---

!test1.log|width=7,height=7,align=absmiddle!

!test2.log|width=7,height=7,align=absmiddle!

!test3.log|width=7,height=7,align=absmiddle!

Tested the patch 3 times to confirm it working.

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: example.log, image-2023-09-14-11-16-23-020.png, 
> test1.log, test2.log, test3.log
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-19 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek updated CASSANDRA-18845:
--
Attachment: test2.log

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: example.log, image-2023-09-14-11-16-23-020.png, 
> test1.log, test2.log, test3.log
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-19 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek updated CASSANDRA-18845:
--
Attachment: test3.log

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: example.log, image-2023-09-14-11-16-23-020.png, 
> test1.log, test2.log, test3.log
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-19 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek updated CASSANDRA-18845:
--
Attachment: test1.log

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: example.log, image-2023-09-14-11-16-23-020.png, 
> test1.log, test2.log, test3.log
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18866) Node sends multiple inflight echos

2023-09-19 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766676#comment-17766676
 ] 

Cameron Zemek commented on CASSANDRA-18866:
---

duplicates.log shows the problem that was fixing that led to the regressions.

echo.log shows with rolled back changes where tested having broken network link 
between two nodes then re-establish it.

> Node sends multiple inflight echos
> --
>
> Key: CASSANDRA-18866
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18866
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: duplicates.log, echo.log
>
>
> CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, 
> 18845 had change to only allow 1 inflight ECHO request at a time. As per 
> 18854 some tests have an error rate due to this change. Creating this ticket 
> to discuss this further. As the current state also does not have retry logic, 
> it just allowing multiple ECHO requests inflight at the same time so less 
> likely that all ECHO will timeout or get lost.
> With the change from 18845 adding in some extra logging to track what is 
> going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO 
> requests from a node and also see it retrying ECHOs when it doesn't get a 
> reply.
> Therefore, I think the problem is more specific than the dropping of one ECHO 
> request. Yes there no retry logic for failed ECHO requests, but this is the 
> case even both before and after 18845. ECHO requests are only sent via gossip 
> verb handlers calling applyStateLocally. In these failed tests I therefore 
> assuming their cases where it won't call markAlive when other nodes consider 
> the node UP but its marked DOWN by a node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18866) Node sends multiple inflight echos

2023-09-19 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek updated CASSANDRA-18866:
--
Attachment: duplicates.log

> Node sends multiple inflight echos
> --
>
> Key: CASSANDRA-18866
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18866
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: duplicates.log, echo.log
>
>
> CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, 
> 18845 had change to only allow 1 inflight ECHO request at a time. As per 
> 18854 some tests have an error rate due to this change. Creating this ticket 
> to discuss this further. As the current state also does not have retry logic, 
> it just allowing multiple ECHO requests inflight at the same time so less 
> likely that all ECHO will timeout or get lost.
> With the change from 18845 adding in some extra logging to track what is 
> going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO 
> requests from a node and also see it retrying ECHOs when it doesn't get a 
> reply.
> Therefore, I think the problem is more specific than the dropping of one ECHO 
> request. Yes there no retry logic for failed ECHO requests, but this is the 
> case even both before and after 18845. ECHO requests are only sent via gossip 
> verb handlers calling applyStateLocally. In these failed tests I therefore 
> assuming their cases where it won't call markAlive when other nodes consider 
> the node UP but its marked DOWN by a node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-18 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766628#comment-17766628
 ] 

Cameron Zemek commented on CASSANDRA-18845:
---

[Cassandra 18845 3.11 by grom358 · Pull Request #2701 · apache/cassandra 
(github.com)|https://github.com/apache/cassandra/pull/2701]

[Cassandra 18845 4.0 by grom358 · Pull Request #2702 · apache/cassandra 
(github.com)|https://github.com/apache/cassandra/pull/2702]

[Cassandra 18845 4.1 by grom358 · Pull Request #2703 · apache/cassandra 
(github.com)|https://github.com/apache/cassandra/pull/2703]

[Cassandra 18845 5.0 by grom358 · Pull Request #2704 · apache/cassandra 
(github.com)|https://github.com/apache/cassandra/pull/2704]

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: example.log, image-2023-09-14-11-16-23-020.png
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-18 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek updated CASSANDRA-18845:
--
Attachment: (was: 18845-5.0.patch)

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: example.log, image-2023-09-14-11-16-23-020.png
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-18 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek updated CASSANDRA-18845:
--
Attachment: (was: 18845-4.0.patch)

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18845-5.0.patch, example.log, 
> image-2023-09-14-11-16-23-020.png
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-18 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek updated CASSANDRA-18845:
--
Attachment: (was: 18845-4.1.patch)

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18845-5.0.patch, example.log, 
> image-2023-09-14-11-16-23-020.png
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-18 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek updated CASSANDRA-18845:
--
Attachment: (was: 18845-3.11.patch)

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18845-5.0.patch, example.log, 
> image-2023-09-14-11-16-23-020.png
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18854) Gossip never recovers from a single failed echo

2023-09-18 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766620#comment-17766620
 ] 

Cameron Zemek commented on CASSANDRA-18854:
---

Since this ticket is resolved and the changes been reverted, I have created 
CASSANDRA-18866 as followup to this one to discuss the regressions caused by 
the reverted change. As the change was to resolve the issue of multiple 
inflight ECHOs and we should still aim to improve that in my opinion. Where as 
the wait to settle already as followup ticket CASSANDRA-18845

> Gossip never recovers from a single failed echo
> ---
>
> Key: CASSANDRA-18854
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18854
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Gossip
>Reporter: Brandon Williams
>Assignee: Brandon Williams
>Priority: Normal
> Fix For: 3.11.17, 4.0.12, 4.1.4, 5.0-alpha2, 5.1
>
> Attachments: echo.log
>
>
> As discovered on CASSANDRA-18792, if an initial echo request is lost, the 
> node will never be marked up.  This appears to be a regression caused by 
> CASSANDRA-18543.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18866) Node sends multiple inflight echos

2023-09-18 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek updated CASSANDRA-18866:
--
Attachment: echo.log

> Node sends multiple inflight echos
> --
>
> Key: CASSANDRA-18866
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18866
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: echo.log
>
>
> CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, 
> 18845 had change to only allow 1 inflight ECHO request at a time. As per 
> 18854 some tests have an error rate due to this change. Creating this ticket 
> to discuss this further. As the current state also does not have retry logic, 
> it just allowing multiple ECHO requests inflight at the same time so less 
> likely that all ECHO will timeout or get lost.
> With the change from 18845 adding in some extra logging to track what is 
> going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO 
> requests from a node and also see it retrying ECHOs when it doesn't get a 
> reply.
> Therefore, I think the problem is more specific than the dropping of one ECHO 
> request. Yes there no retry logic for failed ECHO requests, but this is the 
> case even both before and after 18845. ECHO requests are only sent via gossip 
> verb handlers calling applyStateLocally. In these failed tests I therefore 
> assuming their cases where it won't call markAlive when other nodes consider 
> the node UP but its marked DOWN by a node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Created] (CASSANDRA-18866) Node sends multiple inflight echos

2023-09-18 Thread Cameron Zemek (Jira)

Cameron Zemek created CASSANDRA-18866:
-

 Summary: Node sends multiple inflight echos
 Key: CASSANDRA-18866
 URL: https://issues.apache.org/jira/browse/CASSANDRA-18866
 Project: Cassandra
  Issue Type: Improvement
Reporter: Cameron Zemek
 Attachments: echo.log

CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, 
18845 had change to only allow 1 inflight ECHO request at a time. As per 18854 
some tests have an error rate due to this change. Creating this ticket to 
discuss this further. As the current state also does not have retry logic, it 
just allowing multiple ECHO requests inflight at the same time so less likely 
that all ECHO will timeout or get lost.

With the change from 18845 adding in some extra logging to track what is going 
on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO 
requests from a node and also see it retrying ECHOs when it doesn't get a reply.

Therefore, I think the problem is more specific than the dropping of one ECHO 
request. Yes there no retry logic for failed ECHO requests, but this is the 
case even both before and after 18845. ECHO requests are only sent via gossip 
verb handlers calling applyStateLocally. In these failed tests I therefore 
assuming their cases where it won't call markAlive when other nodes consider 
the node UP but its marked DOWN by a node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18854) Gossip never recovers from a single failed echo

2023-09-18 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766265#comment-17766265
 ] 

Cameron Zemek commented on CASSANDRA-18854:
---

 [^echo.log] i added some logging and I disable networking between two nodes. 
And then once I re-enable the network it reconnected. So not sure why it 
breaking those tests. Having said this pre/post those changes there is not 
retry logic on failed ECHO messages. Pre these changes (and as soon in the logs 
where it skipped) multiple ECHO messages are sent out. So that probably why the 
tests work pre these changes as there more ECHOs.

> Gossip never recovers from a single failed echo
> ---
>
> Key: CASSANDRA-18854
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18854
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Gossip
>Reporter: Brandon Williams
>Assignee: Brandon Williams
>Priority: Normal
> Fix For: 3.11.17, 4.0.12, 4.1.4, 5.0-alpha2, 5.1
>
> Attachments: echo.log
>
>
> As discovered on CASSANDRA-18792, if an initial echo request is lost, the 
> node will never be marked up.  This appears to be a regression caused by 
> CASSANDRA-18543.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18854) Gossip never recovers from a single failed echo

2023-09-18 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek updated CASSANDRA-18854:
--
Attachment: echo.log

> Gossip never recovers from a single failed echo
> ---
>
> Key: CASSANDRA-18854
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18854
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Gossip
>Reporter: Brandon Williams
>Assignee: Brandon Williams
>Priority: Normal
> Fix For: 3.11.17, 4.0.12, 4.1.4, 5.0-alpha2, 5.1
>
> Attachments: echo.log
>
>
> As discovered on CASSANDRA-18792, if an initial echo request is lost, the 
> node will never be marked up.  This appears to be a regression caused by 
> CASSANDRA-18543.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18854) Gossip never recovers from a single failed echo

2023-09-18 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek updated CASSANDRA-18854:
--
Attachment: (was: example_echo.log)

> Gossip never recovers from a single failed echo
> ---
>
> Key: CASSANDRA-18854
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18854
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Gossip
>Reporter: Brandon Williams
>Assignee: Brandon Williams
>Priority: Normal
> Fix For: 3.11.17, 4.0.12, 4.1.4, 5.0-alpha2, 5.1
>
>
> As discovered on CASSANDRA-18792, if an initial echo request is lost, the 
> node will never be marked up.  This appears to be a regression caused by 
> CASSANDRA-18543.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18854) Gossip never recovers from a single failed echo

2023-09-18 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek updated CASSANDRA-18854:
--
Attachment: example_echo.log

> Gossip never recovers from a single failed echo
> ---
>
> Key: CASSANDRA-18854
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18854
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Gossip
>Reporter: Brandon Williams
>Assignee: Brandon Williams
>Priority: Normal
> Fix For: 3.11.17, 4.0.12, 4.1.4, 5.0-alpha2, 5.1
>
> Attachments: example_echo.log
>
>
> As discovered on CASSANDRA-18792, if an initial echo request is lost, the 
> node will never be marked up.  This appears to be a regression caused by 
> CASSANDRA-18543.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-17 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766192#comment-17766192
 ] 

Cameron Zemek commented on CASSANDRA-18845:
---

CASSANDRA-18543 had 3 components:
 # Allow for overriding the values used in waitToSettle
 # Make waitToSettle also consider the liveEndpoint members as part of settling.
 # Changes to handling of ECHO requests to remove duplicate inflight ECHO and 
duplicate log messages about the same node going into UP state 'is now UP'

 

With the reverting in CASSANDRA-18854 did the changes to waitToSettle need to 
be reverted? The problem seems to be the changes to ECHO. 

 

> The next step for this ticket to move forward will be to create tests that 
> demonstrate the problem and guard against regressions.

This is going to be very difficult todo. dtests setup clusters on loopback 
addresses and waitToSettle code path has a guard against it if using a loopback 
address. Also, the problems mostly become apparent with large clusters.

If redo the patch and remove the changes to ECHO and show those tests do not 
have regression would this allow the ticket to move forward?

I also in process of setting up a large test cluster. 

[^example.log] shows an example of what happens without the patched 
waitToSettle. Gossip settles before nodes have finished marked as UP.

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18845-3.11.patch, 18845-4.0.patch, 18845-4.1.patch, 
> 18845-5.0.patch, example.log, image-2023-09-14-11-16-23-020.png
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-17 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek updated CASSANDRA-18845:
--
Attachment: example.log

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18845-3.11.patch, 18845-4.0.patch, 18845-4.1.patch, 
> 18845-5.0.patch, example.log, image-2023-09-14-11-16-23-020.png
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18854) Gossip never recovers from a single failed echo

2023-09-17 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766168#comment-17766168
 ] 

Cameron Zemek commented on CASSANDRA-18854:
---

CASSANDRA-18543 changed how echo requests are handled (as there a lot of 
duplicates and on large clusters this results in a spam in logs and a lot of 
tasks onto gossip stage) in addition to the fix for waiting for live endpoints 
in waitToSettle. At the very least does the change to `waitToSettle` need to be 
reverted here?

> Gossip never recovers from a single failed echo
> ---
>
> Key: CASSANDRA-18854
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18854
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Gossip
>Reporter: Brandon Williams
>Assignee: Brandon Williams
>Priority: Normal
> Fix For: 3.11.17, 4.0.12, 4.1.4, 5.0-alpha2, 5.1
>
>
> As discovered on CASSANDRA-18792, if an initial echo request is lost, the 
> node will never be marked up.  This appears to be a regression caused by 
> CASSANDRA-18543.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-14 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17765429#comment-17765429
 ] 

Cameron Zemek commented on CASSANDRA-18845:
---

Need to-do more investigating around the slowness. I suspect its due to the 
flood of gossip messages on startup. The previous patch CASSANDRA-18543 removed 
the duplicate ECHO messages to cut down on this.

The behavior I notice happening in production though is there a large initial 
delay (> 10 seconds) for any nodes to be marked as `is now UP` then it floods 
in. On large clusters this takes over a minute to complete receiving them all. 
Prior to  CASSANDRA-18543 it never checked liveSize at all and so would start 
up regardless of UP status of nodes. With that change assuming the polling 
starts as UP status are received it waits. So the problem now is waiting for 
that initial event.

The previous patch from CASSANDRA-18543 allowed for overriding the gossip 
parameters but in hindsight it's difficult to determine a suitable default for 
that initial wait as its not consistent. The algorithm in waitToSettle relies 
on seeing a change in these values, so that initial delay if greater than the 
wait time plus the polling phase will move on and start NTR even though we have 
yet to see any nodes as UP.

You are correct that even with this proposed patch it's possible to still start 
NTR too early. Eg, if one node reports UP but the delay for the next event is 
longer than the polling period, but I am not seeing that in production so far. 
Therefore, the purpose of this patch is to have it wait for the first `is now 
UP` from a node instead of relying on cassandra.gossip_settle_min_wait_ms

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18845-3.11.patch, 18845-4.0.patch, 18845-4.1.patch, 
> 18845-5.0.patch, image-2023-09-14-11-16-23-020.png
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-13 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764934#comment-17764934
 ] 

Cameron Zemek commented on CASSANDRA-18845:
---

[~brandon.williams] [~smiklosovic]  the existing conditions 
{noformat}
currentSize == epSize && currentLive == liveSize{noformat}
are what stops it starting Native Transport too early if gossip is still being 
updated (for example liveSize is changing). 

waitToSettle waits by default 5 seconds then it starts polling every 1 second 3 
times seeing if either liveSize or epSize changes and resets its numOkay if 
either of these changes. The problem is when for example it took 79 seconds for 
that first change in liveSize, liveSize was constantly at 1 so it goes okay 
gossip is settled due to no changes in epSize or liveSize.

The extra condition therefore is don't consider gossip settled if there only 1 
live endpoint (the node itself). Unless it's a single node cluster (epSize == 
liveSize)

 

> So when there is a cluster of 50 nodes, without this change, that "if" would 
> return false (or it would not return true fast enough to increment numOkay to 
> break from that while) as there would be new endpoints or live members 
> detected each round.

To rephrase the problem is there is no new endpoints or live members changes. 
waitToSettle will consider it settled with liveSize == 1 currently. 

 

> why it takes almost minute and a half

This is a good question but in general it takes quite awhile for gossip to 
complete on clusters with multiple datacenters and/or large number of nodes. I 
think that is a different much more complex JIRA. The purpose of the attached 
patch is so you don't need to guess what cassandra.gossip_settle_min_wait_ms to 
use. It waits for at least one node to report is now UP in order to increment 
numOkay and to continue with the rest of the waitToSettle logic.

 

!image-2023-09-14-11-16-23-020.png!

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18845-3.11.patch, 18845-4.0.patch, 18845-4.1.patch, 
> 18845-5.0.patch, image-2023-09-14-11-16-23-020.png
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-13 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek updated CASSANDRA-18845:
--
Attachment: image-2023-09-14-11-16-23-020.png

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18845-3.11.patch, 18845-4.0.patch, 18845-4.1.patch, 
> 18845-5.0.patch, image-2023-09-14-11-16-23-020.png
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-12 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek updated CASSANDRA-18845:
--
Attachment: 18845-5.0.patch

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18845-3.11.patch, 18845-4.0.patch, 18845-4.1.patch, 
> 18845-5.0.patch
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-12 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek updated CASSANDRA-18845:
--
Attachment: 18845-4.1.patch

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18845-3.11.patch, 18845-4.0.patch, 18845-4.1.patch
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-12 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek updated CASSANDRA-18845:
--
Attachment: 18845-4.0.patch

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18845-3.11.patch, 18845-4.0.patch, 18845-4.1.patch
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-12 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764467#comment-17764467
 ] 

Cameron Zemek edited comment on CASSANDRA-18845 at 9/13/23 3:32 AM:


I have attached patched. Tested this as follows:
 # Spin up single node cluster. Works due to epSize == liveSize check that lets 
it bypass the liveSize > 1 check
 # Spin up 3 node cluster. All 3 nodes start up NTR as expected.
 # Shutdown all nodes. Start up first node it stays waiting in gossip due to 
the liveSize > 1 requirement
 # Start up second node. Now both nodes start NTR since liveSize > 1 and there 
are no other incoming `is now UP` events so gossip looks settled.

NOTE: I had to disable the if condition for call to Gossiper.waitToSettle() 
since was using loopback addresses


was (Author: cam1982):
I have attached patched. Tested this as follows:
 # Spin up single node cluster. Works due to epSize == liveSize check that lets 
it bypass the liveSize > 1 check
 # Spin up 3 node cluster. All 3 nodes start up NTR as expected.
 # Shutdown all nodes. Start up first node it stays waiting in gossip due to 
the liveSize > 1 requirement
 # Start up second node. Now both nodes start NTR since liveSize > 1 and there 
are no other incoming `is now UP` events so gossip looks settled.

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18845-3.11.patch
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-12 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764467#comment-17764467
 ] 

Cameron Zemek commented on CASSANDRA-18845:
---

I have attached patched. Tested this as follows:
 # Spin up single node cluster. Works due to epSize == liveSize check that lets 
it bypass the liveSize > 1 check
 # Spin up 3 node cluster. All 3 nodes start up NTR as expected.
 # Shutdown all nodes. Start up first node it stays waiting in gossip due to 
the liveSize > 1 requirement
 # Start up second node. Now both nodes start NTR since liveSize > 1 and there 
are no other incoming `is now UP` events so gossip looks settled.

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18845-3.11.patch
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-12 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek updated CASSANDRA-18845:
--
Description: 
This is a follow up to CASSANDRA-18543

Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
this is tedious and error prone. On a node just observed a 79 second gap 
between waiting for gossip and the first echo response to indicate a node is UP.

The problem being that do not want to start Native Transport until gossip 
settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
thinks the replicas are still in DOWN state.

Instead of having to set gossip_settle_min_wait_ms I am proposing that (outside 
single node cluster) wait for UP message from another node before considering 
gossip as settled. Eg.
{code:java}
            if (currentSize == epSize && currentLive == liveSize && liveSize > 
1)
            {
                logger.debug("Gossip looks settled.");
                numOkay++;
            } {code}

  was:
This is a follow up to CASSANDRA-18543

Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
this is tedious and error prone. On a node just observed a 79 second gap 
between waiting for gossip and the first echo response to indicate a node is UP.

The problem being that do not want to start Native Transport until gossip 
settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
thinks the replicas are still in DOWN state.

Instead of having to set gossip_settle_min_wait_ms I am proposing that (outside 
single node cluster) wait for UP message from another node before considering 
gossip as settled. Eg.
{code:java}
            if (currentSize == epSize && currentLive == liveSize && liveSize > 
0)
            {
                logger.debug("Gossip looks settled.");
                numOkay++;
            } {code}


> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18845-3.11.patch
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-12 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek updated CASSANDRA-18845:
--
Attachment: 18845-3.11.patch

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18845-3.11.patch
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Created] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-12 Thread Cameron Zemek (Jira)

Cameron Zemek created CASSANDRA-18845:
-

 Summary: Waiting for gossip to settle on live endpoints
 Key: CASSANDRA-18845
 URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
 Project: Cassandra
  Issue Type: Improvement
Reporter: Cameron Zemek


This is a follow up to CASSANDRA-18543

Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
this is tedious and error prone. On a node just observed a 79 second gap 
between waiting for gossip and the first echo response to indicate a node is UP.

The problem being that do not want to start Native Transport until gossip 
settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
thinks the replicas are still in DOWN state.

Instead of having to set gossip_settle_min_wait_ms I am proposing that (outside 
single node cluster) wait for UP message from another node before considering 
gossip as settled. Eg.
{code:java}
            if (currentSize == epSize && currentLive == liveSize && liveSize > 
0)
            {
                logger.debug("Gossip looks settled.");
                numOkay++;
            } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18773) Compactions are slow

2023-08-28 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759820#comment-17759820
 ] 

Cameron Zemek commented on CASSANDRA-18773:
---

[^18773.patch]

 

I took your idea above and implemented a preserveOrder method onto 
MergeIterator which CompactionIterator implementation will disable when there 
is no index.
{code:java}
INFO  [CompactionExecutor:2] 2023-08-28 22:19:37,162 CompactionTask.java:239 - 
Read=53.93% 7.03 MiB/s, Write=20.47% 7.31 MiB/s
INFO  [CompactionExecutor:2] 2023-08-28 22:20:37,162 CompactionTask.java:239 - 
Read=54.94% 6.97 MiB/s, Write=20.42% 7.24 MiB/s
INFO  [CompactionExecutor:2] 2023-08-28 22:21:37,162 CompactionTask.java:239 - 
Read=53.69% 6.82 MiB/s, Write=22.33% 7.08 MiB/s {code}
Which results in basically same results as my proof of concept.

 

[~blambov] what do you think about using background threads in compactions (to 
decouple read/write)? As that change also results in noticeable increase (40%) 
to:
{noformat}
INFO  [CompactionExecutor:2] 2023-08-28 21:08:08,463 CompactionTask.java:266 - 
Read=37.27% 9.63 MiB/s, Write=28.22% 10 MiB/s
INFO  [CompactionExecutor:2] 2023-08-28 21:09:08,463 CompactionTask.java:266 - 
Read=37.93% 9.65 MiB/s, Write=27.87% 10.02 MiB/s{noformat}
This does copying of the rows into memory to pass across to the writer, so the 
reader can progress its file positions. Eg.
{code:java}
        ArrayList rows = new ArrayList<>();
        while (rowIterator.hasNext())
        {
            rows.add(rowIterator.next());
        }{code}
So there is a tradeoff.

> Compactions are slow
> 
>
> Key: CASSANDRA-18773
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18773
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Compaction
>Reporter: Cameron Zemek
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
> Attachments: 18773.patch, compact-poc.patch, flamegraph.png, 
> stress.yaml
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I have noticed that compactions involving a lot of sstables are very slow 
> (for example major compactions). I have attached a cassandra stress profile 
> that can generate such a dataset under ccm. In my local test I have 2567 
> sstables at 4Mb each.
> I added code to track wall clock time of various parts of the code. One 
> problematic part is ManyToOne constructor. Tracing through the code for every 
> partition creating a ManyToOne for all the sstable iterators for each 
> partition. In my local test get a measy 60Kb/sec read speed, and bottlenecked 
> on single core CPU (since this code is single threaded) with it spending 85% 
> of the wall clock time in ManyToOne constructor.
> As another datapoint to show its the merge iterator part of the code using 
> the cfstats from [https://github.com/instaclustr/cassandra-sstable-tools/] 
> which reads all the sstables but does no merging gets 26Mb/sec read speed.
> Tracking back from ManyToOne call I see this in 
> UnfilteredPartitionIterators::merge
> {code:java}
>                 for (int i = 0; i < toMerge.size(); i++)
>                 {
>                     if (toMerge.get(i) == null)
>                     {
>                         if (null == empty)
>                             empty = EmptyIterators.unfilteredRow(metadata, 
> partitionKey, isReverseOrder);
>                         toMerge.set(i, empty);
>                     }
>                 }
>  {code}
> Not sure what purpose of creating these empty rows are. But on a whim I 
> removed all these empty iterators before passing to ManyToOne and then all 
> the wall clock time shifted to CompactionIterator::hasNext() and read speed 
> increased to 1.5Mb/s.
> So there are further bottlenecks in this code path it seems, but the first is 
> this ManyToOne and having to build it for every partition read.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18773) Compactions are slow

2023-08-28 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek updated CASSANDRA-18773:
--
Attachment: 18773.patch

> Compactions are slow
> 
>
> Key: CASSANDRA-18773
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18773
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Compaction
>Reporter: Cameron Zemek
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
> Attachments: 18773.patch, compact-poc.patch, flamegraph.png, 
> stress.yaml
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I have noticed that compactions involving a lot of sstables are very slow 
> (for example major compactions). I have attached a cassandra stress profile 
> that can generate such a dataset under ccm. In my local test I have 2567 
> sstables at 4Mb each.
> I added code to track wall clock time of various parts of the code. One 
> problematic part is ManyToOne constructor. Tracing through the code for every 
> partition creating a ManyToOne for all the sstable iterators for each 
> partition. In my local test get a measy 60Kb/sec read speed, and bottlenecked 
> on single core CPU (since this code is single threaded) with it spending 85% 
> of the wall clock time in ManyToOne constructor.
> As another datapoint to show its the merge iterator part of the code using 
> the cfstats from [https://github.com/instaclustr/cassandra-sstable-tools/] 
> which reads all the sstables but does no merging gets 26Mb/sec read speed.
> Tracking back from ManyToOne call I see this in 
> UnfilteredPartitionIterators::merge
> {code:java}
>                 for (int i = 0; i < toMerge.size(); i++)
>                 {
>                     if (toMerge.get(i) == null)
>                     {
>                         if (null == empty)
>                             empty = EmptyIterators.unfilteredRow(metadata, 
> partitionKey, isReverseOrder);
>                         toMerge.set(i, empty);
>                     }
>                 }
>  {code}
> Not sure what purpose of creating these empty rows are. But on a whim I 
> removed all these empty iterators before passing to ManyToOne and then all 
> the wall clock time shifted to CompactionIterator::hasNext() and read speed 
> increased to 1.5Mb/s.
> So there are further bottlenecks in this code path it seems, but the first is 
> this ManyToOne and having to build it for every partition read.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18773) Compactions are slow

2023-08-23 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758307#comment-17758307
 ] 

Cameron Zemek commented on CASSANDRA-18773:
---

I added the listener. I also seperate the reading into its own background 
thread for further performance increase.

 
{noformat}
INFO  [CompactionExecutor:2] 2023-08-22 15:24:56,237 CompactionTask.java:264 - 
Read=34.65% 10.43 MiB/s, Write=28.96% 10.83 MiB/s
INFO  [CompactionExecutor:2] 2023-08-22 15:25:56,237 CompactionTask.java:264 - 
Read=34.88% 10.49 MiB/s, Write=28.92% 10.9 MiB/s{noformat}

> Compactions are slow
> 
>
> Key: CASSANDRA-18773
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18773
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Compaction
>Reporter: Cameron Zemek
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
> Attachments: compact-poc.patch, flamegraph.png, stress.yaml
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I have noticed that compactions involving a lot of sstables are very slow 
> (for example major compactions). I have attached a cassandra stress profile 
> that can generate such a dataset under ccm. In my local test I have 2567 
> sstables at 4Mb each.
> I added code to track wall clock time of various parts of the code. One 
> problematic part is ManyToOne constructor. Tracing through the code for every 
> partition creating a ManyToOne for all the sstable iterators for each 
> partition. In my local test get a measy 60Kb/sec read speed, and bottlenecked 
> on single core CPU (since this code is single threaded) with it spending 85% 
> of the wall clock time in ManyToOne constructor.
> As another datapoint to show its the merge iterator part of the code using 
> the cfstats from [https://github.com/instaclustr/cassandra-sstable-tools/] 
> which reads all the sstables but does no merging gets 26Mb/sec read speed.
> Tracking back from ManyToOne call I see this in 
> UnfilteredPartitionIterators::merge
> {code:java}
>                 for (int i = 0; i < toMerge.size(); i++)
>                 {
>                     if (toMerge.get(i) == null)
>                     {
>                         if (null == empty)
>                             empty = EmptyIterators.unfilteredRow(metadata, 
> partitionKey, isReverseOrder);
>                         toMerge.set(i, empty);
>                     }
>                 }
>  {code}
> Not sure what purpose of creating these empty rows are. But on a whim I 
> removed all these empty iterators before passing to ManyToOne and then all 
> the wall clock time shifted to CompactionIterator::hasNext() and read speed 
> increased to 1.5Mb/s.
> So there are further bottlenecks in this code path it seems, but the first is 
> this ManyToOne and having to build it for every partition read.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18773) Compactions are slow

2023-08-22 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757742#comment-17757742
 ] 

Cameron Zemek commented on CASSANDRA-18773:
---

[^compact-poc.patch]

 

I did a patch that does a proof of concept of that idea in my last comment.

 

Before:
{noformat}
INFO  [CompactionExecutor:2] 2023-08-22 03:04:33,591 CompactionTask.java:241 - 
Read=56.21% 138.64 KiB/s, Write=42.50% 146.09 KiB/s
INFO  [CompactionExecutor:2] 2023-08-22 03:05:33,590 CompactionTask.java:241 - 
Read=56.58% 143.37 KiB/s, Write=42.84% 148.96 KiB/s
INFO  [CompactionExecutor:2] 2023-08-22 03:06:33,590 CompactionTask.java:241 - 
Read=56.51% 144.15 KiB/s, Write=42.91% 149.77 KiB/s{noformat}
After:
{noformat}
INFO  [CompactionExecutor:2] 2023-08-22 03:34:34,471 CompactionTask.java:241 - 
Read=53.12% 8.07 MiB/s, Write=18.75% 8.38 MiB/s
INFO  [CompactionExecutor:2] 2023-08-22 03:35:34,470 CompactionTask.java:241 - 
Read=55.08% 7.88 MiB/s, Write=17.99% 8.19 MiB/s
INFO  [CompactionExecutor:2] 2023-08-22 03:36:34,470 CompactionTask.java:241 - 
Read=54.51% 7.65 MiB/s, Write=18.75% 7.95 MiB/s{noformat}
A 50 times improvement in compaction speed.

> Compactions are slow
> 
>
> Key: CASSANDRA-18773
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18773
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Compaction
>Reporter: Cameron Zemek
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
> Attachments: compact-poc.patch, flamegraph.png, stress.yaml
>
>
> I have noticed that compactions involving a lot of sstables are very slow 
> (for example major compactions). I have attached a cassandra stress profile 
> that can generate such a dataset under ccm. In my local test I have 2567 
> sstables at 4Mb each.
> I added code to track wall clock time of various parts of the code. One 
> problematic part is ManyToOne constructor. Tracing through the code for every 
> partition creating a ManyToOne for all the sstable iterators for each 
> partition. In my local test get a measy 60Kb/sec read speed, and bottlenecked 
> on single core CPU (since this code is single threaded) with it spending 85% 
> of the wall clock time in ManyToOne constructor.
> As another datapoint to show its the merge iterator part of the code using 
> the cfstats from [https://github.com/instaclustr/cassandra-sstable-tools/] 
> which reads all the sstables but does no merging gets 26Mb/sec read speed.
> Tracking back from ManyToOne call I see this in 
> UnfilteredPartitionIterators::merge
> {code:java}
>                 for (int i = 0; i < toMerge.size(); i++)
>                 {
>                     if (toMerge.get(i) == null)
>                     {
>                         if (null == empty)
>                             empty = EmptyIterators.unfilteredRow(metadata, 
> partitionKey, isReverseOrder);
>                         toMerge.set(i, empty);
>                     }
>                 }
>  {code}
> Not sure what purpose of creating these empty rows are. But on a whim I 
> removed all these empty iterators before passing to ManyToOne and then all 
> the wall clock time shifted to CompactionIterator::hasNext() and read speed 
> increased to 1.5Mb/s.
> So there are further bottlenecks in this code path it seems, but the first is 
> this ManyToOne and having to build it for every partition read.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18773) Compactions are slow

2023-08-22 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek updated CASSANDRA-18773:
--
Attachment: compact-poc.patch

> Compactions are slow
> 
>
> Key: CASSANDRA-18773
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18773
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Compaction
>Reporter: Cameron Zemek
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
> Attachments: compact-poc.patch, flamegraph.png, stress.yaml
>
>
> I have noticed that compactions involving a lot of sstables are very slow 
> (for example major compactions). I have attached a cassandra stress profile 
> that can generate such a dataset under ccm. In my local test I have 2567 
> sstables at 4Mb each.
> I added code to track wall clock time of various parts of the code. One 
> problematic part is ManyToOne constructor. Tracing through the code for every 
> partition creating a ManyToOne for all the sstable iterators for each 
> partition. In my local test get a measy 60Kb/sec read speed, and bottlenecked 
> on single core CPU (since this code is single threaded) with it spending 85% 
> of the wall clock time in ManyToOne constructor.
> As another datapoint to show its the merge iterator part of the code using 
> the cfstats from [https://github.com/instaclustr/cassandra-sstable-tools/] 
> which reads all the sstables but does no merging gets 26Mb/sec read speed.
> Tracking back from ManyToOne call I see this in 
> UnfilteredPartitionIterators::merge
> {code:java}
>                 for (int i = 0; i < toMerge.size(); i++)
>                 {
>                     if (toMerge.get(i) == null)
>                     {
>                         if (null == empty)
>                             empty = EmptyIterators.unfilteredRow(metadata, 
> partitionKey, isReverseOrder);
>                         toMerge.set(i, empty);
>                     }
>                 }
>  {code}
> Not sure what purpose of creating these empty rows are. But on a whim I 
> removed all these empty iterators before passing to ManyToOne and then all 
> the wall clock time shifted to CompactionIterator::hasNext() and read speed 
> increased to 1.5Mb/s.
> So there are further bottlenecks in this code path it seems, but the first is 
> this ManyToOne and having to build it for every partition read.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18773) Compactions are slow

2023-08-18 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755832#comment-17755832
 ] 

Cameron Zemek commented on CASSANDRA-18773:
---

Yes this is not limited to just major compactions. That was just a way I could 
reproduce the issue reliably. Same thing happening with switching from STCS to 
LCS for a customer, that operation has been going for 2 weeks now. It has 1.5Tb 
of disk usage. Disk benchmarks show the disk able todo 120Mb/s with random 
reads of 16kb chunks. So the operation should have completed in a day.

Picking random node, it has 5 compactions going with compactionthroughput set 
to 64Mb/s. iotop shows max of 26Mb/s.

I commented out a bunch of code in the hot paths:

 
{code:java}
diff --git a/src/java/org/apache/cassandra/db/rows/UnfilteredRowIterators.java 
b/src/java/org/apache/cassandra/db/rows/UnfilteredRowIterators.java
index 2eb5d8fde7..bd72117632 100644
--- a/src/java/org/apache/cassandra/db/rows/UnfilteredRowIterators.java
+++ b/src/java/org/apache/cassandra/db/rows/UnfilteredRowIterators.java
@@ -532,7 +532,7 @@ public abstract class UnfilteredRowIterators
         public void close()
         {
             // This will close the input iterators
-            FileUtils.closeQuietly(mergeIterator);
+//            FileUtils.closeQuietly(mergeIterator);             if (listener 
!= null)
                 listener.close();
diff --git a/src/java/org/apache/cassandra/utils/MergeIterator.java 
b/src/java/org/apache/cassandra/utils/MergeIterator.java
index 6713dd0a43..5744dfb89b 100644
--- a/src/java/org/apache/cassandra/utils/MergeIterator.java
+++ b/src/java/org/apache/cassandra/utils/MergeIterator.java
@@ -42,7 +42,13 @@ public abstract class MergeIterator extends 
AbstractIterator implem
                  ? new TrivialOneToOne<>(sources, reducer)
                  : new OneToOne<>(sources, reducer);
         }
-        return new ManyToOne<>(sources, comparator, reducer);
+        ArrayList> filtered = new ArrayList<>(sources.size());
+        for (Iterator it : sources) {
+            if (it != null) {
+                filtered.add(it);
+            }
+        }
+        return new ManyToOne<>(filtered, comparator, reducer);
     }     public Iterable> iterators()
@@ -361,7 +367,8 @@ public abstract class MergeIterator extends 
AbstractIterator implem
             this.iter = iter;
             this.comp = comp;
             this.idx = idx;
-            this.lowerBound = iter instanceof IteratorWithLowerBound ? 
((IteratorWithLowerBound)iter).lowerBound() : null;
+            this.lowerBound = null;
+//            this.lowerBound = iter instanceof IteratorWithLowerBound ? 
((IteratorWithLowerBound)iter).lowerBound() : null;
         }         /** @return this if our iterator had an item, and it is now 
available, otherwise null */ {code}
still spending a significant chunk of time in UnfilteredRowMergeIterator, with 
bulk of that in ManyToOne constructor.

 

Is there not a way could manage the sstable merging without creating so many 
objects like ManyToOne? Eg. have a state object for each sstable and use that 
throughout the whole compaction to manage the merging. This is what 
cassandra-sstable-tools does. It keeps the current partition key for each 
sstable and has all the sstables in Priority queue (readerQueue). Eg.

 
{code:java}
ArrayList toMerge = new ArrayList(readerQueue.size());

while (!readerQueue.isEmpty()) {
    SSTableReader reader = this.readerQueue.remove();
    toMerge.add(reader);
    DecoratedKey key = reader.key;
    while ((reader = readerQueue.peek()) != null && reader.key.equals(key)) {
        readerQueue.remove();
        toMerge.add(reader);
    }
    doMerge(toMerge);
    for (Reader r : toMerge) {
        readerNext(reader); // advance the reader and add back to priority 
queue if more.
    }
    toMerge.clear();
}{code}
 

That is each sstable reader is positioned ready to read its current partition. 
Grab all readers that belong to the partiton to be merged. doMerge would 
iterate the rows in the readers and perform the merging. Then readerNext would 
read the next partition key and put it back into the priority queue. Doesn't 
have to be priority queue, just some efficient way to determine which sstables 
to include in the partition merge.

> Compactions are slow
> 
>
> Key: CASSANDRA-18773
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18773
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Compaction
>Reporter: Cameron Zemek
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
> Attachments: flamegraph.png, stress.yaml
>
>
> I have noticed that compactions involving a lot of sstables are very slow 
> (for example major compactions). I have attached a cassandra str

[jira] [Comment Edited] (CASSANDRA-18773) Compactions are slow

2023-08-17 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755804#comment-17755804
 ] 

Cameron Zemek edited comment on CASSANDRA-18773 at 8/18/23 5:05 AM:


!flamegraph.png|width=1508,height=691!


was (Author: cam1982):
!flamegraph.png!

> Compactions are slow
> 
>
> Key: CASSANDRA-18773
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18773
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Compaction
>Reporter: Cameron Zemek
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
> Attachments: flamegraph.png, stress.yaml
>
>
> I have noticed that compactions involving a lot of sstables are very slow 
> (for example major compactions). I have attached a cassandra stress profile 
> that can generate such a dataset under ccm. In my local test I have 2567 
> sstables at 4Mb each.
> I added code to track wall clock time of various parts of the code. One 
> problematic part is ManyToOne constructor. Tracing through the code for every 
> partition creating a ManyToOne for all the sstable iterators for each 
> partition. In my local test get a measy 60Kb/sec read speed, and bottlenecked 
> on single core CPU (since this code is single threaded) with it spending 85% 
> of the wall clock time in ManyToOne constructor.
> As another datapoint to show its the merge iterator part of the code using 
> the cfstats from [https://github.com/instaclustr/cassandra-sstable-tools/] 
> which reads all the sstables but does no merging gets 26Mb/sec read speed.
> Tracking back from ManyToOne call I see this in 
> UnfilteredPartitionIterators::merge
> {code:java}
>                 for (int i = 0; i < toMerge.size(); i++)
>                 {
>                     if (toMerge.get(i) == null)
>                     {
>                         if (null == empty)
>                             empty = EmptyIterators.unfilteredRow(metadata, 
> partitionKey, isReverseOrder);
>                         toMerge.set(i, empty);
>                     }
>                 }
>  {code}
> Not sure what purpose of creating these empty rows are. But on a whim I 
> removed all these empty iterators before passing to ManyToOne and then all 
> the wall clock time shifted to CompactionIterator::hasNext() and read speed 
> increased to 1.5Mb/s.
> So there are further bottlenecks in this code path it seems, but the first is 
> this ManyToOne and having to build it for every partition read.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18773) Compactions are slow

2023-08-17 Thread Cameron Zemek (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Zemek updated CASSANDRA-18773:
--
Attachment: flamegraph.png

> Compactions are slow
> 
>
> Key: CASSANDRA-18773
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18773
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Compaction
>Reporter: Cameron Zemek
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
> Attachments: flamegraph.png, stress.yaml
>
>
> I have noticed that compactions involving a lot of sstables are very slow 
> (for example major compactions). I have attached a cassandra stress profile 
> that can generate such a dataset under ccm. In my local test I have 2567 
> sstables at 4Mb each.
> I added code to track wall clock time of various parts of the code. One 
> problematic part is ManyToOne constructor. Tracing through the code for every 
> partition creating a ManyToOne for all the sstable iterators for each 
> partition. In my local test get a measy 60Kb/sec read speed, and bottlenecked 
> on single core CPU (since this code is single threaded) with it spending 85% 
> of the wall clock time in ManyToOne constructor.
> As another datapoint to show its the merge iterator part of the code using 
> the cfstats from [https://github.com/instaclustr/cassandra-sstable-tools/] 
> which reads all the sstables but does no merging gets 26Mb/sec read speed.
> Tracking back from ManyToOne call I see this in 
> UnfilteredPartitionIterators::merge
> {code:java}
>                 for (int i = 0; i < toMerge.size(); i++)
>                 {
>                     if (toMerge.get(i) == null)
>                     {
>                         if (null == empty)
>                             empty = EmptyIterators.unfilteredRow(metadata, 
> partitionKey, isReverseOrder);
>                         toMerge.set(i, empty);
>                     }
>                 }
>  {code}
> Not sure what purpose of creating these empty rows are. But on a whim I 
> removed all these empty iterators before passing to ManyToOne and then all 
> the wall clock time shifted to CompactionIterator::hasNext() and read speed 
> increased to 1.5Mb/s.
> So there are further bottlenecks in this code path it seems, but the first is 
> this ManyToOne and having to build it for every partition read.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18773) Compactions are slow

2023-08-17 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755804#comment-17755804
 ] 

Cameron Zemek commented on CASSANDRA-18773:
---

!flamegraph.png!

> Compactions are slow
> 
>
> Key: CASSANDRA-18773
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18773
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Compaction
>Reporter: Cameron Zemek
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
> Attachments: flamegraph.png, stress.yaml
>
>
> I have noticed that compactions involving a lot of sstables are very slow 
> (for example major compactions). I have attached a cassandra stress profile 
> that can generate such a dataset under ccm. In my local test I have 2567 
> sstables at 4Mb each.
> I added code to track wall clock time of various parts of the code. One 
> problematic part is ManyToOne constructor. Tracing through the code for every 
> partition creating a ManyToOne for all the sstable iterators for each 
> partition. In my local test get a measy 60Kb/sec read speed, and bottlenecked 
> on single core CPU (since this code is single threaded) with it spending 85% 
> of the wall clock time in ManyToOne constructor.
> As another datapoint to show its the merge iterator part of the code using 
> the cfstats from [https://github.com/instaclustr/cassandra-sstable-tools/] 
> which reads all the sstables but does no merging gets 26Mb/sec read speed.
> Tracking back from ManyToOne call I see this in 
> UnfilteredPartitionIterators::merge
> {code:java}
>                 for (int i = 0; i < toMerge.size(); i++)
>                 {
>                     if (toMerge.get(i) == null)
>                     {
>                         if (null == empty)
>                             empty = EmptyIterators.unfilteredRow(metadata, 
> partitionKey, isReverseOrder);
>                         toMerge.set(i, empty);
>                     }
>                 }
>  {code}
> Not sure what purpose of creating these empty rows are. But on a whim I 
> removed all these empty iterators before passing to ManyToOne and then all 
> the wall clock time shifted to CompactionIterator::hasNext() and read speed 
> increased to 1.5Mb/s.
> So there are further bottlenecks in this code path it seems, but the first is 
> this ManyToOne and having to build it for every partition read.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

1 2 3 >

1 - 100 of 205 matches

Mail list logo