[jira] [Commented] (CASSANDRA-19776) Spinning trying to capture readers
[ https://issues.apache.org/jira/browse/CASSANDRA-19776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871180#comment-17871180 ] Cameron Zemek commented on CASSANDRA-19776: --- Note there are other places in the code that call selectAndReference with CANONICAL set also that would result in the same issue if there a compaction ongoing. In fact, I have blacklisted the EstimatedPartitionCount metric as workaround but still see this spinning occur (yet to trace the origin for these). Also another interesting data point all the occurrences of this I have seen are with TimeWindowCompactionStrategy. > Spinning trying to capture readers > -- > > Key: CASSANDRA-19776 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19776 > Project: Cassandra > Issue Type: Bug >Reporter: Cameron Zemek >Priority: Normal > Attachments: extract.log > > > On a handful of clusters we are noticing Spin locks occurring. I traced back > all the calls to the EstimatedPartitionCount metric (eg. > org.apache.cassandra.metrics:type=Table,keyspace=testks,scope=testcf,name=EstimatedPartitionCount) > Using the following patched function: > {code:java} > public RefViewFragment selectAndReference(Function Iterable> filter) > { > long failingSince = -1L; > boolean first = true; > while (true) > { > ViewFragment view = select(filter); > Refs refs = Refs.tryRef(view.sstables); > if (refs != null) > return new RefViewFragment(view.sstables, view.memtables, > refs); > if (failingSince <= 0) > { > failingSince = System.nanoTime(); > } > else if (System.nanoTime() - failingSince > > TimeUnit.MILLISECONDS.toNanos(100)) > { > List released = new ArrayList<>(); > for (SSTableReader reader : view.sstables) > if (reader.selfRef().globalCount() == 0) > released.add(reader); > NoSpamLogger.log(logger, NoSpamLogger.Level.WARN, 1, > TimeUnit.SECONDS, > "Spinning trying to capture readers {}, > released: {}, ", view.sstables, released); > if (first) > { > first = false; > try { > throw new RuntimeException("Spinning trying to > capture readers"); > } catch (Exception e) { > logger.warn("Spin lock stacktrace", e); > } > } > failingSince = System.nanoTime(); > } > } > } > {code} > Digging into this code I found it will fail if any of the sstables are in > released state (ie. reader.selfRef().globalCount() == 0). > See the extract.log for an example of one of these spin lock occurrences. > Sometimes these spin locks last over 5 minutes. Across the worst cluster with > this issue, I ran a log processing script that everytime the 'Spinning trying > to capture readers' was different to previous one it would output if the > released tables were in Compacting state. Every single occurrence has it spin > locking with released listing a sstable that is compacting. > In the extract.log example its spin locking saying that nb-320533-big-Data.db > has been released. But you can see prior to it spinning that sstable is > involved in a compaction. The compaction completes at 01:03:36 and the > spinning stops. nb-320533-big-Data.db is deleted at 01:03:49 along with the > other 9 sstables involved in the compaction. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18543) Waiting for gossip to settle does not wait for live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870881#comment-17870881 ] Cameron Zemek commented on CASSANDRA-18543: --- [~Aburadeh] can you refer to https://issues.apache.org/jira/browse/CASSANDRA-19580 to see if that is what you are running into. > Waiting for gossip to settle does not wait for live endpoints > - > > Key: CASSANDRA-18543 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18543 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip >Reporter: Cameron Zemek >Assignee: Stefan Miklosovic >Priority: Normal > Fix For: 3.11.16, 4.0.11, 4.1.3, 5.0-alpha1, 5.0 > > Attachments: gossip.patch, gossip4.patch > > Time Spent: 1h > Remaining Estimate: 0h > > When a node starts it will get endpoint states (via shadow round) but have > all nodes marked as down. The problem is the wait to settle only checks the > size of endpoint states is stable before starting Native transport. Once > native transport starts it will receive queries and fail consistency levels > such as LOCAL_QUORUM since it still thinks nodes are down. > This is problem for a number of large clusters for our customers. The cluster > has quorum but due to this issue a node restart is causing a bunch of query > errors. > My initial solution to this was to only check live endpoints size in addition > to size of endpoint states. This worked but I noticed in testing this fix > that there also a lot of duplication of checking the same node (via Echo > messages) for liveness. So the patch also removes this duplication of > checking node is UP in markAlive. > The final problem I found while testing is sometimes could still not see a > change in live endpoints due to only 1 second polling, so the patch allows > for overridding the settle parameters. I could not reliability reproduce this > but think its worth providing a way to override these hardcoded values. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19776) Spinning trying to capture readers
[ https://issues.apache.org/jira/browse/CASSANDRA-19776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17868832#comment-17868832 ] Cameron Zemek commented on CASSANDRA-19776: --- Okay I have found the cause of this. The EstimatedPartitionCount metric asks for reference to the CANONICAL sstables. {code:java} ViewFragment view = select(filter); Refs refs = Refs.tryRef(view.sstables); {code} Meanwhile there is a compaction running that includes a fully expired sstable. {code:java} Set actuallyCompact = Sets.difference(transaction.originals(), fullyExpiredSSTables); // ... try (Refs refs = Refs.ref(actuallyCompact);{code} But the compaction doesn't take a reference on the fully expired sstable. So the selectAndReference call by the EstimatedPartitionCount is stuck looping trying to take a reference to the fully expired sstable as that sstable has no references and so fails due to counts check: {code:java} boolean ref() { while (true) { int cur = counts.get(); if (cur < 0) return false; if (counts.compareAndSet(cur, cur + 1)) return true; } } {code} It spins until the compaction is completed when the fully expired sstable is removed from the CANONICAL set of sstables. > Spinning trying to capture readers > -- > > Key: CASSANDRA-19776 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19776 > Project: Cassandra > Issue Type: Bug >Reporter: Cameron Zemek >Priority: Normal > Attachments: extract.log > > > On a handful of clusters we are noticing Spin locks occurring. I traced back > all the calls to the EstimatedPartitionCount metric (eg. > org.apache.cassandra.metrics:type=Table,keyspace=testks,scope=testcf,name=EstimatedPartitionCount) > Using the following patched function: > {code:java} > public RefViewFragment selectAndReference(Function Iterable> filter) > { > long failingSince = -1L; > boolean first = true; > while (true) > { > ViewFragment view = select(filter); > Refs refs = Refs.tryRef(view.sstables); > if (refs != null) > return new RefViewFragment(view.sstables, view.memtables, > refs); > if (failingSince <= 0) > { > failingSince = System.nanoTime(); > } > else if (System.nanoTime() - failingSince > > TimeUnit.MILLISECONDS.toNanos(100)) > { > List released = new ArrayList<>(); > for (SSTableReader reader : view.sstables) > if (reader.selfRef().globalCount() == 0) > released.add(reader); > NoSpamLogger.log(logger, NoSpamLogger.Level.WARN, 1, > TimeUnit.SECONDS, > "Spinning trying to capture readers {}, > released: {}, ", view.sstables, released); > if (first) > { > first = false; > try { > throw new RuntimeException("Spinning trying to > capture readers"); > } catch (Exception e) { > logger.warn("Spin lock stacktrace", e); > } > } > failingSince = System.nanoTime(); > } > } > } > {code} > Digging into this code I found it will fail if any of the sstables are in > released state (ie. reader.selfRef().globalCount() == 0). > See the extract.log for an example of one of these spin lock occurrences. > Sometimes these spin locks last over 5 minutes. Across the worst cluster with > this issue, I ran a log processing script that everytime the 'Spinning trying > to capture readers' was different to previous one it would output if the > released tables were in Compacting state. Every single occurrence has it spin > locking with released listing a sstable that is compacting. > In the extract.log example its spin locking saying that nb-320533-big-Data.db > has been released. But you can see prior to it spinning that sstable is > involved in a compaction. The compaction completes at 01:03:36 and the > spinning stops. nb-320533-big-Data.db is deleted at 01:03:49 along with the > other 9 sstables involved in the compaction. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19776) Spinning trying to capture readers
Cameron Zemek created CASSANDRA-19776: - Summary: Spinning trying to capture readers Key: CASSANDRA-19776 URL: https://issues.apache.org/jira/browse/CASSANDRA-19776 Project: Cassandra Issue Type: Bug Reporter: Cameron Zemek Attachments: extract.log On a handful of clusters we are noticing Spin locks occurring. I traced back all the calls to the EstimatedPartitionCount metric (eg. org.apache.cassandra.metrics:type=Table,keyspace=testks,scope=testcf,name=EstimatedPartitionCount) Using the following patched function: {code:java} public RefViewFragment selectAndReference(Function> filter) { long failingSince = -1L; boolean first = true; while (true) { ViewFragment view = select(filter); Refs refs = Refs.tryRef(view.sstables); if (refs != null) return new RefViewFragment(view.sstables, view.memtables, refs); if (failingSince <= 0) { failingSince = System.nanoTime(); } else if (System.nanoTime() - failingSince > TimeUnit.MILLISECONDS.toNanos(100)) { List released = new ArrayList<>(); for (SSTableReader reader : view.sstables) if (reader.selfRef().globalCount() == 0) released.add(reader); NoSpamLogger.log(logger, NoSpamLogger.Level.WARN, 1, TimeUnit.SECONDS, "Spinning trying to capture readers {}, released: {}, ", view.sstables, released); if (first) { first = false; try { throw new RuntimeException("Spinning trying to capture readers"); } catch (Exception e) { logger.warn("Spin lock stacktrace", e); } } failingSince = System.nanoTime(); } } } {code} Digging into this code I found it will fail if any of the sstables are in released state (ie. reader.selfRef().globalCount() == 0). See the extract.log for an example of one of these spin lock occurrences. Sometimes these spin locks last over 5 minutes. Across the worst cluster with this issue, I ran a log processing script that everytime the 'Spinning trying to capture readers' was different to previous one it would output if the released tables were in Compacting state. Every single occurrence has it spin locking with released listing a sstable that is compacting. In the extract.log example its spin locking saying that nb-320533-big-Data.db has been released. But you can see prior to it spinning that sstable is involved in a compaction. The compaction completes at 01:03:36 and the spinning stops. nb-320533-big-Data.db is deleted at 01:03:49 along with the other 9 sstables involved in the compaction. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-19703) Newly inserted prepared statements got evicted too early from cache that leads to race condition
[ https://issues.apache.org/jira/browse/CASSANDRA-19703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek reassigned CASSANDRA-19703: - Assignee: Cameron Zemek > Newly inserted prepared statements got evicted too early from cache that > leads to race condition > > > Key: CASSANDRA-19703 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19703 > Project: Cassandra > Issue Type: Bug >Reporter: Yuqi Yan >Assignee: Cameron Zemek >Priority: Normal > Fix For: 4.1.x > > > We're upgrading from Cassandra 4.0 to Cassandra 4.1.3 and > system.prepared_statements table size start growing to GB size after upgrade. > This slows down node startup significantly when it's doing > preloadPreparedStatements > I can't share the exact log but it's a race condition like this: > # [Thread 1] Receives a prepared request for S1. Attempts to get S1 in cache > # [Thread 1] Cache miss, put this S1 into cache > # [Thread 1] Attempts to write S1 into local table > # [Thread 2] Receives a prepared request for S2. Attempts to get S2 in cache > # [Thread 2] Cache miss, put this S2 into cache > # [Thread 2] Cache is full, evicting S1 from cache > # [Thread 2] Attempts to delete S1 from local table > # [Thread 2] Tombstone inserted for S1, delete finished > # [Thread 1] Record inserted for S1, write finished > Thread 2 inserted a tombstone for S1 earlier than Thread 1 was able to insert > the record in the table. Hence the data will not be removed because the later > insert has newer write time than the tombstone. > Whether this would happen or not depends on how the cache decides what’s the > next entry to evict when it’s full. We noticed that in 4.1.3 Caffeine was > upgraded to 2.9.2 CASSANDRA-15153 > > I did a small research in Caffeine commits. It seems this commit was causing > the entry got evicted to early: Eagerly evict an entry if it too large to fit > in the cache(Feb 2021), available after 2.9.0: > [https://github.com/ben-manes/caffeine/commit/464bc1914368c47a0203517fda2151fbedaf568b] > And later fixed in: Improve eviction when overflow or the weight is > oversized(Aug 2022), available after 3.1.2: > [https://github.com/ben-manes/caffeine/commit/25b7d17b1a246a63e4991d4902a2ecf24e86d234] > {quote}Previously an attempt to centralize evictions into one code path led > to a suboptimal approach > ([{{464bc19}}|https://github.com/ben-manes/caffeine/commit/464bc1914368c47a0203517fda2151fbedaf568b] > ). This tried to move those entries into the LRU position for early eviction, > but was confusing and could too aggressively evict something that is > desirable to keep. > {quote} > > I upgrade the Caffeine to 3.1.8 (same as 5.0 trunk) and this issue is gone. > But I think this version is not compatible with Java 8. > I'm not 100% sure if this is the root cause and what's the correct fix here. > Would appreciate if anyone can have a look, thanks > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18866) Node sends multiple inflight echos
[ https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17845757#comment-17845757 ] Cameron Zemek commented on CASSANDRA-18866: --- Gossip Stage has only 1 thread, so this doesn't have race condition. So full patch is [^CASSANDRA-18866-4.0.patch] > Node sends multiple inflight echos > -- > > Key: CASSANDRA-18866 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18866 > Project: Cassandra > Issue Type: Improvement > Components: Cluster/Gossip >Reporter: Cameron Zemek >Assignee: Cameron Zemek >Priority: Normal > Fix For: 5.x > > Attachments: 18866-regression.patch, CASSANDRA-18866-4.0.patch, > duplicates.log, echo.log > > > CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, > 18845 had change to only allow 1 inflight ECHO request at a time. As per > 18854 some tests have an error rate due to this change. Creating this ticket > to discuss this further. As the current state also does not have retry logic, > it just allowing multiple ECHO requests inflight at the same time so less > likely that all ECHO will timeout or get lost. > With the change from 18845 adding in some extra logging to track what is > going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO > requests from a node and also see it retrying ECHOs when it doesn't get a > reply. > Therefore, I think the problem is more specific than the dropping of one ECHO > request. Yes there no retry logic for failed ECHO requests, but this is the > case even both before and after 18845. ECHO requests are only sent via gossip > verb handlers calling applyStateLocally. In these failed tests I therefore > assuming their cases where it won't call markAlive when other nodes consider > the node UP but its marked DOWN by a node. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18866) Node sends multiple inflight echos
[ https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek updated CASSANDRA-18866: -- Attachment: CASSANDRA-18866-4.0.patch > Node sends multiple inflight echos > -- > > Key: CASSANDRA-18866 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18866 > Project: Cassandra > Issue Type: Improvement > Components: Cluster/Gossip >Reporter: Cameron Zemek >Assignee: Cameron Zemek >Priority: Normal > Fix For: 5.x > > Attachments: 18866-regression.patch, CASSANDRA-18866-4.0.patch, > duplicates.log, echo.log > > > CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, > 18845 had change to only allow 1 inflight ECHO request at a time. As per > 18854 some tests have an error rate due to this change. Creating this ticket > to discuss this further. As the current state also does not have retry logic, > it just allowing multiple ECHO requests inflight at the same time so less > likely that all ECHO will timeout or get lost. > With the change from 18845 adding in some extra logging to track what is > going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO > requests from a node and also see it retrying ECHOs when it doesn't get a > reply. > Therefore, I think the problem is more specific than the dropping of one ECHO > request. Yes there no retry logic for failed ECHO requests, but this is the > case even both before and after 18845. ECHO requests are only sent via gossip > verb handlers calling applyStateLocally. In these failed tests I therefore > assuming their cases where it won't call markAlive when other nodes consider > the node UP but its marked DOWN by a node. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18866) Node sends multiple inflight echos
[ https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17845360#comment-17845360 ] Cameron Zemek commented on CASSANDRA-18866: --- found a bug with this patch, {code:java} private void handleMajorStateChange(InetAddressAndPort ep, EndpointState epState) { // omitted for brevity endpointStateMap.put(ep, epState); if (localEpState != null) { // the node restarted: it is up to the subscriber to take whatever action is necessary for (IEndpointStateChangeSubscriber subscriber : subscribers) subscriber.onRestart(ep, localEpState); } if (!isDeadState(epState)) markAlive(ep, epState); {code} markAlive is passed the remote epState that just got put into the endpointStateMap which has isAlive = true on it. {code:java} private void markAlive(final InetAddressAndPort addr, final EndpointState localState) { if (inflightEcho.contains(addr)) { return; } inflightEcho.add(addr); localState.markDead(); {code} But we don't enter markAlive when already inflight echo request. So endpointStateMap now has entry with isAlive = true, but unreachableEndpoints has the down node. So now `nodetool status` and down endpoint count do not match. The fix is have the onResponse to ECHO update the entry currently in the map. And always update the passed in state to dead. {code:java} private void markAlive(final InetAddressAndPort addr, final EndpointState localState) { localState.markDead(); if (!inflightEcho.add(addr)) { return; } Message echoMessage = Message.out(ECHO_REQ, noPayload); logger.trace("Sending ECHO_REQ to {}", addr); RequestCallback echoHandler = new RequestCallback() { @Override public void onResponse(Message msg) { // force processing of the echo response onto the gossip stage, as it comes in on the REQUEST_RESPONSE stage runInGossipStageBlocking(() -> { try { EndpointState localEpStatePtr = endpointStateMap.get(addr); realMarkAlive(addr, localEpStatePtr); } finally { inflightEcho.remove(addr); } }); } {code} Not sure if this allows for race condition around endpointStateMap (eg. you have a call to handleMajorChange putting a new entry that gets marked dead after the call to get localEpStatePtr in the onResponse callback) > Node sends multiple inflight echos > -- > > Key: CASSANDRA-18866 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18866 > Project: Cassandra > Issue Type: Improvement > Components: Cluster/Gossip >Reporter: Cameron Zemek >Assignee: Cameron Zemek >Priority: Normal > Fix For: 5.x > > Attachments: 18866-regression.patch, duplicates.log, echo.log > > > CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, > 18845 had change to only allow 1 inflight ECHO request at a time. As per > 18854 some tests have an error rate due to this change. Creating this ticket > to discuss this further. As the current state also does not have retry logic, > it just allowing multiple ECHO requests inflight at the same time so less > likely that all ECHO will timeout or get lost. > With the change from 18845 adding in some extra logging to track what is > going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO > requests from a node and also see it retrying ECHOs when it doesn't get a > reply. > Therefore, I think the problem is more specific than the dropping of one ECHO > request. Yes there no retry logic for failed ECHO requests, but this is the > case even both before and after 18845. ECHO requests are only sent via gossip > verb handlers calling applyStateLocally. In these failed tests I therefore > assuming their cases where it won't call markAlive when other nodes consider > the node UP but its marked DOWN by a node. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status
[ https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842799#comment-17842799 ] Cameron Zemek commented on CASSANDRA-19580: --- > Most of what you've described here are implementation details of how replace > works, like how hibernate is handled, so I'm not sure if anything is wrong. I do not follow what you mean by not sure if anything is wrong. The problem is you can't do the replacement if for any reason the node ends up in hibernate state. It is forever stuck in 'Unable to contact any seeds!' error, every attempt at replacement results in that error. This has been a long running issue that seen many times over the years but never managed to figure out the cause of. I do not know what the correct solution is to this, there seems to be many possible approaches to fix. I am unaware of the reasons for how it's been implemented in order to decide what would be the preferred method. For example, I don't understand why responses to SYN do not include state for nodes that are not in the digest list. Gossip been like this for a long time and therefore seems rather major thing to change. Another approach would be to no longer use hibernate, ie. CASSANDRA-12344 > Unable to contact any seeds with node in hibernate status > - > > Key: CASSANDRA-19580 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19580 > Project: Cassandra > Issue Type: Bug >Reporter: Cameron Zemek >Priority: Normal > > We have customer running into the error 'Unable to contact any seeds!' . I > have been able to reproduce this issue if I kill Cassandra as its joining > which will put the node into hibernate status. Once a node is in hibernate it > will no longer receive any SYN messages from other nodes during startup and > as it sends only itself as digest in outbound SYN messages it never receives > any states in any of the ACK replies. So once it gets to the check > `seenAnySeed` in it fails as the endpointStateMap is empty. > > A workaround is copying the system.peers table from other node but this is > less than ideal. I tested modifying maybeGossipToSeed as follows: > {code:java} > /* Possibly gossip to a seed for facilitating partition healing */ > private void maybeGossipToSeed(MessageOut prod) > { > int size = seeds.size(); > if (size > 0) > { > if (size == 1 && > seeds.contains(FBUtilities.getBroadcastAddress())) > { > return; > } > if (liveEndpoints.size() == 0) > { > List gDigests = prod.payload.gDigests; > if (gDigests.size() == 1 && > gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress())) > { > gDigests = new ArrayList(); > GossipDigestSyn digestSynMessage = new > GossipDigestSyn(DatabaseDescriptor.getClusterName(), > > DatabaseDescriptor.getPartitionerName(), > > gDigests); > MessageOut message = new > MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN, > > digestSynMessage, > > GossipDigestSyn.serializer); > sendGossip(message, seeds); > } > else > { > sendGossip(prod, seeds); > } > } > else > { > /* Gossip with the seed with some probability. */ > double probability = seeds.size() / (double) > (liveEndpoints.size() + unreachableEndpoints.size()); > double randDbl = random.nextDouble(); > if (randDbl <= probability) > sendGossip(prod, seeds); > } > } > } > {code} > Only problem is this is the same as SYN from shadow round. It does resolve > the issue however as then receive an ACK with all the states. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status
[ https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842182#comment-17842182 ] Cameron Zemek commented on CASSANDRA-19580: --- I don't understand why Gossiper::examineGossiper is implemented to only iterate on the digests in the SYN message. Why doesn't it handle sending back in the delta missing entries in the digest list? > Unable to contact any seeds with node in hibernate status > - > > Key: CASSANDRA-19580 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19580 > Project: Cassandra > Issue Type: Bug >Reporter: Cameron Zemek >Priority: Normal > > We have customer running into the error 'Unable to contact any seeds!' . I > have been able to reproduce this issue if I kill Cassandra as its joining > which will put the node into hibernate status. Once a node is in hibernate it > will no longer receive any SYN messages from other nodes during startup and > as it sends only itself as digest in outbound SYN messages it never receives > any states in any of the ACK replies. So once it gets to the check > `seenAnySeed` in it fails as the endpointStateMap is empty. > > A workaround is copying the system.peers table from other node but this is > less than ideal. I tested modifying maybeGossipToSeed as follows: > {code:java} > /* Possibly gossip to a seed for facilitating partition healing */ > private void maybeGossipToSeed(MessageOut prod) > { > int size = seeds.size(); > if (size > 0) > { > if (size == 1 && > seeds.contains(FBUtilities.getBroadcastAddress())) > { > return; > } > if (liveEndpoints.size() == 0) > { > List gDigests = prod.payload.gDigests; > if (gDigests.size() == 1 && > gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress())) > { > gDigests = new ArrayList(); > GossipDigestSyn digestSynMessage = new > GossipDigestSyn(DatabaseDescriptor.getClusterName(), > > DatabaseDescriptor.getPartitionerName(), > > gDigests); > MessageOut message = new > MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN, > > digestSynMessage, > > GossipDigestSyn.serializer); > sendGossip(message, seeds); > } > else > { > sendGossip(prod, seeds); > } > } > else > { > /* Gossip with the seed with some probability. */ > double probability = seeds.size() / (double) > (liveEndpoints.size() + unreachableEndpoints.size()); > double randDbl = random.nextDouble(); > if (randDbl <= probability) > sendGossip(prod, seeds); > } > } > } > {code} > Only problem is this is the same as SYN from shadow round. It does resolve > the issue however as then receive an ACK with all the states. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status
[ https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841016#comment-17841016 ] Cameron Zemek commented on CASSANDRA-19580: --- > Set compression to all so there are no special cases and test again. My test was with all. > Unable to contact any seeds with node in hibernate status > - > > Key: CASSANDRA-19580 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19580 > Project: Cassandra > Issue Type: Bug >Reporter: Cameron Zemek >Priority: Normal > > We have customer running into the error 'Unable to contact any seeds!' . I > have been able to reproduce this issue if I kill Cassandra as its joining > which will put the node into hibernate status. Once a node is in hibernate it > will no longer receive any SYN messages from other nodes during startup and > as it sends only itself as digest in outbound SYN messages it never receives > any states in any of the ACK replies. So once it gets to the check > `seenAnySeed` in it fails as the endpointStateMap is empty. > > A workaround is copying the system.peers table from other node but this is > less than ideal. I tested modifying maybeGossipToSeed as follows: > {code:java} > /* Possibly gossip to a seed for facilitating partition healing */ > private void maybeGossipToSeed(MessageOut prod) > { > int size = seeds.size(); > if (size > 0) > { > if (size == 1 && > seeds.contains(FBUtilities.getBroadcastAddress())) > { > return; > } > if (liveEndpoints.size() == 0) > { > List gDigests = prod.payload.gDigests; > if (gDigests.size() == 1 && > gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress())) > { > gDigests = new ArrayList(); > GossipDigestSyn digestSynMessage = new > GossipDigestSyn(DatabaseDescriptor.getClusterName(), > > DatabaseDescriptor.getPartitionerName(), > > gDigests); > MessageOut message = new > MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN, > > digestSynMessage, > > GossipDigestSyn.serializer); > sendGossip(message, seeds); > } > else > { > sendGossip(prod, seeds); > } > } > else > { > /* Gossip with the seed with some probability. */ > double probability = seeds.size() / (double) > (liveEndpoints.size() + unreachableEndpoints.size()); > double randDbl = random.nextDouble(); > if (randDbl <= probability) > sendGossip(prod, seeds); > } > } > } > {code} > Only problem is this is the same as SYN from shadow round. It does resolve > the issue however as then receive an ACK with all the states. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status
[ https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17840266#comment-17840266 ] Cameron Zemek commented on CASSANDRA-19580: --- > If you have internode_compression=dc then replacement with the same IP will > not work, you need to use a different IP because the compression has already > been negotiated on the other nodes. Not to get too off topic to the issue at hand but I am able todo replacement with same IP with internode compression enabled. So what doesn't work about this? > Unable to contact any seeds with node in hibernate status > - > > Key: CASSANDRA-19580 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19580 > Project: Cassandra > Issue Type: Bug >Reporter: Cameron Zemek >Priority: Normal > > We have customer running into the error 'Unable to contact any seeds!' . I > have been able to reproduce this issue if I kill Cassandra as its joining > which will put the node into hibernate status. Once a node is in hibernate it > will no longer receive any SYN messages from other nodes during startup and > as it sends only itself as digest in outbound SYN messages it never receives > any states in any of the ACK replies. So once it gets to the check > `seenAnySeed` in it fails as the endpointStateMap is empty. > > A workaround is copying the system.peers table from other node but this is > less than ideal. I tested modifying maybeGossipToSeed as follows: > {code:java} > /* Possibly gossip to a seed for facilitating partition healing */ > private void maybeGossipToSeed(MessageOut prod) > { > int size = seeds.size(); > if (size > 0) > { > if (size == 1 && > seeds.contains(FBUtilities.getBroadcastAddress())) > { > return; > } > if (liveEndpoints.size() == 0) > { > List gDigests = prod.payload.gDigests; > if (gDigests.size() == 1 && > gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress())) > { > gDigests = new ArrayList(); > GossipDigestSyn digestSynMessage = new > GossipDigestSyn(DatabaseDescriptor.getClusterName(), > > DatabaseDescriptor.getPartitionerName(), > > gDigests); > MessageOut message = new > MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN, > > digestSynMessage, > > GossipDigestSyn.serializer); > sendGossip(message, seeds); > } > else > { > sendGossip(prod, seeds); > } > } > else > { > /* Gossip with the seed with some probability. */ > double probability = seeds.size() / (double) > (liveEndpoints.size() + unreachableEndpoints.size()); > double randDbl = random.nextDouble(); > if (randDbl <= probability) > sendGossip(prod, seeds); > } > } > } > {code} > Only problem is this is the same as SYN from shadow round. It does resolve > the issue however as then receive an ACK with all the states. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status
[ https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17840241#comment-17840241 ] Cameron Zemek commented on CASSANDRA-19580: --- Yeah so what breaks if use same state as when replacing with different address? I looked through CASSANDRA-8523 and didn't understand what different about replacing when reusing the same IP address. Why isn't the node in UJ state when doing replacements, that is receiving writes but not reads. What do you think would be the correct fix here? Is sending an empty SYN like shadow round okay? Why does examineGossiper not send back states for missing digests (it only compares for the digests in the SYN)? Considering that SYN messages are sent randomly, it seems like could also end up with this 'Unable to contact any seeds!' path if none of the nodes randomly pick the replacement node to send a SYN to. > Unable to contact any seeds with node in hibernate status > - > > Key: CASSANDRA-19580 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19580 > Project: Cassandra > Issue Type: Bug >Reporter: Cameron Zemek >Priority: Normal > > We have customer running into the error 'Unable to contact any seeds!' . I > have been able to reproduce this issue if I kill Cassandra as its joining > which will put the node into hibernate status. Once a node is in hibernate it > will no longer receive any SYN messages from other nodes during startup and > as it sends only itself as digest in outbound SYN messages it never receives > any states in any of the ACK replies. So once it gets to the check > `seenAnySeed` in it fails as the endpointStateMap is empty. > > A workaround is copying the system.peers table from other node but this is > less than ideal. I tested modifying maybeGossipToSeed as follows: > {code:java} > /* Possibly gossip to a seed for facilitating partition healing */ > private void maybeGossipToSeed(MessageOut prod) > { > int size = seeds.size(); > if (size > 0) > { > if (size == 1 && > seeds.contains(FBUtilities.getBroadcastAddress())) > { > return; > } > if (liveEndpoints.size() == 0) > { > List gDigests = prod.payload.gDigests; > if (gDigests.size() == 1 && > gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress())) > { > gDigests = new ArrayList(); > GossipDigestSyn digestSynMessage = new > GossipDigestSyn(DatabaseDescriptor.getClusterName(), > > DatabaseDescriptor.getPartitionerName(), > > gDigests); > MessageOut message = new > MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN, > > digestSynMessage, > > GossipDigestSyn.serializer); > sendGossip(message, seeds); > } > else > { > sendGossip(prod, seeds); > } > } > else > { > /* Gossip with the seed with some probability. */ > double probability = seeds.size() / (double) > (liveEndpoints.size() + unreachableEndpoints.size()); > double randDbl = random.nextDouble(); > if (randDbl <= probability) > sendGossip(prod, seeds); > } > } > } > {code} > Only problem is this is the same as SYN from shadow round. It does resolve > the issue however as then receive an ACK with all the states. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status
[ https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17839901#comment-17839901 ] Cameron Zemek edited comment on CASSANDRA-19580 at 4/23/24 1:03 AM: [~brandon.williams] do you know why it needs to use Hibernate for replacement for same address? CASSANDRA-8523 added BOOT_REPLACE status. I am not sure what I am breaking by doing this: {code:java} public void prepareToJoin() throws ConfigurationException { // omitted for brevity else if (isReplacingSameAddress()) { //only go into hibernate state if replacing the same address (CASSANDRA-8523) logger.warn("Writes will not be forwarded to this node during replacement because it has the same address as " + "the node to be replaced ({}). If the previous node has been down for longer than max_hint_window_in_ms, " + "repair must be run after the replacement process in order to make this node consistent.", DatabaseDescriptor.getReplaceAddress()); appStates.put(ApplicationState.STATUS, valueFactory.bootReplacing(DatabaseDescriptor.getReplaceAddress())); }{code} This stops the issue as no longer putting the node into hibernate during replacement. So if the replacement fails not in a dead state. was (Author: cam1982): [~brandon.williams] do you know why it needs to use Hibernate for replacement for same address. CASSANDRA-8523 added BOOT_REPLACE status. I am not sure what I am breaking by doing this: {code:java} public void prepareToJoin() throws ConfigurationException { // omitted for brevity else if (isReplacingSameAddress()) { //only go into hibernate state if replacing the same address (CASSANDRA-8523) logger.warn("Writes will not be forwarded to this node during replacement because it has the same address as " + "the node to be replaced ({}). If the previous node has been down for longer than max_hint_window_in_ms, " + "repair must be run after the replacement process in order to make this node consistent.", DatabaseDescriptor.getReplaceAddress()); appStates.put(ApplicationState.STATUS, valueFactory.bootReplacing(DatabaseDescriptor.getReplaceAddress())); }{code} This stops the issue as no longer putting the node into hibernate during replacement. So if the replacement fails not in a dead state. > Unable to contact any seeds with node in hibernate status > - > > Key: CASSANDRA-19580 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19580 > Project: Cassandra > Issue Type: Bug >Reporter: Cameron Zemek >Priority: Normal > > We have customer running into the error 'Unable to contact any seeds!' . I > have been able to reproduce this issue if I kill Cassandra as its joining > which will put the node into hibernate status. Once a node is in hibernate it > will no longer receive any SYN messages from other nodes during startup and > as it sends only itself as digest in outbound SYN messages it never receives > any states in any of the ACK replies. So once it gets to the check > `seenAnySeed` in it fails as the endpointStateMap is empty. > > A workaround is copying the system.peers table from other node but this is > less than ideal. I tested modifying maybeGossipToSeed as follows: > {code:java} > /* Possibly gossip to a seed for facilitating partition healing */ > private void maybeGossipToSeed(MessageOut prod) > { > int size = seeds.size(); > if (size > 0) > { > if (size == 1 && > seeds.contains(FBUtilities.getBroadcastAddress())) > { > return; > } > if (liveEndpoints.size() == 0) > { > List gDigests = prod.payload.gDigests; > if (gDigests.size() == 1 && > gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress())) > { > gDigests = new ArrayList(); > GossipDigestSyn digestSynMessage = new > GossipDigestSyn(DatabaseDescriptor.getClusterName(), > > DatabaseDescriptor.getPartitionerName(), > > gDigests); > MessageOut message = new > MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN, >
[jira] [Commented] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status
[ https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17839901#comment-17839901 ] Cameron Zemek commented on CASSANDRA-19580: --- [~brandon.williams] do you know why it needs to use Hibernate for replacement for same address. CASSANDRA-8523 added BOOT_REPLACE status. I am not sure what I am breaking by doing this: {code:java} public void prepareToJoin() throws ConfigurationException { // omitted for brevity else if (isReplacingSameAddress()) { //only go into hibernate state if replacing the same address (CASSANDRA-8523) logger.warn("Writes will not be forwarded to this node during replacement because it has the same address as " + "the node to be replaced ({}). If the previous node has been down for longer than max_hint_window_in_ms, " + "repair must be run after the replacement process in order to make this node consistent.", DatabaseDescriptor.getReplaceAddress()); appStates.put(ApplicationState.STATUS, valueFactory.bootReplacing(DatabaseDescriptor.getReplaceAddress())); }{code} This stops the issue as no longer putting the node into hibernate during replacement. So if the replacement fails not in a dead state. > Unable to contact any seeds with node in hibernate status > - > > Key: CASSANDRA-19580 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19580 > Project: Cassandra > Issue Type: Bug >Reporter: Cameron Zemek >Priority: Normal > > We have customer running into the error 'Unable to contact any seeds!' . I > have been able to reproduce this issue if I kill Cassandra as its joining > which will put the node into hibernate status. Once a node is in hibernate it > will no longer receive any SYN messages from other nodes during startup and > as it sends only itself as digest in outbound SYN messages it never receives > any states in any of the ACK replies. So once it gets to the check > `seenAnySeed` in it fails as the endpointStateMap is empty. > > A workaround is copying the system.peers table from other node but this is > less than ideal. I tested modifying maybeGossipToSeed as follows: > {code:java} > /* Possibly gossip to a seed for facilitating partition healing */ > private void maybeGossipToSeed(MessageOut prod) > { > int size = seeds.size(); > if (size > 0) > { > if (size == 1 && > seeds.contains(FBUtilities.getBroadcastAddress())) > { > return; > } > if (liveEndpoints.size() == 0) > { > List gDigests = prod.payload.gDigests; > if (gDigests.size() == 1 && > gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress())) > { > gDigests = new ArrayList(); > GossipDigestSyn digestSynMessage = new > GossipDigestSyn(DatabaseDescriptor.getClusterName(), > > DatabaseDescriptor.getPartitionerName(), > > gDigests); > MessageOut message = new > MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN, > > digestSynMessage, > > GossipDigestSyn.serializer); > sendGossip(message, seeds); > } > else > { > sendGossip(prod, seeds); > } > } > else > { > /* Gossip with the seed with some probability. */ > double probability = seeds.size() / (double) > (liveEndpoints.size() + unreachableEndpoints.size()); > double randDbl = random.nextDouble(); > if (randDbl <= probability) > sendGossip(prod, seeds); > } > } > } > {code} > Only problem is this is the same as SYN from shadow round. It does resolve > the issue however as then receive an ACK with all the states. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status
[ https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17839887#comment-17839887 ] Cameron Zemek commented on CASSANDRA-19580: --- PS: the customer not doing step 2. That just my reliable way to reproduce the issue. I have seen this 'Unable to contact seeds!' in the past but never had enough information to go on. It seems to happen on larger clusters. > Unable to contact any seeds with node in hibernate status > - > > Key: CASSANDRA-19580 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19580 > Project: Cassandra > Issue Type: Bug >Reporter: Cameron Zemek >Priority: Normal > > We have customer running into the error 'Unable to contact any seeds!' . I > have been able to reproduce this issue if I kill Cassandra as its joining > which will put the node into hibernate status. Once a node is in hibernate it > will no longer receive any SYN messages from other nodes during startup and > as it sends only itself as digest in outbound SYN messages it never receives > any states in any of the ACK replies. So once it gets to the check > `seenAnySeed` in it fails as the endpointStateMap is empty. > > A workaround is copying the system.peers table from other node but this is > less than ideal. I tested modifying maybeGossipToSeed as follows: > {code:java} > /* Possibly gossip to a seed for facilitating partition healing */ > private void maybeGossipToSeed(MessageOut prod) > { > int size = seeds.size(); > if (size > 0) > { > if (size == 1 && > seeds.contains(FBUtilities.getBroadcastAddress())) > { > return; > } > if (liveEndpoints.size() == 0) > { > List gDigests = prod.payload.gDigests; > if (gDigests.size() == 1 && > gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress())) > { > gDigests = new ArrayList(); > GossipDigestSyn digestSynMessage = new > GossipDigestSyn(DatabaseDescriptor.getClusterName(), > > DatabaseDescriptor.getPartitionerName(), > > gDigests); > MessageOut message = new > MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN, > > digestSynMessage, > > GossipDigestSyn.serializer); > sendGossip(message, seeds); > } > else > { > sendGossip(prod, seeds); > } > } > else > { > /* Gossip with the seed with some probability. */ > double probability = seeds.size() / (double) > (liveEndpoints.size() + unreachableEndpoints.size()); > double randDbl = random.nextDouble(); > if (randDbl <= probability) > sendGossip(prod, seeds); > } > } > } > {code} > Only problem is this is the same as SYN from shadow round. It does resolve > the issue however as then receive an ACK with all the states. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status
[ https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17839886#comment-17839886 ] Cameron Zemek commented on CASSANDRA-19580: --- The node trying to replace. So in my replication steps: # replace a node using '-Dcassandra.replace_address=44.239.237.152' # while its replacing kill off cassandra # wipe the cassandra folders # start cassandra again still using the replace address flag After step 2 if I check 'nodetool gossipinfo' the node being replaced (44.239.237.152 in this example) has status of hibernate. During step 4 the other nodes will say 'Not marking /44.239.237.152 alive due to dead state' I did a whole bunch of testing of this yesterday and this is the key issue as far as I can tell. Due to the replacing node being in hibernate they won't send a SYN (see maybeGossipToUnreachableMember filters out ones in dead state). And without the SYN message the replacing node never gets gossip state of the cluster as its own SYN messages only has itself as digest so ACK replies to those don't include other nodes. > Unable to contact any seeds with node in hibernate status > - > > Key: CASSANDRA-19580 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19580 > Project: Cassandra > Issue Type: Bug >Reporter: Cameron Zemek >Priority: Normal > > We have customer running into the error 'Unable to contact any seeds!' . I > have been able to reproduce this issue if I kill Cassandra as its joining > which will put the node into hibernate status. Once a node is in hibernate it > will no longer receive any SYN messages from other nodes during startup and > as it sends only itself as digest in outbound SYN messages it never receives > any states in any of the ACK replies. So once it gets to the check > `seenAnySeed` in it fails as the endpointStateMap is empty. > > A workaround is copying the system.peers table from other node but this is > less than ideal. I tested modifying maybeGossipToSeed as follows: > {code:java} > /* Possibly gossip to a seed for facilitating partition healing */ > private void maybeGossipToSeed(MessageOut prod) > { > int size = seeds.size(); > if (size > 0) > { > if (size == 1 && > seeds.contains(FBUtilities.getBroadcastAddress())) > { > return; > } > if (liveEndpoints.size() == 0) > { > List gDigests = prod.payload.gDigests; > if (gDigests.size() == 1 && > gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress())) > { > gDigests = new ArrayList(); > GossipDigestSyn digestSynMessage = new > GossipDigestSyn(DatabaseDescriptor.getClusterName(), > > DatabaseDescriptor.getPartitionerName(), > > gDigests); > MessageOut message = new > MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN, > > digestSynMessage, > > GossipDigestSyn.serializer); > sendGossip(message, seeds); > } > else > { > sendGossip(prod, seeds); > } > } > else > { > /* Gossip with the seed with some probability. */ > double probability = seeds.size() / (double) > (liveEndpoints.size() + unreachableEndpoints.size()); > double randDbl = random.nextDouble(); > if (randDbl <= probability) > sendGossip(prod, seeds); > } > } > } > {code} > Only problem is this is the same as SYN from shadow round. It does resolve > the issue however as then receive an ACK with all the states. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status
[ https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17839882#comment-17839882 ] Cameron Zemek commented on CASSANDRA-19580: --- Customer cluster has: commitlog_compression=LZ4Compressor hints_compression=null internode_compression=dc So it happens with and without comrpession > Unable to contact any seeds with node in hibernate status > - > > Key: CASSANDRA-19580 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19580 > Project: Cassandra > Issue Type: Bug >Reporter: Cameron Zemek >Priority: Normal > > We have customer running into the error 'Unable to contact any seeds!' . I > have been able to reproduce this issue if I kill Cassandra as its joining > which will put the node into hibernate status. Once a node is in hibernate it > will no longer receive any SYN messages from other nodes during startup and > as it sends only itself as digest in outbound SYN messages it never receives > any states in any of the ACK replies. So once it gets to the check > `seenAnySeed` in it fails as the endpointStateMap is empty. > > A workaround is copying the system.peers table from other node but this is > less than ideal. I tested modifying maybeGossipToSeed as follows: > {code:java} > /* Possibly gossip to a seed for facilitating partition healing */ > private void maybeGossipToSeed(MessageOut prod) > { > int size = seeds.size(); > if (size > 0) > { > if (size == 1 && > seeds.contains(FBUtilities.getBroadcastAddress())) > { > return; > } > if (liveEndpoints.size() == 0) > { > List gDigests = prod.payload.gDigests; > if (gDigests.size() == 1 && > gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress())) > { > gDigests = new ArrayList(); > GossipDigestSyn digestSynMessage = new > GossipDigestSyn(DatabaseDescriptor.getClusterName(), > > DatabaseDescriptor.getPartitionerName(), > > gDigests); > MessageOut message = new > MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN, > > digestSynMessage, > > GossipDigestSyn.serializer); > sendGossip(message, seeds); > } > else > { > sendGossip(prod, seeds); > } > } > else > { > /* Gossip with the seed with some probability. */ > double probability = seeds.size() / (double) > (liveEndpoints.size() + unreachableEndpoints.size()); > double randDbl = random.nextDouble(); > if (randDbl <= probability) > sendGossip(prod, seeds); > } > } > } > {code} > Only problem is this is the same as SYN from shadow round. It does resolve > the issue however as then receive an ACK with all the states. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status
[ https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17839881#comment-17839881 ] Cameron Zemek commented on CASSANDRA-19580: --- [~brandon.williams] > Is compression enabled on this cluster? Not sure which setting you referring to. Just replicated the issue on test cluster where I have: commitlog_compression=null internode_compression=none hint_compression=null > Unable to contact any seeds with node in hibernate status > - > > Key: CASSANDRA-19580 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19580 > Project: Cassandra > Issue Type: Bug >Reporter: Cameron Zemek >Priority: Normal > > We have customer running into the error 'Unable to contact any seeds!' . I > have been able to reproduce this issue if I kill Cassandra as its joining > which will put the node into hibernate status. Once a node is in hibernate it > will no longer receive any SYN messages from other nodes during startup and > as it sends only itself as digest in outbound SYN messages it never receives > any states in any of the ACK replies. So once it gets to the check > `seenAnySeed` in it fails as the endpointStateMap is empty. > > A workaround is copying the system.peers table from other node but this is > less than ideal. I tested modifying maybeGossipToSeed as follows: > {code:java} > /* Possibly gossip to a seed for facilitating partition healing */ > private void maybeGossipToSeed(MessageOut prod) > { > int size = seeds.size(); > if (size > 0) > { > if (size == 1 && > seeds.contains(FBUtilities.getBroadcastAddress())) > { > return; > } > if (liveEndpoints.size() == 0) > { > List gDigests = prod.payload.gDigests; > if (gDigests.size() == 1 && > gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress())) > { > gDigests = new ArrayList(); > GossipDigestSyn digestSynMessage = new > GossipDigestSyn(DatabaseDescriptor.getClusterName(), > > DatabaseDescriptor.getPartitionerName(), > > gDigests); > MessageOut message = new > MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN, > > digestSynMessage, > > GossipDigestSyn.serializer); > sendGossip(message, seeds); > } > else > { > sendGossip(prod, seeds); > } > } > else > { > /* Gossip with the seed with some probability. */ > double probability = seeds.size() / (double) > (liveEndpoints.size() + unreachableEndpoints.size()); > double randDbl = random.nextDouble(); > if (randDbl <= probability) > sendGossip(prod, seeds); > } > } > } > {code} > Only problem is this is the same as SYN from shadow round. It does resolve > the issue however as then receive an ACK with all the states. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status
[ https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17839879#comment-17839879 ] Cameron Zemek commented on CASSANDRA-19580: --- Here is an extract of logs showing the issue: {noformat} INFO [main] 2024-04-17 17:57:45,766 MessagingService.java:750 - Starting Messaging Service on /10.120.156.42:7000 (eth0) INFO [main] 2024-04-17 17:57:45,775 StorageService.java:681 - Gathering node replacement information for /10.120.156.42 TRACE [main] 2024-04-17 17:57:45,781 Gossiper.java:1613 - Sending shadow round GOSSIP DIGEST SYN to seeds [/10.120.156.17, /10.120.156.21, /10.120.156.9] INFO [main] 2024-04-17 17:57:45,788 OutboundTcpConnection.java:108 - OutboundTcpConnection using coalescing strategy DISABLED INFO [HANDSHAKE-/10.120.156.9] 2024-04-17 17:57:45,802 OutboundTcpConnection.java:561 - Handshaking version with /10.120.156.9 INFO [HANDSHAKE-/10.120.156.17] 2024-04-17 17:57:45,803 OutboundTcpConnection.java:561 - Handshaking version with /10.120.156.17 INFO [HANDSHAKE-/10.120.156.21] 2024-04-17 17:57:45,803 OutboundTcpConnection.java:561 - Handshaking version with /10.120.156.21 TRACE [GossipStage:1] 2024-04-17 17:57:45,875 GossipDigestAckVerbHandler.java:41 - Received a GossipDigestAckMessage from /10.120.156.9 TRACE [GossipStage:1] 2024-04-17 17:57:45,875 GossipDigestAckVerbHandler.java:52 - Received ack with 0 digests and 48 states DEBUG [GossipStage:1] 2024-04-17 17:57:45,876 GossipDigestAckVerbHandler.java:57 - Received an ack from /10.120.156.9, which may trigger exit from shadow round DEBUG [GossipStage:1] 2024-04-17 17:57:45,876 Gossiper.java:1802 - Received a regular ack from /10.120.156.9, can now exit shadow round TRACE [GossipStage:1] 2024-04-17 17:57:45,876 GossipDigestAckVerbHandler.java:41 - Received a GossipDigestAckMessage from /10.120.156.21 TRACE [GossipStage:1] 2024-04-17 17:57:45,876 GossipDigestAckVerbHandler.java:45 - Ignoring GossipDigestAckMessage because gossip is disabled TRACE [GossipStage:1] 2024-04-17 17:57:45,876 GossipDigestAckVerbHandler.java:41 - Received a GossipDigestAckMessage from /10.120.156.17 TRACE [GossipStage:1] 2024-04-17 17:57:45,876 GossipDigestAckVerbHandler.java:45 - Ignoring GossipDigestAckMessage because gossip is disabled WARN [main] 2024-04-17 17:57:46,825 StorageService.java:970 - Writes will not be forwarded to this node during replacement because it has the same address as the node to be replaced (/10.120.156.42). If the previous node has been down for longer than max_hint_window_in_ms, repair must be run after the replacement process in order to make this node consistent. INFO [main] 2024-04-17 17:57:46,827 StorageService.java:877 - Loading persisted ring state INFO [main] 2024-04-17 17:57:46,829 StorageService.java:1008 - Starting up server gossip TRACE [main] 2024-04-17 17:57:46,854 Gossiper.java:1550 - gossip started with generation 171337 WARN [main] 2024-04-17 17:57:46,883 StorageService.java:1099 - Detected previous bootstrap failure; retrying INFO [main] 2024-04-17 17:57:46,883 StorageService.java:1679 - JOINING: waiting for ring information TRACE [GossipTasks:1] 2024-04-17 17:57:47,855 Gossiper.java:215 - My heartbeat is now 16 TRACE [GossipTasks:1] 2024-04-17 17:57:47,856 Gossiper.java:633 - Gossip Digests are : /10.120.156.42:171337:16 TRACE [GossipTasks:1] 2024-04-17 17:57:47,857 Gossiper.java:782 - Sending a GossipDigestSyn to /10.120.156.17 ... TRACE [GossipTasks:1] 2024-04-17 17:57:47,857 Gossiper.java:911 - Performing status check ... TRACE [GossipStage:1] 2024-04-17 17:57:47,858 GossipDigestAckVerbHandler.java:41 - Received a GossipDigestAckMessage from /10.120.156.17 TRACE [GossipStage:1] 2024-04-17 17:57:47,858 GossipDigestAckVerbHandler.java:52 - Received ack with 1 digests and 0 states TRACE [GossipStage:1] 2024-04-17 17:57:47,858 Gossiper.java:1048 - local heartbeat version 16 greater than 0 for /10.120.156.42 TRACE [GossipStage:1] 2024-04-17 17:57:47,858 Gossiper.java:1063 - Adding state STATUS: hibernate,true TRACE [GossipStage:1] 2024-04-17 17:57:47,858 Gossiper.java:1063 - Adding state SCHEMA: 59adb24e-f3cd-3e02-97f0-5b395827453f TRACE [GossipStage:1] 2024-04-17 17:57:47,858 Gossiper.java:1063 - Adding state DC: us-west2 TRACE [GossipStage:1] 2024-04-17 17:57:47,858 Gossiper.java:1063 - Adding state RACK: c TRACE [GossipStage:1] 2024-04-17 17:57:47,859 Gossiper.java:1063 - Adding state RELEASE_VERSION: 3.11.16 TRACE [GossipStage:1] 2024-04-17 17:57:47,859 Gossiper.java:1063 - Adding state INTERNAL_IP: 10.120.156.42 TRACE [GossipStage:1] 2024-04-17 17:57:47,859 Gossiper.java:1063 - Adding state RPC_ADDRESS: 10.120.156.42 TRACE [GossipStage:1] 2024-04-17 17:57:47,859 Gossiper.java:1063 - Adding state NET_VERSION: 11 TRACE [GossipStage:1] 2024-04-17 17:57:47,859 Gossiper.java:1063 - Adding state HOST_ID: 4477-a899-4cc1-a9f9-2
[jira] [Commented] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status
[ https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17839870#comment-17839870 ] Cameron Zemek commented on CASSANDRA-19580: --- [~brandon.williams] sorry I did not clarify that exactly what doing, node replacements. In particular for same IP address. If I kill off the node during node replacement the other nodes in cluster will have that replacing node in hibernate status. At which point you will always get 'Unable to contact any seeds!' as SYN are not sent by other nodes to the replacing node when they have it in HIBERNATE status since that is a dead state. In a working replacement the other nodes have it in SHUTDOWN state. Then as part of bootstrap the node gets marked as alive and then one of the nodes end up sending a SYN. That is if there some failure during a node replacement end up in unrecoverable state. > Unable to contact any seeds with node in hibernate status > - > > Key: CASSANDRA-19580 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19580 > Project: Cassandra > Issue Type: Bug >Reporter: Cameron Zemek >Priority: Normal > > We have customer running into the error 'Unable to contact any seeds!' . I > have been able to reproduce this issue if I kill Cassandra as its joining > which will put the node into hibernate status. Once a node is in hibernate it > will no longer receive any SYN messages from other nodes during startup and > as it sends only itself as digest in outbound SYN messages it never receives > any states in any of the ACK replies. So once it gets to the check > `seenAnySeed` in it fails as the endpointStateMap is empty. > > A workaround is copying the system.peers table from other node but this is > less than ideal. I tested modifying maybeGossipToSeed as follows: > {code:java} > /* Possibly gossip to a seed for facilitating partition healing */ > private void maybeGossipToSeed(MessageOut prod) > { > int size = seeds.size(); > if (size > 0) > { > if (size == 1 && > seeds.contains(FBUtilities.getBroadcastAddress())) > { > return; > } > if (liveEndpoints.size() == 0) > { > List gDigests = prod.payload.gDigests; > if (gDigests.size() == 1 && > gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress())) > { > gDigests = new ArrayList(); > GossipDigestSyn digestSynMessage = new > GossipDigestSyn(DatabaseDescriptor.getClusterName(), > > DatabaseDescriptor.getPartitionerName(), > > gDigests); > MessageOut message = new > MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN, > > digestSynMessage, > > GossipDigestSyn.serializer); > sendGossip(message, seeds); > } > else > { > sendGossip(prod, seeds); > } > } > else > { > /* Gossip with the seed with some probability. */ > double probability = seeds.size() / (double) > (liveEndpoints.size() + unreachableEndpoints.size()); > double randDbl = random.nextDouble(); > if (randDbl <= probability) > sendGossip(prod, seeds); > } > } > } > {code} > Only problem is this is the same as SYN from shadow round. It does resolve > the issue however as then receive an ACK with all the states. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status
[ https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek updated CASSANDRA-19580: -- Description: We have customer running into the error 'Unable to contact any seeds!' . I have been able to reproduce this issue if I kill Cassandra as its joining which will put the node into hibernate status. Once a node is in hibernate it will no longer receive any SYN messages from other nodes during startup and as it sends only itself as digest in outbound SYN messages it never receives any states in any of the ACK replies. So once it gets to the check `seenAnySeed` in it fails as the endpointStateMap is empty. A workaround is copying the system.peers table from other node but this is less than ideal. I tested modifying maybeGossipToSeed as follows: {code:java} /* Possibly gossip to a seed for facilitating partition healing */ private void maybeGossipToSeed(MessageOut prod) { int size = seeds.size(); if (size > 0) { if (size == 1 && seeds.contains(FBUtilities.getBroadcastAddress())) { return; } if (liveEndpoints.size() == 0) { List gDigests = prod.payload.gDigests; if (gDigests.size() == 1 && gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress())) { gDigests = new ArrayList(); GossipDigestSyn digestSynMessage = new GossipDigestSyn(DatabaseDescriptor.getClusterName(), DatabaseDescriptor.getPartitionerName(), gDigests); MessageOut message = new MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN, digestSynMessage, GossipDigestSyn.serializer); sendGossip(message, seeds); } else { sendGossip(prod, seeds); } } else { /* Gossip with the seed with some probability. */ double probability = seeds.size() / (double) (liveEndpoints.size() + unreachableEndpoints.size()); double randDbl = random.nextDouble(); if (randDbl <= probability) sendGossip(prod, seeds); } } } {code} Only problem is this is the same as SYN from shadow round. It does resolve the issue however as then receive an ACK with all the states. was: We have customer running into the error 'Unable to contact any seeds!' . I have been able to reproduce this issue if I kill Cassandra as its joining which will put the node into hibernate status. Once a node is in hibernate it will no longer receive any SYN messages from other nodes during startup and as it sends only itself as digest in outbound SYN messages it never receives any states in any of the ACK replies. So once it gets to the check `seenAnySeed` in it fails as the endpointStateMap is empty. A workaround is copying the system.peers table from other node but this is less than ideal. I tested modifying maybeGossipToSeed as follows: {code:java} /* Possibly gossip to a seed for facilitating partition healing */ private void maybeGossipToSeed(MessageOut prod) { int size = seeds.size(); if (size > 0) { if (size == 1 && seeds.contains(FBUtilities.getBroadcastAddress())) { return; } if (liveEndpoints.size() == 0) { List gDigests = prod.payload.gDigests; if (gDigests.size() == 1 && gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress())) { gDigests = new ArrayList(); GossipDigestSyn digestSynMessage = new GossipDigestSyn(DatabaseDescriptor.getClusterName(), DatabaseDescriptor.getPartitionerName(), gDigests); MessageOut message = new MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN, digestSynMessage, GossipDigestSyn.serializer); sendGossip(message, seeds); } else { sendGossip(prod, seeds); } }
[jira] [Created] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status
Cameron Zemek created CASSANDRA-19580: - Summary: Unable to contact any seeds with node in hibernate status Key: CASSANDRA-19580 URL: https://issues.apache.org/jira/browse/CASSANDRA-19580 Project: Cassandra Issue Type: Bug Reporter: Cameron Zemek We have customer running into the error 'Unable to contact any seeds!' . I have been able to reproduce this issue if I kill Cassandra as its joining which will put the node into hibernate status. Once a node is in hibernate it will no longer receive any SYN messages from other nodes during startup and as it sends only itself as digest in outbound SYN messages it never receives any states in any of the ACK replies. So once it gets to the check `seenAnySeed` in it fails as the endpointStateMap is empty. A workaround is copying the system.peers table from other node but this is less than ideal. I tested modifying maybeGossipToSeed as follows: {code:java} /* Possibly gossip to a seed for facilitating partition healing */ private void maybeGossipToSeed(MessageOut prod) { int size = seeds.size(); if (size > 0) { if (size == 1 && seeds.contains(FBUtilities.getBroadcastAddress())) { return; } if (liveEndpoints.size() == 0) { List gDigests = prod.payload.gDigests; if (gDigests.size() == 1 && gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress())) { gDigests = new ArrayList(); GossipDigestSyn digestSynMessage = new GossipDigestSyn(DatabaseDescriptor.getClusterName(), DatabaseDescriptor.getPartitionerName(), gDigests); MessageOut message = new MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN, digestSynMessage, GossipDigestSyn.serializer); sendGossip(message, seeds); } else { sendGossip(prod, seeds); } } else { /* Gossip with the seed with some probability. */ double probability = seeds.size() / (double) (liveEndpoints.size() + unreachableEndpoints.size()); double randDbl = random.nextDouble(); if (randDbl <= probability) sendGossip(prod, seeds); } } } {code} Only problem is this is the same as SYN from shadow round. It does resolve the issue however as then receive an ACK with all the states. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17836370#comment-17836370 ] Cameron Zemek commented on CASSANDRA-18845: --- I have reworked the patch more so it a new method instead of modifying the existing waitToSettle. So it has the least change to any existing behavior. It directly called in MigrationCoordinator::awaitSchemaRequests to handle if node bootstrapping (since need nodes in UP state in order to get schema and stream sstables from). And just before enabling native transport. https://issues.apache.org/jira/secure/attachment/13068153/CASSANDRA-18845-4_0_12.patch > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-seperate.patch, CASSANDRA-18845-4_0_12.patch, > delay.log, example.log, image-2023-09-14-11-16-23-020.png, stream.log, > test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17836370#comment-17836370 ] Cameron Zemek edited comment on CASSANDRA-18845 at 4/11/24 10:18 PM: - I have reworked the [patch| [^CASSANDRA-18845-4_0_12.patch]] more so it a new method instead of modifying the existing waitToSettle, so it has the least change to any existing behavior. It directly called in MigrationCoordinator::awaitSchemaRequests to handle if node bootstrapping (since need nodes in UP state in order to get schema and stream sstables from). And just before enabling native transport. was (Author: cam1982): I have reworked the patch more so it a new method instead of modifying the existing waitToSettle. So it has the least change to any existing behavior. It directly called in MigrationCoordinator::awaitSchemaRequests to handle if node bootstrapping (since need nodes in UP state in order to get schema and stream sstables from). And just before enabling native transport. https://issues.apache.org/jira/secure/attachment/13068153/CASSANDRA-18845-4_0_12.patch > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-seperate.patch, CASSANDRA-18845-4_0_12.patch, > delay.log, example.log, image-2023-09-14-11-16-23-020.png, stream.log, > test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek updated CASSANDRA-18845: -- Attachment: CASSANDRA-18845-4_0_12.patch > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-seperate.patch, CASSANDRA-18845-4_0_12.patch, > delay.log, example.log, image-2023-09-14-11-16-23-020.png, stream.log, > test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19473) Latency Spike on NTR startup
Cameron Zemek created CASSANDRA-19473: - Summary: Latency Spike on NTR startup Key: CASSANDRA-19473 URL: https://issues.apache.org/jira/browse/CASSANDRA-19473 Project: Cassandra Issue Type: Improvement Reporter: Cameron Zemek Firstly you need the patch from https://issues.apache.org/jira/browse/CASSANDRA-18845 to solve consistency query errors on startup. With that patch there is still a further issue we see on some clusters where the latency spikes too high when initially starting. I see pending compactions and hints metrics increased during this time. I tried lowering the hint delivery threshold across the cluster thinking it was overloading the node starting up, but this didn't resolve the issue. So at this time I am not sure what the root cause (I still think its combination of the compactions and hints). As workaround I have this small code change: {code:java} int START_NATIVE_DELAY = Integer.getInteger("cassandra.start_native_transport_delay_secs", 120); if (START_NATIVE_DELAY > 0) { logger.info("Waiting an extra {} seconds before enabling NTR", START_NATIVE_DELAY); Uninterruptibles.sleepUninterruptibly(START_NATIVE_DELAY, TimeUnit.SECONDS); } startNativeTransport(); {code} Where wait an configurable time before starting native transport. Delaying NTR startup resolved the issue. A better solution would be to wait for hints/compactions or whatever is root cause to complete. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18935) Unable to write to counter table if native transport is disabled on startup
[ https://issues.apache.org/jira/browse/CASSANDRA-18935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek updated CASSANDRA-18935: -- Attachment: 18935-3.11.patch > Unable to write to counter table if native transport is disabled on startup > --- > > Key: CASSANDRA-18935 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18935 > Project: Cassandra > Issue Type: Bug >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18935-3.11.patch > > > > {code:java} > if ((nativeFlag != null && Boolean.parseBoolean(nativeFlag)) || > (nativeFlag == null && DatabaseDescriptor.startNativeTransport())) > { > startNativeTransport(); > StorageService.instance.setRpcReady(true); > } {code} > The startup code here only sets RpcReady if native transport is enabled. If > you call > {code:java} > nodetool enablebinary{code} > then this flag doesn't get set. > But with the change from CASSANDRA-13043 it requires RpcReady set to true in > order to get a leader for the counter update. > Not sure what the correct fix is here, seems to only really use this flag for > counters. So thinking perhaps the fix is to just move this outside the if > condition. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18935) Unable to write to counter table if native transport is disabled on startup
[ https://issues.apache.org/jira/browse/CASSANDRA-18935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek updated CASSANDRA-18935: -- Description: {code:java} if ((nativeFlag != null && Boolean.parseBoolean(nativeFlag)) || (nativeFlag == null && DatabaseDescriptor.startNativeTransport())) { startNativeTransport(); StorageService.instance.setRpcReady(true); } {code} The startup code here only sets RpcReady if native transport is enabled. If you call {code:java} nodetool enablebinary{code} then this flag doesn't get set. But with the change from CASSANDRA-13043 it requires RpcReady set to true in order to get a leader for the counter update. Not sure what the correct fix is here, seems to only really use this flag for counters. So thinking perhaps the fix is to just move this outside the if condition. was: {code:java} if ((nativeFlag != null && Boolean.parseBoolean(nativeFlag)) || (nativeFlag == null && DatabaseDescriptor.startNativeTransport())) { startNativeTransport(); StorageService.instance.setRpcReady(true); } {code} The startup code here only sets RpcReady if native transport is enabled. If you call {code:java} nodetool enablebinary{code} then this flag doesn't get set. But with the change from CASSANDRA-13043 it requires RpcReady set to true in other to get a leader for the counter update. Not sure what the correct fix is here, seems to only really use this flag for counters. So thinking perhaps the fix is to just move this outside the if condition. > Unable to write to counter table if native transport is disabled on startup > --- > > Key: CASSANDRA-18935 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18935 > Project: Cassandra > Issue Type: Bug >Reporter: Cameron Zemek >Priority: Normal > > > {code:java} > if ((nativeFlag != null && Boolean.parseBoolean(nativeFlag)) || > (nativeFlag == null && DatabaseDescriptor.startNativeTransport())) > { > startNativeTransport(); > StorageService.instance.setRpcReady(true); > } {code} > The startup code here only sets RpcReady if native transport is enabled. If > you call > {code:java} > nodetool enablebinary{code} > then this flag doesn't get set. > But with the change from CASSANDRA-13043 it requires RpcReady set to true in > order to get a leader for the counter update. > Not sure what the correct fix is here, seems to only really use this flag for > counters. So thinking perhaps the fix is to just move this outside the if > condition. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-18935) Unable to write to counter table if native transport is disabled on startup
Cameron Zemek created CASSANDRA-18935: - Summary: Unable to write to counter table if native transport is disabled on startup Key: CASSANDRA-18935 URL: https://issues.apache.org/jira/browse/CASSANDRA-18935 Project: Cassandra Issue Type: Bug Reporter: Cameron Zemek {code:java} if ((nativeFlag != null && Boolean.parseBoolean(nativeFlag)) || (nativeFlag == null && DatabaseDescriptor.startNativeTransport())) { startNativeTransport(); StorageService.instance.setRpcReady(true); } {code} The startup code here only sets RpcReady if native transport is enabled. If you call {code:java} nodetool enablebinary{code} then this flag doesn't get set. But with the change from CASSANDRA-13043 it requires RpcReady set to true in other to get a leader for the counter update. Not sure what the correct fix is here, seems to only really use this flag for counters. So thinking perhaps the fix is to just move this outside the if condition. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17772415#comment-17772415 ] Cameron Zemek commented on CASSANDRA-18845: --- I have reworked the patch into pull request here: [Wait for live endpoints as part of waiting for gossip to settle by grom358 · Pull Request #2778 · apache/cassandra (github.com)|https://github.com/apache/cassandra/pull/2778]. Created the PR against 4.1 since 5.x is not as stable. Still have not got around to making an automated test for this yet. It has the following behaviors: * Must opt-in by setting cassandra.gossip_settle_wait_live_max * Waits up to maximum number of polls defined by cassandra.gossip_settle_wait_live_max . Set to -1 to wait indefinitely. * cassandra.skip_wait_for_gossip_to_settle still applies to cap the maximum number of polls. * cassandra.gossip_settle_wait_live_required determines how many polls in a row without change to live endpoint state to consider gossip as settled once opt-in via cassandra.gossip_settle_wait_live_max * If live endpoint size equals number of endpoints, consider live endpoints as settled. * Requires at least 1 other live endpoint to begin considering live endpoints as settled. Scenarios considered: * One node cluster. Will skip this check since epSize == liveSize * Entire cluster is down and starting up a node. Will wait cassandra.gossip_settle_wait_live_max polls * Restarting a node when another node is down. Will wait cassandra.gossip_settle_wait_live_required polls * On rare occasions it takes a while to see another node as UP. This is covered by requiring at least 1 other endpoint as up `liveSize > 1` to start the settlement process. Being opt-in, this doesn't break any existing tests. This is also easier to use then the reverted patch as you just need to set cassandra.gossip_settle_wait_live_max . To restate the purpose of this patch is to resolve Native-Transport-Request starting before Cassandra has finished ECHO requests to other nodes. This results in requests failing LOCAL_QUORUM/QUORUM consistency as the endpoints are not considered live for purposes of executing requests. This is coming up every time we are rolling restarting large clusters when doing security patches and other such operations. So typically, only allow a single node to be down at a time. With this Pull Request the waiting for live endpoints ends once all endpoints are UP and so this allows for minimizing time to perform rolling restarts while avoiding failed queries and affecting clients. > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-seperate.patch, delay.log, example.log, > image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18866) Node sends multiple inflight echos
[ https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769377#comment-17769377 ] Cameron Zemek commented on CASSANDRA-18866: --- [Cassandra 5.0 Pull Request #2733|https://github.com/apache/cassandra/pull/2733] [Cassandra 4.1 Pull Request #2734|https://github.com/apache/cassandra/pull/2734] [Cassandra 4.0 Pull Request #2735|https://github.com/apache/cassandra/pull/2735] [Cassandra 3.11 Pull Request #2736|https://github.com/apache/cassandra/pull/2736] > Node sends multiple inflight echos > -- > > Key: CASSANDRA-18866 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18866 > Project: Cassandra > Issue Type: Improvement > Components: Cluster/Gossip >Reporter: Cameron Zemek >Assignee: Cameron Zemek >Priority: Normal > Attachments: 18866-regression.patch, duplicates.log, echo.log > > > CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, > 18845 had change to only allow 1 inflight ECHO request at a time. As per > 18854 some tests have an error rate due to this change. Creating this ticket > to discuss this further. As the current state also does not have retry logic, > it just allowing multiple ECHO requests inflight at the same time so less > likely that all ECHO will timeout or get lost. > With the change from 18845 adding in some extra logging to track what is > going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO > requests from a node and also see it retrying ECHOs when it doesn't get a > reply. > Therefore, I think the problem is more specific than the dropping of one ECHO > request. Yes there no retry logic for failed ECHO requests, but this is the > case even both before and after 18845. ECHO requests are only sent via gossip > verb handlers calling applyStateLocally. In these failed tests I therefore > assuming their cases where it won't call markAlive when other nodes consider > the node UP but its marked DOWN by a node. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-18866) Node sends multiple inflight echos
[ https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17768462#comment-17768462 ] Cameron Zemek edited comment on CASSANDRA-18866 at 9/24/23 11:47 PM: - Had to make the following change for some more dtests: Previous: {code:java} @Override public void onFailure(InetAddressAndPort from, RequestFailureReason failureReason) { logger.trace("Resending ECHO_REQ to {}", addr); Message echoMessage = Message.out(ECHO_REQ, noPayload); MessagingService.instance().sendWithCallback(echoMessage, addr, this); } {code} After: {code:java} @Override public void onFailure(InetAddressAndPort from, RequestFailureReason failureReason) { if (isEnabled()) { logger.trace("Resending ECHO_REQ to {}", addr); Message echoMessage = Message.out(ECHO_REQ, noPayload); MessagingService.instance().sendWithCallback(echoMessage, addr, this); } else { logger.trace("Failed ECHO_REQ to {}, aborting due to disabled gossip", addr); inflightEcho.remove(addr); } } {code} [instaclustr/cassandra at CASSANDRA-18866-regressiontest (github.com)|https://github.com/instaclustr/cassandra/tree/CASSANDRA-18866-regressiontest] was (Author: cam1982): Had to make the following change for some more dtests: Previous: {code:java} @Override public void onFailure(InetAddressAndPort from, RequestFailureReason failureReason) { logger.trace("Resending ECHO_REQ to {}", addr); Message echoMessage = Message.out(ECHO_REQ, noPayload); MessagingService.instance().sendWithCallback(echoMessage, addr, this); } {code} After: {code:java} @Override public void onFailure(InetAddressAndPort from, RequestFailureReason failureReason) { if (isEnabled()) { logger.trace("Resending ECHO_REQ to {}", addr); Message echoMessage = Message.out(ECHO_REQ, noPayload); MessagingService.instance().sendWithCallback(echoMessage, addr, this); } else { logger.trace("Failed ECHO_REQ to {}, aborting due to disabled gossip", addr); } } {code} [instaclustr/cassandra at CASSANDRA-18866-regressiontest (github.com)|https://github.com/instaclustr/cassandra/tree/CASSANDRA-18866-regressiontest] > Node sends multiple inflight echos > -- > > Key: CASSANDRA-18866 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18866 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18866-regression.patch, duplicates.log, echo.log > > > CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, > 18845 had change to only allow 1 inflight ECHO request at a time. As per > 18854 some tests have an error rate due to this change. Creating this ticket > to discuss this further. As the current state also does not have retry logic, > it just allowing multiple ECHO requests inflight at the same time so less > likely that all ECHO will timeout or get lost. > With the change from 18845 adding in some extra logging to track what is > going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO > requests from a node and also see it retrying ECHOs when it doesn't get a > reply. > Therefore, I think the problem is more specific than the dropping of one ECHO > request. Yes there no retry logic for failed ECHO requests, but this is the > case even both before and after 18845. ECHO requests are only sent via gossip > verb handlers calling applyStateLocally. In these failed tests I therefore > assuming their cases where it won't call markAlive when other nodes consider > the node UP but its marked DOWN by a node. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-18866) Node sends multiple inflight echos
[ https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17768462#comment-17768462 ] Cameron Zemek edited comment on CASSANDRA-18866 at 9/24/23 11:42 PM: - Had to make the following change for some more dtests: Previous: {code:java} @Override public void onFailure(InetAddressAndPort from, RequestFailureReason failureReason) { logger.trace("Resending ECHO_REQ to {}", addr); Message echoMessage = Message.out(ECHO_REQ, noPayload); MessagingService.instance().sendWithCallback(echoMessage, addr, this); } {code} After: {code:java} @Override public void onFailure(InetAddressAndPort from, RequestFailureReason failureReason) { if (isEnabled()) { logger.trace("Resending ECHO_REQ to {}", addr); Message echoMessage = Message.out(ECHO_REQ, noPayload); MessagingService.instance().sendWithCallback(echoMessage, addr, this); } else { logger.trace("Failed ECHO_REQ to {}, aborting due to disabled gossip", addr); } } {code} [instaclustr/cassandra at CASSANDRA-18866-regressiontest (github.com)|https://github.com/instaclustr/cassandra/tree/CASSANDRA-18866-regressiontest] was (Author: cam1982): Had to make the following change for some more dtests: Previous: {code:java} @Override public void onFailure(InetAddressAndPort from, RequestFailureReason failureReason) { logger.trace("Resending ECHO_REQ to {}", addr); Message echoMessage = Message.out(ECHO_REQ, noPayload); MessagingService.instance().sendWithCallback(echoMessage, addr, this); } {code} After: {code:java} @Override public void onFailure(InetAddressAndPort from, RequestFailureReason failureReason) { if (isEnabled()) { logger.trace("Resending ECHO_REQ to {}", addr); Message echoMessage = Message.out(ECHO_REQ, noPayload); MessagingService.instance().sendWithCallback(echoMessage, addr, this); } else { logger.trace("Failed ECHO_REQ to {}, aborting due to disabled gossip", addr); } } {code} > Node sends multiple inflight echos > -- > > Key: CASSANDRA-18866 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18866 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18866-regression.patch, duplicates.log, echo.log > > > CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, > 18845 had change to only allow 1 inflight ECHO request at a time. As per > 18854 some tests have an error rate due to this change. Creating this ticket > to discuss this further. As the current state also does not have retry logic, > it just allowing multiple ECHO requests inflight at the same time so less > likely that all ECHO will timeout or get lost. > With the change from 18845 adding in some extra logging to track what is > going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO > requests from a node and also see it retrying ECHOs when it doesn't get a > reply. > Therefore, I think the problem is more specific than the dropping of one ECHO > request. Yes there no retry logic for failed ECHO requests, but this is the > case even both before and after 18845. ECHO requests are only sent via gossip > verb handlers calling applyStateLocally. In these failed tests I therefore > assuming their cases where it won't call markAlive when other nodes consider > the node UP but its marked DOWN by a node. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18866) Node sends multiple inflight echos
[ https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17768462#comment-17768462 ] Cameron Zemek commented on CASSANDRA-18866: --- Had to make the following change for some more dtests: Previous: {code:java} @Override public void onFailure(InetAddressAndPort from, RequestFailureReason failureReason) { logger.trace("Resending ECHO_REQ to {}", addr); Message echoMessage = Message.out(ECHO_REQ, noPayload); MessagingService.instance().sendWithCallback(echoMessage, addr, this); } {code} After: {code:java} @Override public void onFailure(InetAddressAndPort from, RequestFailureReason failureReason) { if (isEnabled()) { logger.trace("Resending ECHO_REQ to {}", addr); Message echoMessage = Message.out(ECHO_REQ, noPayload); MessagingService.instance().sendWithCallback(echoMessage, addr, this); } else { logger.trace("Failed ECHO_REQ to {}, aborting due to disabled gossip", addr); } } {code} > Node sends multiple inflight echos > -- > > Key: CASSANDRA-18866 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18866 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18866-regression.patch, duplicates.log, echo.log > > > CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, > 18845 had change to only allow 1 inflight ECHO request at a time. As per > 18854 some tests have an error rate due to this change. Creating this ticket > to discuss this further. As the current state also does not have retry logic, > it just allowing multiple ECHO requests inflight at the same time so less > likely that all ECHO will timeout or get lost. > With the change from 18845 adding in some extra logging to track what is > going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO > requests from a node and also see it retrying ECHOs when it doesn't get a > reply. > Therefore, I think the problem is more specific than the dropping of one ECHO > request. Yes there no retry logic for failed ECHO requests, but this is the > case even both before and after 18845. ECHO requests are only sent via gossip > verb handlers calling applyStateLocally. In these failed tests I therefore > assuming their cases where it won't call markAlive when other nodes consider > the node UP but its marked DOWN by a node. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18866) Node sends multiple inflight echos
[ https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17768255#comment-17768255 ] Cameron Zemek commented on CASSANDRA-18866: --- {noformat} pytest --count=500 --cassandra-dir=/home/grom/dev/cassandra-instaclustr transient_replication_ring_test.py::TestTransientReplicationRing::test_move_forwards_between_and_cleanup{noformat} 500/500 passes. {noformat} $ rg 'Resending' 1695355675403_test_move_forwards_between_and_cleanup[27-500]/node4_debug.log 1263:DEBUG [InternalResponseStage:1] 2023-09-22 14:07:06,461 Gossiper.java:1390 - Resending ECHO_REQ to /127.0.0.2:70001695362768506_test_move_forwards_between_and_cleanup[74-500]/node1_debug.log 1038:DEBUG [InternalResponseStage:1] 2023-09-22 16:05:20,772 Gossiper.java:1390 - Resending ECHO_REQ to /127.0.0.2:7000 1695362768506_test_move_forwards_between_and_cleanup[74-500]/node1_debug.log: WARNING: stopped searching binary file after match (found "\0" byte around offset 329646)1695403170261_test_move_forwards_between_and_cleanup[342-500]/node1_debug.log 1029:DEBUG [InternalResponseStage:1] 2023-09-23 03:18:41,126 Gossiper.java:1390 - Resending ECHO_REQ to /127.0.0.2:7000 1695403170261_test_move_forwards_between_and_cleanup[342-500]/node1_debug.log: WARNING: stopped searching binary file after match (found "\0" byte around offset 331373)1695366089957_test_move_forwards_between_and_cleanup[96-500]/node4_debug.log 1275:DEBUG [InternalResponseStage:1] 2023-09-22 17:00:41,140 Gossiper.java:1390 - Resending ECHO_REQ to /127.0.0.2:70001695422554318_test_move_forwards_between_and_cleanup[471-500]/node4_debug.log 1293:DEBUG [InternalResponseStage:1] 2023-09-23 08:41:45,750 Gossiper.java:1390 - Resending ECHO_REQ to /127.0.0.2:7000{noformat} So the retry happens 1% of the time with this test. > Node sends multiple inflight echos > -- > > Key: CASSANDRA-18866 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18866 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18866-regression.patch, duplicates.log, echo.log > > > CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, > 18845 had change to only allow 1 inflight ECHO request at a time. As per > 18854 some tests have an error rate due to this change. Creating this ticket > to discuss this further. As the current state also does not have retry logic, > it just allowing multiple ECHO requests inflight at the same time so less > likely that all ECHO will timeout or get lost. > With the change from 18845 adding in some extra logging to track what is > going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO > requests from a node and also see it retrying ECHOs when it doesn't get a > reply. > Therefore, I think the problem is more specific than the dropping of one ECHO > request. Yes there no retry logic for failed ECHO requests, but this is the > case even both before and after 18845. ECHO requests are only sent via gossip > verb handlers calling applyStateLocally. In these failed tests I therefore > assuming their cases where it won't call markAlive when other nodes consider > the node UP but its marked DOWN by a node. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-18866) Node sends multiple inflight echos
[ https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767742#comment-17767742 ] Cameron Zemek edited comment on CASSANDRA-18866 at 9/22/23 3:00 AM: Found some bugs: {code:java} if (inflightEcho.contains(addr)) { return; } inflightEcho.add(addr); {code} should be {code:java} if (!inflightEcho.add(addr)) { return; } {code} Otherwise, data race allows multiple inflight echos. and {code:java} @Override public void onFailure(InetAddressAndPort from, RequestFailureReason failureReason) { MessagingService.instance().sendWithCallback(echoMessage, addr, this); } {code} should be {code:java} @Override public void onFailure(InetAddressAndPort from, RequestFailureReason failureReason) { logger.trace("Resending ECHO_REQ to {}", addr); Message echoMessage = Message.out(ECHO_REQ, noPayload); MessagingService.instance().sendWithCallback(echoMessage, addr, this); } {code} That is need to construct a new message, not send the same message again. was (Author: cam1982): Found some bugs: {code:java} if (inflightEcho.contains(addr)) { return; } inflightEcho.add(addr); {code} should be {noformat} if (!inflightEcho.add(addr)) { logger.info("Skip ECHO_REQ to {}", addr); return; }{noformat} Otherwise, data race allows multiple inflight echos. and {code:java} @Override public void onFailure(InetAddressAndPort from, RequestFailureReason failureReason) { MessagingService.instance().sendWithCallback(echoMessage, addr, this); } {code} should be {code:java} @Override public void onFailure(InetAddressAndPort from, RequestFailureReason failureReason) { logger.trace("Resending ECHO_REQ to {}", addr); Message echoMessage = Message.out(ECHO_REQ, noPayload); MessagingService.instance().sendWithCallback(echoMessage, addr, this); } {code} That is need to construct a new message, not send the same message again. > Node sends multiple inflight echos > -- > > Key: CASSANDRA-18866 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18866 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18866-regression.patch, duplicates.log, echo.log > > > CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, > 18845 had change to only allow 1 inflight ECHO request at a time. As per > 18854 some tests have an error rate due to this change. Creating this ticket > to discuss this further. As the current state also does not have retry logic, > it just allowing multiple ECHO requests inflight at the same time so less > likely that all ECHO will timeout or get lost. > With the change from 18845 adding in some extra logging to track what is > going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO > requests from a node and also see it retrying ECHOs when it doesn't get a > reply. > Therefore, I think the problem is more specific than the dropping of one ECHO > request. Yes there no retry logic for failed ECHO requests, but this is the > case even both before and after 18845. ECHO requests are only sent via gossip > verb handlers calling applyStateLocally. In these failed tests I therefore > assuming their cases where it won't call markAlive when other nodes consider > the node UP but its marked DOWN by a node. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18866) Node sends multiple inflight echos
[ https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767787#comment-17767787 ] Cameron Zemek commented on CASSANDRA-18866: --- {noformat} pytest --count=100 --cassandra-dir=/home/grom/dev/cassandra-instaclustr transient_replication_ring_test.py::TestTransientReplicationRing::test_move_backwards_and_cleanup{noformat} 100/100 passes > Node sends multiple inflight echos > -- > > Key: CASSANDRA-18866 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18866 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18866-regression.patch, duplicates.log, echo.log > > > CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, > 18845 had change to only allow 1 inflight ECHO request at a time. As per > 18854 some tests have an error rate due to this change. Creating this ticket > to discuss this further. As the current state also does not have retry logic, > it just allowing multiple ECHO requests inflight at the same time so less > likely that all ECHO will timeout or get lost. > With the change from 18845 adding in some extra logging to track what is > going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO > requests from a node and also see it retrying ECHOs when it doesn't get a > reply. > Therefore, I think the problem is more specific than the dropping of one ECHO > request. Yes there no retry logic for failed ECHO requests, but this is the > case even both before and after 18845. ECHO requests are only sent via gossip > verb handlers calling applyStateLocally. In these failed tests I therefore > assuming their cases where it won't call markAlive when other nodes consider > the node UP but its marked DOWN by a node. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-18866) Node sends multiple inflight echos
[ https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767742#comment-17767742 ] Cameron Zemek edited comment on CASSANDRA-18866 at 9/21/23 10:04 PM: - Found some bugs: {code:java} if (inflightEcho.contains(addr)) { return; } inflightEcho.add(addr); {code} should be {noformat} if (!inflightEcho.add(addr)) { logger.info("Skip ECHO_REQ to {}", addr); return; }{noformat} Otherwise, data race allows multiple inflight echos. and {code:java} @Override public void onFailure(InetAddressAndPort from, RequestFailureReason failureReason) { MessagingService.instance().sendWithCallback(echoMessage, addr, this); } {code} should be {code:java} @Override public void onFailure(InetAddressAndPort from, RequestFailureReason failureReason) { logger.trace("Resending ECHO_REQ to {}", addr); Message echoMessage = Message.out(ECHO_REQ, noPayload); MessagingService.instance().sendWithCallback(echoMessage, addr, this); } {code} That is need to construct a new message, not send the same message again. was (Author: cam1982): Found some bugs: {code:java} if (inflightEcho.contains(addr)) { return; } inflightEcho.add(addr); {code} should be {noformat} if (!inflightEcho.add(addr)) { logger.info("Skip ECHO_REQ to {}", addr); return; }{noformat} Otherwise, data race allows multiple inflight echos. and {code:java} @Override public void onFailure(InetAddressAndPort from, RequestFailureReason failureReason) { MessagingService.instance().sendWithCallback(echoMessage, addr, this); } {code} should be {code:java} @Override public void onFailure(InetAddressAndPort from, RequestFailureReason failureReason) { logger.trace("Resending ECHO_REQ to {}", addr); Message echoMessage = Message.out(ECHO_REQ, noPayload); MessagingService.instance().sendWithCallback(echoMessage, addr, this); } {code} That is need to construct a new message, not send the same message again. > Node sends multiple inflight echos > -- > > Key: CASSANDRA-18866 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18866 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18866-regression.patch, duplicates.log, echo.log > > > CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, > 18845 had change to only allow 1 inflight ECHO request at a time. As per > 18854 some tests have an error rate due to this change. Creating this ticket > to discuss this further. As the current state also does not have retry logic, > it just allowing multiple ECHO requests inflight at the same time so less > likely that all ECHO will timeout or get lost. > With the change from 18845 adding in some extra logging to track what is > going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO > requests from a node and also see it retrying ECHOs when it doesn't get a > reply. > Therefore, I think the problem is more specific than the dropping of one ECHO > request. Yes there no retry logic for failed ECHO requests, but this is the > case even both before and after 18845. ECHO requests are only sent via gossip > verb handlers calling applyStateLocally. In these failed tests I therefore > assuming their cases where it won't call markAlive when other nodes consider > the node UP but its marked DOWN by a node. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18866) Node sends multiple inflight echos
[ https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767742#comment-17767742 ] Cameron Zemek commented on CASSANDRA-18866: --- Found some bugs: {code:java} if (inflightEcho.contains(addr)) { return; } inflightEcho.add(addr); {code} should be {noformat} if (!inflightEcho.add(addr)) { logger.info("Skip ECHO_REQ to {}", addr); return; }{noformat} Otherwise, data race allows multiple inflight echos. and {code:java} @Override public void onFailure(InetAddressAndPort from, RequestFailureReason failureReason) { MessagingService.instance().sendWithCallback(echoMessage, addr, this); } {code} should be {code:java} @Override public void onFailure(InetAddressAndPort from, RequestFailureReason failureReason) { logger.trace("Resending ECHO_REQ to {}", addr); Message echoMessage = Message.out(ECHO_REQ, noPayload); MessagingService.instance().sendWithCallback(echoMessage, addr, this); } {code} That is need to construct a new message, not send the same message again. > Node sends multiple inflight echos > -- > > Key: CASSANDRA-18866 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18866 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18866-regression.patch, duplicates.log, echo.log > > > CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, > 18845 had change to only allow 1 inflight ECHO request at a time. As per > 18854 some tests have an error rate due to this change. Creating this ticket > to discuss this further. As the current state also does not have retry logic, > it just allowing multiple ECHO requests inflight at the same time so less > likely that all ECHO will timeout or get lost. > With the change from 18845 adding in some extra logging to track what is > going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO > requests from a node and also see it retrying ECHOs when it doesn't get a > reply. > Therefore, I think the problem is more specific than the dropping of one ECHO > request. Yes there no retry logic for failed ECHO requests, but this is the > case even both before and after 18845. ECHO requests are only sent via gossip > verb handlers calling applyStateLocally. In these failed tests I therefore > assuming their cases where it won't call markAlive when other nodes consider > the node UP but its marked DOWN by a node. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18866) Node sends multiple inflight echos
[ https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767555#comment-17767555 ] Cameron Zemek commented on CASSANDRA-18866: --- {noformat} pytest --count=100 --cassandra-dir=/home/grom/dev/cassandra transient_replication_ring_test.py::TestTransientReplicationRing::test_move_forwards_between_and_cleanup{noformat} 100/100 passes > Node sends multiple inflight echos > -- > > Key: CASSANDRA-18866 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18866 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18866-regression.patch, duplicates.log, echo.log > > > CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, > 18845 had change to only allow 1 inflight ECHO request at a time. As per > 18854 some tests have an error rate due to this change. Creating this ticket > to discuss this further. As the current state also does not have retry logic, > it just allowing multiple ECHO requests inflight at the same time so less > likely that all ECHO will timeout or get lost. > With the change from 18845 adding in some extra logging to track what is > going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO > requests from a node and also see it retrying ECHOs when it doesn't get a > reply. > Therefore, I think the problem is more specific than the dropping of one ECHO > request. Yes there no retry logic for failed ECHO requests, but this is the > case even both before and after 18845. ECHO requests are only sent via gossip > verb handlers calling applyStateLocally. In these failed tests I therefore > assuming their cases where it won't call markAlive when other nodes consider > the node UP but its marked DOWN by a node. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767361#comment-17767361 ] Cameron Zemek commented on CASSANDRA-18845: --- {noformat} Sep 21 03:01:42 ip-10-1-32-228 cassandra[52927]: INFO org.apache.cassandra.gms.Gossiper Waiting for gossip to settle... Sep 21 03:01:48 ip-10-1-32-228 cassandra[52927]: INFO org.apache.cassandra.gms.Gossiper Gossip looks settled. epSize=108 Sep 21 03:01:49 ip-10-1-32-228 cassandra[52927]: INFO org.apache.cassandra.gms.Gossiper Gossip looks settled. epSize=108 Sep 21 03:01:50 ip-10-1-32-228 cassandra[52927]: INFO org.apache.cassandra.gms.Gossiper Gossip looks settled. epSize=108 Sep 21 03:02:00 ip-10-1-32-228 cassandra[52927]: INFO o.a.c.gms.GossipDigestAckVerbHandler Received a GossipDigestAckMessage from /15.223.140.86 Sep 21 03:02:00 ip-10-1-32-228 cassandra[52927]: INFO org.apache.cassandra.gms.Gossiper Sending a EchoMessage to /44.229.153.229 ... Sep 21 03:03:40 ip-10-1-32-228 cassandra[52927]: INFO org.apache.cassandra.gms.Gossiper InetAddress /44.229.153.229 is now UP{noformat} Got a test run with 18 second delay. > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-seperate.patch, delay.log, example.log, > image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767358#comment-17767358 ] Cameron Zemek edited comment on CASSANDRA-18845 at 9/21/23 2:59 AM: {noformat} Sep 19 08:09:45 ip-10-1-57-23 cassandra[131402]: INFO org.apache.cassandra.gms.Gossiper Waiting for gossip to settle... Sep 19 08:10:56 ip-10-1-57-23 cassandra[131402]: DEBUG org.apache.cassandra.gms.Gossiper Sending a EchoMessage to /35.83.14.80{noformat} I am struggling to reproduce this ^ I seen it twice, and after enabling more logging haven't been able to reproduce again. What I do sometimes see though is it taking over 30 seconds to get the first ECHO response. Since there are dtests that rely on having CQL up while nodes are down, I have attached a patch [^18845-seperate.patch] (against 5.0 branch) that is opt-in. Having settle just check for currentLive == liveSize is still allowing NTR to start while nodes are marked down. Yes you can increase cassandra.gossip_settle_poll_success_required (and/or the other properties) to mitigate it but these increase the minimum startup time. Whereas [^18845-seperate.patch] doesn't add to this when the cluster is healthy. A more elaborate solution would be to specify the required consistency level. And for all token ranges owned by the node you check if you have the needed live endpoints to satisfy the consistency level. was (Author: cam1982): {noformat} Sep 19 08:09:45 ip-10-1-57-23 cassandra[131402]: INFO org.apache.cassandra.gms.Gossiper Waiting for gossip to settle... Sep 19 08:10:56 ip-10-1-57-23 cassandra[131402]: DEBUG org.apache.cassandra.gms.Gossiper Sending a EchoMessage to /35.83.14.80{noformat} I am struggling to reproduce this ^ I seen it twice, and after enabling more logging haven't been able to reproduce again. What I do sometimes see though it taking over 30 seconds to get the first ECHO response. Since there are dtests that rely on having CQL up while nodes are down, I have attached a patch [^18845-seperate.patch] (against 5.0 branch) that is opt-in. Having settle just check for currentLive == liveSize is still allowing NTR to start while nodes are marked down. Yes you can increase cassandra.gossip_settle_poll_success_required (and/or the other properties) to mitigate it but these increase the minimum startup time. Whereas [^18845-seperate.patch] doesn't add to this when the cluster is healthy. A more elaborate solution would be to specify the required consistency level. And for all token ranges owned by the node you check if you have the needed live endpoints to satisfy the consistency level. > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-seperate.patch, delay.log, example.log, > image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767358#comment-17767358 ] Cameron Zemek commented on CASSANDRA-18845: --- {noformat} Sep 19 08:09:45 ip-10-1-57-23 cassandra[131402]: INFO org.apache.cassandra.gms.Gossiper Waiting for gossip to settle... Sep 19 08:10:56 ip-10-1-57-23 cassandra[131402]: DEBUG org.apache.cassandra.gms.Gossiper Sending a EchoMessage to /35.83.14.80{noformat} I am struggling to reproduce this ^ I seen it twice, and after enabling more logging haven't been able to reproduce again. What I do sometimes see though it taking over 30 seconds to get the first ECHO response. Since there are dtests that rely on having CQL up while nodes are down, I have attached a patch [^18845-seperate.patch] (against 5.0 branch) that is opt-in. Having settle just check for currentLive == liveSize is still allowing NTR to start while nodes are marked down. Yes you can increase cassandra.gossip_settle_poll_success_required (and/or the other properties) to mitigate it but these increase the minimum startup time. Whereas [^18845-seperate.patch] doesn't add to this when the cluster is healthy. A more elaborate solution would be to specify the required consistency level. And for all token ranges owned by the node you check if you have the needed live endpoints to satisfy the consistency level. > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-seperate.patch, delay.log, example.log, > image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek updated CASSANDRA-18845: -- Attachment: (was: 18845-seperate.patch) > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-seperate.patch, delay.log, example.log, > image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek updated CASSANDRA-18845: -- Attachment: 18845-seperate.patch > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-seperate.patch, delay.log, example.log, > image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek updated CASSANDRA-18845: -- Attachment: 18845-seperate.patch > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-seperate.patch, delay.log, example.log, > image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767052#comment-17767052 ] Cameron Zemek commented on CASSANDRA-18845: --- with this removed {code:java} (epSize == liveSize || liveSize > 1){code} the j11_dtests just passed. [j11_dtests (120384) - instaclustr/cassandra (circleci.com)|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3180/workflows/2f7e6199-d865-4eee-a3b1-9511a4c88a45/jobs/120384] > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: delay.log, example.log, > image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767007#comment-17767007 ] Cameron Zemek commented on CASSANDRA-18845: --- [^stream.log] Without this patch I get nodes stuck in being unable to join large test cluster: {noformat} Sep 20 01:18:51 ip-10-7-20-120 cassandra[5521]: INFO o.a.cassandra.service.StorageService JOINING: Starting to bootstrap... Sep 20 01:18:51 ip-10-7-20-120 cassandra[5521]: Exception (java.lang.RuntimeException) encountered during startup: A node required to move the data consistently is down (/13.237.60.255). If you wish to move the data from a potentially inconsistent replica, restart the node with -Dcassandra.consistent.rangemovement=false Sep 20 01:18:51 ip-10-7-20-120 cassandra[5521]: java.lang.RuntimeException: A node required to move the data consistently is down (/13.237.60.255). If you wish to move the data from a potentially inconsistent replica, restart the node with -Dcassandra.consistent.rangemovement=false Sep 20 01:18:51 ip-10-7-20-120 cassandra[5521]: at org.apache.cassandra.dht.RangeStreamer.getAllRangesWithStrictSourcesFor(RangeStreamer.java:294){noformat} The node is in endless restart cycle (since our service keeps retrying) with it reporting a different IP each time. > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: delay.log, example.log, > image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-18866) Node sends multiple inflight echos
[ https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767036#comment-17767036 ] Cameron Zemek edited comment on CASSANDRA-18866 at 9/20/23 7:37 AM: going to run overnight the broken dtest that was flagged by the ECHO changes. But with potential fix: {noformat} @Override public void onFailure(InetAddressAndPort from, RequestFailureReason failureReason) { MessagingService.instance().sendWithCallback(echoMessage, addr, this); }{noformat} will report back in the morning. was (Author: cam1982): !18866-regression.patch|width=7,height=7,align=absmiddle! going to run overnight the broken dtest that was flagged by the ECHO changes. But with potential fix: {noformat} @Override public void onFailure(InetAddressAndPort from, RequestFailureReason failureReason) { MessagingService.instance().sendWithCallback(echoMessage, addr, this); }{noformat} will report back in the morning. > Node sends multiple inflight echos > -- > > Key: CASSANDRA-18866 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18866 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18866-regression.patch, duplicates.log, echo.log > > > CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, > 18845 had change to only allow 1 inflight ECHO request at a time. As per > 18854 some tests have an error rate due to this change. Creating this ticket > to discuss this further. As the current state also does not have retry logic, > it just allowing multiple ECHO requests inflight at the same time so less > likely that all ECHO will timeout or get lost. > With the change from 18845 adding in some extra logging to track what is > going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO > requests from a node and also see it retrying ECHOs when it doesn't get a > reply. > Therefore, I think the problem is more specific than the dropping of one ECHO > request. Yes there no retry logic for failed ECHO requests, but this is the > case even both before and after 18845. ECHO requests are only sent via gossip > verb handlers calling applyStateLocally. In these failed tests I therefore > assuming their cases where it won't call markAlive when other nodes consider > the node UP but its marked DOWN by a node. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767037#comment-17767037 ] Cameron Zemek commented on CASSANDRA-18845: --- the {noformat} (epSize == liveSize || liveSize > 1){noformat} part breaks dtests. For example, {noformat} pytest --force-resource-intensive-tests --cassandra-dir=/home/grom/dev/cassandra materialized_views_test.py::TestMaterializedViews::test_throttled_partition_update{noformat} This test fails since it will shutdown a 5 node cluster and start/stop each node one at a time. And therefore liveSize > 1 is never true. Possible paths forward: # The check for waiting for other nodes is off by default and requries setting a system property. # Figure out why there this large delay between waitToSettle call and getting ECHO responses. # Have the tests override cassandra.skip_wait_for_gossip_to_settle # ?? Some other option haven't thought of yet. > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: delay.log, example.log, > image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18866) Node sends multiple inflight echos
[ https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767036#comment-17767036 ] Cameron Zemek commented on CASSANDRA-18866: --- !18866-regression.patch|width=7,height=7,align=absmiddle! going to run overnight the broken dtest that was flagged by the ECHO changes. But with potential fix: {noformat} @Override public void onFailure(InetAddressAndPort from, RequestFailureReason failureReason) { MessagingService.instance().sendWithCallback(echoMessage, addr, this); }{noformat} will report back in the morning. > Node sends multiple inflight echos > -- > > Key: CASSANDRA-18866 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18866 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18866-regression.patch, duplicates.log, echo.log > > > CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, > 18845 had change to only allow 1 inflight ECHO request at a time. As per > 18854 some tests have an error rate due to this change. Creating this ticket > to discuss this further. As the current state also does not have retry logic, > it just allowing multiple ECHO requests inflight at the same time so less > likely that all ECHO will timeout or get lost. > With the change from 18845 adding in some extra logging to track what is > going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO > requests from a node and also see it retrying ECHOs when it doesn't get a > reply. > Therefore, I think the problem is more specific than the dropping of one ECHO > request. Yes there no retry logic for failed ECHO requests, but this is the > case even both before and after 18845. ECHO requests are only sent via gossip > verb handlers calling applyStateLocally. In these failed tests I therefore > assuming their cases where it won't call markAlive when other nodes consider > the node UP but its marked DOWN by a node. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18866) Node sends multiple inflight echos
[ https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek updated CASSANDRA-18866: -- Attachment: 18866-regression.patch > Node sends multiple inflight echos > -- > > Key: CASSANDRA-18866 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18866 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18866-regression.patch, duplicates.log, echo.log > > > CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, > 18845 had change to only allow 1 inflight ECHO request at a time. As per > 18854 some tests have an error rate due to this change. Creating this ticket > to discuss this further. As the current state also does not have retry logic, > it just allowing multiple ECHO requests inflight at the same time so less > likely that all ECHO will timeout or get lost. > With the change from 18845 adding in some extra logging to track what is > going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO > requests from a node and also see it retrying ECHOs when it doesn't get a > reply. > Therefore, I think the problem is more specific than the dropping of one ECHO > request. Yes there no retry logic for failed ECHO requests, but this is the > case even both before and after 18845. ECHO requests are only sent via gossip > verb handlers calling applyStateLocally. In these failed tests I therefore > assuming their cases where it won't call markAlive when other nodes consider > the node UP but its marked DOWN by a node. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek updated CASSANDRA-18845: -- Attachment: stream.log > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: delay.log, example.log, > image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766949#comment-17766949 ] Cameron Zemek commented on CASSANDRA-18845: --- Still running, but sharing the results so far: {noformat} $ pytest --count=500 --cassandra-dir=/home/grom/dev/cassandra transient_replication_ring_test.py::TestTransientReplicationRing::test_move_forwards_between_and_cleanup /home/grom/dtest/lib/python3.10/site-packages/ccmlib/common.py:773: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. return LooseVersion(match.group(1)) == test session starts ===platform linux -- Python 3.10.12, pytest-7.3.1, pluggy-1.0.0 rootdir: /home/grom/tmp/cassandra-dtest configfile: pytest.ini plugins: repeat-0.9.1, flaky-3.7.0, timeout-1.4.2 timeout: 900.0s timeout method: signal timeout func_only: False collected 500 itemstransient_replication_ring_test.py ... [ 11%] {noformat} > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: delay.log, example.log, > image-2023-09-14-11-16-23-020.png, test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766694#comment-17766694 ] Cameron Zemek commented on CASSANDRA-18845: --- [^delay.log] Attached a log from 105 node test cluster that shows the delay between starting to wait for gossip and getting replies back for UP . Snippet {noformat} Sep 19 08:09:45 ip-10-1-57-23 cassandra[131402]: INFO org.apache.cassandra.gms.Gossiper Waiting for gossip to settle... Sep 19 08:10:56 ip-10-1-57-23 cassandra[131402]: DEBUG org.apache.cassandra.gms.Gossiper Sending a EchoMessage to /35.83.14.80 Sep 19 08:10:57 ip-10-1-57-23 cassandra[131402]: INFO org.apache.cassandra.gms.Gossiper InetAddress /54.149.62.104 is now UP{noformat} So the delay is in sending out the Echo. > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: delay.log, example.log, > image-2023-09-14-11-16-23-020.png, test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek updated CASSANDRA-18845: -- Attachment: delay.log > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: delay.log, example.log, > image-2023-09-14-11-16-23-020.png, test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18773) Compactions are slow
[ https://issues.apache.org/jira/browse/CASSANDRA-18773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766678#comment-17766678 ] Cameron Zemek commented on CASSANDRA-18773: --- [~blambov] have updated the pull request with your feedback and is ready for review. > Compactions are slow > > > Key: CASSANDRA-18773 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18773 > Project: Cassandra > Issue Type: Improvement > Components: Local/Compaction >Reporter: Cameron Zemek >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > Attachments: 18773.patch, compact-poc.patch, flamegraph.png, > stress.yaml > > Time Spent: 1h 10m > Remaining Estimate: 0h > > I have noticed that compactions involving a lot of sstables are very slow > (for example major compactions). I have attached a cassandra stress profile > that can generate such a dataset under ccm. In my local test I have 2567 > sstables at 4Mb each. > I added code to track wall clock time of various parts of the code. One > problematic part is ManyToOne constructor. Tracing through the code for every > partition creating a ManyToOne for all the sstable iterators for each > partition. In my local test get a measy 60Kb/sec read speed, and bottlenecked > on single core CPU (since this code is single threaded) with it spending 85% > of the wall clock time in ManyToOne constructor. > As another datapoint to show its the merge iterator part of the code using > the cfstats from [https://github.com/instaclustr/cassandra-sstable-tools/] > which reads all the sstables but does no merging gets 26Mb/sec read speed. > Tracking back from ManyToOne call I see this in > UnfilteredPartitionIterators::merge > {code:java} > for (int i = 0; i < toMerge.size(); i++) > { > if (toMerge.get(i) == null) > { > if (null == empty) > empty = EmptyIterators.unfilteredRow(metadata, > partitionKey, isReverseOrder); > toMerge.set(i, empty); > } > } > {code} > Not sure what purpose of creating these empty rows are. But on a whim I > removed all these empty iterators before passing to ManyToOne and then all > the wall clock time shifted to CompactionIterator::hasNext() and read speed > increased to 1.5Mb/s. > So there are further bottlenecks in this code path it seems, but the first is > this ManyToOne and having to build it for every partition read. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766677#comment-17766677 ] Cameron Zemek edited comment on CASSANDRA-18845 at 9/19/23 7:32 AM: Tested the patch 3 times to confirm it working. See test1.log test2.log and test3.log was (Author: cam1982): !test1.log|width=7,height=7,align=absmiddle! !test2.log|width=7,height=7,align=absmiddle! !test3.log|width=7,height=7,align=absmiddle! Tested the patch 3 times to confirm it working. > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: example.log, image-2023-09-14-11-16-23-020.png, > test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766677#comment-17766677 ] Cameron Zemek commented on CASSANDRA-18845: --- !test1.log|width=7,height=7,align=absmiddle! !test2.log|width=7,height=7,align=absmiddle! !test3.log|width=7,height=7,align=absmiddle! Tested the patch 3 times to confirm it working. > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: example.log, image-2023-09-14-11-16-23-020.png, > test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek updated CASSANDRA-18845: -- Attachment: test2.log > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: example.log, image-2023-09-14-11-16-23-020.png, > test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek updated CASSANDRA-18845: -- Attachment: test3.log > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: example.log, image-2023-09-14-11-16-23-020.png, > test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek updated CASSANDRA-18845: -- Attachment: test1.log > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: example.log, image-2023-09-14-11-16-23-020.png, > test1.log, test2.log, test3.log > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18866) Node sends multiple inflight echos
[ https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766676#comment-17766676 ] Cameron Zemek commented on CASSANDRA-18866: --- duplicates.log shows the problem that was fixing that led to the regressions. echo.log shows with rolled back changes where tested having broken network link between two nodes then re-establish it. > Node sends multiple inflight echos > -- > > Key: CASSANDRA-18866 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18866 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: duplicates.log, echo.log > > > CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, > 18845 had change to only allow 1 inflight ECHO request at a time. As per > 18854 some tests have an error rate due to this change. Creating this ticket > to discuss this further. As the current state also does not have retry logic, > it just allowing multiple ECHO requests inflight at the same time so less > likely that all ECHO will timeout or get lost. > With the change from 18845 adding in some extra logging to track what is > going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO > requests from a node and also see it retrying ECHOs when it doesn't get a > reply. > Therefore, I think the problem is more specific than the dropping of one ECHO > request. Yes there no retry logic for failed ECHO requests, but this is the > case even both before and after 18845. ECHO requests are only sent via gossip > verb handlers calling applyStateLocally. In these failed tests I therefore > assuming their cases where it won't call markAlive when other nodes consider > the node UP but its marked DOWN by a node. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18866) Node sends multiple inflight echos
[ https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek updated CASSANDRA-18866: -- Attachment: duplicates.log > Node sends multiple inflight echos > -- > > Key: CASSANDRA-18866 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18866 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: duplicates.log, echo.log > > > CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, > 18845 had change to only allow 1 inflight ECHO request at a time. As per > 18854 some tests have an error rate due to this change. Creating this ticket > to discuss this further. As the current state also does not have retry logic, > it just allowing multiple ECHO requests inflight at the same time so less > likely that all ECHO will timeout or get lost. > With the change from 18845 adding in some extra logging to track what is > going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO > requests from a node and also see it retrying ECHOs when it doesn't get a > reply. > Therefore, I think the problem is more specific than the dropping of one ECHO > request. Yes there no retry logic for failed ECHO requests, but this is the > case even both before and after 18845. ECHO requests are only sent via gossip > verb handlers calling applyStateLocally. In these failed tests I therefore > assuming their cases where it won't call markAlive when other nodes consider > the node UP but its marked DOWN by a node. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766628#comment-17766628 ] Cameron Zemek commented on CASSANDRA-18845: --- [Cassandra 18845 3.11 by grom358 · Pull Request #2701 · apache/cassandra (github.com)|https://github.com/apache/cassandra/pull/2701] [Cassandra 18845 4.0 by grom358 · Pull Request #2702 · apache/cassandra (github.com)|https://github.com/apache/cassandra/pull/2702] [Cassandra 18845 4.1 by grom358 · Pull Request #2703 · apache/cassandra (github.com)|https://github.com/apache/cassandra/pull/2703] [Cassandra 18845 5.0 by grom358 · Pull Request #2704 · apache/cassandra (github.com)|https://github.com/apache/cassandra/pull/2704] > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: example.log, image-2023-09-14-11-16-23-020.png > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek updated CASSANDRA-18845: -- Attachment: (was: 18845-5.0.patch) > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: example.log, image-2023-09-14-11-16-23-020.png > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek updated CASSANDRA-18845: -- Attachment: (was: 18845-4.0.patch) > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-5.0.patch, example.log, > image-2023-09-14-11-16-23-020.png > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek updated CASSANDRA-18845: -- Attachment: (was: 18845-4.1.patch) > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-5.0.patch, example.log, > image-2023-09-14-11-16-23-020.png > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek updated CASSANDRA-18845: -- Attachment: (was: 18845-3.11.patch) > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-5.0.patch, example.log, > image-2023-09-14-11-16-23-020.png > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18854) Gossip never recovers from a single failed echo
[ https://issues.apache.org/jira/browse/CASSANDRA-18854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766620#comment-17766620 ] Cameron Zemek commented on CASSANDRA-18854: --- Since this ticket is resolved and the changes been reverted, I have created CASSANDRA-18866 as followup to this one to discuss the regressions caused by the reverted change. As the change was to resolve the issue of multiple inflight ECHOs and we should still aim to improve that in my opinion. Where as the wait to settle already as followup ticket CASSANDRA-18845 > Gossip never recovers from a single failed echo > --- > > Key: CASSANDRA-18854 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18854 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip >Reporter: Brandon Williams >Assignee: Brandon Williams >Priority: Normal > Fix For: 3.11.17, 4.0.12, 4.1.4, 5.0-alpha2, 5.1 > > Attachments: echo.log > > > As discovered on CASSANDRA-18792, if an initial echo request is lost, the > node will never be marked up. This appears to be a regression caused by > CASSANDRA-18543. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18866) Node sends multiple inflight echos
[ https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek updated CASSANDRA-18866: -- Attachment: echo.log > Node sends multiple inflight echos > -- > > Key: CASSANDRA-18866 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18866 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: echo.log > > > CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, > 18845 had change to only allow 1 inflight ECHO request at a time. As per > 18854 some tests have an error rate due to this change. Creating this ticket > to discuss this further. As the current state also does not have retry logic, > it just allowing multiple ECHO requests inflight at the same time so less > likely that all ECHO will timeout or get lost. > With the change from 18845 adding in some extra logging to track what is > going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO > requests from a node and also see it retrying ECHOs when it doesn't get a > reply. > Therefore, I think the problem is more specific than the dropping of one ECHO > request. Yes there no retry logic for failed ECHO requests, but this is the > case even both before and after 18845. ECHO requests are only sent via gossip > verb handlers calling applyStateLocally. In these failed tests I therefore > assuming their cases where it won't call markAlive when other nodes consider > the node UP but its marked DOWN by a node. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-18866) Node sends multiple inflight echos
Cameron Zemek created CASSANDRA-18866: - Summary: Node sends multiple inflight echos Key: CASSANDRA-18866 URL: https://issues.apache.org/jira/browse/CASSANDRA-18866 Project: Cassandra Issue Type: Improvement Reporter: Cameron Zemek Attachments: echo.log CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, 18845 had change to only allow 1 inflight ECHO request at a time. As per 18854 some tests have an error rate due to this change. Creating this ticket to discuss this further. As the current state also does not have retry logic, it just allowing multiple ECHO requests inflight at the same time so less likely that all ECHO will timeout or get lost. With the change from 18845 adding in some extra logging to track what is going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO requests from a node and also see it retrying ECHOs when it doesn't get a reply. Therefore, I think the problem is more specific than the dropping of one ECHO request. Yes there no retry logic for failed ECHO requests, but this is the case even both before and after 18845. ECHO requests are only sent via gossip verb handlers calling applyStateLocally. In these failed tests I therefore assuming their cases where it won't call markAlive when other nodes consider the node UP but its marked DOWN by a node. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18854) Gossip never recovers from a single failed echo
[ https://issues.apache.org/jira/browse/CASSANDRA-18854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766265#comment-17766265 ] Cameron Zemek commented on CASSANDRA-18854: --- [^echo.log] i added some logging and I disable networking between two nodes. And then once I re-enable the network it reconnected. So not sure why it breaking those tests. Having said this pre/post those changes there is not retry logic on failed ECHO messages. Pre these changes (and as soon in the logs where it skipped) multiple ECHO messages are sent out. So that probably why the tests work pre these changes as there more ECHOs. > Gossip never recovers from a single failed echo > --- > > Key: CASSANDRA-18854 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18854 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip >Reporter: Brandon Williams >Assignee: Brandon Williams >Priority: Normal > Fix For: 3.11.17, 4.0.12, 4.1.4, 5.0-alpha2, 5.1 > > Attachments: echo.log > > > As discovered on CASSANDRA-18792, if an initial echo request is lost, the > node will never be marked up. This appears to be a regression caused by > CASSANDRA-18543. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18854) Gossip never recovers from a single failed echo
[ https://issues.apache.org/jira/browse/CASSANDRA-18854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek updated CASSANDRA-18854: -- Attachment: echo.log > Gossip never recovers from a single failed echo > --- > > Key: CASSANDRA-18854 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18854 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip >Reporter: Brandon Williams >Assignee: Brandon Williams >Priority: Normal > Fix For: 3.11.17, 4.0.12, 4.1.4, 5.0-alpha2, 5.1 > > Attachments: echo.log > > > As discovered on CASSANDRA-18792, if an initial echo request is lost, the > node will never be marked up. This appears to be a regression caused by > CASSANDRA-18543. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18854) Gossip never recovers from a single failed echo
[ https://issues.apache.org/jira/browse/CASSANDRA-18854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek updated CASSANDRA-18854: -- Attachment: (was: example_echo.log) > Gossip never recovers from a single failed echo > --- > > Key: CASSANDRA-18854 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18854 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip >Reporter: Brandon Williams >Assignee: Brandon Williams >Priority: Normal > Fix For: 3.11.17, 4.0.12, 4.1.4, 5.0-alpha2, 5.1 > > > As discovered on CASSANDRA-18792, if an initial echo request is lost, the > node will never be marked up. This appears to be a regression caused by > CASSANDRA-18543. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18854) Gossip never recovers from a single failed echo
[ https://issues.apache.org/jira/browse/CASSANDRA-18854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek updated CASSANDRA-18854: -- Attachment: example_echo.log > Gossip never recovers from a single failed echo > --- > > Key: CASSANDRA-18854 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18854 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip >Reporter: Brandon Williams >Assignee: Brandon Williams >Priority: Normal > Fix For: 3.11.17, 4.0.12, 4.1.4, 5.0-alpha2, 5.1 > > Attachments: example_echo.log > > > As discovered on CASSANDRA-18792, if an initial echo request is lost, the > node will never be marked up. This appears to be a regression caused by > CASSANDRA-18543. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766192#comment-17766192 ] Cameron Zemek commented on CASSANDRA-18845: --- CASSANDRA-18543 had 3 components: # Allow for overriding the values used in waitToSettle # Make waitToSettle also consider the liveEndpoint members as part of settling. # Changes to handling of ECHO requests to remove duplicate inflight ECHO and duplicate log messages about the same node going into UP state 'is now UP' With the reverting in CASSANDRA-18854 did the changes to waitToSettle need to be reverted? The problem seems to be the changes to ECHO. > The next step for this ticket to move forward will be to create tests that > demonstrate the problem and guard against regressions. This is going to be very difficult todo. dtests setup clusters on loopback addresses and waitToSettle code path has a guard against it if using a loopback address. Also, the problems mostly become apparent with large clusters. If redo the patch and remove the changes to ECHO and show those tests do not have regression would this allow the ticket to move forward? I also in process of setting up a large test cluster. [^example.log] shows an example of what happens without the patched waitToSettle. Gossip settles before nodes have finished marked as UP. > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-3.11.patch, 18845-4.0.patch, 18845-4.1.patch, > 18845-5.0.patch, example.log, image-2023-09-14-11-16-23-020.png > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek updated CASSANDRA-18845: -- Attachment: example.log > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-3.11.patch, 18845-4.0.patch, 18845-4.1.patch, > 18845-5.0.patch, example.log, image-2023-09-14-11-16-23-020.png > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18854) Gossip never recovers from a single failed echo
[ https://issues.apache.org/jira/browse/CASSANDRA-18854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766168#comment-17766168 ] Cameron Zemek commented on CASSANDRA-18854: --- CASSANDRA-18543 changed how echo requests are handled (as there a lot of duplicates and on large clusters this results in a spam in logs and a lot of tasks onto gossip stage) in addition to the fix for waiting for live endpoints in waitToSettle. At the very least does the change to `waitToSettle` need to be reverted here? > Gossip never recovers from a single failed echo > --- > > Key: CASSANDRA-18854 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18854 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip >Reporter: Brandon Williams >Assignee: Brandon Williams >Priority: Normal > Fix For: 3.11.17, 4.0.12, 4.1.4, 5.0-alpha2, 5.1 > > > As discovered on CASSANDRA-18792, if an initial echo request is lost, the > node will never be marked up. This appears to be a regression caused by > CASSANDRA-18543. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17765429#comment-17765429 ] Cameron Zemek commented on CASSANDRA-18845: --- Need to-do more investigating around the slowness. I suspect its due to the flood of gossip messages on startup. The previous patch CASSANDRA-18543 removed the duplicate ECHO messages to cut down on this. The behavior I notice happening in production though is there a large initial delay (> 10 seconds) for any nodes to be marked as `is now UP` then it floods in. On large clusters this takes over a minute to complete receiving them all. Prior to CASSANDRA-18543 it never checked liveSize at all and so would start up regardless of UP status of nodes. With that change assuming the polling starts as UP status are received it waits. So the problem now is waiting for that initial event. The previous patch from CASSANDRA-18543 allowed for overriding the gossip parameters but in hindsight it's difficult to determine a suitable default for that initial wait as its not consistent. The algorithm in waitToSettle relies on seeing a change in these values, so that initial delay if greater than the wait time plus the polling phase will move on and start NTR even though we have yet to see any nodes as UP. You are correct that even with this proposed patch it's possible to still start NTR too early. Eg, if one node reports UP but the delay for the next event is longer than the polling period, but I am not seeing that in production so far. Therefore, the purpose of this patch is to have it wait for the first `is now UP` from a node instead of relying on cassandra.gossip_settle_min_wait_ms > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-3.11.patch, 18845-4.0.patch, 18845-4.1.patch, > 18845-5.0.patch, image-2023-09-14-11-16-23-020.png > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764934#comment-17764934 ] Cameron Zemek commented on CASSANDRA-18845: --- [~brandon.williams] [~smiklosovic] the existing conditions {noformat} currentSize == epSize && currentLive == liveSize{noformat} are what stops it starting Native Transport too early if gossip is still being updated (for example liveSize is changing). waitToSettle waits by default 5 seconds then it starts polling every 1 second 3 times seeing if either liveSize or epSize changes and resets its numOkay if either of these changes. The problem is when for example it took 79 seconds for that first change in liveSize, liveSize was constantly at 1 so it goes okay gossip is settled due to no changes in epSize or liveSize. The extra condition therefore is don't consider gossip settled if there only 1 live endpoint (the node itself). Unless it's a single node cluster (epSize == liveSize) > So when there is a cluster of 50 nodes, without this change, that "if" would > return false (or it would not return true fast enough to increment numOkay to > break from that while) as there would be new endpoints or live members > detected each round. To rephrase the problem is there is no new endpoints or live members changes. waitToSettle will consider it settled with liveSize == 1 currently. > why it takes almost minute and a half This is a good question but in general it takes quite awhile for gossip to complete on clusters with multiple datacenters and/or large number of nodes. I think that is a different much more complex JIRA. The purpose of the attached patch is so you don't need to guess what cassandra.gossip_settle_min_wait_ms to use. It waits for at least one node to report is now UP in order to increment numOkay and to continue with the rest of the waitToSettle logic. !image-2023-09-14-11-16-23-020.png! > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-3.11.patch, 18845-4.0.patch, 18845-4.1.patch, > 18845-5.0.patch, image-2023-09-14-11-16-23-020.png > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek updated CASSANDRA-18845: -- Attachment: image-2023-09-14-11-16-23-020.png > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-3.11.patch, 18845-4.0.patch, 18845-4.1.patch, > 18845-5.0.patch, image-2023-09-14-11-16-23-020.png > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek updated CASSANDRA-18845: -- Attachment: 18845-5.0.patch > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-3.11.patch, 18845-4.0.patch, 18845-4.1.patch, > 18845-5.0.patch > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek updated CASSANDRA-18845: -- Attachment: 18845-4.1.patch > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-3.11.patch, 18845-4.0.patch, 18845-4.1.patch > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek updated CASSANDRA-18845: -- Attachment: 18845-4.0.patch > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-3.11.patch, 18845-4.0.patch, 18845-4.1.patch > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764467#comment-17764467 ] Cameron Zemek edited comment on CASSANDRA-18845 at 9/13/23 3:32 AM: I have attached patched. Tested this as follows: # Spin up single node cluster. Works due to epSize == liveSize check that lets it bypass the liveSize > 1 check # Spin up 3 node cluster. All 3 nodes start up NTR as expected. # Shutdown all nodes. Start up first node it stays waiting in gossip due to the liveSize > 1 requirement # Start up second node. Now both nodes start NTR since liveSize > 1 and there are no other incoming `is now UP` events so gossip looks settled. NOTE: I had to disable the if condition for call to Gossiper.waitToSettle() since was using loopback addresses was (Author: cam1982): I have attached patched. Tested this as follows: # Spin up single node cluster. Works due to epSize == liveSize check that lets it bypass the liveSize > 1 check # Spin up 3 node cluster. All 3 nodes start up NTR as expected. # Shutdown all nodes. Start up first node it stays waiting in gossip due to the liveSize > 1 requirement # Start up second node. Now both nodes start NTR since liveSize > 1 and there are no other incoming `is now UP` events so gossip looks settled. > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-3.11.patch > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764467#comment-17764467 ] Cameron Zemek commented on CASSANDRA-18845: --- I have attached patched. Tested this as follows: # Spin up single node cluster. Works due to epSize == liveSize check that lets it bypass the liveSize > 1 check # Spin up 3 node cluster. All 3 nodes start up NTR as expected. # Shutdown all nodes. Start up first node it stays waiting in gossip due to the liveSize > 1 requirement # Start up second node. Now both nodes start NTR since liveSize > 1 and there are no other incoming `is now UP` events so gossip looks settled. > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-3.11.patch > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek updated CASSANDRA-18845: -- Description: This is a follow up to CASSANDRA-18543 Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms this is tedious and error prone. On a node just observed a 79 second gap between waiting for gossip and the first echo response to indicate a node is UP. The problem being that do not want to start Native Transport until gossip settles otherwise queries can fail consistency such as LOCAL_QUORUM as it thinks the replicas are still in DOWN state. Instead of having to set gossip_settle_min_wait_ms I am proposing that (outside single node cluster) wait for UP message from another node before considering gossip as settled. Eg. {code:java} if (currentSize == epSize && currentLive == liveSize && liveSize > 1) { logger.debug("Gossip looks settled."); numOkay++; } {code} was: This is a follow up to CASSANDRA-18543 Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms this is tedious and error prone. On a node just observed a 79 second gap between waiting for gossip and the first echo response to indicate a node is UP. The problem being that do not want to start Native Transport until gossip settles otherwise queries can fail consistency such as LOCAL_QUORUM as it thinks the replicas are still in DOWN state. Instead of having to set gossip_settle_min_wait_ms I am proposing that (outside single node cluster) wait for UP message from another node before considering gossip as settled. Eg. {code:java} if (currentSize == epSize && currentLive == liveSize && liveSize > 0) { logger.debug("Gossip looks settled."); numOkay++; } {code} > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-3.11.patch > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek updated CASSANDRA-18845: -- Attachment: 18845-3.11.patch > Waiting for gossip to settle on live endpoints > -- > > Key: CASSANDRA-18845 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 > Project: Cassandra > Issue Type: Improvement >Reporter: Cameron Zemek >Priority: Normal > Attachments: 18845-3.11.patch > > > This is a follow up to CASSANDRA-18543 > Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms > this is tedious and error prone. On a node just observed a 79 second gap > between waiting for gossip and the first echo response to indicate a node is > UP. > The problem being that do not want to start Native Transport until gossip > settles otherwise queries can fail consistency such as LOCAL_QUORUM as it > thinks the replicas are still in DOWN state. > Instead of having to set gossip_settle_min_wait_ms I am proposing that > (outside single node cluster) wait for UP message from another node before > considering gossip as settled. Eg. > {code:java} > if (currentSize == epSize && currentLive == liveSize && liveSize > > 1) > { > logger.debug("Gossip looks settled."); > numOkay++; > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints
Cameron Zemek created CASSANDRA-18845: - Summary: Waiting for gossip to settle on live endpoints Key: CASSANDRA-18845 URL: https://issues.apache.org/jira/browse/CASSANDRA-18845 Project: Cassandra Issue Type: Improvement Reporter: Cameron Zemek This is a follow up to CASSANDRA-18543 Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms this is tedious and error prone. On a node just observed a 79 second gap between waiting for gossip and the first echo response to indicate a node is UP. The problem being that do not want to start Native Transport until gossip settles otherwise queries can fail consistency such as LOCAL_QUORUM as it thinks the replicas are still in DOWN state. Instead of having to set gossip_settle_min_wait_ms I am proposing that (outside single node cluster) wait for UP message from another node before considering gossip as settled. Eg. {code:java} if (currentSize == epSize && currentLive == liveSize && liveSize > 0) { logger.debug("Gossip looks settled."); numOkay++; } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18773) Compactions are slow
[ https://issues.apache.org/jira/browse/CASSANDRA-18773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759820#comment-17759820 ] Cameron Zemek commented on CASSANDRA-18773: --- [^18773.patch] I took your idea above and implemented a preserveOrder method onto MergeIterator which CompactionIterator implementation will disable when there is no index. {code:java} INFO [CompactionExecutor:2] 2023-08-28 22:19:37,162 CompactionTask.java:239 - Read=53.93% 7.03 MiB/s, Write=20.47% 7.31 MiB/s INFO [CompactionExecutor:2] 2023-08-28 22:20:37,162 CompactionTask.java:239 - Read=54.94% 6.97 MiB/s, Write=20.42% 7.24 MiB/s INFO [CompactionExecutor:2] 2023-08-28 22:21:37,162 CompactionTask.java:239 - Read=53.69% 6.82 MiB/s, Write=22.33% 7.08 MiB/s {code} Which results in basically same results as my proof of concept. [~blambov] what do you think about using background threads in compactions (to decouple read/write)? As that change also results in noticeable increase (40%) to: {noformat} INFO [CompactionExecutor:2] 2023-08-28 21:08:08,463 CompactionTask.java:266 - Read=37.27% 9.63 MiB/s, Write=28.22% 10 MiB/s INFO [CompactionExecutor:2] 2023-08-28 21:09:08,463 CompactionTask.java:266 - Read=37.93% 9.65 MiB/s, Write=27.87% 10.02 MiB/s{noformat} This does copying of the rows into memory to pass across to the writer, so the reader can progress its file positions. Eg. {code:java} ArrayList rows = new ArrayList<>(); while (rowIterator.hasNext()) { rows.add(rowIterator.next()); }{code} So there is a tradeoff. > Compactions are slow > > > Key: CASSANDRA-18773 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18773 > Project: Cassandra > Issue Type: Improvement > Components: Local/Compaction >Reporter: Cameron Zemek >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > Attachments: 18773.patch, compact-poc.patch, flamegraph.png, > stress.yaml > > Time Spent: 10m > Remaining Estimate: 0h > > I have noticed that compactions involving a lot of sstables are very slow > (for example major compactions). I have attached a cassandra stress profile > that can generate such a dataset under ccm. In my local test I have 2567 > sstables at 4Mb each. > I added code to track wall clock time of various parts of the code. One > problematic part is ManyToOne constructor. Tracing through the code for every > partition creating a ManyToOne for all the sstable iterators for each > partition. In my local test get a measy 60Kb/sec read speed, and bottlenecked > on single core CPU (since this code is single threaded) with it spending 85% > of the wall clock time in ManyToOne constructor. > As another datapoint to show its the merge iterator part of the code using > the cfstats from [https://github.com/instaclustr/cassandra-sstable-tools/] > which reads all the sstables but does no merging gets 26Mb/sec read speed. > Tracking back from ManyToOne call I see this in > UnfilteredPartitionIterators::merge > {code:java} > for (int i = 0; i < toMerge.size(); i++) > { > if (toMerge.get(i) == null) > { > if (null == empty) > empty = EmptyIterators.unfilteredRow(metadata, > partitionKey, isReverseOrder); > toMerge.set(i, empty); > } > } > {code} > Not sure what purpose of creating these empty rows are. But on a whim I > removed all these empty iterators before passing to ManyToOne and then all > the wall clock time shifted to CompactionIterator::hasNext() and read speed > increased to 1.5Mb/s. > So there are further bottlenecks in this code path it seems, but the first is > this ManyToOne and having to build it for every partition read. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18773) Compactions are slow
[ https://issues.apache.org/jira/browse/CASSANDRA-18773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek updated CASSANDRA-18773: -- Attachment: 18773.patch > Compactions are slow > > > Key: CASSANDRA-18773 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18773 > Project: Cassandra > Issue Type: Improvement > Components: Local/Compaction >Reporter: Cameron Zemek >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > Attachments: 18773.patch, compact-poc.patch, flamegraph.png, > stress.yaml > > Time Spent: 10m > Remaining Estimate: 0h > > I have noticed that compactions involving a lot of sstables are very slow > (for example major compactions). I have attached a cassandra stress profile > that can generate such a dataset under ccm. In my local test I have 2567 > sstables at 4Mb each. > I added code to track wall clock time of various parts of the code. One > problematic part is ManyToOne constructor. Tracing through the code for every > partition creating a ManyToOne for all the sstable iterators for each > partition. In my local test get a measy 60Kb/sec read speed, and bottlenecked > on single core CPU (since this code is single threaded) with it spending 85% > of the wall clock time in ManyToOne constructor. > As another datapoint to show its the merge iterator part of the code using > the cfstats from [https://github.com/instaclustr/cassandra-sstable-tools/] > which reads all the sstables but does no merging gets 26Mb/sec read speed. > Tracking back from ManyToOne call I see this in > UnfilteredPartitionIterators::merge > {code:java} > for (int i = 0; i < toMerge.size(); i++) > { > if (toMerge.get(i) == null) > { > if (null == empty) > empty = EmptyIterators.unfilteredRow(metadata, > partitionKey, isReverseOrder); > toMerge.set(i, empty); > } > } > {code} > Not sure what purpose of creating these empty rows are. But on a whim I > removed all these empty iterators before passing to ManyToOne and then all > the wall clock time shifted to CompactionIterator::hasNext() and read speed > increased to 1.5Mb/s. > So there are further bottlenecks in this code path it seems, but the first is > this ManyToOne and having to build it for every partition read. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18773) Compactions are slow
[ https://issues.apache.org/jira/browse/CASSANDRA-18773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758307#comment-17758307 ] Cameron Zemek commented on CASSANDRA-18773: --- I added the listener. I also seperate the reading into its own background thread for further performance increase. {noformat} INFO [CompactionExecutor:2] 2023-08-22 15:24:56,237 CompactionTask.java:264 - Read=34.65% 10.43 MiB/s, Write=28.96% 10.83 MiB/s INFO [CompactionExecutor:2] 2023-08-22 15:25:56,237 CompactionTask.java:264 - Read=34.88% 10.49 MiB/s, Write=28.92% 10.9 MiB/s{noformat} > Compactions are slow > > > Key: CASSANDRA-18773 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18773 > Project: Cassandra > Issue Type: Improvement > Components: Local/Compaction >Reporter: Cameron Zemek >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > Attachments: compact-poc.patch, flamegraph.png, stress.yaml > > Time Spent: 10m > Remaining Estimate: 0h > > I have noticed that compactions involving a lot of sstables are very slow > (for example major compactions). I have attached a cassandra stress profile > that can generate such a dataset under ccm. In my local test I have 2567 > sstables at 4Mb each. > I added code to track wall clock time of various parts of the code. One > problematic part is ManyToOne constructor. Tracing through the code for every > partition creating a ManyToOne for all the sstable iterators for each > partition. In my local test get a measy 60Kb/sec read speed, and bottlenecked > on single core CPU (since this code is single threaded) with it spending 85% > of the wall clock time in ManyToOne constructor. > As another datapoint to show its the merge iterator part of the code using > the cfstats from [https://github.com/instaclustr/cassandra-sstable-tools/] > which reads all the sstables but does no merging gets 26Mb/sec read speed. > Tracking back from ManyToOne call I see this in > UnfilteredPartitionIterators::merge > {code:java} > for (int i = 0; i < toMerge.size(); i++) > { > if (toMerge.get(i) == null) > { > if (null == empty) > empty = EmptyIterators.unfilteredRow(metadata, > partitionKey, isReverseOrder); > toMerge.set(i, empty); > } > } > {code} > Not sure what purpose of creating these empty rows are. But on a whim I > removed all these empty iterators before passing to ManyToOne and then all > the wall clock time shifted to CompactionIterator::hasNext() and read speed > increased to 1.5Mb/s. > So there are further bottlenecks in this code path it seems, but the first is > this ManyToOne and having to build it for every partition read. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18773) Compactions are slow
[ https://issues.apache.org/jira/browse/CASSANDRA-18773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757742#comment-17757742 ] Cameron Zemek commented on CASSANDRA-18773: --- [^compact-poc.patch] I did a patch that does a proof of concept of that idea in my last comment. Before: {noformat} INFO [CompactionExecutor:2] 2023-08-22 03:04:33,591 CompactionTask.java:241 - Read=56.21% 138.64 KiB/s, Write=42.50% 146.09 KiB/s INFO [CompactionExecutor:2] 2023-08-22 03:05:33,590 CompactionTask.java:241 - Read=56.58% 143.37 KiB/s, Write=42.84% 148.96 KiB/s INFO [CompactionExecutor:2] 2023-08-22 03:06:33,590 CompactionTask.java:241 - Read=56.51% 144.15 KiB/s, Write=42.91% 149.77 KiB/s{noformat} After: {noformat} INFO [CompactionExecutor:2] 2023-08-22 03:34:34,471 CompactionTask.java:241 - Read=53.12% 8.07 MiB/s, Write=18.75% 8.38 MiB/s INFO [CompactionExecutor:2] 2023-08-22 03:35:34,470 CompactionTask.java:241 - Read=55.08% 7.88 MiB/s, Write=17.99% 8.19 MiB/s INFO [CompactionExecutor:2] 2023-08-22 03:36:34,470 CompactionTask.java:241 - Read=54.51% 7.65 MiB/s, Write=18.75% 7.95 MiB/s{noformat} A 50 times improvement in compaction speed. > Compactions are slow > > > Key: CASSANDRA-18773 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18773 > Project: Cassandra > Issue Type: Improvement > Components: Local/Compaction >Reporter: Cameron Zemek >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > Attachments: compact-poc.patch, flamegraph.png, stress.yaml > > > I have noticed that compactions involving a lot of sstables are very slow > (for example major compactions). I have attached a cassandra stress profile > that can generate such a dataset under ccm. In my local test I have 2567 > sstables at 4Mb each. > I added code to track wall clock time of various parts of the code. One > problematic part is ManyToOne constructor. Tracing through the code for every > partition creating a ManyToOne for all the sstable iterators for each > partition. In my local test get a measy 60Kb/sec read speed, and bottlenecked > on single core CPU (since this code is single threaded) with it spending 85% > of the wall clock time in ManyToOne constructor. > As another datapoint to show its the merge iterator part of the code using > the cfstats from [https://github.com/instaclustr/cassandra-sstable-tools/] > which reads all the sstables but does no merging gets 26Mb/sec read speed. > Tracking back from ManyToOne call I see this in > UnfilteredPartitionIterators::merge > {code:java} > for (int i = 0; i < toMerge.size(); i++) > { > if (toMerge.get(i) == null) > { > if (null == empty) > empty = EmptyIterators.unfilteredRow(metadata, > partitionKey, isReverseOrder); > toMerge.set(i, empty); > } > } > {code} > Not sure what purpose of creating these empty rows are. But on a whim I > removed all these empty iterators before passing to ManyToOne and then all > the wall clock time shifted to CompactionIterator::hasNext() and read speed > increased to 1.5Mb/s. > So there are further bottlenecks in this code path it seems, but the first is > this ManyToOne and having to build it for every partition read. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18773) Compactions are slow
[ https://issues.apache.org/jira/browse/CASSANDRA-18773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek updated CASSANDRA-18773: -- Attachment: compact-poc.patch > Compactions are slow > > > Key: CASSANDRA-18773 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18773 > Project: Cassandra > Issue Type: Improvement > Components: Local/Compaction >Reporter: Cameron Zemek >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > Attachments: compact-poc.patch, flamegraph.png, stress.yaml > > > I have noticed that compactions involving a lot of sstables are very slow > (for example major compactions). I have attached a cassandra stress profile > that can generate such a dataset under ccm. In my local test I have 2567 > sstables at 4Mb each. > I added code to track wall clock time of various parts of the code. One > problematic part is ManyToOne constructor. Tracing through the code for every > partition creating a ManyToOne for all the sstable iterators for each > partition. In my local test get a measy 60Kb/sec read speed, and bottlenecked > on single core CPU (since this code is single threaded) with it spending 85% > of the wall clock time in ManyToOne constructor. > As another datapoint to show its the merge iterator part of the code using > the cfstats from [https://github.com/instaclustr/cassandra-sstable-tools/] > which reads all the sstables but does no merging gets 26Mb/sec read speed. > Tracking back from ManyToOne call I see this in > UnfilteredPartitionIterators::merge > {code:java} > for (int i = 0; i < toMerge.size(); i++) > { > if (toMerge.get(i) == null) > { > if (null == empty) > empty = EmptyIterators.unfilteredRow(metadata, > partitionKey, isReverseOrder); > toMerge.set(i, empty); > } > } > {code} > Not sure what purpose of creating these empty rows are. But on a whim I > removed all these empty iterators before passing to ManyToOne and then all > the wall clock time shifted to CompactionIterator::hasNext() and read speed > increased to 1.5Mb/s. > So there are further bottlenecks in this code path it seems, but the first is > this ManyToOne and having to build it for every partition read. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18773) Compactions are slow
[ https://issues.apache.org/jira/browse/CASSANDRA-18773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755832#comment-17755832 ] Cameron Zemek commented on CASSANDRA-18773: --- Yes this is not limited to just major compactions. That was just a way I could reproduce the issue reliably. Same thing happening with switching from STCS to LCS for a customer, that operation has been going for 2 weeks now. It has 1.5Tb of disk usage. Disk benchmarks show the disk able todo 120Mb/s with random reads of 16kb chunks. So the operation should have completed in a day. Picking random node, it has 5 compactions going with compactionthroughput set to 64Mb/s. iotop shows max of 26Mb/s. I commented out a bunch of code in the hot paths: {code:java} diff --git a/src/java/org/apache/cassandra/db/rows/UnfilteredRowIterators.java b/src/java/org/apache/cassandra/db/rows/UnfilteredRowIterators.java index 2eb5d8fde7..bd72117632 100644 --- a/src/java/org/apache/cassandra/db/rows/UnfilteredRowIterators.java +++ b/src/java/org/apache/cassandra/db/rows/UnfilteredRowIterators.java @@ -532,7 +532,7 @@ public abstract class UnfilteredRowIterators public void close() { // This will close the input iterators - FileUtils.closeQuietly(mergeIterator); +// FileUtils.closeQuietly(mergeIterator); if (listener != null) listener.close(); diff --git a/src/java/org/apache/cassandra/utils/MergeIterator.java b/src/java/org/apache/cassandra/utils/MergeIterator.java index 6713dd0a43..5744dfb89b 100644 --- a/src/java/org/apache/cassandra/utils/MergeIterator.java +++ b/src/java/org/apache/cassandra/utils/MergeIterator.java @@ -42,7 +42,13 @@ public abstract class MergeIterator extends AbstractIterator implem ? new TrivialOneToOne<>(sources, reducer) : new OneToOne<>(sources, reducer); } - return new ManyToOne<>(sources, comparator, reducer); + ArrayList> filtered = new ArrayList<>(sources.size()); + for (Iterator it : sources) { + if (it != null) { + filtered.add(it); + } + } + return new ManyToOne<>(filtered, comparator, reducer); } public Iterable> iterators() @@ -361,7 +367,8 @@ public abstract class MergeIterator extends AbstractIterator implem this.iter = iter; this.comp = comp; this.idx = idx; - this.lowerBound = iter instanceof IteratorWithLowerBound ? ((IteratorWithLowerBound)iter).lowerBound() : null; + this.lowerBound = null; +// this.lowerBound = iter instanceof IteratorWithLowerBound ? ((IteratorWithLowerBound)iter).lowerBound() : null; } /** @return this if our iterator had an item, and it is now available, otherwise null */ {code} still spending a significant chunk of time in UnfilteredRowMergeIterator, with bulk of that in ManyToOne constructor. Is there not a way could manage the sstable merging without creating so many objects like ManyToOne? Eg. have a state object for each sstable and use that throughout the whole compaction to manage the merging. This is what cassandra-sstable-tools does. It keeps the current partition key for each sstable and has all the sstables in Priority queue (readerQueue). Eg. {code:java} ArrayList toMerge = new ArrayList(readerQueue.size()); while (!readerQueue.isEmpty()) { SSTableReader reader = this.readerQueue.remove(); toMerge.add(reader); DecoratedKey key = reader.key; while ((reader = readerQueue.peek()) != null && reader.key.equals(key)) { readerQueue.remove(); toMerge.add(reader); } doMerge(toMerge); for (Reader r : toMerge) { readerNext(reader); // advance the reader and add back to priority queue if more. } toMerge.clear(); }{code} That is each sstable reader is positioned ready to read its current partition. Grab all readers that belong to the partiton to be merged. doMerge would iterate the rows in the readers and perform the merging. Then readerNext would read the next partition key and put it back into the priority queue. Doesn't have to be priority queue, just some efficient way to determine which sstables to include in the partition merge. > Compactions are slow > > > Key: CASSANDRA-18773 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18773 > Project: Cassandra > Issue Type: Improvement > Components: Local/Compaction >Reporter: Cameron Zemek >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > Attachments: flamegraph.png, stress.yaml > > > I have noticed that compactions involving a lot of sstables are very slow > (for example major compactions). I have attached a cassandra str
[jira] [Comment Edited] (CASSANDRA-18773) Compactions are slow
[ https://issues.apache.org/jira/browse/CASSANDRA-18773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755804#comment-17755804 ] Cameron Zemek edited comment on CASSANDRA-18773 at 8/18/23 5:05 AM: !flamegraph.png|width=1508,height=691! was (Author: cam1982): !flamegraph.png! > Compactions are slow > > > Key: CASSANDRA-18773 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18773 > Project: Cassandra > Issue Type: Improvement > Components: Local/Compaction >Reporter: Cameron Zemek >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > Attachments: flamegraph.png, stress.yaml > > > I have noticed that compactions involving a lot of sstables are very slow > (for example major compactions). I have attached a cassandra stress profile > that can generate such a dataset under ccm. In my local test I have 2567 > sstables at 4Mb each. > I added code to track wall clock time of various parts of the code. One > problematic part is ManyToOne constructor. Tracing through the code for every > partition creating a ManyToOne for all the sstable iterators for each > partition. In my local test get a measy 60Kb/sec read speed, and bottlenecked > on single core CPU (since this code is single threaded) with it spending 85% > of the wall clock time in ManyToOne constructor. > As another datapoint to show its the merge iterator part of the code using > the cfstats from [https://github.com/instaclustr/cassandra-sstable-tools/] > which reads all the sstables but does no merging gets 26Mb/sec read speed. > Tracking back from ManyToOne call I see this in > UnfilteredPartitionIterators::merge > {code:java} > for (int i = 0; i < toMerge.size(); i++) > { > if (toMerge.get(i) == null) > { > if (null == empty) > empty = EmptyIterators.unfilteredRow(metadata, > partitionKey, isReverseOrder); > toMerge.set(i, empty); > } > } > {code} > Not sure what purpose of creating these empty rows are. But on a whim I > removed all these empty iterators before passing to ManyToOne and then all > the wall clock time shifted to CompactionIterator::hasNext() and read speed > increased to 1.5Mb/s. > So there are further bottlenecks in this code path it seems, but the first is > this ManyToOne and having to build it for every partition read. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18773) Compactions are slow
[ https://issues.apache.org/jira/browse/CASSANDRA-18773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Zemek updated CASSANDRA-18773: -- Attachment: flamegraph.png > Compactions are slow > > > Key: CASSANDRA-18773 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18773 > Project: Cassandra > Issue Type: Improvement > Components: Local/Compaction >Reporter: Cameron Zemek >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > Attachments: flamegraph.png, stress.yaml > > > I have noticed that compactions involving a lot of sstables are very slow > (for example major compactions). I have attached a cassandra stress profile > that can generate such a dataset under ccm. In my local test I have 2567 > sstables at 4Mb each. > I added code to track wall clock time of various parts of the code. One > problematic part is ManyToOne constructor. Tracing through the code for every > partition creating a ManyToOne for all the sstable iterators for each > partition. In my local test get a measy 60Kb/sec read speed, and bottlenecked > on single core CPU (since this code is single threaded) with it spending 85% > of the wall clock time in ManyToOne constructor. > As another datapoint to show its the merge iterator part of the code using > the cfstats from [https://github.com/instaclustr/cassandra-sstable-tools/] > which reads all the sstables but does no merging gets 26Mb/sec read speed. > Tracking back from ManyToOne call I see this in > UnfilteredPartitionIterators::merge > {code:java} > for (int i = 0; i < toMerge.size(); i++) > { > if (toMerge.get(i) == null) > { > if (null == empty) > empty = EmptyIterators.unfilteredRow(metadata, > partitionKey, isReverseOrder); > toMerge.set(i, empty); > } > } > {code} > Not sure what purpose of creating these empty rows are. But on a whim I > removed all these empty iterators before passing to ManyToOne and then all > the wall clock time shifted to CompactionIterator::hasNext() and read speed > increased to 1.5Mb/s. > So there are further bottlenecks in this code path it seems, but the first is > this ManyToOne and having to build it for every partition read. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18773) Compactions are slow
[ https://issues.apache.org/jira/browse/CASSANDRA-18773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755804#comment-17755804 ] Cameron Zemek commented on CASSANDRA-18773: --- !flamegraph.png! > Compactions are slow > > > Key: CASSANDRA-18773 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18773 > Project: Cassandra > Issue Type: Improvement > Components: Local/Compaction >Reporter: Cameron Zemek >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > Attachments: flamegraph.png, stress.yaml > > > I have noticed that compactions involving a lot of sstables are very slow > (for example major compactions). I have attached a cassandra stress profile > that can generate such a dataset under ccm. In my local test I have 2567 > sstables at 4Mb each. > I added code to track wall clock time of various parts of the code. One > problematic part is ManyToOne constructor. Tracing through the code for every > partition creating a ManyToOne for all the sstable iterators for each > partition. In my local test get a measy 60Kb/sec read speed, and bottlenecked > on single core CPU (since this code is single threaded) with it spending 85% > of the wall clock time in ManyToOne constructor. > As another datapoint to show its the merge iterator part of the code using > the cfstats from [https://github.com/instaclustr/cassandra-sstable-tools/] > which reads all the sstables but does no merging gets 26Mb/sec read speed. > Tracking back from ManyToOne call I see this in > UnfilteredPartitionIterators::merge > {code:java} > for (int i = 0; i < toMerge.size(); i++) > { > if (toMerge.get(i) == null) > { > if (null == empty) > empty = EmptyIterators.unfilteredRow(metadata, > partitionKey, isReverseOrder); > toMerge.set(i, empty); > } > } > {code} > Not sure what purpose of creating these empty rows are. But on a whim I > removed all these empty iterators before passing to ManyToOne and then all > the wall clock time shifted to CompactionIterator::hasNext() and read speed > increased to 1.5Mb/s. > So there are further bottlenecks in this code path it seems, but the first is > this ManyToOne and having to build it for every partition read. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org