[ https://issues.apache.org/jira/browse/CASSANDRA-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15041862#comment-15041862 ]
Sylvain Lebresne commented on CASSANDRA-10477: ---------------------------------------------- bq. Which aspect of hint "overload" protection is missing? I see it increments a counter which I thought was the signal upstream. This is about whom is looking at said counter (to do something about it if it's too high). The normal write path is, and so incrementing the counter in CAS will potentially apply back-pressure on normal write, but not on CAS request themselves. bq. Looking at it further is it because it doesn't throw OverloadedException? So a better behavior would be to have the check and exception in a helper method and use that in commitPaxos() so that it can now throw OverloadedException? Exactly. bq. I do wonder what the unforeseen consequences of having CAS capable of throwing OE is going to do that we haven't seen or tested before. It's a good question, and to be honest I'm not sure we have any test that cover {{OverloadException}} at all (but I could be wrong). But in general, the commit part of Paxos is not very "sensible": worst case, if not enough replica get the commit, the next serial operation (including a read) on the partition will re-commit. So the main question is whether potentially throwing {{OverloadedException}} would surprise people. I would argue it shouldn't because normal writes can do so and we never specified it was any different for CAS. That said, if we're uncomfortable with it, I'm totally fine committing that part of the change only in 3.2 (aka trunk currently). bq. the read path now throws OE where it didn't before Right. That's probably more justification for keeping that part in 3.2 only. > java.lang.AssertionError in StorageProxy.submitHint > --------------------------------------------------- > > Key: CASSANDRA-10477 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10477 > Project: Cassandra > Issue Type: Bug > Components: Local Write-Read Paths > Environment: CentOS 6, Oracle JVM 1.8.45 > Reporter: Severin Leonhardt > Assignee: Ariel Weisberg > Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x > > > A few days after updating from 2.0.15 to 2.1.9 we have the following log > entry on 2 of 5 machines: > {noformat} > ERROR [EXPIRING-MAP-REAPER:1] 2015-10-07 17:01:08,041 > CassandraDaemon.java:223 - Exception in thread > Thread[EXPIRING-MAP-REAPER:1,5,main] > java.lang.AssertionError: /192.168.11.88 > at > org.apache.cassandra.service.StorageProxy.submitHint(StorageProxy.java:949) > ~[apache-cassandra-2.1.9.jar:2.1.9] > at > org.apache.cassandra.net.MessagingService$5.apply(MessagingService.java:383) > ~[apache-cassandra-2.1.9.jar:2.1.9] > at > org.apache.cassandra.net.MessagingService$5.apply(MessagingService.java:363) > ~[apache-cassandra-2.1.9.jar:2.1.9] > at org.apache.cassandra.utils.ExpiringMap$1.run(ExpiringMap.java:98) > ~[apache-cassandra-2.1.9.jar:2.1.9] > at > org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:118) > ~[apache-cassandra-2.1.9.jar:2.1.9] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > [na:1.8.0_45] > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > [na:1.8.0_45] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > [na:1.8.0_45] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > [na:1.8.0_45] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [na:1.8.0_45] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [na:1.8.0_45] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45] > {noformat} > 192.168.11.88 is the broadcast address of the local machine. > When this is logged the read request latency of the whole cluster becomes > very bad, from 6 ms/op to more than 100 ms/op according to OpsCenter. Clients > get a lot of timeouts. We need to restart the affected Cassandra node to get > back normal read latencies. It seems write latency is not affected. > Disabling hinted handoff using {{nodetool disablehandoff}} only prevents the > assert from being logged. At some point the read latency becomes bad again. > Restarting the node where hinted handoff was disabled results in the read > latency being better again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)